Page 1 |
Save page Remove page | Previous | 1 of 2 | Next |
|
small (250x250 max)
medium (500x500 max)
large ( > 500x500)
Full Resolution
|
This page
All
Subset |
LOW POWER AND RELIABILITY ASSESSMENT TECHNIQUES FOR
ADVANCED PROCESSOR DESIGN
Title
by
Nasir Mohyuddin
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
August 2010
Copyright 2010 Nasir Mohyuddin
ii
Dedication
This thesis is dedicated to my parents and wife who provided all kind of support
throughout the course of my PhD studies and research.
iii
Acknowledgements
First of all I would thank Almighty Allah (God) who gave me the strength and ability
to accomplish all this. I would also like to present my profound gratitude to my PhD
advisor Dr. Massoud Pedram who has been kind to undertake the supervision of my
research. Despite his heavy commitments and extremely busy research schedules, he
provided a consistent technical guidance and support through out the course of my
research.
I would like to express my sincere thanks to Dr. Jeffery Draper, Dr. Sandeep Gupta,
Dr. Murali Annavaram, Dr Ming-Deh Huang and Dr. Aiichiro Nakano who kindly
consented to be on my PhD guidance committee and helped me to shape this research
work.
iv
Table of Contents
Dedication........................................................................................................................ii
Acknowledgements..........................................................................................................iii
List of Tables ...................................................................................................................vi
List of Figures ................................................................................................................viii
Abstract .............................................................................................................................x
CHAPTER 1: INTRODUCTION..............................................................................................1
1.1 Motivational Background ..............................................................................................................1
1.2 Low Power Technique to Reduce Static Power Dissipation ..........................................................2
1.3 Low Power Technique for Dynamic Power Dissipation................................................................3
1.4 Reliability Considerations for Processor Design............................................................................4
1.5 Organization..................................................................................................................................4
CHAPTER 2: STATIC POWER CONTROL......................................................................5
2.1 Leakage Power Phenomenon in CMOS Circuits ...........................................................................6
2.2 Controlling Leakage Power in Cache Hierarchy............................................................................7
2.3 Related Schemes to Reduce Leakage Power..................................................................................8
2.4 Controlling Leakage in Slumberous Caches ................................................................................11
2.5 Technological Findings to Evaluate Slumberous Caches ............................................................14
2.6 Architectural Simulations ............................................................................................................22
2.7 Evaluations of Slumberous Caches for Famous Cache Replacement Algorithms .......................23
CHAPTER 3: DYNAMIC POWER CONTROL ..............................................................50
3.1 Clock Gating ...............................................................................................................................50
3.2 Using Branch Mispredictions for Clock Gating...........................................................................52
3.3 Proposed Clock Gating Architecture ...........................................................................................55
3.4 Experimental Results ...................................................................................................................65
3.5 Conclusions.................................................................................................................................71
CHAPTER 4: BOOLEAN DIFFERENCE CALCULUS BASED ERROR
CALCULATOR.................................................................................................................................73
4.1 Error Propagation Using Boolean Difference Calculus ...............................................................76
4.2 Proposed Error Propagation Model..............................................................................................80
4.3 Practical Considerations...............................................................................................................88
4.4 Simulation Results .......................................................................................................................94
4.5 Extensions to BDEC ..................................................................................................................101
4.6 Conclusions...............................................................................................................................104
CHAPTER 5: CONCLUSIONS .............................................................................................105
5.1 Scope of the Proposed Techniques ............................................................................................105
5.2 Limitations of the Proposed Techniques ....................................................................................107
5.3 Possible Extensions of the Presented Techniques ......................................................................108
v
BIBLIOGRAPHY...........................................................................................................................110
vi
List of Tables
Table 2.1: Supply and threshold voltages for different technologies ............................................................15
Table 2.2: Leakage current of one cell at different power save levels using predictive models [26] ............16
Table 2.3: Energy per transition per byte between different tranquility levels..............................................16
Table 2.4: 8FO4 clock frequencies used for our simulations ........................................................................19
Table 2.5: Wake up penalties in terms of cycles ...........................................................................................20
Table 2.6: Average leakage power dissipated and maximum leakage power savings per byte.....................21
Table 2.7: Maximum percent leakage power savings ...................................................................................21
Table 2.8: Baseline microprocessor Simulation Model.................................................................................23
Table 2.9: The single simpoints for simulations of the Spec2000 benchmarks.............................................23
Table 2.10: Cache-line replacement priorities for different combinations of Bit_0, Bit_1 and Bit_2...........24
Table 2.11: Average dynamic power costs (nW) for various schemes under PLRU ....................................28
Table 2.12: Replacement priority assignment for LRU.................................................................................32
Table 2.13: Cache-line replacement priorities for hits in different cache line LRU......................................32
Table 2.14: Dynamic power costs for different benchmarks for TL4 schemes under PLRU and LRU........34
Table 2.15: Average dynamic power costs for various schemes under LRU................................................34
Table 2.16: Comparing performance impact of various power saving schemes under LRU and PLRU ......38
Table 2.17: Replacement priority assignment for MRR................................................................................39
Table 2.18: Cache-line replacement priorities for hits in different cache line under MRR...........................39
Table 2.19: Comparing IPCs of various replacement algorithms..................................................................40
Table 2.20: Dynamic power costs for different benchmarks for TL4 schemes under PLRU and LRU........41
Table 2.21: Average dynamic power costs for various schemes under MRR...............................................41
Table 2.22: Comparing performance impact of power saving schemes under PLRU, LRU and MRR ........45
Table 2.23: Comparing all 12 leakage control schemes with respect to LESMs...........................................45
Table 3.1: Processor Model used for Evaluations .........................................................................................66
Table 3.2: IPC Degradation for WPCG.........................................................................................................70
vii
Table 4.1: Output error probability with re-convergent fanout .....................................................................92
Table 4.2: Percent error reduction in output error probability using BDEC +Collapsing .............................93
Table 4.3: Circuit reliability for tree-structured circuits having relatively small number of PIs...................96
Table 4.4: Circuit Reliability for Tree-Structured Circuits having relatively Large Number of PIs .............97
Table 4.5: Circuit Reliability and Efficiency of BDEC Compared to PGM and PTM..................................98
Table 4.6: Runtime Comparison between BDEC and PTM for some Large Benchmark Circuits................99
Table 4.7: Circuit Reliability for Large Benchmark Circuits ........................................................................99
Table 4.8: BDEC Circuit Reliability Compared to MC Simulations for Large Benchmark Circuits..........100
viii
List of Figures
Figure 2.1: Schematic view of a basic CMOS inverter ...................................................................................6
Figure 2.2: A basic CMOS inverter with different N and P type regions inferring diodes .............................7
Figure 2.3: Control circuitry to implement proposed scheme .......................................................................12
Figure 2.4: Six transistor SRAM Cell with wordline and bitlines.................................................................15
Figure 2.5: Charge sharing between cache lines at different tranquility levels .............................................17
Figure 2.6: The power wake up cycle from different levels of tranquility in the 70nm technology .............19
Figure 2.7: Maximum percent leakage power savings for various schemes .................................................22
Figure 2.8: PLRU implementation for 4-way caches ....................................................................................25
Figure 2.9: How hits at different Priority levels affect the cache under PLRU policy..................................25
Figure 2.10: Dynamic power incurred per byte for L1 data cache for TL4 under PLRU policy...................26
Figure 2.11: Dynamic power incurred per byte for L1 data cache for TL2-T4 under PLRU policy.............27
Figure 2.12: Average dynamic power cost for L1 data cache for different power schemes under PLRU ....27
Figure 2.13: Net Savings per byte for L1 data cache for TL4 under PLRU..................................................28
Figure 2.14: % Net Savings per byte for L1 data cache for TL4 scheme under PLRU policy......................29
Figure 2.15: Power savings for L1 data cache under various schemes in future technologies for PLRU.....30
Figure 2.16: Average net leakage power savings for L1 data cache for various schemes under PLRU........30
Figure 2.17: Percent increase in L1 access time for hits for various schemes under PLRU..........................31
Figure 2.18: How hits at different Priority levels affect the cache under LRU policy ..................................33
Figure 2.19: Dynamic power cost for L1 data cache for TL4 scheme under LRU policy.............................33
Figure 2.20: Net savings for L1 data cache for TL4 under LRU policy........................................................35
Figure 2.21: Percent net savings for L1 data cache for TL4 under LRU.......................................................35
Figure 2.22: Power savings for L1 data cache under various schemes in future technologies for LRU .......36
Figure 2.23: Average net leakage power savings for L1 data cache for various schemes under LRU..........37
Figure 2.24: Percent increase in L1 access time for hits for various schemes under LRU............................38
Figure 2.25: How hits at different Priority levels affect the cache under MRR replacement policy.............39
ix
Figure 2.26: Dynamic power incurred per byte for L1 data cache for TL2-T4 under MRR policy ..............41
Figure 2.27: Net savings per byte for L1 data cache for TL2-T4 under MRR.............................................42
Figure 2.28: Percent net savings per byte for L1 data cache for TL2-T4 scheme under MRR.....................42
Figure 2.29: Power savings for L1 data cache under various schemes in future technologies for MRR ......43
Figure 2.30: Average net leakage power savings for L1 data cache for various schemes under MRR.........43
Figure 2.31: Percent increase in L1 access time for hits for various schemes under MRR...........................44
Figure 2.32: Dynamic power cost for L2 cache for various schemes in future technologies for PLRU .......46
Figure 2.33: Leakage power savings for L2 cache for various schemes in future technologies for PLRU...47
Figure 2.34: Percent increase in L2 access time for hits for various schemes under PLRU..........................47
Figure 3.1: Idle time and wrong-path instruction fraction for integer ALUs ................................................53
Figure 3.2: Average number of wrong-path instructions per mispredicted branch .......................................55
Figure 3.3: PFCG Architecture.....................................................................................................................57
Figure 3.4: The WPCG Architecture.............................................................................................................59
Figure 3.5: Circuitry used to detect wrong-path instructions ........................................................................63
Figure 3.6: Usage cycles fraction in integer ALUs and % decrease in the usage cycles due to WPCG........67
Figure 3.7: Energy consumption in the combinational logic and stage registers of the integer ALUs..........68
Figure 3.8: Reduction in register file and cache accesses due to WPCG ......................................................70
Figure 3.9: Energy dissipation distribution for different benchmark ............................................................71
Figure 4.1: Gate implementing function f .....................................................................................................80
Figure 4.2: A faulty buffer with erroneous input...........................................................................................81
Figure 4.3: The proposed model for a general faulty gate.............................................................................81
Figure 4.4: A 2-input faulty AND gate with erroneous inputs ......................................................................82
Figure 4.5: Balanced tree implementation of 4-input AND gate...................................................................89
Figure 4.6: Re-convergent fanout in a 2-to-1 Multiplexer.............................................................................91
x
Abstract
The rapid scaling of silicon technologies over the past decade has introduced
some strenuous constraints for processor design. The technology progression has
exacerbated the power problem which has further increased the necessity to consider
design reliability. Static power dissipation that used to be negligible in past is expected to
be a major component of overall processor dissipation. In this research, we present
techniques that reduce both static and dynamic power dissipation in modern processors
while not compromising the processor reliability as whole. We also present a tool BDEC
that can be used to compare the reliability of different possible implementations of major
processor functional units to choose more reliable implementations in future processor
designs.
The proposed techniques reduce both static and dynamic power dissipation in
modern processor designs. While dealing with low power design one important design
aspect is that the low power technique must not be power hungry itself. The beauty of the
presented low power techniques is that they do not have huge hardware implementation
cost, rather they use existing hardware to control power dissipation. Power dissipation
overhead of presented techniques is minimal.
Smaller and smaller feature sizes and increased power density of modern
processors which has resulted in high chip temperatures has increased the importance of
reliability consideration in processor design. Since in future we will need to build reliable
xi
systems using unreliable components modern processor design will need to be geared
towards considering design reliability during the early phases of the design. To help
compare various design alternatives we developed a tool: BDEC which gives the
reliability of a combinational circuit in terms of gate error probabilities and input error
probabilities on the primary inputs.
1
Chapter 1: Introduction
1.1 Motivational Background
Power dissipation has been a 1st order design concern for the last two decades.
Processor chip temperature has been on rise since then; initially it was compared to a hot
plate and was expected to reach the temperature of Sun’s surface if no measures were
taken to reduce power dissipation in advanced microprocessor design. For every
component of modern processor design, researchers have proposed solutions to reduce
power dissipation and decrease resulting temperature rise that performed within the
technological constraints of the time. As technology kept scaling in accordance with
Moore’s law [50] more and more transistors were available to designers to implement
design novelties. Since memory latency had been a limiting factor in the execution speed
of an application on a specific processor, it had been an attractive choice for processor
designers to move more and more memories to on chip; as a result we see level-1, level-2
and even level-3 cache memories implemented on chip in modern processors [13]. In
modern processors a significant fraction of total die area is occupied by cache memories
e.g. 60% of the StrongARM die area is cache [69]. Therefore caches which are
implemented as SRAM memories provide the greatest opportunity for static power
reduction.
Dynamic power is mostly consumed by the clock network and it accounts for a large
fraction of a chip’s total energy consumption, one popular technique, clock gating, gates
the clock signal from reaching idle functional units. Clock gating is a very effective way
2
of reducing power and energy throughout a processor and is implemented in numerous
commercial systems.
Reliability is very fast becoming a primary design concern. Designers are
accustomed to design systems using reliable gates and components. It is largely believed
that in future we will need to build systems using un-reliable gates and components.
Instead of having deterministic gates and components we will have probabilistic gates
and sub-components which will provide correct output only with a certain probability.
1.2 Low Power Technique to Reduce Static Power Dissipation
Static power is dissipated at all times and is not function of circuit activity. Static
power is also referred to as leakage power. Effective leakage power reduction techniques
for on-chip caches are based on switching SRAM memory cells to low leakage mode
when they are not accessed. The low leakage mode voltage depends on the technology
and is function of transistor threshold voltage and noise margins; it should be high
enough to avoid data corruption. As we lower the voltage applied to the SRAM cells the
leakage current decreases drastically. This results in less static power dissipation.
However, whenever a cell in lower leakage power mode is accessed, power levels
may change, which result in dynamic power consumption and performance penalties. A
trade-off between the amount of leakage power saved on one hand, and the impact on
dynamic power and performance on the other hand must be reached.
3
1.2.1 Slumberous caches
In slumberous caches the power level of cache lines is controlled with the cache
replacement policy. The cache lines are maintained at different power save modes called
"tranquility levels", which depend on their order of replacement priorities. Slumberous
caches idea offers a trade-off between the amount of leakage power saved on one hand
and the impact on dynamic power and performance on the other hand.
1.3 Low Power Technique for Dynamic Power Dissipation
Dynamic power dissipation control has always been an important design
consideration in digital circuit design. Increasing chip sizes with technology advancement
is further increasing the importance of low power techniques especially those that incur
low power and performance overheads.
In this research a well known design technique, clock gating, is used to save dynamic
power dissipated in executing wrong path instructions in advanced processors because of
branch mispredictions. Wrong Path instructions based Clock Gating (WPCG) detects
wrong-path instructions in the event of branch misprediction and prevents them from
being issued to the functional units (FUs), and subsequently, disables the clock of these
FUs along with reducing the stress on register file and cache. Simulations demonstrate
that more than 92% of all wrong-path instructions can be detected and stopped from
being executed. The WPCG architecture results in 16.26% chip-wide energy savings
which is 2.33% more than that of the baseline Pipelined Functional units Clock Gating
(PFCG) scheme.
4
1.4 Reliability Considerations for Processor Design
Slumberous caches idea of keeping most recently used cache line in active power
mode (full rail power) proves to be better in terms of memory reliability when compared
to original drowsy cache idea that lowers the voltage of all the cache lines irrespective of
their usage. As mentioned in [44] the soft error rates at lower supply voltages are order of
magnitude more than in full rail power mode.
1.5 Organization
This thesis is organized as follows. Chapter 2: Discusses the low power technique
for static power control in caches. It also compares various slumberous cache
configurations with respect to performance impact leakage power saved. It proposes best
slumberous cache configurations for both L1 and L2 Data caches. Chapter 3: Talks about
dynamic power control technique using clock gating. Branch misprediction information
is used to clock gate functional units in processor pipeline. Chapter 4: Describes Boolean
Difference Error Calculator (BDEC) in detail. It also shows how BDEC can be used to
design reliable functional units for future processors. Chapter 5: briefly reviews the
general scope and limitations of the proposed low power and reliability techniques along
with some possible future extensions of this research work.
5
Chapter 2: Static Power Control
Traditionally, computer architects have mostly be concerned about performance, cost
and reliability. Power considerations were secondary. Moreover computer architects are
used to ignore and abstract the technology level details of their design. In recent years,
this situation has dramatically changed and power is becoming one of the primary design
parameters at both architecture and physical design levels. Several factors have
contributed to this trend. Perhaps the primary driving factor has been the remarkable
success and growth of the class of personal computing devices (portable desktops, audio-and
video-based multimedia products) and wireless communications systems (personal
digital assistants and personal communicators), which demand high-speed computation
and complex functionality with low power consumption. In high-end machines, power
dissipation and its effect on temperature, cooling and performance are becoming the
major limiting factor to feature size and frequency scaling.
There are two types of power dissipated in a chip: dynamic power and static power.
Dynamic power is incurred whenever the state of a circuit changes, whereas static power
is dissipated (leaked) in each and every circuit, all the time, independently of its changes
of state. The International Technology Roadmap for Semiconductors (ITRS) produced by
the Semiconductor Industry Association predicts that leakage current Ioff will double with
each generation for both high-performance (low threshold voltage Vt, high leakage) and
low-power (high Vt, low leakage) transistors [66].
6
Figure 2.1: Schematic view of a basic CMOS inverter
Different techniques apply to dynamic and static power reduction. Static power is the
focus of this chapter. Static power is also often referred to as leakage power.
2.1 Leakage Power Phenomenon in CMOS Circuits
To understand leakage power one can look at the basic structure of a CMOS inverter,
shown in Figure 2.1. The three major sources of leakage are sub-threshold leakage,
substrate leakage, and leakage through gate oxide [17]. In Figure 2.2 we model the
sources of leakage by diodes. Sub threshold leakage is due to diode D2 and D4. When the
input of the inverter is low and the output is high and the reverse biased diode D2 causes
sub threshold leakage. Conversely, when the input is high and output is low the reverse
biased diode D4 causes sub threshold leakage.
Diode D3 between the power supply VDD and ground GND is responsible for the
substrate leakage. The overall substrate leakage is proportional to the dimensions and
number of devices grown in the n-wells over the p-substrate. Since the substrate is lightly
doped leakage through the substrate is very small as compare to sub threshold leakage.
The leakage through the gate oxide (Diodes D1 and D5) is also very small. The ways to
7
reduce substrate leakage and gate oxide leakage are mostly technology level techniques
such as twin tub and SOI technologies.
Sub-threshold leakage is currently the largest of these three components, and is
bound to increase in future fabrication technologies as threshold voltages are scaled down
[10]. In this chapter, we focus on sub-threshold leakage. We ignore gate oxide and
substrate leakages and the techniques proposed in this research do not address these
leakages.
Figure 2.2: A basic CMOS inverter with different N and P type regions inferring diodes
2.2 Controlling Leakage Power in Cache Hierarchy
Effective leakage power reduction techniques for SRAM based memories e.g. on-chip
caches are based on switching memory cells to low leakage mode when they are not
accessed. However, whenever a memory cell in lower leakage power mode is accessed,
power levels may change, which result in dynamic power consumption and performance
penalties. A trade-off between the amount of leakage power saved on one hand, and the
impact on dynamic power and performance on the other hand must be reached.
8
To affect this trade-off in the context of the cache hierarchy, we introduce
"slumberous caches" in which the power level of set-associative cache lines is controlled
with the cache replacement policy. The replacement policy is useful in set-associative
caches to improve the hit rate of the cache because it exploits the locality property of
memory accesses. This same locality property can be exploited to optimize the trade-off
between static power, dynamic power and performance. In a slumberous cache, cache
lines are maintained at different power save modes which we call "tranquility levels".
The lines in each set of a slumberous cache are maintained at tranquility levels which
depend on their order of replacement priorities.
The effectiveness of this idea is first evaluated in the context of PLRU (Pseudo Least
Recently Used) a common cache replacement algorithm. Then it is extended to couple of
other replacement algorithms. We explore various schemes for the tranquility levels
assigned to lines and compare overall power and performance impacts. As technology
scales down, the dynamic power and performance penalties required to energize
slumberous cache lines drops drastically while the leakage power savings remain roughly
steady.
2.3 Related Schemes to Reduce Leakage Power
Several ideas have been explored to reduce leakage power at the architectural level
in microprocessors. All of these leakage power saving schemes rely on some changes at
the circuit level to cut off power [50] to cache lines or to switch them to reduced
(drowsy) voltage levels. When power to a line is cut off, the information is lost, a backup
copy must exist in the hierarchy, and the next access to the line causes a miss. Drowsy
9
voltage levels are such that the information is not lost, but the line must first be energized
to full voltage before it can be accessed, which results in dynamic power and
performance penalties.
M. Martonosi et.al [33] proposed the Cache Decay scheme. By invalidating and
"turning off" cache lines when they hold data not likely to be reused leakage power can
be saved. Success relies on accurately predicting the cache line dead periods, i.e. the
periods when a cache line is sitting idle and is useless, only consuming static power. A
cache line is turned off if a preset number of cycles (called “decay interval”) have passed
since the cache line’s last access. This results in 70% leakage energy reduction. An
adaptive scheme that chooses the best decay interval for each generation of a cache line
on the fly is also proposed. Problem with this scheme is that early shut off of a cache line
will increase the miss rate consequently affecting overall performance and incurring
dynamic power.
A compiler based strategy to reduce leakage energy was proposed by W. Zhang et al.
for instruction caches [72]. Their scheme is based on marking the last usage of
instructions by a special instruction which turns off the cache line. To limit the frequency
of these special instructions, the authors turn off instructions at the loop granularity level.
At the exit from a loop that will not be visited again, the cache lines are turned off.
The concept of resizable cache was proposed by Babak Falsafi et al [70]. This
method exploits the fact that cache utilization varies from application to application and
also within an application. So, statically or dynamically varying the cache size by turning
off unused cache portions can save lot of static energy. Two different schemes were used
10
to vary the cache size. “Selective-ways” changes the cache set associativity and
“Selective-sets” changes the cache set sizes according to cache usage. Static resizing is
done across entire applications and dynamic resizing changes the cache size on demand
during execution.
Dynamic threshold modulation [24] using MTCMOS applied at cache line
granularity in which the threshold voltage of the transistors in the SRAM cell is
dynamically increased when the cell is set to sleep mode by raising the source-to-body
voltage of the transistors in the circuit. This higher Vt reduces the leakage current while
allowing the memory cell to maintain its state even in sleep mode
All the schemes describe so far rely on shutting down parts of the cache. Our
approach is based on the idea of drowsy caches [20]. Drowsy Cache lines are never
completely shut down. Every cache line can be in two voltage levels: full rail voltage and
drowsy mode voltage. In drowsy mode the supply voltage of the cache line is lowered to
the minimum possible level without corrupting the data. A drowsy bit is used to select the
mode between full rail voltage and drowsy voltage mode. All cache lines are put in
drowsy mode at regular intervals of 2000 cycles. Results for Spec2000 benchmarks
suggest that, for most of the benchmarks, 90% of the cache lines can be kept in drowsy
mode. With wake-up penalties for a drowsy cache line of no more than one cycle, the
authors results show that the total leakage energy was reduced by an average of 71%
when tags were always awake and by an average of 76% using the drowsy tag scheme,
with modest performance impact. In the same vein of work, drowsy instruction caches
11
[34] a cache bank prediction scheme to predict which bank of the instruction cache to put
in drowsy mode and which to turn on.
2.4 Controlling Leakage in Slumberous Caches
In slumberous caches we propose to reduce the leakage power by controlling the
voltage levels of lines in the same cache set with the replacement policy.
2.4.1 Tranquility Levels
We consider set-associative caches with at least two priority levels for replacement
within a set (Thus our approach is not suitable for random replacement policies or direct-mapped
caches.) In an n-way cache, the cache lines are ranked with respect to their
priority of replacement, P1,..., Pn. The line with priority level Pn is always selected for
replacement. We dynamically assign different voltage levels to the cache lines at
different priority levels, based on the information kept in the replacement policy state
bits. These voltage levels are called tranquility levels, and various schemes are possible
in general to assign tranquility levels (T1,...Tn) to replacement priorities (P1, P2,…Pn).
T1 is the highest voltage level and Tn is the lowest voltage level. The lowest possible
supply voltage must be greater than 200mV above the threshold voltage in order to avoid
that ambient noise flips some bits of the line.
Figure 2.3 shows a simple circuit to switch a cache line from one tranquility level to
another for a 4-way set associative cache. The four power rails remain energized with the
different voltage levels. The replacement policy state bits control the transistors feeding
the voltage level to the line. Multiple switching transistors can be distributed along the
12
power rails of the cache lines to avoid current bottlenecks. In all schemes considered in
this paper, cache tags and state bits are always at full rail voltage so that no clock cycle is
wasted in waking them up. This framework allows for the design of slumberous caches,
in which leakage power is controlled by the replacement policy.
Figure 2.3: Control circuitry to implement proposed scheme
2.4.2 Maximum Leakage Power Savings
The leakage power per byte without any power saving scheme is given by the
following equation
leakage ( leakage dd ) P = 8· I ·V (2.1)
Where Ileakage is the leakage current per bit and Vdd is the full rail voltage.
For an n-way set-associative slumberous cache the maximum leakage power saving
per byte is
( )
- · · = Σ=
n
i
saving dd leakage n n P V I n V I
1
8 1 (2.2)
13
where Vn and In are the voltage level and the per-bit leakage current of the
tranquility level of each line in a set. n is the number of ways of associativity. This
maximum savings only depend on the technology, the number of ways, and the
tranquility levels. However, to reap the benefits of this maximum power savings we need
to solve several problems and trade-offs.
2.4.3 Leakage Control Schemes
The simplest scheme is to keep all lines in a set at the same tranquility level,
independently of the replacement priorities and, if needed, to wake up the line every time
it is accessed. The schemes with only one tranquility level will be called TL1 (Tranquility
Level One) and will use -Ti to indicate the voltage of that tranquility level. Hence all one
tranquility level schemes will be denoted as TL1-Ti. TL1-T1 is the scheme that does not
have any leakage savings, as all lines will be at the full rail voltage at all times and TL-Tn
will have maximum savings as all lines will be at the lowest tranquility level. TL1-T1
will not have any performance impact and TL1-Tn will have worst performance impact
as worst wake-up penalties are incurred every time a line is accessed. TL1-Tn will also
incur dynamic power each time a cache line is accessed. So a trade-off is needed to be
made between leakage power savings and performance impact.
To improve this trade-off, we exploit the fact that the MRU (Most Recently Used)
line is very likely to be referenced over and over again. Our evaluations show that for
different replacement policies, on the average more than 94% of data hits are to the MRU
line. Thus we can keep the MRU line at T1, while keeping all the other lines in the set at
one tranquility level, T2,T3,...or Tn. This will reduce the leakage power savings but at the
14
same time it will reduce the performance loss and the dynamic energy needed to wake-up
lines. Hence we have TL2 (Two Tranquility Levels) schemes. As for TL2 schemes one
tranquility level is T1 by default, so we indicate two tranquility level schemes as TL2-Ti
where Ti is the tranquility level employed for non-MRU lines i.e. T2, T3,... Tn. TL2-T2
means a scheme that has two tranquility levels, T1 for MRU and T2 for all non-MRU
lines. Similarly we can have TL2-T3, TL2-T4, ... TL2-Tn.
Finally, more than two tranquility levels can also be used hence we will have TL3,
TL4, ...TLn, where n is the number of ways in a set associative cache. For TLn each
priority level P1, P2,.Pn will be associated with a different level of tranquility T1,
T2,...,Tn (respectively). We have considered linearly distributed voltages for tranquility
levels between the lowest possible operating voltage (deepest tranquility state) i.e. Vt +
200mv and the full power supply voltage (wake up state). Other distributions are
possible, but are beyond the scope of this research.
2.5 Technological Findings to Evaluate Slumberous Caches
We have done many evaluations, both technological and architectural, to evaluate the
trade-offs between leakage power, dynamic power and performance in the design of
slumberous caches. Technological evaluations will be discussed in this section.
2.5.1 Leakage Power of Different Tranquility Levels
An hspice deck was setup with a standard SRAM cell to measure the leakage power
of one SRAM cell shown in Figure 2.4. We simulated the cell over present and future
15
technologies using presently available and predictive technology models [26] for
simulations.
Figure 2.4: Six transistor SRAM Cell with wordline and bitlines
Next we establish the minimum voltage level that guarantees a consistent state of the
memory. A simple noise analysis suggests that minimum voltage should be
approximately 200mV above Vth, the threshold voltage. Vth for each different
technology is determined through simulations using predictive models [26], Table 2.1
shows the operating supply voltages and threshold voltages suggested by our simulations.
Table 2.1: Supply and threshold voltages for different technologies
Technology Supply (V) Vth (V)
130nm 1.3 0.596
100nm 1.1 0.546
70nm 0.9 0.394
We used full rail Vdd level for T1 and Vth + 200mv for T4 level. Power save
Voltage levels for T2 and T3 are selected by linear interpolation between T1 and T4. We
have run extensive simulations over different technologies to verify the correct operation
of an SRAM cell at different tranquility levels and the outcome is shown in Table 2.2.
The leakage current increases significantly as the threshold voltage decreases with
16
technology scaling. Also leakage current decreases with the voltage for different
tranquility levels within a technology.
Table 2.2: Leakage current of one cell at different power save levels using predictive models [26]
Technology
Operating voltage / steady state leakage current per bit for different tranquility
levels
T1 T2 T3 T4
Voltage
(V)
Current
(nA)
Voltage
(V)
Current
(nA)
Voltage
(V)
Current
(nA)
Voltage
(V)
Current
(nA)
130nm 1.3 0.948 1.1 0.673 0.9 0.550 0.7 0.475
100nm 1.1 2.522 0.95 1.818 0.8 1.481 0.65 1.292
70nm 0.9 8.949 0.8 7.321 0.7 6.340 0.6 5.655
2.5.2 Dynamic Power Costs
Table 2.3 shows the energy required to switch between various tranquility levels in
different technologies.
Table 2.3: Energy per transition per byte between different tranquility levels
Energy per transition (joules)
Technology T1<->T2 T1<->T3 T1<->T4 T2<->T3 T2<->T4 T3<->T4
130nm 8.68E-15 3.47E-14 7.81E-14 8.68E-15 3.47E-14 8.68E-15
100nm 3.39E-15 1.36E-14 3.05E-14 3.39E-15 1.36E-14 3.39E-15
70nm 1.10E-15 4.39E-15 9.87E-15 1.10E-15 4.39E-15 1.10E-15
To derive the expression for dynamic power consumption we start with a basic
circuit of forced switching of a capacitor through an energy dissipating switching device
as shown in the Figure 2.5. Theoretically no energy is dissipated by a capacitor in
switching from one voltage level to another; the energy is dissipated in the non ideal
resistive switching device.
17
Figure 2.5: Charge sharing between cache lines at different tranquility levels
To analyze we start from abnitio:
Let C capacitance of the capacitor
Vi initial voltage of capacitor
Vf final voltage of the capacitor
Instantaneous current through capacitor during switching:
dt
dv
I C c
c =
Ic = Is (current through switching device)
The power dissipation in the switching device:
s s s P = V I
Plugging in the Ic we have
s
c
s dt
dv
P = V ´C
s s c P´ dt = V ´Cdv
Writing the voltage across the switching device Vs in terms of voltage across capacitor Vc:
s f c V = V - V
s f c c P ´ dt = (V -V ) ´Cdv
Integrating the above equation for the switching time Ts, during which the Vc=Vi Vf
∫ ´ = ´ ∫ -
Vf
Vi
f c c
Ts
P dt C (V V )dv
0
18
∫ - ∫
Vf
Vi
c c
Vf
Vi
f c CV dv C V dv
( )
2
( ) 2 2
f f i f i V V
C
CV V -V - -
- - ( + )
2
1
( ) f i f f i C V V V V V
- ( - )
2
1
( ) f i f i C V V V V
( ) 2
2 f i V V
C -
2
2
1
E = CDV
Hence when a cache line with total line capacitance of C, is switched from a
tranquility level Ti to a tranquility level Tf the energy dissipated is given as :
( )2
2
1
saving i f P = ·C V -V (2.3)
The total amount of dynamic energy depends on the replacement policy, the
benchmarks, and the levels of tranquility. We must consider the effect of benchmarks to
evaluate dynamic power costs.
2.5.3 Performance Costs
Transitions between tranquility levels come at a cost in terms of performance. To
determine the exact wake-up time, we have run simulations to measure the time needed
to wake-up an SRAM cell from each tranquility level to the full power mode. Figure 2.6
shows the simulated curves obtained by switching power rails of an SRAM cell for 70nm
technology. We switch lines from different tranquility level voltages to full rail voltage
19
(from left, 1st we switch from T2 then from T3 and finally from T4). Time is measured
for each transition and compared to the clock periods proposed by Agarwal et. al. [2] for
the corresponding technology.
8FO4 clock frequencies for the considered technologies as suggested by Agarwal et.
al. [2] are given in Table 2.4.
Table 2.4: 8FO4 clock frequencies used for our simulations
Technology (nm) 8FO4 Clock (GHz) Cycle time (nSec)
130nm 2.67 0.37
100nm 3.47 0.29
70nm 4.96 0.20
Figure 2.6: The power wake up cycle from different levels of tranquility in the 70nm technology
Table 2.5 shows wake-up penalties in terms of clock cycles. Our hspice simulations
using the predictive technologies [26] revealed that the wake up penalty from T2 level is
1 cycle and is 2 cycles from both T3 and T4. We observed that the trend is towards
increasing wake-up penalty as also discussed by Agarwal et. al. [2] that cache access time
20
is not scaling as fast as clock. So for future technologies the wake-up penalty from lowest
level of tranquility will be 3 cycles or even more.
Table 2.5: Wake up penalties in terms of cycles
Wake-up penalty (cycles)
Technology T1 T2 T3 T4
130nm 0 1 2 2
100nm 0 1 2 2
70nm 0 1 2 2
2.5.4 Maximum Possible Leakage Power Savings in Slumberous Caches
In this section we ignore any dynamic power costs of switching from one tranquility
level to another tranquility level, to get an upper bound on the leakage power saving that
can be achieved by different schemes discussed in Section 2.4.3. An upper bound for any
leaking power scheme employing DVS (Dynamic Voltage Scaling) is obtained by
keeping all cache lines at lowest possible voltage level at all times, i.e. TL1-T4 for a 4-
way set associative cache. All hits and misses will wake-up one cache line, but
immediately put it back to T4 level. In this section a 4-way set associative cache is
considered. Upper bounds for various slumberous cache schemes viz TL1-T4, TL4 (all 4
replacement priority levels have separate tranquility level, 4 levels in this case) TL2-T2,
TL2-T3 and TL2-T4 are calculated. Table 2.6 shows the upper bounds for average
leakage power saved per byte for above mentioned schemes.
Table 2.7 and Figure 2.7 show the same information in terms of the % savings
(relative to total leakage power). The maximum savings only depends on the technology,
the number of cache ways, and the tranquility levels. They are independent of the
21
replacement policy (because, at anytime, the same number of lines is at any one
tranquility level). They do not include the dynamic power needed to switch between
tranquility levels (that’s why we call them maximum.)
Table 2.6: Average leakage power dissipated and maximum leakage power savings per byte
Technology
Average Static Power
dissipated per Byte (nW)
Average Static Power Saved per Byte (nW)
TL1-T4 TL4 TL2-T2 TL2-T3 TL2-T4
130nm 9.86 7.20 4.26 2.95 4.42 5.40
100nm 22.19 15.48 9.14 6.28 9.54 11.61
70nm 64.43 37.29 20.95 13.18 21.70 27.97
Table 2.7: Maximum percent leakage power savings
Technology TL1-T4 TL4 TL2-T2 TL2-T3 TL2-T4
130nm 73.00% 43.18% 29.92% 44.85% 54.75%
100nm 69.73% 41.19% 28.31% 42.97% 52.30%
70nm 57.87% 32.51% 20.46% 33.67% 43.40%
Of all approaches, TL1-T4, which keeps all lines at the minimum power levels,
yields the best reduction of leakage power. TL2-T4 is second, TL2-T2 is last, and TL4 is
in-between. These observations should be obvious, given that leakage power savings
depends on tranquility levels. However, one must also contend with dynamic power and
performance penalties before making a final judgment.
22
Figure 2.7: Maximum percent leakage power savings for various schemes
Though leakage energy savings increases exponentially with technology scaling as
we will see in Evaluations of Slumberous Caches for Famous Cache Replacement
Algorithms, percent leakage savings decrease as technology scales, Figure 2.7. This
decrease in percent saving is because the difference between T1and Tn levels is reduced
with technology scaling. For technologies considered it reduces from 0.6V to 0.3V i.e.
50% reduction! So ultimately state destroying leakage energy saving techniques seem to
survive. Using replacement priority information for state destroying leakage saving is
beyond the scope of this research.
2.6 Architectural Simulations
Whereas the leakage power savings are independent of the benchmarks, we must run
architectural simulations to understand the dynamic power and performance implications
of each power savings schemes.
23
2.6.1 Simplescalar Simulations
We modified the simplescalar code to provide the required statistics for the
calculation of the average power dissipation by the L1 data cache for different Spec2000
benchmark programs. Table 2.8 gives the processor model used for our simulations.
Table 2.8: Baseline microprocessor Simulation Model
Instruction Cache 16k 2-way set-associative, 32 byte blocks, 1 cycle latency
Data Cache 8k 4-way set-associative, 32 byte blocks, 1cycle latency
Unified L2 Cache 1Meg 4-way set-associative, 64 byte blocks, 20 cycle latency
Memory 100 cycle round trip access
Out-of-Order Issue out-of-order issue of up to 4 instructions / cycle, 128 entry re-order buffer
Architecture Registers 32 integer, 32 floating point
Functional Units
4-integer ALU, 2-load/store units, 4-FP ALUs, 1-integer MULT/DIV, 1-FP
MULT/DIV
We have used single sample Simpoints [65] of 100-million instructions each for
selected spec2000 programs. The resulting Simpoints are given in Table 2.9.
Table 2.9: The single simpoints for simulations of the Spec2000 benchmarks
Spec2000
Benchmarks
gzip gcc mcf parser vpr bzip2 twolf equake art
Single
Simpoint
814 960 369 1030 1722 184 11 5496 42
2.7 Evaluations of Slumberous Caches for Famous Cache Replacement
Algorithms
The concept of slumberous caches can be applied to various replacement policies
with at least two priority levels; we considered three replacement policies namely LRU
(Least Recently Used) PLRU (Pseudo LRU), and MRR (Modified Random
Replacement). We have concentrated on the design of the L1 data cache in the Pentium 4,
an 8k 4-way set associative cache with 32-byte lines.
24
2.7.1 Pseudo LRU (PLRU)
For completeness we review the PLRU policy. PLRU approximates LRU. LRU is
difficult to maintain in wide caches because of the complexity of updating the state bits to
keep track of replacement priorities. To implement PLRU in a 4-way cache, we need
three state bits called Bit_0, Bit_1 and Bit_2. Table 2.10 shows the cache line priority
levels for different combinations of these three bits. Line at P1 is the MRU line and line
at P4 is the line to replace.
Table 2.10: Cache-line replacement priorities for different combinations of Bit_0, Bit_1 and Bit_2
Bit_2 Bit_1 Bit_0 Line_0 Line_1 Line_2 Line_3
0 0 0 P4 P3 P2 P1
0 0 1 P2 P1 P4 P3
0 1 0 P3 P4 P2 P1
0 1 1 P1 P2 P4 P3
1 0 0 P4 P3 P1 P2
1 0 1 P2 P1 P3 P4
1 1 0 P3 P4 P1 P2
1 1 1 P1 P2 P3 P4
Figure 2.8 shows how the state bits are used to select a victim line. Bit_0 selects
between two groups of cache lines, group_0 (line_0 and line1) or group_1 (line_2 and
line_3). Bit_1 selects between line_0 and line_1 and Bit_2 selects between line_2 and
line_3. If Bit_0 is zero we don’t care about Bit_2 and Bit_1 decides which of line_0 or
line_1 is replaced. Similarly if Bit_0 is 1 then we don’t care about Bit_1 and Bit_2 selects
between line_2 and line_3. When a cache line is referenced we change the state of the
state bits e.g. if line_0 is accessed we set Bit_0 to 1 so that the next victim will be in
25
group_1 and also we set Bit_1 to 1 so that next time when group_0 is selected line_1 will
be selected for replacement.
Figure 2.8: PLRU implementation for 4-way caches
Figure 2.9 shows the priority level transitions of cache lines on an access to a set. For
example, if a hit occurs at a cache line whose priority level is P3, then its priority goes to
P1, the priority of the line previously at P4 goes to P2, the line at P1 goes to P3 and the
line at P2 goes to P4. These priority level transitions are dictated by PLRU and result in
various tranquility level transitions, depending on the control scheme employed.
Figure 2.9: How hits at different Priority levels affect the cache under PLRU policy
26
Leakage saving using PLRU algorithm is evaluated for schemes described in
Maximum Possible Leakage Power Savings in Slumberous Caches Section 2.5.4
2.7.1.1 Dynamic Power Penalties
Figure 2.10 shows the dynamic power required per byte to save leakage power in L1 data
cache for TL4 under PLRU.
Figure 2.10: Dynamic power incurred per byte for L1 data cache for TL4 under PLRU policy
This power varies in a wide range across the benchmarks and depends upon the
number of hits in P2-P4 levels and number of misses. The dynamic power costs are
different for various benchmarks under TL2-T4 as compared to TL4 see Figure 2.11, the
reason is increased dynamic costs for hits in P2, P3 and misses. On the average dynamic
power cost is doubled. Dynamic power is significantly reduced as technology scales. This
is because the range of voltage levels between T1 and T4 is reduced as technology scales
and the dynamic power cost is inversely proportional to the square of the voltage
difference.
27
Figure 2.11: Dynamic power incurred per byte for L1 data cache for TL2-T4 under PLRU policy
In Table 2.11 and Figure 2.12 we compare the amount of average dynamic power
consumed by various schemes to save leakage power. The dynamic power is the average
dynamic power across all benchmarks, obtained by summing up all the dynamic energy
needed for all the benchmarks and dividing the sum by the total execution time. In all
cases, dynamic power is by far the worse in TL1-T4, although the gap closes quickly
with scaled-down technologies. The curves can be explained by the voltage difference
between drowsy and full rail levels in the different schemes.
Figure 2.12: Average dynamic power cost for L1 data cache for different power schemes under PLRU
28
Table 2.11: Average dynamic power costs (nW) for various schemes under PLRU
Technology TL1-T4 TL4 TL2-T2 TL2-T3 TL2-T4
130nm 20.07 1.12 0.33 1.31 2.96
100nm 10.20 0.57 0.17 0.67 1.50
70nm 4.71 0.26 0.08 0.31 0.69
It is clear that we need to consider both the effects of leakage power and dynamic
power caused by the leakage power scheme in our evaluations.
The net power savings is:
Net Saving = LeakagePower Saved - Dynamic Power Incurred
0.00
5.00
10.00
15.00
20.00
25.00
gzip
gcc
mcf
parser
vpr
bzip2
twolf
equake
art
Average
Spec2000 Benchmarks
Net Savings per Byte (nW)
130nm
100nm
70nm
Figure 2.13: Net Savings per byte for L1 data cache for TL4 under PLRU
Figure 2.13 shows the Net Savings for different benchmarks for TL4 scheme under
PLRU. We see that, as technology scales, net power savings become independent of the
benchmark because the dynamic power becomes negligible and the static power saved is
independent of the benchmark.
Whereas the Net Savings increases exponentially with scaled down technology, it is
important to look at the percent savings, as the leakage power also grows exponentially
with the technology.
29
The percent leakage power savings is:
%Net Saving = (Net Saving Total Leakage Power)´100
The %Net Savings for TL4 under PLRU are shown in Figure 2.14, across the
benchmarks. It shows that the percent savings remains steady across technologies, and
also becomes independent of the benchmark as technology scales down.
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
gzip
gcc
mcf
parser
vpr
bzip2
twolf
equake
art
Average
Spec2000 Benchmarks
% Net Savings per Byte
130nm
100nm
70nm
Figure 2.14: % Net Savings per byte for L1 data cache for TL4 scheme under PLRU policy
Figure 2.15 shows the average net leakage power saved per byte across all
benchmarks. Regardless of the power scheme, Net Savings increase exponentially with
technology in all cases. Because of the explosive increase of leakage power and the rapid
drop in dynamic power with technology scaling, the net gains obtainable by TL1-T4 (i.e.
all lines at T4) are on a steeper upwards curve than those for any of the schemes
governed by the replacement policy. We observe that TL2-T4 gave the most savings
among schemes dictated by the replacement policy and that, for the most advanced
technology we have looked at, its savings are roughly equal to those of TL1-T4.
30
-20.00
-15.00
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
130nm 100nm 70nm
Technology
Average Net Leakage Power Saved per
Byte (nW)
TL1-T4
TL4
TL2-T2
TL2-T3
TL2-T4
Figure 2.15: Power savings for L1 data cache under various schemes in future technologies for PLRU
Finally average net leakage power saving is compared to the maximum saving
computed in Section 2.5.4; all savings are shown as percent of TL1-T4 savings of Section
2.5.4 refer Figure 2.16.
Figure 2.16: Average net leakage power savings for L1 data cache for various schemes under PLRU
2.7.1.2 Performance Penalties
Figure 2.17 shows the average increase in L1 cache hit access time (in %) taken over
all the benchmarks. We do not show the case of TL1-T4, as TL1-T4 increases hit latency
by 200% and would obfuscate the comparison between the schemes driven by the
replacement policy, if we included it.
31
For the TL4 scheme we see approximately a 7% average increase in hit latency over
a cache with no leakage power scheme. The only scheme better than TL4 is TL2-T2 but
this scheme has the least leakage savings and does not trend well. TL2-T4 results in 12%
increase in hit latency and an increasing trend in wake-up penalties suggest that TL2-T4
may not scale well w.r.t. performance whereas TL4 will continue to save leakage power
with little performance impact
Figure 2.17: Percent increase in L1 access time for hits for various schemes under PLRU
2.7.2 LRU (Least Recently Used) Policy
Though LRU is considered hard to implement but some real world systems do use
LRU e.g. Ultra SPARC IV used LRU for L2 cache [57], hence we evaluate it in a similar
way we did PLRU.
There can be many possible implementations of LRU for a 4-way set associative
cache. For the sake of discussion consider an implementation that employs 2 bits per way
to assign replacement priority to different lines. Initially all bits will be reset, hence at
same priority level. Table 2.12 shows the replacement priority assignment to different bit
32
combinations. Table 2.13 shows how hits in different cache line of a specific set change
the replacement priority of the cache lines in the same set under LRU policy.
Table 2.12: Replacement priority assignment for LRU
Bit Combination Replacement Priority
00 P1
01 P2
10 P3
11 P4
Table 2.13: Cache-line replacement priorities for hits in different cache line LRU
Hit @ Line_0 Line_1 Line_2 Line_3
Line_0 P1 P2 P3 P4
Line_1 P2 P1 P3 P4
Line_3 P3 P2 P4 P1
Line_1 P3 P1 P4 P2
Line_0 P1 P2 P4 P3
Line_2 P2 P3 P1 P4
Line_3 P3 P4 P2 P1
Line_2 P3 P4 P1 P2
Figure 2.18 shows the priority level transitions of cache lines on an access to a set
under LRU. For example, if a hit occurs at a cache line whose priority level is P3, then its
priority goes to P1, the priority of the line previously at P4 will remain unchanged, the
line previously at P1 goes to P2 and the line at P2 goes to P3. These priority level
transitions are dictated by LRU and result in various tranquility level transitions,
depending on the control scheme employed.
33
Figure 2.18: How hits at different Priority levels affect the cache under LRU policy
Similar to PLRU evaluations, various LRU schemes are denoted as TL2-T2, TL2-
T3, TL2-T4 and TL4.
2.7.2.1 Dynamic Power Penalties
Figure 2.19 shows the dynamic power required per byte to save leakage power in L1 data
cache for TL4.
Figure 2.19: Dynamic power cost for L1 data cache for TL4 scheme under LRU policy
Although Figure 2.19 looks very similar to Figure 2.10, in reality dynamic power
costs are different for same benchmark for the two replacement algorithms considered.
Table 2.14 compares TL4 schemes under LRU and PLRU with respect to dynamic power
34
consumption. It is interesting to observe that dynamic power incurred is always less for
PLRU except for art, which has miss rate of almost 50%. This confirms the intuition that
more complex algorithms are in general more power hungry. In both cases dynamic
power is significantly reduced as technology scales. Hence for 70nm technology both
policies on the average consume the same power.
Table 2.14: Dynamic power costs for different benchmarks for TL4 schemes under PLRU and LRU
LRU PLRU LRU PLRU LRU PLRU
130nm 130nm 100nm 100nm 70nm 70nm
gzip 0.8050 0.8027 0.4089 0.4077 0.1889 0.1883
gcc 0.3654 0.3527 0.1856 0.1791 0.0857 0.0827
mcf 0.9570 0.9411 0.4861 0.4780 0.2245 0.2208
parser 0.8727 0.7920 0.4433 0.4023 0.2047 0.1858
vpr 0.7508 0.7253 0.3814 0.3684 0.1761 0.1702
bzip2 0.4665 0.4543 0.2369 0.2307 0.1094 0.1066
twolf 0.6467 0.6182 0.3285 0.3140 0.1517 0.1450
equake 1.9952 1.9833 1.0134 1.0074 0.4681 0.4653
art 3.3741 3.4094 1.7138 1.7318 0.7916 0.7999
Average 1.1370 1.1199 0.5775 0.5688 0.2668 0.2627
In Table 2.15 the amount of average dynamic power consumed by various schemes
to save leakage power is shown for LRU policy. These values have the same trend as in
Table 2.11 but are a little bit more.
Net saving for LRU is calculated similar to the way it was calculated for PLRU.
Table 2.15: Average dynamic power costs for various schemes under LRU
TL1-T4 TL4 TL2-T2 TL2-T3 TL2-T4
130nm 20.51 1.14 0.34 1.35 3.03
100nm 10.42 0.58 0.17 0.68 1.54
70nm 4.81 0.27 0.08 0.32 0.71
35
Figure 2.20 shows the Net Savings for different benchmarks in the case of TL4 under
LRU and across various benchmarks. The observation is the same i.e. as technology
scales, net power savings become independent of the benchmark because the dynamic
power becomes negligible and the static power saved is independent of the benchmark.
Figure 2.20: Net savings for L1 data cache for TL4 under LRU policy
The % Net Savings for TL4 under RLU across different benchmarks are shown in
Figure 2.21 It shows that the percent savings remains steady across technologies, and also
becomes independent of the benchmark as technology scales down.
Figure 2.21: Percent net savings for L1 data cache for TL4 under LRU
36
Figure 2.22 shows the average net leakage power saved per byte across all
benchmarks. Regardless of the power scheme, Net Savings increases exponentially with
technology in all cases. Because of the explosive increase of leakage power and the rapid
drop in dynamic power with technology scaling, the net gains obtainable by TL1-T4 (i.e.
all lines at T4) are on a steeper upwards curve than those for any of the schemes
governed by the replacement policy. It is observed that TL2-T4 gave the most savings
among schemes dictated by the replacement policy and that, for the most advanced
technology we have looked at, its savings are roughly equal to those of TL1-T4.
Figure 2.22: Power savings for L1 data cache under various schemes in future technologies for LRU
Figure 2.23 compares average net leakage power saving as percent of TL1-T4
savings of Section 2.5.4
37
Figure 2.23: Average net leakage power savings for L1 data cache for various schemes under LRU
2.7.2.2 Performance Penalties
Figure 2.24 shows the average increase in L1 cache hit access time (in %) taken over
all the benchmarks. TL1-T4 is not shown, as TL1-T4 increases hit latency by 200% and
would obfuscate the comparison between the schemes driven by the replacement policy,
if we included it.
For theTL4 scheme we see approximately 8% average increase in hit latency over a
cache with no leakage power scheme. The only scheme better than TL4 is TL2-T2, but
this scheme has the least leakage savings and does not trend well. TL2-T4 results in 14%
increase in hit latency and an increasing trend in wake-up penalties suggest that TL2-T4
may not scale well w.r.t. performance whereas LRU4 will continue to save leakage power
with little performance impact.
Table 2.16 compares various power saving schemes for PLRU and LRU, where as
performance impact means percent increase in the hit latency of L1 data cache. In all
cases the performance impact of LRU is more than that of PLRU. The average miss rate,
for considered benchmarks for LRU (9.40%) is less than that of PLRU (9.62%), hence
38
this difference in performance is because LRU has 2.6% more hits in non MRU cache
lines compared to PLRU.
Table 2.16: Comparing performance impact of various power saving schemes under LRU and PLRU
Power Saving Scheme Performance impact (PLRU) Performance impact (LRU)
TL4 7.20% 7.57%
TL2-T2 6.26% 7.11%
TL2-T3 12.52% 14.23%
TL2-T4 12.52% 14.23%
Figure 2.24: Percent increase in L1 access time for hits for various schemes under LRU
2.7.3 Modified Random Replacement (MRR) Policy
It was mentioned in Section 2.4 that slumberous cache idea can be applied only to
the replacement algorithms that have at least two priority levels. Random replacement
algorithm is used in real world systems e.g. Sun’s Niagara [36], to make it fit for
slumberous cache idea, MRU information is added to random replacement algorithm and
it is named Modified Random Replacement. MRR has two priority levels i.e. MRU and
Non-MRU. For 4-way set associative cache, three ways will be at Non-MRU priority
39
level and one at MRU level, hence a single bit is needed to differentiate between the
priority levels.
Table 2.17: Replacement priority assignment for MRR
Bit Replacement Priority
0 P2
1 P1
Table 2.18 shows how replacement priorities for different cache lines change
corresponding to hits in different ways within a set.
Table 2.18: Cache-line replacement priorities for hits in different cache line under MRR
Hit @ Line_0 Line_1 Line_2 Line_3
Line_0 P1 P2 P2 P2
Line_1 P2 P1 P2 P2
Line_3 P2 P2 P2 P1
Line_3 P2 P2 P2 P1
Line_0 P1 P2 P2 P2
Line_2 P2 P2 P1 P2
Line_3 P2 P2 P2 P1
Line_2 P2 P2 P1 P2
Figure 2.25: How hits at different Priority levels affect the cache under MRR replacement policy
40
Figure 2.25 shows the priority level transitions of cache lines on an access to a set
under MRR. For example, if a hit occurs at a cache line whose priority level is P2, then
its priority goes to P1, the priority of the line previously at P1 goes to P2. These priority
level transitions are dictated by MRR and result in various tranquility level transitions,
depending on the control scheme employed.
MRR performance compared to LRU and PLRU is shown in Table 2.19. As far as
IPC is considered, MRR is a little better than PLRU and a little worse than LRU for the
selected benchmarks.
Table 2.19: Comparing IPCs of various replacement algorithms
IPC
MRR PLRU LRU
gzip 2.02 1.99 2.04
gcc 0.53 0.52 0.52
mcf 0.86 0.86 0.87
parser 2.06 2.08 2.12
vpr 0.90 0.90 0.91
bzip2 0.68 0.67 0.68
twolf 1.45 1.38 1.47
equake 0.99 1.00 0.99
art 2.00 1.43 1.44
Average 1.19 1.17 1.20
TL4 is not possible for MRR, so similar to the evaluations for PLRU and LRU,
schemes TL2-T2, TL2T3 and TL2-T4 are considered.
2.7.3.1 Dynamic Power Penalties
Figure 2.26 shows the dynamic power required per byte to save leakage power in L1 data
cache for TL2-T4 under MRR.
41
Figure 2.26: Dynamic power incurred per byte for L1 data cache for TL2-T4 under MRR policy
Table 2.20 compares TL2-T4 scheme for PLRU, LRU and MRR replacement
policies with respect to dynamic power consumption. It is interesting to observe that in
all cases MRR is equally good as PLRU. LRU being more complex to implement is more
power hungry. In all cases dynamic power is significantly reduced as technology scales.
Hence for 70nm technology all cases incur almost the same dynamic power cost.
Table 2.20: Dynamic power costs for different benchmarks for TL4 schemes under PLRU and LRU
PLRU LRU MRR PLRU LRU MRR
130nm 130nm 130nm 70nm 70nm 70nm
gzip 1.19 1.26 1.19 0.28 0.29 0.28
gcc 0.85 0.87 0.85 0.20 0.20 0.20
mcf 10.53 10.74 10.53 2.47 2.52 2.47
parser 1.86 2.05 1.86 0.44 0.48 0.44
vpr 1.78 1.82 1.78 0.42 0.43 0.42
bzip2 0.86 0.89 0.86 0.20 0.21 0.20
twolf 1.20 1.26 1.20 0.28 0.30 0.28
equake 5.05 5.16 5.05 1.18 1.21 1.18
art 3.28 3.25 3.28 0.77 0.76 0.77
Average 2.96 3.03 2.96 0.69 0.71 0.69
Table 2.21: Average dynamic power costs for various schemes under MRR
TL1-T4 TL2-T2 TL2-T3 TL2-T4 TL1-T4
130nm 20.51 0.33 1.31 2.96 20.51
100nm 10.42 0.17 0.67 1.50 10.42
70nm 4.81 0.08 0.31 0.69 4.81
42
In Table 2.21 the amount of average dynamic power consumed by various schemes
under MRR policy is shown. Net savings for MRR are calculated like LRU and PLRU.
Figure 2.27 shows Net Savings for different benchmarks in the case of TL2-T4 under
MRR. The observation is the same i.e. as technology scales, net power savings become
independent of the benchmark because the dynamic power becomes negligible and the
static power saved is independent of the benchmark.
Figure 2.27: Net savings per byte for L1 data cache for TL2-T4 under MRR
The % Net Savings for TL2-T4 across different benchmarks are shown in Figure
2.28. It shows that the percent savings remains steady across technologies, and also
becomes independent of the benchmark as technology scales down.
Figure 2.28: Percent net savings per byte for L1 data cache for TL2-T4 scheme under MRR
43
Figure 2.29 shows the average net leakage power saved per byte across all
benchmarks. Regardless of the power scheme, Net Savings increases exponentially with
technology in all cases. Because of the explosive increase of leakage power and the rapid
drop in dynamic power with technology scaling, the net gains obtainable by TL1-T4 (i.e.
all lines at T4) are on a steeper upwards curve than those for any of the schemes
governed by the replacement policy. It is observed that TL2-T4 gave the most savings
among schemes dictated by the replacement policy and that, for the most advanced
technology we have looked at, its savings are roughly equal to those of TL1-T4.
Figure 2.29: Power savings for L1 data cache under various schemes in future technologies for MRR
Figure 2.30: Average net leakage power savings for L1 data cache for various schemes under MRR
44
Finally average net leakage power saving is compared to the maximum saving
computed in Section 2.5.4 all savings are shown as percent of TL1-T4 savings of Section
2.5.4 refer Figure 2.30.
2.7.3.2 Performance Penalties
Figure 2.31 shows the average increase in L1 cache hit access time (in %) taken over
all the benchmarks. TL1-T4 is not shown, as TL1-T4 increases hit latency by 200% and
would obfuscate the comparison between the schemes driven by the replacement policy,
if we included it.
For the TL2-T4 scheme we see approximately 13% average increase in hit latency
over a cache with no leakage power scheme. TL2-T2 and TL2-T3 both have the same
performance impact of approximately 6%.
Figure 2.31: Percent increase in L1 access time for hits for various schemes under MRR
Table 2.22 compares various PLRU, LRU and MRR power saving schemes, where
as performance impact means percent increase in the hit latency of L1 data cache. In all
45
cases the performance impact of LRU is the maximum where as that of MRR is
minimum.
Table 2.22: Comparing performance impact of power saving schemes under PLRU, LRU and MRR
Power Saving Scheme Performance
impact
(PLRU)
Performance
impact
(LRU)
Performance
impact
(MRR)
TL4 7.20% 7.57% NA
TL2-T2 6.26% 7.11% 6.17%
TL2-T3 12.52% 14.23% 12.33%
TL2-T4 12.52% 14.23% 12.33%
2.7.4 Best Leakage Control Scheme for L1 data Cache
So far 12 leakage control schemes based on the replacement algorithm were
discussed. To find the best out of these a metric is devised called net leakage saving per
percent increase in the hit latency of the L1 data cache. This metric will be referred to as
Leakage Energy Saving Metric (LESM) in the rest of the document.
Table 2.23: Comparing all 12 leakage control schemes with respect to LESMs
LESM
130nm 100nm 70nm
PLRU
TL4 0.44 1.19 2.87
TL2-T2 0.44 0.99 2.10
TL2-T3 0.29 0.73 1.72
TL2-T4 0.28 0.85 2.20
LRU
TL4 0.41 1.13 2.73
TL2-T2 0.38 0.87 1.85
TL2-T3 0.25 0.64 1.51
TL2-T4 0.24 0.75 1.93
MRR
TL2-T2 0.44 1.00 2.13
TL2-T3 0.29 0.74 1.74
TL2-T4 0.28 0.86 2.23
As per Table 2.23 PLRU4 is the best scheme for all technologies considered. For
130nm technology PTL2-T2 and MRR-T2 are as good as PLRU4. PLRU4 is slumberous
46
scheme in real sense, whereas others are drowsy compatible versions, using replacement
policies. Hence slumberous caches are proved to be better than drowsy caches as
technology scales down.
2.7.5 Extension to Unified L2 Cache
To evaluate the leakage power savings schemes in the context of a unified L2 cache,
a 4-way set-associative cache with PLRU replacement policy is considered. The block
size is 64 bytes and cache size is 1Mbyte. Since the L2 cache is not accessed as
frequently as L1 the dynamic power expanded in switching between the tranquility levels
becomes negligible. Average dynamic power incurred for unified L2 cache is in the order
of pWs per byte whereas, for L1 data caches, it was in the order of nWs. Thus dynamic
power costs are negligible across all schemes, as shown in Figure 2.32.
Figure 2.32: Dynamic power cost for L2 cache for various schemes in future technologies for PLRU
When we compare net leakage power savings per byte in L2 for different schemes
we see that TL1-T4 is always the best refer Figure 2.33 and is also trending up faster with
technology.
47
Figure 2.33: Leakage power savings for L2 cache for various schemes in future technologies for PLRU
L2 cache hit latency of the CPU model used is 20 cycles, so even if we put the whole
L2 cache at T4 level at all times we will have only 10% increase in L2 latency. Impact on
L2 cache latency has similar curve as that of L1 but is less than 2% in all cases refer
Figure 2.34.
Figure 2.34: Percent increase in L2 access time for hits for various schemes under PLRU
48
2.7.6 Immunity to Soft Errors and Reliability of Slumberous Caches
Soft errors will be a main concern in future microprocessors [69] owing to miniature
feature sizes and larger chip areas. As cache memories are occupying most of the chip’s
real state today, they have to be more reliable. Reducing voltage of cache lines makes
them more vulnerable to soft error attacks, as mentioned by [16] soft error vulnerability
increases exponentially with decreasing supply voltage. Further as pointed out by [35]
MRU lines are more vulnerable to soft errors. Hence slumberous cache schemes that
never put MRU lines into drowsy mode and gradually decrease the voltage of a cache
line as its replacement priority is lowered, seem more promising keeping in view power,
performance and reliability. A detailed evaluation of reliability of slumberous caches is
beyond the scope of this research.
2.7.7 Conclusions
From this research it is established that huge leakage energy can be saved in future
technologies, if some tranquility levels for the caches are selected and individual cache
lines are switched to a tranquility level proportional to their frequency of utilization.
Replacement policy can be used to discriminate between more frequently used and less
frequently used cache lines to decide about which power save level they should be
switched to. Our experimental results proved that on the average 45%-32% leakage
power can be saved for 130nm-70nm technologies. Dynamic energy cost to implement
the proposed scheme becomes negligible as technology scales and also as cache sizes
increases for a fixed technology. So above mentioned percent savings are independent of
49
the program execution and cache size used. The performance effect of this scheme is very
less and it decreases towards no performance impact as technology scales.
Two priority levels schemes were considered to contrast with drowsy caches. Our
scheme is similar to drowsy cache scheme in the way that we also reduce supply voltage
to different cache lines. But drowsy cache scheme puts the entire cache to drowsy mode
at some regular intervals in case of L1 data cache and for L1 instruction cache they
introduced some way of bank prediction to put entire bank into drowsy mode. To
mitigate the performance impact we never reduce the supply voltage of P1 priority level
cache lines we only put P2-P4 levels to either multiple levels of tranquility in case of TL4
or to two levels of tranquility as the case of many two levels schemes discussed. For two
level schemes different cases of assigning T2 or T3 or T4 voltage level to all three
priority levels from P2-P4 were considered. Comparing all 12 schemes for L1 data cache,
with respect to LESMs, showed TL4 under PLRU to be the best scheme, which also
proves superiority of slumberous scheme over drowsy type schemes. On account of very
less dynamic cost and very less performance impact TL1-T4 seems to be the best case for
L2 caches i.e put whole L2 cache in deepest tranquility level and wake a cache line up
only when needed and put them back to sleep in the very next cycle,
Another important thing to mention is that though percent leakage energy savings
decrease as technology scales on account of decreasing voltage difference between
different tranquility levels the absolute leakage energy saving increase 2-4 times
(depending on the cache size and replacement policy) from 130nm to 70nm technology.
50
Chapter 3: Dynamic Power Control
Low Power technique presented in the last chapter addressed static aspect of the
power dissipated in a modern processor. In this chapter we present a technique to reduce
dynamic power dissipated in a processor chip. Since clock tree in modern processors
dissipates a significant portion of the total chip dynamic power, we target reducing power
dissipated in the clock tree.
3.1 Clock Gating
Clock gating is a well known technique used to reduce power dissipation in clock
associated circuitry. The idea of clock gating is to shut down the clock of any component
whenever it is not being used (accessed). It involves inserting combinational logic along
the clock path to prevent the unnecessary switching of sequential elements. The
conditions under which the transition of a register may be safely blocked should
automatically be detected. This problem is the target of our paper.
In out-of-order superscalar processors, branch miss-predictions cause wrong-path
instructions to be executed since there is a lag between the branch prediction, actual
branch resolution, and subsequent commit of the branch. The wrong-path instructions are
of course never committed to the actual state of the processor; however, because they are
issued and executed, they can give rise to two negative effects: performance degradation
and power waste.
Many researchers have worked on eliminating or reducing the power consumed by
wrong-path instructions. These schemes are primarily probabilistic in nature. They rely
51
on some kind of branch history as explained next. The pipeline gating technique of [36]
assigns confidence levels about their prediction accuracy to branches based on their
prediction history. When the number of low confidence branches exceeds a preset
threshold, the instruction fetch and decode are stopped. This method suffers from both
performance overhead and lost energy saving opportunities since some low confidence
branches may be predicted correctly while some high confidence branches are in fact
predicted wrongly. Reference [4] improves on the all-or-nothing throttling mechanism of
[36] by having different types and degrees of throttling. Pipeline balancing technique of
[6] monitors the IPC value over a 256-instruction window and disables clusters of
functional units upon detection of the low IPC state (assuming that the program execution
will stay in that state during the next instruction window). Since decisions are taken over
a period of 256 instructions, rapid changes in program behavior result in performance
loss and energy waste.
In [43] the authors propose a deterministic clock gating (DCG) approach which takes
advantage of the resource utilization information available in advance. When it is known
ahead of time (i.e., at the issue stage) that some of the processor resources will not be
used, clock gating signals are generated, at the issue stage, to clock-gate these resources
during their idle times. Another approach, called transparent clock gating [31] enhances
the existing clock gating in latch-based pipelines by keeping the latches transparent by
default i.e., by not clocking them. Latches are clocked only when there is a need to avoid
a data race condition. Register level clock gating of [32] introduces the concept of clock
gating parts of stage registers i.e., when there are not enough instructions to be issued,
52
parts of stage register associated with the issue stage are clock gated. In [10] authors
present a value-based clock gating scheme, which exploits the fact that although the
processor word size has increased to 64 bits and beyond, arithmetic operations on much
smaller bit widths are more common. So while performing operations on smaller
numerical values, higher order bits of the functional units can be clock gated.
Most of the previous work on clock gating either ignores the fact that a noticeable
fraction of the total power is dissipated in executing wrong-path instructions during
branch misprediction or use a probabilistic approach to avoid the resulting power waste.
In this research we take branch misprediction as an opportunity for clock gating the
unnecessarily-used processor resources by deterministically detecting the wrong-path
instructions.
3.2 Using Branch Mispredictions for Clock Gating
Many of the currently available state-of-the-art microprocessors have complex
pipelines with multiple functional units and very wide issue widths [36] in order to offer
high level of parallelism in hardware. Clearly the Instruction Level Parallelism (ILP)
varies a lot across different applications; as a result, not all applications are able to utilize
the full set of resources in a modern processor. In addition, since the processors are
designed to account for the peak performance, many of the applications which are not
able to exploit the available hardware resources end up underutilizing them. Figure 3.1
shows this underutilization across different benchmarks in terms of average percentage of
idle cycles for integer ALUs with simplescalar simulation using the issue width of 4. As
53
we can see the integer ALU is idle, on average, for 41.02% of the time. This resource
underutilization provides us with opportunities for power saving by clock gating.
0
10
20
30
40
50
60
70
BZIP
GCC
GZIP
CJPEG
DJPEG
APSI
EQUAKE
MESA
WUPWISE
Average
Percentage (%)
Idle Cycles
Instructions
Figure 3.1: Idle time and wrong-path instruction fraction for integer ALUs
Most of the current state-of-the-art microprocessors employ aggressive branch
prediction in order to boost performance. Although branch predictors help increase the
processor performance, when a branch is mispredicted, many of the wrong-path
instructions (i.e., instructions that are on the predicted path of the mispredicted branch)
are still executed. Due to the out-of-order execution in modern processors, at the time
when a branch is resolved and found to be mispredicted, there can be a mix of correct
path and wrong-path instructions in the execution pipelines and the instruction queue.
Because of the prohibitive complexity of selective squashing mechanism, many processor
architectures do not flush the pipeline until the mispredicted branch reaches the head of
the ReOrder Buffer (ROB) so that one is assured that all the instructions on the correct
path have retired (Note that instruction fetch and decode are stopped upon detecting a
branch misprediction). As a result many of the wrong-path instructions are still executed
54
only to be thrown away when the pipeline is flushed. Figure 3.1 shows the fraction of
instructions that are executed but never committed (retired), due to mispredicted branches
with respect to the total number of instructions executed. This estimate is obtained from
simplescalar simulation, using the processor configuration that is described in detail in
the experimental results section, which shows that on average around 8.29% of the
executed instructions are due to mispredicted branches. These instructions not only
consume power in functional units during their execution, but also consume power in (i)
register files by reading their input operands; and (ii) caches by executing wrong-path
loads. The impact of these wrong-path instructions on power dissipation is even more
severe with deeper pipelines on account of increased branch misprediction penalty.
As stated earlier, many of the wrong-path instructions are executed even after the
branch is resolved. More precisely, when a branch is resolved to be mispredicted, there
may exist wrong-path instructions which a) have already been issued and thus they either
are in the pipeline or have been completed (type (i)), or b) have not been issued yet, i.e.,
they are still in the issue queue (IQ) (type (ii)). By the time the mispredicted branch
reaches the head of the ROB, many of the instructions which are still in IQ (type (ii))
could be issued to execution units. It is quite expensive (from a hardware cost and control
point of view) to identify and prune type (i) instructions. Fortunately, it is easy to stop the
second set of instructions from being issued, which in turn can result in considerable
power saving.
In Figure 3.2, the bars on the left within each set show the average number of type (i)
+ type (ii) instructions when the mispredicted branch retires. This number tells us the
55
average number of wrong-path instructions that could be prevented from being issued if
we had a perfect oracle that would tell us which instruction is or will be in the wrong-path.
The bars on the right within each set show the average number of type (ii)
instructions when the mispredicted branch retires, i.e., the wrong instructions issued after
the branch is resolved to be mispredicted and before it retires. These are the wrong-path
instructions which can actually be prevented from being issued and executed. These
results show that 92.63% of the wrong-path instructions are issued after the branch is
resolved, which provides a great opportunity for power saving via clock gating.
0.00
5.00
10.00
15.00
20.00
25.00
BZIP
GCC
GZIP
CJPEG
DJPEG
APSI
EQUAKE
MESA
WUPWISE
Average
Average Instructions
Type(i)+Type(ii)
Type(ii)
Figure 3.2: Average number of wrong-path instructions per mispredicted branch
3.3 Proposed Clock Gating Architecture
Based on the aforesaid observations, we present two clock gating techniques that 1)
make use of idle cycles in pipelined functional units when some stage of the functional
unit is idle, and 2) prevent wrong-path instructions of type (ii) from being issued.
56
The first clock gating technique, called Pipeline Functional unit Clock Gating
(PFCG), is straightforward and is presented and implemented here only to serve as a
baseline against which the power efficiency of a second technique i.e., WPCG, is
compared.
3.3.1 Pipelined Functional Unit Clock Gating (PFCG)
Figure 3.3 depicts the PFCG technique at the architectural level. The proposed
architecture utilizes the idleness of various stages of structurally-pipelined functional
units in a processor pipeline.
Note that different stages of a pipelined FU can be idle due to any of a number of
reasons:
o Typically the total number of FUs, including integer and floating point
functional units, is larger than the processor’s issue width. Hence not all the
FUs are used in every cycle of the program’s execution.
o Different applications exhibit different degrees of instruction level
parallelism (ILP) and therefore the FU’s usage varies across different
programs.
o Different application programs exercise different sets of FUs. For example,
integer programs will be using completely a different set of FUs (integer
ones) compared to the floating point programs.
o Because of structurally pipelined FU with multi clock cycle latencies (but
throughput of 1 operation per cycle), depending on the number of operations
57
that are concurrently being executed on the same functional unit, one or more
stages of the pipelined FU may be idle at any given clock cycle.
Issue Logic
..
Data Bus
To writeback
Figure 3.3: PFCG Architecture
In the modern processors, the decoded instructions, after renaming, are stored in an
issue queue (IQ), where they wait for their input operands to become available (if these
operands are being produced by some instruction in the pipeline). The issue logic
examines all instructions that have both of their operands ready and issues n instructions
(for an issue width of n) to appropriate FUs assuming that the corresponding FUs are
available. We define a pipeline stage of an FU as an input register set plus the
combinational logic that succeeds it. In the presented clock gating (CG) architecture, each
stage register set of the FU is appended with a one-bit register called Clock Enable Bit
register (CEBit). The CEBit of stage i of FU j controls the clock of stage i+1 of that FU.
(Note that since the last stage of the FU will not be used to gate any clock signal, it is not
appended with the CEBit).
58
The clock fed to each stage register set, except for the CEBit register which is never
clock gated, goes through an AND gate. The AND gate essentially takes the clock and
the CEBit of the previous stage and performs logical AND on them to produce the clock
that will be fed to the current stage. Hence, during a particular clock cycle, if the CEBit of
the previous stage is ‘0’, the clock for the current stage is masked for that cycle. As
shown in the figure, the CEBit propagates through subsequent stages at each clock cycle
thanks to the CEBit shift register structure.
The CEBit register of the first stage of each FU is set either to ‘0’ or to ‘1’ by the
issue logic via the issued bit (cf. Figure 3.3). If, during a particular cycle m, no instruction
is issued to the FU, then the issued bit will be set to ‘0’, indicating that no instruction is
issued to this particular FU during cycle m. The issued bit is also used to gate the clock of
the first stage. In the subsequent clock cycles as the CEBit travels through the subsequent
stages of an FU, it appropriately gates the clock of those stages.
3.3.2 Wrong-Path instruction Clock Gating
We saw in Section 3.2 that on average 8.29% of the total executed instructions are
never committed due to wrong-path instructions on mispredicted branches. Figure 3.2
showed, on average, how many wrong-path instructions can be prevented from being
issued when the branch is resolved and is known to be mispredicted. As seen, when the
branch is mispredicted, majority of the issued wrong-path instructions can be blocked
since the majority of these wrong-path instructions are still in IQ. Therefore, we propose
a clock gating technique that eliminates the switching activity in the logic and the stage
registers due to wrong-path instructions.
59
Figure 3.4 shows the architecture of Wrong-Path instructions Clock Gating (WPCG).
Note that when a branch is resolved to be mispredicted, the instructions in the IQ may be
correct path instructions (i.e., instructions that were fetched before the mispredicted
branch instruction) or wrong-path instructions (i.e., instructions that have been fetched
after the mispredicted branch instruction). Therefore, in the WPCG architecture, the IQ is
augmented with some logic to determine whether the instruction selected by the issue
logic is a wrong-path instruction or not.
Figure 3.4: The WPCG Architecture
As depicted in Figure 3.4, the misprediction bit is set to ‘0’ initially when the correct
path instructions are being executed and no branch misprediction has taken place. When a
branch is resolved to be mispredicted, the mispredicted_branch_rob_id (MBR_id)
register is updated with the ROB ID of the branch (branch_rob_id) in the next clock
cycle. At the same time, the misprediction bit will be set to ‘1’. This will enable the range
60
comparator in front of each issue port of the IQ, which will subsequently determine
whether the instruction being issued is a wrong-path instruction or not.
The AND gate in front of each issue port essentially takes the ROB ID of the
selected instruction and ANDs it with the misprediction bit. This is necessary since we do
not want unnecessary switching activity in the comparator circuit when the branch is
predicted correctly. Hence, in the event of misprediction, the ROB ID of the selected
instruction is available to the comparator. Furthermore the comparator also receives the
tail of the ROB as input to determine if the selected instruction is between the
mispredicted branch and the tail of the ROB. If it is, then the comparator will output a
‘1’, indicating that the selected instruction is in the wrong-path and thus it should not be
executed. The inverted output of the comparator goes to a 2-to-1 MUX controlled by the
misprediction bit.
In the event of a misprediction, the inverted output of the comparator is chosen to set
the value in the CEBit register of the first stage of the FU. This output is also used to
clock gate the first stage register set of the FU. Note that when the branch is not
mispredicted, the added circuitry is functionally equivalent to the PFCG architecture (cf.
Figure 3.3) and consumes minimal power since there will be no switching activity in the
comparators.
When the head of the ROB reaches the mispredicted branch, we will flush the ROB
and the pipeline. At that time, the misprediction bit will be reset so that starting with the
next clock cycle, the WPCG is disabled.
61
It is important to emphasize the fact that, in out-of-order processors all types of
instructions can be potentially executed out of order, and therefore, branches can also be
executed out of order. Hence, once we detect a branch misprediction and update the
MBR_id register and set the misprediction bit to ‘1’, it is possible that an older branch
gets executed and gets resolved to be mispredicted. An older branch can still be issued
and executed since it falls into the correct path with respect to the mispredicted branch
whose ROB ID is stored in the MBR_id register. Therefore, if an older branch is resolved
to be mispredicted, we should update the MBR_id register with the ROB ID of the just-resolved
older branch since updating the MBR_id register with this new branch will
cover more wrong-path instructions. For the sake of completeness we mention that if a
younger branch gets resolved to be mispredicted, then we do not alter the content of
MBR_id register. Note however that this scenario is not possible since if a branch is
younger than the branch whose ROB ID is in the MBR_id register, then the younger
branch will fall into the category of wrong-path instructions with respect to the branch
whose ROB ID is in MBR_id register. Thus if a branch is resolved to be mispredicted
while the misprediction bit is set to ‘1’, then this newly mispredicted branch must be
older and we update the MBR_id register. Since we update the MBR_id register any time
a branch is mispredicted, we are already taking care of this scenario.
Furthermore, it is possible that more than one branch gets resolved to be
mispredicted in the same cycle. In this case, ideally, we would like to select the branch
that is the oldest and update MBR_id register with the ROB ID of that branch. But this
would require comparison between the ROB IDs of all the branches that are resolved to
62
be mispredicted in the same cycle. Our simulation results show that, on average, only
6.25% of the total mispredicted branches are resolved in the same cycle. Therefore, in
order to avoid the overhead of multiple range comparators, we select only one of the
mispredicted branches from one of the Branch Execution Units with a predefined
priority.
3.3.3 Hardware Overhead
Figure 5 shows the design of the range comparator block used in the WPCG
architecture. As shown in the figure we actually need 3 comparators. This is because the
ROB is a circular queue where the head of the ROB points to the earliest (oldest)
instruction whereas the tail of the ROB points to the latest (youngest) instruction.
Due to this circular queue structure, we must deal with two different scenarios in
order to determine whether the instruction being issued is a wrong-path instruction or not.
For this purpose, we use three comparators. Comparator C1 compares the tail of the ROB
with the ROB ID of the mispredicted branch. Comparator C2 compares the ROB ID of
the instruction being issued (ROB_id) with the tail of the ROB whereas comparator C3
compares the ROB ID of the instruction being issued with the ROB ID of the
mispredicted branch.
63
Figure 3.5: Circuitry used to detect wrong-path instructions
Essentially we want to determine if the ROB ID of the instruction being issued is in
between the mispredicted branch and ROB_tail. If so, the ROB ID belongs to the wrong-path
instruction since the instructions following the branch are from the mispredicted
path. As shown in the Figure 3.5 there are two possible scenarios:
o Case 1: ROB_tail is larger than the mispredicted branch’s ROB ID
(mispredicted_branch_rob_id in Figure 3.5). In this case the instruction being
issued is on the wrong-path exactly if its ROB ID is larger than the
mispredicted_branch_rob_id and smaller than the ROB_tail. This task is
accomplished by the AND gate in the dotted rectangle.
o Case 2: ROB_tail is smaller than the mispredicted_branch_rob_id. In this case the
instruction being issued is on the wrong-path exactly if its ROB ID is larger than the
64
mispredicted_branch_rob_id or it is smaller than the ROB_tail. This task is
accomplished by the gates in dotted oval.
Notice that the inputs of the comparators do not switch when the branch is not
mispredicted. This is due to the fact that the ROB_tail and mispredicted_branch_rob_id
registers (cf. Figure 3.5) are updated only in the event of misprediction. Therefore, they
do not consume any power during the correct path execution. We implemented this
circuit in Hspice and carried out the energy overhead analysis. The results presented in
experimental section account for this overhead.
3.3.4 Timing Overhead
Potentially there can be a timing penalty for routing the misprediction bit and the
mispredicted_branch_rob_id from the Execution stage back to the Issue stage. In the
conventional processor implementations the branch misprediction information is sent to
the Fetch and the Commit stages and the additional routing cost to get it to the Issue stage
could be quite low. Hence we expect that this additional reverse signal path to have little
or no impact on the clock cycle time. If, however, this becomes a concern, then we can
also pipeline the reverse routing path for the misprediction bit signal from the Execution
Unit to the Issue Logic; this will allow some wrong-path instructions to be issued into the
pipeline, which reduces the energy savings of the WPCG technique, but will have no
other performance or functional effects.
More generally, the WPCG architecture adds some logic to determine if the
instruction is a wrong-path instruction, and thus, it adds some delay although the impact
of this delay on the clock cycle time depends on which pipeline stage is the most timing
65
critical one. In the worst case scenario, we must pipeline the issue logic, resulting in an
extra clock cycle penalty for detecting wrong-path instructions. This additional stage will
be bypassed when the branches are predicted correctly and therefore the penalty reduces
to the Mux delay without any extra clock cycle penalty. In our simulations we pipelined
this logic to account for the worst case scenario when the delay of the logic is too high to
be accommodated within the same cycle of the issue. Therefore simulation results
account for the associated performance penalty and are presented in experimental section.
3.4 Experimental Results
To carry out the evaluation of the proposed clock gating scheme, we used a
simplescalar-based simulation platform. The PFCG and WPCG methods were
implemented in simplescalar [27] with appropriate modifications to simplescalar to
implement realistic branch execution. The processor model used for the evaluations is
described in Table 3.1. The benchmarks used for the evaluation included a few integer
SPEC 2000 benchmarks (bzip, gzip, gcc) and a few floating point SPEC 2000
benchmarks (wupwise, apsi, mesa, equake) [28] along with a couple of multimedia
benchmarks (djpeg, cjpeg) [41] . A subset of benchmarks was chosen which exhibits
the same average branch prediction rate as that of the full suite it is representing [57][21].
All benchmarks were run by fast forwarding 300M instructions followed by cycle
accurate out of order simulation of 1B instructions. From simplescalar simulations, we
obtained the access counts for various structures such as the integer functional units,
register files, and caches.
66
To report the energy savings of the proposed clock gating scheme (while accounting
for the overhead of the added circuitry), we used Hspice-based simulations using a 45nm
CMOS technology obtained from the predictive technology models [26]. Input registers
of different stages of an FU were modeled as master-slave Flip Flops, implemented at the
transistor-level, and simulated with Hspice to obtain the energy consumption when the
clock is not gated as well as when the clock is gated. Furthermore to model a typical
integer ALU, we designed and implemented a 32-bit adder, assuming for simplicity that
an integer ALU consists of an adder, at transistor level and simulated it with Hspice. In
order to obtain the energy consumption in the adder circuit, we divided the average
switching activity per bit of the adder input operands into four ranges: [0, 25%), [25%,
50%), [50%, 75%) and [75%, 100%]. The corresponding energy consumptions were
obtained by Hspice by performing Monte Carlo simulation of the adder circuit under
appropriate bit-level switching activities taken from Simplescalar simulations. More
precisely, we obtained the average bit-level switching activities for inputs of various
integer ALUs in the target processor from simplescalar simulations and used these
activity values to estimate power savings on the adder circuit.
Table 3.1: Processor Model used for Evaluations
Processor widths Fetch, Decode, Issue and Commit: 4
ROB 128/64
LSQ 64/32
Caches L1 I/D Cache 64KB 2-way, Hit Latency : 1-cycle, Unified
L2 Cache of 2MB, 8-way, Hit Latency : 12-cycles
Memory Latency 100 cycles
Branch Predictor
Gshare predictor with table size: 4096
BTB 1024 2
Functional Units
Integer ALUs:4
Integer Multiplier/Dividers:2
67
To model the register file and cache structures, we used CACTI [25] with the 45nm
CMOS technology parameters and the machine configuration reported in Table 3.1
We evaluated two processor configurations with respect to ROB and LSQ sizes,
denoted as ROB/LSQ set to 64/32 and 128/64. By increasing sizes of the ROB and LSQ,
the proposed clock gating solution performs better since by increasing these sizes, the
impact of branch misprediction increases and we encounter more opportunities to save
energy (cf. Figure 3.6). Increasing the issue width also increases the number of
instructions per mispredicted branch [14]; thus, it will have a similar effect.
0
10
20
30
40
50
60
70
80
BZIP
GCC
GZIP
CJPEG
DJPEG
APSI
EQUAKE
MESA
WUPWISE
Average
% Usage Cycles
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
% Improvement
64/32 PFCG
64/32 WPCG
128/64 P FCG
128/64 WPCG
64/32
128/64
Figure 3.6: Usage cycles fraction in integer ALUs and % decrease in the usage cycles due to WPCG
Figure 3.6 on the primary (left) Y-axis, shows the average value of the percentage of
usage cycles in integer ALUs for different benchmarks. The PFCG scheme takes
advantage of the fact that ALU usage is not 100% and gates the clock signal of the stage
registers of different ALUs during the idle cycles, and hence, saves power. The WPCG
scheme, which after detecting a branch misprediction does not issue wrong-path
instructions, increases the idle cycle fraction and reduces the ALU usage, as shown on the
secondary (right) Y-axis of Figure 3.6. On average, WPCG reduces ALU usage cycles by
68
2.95% for ROB/LSQ=64/32 and 3.87% for ROB/LSQ=128/64. It is evident from these
results that WPCG creates more opportunities for clock gating compared to PFCG.
Of the presented clock gating schemes, the PFCG technique incurs negligible
overhead, one bit register for the CEBit per 32 or 64 bits registers. The WPCG technique
incurs moderate energy overhead because we activate the wrong-path instruction
detection circuitry of Figure 3.5 only after detecting a mispredicted branch. The energy
overhead due to the overhead circuitry is accounted for by implementing the circuitry of
Figure 3.5 in Hspice. Note that, as mentioned earlier, the WPCG technique also reduces
switching activity in the combinational logic between the clock gated register sets since it
prevents the wrong-path instructions from being issued. Hence WPCG saves power not
only on clock pins of the stage registers but also in the combinational logic blocks. Figure
3.7 shows the energy consumption in the stage registers and the combinational logic of
the integer ALUs for PFCG and WPCG schemes with the ROB/LSQ configuration of
128/64. On average, WPCG expends 2.43% less energy in the combinational logic of
ALUs and 2.41% less energy in stage registers compared to PFCG.
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
BZIP
GCC
GZIP
CJPEG
DJPEG
APSI
EQUAKE
MESA
WUPWISE
Average
Energy (mJ)
Clock Pins PFCG
Clock Pins WPCG
Logic PFCG
Logic WPCG
Figure 3.7: Energy consumption in the combinational logic and stage registers of the integer ALUs
69
Since the WPCG scheme prevents the wrong-path instructions from being executed,
it reduces register file read accesses as most of the wrong-path instructions will access the
register file to read input operands. Furthermore the cache accesses are also typically
reduced since the wrong-path instructions can include load instructions. Notice that the
store accesses to the cache are not affected since stores are executed only on commit. We
used CACTI tool [25] to get per access dynamic energy dissipation for L1 data caches
and the register files implemented in the 45nm PTM technology. Note that CACTI is
equipped with detailed models of memory structures including decoders, sense
amplifiers, bit lines, word lines, interconnect, etc.
Figure 3.8 depicts the percentage reduction in the number of accesses made to the
register files and L1 data cache for WPCG. As shown in this figure, WPCG with the
ROB/LSQ configuration of 128/64 reduces the register file accesses by 3.69% and L1
data cache accesses by 2.60%, resulting in similar energy reduction in register file and L1
data cache. It was reported by [52] that wrong-path instructions may do useful pre-fetches
that can in turn result in reducing the overall execution time for the whole benchmark;
however we did not notice any such effect for our selected benchmarks. This is likely
because of the smaller issue-width, memory latency, and branch misprediction penalty
values used in our simulations (in contrast to the aggressive values assumed in [52], we
assumed parameter values that match today’s commercial processor implementations).
70
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
BZIP
GCC
GZIP
CJPEG
DJPEG
APSI
EQUAKE
MESA
WUPWISE
Average
Reduction in Accesses (%)
Register File 64/32
Register File 128/64
L1 Data Cache 64/32
L1 Data Cache 128/64
Figure 3.8: Reduction in register file and cache accesses due to WPCG
Though WPCG incurs a cycle penalty in detecting wrong path instructions because
of mispredicted branches, it does not affect the overall IPC since misprediction rates are
normally very low. Table 3.2 shows that for the simulated benchmarks WPCG on the
average incur less than 1% IPC degradation.
Table 3.2: IPC Degradation for WPCG
Benchmarks % Change in IPC
ROB/LSQ: 64/32 ROB/LSQ: 128/64
BZIP 0.07 0.13
GCC 0.61 0.66
GZIP 0.32 0.41
CJPEG 0.39 0.40
DJPEG 0.22 0.34
APSI 0.56 0.33
EQUAKE 0.58 0.14
MESA 0.87 0.74
WUPWISE 0.91 1.81
Average 0.63 0.39
Figure 3.9 shows the distribution of energy dissipations among the major on-chip
components obtained by simplescalar/Wattch [10] simulation for 130nm technology.
Among these components, the techniques proposed in this paper are aimed at reducing
power in clock, data cache, register file and ALU. The baseline PFCG saves, on average,
71
38.50% energy in the clock tree, which translates into 13.93% energy savings over all
these major on-chip components. In comparison, WPCG saves additional 2.05% in the
clock tree, 2.43% in ALUs, 3.69% in register file and 2.60% in data cache, which
translates into 16.26% energy savings over the major on-chip components.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
BZIP
GCC
GZIP
CJPEG
DJPEG
APSI
EQUAKE
MESA
WUPWISE
Average
Clock
Resultbus
ALU
D cache
I cache
Reg. File
LSQ
IQ
Rename
Figure 3.9: Energy dissipation distribution for different benchmark
3.5 Conclusions
We presented a clock gating scheme that deterministically clock gates the functional
units in modern out-of-order superscalar processors to save power. Baseline clock gating
scheme PFCG clock gates the stage registers associated with FUs during idle cycles for
the FUs. On the average PFCG reduces energy consumption by 13.93% over major on-chip
components compared to no clock gating case. WPCG takes branch mispredictions
as opportunities to save energy and blocks wrong-path instructions from executing after a
branch has been resolved. WPCG provides more idle cycles for PFCG to exploit clock
gating and also reduces register file and cache accesses caused by wrong-path instruction
72
execution. Accumulating all these energy benefits, WPCG on the average saves 16.26%
energy over the major on-chip components.
73
Chapter 4: Boolean Difference Calculus Based Error
Calculator
As CMOS hits nano-scale regime, device failure mechanisms such as cross talk,
manufacturing variability, and soft error become significant design concerns. Being
probabilistic by nature, these failure sources have pushed the CMOS technology toward
stochastic CMOS [39]. For example, capacitive and inductive coupling between parallel
adjacent wires in nano-scale CMOS Integrated Circuits (ICs) are the potential sources of
crosstalk between the wires. Crosstalk can indeed cause flipping error on the victim signal
[59]. In addition to the probabilistic CMOS, promising nanotechnology devices such as
quantum dots are used in technologies such as Quantum Cellular Automata (QCA). Most
of these emerging technologies are inherently probabilistic. This has made reliability
analysis an essential piece of circuit design. . Reliability analysis will be even more
significant in designing reliable circuits using unreliable components [30][7].
Circuit reliability will thus be an important tradeoff factor which has to be taken care
of similar to traditional design tradeoff factors such as performance, area, and power. To
include the reliability into the design tradeoff equations, there must exist a good measure
for the circuit reliability, and there must exist fast and robust tools that, similar to timing
analyzer and power estimator tools, are capable of estimating circuit reliability at
different design levels. In [38] authors have proposed a Probabilistic Transfer Matrix
(PTM) method to calculate the output signal error probability for a circuit while [1]
74
presents a method based on the Probabilistic Decision Diagrams (PDDs) to perform this
task.
In this chapter we first introduce a probabilistic gate level error propagation model
based on the concept of Boolean difference to propagate errors from inputs to output of a
general gate. We then apply this model to account for the error propagation in a given
circuit and finally estimate the error probability at the circuit outputs. Note that in the
proposed model a gate’s Boolean function is used to determine the error propagation in
the gate. An error at an output of a gate is due to its input(s) and/or the gate itself being
erroneous. The internal gate error in this work is modeled as an output flipping event.
This means that, when a faulty gate makes an error, it flips (changes a “1” to a “0” and a
“0” to a “1”) its output value that it would have generated given the inputs, Von
Neumann error model. In the rest of this chapter, we call our circuit error estimation
technique the Boolean Difference-based Error Calculator, or BDEC for short, and we
assume that a defective logic gate produces the wrong output value for every input
combination. This is a more pessimistic defect model than the stuck-at-fault model.
Authors in [38] use a PTM matrix for each gate to represent the error propagation
from the input(s) to the output(s) of a gate. They also define some operations such as
matrix multiplication and tensor product to use the gate PTMs to generate and propagate
error probability at different nodes in a circuit level-by-level. Despite of its accuracy in
calculating signal error probability, PTM technique suffers from the extremely large
number of computational-intensive tasks namely regular and tensor matrix products. This
makes the PTM technique extremely memory intensive and very slow. In particular, for
75
larger circuits, size of the PTM matrices grows too fast for the deeper nodes in circuit
making PTM an inefficient or even infeasible technique of error rate calculation for a
general circuit. References [8] and [54] developed a methodology based on probabilistic
model checking (PMC) to evaluate the circuit reliability. The issue of excessive memory
requirement of PMC when the circuit size is large was successfully addressed in [9].
However, the time complexity still remains a problem. In fact, the authors of [9] show
that the run time for their space-efficient approach is even worse than that of the original
approach.
Boolean difference calculus was introduced and used by [63] and [3] to analyze
single faults. It was then extended by [40] and [15] to handle multiple fault situations,
however, they only consider stuck-at-faults and they do not consider the case when the
logic gates themselves can be erroneous and hence a gate-induced output error may
nullify the effect of errors at the gate’s input(s). In [60] authors use Bayesian networks to
calculate the output error probabilities without considering the input signal probabilities.
The author in [1] uses probabilistic decision diagrams (PDD) to calculate the error
probabilities at the outputs using probabilistic gates. While PDDs are much more
efficient than PTM for average case, the worst-case complexity of both PTM and PDD-based
error calculators is exponential in the number of inputs in the circuit.
In contrast, we will show in section 4.4 that BDEC calculates the circuit error
probability much faster than PTM while achieving as accurate results as PTM’s. We will
show that BDEC requires a single pass over the circuit nodes using a post-order (reverse
DFS) traversal to calculate the errors probabilities at the output of each gate as we move
76
from the primary inputs to the primary outputs; hence, complexity is O (N) where N is
the number of the gates in the circuit, and O (.) is the big O notation.
4.1 Error Propagation Using Boolean Difference Calculus
Some key concepts and notation that will be used in the remainder of this chapter are
discussed next.
4.1.1 Partial Boolean difference
The partial Boolean difference of function f(x1, x2, …, xn) with respect to one variable
or a subset of its variables [15] is defined as:
( ) 1 2 1 2
1 2 1 2
... ... ... ...
i i
i i ik i i ik
k k
x x
i
x x x x x x
i i i i i i
f
f f
x
f f
f f
x x x x x x
¶ = Å
¶
¶ = ¶ = Å
¶ ¶
(4.1)
where Å represents XOR operator and
xi f is the co-factor of f with respect to xi, i.e.,
1 1 1
1 1 1
( ,..., , 1, ,..., )
( ,..., , 0, ,..., )
i
i
x i i i n
x i i i n
f f x x x x x
f f x x x x x
- +
- +
= =
= =
(4.2)
Higher order co-factors of f can be defined similarly. The partial Boolean difference
of f with respect to xi expresses the condition (with respect to other variables) under
which f is sensitive to a change in the input variable xi. More precisely, if the logic values
of {x1, …, xi-1, xi+1, …, xn} are such that ∂f/∂xi = 1, then a change in the input value xi, will
change the output value of f.. However, when ∂f/∂xi = 0, changing the logic value of xi
will not affect the output value of f.
77
It is worth mentioning that the order-k partial Boolean difference defined in
Equation 4.1 is different from the kth Boolean difference of function f as used in [40],
which is denoted by
1
...
k
k
¶ f ¶xi ¶xi . For example, the 2nd Boolean difference of function f
with respect to xi and xj is defined as:
2
xi x j xi x j xi x j xi x j
i j i j
f f
f f f f
x x x x
¶ ¶ ¶ = = Å Å Å ¶ ¶ ¶ ¶ (4.3)
Therefore, ∂2f/∂xi∂xj≠∂f/∂(xixj).
4.1.2 Total Boolean difference
Similar to the partial Boolean difference that shows the conditions under which a
Boolean function is sensitive to change of any of its input variables, we can define total
Boolean difference showing the condition under which the output of the Boolean function
f is sensitive to the simultaneous changes in all the variables of a subset of input
variables. For example, the total Boolean difference of function f with respect to xixj is
defined as:
( ) ( ) ( ) ( ) ( ) i j i j i j i j
i j i j i j
f f f
x x x x x x x x
x x x x x x
D = ¶ + + ¶ +
D ¶ ¶ (4.4)
where f/ (xixj) describes the conditions under which the output of f is sensitive to a
simultaneous change in xi and xj. That is, the value of f changes as a result of the
simultaneous change. Some examples for simultaneous changes in xi and xj are
transitioning from xi=xj=1 to xi=xj=0 and vice versa, or from xi=1, xj=0 to xi=0, xj=1 and
vice versa. However, transitions in the form of xi=xj=1 to xi=1, xj=0 or xi=1, xj=0 to
78
xi=0, xj=0 are not simultaneous changes. Note that ∂f/∂(xixj) describes the conditions
when a transition from xi=xj=1 to xi=xj=0 and vice versa changes the value of function f.
It can be shown that the total Boolean difference in Equation 1.4 can be written in
the form of:
( )
2
i j i j i j
f f f f
x x x x x x
D = ¶ Å ¶ Å ¶
D ¶ ¶ ¶ ¶
(4.5)
The total Boolean difference with respect to three variables is:
( )
( )
( )
( )
1 2 3 1 2 3
1 2 3 1 2 3
1 2 3 1 2 3
1 2 3
1 2 3 1 2 3
1 2 3
1 2 3 1 2 3
1 2 3
( ) ( )
( )
( )
( )
f f
x x x x x x
x x x x x x
f
x x x x x x
x x x
f
x x x x x x
x x x
f
x x x x x x
x x x
D = ¶ +
D ¶
+ ¶ +
¶
+ ¶ +
¶
+ ¶ +
¶ (4.6)
It is straightforward to verify that:
2
1 2 3 1 2 3 1 2
2 2 3
2 3 1 3 1 2 3
( )
f f f f f
x x x x x x x x
f f f
x x x x x x x
D = ¶ Å ¶ Å ¶ Å ¶
D ¶ ¶ ¶ ¶ ¶
Å ¶ Å ¶ Å ¶
¶ ¶ ¶ ¶ ¶ ¶ ¶ (4.7)
In general total Boolean difference of a function f with respect to an n-variable subset of
its inputs can be written as:
( ) 1
1 2
2 1
2 1
( ... ) 0
n
n
n j
j j
i i i j m
f f
m m
x x x x
- -
- -
=
D = ¶ +
D ¶
Σ t
(4.8)
where mj’s are defined as follows:
79
1 2 1
1 2 1
1 2 1
0
1
2 1
...
...
... ,
n n
n n
n
n n
i i i i
i i i i
i i i i
m x x x x
m x x x x
m x x x x
-
-
- -
=
=
=
M
(4.9)
and we have:
( )
* * * *
* * * * 1 2 1
1 2 1
where ...
... j
j n n
m n n
f f
m x x x x
x x x x x -
-
¶ = ¶ =
¶ ¶ t
(4.10)
4.1.3 Signal and error probabilities
Signal probability is defined as the probability for a signal value to be “1”. That is:
pi = Pr{xi = 1}
(4.11)
Gate error probability is shown by εg and is defined as the probability that a gate
generates an erroneous output, independent of its applied inputs. Such a gate is
sometimes called (1-εg)-reliable gate. Signal error probability is defined as the probability
of error on a signal line. If the signal line is the output of a gate, the error can be either
due to error at the gate input(s) or the gate error itself. We denote the error probability on
signal line xi by εi.
We are interested in determining the circuit output error rates, given the circuit input
error rates under the assumption that each gate in the circuit can fail independently with a
probability of εg. In other words, we account for the general case of multiple
simultaneous gate failures.
80
4.2 Proposed Error Propagation Model
In this section we propose our gate error model in the Boolean difference calculus
notation. The gate error model is then used to calculate the error probability and
reliability at outputs of a circuit.
4.2.1 Gate Error Model
Figure 4.1 shows a general logic gate realizing Boolean function f, with gate error
probability of εg. The signal probabilities at the inputs, i.e., probabilities for input signals
being 1, are p1, p2,…, pn while the input error probabilities are e1, e2,…, en. The output
error probability is ez.
p1 , ε1
p2 , ε2
pn , εn
f, εg
εz
Figure 4.1: Gate implementing function f
First consider the error probability equation for a buffer gate shown in Figure 4.2.
The error occurs at the output if (i) the input is erroneous and the gate is error free or (ii)
the gate is erroneous and the input is error free. Therefore, assuming independent faults
for the input and the gate, the output error probability for a buffer can be written as:
(1 ) (1 ) (1 2 ez =ein -eg + -ein eg =eg + - eg)ein
(4.12)
where εin is the error probability at the input of the buffer. It can be seen from this
equation that the output error probability for buffer is independent from the input signal
probability. Note Equation4.12 can also be used to express the output error probability of
an inverter gate.
81
εz
εg
pin , εin
Figure 4.2: A faulty buffer with erroneous input
We can model each faulty gate with erroneous inputs as an ideal (no fault) gate with
the same functionality and the same inputs in series with a faulty buffer as shown in
Figure 4.3.
p1 , ε1
p2 , ε2
pn , εn
f, εg
εz f
(ideal)
εz
εg
p1 , ε1
p2 , ε2
pn , εn
pin
εin
Figure 4.3: The proposed model for a general faulty gate
Now consider a general two-input gate. Using the fault model discussed above, we
can write the output error probability considering all the cases of no error, single error
and double errors at the input and the error in the gate itself. We can write the general
equation for the error probability at the output, εz, as:
1 2 1 2
1 2
1 2
1 2
(1 )Pr (1 ) Pr
(1 2 )
Pr
( )
z g g
in
f f
x x
f
x x
e e e e
e e e
e e
e
¶ ¶ - + -
¶ ¶ = + - D + D
14444444244444443
(4.13)
where Pr{.} represents the signal probability function and returns the probability of its
Boolean argument to be “1”. The first and the second terms in εin account for the error at
the output of the ideal gate due to single input errors at the first and the second inputs,
respectively. Note error at each input of the ideal gate propagates to the output of this
82
gate only if the other inputs are not masking it. The non-masking probability for each
input error is taken into account by calculating the signal probability of the partial
Boolean difference of the function f with respect to the corresponding input. The first two
terms in εin only account for the cases when we have single input errors at the input of the
ideal gate, however, error can also occur when both inputs are erroneous simultaneously.
This is taken into account by multiplying the probability of having simultaneous errors at
both inputs, i.e., ε1ε2, with the probability of this error to be propagated to the output of
the ideal gate, i.e., the signal probability of the total Boolean difference of f with respect
to x1x2.
For 2-input AND gate (f=x1x2) shown in Figure 4.4 we have:
{ } { }
{ } ( )( )
( )
2 2 1 1
1 2
1 2 1 2 1 2 1 2
1 2
1 2 1 2
Pr Pr , Pr Pr
Pr Pr 1 1
( )
1 2
f f
x p x p
x x
f
x x x x p p p p
x x
p p p p
¶ ¶ = = = =
¶ ¶
D = + = - - +
D
= - + +
(4.14)
Plugging Equation 4.14 into Equation 4.13 and after some simplifications we have:
( )( ( ( ) )) 2 1 2 2 1 1 2 1 2 1 2 1 2 1 2 2 AND g g e =e + - e e p +e p +e e - p + p + p p (4.15)
p1 , ε1
p2 , ε2
εg εAND2
Figure 4.4: A 2-input faulty AND gate with erroneous inputs
Similarly, the error probability for the case of 2-input OR can be calculated as:
( )( ( ) ( ) ( )) 2 1 2 2 1 1 2 1 2 1 2 1 1 2 1 OR g g e =e + - e e - p +e - p +e e p p - (4.16)
83
And for 2-input XOR gate we have:
( )( ) e XOR2 =e g + 1- 2e g e 1 +e 2 - 2e 1e 2
(4.17)
It is interesting to note that the error probability at the output of the XOR gate is
independent of the input signal probabilities. Generally, the 2-inpout XOR gate exhibits
larger output error compared to 2-input OR and AND gates. This is expected since XOR
gates show maximum sensitivity to input errors (XOR, like inversion, is an entropy-preserving
function).The output error expression for a 3-input gate is:
1 2 3 2 3
1
2 1 3 1 3
2
3 1 2 1 2
3
1 2 3
1 2
2 3 1
2 3
1 3 2
1 3
1 2 3
(1 )Pr
(1 )Pr
(1 )Pr
(1 2 ) (1 )Pr
( )
(1 )Pr
( )
(1 )Pr
( )
Pr
(
z g g
f
x
f
x
f
x
f
x x
f
x x
f
x x
f
x
e e e e e
e e e e e
e e e e e
e e e e e e
e e e
e e e
e e e
¶ - - +
¶
¶ + - - +
¶
¶ + - - +
¶
D = + - + - D
D + -
D
D + -
D
+ D
D 1 2 3 x x )
(4.18)
As an example of a 3-input gate, we can use Equation 4.18 to calculate the
probability of error for the case of 3-input AND gate. We can show that the output error
probability can be calculated as:
84
2 3 1 1 3 2 1 2 3
3 2 1 1 2 1 2
3 1 2 3 2 3 2 3
2 1 3 1 3 1 3
1 2 3
1 2 3
1 2 2 3 1 3 1 2 3
(1 2( ) 2 )
(1 2 ) (1 2( ) 2 )
(1 2( ) 2 )
1 2( )
4( ) 6
AND g g
p p p p p p
p p p p p
p p p p p
p p p p p
p p p
p p p p p p p p p
e e e
e e
e e e e e
e e
e e e
+ +
+ - + +
= + - + - + +
+ - + +
- + +
+ + + + - (4.19)
Now we give a general expression for a 4-input logic gate as:
( , ) ( , , )
( , ) , ( , ) ,
( , , ) , ,
1 2 3
1 Pr
1 Pr
( )
= (1 2 )
1 Pr
( )
i j j k j k l
i j i j k i j k l i i
i j k k l
i j k i j k l i j i j
Object Description
| Title | Low power and reliability assessment techniques for advanced processor design |
| Author | Mohyuddin, Nasir |
| Author email | mohyuddi@usc.edu; nasir.mohyuddin@gmail.com |
| Degree | Doctor of Philosophy |
| Document type | Dissertation |
| Degree program | Computer Engineering |
| School | Viterbi School of Engineering |
| Date defended/completed | 2010-05-26 |
| Date submitted | 2010 |
| Restricted until | Unrestricted |
| Date published | 2010-07-22 |
| Advisor (committee chair) | Pedram, Massoud |
| Advisor (committee member) |
Gupta, Sandeep K. Draper, Jeffrey T. Nakano, Aiichiro |
| Abstract | The rapid scaling of silicon technologies over the past decade has introduced some strenuous constraints for processor design. The technology progression has exacerbated the power problem which has further increased the necessity to consider design reliability. Static power dissipation that used to be negligible in past is expected to be a major component of overall processor dissipation. In this research, we present techniques that reduce both static and dynamic power dissipation in modern processors while not compromising the processor reliability as whole. We also present a tool BDEC that can be used to compare the reliability of different possible implementations of major processor functional units to choose more reliable implementations in future processor designs.; The proposed techniques reduce both static and dynamic power dissipation in modern processor designs. While dealing with low power design one important design aspect is that the low power technique must not be power hungry itself. The beauty of the presented low power techniques is that they do not have huge hardware implementation cost, rather they use existing hardware to control power dissipation. Power dissipation overhead of presented techniques is minimal.; Smaller and smaller feature sizes and increased power density of modern processors which has resulted in high chip temperatures has increased the importance of reliability consideration in processor design. Since in future we will need to build reliable systems using unreliable components modern processor design will need to be geared towards considering design reliability during the early phases of the design. To help compare various design alternatives we developed a tool: BDEC which gives the reliability of a combinational circuit in terms of gate error probabilities and input error probabilities on the primary inputs. |
| Keyword | branch misprediction; instruction queue; dynamic power dissipation; static power dissipation; clock gating; drowsy cache; functional units; leakage power; cache hierarchy; cache replacement policy; LRU (least recently used); PLRU (pseudo least recently used); simplescalar; simpoints; MRR (modified random replacement policy); soft errors; reliability; instruction level parallelism (ILP); wrong-path instructions; branch misprediction penalty; re-order buffer (ROB); instructions per clock (IPC); probabilistic transfer matrix (PTM); error propagation; Von Neumann fault; error probability; partial Boolean difference; co-factor; signal probability; post-order (reverse DFS); reconvergent fanout; soft error rate (SER); quantum-dot cellular automaton (QCA) |
| Language | English |
| Part of collection | University of Southern California dissertations and theses |
| Publisher (of the original version) | University of Southern California |
| Place of publication (of the original version) | Los Angeles, California |
| Publisher (of the digital version) | University of Southern California. Libraries |
| Provenance | Electronically uploaded by the author |
| Type | texts |
| Legacy record ID | usctheses-m3209 |
| Rights | Mohyuddin, Nasir |
| Repository name | Libraries, University of Southern California |
| Repository address | Los Angeles, California |
| Repository email | http://www.usc.edu/isd/libraries/services/ask_a_librarian/email/ |
| Filename | etd-Mohyuddin-3830; mohyuddi_nasir_thesis_vs5 |
Description
| Title | Page 1 |
| Full text | LOW POWER AND RELIABILITY ASSESSMENT TECHNIQUES FOR ADVANCED PROCESSOR DESIGN Title by Nasir Mohyuddin A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) August 2010 Copyright 2010 Nasir Mohyuddin ii Dedication This thesis is dedicated to my parents and wife who provided all kind of support throughout the course of my PhD studies and research. iii Acknowledgements First of all I would thank Almighty Allah (God) who gave me the strength and ability to accomplish all this. I would also like to present my profound gratitude to my PhD advisor Dr. Massoud Pedram who has been kind to undertake the supervision of my research. Despite his heavy commitments and extremely busy research schedules, he provided a consistent technical guidance and support through out the course of my research. I would like to express my sincere thanks to Dr. Jeffery Draper, Dr. Sandeep Gupta, Dr. Murali Annavaram, Dr Ming-Deh Huang and Dr. Aiichiro Nakano who kindly consented to be on my PhD guidance committee and helped me to shape this research work. iv Table of Contents Dedication........................................................................................................................ii Acknowledgements..........................................................................................................iii List of Tables ...................................................................................................................vi List of Figures ................................................................................................................viii Abstract .............................................................................................................................x CHAPTER 1: INTRODUCTION..............................................................................................1 1.1 Motivational Background ..............................................................................................................1 1.2 Low Power Technique to Reduce Static Power Dissipation ..........................................................2 1.3 Low Power Technique for Dynamic Power Dissipation................................................................3 1.4 Reliability Considerations for Processor Design............................................................................4 1.5 Organization..................................................................................................................................4 CHAPTER 2: STATIC POWER CONTROL......................................................................5 2.1 Leakage Power Phenomenon in CMOS Circuits ...........................................................................6 2.2 Controlling Leakage Power in Cache Hierarchy............................................................................7 2.3 Related Schemes to Reduce Leakage Power..................................................................................8 2.4 Controlling Leakage in Slumberous Caches ................................................................................11 2.5 Technological Findings to Evaluate Slumberous Caches ............................................................14 2.6 Architectural Simulations ............................................................................................................22 2.7 Evaluations of Slumberous Caches for Famous Cache Replacement Algorithms .......................23 CHAPTER 3: DYNAMIC POWER CONTROL ..............................................................50 3.1 Clock Gating ...............................................................................................................................50 3.2 Using Branch Mispredictions for Clock Gating...........................................................................52 3.3 Proposed Clock Gating Architecture ...........................................................................................55 3.4 Experimental Results ...................................................................................................................65 3.5 Conclusions.................................................................................................................................71 CHAPTER 4: BOOLEAN DIFFERENCE CALCULUS BASED ERROR CALCULATOR.................................................................................................................................73 4.1 Error Propagation Using Boolean Difference Calculus ...............................................................76 4.2 Proposed Error Propagation Model..............................................................................................80 4.3 Practical Considerations...............................................................................................................88 4.4 Simulation Results .......................................................................................................................94 4.5 Extensions to BDEC ..................................................................................................................101 4.6 Conclusions...............................................................................................................................104 CHAPTER 5: CONCLUSIONS .............................................................................................105 5.1 Scope of the Proposed Techniques ............................................................................................105 5.2 Limitations of the Proposed Techniques ....................................................................................107 5.3 Possible Extensions of the Presented Techniques ......................................................................108 v BIBLIOGRAPHY...........................................................................................................................110 vi List of Tables Table 2.1: Supply and threshold voltages for different technologies ............................................................15 Table 2.2: Leakage current of one cell at different power save levels using predictive models [26] ............16 Table 2.3: Energy per transition per byte between different tranquility levels..............................................16 Table 2.4: 8FO4 clock frequencies used for our simulations ........................................................................19 Table 2.5: Wake up penalties in terms of cycles ...........................................................................................20 Table 2.6: Average leakage power dissipated and maximum leakage power savings per byte.....................21 Table 2.7: Maximum percent leakage power savings ...................................................................................21 Table 2.8: Baseline microprocessor Simulation Model.................................................................................23 Table 2.9: The single simpoints for simulations of the Spec2000 benchmarks.............................................23 Table 2.10: Cache-line replacement priorities for different combinations of Bit_0, Bit_1 and Bit_2...........24 Table 2.11: Average dynamic power costs (nW) for various schemes under PLRU ....................................28 Table 2.12: Replacement priority assignment for LRU.................................................................................32 Table 2.13: Cache-line replacement priorities for hits in different cache line LRU......................................32 Table 2.14: Dynamic power costs for different benchmarks for TL4 schemes under PLRU and LRU........34 Table 2.15: Average dynamic power costs for various schemes under LRU................................................34 Table 2.16: Comparing performance impact of various power saving schemes under LRU and PLRU ......38 Table 2.17: Replacement priority assignment for MRR................................................................................39 Table 2.18: Cache-line replacement priorities for hits in different cache line under MRR...........................39 Table 2.19: Comparing IPCs of various replacement algorithms..................................................................40 Table 2.20: Dynamic power costs for different benchmarks for TL4 schemes under PLRU and LRU........41 Table 2.21: Average dynamic power costs for various schemes under MRR...............................................41 Table 2.22: Comparing performance impact of power saving schemes under PLRU, LRU and MRR ........45 Table 2.23: Comparing all 12 leakage control schemes with respect to LESMs...........................................45 Table 3.1: Processor Model used for Evaluations .........................................................................................66 Table 3.2: IPC Degradation for WPCG.........................................................................................................70 vii Table 4.1: Output error probability with re-convergent fanout .....................................................................92 Table 4.2: Percent error reduction in output error probability using BDEC +Collapsing .............................93 Table 4.3: Circuit reliability for tree-structured circuits having relatively small number of PIs...................96 Table 4.4: Circuit Reliability for Tree-Structured Circuits having relatively Large Number of PIs .............97 Table 4.5: Circuit Reliability and Efficiency of BDEC Compared to PGM and PTM..................................98 Table 4.6: Runtime Comparison between BDEC and PTM for some Large Benchmark Circuits................99 Table 4.7: Circuit Reliability for Large Benchmark Circuits ........................................................................99 Table 4.8: BDEC Circuit Reliability Compared to MC Simulations for Large Benchmark Circuits..........100 viii List of Figures Figure 2.1: Schematic view of a basic CMOS inverter ...................................................................................6 Figure 2.2: A basic CMOS inverter with different N and P type regions inferring diodes .............................7 Figure 2.3: Control circuitry to implement proposed scheme .......................................................................12 Figure 2.4: Six transistor SRAM Cell with wordline and bitlines.................................................................15 Figure 2.5: Charge sharing between cache lines at different tranquility levels .............................................17 Figure 2.6: The power wake up cycle from different levels of tranquility in the 70nm technology .............19 Figure 2.7: Maximum percent leakage power savings for various schemes .................................................22 Figure 2.8: PLRU implementation for 4-way caches ....................................................................................25 Figure 2.9: How hits at different Priority levels affect the cache under PLRU policy..................................25 Figure 2.10: Dynamic power incurred per byte for L1 data cache for TL4 under PLRU policy...................26 Figure 2.11: Dynamic power incurred per byte for L1 data cache for TL2-T4 under PLRU policy.............27 Figure 2.12: Average dynamic power cost for L1 data cache for different power schemes under PLRU ....27 Figure 2.13: Net Savings per byte for L1 data cache for TL4 under PLRU..................................................28 Figure 2.14: % Net Savings per byte for L1 data cache for TL4 scheme under PLRU policy......................29 Figure 2.15: Power savings for L1 data cache under various schemes in future technologies for PLRU.....30 Figure 2.16: Average net leakage power savings for L1 data cache for various schemes under PLRU........30 Figure 2.17: Percent increase in L1 access time for hits for various schemes under PLRU..........................31 Figure 2.18: How hits at different Priority levels affect the cache under LRU policy ..................................33 Figure 2.19: Dynamic power cost for L1 data cache for TL4 scheme under LRU policy.............................33 Figure 2.20: Net savings for L1 data cache for TL4 under LRU policy........................................................35 Figure 2.21: Percent net savings for L1 data cache for TL4 under LRU.......................................................35 Figure 2.22: Power savings for L1 data cache under various schemes in future technologies for LRU .......36 Figure 2.23: Average net leakage power savings for L1 data cache for various schemes under LRU..........37 Figure 2.24: Percent increase in L1 access time for hits for various schemes under LRU............................38 Figure 2.25: How hits at different Priority levels affect the cache under MRR replacement policy.............39 ix Figure 2.26: Dynamic power incurred per byte for L1 data cache for TL2-T4 under MRR policy ..............41 Figure 2.27: Net savings per byte for L1 data cache for TL2-T4 under MRR.............................................42 Figure 2.28: Percent net savings per byte for L1 data cache for TL2-T4 scheme under MRR.....................42 Figure 2.29: Power savings for L1 data cache under various schemes in future technologies for MRR ......43 Figure 2.30: Average net leakage power savings for L1 data cache for various schemes under MRR.........43 Figure 2.31: Percent increase in L1 access time for hits for various schemes under MRR...........................44 Figure 2.32: Dynamic power cost for L2 cache for various schemes in future technologies for PLRU .......46 Figure 2.33: Leakage power savings for L2 cache for various schemes in future technologies for PLRU...47 Figure 2.34: Percent increase in L2 access time for hits for various schemes under PLRU..........................47 Figure 3.1: Idle time and wrong-path instruction fraction for integer ALUs ................................................53 Figure 3.2: Average number of wrong-path instructions per mispredicted branch .......................................55 Figure 3.3: PFCG Architecture.....................................................................................................................57 Figure 3.4: The WPCG Architecture.............................................................................................................59 Figure 3.5: Circuitry used to detect wrong-path instructions ........................................................................63 Figure 3.6: Usage cycles fraction in integer ALUs and % decrease in the usage cycles due to WPCG........67 Figure 3.7: Energy consumption in the combinational logic and stage registers of the integer ALUs..........68 Figure 3.8: Reduction in register file and cache accesses due to WPCG ......................................................70 Figure 3.9: Energy dissipation distribution for different benchmark ............................................................71 Figure 4.1: Gate implementing function f .....................................................................................................80 Figure 4.2: A faulty buffer with erroneous input...........................................................................................81 Figure 4.3: The proposed model for a general faulty gate.............................................................................81 Figure 4.4: A 2-input faulty AND gate with erroneous inputs ......................................................................82 Figure 4.5: Balanced tree implementation of 4-input AND gate...................................................................89 Figure 4.6: Re-convergent fanout in a 2-to-1 Multiplexer.............................................................................91 x Abstract The rapid scaling of silicon technologies over the past decade has introduced some strenuous constraints for processor design. The technology progression has exacerbated the power problem which has further increased the necessity to consider design reliability. Static power dissipation that used to be negligible in past is expected to be a major component of overall processor dissipation. In this research, we present techniques that reduce both static and dynamic power dissipation in modern processors while not compromising the processor reliability as whole. We also present a tool BDEC that can be used to compare the reliability of different possible implementations of major processor functional units to choose more reliable implementations in future processor designs. The proposed techniques reduce both static and dynamic power dissipation in modern processor designs. While dealing with low power design one important design aspect is that the low power technique must not be power hungry itself. The beauty of the presented low power techniques is that they do not have huge hardware implementation cost, rather they use existing hardware to control power dissipation. Power dissipation overhead of presented techniques is minimal. Smaller and smaller feature sizes and increased power density of modern processors which has resulted in high chip temperatures has increased the importance of reliability consideration in processor design. Since in future we will need to build reliable xi systems using unreliable components modern processor design will need to be geared towards considering design reliability during the early phases of the design. To help compare various design alternatives we developed a tool: BDEC which gives the reliability of a combinational circuit in terms of gate error probabilities and input error probabilities on the primary inputs. 1 Chapter 1: Introduction 1.1 Motivational Background Power dissipation has been a 1st order design concern for the last two decades. Processor chip temperature has been on rise since then; initially it was compared to a hot plate and was expected to reach the temperature of Sun’s surface if no measures were taken to reduce power dissipation in advanced microprocessor design. For every component of modern processor design, researchers have proposed solutions to reduce power dissipation and decrease resulting temperature rise that performed within the technological constraints of the time. As technology kept scaling in accordance with Moore’s law [50] more and more transistors were available to designers to implement design novelties. Since memory latency had been a limiting factor in the execution speed of an application on a specific processor, it had been an attractive choice for processor designers to move more and more memories to on chip; as a result we see level-1, level-2 and even level-3 cache memories implemented on chip in modern processors [13]. In modern processors a significant fraction of total die area is occupied by cache memories e.g. 60% of the StrongARM die area is cache [69]. Therefore caches which are implemented as SRAM memories provide the greatest opportunity for static power reduction. Dynamic power is mostly consumed by the clock network and it accounts for a large fraction of a chip’s total energy consumption, one popular technique, clock gating, gates the clock signal from reaching idle functional units. Clock gating is a very effective way 2 of reducing power and energy throughout a processor and is implemented in numerous commercial systems. Reliability is very fast becoming a primary design concern. Designers are accustomed to design systems using reliable gates and components. It is largely believed that in future we will need to build systems using un-reliable gates and components. Instead of having deterministic gates and components we will have probabilistic gates and sub-components which will provide correct output only with a certain probability. 1.2 Low Power Technique to Reduce Static Power Dissipation Static power is dissipated at all times and is not function of circuit activity. Static power is also referred to as leakage power. Effective leakage power reduction techniques for on-chip caches are based on switching SRAM memory cells to low leakage mode when they are not accessed. The low leakage mode voltage depends on the technology and is function of transistor threshold voltage and noise margins; it should be high enough to avoid data corruption. As we lower the voltage applied to the SRAM cells the leakage current decreases drastically. This results in less static power dissipation. However, whenever a cell in lower leakage power mode is accessed, power levels may change, which result in dynamic power consumption and performance penalties. A trade-off between the amount of leakage power saved on one hand, and the impact on dynamic power and performance on the other hand must be reached. 3 1.2.1 Slumberous caches In slumberous caches the power level of cache lines is controlled with the cache replacement policy. The cache lines are maintained at different power save modes called "tranquility levels", which depend on their order of replacement priorities. Slumberous caches idea offers a trade-off between the amount of leakage power saved on one hand and the impact on dynamic power and performance on the other hand. 1.3 Low Power Technique for Dynamic Power Dissipation Dynamic power dissipation control has always been an important design consideration in digital circuit design. Increasing chip sizes with technology advancement is further increasing the importance of low power techniques especially those that incur low power and performance overheads. In this research a well known design technique, clock gating, is used to save dynamic power dissipated in executing wrong path instructions in advanced processors because of branch mispredictions. Wrong Path instructions based Clock Gating (WPCG) detects wrong-path instructions in the event of branch misprediction and prevents them from being issued to the functional units (FUs), and subsequently, disables the clock of these FUs along with reducing the stress on register file and cache. Simulations demonstrate that more than 92% of all wrong-path instructions can be detected and stopped from being executed. The WPCG architecture results in 16.26% chip-wide energy savings which is 2.33% more than that of the baseline Pipelined Functional units Clock Gating (PFCG) scheme. 4 1.4 Reliability Considerations for Processor Design Slumberous caches idea of keeping most recently used cache line in active power mode (full rail power) proves to be better in terms of memory reliability when compared to original drowsy cache idea that lowers the voltage of all the cache lines irrespective of their usage. As mentioned in [44] the soft error rates at lower supply voltages are order of magnitude more than in full rail power mode. 1.5 Organization This thesis is organized as follows. Chapter 2: Discusses the low power technique for static power control in caches. It also compares various slumberous cache configurations with respect to performance impact leakage power saved. It proposes best slumberous cache configurations for both L1 and L2 Data caches. Chapter 3: Talks about dynamic power control technique using clock gating. Branch misprediction information is used to clock gate functional units in processor pipeline. Chapter 4: Describes Boolean Difference Error Calculator (BDEC) in detail. It also shows how BDEC can be used to design reliable functional units for future processors. Chapter 5: briefly reviews the general scope and limitations of the proposed low power and reliability techniques along with some possible future extensions of this research work. 5 Chapter 2: Static Power Control Traditionally, computer architects have mostly be concerned about performance, cost and reliability. Power considerations were secondary. Moreover computer architects are used to ignore and abstract the technology level details of their design. In recent years, this situation has dramatically changed and power is becoming one of the primary design parameters at both architecture and physical design levels. Several factors have contributed to this trend. Perhaps the primary driving factor has been the remarkable success and growth of the class of personal computing devices (portable desktops, audio-and video-based multimedia products) and wireless communications systems (personal digital assistants and personal communicators), which demand high-speed computation and complex functionality with low power consumption. In high-end machines, power dissipation and its effect on temperature, cooling and performance are becoming the major limiting factor to feature size and frequency scaling. There are two types of power dissipated in a chip: dynamic power and static power. Dynamic power is incurred whenever the state of a circuit changes, whereas static power is dissipated (leaked) in each and every circuit, all the time, independently of its changes of state. The International Technology Roadmap for Semiconductors (ITRS) produced by the Semiconductor Industry Association predicts that leakage current Ioff will double with each generation for both high-performance (low threshold voltage Vt, high leakage) and low-power (high Vt, low leakage) transistors [66]. 6 Figure 2.1: Schematic view of a basic CMOS inverter Different techniques apply to dynamic and static power reduction. Static power is the focus of this chapter. Static power is also often referred to as leakage power. 2.1 Leakage Power Phenomenon in CMOS Circuits To understand leakage power one can look at the basic structure of a CMOS inverter, shown in Figure 2.1. The three major sources of leakage are sub-threshold leakage, substrate leakage, and leakage through gate oxide [17]. In Figure 2.2 we model the sources of leakage by diodes. Sub threshold leakage is due to diode D2 and D4. When the input of the inverter is low and the output is high and the reverse biased diode D2 causes sub threshold leakage. Conversely, when the input is high and output is low the reverse biased diode D4 causes sub threshold leakage. Diode D3 between the power supply VDD and ground GND is responsible for the substrate leakage. The overall substrate leakage is proportional to the dimensions and number of devices grown in the n-wells over the p-substrate. Since the substrate is lightly doped leakage through the substrate is very small as compare to sub threshold leakage. The leakage through the gate oxide (Diodes D1 and D5) is also very small. The ways to 7 reduce substrate leakage and gate oxide leakage are mostly technology level techniques such as twin tub and SOI technologies. Sub-threshold leakage is currently the largest of these three components, and is bound to increase in future fabrication technologies as threshold voltages are scaled down [10]. In this chapter, we focus on sub-threshold leakage. We ignore gate oxide and substrate leakages and the techniques proposed in this research do not address these leakages. Figure 2.2: A basic CMOS inverter with different N and P type regions inferring diodes 2.2 Controlling Leakage Power in Cache Hierarchy Effective leakage power reduction techniques for SRAM based memories e.g. on-chip caches are based on switching memory cells to low leakage mode when they are not accessed. However, whenever a memory cell in lower leakage power mode is accessed, power levels may change, which result in dynamic power consumption and performance penalties. A trade-off between the amount of leakage power saved on one hand, and the impact on dynamic power and performance on the other hand must be reached. 8 To affect this trade-off in the context of the cache hierarchy, we introduce "slumberous caches" in which the power level of set-associative cache lines is controlled with the cache replacement policy. The replacement policy is useful in set-associative caches to improve the hit rate of the cache because it exploits the locality property of memory accesses. This same locality property can be exploited to optimize the trade-off between static power, dynamic power and performance. In a slumberous cache, cache lines are maintained at different power save modes which we call "tranquility levels". The lines in each set of a slumberous cache are maintained at tranquility levels which depend on their order of replacement priorities. The effectiveness of this idea is first evaluated in the context of PLRU (Pseudo Least Recently Used) a common cache replacement algorithm. Then it is extended to couple of other replacement algorithms. We explore various schemes for the tranquility levels assigned to lines and compare overall power and performance impacts. As technology scales down, the dynamic power and performance penalties required to energize slumberous cache lines drops drastically while the leakage power savings remain roughly steady. 2.3 Related Schemes to Reduce Leakage Power Several ideas have been explored to reduce leakage power at the architectural level in microprocessors. All of these leakage power saving schemes rely on some changes at the circuit level to cut off power [50] to cache lines or to switch them to reduced (drowsy) voltage levels. When power to a line is cut off, the information is lost, a backup copy must exist in the hierarchy, and the next access to the line causes a miss. Drowsy 9 voltage levels are such that the information is not lost, but the line must first be energized to full voltage before it can be accessed, which results in dynamic power and performance penalties. M. Martonosi et.al [33] proposed the Cache Decay scheme. By invalidating and "turning off" cache lines when they hold data not likely to be reused leakage power can be saved. Success relies on accurately predicting the cache line dead periods, i.e. the periods when a cache line is sitting idle and is useless, only consuming static power. A cache line is turned off if a preset number of cycles (called “decay interval”) have passed since the cache line’s last access. This results in 70% leakage energy reduction. An adaptive scheme that chooses the best decay interval for each generation of a cache line on the fly is also proposed. Problem with this scheme is that early shut off of a cache line will increase the miss rate consequently affecting overall performance and incurring dynamic power. A compiler based strategy to reduce leakage energy was proposed by W. Zhang et al. for instruction caches [72]. Their scheme is based on marking the last usage of instructions by a special instruction which turns off the cache line. To limit the frequency of these special instructions, the authors turn off instructions at the loop granularity level. At the exit from a loop that will not be visited again, the cache lines are turned off. The concept of resizable cache was proposed by Babak Falsafi et al [70]. This method exploits the fact that cache utilization varies from application to application and also within an application. So, statically or dynamically varying the cache size by turning off unused cache portions can save lot of static energy. Two different schemes were used 10 to vary the cache size. “Selective-ways” changes the cache set associativity and “Selective-sets” changes the cache set sizes according to cache usage. Static resizing is done across entire applications and dynamic resizing changes the cache size on demand during execution. Dynamic threshold modulation [24] using MTCMOS applied at cache line granularity in which the threshold voltage of the transistors in the SRAM cell is dynamically increased when the cell is set to sleep mode by raising the source-to-body voltage of the transistors in the circuit. This higher Vt reduces the leakage current while allowing the memory cell to maintain its state even in sleep mode All the schemes describe so far rely on shutting down parts of the cache. Our approach is based on the idea of drowsy caches [20]. Drowsy Cache lines are never completely shut down. Every cache line can be in two voltage levels: full rail voltage and drowsy mode voltage. In drowsy mode the supply voltage of the cache line is lowered to the minimum possible level without corrupting the data. A drowsy bit is used to select the mode between full rail voltage and drowsy voltage mode. All cache lines are put in drowsy mode at regular intervals of 2000 cycles. Results for Spec2000 benchmarks suggest that, for most of the benchmarks, 90% of the cache lines can be kept in drowsy mode. With wake-up penalties for a drowsy cache line of no more than one cycle, the authors results show that the total leakage energy was reduced by an average of 71% when tags were always awake and by an average of 76% using the drowsy tag scheme, with modest performance impact. In the same vein of work, drowsy instruction caches 11 [34] a cache bank prediction scheme to predict which bank of the instruction cache to put in drowsy mode and which to turn on. 2.4 Controlling Leakage in Slumberous Caches In slumberous caches we propose to reduce the leakage power by controlling the voltage levels of lines in the same cache set with the replacement policy. 2.4.1 Tranquility Levels We consider set-associative caches with at least two priority levels for replacement within a set (Thus our approach is not suitable for random replacement policies or direct-mapped caches.) In an n-way cache, the cache lines are ranked with respect to their priority of replacement, P1,..., Pn. The line with priority level Pn is always selected for replacement. We dynamically assign different voltage levels to the cache lines at different priority levels, based on the information kept in the replacement policy state bits. These voltage levels are called tranquility levels, and various schemes are possible in general to assign tranquility levels (T1,...Tn) to replacement priorities (P1, P2,…Pn). T1 is the highest voltage level and Tn is the lowest voltage level. The lowest possible supply voltage must be greater than 200mV above the threshold voltage in order to avoid that ambient noise flips some bits of the line. Figure 2.3 shows a simple circuit to switch a cache line from one tranquility level to another for a 4-way set associative cache. The four power rails remain energized with the different voltage levels. The replacement policy state bits control the transistors feeding the voltage level to the line. Multiple switching transistors can be distributed along the 12 power rails of the cache lines to avoid current bottlenecks. In all schemes considered in this paper, cache tags and state bits are always at full rail voltage so that no clock cycle is wasted in waking them up. This framework allows for the design of slumberous caches, in which leakage power is controlled by the replacement policy. Figure 2.3: Control circuitry to implement proposed scheme 2.4.2 Maximum Leakage Power Savings The leakage power per byte without any power saving scheme is given by the following equation leakage ( leakage dd ) P = 8· I ·V (2.1) Where Ileakage is the leakage current per bit and Vdd is the full rail voltage. For an n-way set-associative slumberous cache the maximum leakage power saving per byte is ( ) - · · = Σ= n i saving dd leakage n n P V I n V I 1 8 1 (2.2) 13 where Vn and In are the voltage level and the per-bit leakage current of the tranquility level of each line in a set. n is the number of ways of associativity. This maximum savings only depend on the technology, the number of ways, and the tranquility levels. However, to reap the benefits of this maximum power savings we need to solve several problems and trade-offs. 2.4.3 Leakage Control Schemes The simplest scheme is to keep all lines in a set at the same tranquility level, independently of the replacement priorities and, if needed, to wake up the line every time it is accessed. The schemes with only one tranquility level will be called TL1 (Tranquility Level One) and will use -Ti to indicate the voltage of that tranquility level. Hence all one tranquility level schemes will be denoted as TL1-Ti. TL1-T1 is the scheme that does not have any leakage savings, as all lines will be at the full rail voltage at all times and TL-Tn will have maximum savings as all lines will be at the lowest tranquility level. TL1-T1 will not have any performance impact and TL1-Tn will have worst performance impact as worst wake-up penalties are incurred every time a line is accessed. TL1-Tn will also incur dynamic power each time a cache line is accessed. So a trade-off is needed to be made between leakage power savings and performance impact. To improve this trade-off, we exploit the fact that the MRU (Most Recently Used) line is very likely to be referenced over and over again. Our evaluations show that for different replacement policies, on the average more than 94% of data hits are to the MRU line. Thus we can keep the MRU line at T1, while keeping all the other lines in the set at one tranquility level, T2,T3,...or Tn. This will reduce the leakage power savings but at the 14 same time it will reduce the performance loss and the dynamic energy needed to wake-up lines. Hence we have TL2 (Two Tranquility Levels) schemes. As for TL2 schemes one tranquility level is T1 by default, so we indicate two tranquility level schemes as TL2-Ti where Ti is the tranquility level employed for non-MRU lines i.e. T2, T3,... Tn. TL2-T2 means a scheme that has two tranquility levels, T1 for MRU and T2 for all non-MRU lines. Similarly we can have TL2-T3, TL2-T4, ... TL2-Tn. Finally, more than two tranquility levels can also be used hence we will have TL3, TL4, ...TLn, where n is the number of ways in a set associative cache. For TLn each priority level P1, P2,.Pn will be associated with a different level of tranquility T1, T2,...,Tn (respectively). We have considered linearly distributed voltages for tranquility levels between the lowest possible operating voltage (deepest tranquility state) i.e. Vt + 200mv and the full power supply voltage (wake up state). Other distributions are possible, but are beyond the scope of this research. 2.5 Technological Findings to Evaluate Slumberous Caches We have done many evaluations, both technological and architectural, to evaluate the trade-offs between leakage power, dynamic power and performance in the design of slumberous caches. Technological evaluations will be discussed in this section. 2.5.1 Leakage Power of Different Tranquility Levels An hspice deck was setup with a standard SRAM cell to measure the leakage power of one SRAM cell shown in Figure 2.4. We simulated the cell over present and future 15 technologies using presently available and predictive technology models [26] for simulations. Figure 2.4: Six transistor SRAM Cell with wordline and bitlines Next we establish the minimum voltage level that guarantees a consistent state of the memory. A simple noise analysis suggests that minimum voltage should be approximately 200mV above Vth, the threshold voltage. Vth for each different technology is determined through simulations using predictive models [26], Table 2.1 shows the operating supply voltages and threshold voltages suggested by our simulations. Table 2.1: Supply and threshold voltages for different technologies Technology Supply (V) Vth (V) 130nm 1.3 0.596 100nm 1.1 0.546 70nm 0.9 0.394 We used full rail Vdd level for T1 and Vth + 200mv for T4 level. Power save Voltage levels for T2 and T3 are selected by linear interpolation between T1 and T4. We have run extensive simulations over different technologies to verify the correct operation of an SRAM cell at different tranquility levels and the outcome is shown in Table 2.2. The leakage current increases significantly as the threshold voltage decreases with 16 technology scaling. Also leakage current decreases with the voltage for different tranquility levels within a technology. Table 2.2: Leakage current of one cell at different power save levels using predictive models [26] Technology Operating voltage / steady state leakage current per bit for different tranquility levels T1 T2 T3 T4 Voltage (V) Current (nA) Voltage (V) Current (nA) Voltage (V) Current (nA) Voltage (V) Current (nA) 130nm 1.3 0.948 1.1 0.673 0.9 0.550 0.7 0.475 100nm 1.1 2.522 0.95 1.818 0.8 1.481 0.65 1.292 70nm 0.9 8.949 0.8 7.321 0.7 6.340 0.6 5.655 2.5.2 Dynamic Power Costs Table 2.3 shows the energy required to switch between various tranquility levels in different technologies. Table 2.3: Energy per transition per byte between different tranquility levels Energy per transition (joules) Technology T1<->T2 T1<->T3 T1<->T4 T2<->T3 T2<->T4 T3<->T4 130nm 8.68E-15 3.47E-14 7.81E-14 8.68E-15 3.47E-14 8.68E-15 100nm 3.39E-15 1.36E-14 3.05E-14 3.39E-15 1.36E-14 3.39E-15 70nm 1.10E-15 4.39E-15 9.87E-15 1.10E-15 4.39E-15 1.10E-15 To derive the expression for dynamic power consumption we start with a basic circuit of forced switching of a capacitor through an energy dissipating switching device as shown in the Figure 2.5. Theoretically no energy is dissipated by a capacitor in switching from one voltage level to another; the energy is dissipated in the non ideal resistive switching device. 17 Figure 2.5: Charge sharing between cache lines at different tranquility levels To analyze we start from abnitio: Let C capacitance of the capacitor Vi initial voltage of capacitor Vf final voltage of the capacitor Instantaneous current through capacitor during switching: dt dv I C c c = Ic = Is (current through switching device) The power dissipation in the switching device: s s s P = V I Plugging in the Ic we have s c s dt dv P = V ´C s s c P´ dt = V ´Cdv Writing the voltage across the switching device Vs in terms of voltage across capacitor Vc: s f c V = V - V s f c c P ´ dt = (V -V ) ´Cdv Integrating the above equation for the switching time Ts, during which the Vc=Vi Vf ∫ ´ = ´ ∫ - Vf Vi f c c Ts P dt C (V V )dv 0 18 ∫ - ∫ Vf Vi c c Vf Vi f c CV dv C V dv ( ) 2 ( ) 2 2 f f i f i V V C CV V -V - - - - ( + ) 2 1 ( ) f i f f i C V V V V V - ( - ) 2 1 ( ) f i f i C V V V V ( ) 2 2 f i V V C - 2 2 1 E = CDV Hence when a cache line with total line capacitance of C, is switched from a tranquility level Ti to a tranquility level Tf the energy dissipated is given as : ( )2 2 1 saving i f P = ·C V -V (2.3) The total amount of dynamic energy depends on the replacement policy, the benchmarks, and the levels of tranquility. We must consider the effect of benchmarks to evaluate dynamic power costs. 2.5.3 Performance Costs Transitions between tranquility levels come at a cost in terms of performance. To determine the exact wake-up time, we have run simulations to measure the time needed to wake-up an SRAM cell from each tranquility level to the full power mode. Figure 2.6 shows the simulated curves obtained by switching power rails of an SRAM cell for 70nm technology. We switch lines from different tranquility level voltages to full rail voltage 19 (from left, 1st we switch from T2 then from T3 and finally from T4). Time is measured for each transition and compared to the clock periods proposed by Agarwal et. al. [2] for the corresponding technology. 8FO4 clock frequencies for the considered technologies as suggested by Agarwal et. al. [2] are given in Table 2.4. Table 2.4: 8FO4 clock frequencies used for our simulations Technology (nm) 8FO4 Clock (GHz) Cycle time (nSec) 130nm 2.67 0.37 100nm 3.47 0.29 70nm 4.96 0.20 Figure 2.6: The power wake up cycle from different levels of tranquility in the 70nm technology Table 2.5 shows wake-up penalties in terms of clock cycles. Our hspice simulations using the predictive technologies [26] revealed that the wake up penalty from T2 level is 1 cycle and is 2 cycles from both T3 and T4. We observed that the trend is towards increasing wake-up penalty as also discussed by Agarwal et. al. [2] that cache access time 20 is not scaling as fast as clock. So for future technologies the wake-up penalty from lowest level of tranquility will be 3 cycles or even more. Table 2.5: Wake up penalties in terms of cycles Wake-up penalty (cycles) Technology T1 T2 T3 T4 130nm 0 1 2 2 100nm 0 1 2 2 70nm 0 1 2 2 2.5.4 Maximum Possible Leakage Power Savings in Slumberous Caches In this section we ignore any dynamic power costs of switching from one tranquility level to another tranquility level, to get an upper bound on the leakage power saving that can be achieved by different schemes discussed in Section 2.4.3. An upper bound for any leaking power scheme employing DVS (Dynamic Voltage Scaling) is obtained by keeping all cache lines at lowest possible voltage level at all times, i.e. TL1-T4 for a 4- way set associative cache. All hits and misses will wake-up one cache line, but immediately put it back to T4 level. In this section a 4-way set associative cache is considered. Upper bounds for various slumberous cache schemes viz TL1-T4, TL4 (all 4 replacement priority levels have separate tranquility level, 4 levels in this case) TL2-T2, TL2-T3 and TL2-T4 are calculated. Table 2.6 shows the upper bounds for average leakage power saved per byte for above mentioned schemes. Table 2.7 and Figure 2.7 show the same information in terms of the % savings (relative to total leakage power). The maximum savings only depends on the technology, the number of cache ways, and the tranquility levels. They are independent of the 21 replacement policy (because, at anytime, the same number of lines is at any one tranquility level). They do not include the dynamic power needed to switch between tranquility levels (that’s why we call them maximum.) Table 2.6: Average leakage power dissipated and maximum leakage power savings per byte Technology Average Static Power dissipated per Byte (nW) Average Static Power Saved per Byte (nW) TL1-T4 TL4 TL2-T2 TL2-T3 TL2-T4 130nm 9.86 7.20 4.26 2.95 4.42 5.40 100nm 22.19 15.48 9.14 6.28 9.54 11.61 70nm 64.43 37.29 20.95 13.18 21.70 27.97 Table 2.7: Maximum percent leakage power savings Technology TL1-T4 TL4 TL2-T2 TL2-T3 TL2-T4 130nm 73.00% 43.18% 29.92% 44.85% 54.75% 100nm 69.73% 41.19% 28.31% 42.97% 52.30% 70nm 57.87% 32.51% 20.46% 33.67% 43.40% Of all approaches, TL1-T4, which keeps all lines at the minimum power levels, yields the best reduction of leakage power. TL2-T4 is second, TL2-T2 is last, and TL4 is in-between. These observations should be obvious, given that leakage power savings depends on tranquility levels. However, one must also contend with dynamic power and performance penalties before making a final judgment. 22 Figure 2.7: Maximum percent leakage power savings for various schemes Though leakage energy savings increases exponentially with technology scaling as we will see in Evaluations of Slumberous Caches for Famous Cache Replacement Algorithms, percent leakage savings decrease as technology scales, Figure 2.7. This decrease in percent saving is because the difference between T1and Tn levels is reduced with technology scaling. For technologies considered it reduces from 0.6V to 0.3V i.e. 50% reduction! So ultimately state destroying leakage energy saving techniques seem to survive. Using replacement priority information for state destroying leakage saving is beyond the scope of this research. 2.6 Architectural Simulations Whereas the leakage power savings are independent of the benchmarks, we must run architectural simulations to understand the dynamic power and performance implications of each power savings schemes. 23 2.6.1 Simplescalar Simulations We modified the simplescalar code to provide the required statistics for the calculation of the average power dissipation by the L1 data cache for different Spec2000 benchmark programs. Table 2.8 gives the processor model used for our simulations. Table 2.8: Baseline microprocessor Simulation Model Instruction Cache 16k 2-way set-associative, 32 byte blocks, 1 cycle latency Data Cache 8k 4-way set-associative, 32 byte blocks, 1cycle latency Unified L2 Cache 1Meg 4-way set-associative, 64 byte blocks, 20 cycle latency Memory 100 cycle round trip access Out-of-Order Issue out-of-order issue of up to 4 instructions / cycle, 128 entry re-order buffer Architecture Registers 32 integer, 32 floating point Functional Units 4-integer ALU, 2-load/store units, 4-FP ALUs, 1-integer MULT/DIV, 1-FP MULT/DIV We have used single sample Simpoints [65] of 100-million instructions each for selected spec2000 programs. The resulting Simpoints are given in Table 2.9. Table 2.9: The single simpoints for simulations of the Spec2000 benchmarks Spec2000 Benchmarks gzip gcc mcf parser vpr bzip2 twolf equake art Single Simpoint 814 960 369 1030 1722 184 11 5496 42 2.7 Evaluations of Slumberous Caches for Famous Cache Replacement Algorithms The concept of slumberous caches can be applied to various replacement policies with at least two priority levels; we considered three replacement policies namely LRU (Least Recently Used) PLRU (Pseudo LRU), and MRR (Modified Random Replacement). We have concentrated on the design of the L1 data cache in the Pentium 4, an 8k 4-way set associative cache with 32-byte lines. 24 2.7.1 Pseudo LRU (PLRU) For completeness we review the PLRU policy. PLRU approximates LRU. LRU is difficult to maintain in wide caches because of the complexity of updating the state bits to keep track of replacement priorities. To implement PLRU in a 4-way cache, we need three state bits called Bit_0, Bit_1 and Bit_2. Table 2.10 shows the cache line priority levels for different combinations of these three bits. Line at P1 is the MRU line and line at P4 is the line to replace. Table 2.10: Cache-line replacement priorities for different combinations of Bit_0, Bit_1 and Bit_2 Bit_2 Bit_1 Bit_0 Line_0 Line_1 Line_2 Line_3 0 0 0 P4 P3 P2 P1 0 0 1 P2 P1 P4 P3 0 1 0 P3 P4 P2 P1 0 1 1 P1 P2 P4 P3 1 0 0 P4 P3 P1 P2 1 0 1 P2 P1 P3 P4 1 1 0 P3 P4 P1 P2 1 1 1 P1 P2 P3 P4 Figure 2.8 shows how the state bits are used to select a victim line. Bit_0 selects between two groups of cache lines, group_0 (line_0 and line1) or group_1 (line_2 and line_3). Bit_1 selects between line_0 and line_1 and Bit_2 selects between line_2 and line_3. If Bit_0 is zero we don’t care about Bit_2 and Bit_1 decides which of line_0 or line_1 is replaced. Similarly if Bit_0 is 1 then we don’t care about Bit_1 and Bit_2 selects between line_2 and line_3. When a cache line is referenced we change the state of the state bits e.g. if line_0 is accessed we set Bit_0 to 1 so that the next victim will be in 25 group_1 and also we set Bit_1 to 1 so that next time when group_0 is selected line_1 will be selected for replacement. Figure 2.8: PLRU implementation for 4-way caches Figure 2.9 shows the priority level transitions of cache lines on an access to a set. For example, if a hit occurs at a cache line whose priority level is P3, then its priority goes to P1, the priority of the line previously at P4 goes to P2, the line at P1 goes to P3 and the line at P2 goes to P4. These priority level transitions are dictated by PLRU and result in various tranquility level transitions, depending on the control scheme employed. Figure 2.9: How hits at different Priority levels affect the cache under PLRU policy 26 Leakage saving using PLRU algorithm is evaluated for schemes described in Maximum Possible Leakage Power Savings in Slumberous Caches Section 2.5.4 2.7.1.1 Dynamic Power Penalties Figure 2.10 shows the dynamic power required per byte to save leakage power in L1 data cache for TL4 under PLRU. Figure 2.10: Dynamic power incurred per byte for L1 data cache for TL4 under PLRU policy This power varies in a wide range across the benchmarks and depends upon the number of hits in P2-P4 levels and number of misses. The dynamic power costs are different for various benchmarks under TL2-T4 as compared to TL4 see Figure 2.11, the reason is increased dynamic costs for hits in P2, P3 and misses. On the average dynamic power cost is doubled. Dynamic power is significantly reduced as technology scales. This is because the range of voltage levels between T1 and T4 is reduced as technology scales and the dynamic power cost is inversely proportional to the square of the voltage difference. 27 Figure 2.11: Dynamic power incurred per byte for L1 data cache for TL2-T4 under PLRU policy In Table 2.11 and Figure 2.12 we compare the amount of average dynamic power consumed by various schemes to save leakage power. The dynamic power is the average dynamic power across all benchmarks, obtained by summing up all the dynamic energy needed for all the benchmarks and dividing the sum by the total execution time. In all cases, dynamic power is by far the worse in TL1-T4, although the gap closes quickly with scaled-down technologies. The curves can be explained by the voltage difference between drowsy and full rail levels in the different schemes. Figure 2.12: Average dynamic power cost for L1 data cache for different power schemes under PLRU 28 Table 2.11: Average dynamic power costs (nW) for various schemes under PLRU Technology TL1-T4 TL4 TL2-T2 TL2-T3 TL2-T4 130nm 20.07 1.12 0.33 1.31 2.96 100nm 10.20 0.57 0.17 0.67 1.50 70nm 4.71 0.26 0.08 0.31 0.69 It is clear that we need to consider both the effects of leakage power and dynamic power caused by the leakage power scheme in our evaluations. The net power savings is: Net Saving = LeakagePower Saved - Dynamic Power Incurred 0.00 5.00 10.00 15.00 20.00 25.00 gzip gcc mcf parser vpr bzip2 twolf equake art Average Spec2000 Benchmarks Net Savings per Byte (nW) 130nm 100nm 70nm Figure 2.13: Net Savings per byte for L1 data cache for TL4 under PLRU Figure 2.13 shows the Net Savings for different benchmarks for TL4 scheme under PLRU. We see that, as technology scales, net power savings become independent of the benchmark because the dynamic power becomes negligible and the static power saved is independent of the benchmark. Whereas the Net Savings increases exponentially with scaled down technology, it is important to look at the percent savings, as the leakage power also grows exponentially with the technology. 29 The percent leakage power savings is: %Net Saving = (Net Saving Total Leakage Power)´100 The %Net Savings for TL4 under PLRU are shown in Figure 2.14, across the benchmarks. It shows that the percent savings remains steady across technologies, and also becomes independent of the benchmark as technology scales down. 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 gzip gcc mcf parser vpr bzip2 twolf equake art Average Spec2000 Benchmarks % Net Savings per Byte 130nm 100nm 70nm Figure 2.14: % Net Savings per byte for L1 data cache for TL4 scheme under PLRU policy Figure 2.15 shows the average net leakage power saved per byte across all benchmarks. Regardless of the power scheme, Net Savings increase exponentially with technology in all cases. Because of the explosive increase of leakage power and the rapid drop in dynamic power with technology scaling, the net gains obtainable by TL1-T4 (i.e. all lines at T4) are on a steeper upwards curve than those for any of the schemes governed by the replacement policy. We observe that TL2-T4 gave the most savings among schemes dictated by the replacement policy and that, for the most advanced technology we have looked at, its savings are roughly equal to those of TL1-T4. 30 -20.00 -15.00 -10.00 -5.00 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 130nm 100nm 70nm Technology Average Net Leakage Power Saved per Byte (nW) TL1-T4 TL4 TL2-T2 TL2-T3 TL2-T4 Figure 2.15: Power savings for L1 data cache under various schemes in future technologies for PLRU Finally average net leakage power saving is compared to the maximum saving computed in Section 2.5.4; all savings are shown as percent of TL1-T4 savings of Section 2.5.4 refer Figure 2.16. Figure 2.16: Average net leakage power savings for L1 data cache for various schemes under PLRU 2.7.1.2 Performance Penalties Figure 2.17 shows the average increase in L1 cache hit access time (in %) taken over all the benchmarks. We do not show the case of TL1-T4, as TL1-T4 increases hit latency by 200% and would obfuscate the comparison between the schemes driven by the replacement policy, if we included it. 31 For the TL4 scheme we see approximately a 7% average increase in hit latency over a cache with no leakage power scheme. The only scheme better than TL4 is TL2-T2 but this scheme has the least leakage savings and does not trend well. TL2-T4 results in 12% increase in hit latency and an increasing trend in wake-up penalties suggest that TL2-T4 may not scale well w.r.t. performance whereas TL4 will continue to save leakage power with little performance impact Figure 2.17: Percent increase in L1 access time for hits for various schemes under PLRU 2.7.2 LRU (Least Recently Used) Policy Though LRU is considered hard to implement but some real world systems do use LRU e.g. Ultra SPARC IV used LRU for L2 cache [57], hence we evaluate it in a similar way we did PLRU. There can be many possible implementations of LRU for a 4-way set associative cache. For the sake of discussion consider an implementation that employs 2 bits per way to assign replacement priority to different lines. Initially all bits will be reset, hence at same priority level. Table 2.12 shows the replacement priority assignment to different bit 32 combinations. Table 2.13 shows how hits in different cache line of a specific set change the replacement priority of the cache lines in the same set under LRU policy. Table 2.12: Replacement priority assignment for LRU Bit Combination Replacement Priority 00 P1 01 P2 10 P3 11 P4 Table 2.13: Cache-line replacement priorities for hits in different cache line LRU Hit @ Line_0 Line_1 Line_2 Line_3 Line_0 P1 P2 P3 P4 Line_1 P2 P1 P3 P4 Line_3 P3 P2 P4 P1 Line_1 P3 P1 P4 P2 Line_0 P1 P2 P4 P3 Line_2 P2 P3 P1 P4 Line_3 P3 P4 P2 P1 Line_2 P3 P4 P1 P2 Figure 2.18 shows the priority level transitions of cache lines on an access to a set under LRU. For example, if a hit occurs at a cache line whose priority level is P3, then its priority goes to P1, the priority of the line previously at P4 will remain unchanged, the line previously at P1 goes to P2 and the line at P2 goes to P3. These priority level transitions are dictated by LRU and result in various tranquility level transitions, depending on the control scheme employed. 33 Figure 2.18: How hits at different Priority levels affect the cache under LRU policy Similar to PLRU evaluations, various LRU schemes are denoted as TL2-T2, TL2- T3, TL2-T4 and TL4. 2.7.2.1 Dynamic Power Penalties Figure 2.19 shows the dynamic power required per byte to save leakage power in L1 data cache for TL4. Figure 2.19: Dynamic power cost for L1 data cache for TL4 scheme under LRU policy Although Figure 2.19 looks very similar to Figure 2.10, in reality dynamic power costs are different for same benchmark for the two replacement algorithms considered. Table 2.14 compares TL4 schemes under LRU and PLRU with respect to dynamic power 34 consumption. It is interesting to observe that dynamic power incurred is always less for PLRU except for art, which has miss rate of almost 50%. This confirms the intuition that more complex algorithms are in general more power hungry. In both cases dynamic power is significantly reduced as technology scales. Hence for 70nm technology both policies on the average consume the same power. Table 2.14: Dynamic power costs for different benchmarks for TL4 schemes under PLRU and LRU LRU PLRU LRU PLRU LRU PLRU 130nm 130nm 100nm 100nm 70nm 70nm gzip 0.8050 0.8027 0.4089 0.4077 0.1889 0.1883 gcc 0.3654 0.3527 0.1856 0.1791 0.0857 0.0827 mcf 0.9570 0.9411 0.4861 0.4780 0.2245 0.2208 parser 0.8727 0.7920 0.4433 0.4023 0.2047 0.1858 vpr 0.7508 0.7253 0.3814 0.3684 0.1761 0.1702 bzip2 0.4665 0.4543 0.2369 0.2307 0.1094 0.1066 twolf 0.6467 0.6182 0.3285 0.3140 0.1517 0.1450 equake 1.9952 1.9833 1.0134 1.0074 0.4681 0.4653 art 3.3741 3.4094 1.7138 1.7318 0.7916 0.7999 Average 1.1370 1.1199 0.5775 0.5688 0.2668 0.2627 In Table 2.15 the amount of average dynamic power consumed by various schemes to save leakage power is shown for LRU policy. These values have the same trend as in Table 2.11 but are a little bit more. Net saving for LRU is calculated similar to the way it was calculated for PLRU. Table 2.15: Average dynamic power costs for various schemes under LRU TL1-T4 TL4 TL2-T2 TL2-T3 TL2-T4 130nm 20.51 1.14 0.34 1.35 3.03 100nm 10.42 0.58 0.17 0.68 1.54 70nm 4.81 0.27 0.08 0.32 0.71 35 Figure 2.20 shows the Net Savings for different benchmarks in the case of TL4 under LRU and across various benchmarks. The observation is the same i.e. as technology scales, net power savings become independent of the benchmark because the dynamic power becomes negligible and the static power saved is independent of the benchmark. Figure 2.20: Net savings for L1 data cache for TL4 under LRU policy The % Net Savings for TL4 under RLU across different benchmarks are shown in Figure 2.21 It shows that the percent savings remains steady across technologies, and also becomes independent of the benchmark as technology scales down. Figure 2.21: Percent net savings for L1 data cache for TL4 under LRU 36 Figure 2.22 shows the average net leakage power saved per byte across all benchmarks. Regardless of the power scheme, Net Savings increases exponentially with technology in all cases. Because of the explosive increase of leakage power and the rapid drop in dynamic power with technology scaling, the net gains obtainable by TL1-T4 (i.e. all lines at T4) are on a steeper upwards curve than those for any of the schemes governed by the replacement policy. It is observed that TL2-T4 gave the most savings among schemes dictated by the replacement policy and that, for the most advanced technology we have looked at, its savings are roughly equal to those of TL1-T4. Figure 2.22: Power savings for L1 data cache under various schemes in future technologies for LRU Figure 2.23 compares average net leakage power saving as percent of TL1-T4 savings of Section 2.5.4 37 Figure 2.23: Average net leakage power savings for L1 data cache for various schemes under LRU 2.7.2.2 Performance Penalties Figure 2.24 shows the average increase in L1 cache hit access time (in %) taken over all the benchmarks. TL1-T4 is not shown, as TL1-T4 increases hit latency by 200% and would obfuscate the comparison between the schemes driven by the replacement policy, if we included it. For theTL4 scheme we see approximately 8% average increase in hit latency over a cache with no leakage power scheme. The only scheme better than TL4 is TL2-T2, but this scheme has the least leakage savings and does not trend well. TL2-T4 results in 14% increase in hit latency and an increasing trend in wake-up penalties suggest that TL2-T4 may not scale well w.r.t. performance whereas LRU4 will continue to save leakage power with little performance impact. Table 2.16 compares various power saving schemes for PLRU and LRU, where as performance impact means percent increase in the hit latency of L1 data cache. In all cases the performance impact of LRU is more than that of PLRU. The average miss rate, for considered benchmarks for LRU (9.40%) is less than that of PLRU (9.62%), hence 38 this difference in performance is because LRU has 2.6% more hits in non MRU cache lines compared to PLRU. Table 2.16: Comparing performance impact of various power saving schemes under LRU and PLRU Power Saving Scheme Performance impact (PLRU) Performance impact (LRU) TL4 7.20% 7.57% TL2-T2 6.26% 7.11% TL2-T3 12.52% 14.23% TL2-T4 12.52% 14.23% Figure 2.24: Percent increase in L1 access time for hits for various schemes under LRU 2.7.3 Modified Random Replacement (MRR) Policy It was mentioned in Section 2.4 that slumberous cache idea can be applied only to the replacement algorithms that have at least two priority levels. Random replacement algorithm is used in real world systems e.g. Sun’s Niagara [36], to make it fit for slumberous cache idea, MRU information is added to random replacement algorithm and it is named Modified Random Replacement. MRR has two priority levels i.e. MRU and Non-MRU. For 4-way set associative cache, three ways will be at Non-MRU priority 39 level and one at MRU level, hence a single bit is needed to differentiate between the priority levels. Table 2.17: Replacement priority assignment for MRR Bit Replacement Priority 0 P2 1 P1 Table 2.18 shows how replacement priorities for different cache lines change corresponding to hits in different ways within a set. Table 2.18: Cache-line replacement priorities for hits in different cache line under MRR Hit @ Line_0 Line_1 Line_2 Line_3 Line_0 P1 P2 P2 P2 Line_1 P2 P1 P2 P2 Line_3 P2 P2 P2 P1 Line_3 P2 P2 P2 P1 Line_0 P1 P2 P2 P2 Line_2 P2 P2 P1 P2 Line_3 P2 P2 P2 P1 Line_2 P2 P2 P1 P2 Figure 2.25: How hits at different Priority levels affect the cache under MRR replacement policy 40 Figure 2.25 shows the priority level transitions of cache lines on an access to a set under MRR. For example, if a hit occurs at a cache line whose priority level is P2, then its priority goes to P1, the priority of the line previously at P1 goes to P2. These priority level transitions are dictated by MRR and result in various tranquility level transitions, depending on the control scheme employed. MRR performance compared to LRU and PLRU is shown in Table 2.19. As far as IPC is considered, MRR is a little better than PLRU and a little worse than LRU for the selected benchmarks. Table 2.19: Comparing IPCs of various replacement algorithms IPC MRR PLRU LRU gzip 2.02 1.99 2.04 gcc 0.53 0.52 0.52 mcf 0.86 0.86 0.87 parser 2.06 2.08 2.12 vpr 0.90 0.90 0.91 bzip2 0.68 0.67 0.68 twolf 1.45 1.38 1.47 equake 0.99 1.00 0.99 art 2.00 1.43 1.44 Average 1.19 1.17 1.20 TL4 is not possible for MRR, so similar to the evaluations for PLRU and LRU, schemes TL2-T2, TL2T3 and TL2-T4 are considered. 2.7.3.1 Dynamic Power Penalties Figure 2.26 shows the dynamic power required per byte to save leakage power in L1 data cache for TL2-T4 under MRR. 41 Figure 2.26: Dynamic power incurred per byte for L1 data cache for TL2-T4 under MRR policy Table 2.20 compares TL2-T4 scheme for PLRU, LRU and MRR replacement policies with respect to dynamic power consumption. It is interesting to observe that in all cases MRR is equally good as PLRU. LRU being more complex to implement is more power hungry. In all cases dynamic power is significantly reduced as technology scales. Hence for 70nm technology all cases incur almost the same dynamic power cost. Table 2.20: Dynamic power costs for different benchmarks for TL4 schemes under PLRU and LRU PLRU LRU MRR PLRU LRU MRR 130nm 130nm 130nm 70nm 70nm 70nm gzip 1.19 1.26 1.19 0.28 0.29 0.28 gcc 0.85 0.87 0.85 0.20 0.20 0.20 mcf 10.53 10.74 10.53 2.47 2.52 2.47 parser 1.86 2.05 1.86 0.44 0.48 0.44 vpr 1.78 1.82 1.78 0.42 0.43 0.42 bzip2 0.86 0.89 0.86 0.20 0.21 0.20 twolf 1.20 1.26 1.20 0.28 0.30 0.28 equake 5.05 5.16 5.05 1.18 1.21 1.18 art 3.28 3.25 3.28 0.77 0.76 0.77 Average 2.96 3.03 2.96 0.69 0.71 0.69 Table 2.21: Average dynamic power costs for various schemes under MRR TL1-T4 TL2-T2 TL2-T3 TL2-T4 TL1-T4 130nm 20.51 0.33 1.31 2.96 20.51 100nm 10.42 0.17 0.67 1.50 10.42 70nm 4.81 0.08 0.31 0.69 4.81 42 In Table 2.21 the amount of average dynamic power consumed by various schemes under MRR policy is shown. Net savings for MRR are calculated like LRU and PLRU. Figure 2.27 shows Net Savings for different benchmarks in the case of TL2-T4 under MRR. The observation is the same i.e. as technology scales, net power savings become independent of the benchmark because the dynamic power becomes negligible and the static power saved is independent of the benchmark. Figure 2.27: Net savings per byte for L1 data cache for TL2-T4 under MRR The % Net Savings for TL2-T4 across different benchmarks are shown in Figure 2.28. It shows that the percent savings remains steady across technologies, and also becomes independent of the benchmark as technology scales down. Figure 2.28: Percent net savings per byte for L1 data cache for TL2-T4 scheme under MRR 43 Figure 2.29 shows the average net leakage power saved per byte across all benchmarks. Regardless of the power scheme, Net Savings increases exponentially with technology in all cases. Because of the explosive increase of leakage power and the rapid drop in dynamic power with technology scaling, the net gains obtainable by TL1-T4 (i.e. all lines at T4) are on a steeper upwards curve than those for any of the schemes governed by the replacement policy. It is observed that TL2-T4 gave the most savings among schemes dictated by the replacement policy and that, for the most advanced technology we have looked at, its savings are roughly equal to those of TL1-T4. Figure 2.29: Power savings for L1 data cache under various schemes in future technologies for MRR Figure 2.30: Average net leakage power savings for L1 data cache for various schemes under MRR 44 Finally average net leakage power saving is compared to the maximum saving computed in Section 2.5.4 all savings are shown as percent of TL1-T4 savings of Section 2.5.4 refer Figure 2.30. 2.7.3.2 Performance Penalties Figure 2.31 shows the average increase in L1 cache hit access time (in %) taken over all the benchmarks. TL1-T4 is not shown, as TL1-T4 increases hit latency by 200% and would obfuscate the comparison between the schemes driven by the replacement policy, if we included it. For the TL2-T4 scheme we see approximately 13% average increase in hit latency over a cache with no leakage power scheme. TL2-T2 and TL2-T3 both have the same performance impact of approximately 6%. Figure 2.31: Percent increase in L1 access time for hits for various schemes under MRR Table 2.22 compares various PLRU, LRU and MRR power saving schemes, where as performance impact means percent increase in the hit latency of L1 data cache. In all 45 cases the performance impact of LRU is the maximum where as that of MRR is minimum. Table 2.22: Comparing performance impact of power saving schemes under PLRU, LRU and MRR Power Saving Scheme Performance impact (PLRU) Performance impact (LRU) Performance impact (MRR) TL4 7.20% 7.57% NA TL2-T2 6.26% 7.11% 6.17% TL2-T3 12.52% 14.23% 12.33% TL2-T4 12.52% 14.23% 12.33% 2.7.4 Best Leakage Control Scheme for L1 data Cache So far 12 leakage control schemes based on the replacement algorithm were discussed. To find the best out of these a metric is devised called net leakage saving per percent increase in the hit latency of the L1 data cache. This metric will be referred to as Leakage Energy Saving Metric (LESM) in the rest of the document. Table 2.23: Comparing all 12 leakage control schemes with respect to LESMs LESM 130nm 100nm 70nm PLRU TL4 0.44 1.19 2.87 TL2-T2 0.44 0.99 2.10 TL2-T3 0.29 0.73 1.72 TL2-T4 0.28 0.85 2.20 LRU TL4 0.41 1.13 2.73 TL2-T2 0.38 0.87 1.85 TL2-T3 0.25 0.64 1.51 TL2-T4 0.24 0.75 1.93 MRR TL2-T2 0.44 1.00 2.13 TL2-T3 0.29 0.74 1.74 TL2-T4 0.28 0.86 2.23 As per Table 2.23 PLRU4 is the best scheme for all technologies considered. For 130nm technology PTL2-T2 and MRR-T2 are as good as PLRU4. PLRU4 is slumberous 46 scheme in real sense, whereas others are drowsy compatible versions, using replacement policies. Hence slumberous caches are proved to be better than drowsy caches as technology scales down. 2.7.5 Extension to Unified L2 Cache To evaluate the leakage power savings schemes in the context of a unified L2 cache, a 4-way set-associative cache with PLRU replacement policy is considered. The block size is 64 bytes and cache size is 1Mbyte. Since the L2 cache is not accessed as frequently as L1 the dynamic power expanded in switching between the tranquility levels becomes negligible. Average dynamic power incurred for unified L2 cache is in the order of pWs per byte whereas, for L1 data caches, it was in the order of nWs. Thus dynamic power costs are negligible across all schemes, as shown in Figure 2.32. Figure 2.32: Dynamic power cost for L2 cache for various schemes in future technologies for PLRU When we compare net leakage power savings per byte in L2 for different schemes we see that TL1-T4 is always the best refer Figure 2.33 and is also trending up faster with technology. 47 Figure 2.33: Leakage power savings for L2 cache for various schemes in future technologies for PLRU L2 cache hit latency of the CPU model used is 20 cycles, so even if we put the whole L2 cache at T4 level at all times we will have only 10% increase in L2 latency. Impact on L2 cache latency has similar curve as that of L1 but is less than 2% in all cases refer Figure 2.34. Figure 2.34: Percent increase in L2 access time for hits for various schemes under PLRU 48 2.7.6 Immunity to Soft Errors and Reliability of Slumberous Caches Soft errors will be a main concern in future microprocessors [69] owing to miniature feature sizes and larger chip areas. As cache memories are occupying most of the chip’s real state today, they have to be more reliable. Reducing voltage of cache lines makes them more vulnerable to soft error attacks, as mentioned by [16] soft error vulnerability increases exponentially with decreasing supply voltage. Further as pointed out by [35] MRU lines are more vulnerable to soft errors. Hence slumberous cache schemes that never put MRU lines into drowsy mode and gradually decrease the voltage of a cache line as its replacement priority is lowered, seem more promising keeping in view power, performance and reliability. A detailed evaluation of reliability of slumberous caches is beyond the scope of this research. 2.7.7 Conclusions From this research it is established that huge leakage energy can be saved in future technologies, if some tranquility levels for the caches are selected and individual cache lines are switched to a tranquility level proportional to their frequency of utilization. Replacement policy can be used to discriminate between more frequently used and less frequently used cache lines to decide about which power save level they should be switched to. Our experimental results proved that on the average 45%-32% leakage power can be saved for 130nm-70nm technologies. Dynamic energy cost to implement the proposed scheme becomes negligible as technology scales and also as cache sizes increases for a fixed technology. So above mentioned percent savings are independent of 49 the program execution and cache size used. The performance effect of this scheme is very less and it decreases towards no performance impact as technology scales. Two priority levels schemes were considered to contrast with drowsy caches. Our scheme is similar to drowsy cache scheme in the way that we also reduce supply voltage to different cache lines. But drowsy cache scheme puts the entire cache to drowsy mode at some regular intervals in case of L1 data cache and for L1 instruction cache they introduced some way of bank prediction to put entire bank into drowsy mode. To mitigate the performance impact we never reduce the supply voltage of P1 priority level cache lines we only put P2-P4 levels to either multiple levels of tranquility in case of TL4 or to two levels of tranquility as the case of many two levels schemes discussed. For two level schemes different cases of assigning T2 or T3 or T4 voltage level to all three priority levels from P2-P4 were considered. Comparing all 12 schemes for L1 data cache, with respect to LESMs, showed TL4 under PLRU to be the best scheme, which also proves superiority of slumberous scheme over drowsy type schemes. On account of very less dynamic cost and very less performance impact TL1-T4 seems to be the best case for L2 caches i.e put whole L2 cache in deepest tranquility level and wake a cache line up only when needed and put them back to sleep in the very next cycle, Another important thing to mention is that though percent leakage energy savings decrease as technology scales on account of decreasing voltage difference between different tranquility levels the absolute leakage energy saving increase 2-4 times (depending on the cache size and replacement policy) from 130nm to 70nm technology. 50 Chapter 3: Dynamic Power Control Low Power technique presented in the last chapter addressed static aspect of the power dissipated in a modern processor. In this chapter we present a technique to reduce dynamic power dissipated in a processor chip. Since clock tree in modern processors dissipates a significant portion of the total chip dynamic power, we target reducing power dissipated in the clock tree. 3.1 Clock Gating Clock gating is a well known technique used to reduce power dissipation in clock associated circuitry. The idea of clock gating is to shut down the clock of any component whenever it is not being used (accessed). It involves inserting combinational logic along the clock path to prevent the unnecessary switching of sequential elements. The conditions under which the transition of a register may be safely blocked should automatically be detected. This problem is the target of our paper. In out-of-order superscalar processors, branch miss-predictions cause wrong-path instructions to be executed since there is a lag between the branch prediction, actual branch resolution, and subsequent commit of the branch. The wrong-path instructions are of course never committed to the actual state of the processor; however, because they are issued and executed, they can give rise to two negative effects: performance degradation and power waste. Many researchers have worked on eliminating or reducing the power consumed by wrong-path instructions. These schemes are primarily probabilistic in nature. They rely 51 on some kind of branch history as explained next. The pipeline gating technique of [36] assigns confidence levels about their prediction accuracy to branches based on their prediction history. When the number of low confidence branches exceeds a preset threshold, the instruction fetch and decode are stopped. This method suffers from both performance overhead and lost energy saving opportunities since some low confidence branches may be predicted correctly while some high confidence branches are in fact predicted wrongly. Reference [4] improves on the all-or-nothing throttling mechanism of [36] by having different types and degrees of throttling. Pipeline balancing technique of [6] monitors the IPC value over a 256-instruction window and disables clusters of functional units upon detection of the low IPC state (assuming that the program execution will stay in that state during the next instruction window). Since decisions are taken over a period of 256 instructions, rapid changes in program behavior result in performance loss and energy waste. In [43] the authors propose a deterministic clock gating (DCG) approach which takes advantage of the resource utilization information available in advance. When it is known ahead of time (i.e., at the issue stage) that some of the processor resources will not be used, clock gating signals are generated, at the issue stage, to clock-gate these resources during their idle times. Another approach, called transparent clock gating [31] enhances the existing clock gating in latch-based pipelines by keeping the latches transparent by default i.e., by not clocking them. Latches are clocked only when there is a need to avoid a data race condition. Register level clock gating of [32] introduces the concept of clock gating parts of stage registers i.e., when there are not enough instructions to be issued, 52 parts of stage register associated with the issue stage are clock gated. In [10] authors present a value-based clock gating scheme, which exploits the fact that although the processor word size has increased to 64 bits and beyond, arithmetic operations on much smaller bit widths are more common. So while performing operations on smaller numerical values, higher order bits of the functional units can be clock gated. Most of the previous work on clock gating either ignores the fact that a noticeable fraction of the total power is dissipated in executing wrong-path instructions during branch misprediction or use a probabilistic approach to avoid the resulting power waste. In this research we take branch misprediction as an opportunity for clock gating the unnecessarily-used processor resources by deterministically detecting the wrong-path instructions. 3.2 Using Branch Mispredictions for Clock Gating Many of the currently available state-of-the-art microprocessors have complex pipelines with multiple functional units and very wide issue widths [36] in order to offer high level of parallelism in hardware. Clearly the Instruction Level Parallelism (ILP) varies a lot across different applications; as a result, not all applications are able to utilize the full set of resources in a modern processor. In addition, since the processors are designed to account for the peak performance, many of the applications which are not able to exploit the available hardware resources end up underutilizing them. Figure 3.1 shows this underutilization across different benchmarks in terms of average percentage of idle cycles for integer ALUs with simplescalar simulation using the issue width of 4. As 53 we can see the integer ALU is idle, on average, for 41.02% of the time. This resource underutilization provides us with opportunities for power saving by clock gating. 0 10 20 30 40 50 60 70 BZIP GCC GZIP CJPEG DJPEG APSI EQUAKE MESA WUPWISE Average Percentage (%) Idle Cycles Instructions Figure 3.1: Idle time and wrong-path instruction fraction for integer ALUs Most of the current state-of-the-art microprocessors employ aggressive branch prediction in order to boost performance. Although branch predictors help increase the processor performance, when a branch is mispredicted, many of the wrong-path instructions (i.e., instructions that are on the predicted path of the mispredicted branch) are still executed. Due to the out-of-order execution in modern processors, at the time when a branch is resolved and found to be mispredicted, there can be a mix of correct path and wrong-path instructions in the execution pipelines and the instruction queue. Because of the prohibitive complexity of selective squashing mechanism, many processor architectures do not flush the pipeline until the mispredicted branch reaches the head of the ReOrder Buffer (ROB) so that one is assured that all the instructions on the correct path have retired (Note that instruction fetch and decode are stopped upon detecting a branch misprediction). As a result many of the wrong-path instructions are still executed 54 only to be thrown away when the pipeline is flushed. Figure 3.1 shows the fraction of instructions that are executed but never committed (retired), due to mispredicted branches with respect to the total number of instructions executed. This estimate is obtained from simplescalar simulation, using the processor configuration that is described in detail in the experimental results section, which shows that on average around 8.29% of the executed instructions are due to mispredicted branches. These instructions not only consume power in functional units during their execution, but also consume power in (i) register files by reading their input operands; and (ii) caches by executing wrong-path loads. The impact of these wrong-path instructions on power dissipation is even more severe with deeper pipelines on account of increased branch misprediction penalty. As stated earlier, many of the wrong-path instructions are executed even after the branch is resolved. More precisely, when a branch is resolved to be mispredicted, there may exist wrong-path instructions which a) have already been issued and thus they either are in the pipeline or have been completed (type (i)), or b) have not been issued yet, i.e., they are still in the issue queue (IQ) (type (ii)). By the time the mispredicted branch reaches the head of the ROB, many of the instructions which are still in IQ (type (ii)) could be issued to execution units. It is quite expensive (from a hardware cost and control point of view) to identify and prune type (i) instructions. Fortunately, it is easy to stop the second set of instructions from being issued, which in turn can result in considerable power saving. In Figure 3.2, the bars on the left within each set show the average number of type (i) + type (ii) instructions when the mispredicted branch retires. This number tells us the 55 average number of wrong-path instructions that could be prevented from being issued if we had a perfect oracle that would tell us which instruction is or will be in the wrong-path. The bars on the right within each set show the average number of type (ii) instructions when the mispredicted branch retires, i.e., the wrong instructions issued after the branch is resolved to be mispredicted and before it retires. These are the wrong-path instructions which can actually be prevented from being issued and executed. These results show that 92.63% of the wrong-path instructions are issued after the branch is resolved, which provides a great opportunity for power saving via clock gating. 0.00 5.00 10.00 15.00 20.00 25.00 BZIP GCC GZIP CJPEG DJPEG APSI EQUAKE MESA WUPWISE Average Average Instructions Type(i)+Type(ii) Type(ii) Figure 3.2: Average number of wrong-path instructions per mispredicted branch 3.3 Proposed Clock Gating Architecture Based on the aforesaid observations, we present two clock gating techniques that 1) make use of idle cycles in pipelined functional units when some stage of the functional unit is idle, and 2) prevent wrong-path instructions of type (ii) from being issued. 56 The first clock gating technique, called Pipeline Functional unit Clock Gating (PFCG), is straightforward and is presented and implemented here only to serve as a baseline against which the power efficiency of a second technique i.e., WPCG, is compared. 3.3.1 Pipelined Functional Unit Clock Gating (PFCG) Figure 3.3 depicts the PFCG technique at the architectural level. The proposed architecture utilizes the idleness of various stages of structurally-pipelined functional units in a processor pipeline. Note that different stages of a pipelined FU can be idle due to any of a number of reasons: o Typically the total number of FUs, including integer and floating point functional units, is larger than the processor’s issue width. Hence not all the FUs are used in every cycle of the program’s execution. o Different applications exhibit different degrees of instruction level parallelism (ILP) and therefore the FU’s usage varies across different programs. o Different application programs exercise different sets of FUs. For example, integer programs will be using completely a different set of FUs (integer ones) compared to the floating point programs. o Because of structurally pipelined FU with multi clock cycle latencies (but throughput of 1 operation per cycle), depending on the number of operations 57 that are concurrently being executed on the same functional unit, one or more stages of the pipelined FU may be idle at any given clock cycle. Issue Logic .. Data Bus To writeback Figure 3.3: PFCG Architecture In the modern processors, the decoded instructions, after renaming, are stored in an issue queue (IQ), where they wait for their input operands to become available (if these operands are being produced by some instruction in the pipeline). The issue logic examines all instructions that have both of their operands ready and issues n instructions (for an issue width of n) to appropriate FUs assuming that the corresponding FUs are available. We define a pipeline stage of an FU as an input register set plus the combinational logic that succeeds it. In the presented clock gating (CG) architecture, each stage register set of the FU is appended with a one-bit register called Clock Enable Bit register (CEBit). The CEBit of stage i of FU j controls the clock of stage i+1 of that FU. (Note that since the last stage of the FU will not be used to gate any clock signal, it is not appended with the CEBit). 58 The clock fed to each stage register set, except for the CEBit register which is never clock gated, goes through an AND gate. The AND gate essentially takes the clock and the CEBit of the previous stage and performs logical AND on them to produce the clock that will be fed to the current stage. Hence, during a particular clock cycle, if the CEBit of the previous stage is ‘0’, the clock for the current stage is masked for that cycle. As shown in the figure, the CEBit propagates through subsequent stages at each clock cycle thanks to the CEBit shift register structure. The CEBit register of the first stage of each FU is set either to ‘0’ or to ‘1’ by the issue logic via the issued bit (cf. Figure 3.3). If, during a particular cycle m, no instruction is issued to the FU, then the issued bit will be set to ‘0’, indicating that no instruction is issued to this particular FU during cycle m. The issued bit is also used to gate the clock of the first stage. In the subsequent clock cycles as the CEBit travels through the subsequent stages of an FU, it appropriately gates the clock of those stages. 3.3.2 Wrong-Path instruction Clock Gating We saw in Section 3.2 that on average 8.29% of the total executed instructions are never committed due to wrong-path instructions on mispredicted branches. Figure 3.2 showed, on average, how many wrong-path instructions can be prevented from being issued when the branch is resolved and is known to be mispredicted. As seen, when the branch is mispredicted, majority of the issued wrong-path instructions can be blocked since the majority of these wrong-path instructions are still in IQ. Therefore, we propose a clock gating technique that eliminates the switching activity in the logic and the stage registers due to wrong-path instructions. 59 Figure 3.4 shows the architecture of Wrong-Path instructions Clock Gating (WPCG). Note that when a branch is resolved to be mispredicted, the instructions in the IQ may be correct path instructions (i.e., instructions that were fetched before the mispredicted branch instruction) or wrong-path instructions (i.e., instructions that have been fetched after the mispredicted branch instruction). Therefore, in the WPCG architecture, the IQ is augmented with some logic to determine whether the instruction selected by the issue logic is a wrong-path instruction or not. Figure 3.4: The WPCG Architecture As depicted in Figure 3.4, the misprediction bit is set to ‘0’ initially when the correct path instructions are being executed and no branch misprediction has taken place. When a branch is resolved to be mispredicted, the mispredicted_branch_rob_id (MBR_id) register is updated with the ROB ID of the branch (branch_rob_id) in the next clock cycle. At the same time, the misprediction bit will be set to ‘1’. This will enable the range 60 comparator in front of each issue port of the IQ, which will subsequently determine whether the instruction being issued is a wrong-path instruction or not. The AND gate in front of each issue port essentially takes the ROB ID of the selected instruction and ANDs it with the misprediction bit. This is necessary since we do not want unnecessary switching activity in the comparator circuit when the branch is predicted correctly. Hence, in the event of misprediction, the ROB ID of the selected instruction is available to the comparator. Furthermore the comparator also receives the tail of the ROB as input to determine if the selected instruction is between the mispredicted branch and the tail of the ROB. If it is, then the comparator will output a ‘1’, indicating that the selected instruction is in the wrong-path and thus it should not be executed. The inverted output of the comparator goes to a 2-to-1 MUX controlled by the misprediction bit. In the event of a misprediction, the inverted output of the comparator is chosen to set the value in the CEBit register of the first stage of the FU. This output is also used to clock gate the first stage register set of the FU. Note that when the branch is not mispredicted, the added circuitry is functionally equivalent to the PFCG architecture (cf. Figure 3.3) and consumes minimal power since there will be no switching activity in the comparators. When the head of the ROB reaches the mispredicted branch, we will flush the ROB and the pipeline. At that time, the misprediction bit will be reset so that starting with the next clock cycle, the WPCG is disabled. 61 It is important to emphasize the fact that, in out-of-order processors all types of instructions can be potentially executed out of order, and therefore, branches can also be executed out of order. Hence, once we detect a branch misprediction and update the MBR_id register and set the misprediction bit to ‘1’, it is possible that an older branch gets executed and gets resolved to be mispredicted. An older branch can still be issued and executed since it falls into the correct path with respect to the mispredicted branch whose ROB ID is stored in the MBR_id register. Therefore, if an older branch is resolved to be mispredicted, we should update the MBR_id register with the ROB ID of the just-resolved older branch since updating the MBR_id register with this new branch will cover more wrong-path instructions. For the sake of completeness we mention that if a younger branch gets resolved to be mispredicted, then we do not alter the content of MBR_id register. Note however that this scenario is not possible since if a branch is younger than the branch whose ROB ID is in the MBR_id register, then the younger branch will fall into the category of wrong-path instructions with respect to the branch whose ROB ID is in MBR_id register. Thus if a branch is resolved to be mispredicted while the misprediction bit is set to ‘1’, then this newly mispredicted branch must be older and we update the MBR_id register. Since we update the MBR_id register any time a branch is mispredicted, we are already taking care of this scenario. Furthermore, it is possible that more than one branch gets resolved to be mispredicted in the same cycle. In this case, ideally, we would like to select the branch that is the oldest and update MBR_id register with the ROB ID of that branch. But this would require comparison between the ROB IDs of all the branches that are resolved to 62 be mispredicted in the same cycle. Our simulation results show that, on average, only 6.25% of the total mispredicted branches are resolved in the same cycle. Therefore, in order to avoid the overhead of multiple range comparators, we select only one of the mispredicted branches from one of the Branch Execution Units with a predefined priority. 3.3.3 Hardware Overhead Figure 5 shows the design of the range comparator block used in the WPCG architecture. As shown in the figure we actually need 3 comparators. This is because the ROB is a circular queue where the head of the ROB points to the earliest (oldest) instruction whereas the tail of the ROB points to the latest (youngest) instruction. Due to this circular queue structure, we must deal with two different scenarios in order to determine whether the instruction being issued is a wrong-path instruction or not. For this purpose, we use three comparators. Comparator C1 compares the tail of the ROB with the ROB ID of the mispredicted branch. Comparator C2 compares the ROB ID of the instruction being issued (ROB_id) with the tail of the ROB whereas comparator C3 compares the ROB ID of the instruction being issued with the ROB ID of the mispredicted branch. 63 Figure 3.5: Circuitry used to detect wrong-path instructions Essentially we want to determine if the ROB ID of the instruction being issued is in between the mispredicted branch and ROB_tail. If so, the ROB ID belongs to the wrong-path instruction since the instructions following the branch are from the mispredicted path. As shown in the Figure 3.5 there are two possible scenarios: o Case 1: ROB_tail is larger than the mispredicted branch’s ROB ID (mispredicted_branch_rob_id in Figure 3.5). In this case the instruction being issued is on the wrong-path exactly if its ROB ID is larger than the mispredicted_branch_rob_id and smaller than the ROB_tail. This task is accomplished by the AND gate in the dotted rectangle. o Case 2: ROB_tail is smaller than the mispredicted_branch_rob_id. In this case the instruction being issued is on the wrong-path exactly if its ROB ID is larger than the 64 mispredicted_branch_rob_id or it is smaller than the ROB_tail. This task is accomplished by the gates in dotted oval. Notice that the inputs of the comparators do not switch when the branch is not mispredicted. This is due to the fact that the ROB_tail and mispredicted_branch_rob_id registers (cf. Figure 3.5) are updated only in the event of misprediction. Therefore, they do not consume any power during the correct path execution. We implemented this circuit in Hspice and carried out the energy overhead analysis. The results presented in experimental section account for this overhead. 3.3.4 Timing Overhead Potentially there can be a timing penalty for routing the misprediction bit and the mispredicted_branch_rob_id from the Execution stage back to the Issue stage. In the conventional processor implementations the branch misprediction information is sent to the Fetch and the Commit stages and the additional routing cost to get it to the Issue stage could be quite low. Hence we expect that this additional reverse signal path to have little or no impact on the clock cycle time. If, however, this becomes a concern, then we can also pipeline the reverse routing path for the misprediction bit signal from the Execution Unit to the Issue Logic; this will allow some wrong-path instructions to be issued into the pipeline, which reduces the energy savings of the WPCG technique, but will have no other performance or functional effects. More generally, the WPCG architecture adds some logic to determine if the instruction is a wrong-path instruction, and thus, it adds some delay although the impact of this delay on the clock cycle time depends on which pipeline stage is the most timing 65 critical one. In the worst case scenario, we must pipeline the issue logic, resulting in an extra clock cycle penalty for detecting wrong-path instructions. This additional stage will be bypassed when the branches are predicted correctly and therefore the penalty reduces to the Mux delay without any extra clock cycle penalty. In our simulations we pipelined this logic to account for the worst case scenario when the delay of the logic is too high to be accommodated within the same cycle of the issue. Therefore simulation results account for the associated performance penalty and are presented in experimental section. 3.4 Experimental Results To carry out the evaluation of the proposed clock gating scheme, we used a simplescalar-based simulation platform. The PFCG and WPCG methods were implemented in simplescalar [27] with appropriate modifications to simplescalar to implement realistic branch execution. The processor model used for the evaluations is described in Table 3.1. The benchmarks used for the evaluation included a few integer SPEC 2000 benchmarks (bzip, gzip, gcc) and a few floating point SPEC 2000 benchmarks (wupwise, apsi, mesa, equake) [28] along with a couple of multimedia benchmarks (djpeg, cjpeg) [41] . A subset of benchmarks was chosen which exhibits the same average branch prediction rate as that of the full suite it is representing [57][21]. All benchmarks were run by fast forwarding 300M instructions followed by cycle accurate out of order simulation of 1B instructions. From simplescalar simulations, we obtained the access counts for various structures such as the integer functional units, register files, and caches. 66 To report the energy savings of the proposed clock gating scheme (while accounting for the overhead of the added circuitry), we used Hspice-based simulations using a 45nm CMOS technology obtained from the predictive technology models [26]. Input registers of different stages of an FU were modeled as master-slave Flip Flops, implemented at the transistor-level, and simulated with Hspice to obtain the energy consumption when the clock is not gated as well as when the clock is gated. Furthermore to model a typical integer ALU, we designed and implemented a 32-bit adder, assuming for simplicity that an integer ALU consists of an adder, at transistor level and simulated it with Hspice. In order to obtain the energy consumption in the adder circuit, we divided the average switching activity per bit of the adder input operands into four ranges: [0, 25%), [25%, 50%), [50%, 75%) and [75%, 100%]. The corresponding energy consumptions were obtained by Hspice by performing Monte Carlo simulation of the adder circuit under appropriate bit-level switching activities taken from Simplescalar simulations. More precisely, we obtained the average bit-level switching activities for inputs of various integer ALUs in the target processor from simplescalar simulations and used these activity values to estimate power savings on the adder circuit. Table 3.1: Processor Model used for Evaluations Processor widths Fetch, Decode, Issue and Commit: 4 ROB 128/64 LSQ 64/32 Caches L1 I/D Cache 64KB 2-way, Hit Latency : 1-cycle, Unified L2 Cache of 2MB, 8-way, Hit Latency : 12-cycles Memory Latency 100 cycles Branch Predictor Gshare predictor with table size: 4096 BTB 1024 2 Functional Units Integer ALUs:4 Integer Multiplier/Dividers:2 67 To model the register file and cache structures, we used CACTI [25] with the 45nm CMOS technology parameters and the machine configuration reported in Table 3.1 We evaluated two processor configurations with respect to ROB and LSQ sizes, denoted as ROB/LSQ set to 64/32 and 128/64. By increasing sizes of the ROB and LSQ, the proposed clock gating solution performs better since by increasing these sizes, the impact of branch misprediction increases and we encounter more opportunities to save energy (cf. Figure 3.6). Increasing the issue width also increases the number of instructions per mispredicted branch [14]; thus, it will have a similar effect. 0 10 20 30 40 50 60 70 80 BZIP GCC GZIP CJPEG DJPEG APSI EQUAKE MESA WUPWISE Average % Usage Cycles 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 % Improvement 64/32 PFCG 64/32 WPCG 128/64 P FCG 128/64 WPCG 64/32 128/64 Figure 3.6: Usage cycles fraction in integer ALUs and % decrease in the usage cycles due to WPCG Figure 3.6 on the primary (left) Y-axis, shows the average value of the percentage of usage cycles in integer ALUs for different benchmarks. The PFCG scheme takes advantage of the fact that ALU usage is not 100% and gates the clock signal of the stage registers of different ALUs during the idle cycles, and hence, saves power. The WPCG scheme, which after detecting a branch misprediction does not issue wrong-path instructions, increases the idle cycle fraction and reduces the ALU usage, as shown on the secondary (right) Y-axis of Figure 3.6. On average, WPCG reduces ALU usage cycles by 68 2.95% for ROB/LSQ=64/32 and 3.87% for ROB/LSQ=128/64. It is evident from these results that WPCG creates more opportunities for clock gating compared to PFCG. Of the presented clock gating schemes, the PFCG technique incurs negligible overhead, one bit register for the CEBit per 32 or 64 bits registers. The WPCG technique incurs moderate energy overhead because we activate the wrong-path instruction detection circuitry of Figure 3.5 only after detecting a mispredicted branch. The energy overhead due to the overhead circuitry is accounted for by implementing the circuitry of Figure 3.5 in Hspice. Note that, as mentioned earlier, the WPCG technique also reduces switching activity in the combinational logic between the clock gated register sets since it prevents the wrong-path instructions from being issued. Hence WPCG saves power not only on clock pins of the stage registers but also in the combinational logic blocks. Figure 3.7 shows the energy consumption in the stage registers and the combinational logic of the integer ALUs for PFCG and WPCG schemes with the ROB/LSQ configuration of 128/64. On average, WPCG expends 2.43% less energy in the combinational logic of ALUs and 2.41% less energy in stage registers compared to PFCG. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 BZIP GCC GZIP CJPEG DJPEG APSI EQUAKE MESA WUPWISE Average Energy (mJ) Clock Pins PFCG Clock Pins WPCG Logic PFCG Logic WPCG Figure 3.7: Energy consumption in the combinational logic and stage registers of the integer ALUs 69 Since the WPCG scheme prevents the wrong-path instructions from being executed, it reduces register file read accesses as most of the wrong-path instructions will access the register file to read input operands. Furthermore the cache accesses are also typically reduced since the wrong-path instructions can include load instructions. Notice that the store accesses to the cache are not affected since stores are executed only on commit. We used CACTI tool [25] to get per access dynamic energy dissipation for L1 data caches and the register files implemented in the 45nm PTM technology. Note that CACTI is equipped with detailed models of memory structures including decoders, sense amplifiers, bit lines, word lines, interconnect, etc. Figure 3.8 depicts the percentage reduction in the number of accesses made to the register files and L1 data cache for WPCG. As shown in this figure, WPCG with the ROB/LSQ configuration of 128/64 reduces the register file accesses by 3.69% and L1 data cache accesses by 2.60%, resulting in similar energy reduction in register file and L1 data cache. It was reported by [52] that wrong-path instructions may do useful pre-fetches that can in turn result in reducing the overall execution time for the whole benchmark; however we did not notice any such effect for our selected benchmarks. This is likely because of the smaller issue-width, memory latency, and branch misprediction penalty values used in our simulations (in contrast to the aggressive values assumed in [52], we assumed parameter values that match today’s commercial processor implementations). 70 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 BZIP GCC GZIP CJPEG DJPEG APSI EQUAKE MESA WUPWISE Average Reduction in Accesses (%) Register File 64/32 Register File 128/64 L1 Data Cache 64/32 L1 Data Cache 128/64 Figure 3.8: Reduction in register file and cache accesses due to WPCG Though WPCG incurs a cycle penalty in detecting wrong path instructions because of mispredicted branches, it does not affect the overall IPC since misprediction rates are normally very low. Table 3.2 shows that for the simulated benchmarks WPCG on the average incur less than 1% IPC degradation. Table 3.2: IPC Degradation for WPCG Benchmarks % Change in IPC ROB/LSQ: 64/32 ROB/LSQ: 128/64 BZIP 0.07 0.13 GCC 0.61 0.66 GZIP 0.32 0.41 CJPEG 0.39 0.40 DJPEG 0.22 0.34 APSI 0.56 0.33 EQUAKE 0.58 0.14 MESA 0.87 0.74 WUPWISE 0.91 1.81 Average 0.63 0.39 Figure 3.9 shows the distribution of energy dissipations among the major on-chip components obtained by simplescalar/Wattch [10] simulation for 130nm technology. Among these components, the techniques proposed in this paper are aimed at reducing power in clock, data cache, register file and ALU. The baseline PFCG saves, on average, 71 38.50% energy in the clock tree, which translates into 13.93% energy savings over all these major on-chip components. In comparison, WPCG saves additional 2.05% in the clock tree, 2.43% in ALUs, 3.69% in register file and 2.60% in data cache, which translates into 16.26% energy savings over the major on-chip components. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% BZIP GCC GZIP CJPEG DJPEG APSI EQUAKE MESA WUPWISE Average Clock Resultbus ALU D cache I cache Reg. File LSQ IQ Rename Figure 3.9: Energy dissipation distribution for different benchmark 3.5 Conclusions We presented a clock gating scheme that deterministically clock gates the functional units in modern out-of-order superscalar processors to save power. Baseline clock gating scheme PFCG clock gates the stage registers associated with FUs during idle cycles for the FUs. On the average PFCG reduces energy consumption by 13.93% over major on-chip components compared to no clock gating case. WPCG takes branch mispredictions as opportunities to save energy and blocks wrong-path instructions from executing after a branch has been resolved. WPCG provides more idle cycles for PFCG to exploit clock gating and also reduces register file and cache accesses caused by wrong-path instruction 72 execution. Accumulating all these energy benefits, WPCG on the average saves 16.26% energy over the major on-chip components. 73 Chapter 4: Boolean Difference Calculus Based Error Calculator As CMOS hits nano-scale regime, device failure mechanisms such as cross talk, manufacturing variability, and soft error become significant design concerns. Being probabilistic by nature, these failure sources have pushed the CMOS technology toward stochastic CMOS [39]. For example, capacitive and inductive coupling between parallel adjacent wires in nano-scale CMOS Integrated Circuits (ICs) are the potential sources of crosstalk between the wires. Crosstalk can indeed cause flipping error on the victim signal [59]. In addition to the probabilistic CMOS, promising nanotechnology devices such as quantum dots are used in technologies such as Quantum Cellular Automata (QCA). Most of these emerging technologies are inherently probabilistic. This has made reliability analysis an essential piece of circuit design. . Reliability analysis will be even more significant in designing reliable circuits using unreliable components [30][7]. Circuit reliability will thus be an important tradeoff factor which has to be taken care of similar to traditional design tradeoff factors such as performance, area, and power. To include the reliability into the design tradeoff equations, there must exist a good measure for the circuit reliability, and there must exist fast and robust tools that, similar to timing analyzer and power estimator tools, are capable of estimating circuit reliability at different design levels. In [38] authors have proposed a Probabilistic Transfer Matrix (PTM) method to calculate the output signal error probability for a circuit while [1] 74 presents a method based on the Probabilistic Decision Diagrams (PDDs) to perform this task. In this chapter we first introduce a probabilistic gate level error propagation model based on the concept of Boolean difference to propagate errors from inputs to output of a general gate. We then apply this model to account for the error propagation in a given circuit and finally estimate the error probability at the circuit outputs. Note that in the proposed model a gate’s Boolean function is used to determine the error propagation in the gate. An error at an output of a gate is due to its input(s) and/or the gate itself being erroneous. The internal gate error in this work is modeled as an output flipping event. This means that, when a faulty gate makes an error, it flips (changes a “1” to a “0” and a “0” to a “1”) its output value that it would have generated given the inputs, Von Neumann error model. In the rest of this chapter, we call our circuit error estimation technique the Boolean Difference-based Error Calculator, or BDEC for short, and we assume that a defective logic gate produces the wrong output value for every input combination. This is a more pessimistic defect model than the stuck-at-fault model. Authors in [38] use a PTM matrix for each gate to represent the error propagation from the input(s) to the output(s) of a gate. They also define some operations such as matrix multiplication and tensor product to use the gate PTMs to generate and propagate error probability at different nodes in a circuit level-by-level. Despite of its accuracy in calculating signal error probability, PTM technique suffers from the extremely large number of computational-intensive tasks namely regular and tensor matrix products. This makes the PTM technique extremely memory intensive and very slow. In particular, for 75 larger circuits, size of the PTM matrices grows too fast for the deeper nodes in circuit making PTM an inefficient or even infeasible technique of error rate calculation for a general circuit. References [8] and [54] developed a methodology based on probabilistic model checking (PMC) to evaluate the circuit reliability. The issue of excessive memory requirement of PMC when the circuit size is large was successfully addressed in [9]. However, the time complexity still remains a problem. In fact, the authors of [9] show that the run time for their space-efficient approach is even worse than that of the original approach. Boolean difference calculus was introduced and used by [63] and [3] to analyze single faults. It was then extended by [40] and [15] to handle multiple fault situations, however, they only consider stuck-at-faults and they do not consider the case when the logic gates themselves can be erroneous and hence a gate-induced output error may nullify the effect of errors at the gate’s input(s). In [60] authors use Bayesian networks to calculate the output error probabilities without considering the input signal probabilities. The author in [1] uses probabilistic decision diagrams (PDD) to calculate the error probabilities at the outputs using probabilistic gates. While PDDs are much more efficient than PTM for average case, the worst-case complexity of both PTM and PDD-based error calculators is exponential in the number of inputs in the circuit. In contrast, we will show in section 4.4 that BDEC calculates the circuit error probability much faster than PTM while achieving as accurate results as PTM’s. We will show that BDEC requires a single pass over the circuit nodes using a post-order (reverse DFS) traversal to calculate the errors probabilities at the output of each gate as we move 76 from the primary inputs to the primary outputs; hence, complexity is O (N) where N is the number of the gates in the circuit, and O (.) is the big O notation. 4.1 Error Propagation Using Boolean Difference Calculus Some key concepts and notation that will be used in the remainder of this chapter are discussed next. 4.1.1 Partial Boolean difference The partial Boolean difference of function f(x1, x2, …, xn) with respect to one variable or a subset of its variables [15] is defined as: ( ) 1 2 1 2 1 2 1 2 ... ... ... ... i i i i ik i i ik k k x x i x x x x x x i i i i i i f f f x f f f f x x x x x x ¶ = Å ¶ ¶ = ¶ = Å ¶ ¶ (4.1) where Å represents XOR operator and xi f is the co-factor of f with respect to xi, i.e., 1 1 1 1 1 1 ( ,..., , 1, ,..., ) ( ,..., , 0, ,..., ) i i x i i i n x i i i n f f x x x x x f f x x x x x - + - + = = = = (4.2) Higher order co-factors of f can be defined similarly. The partial Boolean difference of f with respect to xi expresses the condition (with respect to other variables) under which f is sensitive to a change in the input variable xi. More precisely, if the logic values of {x1, …, xi-1, xi+1, …, xn} are such that ∂f/∂xi = 1, then a change in the input value xi, will change the output value of f.. However, when ∂f/∂xi = 0, changing the logic value of xi will not affect the output value of f. 77 It is worth mentioning that the order-k partial Boolean difference defined in Equation 4.1 is different from the kth Boolean difference of function f as used in [40], which is denoted by 1 ... k k ¶ f ¶xi ¶xi . For example, the 2nd Boolean difference of function f with respect to xi and xj is defined as: 2 xi x j xi x j xi x j xi x j i j i j f f f f f f x x x x ¶ ¶ ¶ = = Å Å Å ¶ ¶ ¶ ¶ (4.3) Therefore, ∂2f/∂xi∂xj≠∂f/∂(xixj). 4.1.2 Total Boolean difference Similar to the partial Boolean difference that shows the conditions under which a Boolean function is sensitive to change of any of its input variables, we can define total Boolean difference showing the condition under which the output of the Boolean function f is sensitive to the simultaneous changes in all the variables of a subset of input variables. For example, the total Boolean difference of function f with respect to xixj is defined as: ( ) ( ) ( ) ( ) ( ) i j i j i j i j i j i j i j f f f x x x x x x x x x x x x x x D = ¶ + + ¶ + D ¶ ¶ (4.4) where f/ (xixj) describes the conditions under which the output of f is sensitive to a simultaneous change in xi and xj. That is, the value of f changes as a result of the simultaneous change. Some examples for simultaneous changes in xi and xj are transitioning from xi=xj=1 to xi=xj=0 and vice versa, or from xi=1, xj=0 to xi=0, xj=1 and vice versa. However, transitions in the form of xi=xj=1 to xi=1, xj=0 or xi=1, xj=0 to 78 xi=0, xj=0 are not simultaneous changes. Note that ∂f/∂(xixj) describes the conditions when a transition from xi=xj=1 to xi=xj=0 and vice versa changes the value of function f. It can be shown that the total Boolean difference in Equation 1.4 can be written in the form of: ( ) 2 i j i j i j f f f f x x x x x x D = ¶ Å ¶ Å ¶ D ¶ ¶ ¶ ¶ (4.5) The total Boolean difference with respect to three variables is: ( ) ( ) ( ) ( ) 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 ( ) ( ) ( ) ( ) ( ) f f x x x x x x x x x x x x f x x x x x x x x x f x x x x x x x x x f x x x x x x x x x D = ¶ + D ¶ + ¶ + ¶ + ¶ + ¶ + ¶ + ¶ (4.6) It is straightforward to verify that: 2 1 2 3 1 2 3 1 2 2 2 3 2 3 1 3 1 2 3 ( ) f f f f f x x x x x x x x f f f x x x x x x x D = ¶ Å ¶ Å ¶ Å ¶ D ¶ ¶ ¶ ¶ ¶ Å ¶ Å ¶ Å ¶ ¶ ¶ ¶ ¶ ¶ ¶ ¶ (4.7) In general total Boolean difference of a function f with respect to an n-variable subset of its inputs can be written as: ( ) 1 1 2 2 1 2 1 ( ... ) 0 n n n j j j i i i j m f f m m x x x x - - - - = D = ¶ + D ¶ Σ t (4.8) where mj’s are defined as follows: 79 1 2 1 1 2 1 1 2 1 0 1 2 1 ... ... ... , n n n n n n n i i i i i i i i i i i i m x x x x m x x x x m x x x x - - - - = = = M (4.9) and we have: ( ) * * * * * * * * 1 2 1 1 2 1 where ... ... j j n n m n n f f m x x x x x x x x x - - ¶ = ¶ = ¶ ¶ t (4.10) 4.1.3 Signal and error probabilities Signal probability is defined as the probability for a signal value to be “1”. That is: pi = Pr{xi = 1} (4.11) Gate error probability is shown by εg and is defined as the probability that a gate generates an erroneous output, independent of its applied inputs. Such a gate is sometimes called (1-εg)-reliable gate. Signal error probability is defined as the probability of error on a signal line. If the signal line is the output of a gate, the error can be either due to error at the gate input(s) or the gate error itself. We denote the error probability on signal line xi by εi. We are interested in determining the circuit output error rates, given the circuit input error rates under the assumption that each gate in the circuit can fail independently with a probability of εg. In other words, we account for the general case of multiple simultaneous gate failures. 80 4.2 Proposed Error Propagation Model In this section we propose our gate error model in the Boolean difference calculus notation. The gate error model is then used to calculate the error probability and reliability at outputs of a circuit. 4.2.1 Gate Error Model Figure 4.1 shows a general logic gate realizing Boolean function f, with gate error probability of εg. The signal probabilities at the inputs, i.e., probabilities for input signals being 1, are p1, p2,…, pn while the input error probabilities are e1, e2,…, en. The output error probability is ez. p1 , ε1 p2 , ε2 pn , εn f, εg εz Figure 4.1: Gate implementing function f First consider the error probability equation for a buffer gate shown in Figure 4.2. The error occurs at the output if (i) the input is erroneous and the gate is error free or (ii) the gate is erroneous and the input is error free. Therefore, assuming independent faults for the input and the gate, the output error probability for a buffer can be written as: (1 ) (1 ) (1 2 ez =ein -eg + -ein eg =eg + - eg)ein (4.12) where εin is the error probability at the input of the buffer. It can be seen from this equation that the output error probability for buffer is independent from the input signal probability. Note Equation4.12 can also be used to express the output error probability of an inverter gate. 81 εz εg pin , εin Figure 4.2: A faulty buffer with erroneous input We can model each faulty gate with erroneous inputs as an ideal (no fault) gate with the same functionality and the same inputs in series with a faulty buffer as shown in Figure 4.3. p1 , ε1 p2 , ε2 pn , εn f, εg εz f (ideal) εz εg p1 , ε1 p2 , ε2 pn , εn pin εin Figure 4.3: The proposed model for a general faulty gate Now consider a general two-input gate. Using the fault model discussed above, we can write the output error probability considering all the cases of no error, single error and double errors at the input and the error in the gate itself. We can write the general equation for the error probability at the output, εz, as: 1 2 1 2 1 2 1 2 1 2 (1 )Pr (1 ) Pr (1 2 ) Pr ( ) z g g in f f x x f x x e e e e e e e e e e ¶ ¶ - + - ¶ ¶ = + - D + D 14444444244444443 (4.13) where Pr{.} represents the signal probability function and returns the probability of its Boolean argument to be “1”. The first and the second terms in εin account for the error at the output of the ideal gate due to single input errors at the first and the second inputs, respectively. Note error at each input of the ideal gate propagates to the output of this 82 gate only if the other inputs are not masking it. The non-masking probability for each input error is taken into account by calculating the signal probability of the partial Boolean difference of the function f with respect to the corresponding input. The first two terms in εin only account for the cases when we have single input errors at the input of the ideal gate, however, error can also occur when both inputs are erroneous simultaneously. This is taken into account by multiplying the probability of having simultaneous errors at both inputs, i.e., ε1ε2, with the probability of this error to be propagated to the output of the ideal gate, i.e., the signal probability of the total Boolean difference of f with respect to x1x2. For 2-input AND gate (f=x1x2) shown in Figure 4.4 we have: { } { } { } ( )( ) ( ) 2 2 1 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 Pr Pr , Pr Pr Pr Pr 1 1 ( ) 1 2 f f x p x p x x f x x x x p p p p x x p p p p ¶ ¶ = = = = ¶ ¶ D = + = - - + D = - + + (4.14) Plugging Equation 4.14 into Equation 4.13 and after some simplifications we have: ( )( ( ( ) )) 2 1 2 2 1 1 2 1 2 1 2 1 2 1 2 2 AND g g e =e + - e e p +e p +e e - p + p + p p (4.15) p1 , ε1 p2 , ε2 εg εAND2 Figure 4.4: A 2-input faulty AND gate with erroneous inputs Similarly, the error probability for the case of 2-input OR can be calculated as: ( )( ( ) ( ) ( )) 2 1 2 2 1 1 2 1 2 1 2 1 1 2 1 OR g g e =e + - e e - p +e - p +e e p p - (4.16) 83 And for 2-input XOR gate we have: ( )( ) e XOR2 =e g + 1- 2e g e 1 +e 2 - 2e 1e 2 (4.17) It is interesting to note that the error probability at the output of the XOR gate is independent of the input signal probabilities. Generally, the 2-inpout XOR gate exhibits larger output error compared to 2-input OR and AND gates. This is expected since XOR gates show maximum sensitivity to input errors (XOR, like inversion, is an entropy-preserving function).The output error expression for a 3-input gate is: 1 2 3 2 3 1 2 1 3 1 3 2 3 1 2 1 2 3 1 2 3 1 2 2 3 1 2 3 1 3 2 1 3 1 2 3 (1 )Pr (1 )Pr (1 )Pr (1 2 ) (1 )Pr ( ) (1 )Pr ( ) (1 )Pr ( ) Pr ( z g g f x f x f x f x x f x x f x x f x e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e ¶ - - + ¶ ¶ + - - + ¶ ¶ + - - + ¶ D = + - + - D D + - D D + - D + D D 1 2 3 x x ) (4.18) As an example of a 3-input gate, we can use Equation 4.18 to calculate the probability of error for the case of 3-input AND gate. We can show that the output error probability can be calculated as: 84 2 3 1 1 3 2 1 2 3 3 2 1 1 2 1 2 3 1 2 3 2 3 2 3 2 1 3 1 3 1 3 1 2 3 1 2 3 1 2 2 3 1 3 1 2 3 (1 2( ) 2 ) (1 2 ) (1 2( ) 2 ) (1 2( ) 2 ) 1 2( ) 4( ) 6 AND g g p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p e e e e e e e e e e e e e e e + + + - + + = + - + - + + + - + + - + + + + + + - (4.19) Now we give a general expression for a 4-input logic gate as: ( , ) ( , , ) ( , ) , ( , ) , ( , , ) , , 1 2 3 1 Pr 1 Pr ( ) = (1 2 ) 1 Pr ( ) i j j k j k l i j i j k i j k l i i i j k k l i j k i j k l i j i j |
| Archival file | uscthesesreloadpub_Volume17/etd-Mohyuddin-3830.pdf |
Comments
Post a Comment for Page 1

