Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Energy efficient design and provisioning of hardware resources in modern computing systems
(USC Thesis Other)
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENERGY EFFICIENT DESIGN AND PROVISIONING OF HARDWARE
RESOURCES IN MODERN COMPUTING SYSTEMS
by
Kimish Patel
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2010
Copyright 2010 Kimish Patel
ii
Dedication
To my parents and sister, Charu, who have had the patience to put up with me
through my long journey of graduate school.
iii
Acknowledgements
It was a moment of immense pleasure as I defended and came a step closure to the
elusive end of graduation, but hardly can I ever forget the support and encouragement of
all those who have provided me their support.
To this end, I would first like to thank my PhD advisor Professor Massoud Pedram.
His enthusiasm towards research and learning has inspired me throughout my PhD. His
insights into a wide spectrum of domains has not only provided me with better
understanding of my own research but has also helped me develop an eye for details.
I would also like to thank Professor Murali Annavaram who has served as my co-
advisor and with whom I have interacted for past 3 years. His enthusiastic and optimistic
attitude towards research and his knowledge has provided me with a perspective that has
helped me look at my own research differently in moments of doubts.
Along with my advisor and co-advisor, I would like to thank my qualification and
dissertation committee, which includes Prof. Aiichiro Nakano, Prof. Sandeep Gupta, Prof.
Michel Dubois, and Prof. Ramesh govindan, who have provided me with useful insights
that have helped me with my research directions.
Special thanks to Dr. Wonbok Lee with whom I have collaborated on many projects
while he was a student at USC and our research group that has always stimulated
interesting research discussions during our group meetings. I also thank Electrical
Engineering staff members at USC, Annie Yu, Diane Demetras, and Tim Boston for their
support.
iv
My acknowledgements would not be complete without thanking all my friends who
have helped me get through tough times. To this end special thanks to Vidushak, my
improve theater group, which I was part of throughout my PhD. It was to them I took my
frustrations of those dark nights and dissolved them with improvised humor. To all
Vidushak members, thank you for providing me with all humorous support.
And lastly where would I be without my family. So thank you for understanding why
did I take this long. Honestly I do not know.
v
Table of Contents
Dedication. .......................................................................................................................... ii
Acknowledgements ............................................................................................................ iii
List of Tables ................................................................................................................... viii
List of Figures .................................................................................................................... ix
Abstract…. ....................................................................................................................... xiii
Chapter 1. Introduction ........................................................................................................1
1.1 Circuit Level Energy Optimization ................................................................... 3
1.1.1 Proposed Design Time Solutions ................................................................ 4
1.2 Circuit/Architecture Level Energy Optimizations ............................................ 5
1.2.1 Proposed Solution ....................................................................................... 6
1.3 System Level Energy Optimizations ................................................................ 6
1.3.1 Proposed System Level Solution ................................................................ 8
1.4 Organization ...................................................................................................... 8
Chapter 2. Circuit Level Energy Optimization: Charge Recycling ...................................10
2.1 Introduction ..................................................................................................... 10
2.2 Write Power Reduction in Register File ......................................................... 10
2.2.1 Prior Work ................................................................................................ 14
2.2.2 Conditional Charge Sharing: Concepts ..................................................... 16
2.2.2.1 RegFile Structure and the Write Operation .......................................... 16
2.2.2.2 Motivation for the Conditional Charge Sharing ................................... 17
2.2.3 Conditional Charge Sharing: Circuitry ..................................................... 19
2.2.3.1 Bit-line Flip Detector ............................................................................ 19
2.2.3.2 Charge Sharing Period Generator and the Charge-Sharing Switch ...... 20
2.2.3.3 Charging/Discharging Circuits & Delay Generator .............................. 21
2.2.4 Performance Estimation ............................................................................ 22
2.2.5 Overhead Estimation ................................................................................. 25
2.2.5.1 Area Overhead ...................................................................................... 25
2.2.5.2 Delay Overhead .................................................................................... 26
2.2.6 Experimental Results ................................................................................ 27
2.2.6.1 Impact of Charge Sharing Based Writes on Reorder Buffer ................ 32
2.3 In-order Pulsed Charge Recycling in Off-chip Data Buses ............................ 34
2.3.1 Prior Work ................................................................................................ 35
vi
2.3.2 Pulsed Charge Recycling .......................................................................... 36
2.3.2.1 Key Concept and Method ..................................................................... 36
2.3.2.2 Energy Savings of PCR Compared to CCR .......................................... 38
2.3.2.3 Pulsed Charge Recycling Implementation ............................................ 40
2.3.2.4 Charge Donor Activation Circuit .......................................................... 41
2.3.2.5 Pulsed Charge Sharing Circuit .............................................................. 42
2.3.2.6 Charge/Discharge Completion Circuit .................................................. 43
2.3.2.7 Bus Line Grouping ................................................................................ 43
2.3.3 Experimental Results ................................................................................ 44
2.3.3.1 Energy Savings Analysis ...................................................................... 44
2.3.3.2 Off-Chip Bus Traffic Analysis.............................................................. 47
2.4 Summary ......................................................................................................... 49
Chapter 3. Circuit/Architecture Level Energy Optimization: Charge Recycling Cache ...50
3.1 Introduction ..................................................................................................... 50
3.2 Prior Work ...................................................................................................... 52
3.3 Proposed CR-Cache Architecture ................................................................... 55
3.3.1 Circuit Level Design ................................................................................. 55
3.3.1.1 Conditional Charge Sharing Block ....................................................... 56
3.3.1.2 Charge Sharing Period Generator Block ............................................... 58
3.3.1.3 Delayed Write Enable Block ................................................................ 60
3.3.1.4 Conditional Charge/Precharge Block ................................................... 61
3.3.1.5 Transistor Sizing ................................................................................... 62
3.3.1.6 Timing of Operations ............................................................................ 64
3.3.2 Architectural Modifications ...................................................................... 67
3.3.2.1 Coalescing Cache Writes ...................................................................... 68
3.3.2.2 Cache Coherency .................................................................................. 70
3.3.2.3 Clustered Writes.................................................................................... 72
3.4 Experimental Setup and Results ..................................................................... 73
3.4.1 Circuit Simulations ................................................................................... 74
3.4.1.1 Write Operation .................................................................................... 75
3.4.1.2 Impact on Read Operation .................................................................... 78
3.4.2 Cache Modeling ........................................................................................ 80
3.4.3 Architectural Simulations ......................................................................... 82
3.4.3.1 Impact on Cache Writes ........................................................................ 84
3.4.3.2 Impact on Cache Energy Dissipation .................................................... 87
3.4.3.3 Impact on Performance ......................................................................... 91
3.4.3.4 Impact of Line Size and Associativity .................................................. 93
3.5 Summary ......................................................................................................... 95
vii
Chapter 4. System Level Energy Optimization: Resource Allocation in Hosting
Centers .............................................................................................................96
4.1 Introduction ..................................................................................................... 96
4.2 Prior Work .................................................................................................... 100
4.3 System Model Parameters and Assumptions ................................................ 104
4.4 NFRA Framework ........................................................................................ 107
4.4.1 Generalized Networks ............................................................................. 107
4.4.2 Modeling Resource Allocation in NFRA ............................................... 108
4.4.3 SLA types................................................................................................ 110
4.4.3.1 Throughput Constraint ........................................................................ 110
4.4.3.2 Average Response Time Constraint.................................................... 111
4.4.3.3 Stochastic Maximum Response Time Constraint ............................... 115
4.4.3.3.1 Energy Minimization .................................................................... 116
4.4.3.3.2 Profit Maximization ...................................................................... 118
4.4.4 Solving Min Cost Max Flow in NFRA ................................................... 126
4.5 Experimental Setup and Results ................................................................... 126
4.5.1 Hosting Center and SLA Parameters ...................................................... 126
4.5.2 Pseudo Optimal Solution ........................................................................ 130
4.5.3 Throughput Constrained Optimization ................................................... 132
4.5.4 Average Response Time Constrained Optimization ............................... 134
4.5.5 Stochastic Maximum Response Time Constrained Optimization .......... 135
4.5.5.1 Energy Optimization ........................................................................... 135
4.5.5.2 Profit Maximization ............................................................................ 136
4.5.6 Sensitivity Analysis ................................................................................ 140
4.6 Summary ....................................................................................................... 144
Chapter 5. Conclusions and Future Work .......................................................................146
5.1 Future Directions .......................................................................................... 147
Bibliography ....................................................................................................................150
viii
List of Tables
Table 2.1: THE ENERGY SAVINGS AND DELAY PENALTY................................... 30
Table 2.2: NORMALIZED POWER DECOMPOSITION .............................................. 31
Table 3.1: CGate Truth table corresponding to BL .......................................................... 62
Table 3.2: Energy Savings of CR-cache ........................................................................... 77
Table 3.3: Cache configurations ....................................................................................... 80
Table 3.4: Processor configuration parameters ................................................................. 83
Table 4.1: Client SLA specification for Instance_1 ....................................................... 129
ix
List of Figures
Figure 2.1: Scaling of write access power in RegFile. ......................................................11
Figure 2.2: Single ported SRAM structure. .......................................................................12
Figure 2.3: A Typical 6-T cell in RegFile with issue width of one. ..................................16
Figure 2.4: Conventional write operation in the RegFile. .................................................17
Figure 2.5: Circuit design for the conditional charge-sharing scheme. .............................18
Figure 2.6: Pictorial comparison between the conditional charge-sharing vs.
conventional techniques. ....................................................................................................22
Figure 2.7: Decomposition of a write delay in both designs. ............................................27
Figure 2.8: Electrical waveforms for various signals when writing a „11‟ sequence
into a cell that had initially stored a value of „0‟. (a) Conditional charge-sharing
scheme (b) Conventional scheme. .....................................................................................28
Figure 2.9: Average ratio of non-flip bit-line pair per write access to RegFile and
the energy savings. .............................................................................................................32
Figure 2.10: Energy Savings in ROB as a function of average, per write, ratio of
non-flip bit-lines to total bit-width. ....................................................................................33
Figure 2.11: Comparison between CCR and PCR techniques. ..........................................37
Figure 2.12: Proposed charge sharing structure for a data bus. .........................................41
Figure 2.13: Energy savings with different numbers of rising and falling transitions. .....45
Figure 2.14: Energy savings comparison between the PCR and CCR. .............................46
Figure 2.15: Electrical waveforms for various signals. .....................................................46
Figure 2.16: Distribution of Transitions. ...........................................................................48
Figure 3.1: Circuit Level Design of Proposed Charge Recycling Cache (CR-Cache) ......56
x
Figure 3.2: Conditional Charge Sharing Block..................................................................57
Figure 3.3: Charge Sharing Period Generator Block .........................................................59
Figure 3.4: Delayed Write Enable Block ...........................................................................60
Figure 3.5: Conditional Charge/Precharge Block ..............................................................61
Figure 3.6: Signal timings for CCSB+CSPG blocks .........................................................65
Figure 3.7: Signal timings for DWEB+CCPB blocks .......................................................65
Figure 3.8: Bitline status and cell writing operation ..........................................................66
Figure 3.9: Conventional vs. Proposed Store Queue .........................................................69
Figure 3.10: State machine for clustered writes .................................................................72
Figure 3.11: Experimental Methodology ...........................................................................74
Figure 3.12: Write operation in baseline SRAM block .....................................................75
Figure 3.13: Write operation in CR-cache SRAM block ...................................................76
Figure 3.14: Comparison of current curves between original SRAM and CR-cache
SRAM ................................................................................................................................77
Figure 3.15: Increase in bit-line pre-charging delay ..........................................................79
Figure 3.16: Cache write operation energy ........................................................................81
Figure 3.17: Cache read and write energy comparison......................................................82
Figure 3.18: Impact of SQ size on CS enabled writes .......................................................85
Figure 3.19: Reduction in cache read accesses due to loads ..............................................86
Figure 3.20: Comparison between two retirement policies ...............................................87
Figure 3.21: Read and Write Energy Reduction ................................................................88
Figure 3.22: Net Energy Savings in L1 Data Cache ..........................................................89
xi
Figure 3.23: Percentage of Writes Needed for Breakeven ................................................90
Figure 3.24: Performance Penalty ......................................................................................92
Figure 3.25: Comparison of performance penalty between the two policies ....................92
Figure 3.26: Impact of cache line size and associativity on CS enabled writes and
cache read accesses ............................................................................................................93
Figure 3.27: Impact of cache line size and associativity on cache energy and
performance .......................................................................................................................94
Figure 4.1: Hosting Center Architecture. ...........................................................................99
Figure 4.2: Client Request Distribution within a Server Pool. ........................................103
Figure 4.3: Generalized Network Model for NFRA. .......................................................108
Figure 4.4: Power and per Request Energy Consumption Scaling. .................................112
Figure 4.5: Net Profit per Request at Different Utilization Levels. .................................120
Figure 4.6: Utilization Range for Profit Optimization. ....................................................121
Figure 4.7: Edge Splitting for Profit Optimization. .........................................................123
Figure 4.8: Energy and Response Time Characterization. ..............................................128
Figure 4.9: Throughput Constrained Energy Optimization. ............................................133
Figure 4.10: Average Response Time Constrained Energy Optimization. ......................134
Figure 4.11: Stochastic Maximum Response Time Constrained Energy
Optimization. ...................................................................................................................136
Figure 4.12: Profit Optimization. .....................................................................................137
Figure 4.13: Impact of Large Scale Heterogeneity. .........................................................138
Figure 4.14: Impact of Edge Splitting Cardinality on Approximation. ...........................139
Figure 4.15: NFRA Runtime Scaling...............................................................................140
xii
Figure 4.16: Arrival Rate Variation. ................................................................................141
Figure 4.17: Impact of Reboot/Wakeup Latency and Power. ..........................................143
xiii
Abstract
Importance of energy efficiency in electronic systems is ever increasing, from
embedded systems such as smart phones to large scale distributed systems such as
datacenters. Modern battery-powered, embedded systems are complex devices providing
various functionalities while supporting a wide range of applications, leading to complex
energy profile and necessitating energy efficient design for longer battery life. On the
other end of the spectrum lie the complex large-scale distributed systems such as data
centers. Such systems consume not only significant computing power but also cooling
power in order to remove the heat generated by the information technology equipment.
The issue of energy efficiency in such systems can be addressed at various levels of
system design, e.g., circuit/architecture level design time solutions or operating
system/application level runtime solutions. In this thesis, we present circuit and
architecture level design time solutions for modern microprocessors based on the concept
of charge sharing, a technique that is applicable to all kinds of systems independent of the
usage scenario, and system level run time solutions based on energy-aware resource
allocation that is mostly applicable to data centers.
At the circuit level, we introduce a charge recycling based optimization approach for
1) write operation power minimization in on-chip memory structures with dedicated write
ports, such as register file, issue queue, reorder buffer, etc., where charge among bit-lines
is recycled in order to reduce voltage swing on bit-lines and 2) power minimization in
off-chip data buses by recycling charge in a sequential manner where charge from bus
xiv
lines experiencing falling transitions is recycled to bus lines with rising transitions, in
multiple charge sharing cycles, so as to recycle more charge compared to simultaneous
charge recycling techniques.
Extending the idea of charge recycling to data caches with shared read write ports, we
describe a new cache architecture that can dynamically switch between the charge
sharing based write operation mode and regular cache operation mode. At architecture
level, we employ a clustered store retirement technique, to delay the instruction
retirement for store operations, in order to group stores together to generate back-to-back
cache writes so that the writes can take advantage of the underlying circuit support for
charge recycling to reduce the write operation power.
The aforementioned design time solutions are equally applicable independent of the
workloads system is running under. At a level higher where the underlying system design
is fixed, we employ, for given set of workloads and hardware resources, intelligent
resource allocation to achieve energy efficient resource assignment in large scale
distributed systems such as hosting centers. Heterogeneity present among the servers in
such large scale distributed systems along with non energy proportional behavior of these
servers make the task of resource allocation non-trivial. Using generalized networks we
capture power and performance heterogeneity among servers while modeling utilization
dependent non energy proportional behavior. We present a generalized network flow
based resource allocation algorithm that, where nodes represent workloads and resources,
finds close-to-optimal solution in the presence of resource heterogeneity and non energy
proportionality, while meeting the stipulated service level agreements (SLAs).
1
Chapter 1. Introduction
As technology progresses and becomes more ubiquitous, the complexity of the
devices, from popular consumer devices such as smart phones, GPS systems, and MP3
players to high-end desktops/laptops with state-of-the-art graphics rendering capability to
high-end servers supporting high performance computing, increases. Most of the state-of-
the art personal embedded devices such as smart phones are complex heterogeneous
systems with general purpose microprocessors as well as ASICs supporting specific
functionality such as image or video processing. Such devices implement a full software
stack and support a plethora of applications demanding significant power budget.
On the other end, with the advent of Software-as-a-Service (SaaS) and Platform-as-a-
Service (PaaS) and other cloud services, more and more service requests are offloaded to
data centers and the computing/storage capabilities of such centers are stretched to their
limit. Such data centers, in order to satisfy client demands, e.g., scientific computing
pertaining to molecular biology experiments or weather predictions, more mundane user
services such as google docs or google voice recognition, e-commerce applications,
social networking, must employ state-of-the-art high-performance servers. These data
centers are complex systems with multi-layer architecture and complex network and
routing infrastructure, consuming significant amount of power to function 24-7.
Designing energy efficient systems has thus become an issue of paramount importance.
Energy dissipation in CMOS digital circuits is dictated by the following equation:
2
clk avg dd
P C V V f
(1.1)
In the above equation, C represents the total switching capacitance, ΔV represents the
voltage swing, V
dd
denotes the supply voltage, f
clk
is the clock frequency, and β represents
the switching activity factor. Various power minimization techniques attempt to reduce
power dissipation in digital circuits by modifying one of these parameter values, i.e.,
reducing the amount of switched capacitance (product of C and β) , reducing supply
voltage or the voltage swing, or reducing the clock frequency.
As we move to system level, which includes the whole system including peripheral
components such as DRAM, disk drives etc., instead of just the main processor, these
peripheral components start playing an important role in the overall power dissipation.
Power dissipation in such components is not necessarily dictated by the same parameters
as the ones in eq. (1.1). For example, energy dissipation in traditional disk drives is
dominated by mechanical components such as spindle motor and actuator (responsible
for the head movements) [29], which tends to make them energy inefficient compared to
their integrated circuits based counterparts (SSDs). The disks, that employ solid state
drives (SSD), are more energy efficient with their read access energy being much lower
than traditional hard drives. Hence the system level energy minimization approaches
must account for such system level parameters as main memory, disk I/O, etc.
Addressing the issue of energy minimization across these different levels of system
design, in this thesis, we propose circuit level and circuits/architecture level design time
solutions based on charge recycling to reduce voltage swing and system level resource
3
allocation based runtime solution for energy minimization and profit maximization using
generalized network flow.
1.1 Circuit Level Energy Optimization
As mentioned earlier, circuit-level approaches mainly attempt to address the
parameters such as C, ΔV, V
dd
, β, or f
clk
of eq. (1.1). These approaches mainly comprise of
various transistor or physical level circuit design techniques.
Such design techniques for energy minimization include: downsizing of transistors on
non-timing critical paths ([10],[80]) which essentially reduces the total switched
capacitance (C in eq. (1.1)), voltage swing reduction in memory structures ([62], [95])
that reduces ΔV of eq. (1.1), pipelining the logic and subsequently reducing the voltage
([14]) which essentially assigns less amount of work per pipeline stage so as to avail
more time per stage and subsequently reduce supply voltage (V
dd
in eq. (1.1)). These
design time solutions are applicable universally, independent of the use case, i.e.,
independent of the applications they are running under. In contrast runtime solutions
which dynamically activate/deactivate the power optimization structures to provide just
the right level of performance are strongly dependent on the application and input data.
Such runtime solutions, which employ a variety of hardware or software based
monitoring means in order to dynamically activate/deactivate various sub-circuits within
the system based on the activity profile of the system, include: dynamic voltage and
frequency scaling (DVFS) ([64], [83]) which scale the supply voltage, V
dd
and frequency,
f
clk
, of eq. (1.1), to reduce energy dissipation, clock gating ([93], [20]) that dynamically
4
disables the clock of the registers and/or functional blocks that are not being used or
whose output is not being observed and thus reduces β of eq. (1.1), etc.
1.1.1 Proposed Design Time Solutions
The circuit level power reduction techniques proposed in this thesis are design time
solutions that make use of the idea of charge recycling. The first circuit level design time
solution is a technique for write operation power reduction in memory structures with
dedicated write ports, where charge is recycled among bit-lines to reduce the bit-lines
voltage swing. We observe that, when opposite values are written into a bit-line pair in
consecutive cycles, the bit-line voltage swings in opposite direction. By making use of
charge recycling among such a bit-line pair, we can reduce the total voltage swing that is
needed to flip the bit-line voltage status. The design of the peripheral circuits to facilitate
the charge sharing among bit-lines is presented and Hspice and Simplescalar evaluations
are presented.
Exploiting charge recycling once again, we describe another circuit level design
technique to reduce the voltage swing on off-chip data buses. More precisely, we employ
sequential charge recycling by connecting the bus lines that undergo rising transitions,
sequentially, in multiple charge sharing cycles, to all the bus lines that experience falling
transitions. Such sequential charge sharing, in multiple cycles, recycles more charge, as
shown later, from bus lines going through falling transition compared to simultaneous
charge sharing. We describe the design and implementation of a peripheral circuit that
enables such sequential charge recycling and present our evaluation of energy savings
5
and performance penalty by using Hspice simulations.
1.2 Circuit/Architecture Level Energy Optimizations
Circuit-level solutions discussed earlier are largely unaware of the behavior of the
system under different usage patterns, i.e., while running different applications/programs
and performance demanded by them. Therefore, many of the energy optimization
approaches make use of the information available at the architecture level to optimally
use the energy optimization support provided by the circuits.
For example, drowsy caches put inactive cache lines in either power-off or retentive
mode ([45], [25]) to reduce leakage power; zero compression or sign compression based
approaches compress a byte(s) of data into a single bit ([91], [50]) in order to reduce the
number of bits needed to be read out from memory; and banked register file exploits the
usage pattern ([87]) of register file so as to divide it into smaller banks to reduce per
access energy while minimizing the performance penalty. The other architecture level
techniques, instead of augmenting circuit level support with architectural enhancements,
propose purely architectural solutions. For example; hierarchical banked register file
([19]) approach makes use of the observation that a very small number of registers is
needed to keep the processor at maximum throughput and proposes multi-level banked
register file with smaller and faster level one register file; banked issue queue ([12])
approach proposes smaller banks with only one pending operand stored per instruction
based on the observation that most of the instructions have only one operand pending;
delayed instruction retirement ([81]) proposes to postpone the instruction retirement
6
based on the observation that register lifetime span is quite small for large fraction of the
time.
1.2.1 Proposed Solution
The data cache write power reduction technique presented in this thesis belongs to the
category of power reduction approaches that lie at the boundary of circuits and
architecture. The proposed solution extends our charge recycling based write power
reduction technique to the memory structures with shared read-write ports, e.g., caches.
We propose novel circuit architecture for single ported SRAMs that can switch between
charge sharing based write operation mode and regular memory access mode. However,
in order to take advantage of this circuit level enhancement, we must provide support at
the architecture level. We modify the cache controller and store retirement policy so as to
delay store retirement and retire them in clusters in order to generate back-to-back cache
writes. We evaluate our proposed solution at various levels, i.e., using Hspice netlist for
circuit simulation, CACTI for cache power modeling, and Simplescalar simulation for
architectural modification and evaluation.
1.3 System Level Energy Optimizations
As we move from circuit to architecture level we are able to account for the general
behavior rather than the detailed implementation. Thus, at the architecture level, we were
able to view the system in terms of its usage of various components of the system such as
register file, reorder buffer, functional units and caches. However as we abstract away at
7
a level even higher, i.e., system/server level, the other components that the system is
comprised of, other than microprocessor, such as memory, disk, network I/O etc. start
playing an important role. Such peripheral components contribute more significantly to
power in server/desktop systems than in embedded systems ([7]).
System level power management that takes holistic view of the server, and beyond
that of a rack or of a data center as a whole, has become important. Previous works have
attempted to address the issue of energy efficiency at the system level e.g., dynamically
shutting off DRAM modules ([7], [23], [42]) to reduce server power where the idle time
distribution of DRAM or history based memory scheduling is exploited, conserving disk
power ([13], [29], [99]) by exploiting disk idle time, employing multi-speed disks, or
doing disk power aware cache management policies. To assist such approaches that
attempt to bridge the gap between energy proportional processors ([2]) and non-energy
proportional memory and disk I/O, many modern server systems are equipped with ACPI
([1]) that provides interface for OS level power management. Such can be very useful to
trade-off between performance and power. However in order to exploit this trade-off the
work must be assigned to these servers accordingly. Therefore energy aware resource
allocation has also gained considerable attention. The problem, however, is compounded
by increasing heterogeneity that is present in today‟s data centers as they upgrade and/or
install new server systems to accommodate their increasing clientele. To address this
issue, for system level power management, we propose resource allocation algorithm that
minimizes energy consumption and maximizes profit for a given set of clients and servers.
8
1.3.1 Proposed System Level Solution
The system level resource allocation solution presented in this thesis allocates
resource and therefore assigns workload for a given set of clients. We assume a
heterogeneous data/hosting center environment where server pools with heterogeneous
performance/power profile are available. The set of clients are given to us with their
service level agreement (SLA) and corresponding price function. We employ generalized
network flow based framework, called NFRA, where nodes in the network represent both
clients and server pools and flow represents the resource allocation. While accounting for
non-energy proportionality of today‟s servers, NFRA produces resource allocation
solutions for throughput constrained and response time constrained energy minimization
and response time constrained profit maximization problems. We show that the
generalized network flow based framework provides us the flexibility of accounting for
different types of SLAs as well as queuing models.
1.4 Organization
This thesis is organized in three chapters.
Chapter 2 covers, for circuit level energy optimization, the charge recycling based
energy minimization techniques. The details of the circuit augmentation, for facilitating
charge recycling between bit-lines of register file like memory structures, and the
corresponding impact on power and timing are presented. Latter part of this chapter
covers our sequential charge recycling technique for off-chip data buses along with
9
necessary circuit augmentation. Experimental results describe the impact of our proposed
techniques on power savings and performance using SPEC benchmarks.
Extending the charge recycling idea to memory structures with shared read/write
ports, we present, in Chapter 3, for circuits/architecture level energy optimization, the
circuit level design of CR-cache (charge recycling cache), that can dynamically switch
between regular mode of operation and charge recycling based write mode of operation.
Furthermore, we present architectural support for late retirement of store operations in
order to retire stores in clusters and corresponding cache controller state machine.
Experimental results extensively explore circuit level, cache level and architecture level
impact of the proposed technique under different cache configurations and store
retirement policies.
At a higher level of system design, in Chapter 4, we present system level energy
optimization approach for large scale distributed systems, such as hosting centers, using
resource allocation. The proposed framework, called NFRA (Network Flow based
Resource Allocation), that employs generalized network for workload distribution, is
presented. Various hosting center topologies are explored and the impact of variation in
various parameters, such as client request arrival rate, as well as the overhead of turning
server ON and OFF, are accounted for in experimental section.
Summarizing the presented energy optimization approaches, working at different
levels of system design, we present conclusions in Chapter 5 along with future research
directions.
10
Chapter 2. Circuit Level Energy Optimization:
Charge Recycling
2.1 Introduction
Charge recycling is a phenomenon where charge stored in one part of the circuit is
recycled in another part in order to reduce the total amount of current drawn from power
supply. Thus, part of the work already done to charge up a node is recycled to partially
charge up another node and therefore reducing total amount of work done. Charge
recycling is particularly helpful in interconnects where large wire capacitances dominate
power consumption. We present, in this chapter, two circuit level approaches that make
use of charge recycling to reduce power in memory interconnects and off-chip buses. The
first approach reduces voltage swing in bit-lines of multi-ported memories which form
highly capacitive interconnect. The second approach exploits sequential charge recycling
to reduce the voltage swing by recycling charge in off-chip buses.
2.2 Write Power Reduction in Register File
State-of-the-art microprocessors employ large physical register files (RegFile) ([65],
[24]) and register renaming in order to exploit instruction level parallelism (ILP) and
improve performance. Complexity of the RegFiles in modern out-of-order processors
with large issue widths increases since RegFiles must have multiple read and write ports
11
in order to support large issue widths. For example, in order to support issue width of 4
RegFile needs to have 8 read ports and 4 write ports. Such increase in complexity, due to
increasing issue widths, is accompanied by increased power consumption. In Figure 2.1
we show the scaling of per write access power, normalized to issue width of 1, in RegFile
as a function of issue width, generated with the help of Wattch [11] power modeling tool.
As shown in the figure, as we increase the issue width, per write access power increases
(by a factor of nearly 2 as we move to issued width of 4 from 1) since the increase in
issue width results also results in increased length of word-lines/bit-lines. Thus, with
wide issue widths and large number of registers, power dissipation and resulting
temperature in RegFile are poised to increase.
Figure 2.1: Scaling of write access power in RegFile.
The operation and the structure of a RegFile are very similar to single ported SRAMs
(cf. Figure 2.2), except that the RegFile has dedicated and separate read and write ports
i.e., a multiple number of bit-line pairs are connected to each of the memory cell, shown
later in Figure 2.3. Despite these dissimilarities, the dynamic power consumption in the
RegFile and memory can be similarly decomposed into several components: bit-line
12
charging/discharging, word-line decoding, and differential sense amplification (SA). It is
known that, in the conventional 6-T SRAM structure the write operation dissipates more
dynamic power than the read operation ([49], [48]). More precisely, during the write
operation, power is dissipated by fully discharging either the bit-line or the bit-line-bar.
In contrast, during the read operation either the bit-line or the bit-line-bar is partially
discharged for the sense amplifier to read the value stored in the cell.
1 j n
Row Decoder
Precharging Circuitry
…... …...
Sense Amplifiers
Row i
Column j
Bit line
Pair
Figure 2.2: Single ported SRAM structure.
Monitoring the current state of the bit-lines and the new data value being written to
the bit-lines provides us helpful information to save power as detailed next. Consider a
single ported SRAM structure of Figure 2.2 where an n-bit wide memory structure with
decoder, pre-charging circuitry and sense amplifiers is shown. Consider that in some
cycle we try to write a value of „1‟ to a certain cell in some column j. As a result, we have
to drive the pair of bit-lines, corresponding to column j, to (V
dd
, 0). Suppose now that, in
the next cycle, a cell in the same column j (possibly a different one) must be written by
value „1‟, we must again set the pair of bit-lines to (V
dd
, 0) because between these two
13
write cycles the pair of bit-lines is typically pre-charged to a „1‟ value. It is obvious that
the bit-line pre-charging and the subsequent discharging become redundant. Next
suppose that, in the next cycle, the value that must be written to some cell in that same
column j is „0‟. As a result, the bit-line pair corresponding to that column must be set to
(0, V
dd
) for the correct write operation. More precisely, ignoring the pre-charging step,
the bit-line pair charge status is switched from (V
dd
, 0) during write „1‟ to (0, V
dd
) during
write „0‟. Thus the bit-line pair experiences voltage swing in opposite direction in two
consecutive write cycles. This provides us with an opportunity to save power/energy by
employing the charge-sharing.
Making use of this observation, we present a mechanism [74] which avoids the
aforementioned redundant bit-line pre-charging phase in the write operation cycles. In
addition, a charge-sharing scheme is conditionally used, depending on the data written in
previous cycle and data being written in the current cycle, to further increase the
power/energy saving. Our proposed mechanism modifies the conventional memory
structure in the RegFile by adding a low-complexity controller that consist of 1) bit-line
flip detection logic which does away with the per-cycle pre-charging, 2) a pair of
MOSFET switches which enables the charge-sharing, 3) circuitry that determines the
charge-sharing period and 4) buffer chain that delays the subsequent occurrence of
conventional bit-line charging/discharging in the write operation on each column. The
area and the delay overheads of all of the additional circuitry are reported.
14
2.2.1 Prior Work
With an advent of increasingly fast, dense, and power hungry memory structures,
energy has become major issues. In the half-swing (HS) scheme ([62]), 75% of the power
reduction was achieved by restricting the bit-line swing to half of the V
dd
combined with
charge recycling. Similar charge recycling/swing voltage reduction technique was
proposed by Yang et al. in ([95]). Generally, reducing the swing voltage from the full V
dd
swing is a powerful, cell data independent method, and its power saving is proportional to
the swing voltage. However, the lower bound in swing voltage is limited since it is
strongly dependent on the sensitivity of SA and the HS scheme innately has a problem
with read operation since half V
dd
of a bit-line pair in the read operation increases the
erroneous cell data flip.
In [43], Kanda et al. reported that 90% of write power-saving was achieved in SRAM
using sense-amplifying memory cell. Similar as above, their technique reduces the bit-
line swing by V
dd
/6 but amplifies the voltage swing by SA instead. However, their
performance is bit-width dependent: They have reported to achieve 90% of write power
savings under 256 bit-widths however their reduction ratio decreases when bit-width
becomes smaller. In [17], Cheng et al. devised a single bit-line driving technique for the
write operation in SRAM to eliminate the excessive full charging on the bit-line pair.
They force a strong „0‟ signal in a single side of bit-lines while leaving the other side of
bit-lines to float.
In [44], Karandikar et al. introduced a hierarchical divided bit-line [94] concept in
low power SRAM design. The division of bit-line into hierarchical sub bit-lines results in
15
the reduction of bit-line capacitance, hence it reduces the dynamic power in SRAM. In
this technique, however, the access of the memory is confined to a smaller sub-array and
the area overhead for the extra decoding and control circuitries are not negligible. Using
adiabatic circuits, Hu et al. in [40] presented low power register file architecture that
employs complementary pass-transistor adiabatic logic for all the circuits of RegFile
except storage cells, achieving energy savings in the range of 65% to 85%. However their
approach supports frequencies ranging from 25 to 200MHz, which is quite slow even for
many of the embedded/mobile processors.
Observing the higher number of read/write ports in RegFiles supporting wide issued
widths, authors in [51] proposed to reduce the peak demand for read ports using delayed
write back and operand prefetch buffer. However, their approach introduces additional
hardware/energy overhead of these new queue and buffer structures which will be
significant for any realistic size (64 to 128 entries) of RegFile. Making another
architecture level observation, in [52], that many operands do not need the full bit-width,
authors propose bit-partitioning to reduce access time and energy in RegFiles. Their
approach requires significant modifications to the pipeline.
The circuit level approach presented in this thesis, for write power reduction in
RegFile, employs the idea of charge sharing among the bit-line pair of a given column. In
particular, unlike previous approaches on exploiting charge sharing, we exploit the
RegFile structure which has dedicated write ports and thus can maintain the charge status
on bit-line without any reads coming in between. Furthermore, making use of data
awareness in the pair of a bit-line alone can avoid half of the bit-line power consumption
16
and the additional charge-sharing mechanism can increase this ratio up to 75%, if we
disregard the overheads. Section 2.2.4 will cover this theoretical case.
2.2.2 Conditional Charge Sharing: Concepts
2.2.2.1 RegFile Structure and the Write Operation
Figure 2.3 shows a 6-T storage cell in the conventional RegFile structure: Each cell
has a pair of cross coupled inverters and an access transistor connected on each side to
either bit-line or bit-line-bar. Compared with the 6-T memory cell in the conventional
SRAM, each storage cell in the RegFile is connected with multiple bit-lines to
incorporate multiple read and write ports. In Figure 2.3, for example, the cell has two
pairs of read bit-lines (RBL1, RBL1, and RBL2, RBL2 ) and a pair of write bit-line
(WBL and WBL ), which corresponds to the issue width of 1. Every increase of one in
issue width is accompanied by an increase of two in read ports and an increase of one in
write ports. Different word-lines, e.g. RWL1 and RWL2 (read word-lines) and WWL
(write word-line), are selectively chosen for the specific operation of the cells.
V
dd RBL1
RBL1
RWL1
P1 P2
N1 N2
N3 N4
RBL2
WBL
RBL2
WBL
RWL2
WWL
RWL1
RWL2
WWL
Figure 2.3: A Typical 6-T cell in RegFile with issue width of one.
17
Figure 2.4 shows one such write port to illustrate the basic write operation. As shown,
WBL and WBL are pre-charged to V
dd
in every write operation. When the new cell data
comes in, depending on the new cell data, one of the bit-lines will be discharged by the
discharging circuitry at the bottom. This discharging circuitry is enabled by write enable
(WEN) signal. The goal of the equalization transistor is to speed up equalization of the
WBL and WBL , during the pre-charge phase, by allowing the capacitance and the pull-
up transistor of the non-discharged bit-line to assist in pre-charging the discharged bit-
line. Note that each write operation is independent of the previous write operation in the
sense that no matter what data we are getting for the next write operation, we fully pre-
charge both of the bit-lines to V
dd
and fully discharge one of them, therefore the write
operation consumes the same amount of power on every write operation.
WBL WBL
WL WL
MC
Q Q
Data
Data
Pre-charge
WEN WEN
Figure 2.4: Conventional write operation in the RegFile.
2.2.2.2 Motivation for the Conditional Charge Sharing
Our conditional charge-sharing idea starts from the following observation: When a
new cell data comes in during the write operation, either both of bit-lines will be flipped
in the opposite direction or remain the same depending on the previously written data and
18
the current data being written. If the new data is the same as the current data on bit-lines
(which may be targeted to a different cell in the same column), then we do not need to
unnecessarily charge one of the bit-lines to V
dd
and then subsequently discharge it to
GND. Only if the new data is different from the current data on the bit-lines, we will need
to flip both of the bit-lines in the opposite direction, i.e., the bit-line that was previously
at V
dd
has to be discharged to GND and the bit-line that was previously at GND has to be
charged to V
dd
. (Here we are assuming that we are not pre-charging both of the bit-lines
to V
dd
after every write operation, which will be explained later). The latter situation
provides an opportunity to apply charge-sharing between the bit-line pair so as to transfer
some of the charge from the bit-line that is going to be discharged to the bit-line that is
going to be charged to V
dd
as explained below.
BL BL
WL WL
MC
WEN
Q Q
Data
BL BL
Data
(XOR)
Charge- Sharing Switch
Bit-line Flip Detector
FLIP
FLIP_T
Data Data
Data
Delayed WEN
Data
Delayed WEN Generator
FLIP _T
Delayed WEN
Charge -Sharing Period Generator
Delayed WEN
Charging circuitry
Discharging circuitry
disable_cs
Figure 2.5: Circuit design for the conditional charge-sharing scheme.
19
2.2.3 Conditional Charge Sharing: Circuitry
Based on the aforementioned observation, we augmented the traditional RegFile
design to facilitate conditional charge-sharing operation. Our circuit design is depicted in
Figure 2.5. It includes a bit-line flip detector, charge-sharing period generator, a delayed
write-enable (WEN) generator, and a pair of charge-sharing switches. The pre-charging
and discharging circuitries are modified as well. Note that these circuit elements are
added for each bit-line pair, i.e., a column. The next subsections will explain the
operation of each circuit element in detail.
2.2.3.1 Bit-line Flip Detector
At any cycle, bit-line pair, BL and BL , has a set of complementary values, e.g., (BL,
BL ) = (V
dd
, 0). If the same data, e.g., (V
dd
, 0), is being written to this column, then none
of these bit-lines must be flipped. If, however, a different value, e.g., (0, V
dd
), is to be
written to this column, then both bit-lines must be flipped, which charges a bit-line from
0 to V
dd
and discharges the other bit-line from V
dd
to 0. The bit-line flip detector in
Figure 2.5 detects this bit flipping situation and generates the FLIP signal, indicating
whether the bit-line flip will actually occur in this column.
The bit-line flip detector does not follow the conventional XOR gate design, which
generally needs 6 transistors. Our XOR gate is designed to have only two NMOS
transistors which reduce the area overhead. Such a 2-T XOR gate design, in turn, requires
both positive and negative signals, which are fortunately available in memory structure of
the RegFile; hence, no additional inverters are needed for logic value complementation.
20
2.2.3.2 Charge Sharing Period Generator and the Charge-Sharing Switch
In our proposed design, it is crucial to allow a reasonable switching period that
ensures complete charge sharing between the bit-line pair. The charge sharing period
generator in Figure 2.5, generates „0‟ at the output of the NOR gate whenever WEN is „0‟
since the NAND gate produces a „1‟ ( FLIP =1), and hence, the NOR gate will produce a
„0‟ ( 0 FLIP_T ). However, when the flip is detected and WEN is „1‟, the NAND gate
produces a „0‟ ( FLIP =0). At this time, the other input of the NOR gate (which is the
disable_cs signal, charge sharing disabler, through the two buffers in the delayed WEN
generator) is still „0‟ and thus the NOR gate momentarily produces a „1‟ (which turns the
switch ON). Shortly thereafter, when the disable_cs becomes „1‟ the NOR gate produces
a „0‟ (which turns the switch OFF). This momentary „1‟ value at the output of the NOR
gate enables the charge sharing switch to connect the two bit-lines. Clearly, the amount of
time FLIP_T stays at „1‟ is dependent on the two buffers in the delayed WEN generator.
We size these two buffers such that the pulse widths of FLIP_T and FLIP_T are large
enough for the switches to perform close to full charge sharing, resulting in a common
voltage value on the bit-line pair (i.e., the pair achieves close to equilibrium voltage).
Clearly, design of the delay element, i.e. two buffers in the delayed WEN generator,
is critical. If the delay element is designed to generate a small duty cycle pulse for driving
the gate of the charge sharing transistors, full charge sharing will not take place, and
hence, the power savings will be reduced. On the other hand, if it is designed to generate
a long duty cycle pulse, then it can cause timing violation by elongating the write cycle,
21
thereby, missing the memory access clock cycle, which is typically set by the read
operation latency.
The size of the charge sharing switch also determines the time period that is spent in
carrying out the charge transfer and bringing the two bit-lines to voltage equalization. We
design the switch such that it generates charge sharing current similar to the bit-line
charging / discharging current of the conventional write operation, which basically
considers the bit-line pull-up and pull-down transistor sizes. A larger charge sharing
switch will shorten the charge sharing period but will also result in high power
consumption in its driving path.
2.2.3.3 Charging/Discharging Circuits & Delay Generator
Our conditional charge sharing mechanism does not pre-charge the bit-line pairs on
every cycle. As such, pre-charging (this is also true for the discharging) of bit-lines
occurs according to the actual occurrence of charge sharing. At the end of charge sharing,
BL and BL voltage levels will be equalized. Then, a bit-line will be charged to V
dd
(by
the charging circuitry shown in Figure 2.5) while the other bit-line will be discharged to
GND (by the discharging circuitry shown in Figure 2.5). In this case, the full-swings on
BL and BL are avoided because both bit-lines start from an equal voltage of V
dd
/2 and
move in opposite directions to V
dd
and GND levels. As a consequence, we save power as
explained in the next section.
The delayed WEN generator in Figure 2.5 is designed to provide two features: 1)
produce a suitable turn-on time period of the charge sharing switch (owing to the two
22
buffers of delay element), and 2) control the delay of bit-line charging / discharging
transistors such that it prohibits any of the bit-lines from being connected to V
dd
or GND
during the charge sharing period, thereby, avoiding a short-circuit path. Notice that the
delayed WEN generator is shared among all the columns, and hence, we can reduce the
area and power overhead.
BL BL BL BL BL BL BL BL BL BL BL BL
Precharge
Precharge Precharge
BL BL
(BL=0, BL=V
dd
)
BL BL
BL BL
No Charge-
Sharing
BL BL BL BL BL BL BL BL BL BL BL BL BL BL
Charge-
Sharing
=
Cycle n Cycle n+1 Cycle n+2 Cycle n+3
(BL=0, BL=V
dd
) (BL=V
dd
, BL=0)
Discharge
BL
Discharge
BL
Discharge
BL
No
Precharge
No
Precharge
No
Precharge
Half-charge &
Half discharge
Conventional
Write-Operation
Charge-Sharing
Write-Operation
(BL=V
dd
, BL=0)
Figure 2.6: Pictorial comparison between the conditional charge-sharing vs.
conventional techniques.
2.2.4 Performance Estimation
We estimate the maximum achievable power savings, which does not account for the
overhead, in our charge sharing scheme using a pictorial explanation. Figure 2.6 shows
how the charge on the bit-line pair changes in consecutive write operations both in the
conventional scheme and in the charge-sharing scheme. The color of each of bar-graph
pairs represents the charge status of the bit-lines, i.e., blue means „charged‟ and white
means „discharged‟. The upper bar-graphs correspond to the write operation in the
conventional scheme whereas the lower bar-graphs correspond to the write operation in
23
the charge-sharing scheme. Moreover, the transition from cycle n to n+1 corresponds to
the bit-line non-flip (i.e., same data value is being written) whereas the transition from
cycle n+1 to n+2 corresponds to the bit-line flip (i.e., a different data value is being
written). By showing these two cases, we estimate the average energy savings in both flip
and non-flip situations.
We assume that the initial bit-line state of (BL, BL ) = (0, V
dd
) in cycle n. In cycle
n+1, in the conventional scheme, we assume the same data value, i.e., (BL, BL ) = (0,
V
dd
), comes in. Since the bit-line pair needs to be fully pre-charged before write
operation, the BL has been fully charged at this time. Next, the new data set discharges
BL. In cycle n+2, we assume that data value (BL, BL ) = (V
dd
, 0) comes in. Again, BL
will have been fully pre-charged, but this time BL is fully discharged by the new data.
To sum up, a total of four bit-line charge/discharge operations occur. In contrast, in the
charge-sharing scheme, bit-lines have not been pre-charged at cycle n and the same data
value, (BL, BL ) = (0, V
dd
), comes in at cycle n+1. During the write operation at cycle
n+1, none of the bit-lines are discharged since currently discharged BL matches the new
data set. When a different data value (BL, BL ) = (V
dd
, 0) comes in at cycle n+2, the bit-
line flip detector is triggered and the charge-sharing switch is turned on. As a result, the
charge stored in BL is transferred to BL. After the bit-line pair reaches voltage
equilibrium, the (delayed) write circuitry charges BL to V
dd
from V
dd
/2 while discharges
BL to GND from V
dd
/2. To sum up, equivalent of only one bit-line charge/discharge
operation occurs. Therefore, in the ideal case as described above, the charge sharing
24
solution can save up to 75% of the bit-line power dissipation for consecutive write of a
same (e.g. between n and n+1 cycles) and an opposite (e.g. between n+1 and n+2 cycles)
values into any cell(s) on the same column of the RegFile. Our scheme eliminates energy
dissipation altogether for the case of non-flipped bit-line pair while it reduces that amount
of energy consumed in the flipped bit-line pair case by a factor of 2.
While the power consumed in the cell data flipping and additional power consumed
in the supporting circuitry are not considered in this example for the purpose of simplicity
of demonstration, the RegFile and supporting circuitry for conditional charge sharing,
described in previous section, were implemented in Hspice to accurately account for
delay and energy overheads in reporting experimental results, shown in 2.2.6. Notice that
at cycles n+1 or n+2, a cell data flipping may or may not occur depending on the
previously stored cell data. Indeed, the data stored inside the cell that is being written is
not considered in the charge-sharing scheme, i.e., we may or may not be flipping the cell
and this does not affect the charge-sharing scheme in any way.
We point out that the conditional charge-sharing scheme may not be applied to the
conventional SRAM array which has a single bit-line pair per-column (cf. Figure 2.2)
that is shared between the read and write operations. Since the read operation needs
unconditional pre-charging and our proposed technique, if applied to the SRAM array,
will modify the pre-charging logic with conditional charging, which will in turn increase
the time required for performing the read operation. The read access delay, however, sets
the overall access time of the SRAM array, and hence, the conditional charge-sharing
scheme will result in SRAM array performance degradation. In contrast, the RegFile has
25
dedicated and separate bit-lines for read and write operations into a register and therefore
it is amenable to the application of the proposed charge-sharing scheme. However in
Chapter 3 we propose a charge sharing based novel architecture for caches (SRAMs) that
have shared/single read/write ports.
2.2.5 Overhead Estimation
2.2.5.1 Area Overhead
We estimate the area overhead for the conditional charge-sharing scheme by
calculating the ratio of the number of additional transistors to the transistor count of the
conventional design. The additional parts in each column are: a) bit-line flip detector
(XOR + two-input NAND), b) charge sharing switch (PMOS + NMOS), c) charge
sharing period generator (two-input NOR + inverter), d) delayed WEN generator (2
buffers and 2 inverters), e) the modified pre-charging/discharging circuitry (2 two-input
NAND gates). Hence, the number of added transistors in a pair of bit-line column is 34
transistors: 6-T (flip detector) + 2-T (switch) + 6-T (period generator) + 12-T (delay
generator) + 8-T (modified pre-charging / discharging).
For a conventional RegFile with two read ports and one write port, which is 32 bit
wide and has an issue width of 1, the number of transistors in each column is 653: 64
(rows) x 10 (4 transistors in the cross coupled inverter pair, 4 access transistors for two
read ports and 2 access transistors for the write port) + 3-T (two pre-charge PMOS
transistors and a PMOS equalizer) + 10-T (two-input NAND which is 4-T and an NMOS
transistor which is 1-T, in each side of bit-line). As a result, the proposed technique needs
26
a total of 34-T+653-T= 687 transistors per column, i.e., the resulting estimated overhead
is 34/653 = 5%. Note that the transistors in the delayed WEN generator can be shared
over all columns hence the actual area overhead is indeed smaller. However, we included
these transistors in the area penalty estimate to be conservative.
Note that the above estimate ignores the transistors sizes. As we account for transistor
sizes, assuming conservatively that the delayed WEN generator block is shared among 8
columns, the area overhead increases to 15%. The 15% area overhead is only for the
memory array, including sense amplifiers, write circuitry and pre-charge circuitry,
excluding the decoder logic. Hence the area overhead would be even smaller when row
decoder is accounted for.
2.2.5.2 Delay Overhead
In Figure 2.7, we show a cycle period during the write operation under the
conventional and the conditional charge sharing schemes. Unlike the large SRAM arrays,
the row decoding time is not critical since the number of rows in the RegFile is relatively
small. Moreover, there is no need to wait for column decoding in the RegFile, i.e.,
discharging of the bit-lines can start at the beginning of the cycle. During write to a
conventional RegFile, either bit-line or bit-line-bar starts getting discharged based on the
new cell data. Subsequently, the cell writing and the bit-line pre-charging take place. In
contrast, during write in conditional charge sharing RegFile, some amount of time is
needed to turn ON the data dependent charge sharing switch so that the conditional
charge sharing can subsequently take place. When the charge sharing is completed,
27
charging / discharging circuitry perform the remaining bit-line charging / discharging
starting from the equalized voltage level of V
dd
/2. The cell writing occurs last.
Conventional
Write
Charge Sharing
Write
BL Discharge
XOR + Delay + Switch
Charge
Sharing
Remaining BL
Discharge
Cell
Write
Row Address Decoding Start
Row
Decode
Cell
Write
BL Precharge
New Write Delay
Time
Original Write Delay
Figure 2.7: Decomposition of a write delay in both designs.
In the conditional charging sharing scheme, we save some amount of time by
avoiding redundant bit-line pre-charging; however, we spend extra time to share the
charge between the bit-line pair so as to bring these lines into an equilibrium voltage.
Hence, there is a trade-off between the amount of charge sharing delay and the sizing of
charge-sharing switch, which in turn effects the energy saving. There is no delay increase
when the new data is the same as before, i.e., neither the charge-sharing nor the bit-line
pre-charging is needed. On the other hand, when the data is flipped we need extra
circuitry to facilitate the charge sharing which increases the write delay. This increase in
the write delay is accounted for in our implementation and is reported in experimental
results.
2.2.6 Experimental Results
For the experiments with the conditional charge sharing scheme, we use Hspice for
the power and delay calculation. The proposed technique is implemented on a 64 x 32-bit
28
RegFile and a 128 x 32-bit RegFile. Conventional RegFiles (with two pre-charge and an
equalization PMOS transistor configuration of Figure 2.4) were also simulated for
comparison purposes. For the technology file, we used the 65nm PTM from [75]. The
temperature was set to 75 ℃ and the V
dd
was set to 1.0V.
Figure 2.8: Electrical waveforms for various signals when writing a „11‟ sequence
into a cell that had initially stored a value of „0‟. (a) Conditional charge-
sharing scheme (b) Conventional scheme.
Figure 2.8 shows the waveforms of the consecutive write operations with the
conventional (waveform in the lower half of the figure) and the conditional charge
sharing scheme (waveform in the upper half of the figure) in the 64 x 32-bit RegFile. In
each waveform, we show two cycles: the first cycle corresponds to the actual cell data
flip while the second cycle corresponds to the non-flipping case. Here the initial status of
BL and BL is assumed to be 0 and V
dd
. During the first write cycle in the conventional
scheme, the order of operation is: pre-charge BL, discharge one of the BLs, and perform
29
the cell flip. During the second cycle, pre-charge and discharge of BL exist, which
consumes redundant power. In contrast, during the first write cycle in the conditional
charge sharing scheme, the order of operation is: charge sharing between BLs, charging /
discharging remaining BL, and performing the cell flip. Notice that charge-sharing does
not consume power and the remaining BL charging/discharging consumes less power
compared to the conventional counterpart. In the second cycle, there is no bit-line status
change, which ideally (ignoring bit-line leakage) does not consume any power. One
important point is that bit-line flips do not necessarily result in the occurrence of cell flip,
and that those two outcomes are independent of one another. For example, in the
conventional write operation curve, the same bit-line is discharged twice over two cycles.
At the same time, in the first cycle, the cell itself flips while in the second cycle it (or any
other cell in the same column) does not.
Figure 2.8 also shows the current waveform out of the V
dd
line that powers up all
circuitry for a single column in the RegFile, including the memory cells attached to the
column, the pre-charge logic for the conventional scheme plus the additional charge-
sharing logic for the proposed scheme. Notice that the current waveforms, corresponding
to conventional RegFile, for the case of writing „1‟ into a cell storing „0‟ and the case of
writing „1‟ into a cell storing „1‟ is nearly identical. This confirms the data-independent
power consumption of the write operation in the conventional RegFile design. In contrast,
the current waveform of the conditional charge sharing RegFile design is strongly
dependent on whether or not a bit line pair flip occurs. During the first write cycle (bit-
line flip case), we dissipate some amount of power due to switching activity inside the
30
circuitry that is responsible for charge sharing between the BL and BL . This overhead
reduces the theoretical energy saving in the bit-flip case from 50% to 39.2%. During the
second write cycle (no cell flip case), there is some amount of power consumption due to
activity which is caused by WEN in the added charge sharing circuitry. Note that in either
case, bit-line flip or non-flip, the maximum current in charge sharing based RegFile is
roughly half the value of the maximum current in conventional RegFile.
Table 2.1: THE ENERGY SAVINGS AND DELAY PENALTY
RegFile
Size
Bit-line
status
Energy savings in Charge-sharing
design over conventional design
Delay penalty in write operation
64
Flip 40.4 16.2%
Non-flip 90.3 16.2%
128
Flip 38.0 16.2%
Non-flip 90.1 16.2%
Table 2.1 shows another set of experimental results for power dissipation and delay.
Compared to the conventional RegFiles, the proposed technique achieves an average of
39.2% and 90.2% energy savings in the two RegFiles for the flipping and non-flipping
writes, respectively. The energy savings come from 1) reduced bit-line charge/discharge
swing due to charge-sharing (in the „Flip‟ case), and 2) elimination of unnecessary pre-
charging (in the „Non-flip‟ case). In both cases, the delay increase is 16.2%.
In Table 2.2, we decompose the overall power consumption of the conditional charge
sharing design into its constituent parts. As shown, the bit-line flip detector and switching
period generator consume almost half of the power because 1) switching period generator
needs to drive the large charge-sharing switches in case of the bit-line flip and 2) due to
31
charge sharing the portion of bit-line charging in a write operation power is dramatically
reduced.
Table 2.2: NORMALIZED POWER DECOMPOSITION
Functional Block
RegFile Size
64 128
Bit-line flip detector & Switching period generator 49.7 49.0
Delayed WEN generator 3.9 3.8
Charge-sharing switches 0.9 1.0
Bit-line charging 45.5 46.2
To estimate the net energy savings, we used SimpleScalar [37], and ran applications
from some of SPEC2000INT benchmark suite [38] with ref input files whose name is
mentioned along with benchmark name, and two applications from the MediaBench suite
0 with custom input files. The architectural simulator was configured to have an issue
width of 4 and was modified to generate information about the number of bits flipped
during RegFile write operations over a complete run of each program. With this
information about bit-flips and cycle-level energy saving values (for the 64 x 32-bit
RegFile) in Table 2.1, we computed and report the energy savings in Figure 2.9. Figure
shows, on X-axis, the average ratio, per write operation, of bit-line non-flips to total bit-
width of a write. Clearly, energy saving of the proposed scheme varies linearly as a
function of the average ratio of bit-line non-flips per write operation of the RegFile. Our
experimental results show that, on average, we achieve 61.5% energy savings per write
operation in these programs.
32
Energy Savings in Write Operations (%)
30
35
40
45
50
55
60
65
70
75
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Avg. ratio of non-flip bit-line pairs per write access
Energy savings in Write Operations
(%)
Figure 2.9: Average ratio of non-flip bit-line pair per write access to
RegFile and the energy savings.
2.2.6.1 Impact of Charge Sharing Based Writes on Reorder Buffer
Since the proposed charge sharing based write operation power reduction technique is
applicable to any SRAM structure with separate write ports, and not just the RegFile, we
decided to carry out energy savings analysis for another similar on-chip structure called
Reorder Buffer (ROB). ROB is employed in modern out-of-order processors in order to
retire instructions in order in which have been issued but executed out of order. ROB is a
circular queue which stores instructions in their program order. Upon decoding an
instruction an ROB entry is allocated to it. When the instruction finishes its execution the
results are written back to the physical register file and the corresponding ROB entry is
updated. Since many instructions can be decoded and written back as well as retired in
the same cycle, we need separate read ports and write ports similar to RegFile. Each entry
of the ROB, for processors with separate physical register file, stores PC (program
33
counter) of the corresponding instruction, architectural register id of the destination
register and flags for exceptions [54]. We carry out the energy saving analysis for ROB
during instruction decode/dispatch stage when an ROB entry is allocated and PC and
architectural register id of the destination register are written into ROB.
Figure 2.10 shows the result for energy savings in ROB. As shown in the figure,
compared to RegFile, the ratio of bit-line non-flips to the total bit-width is much higher
with an average of 0.84. This is due to the following reason: during instruction
decode/dispatch the PC value and destination register‟s id is written into ROB. PC‟s bit
width is 32 and register id‟s bit-width is 5. Thus PC dominates to the total bit-width of a
write operation. Given that PC exhibits fair amount of spatial locality, very few bits flip
between consecutive updates of PC and thus we observe high ratio of bit-line non-flips in
Figure 2.10. Correspondingly we also achieve significant energy saving with an average
of 82.53%.
Figure 2.10: Energy Savings in ROB as a function of average, per write,
ratio of non-flip bit-lines to total bit-width.
34
2.3 In-order Pulsed Charge Recycling in Off-chip Data Buses
In the previous section we looked at charge recycling to reduce voltage swing in
memory interconnects such as bit-lines. However, the maximum amount of charge we
could recycle was 50% due to complementary bit-line structure of memories which
restricts the number of bit-lines that are ready to receive charge to be equal to the number
of bit-lines that are ready to drain charge. However, if we have more than one
interconnect line ready to receive charge from one line that is donating charge then we
can recycle more than 50% charge as we will show in 2.3.2.1. This is not possible in
memories due to complementary bit-line structure where at least one bit-line has voltage
status of V
dd
. However in case of buses, on-chip or off-chip, recycling more than 50%
charge is possible since we can have a few bus lines that are at GND status and ready to
receive charge with a fewer bus lines that are at V
dd
status and ready to donate charge.
Especially in the case of off-chip buses that have large capacitance, being able to recycle
more charge can yield substantial energy savings. The effect is pronounced in embedded
systems with low power processors, since these processors do not necessarily have large
on chip caches which can result in higher off-chip memory traffic.
Many researchers in past have focused on off-chip bus power reduction due to large
capacitive load which tend to be orders of magnitude larger than their on-chip
counterparts and thus contributing significantly to overall system power ([3], [53]). Some
researchers have proposed charge sharing and recycling based ideas to reduce power in
the off chip buses, which follow the conventional way of charge sharing idea, i.e., the
35
same amount of charge is distributed in every capacitive loads involved. Bus encoding
techniques have proven quite effective on address buses because of the spatiotemporal
locality of addresses on the bus. Unfortunately, these techniques are not as effective for
data buses due to its unpredictable nature of values that appear on these buses.
In this work, unlike previous approaches, we present a charge recycling technique
that exploits the basic principle of charge sharing and maximizes the recycled charge in
the off-chip data bus [73] using sequential charge recycling in multiple charge sharing
cycles, as explained later. The proposed technique does not need a priori information
about the data stream in the bus, which is a must in any encoding-based techniques.
2.3.1 Prior Work
Concepts of charge sharing and charge recycling are well-known and their application
to energy efficient design of on- and off-chip bus architecture have been explored in the
past. In [47], Khoo et al. reported the theoretically achievable energy savings of 47% for
a 32-bit data bus and proposed an efficient charge-recovery technique. In [8][60], authors
extended Khoo‟s work and their results show that a simple implementation of the charge
recovery data bus is capable of reducing the average bus energy consumption by 28%. In
[85], Sotiriadis et al. analyzed and implemented a charge-recycling technique for on-chip
data bus. This is the closest to the idea proposed in this paper. However, they do not
maximize the recycled charge. Moreover, on-chip bus does not have an adequate target
structure for charge recycling since it tends to have repeaters which limit the scope of
charge recycling only to the portion of the bus before repeaters. Analytical comparison
36
between Sotiriadis‟ work and our proposed technique, presented later, proves that more
charge is recycled in our technique.
Unlike the aforesaid circuit level charge recycling techniques in the bus architecture,
many researchers have proposed power saving solutions based on different types of
encoding techniques. In [61], Stan et al. proposed a bus-invert coding that toggles the
polarity of the signals according to the Hamming distance between the two consecutive
data values by using an additional bit line on the bus. In [69], Sotiriadis et al. proposed a
data bus encoding scheme that takes the effects of inter-wire capacitance into account.
For the address bus, instead, most researchers typically utilize the spatial locality of
reference in addresses. In [67], Musoll proposed an encoding technique based on the
observation that some applications favor a few working zones of their memory address
space hence for an address reference belonging to one of these zones only the offset of
this reference with respect to the previous reference to that zone needs to be sent over the
bus along with an identifier of that zone. In [4] and [5], Benini et al. exploited the
temporal correlations on the address bus to propose zero transition encoding scheme and
enhanced their scheme by combining it with previous encoding schemes.
2.3.2 Pulsed Charge Recycling
2.3.2.1 Key Concept and Method
The proposed in-order pulsed charge-recycling technique (or PCR for short) attempts
to maximize the recycled charge compared to the conventional charge recycling
techniques of, say, reference [85]. From now on we will refer to the charge-recycling
37
scheme of [85] as the conventional charge recycling scheme (or CCR for short). Let us
understand the difference between PCR and CCR with the aid of the example depicted in
Figure 2.11.
Assume that we have three bus lines, 1, 2 and 3 with current (cycle n-1) data values
„1‟ (represented by V
dd
), „0‟ and „0‟, respectively. Moreover, assume that the next set of
values to be written on theses lines (in cycle n) are „0‟, „1‟ and „1‟, respectively. In the
figure, the blue bar corresponds to the amount of charge present on the bit-line and the
red bar corresponds to the amount of charge needs to be extracted from supply voltage to
bring the bus line to V
dd
. With CCR, all three bus lines are shorted together, and thus,
each of the bus lines will be charged/discharged to V
dd
/3, resulting in 66% of charge
stored on line 1 being recycled.
Bus-Line
1 3 2
Conventional Charge Sharing Initial Status
1 3 2
1 2
Proposed Charge Sharing
Phase 1
3
Phase 2
1
V
dd
/2
V
dd
/3
V
dd
/4
Charge left Charge left
1 3 2 1 3 2
Cycle n-1 Cycle n Cycle n
V
dd
/2
V
dd
/3
V
dd
/4
Figure 2.11: Comparison between CCR and PCR techniques.
Now consider the case when we allow charge recycling between bus lines in
sequential manner in multiple charge recycling cycles. In the first cycle connect bus lines
1 and 2 and then disconnect bus line 2 from 1, and subsequently in the second cycle
enable another charge recycling between bus lines 1 and 3. Notice that each of these
38
phases is long enough to allow full charge recycling and that the two phases are non-
overlapping in time. When the charge recycling takes place in the first cycle, bus lines 1
and 2 voltages converge to V
dd
/2. When the subsequent charge recycling takes place in
the second cycle, remaining V
dd
/2 of bus line 1 is shared with bus line 3, resulting in
voltage level of V
dd
/4 on both lines. As a consequence, total recycled charge is 75% of
the original charge stored on line 1. Thus, PCR is able to recycle more charge compared
CCR by recycling charge sequentially in multiple cycles.
2.3.2.2 Energy Savings of PCR Compared to CCR
Consider an off-chip data bus with M lines, each of them with a total line to ground
capacitance of C. Let us denote data on the bus as
1 1 1 1
12
[ , , ..., ]
n n n n
M
X x x x
in cycle n-1
and as
12
[ , , ..., ]
n n n n
M
X x x x in cycle n. Among the M lines, we denote the set of bus lines
that will experience 1→0 and 0→1 transitions as F and R, respectively. Furthermore, α =
| F | and β = | R |. Since the initial status of the F lines is V
dd
, the amount of total charge
stored on the data bus ahead of charge sharing is:
dd
CV (2.1)
In the PCR scheme, charge sharing for all the R lines is done one at a time. In this
scenario, when the first R line is connected to the F lines, it will thus receive an amount
of charge equal to:
39
()
dd
CV
+1
(2.2)
The charge stored on each of the F lines will drop from C∙V
dd
to that given in Eqn.
(2.2). Next the first R line is disconnected from the F lines and a second R line is
connected to the F lines. This second R line will receive an amount of charge equal to:
2
2
1
()
dd dd
C V C V
+1 +1 +1
(2.3)
Continuing in this manner until all R lines are sequentially connected for a fixed
period of time to the F lines, the total transferred charge from the F lines to the R lines is
equal to:
()
k
dd k
k=1
CV
+1
(2.4)
During such a transaction on an off-chip data bus without charge recycling (No
Charge Recycling, or NCR for short), we will have to consume ( β∙C∙V
dd
2
) of energy to
raise the R lines from 0 to V
dd
. In contrast, with the proposed charge recycling scheme,
the total energy needed to raise the R lines to V
dd
is only
()
k
dd dd dd k
k=1
CV C V V
+1
(2.5)
In the CCR scheme whereby the R lines are connected to the F lines in one step
(alternatively, the previously-connected R lines are not disconnected before the current R
40
line is connected to the F lines), the total transferred charge from the F lines to the R lines
is equal to:
()
dd
CV
+
(2.6)
Therefore, the total energy needed to raise the R lines to V
dd
in a CCR is
()
dd dd dd
C V C V V
+
(2.7)
Notice that in general,
( ) ( )
k
k
k=1
+1 +
which indicates that the PCR is more effective than the CCR in achieving higher energy
saving through the charge recycling idea. In particular PCR is superior to CCR for β ≥ 2
and α ≥ 1.
2.3.2.3 Pulsed Charge Recycling Implementation
Consider an off-chip data bus where some bus lines are expected to undergo rising or
falling transitions from cycle n-1 to n. The proposed charge recycling technique targets R
and F lines and consists of three steps: 1) connect all F lines to a common node, 2)
connect for a fixed period of time and subsequently disconnect each of R lines to the
same common node, one at a time, to enable charge sharing with the F lines, and finally,
3) resume regular data bus transaction by enabling the tri-state buffers to complete the
remaining charging (discharging) of the R (F) bus lines.
41
Capacitance in line
1
Off-Chip Memory
A Common Node
Capacitance in line
2
Capacitance in line
M
Line
1
i
D
Line
1
i-1
Line
2
i
D
Line
2
i-1
Line
M
i
D
Line
M
i-1
PCR-En
1
En
En
En
PCR Enable Generation Circuit
Charge Donor Activation Circuit
Charge/Discharge Completion Circuit
No Charging
En
PCR-En ’
PCR-En
2
PCR-En
M
Charge Receiver Activation Circuit
Figure 2.12: Proposed charge sharing structure for a data bus.
Figure 2.12 shows the circuit diagram for the proposed idea. For brevity, we show
three lines in the data bus corresponding to each transition type: line
1
i
which experiences
no transition, line
2
i
that is undergoing a 1→0 transition, and line
M
i
that is undergoing a
0→1 transition. In accordance with three steps in the proposed technique, we add three
functional blocks: 1) charge donor activation circuit, 2) pulsed charge recycling circuit,
and 3) charge/discharge completion circuit.
2.3.2.4 Charge Donor Activation Circuit
During the first step, all F lines are connected to the common node. The logic block
that performs this operation is shown in blue color in Figure 2.12. The circuit basically
compares the previous data, from the D-latch, with the current data to be written on the
corresponding bus line to detect a falling transition. Upon such detection, it turns ON the
transmission gate (TG) which connects these all F lines to the common node. Note that
42
this operation does not need to wait for the arrival of the PCR_En
1
signal. Moreover, this
operation does not perform any charge sharing by itself; instead it simply provides a path
from the stored charge on the F lines to a common node from which a potential receiver
could collect the charge.
2.3.2.5 Pulsed Charge Sharing Circuit
To maximize the charge recycled, each of the R lines should in order receive some
charge from the common node. To facilitate this operation in the second phase, we
perform two operations: 1) Detection of a rising transition. This is done by a logic block
similar to charge donor activation circuit, called „charge receiver activation circuit‟,
drawn in black for each bus line in Figure 2.12. 2) Generation of the PCR enable signals
(PCR_En
i
) which connects each of the R lines to the common node to enable charge
recycling. This is done by a logic block named „PCR enable generation circuit‟, drawn in
orange for each bus line in Figure 2.12.
To generate the enable signals for each bus line, we use a buffer chain as depicted in
Figure 2.12. The buffer chain receives the PCR_En
1
signal as its input and shifts the
signal such that no two enable signals (PCR_En
i
and PCR_En
i-1
) corresponding to two
different bus lines overlap one another. For a fixed size of TG, notice that the amount of
time required to carry out full charge sharing is variable and depends on the number of
donors. To limit the complexity, we decided to use a fixed period for charge sharing
independent of the number of donors.
43
Notice that charge donation and reception occur only for the bus lines that are
undergoing some transition from cycle n to n+1. The remaining bus lines, which
experience no transition from the current to next cycle, have their charge sharing switches
(TG) turned-off, and hence, the charge on these bus lines remains intact.
2.3.2.6 Charge/Discharge Completion Circuit
When the charge-recycling step is completed, to avoid shorting the bus lines, every
charge sharing switch, i.e., TG, needs to be turned OFF before the tri-state buffers are
enabled. This is in turn achieved by applying the PCR_En ’ signal to the clock input of the
D-latches. This essentially turns-off TGs for all F lines, by overwriting the previous data
stored in D-latch with the current data. When this has been done, the delayed enable
signal (En) to the tri-state buffer of each bus line activates the buffer to perform the
remaining charging/discharging operation. Note that during these operations tri-state
buffers on the receiver side are OFF.
2.3.2.7 Bus Line Grouping
The design presented in the previous section enables the charge receiver circuit of
each bus line one after another in some order requiring exactly 32 charge sharing cycles
for a 32-bit wide bus. To reduce this overhead, one may group the bus lines. We
experimented with a group of 8 bus lines (8-line group) and a group of 4 bus lines (4-line
group). In the case of 8-line groups (there are 4 such groups in a 32-bit bus), we enable
charge receiver activation circuits for bus lines in the same bit position of the 4 different
groups at the same time. For example bus line
1
, bus line
9
, bus line
17
and bus line
25
are
44
enabled simultaneously by using the same exact PCR enable signal for all of them.
(Notice however that only the ones that are in the R set will be actually connected to the
common node.) Since each group has 8 bus lines, our buffer chain has to generate 8 such
PCR enable signals, each of which drives charge-receiver activation circuit of the
corresponding 4 bus lines. Similarly, in the case of 4-line groups, buffer chain produces 4
such PCR enable signals. Note that the notion of group exists only for charge reception,
i.e., only for the R lines. All the F lines bus lines are connected to common node
regardless of the group they belong to. This enables us to receive charge even from the
donors belonging to a different group.
2.3.3 Experimental Results
The in-order pulsed charge sharing technique was applied to a 32 bit-wide data bus
and implemented in a 0.13μm CMOS process with a 1.8V supply voltage. The power
dissipation was measured with HSpice. Each line in the off-chip data bus was modeled to
have 20pF of capacitance and 100Ω of resistance values referenced from [84]. This bus
structure modeled in HSpice was configured (i.e., its drivers were appropriately sized) to
run at 100MHz. Any delay penalty due to the PCR technique was calculated with respect
to this baseline 10ns bus transaction delay. The PCR technique implemented
corresponded to the eight 4-line group architecture explained above.
2.3.3.1 Energy Savings Analysis
Figure 2.13 reports the energy savings achieved by the PCR technique compared to
the bus architecture with no charge sharing (NCR). In each measurement, we used
45
different combinations of α and β values. Note that a maximum of 32 bus lines can
undergo transitions in any cycle, although most of the time the number of bit transitions
is small and limited to the lower bits (explained later). Through HSpice measurements,
which fully accounted for the power dissipation due to the added circuitry of the PCR
architecture, we obtained an average of 17.4% energy savings of PCR over NCR. Note
that higher energy savings were achieved when the two types of transitions were
balanced.
4
8
12
16
20
24
28
4
8
12
16
20
24
28
-10.0
0.0
10.0
20.0
30.0
40.0
α
β
Figure 2.13: Energy savings with different numbers of rising and falling transitions.
Figure 2.14 shows the comparison between the proposed and the conventional charge
sharing technique [85] using different values of α and β. As shown in the figure, the PCR
technique outperforms the CCR technique in both 32 and 16 transitions.
46
0
5
10
15
20
25
30
35
α= 4,
β=12
α= 6,
β=10
α= 8,
β= 8
α=10,
β= 6
α=12,
β= 4
α= 8,
β=24
α=12,
β=20
α=16,
β=16
α=20,
β=12
α=24,
β= 8
Distribution of Both Transition Types
Energy Saving (%)
Proposed Charge-sharing Conventional Charge-sharing
Figure 2.14: Energy savings comparison between the PCR and CCR.
Figure 2.15: Electrical waveforms for various signals.
Figure 2.15 shows voltage and current waveforms for some data bus transaction for
both the NCR and PCR designs. In both designs, 32 bit-wide data bus has 16 falling
transitions on bus lines 0 to 15, and 16 rising transitions on bus lines 16 to 31. Note that
the PCR design corresponds to that of eight 4-line group charge sharing architecture, i.e.,
47
eight of the bus lines are enabled simultaneously for charge sharing. Therefore if G
j
denote a set of bus lines that are enabled together for group j, then:
{ | mod 4 } for 0,...,31 and 0,...,3
j i i
G b b j i j (2.8)
The NCR design takes 10ns while the PCR design takes 15ns to complete the same
bus transaction, hence, giving rise to 50% delay penalty (CCR has 20% delay penalty
compared to NCR). The current curves demonstrate that the PCR design consumes less
energy (to be precise, 35% less) compared to the NCR design.
2.3.3.2 Off-Chip Bus Traffic Analysis
The aforementioned analysis does not account for the characteristics of the off-chip
traffic in different applications. Therefore we profiled programs for their off-chip bus
traffic from some of SPEC2000INT [38] (with ref input files) and MediaBench 0 (with
custom input files) benchmarks. In Figure 2.16 we report the distribution of both
transition types for an 8-bit grouping of the bus lines given by eq. (2.9), where G
j
represents the set of bus lines that belong to group j:
| 8 8 1 for 0,...,31 and 0,...,3
j i i
G b j b j i j (2.9)
For each group, we report the number of falling and rising transitions per group of
bits per application program. More precisely, for each group of bits, we provide two set
of bar graphs in Figure 2.16: the first set of bar graph corresponds to the percentage of
falling transitions in that group whereas the second set corresponds to the percentage of
48
rising transitions in the group. From this figure, we observe that most of the transitions
(around 60 to 70%) occur in the first two groups, i.e., in lower 16 bus lines.
Figure 2.16: Distribution of Transitions.
Based on the data reported in Figure 2.16, we propose an improved PCR design
where we only apply charge recycling to the lower 16 bus lines. Furthermore we change
the grouping strategy within these 16 bits in order to reduce the delay penalty. The new
charge sharing architecture uses eight 2-line groups, i.e. bus lines 1, 3, 5, …, 15 are
connected to the common node in the first charge sharing cycle and bus lines 2, 4, 6,…,
16 are connected to the common node in the second charge sharing cycle:
{ | mod 2 } for 0,...,15 and 0,1
j i i
G b b j i j (2.10)
The new energy savings for this case with four falling transitions in the lower half of
the bus lines and four rising transitions in the upper half of the bus lines is 26.4%
compared to NCR design, and 16.8% compared to the CCR design. Furthermore delay
penalty is reduced from 4 to 2 charge sharing cycles, resulting in only 25% delay penalty
with respect to the NCR design.
49
2.4 Summary
In this chapter we looked at circuit level energy optimization techniques based on
charge recycling. We observe that in memories with dedicated write ports, the charge on
the bit-lines can be preserved at the end of a write operation. This charge can be recycled
during the next write operation if the value being written is different from the last write
due to which bit-lines are poised to swing opposite direction. By recycling charge
between bit-lines we can reduce the total bit-line swing necessary to flip the voltage
status on the bit-lines that is necessary for the new write operation.
We furthermore observe that in such memory structures the maximum charge we can
recycle is 50% due to complementary bit-line structure which restricts the number of bit-
lines that are ready to receive charge to be equal to the number of bit-lines that are ready
to drain charge. However, this is not the case with buses, where many bus -lines can be
going through rising transition and therefore waiting to receive charge with a fewer bus
lines going through falling transition ready to drain charge. Making use of this unique
opportunity we propose a sequential charge recycling technique for off-chip data buses
that can recycle more than 50% charge by recycling charge in multiple charge sharing
cycles. Furthermore, by observing that large part of data bus transaction activity is
concentrated in lower order bits, we proposed a grouping strategy that offers good trade-
off between energy savings and performance penalty.
50
Chapter 3. Circuit/Architecture Level Energy
Optimization: Charge Recycling Cache
3.1 Introduction
In the previous chapter we looked at the circuit level energy optimization techniques
where one of the techniques employed charge recycling for on-chip memory structures
with separate write ports. However, significant fraction of the chip real estate is also
dedicated to memory structures like level 1 and 2 caches, which contribute significantly
to the power as well. It has been shown that on chip instruction and data cache can
consume upto 45% of the total chip power in ARM processors ([82]), whereas 33% and
about 15% in Pentium Pro ([63]) and Alpha 21264 ([28]) respectively.
These caches are made up of SRAM blocks similar to RegFile or ROB. However they
differ in their architecture. Largely caches are single ported memory structures with 6-T
SRAM cells. On top of the memory block that holds data, data RAM, it also contains tag
memory, tag RAM, that stores part of the address. Thus caches comprise of tag RAM and
data RAM blocks. Furthermore, as cache size keeps increasing, it becomes infeasible to
have data and/or tag RAM as a single SRAM block due to power and delay reasons.
Therefore most of the modern microprocessors employ caches with multiple banks ([33]).
The power consumption in such caches is comprised of various factors. At circuit
level, similar to any SRAM, cache power can be decomposed into decoder power, pre-
51
charge power, bit-line power, sense amplifier (SA) power etc. These factors are affected
by physical design parameters such as technology, transistor sizes, bit-line/word-line
capacitance, delay etc. At architecture level cache power is affected largely by cache
organization. A large cache with multiple banks requires routing of address and data to
and from the banks which in some cases can dominate the cache power. Furthermore
cache associativity also has significant impact on power. A highly associative cache
requires a tag RAM corresponding to each way, all of which may need to be accessed for
a read/write access to cache. Thus, as associativity increases cache power increases as
well. Another important factor that can affect power is cache line size. A larger line size
may help satisfy a subsequent request from line buffer instead of requiring cache to be
accessed. However, too large a line size may increase power/delay of an access. Thus,
various such factors from circuit level to architecture level affect the dynamic power
dissipation in cache.
In this work, we address the issue of cache power, and in particular cache write
power, at circuit as well as architecture level. Similar to charge recycling for RegFile, we
exploit charge recycling for single/shared ported caches by modifying our previously
proposed charge recycling circuitry. Furthermore, in order to exploit this underlying
circuit support for energy optimization, we modify store retirement and cache controller.
We propose late store retirement policy in order to retire stores in clusters to take
advantage of back-to-back writes and charge recycling [72].
52
3.2 Prior Work
There has been a flurry of work on cache power optimization at different levels of
system design. We categorize these previous works in circuit level and architecture level
approaches. Circuit level approaches exploit various circuit level techniques for SRAMs
that are equally applicable to caches and other SRAM based memories. Whereas
architecture level approaches exploit various architectural properties specific to caches.
Circuit level approaches: In [70] authors proposed to conditionally invert a stored
value in order to reduce the total number bit-lines that discharge in single ported RAMs
or ROMs. In [91], observing that the large fraction of the data word bits are zeros, Villa
et al. introduced a zero compression technique to reduce power dissipation in data caches.
Their technique compresses zero bytes by storing „1‟ in an extra bit for every byte that is
zero and thus reading only one bit instead of 8 bits for such a zero cluster. In [96] authors
similarly observe that a large fraction of bits written are zeros and propose zero aware
SRAM cell with asymmetric inverter pair that reduces power for writing a zero. Kim et
al. in [46] presented a write power reduction technique similar to [95] where they
simultaneously reduce the bit-line swing and recycle the charge between bit-line pairs.
However, their approach introduces significant overhead for charge recycling while the
lower bit-line swing reduces the noise margin and introduces cell stability issues, which
was dealt with by using additional overhead circuitry.
Architecture level approaches: Exploiting cache compression Kim et al. [50]
proposed to use sign compression in order to compress the upper half of the words which
53
contain either all „0‟ or all „1‟ bits and thus reading only lower half of the words and
reducing power. However their approach requires additional storage of few bits per cache
line. In [41] Huang et al. described a cache design that handles the stack accesses
effectively. By keeping committed stores in the cache in order to improve the load
forwarding opportunities, authors of [68] proposed to cache the committed stores as well
as the loads in order to reduce data cache accesses and reduce power dissipation. Ghose
et al. [26] exploited multiple line-buffers, sub-banking and bit-line segmentation in order
to reduce cache energy. Exploiting the application specific behavior of embedded
systems, Zhang et al. in [97] presented a highly configurable cache that can be configured
as a direct mapped, 2-way or 4-way set associative cache, using a technique called way
concatenation. Configurability provides different application sets to exploit cache
configuration that is best suited for their behavior. Observing the trade-off between speed
of direct mapped cache and low miss rate of associative cache, Zhang in [18] presented a
novel technique by extending the address decoder length to balance the accesses to cache
sets in direct mapped cache. Exploiting the locality exhibited by programs in different
regions of memory, i.e., stack, global and heap regions, Lee et al. presented in [56] region
based caching where they augment the main cache with smaller structures exclusively for
stack and global regions. They argue that these smaller structures can reduce the power
while maintaining high hit rates.
Most of these previous works on cache energy optimization, do not address the
impact of optimizations carried out at one level of the system design on the other levels.
For example, the circuit level optimization techniques do not analyze their impact at the
54
architecture or system level which would give a better idea of overall energy savings.
Similarly many architectural optimizations underestimate or do not analyze the impact of
required overhead necessary for rendering the approach effective.
In this part we present a charge recycling based technique to reduce the write
operation power dissipation due to bit-line swing in single/shared ported caches. Recall
that bit-lines consume significant power since they experience full voltage swing during
the write operation in conventional SRAMs; this is unlike the relatively small partial
voltage swing during the read operation. In this work, we extend our work of [74] to
caches that have shared read/write ports by proposing a new cache architecture. Having
secured support for exploiting charge recycling to reduce bit-line swing, we introduce, for
the first time, an approach to cluster stores, i.e. data cache writes, into groups by using a
late retirement policy that exploits the underlying support provided by circuit level
optimizations in order to reduce the data cache write operation power dissipation. Hence,
in the presented work, by combining circuit level energy optimization with architecture
level support we make the following contributions:
o We introduce a novel circuitry for bit-line charge recycling for caches with shared
read write ports and analyze its impact on energy and performance with Hpsice
simulations.
o With the help of cache modeling we analyze the impact of proposed circuit level
optimization on cache write power reduction.
55
o We describe a late retirement policy for stores in order to cluster writes such that
the consecutive writes can take benefit of underlying circuit‟s support for write
power reduction.
o We carried out the evaluation of the proposed approach from circuit level to
architecture level for power and performance implications.
3.3 Proposed CR-Cache Architecture
In order to exploit charge recycling for reducing write operation power in caches with
shared read write ports, we propose a novel cache architecture that can dynamically
switch between normal cache access mode and charge recycling based write access
mode. The proposed architecture augments single ported SRAM structure of cache, at
circuit level, to enable charge recycling for back to back, i.e., consecutive write
operations. At architecture level we modify the cache controller and exploit a store queue
to help us generate groups of consecutive writes that take advantage of the support
provided by the underlying SRAM circuit for charge recycling.
For the purpose clarity and ease of explanation we divide the proposed cache
architecture in: 1) Circuit level design and 2) Architectural modifications.
3.3.1 Circuit Level Design
Figure 3.1 shows the circuit level design of the proposed charge recycling based
cache, CR-cache. The design shown in the figure corresponds to one column with the
corresponding bit-line (BL) and bit-line-bar ( BL ). Each of the columns in the memory
56
array is augmented as shown in Figure 3.1. Circuit corresponding to each of the column
is modified by augmenting it with four blocks: 1) Conditional Charge Sharing Block, 2)
Charge Sharing Period Generator Block, 3) Delayed Write Enable Block and 4)
Conditional Charge/Precharge Block. All of the 4 blocks work together to enable charge
recycling based write operations. Note that the Delayed Write Enable Block can be
shared among multiple columns and would therefore reduce the overhead. In the
following subsections we explain the working of each of the blocks.
BL BL
WL WL
Memory
Cell
Q Q
Data
Delayed WEN Generator
DATA
XNOR
WEN
WEN
DATA
Prev
Prev
DATA
Prev
Bit - line Flip Detector
Data
FLIP
DATA
DATA
CS_EN
CGate
CS_EN
Precharge CGate
CS_EN
CS _EN
Precharge
CS _ EN
D_charge
Dbar_charge
Charge - Sharing Switch
CS _EN
Charge Sharing
Period
Generator
Delayed Write
Enable Block
Conditional Charge /
Precharge Block
delayed_WEN
delayed_WEN
BL_charge
BL_charge
………
CS_ EN
Conditional Charge Sharing Block
Figure 3.1: Circuit Level Design of Proposed Charge Recycling Cache (CR-Cache)
3.3.1.1 Conditional Charge Sharing Block
Purpose of this block is to enable charge sharing between BL and BL whenever the
new data being written in the given column is different from the data due to previous
57
write. Therefore if the previous write operation wrote a „0‟ into some cell of a given
column then the status of BL and BL due to previous write would be „0‟ and „1‟
respectively. If the new write operation writes a „1‟ in this column then we know that the
bit-lines will swing in opposite direction and avail the opportunity of charge sharing.
However if the new write operation writes again a „0‟ then no charge sharing shall take
place.
Bit - line Flip Detector
FLIP
Charge - Sharing Switch
CS_EN
BL BL
DATA
XNOR
Prev
Prev
DATA
Figure 3.2: Conditional Charge Sharing Block
To facilitate such conditional charge sharing we must detect whether the current data
value being written into a given column is the same as of the value written due to
previous write. This function is carried out by Conditional Charge Sharing Block
(CCSB), much like bit-line flip detector of previous chapter. However, in order to make
the XOR gate more robust we modified the design as shown in Figure 3.2.
As shown in the figure BL and BL are connected by an NMOS charge sharing
switch (CS switch) that is controlled by FLIP signal which is the output of the NOR gate.
If charge sharing is disabled, i.e., CS_EN=0 and therefore _ CS EN =1, then the FLIP
stays at „0‟, disabling the CS switch. However if CS_EN=1 then the FLIP signal is
controlled by XNOR gate that is implemented using transmission gates to reduce
58
overhead. XNOR gate checks if the current data, DATA, being written into this column is
the same as of previous data, Prev. If so then XNOR gate produces a „1‟. Therefore NOR
gate will produce a „0‟ on FLIP, disabling the CS switch. However if the new data being
written is different from previous data then XNOR gate produces „0‟, resulting in
„0‟→„1‟ transition on FLIP and thus enabling the CS switch. At this point BL and BL
are connected to share their charge.
Notice that in order for sufficient charge sharing to take place, we must address two
issues: 1) the amount of time CS switch is kept ON and 2) size of CS switch. We address
the issue of sizing later in section 3.3.1.5. The time period for which CS switch is kept
ON is taken care by another control block called Charge Sharing Period Generator which
controls the amount of time CS switch is kept ON.
3.3.1.2 Charge Sharing Period Generator Block
Charge Sharing Period Generator (CSPG) block ensures that enough time is provided
for charge sharing to take place. CSPG does this by making sure that CS switch is kept
ON for sufficient amount of time. Since the CS switch is controlled by a NOR gate,
which in turn, when CS_EN=1, is controlled by an XNOR gate, CSPG must ensure that
the XNOR output does not change too soon. This is accomplished by updating Prev and
Prev signals after a sufficient amount of time as explained below.
59
DATA
Prev
Prev
latch_update
WEN
CS_EN
Figure 3.3: Charge Sharing Period Generator Block
As shown in Figure 3.3 latch_update signal updates the latch that is holding the
previous data value written into the given column. Whenever the latch, storing previously
written data, is updated with the data to be written, the XNOR gate will produce „1‟ at its
output. This in turn will produce a „0‟ at the output of the NOR gate of CCSB and thus
disabling CS switch. Thus, in order to give sufficient amount of time for charge sharing
to take place, we must ensure that the latch_update signal is activated with appropriate
latency. This is accomplished, in CSPG, by delaying the WEN signal. As shown in Figure
3.3, whenever CS_EN=1, the WEN signal arrives with a certain latency since it passes
through the buffer that produces latch_update signal. Note that, once again, sizing plays
an important role in generation of charge sharing time period. We will discuss this issue
later in section 3.3.1.5.
Note that while the charge sharing is taking place we must ensure that the write
operation is not activated. Therefore write must be delayed until the CS switch is turned
OFF. This is accomplished with the help of Delayed Write Enable Block.
60
3.3.1.3 Delayed Write Enable Block
Delayed Write Enable Block (DWEB) ensures that the WEN signal is delayed such
that the write operation is activated only after CS switch is turned OFF, i.e., FLIP=0.
DWEB accomplishes this as shown in Figure 3.4.
As shown in the figure, when CS_EN=1, the WEN signal is delayed through
buffer/inverter chain. Whenever WEN=0, D_charge and Dbar_charge is „0‟ and so is
delayed_WEN. When WEN transitions to „1‟, it is delayed through the chain, resulting in
delayed transition of delayed_write from „1‟→„0‟. At this time, if the data to be written
in the given column is „1‟, i.e., DATA =0 and DATA=1, then D_charge goes through
„0‟→„1‟ transition otherwise Dbar_charge goes through „0‟→„1‟ transition. Furthermore,
when CS_EN=1, delayed_WEN is controlled by buffer chain, since the other input of the
NAND gate is kept at „1‟ due to _ CS EN =0. Therefore after some delay delayed_WEN
will go through „0‟→„1‟ transition.
Delayed WEN Generator WEN
WEN
DATA
DATA
CS_EN
CS_EN
D_charge
Dbar_charge
delayed_write
delayed_WEN
Figure 3.4: Delayed Write Enable Block
Note, however, if CS_EN=0 then the delayed_write signal will stay at „1‟ resulting in
both D_charge and Dbar_charge staying at „0‟. Furthermore, whenever CS_EN=0,
delayed_WEN is controlled by the output of the other NAND gate instead of the buffer
61
chain. Therefore, whenever CS_EN=0, WEN is propagated to delayed_WEN without any
significant delay. In addition, note that, the bit-line charging must be controlled by
D_charge and Dbar_charge whenever CS_EN=1 and by Precharge (pre-charging signal)
whenever CS_EN=0. This is accomplished by Conditional Charge/Precharge Block.
3.3.1.4 Conditional Charge/Precharge Block
At the end of charge sharing period, BL and BL will have shared their charge,
thereby approaching V
dd
/2, whenever 1) data to be written is different from the previous
write, and 2) CS_EN=1. Therefore, at the end of charge sharing period, depending on the
data to be written we must bring one of the BL and BL lines to V
dd
and the other to
GND. The delayed_WEN signal takes care of bringing one of the bit-lines to ground by
discharging the rest of the charge on that bit-line. Bringing the other bit-line to V
dd
, by
providing the required remaining charge, is taken care of by Conditional
Charge/Precharge Block (CCPB), as shown in Figure 3.5.
CGate
CS_EN
CS_EN
Precharge
D_charge
BL_charge
V dd
GND
CGate
CS_EN
CS_EN
Precharge
CS_EN
Precharge
D_charge
D_charge
CS_EN
BL_charge
Figure 3.5: Conditional Charge/Precharge Block
62
Figure 3.5 shows the details of the complex gate (CGate) corresponding to BL. As
shown in the figure CGate takes four inputs, namely, Precharge, _ CS EN , CS_EN and
D_Charge. Whenever CS_EN=1 the output path controlled by Precharge signal is
disabled and control of the output path is given to the path controlled by D_charge signal.
Therefore if the data to be written is „1‟, then D_charge signal will go through „0‟→„1‟
transition, as explained earlier, resulting in „1‟→„0‟ transition on BL_charge. This will
enable the charging of remaining charge on BL. However, if CS_EN=0 then the output
path controlled by D_charge is disabled and the control is given to the output path
controlled by Precharge signal. Hence regular write/read operation can take place. Table
3.1 shows the truth table for CGate corresponding to BL; a similar complex gate exists
for BL .
Table 3.1: CGate Truth table corresponding to BL
Precharge _ CS EN CS_EN D_charge BL_charge
X 0 1 1 0
X 0 1 0 1
1 1 0 X 0
0 1 0 X 1
3.3.1.5 Transistor Sizing
Precise sizing of the transistors in memory array design is dependent on many factors
such as memory array width and depth, bit-line capacitance, access transistor sizes that
control parasitic capacitance, timing constraints etc. However, given an existing memory
array we want to augment it with the support circuitry to enable charge recycling based
write operations. For this purpose the support circuitry must be sized such that 1) it
63
generates sufficiently long charge sharing period and lets sufficient charge be shared
between the bit-lines and 2) delays write operation such that charge sharing and write
operations do not overlap.
o Transistor Sizing for Charge Sharing Period
Amount of charge shared between the bit-lines is dependent on the size of CS switch
as well as the amount of time for which it is turned ON. We would like to size the CS
switch such that the current flowing through it during charge sharing is similar in
magnitude to the current flowing through the pre-charging PMOS transistors during
conventional pre-charge operation. Therefore the CS switch could be sized similar to the
size of the pre-charging transistors. However the difference between drain and source
voltage reduces rapidly during charge sharing as compared to pre-charging operation and
therefore the CS switch transitions to the linear mode of operation more quickly.
Although, due to the CS switch being an NMOS transistor, sizing it to have the same
width as that of pre-charging PMOS transistors was sufficient in our experiments.
Furthermore we had to make sure that CS switch was turned ON for sufficiently long
period of time. Therefore the pulse width of FLIP signal that controls the CS switch must
be such that full charge sharing can take place. For this purpose we sized the buffer that
controls the latch_update signal and the transistors within the latch, storing previous data,
such that the pulse width of FLIP was sufficiently long.
o Transistor Sizing for Delaying Write Enable
As mentioned earlier we must make sure that the FLIP signal does not overlap either
with delayed_WEN or either of the D_charge or Dbar_charge. For this purpose the
64
buffer chain was sized such that the incoming WEN pulse was shifted in time to avoid
any overlap with FLIP. Also note that the sizes of the CGate blocks as well as the final
NAND gate driving delayed_WEN shall be large enough to drive corresponding pull-up
and pull-down transistors. Hence the task of transistor sizing was rather complex and
involved a lot of manual optimizations.
3.3.1.6 Timing of Operations
Write operation in CR-cache involves two distinct phases. In phase 1 charge sharing
between BL and BL is enabled by generating the FLIP pulse. During phase 2, once the
charge sharing is completed, write operation takes place by discharging/charging the
required remaining charge from/to the BL and BL depending on the data value to be
written. Figure 3.6 shows timing of various signals for phase 1 which involves blocks
CCSB and CSPG. As shown in the figure, latch_update signal is essentially a time
shifted copy of the WEN signal. Upon arrival of latch_update, latch storing the previous
data value is updated, albeit with certain delay due to appropriate transistor sizing, which
further delays the „0‟→„1‟ transition of XNOR gate. As shown in the figure, as XNOR
gate transitions from „0‟ to „1‟ FLIP transitions from „1‟ to „0‟, thereby disabling the CS
switch. Thus, we generate the FLIP signal with appropriate pulse width to facilitate
charge sharing.
65
Figure 3.6: Signal timings for CCSB+CSPG blocks
After charge sharing is completed, we must charge/discharge the appropriate bit-line
pair for the write operation to take place. This is carried out by the other two blocks,
DWEB and CCPB, as shown in Figure 3.7. DWEB generates delayed_WEN pulse by
appropriately time-shifting the incoming WEN pulse such that it does not overlap with
FLIP signal. Furthermore CCPB block similarly generates BL_charge and BL_charge to
bring appropriate bit-line (BL or BL ) to V
dd
depending on the data value to be written.
Figure 3.7: Signal timings for DWEB+CCPB blocks
66
The two timing plots shown above are generated by Hspice simulation of the memory
array. The details of the memory array are given in the experimental section. The plots
correspond to the data value of „1‟ being written into a column in which „0‟ was written
previously. Figure 3.8 shows BL and BL waveforms as a function of control signals
FLIP, delayed_WEN and BL_charge. As shown in the figure, after BL is brought up
close to V
dd
and BL is discharged to GND, the data is written into the cell by activating
the word-line (WL).
Note that the aforementioned circuit level support for charge sharing based write
power reduction is applicable to any SRAM based memory structure with shared
read/write ports. In this paper we exploit this support particularly for caches at
architecture level and show how we can take advantage of it for cache write power
reduction.
Figure 3.8: Bitline status and cell writing operation
67
3.3.2 Architectural Modifications
In regular caches bit-lines must be pre-charged to V
dd
at the end of each read or write
operation. However, the CR-cache design presented above does not pre-charge bit-lines
to V
dd
at the end of a charge sharing enabled write operation since, under charge sharing
enabled write operation mode, we want to preserve the charge status on bit-lines due to a
write operation. Thus, if the very next cache access is again a write operation, then using
charge sharing we can reduce the bit-line swing and reduce write power. However if the
next operation is a read, then the bit-lines must have been pre-charged to V
dd
. This
implies that, for CR-cache to be effective in reducing the write power, CS mode can be
enabled, CS_EN=1, only for consecutive write operations. Furthermore, another
important parameter for the effectiveness of CR-cache under CS_EN=1 is the elapsed
time between consecutive writes. Since CR-cache, under CS_EN=1, cuts off the bit-lines
from V
dd
and GND at the end of a write, bit-lines will be floating. Therefore charge
stored on bit-lines, at the end of a write, can leak which will reduce power savings.
Besides, such charge leakage also raises concerns about the timing of the CR-cache
circuitry.
To address the aforementioned issues, i.e., consecutive writes and charge leakage, we
must cluster the write accesses going to the cache such that the write operations take
place in consecutive cycles. Note, however, that caches with separate read and write
ports, i.e., separate bit-lines for read and write operations, do not have to deal with the
issue of back to back writes. This is due to the fact that with separate read and write ports,
charge status on bit-lines can be maintained without worrying about a read coming
68
between two consecutive writes and destroying the charge status on bit-lines. Multi-
ported caches are common place in out-of-order processors ([66], [39]) which can issue
load and store in the same cycle. For such cases, our earlier charge sharing based
approach is directly applicable. However we still must deal with the issue of charge
leakage. The proposed approach of coalescing cache writes in order to address the
aforementioned two issues is equally applicable to single ported and multi-ported caches.
3.3.2.1 Coalescing Cache Writes
In order to coalesce stores together for clustering writes, we make use of a store
queue. Most of the modern microprocessors, in-order or out-of-order, employ a store
queue (SQ) to hold speculative stores before retiring them to cache. For example ARM
cortex A8 [33], which is a dual issue superscalar in-order processor, employs a 4-entry
store buffer that acts as a storage space for stores that have been issued to the pipeline but
have not yet reached the commit stage. Each entry of the SQ has data corresponding to
the store and its address. Furthermore cortex A8 also uses the SQ to merge stores going
to the same cache line. In out-of-order processors SQ is similarly used to hold
speculatively issued stores. We modify the control logic associated with the SQ such that,
when the store reaches commit stages we commit the store without evacuating it from the
SQ and sending it to cache. This is accomplished by keeping a Commit pointer as show in
Figure 3.9.
69
Tail
Head
Commit
…..
…..
Tail
Head
…..
At issue stage:
If (store & !SQ_full)
{
Allocate SQ[Tail]
Tail++
SQ_num++
}
At commit stage:
If (!SQ_empty &
completed(SQ[Head]))
{
Retire SQ[Head]
Head++
SQ_num--
}
At issue stage:
If (store & !SQ_full)
{
Allocate SQ[Tail]
Tail++
SQ_num++
}
At commit stage:
If (!SQ_empty &
completed(SQ[Head]))
{
tag_lookup(SQ[Head])
Head++
SQ_comitted++
}
Conventional Store
Queue
Proposed Store
Queue
Commit pointer and
SQ_num updated during
clustered write phase
Figure 3.9: Conventional vs. Proposed Store Queue
As shown in the figure, a conventional SQ is indexed with two pointers: Head and
Tail, whereas the proposed SQ is indexed with an additional pointer: Commit. In
conventional SQ, the Tail pointer is used to keep track of „where to write the new
speculative store in the SQ‟ whereas the Head pointer is used to keep track of „from
where to retire a store that has reached the commit stage and hence is completed‟. As
shown in the figure, in the proposed SQ, the Head pointer is still used in order to keep
track of the already committed stores. However, upon commit, the store is not evacuated
from the SQ; neither is the SQ_num, which keeps track of the number of entries occupied
70
in SQ, updated. SQ_num is updated and store is retired during a clustered write phase
when the committed SQ entries are sent to cache. For this purpose, in the proposed SQ,
we use the Commit pointer which tracks the stores that are already committed but not yet
retired to cache. These stores are retired from SQ to cache in clusters. In order to retire
stores in a cluster we propose a small modification to the cache controller. Note that
many existing cache controller state machines are capable of performing back to back,
i.e., bulk, writes [32]. But before looking at the state machine for clustered writes, let us
address the issue of cache coherency.
3.3.2.2 Cache Coherency
The fact that we keep the committed stores into the SQ instead of retiring them to
cache, raises concerns regarding coherency of the cache. We address this issue by
performing cache tag lookup for the stores that are being committed at the commit stage,
as shown in Figure 3.9. Hence at the commit stage the directory is updated according to
the cache coherency protocol as it would be at the commit stage, for the conventional SQ
case. Once the tag is looked up, the appropriate way ID is stored in the SQ such that,
when the store is eventually retired during the clustered write phase, no tag lookup is
required. Note that if the coherency protocol requires update (rather than invalidate)
messages to be sent, then the store data must be forwarded to the other processors,
otherwise only the invalidate message is sent.
If tag lookup of a store results in a miss, then we must drain the SQ of all the
committed stores, i.e., all stores between Commit and Head pointers. Such an action is
71
required for the following reason: If the store, when a tag look up is performed upon
commit, misses in the cache, then the corresponding line must be brought in the cache
while sending invalidate message to the other processors. However, we cannot bring the
corresponding line in the cache until all the other earlier stores, that have been
committed, are retired to the cache since we must retire stores in order. Note that we must
retire stores to the cache in order since, besides other reasons, it is possible that the store
being committed, which has missed in the cache, can replace a cache line that is being
occupied by an earlier store that has been committed but is still in the SQ waiting to be
retired. If such a store that is still in SQ waiting to be retired gets replaced then the stale
data, corresponding to the store, may get evacuated to the next level memory. Hence we
drain the SQ of committed stores whenever tag lookup results in a miss.
Similar situation exists for a load miss. If a load misses in the cache and replaces a
line in the cache that was previously occupied by a committed-but-not-yet-retired store of
the SQ, then that cache line (the one that is replaced by the load miss) will be evacuated
to the next level of memory. This would result in evacuating the cache line that is stale to
the next level of memory since the latest data corresponding to that cache line is still in
the SQ waiting to be retired. To address this issue, once again, we resort to draining the
SQ of committed stores in the event of a load miss.
We must also drain SQ of the committed stores whenever an invalidate request is
received. Hence, to take care of cache coherency issues, our policy dictates that we drain
the SQ of all the committed stores in the event of a cache miss, either due to load miss,
tag lookup miss for a store being committed, or an invalidate request. Now we can take a
72
look at the modifications to the cache controller state machine in order to retire stores to
the cache, in clusters.
3.3.2.3 Clustered Writes
As mentioned earlier we drain the SQ of committed stores whenever cache miss is
observed or an invalidate request is received. Furthermore we must also drain SQ of
committed stores whenever SQ becomes full. Hence, upon receiving any of these events,
clustered write phase is activated. Figure 3.10 shows the corresponding state machine.
INIT:
CS_EN=0
Retire:
If(SQ_committed=1)
{ CS_EN=0 }
else
{ CS_EN=1 }
retire[Commit]
Committ++
SQ_committed--
SQ_num--
(Invalidate | cache_miss | SQ_full) & SQ_committed!=0
SQ_committed=1
SQ_committed != 1
reset
Figure 3.10: State machine for clustered writes
As show in the figure, state is changed to Retire from INIT upon receiving either of
invalidate, cache miss or SQ_full events. Upon entering the Retire state, charge sharing
based writes are enabled by making CS_EN=1. Note that the very first write of the cluster
will be written with CS_EN=0. This, however, is the desired behavior since the first write
of the cluster cannot take advantage of any charge sharing anyway since there was no
previous write to begin with. We stay in the Retire state until we reach the last store
committed. Upon reaching the last store committed we disable charge sharing, by making
CS_EN=0. This takes effect only in the next cycle as we move back to INIT state. Thus,
73
in the next cycle, when we move back to INIT state, cache is ready for regular read/write
operation.
As mentioned before most of the cache controllers support such bulk writes. Hence
the hardware complexity is increased only marginally. The additional support needed
includes the pointer Commit and the counter SQ_committed along with marginal increase
in the control logic of the state machine that already supports bulk writes.
3.4 Experimental Setup and Results
To assess the impact of the proposed charge recycling scheme on cache power
dissipation, we must quantify the impact of energy savings of the proposed approach at
different levels of the system. Figure 3.11 shows the approach undertaken in this paper.
As shown in the figure, Hspice implementation and simulation of the proposed approach
is used to obtain the write energy savings due to bit-line swing reduction. As a next step
we use CACTI [36] to obtain cache power consumption due to different components for
different cache configurations and thus obtain the contribution of bit-line swing to cache
write operation power. Hence we obtain net power savings per write operation for a given
cache configuration under charge sharing enabled mode of cache writes for CR-cache.
This information is fed to Simplescalar [37] simulation environment that implements our
architectural modifications to perform late retirement of stores in clustered writes in order
to assess the cache power reduction. As a result of this architectural simulation, we obtain
net power savings in caches due to the proposed CR-cache. Thus we carry out the
74
analysis of our proposed approach on circuit level implications and architecture level
implications in terms of both energy and performance.
Hspice: Circuit
implementation
and simulation
CACTI
cache power
model
Bit-line swing
reduction and
resulting write
power saving
Architectural
Simulation
(Simplescalar)
Cache write
power
reduction
Average cache
power reduction
Figure 3.11: Experimental Methodology
For this purpose we divide the results section in circuit simulations, cache modeling
and architectural simulations corresponding to the three blocks shown in Figure 3.11.
3.4.1 Circuit Simulations
We implemented Hspice netlist of the CR-cache architecture of Figure 3.1 for
accurate circuit level modeling of the delay and power overhead due to supporting
circuitry for charge recycling. We implemented a single bank as an SRAM block of
1024x64 bits, i.e., 1024 cache lines with each line being 64 bits (8 bytes) wide. This bank
size was inspired by the cache architecture of ARM Cortex A8 [33] which, for a 4-way
32KB size cache, employs 8 banks with each bank having 1024 lines and 4 byte (32 bits)
wide lines. Thus two such banks together form one way. The baseline SRAM block
implementation mimicked the operation of only one row, one column and one memory
cell. Rest of the components of SRAM block, i.e. decoder, sense amplifiers etc., were
implemented. CR-cache implementation of this SRAM block included, on top, the
supporting circuitry of Figure 3.1. Thus we had two implementations of 1024x64 bits
75
SRAM block for comparison purpose. One implementation being the baseline SRAM
block and the other being CR-cache based SRAM block.
3.4.1.1 Write Operation
We simulated the netlist with 65nm PTM [75] and carried out simulations for
baseline SRAM and CR-cache based SRAM block. For baseline SRAM block we
observed the current dissipation due to bit-line swing, leaving rest of the components, i.e.
decoder, word-line, sense amplifier, untouched. For CR-cache SRAM block we observed
current dissipation due to bit-line swing as well as support circuitry.
Figure 3.12: Write operation in baseline SRAM block
Figure 3.12 shows the write operation in the baseline SRAM block for writing a „1‟.
Note that in the baseline SRAM block, the behavior remains the same regardless of the
value being written. As shown in the figure we pre-charge the bit-lines at the end of the
write operation.
76
Figure 3.13: Write operation in CR-cache SRAM block
The plot of Figure 3.8, presented earlier and reproduced in Figure 3.13, shows the
write operation in CR-cache SRAM block where previous value written is „0‟ and the
new value is „1‟, i.e. bit-lines will flip. This case is called Flip. Note that bit-line status in
the beginning of the write operation is little below 1V. This is to account for the fact that,
in the worst case when we write different values in consecutive write operations, the bit-
line status at the end of the write operation in CR-cache is 0.97V. When the previous
write value and current write value are the same, no charge sharing takes place and the
current due to bit-line swing, which is very small, is negligible. This case is called
Nonflip. Furthermore, we also analyzed the write operation when CR-cache is used under
non charge sharing mode, i.e., CS_EN=0, which shows negligible increase in power.
Figure 3.14 shows the current curves for baseline (original) SRAM block and CR-
cache SRAM block. For baseline SRAM block the current curve corresponds to the
current due to bit-line swing. For CR-cache SRAM block the current curve correspond to
the current in bit-line swing and supporting circuitry. The three different current curves of
77
CR-cache SRAM block correspond to Flip, Nonflip and CS_EN=0 modes of operation.
As shown in the figure, for Flip the resulting current is smaller and for Nonflip the
resulting current is significantly smaller than the current in the baseline SRAM block.
While, under CS_EN=0 mode, the current dissipation in CR-cache SRAM block is
marginally higher than the baseline SRAM block.
Figure 3.14: Comparison of current curves between original
SRAM and CR-cache SRAM
Table 3.2 summarizes the net energy savings due to bit-line swing reduction
compared to baseline SRAM. Note that under CS_EN=0 mode of operation CR-cache
consumes 1.65% and 0.20% more power corresponding to Flip and Nonflip.
Table 3.2: Energy Savings of CR-cache
Mode of Operation Case % Energy Savings
CS_EN=1 Flip 37.27
CS_EN=1 Nonflip 78.67
CS_EN=0 Flip -1.65
CS_EN=0 Nonflip -0.20
78
3.4.1.2 Impact on Read Operation
It is important to understand the impact of the proposed circuit optimization, for write
operation power reduction, on read operations, in terms of both energy and delay. For this
purpose we carried out Hspice simulations for read operations, similar to write
operations. With respect to energy consumption we make the following observation. The
only supporting circuitry that switches during a read operation in CR-cache is the two
CGate blocks. Rest of the supporting circuitry experiences switching activity only during
a write operation. These CGate blocks in CR-cache are used to drive the pre-charging
transistors. Similarly, in baseline SRAM, inverters are used to drive pre-charging
transistors. However CGate consumes more power compared to inverters. Therefore our
Hspice simulations for read operations show that CR-cache consumes, accounting for
overhead, 4% more power in bit-lines swing compared to baseline SRAM. However note
that, in read operations, bit-line swing is limited (~150mV) as compared to full swing in
write operations. We observed that the power consumption in bit-line swing for reads is
about one order of magnitude lower compared to that for writes. Furthermore the
contribution of bit-line swing power to total read operation power is also negligible as
shown in the next section. Thus the resulting increase in the read operation power is less
than 1%.
Because the CGate blocks of CR-cache are more complex than a simple inverter, as
CGate has two NMOS and two PMOS transistors connected in series (cf. Figure 3.5) in
its pull-down and pull-up paths, it increases the delay of turning the pre-charging
transistors ON and OFF. Although this is an issue of transistors sizing, we sized the
79
transistors in CGate such that their energy overhead was bounded to a small value, as
mentioned earlier. This resulted in increased delay of turning pre-charging transistors ON
and OFF. Figure 3.15 depicts a falling transition on BL_charge signal, in baseline SRAM
as well as CR-cache, which turns PMOS pre-charging transistor ON. This figure also
shows the delay in pre-charging the bit-line to 0.99V in both baseline SRAM and CR-
cache. As shown in the figure, this delay is increased by 3.69% due to delayed transition
of BL_charge signal. However, when this increase in delay was accounted for in the total
read operation delay of our SRAM bank, which includes the decode, sense amplifier
delay and delay in establishing the differential voltage across bit-lines, the delay
overhead reduced to a mere 1%.
Figure 3.15: Increase in bit-line pre-charging delay
Thus, the proposed CR-cache does not have any substantial impact, in terms of
energy and delay, on the read operations.
80
3.4.2 Cache Modeling
In the previous section we presented the effects of CR-cache on bit-line energy
consumption. However to understand the impact of proposed charge sharing based write
operation energy optimization technique, we have to understand the contribution of bit-
line energy consumption to the total write power consumption. Since state of the art
caches today have multi-bank modules where data and addresses are routed to and from
various banks, many other components, such as output load driver, address and data
routing to the banks, along with bit-lines start dominating the cache energy consumption.
Therefore, in order to understand the contribution of various components to cache energy
consumption, we made use of the cache modeling tool CACTI 6.5 [36] under 65nm
technology node.
Table 3.3: Cache configurations
Configuration Size(KB) Block Size
A 32 8
B 32 16
C 64 8
D 64 16
CACTI 6.5 models cache as set of banks arranged in hierarchical fashion. The data
and address are routed to these banks in H-tree fashion. Even though CACTI iterates over
many different cache configuration to find the best possible choice for a given set of
criterion, part of the bank topology was set rather manually by us. As shown in Table 3.3,
we profiled energy consumption of four different cache configurations. For each of these
configurations we experimented with associativity of 4 and 2 and correspondingly set the
81
number of banks to be 4 and 2. Each of these banks is further divided into subbanks,
however, for the sake of brevity we omit their details.
Figure 3.16 shows the write energy consumption due to various components across
the four cache configurations for 4 and 2 way set associativity. As shown in the figure,
the write energy is divided into energy consumption in four major blocks, namely word-
lines, bit-lines, decoder and H-tree. This figure also shows, on the secondary Y-axis, the
percentage of write energy contribution due to bit-lines. Note that the proposed charge
sharing technique will be effective and save energy on this fraction of write energy which
is due to bit-lines. As shown in the figure, when associativity is high (4) and therefore the
number of banks is high, the write energy is dominated by the H-tree and bit-lines in that
order, with bit-lines contributing on average 33.63%. However with lower associativity
of 2 the contribution of H-tree reduces drastically and therefore bit-lines contribution
increases to an average of 85.85%.
Figure 3.16: Cache write operation energy
82
Figure 3.17 shows a comparison of read/write energy with percentage contribution of
bit-lines to read energy on the secondary Y-axis. As shown in the figure the bit-lines
contribution to read energy is below 8% for 4-way cache configurations and below 17%
for 2-way cache configurations. In either case the impact of our optimizations and the
resulting overhead on read energy (being merely 4% as mentioned in the previous
section) will still be negligible and below 1%.
Figure 3.17: Cache read and write energy comparison
Since most of the modern microprocessors employ 4-way ([33], [66]) L1 cache with
typical L1 cache size being 32KB, we chose configurations A for our architectural
simulations.
3.4.3 Architectural Simulations
To analyze the impact of the proposed CR-cache architecture on system performance
and overall cache power, we carried out architecture level simulation using Simplescalar.
We modeled our processor architecture based on ARM cortex A8 [33] which is a dual
83
issue in-order processor. Note that, out-of-order processors generally employ dual/multi
ported caches ([66], [39]) in order to be able to retire one load and one store per cycle.
For such caches with separate read and write ports, our charge sharing based write
operation power reduction architecture from previous chapter is applicable. In this paper
we focus on caches with shared read write ports employed in processors that support one
load/store per cycle. Table 3.4 shows the configuration parameters for architectural
simulations. We experimented with different SQ sizes. Note that the SQ size directly
affects the energy savings, since the size of the clusters and therefore number of back to
back writes in clustered write phase is directly dependent on the SQ size. However larger
SQ size results in higher performance penalty.
Table 3.4: Processor configuration parameters
Processor widths In-order with Fetch, Decode, Issue and
Commit width of 2
SQ 4/8/16
Caches L1 I/D Cache 32KB 4-way with 8 byte line
size, Hit Latency : 1-cycle
Memory Latency 100 10 cycles
Branch Predictor Bimodal with table size: 4096
BTB 1024 2
Functional Units Integer ALUs:2
Integer Multiplier/Dividers:1
Apart from simulating the processor, we also have to model the data cache according
to the models considered in cache modeling section based on the results from CACTI.
Our cache model for various configurations was modeled after Cortex A8 in terms of
number of banks employed. Hence Simplescalar was modified to precisely model the D-
cache in terms of the banks employed. Note that it is important to model cache as a multi-
84
bank module for the following reason: two consecutive writes going to the cache will be
able to make use of CR-cache architecture only if those two writes go to the same bank
since, charge sharing between bit-lines can take place only if the consecutive writes fall
into the same bank. Furthermore, we also modified Simplescalar to implement our state
machine (cf. Figure 3.10) for clustered writes such that the corresponding performance
penalty was accounted for. Moreover, we used PISA as well as ARM ISA. Note that
ARM provisions only 16 registers and hence more register spills are likely to occur
resulting in more loads and stores compare to PISA binaries. As for the benchmarks, we
used a few SPEC ([38]) benchmarks (bzip, gzip, gcc, mcf, vortex) and a few MediaBench
(0) benchmarks (cjpeg, djpeg, gsm).
The aforementioned simulation environment was used to analyze the impact of, CR-
cache architecture and parameters such as SQ size, on cache writes, cache energy and
performance. Correspondingly we divide the architectural simulation results in three
main subsections, namely, impact on cache write, impact on cache energy and impact on
performance. This analysis was performed for cache configuration A with 4 way set
associativity. We also assessed the impact of cache line size and associativity on the CR-
cache performance in impact of line size and associativity subsection.
3.4.3.1 Impact on Cache Writes
The efficiency of the proposed CR-cache architecture in reducing write energy is
directly dependent on the number of consecutive writes to a given cache bank. This
number is dependent on parameters such as the SQ size, cache line size, and cache
85
associativity. Furthermore, it is also dependent on the number of committed stores in the
SQ when the transition to Retire state (cf. Figure 3.10) takes place. As shown in Figure
3.10, apart from the conditions of cache miss and invalidate, SQ_full signal is also used
for the transition. It is possible that only few of the stores in the SQ are committed when
SQ_full is raised. Therefore we propose two different policies to transition to Retire state:
1) SQ_full triggered: using SQ_full signal along with cache miss and invalidate as shown
in Figure 3.10 and 2) all_committed triggered: using all_committed (all the entries in the
SQ are committed) signal along with cache miss and invalidate.
Figure 3.18: Impact of SQ size on CS enabled writes
Figure 3.18 shows impact of the SQ size on the number of consecutive writes, i.e.,
writes under CS_EN=1, under policy SQ_full triggered for L1 data cache. We call such
writes charge sharing (CS) enabled writes. Y-axis of the plot shows the CS enabled
writes as a percentage of total writes. As shown in the figure, increasing the SQ size
results in increased CS enabled writes which will result in higher energy savings. On
average we get 18.86, 41.53, and 59.02% of CS enabled writes for SQ sizes of 4, 8 and
86
16, respectively. However, with larger SQ size performance penalty also increases, as
shown later.
The lazy retirement of stores, by keeping them in SQ until SQ is full, can also
increases the number of loads getting their data forwarded from stores. This also reduces
the number of cache accesses due to loads as shown in Figure 3.19, with average
reduction of 9.33, 15.14 and 21.26%, respectively for three different SQ sizes under
policy SQ_full triggered.
Figure 3.19: Reduction in cache read accesses due to loads
In order to understand the impact of further delaying the transition to Retire state,
under policy all_committed triggered, on CS enabled writes, we used SQ size of 8. The
comparison between the two policies is shown in Figure 3.20. Figure 3.20 shows that,
under all_committed triggered, the percentage of CS enabled writes increases (bar
graphs), albeit only marginally with average increase of 2.58%. The percentage reduction
in cache read accesses (line graphs) is reduced in some cases under all_committed
triggered. This can be explained as follows: fewer loads are getting their data forwarded
87
from SQ under all_committed triggered compared to SQ_full triggered, since under
all_committed triggered we wait until all the entries in SQ are committed in order to
transition to Retire state. Therefore, all_committed triggered policy blocks the pipeline
for longer time period, preventing the loads from being issued which could have had their
data forwarded from SQ. The above analysis shows that, for our system configuration,
all_committed triggered does not add any significant advantage over SQ_full triggered.
Figure 3.20: Comparison between two retirement policies
3.4.3.2 Impact on Cache Energy Dissipation
Using the cache energy model, derived earlier with the help of CACTI, we analyzed
the impact of CR-cache on L1 data cache energy dissipation under policy SQ_full
triggered.
Figure 3.21 shows the reduction in cache write energy, as a result of bit-line swing
reduction, and cache read energy, as a result of reduced cache read accesses, for different
SQ sizes. As shown in the figure, for benchmarks bzip, gzip, gcc of PISA and bzip, mcf
of ARM, the percentage reduction in read energy dominates the reduction in write
88
energy. Whereas for benchmarks cjpeg, vortex of PISA and gzip of ARM, the percentage
write and read energy reduction is comparable. For benchmarks djpeg of PISA and djpeg,
gsm of ARM, the percentage write energy reduction dominates the read energy reduction.
Thus the behavior, in term of energy reduction due to writes and due to reads, across
these benchmarks, varies. On average, for SQ sizes 4, 8 and 16, write energy is reduced
by 4.03, 9.13 and 13.10%, and read energy is reduced by 9.19, 15.01, and 21.14%,
respectively.
Figure 3.21: Read and Write Energy Reduction
Figure 3.22 shows the net cache energy reductions with weighted combination of read
and write energy reduction. On average CR-cache saves 7.29, 12.68 and 17.53% energy
for SQ sizes 4, 8 and 16, respectively. However, the case of djpeg stands out with very
low energy savings across all three SQ sizes and has an average energy savings of 3%.
This is due to the fact that 1) djpeg has comparatively smaller ratio of number of writes
going to cache to number of reads going to cache (0.3 compared to the average of 0.6)
89
(note: many of the reads can also be satisfied from the line buffers), and 2) our late
retirement of stores does not help reduce cache read accesses for djpeg (cf. Figure 3.19
and Figure 3.21). For the other benchmarks, however, percentage energy savings scales
well.
Figure 3.22: Net Energy Savings in L1 Data Cache
Note that the energy savings of CR-cache architecture largely depends on SQ size and
resulting percentage of CS enabled writes. Given this dependency we wanted to answer
the following question: What if the percentage of CS enabled writes is very small or even
worse what if there are no CS enabled writes we can garner? Under this scenario we are
not reaping the benefits of CR-cache architecture and yet paying the overhead. Therefore,
it is important to understand the effects of such a scenario on cache energy. Given that
our overhead is quite small, we observed that, if we employ CR-cache, without any CS
enabled writes, the cache energy is increased by an average of about 0.12%. There are
two reasons behind such a low overhead: 1) the overhead for read operations is fairly
low, as mentioned in section 3.4.1.2, and merely 0.15% for our cache configuration A; 2)
90
for write operations the overhead in bit-lines is only 0.2% for the case of Nonflip and
1.65% for the case of Flip. Overall very few bits flip as expected, since only few lower
order bits in a word are known to change while the rest remain unchanged.
Figure 3.23: Percentage of Writes Needed for Breakeven
This raises the question: What happens to write energy if a large number of bits in a
word flip compared to the number of bits remaining unchanged? Note that, even in the
extreme case, when all the bits flip all the time, the write energy due to bit-lines is
increased only by 1.65%. Even in this extreme case the cache energy increase, due to CR-
cache with no CS enabled writes, was found to be below 1% for our simulations, with an
average of 0.28%. Furthermore, we carried out analytical assessment of the percentage of
CS enabled writes needed in order to break even for bit-line energy during write
operations, accounting for the overhead presented in Table 3.2. The results are presented
in Figure 3.23 for different ratios of bit flips to bit non-flips. This ratio, α, represents the
number of bits flipping in a word to the number of bits remaining unchanged. As shown
in the figure, for α < 0.5 the required percentage of CS enabled writes, to breakeven, is
below 1.5%. This percentage increases with increasing value of α but even at α=0.9, the
91
required percentage to breakeven is below 4%. Thus even in the extreme case the
percentage of CS enabled writes needed to breakeven is fairly small.
3.4.3.3 Impact on Performance
Energy savings in data cache, presented earlier, comes at the cost of performance,
since CR-cache architecture requires locking down of the data cache while the SQ is
being drained in Retire state. This may result in performance penalty due to processor
stalls. Note that our in-order processor will have to be stalled if 1) a load in the pipeline
cannot get data forwarded from the SQ and 2) data cache is locked to drain the SQ. Under
such conditions, the processor is stalled until the SQ is drained off of all the committed
stores and returns to INIT state. Note that the fraction of time the processor is stalled is
dependent on the SQ size and policy used to transition to Retire state.
Figure 3.24 shows the percentage performance penalty for different SQ sizes under
policy SQ_full triggered. As shown in the figure, the performance penalty stays below 4,
5 and 6% for the SQ size of 4, 8 and 16 respectively. The corresponding average
performance penalty is 2.04, 2.52 and 3.31%. As expected, the performance penalty
increases as we increase the SQ size. Note that djpeg has relatively large performance
penalty for ARM. Since, as mentioned earlier, not many loads in djpeg get their data
forwarded from the SQ, it incurs larger performance penalty and it does not yield any
significant energy gains. However, the other benchmarks show better trade-off between
energy and performance penalty.
92
Figure 3.24: Performance Penalty
For the sake of completeness, to understand the impact of different policies for
transitioning to Retire state on performance, we compared the performance penalty
between policies SQ_full triggered and all_committed triggered. Results are reported in
Figure 3.25, which corresponds to SQ size of 8. As shown in the figure, due to longer
stall periods, all_committed triggered results in higher performance penalty.
Figure 3.25: Comparison of performance penalty between the two policies
93
3.4.3.4 Impact of Line Size and Associativity
In order to assess the impact of cache line size (LS) and associativity on CR-cache,
we fixed the SQ size to 8 and varied the cache line size from 8 to 16 and cache
associativity from 4 to 2, i.e., cache configurations A and B of Table 3.3. The
corresponding results for CS enabled writes and reduction in cache read accesses are
shown in Figure 3.26. As shown in the figure, on primary Y-axis with bar graphs, the
percentage of CS enabled writes decreases as we increase line size from 8 (Assoc_4
LS_8) to 16 (Assoc_4 LS_16). This is due to the fact that now each cache line contains 4
words instead of 2 and hence the probability of 2 writes going to the same set of bit-lines,
i.e. same word of a cache line, reduces. Furthermore, as we move to associativity 2 from
4, the percentage of CS enabled writes increases due to lower associativity and hence
more writes going into the same way.
Figure 3.26: Impact of cache line size and associativity on CS enabled writes
and cache read accesses
94
With respect to cache read accesses, as shown on the secondary Y-axis of Figure 3.26
with line graphs, as we increase the line size to 16 the percentage reduction in cache read
accesses also reduces. This is due to the fact that many reads are satisfied from line buffer
itself due to larger line size. However, associativity does not have any strong correlation
with the reduction in cache read accesses. Across associativity of 4 and 2 for the same
line size percentage reduction in cache read accesses remains roughly the same.
Figure 3.27: Impact of cache line size and associativity on cache
energy and performance
The impact of variation, in cache line size and associativity on cache energy and
performance is shown in Figure 3.27, with percentage cache energy reduction shown on
the primary Y-axis and percentage performance penalty shown on the secondary Y-axis.
As we increase line size percentage reduction in energy reduces as expected since
percentage of CS enabled writes also reduces (cf. Figure 3.26). Furthermore, with respect
to performance, as the line size increases performance penalty decreases for the same
reason. However, as associativity decreases percentage reduction in energy increases
since, besides increased percentage of CS enabled writes, the contribution of bit-lines to
95
read/write energy increases and so does increase the contribution of write energy
reduction to overall cache energy reduction.
3.5 Summary
We observed, in this chapter, that caches provide unique challenge in employing
charge recycling for write power reduction due share read/write ports. Due to such
sharing the charge status on bit-lines may need to be destroyed in order for cache to be
ready for a read operation. Making use of this unique challenge we propose a novel
charge recycling circuitry for such caches that can dynamically switch between charge
recycling based write operation mode and regular cache access mode. However, this
novel charge recycling circuitry does not suffice for exploiting charge recycling, since in
order to exploit charge recycling, we must have back-to-back writes that can recycle the
charge left on bit-lines due to previous write. Thus, in order to generate back-to-back
writes, we propose architectural modifications so as to delay the store retirement by
keeping them in store queue after they have been committed. By making use of such
delayed store retirement, then, we retire stores in clusters and modify the cache controller
state machine such that it activates the charge recycling based write mode and sends these
back-to-back stores to cache. We show, by the means of extensive experimental setup
and results, that such confluence of circuit and architectural energy minimization
techniques yield substantial benefits in cache energy reduction with marginal
performance penalty.
96
Chapter 4. System Level Energy Optimization:
Resource Allocation in Hosting Centers
4.1 Introduction
So far, in previous two chapters, we focused on design time solutions by modifying
the circuits as well architecture of on-chip hardware resources, such as memories and off
chip bus controllers, for energy minimization. These approaches exploited the low level
behavior of the systems, e.g., bit-line swing in opposite directions during consecutive
write operations in SRAMs at circuit level, using store buffers for coalescing writes using
late store retirement at architecture level etc. However, the approaches at these levels fail
to exploit particular knowledge about the system at higher levels, e.g., software stack
running on a machine and/or applications from a particular domain. Such knowledge,
about system level behavior of applications, is hard to incorporate into lower level design
of general purpose processors/machines, since such machines are supposed to support a
range of applications efficiently rather than catering to a selected few whose behavior is
well characterized. However the behavior of different applications, in terms of
performance and energy, can widely vary across different machines that differ from each
other in architecture/ speed/ technology/ power etc. of processors, memory, disk etc.
These differences in performance and energy become particularly important in wide
scale deployment of such units, e.g., servers, in data center environments, due to their
97
prohibitively increasing power consumption. To put the energy consumption of data
centers in perspective, the peak power demand from all existing data centers in the USA
is estimated at 7 Gigawatts. Power consumption of data centers will soon match or
exceed many other energy-intensive industries such as air transportation. Thus, as the
presence and usage of the online services, such as instant messaging, online gaming,
email, software-as-a-service (SaaS), e-commerce, social networking etc., proliferate, the
issue of energy efficient resource allocation, that accounts for compute power, routing
power, cooling power, power delivery network etc., becomes increasingly important.
However, conservatively allocating resource to meet Service Level Agreements
(SLAs) can lead to over provisioning. It is known that a significant fraction of the
datacenter power consumption is due to resource over-provisioning. A recent EPA report
predicts that if datacenter resources are managed with state-of-the-art solutions, the
power consumption in 2011 can be reduced from 10 Gigawatts to below 5 Gigawatts
[35]. But such solutions require perfectly provisioned servers in data centers which
require allocating only the absolute minimum resources for completing the tasks while
meeting the specified SLAs. However, perfect provisioning is difficult to achieve due to
1) resource heterogeneity and 2) lack of energy proportionality.
Resource Heterogeneity: Datacenter resources become heterogeneous, even if a
datacenter is initially provisioned with homogeneous resources. For instance, replacing
non-operational servers or adding a new rack of servers to accommodate demand
typically leads to installing new servers that reflect the advances in current state-of-the-
art. In this research we focus on performance heterogeneity where servers differ in CPU
98
speed, memory and disk capacity, which also leads to heterogeneity in power
consumption. Heterogeneity makes perfect provisioning at the datacenter level difficult
since different tasks may occupy different amount of resources across different servers
with varying energy profiles.
Energy Proportionality: Energy proportionality is the notion that the energy
consumed by a resource must scale linearly with its utilization. Hence, a perfectly energy
proportional server consumes zero power at zero utilization and its power consumption
increases linearly with utilization. However, servers today typically consume 80% of the
peak power even at 20% utilization [2]. The consequence of this lack of proportionality is
that when a task is assigned to a server the energy cost of completing that task is
dependent on the resulting server utilization. Hence, it is insufficient to consider energy
cost of operating a datacenter based solely on the number of active servers; in fact, it is
critical to consider the energy cost as a function of server utilization.
In order to account for the aforementioned two issues and to address the issue of
energy minimization as well as profit maximization in large scale systems such as data
centers, in this work, we present network flow based resource allocation framework. We
target our research toward hosting centers in particular, which are an important
incarnation of datacenters. Hosting centers provide compute services to clients, such as
small business owners like small banks, e-commerce services, customer support services,
etc., that need computing capabilities of a datacenter but do not have the wherewithal to
operate one. Hosting centers may be organized in multiple ways. The particular
organization used in this research is shown in Figure 4.1.
99
As shown in the figure, requests from multiple hosting center clients are first directed
to a load distributer. The load distributer is connected to pools of servers, where all
servers in a pool are homogeneous, but pools are heterogeneous. Server pool
heterogeneity is hidden from the clients by the load distributer. Clients‟ requests are in
turn generated by each client‟s own end-user requests. For instance, an end user may
generate a browsing/purchasing transaction to an e-commerce client which is in-turn
routed to the hosting center. Hosting center operator and clients are bound by SLAs
where hosting center guarantees a minimum level of service. Each client can define its
own quality of service requirement, such as bounds on response time of a request, or
request throughput. When the hosting center operator fails to meet SLAs they may even
have to pay a penalty to the client. It is the job of the load distributor to allocate resources
to satisfy client's tasks while making sure SLAs are met. Conservative resource allocation
to meet SLAs can lead to over provisioning and thus increase energy consumption
thereby reduce profit. On the other hand under provisioning of resources may lead to
profit loss when operator pays penalty for missing SLAs or loses customers.
Figure 4.1: Hosting Center Architecture.
100
The goal of this research is to provide a unified framework to maximize profit under a
wide range of SLAs by allocating resources optimally in the presence of server
heterogeneity and lack of energy proportionality. We formulate this optimization problem
using generalized networks and present a Network Flow based Resource Allocation
framework, called NFRA, where nodes in the network represent server pools and clients
and the flow from server pools to clients represents resource allocation [71].
4.2 Prior Work
Due to increasing energy and cooling cost of the hosting centers and their ensuing
environmental impact, many of the previous works have focused on energy management
in hosting centers. Chase et al. in [15] proposed an economy based approach for energy
optimization that monitors the dynamic variation in the workload and accordingly assigns
servers so as to minimize energy while meeting SLA. In [21] authors have proposed
various independent and coordinated workload dependent DVFS techniques to minimize
energy in server clusters. In [16] authors proposed queuing and control theory based
performance constrained energy management framework for homogenous server clusters.
Their approach exploits DVFS for energy minimization while accounting for the wear
and tear costs associated with turning the components ON and OFF. Rusu, et al. in [79]
proposed a QoS aware technique that dynamically reconfigures a set of heterogeneous
clusters to reduce energy during the reduced load period. In [77] authors propose power
aware request distribution technique that takes into account the startup and shutdown
delay of the servers. Request distribution takes place at the load balancing front end, and
101
energy management is performed at the servers. To account for the workload variation,
authors in [78] proposed an approach that exploits long-time-scale variation in the
workload in order to reduce the resource requirement for energy improvement. In [30]
Heath, et al. proposed a heterogeneous server cluster design for throughput constrained
energy optimization. In another set of research ([13], [29], [99]) authors proposed energy
management techniques for disk power management in servers with multispeed disks,
dynamic speed control and storage cache management.
In a more recent work, Raghavendra et al. in 0 proposed a coordinated energy
management approach by integrating various independent policies working at different
levels of the hierarchy in the data center. They show that policy decisions made by an
approach at a certain level of the energy management hierarchy can produce results that
conflict with the decisions made by an approach at another level. Their proposed
framework attempts to synchronize such decisions made at various levels. Exploiting a
virtualized environment authors in [57] proposed a power minimizing data center
architecture by employing comprehensive online monitoring, live virtual machine (VM)
migration and VM placement optimization. In [27] countering the over provisioning of
resources Govindan et al. proposed power management approach by under provisioning
the resources and overbooking the power needs of the hosted workloads and subsequently
distributing the power flexibly among hosted workloads.
Along with the energy management in the hosting center profit optimization has also
been a concern that has been addressed by many previous works. In [88] authors
proposed to profile applications in terms of their resource needs and then employ
102
resource overbooking to maximize the generated revenue while providing performance
guarantees. Bennani [6] et al. proposed an analytical queuing model and combinatorial
search based approach to resource allocation that optimizes some utility function, e.g
throughput or response time etc. In [58] authors proposed an approach to maximize the
total profit considering the prices and penalties paid by the clients being hosted on the
hosting center, subject to SLAs. They used the fixed point iteration method to converge
to optimal solution by solving separable concave resource allocation problems assuming
an increasing, concave and differentiable price function. In another profit maximizing
approach, Zhang et al. [98] proposed tabu search based heuristics that tries to maximize
the total profit while considering the operational cost of the servers. They assumed that
the price paid by clients is a step function of the response time and the operational cost of
an ON server is fixed.
Most of these previous works fail to take into account both server heterogeneity and
non energy proportionality. Applications‟ behavior can widely wary and scale in non-
similar fashion due to server heterogeneity, while non-energy proportionality can affect
the energy cost of performing a task. In this work we address both of these issues. Here
we highlight the research contribution of this work with respect to previous works and
particularly in comparison to two closely related previous works, namely [58] and [98]:
A.) NFRA takes into consideration server heterogeneity. In particular, it considers
how performance and energy costs scale non-uniformly for different clients across
different servers. For example, consider two heterogeneous servers, A and B, and
two clients, X and Y. For client X, server A is able to process 100
103
requests/second, while server B is able to process 150 requests/second. Whereas
for client Y, A is able to process 100 requests/second, while B is able to process
200 requests/second. Thus the scaling of performance across these two servers is
not the same for the two clients. Such heterogeneity exists in energy consumption
as well. In contrast, experimental results presented in [58] considered the same
service rate across different resource types for a given class of requests.
B.) NFRA accounts for non linear energy cost of operating a server at different
utilization levels. In [58] the energy cost of operating a server was not considered
as they maximize profit by only considering generated revenues. In [98] the
authors do take into account the energy cost of operating a server. However, they
use a simple cost model; when a server is ON it consumes fixed energy whereas if
it is OFF it consumes zero energy. Due to lack of energy proportionality in
servers, the energy cost of servicing a request can change dramatically at different
utilization levels which is not considered in [98].
C.) NFRA accounts for response time constraints while minimizing energy costs.
Reference [98] considers energy optimization without taking into account
response time constraints, e.g., average or maximum response time should not
exceed the stipulated response time requirement.
Server
Pool i
Server 1
Server C
i
......
Client j
Figure 4.2: Client Request Distribution within a Server Pool.
104
4.3 System Model Parameters and Assumptions
Before looking at our generalized network flow based formulation, let us first look at
the parameters we use to capture hosting center‟s and clients‟ characteristics. Let us
assume that the hosting center has m heterogeneous server pools, i=1, 2,…, m. All servers
within a given server pool are homogenous. (Note that we will use servers and resources
synonymously in the remainder of this chapter.) Each pool is characterized by C
i
,
representing the number of homogenous servers in pool i. As shown in Figure 4.2, each
server within a pool is modeled as a single server queuing system for each of the clients it
is serving. Thus requests of a given client are distributed across servers where they wait
in a single server queue.
Assuming that the hosting center is hosting n clients, j=1, 2,…, n, each client is
characterized by the following parameters.
o λ
j
: average request arrival rate, in requests per second, for client j, accounting
for any user think times.
o μ
ij
: average service rate in requests per second for client j‟s requests on a server
from server pool i at system utilization U=1. Hence, execution time of client j‟s
request on a server from server pool i, T
ij
=1/ μ
ij
. We don‟t make any assumptions
regarding the distribution of random variables λ
j
and μ
ij
.
o e
ij
: average energy cost in joules per request to serve a request of client j on
server pool i at server utilization U=1.
105
o τ
j,max
: average or maximum tolerable response time per request in seconds for
client j.
o β
j
: stochastic upper bound on the fraction of client j‟s requests that can violate
τ
j,max
, i.e.:
Pr( )
j max j
where τ
j
is the response time of a request of client
j.
o R
j
: per request price paid to hosting center by client j as long as the response
time is not larger than τ
j,max
.
o L
j
: per request penalty incurred on hosting center by client j whenever the
response time exceeds τ
j,max
.
Parameters, τ
j,max
, β
j
, R
j
, L
j
, are specified in the SLA. Usually, the client also provides
λ
j
; if, however, λ
j
is unavailable at the time granularity of making resource allocation
decisions, then one can use a history based predictor such as the one in [90] to estimate λ
j
.
Computing μ
ij
and e
ij
is the responsibility of the hosting center. We briefly describe one
approach that a hosting center may use to obtain these two values, which was also used in
our experimental results section.
We assume that when the hosting center signs on a new client, it conservatively
allocates servers from each server pool to this client and monitors the resulting
performance in terms of number of requests served, server utilization, and power
consumption. Let us assume that during such a phase hosting center allocates L
ij
servers
from server pool i to client j. We observe at these L
ij
servers, over a period of δ
profile
seconds, that the total number of requests served for a client j on some server l of pool i is
μ
ij,l
and the server utilization is U
ij,l
. Based on these values, we obtain μ
ij
by first scaling
106
the service rate ( μ
ij,l
/ δ
profile
) using U
ij,l
to obtain service rate at U=1 at server l and then
taking average across L
ij
servers, as shown in eq. (4.1).
,
,
1 1 1
;
ij l
ij ij
l
profile ij l ij ij
T
LU
(4.1)
Furthermore we also measure the average power (energy/second) consumed by server
l in server pool i at utilization U
ij,l
during the period of δ
profile
seconds. Let us denote this
power by P
ij,l
. Given P
ij,l
and U
ij,l
we obtain P
ij,(1)
, server power at U=1 for client j on pool
i, as shown in eq. (4.2), by first scaling P
ij,l
linearly to obtain power at U=1 on server l
and then taking average across L
ij
servers. We assume that P
i,idle
, idle state power of a
server from server pool i, is given to us. Using P
ij,(1)
we obtain e
ij
as well, as shown in eq.
(4.2). Note also that the heterogeneity in server pools, in terms of processor speed,
memory, IO etc., will be captured by and reflected in μ
ij
, T
ij
and e
ij
. The T
ij
and e
ij
characterization results presented in experimental section will help clarify this further (cf.
Figure 4.8). We have assumed that any client can be mapped to any server pool. This is
not a limiting assumption for the proposed framework and is made solely to simplify the
presentation.
,(1) , , , ,
,(1)
1
;
ij
ij ij i idle ij l i idle ij l
l
ij ij
e P P P P U
PL
(4.2)
107
4.4 NFRA Framework
In this section we describe a generalized Network Flow based Resource Allocation
framework for energy minimization and profit maximization in hosting centers, which we
call NFRA.
4.4.1 Generalized Networks
Generalized networks [92] are similar to regular networks except that each edge (v,
w) of the network has a gain factor, γ ( γ ≥ 0), associated with it along with the capacity, u.
If an edge (v, w) has a gain factor of γ, then one unit of flow that leaves node v becomes γ
units when it arrives at w. Generalized networks are useful to model financial systems
with interest rates, oil pipeline networks with leaks or currency exchange rate problems,
etc. In such problems the objective is to find the maximum flow, much like maximum
flow in regular networks, into some sink node t such that the return on the investment is
maximized or the maximum amount of oil is received at the sink t, etc. Furthermore,
much like the minimum cost flow in the regular networks, a natural extension would be
to add a cost, κ, to each edge (v, w) which defines the cost of sending one unit of flow
from node v to node w. Based on the these parameters we can characterize each edge (v,
w) of the generalized network by a triplet ( γ, κ, u). Here the objective function could be to
find a maximum flow at minimum cost.
The unique property of generalized network flow models is their gain factors. We use
gain factor to capture the heterogeneity of server pools in a hosting center. For example,
108
the execution times, or service rates, for the requests generated by a client may vary
widely across different server pools due to heterogeneity. Such differences can be
encapsulated by gain factors as we will show in the next section.
4.4.2 Modeling Resource Allocation in NFRA
In order to construct a generalized network for hosting center resource allocation
problem, let us assume that each server pool i and each client j represents a node in a bi-
partite graph. As shown in Figure 4.3, the left side of the bi-partite graph represents a
hosting center comprising of different server pools whereas the right part of the graph
represents different clients. Each server pool i is connected to a client j by a directed edge
(i, j). Each such edge (i, j) is characterized by a gain factor, γ
ij
, which captures the amount
of service one unit of server pool i (i.e., one server in that pool) provides to client j. In
other words, one unit of flow that is sent from server pool i becomes γ
ij
serviced requests
when it arrives at client j. Note that since γ
ij
is the same as μ
ij
, server utilization is 1. For
any other utilization (U < 1), γ
ij
< μ
ij
.
1
i
m
1
j
n
…...
…... ……...
( γ
i1
, κ
i1
, ∞)
( γ
ij
, κ
ij
, ∞)
( γ
in
, κ
in
, ∞)
…
… …
…
n clients m server pools
… …...
s
(1, 0, C
1
)
(1, 0, C
i
)
(1, 0, C
m
)
t
(1, 0, λ
1
)
(1, 0, λ
j
)
(1, 0, λ
n
)
Figure 4.3: Generalized Network Model for NFRA.
109
Furthermore, each edge is characterized by the cost κ
ij
that captures the cost of
sending one unit of flow from node i to node j, which captures the cost of allocating one
unit of server pool i, i.e. a server of pool i, to client j. As we will explain in Section 4.4.3,
the cost parameter is a function of the SLA type; in some cases the cost is purely the
energy cost of allocating one server to satisfy a client‟s requests and in other cases the
cost could account for complex profit functions. Finally, each edge (i, j) in this bipartite
graph is assumed to have capacity u
ij
=∞.
To this bi-partite graph, we add a source node s and a sink node t. Each server pool
node i is connected to s by an edge (s, i). The gain factor and cost associated with this
edge are set to: γ
si
=1 and κ
si
=0. The capacity of such an edge (s, i) is set to be u
si
=C
i
.
Intuitively this implies that from node s, we will push flow in terms of servers, with a
maximum of C
i
servers, towards a server pool node i. The sink node t is connected to
each client node j by an edge (j, t). The gain factor and cost associated with this edge (j, t)
are also set to γ
jt
=1 and κ
jt
=0, while the capacity is set to be u
jt
= λ
j
. Intuitively this means
that from client nodes, we will push flow in terms of number of serviced requests, with a
maximum of λ
j
serviced requests, onto the edges towards t. Figure 4.3 shows the resulting
generalized network graph where each edge (v, w) is characterized by the triplet ( γ
vw
, κ
vw
,
u
vw
). Such a network is denoted by G from here on. The value of some flow F in such a
network is defined by the value of the net amount of flow going into the sink t, i.e., the
total number of client requests that can be serviced by the server pools.
Having defined flow F in G as above, the maximum amount of flow in G, F
max
,
satisfies the following inequality:
j
max j
F
. This equation states that the value of
110
F
max
can never be more than the sum of request arrival rates. It is however possible that
F
max
is less than the sum of the arrival rates in which case no feasible solution exists in G
that satisfies all the requests of every client.
Note that the solutions achieved through generalized network are fractional. However
under the assumption, which is not unrealistic, that the number of requests generated by
clients is much larger than the number of servers hosted, the fractional result does not
have any dire implications. In any case, the fractional ratio of client request distribution
across different server pools can be achieved over a sufficiently large time interval.
In order to realize various SLAs, we will modulate the gain factors γ
ij
and associated
costs κ
ij
of the edges in G. The following subsections cover different SLA types
considered in this paper and their corresponding realizations in G.
4.4.3 SLA types
4.4.3.1 Throughput Constraint
Throughput constrained SLA is the simplest form of SLA, where a client pays a fixed
price for meeting its throughput requirement. Since the price paid is fixed the hosting
center‟s profit is purely a function of how much energy it consumes. Hence, the objective
of resource allocation in this case is energy minimization.
In NFRA throughput constrained SLA is formulated as follows. Throughput
requirement of client j is stipulated by λ
j
. Therefore all the incoming requests must be
served. Now the maximum throughput provided by a server of pool i for client j is given
by μ
ij
, the service rate. Therefore we simply set the gain factors γ
ij
= μ
ij
, thus providing
111
maximum possible throughput and forcing servers to operate at 100% utilization.
Correspondingly the edge costs are set to be κ
ij
=e
ij
μ
ij
. This cost essentially represents the
power cost of servicing μ
ij
requests per second for client j on a server from pool i. Finding
min cost max flow in G thus coincides with minimizing average power while meeting the
throughput requirement.
4.4.3.2 Average Response Time Constraint
Average response time constraint SLA stipulates that the average per request
response time, τ
j,avg
, for requests of client j under a given arrival rate λ
j
shall never exceed
τ
j,max
. The client pays a fixed price for meeting the average response time constraint. Here
the objective of profit maximization translates into energy minimization while still
honoring the response time requirement. Note that the response time is a function of
system utilization and, based on queuing theory, as utilization reduces response time
decreases. However, operating servers at lower utilization have two negative impacts.
The most obvious negative impact is increased number of active servers. The second
impact is the increased energy cost due to lack of energy proportionality.
In order to understand the impact of utilization on server energy let us look at how
energy, for a unit of work done, scales as a function of server utilization. Assume that a
server can process 100 requests/second at 100% utilization while consuming power P
(1)
.
Per request energy consumption is given as P
(1)
/100. Assuming that the server
performance scales linearly with utilization, the same server when operating at 80%
utilization can satisfy only 80 requests/second consuming P
(0.8)
power. Per request energy
112
consumption is given as P
(0.8)
/80. The ratio of energy per request at 80% utilization to
energy per request at 100% utilization is defined as Energy Increase Factor (EIF)
representing the scaling of energy per unit of work at different utilizations with respect to
100% utilization. More generally EIF at utilization U is given as, EIF
U
, in eq. (4.3),
()
(1)
U
U
EIF
P
PU
(4.3)
where P
(U)
is power at utilization U and P
(1)
is power at 100% utilization. System is most
energy efficient when EIF=1 which is at U=1. Figure 4.4 plots EIF on the primary Y-axis
and scaling of power on secondary Y-axis vs. utilization. The data for Figure 4.4 was
obtained from the SPEC website for SPECWeb2009 [86] Power benchmarks for the
server configuration, HP_C, detailed in the experimental results section. Note that the
power value P
(U)
at different U‟s was obtained using linear regression on the power data
resulting in P
(U)
= mU + P
idle
with m=102.5 and P
idle
=238.5 where P
idle
represents the idle
state power. As shown in the figure, EIF increases super-linearly as utilization decreases.
The above analysis shows that there is a tradeoff between increasing utilization to reduce
the energy cost and reducing utilization to meet response time constraint.
Figure 4.4: Power and per Request Energy Consumption Scaling.
113
In order to account for this tradeoff let us first look at the average per request
response time of client j‟s requests on a server from server pool i as a function server
utilization. Let us denote this average per request response time by τ
ij,avg
. τ
ij,avg
depends on
the queuing model and distribution of service rate μ
ij
and arrival rate λ
j
. Given that we are
considering single server queuing system, shown in Figure 4.2, if arrival and service rates
are Poisson distributed (M/M/1 queue),then τ
ij,avg
is given as ([9]):
,
11
1
ij avg
ij ij ij ij
U
(4.4)
where, λ
ij
denotes the effective arrival rate of client j‟s requests to a server of pool i, i.e.,
the number of client j‟s requests that are assigned to a server of pool i, and
ij ij ij
U
denotes the effective per server utilization for client j on server pool i. If we want to make
sure that τ
ij,avg
≤ τ
i,max
, then from equation (4.4) we can upper bound the per server
utilization U
ij
as follows:
1
1
ij
ij ij
U
(4.5)
,
,
1
1
ij max
ij j max
U
(4.6)
Since energy efficiency of servers reduces at lower utilization, we set the utilization
U
ij
to be equal to U
ij,max
as derived in eq. (4.6). Obviously a reduction in utilization comes
at the expense of reduced effective service rate, μ
’
ij
. Thus, in order to account for the
average response time constraint, we modify the service rate as shown in eq. (4.7). Note
that reducing μ
ij
essentially creates the illusion that server pool i is able to process fewer
114
of client j‟s requests per server, although, effectively, the service rate hasn‟t changed.
Once the service rate is updated we set the gain factors γ
ij
to this new service rate. Note
that, by enforcing the upper bound on utilization on every server, the average response
time constraint is met on every server and is thus met overall.
,
'
ij ij ij max ij
U
(4.7)
The reduction in server utilization results in increased per request energy cost due to
non energy proportionality. We initially characterized e
ij
at utilization U=1. We account
for increased e’
ij
as follows:
,max
,
,( )
,(1) ,
'
ij
ij max
ij U
ij ij U ij
ij ij max
P
e e EIF e
PU
(4.8)
where P
ij,(U)
is per server power dissipation value at utilization U, P
ij,(1)
is per server
power at U=1 for server pool i and client j. Using e’
ij
we compute the value for κ
ij
as
κ
ij
= e’
ij
γ’
ij
.
Now we have updated all the necessary variables of our initial generalized network
flow formulation, in order to account for the average response time constraints. Using
this new set of variables, finding the maximum flow, F
max
, into sink t at minimum cost
will result in an average response time constrained energy minimizing resource
allocation.
Note that eq. (4.4) and subsequent analysis assumed the M/M/1 model. However we
can use any other distribution of service and arrival rate to obtain upper bound on
utilization. For example under G/G/1 model we can derive the upper bound on utilization
as follows [9]:
115
22
11
12
AA
ij
ij
ij ij ij
U
CC
U
,
,
22
,
1
1 2 1
ij max
A A i avg ij
U
CC
where C
A
and C
B
are coefficients of variation of request inter-arrival times and service
times, respectively. Rest of the analysis would use this new upper bound to compute
remaining variables.
One point is worth mentioning here. Since the min cost max flow in G can yield
fractional solution, some server in server pool i might be shared among multiple clients.
If each client demands a different utilization (U
ij
) level from the same server, then it may
seem that the computation of the energy cost e’
ij
, based on a given client utilization level,
may not hold. However, if the server is shared in a time multiplexed fashion, then we
shall compute e’
ij
at a fine time granularity only accounting for the duration of time a
client j is assigned to resource i. For example, if one particular server is equally shared
among two clients, and during the first half of a second one client uses server at 10%
utilization while during the second half the other client uses server at 70% utilization,
then for each half second we can compute e’
ij
for the two clients with the corresponding
utilization levels. Given the assumption that the number of service requests is much
larger than the available resources, such selection, on average, will not impact the end
result much.
4.4.3.3 Stochastic Maximum Response Time Constraint
This is the third and the most complex SLA type we will use to demonstrate the
effectiveness of NFRA. Under this SLA type, a given client j stipulates its response time
116
requirement as follows: No more than β
j
fraction of requests shall violate the maximum
response time τ
j,max
, i.e.,
,max
Pr( )
j j j
, where τ
j
is the response time of a request of
client j. As can be inferred from this probability equation the client cares about the
response time of every request and therefore the distribution of request response times
and not just the average response time as was the case in the previous SLA described in
Section 4.4.3.2. We propose to meet this response time constraint by ensuring that the
constraint is met at every server of pool i to which client j is mapped, much like the
solution for average response time constraint. This translates into the following
requirement: For a given client j and server pool i, the per request response time τ
ij
shall
not violate the maximum tolerable response time τ
j,max
for more than β
j
fraction of the
requests: i.e.,
,
Pr( )
ij j max j
. Hence by imposing β
j
of client j on every server
pool i we ensure that the probabilistic response time guarantee is met at every pool i and
hence is met overall as well. For our single server queuing model, this probability can be
upper bounded as shown in eq. (4.9), for M/M/1 queues [9],
( ) (1 )
,max ,max
,
Pr( )
U
ij ij j ij ij j
ij j max j
ee
(4.9)
Using the upper bound from eq. (4.9) to meet the SLA, we obtain the effective
utilization using eq. (4.10):
,
1 ln
ij j ij j max
U
(4.10)
4.4.3.3.1 Energy Minimization
We will consider two variants under this SLA. In the first variant we assume that the
client pays the hosting center a fixed total (lump sum) price, as long as the hosting center
117
abides by this SLA. For instance, SLA stipulates that 95% of a client requests shall be
completed within 10 milliseconds (ms). Here, never does the client pay any incentive if
more than 95% of the requests are satisfied within 10ms; similarly, the hosting center
pays no penalties as long as 95% of the requests are completed within 10ms. Given this
scenario the hosting center strives to satisfy 95% of the requests within 10ms – no more,
no less. Thus, the objective of our resource allocation problem in this scenario is of
energy minimization.
Based on eq. (4.10) and given that β
j
≤ 1, and hence ln β
j
≤ 0, U
ij
shall always be
smaller than 1 by at least (–lnβ
j
/ μ
ij
τ
j,max
). Since the energy efficiency of servers reduces
at lower utilization, as mentioned earlier, we set the utilization to the upper bound
provided in eq. (4.10) as shown in eq. (4.11).
,,
1 ln
ij max j ij j max
U
(4.11)
Obviously a reduction in utilization comes at the expense of reduced effective service
rate. Thus, in order to account for the response time constraint, we modify the service
rate, and therefore the gain factor, as before, as shown in eq. (4.7). Note that, since the
probabilistic response time guarantee of eq. (4.9) is provided at every server of pool i to
which client j is mapped, the same guarantee is also provided by server pool i as a whole.
Furthermore, we must account for the increased per request energy cost due to reduced
server utilization, as before. We account for e
ij
increase as shown in eq. (4.8) and
subsequently update the edge costs to κ
ij
as before. In essence, once U
ij,max
is computed
from eq. (4.11) the rest of variables for the network flow are computed exactly the same
way we did in Section 4.4.3.2. Finding the maximum flow, F
max
, into sink t at minimum
118
cost will result in stochastic maximum response time constrained energy minimizing
resource allocation.
Note that, in order to obtain U
ij,max
we assumed M/M/1 queuing model with
exponential service times. However, for general service time distributions, eq. (4.9) can
be used as an approximation when service times have heavy tail distribution [58].
4.4.3.3.2 Profit Maximization
So far we looked at resource allocation that minimizes energy consumption assuming
the client pays a fixed price as long as the stipulated SLAs are met. However, when the
price paid by the client is dependent on the quality of service received, the problem of
resource allocation becomes more complex. In this last problem formulation we assume
that the client pays a fixed per request price, instead of a fixed total (lump sum) price, as
long as the hosting center provides an agreed upon probabilistic response time guarantee.
However, while providing probabilistic guarantee, the hosting center pays per request
penalty to the client whenever the response time of a request violates the stipulated
response time requirement. For example, SLA may stipulate at least 95% of all requests
from a particular client shall be completed within 10ms and whenever requests are not
completed within 10ms the hosting center pays a penalty to the client. In this scenario a
client pays for every request that meets response time requirement, while the hosting
center pays a penalty for every request that fails to meet the response time requirement.
Thus hosting center has an incentive to meet the maximum response time requirement of
more than 95% of the requests since this will decrease the penalties hosting center has to
pay and increase the revenues. Thus allocating more resources can reduce the penalty and
119
hence generate more revenue, but it can also adversely affect the energy cost. Therefore
profit maximization must account for generated revenue and energy cost.
Before formulating the problem as generalized network flow, we will show the
relationship between utilization and profit that captures the trade-off between generated
revenue and energy cost. Let us assume that request arrival rate λ and service rate μ are
Poisson with λ≤ μ. The ratio of λ to μ gives us the utilization, U, of the system. As defined
earlier, the price paid per request is R as long as the response time τ τ
max
and when τ >
τ
max
hosting center pays a penalty of L (L ≥ 0). Let σ represent the ratio of τ
max
to service
time 1/ μ, i.e., σ= μτ
max
. Assuming M/M/1 model, we can calculate an upper bound on the
probability of failure, β, i.e., Pr( τ > τ
max
), as:
() (1 )
max
U
ee
. Given the value
of β, the net price paid per request, π, is given by eq. (4.12).
(1 ) RL
(4.12)
The net profit per request Φ is obtained as shown in eq. (4.13) where e is the energy
cost. Note that, for a given server, Φ, π and e are functions of the server utilization.
e
(4.13)
We will quantitatively demonstrate how net profit varies with utilization. Let‟s
assume that the hosting center operator sets 300% profit margin over the energy cost, e
1
,
of servicing a request at utilization U=1. Thus the price paid per request, R, is 3 times e
1
.
Let L=R. We obtained e at different U‟s using eq. (4.3) and the data of Figure 4.4. The
graph of net profit per request vs. utilization for σ values of 5, 10 and 15 is shown in
Figure 4.5.
120
Figure 4.5: Net Profit per Request at Different Utilization Levels.
Lemma 1: The Profit function Φ of eq. (4.13) is a concave downwards function.
Proof: Expanding the function Φ we get the eq. (4.14).
(1 ) (1 )
(1)
( (1 ) ) ( ) ( )
UU
idle
R e Le mU P P U
(4.14)
Without loss of generality let us assume R=L (price and penalty paid are the same).
Let us denote P
idle
/P
(1)
=A, which is a constant for any given server. Taking first and
second order derivate of Φ with respect to U we get eq. (4.15) and (4.16) respectively.
(1 ) 2
2
U
R e AU
U
(4.15)
2
23
2
(1 )
2( )
U
R e AU
U
(4.16)
Given that A ≥ 0, R ≥ 0 and σ ≥ 0 the second order derivative of Φ with respect to U is
negative. Therefore function Φ is concave downwards.
Thus when system utilization is high the energy cost per request is low but β (fraction
of SLA violations) increases, thereby causing hosting center to pay penalties to the client.
Conversely when utilization is low, β may be lower but energy costs increase. Thus net
121
profit is a complex interplay between system utilization, energy costs, and probability of
missing response time constraint. Under such a scenario, profit maximizing resource
allocation needs to account for utilization dependent per request profit, available
resources and response time constraint.
Figure 4.6: Utilization Range for Profit Optimization.
The response time constraint puts an upper bound on the server utilization as shown
in eq. (4.10). Therefore for profit maximization, per server utilization of client j on server
pool i must lie in [0, U
ij,max
]. However, if we can shrink the search range [0, U
ij,max
] by
providing a better lower bound for utilization without compromising the optimality, then
the efficiency of our resource allocation algorithm can be improved. Here we obtain one
such lower bound and establish a theorem of optimality. For the sake of clarity and to
provide an example driven intuition, in Figure 4.6 we reproduce only one curve from
Figure 4.5 corresponding to σ
ij
= μ
ij
τ
j,max
=10. Assume that the SLA stipulates β
j
to be equal
to 8.2% which implies that 91.8% of the requests must satisfy response time
requirements. Correspondingly we derive U
ij,max
from eq. (4.11) which gives us
U
ij,max
=0.75 as shown in Figure 4.6. Using Lemma 1 we proved that Φ, the per request
122
profit function, is a concave downwards function which achieves its maximum value
somewhere between U=0 and U=1. Let us denote the utilization, at which the per request
profit function, Φ
ij
, for a server pool and client pair (i, j), achieves its maximum value by
U
ij,lopt
. U
ij,lopt
is server pool and client pair (i, j)‟s local optima for maximum profit. For
the profit function of Figure 4.6 U
ij,lopt
is 0.65. We first establish the following theorem.
Theorem 1: In the optimal solution to the resource allocation problem for maximum
profit, the utilization level for any server pool and client pair (i, j) can never lie below
U
ij,lopt
.
Proof: Let us assume that the optimal solution uses some utilization U
ij,*
for server
pool and client pair (i, j) such that U
ij,*
< U
ij,lopt
. For this U
ij,*
the corresponding net profit
per request is Φ
ij,*
. Due to concave (downwards) nature of net profit per request curve Φ
(Lemma 1) we know that the following relation holds:
,*
0
ij
ij
U
U
. This implies
that at utilizations higher than U
ij,*
but lower than U
ij,lopt
the net profit per request is
strictly increasing. Now then if we had some fraction x
ij,*
of server pool i allocated to
client j to support utilization U
ij,*
, then we know with certainty that a fraction smaller
than x
ij,*
can be used (but at higher utilization) to obtain higher net profit per request.
Note that while we use a fraction smaller than x
ij,*
at higher utilization to obtain higher
per request profit, all the other x
yz
‟s, y ≠ i and z ≠ j, remain unchanged. Therefore rest of
the solution does not get affected. Hence for U
ij,*
< U
ij,lopt
and for corresponding x
ij,*
we
can safely use a fraction smaller than x
ij,*
at higher utilization to improve the solution.
123
This contradicts our assumption that the optimal solution lies at some optimal utilization
U
ij,*
such that U
ij,*
< U
ij,lopt
.
Based on theorem 1 we can establish that the optimal solution must lie between U
ij,lopt
and U
ij,max
, given that U
ij,lopt
≤ U
ij,max
, of each server pool and client pair (i, j) as long as
we allocate some resources of pool i to client j. Note that the utilization level for pair (i, j)
introduces a new decision variable in our resource allocation problem and hence we must
find not only what fraction of pool i to allocate to client j but also at what utilization
level. In our problem formulation we account for decision variables corresponding to
different utilization levels in the following way: For each server pool and client pair (i, j)
we split the edge (i, j) into multiple edges corresponding to different utilization values
that lie between U
ij,lopt
and U
ij,max
. Because of this discretization, the new problem
formulation will only be able to provide an approximate solution.
1
i
m
1
j
n
…...
…...
… …...
…
…
n clients m server pools
… …...
s
(1, 0, C
1
)
(1, 0, C
i
)
(1, 0, C
m
)
t
(1, 0, λ
1
)
(1, 0, λ
j
)
(1, 0, λ
n
)
i
j
K+1
( γ
ij
1
, κ
ij
1
, ∞)
( γ
ij
2
, κ
ij
2
, ∞)
( γ
ij
K+1
, κ
ij
K+1
, ∞)
(1, 0, ∞)
(1, 0, ∞)
(1, 0, ∞)
i
j
2
i
j
1
Figure 4.7: Edge Splitting for Profit Optimization.
Figure 4.7 shows the new graph G’. Figure shows the splitting of one such edge (i, j)
where we divide the range [U
ij,lopt
, U
ij,max
] into K levels and add K+1 nodes between node
i and node j. We denote K as the edge splitting cardinality. Given that the optimal
124
solution lies in [U
ij,lopt
, U
ij,max
] we are essentially providing an approximation subject to
the quantization error introduced due to edge splitting cardinality K. The granularity of
sweeping the range [U
ij,lopt
, U
ij,max
] is only impeded by the computational complexity.
Fine granularity, i.e. bigger edge splitting cardinality, improves the solution but increases
the computational time.
Let us denote each of these K+1 nodes by i
j
k
for k=1,.., K+1 and assign them a
utilization level U
ij
k
. Then for each edge (i
j
k
, j), shown by dotted thick edges in Figure
4.7, we set the parameters ( γ, κ, u)=(1, 0, ∞) whereas for each edge (i, i
j
k
), shown by solid
thick edges, we set the corresponding gain factors, γ
ij
k
, as shown in eq. (4.17) with
capacities set to ∞.
,,
,
,
,(1)
( 1)
;
k
ij
ij max ij lopt k
ij ij lopt
ij U
k k k k
ij ij ij ij ij ij
k
ij ij
UU
U U k
K
P
U e e
PU
(4.17)
Furthermore for each edge (i, i
j
k
) with utilization U
ij
k
, π
ij
k
, the expected price paid per
request by client j at utilization U
ij
k
, changes as well. In order to determine π
ij
k
let us
denote β
ij
k
to be the fraction of client j‟s requests that violate the maximum response time
τ
j,max
on server pool i at utilization U
ij
k
. Then eq. (4.18) follows. Based on eq. (4.18) we
calculate π
ij
k
and the net profit per request, Φ
ij
k
, as shown in eq. (4.19) and (4.20)
respectively. With this new formulation, from the perspective of server pools, each of the
newly added nodes is considered a separate client. Moreover, to maximize profit by
solving a min cost max flow problem on G’, the cost on the edges, κ
ij
k
, must be set such
125
that when we minimize the cost, we maximize the profit. Hence we modify the edge costs
as shown in eq. (4.21).
,max
(1 )
,
Pr( )
k
ij ij j
U
kk
ij j max ij
e
(4.18)
(1 )
k k k
ij j ij j ij
RL
(4.19)
k k k
ij ij ij
e
(4.20)
k k k
ij ij ij
(4.21)
It is worthwhile mentioning that the new formulation presented by graph G’ provides
us greater flexibility. For example, much like the price function used in [98], if we were
to model R
j
as a step function of the response time, τ
ij
k
, of client j on server pool i for a
given utilization U
ij
k
, then we can easily modify eq. (4.19) to account for the new price
paid for the improved response time instead of the fixed price R
j
. Note that so far we
assumed that U
ij,lopt
was given to us. In practice, to find U
ij,lopt
we need to solve
0
ij
U . We use Newton‟s method to find U
ij,lopt
.
o Iterative Edge Splitting
Note that, a given fixed value of edge splitting cardinality, K, may not be applicable
to all the scenarios. Hence it is advisable to iteratively increase the value of K until the
incremental benefit between two consecutive iterations falls below a certain threshold.
While such an iterative approach is useful, it is worthwhile mentioning that as we
increase the value of K and therefore increase the number of utilization levels being
introduced between pair (i, j), they shall be chosen in a manner such that they are
inclusive of the U
ij
k
values from the previous iteration. This means that if for an iteration
with K=K
1
we chose K
1
utilization levels between pair (i, j) then in the next iteration with
126
K=K
2
, K
2
> K
1
, the K
2
utilization levels between the pair (i, j) must include the same K
1
utilization levels from the previous iteration and introduce only K
2
– K
1
new utilization
levels. We call such growth in K, inclusive growth of edge splitting cardinality. The
inclusive growth of K is necessary to ensure that the new decision points, in the form of
new utilization levels that are being introduced, contain all the decision points from the
previous iteration and a few new ones. Thus, we ensure that our solution always
improves.
4.4.4 Solving Min Cost Max Flow in NFRA
For the min cost max flow and max flow problem for generalized network, the fastest
known polynomial time algorithm is based on interior point linear programming (LP)
methods [89] (O(m
1.5
n
2
log(nB))). Recently in [92] Wayne et al. proposed the first
polynomial time combinatorial algorithm for min cost circulation with time complexity
of O(m
3
n
3
logB). Given the complexity of combinatorial approaches we resort to LP to
find min cost max flow solutions presented in this paper.
4.5 Experimental Setup and Results
4.5.1 Hosting Center and SLA Parameters
Before looking at our experimental evaluations we explain how we selected the
model parameters described in Section 4.4.2. For server heterogeneity we considered four
types of server pools (m=4), each being categorized by a three tuple {CPU frequency,
memory and disk size}. The four server pools are HP_A = {3.2GHz, 96GB, 2TB}; HP_B
127
= {3.2GHz, 48GB, 1.5TB Flash}; HP_C = {2.3GHz, 48GB, 1.9TB}; HP_D = {2.3GHz,
48GB, 1.1TB Flash}. HP_B and HP_D use flash drives, instead of traditional hard drives,
which consume less power per read. Instead of arbitrarily selecting the server
characteristics we selected servers used in reporting SPECWeb2009 [86] results by
various vendors in the near past.
Briefly, SPECWeb2009 are devised for next generation web based workloads and
consist of four different workload types: Banking, E-commerce, Support and Power.
Power workload essentially characterizes server power at different utilizations for E-
commerce workload. The other three workloads operate at 100% utilization by issuing as
many requests as needed to saturate the system utilization. The request rates from these
workloads follow Poisson distribution since think times that govern request inter arrival
times are modeled as an exponential distribution. We looked up the published
SPECWeb2009 Power benchmark results to obtain the power consumption at five
different utilization levels. We then used linear interpolation to derive the EIF value at
any utilization. We treated SPECWeb2009 Banking, E-Commerce and Support
workloads as three different customer types. Using the published results we obtain the
energy per request of each workload type on each server type (e
ij
) and the per request
execution time (T
ij
) on each server type as discussed in 4.3.
Figure 4.8 shows e
ij
on primary Y-axis (dotted lines) and T
ij
on the secondary Y-axis,
for the three workload types (equivalent to clients in Figure 4.1) and the four server types
(equivalent to server pools in Figure 4.1). Note that T
ij
values for different workload
types do not necessarily scale in a similar manner across different server types. For
128
example, for E-commerce and Banking workloads‟ T
ij
value reduces from pool HP_A to
HP_B, while it increases for Support. Thus the heterogeneity in servers can alter
workload behaviors (e
ij
and T
ij
) in a complex manner.
Figure 4.8: Energy and Response Time Characterization.
We created two hosting center setups using the four server types by choosing the
corresponding C
i
values (number of servers in each pool) as follows: config_A:{HP_A,
HP_B, HP_C, HP_D}={10, 10, 10, 10} and config_B={9, 9, 11, 11}. We also assume
that we have three Banking clients, three E-commerce clients and two Support clients, a
total of eight clients, i.e., n=8. We formulate three different instantiations using these
eight clients, called Instance_1, Instance_2 and Instance_3 workloads. Instance_1 is
banking dominated, i.e., load generated due to banking clients dominates. Similarly
Instance_2 and Instance_3 workloads are E-commerce and Support dominated,
respectively. Table 4.1 shows the SLA parameters for the eight clients for Instance_1.
The clients within a particular instance differ in the number of requests sent to the hosting
center. For example, small scale (SS) Banking client generates 70K requests per second
129
while large scale (LS) Banking client generates 98K requests per second, 40% more. We
also assume that E-commerce and Banking clients have more stringent response time
requirements than Support clients. Furthermore, the SLA specification for LS clients has
lower tolerance for response time violation and hence has lower β
j
. But they are also
willing to pay a premium, in terms of higher profit margins for the hosting center,
denoted by M
j
in Table 4.1, representing the profit margin percent over average per
request energy cost across all server pools. Furthermore we assume that R
j
=L
j
.
Table 4.1: Client SLA specification for Instance_1
Client Type J λ
j
τ
j,max
(msec)
β
j
M
j
Bank (LS) 1 98K 15 0.05 80
Bank (MS) 2 82K 15 0.08 60
Bank (SS) 3 70K 15 0.1 40
Ecomm (LS) 4 41K 12 0.05 80
Ecomm(MS) 5 36K 12 0.08 60
Ecomm (MS) 6 20K 12 0.1 40
Support (MS) 7 8K 120 0.6 60
Support (SS) 8 4K 120 0.8 50
(LS=Large Scale, MS=Medium Scale, SS=Small Scale)
It is worth noting that the values we selected for various parameters shown in Table
4.1 roughly reflect the relative importance of clients. The absolute values were arbitrarily
selected while preserving the relative importance. Note that client SLA parameters across
different instantiations change only in terms of arrival rates leaving rest of the
parameters, i.e., τ
j,max
, β
j
, M
j
, unchanged. For the sake of brevity we omit the arrival rates
of other instance types. Furthermore, experimental results presented here assume Poisson
arrival and service rates. Note that the framework itself can support other distributions as
130
mentioned earlier. We implemented the min cost max flow using LP package lpsolve
[59].
4.5.2 Pseudo Optimal Solution
NFRA generates conservative resource assignments by ensuring that a constraint is
met at every server of pool i to which client j is mapped. For instance, NFRA ensures
τ
ij,avg
τ
j,max
rather than τ
j,avg
τ
j,max
for average response time constraint SLA. In an
optimal solution this more restrictive constraint is unnecessary since at some servers the
average response time can violate the stipulated response time requirement while making
up for it at some other servers. In order to quantify the effect of this sub-optimal
utilization selection of NFRA we compare NFRA with two approaches: (1) greedy and
(2) pseudo optimal. Here we give a brief overview of the pseudo optimal solution, which
is applicable to all the SLA types except the throughput constrained SLA for which
NFRA already finds the optimal solution. Let us look at the pseudo optimal solution for
average response time SLA, which is equally applicable to stochastic maximum response
time SLA.
Since in optimal solution constraints are not imposed conservatively on every server
of a pool, pseudo optimal solution must treat every server in a pool as a different resource
and try to find the optimal resource allocation between every server and client pair, rather
than for a server pool and client pair like NFRA. Second, an optimal solution does not
need to set an upper bound on the utilization of any server, as done in NFRA. This is
because it is possible that, in an optimal solution, a given client‟s average response time
131
violates the stipulated maximum response time requirement on some servers while
making up for it on others such that overall, across all the servers, the average response
time does not violate the maximum response time requirement. In essence, in an optimal
solution any server can be allocated to any client at any utilization as long as the SLA
constraint is met across all the servers. Therefore each individual server client pair must
sweep the whole range of utilization in order to search for the optimal solution. In the
pseudo optimal solution we divide the whole range of utilization (0, 1) into multiple
levels. This discretization of utilization level makes the solution pseudo optimal instead
of truly optimal. The following LP formulation models our pseudo optimal solution, for
average response time constrained SLA.
,
,,
()
. . ( ) ,
,
1,
j i k
k
ik
jk
jik jik jik
jik avg jik jik jik jik j j max
i
jik jik jik j
jik
Min P U x
s t U U x j
U x j
xi
where, x
jik
is the decision variable corresponding to client j on some server i at utilization
level k (U
jik
), denoting what fraction of server i at utilization level k is allocated to client
j. P
jik
denotes the power corresponding to the same triplet (j, i, k) at utilization U
jik
. Note
that P
jik
is a function of U
jik
. Thus the summation term in the objective function represents
the power/energy consumption of client j when and if mapped to server i at utilization
level k, U
jik
. Therefore the objective function is that of the energy minimization. The first
constraint is the average response time constraint which takes the weighted average of the
132
average response times τ
jik,avg
for triplet (j, i, k) at utilization, U
jik
. Note that τ
jik,avg
is a
function of U
jik
(cf. eq. (4.4)). τ
jik,avg
is weighted with respect to the fraction client j‟s
requests that are served by server i at utilization level k, U
jik
, i.e., the term
jik jik jik j
Ux
. The second constraint is the throughput constraint for each client j and
third constraint is capacity constraint for each server i. We refer to this solution as
Pseudo-opt.
Note that we replace τ
jik,avg
with β
jik
for stochastic maximum response time constraint,
thus ensuring, for each client j across all servers and utilization levels, that the fraction of
requests violating τ
j,max
are below β
j
, e.g., (
( ) ,
ik
jik jik jik jik jik j j
U U x j
).
For profit maximization we update the objective function with
( )( )
j i k
ijk ijk ijk ijk ijk
U U x
, where Φ
jik
is per request profit that is a function of U
jik
.
Two serious concerns for implementing Pseudo-opt are worth mentioning here. 1) It
treats each server in a pool individually thereby causing a huge explosion in the state
space that needs to be searched between all client-server pairs. 2) The utilization range (0,
1) is split into multiple levels and hence the search space multiplies by the number of
levels. The net result of this explosion is that Pseudo-opt even for our small hosting
center configurations runs approximately 75,000X slower than NFRA.
4.5.3 Throughput Constrained Optimization
The greedy approach implemented for throughput constrained energy minimization
sorts energy per request values, e
ij
, corresponding to client j and server pool i, in non-
133
decreasing order. From this sorted list it picks each server pool and client pair (i, j), one at
a time, and allocates resources so as to meet the throughput requirement.
Figure 4.9: Throughput Constrained Energy Optimization.
Figure 4.9 shows comparison between Greedy and NFRA. Note that Greedy is not
able to find a feasible solution corresponding to config_B for Instance_1 and Instance_2
workloads. This means that even though we had enough servers to be allocated to clients
so as to satisfy the throughput requirement, Greedy was unable to find a mapping. When
Greedy does find a feasible solution, the energy consumption of NFRA is 3.9% lower
than Greedy.
We can intuitively explain why greedy solution, when it works, is not as worse, as
one would expect. Referring to the per request energy characterization of Figure 4.8, the
trend in per request energy across four different server pools remains similar for all three
workload types. Hence, the choice of which server pool to use is not very different
between Greedy and NFRA. However, as the heterogeneity of the server pools and
workloads increases Greedy would perform worse, as we will shortly demonstrate.
Second it is critical to note that NFRA always finds a feasible solution whenever such a
134
solution exists. Hence, when resources are provisioned for near-peak demand then NFRA
will find a feasible solution while Greedy will fail. When resources are scarce due server
downtimes or failures and/or maintenance, Greedy may have to employ admission
control even though enough resources are available. Note that, due to heterogeneous
server clusters, there is no closed form equation that dictates the capacity requirement for
a given set of clients and hence the existence of a feasible solution is non-trivial to check
for Greedy.
4.5.4 Average Response Time Constrained Optimization
In this section we present the results for average response time constrained energy
optimization and compare it with Greedy and Pseudo-opt approaches. Greedy works
similar to the one used for throughput constrained optimization; once a pair (i, j) with the
smallest e
ij
is picked, the decision regarding how many requests of client j are assigned to
server pool i is made using the U
ij,max
(eq. (4.6)) that satisfies the average response time
constraint.
Figure 4.10: Average Response Time Constrained Energy Optimization.
135
Figure 4.10 shows the comparison of NFRA with Greedy and Pseudo-opt. Here,
Greedy always finds a solution and is within 5% margin of NFRA with NFRA being, on
average, 3.3% better than Greedy. The reason behind such a small difference in energy
optimization between Greedy and NFRA is, once again, the trend in energy consumption
as mentioned in previous section. With respect to Pseudo-opt solution, NFRA results are
within 0.1% bound of Pseudo-opt. Hence our slightly conservative resource allocation
does not negatively impact profit much. However, the difference in execution time is
dramatic. NFRA on average took about 0.004 seconds to complete while Psuedo-opt
takes 300 seconds, a 75,000X slowdown.
4.5.5 Stochastic Maximum Response Time Constrained Optimization
4.5.5.1 Energy Optimization
In this section we present the results for stochastic maximum response time
constrained energy optimization. The Greedy approach implemented here is similar to the
Greedy approach for average response time constrained energy optimization except that
U
ij,max
is derived using eq. (4.11) that satisfies stochastic maximum response time
constraint.
Figure 4.11 shows the results using response time constraints specified in Table 4.1.
As shown in the figure, Greedy is about 6% worse than NFRA, if Greedy is able to find a
solution. Once again note that NFRA always finds a feasible solution while Greedy fails
in some instances. Second, the reason why Greedy is only 6% worse than NFRA is due to
similar trend in per request energy of the workloads across four server pools as
136
mentioned earlier. Figure 4.11 also shows the results of Pseudo-opt. As shown, once
again, NFRA results are within 0.1% bound of the Pseudo-opt. Thus NFRA achieves
results that are very close to optimal at much lower execution time overhead.
Figure 4.11: Stochastic Maximum Response Time Constrained Energy Optimization.
4.5.5.2 Profit Maximization
For profit maximization, the greedy approach sorts (i, j) pairs according to the per
request profit in non-increasing order, where profit is obtained by subtracting e
ij
values
from R
j
. Once a pair (i, j) is picked it greedily picks utilization value that, while satisfying
response time constraint dictated by U
ij,max
, maximizes per request profit, i.e., U
ij,lopt
, and
accordingly assigns client j‟s requests to server pool i. Figure 4.12 shows the results for
the profit maximization, where the results of Greedy and Pseudo-opt are normalized to
the result of NFRA. Here we set the edge splitting cardinality, K=8. As shown in the
figure, NFRA achieves, on average, 9.4% more profit than Greedy for Instance_3
workload while Greedy fails to find any feasible solutions for Instance_1 and Instance_2
137
workloads. Furthermore, Figure 4.12 also shows the Pseudo-opt results normalized to
NFRA which is again within 0.1% bound.
Figure 4.12: Profit Optimization.
The superiority of NFRA over Greedy may be highlighted with more heterogeneous
setups when different client requests exhibit stark execution time differences on different
server pools. To demonstrate this aspect we chose five different hosting center setups
with two server pools consisting of HP_A and HP_B servers. The five hosting center
combinations are (# of HP_A servers, # of HP_B servers): a) (30, 10), b) (25, 15), c) (20,
20), d) (15, 25) e) (10, 30). We chose only two clients, E-commerce and Support, due to
their stark difference in T
ij
values across HP_A and HP_B servers. Using the same setup
we also demonstrate the importance of accounting for non energy proportionality. Here
we modified our profit maximizing solution such that it does not scale per request energy,
e
ij
, for client j on server pool i. Hence the profit function Φ
ij
k
of eq. (4.20) only accounts
for the utilization dependent price, π
ij
k
, while keeping the energy cost constant to e
ij
,
independent of utilization. Since the energy cost is constant we will not have a concave
138
downward function (cf. Figure 4.5). Hence in this new, energy proportionality oblivious,
profit maximizing solution we must consider the range (0, U
ij,max
) of server utilization.
We divided this whole range for each server pool and client pair in K=64 levels.
Figure 4.13: Impact of Large Scale Heterogeneity.
Figure 4.13 shows the results with NFRA performing 38.8, 27.3, 17.5, and 12.9%
better than Greedy for configuration a, b, c, and d respectively. For configuration e
Greedy does not find a solution. With respect to non energy proportionality oblivious
solution, NFRA performs 17.4, 16.9, 14.4, 10.7 and 8.2% better for configuration a, b, c,
d and e respectively. The results show that in the presence of large number of energy
inefficient servers 1) NFRA outperforms Greedy and by a much larger margin 2)
accounting for non energy proportionality becomes important.
As mentioned earlier, the profit maximization problem that we formulated in section
4.4.3.3.2 is an approximation. Since we split the range [U
ij,lopt
, U
ij,max
] into K levels, there
is a quantization error. As we increase the value of K this error reduces and results
improve. In order to understand the impact of K on approximation result we obtain the
profit maximization results using different values of K for config_A. Figure 4.14 shows
139
the result with profit normalized to K=1. Across all three workloads, beyond K=4 the
improvement in the solution is below 0.1%. However, as K increases, NFRA execution
time increases super-linearly as shown on the secondary Y-axis, thereby increasing the
overhead. Note that the improvement over different K values is also a function of server
pool and client characterization. Hence the data shown in Figure 4.14 may not hold
across different server pool and/or client configurations, but the proposed iterative edge
splitting approach will be able to quickly converge to a good solution given appropriate
threshold bounds on incremental profit.
Figure 4.14: Impact of Edge Splitting Cardinality on Approximation.
To understand how NFRA runtime scales across much larger setups we solved profit
maximizing NFRA across four different setups, characterized by parameters (# of pools,
# of clients, # of servers across all pools) given as follows: (12, 24, 600), (24, 40, 1200),
(36, 56, 1800) and (48, 72, 2400). These setups were generated by scaling config_A and
Instance_1 workloads. In order to scale up the number of server pools we took the
existing baseline config_A setup and generated new server pools by scaling up each of the
140
server pool in terms of their frequency and power values. Thus from our 4 baseline server
pools, HP_A, HP_B, HP_C and HP_D, we created 12, 24, 36 and 48 server pools by
scaling performance and power values of each the server pool up. Furthermore
Instance_1 workload was similarly scaled up to generate T
ij
and e
ij
values for each of the
new server pool by applying appropriate scaling factors that we used to scale server
pools. Note that, for the purpose of runtime scaling what matters is the number variables
each new hosting center setup introduces rather than the precise values of T
ij
and e
ij
. In
Figure 4.15 we show how runtime scales on our Xeon dual core 1.86GHz, 2GB RAM
machine. As shown in the figure, even with larger setup, run time for K=4 for the largest
setup is about 2 seconds while the profit is within 0.1% compared to K=16 . However,
with K=16 run time increases substantially (3X).
Figure 4.15: NFRA Runtime Scaling.
4.5.6 Sensitivity Analysis
In this section we present the impact of variation, in various parameters, on the
optimality of the solution and on the SLA constraints. For this purpose we perform
141
sensitivity analysis, for profit maximizing NFRA, by introducing percentage variation in
the arrival rates ( λ
j
) of all the clients. We introduce variation homogenously across all the
clients for simplicity since otherwise exploration space will explode. Our sensitivity
analysis was carried out for profit maximizing NFRA for Instance_3 under config_A.
First we obtain the optimal resource allocation, opt, without considering any variation
with corresponding optimal profit denoted as profit
opt
. Then we apply different amounts
of variation in arrival rate as show on X-axis in Figure 4.16.
Figure 4.16: Arrival Rate Variation.
Due to this variation, fraction of the requests, denote by β
new
, that will violate
response time will be different from the fraction resulting from the optimal solution
which has ignored variation. The results on primary Y-axis of Figure 4.16 show that how
far β
new
is from the β stipulated in SLA. We denote this distance by percentage difference
Δβ. Furthermore we obtain new optimal resource allocation, opt
new
, considering new
arrival rates resulting from variation with corresponding optimal profit denoted as
profit
newopt
. Note that the achieved profit by opt, as a result of variation, denoted by
142
profit’
opt
, will be different from profit
opt
. The secondary Y-axis of Figure 4.16 show how
far profit’
opt
is from profit
newopt
. Obviously, if the average arrival rate is less than the
originally estimated λ
j
value, Δβ is negative, i.e., we satisfy more requests than stipulated
by SLA on time. Furthermore in the worst case p rofit’
opt
is 10.6% lower than profit
newopt
with average being 5.1%. As the average arrival rate increases, particularly beyond 4%
variation, Δβ goes above zero for many clients, leading to SLA violation. For example,
for client 1 at 6% variation we are violating β=0.05 (cf. Table 4.1) by 21%;
probabilistically, 6.05% of the requests, instead of 5%, will violate the τ
j,max
. This implies
that either we must make allowance for the expected variation during resource allocation
for a given time epoch or based on dynamic monitoring of Δβ solve a new instance of
NFRA instead of waiting till the end of the time epoch. Note that time epoch here refers
to time interval length at which resource allocation decision are made. We carried out
similar analysis for variation in μ
ij
and e
ij
, reflecting the changing characteristics of
clients‟ tasks that result from code updates and new features etc., obtaining similar results
in terms of their behavior on optimality of the solutions. However variation in these
parameters takes place at a much larger time scale compared to variations in request
arrival rate.
However if variations in the arrival rate are frequent and have large amplitude then
we must solve a new instance of NFRA in order to stay close to the optimal solution,
which may require turning ON or OFF more servers. Here we focus on increase in the
arrival rates, which may require turning ON some new servers, either by 1) Reboot:
unused servers are originally turned off and hence require reboot if allocated or 2)
143
Wakeup: unused servers are put to system sleep state S3 and hence require wakeup when
allocated. Sleeping servers still consume power, albeit much less than ON server, such as
for DRAM refresh; OFF servers consume zero power for all practical purposes. To
understand the impact of the frequency and amplitude of variation, in λ
j
, on profit and
penalties associated with turning servers ON, we carry out the following experiment: In
order to capture frequency of variation we vary the length of the time epoch from 10
minutes to 60 minutes in the increments of 10 minutes, shown on X-axis in Figure 4.17.
At each such time epoch we apply 10, 15 and 20% increase in the arrival rates to capture
the amplitude of variation and solve NFRA.
Figure 4.17: Impact of Reboot/Wakeup Latency and Power.
The resource reallocation may require few more servers to be turned ON and the
latencies associated with rebooting the servers and wakeup from S3 are conservatively set
to 90 [16] and 15 [31] seconds, respectively. During this reboot/wakeup latency, if the
increase in arrival rate is such that the previously allocated resources are under
provisioned then servers may run at 100% utilization. Requests allocated to such servers
144
will be queued until served but will violate the response time. Hence we conservatively
assume that we will pay penalty on all such requests. Once the servers are UP, either after
reboot or wakeup, we start operating at the optimal point. We consider S3 state power to
be 1/20th of the idle power [34]. The results in Figure 4.17 show, for different time
epochs lengths, how far the achieved profit will be from the theoretical optimum. The
theoretical optimum here assumes that servers are turned ON instantly.
Note that at smaller time epoch lengths, e.g., 10 minutes, the latency associated with
reboot is very high compared to wakeup from S3, hence we incur high losses under
Reboot at high variations, i.e., 15% and 20%. Therefore under these circumstances
Wakeup is better, as shown in Figure 4.17. Although at low variation, 10%, Reboot works
better since losses incurred are small during the latency period and for the rest of the time
epoch unused servers consume no power. At larger epoch lengths Reboot works better
since servers keep consuming power for the rest of the time epoch under S3 state for
Wakeup. Hence if frequent and high variations are observed, Wakeup works better and
Reboot works better otherwise.
4.6 Summary
Accounting for the impact of different applications on energy consumption of
hardware resources, e.g., servers, we presented, in this chapter, system level energy
minimization approach using resource allocation in large scale distributed systems such
as hosting centers. We observe that heterogeneity in servers, deployed in today‟s hosting
center which offer widely varying energy/performance characteristics for different
145
applications, make the resource allocation problem difficult. Furthermore, non energy
proportional nature of the today‟s servers, which results in increased energy cost per unit
work done at lower utilizations, introduces additional challenge in already complex
resource allocation problem. Addressing these challenges, in this chapter we presented
generalized network flow based resource allocation framework called NFRA.
We showed that generalized networks provide us the flexibility of modeling
heterogeneity; while non-energy proportionality was accounted for by establishing
concave downwards nature of profit function that shows the tradeoff between energy
consumption and revenue as a function of utilization. Furthermore we accounted for
various SLA in NFRA by providing appropriate bounds on the flow using modulated
gain factors on the edges of the network, where nodes represent clients and server pool
and flow represents resource allocation. We showed that NFRA provides us the
flexibility to account various SLAs under different queuing models. Our experiments
show the superiority of NFRA over pseudo optimal results while sensitivity analysis
showed the impact of variation on different parameters.
146
Chapter 5. Conclusions and Future Work
Taking a holistic view on energy efficiency, from circuit level to system level, this
thesis presented solutions for energy minimization at the circuit and architecture levels
along with solutions for energy minimization and profit maximization at the system level
using dynamic resource allocation.
At circuit level we presented two charge recycling based circuit level energy
minimization solutions. Our first solution proposed write power reduction for multi-
ported on-chip structures such as register files, reorder buffers etc. The proposed
technique exploits the fact that the on-chip memory structures with dedicated write ports
can preserve the charge status on bit-lines due to a write, and recycle this charge in the
next write, if the bit-lines are going to swing in opposite directions due a new write value.
Our second solution, for off-chip buses, proposed pulsed charge recycling technique in
order to recycle more charge by sequentially, instead of simultaneously, connecting
groups of bus line going through falling transition to a group going through rising
transition.
Extending this work to on-chip caches we exploited the knowledge available at
architecture level that is not available at circuit level. Here we first extended our charge
recycling technique to memories with share read/write ports, and proposed and
implemented supporting circuitry to dynamically switch between charge recycling based
write operation mode and regular cache operation mode. Once such a support is available
at circuit level, we modified cache controller and retirement logic to postpone the
147
retirement of stores so as to retire them in clusters. Retiring stores in clusters generated
back-to-back write to the cache that can take advantage of charge recycling based
architecture.
Based on the observation that the aforementioned design time energy minimization
techniques do not account for higher level behavior of the system, i.e. application level or
OS level, we propose a system level dynamic run time solution for resource allocation in
large scale systems such as data centers that accounts for application behavior. Observing
that the complexity of the problem increases with server heterogeneity in today‟s
datacenters and with lack of energy proportionality in today‟s servers, we proposed a
generalized network based resource allocation framework called NFRA (Network Flow
based Resource Allocation). NFRA accounts for both heterogeneity and energy
proportionality while supporting a range of SLAs.
5.1 Future Directions
We present the scope of future work in this section by presenting a few potential
exploration techniques for our architecture and system level optimization works.
At architecture level we observe that, despite using the late store retirement approach,
only 41% of the total cache writes were CS (Charge Sharing) enabled writes. This is
mainly due to two reasons: 1) Frequent draining of SQ (Store Queue) despite of it not
being full and 2) placement of the cache blocks among different banks. SQ can be
frequently drained since we conservatively drain SQ every time a cache miss is observed.
Instead, if for a given miss we are certain that it does not replace an existing cache line
148
occupied by a store that is still in SQ, then we do not need to drain SQ. One may thus
explore techniques that can determine if a load miss will replace a cache line currently
being occupied by a committed store in SQ. Furthermore, the second issue brings forth a
point that if the stores that have high temporal locality belong to cache lines that are
mapped to different banks then despite of having back-to-back writes it will not benefit
from the underlying charge recycling techniques. Thus, to address the second concern,
one may explore cache replacement policies that can assist CR cache for write power
reduction without significantly affecting performance.
At system level we observe that one of the issues addressed by NFRA was non
energy proportionality prevalent in today‟s servers. One may explore various techniques
to address this issue so as to make servers more energy proportional. One such way is
page compression. It is known that I/Os, e.g., disks, main memory and other peripheral
devices, are instrumental to the non-energy proportional nature of today's server systems.
Evidently, a large amount of power in hard drives is spent in spinning which is required
even when they are idle. This causes idle and active power for hard drives to be non-
energy proportional. In order to tackle this issue, one may investigate various
mechanisms to avail more idle time to disks such that they can either be shut off or spun
down to reduce idle power. One such path of exploration is page compression. With the
use of page compression one can free up more main memory to the server that can act
either as a swap space which was earlier in disk drives or as pre-fetch buffer for future
pages. Furthermore in large scale distributed systems such as data centers one can
consolidate the disk bandwidth requirement across different servers. For example, when
149
the memory pressure is high even after page compression, instead of swapping pages to
hard drives, they could be offloaded to the memory of other servers or the swap space on
the hard drives of the other servers such that the idle time for the hard drive of a given
server is maximized while accounting for the performance penalty incurred.
150
Bibliography
[1] ACPI: http://www.acpi.info/
[2] Barroso, L. A., Hölzle, U., “The Case for Energy-Proportional Computing,”
Computer 40, 12, 33-37, December 2007.
[3] Basu, K., Choudhary, A., Pisharath, J., Kandemir, M., “Power protocol: reducing
power dissipation on off-chip data buses,” In Proceedings of the 35th Annual
ACM/IEEE international Symposium on Microarchitecture, IEEE Computer
Society Press, 345-355, November 2002.
[4] Benini, L., De Micheli, G., Macii, E., Sciuto, D., Silvano, C., “Address bus
encoding techniques for system-level power optimization,” In Proceedings of the
Conference on Design, Automation and Test in Europe, IEEE Computer Society,
861-867, February 1998.
[5] Benini, L., de Micheli, G., Macii, E., Sciuto, D., Silvano, C., “Asymptotic Zero-
Transition Activity Encoding for Address Busses in Low-Power Microprocessor-
Based Systems,” In Proceedings of the 7th Great Lakes Symposium on VLSI,
IEEE Computer Society, 77-82, March 1997.
[6] Bennani, M.N., Menasce, D.A., "Resource Allocation for Autonomic Data
Centers using Analytic Performance Models," Autonomic Computing,
Proceedings. Second International Conference on, 229-240, June 2005.
[7] Bianchini, R., Rajamony, R., "Power and energy management for server
systems," Computer , vol.37, no.11, 68- 76, November 2004.
[8] Bishop, B., Irwin, M.J., "Databus charge recovery: practical considerations," Low
Power Electronics and Design, Proceedings. 1999 International Symposium on,
pp. 85- 87, August 1999.
[9] Bolch, G., Greiner, S., de Meer, H., Trivedi, K. S., “Queueing Networks and
Markov Chains: Modeling and Performance Evaluation with Computer Science
Applications,” Wiley-Interscience.
[10] Borah, M., Owens, R.M., Irwin, M.J., "Transistor sizing for low power CMOS
circuits," Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on , vol.15, no.6, 665-671, Jun 1996.
151
[11] Brooks, D., Tiwari, V., Martonosi, M., “Wattch: a framework for architectural-
level power analysis and optimizations,” SIGARCH Comput. Archit. News 28, 2,
83-94, May 2000.
[12] Buyuktosunoglu, A., Albonesi, D. H., Bose, P., Cook, P. W., Schuster, S. E.,
“Tradeoffs in power-efficient issue queue design,” In Proceedings of the
international Symposium on Low Power Electronics and Design, 184-189, New
York August 2002.
[13] Carrera, E. V., Pinheiro, E., Bianchini, R., “Conserving disk energy in network
servers,” In Proceedings of the 17th Annual international Conference on
Supercomputing, 86-97, June 2003.
[14] Chandrakasan, A.P., Sheng, S., Brodersen, R.W., "Low-power CMOS digital
design," Solid-State Circuits, IEEE Journal of, vol.27, no.4, 473-484, Apr 1992.
[15] Chase, J. S., Anderson, D. C., Thakar, P. N., Vahdat, A. M., and Doyle, R. P.,
“Managing energy and server resources in hosting centers,” SIGOPS Oper. Syst.
Rev. 35, 5, 103-116, December 2001.
[16] Chen, Y., Das, A., Qin, W., Sivasubramaniam, A., Wang, Q., Gautam, N.,
“Managing server energy and operational costs in hosting centers,” SIGMETRICS
Perform. Eval. Rev. 33, 1, 303-314, June 2005.
[17] Cheng Shin-Pao, Huang Shi-Yu, "A low-power SRAM design using quiet-bitline
architecture," Memory Technology, Design, and Testing, IEEE International
Workshop on, 135-139, August 2005.
[18] Chuanjun Zhang, "Balanced Cache: Reducing Conflict Misses of Direct-Mapped
Caches," Computer Architecture, 33rd International Symposium on, 155-166,
July 2006.
[19] Cruz, J., González, A., Valero, M., Topham, N. P., “Multiple-banked register file
architectures,” SIGARCH Comput. Archit. News 28, 316-325, May 2000.
[20] Donno, M., Ivaldi, A., Benini, L., Macii, E., "Clock-tree power optimization
based on RTL clock-gating," Design Automation Conference, Proceedings, 622-
627, June 2003.
[21] Elnozahy, E. N., Kistler, M., Rajamony, R., “Energy-efficient server clusters,” In
Proceedings of the 2nd international Conference on Power-Aware Computer
Systems, 179-197, February 2002.
152
[22] Ernst, D., Nam Sung Kim, Das, S., Pant, S., Rao, R., Toan Pham, Ziesler, C.,
Blaauw, D., Austin, T., Flautner, K., Mudge, T., "Razor: a low-power pipeline
based on circuit-level timing speculation," Microarchitecture, Proceedings. 36th
Annual IEEE/ACM International Symposium on , 7- 18, December 2003.
[23] Fan, X., Ellis, C., Lebeck, A., “Memory controller policies for DRAM power
management,” In Proceedings of the international Symposium on Low Power
Electronics and Design, 129-134, 2001.
[24] Flachs, B., Asano, S., Dhong, S.H., Hotstee, P., Gervais, G., Kim, R., Le, T., Liu,
P., Leenstra, J., Liberty, J., Michael, B., Oh, H., Mueller, S.M., Takahashi, O.,
Hatakeyama, A., Watanabe, Y., Yano, N., "A streaming processing unit for a
CELL processor," Solid-State Circuits Conference, Digest of Technical Papers.
IEEE International , 134-135 Vol. 1, February 2005.
[25] Flautner, K., Kim, N. S., Martin, S., Blaauw, D., and Mudge, T., “Drowsy caches:
simple techniques for reducing leakage power,” SIGARCH Comput. Archit. News
30, 148-157, May 2002.
[26] Ghose, K. Kamble, M. B., “Reducing power in superscalar processor caches using
subbanking, multiple line buffers and bit-line segmentation,” In Proceedings of
the international Symposium on Low Power Electronics and Design. ACM, 70-75,
August 1999.
[27] Govindan, S., Choi, J., Urgaonkar, B., Sivasubramaniam, A., Baldini, A.,
“Statistical profiling-based techniques for effective power provisioning in data
centers,” In Proceedings of the 4th ACM European Conference on Computer
Systems, 317-330, April 2009.
[28] Gowan, M.K., Biro, L.L., Jackson, D.B., "Power considerations in the design of
the Alpha 21264 microprocessor," Design Automation Conference, Proceedings ,
726- 731, June 1998.
[29] Gurumurthi, S., Sivasubramaniam, A., Kandemir, M., Franke, H., "DRPM:
dynamic speed control for power management in server class disks," Computer
Architecture, Proceedings. 30th Annual International Symposium on, 169- 179,
June 2003.
[30] Heath, T., Diniz, B., Carrera, E. V., Meira, W., Bianchini, R., “Energy
conservation in heterogeneous server clusters,” In Proceedings of the Tenth ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming, 186-
195, June 2005.
153
[31] Horvath, T., Skadron, K., “Multi-mode energy management for multi-tier server
clusters,” In Proceedings of the 17th international Conference on Parallel
Architectures and Compilation Techniques, 270-279, October 2008.
[32] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0329l/index.htm
l
[33] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.cortexa.a8/in
dex.html
[34] http://terranovum.com/projects/energystar/standby_v_hiber.html
[35] http://www.energystar.gov/index.cfm?c=prod_development.serverefficiency#epa.
Report to congress on server and data center energy efficiency. Retreived Oct
2009.
[36] http://www.hpl.hp.com/research/cacti/
[37] http://www.simplescalar.com
[38] http://www.spec.org
[39] http://www.tomshardware.com/reviews/intel,264-5.html
[40] Hu, J., Xu, T., Li, H., “A Lower-Power Register File Based on Complementary
Pass-Transistor Adiabatic Logic,” IEICE - Trans. Inf. Syst. E88-D, 1479-1485,
July 2005.
[41] Huang, M., Renau, J., Yoo, S., Torrellas, J., “L1 data cache decomposition for
energy efficiency,” In Proceedings of the 2001 international Symposium on Low
Power Electronics and Design, 10-15, 2001.
[42] Hur, I., Lin, C., "A comprehensive approach to DRAM power management,"
High Performance Computer Architecture, IEEE 14th International Symposium
on , 305-316, February 2008.
[43] Kanda, K., Sadaaki, H., Sakurai, T., "90% write power-saving SRAM using
sense-amplifying memory cell," Solid-State Circuits, IEEE Journal of , vol.39,
no.6, 927- 933, June 2004.
[44] Karandikar, A., Parhi, K.K., "Low power SRAM design using hierarchical
divided bit-line approach," Computer Design: VLSI in Computers and Processors,
Proceedings. International Conference on, 82-88, October 1998.
154
[45] Kaxiras, S., Hu, Z., and Martonosi, M., “Cache decay: exploiting generational
behavior to reduce cache leakage power,” SIGARCH Comput. Archit. News 29,
240-251, May 2001.
[46] Keejong Kim, Mahmoodi, H., Roy, K., "A Low-Power SRAM Using Bit-Line
Charge-Recycling," Solid-State Circuits, IEEE Journal of , vol.43, no.2, 446-459,
February 2008.
[47] Khoo, K., Wilson, A. N., “Charge recovery on a databus,” In Proceedings of the
1995 international Symposium on Low Power Design, ACM, 185-189, April 1995.
[48] Kim J-H., Papaefthymiou M. C., “Constant-Load Energy Recovery Memory for
Efficient High-Speed Operation,” Proc. of the Int‟l Symposium on Low Power
Electronics and Design, pp. 240-243, August 2004.
[49] Kim J-H., Ziesler C. H., “Fixed-Load Energy Recovery Memory for Low Power,”
Proc. of IEEE Computer Society Annual Symposium on VLSI Emerging Trends
in VLSI System Design, pp. 145-150, February 2004.
[50] Kim N., Austin T., Mudge T., "Low-Energy Data Cache using Sign Compression
and Cache Line Bisection", 2nd Annual Workshop on Memory Performance
Issues, May 2002.
[51] Kim, N. S., Mudge, T., “The microarchitecture of a low power register file,” In
Proceedings of the international Symposium on Low Power Electronics and
Design, 384-389, August 2003.
[52] Kondo, M., Nakamura, H., “A Small, Fast and Low-Power Register File by Bit-
Partitioning,” In Proceedings of the 11th international Symposium on High-
Performance Computer Architecture, IEEE Computer Society, 40-49, February
2005.
[53] Kretzschmar, C., Nieuwland, A.K., Muller, D., "Why transition coding for power
minimization of on-chip buses does not work," Design, Automation and Test in
Europe Conference and Exhibition, Proceedings , 512- 517 February 2004.
[54] Kucuk, G., Ponomarev, D., Ghose, K., “Low-complexity reorder buffer
architecture,” In Proceedings of the 16th international Conference on
Supercomputing, 57-66, June 2002.
155
[55] Lee, C., Potkonjak, M., Mangione-Smith, W. H., “MediaBench: a tool for
evaluating and synthesizing multimedia and communicatons systems,” In
Proceedings of the 30th Annual ACM/IEEE international Symposium on
Microarchitecture, 330-335, December 1997.
[56] Lee, H. S. Tyson, G. S., “Region-based caching: an energy-delay efficient
memory architecture for embedded processors,” In Proceedings of the
international Conference on Compilers, Architecture, and Synthesis for
Embedded Systems, ACM, 120-127, November 2000.
[57] Liu, L., Wang, H., Liu, X., Jin, X., He, W. B., Wang, Q. B., Chen, Y.,
“GreenCloud: a new architecture for green data center,” In Proceedings of the 6th
international Conference on Autonomic Computing and Communications industry
Session, 29-38, June 2009.
[58] Liu, Z., Squillante, M. S., Wolf, J. L., “On maximizing service-level-agreement
profits,” In Proceedings of the 3rd ACM Conference on Electronic Commerce,
213-223, October 2001.
[59] lpsolve: http://lpsolve.sourceforge.net/
[60] Lyuboslavsky, V., Bishop, B., Narayanan, V., Irwin, M.J., "Design of databus
charge recovery mechanism," ASIC/SOC Conference, Proceedings. 13th Annual
IEEE International , pp.283-287, September 2000.
[61] M. R. Stan and W. P. Burleson, “Low Power Encoding for Global
Communication in CMOS VLSI,” In Proc. of IEEE Trans. on Very Large Scale
Integration Systems, Vol. 5, No. 4, pp. 444-455, December 1997.
[62] Mai, K.W., Mori, T., Amrutur, B.S., Ho, R., Wilburn, B., Horowitz, M.A.,
Fukushi, I., Izawa, T., Mitarai, S., "Low-power SRAM design using half-swing
pulse-mode techniques," Solid-State Circuits, IEEE Journal of, vol.33, no.11,
1659-1671, Nov 1998.
[63] Manne, S., Klauser, A., Grunwald, D., "Pipeline gating: speculation control for
energy reduction," Computer Architecture, Proceedings. The 25th Annual
International Symposium on, 32-141, July 1998.
[64] Martin, S.M., Flautner, K., Mudge, T., Blaauw, D., "Combined dynamic voltage
scaling and adaptive body biasing for lower power microprocessors under
dynamic workloads," Computer Aided Design, IEEE/ACM International
Conference on, vol., no., 721- 725, November 2002.
156
[65] McNairy, C., Soltis, D., "Itanium 2 processor microarchitecture," Micro, IEEE,
vol.23, no.2, 44- 55, March-April 2003.
[66] Molka, D., Hackenberg, D., Schone, R., Muller, M.S., "Memory Performance and
Cache Coherency Effects on an Intel Nehalem Multiprocessor System," Parallel
Architectures and Compilation Techniques, 18th International Conference on,
261-270, September 2009.
[67] Musoll, E., Lang, T., Cortadella, J., “Exploiting the locality of memory references
to reduce the address bus energy,” In Proceedings of the international Symposium
on Low Power Electronics and Design, ACM, 202-207, August 1997.
[68] Nicolaescu, D., Veidenbaum, A., Nicolau, A., “Reducing data cache energy
consumption via cached load/store queue,” In Proceedings of the international
Symposium on Low Power Electronics and Design, ACM, 252-257, August 2003.
[69] P. P. Sotiriadis, and A. Chandrakasan, “Low power bus coding techniques
considering inter-wire capacitances,” In Custom Integrated Circuits Conference,
Proceedings of the IEEE, pp. 507-510, May 2000.
[70] Park B. K., Chang Y. S., Kyung C. M., “Confirming inverted data store for
lowpower memory,” in Proc. Int. Symp. Low-Power Electronics and Design
(ISLPED), 91–93, August 1999.
[71] Patel K., Annavaram M., Pedram M., "NFRA: Generalized network flow based
resource allocation for profit maximizing hosting centers," Under review.
[72] Patel K., Lee W-B., Pedram M., Annavaram M., "CR-Cache: Write Power
Reduction in Data Caches by Bit-line Charge Recycling", Submitted to IEEE
Transactions on Computers, 2010.
[73] Patel, K., Lee, W., Pedram, M., “In-order pulsed charge recycling in off-chip data
buses,” In Proceedings of the 18th ACM Great Lakes Symposium on VLSI, ACM,
371-374, May 2008.
[74] Patel, K., Lee, W., Pedram, M., “Minimizing power dissipation during write
operation to register files,” In Proceedings of the international Symposium on
Low Power Electronics and Design, 183-188, August 2007.
[75] Predictive Technology Model (PTM) at http:// www.eas.asu.edu/~ptm/
157
[76] Raghavendra, R., Ranganathan, P., Talwar, V., Wang, Z., Zhu, X., “No "power"
struggles: coordinated multi-level power management for the data center,” In
Proceedings of the 13th international Conference on Architectural Support For
Programming Languages and Operating Systems, 48-59, March 2008.
[77] Rajamani, K., Lefurgy, C., "On evaluating request-distribution schemes for saving
energy in server clusters," Performance Analysis of Systems and Software, IEEE
International Symposium on, 111- 122, March 2003.
[78] Ranjan, S., Rolia, J., Fu, H., Knightly, E., "QoS-driven server migration for
Internet data centers," Quality of Service, Tenth IEEE International Workshop on,
3- 12, August 2002.
[79] Rusu, C., Ferreira, A., Scordino, C., Watson, A., "Energy-Efficient Real-Time
Heterogeneous Server Clusters," Real-Time and Embedded Technology and
Applications Symposium, Proceedings of the 12th IEEE , 418- 428, April 2006.
[80] Sapatnekar, S. S. and Chuang, W. Power-delay optimizations in gate sizing. ACM
Trans. Des. Autom. Electron. Syst. 5, 1 (Jan. 2000), 98-114.
[81] Savransky E., Ronen R., Gonzalez A., “Lazy Retirement: A Power Aware
Register Management Mechanism,” Proc. Workshop Complexity-Effective Design,
2002.
[82] SEGARS, S., “Low power design techniques for microprocessors,” In IEEE
International Solid-State Circuits Conference Tutorial, February 2001.
[83] Shang Li, Peh Li-Shiuan, Jha, N.K., "Dynamic voltage scaling with links for
power optimization of interconnection networks," High-Performance Computer
Architecture, Proceedings. The Ninth International Symposium on, 91- 102,
February 2003.
[84] Shim, H., Joo, Y., Choi, Y., Lee, H. G., Chang, N., “Low-energy off-chip
SDRAM memory systems for embedded applications,” ACM Trans. Embed.
Comput. Syst. 2, 1, 98-130, February 2003.
[85] Sotiriadis, P., Konstantakopoulos, T., Chandrakasan, A., “Analysis and
implementation of charge recycling for deep sub-micron buses,” In Proceedings
of the 2001 international Symposium on Low Power Electronics and Design ,
ACM, 364-369, August 2001.
[86] SPECWeb2009: http://www.spec.org/web2009
158
[87] Tseng, J. H. Asanović, K., “Banked multiported register files for high-frequency
superscalar microprocessors,” In Proceedings of the 30th Annual international
Symposium on Computer Architecture, 62-71, New York, June 2003.
[88] Urgaonkar, B., Shenoy, P., Roscoe, T., “Resource overbooking and application
profiling in shared hosting platforms,” SIGOPS Oper. Syst. Rev. 36, 239-254,
December 2002.
[89] Vaidya, P. M., “Speeding-up linear programming using fast matrix
multiplication,” In Proceedings of the 30th Annual Symposium on Foundations of
Computer Science, IEEE Computer Society, 332-337, October/November 1989.
[90] Verma, A., Ghosal, S., “On admission control for profit maximization of
networked service providers,” In Proceedings of the 12th international
Conference on World Wide Web, 128-137, May 2003.
[91] Villa, L., Zhang, M., and Asanović, K, “Dynamic zero compression for cache
energy reduction,” In Proceedings of the 33rd Annual ACM/IEEE international
Symposium on Microarchitecture, 214-220, New York 2000.
[92] Wayne, K. D., “Generalized Maximum Flow Algorithms,” Doctoral Thesis. UMI
Order Number: AAI9910194., Cornell University, 1999.
[93] Wu Qing, Pedram, M., Wu Xunwei, "Clock-gating and its application to low
power design of sequential circuits," Circuits and Systems I: Fundamental Theory
and Applications, IEEE Transactions on, vol.47, no.3, 415-420, March 2000.
[94] Yang, B., Kim, L, “A low-power ROM using single charge-sharing capacitor and
hierarchical bit line,” IEEE Trans. Very Large Scale Integr. Syst. 14, 313-322,
April 2006.
[95] Yang, B., Kim, L., “A low-power charge-recycling ROM architecture,” IEEE
Trans. Very Large Scale Integr. Syst. 11, 590-598, August 2003.
[96] Yen-Jen Chang, Feipei Lai, Chia-Lin Yang, "Zero-aware asymmetric SRAM cell
for reducing cache power in writing zero," Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on , vol.12, no.8, 827- 836, August 2004.
[97] Zhang, C., Vahid, F., Najjar, W., "A highly configurable cache architecture for
embedded systems," Computer Architecture, Proceedings. 30th Annual
International Symposium on, 136- 146, June 2003.
159
[98] Zhang, L. Ardagna, D., “SLA based profit optimization in autonomic computing
systems,” In Proceedings of the 2nd international Conference on Service
Oriented Computing, 173-182, November 2004.
[99] Zhu Qingbo, David, F.M., Devaraj, C.F., Li Zhenmin, Zhou Yuanyuan, Cao Pei,
"Reducing Energy Consumption of Disk Storage Using Power-Aware Cache
Management," High Performance Computer Architecture, Proceedings. 10th
International Symposium on, 118- 129, February 2004.
Abstract (if available)
Abstract
Importance of energy efficiency in electronic systems is ever increasing, from embedded systems such as smart phones to large scale distributed systems such as datacenters. Modern battery-powered, embedded systems are complex devices providing various functionalities while supporting a wide range of applications, leading to complex energy profile and necessitating energy efficient design for longer battery life. On the other end of the spectrum lie the complex large-scale distributed systems such as data centers. Such systems consume not only significant computing power but also cooling power in order to remove the heat generated by the information technology equipment. The issue of energy efficiency in such systems can be addressed at various levels of system design, e.g., circuit/architecture level design time solutions or operating system/application level runtime solutions. In this thesis, we present circuit and architecture level design time solutions for modern microprocessors based on the concept of charge sharing, a technique that is applicable to all kinds of systems independent of the usage scenario, and system level run time solutions based on energy-aware resource allocation that is mostly applicable to data centers.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy-efficient shutdown of circuit components and computing systems
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Energy proportional computing for multi-core and many-core servers
PDF
A joint framework of design, control, and applications of energy generation and energy storage systems
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Thermal modeling and control in mobile and server systems
PDF
Integration of energy-efficient infrastructures and policies in smart grid
PDF
Prediction of energy consumption behavior in component-based distributed systems
PDF
Charge-mode analog IC design: a scalable, energy-efficient approach for designing analog circuits in ultra-deep sub-µm all-digital CMOS technologies
Asset Metadata
Creator
Patel, Kimish
(author)
Core Title
Energy efficient design and provisioning of hardware resources in modern computing systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
12/03/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
caches,charge recycling,charge sharing,circuit design,data center,delayed retirement,energy optimization,energy proportionality,hosting center,network flow,OAI-PMH Harvest,power optimization,register file,resource allocation,store buffer,write power
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Annavaram, Murali (
committee member
), Govindan, Ramesh (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
kimish.patel@gmail.com,kimishpa@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3572
Unique identifier
UC1134825
Identifier
etd-Patel-4184 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-415179 (legacy record id),usctheses-m3572 (legacy record id)
Legacy Identifier
etd-Patel-4184.pdf
Dmrecord
415179
Document Type
Dissertation
Rights
Patel, Kimish
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
caches
charge recycling
charge sharing
circuit design
data center
delayed retirement
energy optimization
energy proportionality
hosting center
network flow
power optimization
register file
resource allocation
store buffer
write power