Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Energy-efficient shutdown of circuit components and computing systems
(USC Thesis Other)
Energy-efficient shutdown of circuit components and computing systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENERGY-EFFICIENT SHUTDOWN OF CIRCUIT COMPONENTS AND COMPUTING SYSTEMS by Ehsan Pakbaznia ______________________________________________________________ A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2010 Copyright 2010 Ehsan Pakbaznia ii Dedication To my family for their everlasting love and support iii Acknowledgements The completion of this doctoral dissertation has been a long journey made possible through the inspiration and support of a handful of people. First and foremost, I especially would like to thank my Ph.D. advisor and dissertation committee chairman, Professor Massoud Pedram, for his enthusiasm, mentorship, and encouragement throughout my graduate studies and academic research. His passion for scientific discoveries and his dedication in making technical contributions to the computer engineering community have been invaluable in shaping my academic and professional career. I was delighted to interact with Dr. Farzan Fallah, who served as my Ph.D. co- advisor for a number of projects while at Fujitsu Labs of America. His insight and perspective to the practical aspects of my research proved valuable. Special thanks go to my dissertation and qualification committees, including Prof. Jeffrey Draper, Prof. Peter A. Beerel, Prof. Sandeep K. Gupta, and Prof. Aiichiro Nakano for their time, guidance, and feedback. I would like to thank those who have in many ways contributed to the success of my academic endeavors. They are the Electrical Engineering staff at the University of Southern California, particularly Annie Yu, Diane Demetras, and Tim Boston; the members of the System Power Optimization and Regulation Technology (SPORT) Laboratory; and my best and dearest friends. iv Last but not least, my deepest gratitude goes to my family for their unconditional love, support, and devotion throughout my life and especially during my education. v Table of Contents Dedication .................................................................................................................... ii Acknowledgements ..................................................................................................... iii List of Tables ............................................................................................................ viii List of Figures ............................................................................................................. ix Abstract ...................................................................................................................... xii Chapter 1: Introduction ................................................................................................ 1 1.1 Energy-Efficient Computing ........................................................................ 1 1.2 Circuit-Level Energy Efficient Design ......................................................... 2 1.2.1 Total Power Saving Factor ........................................................................ 3 1.2.2 Energy-Efficient Power Gating Structures ................................................ 4 1.3 System-Level Energy Efficient Design ........................................................ 5 1.3.1 Energy-Efficiency in Datacenters ............................................................. 6 1.4 Overview of the Dissertation ........................................................................ 7 Chapter 2: Charge Recycling for Power-Gated Circuits ............................................ 10 2.4.1 Virtual Node Voltage Values in the Sleep Mode .................................... 18 2.4.2 Charge-Recycling for Mode-Transition Energy Saving ......................... 21 2.5 Energy Saving Analysis for Charge Recycling Technique ........................ 24 2.6 Some Considerations .................................................................................. 32 2.6.1 Effect of the Threshold Voltages of the TG ............................................ 32 2.6.2 Effect of the Transistor Sizes of the TG .................................................. 34 2.7 Leakage Current and Ground Bounce Analysis ......................................... 37 2.7.1 Leakage Current ...................................................................................... 37 2.7.2 Ground Bounce ........................................................................................ 39 2.8 Variants of the Charge-Recycling Technique ............................................. 41 2.8.1 Charge-Recycling Between the Same Type of Virtual Rails .................. 42 2.8.2 Charge-Recycling for Blocks with Different Power Supply Levels ....... 44 2.8.3 Charge-Recycling for Super Cut-off CMOS ........................................... 46 2.9 Simulation Results ...................................................................................... 48 vi Chapter 3: Multimodal Power Gating ........................................................................ 59 3.1 Introduction ................................................................................................. 59 3.2 Prior Work .................................................................................................. 60 3.3 Tri-Modal Switch ........................................................................................ 62 3.3.1 Circuit Configuration and Switch Functionality ..................................... 62 3.3.2 Leakage Equations .................................................................................. 64 3.3.3 Data Retention and Noise Stability ......................................................... 68 3.3.4 Transistor Sizing ...................................................................................... 70 3.4 Data-Retentive Power Gating ..................................................................... 75 3.4.1 Proposed Architecture ............................................................................. 75 3.4.2 Placement with Row Sectioning .............................................................. 77 3.5 Multi-Drowsy Mode Circuits ..................................................................... 80 3.6 Voltage-Scaling Using Multimodal Headers .............................................. 84 3.7 Simulation Results ...................................................................................... 87 3.7.1 Data-Retentive Power Gating: Design Flow ........................................... 87 3.7.2 Data-Retentive Power Gating: Results .................................................... 88 3.7.3 The Effect of Technology Scaling on Data-Retentive Power Gating ..... 93 3.7.4 Multi-Drowsy Mode Circuits .................................................................. 94 3.7.5 TSPF Measure by Way of an Example ................................................... 95 Chapter 4: Temperature-Aware Dynamic Resource Provisioning in a Power- Optimized Datacenter..................................................................................... 98 4.1 Introduction ................................................................................................. 98 4.2 Prior Work .................................................................................................. 99 4.3 Preliminaries ............................................................................................. 101 4.3.1 Datacenter Configuration ...................................................................... 101 4.3.2 Power Model for Blade Servers ............................................................ 103 4.3.3 Heat Transfer Equations ........................................................................ 104 4.4 Datacenter Power Modeling ..................................................................... 105 4.4.1 Power Consumption of the CRAC Unit ................................................ 105 4.4.2 Total Power Consumption ..................................................................... 106 4.5 Temperature-Aware Dynamic Provisioning and Power Optimization ..... 107 4.5.1 Workload Monitor ................................................................................. 108 4.5.1.1 Calculating the Required Server Count ................................................. 109 4.5.1.2 Workload Prediction .............................................................................. 110 4.5.2 Power/Temperature Manager ................................................................ 113 4.5.2.1 Server Retirement Policy ....................................................................... 115 4.5.2.2 Server Employment Policy .................................................................... 117 4.5.2.3 Calculating the Optimum T s value ........................................................ 118 4.6 Simulation Results .................................................................................... 118 vii 4.6.1 Simulation Setup ................................................................................... 118 4.6.2 Workload Generation ............................................................................ 121 4.6.3 Power and T s Comparison ..................................................................... 121 Chapter 5: Conclusions ............................................................................................ 126 References ................................................................................................................ 128 viii List of Tables Table 2.1: Comparison of the Dynamic Mode-Transition Energy Dissipation of MTCMOS and CR-MTCMOS Circuits. ........................................................ 46 Table 2.2: Comparison of the Dynamic Mode-Transition Energy Dissipation of ST-MTCMOS, NP-MTCMOS and CR-MTCMOS Circuits. ........................ 52 Table 2.3: Comparison of the Dynamic Mode-Transition Energy Dissipation of NP-SCCMOS and CR-SCCMOS Circuits, V DD =1.2V. ................................. 54 Table 2.4: Ground Bounce Comparison between MTCMOS and CRMTCMOS Circuits, V DD =1.2, L=5nH, R=5Ω. ................................................................. 54 Table 3.1: Tri-mode switch functionality................................................................... 63 Table 3.2: Stability and data retention of DFF in drowsy mode ................................ 69 Table 3.3: Achieving different scaled values of the supply voltage for a full- adder circuit for TSMC0.18um and VDD= 1.8V. ......................................... 86 Table 3.4: Leakage, GB and w/r latency comparisons............................................... 89 Table 3.5: Delay, area and routing comparisons. ....................................................... 93 Table 3.6: Leakage comparisons for TSMC90nm. .................................................... 94 Table 3.7: Ready latencies for multi-drowsy mode ISCAS85 circuits. ..................... 95 Table 3.8: Leakage current for different modes in multi-drowsy implementation of ISCAS85 benchmark circuits in 90nm technology, Vdd=1.2V................. 95 ix List of Figures Figure 2.1: Sleep transistor slows down the gate. ...................................................... 10 Figure 2.2: Coarse-grain MTCMOS. ......................................................................... 11 Figure. 2.3: Using multiple parallel sleep transistors to gradually wake up a circuit block. ................................................................................................... 16 Figure 2.4: The conventional power gating structure using an NMOS or a PMOS sleep transistor for each circuit block............................................................. 17 Figure 2.5: The virtual ground voltage in the sleep mode, VDD=1.2 V. .................. 19 Figure 2.6: The proposed charge-recycling configuration for power gating structures. ....................................................................................................... 21 Figure 2.7: The proposed charge-recycling configuration with a TG realization of the charge sharing switch. .......................................................................... 25 Figure 2.8: The charge recycling waveforms for an inverter chain implemented in a 70 nm CMOS technology. ....................................................................... 29 Figure 2.9: Charge sharing between C 1 and C 2 when using a TG to realize the charge sharing switch. .................................................................................... 33 Figure 2.10: Percentage of Energy Saving Ratio (ESR) versus size of the transmission gate used for 9sym benchmark circuit. ..................................... 36 Figure 2.11: Wakeup time versus size of the transmission gate used for 9sym benchmark circuit. .......................................................................................... 36 Figure 2.12: The RL equivalent model of the ground used to analyze the GB effect in MTCMOS circuits. .......................................................................... 40 Figure 2.13: The GB waveforms in the conventional and the CR structures for an inverter chain implemented in a 70nm CMOS technology. ...................... 41 Figure 2.14: (a) Charge-recycling between two virtual grounds: VGND1 and VGND2 (b) Charge-recycling between the virtual rails of blocks with different supply levels. ................................................................................... 43 x Figure 2.15: Charge-recycling for SCCMOS circuits. ............................................... 48 Figure 2.16: Percentage of energy saving versus mode-transition factor for different duty factors for 9sym circuit, f clk =4GHz. ........................................ 57 Figure 3.1: Implementation of the tri-mode footer cell.............................................. 63 Figure 3.2: Implementation of the tri-mode header cell. ........................................... 65 Figure 3.3: A DFF and negative noise applied on its internal node. Ground connection goes thru a tri-modal footer cell. ................................................. 69 Figure 3.4: Leakage and wakeup/ready latencies for DFF. ....................................... 73 Figure 3.5: Different power-delay product metrics.................................................... 74 Figure 3.6: Application of tri-modal switch in designing multimodal pipeline structures. ....................................................................................................... 76 Figure 3.7: Examples of (a) illegal and (b) legal placements. ................................... 77 Figure 3.8: Column-aligned placement: (a) before and (b) after removing illegal placements. ..................................................................................................... 78 Figure 3.9: Outline of the proposed row sectioning placement algorithm. ................ 79 Figure 3.10: Different drowsy VVSS voltage achieved by changing the threshold voltage value of sleep transistor (MS)............................................ 81 Figure 3.11: Different drowsy VVSS voltage achieved by changing the size of sleep transistor (MS). ..................................................................................... 82 Figure 3.12: Implementations of multimodal footer switch for multi-drowsy mode circuits. ................................................................................................. 83 Figure 3.13: Using multimodal header to perform voltage scaling. .......................... 85 Figure 3.14: Summarized block diagram of the design flow. .................................... 88 Figure 3.15: (a) Original placement for 16×16 pipelined CSM, (b) placed design after row sectioning with ni=1, (c) placed design with ni=2 (d) routed design with ni=2. ............................................................................................ 90 Figure 3.16: Leakage versus total latency for different mode-transition xi frequencies in the unit of per million cycles. ................................................. 92 Figure 4.1: Hot-aisle/cold-aisle datacenter structure. .............................................. 102 Figure 4.2: Datacenter power optimization architecture. ......................................... 107 Figure 4.3: Prediction of total number of requests, r(t). .......................................... 112 Figure 4.4: Datacenter structure used in the simulations. ........................................ 119 Figure 4.5: Comparison of the total power consumption for GREEDY, TA- GREEDY and TA-DRP (K=1). ................................................................... 122 Figure 4.6: Comparison of supplied cold air temperature for K=1. ......................... 122 Figure 4.7: Comparison of the total power consumption for K=2. .......................... 123 Figure 4.8: Temperature distribution of a snapshot of TA-DRP. ............................ 124 xii Abstract The increasing computing and storage capacity of electronic devices and information processing systems has increased their power consumption and energy usage dramatically. This has made the energy efficiency of circuit components and computing systems a very important concern. Energy efficiency is desirable for portable electronics (e.g., mobile phones, laptops, tablets, etc) because it lengthens the battery lifetime. In more sophisticated computing systems that are not battery operated (e.g., web servers, datacenters, etc), better energy efficiency reduces the total cost of ownership by reducing cost of electricity (due to computing and cooling), and improves the environmental impacts (e.g., reducing CO 2 emission). This dissertation is divided into two parts. In the first part, I discuss some of the recent challenges in designing low-power and energy-efficient circuits where I present some novel circuit-level techniques to reduce power consumption and improve energy efficiency. Two major techniques are discussed in this part. The first technique, charge recycling for power-gated circuits, reduces mode-transition (sleep to active and active to sleep) energy consumption in power gated circuits where we recycle electric charge— that will be wasted otherwise—between virtual ground and virtual supply at the edge of the mode transition. This, in theory, reduces the mode transition energy by 50%. The next circuit-level technique presented in this dissertation introduces multimodal power gating structures using a novel design of a tri-modal power-gating xiii switch. This switch is used to implement data-retentive power gating structures and multi-drowsy mode circuits. We also show that by using the proposed tri-modal switch, we can perform voltage scaling with the same infrastructure that is used for power gating. The second part of this dissertation (system-level energy efficiency) is dedicated to energy efficiency in cloud computing infrastructures. We present a dynamically power-optimized datacenter that exploits correctly provisioned resources (servers and supplied cold air). The power optimization procedure comprises of two major actions. First is to predict the right number of required servers by employing a short-term workload forecasting technique. Second is to optimally choose candidate servers that are either being retired (turned OFF) or employed (turned ON) from the available pool of servers and to determine the optimum supplied cold-air temperature value of the Air Conditioning (AC) unit while satisfying the datacenter thermal constraints. The power saving is achieved by a combination of chassis consolidation and efficient cooling. 1 Chapter 1 Introduction 1.1 Energy-Efficient Computing Energy efficiency is one of the most significant challenges in designing today’s advanced electronic circuits and systems which can be viewed both from hardware and software perspectives. From the hardware designer’s point of view, each computing component used in the system (e.g., CPU, cache, main memory, etc) must be energy efficient with more energy-hungry components having higher level of design criticality. On the other hand, even if hardware designers achieve superior energy efficiency for their designs, there is still opportunity left for energy saving in the software level. For example, how exactly an operating system manages different pieces of hardware and their low-power features (provided by the hardware vendor) in a computer affects the energy efficiency of that system. Energy-efficient computing circuits and systems are desirable for various reasons. Small and/or portable electronics such as cell phones, laptops, Personal Digital Assistants (PDA’s), MP3 players, tablets, digital cameras, digital sensors in sensor networks, etc that mainly rely on battery power, benefit from energy efficiency by enjoying a longer lasting battery charge. On the other hand, longer lasting battery charge translates into longer total battery life time which is also beneficial form the environmental point of view. On the other hand, less expensive packaging and cooling can be used for circuit 2 components that consume less amount of power. In more sophisticated computing systems such as electronic funds transfer, supply chain management, Internet marketing, online transaction processing, automated data collection systems, High Performance Computing (HPC) centers, E-commerce hosting centers, etc, where battery life is not a concern, power management and energy efficiency becomes even more significant. In such systems, more important issues such as cooling, cost of electricity, Total Cost of Ownership (TCO), CO 2 emission, thermal distribution, etc, come into play. 1.2 Circuit-Level Energy Efficient Design There are a number of techniques exploited by hardware designers to improve energy efficiency of different pieces of hardware used for computing. Some examples of low power/energy techniques are listed below: A. Active mode power/energy reduction techniques i. Static voltage scaling ii. Pipelining and parallelization (trading area or latency for power) iii. Bus encoding and split buses iv. Clock gating and state encoding v. Adiabatic circuits and stepwise charging B. Standby power/energy reduction: i. Power gating and Multi-Threshold CMOS (MTCMOS) ii. Variable-Threshold CMOS (VTCMOS) 3 iii. Input vector control Depending on the type of the circuit and the operating scenario, one may choose to apply one (or even more) low power/energy techniques from different available options. For example, for a purely combinational block, voltage scaling technique may be suitable to reduce power consumption in active mode while for a pipeline circuit with lots of flip-flops we may use clock gating technique. Both examples may benefit from a low- energy standby mode by exploiting power gating technique. 1.2.1 Total Power Saving Factor Consider different power reduction techniques. Each particular technique is targeted for a certain circuit operating mode. For example, the goal of dynamic voltage-frequency scaling (DVFS) technique is to reduce dynamic power consumption in the active mode of circuit operation whereas Multi-Threshold CMOS (MTCMOS) is used to reduce leakage power consumption in the standby mode of the circuit. In other words, different power reduction techniques cover different parts of the operational spectrum of their underlying circuits with different saving factors. Some power reduction techniques reduce power consumption in a certain mode while increasing power consumption in other modes. For instance, clock gating technique which reduces active power may increase leakage power by introducing additional circuitry. For each power saving technique used, we may define a Total Power Saving Factor (TPSF) which shows the quality of the power saving technique used for that circuit. TPSF for some circuit, c, employing a power saving 4 technique, lp, may be defined as follows: where, τ i (c) is the fraction of time that circuit c is spending in mode i ( ), α i (c,lp) is the amount of power saving achieved by applying lp to c and we have 0 ≤ α i <1, and the summation is taken over all possible modes in which circuit c operates. This coefficient can be used to compare the overall quality of different power saving techniques. A larger TPSF value shows a higher quality low power technique for a specific circuit. 1.2.2 Energy-Efficient Power Gating Structures As CMOS technology scales down, supply voltage is reduced to avoid device failure due to high electric fields in the gate oxide and the conducting channel under the gate. This supply voltage scaling reduces the dynamic component of circuit power dissipation, but unfortunately also decreases the switching speed of transistors. To compensate for this performance loss, the transistor threshold voltages are decreased, which in turn causes an exponential increase in the sub-threshold leakage current. Furthermore, to maintain the gate voltage control over the active region of the transistor, thickness of the dielectric between the gate and the channel region is reduced, which results in an exponential increase in the gate leakage current [58]. Power gating, also known as Multi-threshold CMOS [42] or MTCMOS for short, 5 is used to cut off the power to some functional blocks in a design. MTCMOS provides low leakage and high performance operation by utilizing high speed, low V t (LVT) transistors for logic cell implementation and low leakage, high V t (HVT) transistors for power gating switch implementation. The power gating switch itself is typically realized as a single (footer) nMOS or (header) pMOS transistor, which disconnects logic cells from ground or VDD rails to reduce the leakage when the circuit is in the sleep mode. Some of the design challenges that must be considered when using the power gating technique are: (i) placement and sizing of the sleep transistors; (ii) automatic generation of sleep signal; (iii) sleep signal scheduling for wakeup noise reduction; (iv) mode transition energy minimization; (v) state retention; (vi) support for multiple levels of sleep. This dissertation, in part, contributes to the power gating technology by introducing new circuit structures, and algorithms that advance the existing knowledge and shift the current trend towards a more energy efficient one. 1.3 System-Level Energy Efficient Design System-level energy efficiency is not necessarily achieved by simply putting together energy-efficient components. It is possible to improve the energy efficiency of a system by selectively putting different components into predesigned low-power states if possible or by directly controlling the operating voltage and frequency of different components (DVFS). This is generally referred to as Dynamic Power Management (DPM) [11] which is the practice of making decisions dynamically of how to utilize different resources in the system such that system energy efficiency is improved. This is usually 6 subject to satisfying some notion(s) of performance constraints such as throughput, latency (response time), Quality of Service (QoS), etc. One approach to perform system- level DPM is to selectively put idle components into lower power states. Different system resources can be modeled using state-based abstraction where each state captures the existing tradeoff between power and performance of the underlying resource [26]. The system power manager controls the state transitions of different resources by monitoring the incoming requests to the system (workload) and issuing appropriate commands to those resources. The decision making policy in power management results in a constrained optimization problem. The policy itself can be defined in different ways such as minimizing power subject to performance constraint or maximizing performance under power budget constraint (energy-efficient versus energy- adaptive policies). 1.3.1 Energy-Efficiency in Datacenters Datacenters provide the supporting infrastructure for a wide range of economic activities based on digital information. As such, they are extremely important drivers of economic growth. They are also at the center of societal changes enabling new media for cyber- social interactions. Rapid increase in the Internet traffic is partially due to the dramatic increase in requests to the popular web sites (social networking sites, online marketplaces), ubiquitous use of search engines, and web portals that combine media and entertainment, financial/market information, and email/chat services. This has facilitated 7 the Information and Communication Technology (ICT) revolution, and datacenters, sitting at the heart of this revolution, are now indispensable to the functioning of the financial, academic, and governmental institutions, social and entertainment networks, high performance computing centers, business administrations, online education networks, public awareness, etc. However, the continued growth of datacenters is now hindered by their unsustainable (and rising) energy needs. Apart from datacenter energy consumption and associated costs, corporations and governments are also concerned about the environmental impact of datacenters, in terms of their carbon dioxide (CO 2 ) footprint. Motivated by the need for datacenters to be put on a more scalable and sustainable energy usage curve, this dissertation, in part, seeks to advance the technology of energy-efficient datacenters. 1.4 Overview of the Dissertation Design of a suitable power gating (e.g., multi-threshold CMOS or super cutoff CMOS) structure is an important and challenging task in sub-90nm VLSI circuits where leakage currents are significant. In designs where the mode transitions are frequent, a significant amount of energy is consumed to turn on or off the power gating structure. It is thus desirable to develop a power gating solution that minimizes the energy consumed during mode transitions. Part of this dissertation presents such a solution by recycling charge between the virtual ground and virtual supply rails immediately after entering the sleep mode and just before wakeup. The proposed method, in theory, can save up to 8 50% of the dynamic energy wasted during mode transition while maintaining the wakeup time of the original circuit. It also reduces the peak negative voltage value and the settling time of the ground bounce. Because of the large amount of rush-thru current and large wakeup latency for MTCMOS circuits, for short standby periods it is better to put the circuit into an intermediate power-saving mode (called the drowsy mode). The reason is that the transition latency from the drowsy to active mode (which we shall call the ready latency) is much less than the wakeup time of the circuit when coming out of the sleep mode. Furthermore, if designed appropriately, drowsy circuits can retain pre-standby internal state of the circuit. The downside of putting a circuit into drowsy mode is the higher amount of the leakage current compared to the case when the circuit is put into the sleep mode. This dissertation also presents a tri-modal switch cell that enables implementation of multimodal power gating, including active, data-retentive drowsy, and deep sleep modes. A circuit realization and design methodology are presented that allow one to take advantage of the ultra low leakage deep sleep mode, low leakage, but very fast wakeup, drowsy mode, and an additional low leakage data-retentive mode. We then extend the application of the tri-modal switch by introducing multimodal headers and footers that facilitate developing more interesting designs such as multi-drowsy circuits, which enable multiple drowsy modes in addition to sleep and active modes, and active mode voltage scaling using the same infrastructure that is used for power gating, i.e., the 9 multimodal header. This expands the power saving operational spectrum of multimodal switches from standby mode to active + standby mode (i.e., higher TPSF). Experimental results demonstrate the benefits of this new switch and corresponding power gating technique. On the system-level energy efficiency side, this dissertation focuses on the energy efficiency of datacenters. The dramatic growth in the Internet technology has been a key driver in the developing expansion of the Information and Communication Technology (ICT) which together have led to an unprecedented societal transformation. Data centers, placed at the heart of this revolution, must keep pace to enable this inevitable continued growth. However, the current energy and environmental cost trend of data centers are unsustainable. This demonstrates the urgent need for efficient data center power management. In this dissertation, we present a dynamically power-optimized datacenter that exploits correctly provisioned resources (servers and supplied cold air). The power optimization procedure comprises of two major actions. First is to predict the right number of required servers by employing a short-term workload forecasting technique. Second is to optimally choose candidate servers that are either being retired (turned OFF) or employed (turned ON) from the pool of servers and to determine the optimum supplied cold-air temperature value of the Air Conditioning (AC) unit while satisfying the datacenter thermal constraints. The power saving is achieved by a combination of chassis consolidation and efficient cooling. Simulation results show the effectiveness of the presented dynamic datacenter resource provisioning scheme. 10 Chapter 2 Charge Recycling for Power-Gated Circuits 2.1 Motivation Adding sleep transistor to a gate will introduce a delay overhead when the gate is operating in the active mode. The reason is because of the voltage drop between the drain and source of the sleep transistor while it is switching. This voltage drop is shown by V X for an NMOS type sleep transistor in Figure 2.1. Figure 2.1: Sleep transistor slows down the gate. The amount of the delay overhead depends on the size of the sleep transistor. For a fixed gate, larger sleep transistor has smaller ON resistance, thus, smaller delay overhead and larger area overhead. Usually the sleep transistor is sized such that the delay overhead is less than a maximum given value, e.g., 5%. In fine-grain MTCMOS, each individual gate in the circuit uses its own sleep transistor [17]. However, sleep transistors can be used for a group of gates in a SLEEP = V dd V x V dd V gs < V dd SLEEP = V dd V x V dd V gs < V dd 11 circuit. In that case each group of gates in the circuit uses one sleep transistor as shown in Figure 2.2, coarse-grain MTCMOS. The area overhead of the coarse grain MTCMOS is much less than that for the fine-grain one. This is due to the fact that not all the gates that are sharing a sleep transistor are switching simultaneously. One of the main challenges in coarse-grain MTCMOS is how to size the sleep transistor efficiently. There have been several works addressing this issue [8, 17, 28, 29, 38]. While providing a brief review on some of the drawbacks with MTCMOS technology and how to mitigate them, in this chapter we mainly focus on one issue that was never addressed before this research, the unnecessary energy consumption during the mode transition. Charge-recycling MTCMOS is presented as a solution to this problem, where we emphasize on the concept and circuit design of this technique. Interested readers may refer to [45] to investigate the Computer-Aided Design (CAD) related problems with charge-recycling technique. Figure 2.2: Coarse-grain MTCMOS. Gate 1 SLEEP Virtual Ground Gate 2 Gate 3 Gate 1 SLEEP Virtual Ground Gate 2 Gate 3 12 2.2 Drawbacks of MTCMOS As CMOS technology scales down, the supply voltage is reduced to avoid device failure due to high electric fields in the gate oxide and the conducting channel under the gate. Voltage scaling reduces the circuit power consumption because of the quadratic relationship between dynamic power consumption and supply voltage, but it also increases the delay of logic gates. To compensate for the performance loss, transistor threshold voltages are decreased, which causes exponential increase in the sub-threshold leakage current [30]. MTCMOS technology provides low leakage and high performance operation by utilizing high speed, low V t (LVT) transistors for logic cells and low leakage, high V t (HVT) devices as sleep transistors [27, 54]. Sleep transistors disconnect logic cells from the supply and/or ground to reduce the leakage in the sleep mode. In this technology, also called power gating or ground gating, the wake up latency and power plane integrity are key issues. Assume a sleep/wake up signal is supplied by an on-chip power management module. A key question is how to minimize energy consumption during mode transition, i.e., when switching from active to sleep mode or vice versa. Another important question is how to minimize the time required to turn on the circuit upon receiving the wake up signal since the length of the wake up time can affect the overall performance of the VLSI circuit. Furthermore, the large current flowing to ground when sleep transistors are turned on can become a major source of noise in the power distribution 13 network, which can adversely impact the performance and/or functionality of the other parts of the circuit. Hence, there is a trade-off between the generated noise due to the current flowing to ground and the transition time from the sleep mode to the active mode. Sleep transistors cause logic cells to slow down during the active mode operation of the circuit. This is due to the voltage drop across the functionally-redundant sleep transistors and the increase in the threshold voltage of logic cell transistors as a result of the body effect. The performance penalty of using a sleep transistor depends on its size and the amount of the current which flows through it due to logic transitions in the active mode. Several researchers have proposed methods for optimal sizing of sleep transistors in a given circuit to meet a performance constraint [8, 28, 32]. 2.3 Prior Works There have been several works addressing different issues with power gating such as wakeup time reduction and ground (or supply) bounce minimization generated by the power gating structure [5, 6, 16, 27, 33]. Due to the large amount of mode-transition energy overhead and large wakeup latency for the circuits, sometimes, for short sleep periods, it is better to put the circuit in a drowsy mode instead of the sleep mode. The reason is that the wakeup latency of the drowsy circuit is much less than that of the circuit in sleep mode. Also, if designed appropriately, drowsy circuits can retain pre-sleep states and values of the circuit. The 14 downside of putting a circuit in drowsy mode is the higher amount of the leakage current compared to the case when the circuit is put into sleep mode. In [34], the authors propose a power gating structure to support an intermediate (drowsy) power-saving mode and the traditional power cut-off mode. The idea is to add a PMOS transistor in parallel with each NMOS sleep transistor. By applying zero voltage to the gate of the PMOS transistor, the circuit can be put in the intermediate power saving mode whereby leakage reduction and data retention are both realized. Furthermore, by transitioning through this intermediate mode while changing between sleep and active modes, the magnitude of the voltage fluctuation of the power supply or ground during power-mode transitions is reduced. In the cut-off mode, the gate of the PMOS transistor is connected to V DD . The work in [7] proposes multiple power modes for the circuit, but it needs multiple supply voltages (stable reference voltages to drive the gate terminal of the sleep transistor which will be operating in different points of the subthreshold conduction region during the sleep mode). This is a costly proposition. In [56] authors propose a drowsy circuit scheme that automatically controls the degree of the drowsiness of the circuit by using a negative feedback implemented with a sleep inverter. There is large current rush during the sleep to active mode switching in the MTCMOS circuits due to existence floating nodes in the sleep mode. High peak rush current values in the circuit can cause Electro-Migration (EM) problems in the power/ground rails. This rush current will also result in high supply/ground bounces due to Ldi/dt effect which in return can cause spurious transitions in a circuit. This can result 15 in wrong values being latched in the circuit registers. In [19] the authors propose a wakeup strategy and a partitioning technique to limit the rush-through current. The authors of [5] present a few techniques for reducing the transition time from the sleep mode to the active mode for a circuit while assuring the power integrity of the rest of the system. The problem is minimizing the wakeup time while limiting the current flowing to ground during the sleep to active mode transition. Their basic approach is comprised of (i) obtaining the discharge patterns of all logic cells, (ii) grouping the circuit into a minimum number of clusters in such a way that the total discharge current of each cluster is below a given threshold. In [35] the authors introduce two power mode transition techniques to reduce the ground bounce while turning on the circuit. The first technique is by using a single sleep transistor, and turning it on gradually. Initially a voltage less than V DD is applied to the input gate terminal of the transistor to weakly turn on the sleep transistor and limit the peak voltage bounce of the virtual ground. In subsequent steps, the sleep transistor is turned on more strongly to further reduce the resistance between the virtual and actual grounds. The second proposed method in [35] is to use parallel-connected sleep transistors with increasing widths (cf. Figure. 2.3). The sleep transistors are turned on in a number of time steps, starting from the transistor with the smallest width. In the first step, since the voltage of the virtual ground is initially at its maximum, a relatively high resistance value is used to discharge it; this limits the peak current. In the subsequent time steps, the resistance of the path between virtual and actual grounds is reduced by turning 16 on wider sleep transistors. Figure. 2.3: Using multiple parallel sleep transistors to gradually wake up a circuit block. None of these works attempt to minimize the power consumption during the sleep-to-active and active-to-sleep transitions or reduce wake up time and the noise generated by the power gating structure while maintaining almost the same standby leakage current. In this chapter, we apply a charge-recycling technique to minimize the power consumption during the mode transition in a power gating structure while maintaining, or sometimes even improving, the wake up time. Through simulations, we show how the proposed technique also helps reduce the ground bounce in the sleep-to- active transition. ∆t N -1 W 1 W 2 W N -1 W N TURN_ON ∆t 2 ∆t 1 t 1 t N -1 t N t 2 VGNDL VGND L R LOGIC BLOCK ∆t N -1 W 1 W 2 W N -1 W N TURN_ON ∆t 2 ∆t 1 t 1 t N -1 t N t 2 VGNDL VGND L R LOGIC BLOCK 17 2.4 Charge-Recycling Technique Figure 2.4: The conventional power gating structure using an NMOS or a PMOS sleep transistor for each circuit block. Consider the coarse-grain, vs. fine-grain, MTCMOS configuration shown in Figure 2.4. There are two different blocks in the circuit; one is power-gated by an NMOS sleep transistor which connects the virtual ground (VGND), i.e., node G in the figure, to the ground, whereas the other is power-gated by a PMOS sleep transistor which connects the virtual V DD (VV DD ), i.e., node P in the figure, to the supply. In the active period, sleep transistors S N and S P are in the linear region and the voltage values of the virtual ground and virtual V DD are equal to 0 and V DD , respectively. In the sleep mode, sleep transistors S N and S P are turned off; since they are high threshold voltage devices, very little subthreshold leakage current flows through them. In practice (see below for precise conditions), all internal nodes of the gates in block C 1 and the virtual ground node, G, will be charged up to a voltage value very close C 1 C 2 V DD V DD G P S P S N C 1 C 2 V DD V DD G P S P S N 18 to V DD [46]. This happens because G is floating and leakage current causes its voltage level to rise toward V DD . Similarly, if the sleep period is long enough, all internal nodes of C 2 and the virtual supply node, P, will be discharged to a voltage very close to 0. We discuss this in more details in the following sub-section. 2.4.1 Virtual Node Voltage Values in the Sleep Mode Consider sub-circuit C 1 in Figure 2.4. We show the only scenario where the assumption of VGND node being charged to a value close to V DD is invalid is when outputs of all logic cells in C 1 are logic 1 (i.e., the pull-down sections of all cells are OFF) immediately before the active-to-sleep transition occurs. However, this case rarely happens in practice, because if there is at least one cell in C 1 with output value set to logic 0 (i.e., its pull- down section is ON) before the active-to-sleep transition and if the sleep period is sufficiently long, then the steady-state value for the virtual ground voltage after entering the sleep mode will be close to V DD . Considering that a sub-circuit will typically contain tens of logic cells, the probability of at least one of them having a logic 0 at its output (before entering the sleep mode) is almost 1, therefore, the voltage of the virtual ground of sub-circuit C 1 will rise and reach close to V DD after sufficient time is spent in the sleep mode. To empirically confirm the aforementioned claim, we show the voltage waveforms of the virtual ground node for four different cases in Figure 2.5. In each case we have used an NMOS sleep transistor (the case with PMOS sleep transistor will be 19 similar except that the corresponding output states are reversed)[45]. Figure 2.5: The virtual ground voltage in the sleep mode, VDD=1.2 V. The first case is when there is only a single inverter cell in sub-circuit C 1 and the output of the inverter is logic 1 before entering the sleep mode. As the figure shows, after entering the sleep mode, the virtual ground voltage of the inverter cell rises to about 200mV, which is much less than V DD =1.2V. The next case corresponds to the same sub- circuit C 1 , this time with the output of the inverter forced to logic 0. Here, the virtual ground voltage rises to 0.95V, which is close to V DD =1.2V and a suitable level for the charge-recycling purpose. The next two cases correspond to C 1 comprising of 4 inverter cells each driven an input to C 1 . In one case, three of the inverter outputs are 1 and only one inverter output is 0. In this case, the virtual ground voltage rises to even a higher level than case 2, resulting in a final steady sate voltage level of 1V, which is again 20 suitable for the charge-recycling purpose. In the last case, two inverter outputs are set to logic 1 while the others are set to logic 0. Clearly in this case, after entering the sleep mode, the virtual ground node is expected to rise and achieve a level even closer to V DD than before. This is confirmed by the top waveform in the figure, which shows the virtual ground of sub-circuit C 1 reaches to a voltage close to 1.2V. In summary, as long as there is a reasonably large number of logic cells in a sub- circuit that uses an NMOS sleep transistor, the probability that at least one of these cells will have a logic 0 output value before entering the sleep mode is close to one, so the virtual ground voltage of such a sub-circuit will gradually rise and stabilize to a voltage close to V DD . This occurs in a relatively short period of sleep time (in the order of microseconds), which provides us with the opportunity for charge-recycling between this sub-circuit and another one that uses a PMOS sleep transistor. The case that a PMOS sleep transistor is used instead of an NMOS transistor is similar and it can be shown that the VV DD node is discharged to 0 during the sleep mode. In practice, in a circuit block that uses an NMOS sleep transistor, the number and sizes of logic cells with 0 output values is sufficiently large so that the virtual ground voltage of this circuit after it enters the sleep mode rises to a value which is very close to V DD . The same statement holds with respect to the virtual V DD voltage of a circuit block that uses a PMOS sleep transistor dropping to a value very close to the ground voltage level after the circuit enters the sleep mode. In the analytical parts of this chapter, we will assume that the virtual ground and V DD voltages of circuits using NMOS and PMOS 21 transistors will change to exactly V DD and ground levels, respectively, after entering and staying in the sleep mode for a long enough time. In the next sub-chapter we use this observation to propose a charge-recycling technique to achieve energy savings during mode transitions. 2.4.2 Charge-Recycling for Mode-Transition Energy Saving Figure 2.6: The proposed charge-recycling configuration for power gating structures. When the sleep-to-active transition edge arrives at the gates of the sleep transistors in an MTCMOS circuit, the voltage of G starts to fall toward 0, whereas the voltage of P starts to rise toward V DD . If we denote the total effective capacitance in the VGND and VV DD nodes by C G and C P , respectively, we observe that during the active-to-sleep transition, C G is charged up from 0 to V DD , while C P is discharged from V DD to 0. The situation is reversed for the sleep-to-active transition, i.e., in this case C G is discharged from V DD to C 1 C 2 V DD V DD G P t=t a t=t s t=t a0 <t a t=t s0 >t s S N S P C G C P M C 1 C 2 V DD V DD G P t=t a t=t s t=t a0 <t a t=t s0 >t s S N S P C G C P M 22 0, while C P is charged to V DD from its initial value of 0. These charge and discharge events on the VGND and VV DD nodes are wasteful from the energy dissipation point of view [44, 46]. Our goal is to reduce the energy as we switch between active and sleep modes of the circuit. More precisely, we propose to use a charge-recycling technique to reduce the switching power consumption during the active-to-sleep and sleep-to-active transitions by adding a charge sharing switch between the virtual ground and supply nodes as shown in Figure 2.6. The proposed charge-recycling strategy works as follows. We turn on the charge sharing switch (i) immediately before turning on the sleep transistors while going from the sleep to the active mode, and (ii) just after turning off the sleep transistors while going from the active to the sleep mode. By turning on the switch at the end of the sleep mode as the circuit is about to go from sleep to active mode, we allow charge sharing between the completely charged up capacitance C G and the completely discharged capacitance C P . After the charge-recycling is completed, the common voltage of the virtual ground and virtual supply is αV DD , where α is a positive real number less than 1. The value of α depends on the relative sizes of C G and C P . As a result of this step, the power consumed due to switching the sleep transistors on and off is reduced. The reason is that in this case, the voltage of virtual ground changes from αV DD to 0 and the voltage of the virtual supply changes from αV DD to V DD , whereas in the conventional MTCMOS circuit, the transitions are from V DD to 0 and from 0 to V DD at the virtual ground and virtual V DD nodes, respectively. A similar analysis proves that the charge-recycling 23 technique helps to reduce the power consumed for transition from the active mode to the sleep mode as well. In practice, we use a transmission gate (TG) to realize a switch (cf. Figure 2.7). One may instead use other circuit realizations of a switch, such as pass transistors. Note that with a TG it is easier to achieve full charge sharing between the floating virtual ground and virtual V DD nodes. We will use a TG in the rest of this dissertation. As it is implied from the discussion in this section, the proposed charge-recycling technique is used for mode-transition energy saving in a coarse-grain MTCMOS design where each sleep transistor is used to disconnect ground/supply from multiple logic cells, whereas in fine-grain MTCMOS design, the standard cell library consists of logic cells with a sleep transistor integrated into each cell, i.e., each logic cell has its own built-in sleep transistor. Typically in such a library, virtual ground/supply nodes are considered as internal nodes i.e., there are no separate pins attached to these nodes. This makes accessing the virtual rails, thus, applying the charge recycling technique very difficult. However, if we assume the fine-grain library is designed such that we have access to the virtual nodes, one can think of different ways of implementing the charge-recycling technique with this cell library. We do not recommend applying the charge recycling technique at the individual cell level (fine-grain) since our basic requirement for energy saving due to charge recycling, i.e., the condition the virtual nodes change to the opposite rail values during sleep may be frequently violated under this scenario. Another MTCMOS design configuration is the cluster-based style which we 24 consider it as a mid-grain MTCMOS technique [8, 29]. In order to implement cluster- based (mid-grain) charge recycling, we can start by putting groups of n logic cells which use NMOS sleep transistors together, connecting their virtual ground nodes to create a single virtual ground node. The same step can be performed for groups of m logic cells that use PMOS sleep transistors to make a single virtual supply node shared among the cells. Charge-recycling can subsequently be performed between the virtual ground of one group and the virtual supply of the other group. Although cell clustering is an important optimization step, it falls outside the scope of this dissertation. In the next section we will analyze the energy saving achieved by applying the charge-recycling technique for a coarse-grain MTCMOS design. 2.5 Energy Saving Analysis for Charge Recycling Technique In this section we first calculate the maximum achievable energy saving and discuss the conditions under which we can achieve this maximum saving. Next we quantitatively analyze the effect of threshold voltages and sizes of the transistors in the transmission gate realizing the charge sharing switch. It is worth stating at the onset that for the purpose of analyzing energy consumption in CMOS circuits, energy is taken out of the V DD rail only when a capacitive node is charged up through a direct connection to the V DD rail. Energy that is dumped to the ground rail is the energy which was stored in that capacitive node and need not be accounted for again. The charge recycling between “floating” capacitive nodes (with possibly different initial voltage levels) does not extract any energy from the V DD rail or dump any into the ground rail, instead some of 25 the energy that was stored in the capacitors is consumed in the resistance of the switch that short circuits the two capacitive nodes while the remainder of the energy is appropriately distributed between the nodes. Figure 2.7: The proposed charge-recycling configuration with a TG realization of the charge sharing switch. To calculate energy saving of the charge-recycling technique, we consider two different transitions: sleep-to-active and active-to-sleep. Case 1: Wakeup Transition Let C G and C P represent the total capacitance in the virtual ground and supply nodes, respectively. We assume the sleep period is long enough so C G is charged up to a voltage close to V DD , while C P is completely discharged to a voltage close to 0. This is a good assumption in most circuits. Otherwise, the voltages of C G and C P will be a function of C 1 C 2 V DD V DD G P t=t a t=t s t=t a0 <t a t=t s0 >t s S N S P V DD V DD C 1 C 2 V DD V DD G P t=t a t=t s t=t a0 <t a t=t s0 >t s S N S P V DD V DD V DD V DD 26 the length of the sleep period. As stated earlier, to go from the sleep mode to the active mode, instead of simply turning on sleep transistors, we first allow charge-recycling between C G and C P . This is done by closing switch M at time t=t a0 . Assuming ideal charge sharing between C G and C P , the common voltage value of nodes G and P after charge sharing is calculated by equating the total charge in both capacitances before and right after charge-recycling: f DD G G P V V C CC α α = = + (2.1) The common voltage value, V f , of the virtual ground and virtual supply at the end of the charge sharing is αV DD . After the charge sharing is complete, i.e., at time t=t a1 , we open switch M and turn on S N and S P sleep transistors. As a result, there will be a path from the virtual ground to the (actual) ground going through S N which would discharge C G to 0. There will also be a path from the virtual V DD to the (actual) V DD going through S P which would charge C P to V DD . For now we neglect the energy consumption in the switch itself so the total energy drawn from the power supply is due to the process of charging capacitance C P which can be obtained as follows: ( ) sleep active P DD P DD DD f E CV V CV V V → = ∆ = − (2.2) Substituting V f from (2.1) into (2.2), we obtain the energy consumed during sleep- active transition: 27 ( ) ( ) 2 1 sleep active P DD DD DD P DD E CV V V CV α α → = − = − (2.3) Next we consider active to sleep transition. Case 2: Sleep Transition As mentioned earlier, to go from the active mode to the sleep mode, instead of simply turning off the sleep transistors, we do charge-recycling between C G and C P as soon as the circuit enters the sleep mode. In other words, we close switch M at t=t s0 which is the time when the sleep transistors are turned off. The voltage values of the VGND and VV DD nodes at t=t s0 are 0 and V DD , respectively. Assuming ideal charge sharing between C G and C P , the common voltage value of nodes G and P after charge sharing is calculated by equating the total charge in both capacitances right before and after charge sharing: f DD P G P VV C CC β β = = + (2.4) Based on the above equation, the common voltage value, V f , of the virtual ground and virtual V DD at the end of charge sharing is βV DD . The charge-recycling is complete at t=t s1 , so we open the switch. After opening the switch, there is a leakage path from the power supply to the virtual ground going through logic block C 1 which eventually causes C G to be charged up to V DD . There is also a leakage path from the virtual supply to the ground going through logic block C 2 which eventually causes C P to be completely discharged to the ground. Again, if we neglect the power consumption in the switch, the 28 total energy consumed is due to charging up the capacitance C G ; the energy consumption can be calculated as follows: ( ) active sleep G DD G DD DD f E CV V CV V V → = ∆ = − (2.5) Substituting V f from (2.4) into (2.5), we obtain: ( ) ( ) 2 1 active sleep G DD DD DD G DD E CV V V CV β β → = − = − (2.6) Since α+β=1, the total energy consumption will be: 22 CRMTCMOS active sleep sleep active G DD P DD E EE C V CV αβ →→ = + = + (2.7) where E CR-MTCMOS is the dynamic energy consumption during mode transition in the charge-recycling circuit. We can calculate the total energy consumption of the corresponding conventional MTCMOS circuit, i.e., when no charge-recycling is used using the following formula: 22 MTCMOS G DD P DD E C V CV = + (2.8) From (2.7) and (2.8), and after substituting for α and β from (2.1) and (2.4), the energy saving ratio (ESR) would be: ( ) 2 2 () 1 MTCMOS CRMTCMOS MTCMOS EE X ESR X E X − = = + (2.9) where X=C G /C P is the ratio of the virtual ground capacitance to the virtual V DD 29 capacitance, or. The optimum value of X which maximizes ESR(X) is obtained by equating the derivative of ESR(X) to zero which results in X=1, or C G =C P . In other words, in order to obtain the maximum energy saving, we need to have equal capacitances in virtual ground and virtual V DD . Then the maximum energy saving is: max 1 1 ( )| 2 X ESR ESR X = = = (2.10) This means a maximum energy saving of 50% can be achieved by using the charge- recycling method. However, considering the power needed to turn the TG on and off, the total saving ratio would be less than 50%. Figure 2.8: The charge recycling waveforms for an inverter chain implemented in a 70 nm CMOS technology. Figure 2.8 shows HSPICE waveforms when charge-recycling is performed before 0 0.1 0.2 0.3 0.4 0.5 0.6 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 Time (ns) Voltage (Volt) V(G) V(d) V(CR) 30 transitioning from the sleep to the active mode for an inverter chain implemented in 70nm CMOS technology. Note that in the circuit, C G =C P . The figure shows the virtual ground voltage, V G , the virtual V DD voltage, V P , and the charge-recycling signal, V CR . Now we define the virtual ground/supply capacitances, i.e., C G and C P . The total effective capacitance in the virtual ground (supply) compromises of the following components: (a) Diffusion capacitance (C dif ): This component is defined as the summation of the diffusion capacitances of transistors in logic gates connected to the virtual ground (supply). (b) Interconnect Capacitance (C int ): This component is defined as the total rail capacitance in the virtual ground (supply) due to interconnect. (c) Internal node capacitance (C in ): This component is defined as the total internal node capacitance of logic gates connected to virtual ground (supply) whose voltage values transition from V DD to 0 or vice versa during mode transitions. Based on the above definitions, the total virtual node capacitances can be written as: int int G GG P PP G dif in P dif in CC C C CC C C = + + = ++ (2.11) Now suppose each of blocks C 1 and C 2 in Figure 2.6, consists of a simple inverter. When charge-recycling is performed, after active-to-sleep mode transition, the value of C G depends on the state of the inverter in C 1 : (i) when the input of the inverter is at logic 0, the NMOS transistor of the inverter is off, so the total capacitance, C G , is the sum of 31 the first two components in (2.11) (no internal node capacitance.) (ii) When the input of the inverter is at logic 1, the NMOS transistor of the inverter is on, and the internal node capacitance contributes to C G . Similar discussion holds for C P capacitance and the state of the inverter in C 2 block. This makes C G and C P values input-pattern dependent for a general circuit, meaning different input patterns applied to the circuit result in different logic values for the inputs of the circuit’s gates of which changes the contribution of the internal node capacitances to the total rail capacitance resulting in different C G and C P values. Fortunately, our simulations for large enough circuit blocks (e.g., more than 20 gates per block) show the maximum change in the shared voltage value after the charge recycling operation is less than 5% under different input patterns. This means the impact of the input vector on unbalancing the total virtual ground and virtual supply capacitance values is small and can be neglected. Finally we point out the energy saving ratio is only a weak function of the ratio between C G and C P . From (2.9), the maximum ESR is achieved when C G =C P . However, if this condition is not fully satisfied, the energy saving ratio will not decrease dramatically, for example, for C P =2×C G which means X=1/2 in (2.9), ESR becomes 44%, and for C P =3×C G , X=1/3, ESR becomes 38%. Therefore, even in case C G and C P values are different by as much as 2 or 3 times, the energy saving ratio is still large. Note all the equations we derived so far were based on the assumption of having an ideal charge-recycling between C G and C P . Under this scenario, we assume that no 32 energy is consumed to switch the TG on and off. We also assume that the TG is “ON” while the charge-recycling is in process. However, because of the dynamic power consumption in the TG, and also the possibility of having incomplete charge sharing, this is not a perfect replacement in practice. In the following we study the effects of the TG threshold voltage and sizing on the energy saving ratio and the wakeup time of the charge-recycling configuration. 2.6 Some Considerations 2.6.1 Effect of the Threshold Voltages of the TG We first discuss the effect of threshold voltages of the NMOS and PMOS transistors of the TG on the energy saving and the delay of the circuit. Consider the charge sharing configuration shown in Figure 2.9 where V 1 and V 2 are set to V DD and 0 levels initially. After the TG is closed, the common node voltage is referred to as V f . To have a complete charge sharing, the TG has to stay “ON” for the whole duration of the charge sharing process. In order to have this property, the absolute values of the threshold voltages of the N and P transistors of the TG have to be small enough. To guarantee this, the common final voltage value of virtual ground and virtual supply, V f , has to satisfy at least one of the following two inequalities: ≤ − ≤ f p t f DD n t V V or V V V , , (2.12) 33 where V t,n and V t,p denote threshold voltages of the NMOS and PMOS transistors in the TG accounting for the body effect. Notice that V f can be obtained from (2.1) for the active to sleep case and from (2.4) for the sleep to active case. The inequalities guarantees that at least one of the transistors in the TG remains “ON” for the complete duration of charge sharing. Figure 2.9: Charge sharing between C 1 and C 2 when using a TG to realize the charge sharing switch. In the case of equal virtual node capacitances, C G =C P , a complete charge sharing in both active-to-sleep and sleep-to-active cases results in a common final voltage value of V f =V DD /2, and (2.12) translates into Min{V t,n , |V t,p |} ≤ V DD /2. 1 2.8 Now, if V tn =|V tp |≤V DD /2, a TG may be replaced with a pass transistor while still achieving full charge sharing. Note in current CMOS technologies this condition is satisfied for both LVT and HVT devices to have acceptable static DC noise margins. In the future CMOS technologies that use sub-1V power supply level, as it will be discussed in , turning on 1 If Min{V t,n , |V t,p |} > V DD /2, the charge-recycling will not be complete, and the Energy Saving Ratio (ESR) will be less than what we have predicted. C 1 C 2 V gate_p =0 V gate_n =V DD V 1 V 2 C 1 C 2 V gate_p =0 V gate_n =V DD V 1 V 2 34 the HVT devices will be difficult, and that is why Super Cut-off CMOS (which uses voltage over or under drive) has been proposed [31]. Therefore, for sub-1V technologies we recommend using CR-SCCMOS instead of CR-MTCMOS (cf. 2.8.3). In this case, the transmission gate’s transistors will be LVT and V tn , |V tp | ≤ V DD /2 will be automatically satisfied (otherwise, even the static CMOS logic cells, which use LVT transistors, would not meet the static noise margin requirements in the design). 2.6.2 Effect of the Transistor Sizes of the TG Sizing of the TG is another factor that affects the ESR as well as the wake up time of the circuit. In case of the original configuration when there is not any charge-recycling, the wakeup time may be defined as the time between when we turn on the sleep transistors to when the voltage of the virtual ground (or virtual V DD ) reaches within 10% of its final value. However, in a circuit that uses charge-recycling, the wake up time may be defined as the time between when we turn on the TG to when the virtual ground (or virtual V DD ) voltage reaches within 10% of its final value. In the following discussion, we consider the effect of the dynamic power consumption of the TG on the ideal energy saving ratio, ESR, which we previously calculated. Consider TG with its control signal (the complement of the control signal is produced by a CMOS inverter). Assume a total input capacitance of C tg for the NMOS and PMOS transistors of the TG. In each active-sleep-active cycle, we need to turn on the TG twice, once before turning the sleep transistors on and once after turning them off. 35 Every time we turn the TG on and off, we charge and discharge C tg . We have to turn off the TG after the charge sharing is complete. Therefore, we can calculate the dynamic energy consumption of the TG for one complete active-sleep-active cycle as follows: 2 2 TG tg DD E CV = (2.13) Therefore, the actual energy saving ratio (ESR) can be calculated by subtracting the correction ratio E TG /E MTCMOS from the ideal ESR in (2.9). The correction ratio can be calculated as: ( ) 2 2 22 tg DD tg TG MTCMOS G P DD G P CV C E E CC V CC = = ++ (2.14) This correction ratio is proportional to the sizes of the TG’s transistors since C tg is proportional to the size of the TG. Because many gates are usually connected to the virtual ground and the virtual V DD , C G +C P is usually much larger than C tg . Thus, the correction ratio is usually few percents which makes the actual ESR to be less than the ideal ESR, i.e., 50%, by only a few percentage points. Figure 2.10 shows the ESR versus total transistor width used in the TG. As seen the ESR is reduced as we the TG size. By changing the TG size, we can change the speed of charge sharing operation and as a result, minimize the wake up time; however, charge- sharing operation only changes the virtual node voltages from their initial values to V DD /2. The rest of the wakeup operation is performed by the sleep transistors, whose duration depends on the sizes of the sleep transistors. Clearly, increasing the TG size does not affect the speed by which the sleep transistors can change the virtual node 36 voltages from V DD /2 to V DD or ground as the case may require. Therefore, the total wakeup time of the circuit is expected to decrease when we increase the TG size, but then it saturates at some point. Figure 2.10: Percentage of Energy Saving Ratio (ESR) versus size of the transmission gate used for 9sym benchmark circuit. Figure 2.11: Wakeup time versus size of the transmission gate used for 9sym benchmark circuit. 15 20 25 30 35 40 45 100 300 500 700 900 1100 1300 Transmission Gate Size (λ) ESR (%) 9sym 300 320 340 360 380 400 420 440 460 480 500 100 300 500 700 900 1100 1300 Transmission Gate Size (λ) Wakeup Time (ps) 9sym 37 Figure 2.11 shows the circuit wakeup time versus the total transistor width used in TG. Finally note that although increasing the TG size reduces the wakeup time, it also increases the correction ratio given in (2.14), thereby, changing the energy saving ratio of the circuit. In other words, there is a tradeoff between the wakeup time and the energy saving ratio. 2.7 Leakage Current and Ground Bounce Analysis We analyze two important issues for the proposed charge-recycling MTCMOS configuration, namely the leakage current and the ground bounce (GB). 2.7.1 Leakage Current In the sequel, we derive the leakage current equations for both MTCMOS and CR- MTCMOS circuits. The leakage current of a MOS transistor can be written as follows [15]: 2 1.8 0 e 1e gs th ds TT V V V Sv ox leakage T ox W Ie TL ν ε µν − − = − (2.15) where V gs and V ds are the gate-source and drain-source voltages of the transistor and W/L is the width to the length ratio of the transistor. In the sleep mode, all sleep and charge- recycling transistors are off, i.e., they all have V gs =0. Here, V ds for each sleep and charge- recycling transistor is the absolute voltage difference between VGND and VV DD nodes in the sleep mode, which is approximately equal to V DD based on the discussion in 2.4.1. 38 From (2.15), we can ignore the dependence of the transistor’s subthreshold leakage current on V ds since V ds ≥ 75mv. There are two leakage current components corresponding to the two leakage paths in the conventional MTCMOS circuit: the NMOS sleep transistor leakage current (I Ln ) and the PMOS sleep transistor leakage current (I Lp ). Assuming the widths of NMOS and PMOS sleep transistors are W n and W p , respectively, I Ln and I Lp can be written as: 2 1.8 2 1.8 e e tH T tH T V S ox n Ln n T ox V p S ox Lp p T ox W Ie TL W Ie TL ν ν ε µν ε µν − − = = (2.16) where V tH is the threshold voltage of the sleep transistors. The total leakage current of the MTCMOS circuit is the sum of I Ln and I Lp : ( ) 2 1.8 e tH T V S MTCMOS ox leakage n n p p T ox I WW e LT ν ε µµ ν − = + (2.17) For the charge-recycling MTCMOS (CR-MTCMOS), however, there is an additional leakage component due to the charge-recycling transistor (I Lcr ). For the purpose of this section, assume instead of a TG, a single NMOS transistor with the width W cr is used for charge-recycling. Using (2.15) I Lcr can be written as: 2 1.8 e tH T V S ox cr Lcr n T ox W Ie TL ν ε µ ν − = (2.18) Using (2.17) and (2.18), the ratio of the leakage current for MTCMOS and CR- MTCMOS can be written as: 39 ( ) 1 CR MTCMOS leakage n n n cr p p MTCMOS leakage n n p p cr n pn p I WW W I WW W WW µµ µ µµ µµ + + = + = + + (2.19) Assuming μ n =2μ p and W n =0.5W p : 1 2 CR MTCMOS leakage cr MTCMOS leakage n I W IW = + (2.20) Since the charger-recycling transistor is usually much smaller than the sleep transistors, the leakage-increase ratio given in (2.20) is usually too small when compared to the power saving achieved by using the charge-recycling technique. 2.7.2 Ground Bounce Ground and power line bounces are one of the most important design concerns when power gating is used [35]. Ground Bounce (GB) or power bounce may occur in power gating structures at the sleep to active transition edge. In this section we discuss about how charge-recycling technique affects the ground bounce. Consider the circuit in Figure 2.12. Large current flows into the GND after the sleep transistor is turned on at the end of the sleep period. We adopt a simple RL model for the purpose of GB analysis. Because of the large di/dt at the turn-on time, a large voltage, i.e., Ldi/dt, appears across the inductance. We next study the effect of the proposed charge-recycling technique on the GB of the circuit. 40 Figure 2.12: The RL equivalent model of the ground used to analyze the GB effect in MTCMOS circuits. Figure 2.12 shows the virtual ground capacitance, C G , connected to the RL circuit (modeling the pin-package parasitics of the IC), via the sleep transistor, S N . The sleep transistor is turned on at t=0 when the initial voltage of C G is V 0 , i.e., V G (t=0)=V 0 . Based on the results of [25], the positive peak of the GB occurs during the time when S N operates in the saturation region. Although the peak value does not depend on V 0 , it is a function of R, L, C G , V Tn and V DD . Therefore, we expect the proposed charge-recycling technique, which changes V 0 from V DD to V DD /2, would not change the GB’s positive peak. However, both the GB’s negative peak and the settling time of GB are functions of V 0 [25]. Furthermore, both quantities decrease if V 0 is reduced. Therefore, both the negative peak value and the settling time of the GB voltage are expected to decrease for the charge-recycling MTCMOS. The amounts of improvement in the negative peak and settling time depend on the relative values of L, C G , R, V DD , and the sleep transistor parameters. Figure 2.13 L R C G t=0 S N + - V G (t=0)=V 0 L R C G t=0 S N + - V G (t=0)=V 0 41 compares GB waveforms for the conventional and the charge-recycling power gating structures used for an inverter chain implemented in 70nm CMOS technology. As expected, the positive peak value is the same in both cases; however, the negative peak value and the settling time are smaller for the charge-recycling MTCMOS structure. Figure 2.13: The GB waveforms in the conventional and the CR structures for an inverter chain implemented in a 70nm CMOS technology. 2.8 Variants of the Charge-Recycling Technique In this section we discuss three variations of the proposed charge-recycling technique for the MTCMOS circuits. Previously, we presented a certain type of charge-recycling technique that uses both NMOS and PMOS sleep transistors. Charge-recycling was then applied between VGND and VV DD nodes. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 Time (ns) Ground Line Voltage (volt) Conventional Charge Recycling 42 2.8.1 Charge-Recycling Between the Same Type of Virtual Rails Consider Figure 2.14.a where two circuit blocks C 1 and C 2 are using the same type of sleep transistors, e.g., NMOS transistors. Suppose C 1 and C 2 work in “orthogonal” modes, i.e., when C 1 is in active mode, C 2 is in sleep mode and vice versa. For example, C 1 and C 2 can be integer and floating-point arithmetic blocks of a processor. When the integer arithmetic block is used, the floating-point block will be idle and conversely. We show charge-recycling can be performed between VGND nodes of blocks C 1 and C 2 , denoted by VGND 1 and VGND 2 , respectively. First assume C 1 is in the active mode and C 2 is in the sleep mode. Voltages of VGND 1 and VGND 2 are 0 and V DD , respectively. When C 1 is switched to the sleep mode, C 2 is switched to the active mode and the voltages of VGND 1 and VGND 2 change to V DD and 0, respectively. Therefore, the charge-recycling can be done between VGND 1 and VGND 2 nodes to save the mode transition energy. The energy consumptions for the MTCMOS and CR-MTCMOS circuits in a full active- sleep-active cycle are: ( ) 12 12 2 12 MTCMOS G G DD CR MTCMOS G DD G DD E C CV E C V V C V V = + = ∆+ ∆ (2.21) where ΔV 1 and ΔV 2 are the voltage differences between the final charge-recycling voltage value and the supply voltage values of the two blocks and are calculated as follows: 43 (a) (b) Figure 2.14: (a) Charge-recycling between two virtual grounds: VGND1 and VGND2 (b) Charge-recycling between the virtual rails of blocks with different supply levels. C 1 C 2 V DD V DD VGND1 S N1 VGND2 S N2 SLEEP M C 1 C 2 V DD V DD VGND1 S N1 VGND2 S N2 SLEEP M C 1 C 2 V DD1 V DD2 VGND1 S N VVDD2 S P SLEEP M C 1 C 2 V DD1 V DD2 VGND1 S N VVDD2 S P SLEEP M 44 2 12 1 12 1 2 G DD DD GG G DD DD GG C VV V CC C VV V CC ∆ = − + ∆= − + (2.22) Substituting ΔV 1 and ΔV 2 from (2.22) into (2.21), we can calculate the energy saving ratio as: ( ) 12 12 22 2 2 GG CR MTCMOS DD MTCMOS GG CC E V E CC + = + (2.23) which is similar to the regular charge-recycling case. The maximum energy saving of 50% is achieved when C G1 =C G2 . Similarly, the charge-recycling technique may be applied between the VV DD nodes of two blocks that use PMOS sleep transistors. 2.8.2 Charge-Recycling for Blocks with Different Power Supply Levels Consider Figure 2.14.b where two circuit blocks C 1 and C 2 use two different power supply levels, V DD1 and V DD2 , respectively. If C 1 and C 2 use different types of sleep transistors, for example, C 1 uses an NMOS while C 2 uses a PMOS sleep transistor and if C 1 and C 2 are always in the same mode of operation (i.e., they are both in the sleep mode or they are both in the active mode), then the charge-recycling technique may be applied between the virtual ground of C 1 , VGND1, and the virtual supply of C 2 , VV DD2 . In this case, the energy consumptions for the MTCMOS and CR-MTCMOS circuits can be written as follows: 45 1 1 2 2 1 1 2 2 22 12 MTCMOS G DD P DD CR MTCMOS G DD P DD E C V CV E C V V CV V = + = ∆+ ∆ (2.24) where ΔV 1 and ΔV 2 are the voltage differences between the final charge-recycling voltage value and the supply voltage values of the two blocks and are calculated as follows: 2 12 12 1 21 12 1 2 P DD DD GP G DD DD GP C VV V CC C VV V CC ∆ = − + ∆= − + (2.25) Substituting ΔV 1 and ΔV 2 from (2.25) into (2.24), we can calculate the energy saving ratio as: ( ) ( ) 12 1 2 22 12 1 12 2 2 MTCMOS CR MTCMOS MTCMOS G P DD DD G P G DD P DD E E ESR E CC V V C C CV C V − = = ++ (2.26) One can see from (2.26), the energy saving ratio in this case depends not only on the capacitance values in the virtual rails, but on both supply voltage values. Notice that if V DD1 = V DD2 then (2.26) is reduced to (2.9). Table 2.1 shows the energy saving results for two variants of the charge-recycling technique discussed in parts 2.8.1 and 2.8.2 of this Section. This table includes three different cases for charge-recycling for the same type of virtual rails. In each case, we have used two blocks of the same circuit when they both employ NMOS sleep transistors. Table 2.1 also includes a charge-recycling case for blocks with different supply levels. In 46 this case we used two circuit blocks 9sym and C880 where 9sym uses PMOS sleep transistor and a supply voltage of V DD1 =1.3V, while C880 uses NMOS sleep transistor and a supply voltage of V DD2 =1.0V. As one can see the energy consumption during mode transition for CR-MTCMOS is less than that for MTCMOS by an average of 32%. Table 2.1: Comparison of the Dynamic Mode-Transition Energy Dissipation of MTCMOS and CR-MTCMOS Circuits. Circuit Blocks Type Avg. # of Cells per block Avg. SLP TX width per block (λ) Total CR TX width (λ) Dynamic Energy Dissipation for Mode Trans. (Femto Jules) ESR (%) MTCMOS CR- MTCMOS 9sym/9sym 2VGND 276 1200 600 866 525 39.5 C432/C432 2VGND 204 1000 450 673 375 44.3 C880/C880 2VGND 432 1600 1050 1356 851 37.2 C880/9sym 2VDD 354 2200 750 2373 1800 24 Avg. - - - - 1317 887.8 36.3 2.8.3 Charge-Recycling for Super Cut-off CMOS Turning on HVT devices is difficult in sub 1-V CMOS technologies [31, 39]. In 45nm technology, the best corner V DD is 0.9V while the standard threshold voltage, SVT, is about 0.5V. For acceptable leakage saving, the high threshold voltage must be at least 0.65V. This leaves only a 0.25V margin for the gate-source voltage (0.65 < V GS < 0.9V) of a turned on NMOS sleep transistor when MTCMOS is used. Therefore, high threshold voltage (HVT) sleep transistors are too slow and hard to turn on in sub 1-V technologies. Super Cut-off CMOS (SCCMOS) circuits solve this problem by using a low threshold voltage (LVT) device for cutting off ground or V DD [31]. Instead of using HVT devices 47 for leakage reduction, SCCMOS circuits overdrive the LVT PMOS sleep transistors by applying a positive overdrive voltage of ΔV DD in excess of V DD to their gate terminals. Similarly, they under drive the LVT NMOS sleep transistors by applying a negative voltage of –ΔV DD to their gate terminals. It has been shown the SCCMOS circuits achieve the same leakage reduction as the corresponding MTCMOS circuits with shorter wakeup times due to the use of LVT transistors. Similar to MTCMOS, conventional SCCMOS circuits suffer from wasteful mode transition energy consumption. Both NMOS and PMOS sleep transistors may be used to cut off power or ground from the gates inside a circuit. During the standby mode, due to leakage, the VGND node will be charged to a value close to V DD while the VV DD node will be discharged to a voltage close to GND [39]. The opposite situation occurs in the active mode. Consequently, charge-recycling may be applied to SCCMOS circuits to save the mode transition energy in the same fashion as it was applied to MTCMOS circuits. Figure 2.15 shows the configuration of the circuit used for charge-recycling SCCMOS (CR- SCCMOS). Table 2.3 reports the results of applying the charge-recycling technique to SCCMOS circuits. In order to have a fair comparison between each MTCMOS and its SCCMOS counterpart, the value of the overdrive voltage for a PMOS sleep transistor in the SCCMOS circuit, i.e., ΔV DD , is set to the threshold voltage difference between the HVT and LVT PMOS devices in the MTCMOS circuit. Similarly, the value of the underdrive voltage for an NMOS sleep transistor in the SCCMOS circuit, –ΔV DD , is set to the 48 threshold voltage difference between the LVT and HVT NMOS devices in the MTCMOS circuit. Figure 2.15: Charge-recycling for SCCMOS circuits. 2.9 Simulation Results We used the ISCAS-85 circuit benchmark suite to generate our experimental results. All benchmark circuits are first optimized using “script.rugged” in SIS. We used a 90nm cell library to perform timing-driven technology mapping. The LVT value is 0.25V, whereas the HVT value is 0.65V for NMOS transistors. Similarly, for PMOS transistors LVT value is -0.22V, whereas the HVT value is -0.62V. The supply voltage’s value is V DD =1.2V. Each circuit is divided into two sub-circuits, one uses NMOS sleep transistor and C 1 C 2 V DD V DD VGND LVT S N VV DD LVT S P V=-ΔV DD V=V DD V=0 V=V DD +ΔV DD M C 1 C 2 V DD V DD VGND LVT S N VV DD LVT S P V=-ΔV DD V=V DD V=0 V=V DD +ΔV DD M 49 the other uses PMOS sleep transistor to do power gating. In all cases, sub-circuits are chosen such that the total capacitance values in the virtual nodes are approximately equal. Starting with an optimized and technology mapped ISCAS-85 circuit, we first generate the MTCMOS version of the same circuit as follows. We use a single NMOS sleep transistor to cut off the ground from the virtual ground node during the sleep time. The size of this sleep transistor is set to ensure a voltage drop of no more than 5% of V DD across its R DS (ON) when the circuit is active. This limits the performance penalty of the power gating structure. The exact solution to this problem requires an optimization that falls outside the scope of this dissertation. Interested readers may refer to [9, 29, 52] for different ways in which the problem can be formulated and solved. In our experiments, we assumed at most 10% of logic gates in the circuit, i.e., N/10 where N is the total gates in the circuit, have a simultaneous high-to-low output transition in any given cycle, each transition contributing an average of ∆I avg current to the total current flowing thru the ON sleep transistor, and therefore, , ,, 0.05 () 2 10 1 () ( ) DD DD ds n avg avg n ds n n ox DD tH n V V V R ON N I NI I W L R ON C V V µ ∆ = = = ∆ ∆ = − (2.27) This simple derivation produces reasonably good results for the size of the MTCMOS sleep transistor in our benchmark suite. In the table of results, we use notation ST-MTCMOS to refer to standard MTCMOS version of circuits. 50 Next, we generate a version of the circuit benchmarks that uses both NMOS and PMOS sleep transistors. In particular we partition circuit C into two blocks C1 and C2, where C1 uses an NMOS sleep transistor, while C2 uses a PMOS one. Furthermore, the partitioning is done such that the total capacitance of the virtual ground node of C1 is equal to the total capacitance of the virtual voltage node of C2. The sizing of the NMOS and PMOS sleep transistors for each circuit block is done similar to the ST-MTCMOS case (accounting for the difference between hole and electron mobility, of course). We refer to this version as the NP-MTCMOS because it uses both types of sleep transistors, yet it does not perform any charge recycling. We incorporate the charge recycling technique into NP-MTCMOS by using an appropriately sized TG as the switch between the VGND of C1 and VVDD of C2. The size of this TG is selected such that the wakeup times of the NP-MTCMOS and the CR- MTCMOS are approximately equal. The optimization is performed by measuring the wakeup time of the NP-MTCMOS and sweeping the TG size (using SPICE) while monitoring the wakeup time of the CR-MTCMOS circuit. Since electrons have a higher mobility than holes, the NMOS sleep transistors are more conductive than PMOS ones; thus, from the area point of view it is better to use NMOS sleep transistors. However, the sleep transistor size is not the only factor determining whether NMOS or PMOS sleep transistors is used. Other factors such as leakage, noise on power/ground rails, ease of implementation in a given CMOS technology, are also important. For example, PMOS transistors have a better leakage 51 characteristic. Indeed since the total area overhead of the sleep transistors is relatively small (it is typically less than 5% of the total logic cell area), using NMOS vs. PMOS sleep transistors does not make a big difference in terms of the total area. In contrast, an important issue is the cost of implementing PMOS or NMOS sleep transistors in the given process technology. If NMOS sleep transistors are used, body connections of the NMOS transistors of logic cells have to be tied to the virtual ground node in order to minimize the body effect. On the other hand, the body connection of the NMOS sleep transistor has to be tied to the actual ground. Thus, a three-well CMOS process is required, which is more expensive than a typical two-well CMOS process. In contrast if PMOS sleep transistors are used, the p-substrate easily separates the n-well of these transistors from other n-wells which contain PMOS transistors used in the normal cells. As far as we know many industrial realizations of power gating use PMOS sleep transistors. We generate NP-SCCMOS circuits by taking the NP-MTCMOS and scaling both the NMOS and PMOS sleep transistors by the following factor: ,* ,* () () DD tH DD tL VV VV − − where ,* tH V and ,* tL V denote the HVT and LVT values of NMOS or PMOS devices. Finally, we generate CR-SCCMOS by enabling charge sharing with an appropriately sized TG. Similar to CR-MTCMOS case, the size of this TG is determined through SPICE simulation with the goal of equating the wakeup times of NP-SCCMOS 52 and CR-SCCMOS. Note the control signal for the transmission gate needs to be synchronized with the sleep signal generated by the power management unit. The pulse duration has to be long enough to enable charge sharing but not unnecessarily long since it adds up to the wakeup time. Typically 20%-30% of the total cycle time is sufficient for the charge- recycling operation to finish. For example, in 90nm technology with clock frequency of 2.5GHz, the cycle time is 400ps. Thus, a 100ps pulse-width is a good choice for charge- recycling operation. The task of synchronizing this pulse with the clock and power management control signal is similar to meeting other timing constraints in nanoscale CMOS designs. Table 2.2 shows the energy saving results for various ST-MTCMOS circuits and their corresponding NP-MTCMOS and CR-MTCMOS ones. Table 2.2: Comparison of the Dynamic Mode-Transition Energy Dissipation of ST-MTCMOS, NP-MTCMOS and CR-MTCMOS Circuits. Circuit # Cells connected to VGND # Cells connected to VVDD Total SLP TX width (λ) Total CR TX width (λ) Dynamic Energy Dissipation for Mode Transitions (Femto Jules) ST- ESR (%) NP- ESR (%) ST- MTCMOS NP- MTCMOS CR- MTCMOS 9sym 145 131 1,620 300 1240 1600 930 25 39 C432 128 76 1,120 240 890 1060 660 25.8 37.3 C880 232 200 2,528 480 1880 2400 1470 21.8 38.7 C1355 296 230 3,024 480 2230 2820 1700 23.8 39.5 C3540 745 550 7,580 900 5670 7340 4290 24.4 41.6 C5315 1,017 710 9,748 900 7210 9230 5270 26.8 42.8 Avg. - - - - 3187 4075 2387 24.6 39.8 As one can see the energy consumption during mode transition for CR-MTCMOS is less than ST-MTCMOS and NP-MTCMOS by an average of 25% and 40%, respectively. Note, in all reported cases, the wakeup times are equal. As a good 53 approximation, we can say the total sleep transistor area overhead in NP/CR-MTCMOS is 50% more than that for the ST-MTCMOS. Since this area overhead is only a small percentage of the total chip area (less than 5%), the actual sleep transistor area overhead due to using CR-MTCMOS compared to ST-MTCMOS is less than 2.5%. Table 2.3 shows the energy saving results for various NP-SCCMOS and corresponding CR-SCCMOS circuits. In order to have a fair comparison between MTCMOS and SCCMOS circuits, the value of the overdrive voltage for a PMOS super cut-off switch in the SCCMOS circuit is set to the threshold voltage difference between the HVT and LVT PMOS devices in the MTCMOS circuit. Similarly, the value of the underdrive voltage for an NMOS switch in the SCCMOS circuit is set to the threshold voltage difference between the HVT and LVT NMOS devices in the MTCMOS circuit. As one can see the energy saving of CR-SCCMOS over NP-SCCMOS is about 36% on an average for the same wakeup time. Reducing ground and power rail bounces is among the important issues in designing MTCMOS circuits. As it was discussed in Section 2.7, the proposed charge- recycling technique reduces the ground (power) bounce of the MTCMOS circuits. Table 2.4 validates this expectation by reporting the positive and negative peaks of the ground bounce for various NP-MTCMOS circuits and the corresponding CR-MTCMOS circuits. As one can see the negative peak ground bounce value of the CR-MTCMOS has decreased by an average of 37% compared to NP-MTCMOS. Next, we compare ST-MTCMOS and CR-MTCMOS circuits in terms of their 54 total energy consumptions. The total energy consumptions in the ST-MTCMOS and CR- MTCMOS circuits may be written as the summation of their corresponding active and sleep mode energy consumptions plus the energy consumption due to the mode transition in these circuits: ST MTCMOS ST MTCMOS ST MTCMOS ST MTCMOS total active sleep mt CR MTCMOS CR MTCMOS CR MTCMOS CR MTCMOS total active sleep mt E EEE E E E E − −−− − −−− = ++ = + + (2.28) Table 2.3: Comparison of the Dynamic Mode-Transition Energy Dissipation of NP-SCCMOS and CR-SCCMOS Circuits, V DD =1.2V. Circuit # Cells Connected to VGND # Cells Connected to VVDD Total SLP TX width (λ) Total CR TX width (λ) Dynamic Energy in Mode Transition (Femto Jules) ESR (%) NP- SCCMOS CR- SCCMOS 9sym 145 131 972 450 860 590 31.4 C432 128 76 672 330 590 410 30.5 C880 232 200 1,517 600 1270 840 33.8 C1355 296 230 1,815 480 1480 920 37.8 C3540 745 550 4,548 900 3930 2330 40.7 C5315 1,017 710 5,849 900 5020 2910 42 Avg. - - - - 2192 1333 36 Table 2.4: Ground Bounce Comparison between MTCMOS and CRMTCMOS Circuits, V DD =1.2, L=5nH, R=5Ω. Circuit Positive Peak GB (mV) Negative Peak GB (mV) MTCMOS CR- MTCMOS GB Reduction (%) MTCMOS CRMTCMOS GB Reduction (%) 9sym 475 435 8.4 326 158 51.5 C432 476 437 8.2 375 225 40 C880 455 417 8.4 324 181 44.1 C1355 431 398 7.7 311 151 51.4 C3540 315 293 7.0 202 155 23.2 C5315 228 202 11.4 206 193 6.3 Avg. 397 364 8.5 291 177 36.8 55 The active-mode energy consumption for both cases consists of two parts: dynamic component and static (leakage) component. Since the ON resistance of the sleep transistor in the active mode is non-zero, both active-mode energy components are slightly different in the ST-MTCMOS and CR-MTCMOS circuits; however, this is a secondary effect which we ignore in this work. Therefore, ( ) 2 ST MTCMOS CR MTCMOS active active sw DD clk la DD active E E c V f IV t −− = = + (2.29) where c sw denotes the average switched capacitance for the circuit in each clock cycle, f clk is the clock frequency, la I denotes the average active leakage current in the circuit, and t active is the total time the circuit is active. Let N clk denote the number of the clock cycles over which energy calculations are performed. We can write: ( ) 1 active clk clk sleep clk clk t NT t NT α α = = − (2.30) where T clk =1/f clk is the clock period, and α denotes the duty factor which is defined as the percentage of the total time that the circuit is active. The sleep-mode energy consumptions for the two circuits can be written as: ( ) n np ST MTCMOS ST sleep ls DD sleep CR MTCMOS CR CR CR sleep ls ls lcr DD sleep E IV t E III V t − − = = ++ (2.31) where n ST ls I is the leakage current through the sleep transistor in the ST-MTCMOS circuit during the sleep mode of the operation. n CR ls I , p CR ls I and cr CR ls I denote the leakage currents through 56 the NMOS and PMOS sleep transistors and the charge-recycling transistors in the CR- MTCMOS circuit during the sleep mode of operation, respectively. Typically, the leakage current through sleep transistors in both cases are in the same order, however, since the TG is much smaller than the sleep transistors, usually smaller than 1/10 th , cr CR ls I in (2.31) is much smaller, smaller than 1/10 th , than n p CR CR ls ls II + . The mode-transition energy consumption for two circuits can be written as: ( ) ( ) 2 2 1 2 st st cr cr cr ST MTCMOS mt slp G DD clk CR MTCMOS mt slp G P DD clk E c cV N E c c c V N β β − − = + = ++ (2.32) where st slp c and cr slp c denote the total sleep transistor input capacitance, and st G c denotes the total virtual ground capacitance in the ST-MTCMOS circuit while cr G c and cr P c denote the total virtual ground and virtual V DD capacitances in the CR-MTCMOS circuit, respectively. Finally, β is the mode transition factor, that is, the percentage of clock cycles during which a mode transition occurs. From (2.29), the active mode energy consumption is the same for both circuits which means that charge-recycling technique does not have any influence on the active mode energy consumption; therefore, we do not consider the active mode energy consumption component of (2.28) for the remainder of the discussion. Therefore, (2.28) can be rewritten as: , , ST MTCMOS ST MTCMOS ST MTCMOS slp mt sleep mt CR MTCMOS CR MTCMOS CR MTCMOS slp mt sleep mt E EE E E E − −− − −− = + = + (2.33) 57 Substituting (2.30), (2.31) and (2.32) into (2.33), and ignoring the terms related to the sleep transistors, we obtain: ( ) ( ) ( ) ( ) ( ) 2 , 2 , 1 1 1 2 n st n p cr cr ST MTCMOS ST slp mt ls DD clk G DD clk CR MTCMOS CR CR CR slp mt ls ls lcr DD clk G P DD clk E I V T cV N E III V T c c V N αβ αβ − − = −+ = ++ − + + (2.34) Figure 2.16 shows the percentage of the total energy saving of CR-MTCMOS over ST- MTCMOS as a function of the mode-transition frequency for three different duty factor values for one of the ISCAS-85 benchmark circuits, 9sym. Figure 2.16: Percentage of energy saving versus mode-transition factor for different duty factors for 9sym circuit, f clk =4GHz. As we increase the mode-transition factor β, the percentage of energy saving increases for each case. This is because the charge-recycling technique can save energy during mode transition only. As we increase the duty factor α, the total sleep time will Mode Transition Factor (per million cycles) Sleep+Mode-Transition Energy Saving (%) Mode Transition Factor (per million cycles) Sleep+Mode-Transition Energy Saving (%) 58 decrease and the total saving will consequently increase. This can be seen in Figure 2.16 by looking at energy saving plots for different activity factors. For large values of α (e.g., 0.9), and β, the sleep plus mode-transition energy saving ratio will be approximately equal to the mode-transition energy saving ratio (as was reported in Table 2.2). 59 Chapter 3 Multimodal Power Gating 3.1 Introduction MTCMOS technology provides a simple and effective power gating structure by utilizing high speed, low Vt (LVT) transistors for logic cells and low leakage, high Vt (HVT) devices as sleep transistors [55]. Sleep transistors disconnect logic cells from the supply and/or ground to reduce the leakage in the standby mode. More precisely, MTCMOS uses low-leakage NMOS (PMOS) transistors as footer (header) switches to disconnect ground (power supply) from parts of a design in the circuit standby mode. MTCMOS is a leakage power saving solution that provides high active mode performance and low standby leakage power [30, 43]. Because of the large amount of rush-thru current and large wakeup latency for MTCMOS circuits for short standby periods it is better to put the circuit into an intermediate power-saving mode (called the drowsy mode). The reason is that the transition latency from the drowsy to active mode (which we shall call the ready latency) is much less than the wakeup time of the circuit when coming out of the sleep mode. Furthermore, if designed appropriately, drowsy circuits can retain pre-standby internal state of the circuit. The downside of putting a circuit into drowsy mode is the higher amount of the leakage current compared to the case when the circuit is put into the sleep mode. 60 In this chapter, we first present a new tri-modal power gating switch that enables three different circuit modes: A. Active B. Sleep C. Drowsy The presented tri-modal switch benefits from the low-leakage sleep mode and fast and low-cost mode-transition drowsy mode. We then use this switch to implement some interesting circuits such as data-retentive power gated circuits which is a power-gating structure that eliminates using retention flip-flops for power-gated circuits by exploiting the proposed tri-modal switch [24]. We will see that such a circuit can operate in four different modes namely, Active, Drowsy, Data Retentive, and Sleep. We then extend the application of the tri-modal switch by introducing multimodal headers and footers that facilitate developing more interesting designs such as multi-drowsy mode circuits, which enable multiple drowsy modes in addition to sleep and active modes. We will demonstrate how using the same infrastructure that is used for power gating, i.e., multimodal header, enables active mode voltage scaling. This expands the power saving operational spectrum of multimodal switches from standby mode to active + standby mode i.e., it yields a higher TPSF value. 3.2 Prior Work In [34], the authors propose a power gating structure to support an intermediate (drowsy) power-saving mode and the traditional sleep mode. The idea is to add a 61 clamping PMOS transistor in parallel with each NMOS sleep transistor. By applying zero voltage to the gate of the clamping PMOS and NMOS sleep transistors, the circuit can be put in the intermediate power saving mode whereby leakage reduction and data retention are both realized. In the deep sleep mode with no data retention, the gate of the PMOS transistor is connected to VDD while the NMOS sleep transistor is turned off. In this approach, similar to other MTCMOS techniques, the sleep signal is generated by an always-on buffer. To have shorter wakeup times, the sleep buffer uses LVT devices. Therefore, this approach suffers from the high drowsy leakage current due to using always-on buffers. In Section III we will see that sleep buffer can also be power-gated during the drowsy mode, and thus, its leakage may be reduced. The work in [7] describes multiple power modes for the circuit, but it needs multiple supply voltages (stable reference voltages to drive the gate terminal of the sleep transistor which operates in different points of the subthreshold conduction region during the sleep mode), which is costly. In [56], the authors propose a drowsy circuit scheme that automatically controls the degree of the drowsiness of the circuit by using a negative feedback implemented with a sleep inverter. This configuration thereby clamps the voltage level of the virtual ground node using the negative feedback loop. The problem with using this technique is that the circuit will either work in the active or drowsy mode, and the sleep mode is lost. This technique works fine for small standby periods when the circuit switches back and forth between standby and active periods frequently. However, for medium to long standby periods, the technique in [56] fails to be effective due to the 62 large amount of leakage consumption during the long standby period. 3.3 Tri-Modal Switch In this section we present the circuit configuration and functionality of the tri-modal switch. Similar to bimodal MTCMOS switches, tri-modal switch also comes in two flavors: header and footer type switch. 3.3.1 Circuit Configuration and Switch Functionality Figure 3.1 shows the proposed footer type tri-modal switch configuration. Both HVT and LVT transistors are used in this design. We use thick lines to draw the gate plate of HVT transistors [49]. Conventional footer sleep transistors use a single control input called SLEEP. As seen in Figure 3.1, the proposed tri-modal switch has an additional input called DROWSY. We show how this switch enables three different circuit operation modes: sleep, drowsy, or active, depending on the value of the two control signals (see Table 3.1 for the functionality of the tri-modal switch in terms of its input signals). When SLEEP = ‘0’, MS1 is ON and the voltage level at GS (gate of MS) is VDD. Thus, independent of the value of the DROWSY input, the MS transistor is ON, virtual ground (VVSS) is connected to actual ground (VSS), and the circuit is in the active mode. When SLEEP = ‘1’, the tri-modal switch operates in the sleep or drowsy mode depending on the value of the DROWSY signal. In particular, if DROWSY = ‘0’, MS2 and MD2 will both be ON, and the output of the sleep inverter GS will be ‘0’ which turns the sleep 63 transistor MS OFF. In this case, the tri-modal switch cell will put the circuit in the sleep mode. If DROWSY = ‘1’, MS2 and MD1 will be ON, creating a negative feedback between VVSS and GS nodes which puts the circuit block into the drowsy mode. Figure 3.1: Implementation of the tri-mode footer cell. Table 3.1: Tri-mode switch functionality SLEEP/DROWSY Tri-mode Switch Function 0X Active 10 Sleep 11 Drowsy Unlike the conventional power-gating techniques, the sleep inverter in the tri- modal switch cell is power gated through the MS transistor during the sleep mode, thus it has low leakage. In addition the drowsy signal changes only when we make a transition from the sleep to drowsy or vice versa which means that the drowsy signal need not be fast. Therefore, the always-on drowsy inverter shown in Figure 3.1 can be implemented SLEEP Sleep Inverter VVSS MS1 MS2 MS Circuit Block VDD VDD DROWSY MD1 MD2 GS SLEEP Sleep Inverter VVSS MS1 MS2 MS Circuit Block VDD VDD DROWSY MD1 MD2 GS 64 using HVT devices to lower the leakage. The transistor-count overhead of the proposed tri-modal switch is only four: MD1, MD2, and the two transistors inside the drowsy inverter. The two transistors inside the sleep inverter, MS1 and MS2, are already used by all other power gating structures. In Section 3.3.4 we shall see that all these additional transistors are all minimum sized independent of the circuit block or the sleep transistor size, therefore, the actual area overhead of these additional transistors is quite small. Circuit configuration of the tri-modal header is similar to the footer one and is presented in Figure 3.2. The switch functionality is exactly similar to the footer one and thus, is given in Table 3.1. So far in this section we have explained the functionality of the tri-modal header and footer cells intuitively. In the next section we support our intuitive claims using more detailed circuit equations. 3.3.2 Leakage Equations In order to see the difference between sleep and drowsy modes, we write the separate leakage equations of the circuit corresponding to these two modes. We do our analysis only for the footer type switch in this section. Analysis for the header switch can be carried in a similar fashion. 65 Figure 3.2: Implementation of the tri-mode header cell. Consider the footer switch in Figure 3.1. There is a sneak leakage path from the VVSS to VSS and through MD1 and MD2 in the footer switch in both sleep and drowsy modes. In the sleep mode, MD2 is ON and the sneak path goes through MD1 which operates in the sub-threshold region whereas in the drowsy mode, MD1 is ON and the sneak path passes through MD2 which operates in the sub-threshold region. To minimize leakage of these sneak paths, MD1 and MD2 must be HVT transistors. To calculate the final voltage level of the VVSS node in the sleep and drowsy modes, ignoring the gate leakage, we write a KCL equation for sub-threshold leakage components at the VVSS node. We use the transistor sub-threshold leakage equation [15]: SLEEP Sleep Inverter MS2 MS1 MS Circuit Block VDD VDD DROWSY MD2 MD1 GS VVDD 66 ( ) 2 1.8 0 1 with DS GS TH DS qV q VV V nkT kT sub ox eff W kT I Ae e A C e Lq η µ −+ − = −= (3.1) In this equation V GS , V DS , and V TH denote the gate-source, drain-source, and the (body-affected) threshold voltages of the transistor, respectively; η is the DIBL (Drain Induced Barrier Lowering) coefficient representing the effect of V DS on the threshold voltage; C ox is the gate oxide capacitance per unit area; μ 0 is the zero-bias carrier mobility; and n denotes the sub-threshold swing coefficient of the transistor. During the sleep mode, SLEEP = ‘1’, DROWSY = ‘0’, MS2 and MD2 are ON, and MD1 and MS are in the sub-threshold region. In this case, if we assume the voltage level of the VVSS node is V X , the KCL equation at VVSS yields: ( ) ( ) ( ) , , , 1 leak CB X sub MS X sub MD X I V I VI V = + (3.2) where I sub,MS and I sub,MD1 are the sub-threshold leakage currents of MS and MD1, respectively, and I leak,CB denotes the leakage current of the circuit block (CB). Substituting the sub-threshold leakage current from (3.1) into (3.2), we obtain: ( ) ( ) ( ) 1 1 1 X TN X X TN X qV q VV nkT kT leak CB X MS qV q VV nkT kT MD I V Ae e Ae e η η −+ − − −+ − = − + − (3.3) In the drowsy mode, SLEEP = ‘1’, DRWOSY = ‘1’, MS2 and MD1 are ON, and MD2 and MS are in the sub-threshold region. In this case, if we assume the voltage level of the VVSS node is V X , the KCL equation at VVSS yields: 67 ( ) ( ) ( ) ( ) , , , 2 ,1 leak CB X sub MS X sub MD X sub MS X I V I VI VI V = ++ (3.4) where I sub,MS , I sub,MD2 and I sub,MS1 are the sub-threshold leakage currents of MS and MD2 and MS1, respectively. Substituting the sub-threshold leakage current from (3.1) into (3.4), we obtain: ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 - 2 1 1 1 1 X TN X X TN X DD X TP DD X qV q VV nkT kT leak CB X MS qV q VV nkT kT MD qV V q V VV nkT kT MS I V Ae e Ae e Ae e η η η − ++ − −+ − − −+ − − = − +− + − (3.5) Now we show that the V X value obtained for the drowsy mode is strictly smaller than that obtained for the sleep mode. Theorem 1 Assume W MD1 =W MD2 . Let V X1 and V X2 denote the solutions of equations (3.3) and (3.5), respectively. Then, V X1 > V X2 . Proof by contradiction: Suppose V X2 ≥ V X1 . Since W MD1 =W MD2 , we have A MD1 =A MD2 . We can easily show that: ( ) ( ) ( ) 21 21 1 11 XX TN X TN X qV qV qq V V VV nkT kT nkT kT MS MS Ae e Ae e ηη −+ + − −+ − − ≥ − (3.6) The assumption of V X2 ≥ V X1 will result in the following: ( ) ( ) ( ) 21 2 1 1 21 11 X X TN X DD TP X qV qV qq V V VV V nkT kT nkT kT MD MD A e e Ae e ηη − + − − − ++ − −≥ − (3.7) 68 We also have: ( ) ( ) ( ) 2 2 1 1 0 DD X TP DD X qV V q V VV nkT kT MS Ae e η − −+ − − −> (3.8) Adding both sides of inequalities in (3.6)-(3.8), and comparing the result with (3.3) and (3.5), we obtain I leak-CB (V X2 ) > I leak-CB (V X1 ), but this is contradiction, because if V X2 ≥ V X1 , we must have: I leak-CB (V X2 ) ≤ I leak-CB (V X1 ) i.e., we have V X2 < V X1 . ■ Based on Theorem 1, we can argue that in the proposed tri-modal switch, the voltage level of VVSS in the drowsy mode is strictly less than that in the sleep mode. Similar analysis can be performed for the header cell. 3.3.3 Data Retention and Noise Stability Figure 3.3 shows a master-slave D flip-flop (DFF). Initially the DFF is holding a logic “0’ value at the Q output. In the drowsy mode, however, this value rises to some value around 250mV with a VDD of 1.8V. Our simulations show that the VVSS voltage level in the drowsy mode is a weak function of the circuit block and is always around 250mV for this technology, which is TSMC0.18um. To assess the data stability of the DFF, a negative voltage perturbation is applied to the internal node, S, of the DFF when it is holding a logic “1’ value (Q=1). Simulations also show that data in the DFF is retained for perturbations smaller than ΔV=609mV (cf. Table 3.2). 69 Figure 3.3: A DFF and negative noise applied on its internal node. Ground connection goes thru a tri-modal footer cell. Table 3.2: Stability and data retention of DFF in drowsy mode Type of Variation VVSS Voltage (mV) Peak of the Max. Tolerable Noise at Node S (mV) No Variation 252 609 |V th |+15% 237 662 |V th |−15% 289 547 VDD+10% 256 629 VDD−10% 249 596 The maximum tolerable perturbation (noise margin) for the same flip-flop when no power gating is employed is 825mV. VVSS voltage and maximum tolerable perturbation values vary under different circuit parameter variation which results in different noise stability characteristics. As Table 3.2 reports, the drowsy DFF shows good noise stability characteristics even under these variations. V DD D CLK V DD V DD CLK CLK CLK V DD CLK V DD V DD CLK CLK CLK Q S V DD D CLK V DD V DD CLK CLK CLK CLK CLK V DD CLK V DD V DD CLK CLK CLK CLK CLK Q S 70 3.3.4 Transistor Sizing Correct sizing of different transistors in the tri-modal switch is an important task since it has direct effect on various characteristics of the circuit, including logic gate switching speeds in the active mode, leakage currents in sleep and drowsy modes, wakeup latency, and area overhead. There are a number of design tradeoffs that impinge on transistor sizing for the tri-modal switch. For example, in the active mode when MS is ON, the delay of the circuit block in Figure 3.1 depends on the size of MS. Larger MS sizes result in higher active mode switching speeds but also increased sleep and drowsy leakage currents and lower VVSS voltage during the drowsy mode, which in turn leads to lower ground bounce and faster wakeup delays. Active Mode Performance: Sizing MS Power-gated circuits suffer from active-mode performance degradation due to the lower effective VDD which is due to the IR-drop on the sleep transistor in the active mode. The sleep transistor in active mode operates in its linear region, thus it can be modeled as a linear resistance. Consider using an NMOS sleep transistor (gated-ground). Each time there is high to low switching at any node in the circuit block, current flows from the node capacitance to the ground through the sleep transistor (MS in Figure 3.1). This discharging current causes a voltage drop between drain and source of the sleep transistor, resulting in switching speed degradation for the considered transition. The amount of speed degradation depends on the size of the sleep transistor. The 71 larger the sleep transistor is, the lower the switching speed degradation will be. Typically the maximum tolerable performance degradation in a power-gated design is set to 5-10% of the corresponding non-power-gated circuit. We set the maximum performance degradation to 5%. With this constraint, we size the sleep transistor MS in the tri-modal switch. The sizing technique, which is straight-forward and follows standard sleep transistor sizing techniques, is omitted. Interested readers may refer to [28, 48, 52] for sleep transistor sizing. Wakeup Latency and Leakage: Sizing MS1 Consider a gated-ground circuit block. During the sleep period, when the sleep transistor is OFF, if the circuit block is large enough, then the VVSS node and all internal nodes in the circuit will charge to a high voltage level [46]. This is due to the higher leakage of the circuit block compared to that of the OFF sleep transistor, which eventually charges up all the internal nodes in the circuit block including the VVSS node. At the edge of the sleep to active mode transition, the sleep transistor is turned on, but the circuit block will not start working at its full speed until all extra charges are removed from internal nodes (including VVSS) through the sleep transistor. There is a wakeup latency associated with this discharging process. The wakeup latency, t w , is defined as the delay between the time when the SLEEP signal crosses the 50% VDD level as it makes a transition to low state and the time when the VVSS node reaches 5% of the VDD level as it is discharged toward VSS. Similarly, when the circuit in Figure 3.1 is put in the drowsy mode, the VVSS 72 node is charged to a non-zero voltage level. Even though the circuit block is still functional, it will not be working at full speed. Therefore, there is a ready latency associated with a drowsy circuit that is brought into active mode. In this chapter, the ready latency, t r , is defined as the delay between the time when the SLEEP signal crosses the 50% VDD level as it falls and the time when VVSS node reaches 5% of the VDD level as it is discharged toward VSS. The wakeup and ready latencies of the circuit configuration in Figure 3.1 depend on sizes of MS and MS1 and voltage level of the VVSS node in the sleep/drowsy mode. The voltage level of the VVSS node in the sleep/drowsy mode is mainly determined by the size and threshold voltage value of MS. Since MS is sized when considering the active mode performance criterion (c.f. Section 0), the wakeup and ready latencies are determined by sizing MS1. Suppose that we use our tri-modal switch for power-gating of a DFF. Furthermore assume that the MS transistor is already sized for 5% active performance degradation. Figure 3.4 shows the wakeup and ready latencies as well as the normalized leakage values in sleep and drowsy modes for different values of W MS1 for this positive-edge triggered DFF in TSMC0.18um. The leakage data is normalized to the active leakage of the FF when no tri-modal switch is used. As seen in the figure, the ready time is always less than the wakeup time for a fixed size of MS1. In contrast the drowsy mode leakage is always higher than the sleep mode leakage. When we increase the size of MS1 above some threshold, the wakeup and ready latencies reach some saturating values. For this example, the saturation occurs at W MS1 =3μm. 73 Figure 3.4: Leakage and wakeup/ready latencies for DFF. Sleep and drowsy leakage currents increase linearly with W MS1 . To optimally size MS1, we must consider wakeup/ready latencies as well as the amount of the leakage current in the sleep/drowsy modes. We define four cost figures. They all are in the form of power-delay products (PDP): a) PDP sleep-sleep =I sleep ×VDD×t w , b) PDP sleep-drowsy =I sleep ×VDD×t r , c) PDP drowsy-drowsy =I drowsy ×VDD×t r d) PDP drowsy-sleep =I drowsy ×VDD×t w where I sleep and I drowsy denote leakage currents in the sleep and drowsy modes, respectively. Figure 3.5 illustrates all four PDP’s defined above for the DFF circuit. One can confirm from the figure that for all these cases, increasing W MS1 results in decreasing 0 50 100 150 200 250 300 0.2 0.7 1.2 1.7 2.2 2.7 3.2 0 10 20 30 40 50 60 W MS1 ( μm) Normalized Leakage (%) Latency (psec) I drowsy /I active I sleep /I active t w t r 0 50 100 150 200 250 300 0.2 0.7 1.2 1.7 2.2 2.7 3.2 0 10 20 30 40 50 60 W MS1 ( μm) Normalized Leakage (%) Latency (psec) I drowsy /I active I sleep /I active t w t r 74 PDP until some point when PDP curve saturates at a minimum value. One may size MS1 based on any one of the PDP profiles in Figure 3.5; however, we use PDP drowsy-sleep profile to perform sizing. The reason is that the sleep mode leakage and the drowsy mode ready latency are already small, we thus perform sizing of MS1 based on the drowsy mode leakage and the sleep mode wakeup latency. We size MS1 such that the PDP drowsy- sleep corresponding to this size is no more than 10% higher than the minimum (saturated) PDP drowsy-sleep value. In our example, this results in W MS1 =1.6μm. Figure 3.5. Different power-delay product metrics. All other transistors in the tri-modal switch cell including MS2, MD1, MD2 and transistors inside the DROWSY inverter are minimum-sized transistors. The reason is that none of these transistors has influence on the wakeup latency. Area, sleep vs. drowsy leakage currents, and energy dissipation for a mode transition are decreased by choosing minimum transistor sizes. 0 2 4 6 8 10 12 14 16 0.2 0.7 1.2 1.7 2.2 2.7 3.2 RecoveryxSleep_leakage WakeupxSleep_Leakage RecoveryxDrowsy_Leakage WakeupxDrowsy_Leakage PDP sleep-drowsy PDP drowsy-sleep PDP sleep-sleep PDP drowsy-drowsy W MS1 (μm) PDP (10 -21 J) 0 2 4 6 8 10 12 14 16 0.2 0.7 1.2 1.7 2.2 2.7 3.2 RecoveryxSleep_leakage WakeupxSleep_Leakage RecoveryxDrowsy_Leakage WakeupxDrowsy_Leakage PDP sleep-drowsy PDP drowsy-sleep PDP sleep-sleep PDP drowsy-drowsy W MS1 (μm) PDP (10 -21 J) 75 3.4 Data-Retentive Power Gating In this section we use the tri-modal switch to realize data-retentive multimodal power gating solutions. By controlling the SLEEP and DROWSY signals for different tri-modal switches in the circuit, we can selectively put various circuit elements in different modes. Let’s consider a general multi-stage pipeline circuit. We perform power gating for this structure by using the proposed tri-modal switches, where we have two different types of tri-modal switches: ones disconnecting VVSS net of the flip-flops in pipeline registers from the ground rail and those disconnecting VVSS net of the combinational logic cells in the design from VSS. This implies having two different VVSS nets: one for the flip- flops and another for the other logic cells. 3.4.1 Proposed Architecture Consider a K-stage pipeline structure with K−1 pipeline registers as shown in Figure 3.6. Suppose the design is to be implemented in a standard cell layout style. Cells fit in one of two groups: (i) sequential logic cells (FF’s) belonging to pipeline registers, and (ii) combinational logic cells belonging to the pipeline logic blocks. If the pre-standby stored data in the pipeline registers is to be retained when going to sleep, the pipeline registers must be put into the data-retentive drowsy mode while the rest of the cells in the circuit are put in the sleep mode to reduce standby leakage consumption. To realize this architecture, placement of the cells in the design has to be in such a way that the VVSS rail used for pipeline FF’s is separated from the VVSS rail used for 76 combinational logic cells in the circuit. This is possible by disconnecting the VVSS rail every time a FF is placed next to a logic cell, which can cause significant breaks and reconnections in the VVSS rail. If FF’s are grouped together and placed contiguously in each standard cell row, then there will be only one discontinuity in the VVSS rail of that row. However, this type of placement constraint will adversely impact the quality of the placement solution and likely increase the total wire length of the placed design [49]. Figure 3.6: Application of tri-modal switch in designing multimodal pipeline structures. To solve the aforesaid problem, we take the original placement of the design and modify it by moving the cells such that in each row, there are at most a few contiguous sections of FF’s and a few contiguous sections of logic cells. Figure 3.7 shows a legal and an illegal placement. Note that in the case when we have a legal placement with a number of sections in the same row, e.g., Figure 3.7.b, the virtual ground rail has to be disconnected at the point where two adjacent sections meet. Next we describe a heuristic TM Switch 1 d s TM Switch 2 d s VSS VSS VSS VSS Data in Data out Sleep Drowsy1 Sleep Drowsy2 Combinational Blocks VSS 77 approach to minimize the interconnection length cost associated with removing placement conflicts. (a) (b) Figure 3.7: Examples of (a) illegal and (b) legal placements. 3.4.2 Placement with Row Sectioning In a standard cell design, power-gating switch cells can be placed in different ways among the cells in a circuit. Typically, it is desirable to uniformly distribute the switch cells on each standard cell row in order to have a simple power/ground network routing strategy and minimize the worst-case (resistive) parasitic of the virtual net. Figure 3.8.a shows the so-called column-aligned sleep transistor placement style. The dashed boxes represent tri-modal switch cells. All other standard cells are assumed to be placed in the blank areas between the switch cells. The True VSS (TVSS) mesh lines are also shown in the figure. They are used to connect to the TVSS pins in various switch cells. With this FF Logic FF FF FF Logic FF Logic FF FF FF Logic FF FF FF FF Logic Logic FF partition Logic partition Rail separation FF FF FF FF Logic Logic FF partition Logic partition Rail separation 78 placement style, there can be only one switch cell under each TVSS line at each row which can be used to power gate a FF section or a combinational logic section as the case may be. We have to decide which TVSS lines are used for FF’s and which are used for combinational logic cells. We present a heuristic approach which modifies the original placement (c.f. Figure 3.8.a) and converts it to a legal placement while minimizing the total perturbation to the original placement by moving the FF cells in the design. Note each row is considered separately, and cell interchange between cell rows is not allowed. Also number and placement of the TVSS lines are assumed to be fixed and given. (a) (b) Figure 3.8: Column-aligned placement: (a) before and (b) after removing illegal placements. Consider an already placed design obtained by any state-of-the-art placement tool. Suppose there are r rows and m TVSS lines in the design. Let’s assume that we use at most n i TVSS lines for the FF’s in the i th row (n i < m). For each row, we have to determine: (a) the number of contiguous FF sections, and (b) the TVSS lines around which these FF sections should be placed in order to minimize the total extent of FF C4 FF4 C14 C3 FF2 FF3 C8 C12 C13 FF1 C2 C6 C7 C10 C11 C1 C5 C9 VVSS VDD VVSS TVSS TVSS TVSS VDD C4 FF4 C14 C3 FF2 FF3 C8 C12 C13 FF1 C2 C6 C7 C10 C11 C1 C5 C9 VVSS VDD VVSS TVSS TVSS TVSS VDD C4 FF4 C14 FF2 C3 C8 FF3 C12 C13 C2 FF1 C6 C7 C10 C11 C1 C5 C9 VVSS VDD VVSS TVSS TVSS TVSS VDD C4 FF4 C14 FF2 C3 C8 FF3 C12 C13 C2 FF1 C6 C7 C10 C11 C1 C5 C9 VVSS VDD VVSS TVSS TVSS TVSS VDD 79 displacements compared to the original placement solution. For the i th row, the heuristic starts by assuming n i =1 (if no FF lies in the i th row, n i =0 and we are done). We evaluate each of the m TVSS lines in the i th row by calculating the amount of placement perturbation with respect to that line, i.e., the increase in total perturbation of the circuit when all FF’s in the row are moved to new locations on the row so as to make a single contiguous section adjacent to that TVSS line. The FF’s are sorted based on their distance from the target TVSS line and moved one after the other in that order. Cell overlaps are removed by pushing overlapping cells aside to make space for the FF’s. Algorithm Find_FF_Sections ( ) 1. if (there is no FF in the row) return n i =0 2. else n i =1 3. while (n i ≤ max_n i ) do 4. find the set of n i TVSS lines such that moving all the FF’s in the row to lie around them results in the minimum placement perturbation 5. n i ++ 6. endwhile 7. (s min ,n i )=find_min_perturbation 8. endif 9. return (s min ,n i ) Figure 3.9: Outline of the proposed row sectioning placement algorithm. After evaluating each of the m TVSS lines, we can determine which one of them minimizes the total placement perturbation. Next we set n i =2, and evaluate all possible pairs of TVSS lines by calculating the placement perturbation with respect to that pair, () i n s 1 2 max_ ( , ,..., ) n i nn n ss s 80 i.e., when all the FF’s in the i th row are moved to make two sections around the pair of TVSS lines. Evaluating each pair of TVSS lines starts by moving the closest FF to any of the TVSS lines in the pair under consideration, then the second closest FF, if exists, to any of the TVSS lines in the pair under consideration, and so forth. The perturbation cost is calculated as in n i =1 case. After evaluating all possible pairs of TVSS lines, C(m,2), the best pair that results in the minimum placement perturbation is determined. We can keep increasing n i and do evaluation to m, but the algorithm complexity will become exponential in m. Fortunately, our results show that for a design with a relatively small number of FF’s compared to the total logic cell count, which is the typical case, the amount of cost reduction that is achieved by going beyond n i =2 is negligible (c.f. Section 0). Figure shows a summary of the placement algorithm with row sectioning that was explained above. 3.5 Multi-Drowsy Mode Circuits In Section 3.4 we explained how using the proposed tri-modal switch can help implementing one type of multimodal circuits. In this section we introduce a different family of multimodal circuits that can be designed using the proposed tri-modal switch. Consider the tri-modal switch in Figure 3.2. The VVSS voltage value in drowsy mode depends on the threshold voltage value and the width of MS. Larger width and lower threshold voltage value for MS results in lower VVSS drowsy voltage value. Figure 3.10 shows how the VVSS voltage value changes with respect to threshold voltage value of the sleep transistor (MS). It can be seen that with a 250mV range of variation in 81 the threshold voltage value of MS we can achieve almost same range of variation for VVSS voltage in the drowsy mode. Figure 3.10: Different drowsy VVSS voltage achieved by changing the threshold voltage value of sleep transistor (MS). Figure 3.11 shows how the VVSS voltage value changes by changing the sleep transistor size. It can be seen that we can achieve a 15% increase in the drowsy mode voltage value of VVSS by reducing the size of sleep transistor from 3μm to 0.5μm. Figure 3.10 and Figure 3.11 suggest having multiple drowsy modes by appropriately changing threshold voltage or width of MS, respectively. Figure 3.12 shows a multimodal switch that is designed by using multiple sleep transistors and using different SLEEP signals to turn them ON or OFF. Suppose that all the sleep transistors in Figure 3.12, i.e., MS1-MSn, are HVT. We also assume that the summation of all the sleep transistors width is equal to the total required sleep transistor width to satisfy the Threshold Voltage Value of MS (mV) VVSS Voltage in Drowsy (mV) 0 50 100 150 200 250 300 350 400 400 450 500 550 600 650 700 82 constraint of 5% increase in active delay (c.f. Section 0), that is . Figure 3.11: Different drowsy VVSS voltage achieved by changing the size of sleep transistor (MS). In the active mode, all the sleep signals have logic “0”, i.e., SLEEP i =0 for i=1,…,n, and the value of the DROWSY signal is X (don’t care). Note that turning any of the sleep transistors on will put the circuit block in the active mode, but may not satisfy the constraint of 5% increase in the active delay. In the sleep mode, however, the drowsy signal has logic “0” value and all the sleep signals are “1”. This means that in the sleep mode, we have DROWSY=0, SLEEP i =1, for i=1,…,n. In the drowsy mode, the drowsy signal has logic “1” value (DROWSY=1). In this case, turning on less number of sleep transistors is equal to having a larger effective sleep transistor size which based on Figure 3.11 results in higher VVSS voltage value and thus lower leakage current in the drowsy W MS (μm) VVSS Voltage in Drowsy (mV) 240 250 260 270 280 290 0 0.5 1 1.5 2 2.5 3 3.5 83 mode. Figure 3.12: Implementations of multimodal footer switch for multi-drowsy mode circuits. One of the big advantages of using the proposed multimodal switch is the opportunity that it provides to prevent huge amount of rush through current at the edge of transition from sleep to active by turning on the circuit slowly. A large rush through current causes a large ground/supply bounce at the sleep to active transition edge due to di/dt effect. Mother-daughter MTCMOS switches have been proposed to avoid this problem for conventional MTCMOS circuits. A mother-daughter footer switch includes two different size sleep transistors, a larger sleep transistor (mother) and a smaller one (daughter). At the wakeup time, the daughter transistor is turned on first giving the circuit block some time to slowly recover from the sleep mode by discharging a portion of its SLEEP1 VVSS MS1 Circuit Block VDD DROWSY MD1 MD2 MSn SLEEPn 84 extra charge. The mother transistor is then turned on discharging the rest of the extra charge in the circuit block putting it into the active mode. Since the proposed multimodal switch uses more than one sleep transistor, it can be used in a similar fashion to avoid large rush through currents by correctly sizing the sleep transistors (MS i ’s) and appropriately timing them. 3.6 Voltage-Scaling Using Multimodal Headers DC-DC converters are used to supply power in most digital systems. They are typically classified in two types: linear voltage and switching voltage regulators [18]. Switching regulators usually achieve better power efficiency compared to linear regulators; however, linear regulators are much cheaper and generate less noise. Linear regulators are also faster and can be implemented on-chip. In this section we present an interesting use of the proposed multimodal switch in designing a special type of linear regulator that can be used in enabling on-chip Dynamic Voltage Scaling (DVS) for VLSI circuits. Consider the circuit shown in Figure 3.13 which shows a circuit block with multimodal header switch. Suppose that the circuit is in drowsy mode, that is DROWSY=”1” and at least one of the sleep signals (SLEEPi’s) is “1”. Similar to what we discussed in Section 3.5, we can provide different voltage levels at VVDD node in the drowsy mode by changing the effective size (or threshold voltage) of the sleep transistor. This is done by turning ON or OFF different number of sleep transistors in the multimodal switch (MSi’s in Figure 3.13). 85 Figure 3.13: Using multimodal header to perform voltage scaling. The capacitor, CVVDD, in Figure 13 is to stabilize the VVDD voltage when there are switching activities inside the circuit block. Even though more sophisticated techniques can potentially result in improved I-V characteristics, they are out of the scope of this dissertation, and we only consider a simple capacitor as the voltage stabilizer as shown in Figure 3.13. Table 3.3 shows different scaled voltage levels achieved by using a multimodal switch with four parallel sleep transistors, MS1-MS4, for a full-adder circuit. The four sleep transistors have different widths and different threshold voltage values as follows: WMS1=WMS3=0.3μm, WMS2=WMS4=3μm, VtHigh= –0.72V, and VtRegular= –0.42. MS2 Circuit Block VDD VDD DROWSY MD2 MD1 VVDD MS1 SLEEP2 SLEEP1 VDD C VVDD 86 Table 3.3: Achieving different scaled values of the supply voltage for a full-adder circuit for TSMC0.18um and VDD= 1.8V. Sleep TX Used Sleep TX’s Vt VVDD Voltage (V) MS1 High 1.3V MS2 High 1.4V MS3 Regular 1.6V MS4 Regular 1.7V The presented approach for VDD scaling is specifically suitable for implementing local DVS where global DVS is less effective. For example, in the existence of latency imbalances of pipeline stages, the effectiveness of global DVS decreases leaving some power saving opportunities for local DVS, where different scaling factors are used for different stages [37]. In other words, instead of constraining pipeline voltage to single global voltage (as it is done in global DVS) and changing that global value, local DVS supplies separate voltage values for different pipeline stages using locally adjustable voltages. Therefore, the energy demand for each pipeline stage is minimized individually. Local DVS shows better energy saving compared to global DVS, but the downside is that now each stage has to have its own voltage regulator. Level converters are also required between two stages. Our presented DVS scheme can be used to implement local DVS for different stages of a pipeline using their power gating circuitry. This reduces the cost of implementing the local DVS dramatically by eliminating voltage regulators of different stages. 87 3.7 Simulation Results In this section we present the simulation results for different ideas discussed in this chapter. We start with the data-retentive power gating explained in Section 3.4. For this purpose we designed and implemented a 16×16 pipelined Carry Save Multiplier (CSM). The circuit is divided into two pipeline stages. The 46-bit output of the first stage is latched into the pipeline registers (46 FF’s). The first 16 bits out of these 46 bits, which make the least significant bits of the product, are directly passed to the output. The last 30 bits are passed to the second stage to make the most significant bits of the product. 3.7.1 Data-Retentive Power Gating: Design Flow We implemented the 16×16 pipelined CSM in structural Verilog. After verifying the functionality of the Verilog design, we synthesized the design by using the Synopsys Design Compiler with OSU standard cell library [1] in TSMC0.18um. V DD =1.8V. We performed timing analysis on the synthesized design and achieved the worst- case stage delay of 4.1ns (clock frequency of 244 MHz). After synthesizing the design, the standard delay format (sdf) file was generated, and the design was verified with sdf back-annotation. We then used the Cadence Encounter to complete the placement of the design. We modified the placement using the row sectioning method described in 3.4.2 with n i =2. The tri-modal switch cells were manually inserted into the design. After the placement was done, the design was routed with Cadence Encounter, timing analysis was 88 performed and an sdf file was generated. The design was then verified again. Finally, we extracted the netlist and performed HSPICE simulations. Figure 3.14: Summarized block diagram of the design flow. Figure 3.14 shows the brief design flow that is used in this chapter. Note not all the steps are shown in this figure. Figure 3.15.a depicts the design after the original placement where the FF’s, which are scattered in the design, are highlighted with red boxes. Figure 3.15.b and Figure 3.15.c show the same design after the row sectioning technique for FF placement is applied for n i =1 and n i =2, respectively. Figure 3.15.d shows the routed design of Figure 3.15.c, i.e., with n i =2. 3.7.2 Data-Retentive Power Gating: Results In this section we discuss the results that we achieved by implementing the 16×16 pipelined CSM explained in Section 3.7.1. Tri-modal switch cells are used to implement all the MTCMOS circuits considered in this section. We compare the leakage current, ground bounce and wakeup/ready latencies for four different cases: a) CMOS, b) MTCMOS: deep-sleep, c) MTCMOS: drowsy, and d) MTCMOS: data-retentive. No HDL (Verilog-XL) Synthesis (Design Compiler) Timing Analysis Place and Route (Encounter) Netlist Extraction and HSPICE Simulations HDL (Verilog-XL) Synthesis (Design Compiler) Timing Analysis Place and Route (Encounter) Netlist Extraction and HSPICE Simulations 89 power gating is used for the CMOS circuit and there is no constraint for placement of the FF’s. During the active mode, all tri-modal switches are in the active state (SLEEP=“0”, DROWSY=“X”) in all versions of MTCMOS circuit. In the standby mode, however, tri- modal switches are put in different states: in deep-sleep MTCMOS, all tri-modal switches are in the sleep mode (SLEEP=“1”, DROWSY=“0”), in drowsy MTCMOS all tri-modal switches are in the drowsy mode (SLEEP=“1”, DROWSY=“1”), while in data-retentive MTCMOS, tri-modal switches used for combinational logic cells are in the sleep mode and tri-modal switches used for FF’s are in drowsy mode. We compare different aspects of these four versions of the same 16×16 pipelined CSM. Table 3.4: Leakage, GB and w/r latency comparisons. Circuit Type Leakage (nA) Ground- Bounce (mV) Wakeup/Ready Latency (ns) CMOS 63.0 - - Deep-Sleep 0.10 473 19.32 Drowsy 48.0 143 4.83 Data-Retentive 2.85 441 19.32 The second to fourth columns of Table 3.4, respectively, show the standby leakage current, the peak value of ground bounce (GB), and the wakeup/ready (w/r) latencies for all circuit configurations explained above. It is seen from the table that the deep-sleep MTCMOS circuit has the lowest leakage among all configurations, making it the most appropriate choice for long standby periods. We note that the leakage of the drowsy MTCMOS is only 24% lower than that of the CMOS circuit, i.e., much higher 90 than that of the deep-sleep. The ground bounce for deep-sleep circuit is much higher than that for drowsy circuit. (a) (b) (c) (d) Figure 3.15: (a) Original placement for 16×16 pipelined CSM, (b) placed design after row sectioning with ni=1, (c) placed design with ni=2 (d) routed design with ni=2. 91 We assume that the maximum tolerable ground bounce is 150mV. To maintain a ground bounce value less than this threshold, we have resorted to a multi-cycle turn-on strategy similar to the one proposed in [35], where we turn on only some of tri-modal switches at each clock cycle. In particular, 7/45, 9/45, 11/45, and 18/45 portions of the tri-modal switches are turned on during the first, second, third, and fourth consecutive clock cycles, respectively. Using this turn-on strategy, it takes 4 clock cycles to wake up the deep-sleep circuit while it only takes one clock cycle to wake up the drowsy circuit. Therefore, there is a three clock cycle penalty to wake up from deep sleep mode as compared to waking up from drowsy mode, which is done in one cycle. Now assume this multiplier is used in the execution stage of a five-stage pipelined processor, and has been put into the deep-sleep mode by the power-management unit since it had not been utilized recently. If a new instruction in the IF stage needs to use this multiplier, the processor has to be stalled for three clock cycles for this multiplier to be ready for operation. However, if the multiplier was in the drowsy mode, the processor could perform its regular operation without being stalled. The cycle penalty will increase as the size of the circuit increases. Despite having a faster wakeup, the drowsy circuit suffers from higher leakage compared to the deep-sleep circuit. Therefore, for longer standby periods when the leakage energy dissipation becomes an issue, we may want to pay the wakeup cycle penalty to achieve low leakage dissipation. In that case, deep-sleep or data-retentive modes are more preferable than the drowsy mode. Therefore, it is important to have a 92 power-gating structure that supports the four power modes discussed above. Figure 3.16 shows leakage energy versus total (wakeup) latency for the CSM circuit when it is operating for 100,000 clock cycles. We assume that 20% of the time the circuit is operating in the active mode while 80% of the time it is in the standby mode. We compare three different standby policies: (i) CMOS, (ii) MTCMOS: drowsy, and (iii) MTCMOS: deep-sleep. The energy versus total latency curves are shown for different mode transition frequency values, f mt . The mode transition frequency is in the units of per million clock cycles and is defined as the number of the mode transitions that happen in one million cycles. Since we compare leakage energy and the total latency, we do not consider data-retentive circuit in this analysis. Figure 3.16: Leakage versus total latency for different mode-transition frequencies in the unit of per million cycles. 0 0.5 1 1.5 2 10 15 20 25 30 35 40 45 50 55 data1 data2 data3 data4 f mt =100 f mt =200 f mt =500 f mt =1000 Total Latency (μs) Leakage Energy (pJ) drowsy deep-sleep CMOS 93 Table 3.5 compares delay, cell area and total wire length for CMOS and MTCMOS circuits. The placement modification discussed in Section 3.4.2, i.e., row sectioning, increases the for signal routing cost. The total wire length is reported for n i =1 and n i =2. It can be seen that we have a 5% reduction in total wire length when we use n i =2 as compared to n i =1; however, our experiments show that if we use n i > 2, the total wire length reduction is negligible compared to n i =2. For example, for the CSM design, the total wire length reduction by going from n i =2 to n i =4 is less than 1%. The total MTCMOS cell area increase reported in Table 3.5 is due to the area occupied by tri- modal switches. As seen in the table, the overall area increase is only 1.8%. Note that the sleep transistors have been sized for maximum 5-7% active delay increase compared to the (non-power-gated) CMOS circuit. We could have achieved lower MTCMOS active delay by upsizing the sleep transistor inside the tri-modal switch. Table 3.5: Delay, area and routing comparisons. Circuit Type Stage delay (ns) Cell area (um 2 ) Wire length (um) Wire length (um) n i =1 n i =2 CMOS 4.54 54720 54402.6 54402.6 MTCMOS 4.83 55710 59008.4 56077.2 Increase (%) 6.4 1.8 8.5 3.1 3.7.3 The Effect of Technology Scaling on Data-Retentive Power Gating We have done similar simulations for TSMC90nm technology with VDD=1.2V to show the scalability of the proposed technique. Results are summarized in Table 3.6. It can also 94 be seen from the table that the leakage current in the drowsy circuit is reduced by 77% as compared to that for the CMOS circuit. This means that leakage saving of the drowsy circuit compared to deep sleep mode becomes relatively better with technology scaling. Table 3.6: Leakage comparisons for TSMC90nm. Circuit Type Leakage (μA) Deep-Sleep 0.6 Drowsy 35 Data-Retentive 2.35 CMOS 150 3.7.4 Multi-Drowsy Mode Circuits Based on the discussion that we had in Section 3.5, multimodal headers can be used in implementing circuits with multiple drowsy modes. This part of the experimental results demonstrates the implementation of this idea for some benchmark circuits. For each circuit we use a multimodal header with two sleep transistors of equal size and different threshold voltages. Therefore, there are two different drowsy modes for each circuit. Considering active and sleep modes, this adds up to four different available power modes for each circuit. In the active mode both sleep transistors are ON providing the maximum current capacity for the circuit in case of any switching event. Table 3.7 shows the ready latency values measured for the two drowsy modes for different benchmark circuits in 90nm technology. Table 3.8 shows leakage current values and leakage savings for different modes for the same circuits as in Table 3.7. Leakage current in Table 3.8 is averaged over 1000 different input cases, where a random input vector is applied to the underlying circuit in each case. It can be seen that an 95 average of 50%, 71%, and 91% leakage saving is achieved for Drowsy1, Drowsy2, and Sleep circuits, respectively. By comparing results shown in Table 3.7 and Table 3.8, we realize that Drowsy1 provides relatively smaller leakage saving, but a much faster ready latency compared to Drowsy2 making it more convenient for smaller idle periods. Having different power modes with different characteristics available gives designer the opportunity of coming up with solutions that consume less amount of power and show faster response time. Table 3.7: Ready latencies for multi-drowsy mode ISCAS85 circuits. Circuit Ready/Wakeup Latency (ns) Ready to Wakeup Increase (%) Drowsy1 Drowsy2 Sleep Drowsy1 Drowsy2 9sym 1.72 2.14 2.74 59 28 C432 2.16 2.79 2.87 33 3 C880 1.76 2.10 2.53 43 21 C1355 1.61 1.92 2.44 51 27 C3540 1.59 1.88 2.20 38 17 Avg. - - - 45 19 Table 3.8: Leakage current for different modes in multi-drowsy implementation of ISCAS85 benchmark circuits in 90nm technology, Vdd=1.2V Circuit # of Cells Total SLP TX Width (µm) Leakage Current (µA) Leakage Saving (%) Standb y Drowsy 1 Drowsy 2 Slee p Drowsy 1 Drowsy 2 Sleep 9sym 276 99 3.9 2.5 1.5 0.7 37 61 83 C432 204 73.4 7.1 3.2 1.9 0.5 55 73 93 C880 432 155.5 14.9 6.8 4.1 1.0 54 73 93 C1355 526 189.4 17.9 7.8 4.0 1.3 57 78 93 C3540 1295 466.2 45.6 23.5 13.9 2.9 49 70 94 Avg. - - - - - - 50 71 91 3.7.5 TSPF Measure by Way of an Example Suppose that we have a 32-bit ripple carry adder (RCA32) built by concatenating adder cells discussed in Section 3.6, but this time only with three operating voltage 96 values, namely 1.8V, 1.6V and 1.3V. Furthermore, assume that the adder block is part of a processor’s ALU. Suppose that 25% of the time ALU works at full performance (VDD=1.8V), 35% of the time at medium performance (VDD=1.6V), 20% of the time with low performance (VDD=1.3V), and for the remaining 20% of the time, it is idle (i.e., it is in the Sleep mode). Moreover, suppose that the adder activity factor remains unchanged under different active modes, and that the clock frequency is scaled by the same factor as the supply voltage. The TSPF for the adder is calculated as follows (mm stands for multimodal): Substituting the abovementioned information, we will have: where we have assumed that because of power gating, the amount of leakage in the sleep mode is negligible. Now consider the case that the 32-bit adder employs only DVFS using conventional approaches. In this case, the adder will operate at VDD=1.3V (lowest power state) during its idle period, and we have: Finally consider the case where we use conventional (bimodal) MTCMOS to reduce leakage in the sleep mode. In this case, the RCA32 always works at the maximum supply (VDD=1.8V) and the power saving is only due to leakage reduction in the sleep mode. The TPSF is calculated as: 97 It can be seen that the multimodal adder performs much better than others, i.e., we have: 98 Chapter 4 Temperature-Aware Dynamic Resource Provisioning in a Power- Optimized Datacenter 4.1 Introduction Large datacenters that typically serve millions of users globally and 24-7 comprise of tens of thousands of servers with tens of peta bytes of storage, and multiple hundreds of giga-bit bandwidth to the Internet. The continuous increase in computing and storage capacities of these datacenters is made possible by advances in the underlying manufacturing process and design technologies. A by-product of such a capacity growth has been a rapid rise in the energy consumption and power density of datacenters. The electric bill of the former (including the electricity needed for cooling and air conditioning in the datacenter) is projected to pass 7 billion US dollars in the US alone, while the latter is expected to reach 60KW/m 2 for datacenters by 2010 [3]. The Environmental Protection Agency (EPA), in its August 2007 report to the US Congress, stated that the energy consumption of servers and datacenters has doubled in the past five years and is expected to quadruple in the next five years to more than 100 billion kWh at a cost of about $7.4 billion annually [3]. This has urged a substantial study on datacenter energy efficiency. Different impacts of energy efficient computing include 99 economic impacts (such as reduced Total Cost of Ownership (TCO) for datacenter owners, unproblematic energy provisioning for ICT by government especially considering the recent energy crisis), and environmental impacts by using less fuel and thus producing less CO 2 . The later makes datacenter energy efficiency an inseparable part of green computing research. In this chapter, we present a dynamically power-optimized datacenter exploiting correctly provisioned resources (servers and supplied cold air). Two fundamental actions are taken for power optimization. First is to predict the right number of required servers by employing a short-term workload forecasting technique. Second is to optimally choose candidate servers that are either being retired or employed from the available pool of servers and to determine the optimum supplied cold-air temperature value of the AC unit while satisfying the datacenter thermal constraints. The terms retired and employed refer to servers that are being turned OFF or ON, respectively. The power saving is achieved by a combination of chassis consolidation and efficient cooling. 4.2 Prior Work There are a number of different techniques to reduce the energy cost and power density in datacenters. This problem can be considered in different levels of granularity, chip-level, server level, rack level, datacenter level, etc. There are several works published recently presenting promising results in chip-level power optimization issues [12, 21, 22, 41]. Load balancing [13, 20, 51] which is a datacenter-level approach can be used to distribute the total workload of the datacenter among different servers evenly in order to 100 balance the per server workload (and hence achieve uniform power density). Server consolidation[2], which refers to using the minimum number of active servers in the datacenter, is another approach for power reduction of datacenters. Accounting for about 30% of the total energy cost of a datacenter (another 10- 15% is due to power distribution and conversion losses in the datacenter), the cooling cost is one of the major contributors of the total electricity bill of large datacenters [53]. There have been a number of prior works on increasing the efficiency of the cooling process in datacenters by performing temperature-aware task placement [40, 57]. In [57] the authors formulate and solve a mathematical problem that maximizes the steady state datacenter cooling efficiency by maximizing the required supplied cold-air temperature value. However, in [50] we used a combination of chassis consolidation and efficient cooling to minimize the total datacenter power consumption (server plus cooling) and showed that the maximum cooling efficiency does not necessarily result in minimum total datacenter power consumption. Most of the abovementioned mentioned approaches try to minimize the datacenter power consumption, however, they either lack a precise mathematical formulation of the optimization problem or they have not considered the datacenter dynamics. Also, when it comes to the solution, most of them lack a rigorous algorithmic solution that solves the underlying optimization problem directly. Also, since the datacenter workload is constantly changing, the required number of ON servers also changes at runtime. This fact points to the shortcoming of the steady-state datacenter power optimization work and 101 argues for a power management technique that allows for dynamic resource provisioning for the underlying datacenter. 4.3 Preliminaries In this section we give an overview of the datacenter layout, arrangement of servers, the cooling system, datacenter power model, and thermodynamic equations for temperature distribution. 4.3.1 Datacenter Configuration A datacenter is typically a (warehouse-sized) room with several rows of server cabinets. Each row comprises of several racks (cabinets), each rack contains several chassis, and each chassis contains several (blade) servers. All the blade servers in a chassis share a single power unit of the chassis. A modern datacenter is designed in hot-aisle/cold-aisle style as depicted in Figure 4.1, where each row is sandwiched between a hot aisle and a cold aisle. Cold air in cold aisles is supplied by the AC unit and comes through the perforated tiles in the floor. Servers suck the cold air coming from the cold aisle into the rack using chassis fans. The cold air cools the servers; the hot air exits the rack toward the adjacent hot aisles, and is then extracted from the room by the AC intakes on the ceiling above the hot aisles. 102 Figure 4.1: Hot-aisle/cold-aisle datacenter structure. A datacenter may include different classes of servers with different power/performance characteristics designed for different purposes. In order to achieve peak performance with minimum power under a given service level requirement, we need to employ the least Energy Per Instruction (EPI) system design. The relationship between peak power and instruction throughput is given by the equation: P peak = EPI × IPS peak , where P peak is the peak power consumption, EPI is the energy consumed per instruction, and IPS peak is the peak instruction throughput per second. Reducing EPI is the only option to reduce the peak power consumption. Achieving optimal energy per instruction under heterogeneous application behavior require [36]. The most striking EPI range is available using heterogeneous cores. For instance, EPI of Core i7 is 1.86 nano joules where as NVidia GEForce 9800 GX2 has an EPI of 0.169 nano Joules. A datacenter (or hosting center) is a large collection of various classes of C B A D E C B A D E C B A D C B A D E E Hot air intake CRAC unit Cold air under the raised floor 103 heterogeneous servers with different power/performance characteristics designed for different purposes. A server pool is the subset of all servers from the same class in the hosting center. For example, Google search cluster contains different classes of servers: web servers, index servers, document servers, etc. [10]. In an optimally designed Google cluster, index servers which are responsible for finding the search query in their indexed data usually run CPU-intensive tasks and thus must comprise high speed CPUs. Document servers which are responsible for loading part of a document (the text that comes with each Google search result) from the Google storage do not really need a high speed CPU since the type of tasks that they run is not CPU intensive. 4.3.2 Power Model for Blade Servers Assume there are K different classes of servers distributed among N chassis in the datacenter with the i th chassis containing M ij number of type-j servers. Each chassis contains a fixed number of servers, . Let c ij denote the number of ON type-j servers in the i th chassis. The power consumption of this chassis is calculated as: (4.1) where p i represents the base power consumption of the i th chassis, and accounts for the power consumption of the chassis fan and switching losses due to AC-DC conversion. α j denotes the power consumption of a type-j server when it is ON. We define γ = [ γ i ] N×1 and α=[α j ] K×1 as vectors representing base power dissipations of all chassis, and power 104 dissipations of different server classes, respectively. Also, denotes the server state matrix where c ij is the number of ON type-j servers on the i th chassis. We can write (4.1) in matrix form as: (4.2) where p = [p i ] N×1 , and 1 K×1 denotes a K-dimensional column vector with all elements equal to 1. Chassis base power consumption is typically very high; hence, it is desirable to have the required number of ON servers on the minimum number of chassis so that the remaining ones can be off. This is called chassis consolidation. 4.3.3 Heat Transfer Equations The temperature spatial granularity considered in this chapter is at the chassis level. The temperature of the cold air that is drawn to the i th chassis is called inlet temperature of that chassis and is denoted by . Similarly, the outlet temperature of the i th chassis, , is defined as temperature of the hot air that exits the chassis. The inlet temperature of a chassis depends on the supplied cold air temperature from the Computer Room Air Conditioning (CRAC) unit and the hot air that is re-circulated from the outlet of other chassis. The authors in [57] showed that the recirculation of heat in a datacenter can be described by a cross-interference matrix. This matrix is represented by and shows how much of the inlet heat (flow) rate of each chassis comes from the outlet heat rate of other chassis. This results in [57]: 105 (4.3) where T in and T s are the corresponding inlet temperature and the cold air supply vectors, respectively, and K is an N×N diagonal matrix whose entries are the thermodynamic constants of different chassis, i.e., , and . It is clear from (4.3) that the power distribution among different chassis in the datacenter directly affects the temperature distribution in the room. If we use equation (4.2) to substitute P into (4.3), we have: (4.4) 4.4 Datacenter Power Modeling 4.4.1 Power Consumption of the CRAC Unit The efficiency of the cooling process depends on different factors such as the substance used in the chiller, the speed of the air exiting the CRAC unit, etc. Coefficient of Performance (COP) is defined as the ratio of the amount of heat that is removed by the CRAC unit (Q) to the total amount of energy that is consumed in the CRAC unit to chill the air (E) [40]: (4.5) The COP of a CRAC unit is not constant and varies by the temperature of the cold air that it supplies to the room. In particular the higher the supplied air temperature, the better cooling efficiency. In this work we use the COP model of a typical water-chilled 106 CRAC unit which has been utilized in a HP Utility Datacenter. This model is quantified in terms of the supplied cold air temperature (T s ) as follows [40]: (4.6) 4.4.2 Total Power Consumption We define the total power consumption of a datacenter as the power consumptions of all chassis and the CRAC unit i.e., we do not consider power losses in the electrical power conversion network (UPS, AC-DC and DC-DC converters) as well as losses in the switch gear and conductors. The IT power consumption of a datacenter is denoted by P IT and is the summation of power consumption over all chassis: (4.7) where p i is the power consumption in the i th chassis. The power cost of the CRAC unit is specified as . The total datacenter power consumption is the summation of P IT and P CRAC and is written as: (4.8) Substituting the expression from (4.1) for p i , we obtain: 107 4.5 Temperature-Aware Dynamic Provisioning and Power Optimization Figure 4.2: Datacenter power optimization architecture. Figure 4.2 shows overall design of the proposed datacenter power optimization flow. Input requests are collected in a Global input Queue (GQ) [47]. The temperature-aware Dynamic Resource Provisioning (DRP) module consists of two sub-modules: Workload Monitoring (WM) unit and Power-Thermal Manager (PTM) unit. WM does workload analysis and prediction for the next epoch. The result will be passed onto the PTM unit in the form of the required number of ON servers of each server class for the next epoch. The PTM then uses this number along with information about the servers’ status in the datacenter to decide on which servers/chassis to employ/retire (turn ON/OFF) for the next epoch. The Request Dispatcher (RD) unit uses the server status information to assign • Temp-Aware PM • Turn ON/OFF Policy • Request Allocation PTM Incoming T asks Next Epoch Server Stats Request Allocated on Servers • Workload Prediction • Required ON Servers Workload History Current Epoch Server Stats No. of the ON Servers per Each Class T s of the CRAC Unit Global Input Request Queue r 1 r 2 r 3 . . WA TA-DRP RD 108 requests to different servers. Designing a power efficient RD unit is not the purpose of this dissertation, and we use a simple Round Robin scheduling algorithm which is widely employed in production datacenters to implement this unit. The combination of the WM and the PTM units is called Temperature-Aware Dynamic Resource Provisioning (TA- DRP) unit, and its design is the main focus of this dissertation. 4.5.1 Workload Monitor As mentioned earlier, the WM is responsible for providing PTM with the required number of ON servers of each class. We denote the required number of type-j ON servers for the next epoch by n j (t+1), where the time index “t+1” represents the next epoch. The total number of type-j servers which must be turned ON in the next epoch is , where the time index “t” represents the current epoch. If S j > 0, the PTM must employ S j new type-j servers; if S j < 0, PTM retires |S j | type-j servers, and if S j = 0, PTM does not take any action for the type-j servers. In this work we estimate the required number of servers for each epoch by performing workload prediction [47]. To introduce the prediction approach, we need to define two parameters that determine the characteristics of a workload: total number of requests that are being processed at any given time and the request arrival rate. We denote the total number of requests and the request arrival rate at time t by r(t) and λ(t), respectively. Suppose that each request requires n j avg number of type-j servers on average. The 109 total number of required type-j servers at time t, and the rate at which this number changes, can be estimated as r(t)×n j avg and λ(t)×n j avg , respectively. This is a reasonable assumption because most of the cloud computing services need a relatively fixed number of servers of each type to service a request. Examples of cloud computing applications include Web services such as Web search, Web mail, Connection services (e.g., Yahoo Messenger, Google Talk, and Windows Live Messenger), web crawlers and web indexing applications, database-centric workloads such as On-Line Transaction Processing (OLTP) and Decision Support System (DSS) workloads, etc. However, these applications demand non-uniform compute resources over time (across multiple decision epochs). Therefore, value of n j avg is updated by using a moving average. In the remainder of this chapter, we consider internet services as the application running on the datacenter 4.5.1.1 Calculating the Required Server Count For a given maximum tolerable CPU (or I/O) utilization and a specific application workload, we can find a maximum tolerable load (i.e., the number of connections to each server for connection-intensive internet services) and the maximum tolerable rate at which the load for a server is changing [14]. We denote the maximum tolerable load and maximum tolerable load rate for each type-j server with R j max and Λ j max , respectively. Therefore, our algorithms have to guarantee that the load of any type-j server will not exceed R j max , and the rate at which this load is changing does not exceed Λ j max . Values of R j max and Λ j max depend on the type of application and the amount of bandwidth that the 110 corresponding tasks take of each server. For our analysis in this work, we use R j max =35 and Λ j max =5. Based on what we have explained so far, we may ideally calculate the required number of type-j servers in the datacenter at time t as: (4.9) where denotes the ceiling of x. However, as stated in [14] there is a problem with this equation. In (4.9) we assume that a newly employed server will have R j max number of connections right after it is turned ON. This is not a valid assumption, and the load of a newly employed server will rise gradually from 0 to R j max . This is due to two To address the abovementioned problems, we provide some extra margin to (4.9) by entering two correction coefficients to calculate the total required number of servers at time t: (4.10) where γ r and γ λ are the correction coefficients, and γ r , γ λ > 1. In Section 4.5.1.2 we explain how these coefficients are chosen. 4.5.1.2 Workload Prediction Enterprise datacenters workloads typically show a repetitive pattern with a period in the order of hours, days, weeks and so forth. In [23] authors have demonstrated that for the 111 purpose of workload forecasting, the period of workload behavior is equal to 7 days for a large variety of datacenter applications. The forecasting method we use is composed of two exponential smoothing components for the trend value and the offset value prediction. (4.11) The trend component performs prediction of the periodic pattern that has been exhibited with period T, whereas the offset component uses the correlation between the estimated value and the previous immediate neighbors. In this work we use the same idea in the form of the following forecasting equation: (4.12) where x(.) and s(.) represent actual and predicted values, respectively. Our experiments show that four p i coefficients (p 1 -p 4 ) and two q i coefficients (q 1 -q 2 ) results in a small amount of prediction error. It is worth mentioning that p i and q i coefficients are updated adaptively to reflect the time varying behavior of the workload. The assumption is that values of x(.) at every T steps are highly correlated. The offset component reflects the correlation between the observed value differences in the recent history with respect to the predicted trend (short-term correlations). Forecasting r(t) and λ(t), is done through equation (4.12). The predicted values of r(t) and λ(t) obtained from (4.12) are then used to estimate the total number of ON servers using (4.10). The correction coefficients, γ r and γ λ , are calculated based on the real-time prediction error measurements to avoid performance loss. We set the 112 correction factors in (4.10) to γ r =1+3σ r and γ λ =1+3σ λ . Figure 4.3.a illustrates the performance of the workload forecasting of total number of requests in one-week time span, and Figure 4.3.b demonstrates the relative error observed in the prediction of total number of requests versus its actual value, for the same time span. We can calculate the standard deviation of the prediction error functions. In this case, it is calculated as σ r = 0.052. (a) Predicted data vs. Actual Data (b) Prediction error Figure 4.3: Prediction of total number of requests, r(t). 113 4.5.1.3 Power/Temperature Manager In this section we present our Power Thermal Manager (PTM) unit which sits at the heart of TA-DRP. As explained in the beginning of Section 4.5, inputs to the PTM are the server status in the current epoch and the required number of ON servers (of each server class) for the next epoch. Using these inputs, the PTM decides on the location of the servers that are being employed or retired (this is sometimes referred to as the server placement problem – the goal is to assign virtual servers or applications to the physical servers in the datacenter so that the spatial distribution of ON and OFF servers result in minimum overall power cost in the datacenter including the power dissipations of the servers and the air conditioning unit). The optimality is defined as minimizing the total datacenter power consumption given in (4.8). The goal of PTM unit is to minimize the total datacenter power consumption given in (4.8) by dynamically employing or retiring the requested number of servers provided by the WM. This is done by a combination of three means: (i) employing the right number of servers of each type for each epoch and retiring any unused servers; (ii) chassis consolidation, i.e., turning ON only the minimum number of chassis and thereby eliminating the unnecessary base power consumption of the chassis, and (iii) maximizing the required T s , thus a more efficient cooling, by optimally choosing locations of servers and chassis that are to be employed or retired. Outputs of the PTM unit are the supplied cold air temperature value and the exact ON/OFF status of servers/chassis for the next epoch. Every time there is 114 a need to employ new servers, we keep the currently ON servers ON and simply add new servers (we do not retire an ON server and employ a different new server instead.) This is to avoid the performance and energy overheads associated with retiring a busy server, employing a new server, and transferring the retired server’s jobs to the new server. The downside of this policy is that it does not guarantee a power optimal solution across all datacenter utilization levels because of the ON server persistency policy. To present the PTM problem statement, we first pay attention to the cost function (P DC ) given in (4.8). For simplicity, we extract the T s dependency from the cost function. The optimum value of T s will be determined by performing a linear search across all possible T s values, and finding a value which results in the minimum P DC . We note that for a fixed T s value, COP(T s ) becomes a constant and can be taken out of the cost function. In this case, the cost function simply becomes P IT . We introduce a new integer variable for each chassis that takes on values from {0,1} and signals whether a chassis is ON or OFF. This variable is denoted by x i for the i th chassis, and is defined as: (4.13) With this new definition, it can be shown that the cost function (P IT ) becomes: (4.14) where x=[x 1 ,x 2 ,…,x N ] T , and 1 1×N denotes an N-dimensional row vector with all elements equal to 1. Also, the inlet temperature vector in (4.4) will change to: (4.15) 115 where Γ is a diagonal matrix defined as . PTM decisions depend on the current server/chassis ON/OFF status. To capture the ON/OFF status of chassis, we define a new column vector, x 0 , of size N as x 0 =[x 1 0 ,x 2 0 ,…,x N 0 ] T , where x i 0 =1 if the i th chassis is currently ON, otherwise x i 0 =0. Similarly we define a new matrix, , to capture the ON/OFF status of each specific type of server. is the number of type-j servers that are currently employed on the i th chassis. Next we explain the proposed server employment and retirement policies that we use in this chapter. 4.5.1.4 Server Retirement Policy The purpose of the retirement policy is to minimize the number of ON servers by retiring additional ones. Every time the WM decides on reducing the number of ON servers of a certain type, the PTM unit selects some candidate servers to retire. Unlike the turn ON scenario (c.f. Section 4.5.2.2) where the candidate ON servers will be employed immediately, for the turn OFF case, the PTM passes the list of candidate servers to the Request Dispatcher (RD), and asks it to retire these servers simply by not assigning new requests to them. On the other hand, the PTM has a list of retiring servers which it updates at each epoch. If a server on this list stays idle, i.e., it does not provide service to any request, for u epochs, that server will be put in the halt (hibernate) mode. Note that we do not completely turn OFF the retired servers (unless the whole chassis is being turned OFF), and we put them in hibernate mode instead. This is due to the very small amount of power consumption and faster (compared to an OFF server) wakeup time. 116 Similar to the retiring server list, the PTM also maintains a list of retiring chassis comprising ON chassis that include hibernated servers and no ON servers. This list is also updated at each epoch. If a certain chassis stays on the retiring chassis list for v epochs, all its servers and the whole chassis will be turned OFF. Now we explain how the list of candidate retiring servers is determined by the PTM. The power thermal optimization problem to determine the retiring type-j servers can be formulated as the following Integer Linear Programming (ILP) problem. 5 (4.16) where T critical is a vector of size N with all entries equal to a critical inlet temperature, T critical (The inlet temperature of all chassis must be less than this value in order to ensure that the corresponding servers will not overheat and eventually fail). A typical value for T critical is 25°C. The outputs of the ILP problem in (4.16) are c ij and x i values. x i values determine which chassis are to stay ON and which are to be retired. c ij values determine 117 the number of servers from each type that are to be retired on each chassis. 4.5.1.5 Server Employment Policy The purpose of the server employment policy is to determine the optimum T s value and locations of the required number of ON servers, and also to turn them ON. As mentioned in Section 4.5.2.1, the PTM maintains a list of retiring servers and a list of hibernating servers. Each time the PTM is asked to make use of new servers of a certain type, it first tries to meet this request by employing servers from the retiring server list. If this is not possible, then the PTM will try to satisfy the request by employing servers from hibernating server list. Finally, if this is not possible either, PTM will employ new servers by solving an optimization problem as explained below. The turn-ON power thermal optimization problem to employ type-j servers can be formulated as the following Integer Linear Programming (ILP) problem. 5 6 (4.17) 118 Note that the problem statement in (4.17) is the power optimization problem for a non- idle datacenter (a non-idle datacenter is a datacenter that already contains some ON servers.) In that sense it is different from the problem statement presented in [57] which is for an idle datacenter. 4.5.1.6 Calculating the Optimum T s value Every time that we solve the retirement/employment policy, we also need to determine the optimum T s value. The way that we do this in this dissertation is by iterating on possible T s values, and solving the retirement/employment problem every time. We then pick the solution that gives the minimum total power consumption. 4.6 Simulation Results In this section we evaluate the power dissipation and cooling cost of our proposed technique, and we compare it with some of the common techniques in datacenter operations. 4.6.1 Simulation Setup We use a small scale datacenter with physical dimensions of 9.6m×8.4m×3.6m consisting of 7U blade servers. The datacenter has two rows that are put together in a hot-aisle/cold- aisle arrangement as it is shown in Figure 4.4. Each row has five 42U racks. Each rack consists of five chassis each having 20 blade servers. Therefore, there are a 119 total number of 1,000 servers in this datacenter. A CRAC unit is used to supply the cold air with f=8m 3 /s in the room. We may have K=1 or K=2 type(s) of server in the datacenter. Power parameters for servers and chassis are γ=820W, α 1 =85W (uses higher performance core with larger EPI), and α 2 =50W (uses lower performance core with smaller EPI). We have simulated the WM and PTM units using the algorithms explained in previous sections. For the RD, we used a simple round robin scheduling algorithm. To solve the ILP problems for server employment and retirement, we first used the LP solver package of TOMLAB [59], and then found the closest integer solution to the continuous variable solution. Figure 4.4: Datacenter structure used in the simulations. To the best of our knowledge, the present work is the first that addresses temperature-aware dynamic resource provisioning in a power-optimized datacenter. So, comparison with prior work is not possible. However, we compared the proposed 5 4 3 2 1 Row1 Row2 C B A D E C B A D E 120 technique (TA-DRP) with two (reasonable) greedy heuristics, called GREEDY and TA- GREEDY. Their difference from TA-DRP is that they use different techniques for server retirement and employment; otherwise they operate with the exact same procedures for the WM and RD units. Power-Aware Greedy (GREEDY) This heuristic algorithm performs chassis consolidation using a greedy approach without considering the cooling efficiency factor. For the server employment policy, it starts with chassis that have the maximum occupancy factor (maximum number of employed servers) so that no new chassis is turned on. For the retirement policy, on the other hand, it uses the least occupied chassis, so this chassis will have more chance to be turned off later on when the workload diminishes. Temperature-Aware Greedy (TA-GREEDY) In this heuristic algorithm, the chassis’ inlet temperatures are given a higher weight (priority) compared to the chassis’ occupancy factor. This is to prevent hot spots and imbalanced temperature distribution across the datacenter. Indirectly, balancing the heat distribution in the datacenter can save power. The algorithm maintains a list of relatively hot servers whose inlet temperatures are above a threshold value e.g., T th =22°C. These servers will be assigned a higher priority for retirement. If there are no hot servers to retire, this heuristic picks the retiring candidates in the same fashion as GREEDY heuristic. In the same spirit, a cold chassis with maximum number of employed servers will be given a high priority to be used for server employment. Thus 121 TA-GREEDY avoids turning on any servers in a chassis on the hot list as much as possible. Note that both GREEDY and TA-GREEDY adjust the T s value, if and when needed, so that the thermal constraints are met. 4.6.2 Workload Generation Our datacenter-level simulations were done by using a benchmark suite where the number of existing requests and the expected request arrival rate are the input parameters. We could thus simulate a wide range of workload scenarios corresponding to different initial occupancies for the global queue of Section 4.4.2 and different request arrival rates into the queue. The requests were homogenous in terms of their CPU and memory usage. 4.6.3 Power and T s Comparison Figure 4.5 and Figure 4.6 show the total power consumption and the T s value for the three techniques described above. Both figures are the result of running the workload for one full day as also discussed in sections 4.5.1.2 and 4.6.2. The workload prediction is done by the WM and the result is passed to the corresponding power/thermal management unit (PTM in case of the TA-DRP). It is seen from Figure 4.5 that the TA- GREEDY which considers temperature in addition to consolidation also shows better results compared to GREEDY. More importantly, the proposed TA-DRP algorithm achieves much lower power consumption compared to both heuristics at all times during the full day operation of this datacenter. 122 Figure 4.5: Comparison of the total power consumption for GREEDY, TA-GREEDY and TA-DRP (K=1). Figure 4.6: Comparison of supplied cold air temperature for K=1. Another interesting point that can be seen from Figure 4.5 is that for lower 0 500 1000 1500 40 60 80 100 120 140 160 data1 data2 data3 Time in the day (minutes) T otal Power Consumption (KW) GREEDY TA-GREEDY TA-DRP 0 500 1000 1500 16 17 18 19 20 21 22 23 24 data1 data2 data3 Time in the day (minutes) Supplied Cold Air T emperature (T s ) ( C) GREEDY TA-DRP TA-GREEDY 123 datacenter utilization (lower workload), the TA-DRP achieves even higher relative power savings. This is because there are more opportunities for chassis consolidation and improved cooling efficiency under light load conditions – these opportunities are better exploited by TA-DRP. Figure 4.7: Comparison of the total power consumption for K=2. Figure 4.7 shows the power consumption of the datacenter simulated for half a day. In this case all the requests to the datacenter use two different types of servers (K=2). The average power consumption of TA-DRP over this time period is 25% and 18% less than those of the GREEDY and TA-GREEDY heuristics, respectively. Our future plan is to analyze the characteristics of the workload generated by standard benchmarks such as the SPECpower_ssj2008 [4] and TPC-APP to demonstrate the degree of power savings achieved by TA-DRP for these workloads. 500 600 700 800 900 1000 1100 1200 1300 1400 40 50 60 70 80 90 100 110 120 130 140 data1 data2 data3 Time in the day (minutes) T otal Power Consumption (KW) GREEDY TA-GREEDY TA-DRP 124 (a) n 1 =n 2 =410, # of ON chassis=40 (b) n 1 =n 2 =205, # of ON chassis=23 Figure 4.8: Temperature distribution of a snapshot of TA-DRP. 1 3 5 0 5 10 15 20 25 A B C D E E D C B A Chassis number Temperature number ( C) 1 2 3 4 5 0 5 10 15 20 25 A B C D E E D C B A Chassis number Temperature number ( C) 125 Figure 4.8 shows temperature distribution for two snapshots of the TA-DRP algorithm in the room. In both cases we assume that all the input requests use two types of servers. Figure 4.8.a represents the case when 820 servers are ON (410 of each type). TA-DRP has used 40 chassis to provide 820 servers in this case. Figure 4.8.b shows the case when 410 servers are ON (205 of each type), and TA-DRP has used 23 chassis to provision 410 servers. 126 Chapter 5 Conclusions Higher speed and computing capacity of electronic circuits and computing systems, have made the low power and energy efficiency a significant issue for circuit and system designers. Issues such as packaging, cooling, temperature management, battery life, and electricity cost all demand low power and energy efficient circuits and systems. This dissertation presented innovative circuit and system level techniques to reduce power consumption and improve energy efficiency. The first technique presented in this dissertation, charge recycling for power- gated circuits, reduces energy consumption during mode-transition (sleep to active and active to sleep) in power gated circuits. This technique, which saves up to 50% of the wasted mode-transition energy, benefits from recycling the electric charge between virtual ground and virtual supply at the edge of the transition. The second circuit-level technique presented in this dissertation introduced multimodal and data-retentive power gating structures. The scheme uses an innovative tri-modal power-gating switch used to implement data-retentive power gating structures and multi-drowsy mode circuits. Another use of this switch is to perform voltage scaling with the same infrastructure that is used for power gating. Finally, and in addition to the circuit-level techniques, this dissertation addressed some system-level energy-efficiency issues. We presented a dynamically power- 127 optimized datacenter that exploits correctly provisioned resources (servers and supplied cold air). The presented framework decides which candidate servers to retire or employ from the pool of servers. It also determines the required value for the supplied cold-air temperature. Both of these decisions are made optimally by continuously solving corresponding mathematical programs. The power saving is achieved by a combination of chassis consolidation and efficient cooling. 128 References [1] "Oklahoma State University's Standard Cell Library," http://vcag.ecen.okstate.edu/projects/scells/. [2] "Server Consolidation Using Quad-Core Processors," http://www.intel.com/it/pdf/consolidate-using-quadcore.pdf. [3] "Report to congress on server and data center energy efficiency," U. S. E. P. Agency, Ed., 2007. [4] "SPECpower_ssj2008," http://www.spec.org/power_ssj2008/, 2008. [5] A. Abdollahi, F. Fallah, and M. Pedram, "An effective power mode transition technique in MTCMOS," in Design Automation Conference, 2005, pp. 37-42. [6] A. Abdollahi, F. Fallah, and M. Pedram, "A robust power gating structure and power mode transition strategy for MTCMOS design," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15, pp. 80-89, 2007. [7] K. Agarwal, H. Deogun, D. Sylvester, and K. Nowka, "Power Gating with Multiple Sleep Modes," in International Symposium on Quality Electronic Design, 2006, pp. 633 - 637. [8] M. Anis, S. Areibi, and M. Elmasry, "Design and optimization of multithreshold CMOS (MTCMOS) circuits," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 22, pp. 1324-1342, 2003. [9] M. Anis, S. Areibi, M. Mahmoud, and M. Elmasry, "Dynamic and leakage power reduction in MTCMOS circuits using an automated efficient gate clustering technique," in Design Automation Conference, 2002, pp. 480-485. [10] L. A. Barroso, J. Dean, and U. Hölzle, "Web Search for a Planet: The Google Cluster Architecture," IEEE Micro, vol. 23, 2003 [11] L. Benini and G. D. Micheli, Dynamic Power Management: design techniques and CAD tools: Kluwer, 1997. [12] A. Bogliolo, L. Benini, E. Lattanzi, and G. D. Micheli, "Specification and analysis of power-managed systems," Proceedings of IEEE, vol. 92, pp. 1308–1346, 2004. 129 [13] V. Cardellini, M. Colajanni, and P. S. Yu, "Dynamic load balancing on Web- server systems," IEEE Internet Computing Magazine, vol. 3, pp. 28-39, 1999. [14] G. Chen, W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, and F. Zhao, "Energy-aware server provisioning and load dispatching for connection-intensive internet services," in 5th USENIX Symposium on Networked Systems Design and Implementation, 2008, pp. 337-350. [15] Z. Chen, L. Wei, M. Johnson, and K. Roy, "Estimation of Standby Leakage Power in CMOS Circuits Considering Accurate Modeling of Transistor Stacks," in International Symposium on Low Power Electronics and Design, 1998, pp. 239-244. [16] D.-S. Chiou, S.-H. Chen, S.-C. Chang, and C. Yeh, "Timing driven power gating," 2006. [17] D.-S. Chiou, D.-C. Juan, Y.-T. Chen, and S.-C. Chang, "Fine-grained sleep transistor sizing algorithm for leakage power minimization," in Design Automation Conference, 2007. [18] Y. Choi, N. Chang, and T. Kim, "DC-DC converter-aware power management for battery-operated embedded systems," in Design Automation Conference, 2005. [19] A. Davoodi and A. Srivastava, "Wake-up protocols for controlling current surges in MTCMOS-based technology," in Asia South Pacific Design Automation Conference, 2005, pp. 868-871. [20] D. M. Dias, W. Kish, R. Mukherjee, and R. Tewari, "A Scalable and Highly Available Web Server," in IEEE Computer Society International Conference, 1996, pp. 85-92. [21] M. Ghasemazar, E. Pakbaznia, and M. Pedram, "Minimizing energy consumption of a chip multiprocessor system through simultaneous core consolidation and dynamic voltage/frequency scaling," in IEEE International Symposium on Circuits and Systems, 2010. [22] M. Ghasemazar, E. Pakbaznia, and M. Pedram, "Minimizing the power consumption of a chip multiprocessor under an average throughput constraint," in International Symposium on Quality of Electronic Design, 2010. 130 [23] D. Gmach, J. Rolia, L. Cherkasova, and A. Kemper, "Workload Analysis and Demand Prediction of Enterprise Data Center Applications," in International Symposium on Workload Characterization, 2007, pp. 171-180. [24] S. Henzler, T. Nirschl, C. Pacha, P. Spindler, P. Teichmann, M. Fulde, M. Eireiner, T. Fischer, G. Georgakos, J. Berthold, and D. Schmitt-Landsiedel, "Dynamic State-Retention FlipFlop for Fine-Grained Sleep-Transistor Scheme," in ESSCIRC, 2005. [25] P. Heydari and M. Pedram, "Ground bounce in digital VLSI circuits," IEEE Transaction on VLSI systems, pp. 180-193, 2003. [26] Intel, Microsoft, and Toshiba, "Advanced Configuration and Power Interface specification," 1996. [27] H. Jiang, M. Marek-Sadowska, and S. R. Nassif, "Benefits and Costs of Power- Gating Technique," in International Conference on Computer Design, 2005. [28] J. Kao, A. Chandrakasan, and D. Antoniadis, "Transistor Sizing Issues and Tool for Multi Threshold CMOS Technology," in Design Automation Conference, 1997, pp. 409-414. [29] J. Kao, S. Narenda, and A. Chandrakasan, "MTCMOS hierarchical sizing based on mutual exclusive discharge patterns," in Design Automation Conference, 1998, pp. 495-500. [30] J. Kao, S. Narendra, and A. Chandrakasan, "Subthreshold leakage modeling and reduction techniques," in International Conference on Computer Aided Design, 2002, pp. 141-148. [31] H. Kawaguchi, K. Nose, and T. Sakurai, "A Super Cut-Off CMOS (SCCMOS) Scheme for 0.5-V Supply Voltage with Picoampere Stand-By Current," IEEE Journal of Solid-State Circuits, vol. 35, pp. 1498-1501, 2000. [32] V. Khandelwal and A. Srivastava, "Leakage Control Through Fine-Grained Placement and Sizing of Sleep Transistors," IEEE Transistors on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, 2007. [33] S. Kim, C. J. Choi, D. K. Jeong, S. V. Kosonocky, and S. B. Park, "Reducing ground bounce noise and stabilizing the data-retention voltage of power-gating structures," IEEE Transactions on Electron Devices, vol. 55, pp. 97-205, 2008. 131 [34] S. Kim, S. V. Kosonocky, D. R. Knebel, and K. Stawiasz, "Experimental measurement of a novel power gating structure with intermediate power saving mode," in International Symposium on Low Power Electronics and Design, 2004, pp. 20-25. [35] S. Kim, S. V. Kosonocky, Stephen, and D. R. Knebel, "Understanding and minimizing ground bounce during mode transition of power gating structures," in International Symposium on Low Power Electronics and Design, 2003, pp. 22-25. [36] R. Kumar, D. Tullsen, N. Jouppi, and P. Ranganathan, "Heterogeneous chip multiprocessors," Computer, vol. 38, pp. 32-38, 2005. [37] S. Lee, S. Das, T. Pham, T. Austin, D. Blaauw, and T. Mudge, "Reducing pipeline energy demands with local DVS and dynamic retiming," in International Symposium on Low Power Electronics and Design, 2004. [38] C. Long and L. He, "Distributed sleep transistor network for power reduction," IEEE Transactions on VLSI Systems, vol. 12, pp. 937- 946, 2004. [39] K. S. Min and T. Sakurai, "Zigzag Super Cut-off CMOS (ZSCCMOS) Scheme with Self-Saturated Virtual Power Lines for Subthreshold-Leakage-Suppressed Sub-1-V-VDD LSI’s," in European Solid-State Circuits Conference, 2002, pp. 679-682. [40] J. Moore, J. Chase, P. Ranganathan, and R. Sharma, "Making scheduling "cool": temperature-aware workload placement in data centers," in USENIX Annual Technical Conference, 2005. [41] A. Mutapcic, S. Boyd, S. Murali, D. Atienza, G. D. Micheli, and R. Gupta, "Processor Speed Control With Thermal Constraints," IEEE Transactions on Circuits and Systems, vol. 56, pp. 1994-2008, 2009. [42] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, "1-V Power Supply High-speed Digital Circuit Technology with Multi-threshold- Voltage CMOS," IEEE Journal of Solid-State Circuits, vol. 30, pp. 847-854, 1995. [43] S. Mutoh, S. Shigematsu, Y. Matsuya, H. Fukada, and J. Yamada, "1-V Multi- Threshold CMOS DSP with an Efficient Power Management Technique for Mobile Phone Application," in International Solid-State Circuit Conference 1996, pp. 168-169. 132 [44] E. Pakbaznia, F. Fallah, and M. Pedram, "Charge recycling in MTCMOS circuits: concept and analysis," in Design Automation Conference, 2006, pp. 97-102. [45] E. Pakbaznia, F. Fallah, and M. Pedram, "Sizing and placement of charge recycling transistors in MTCMOS circuits," in International Conference on Computer-Aided Design, 2007. [46] E. Pakbaznia, F. Fallah, and M. Pedram, "Charge recycling in power-gated CMOS circuits," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 27, pp. 1798-1811, 2008. [47] E. Pakbaznia, M. Ghasemazar, and M. Pedram, "Temperature-aware dynamic resource provisioning in a power-optimized datacenter," in Design Automation and Test in Europe, 2010. [48] E. Pakbaznia and M. Pedram, "Coarse-Grain MTCMOS Sleep Transistor Sizing Using Delay Budgeting," in Design Automation and Test in Europe, 2008, pp. 385-390. [49] E. Pakbaznia and M. Pedram, "Design and application of multi-modal power- gating structures," in International Symposium on Quality of Electronic Design, 2009. [50] E. Pakbaznia and M. Pedram, "Minimizing data center cooling and server power costs," in International Symposium on Low Power Electronics and Design, 2009, pp. 145-150. [51] E. Pinheiro, R. Bianchini, E. V. Carrera, and T. Heath, "Load Balancing and Unbalancing for Power and Performance in Cluster-Based Systems," in Workshop on Compilers and Operating Systems for Low Power, 2001. [52] A. Ramalingam, B. Zhang, A. Devgan, and D. Pan, "Sleep transistor sizing using timing criticality and temporal currents," in Asia South Pacific Design Automation Conference, 2005. [53] N. Rasmussen, "Calculating Total Cooling Requirements for Data Centers," American Power Conversion, white paper 25, 2007. [54] K. Shi and D. Howard, "Challenges in sleep transistor design and implementation in low-power designs," in Design Automation Conference, 2006. 133 [55] S. Shigematsu, S. Mutoh, Y. Matsuya, Y. Tanabe, and J. Yamada, "A 1-V high- speed MTCMOS circuit scheme for power-down application circuits," IEEE Journal of Solid-State Circuits, vol. 32, pp. 861-869, 1997. [56] A. Tada, H. Notani, and M. Numa, "A novel power gating scheme with charge recycling," IEICE Electronics Express, vol. 3, pp. 281-286, 2007. [57] Q. Tang, S. K. S. Gupta, and G. Varsamopoulos, "Energy-Efficient Thermal- Aware Task Scheduling for Homogeneous High-Performance Computing Data Centers: A Cyber-Physical Approach," IEEE Transactions on Parallel and Distributed Systems, vol. 19, pp. 1458-1472, 2008. [58] Y. Taur, "CMOS design near the limit of scaling," IBM Journal of Research and Development, vol. 46, pp. 213-222, 2002. [59] TOMLAB, "http://tomopt.com/tomlab/," 2009.
Abstract (if available)
Abstract
The increasing computing and storage capacity of electronic devices and information processing systems has increased their power consumption and energy usage dramatically. This has made the energy efficiency of circuit components and computing systems a very important concern. Energy efficiency is desirable for portable electronics (e.g., mobile phones, laptops, tablets, etc) because it lengthens the battery lifetime. In more sophisticated computing systems that are not battery operated (e.g., web servers, datacenters, etc), better energy efficiency reduces the total cost of ownership by reducing cost of electricity (due to computing and cooling), and improves the environmental impacts (e.g., reducing CO2 emission).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
A joint framework of design, control, and applications of energy generation and energy storage systems
PDF
Integration of energy-efficient infrastructures and policies in smart grid
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Energy proportional computing for multi-core and many-core servers
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Energy consumption and lifetime/reliability improvement of computing systems using voltage overscaling (VOS) approximation technique
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Architectures and algorithms of charge management and thermal control for energy storage systems and mobile devices
PDF
Dynamically reconfigurable off- and on-chip networks
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
High performance and ultra energy efficient computing using superconductor electronics
PDF
Reinforcement learning in hybrid electric vehicles (HEVs) / electric vehicles (EVs)
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Prediction of energy consumption behavior in component-based distributed systems
PDF
Hardware techniques for efficient communication in transactional systems
Asset Metadata
Creator
Pakbaznia, Ehsan
(author)
Core Title
Energy-efficient shutdown of circuit components and computing systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/06/2010
Defense Date
12/11/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
charge recycling,datacenter,energy efficiency,energy efficient,low power,MTCMOS,multimodal,OAI-PMH Harvest,power gating,resource provisioning,temperature-aware,VLSI
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Draper, Jeffrey T. (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
ehsan.pakbaznia@gmail.com,pakbazni@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3334
Unique identifier
UC1466758
Identifier
etd-Pakbaznia-3946 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-378029 (legacy record id),usctheses-m3334 (legacy record id)
Legacy Identifier
etd-Pakbaznia-3946.pdf
Dmrecord
378029
Document Type
Dissertation
Rights
Pakbaznia, Ehsan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
charge recycling
datacenter
energy efficiency
energy efficient
low power
MTCMOS
multimodal
power gating
resource provisioning
temperature-aware
VLSI