Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Variation-aware circuit and chip level power optimization in digital VLSI systems
(USC Thesis Other)
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
VARIATION-AWARE CIRCUIT AND CHIP LEVEL POWER OPTIMIZATION IN DIGITAL VLSI SYSTEMS by Mohammad Ghasemazar ______________________________________________________________ A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2011 Copyright 2011 Mohammad Ghasemazar ii DEDICATION To my lovely wife, my parents, and my brothers. iii ACKNOWLEDGEMENTS This dissertation would not have been possible without the guidance and the help of several individuals who in one way or another contributed and extended their valuable assistance in the preparation and completion of this study. First and foremost, I am most grateful to my advisor, Professor Massoud Pedram, for inviting me to join his research group and providing invaluable support and guidance throughout my PhD studies at the University of Southern California. He has been a continuous source of motivation for me and I want to sincerely thank him for all I have achieved. His multi-disciplinary approach and global vision of research problems have been instrumental in defining my professional career. I would also like to thank my other dissertation and qualification committee members, Professor Sandeep K. Gupta, Professor Murali Annavaram, Professor Jeffrey Draper and Professor Aiichiro Nakano for their insightful suggestions and for their valuable time. Special thanks to Dr. Behnam Amelifard with whom I have collaborated on some projects while he was a student at USC. I truly appreciate his mentorship, priceless advices, feedback, and help. I also thank all those who supported me in any aspect during my PhD; my best and dearest friends, SPORT Lab colleagues, and the Electrical Engineering staff, particularly Annie Yu, Tim Boston, Janice Thompson, Estela Lopez, Christina Fontenot, and my special thanks to Diane Demetras for being such a helpful advisor. iv I would like to thank my parents for their unconditional love and support. I would not have been able to accomplish my goals without their support and encouragement. I am much indebted to my mother and father, for believing in me and encouraging me to pursue my goals. I would also like to thank my brothers, Amir and Amin, who I love dearly. No matter how far away they may be physically, they are never far from my heart and mind. Last but not least, words cannot express my gratitude to my beloved wife, Yasaman. Not only is she my adorable wife and closest friend, but also one of the smartest people, technically helping me with fruitful discussions. I would like to thank Yasaman for her constant love, support, and understanding. v TABLE OF CONTENTS Dedication ........................................................................................................................... ii Acknowledgements ............................................................................................................ iii List of Tables .................................................................................................................... vii List of Figures .................................................................................................................. viii Abstract .............................................................................................................................. xi Chapter 1.Introduction ........................................................................................................ 1 1.1 Overview of Low Power Pipeline Design .............................................................. 6 1.2 Overview of Power Management in CMP’s ........................................................... 7 1.3 Overview of Power and Thermal Management in CMP’s.................................... 11 1.4 Overview of Variability in Digital Circuits and Systems ..................................... 13 1.5 Dissertation Contributions .................................................................................... 16 1.6 Outline of this Dissertation ................................................................................... 18 Chapter 2.Pipeline Power-Delay Optimization by Opportunistic Time Borrowing ........ 20 2.1 Introduction ........................................................................................................... 20 2.2 Prior Work ............................................................................................................ 22 2.3 Background ........................................................................................................... 24 2.4 Soft-Edge Flip-Flops (SEFF) ................................................................................ 30 2.5 Power-Delay Optimization in a Pipeline Using SEFF .......................................... 44 2.6 Power-Delay Optimal Soft Pipeline (OSP)........................................................... 46 2.7 Statistical Power-Delay Optimal Soft Pipeline (SOSP)........................................ 51 2.8 Error-Tolerant Statistical Power-Delay Optimal Soft Pipeline (ESOSP) ............. 55 2.9 Bounding the Probability of Undetected Errors.................................................... 59 2.10 Experimental Results ............................................................................................ 60 2.11 Summary ............................................................................................................... 69 Chapter 3.Performance-Constrained Power Optimization in a Chip Multiprocessor ...... 70 3.1 Introduction ........................................................................................................... 70 3.2 Prior Work ............................................................................................................ 72 3.3 Background ........................................................................................................... 75 3.4 CMP Power Management Problem Statement ..................................................... 81 3.5 Experimental Results ............................................................................................ 95 3.6 Summary ............................................................................................................. 101 vi Chapter 4.Performance Optimization of Chip Multiprocessors under Power and Thermal Constraints ............................................................................................ 102 4.1 Introduction ......................................................................................................... 102 4.2 Prior Work .......................................................................................................... 104 4.3 Preliminaries ....................................................................................................... 106 4.4 Problem Formulation .......................................................................................... 113 4.5 Proposed Solution ............................................................................................... 118 4.6 Experimental Results .......................................................................................... 132 4.7 A Real-time Power/Thermal Manager in Linux ................................................. 148 4.8 Summary ............................................................................................................. 151 Chapter 5.Stochastic Dynamic Power Management for Chip Multiprocessors Subject to Variations ........................................................................................... 153 5.1 Introduction ......................................................................................................... 153 5.2 Prior Work .......................................................................................................... 154 5.3 Preliminaries ....................................................................................................... 155 5.4 Proposed Variation Aware DPM ........................................................................ 163 5.5 Experimental Results .......................................................................................... 169 5.6 Summary ............................................................................................................. 173 Chapter 6.Conclusion ..................................................................................................... 175 6.1 Summary of Contributions .................................................................................. 175 6.2 Future Directions ................................................................................................ 176 Bibliography ................................................................................................................... 178 Alphabetized Bibliography ............................................................................................. 189 vii LIST OF TABLES Table 2-1. Power-delay-product improvement by OSP .................................................... 63 Table 2-2. The optimum Tclk and window sizes obtained by OSP-FV ........................... 65 Table 2-3. Power-delay-product saving by SOSP ............................................................ 66 Table 2-4. ESOSP performance and comparison to baseline ........................................... 67 Table 3-1. Configurations of the cores in CMP system .................................................... 97 Table 3-2. Average characteristics of benchmarks used to generate tasks ....................... 97 Table 3-3. Simulation parameters ..................................................................................... 98 Table 4-1. Voltage and frequency relationship in AMD six-core processor OS4176OFU6DGO ............................................................................................. 110 Table 4-2. Voltage and frequency relationship in Intel Pentium M processor ............... 110 Table 4-3. Configurations of the cores in CMP system .................................................. 133 Table 4-4. Assignment of benchmarks in test1 ............................................................... 134 Table 4-5. Total throughput of 8core CMPs at different power budgets ........................ 139 Table 5-1. Definition of system parameters ................................................................... 158 viii LIST OF FIGURES Figure 1-1. AMD’s six core Opteron processor [2]. ........................................................... 3 Figure 2-1. A simple linear pipeline. ................................................................................ 24 Figure 2-2. Power vs. delay relationship for delay elements. ........................................... 30 Figure 2-3. Positive-edge triggered master-slave SEFF a) circuit b) timing diagram. ..... 31 Figure 2-4. Negative-edge triggered HLFF a) circuit b) timing diagram. ........................ 33 Figure 2-5. Monostable-based SEFF a) circuit b) timing diagram. .................................. 34 Figure 2-6. SEFF timing characteristics – HSPICE simulations. ..................................... 35 Figure 2-7. a) Setup time and b) hold time as functions of supply voltage and transparency window width– HSPICE simulations. .......................................... 36 Figure 2-8. Power consumption as a function of window size of SEFF........................... 37 Figure 2-9. Positive edge SEFF with built-in error detection a) circuit and b) timing waveform. ........................................................................................................... 39 Figure 2-10. Timing waveforms for the SEFF. ................................................................. 40 Figure 2-11. Positive edge SEFF with built-in error correction. ...................................... 41 Figure 2-12. Time borrowing between two stages of a soft pipeline................................ 42 Figure 2-13. Example of slack passing. ............................................................................ 46 Figure 2-14. Pseudo-code of OSP algorithm. ................................................................... 50 Figure 2-15. Accuracy of linearly approximating stage delay CDF. ................................ 62 Figure 2-16. Power-Delay reduction by OSP. .................................................................. 64 Figure 2-17. Power-Delay reduction by OSP. .................................................................. 68 Figure 3-1. System model with global and local queues. ................................................. 76 ix Figure 3-2. Throughput-frequency relationship for a) low CMF tasks b) high CMF tasks. ................................................................................................................... 80 Figure 3-3. Block diagram of the proposed three-tiered PM. ........................................... 84 Figure 3-4. Tier-2 task assignment scheme. ..................................................................... 91 Figure 3-5. Closed loop system representation. ................................................................ 92 Figure 3-6. a) Root Locus and b) Step Response of the places poles. .............................. 94 Figure 3-7. Power consumption of 3T-PM vs. baseline for different configurations and arrival rates. ................................................................................................. 98 Figure 3-8. Frequency waveforms used by 3T-PM and baseline PM for the same throughput constraint. ......................................................................................... 99 Figure 3-9. Task loss rate improvement due to the task classification step in the 3T-PM solution. ................................................................................................ 100 Figure 4-1. Thermal model of a CMP. ............................................................................ 107 Figure 4-2. Linear relationship of supply voltage and clock frequency in modern processors. ........................................................................................................ 110 Figure 4-3. Block diagram of VPTM. ............................................................................. 121 Figure 4-4. An online algorithm for VPTM.................................................................... 121 Figure 4-5. Pseudo-code of T1-PTM algorithm. ............................................................ 125 Figure 4-6. PI controllers of VPTM. ............................................................................... 132 Figure 4-7. CMP floorplan in our thermal model; based on Intel Xeon floorplan . ....... 134 Figure 4-8. Performance of VPTM algorithm. ............................................................... 136 Figure 4-9. Performance of PHPL algorithm. ................................................................. 137 Figure 4-10. Total IPS under power budget –VPTM vs PHPL. ..................................... 138 Figure 4-11. Comparison of VPTM and PHPL in 8-core CMPs. ................................... 139 Figure 4-12. Selection of frequencies in VPTM and PHPL at power budgets. .............. 140 Figure 4-13. Limitation imposed by power and thermal constraints. ............................. 142 x Figure 4-14. Limitation imposed by thermal and power constraints. ............................. 143 Figure 4-15. Sensitivity of frequencies in VPTM to critical temperature. ..................... 145 Figure 4-16. Comparison of VPTM performance with K θ =0 (left) and K θ ≠0 (right). ... 146 Figure 4-17. Sensitivity of VPTM performance to K θ . ................................................... 147 Figure 4-18. Sample sensor reading file. ........................................................................ 149 Figure 4-19. Sample power measurement setup. ............................................................ 151 Figure 5-1. Structure of variability-aware DPM. ............................................................ 163 Figure 5-2. Online algorithm for variability-aware DPM. .............................................. 168 Figure 5-3. Performance of belief state estimator. .......................................................... 170 Figure 5-4. Comparison of BASE, UT_DPM and VA-DPM a) power consumption b) queue occupancy c) action ID. ..................................................................... 172 Figure 5-5. Comparison of BASE, UT-DPM, and VA-DPM. ........................................ 173 xi ABSTRACT In today’s IC design, one of the key challenges is the increase in power consumption of the circuit which in turn shortens the service time of battery-powered electronics, and increases the cooling and packaging costs of server systems. On the other hand, with the increasing levels of variability in the characteristics of nanoscale CMOS devices and VLSI interconnects and continued uncertainty in the operating conditions of VLSI circuits, achieving power efficiency and high performance in electronic systems under process, voltage, and temperature (PVT) variations has become a daunting, yet vital, task. This dissertation investigates power optimization techniques in CMOS VLSI circuits both at circuit level and chip level, while considering the variations in fabrication process or operating conditions of such circuits and systems. First, at circuit level, we present and solve the problem of power-delay optimal design of linear pipeline utilizing soft-edge flip-flops which allow opportunistic time borrowing within the pipeline. We formulate this problem considering statistical delay models that characterize effect of process variation on gate and interconnect delays. To enable further optimization, the soft-edge flip flops are equipped with dynamic error detection (and correction) circuitry to detect and fix the errors that might arise from possible over-clocking. Second, we propose chip level solutions to the problem of low power design in Chip Multiprocessors (CMPs). We formulate this problem in the form of minimizing total power consumption of CMP while maintaining an average system-level throughput, xii or maximizing total CMP throughput subject to constraints on power dissipation or die- temperatures. We then propose mathematically rigorous and robust algorithms in the form of dynamic power (and thermal) management solutions to each of these problem formulations. Our proposed algorithms are hierarchical global power management approaches that aim to minimize CMP power consumption (or maximize throughput) by applying mainly dynamic voltage and frequency scaling (DVFS) technique, task assignment and consolidation of processing cores. To tackle the inherent variation and uncertainty of manufacturing parameters and operating conditions in these problems, our solutions adopt a closed loop feedback controller. Additionally, in one problem formulation, we focus primarily on the variations and uncertainty of CMP optimization problem parameters and adopt an algorithm based on partially observable Markovian decision process (POMDP) that uses belief states to determine unobservable system parameters, and then stochastically minimize overall CMP power consumption. Overall, simulation results of our solutions demonstrate promising results for the CMP power/thermal optimization problem. 1 Chapter 1. INTRODUCTION One of the key challenges in today’s IC design is the increase in power dissipation of the circuit which in turn shortens the service time of battery-powered electronics, increases the cooling and packaging costs of these circuits, reduces the long-term reliability of circuits due to temperature-induced accelerated aging of device and interconnect. The increase in the power consumption results in shorter battery lifetime for battery-operated portable devices such as laptops, cell phones, and PDA’s. As a result, the primary objective of low-power design for battery-operated electronics is to extend the battery lifetime while meeting the performance demands. Due to rather low rate of improvement in battery performance, unless power optimization techniques are applied at different levels of granularity, the capabilities of future portable systems will be strictly limited by the weight and size of the batteries required for acceptable service duration [1]. In high performance server systems, on the other hand, the packaging cost and power reliability issues associated with high power consumption also has made the low power design a primary design objective. As a consequence of technology scaling is that integrated circuit densities and operating frequencies are continuing to go up. The result is that chips are becoming larger, faster, and more complex, therefore, consuming ever larger amounts of dynamic power [1]. At the same time, CMOS scaling toward Ultra-Deep Submicron (UDSM) technologies requires very low threshold voltages and ultra-thin gate oxides to retain the 2 current drive and alleviate the short-channel effects. The side effect of threshold voltage and oxide thickness scaling is an exponential increase in both subthreshold and tunneling gate leakage currents, which adds to total power consumption of the chip. Driven by the increase in demand for high performance processors, Chip Multiprocessor (CMP) architectures, a.k.a Multiprocessor System-on-Chip (MPSoC) or multicore architectures, have been introduced to enable continued performance scaling in spite of the slow-down of the CMOS technology scaling. For decades, performance of a CPU was improved by increasing its frequency at each technology node, which started to greatly diminish due to the following three primary factors: i. The memory wall; the difference between processor and memory speeds was increasing, and hence memory latency and bandwidth were the bottleneck in performance. ii. The ILP wall; the increasing difficulty of finding enough parallelism in a single instructions stream to keep a high-performance single-core processor busy. iii. The power wall; the power consumption exponentially increased with each factorial increase of operating frequency. CMPs have alleviated the problem of power consumption in complex uniprocessor designs subject to maintain growing performance trend. However, in spite of the short term relief in power consumption provided by CMPs, the ever increasing need for higher processing power has ensured the growing significance of the need for power and energy efficient design of multi-core processing platforms. 3 In order to continue delivering regular performance improvements for general- purpose processors, manufacturers such as Intel and AMD have turned to multi-core designs. Figure 1-1 illustrates AMD’s commercial six core processor which came into market early 2011. Almost all today’s server class processors from main manufacturers are multicore [2][3]. Figure 1-1. AMD’s six core Opteron processor [2]. A large body of IC design research has been associated to various techniques to optimize power and energy consumption at different levels of granularity. Circuit level methods can be categorized in groups that either work based on reducing the supply voltage and the clock frequency [6][8][17][22], the capacitance [4][5], the switching activity [79] of the circuits, utilizing multiple threshold voltages [6][8] and power and/or ground gating [9][10] that are the bases of proposed solutions to reducing the dynamic 4 power consumption and leakage power consumption. At the chip level, the problems of low power CMP design and optimization of CMP power have been also studied in the literature; prior studies propose power management techniques including both heuristics [52][56][57][60][101], that mainly utilize chip level techniques [52][36] or task scheduling heuristics [57][60][61][62] or local core-level responses such as Dynamic Voltage and Frequency Scaling (DVFS), fetch throttling [52][58][59] to perform dynamic power/thermal management in homogeneous [51]-[54] or heterogeneous multicore architectures [56][57] (a homogeneous multicore is composed of identical processing cores, while in a heterogeneous CMP, processing cores are not identical.) On the other hand, a side effect of technology scaling is that the Critical Dimension has become so small that the atomicity of the physical features and dopant levels is becoming assessable [34]. This results in large variations in the physical and electrical characteristics of interconnect and transistors which in turn affect the performance and power consumption of the circuit. Parameter variation manifests itself in the distributions of process tolerance; it appears in voltage- and temperature-induced tolerance arising from the operating environment. Variability can be temporal or spatial in nature. Aging-induced variation arising from wear-out mechanisms has a negative impact on performance. Negative-bias temperature instability (NBTI) affecting p-FETs and hot-electron effects affecting n-FETs both elevate device thresholds, degrading device and circuit performance [85]. Electro-migration (EM) [86] slowly erodes interconnect admittance, becoming more severe below 65 nm because of higher interconnect current densities. The term spatial variation refers to lateral and vertical 5 differences from intended polygon dimensions and film thicknesses [87]. Intrinsic (random) variations are caused by atomic-level differences between devices that occur even though the devices may have identical layout geometry and neighboring structures. These stochastic differences appear in dopant profiles, film thickness variation, and line- edge roughness. Extrinsic (systematic) variation is due to unintentional shifts in contemporary process conditions. It is typically not associated with atomistic problems, but rather with the operating dynamics of a fabrication line [84]. With the increasing levels of variability in the characteristics of nanoscale CMOS devices and VLSI interconnects, including MPSoCs, achieving power efficiency and high performance in electronic systems under PVT variations and aging effects and uncertainty of operating conditions has become a daunting, yet vital, task and this variability and uncertainty is undermining the effectiveness of traditional power management approaches. Given the importance of low-power design, this dissertation is focused on developing power optimization techniques at circuit level and chip level in CMOS VLSI circuits, while considering the different variations in fabrication process, or operating conditions of such circuits and systems. In this dissertation, we first present our circuit level techniques for low power design, more precisely, design of a power-delay optimal pipeline by means of applying voltage scaling and appropriately designing the flip flops. We mathematically formulate and solve this problem in both deterministic and probabilistic frameworks, based on the idea of utilizing soft-edge flip-flops (SEFF) for slack passing and decreasing the error rate in the pipeline stages. Next, at chip level, we 6 address the problems of minimizing the total power consumption of a CMP while maintaining a CMP-level average throughput target for tasks running on the CMP, as well as maximizing the CMP performance (throughput) under a power consumption budget, and thermal constraint. 1.1 Overview of Low Power Pipeline Design Pipelined data path in a modern processor is a major contributor to power consumption of the processor, and consequently, one of the main sources of heat generation on the chip [1]. Many techniques have been proposed to reduce power consumption of a microprocessor’s pipeline such as pipeline gating [1], clock gating [15], and voltage scaling [6]. In this dissertation we present the problem of power-delay optimal pipeline design in a synchronous linear pipeline by means of applying voltage scaling and appropriately designing the flip flops. We propose mathematical solutions to this problem in both deterministic and probabilistic frameworks. Our technique is based on the idea of utilizing soft-edge flip-flops (SEFF) for slack passing and decreasing the error rate in the pipeline stages. The linear pipeline composed of soft-edge flip flops is called a soft pipeline. In this work, we describe a unified methodology for optimally selecting the transparency window of the SEFF’s in a linear pipeline so as to achieve the minimum power-delay product for the pipeline by means of opportunistic time borrowing. We also formulate the same problem for a scenario where stage delays are assumed to be random variables, and find the solution with minimum power-delay product while ensuring that the probability of timing violations due to increased operation frequency of pipeline is lower than a threshold. Traditionally, process variations have 7 been modeled by considering the worst-case process corners in order to evaluate the performance of the design. Nevertheless, designing at the worst-case process corner leads to excessive guard-banding which in turn wastes lots of die resources and leaves silicon performance untapped; therefore, in recent years much research has been conducted on statistical modeling of variations [39][40][41][43][44]. We will employ statistical model for variations, in this work. An elaborate study of prior related work will be presented in section 2.2. 1.2 Overview of Power Management in CMP’s As mentioned before, power dissipation and die temperature have become the main design concerns and key performance limiters in today’s high-performance multi- core processors. Dynamic Power Management (DPM) solutions have been proposed to manage resources in a CMP based on power, performance and temperature of processor cores, to optimize performance (or power, or both). The various approaches can be generally classified into performance-constrained power minimization [103][104] and power-constrained performance maximization [51][105]. Typically, the problem formulations target performance optimization under a power/energy budget [51][53] or a thermal constraint [58][63][64][65], or attempt to minimize the total power consumption as cost function [54][66] or energy per throughput [56][63] subject to a total throughput constraint. Dynamic power management techniques can be classified into heuristics [52][56][57][60] and stochastic approaches [101]. Usually, heuristic approaches are simple and easy to implement, but they do not provide any power/performance 8 assurances; while, the stochastic approaches guarantee optimality under performance constraints although they are more complex to implement. Both heuristic algorithms and stochastic techniques utilize local responses at the core-level such as DVFS, fetch throttling [52][58][59] or system level core turn on/off policies [52] [54], or global task scheduling heuristics [57][60][61][62] in either homogeneous [51]-[54] or heterogeneous multicore architectures [56][57]. Li and Martinez [54] optimize a parallel workload running on a CMP by dynamically changing the number of active processors and the DVFS setting of CMP chip-wide, which reduces the flexibility and impact of the optimization. Authors of [107] suggest a linear programming based algorithm for application scheduling in a large CMP with process variation. Authors of [55] use the compiler to assign different DVFS settings to different processors depending on the workload. Feedback control theory is a powerful tool for dealing with variability in CMPs; Proportional-Integral-Differential (PID)-based controllers have been widely used to dynamically adjust the processor voltage or frequency to control a system latency [113], a display buffer occupancy [114], the inter- processor queue occupancy levels in MPSoCs [119]. In this dissertation, we address the problem of power management in a CMP while maintaining a total average throughput constraint and its dual problem. The minimum power solution is achieved by applying DVFS (Dynamic Voltage and Frequency Scaling), core consolidation, and task assignment in the introduced hierarchical global power manager that is comprised of three tiers. The top-tier PM unit performs coarse-grain DVFS, and the low-tier PM employs a closed-loop feedback 9 implementing DVFS technique at core level. Toward this end, we propose a feedback control solution for accurate power management in CMPs. One major advantage of our closed loop control logic is that it provides robustness in presence of process variation and workload variations, compared to open loop control or adhoc heuristics. Process variation may result in an variability in the nominal clock frequency of a core, and hence the actual frequency might vary core-to-core. In this case, utilizing a closed loop feedback controller, adjusts the frequency to higher or lower value such that it meets the throughput calculated based on nominal frequency. At the same time, the feedback loop solved the problem of the inherent uncertainty in task characteristics which causes the dynamic decisions that are made based on these uncertain values to be suboptimal (although, statistical solutions can be suggested, but these statistical approaches lack the flexibility of dynamic ones to adopt to present working conditions). In brief, the benefit of employing a feedback control loop is that it starts with an approximate solution that reflects the erroneous and uncertain estimate of all parameters –such as task characteristics, arrival times, and actual core frequency- and adaptively updates the controlling variable, i.e. frequency of the cores, to eliminate the effect of uncertainty and variability in those parameters and satisfy the requirements of power manager. Thus, the perception of power manager from the underlying system, i.e. individual cores and the tasks running on them, is an ideal system with precisely known parameters. 1.2.1 DVFS and Power Controllers in Processors and Operating Systems State-of-the-art processor chips often have multiple voltage and frequency levels, which are referred to as Performance States (P-state). The P-states provide flexibility of 10 running at maximum performance when needed, and switching to low power p-states to save power and energy when performance is not critical. Historically, the first general- purpose CMP to support a form of core-level DVFS is the AMD quad-core Opteron [109]. In this chip, the frequency of each core can be set independently, although all cores have the same voltage. Currently, multiple on-chip voltages are provided by off-chip voltage regulators, which are bulky and costly. Perhaps the most sophisticated design is Intel’s Foxton technology in the Itanium II processor [110]. It is a control system that maximizes performance while staying within target power and temperature. It consists of power and temperature sensors, and a small on-chip hardware controller. If the power consumption is less than the target, the controller increases the core voltage - and the frequency follows. The opposite occurs if the power is over the target. Both cores in the Itanium II have the same voltage and frequency. Later, Intel introduced SpeedStep and Enhanced Intel SpeedStep (EIST) technologies in its CPUS which performs dynamic frequency scaling to meet the instantaneous performance needs of the operation being performed, while minimizing power draw and heat dissipation. Similarly, AMD has introduced CPU speed throttling and power saving technologies Cool'n'Quiet for desktop and server chips, and PowerNow for mobile chips. It works by reducing the processor's voltage and clock frequency when the processor is idle, to reduce overall power consumption and lower heat generation, allowing for slower (thus quieter) cooling fan operation. Major operating systems, including Linux, Windows, and MacOS support these features of Intel and AMD in their kernels. For instance, Windows provides multiple power/performance 11 configurations that utilize SpeedStep; the "Home/Office Desk" disables SpeedStep, the "Portable/Laptop" power scheme enables SpeedStep, and the "Max Battery" uses SpeedStep to slow the processor to minimal power levels as the battery weakens. A problem with existing operating systems is that the power management unit is decoupled from the resource manager. The thread dispatcher, which is the kernel subsystem responsible for deciding to schedule threads on cores, has no notion of core power/performance states. At the same time, the power management subsystem is polling looking for idle cores to power manage. Having these two subsystems decoupled leads to situations where the two subsystems can undermine each other's efforts, leading to poorer performance as threads are inadvertently run on clocked down cores, or where utilization across the system remains light, but is distributed across the system to the point where nothing is inactive enough to be power managed. 1.3 Overview of Power and Thermal Management in CMP’s Besides CMP power consumption, its die temperature is another important factor that limits the performance of a CMP. Dynamic Thermal Management (DTM) unit manages resources in a CMP based on the measured power dissipation, performance, and die temperature of processing cores, in order to control a chip’s operating temperature. Many architectural extensions have been proposed to reduce the impact of hot spots and/or to prevent the die from reaching a critical temperature, by reducing power consumption density. Such techniques include fetch toggling, decode throttling, frequency and/or voltage scaling [51][65]. Reference [99] lists a number of such techniques and compares their application in thermal management of a processor. Heo et 12 al. [97] and Stavrou and Trancoso [98] minimize power density or temperature hot spots by judiciously scheduling jobs or migrating them from core to core. Authors of [100] study several effective methods for DTM, such as temperature-tracking frequency scaling, migrating computation to spare hardware units, and a “hybrid” policy that combines fetch throttling with dynamic voltage scaling. Authors of [108] mathematically formulate problem of speed scaling in multiprocessors, under thermal constraint, and show that it is a convex optimization problem. They model dissipated power of a processor as a positive and strictly increasing convex function of the speed, namely a cubic function. Their suggested approach is an optimal mathematical solution to the convex problem formulation. By Dynamic Power and Thermal Management (PTM), we refer to a range of possible hardware and software techniques that work dynamically, at run-time, to simultaneously control CMP power consumption and individual cores’ die temperatures. These parameters are closely correlated, yet independent. For instance, there can be situations where a CMP power is within a desired range, but a core creates a hot-spot due to high activity. In this dissertation, we present a mathematically rigorous and robust algorithm for power and thermal management of CMPs subject to variability and uncertainty in system parameters. We first model and formulate the problem of maximizing the throughput of a CMP subject to a power budget and a die temperature bound. Next we present our solution framework, called Variation-aware Power/Thermal Manager (VPTM), which is a hierarchical dynamic power and thermal management solution targeting heterogeneous 13 CMP architectures. VPTM utilizes dynamic voltage and frequency scaling (DVFS) and core consolidation techniques to control the core power consumptions, which implicitly regulate the core temperatures. An efficient algorithm is presented for core consolidation and task assignment, and a convex program is formulated and solved to produce optimal DVFS settings. Finally, a feedback controller is employed to compensate for variations in key system parameters at runtime. 1.4 Overview of Variability in Digital Circuits and Systems With the increasing levels of variability in the characteristics of nanoscale CMOS devices and VLSI interconnects and continued uncertainty in the operating conditions of VLSI circuits, achieving power efficiency and high performance in electronic systems under process, voltage, and temperature (PVT) variations as well as current stress, device aging, and interconnect wear-out phenomena has become a daunting, yet vital, task. Increasing attention has been given to the problem of reducing variability in the circuit design parameters. The work presented in [89] studies the impact of leakage reduction techniques on the delay uncertainty. By emphasizing that the leakage is critically dependent on the operating temperature and power supply, the authors in [91] present a full chip leakage estimation technique which accurately accounts for power supply and temperature variations. It is only recently that people have started paying attention to the effects of variability on optimization processes and tradeoffs as we go up in the design abstraction hierarchy [92]-[94]. None of these works has considered the effect of variability and uncertainty sources on system-level decision making for improved energy efficiency. 14 Like any nanoscale CMOS VLSI circuit, chip multiprocessors (CMP) are manufactured in technologies subject to process variations and are operated under widely varying conditions over the lifetime of system. Such systems are greatly affected by increasing levels of process variations typically materializing as intrinsic (random) or systematic sources of variability and aging effects in device and interconnect characteristics, and widely varying workloads usually appearing as a source of uncertainty. Variations have a randomizing effect on the performance and power dissipation of a particular processor chip. At the same time, measurements made about the state of the processor and predictions about its future state tend to be imperfect, which gives rise to uncertainty about the system state. At the chip level this variability and uncertainty is beginning to undermine the effectiveness of traditional power management approaches. As technology scaling, dimensions of individual cores are becoming smaller and the spatially correlated intra-die process variations result in core-to-core (C2C) power and performance variations. The increasing process variation and device and interconnect aging effect cause the urgent need to making power management variation aware, to avoid the costly over-estimation of worst-case based methods. The problem of optimizing system performance (throughput) and global power consumption of CMP with thermal considerations in a framework subject to different sources of variation is an important problem in high-end servers and hosting datacenters. Process variation, particularly inter-die and intra-die variation, has become critical in CMP systems causing similar cores present different performance and power behavior, 15 such as their maximum operating frequency or the standby leakage current. Parameter variation manifests itself in the distributions of process tolerance; it appears in voltage- and temperature-induced tolerance arising from the operating environment. The work presented in [90] suggests a novel, comprehensive timing error model for microarchitectural structures and model the error rate in logic structures, SRAM structures, and combinations of both, and consider both systematic and random variation. Authors of [95] examine process variation in a CMP and point out the core-to- core variation in frequency. They estimate the maximum difference in core frequencies to be approximately 20%. They suggest Adaptive Body Bias (ABB) and Adaptive Supply Voltage (ASV) to reduce some of this variation —at the cost of increasing power variation. Donald and Martonosi [68] also examine process variation in a CMP and focus on the core-to-core variation in power. They suggest turning off cores that consume power in excess of a certain computed value, with the goal of maximizing the chip-wide performance/power ratio. Reference [96] studies core-to-core variations in frequency island (FI) CMPs, and presents an analytical model for the throughput of such CMPs and its use it to quantify the performance benefits of the FI design style. It is demonstrated that per-core frequency control yields the greatest performance improvements for CMPs consisting of many small cores. Donald and Martonosi [68] examine process variation in a CMP and focus on the core-to-core variation in power. They suggest turning off cores that consume power in excess of a certain computed value, with the goal of maximizing the chip-wide performance/power ratio. Authors of [107] suggest variation-aware algorithms for 16 application scheduling and power management. One such power management algorithm, called LinOpt, uses linear programming to maximize throughput at a given core power budget through voltage and frequency scaling, in a 20- core CMP. However, this work does not consider the temperature constraint, leakage dependence on temperature, and core consolidation to save power and reduce overheating of cores, which play a significant role in determining the energy-efficient operation. 1.5 Dissertation Contributions In this dissertation, we propose low power solutions to digital VLSI systems at circuit level and chip level. At circuit level, we propose a power optimization technique based on time borrowing for pipeline circuits, and at chip level, we target designing efficient dynamic power and thermal management solutions for chip multiprocessors. Our circuit level power optimization method includes presenting the formulation and solution to minimize the power-delay product metric in a linear pipeline, in both deterministic and probabilistic frameworks. The key idea is utilizing soft-edge flip-flops (SEFF) to perform time borrowing between consecutive stages of the pipeline. We describe a unified methodology for optimally designing SEFF’s and selecting optimum operating voltage and frequency of a linear pipeline so as to achieve the minimum power- delay product. We formulate this problem for a scenario where stage delays are assumed to be the worst case delays as well as random variables. Our method minimizes the power-delay product while ensuring that the probability of timing violations due to increased operation frequency of pipeline is lower than a threshold. Also, by over- clocking the pipeline and allowing timing violations to occur and then being recovering 17 the errors, our proposed ESOSP algorithm exploits the trade-off between performance and power saving to further minimize the expected power-delay product of a pipeline. Due to ever-increasing importance of power in the design of multi-core processors, we dedicate the rest of this dissertation to designing power/energy efficient dynamic power and thermal management strategies for CMPs. First, we propose a dynamic power management solution to minimize the power consumption of a CMP under an average throughput constraint and subject to process and workload variations. In particular, we introduced a hierarchical global power manager comprised of three tiers performing core consolidation and coarse-grain DVFS at top tier, assigning the tasks to available cores considering server and task affinities at mid-tier, and closed-loop feedback based per-core DVFS at the low-tier. Next, we introduce a mathematically rigorous and robust algorithm for power and thermal management of CMPs subject to variability and uncertainty in system parameters. We first model and formulate the problem of maximizing the throughput of a CMP subject to a power budget and a die temperature bound. Next we present our solution framework, called Variation-aware Power/Thermal Manager (VPTM), which is a hierarchical dynamic power and thermal management solution targeting homogeneous and heterogeneous CMP architectures. VPTM utilizes DVFS and core consolidation as well as parallel feedback controllers to manage the core power consumptions, which implicitly regulate the core temperatures. An efficient algorithm is presented for core consolidation and task assignment, and a convex program is formulated and solved to 18 produce optimal DVFS settings, and a feedback controller is employed to compensate for variations in key system parameters at runtime. Finally, we target the problem of chip-level dynamic power management (DPM) in chip multiprocessors, with an emphasis on process variations and aging effects in device and interconnect characteristics, and widely varying workloads usually appearing as a source of uncertainty. Variations have a randomizing effect on the performance and power dissipation of a particular processor chip. At the system level this variability and uncertainty is beginning to undermine the effectiveness of traditional DPM approaches. We propose a stochastic power management technique that addresses the problem of variation aware power optimization in CMPs, using Partially Observable Markovian Decision Process (POMDP). Our proposed power manager interacts with an uncertain environment and statistically changing state variables and immediate cost function and tries to minimize the discounted cost in the limit by choosing appropriate actions. 1.6 Outline of this Dissertation In this chapter, we introduced the problems this dissertation is addressing and a summary of our contribution. The remainder of this dissertation is organized as follows. In Chapter 2, we present our work on the problem of power-delay optimal pipeline design considering delay variations by means of voltage scaling and time borrowing using SEFF. In Chapter 3, we address the problem of minimizing the total power consumption of a CMP while maintaining a CMP-level average throughput target for tasks running on the CMP. The minimum power solution is achieved by applying DVFS, core 19 consolidation, and task assignment in the introduced hierarchical global power manager that is comprised of three tiers. Chapter 4 explains our approach to solve the problem of performance maximization in a CMP under power budget and thermal constraints, using a hierarchical power management unit that utilizes DVFS, core consolidation and optimum task re- assignment. In Chapter 5, we present a stochastic power management technique that addresses the problem of variation aware power optimization in CMPs, using Partially Observable Markovian Decision Process (POMDP). Chapter 6 concludes this dissertation, summarizing main contributions of the completed projects, followed by technical approaches that we suggest as future directions. 20 Chapter 2. PIPELINE POWER-DELAY OPTIMIZATION BY OPPORTUNISTIC TIME BORROWING 2.1 Introduction With the increase in demand for battery-operated personal computing devices and wireless communication equipment, the need for power-efficient design has increased. In addition, rising levels of power dissipation and the resulting thermal problems have become the key limiting factors to processor performance. Due to the high utilization of pipelined data path in modern processors, it is a major contributor to power consumption of a processor, and hence, one of the main sources of heat generation on the chip [13]. Many techniques have been proposed to reduce power consumption of a microprocessor’s pipeline such as pipeline gating [13], clock gating [14], and voltage scaling [11]. In this chapter, we present the problem of power-delay optimal pipeline design in a synchronous linear pipeline by means of voltage scaling and time borrowing through redesigning the flip flops. We propose mathematical solutions to this problem in deterministic and probabilistic frameworks. Our technique is based on the idea of utilizing soft-edge flip-flops (SEFF) for slack passing and decreasing the error rate in pipeline stages. A linear pipeline composed of SEFFs is called a soft pipeline. Soft-edge flip-flops have a small transparency window which allows time borrowing across pipeline stages. SEFFs have been used for minimizing the effect of 21 clock skew on circuit performance [17][18] and minimizing the effect of process variation on parametric yield [19]. In this work, SEFF is utilized to compensate for unbalanced pipeline stage delays by means of time borrowing. It is observed that this imbalance of path delays of different pipeline stages is very common in pipelined circuits [20]. In this chapter, we describe a unified methodology for optimally selecting the transparency windows of SEFFs in a linear pipeline so as to achieve the minimum power- delay product for the pipeline by means of opportunistic time borrowing and voltage scaling. We take on three power-delay optimization problems as explained next. In the first problem formulation, timing violations are avoided by respecting the worst case path delays (calculated as deterministic values by static timing analysis) for every stage in a pipeline. Next we formulate the same problem for a scenario where stage delays are assumed to be random variables, and find the solution with the minimum power-delay product while ensuring that the probability of timing violations in pipeline is lower than a threshold. Thirdly, we allow timing violations to take place while implementing a mechanism to detect and fix the errors and accounting for the power and delay penalties of error correction. The remainder of this chapter is organized as follows. In section 2.2, we provide a brief review of prior related work. Section 2.3 provides a background on timing constraints in a linear pipeline and power and delay models. Section 2.4 presents SEFFs, their circuit design, characteristics and models. A general description of the power-delay product optimization problem is presented in section 2.5, while our formulations and 22 proposed solutions are explained in sections 2.6, 2.7, 2.8 and 2.9 for various scenarios. Section 2.10 is dedicated to the experimental results, and section 2.11 concludes and summarizes the chapter. 2.2 Prior Work Soft-Edge Flip Flops –Soft-edge flip-flops have been used for minimizing the effect of clock skew on static and dynamic circuits [18]. Recently, authors of [19] proposed an interesting approach to utilize SEFFs in sequential circuits in order to minimize the effect of process variation on yield. They formulated the problem of statistically aware SEFF assignment which maximizes the gain in timing yield as an integer linear program (ILP) and proposed a heuristic algorithm to solve the problem. Also, SEFF has been utilized to reduce combinational circuit’s Soft Error Rate (SER) [26] by leveraging the effect of temporal masking caused by introduction of transparency window to SEFF circuit design. It is more delay and power efficient compared to circuit redundancy based techniques [26]. Time borrowing – Authors of [27] proposed an architectural framework, called ReCycle, which adopts clock skew based time borrowing to compensate for process variation in a pipeline latching elements. It solves a linear program to determine optimum clock skews of pipeline stages that improve maximum attainable frequency. It enables the pipeline to tolerate process variation, after fabrication. In a recent work, [28], authors have optimized pipeline clock frequency by replacing flip-flops with pulsed latches to enable time borrowing, as well as skewing clock. Introduction of clock skew to an edge-triggered flip-flop has an effect similar to the 23 circuit retiming in VLSI timing optimization- movement of the flip-flops across combinational logic module boundaries [29]. Although it achieves time borrowing as SEFF does, but it requires modification to the standard tools and it is a static solution and cannot account for circuit variability and other sources of uncertainty in the environment or input. It has been shown to be ineffective for addressing process variation and circuit imbalance [19]. Moreover, SEFF can pass data anytime during its transparency window, while a FF with skewed clock passes the data only at the shifted edge of clock. Obviously, adjusting clock for each individual flip-flop lifts this limitation at the cost of a complex design effort. Integrated error handling mechanisms – Razor flip-flop design was introduced in [16] that obtains an significant power reduction by adopting an smart opportunistic voltage scaling scheme. It only reduces voltage upon detection of timing errors in pipeline. It equips a pipeline with delay error detection capability as well as error correction mechanism. In a later work, authors of [30] proposed two local tuning mechanisms in the context of Razor dynamic voltage scaling: per-stage voltage controlling and per-stage clock skew adjustment. Its drawbacks are rather complex to provide separate voltage supplies for each pipeline stage in physical implementation, plus the disadvantages of clock skewing technique mentioned earlier. In a recent work, Razor architecture has been revisited and Razor II has been proposed that provides both low- power operation and SER tolerance [31]. Its power saving is achieved by performing only error detection in the FF, while correction is performed through architectural replay. This allows significant reduction in the complexity and size of the FF, too. Our work 24 efficiently combines the power saving integrated error handling mechanism of Razor, with the performance enhancer time borrowing technique. Similar to Razor, MicroFix architecture [32] takes the delay errors as the indicator to required DVFS action. It handles errors in a prediction based manner [32]. 2.3 Background A linear pipeline is a pipeline with the following properties: (i) processing stages are linearly connected, with no feedback loops (ii) it performs a fixed function, and (iii) stages are separated by flip-flops which are clocked with the same clk signal. Figure 2-1 demonstrates a sample linear pipeline. We call the set of flip-flops that separate consecutive pipeline stages as a FF-set, e.g., FF 0 … FF 2 in Figure 2-1 are FF-sets. Figure 2-1. A simple linear pipeline. Clearly, delay of combinational circuit and interconnect 1 depend on the supply voltage of pipeline (see eq. (7) and (8)); so are the timing characteristics of the flip-flops, such as setup time, hold time and clock-to-Q delay (and D-to-Q delay; see section 2.4). Let’s assume the pipeline is operating under voltage level v j (any variable with subscript j in the following equations denotes its value under supply voltage j). To guarantee the 1 In the entire work, the interconnect delay would be integrated in the combinational logic’s delay, and where we refer to combinational delay, it also includes the interconnect delay. D Q D Q D Q clk C1 C2 FF0 FF1 FF2 25 correct operation of the pipeline, the following timing constraints must be satisfied in all stages of pipeline: , , , : 1 (1) , , : 1 (2) where d i and δ i denote the maximum and minimum delays of combinational logic in stage i, T clk denotes the clock cycle time, t s,i and t h,i are setup and hold times of flip-flops in the i th FF-set whereas t cq,i-1 denotes clock-to-Q delay of flip-flops in i-1 st FF-set. N denotes the number of pipeline stages. Inequality (1) gives the constraint set on the maximum delays of combinational logic and flip-flop timing characteristics to prevent setup time violations. Conversely, inequality (2) specifies the constraint set on the minimum delay of pipeline stages in order to prevent short path data race hazards. Notice that to account for the effect of clock skew, t skew , we can simply add t skew to the left side of inequality (1) and subtract it from the left side of inequality (2). 2.3.1 Timing Constraints in a Linear Pipeline under Delay Variations As technology scales, process, voltage, and temperature (PVT) variations are becoming critical design concerns due to their effect on logic and interconnect delay [84]. Process variations such as random dopant fluctuations, and gate-oxide thickness variations modulate MOSFET characteristics and parasitic components, causing variation in the switching delays of identical gates [33][34]. The random maximum and minimum stage delays are described by probability distribution functions (PDF) and cumulative distribution functions (CDF) with 26 corresponding mean, μ, and variance, σ. In some works, e.g. [36][37], this distribution has been assumed to be a Gaussian (Normal) distribution [35]. However, precise statistical timing analysis schemes have proposed non-Gaussian distribution models due to nonlinearity of max/min operations on delays of gates and paths and their correlation [38][39][40]. In order to account for the random variations (Gaussian or non-Gaussian) of the path delays in equations (1)-(2), one should express the probability of violating the setup or hold conditions as a function of delay variations. The probability of satisfying setup time constraint in pipeline stage i with voltage v j for a given cycle time T clk,j , denoted by p setup,ij , can be written as probability of the maximum delay of combinational logic in that stage, d i , being less than the available time: , !" , , , # $ % , , , (3) where $ % denotes the CDF of delay of pipeline stage i under voltage setting j. The probability of a setup time constraint violation in pipeline stage i is thus calculated as: & , !" ' , , , # 1 $ % ( , , , ) 1 , (4) Similarly, given the CDF of minimum delay of stage i under voltage setting j, $ * , probability of violating (q hold,ij ) the hold time constraint of stage i may be calculated as: & +%, !" , , , # $ * ( , , ) (5) Note that we ignore the effect of variability on flip-flop timing characteristics and only focus on the effect of variability on the combinational logic delays. To a first order, 27 the clock-to-Q and setup-time of input and output flip-flops are much smaller than the maximum delay of combinational logic, and hence, we can ignore variations of flip-flop characteristics compared to the logic. This is however not true with respect to the hold- time and the minimum delay of logic. Therefore, we insert an adequate number of delay elements (see section 2.3.4) to eliminate the hold time violation for the minimum value of hold time of flip-flops. The CDF of maximum and minimum delays of stage i under voltage setting j (denoted by $ % and $ * , respectively) can be in the form of any distribution function. These functions are provided by the extensive statistical timing analysis of the circuit [41] (which is performed prior to our proposed algorithms). Let μ d,ij and μ δ,ij denote mean values of the maximum delay and the minimum delay of i th logic stage under j th voltage setting, respectively, while σ d,ij and σ δ,ij are standard deviations of corresponding delay distributions. 2.3.2 Pipeline Delay Model Average pipeline delay, denoted by D, is the defined as the inverse of its effective throughput. We assume that the pipeline can process at most one data/instruction unit if it does not encounter timing violations, hence, 0 . This delay probabilistically accounts for the stalling overhead of correcting potential setup time problems in an over- clocked pipeline. 0 1 clock cycle count number of valid output data (6) 28 We assume that the pipeline can process at most one data/instruction unit if it does not encounter timing violations, hence, 0 . In a pipeline that processes each data in one cycle, its average delay is equal to the clock period, T clk (that is determined by the slowest pipeline stage; see equation (1).) However, if the pipeline stalls or gets flushed, for any reason, the average processing time of data/instruction increases. In other words, the delay is not simply the inverse of the clock frequency, rather it also probabilistically accounts for the overhead of correcting potential setup time problems in an over-clocked pipeline. 2.3.3 Combinational Logic Block Modeling The delay of a combinational logic (according to alpha-power law [8]), and its total power consumption, P Comb , change as follows due to voltage scaling: (A ) B C D E D A D F G D E (7) (A ) B C D E D A D F G D E (8) where α is a technology parameter which is around 2 for long channel devices and 1.3 for short channel devices, and V t denotes the magnitude of the threshold voltage of transistors. Coefficient B j captures the effect of temperature increase (due to power consumption) on delay, and is defined as (9). B 1 H I J JK L M N ΔK(A ) (9) In the above equation Δθ(v j ) is the increase in steady state temperature of circuit under voltage level v j with respect to temperature at V 0 , and J/JK is the voltage- dependent slope of delay-temperature curve at voltage level v j (which captures inverted 29 temperature dependence effect, too [24]). We assume the only source of temperature increase is the circuit’s power consumption (based on circuit’s thermal models [25]), which is itself a function of voltage as given in (10). Hence, the steady state temperature of a circuit can be calculated for a voltage v j . Note that equations (7) and (8) are used to calculate worst-case delays under the assumption that V t does not vary (no process variation). For the scenarios that consider V t variations, it is precise to use PDF of d ij and δ ij profiled at any voltage. Additionally, the total power consumption of combinational logic, P Comb , changes as follows due to voltage scaling 2 : ! P+QR (A , ) S A D E T U V %WX 1 H S A D E T Y ! Z (10) where E dyn and P leak are total dynamic energy dissipation and leakage power consumption of the combinational logic at nominal supply voltage V 0 . 2.3.4 Delay Elements From equation (2), one can see that increasing the window size of the i th soft-edge FF-set puts a more stringent constraint on the hold time condition for the i th stage of pipeline. Therefore, if needed, delay elements may be utilized in the minimum-delay path(s) to alleviate the hold time constraint violation. Insertion of a delay element with a delay magnitude of z i would change equation (5) as follows: & +%, !" , , , [ # $ * ( , , [ ) (11) 2 This super-linear dependency of leakage power on supply voltage is due to combined effect of drain induced barrier lowering and off-state leakage equation (V dd ×I OFF ). Its cubic form was empirically observed in SPICE simulations. 30 Delay elements are indeed created by utilizing some inverters and appropriately sizing them in order to meet the desired delay lower bound while incurring minimum power loss. The power overhead of a delay element is denoted as: ! \] [, A ^ U A · [ H ^ A [ (12) where z is the desired delay and h 2 (v) and h 1 (v) are voltage dependent parameters to be determined by HSPICE simulations. Figure 2-2 illustrates the linear model fitting on the measured data. Note that the delay elements are created by means of a buffer chain; to get larger delay, more buffers or larger loads are needed. This causes power dissipation increase with increased delay (see Figure 2-2) with discontinuity points due to change in the number of buffers. Figure 2-2. Power vs. delay relationship for delay elements. 2.4 Soft-Edge Flip-Flops (SEFF) Soft-edge flip-flops (SEFF) have a small transparency window right after the clock edge which allows time borrowing across pipeline stages and is beneficial for reducing the effect of clock uncertainty [22]. 0 20 40 60 80 100 120 140 0 100 200 300 Power [nW] Delay od Delay Element [ps] Vdd = 1.1V Vdd = 0.9V 31 2.4.1 SEFF Circuit Some SEFF designs are derived by applying modifications to conventional hard- edge counterparts. We focus on some of the most widely used flip-flop circuits in state- of-the-art processors [45]. SEFF designs based on master-slave FF (MSFF), hybrid latch FF (HLFF) and monostable-based FF (MBFF) are studied in this work. Figure 2-3 illustrates the design and timing diagram of a master-slave SEFF, used in IBM Power PC 603 processor [45]. The key modification in the SEFF version is that by delaying the clock of the master latch, both master and slave latches are ON for the duration of transparency window. (a) (b) Figure 2-3. Positive-edge triggered master-slave SEFF a) circuit b) timing diagram. Figure 2-3 (b) illustrates the timing diagram for key signals of a master-slave SEFF. The dashed square highlights the transparency window which is the overlap of clk and its delayed version, clkd. If the overlap between edge of clk and the latching edge of clkd is larger than the delay through the master latch, the master–slave pair is transparent Delay D Q clkd !clkd !clkd clkd !clk !clk clk clk !clk clk clkd !clkd clk !clkd D 32 to the input during the window after the edge of main clock, clk. The delayed clock and its reverse-polarity can be produced locally for each FF-set (or multiple FF-sets that have equal transparency window size) by utilizing some inverter chain, appropriately sizing them and changing chain length in order to achieve the desired transparency window size. The hybrid latch flip-flop [11], is shown in Figure 2-4, which is originally a soft- edge flip-flop; here, we seek to make the size of its transparency window adjustable as required. Figure 2-4 (b) illustrates the timing waveforms corresponding to operation of HLFF. In this figure, the shaded area represents the transparency window, which is created by overlap of clk and !clkd signals. During the time interval when both of these signals are high, both transistor stacks act as inverter gates to transfer D to S a and then to Q. In order to increase the transparency window size in the HLFF, delay of the delay element in Figure 2-4 (a), should be decreased by the desired amount. HLFF is one of the fastest SEFFs used in industrial designs, such as AMD K6 processors [45], for its advantages of high performance and relatively small area. Large power consumption, glitch activity, and somewhat complex implementation are its drawbacks [45]. Note that transparency window of this architecture is located before the clock edge. Hence, it is suitable for backward time borrowing schemes. 33 (a) (b) Figure 2-4. Negative-edge triggered HLFF a) circuit b) timing diagram. Monostable-based flip flop is another industrial negative-edge flip flop that we convert it to SEFF. MBFF suffers from large area and high power consumption [45]. In order to modify MBFF’s circuit to admit an adjustable transparency window size, a delay element is inserted in its design, as illustrated in Figure 2-5 (a). In this design, the first stage of the flip flop generates a short pulse on nodes S or R to trigger the S-R latch. The delay element essentially extends this pulse width, providing longer time for D to arrive and get captured in the SR latch. Figure 2-5 (b) demonstrates timing waveform of this SEFF for D=1 (for D=0, the pulse applies to R). The triggering pulse can be de-asserted as early as a t 1 delay after the negative-edge of clk and is asserted exactly after a t 2 delay after the negative edge of !clkd. Delay clkd clk D Q Q S τ clk !clkd D 34 (a) (b) Figure 2-5. Monostable-based SEFF a) circuit b) timing diagram. Due to the practical advantages of Master Slave based SEFF we will focus on this design for the rest of this chapter to derive equations and use it design problems. Similar equations and discussions hold for other SEFF designs. 2.4.2 SEFF Timing Characteristics To optimally select the transparency window of the SEFFs, we must accurately account for the effect of the transparency window on SEFF’s power consumption and its timing characteristics, i.e., setup time, hold time, clock-to-Q delay and D-to-Q delay. The setup time, t s , and hold time, t h of a SEFF may be modeled as linear functions of the transparency window size, w, while the clock-to-Q delay, t cq , (defined as the delay between the positive edge of clock and the time that output is valid) and D-to-Q delay, t dq , (define as the input to output propagation delay of data when it is transparent) are independent of the transparency window width (see Figure 2-6.) D Q clkd clk Delay clk Q S R !clkd clk D t1 t2 S 35 Figure 2-6. SEFF timing characteristics – HSPICE simulations. If the supply voltage of the flip-flop can be adjusted to a new voltage level, v j , then the coefficients of linear models of setup and hold time as well as values of t cq and t dq will become voltage-dependent parameters, i.e., b c d c e , (f , A ) g (A )f H g E (A ) , (f , A ) h (A )f H h E (A ) , (A ) %, % A I (13) Timing characteristics of SEFF are measured by HSIPCE simulations (sweeping voltage) to determine voltage dependent values and coefficients through linear regression. Figure 2-7 shows SPICE simulations of setup and hold time as linear functions of transparency window size and voltage level for Master-Slave SEFF. -30 -15 0 0 50 100 150 200 40 60 80 100 Setup Time [ps] Hold - C2Q - D2Q [ps] Transparency Window [ps] DtoQ CtoQ Hold Setup 36 (a) (b) Figure 2-7. a) Setup time and b) hold time as functions of supply voltage and transparency window width– HSPICE simulations. 2.4.3 SEFF Power Consumption Model Power consumption of a SEFF is generally an increasing function of its window size, w. This is due to the fact that increasing the window size is performed by resizing and/or increasing the number of inverters in the delayed clock path; both methods result in an increase in the dynamic and leakage power consumption of the SEFF. Figure 2-8 illustrates total power consumption of a master-slave SEFF and its linear approximation versus its window size, for a fixed clock period and two different voltage values. -30 -20 -10 0 10 20 30 50 70 90 110 Setup [ps] Transparency Window [ps] 0.8 0.85 0.9 0.95 1 0 10 20 30 40 50 60 70 50 70 90 110 130 Hold Time [ps] Transparency Window [ps] 0.8 0.85 0.9 0.95 1 37 Figure 2-8. Power consumption as a function of window size of SEFF. From Figure 2-8, one can conclude that power dissipation of the SEFF may be approximated as a linear function of the transparency window width, for a fixed clock period. To approximate effect of both dynamic and leakage power consumption for any window size and any clock period in the SEFF circuit, its power consumption may be calculated as: ! i]jj k Y A f H k U A · f H k A 1 H k E A (14) where v denotes the supply voltage level, and k 0 (v) through k 3 (v) are voltage- and technology-dependent coefficients which can be determined through HSPICE circuit simulation. In the above equation, the two T clk dependant terms correspond to dynamic power consumption while the other terms correspond to leakage power. 2.4.4 SEFF with Built-in Error Detection As stated earlier, in non conservative frameworks, we adopt an error detection mechanism in the design of SEFF to guarantee correct computation in the pipeline. More precisely, we have utilized a multi-sampling technique in the pipeline registers similar to 0 100 200 300 400 500 600 700 800 900 0 50 100 150 200 250 Power Consumption (uW) Transparency Window (ps) Vdd=1.0 Vdd=0.9 38 Razor FF [16] (however, Razor integrates error correction circuitries, too, that increases flip-flop delay). Usually, flip flops with built-in error detection are intended to operate under a condition with low error rate; this would make the amortized performance and energy overheads of micro-architectural correction negligible, whereas error correction is much faster but with high amortized overheads in built-in implemented correction mechanisms (see next subsection). In a SEFF with built-in error detection, a secondary latch, called shadow latch, is added to each conventional flip-flop. This shadow latch re-samples the input data at a later time by utilizing a phase-shifted global clock signal, clkp. Hence, the input will be double sampled at the triggering edges of the normal clock and the delayed clock. If there is a setup time violation in the pipeline stage, comparison of these two data values would detect the error. Figure 2-9 (a) shows the internal architecture of a soft-edge master-slave flip-flop with built-in error detection mechanism. Figure 2-9 (b) illustrates the timing waveforms and operation of error detection circuitry. In this figure, data unit D1 arrives early enough to get correctly latched in the FF at time t1 (and the window preceding it). The error detection unit samples it at t2 as the correct data. On the other hand, due to delay variability or operating frequency being higher than allowed, data D2 misses the latching window (indicated by the red arrow in the figure) and cannot be latched at time t3. Instead D1 or an invalid data is stored. However, later at time t4, the error detection unit re-samples the data and captures D2; the result of XNOR of two sampled data indicates an error. 39 (a) (b) Figure 2-9. Positive edge SEFF with built-in error detection a) circuit and b) timing waveform. Introduction of the phase-shifted clock signal to design requires an additional timing constraint to avoid undetected errors or short path violations in the following scenarios. First, if the longest path delay of the preceding logic block is so large that the signal misses the triggering edges of both the main and PS clock edges, then the error cannot be detected. Second, as shown in Figure 2-10, in which D2 is stored correctly in the main FF, if the minimum delay of the combinational logic circuit succeeding a flip flop is too short, new data D3 overwrites the last one, D2, and thus, D3 is read at the re- sampling time at PS clock edge, and subsequently mistakenly marked as an error. We impose the following timing constraint to address these scenarios: , H QZl H , !m QX H , , , 1 n , 1 o m (15) where PS denotes the phase shift (delay) of the PS-Clk relative to the main clock, and d ij max and δ ij ma denote delay of longest and shortest paths of stage i, under the voltage setting j. D Q clkd !clkp clkp err !clkd !clkd clkd !clk !clk clk clk D1 D2 clkp clk t1 t3 t4 t2 t0 40 Figure 2-10. Timing waveforms for the SEFF. 2.4.5 SEFF with Built-in Error Correction Similar to error detection, an error correction mechanism can be integrated in the flip flop circuit (see Razor FF [16]). In addition to the multi-sampling structure for error detection, a multiplexer is integrated in the SEFF. As illustrated in Figure 2-11, this multiplexer selects between the data sampled at main clock edge and the one sampled at PS clock edge, which is the corrected data in case of any error. This approach has less performance overhead than micro-architecture based mechanisms, e.g. flushing and repeating operation. However, the power dissipation and area overheads of SEFF with built-in error correction are higher because of the internal multiplexer gate. Usually, flip flops with built-in error detection are intended to operate under a condition with low error rate; this would make the amortized performance and energy overheads of micro- architectural correction negligible. The timing constraint of (15) applies to SEFF with built-in error correction, too. D2 D1 D3 clkp clk 41 Figure 2-11. Positive edge SEFF with built-in error correction. 2.4.6 Soft-Pipeline Timing Constraints Introduction of a transparency window to a flip-flop not only modifies the timing characteristics of a SEFF, but also changes the timing constraints imposed on the pipeline due to implementation of time borrowing. The hold time constraint does not change in case of time borrowing. The following inequalities establish the setup time constraint for time borrowing between stages i and i+1 [42]: p , 1 H q 2 % ,q 1 1 I (16) (17) Figure 2-12 illustrates setup time constraint fundamentals of a time borrowing operation among three consecutive stages, in which stage i uses the timing slack of stage i+1, and stage i+1 uses that of stage i+2. In this figure, D i and Q i represent the input and output of FF-sets of stage i, respectively. D Q clkd !clkp clkp err err !clkd !clkd clkd !clk !clk clk clk 42 Figure 2-12. Time borrowing between two stages of a soft pipeline. Inequality (16) is in fact the same setup time constraint as (1) for a single stage which ensures that delay of i-th stage is able to meet the setup time of its destination SEFF with time borrowing enabled. Inequality (17) assumes that stage i may borrow time from stage i+1, but the accumulated delay of these two stages (plus setup time and clock- to-Q of SEFFs) should not exceed two clock periods. Note that in inequality (17) for the SEFF-set i, data arrive within the transparency window and propagates to the output only after a delay of t dq . In general, setup time constraints corresponding to an N-stage soft-pipeline can be written as: s H 1 , s · %, , qQ t l qQ l u 0 s , 1 (18) The above inequality set (18) covers setup time constraints applied to single stages and multiple stages involved in time borrowing. The parameter m denotes the depth of time borrowing in this equation. If m=0, the inequality represents the setup time constraint within a single pipeline stage, and larger values of m produce the setup timing D i clk d i Q i-1 Q i D i+1 d i+1 t dq t s,i+1 t cq 2T clk 43 condition on accumulative delays of multiple consecutive pipeline stages. Also in the statistical framework, setup constraint violation probability may be written as: & ,,Q ! w s H 1 , , s %, , qQ t l qQ lu x 0 s , 1 (19) & , 1 y 1 & ,,Q z Qu E (20) As mentioned in section 2.4.2, the effect of variability on the flip-flop timing characteristics is negligible, and the random variables in (19) are d ij ’s, which are correlated [38][43]. Let ρ ik denote the correlation between the maximum stage delays of stage i and k. Given the CDF of all d ij ’s and ρ ik , we can estimate the CDF of summation of d ij ’s, by assuming that it follows the same distribution function as any of d ij ’s, with corresponding mean and variance calculated as: { V | t } t { %, ~ U Ag | t } t ~ %, U H t ~ %, ~ %, € � (21) Note that we assume the circuits that our proposed algorithms optimize are fully synthesized and mapped circuits and standard SSTA timing analysis has been performed on each pipeline stage. Such tools do account for various sources of variability and certainly consider the effect of spatial process variations and/or reconvergent fanout paths in their calculations. 44 2.5 Power-Delay Optimization in a Pipeline Using SEFF Due to significance of both performance and power efficiency in pipelined circuits, we chose Power-Delay product as the cost metric to optimize the design of such circuits. Note that in the Power-Delay product, delay is not simply the inverse of the clock frequency; rather, as will be seen next, it is defined to also probabilistically account for the error correction timing overheads of potential setup time problems in an over- clocked pipeline. In this way, we are able to exploit the case where the increase in setup time violation and corresponding timing overhead is compensated by the decrease in the power dissipation. In this section, we solve the problem of power-delay optimization in a linear pipeline using SEFF. We formulate the problem for three scenarios: (i) The stage delays are captured by the worst case delay estimates, (ii) Statistical timing analysis is used to model the stage delays, and no timing violation is allowed, (iii) The stage delays are still computed by statistical timing models, but timing failures are allowed to exist and automatically be detected and fixed. In scenario (i), we deal with deterministic values of the worst case combinational circuit delays, which are the maximum observed values of combinational circuit’s delay, over all possible input combinations and under any possible operating conditions (different PVT corners.) Satisfying the timing constraints of (1) and (2) for these conservative delay values results in error-free operation of the pipeline. On the other hand, in scenario (ii), we will consider the path delays as random variables and will use 45 statistical timing equations and find the optimum solution for a limited error rate. Under scenario (iii), we allow a few timing violations to occur and adopt an error detection mechanism to guarantee correct functionality of pipeline. In this framework, our solution considers the trade-off between aggressively scaling pipeline frequency to improve delay, and the power and delay penalties due to error detection and correction. . The key motivation for using SEFFs in a pipeline circuit is that some positive slack may be available in one or more stages of the pipeline. Utilizing SEFF allows passing this slack to more timing critical stages and utilize it for power optimization by voltage scaling. As an example, consider the three stage pipelined circuit of Figure 2-13 operating at a supply voltage level of V DD . The per-stage maximum logic delays are shown in the figure. Let’s assume the setup time, hold time, and the clock-to-Q delay of all (hard-edge) FF’s are 25ps each. From equation (1), the minimum clock period is 500ps, and no slack is available to the first stage of the pipeline. However, if FF1 is replaced with a SEFF with a transparency window of 50ps, the available slack at the second stage is passed to the first stage, providing the first stage with 50ps of borrowed time. Now since positive slacks are available in all stages of the pipeline, the circuit can be operated at higher clock frequency and/or a smaller voltage in order to reduce the power consumption, and possibly the power-delay metric (ideally, V DD may be reduced by approximately 10%, resulting in roughly 19% power saving). 46 Figure 2-13. Example of slack passing. 2.6 Power-Delay Optimal Soft Pipeline (OSP) The problem of power-delay optimal soft pipeline (OSP) design is defined as that of finding optimal values of the global supply voltage level, pipeline clock period, and the transparency windows of the individual soft-edge FF-sets in the design so as to minimize the total power-delay product of an N-stage pipeline circuit subject to setup and hold time constraints. From (10), (14) and (12), total power consumption of pipeline is: ! +Z ! P+QR, H t ! i]jj, z u H t ! \], z u (22) ! Z, H V %WX, H t Ck Y f H k U · f H k H k E F z u H t S^ U · [ H ^ [ T z u Hence, optimizing the power-delay of a soft pipeline (which is equivalent to energy dissipation in this case) may be formulated as: D Q D Q D Q clk FF0 FF1 FF2 C1 C2 D Q FF3 C3 d1=450ps d2=350ps d3=400ps 47 b c c c c c d c c c c c e n‚s[ƒ ! +Z · 0 „! P+QR, H t ! i]jj, z u H t ! \], z u … such that: s H 1 , s. %, , qQ t l qQ lu 0 s , 1 , , [ 1 f QX f f QZl 1 1 1 o m A ‰ Š D , … , D i Œ I (23) The first and second sets of inequalities in (23) are respectively the setup and hold time constraints in the pipeline stages, the third set of inequality constraints imposes an upper bound and a lower bound on the transparency window of the flip-flop imposed by the library or design rules (typically, w min ≥ 0 and w max < ½T clk ). Finally, the last statement in (23) enforces the supply voltage of the pipeline to be from the set of available voltages {V 1 ,…, V S }, where V 0 =V 1 >…> V S (V 0 is the nominal supply voltage). Note that problem formulation (23) has 2N+1 optimization variables corresponding to N- 1 transparency window sizes, w i , for the N-1 soft-edge FF-sets in the linear pipeline, N delay element values, z i , for the N stages of the pipeline, one supply voltage variable setting, v, and one clock period variable, T clk . Referring back to Figure 2-13, for the sake of consistency with the input and output environments and to avoid imposing constraints on the sender or receiver of data for the linear pipeline circuit in question, we impose the boundary condition that the first and last FF-sets in the pipeline are composed of hard-edge FF’s whereas intervening FF- sets may be SEFFs. 48 To solve the problem stated in (31) efficiently, we enumerate all possible values for v, and for each fixed v we solve a quadratic program (i.e., we minimize a quadratic cost function subject to linear inequality constraints), which can be solved optimally in polynomial time. We refer to this version of the problem as OSP-FV, OSP with fixed voltage: b c c c c c d c c c c c e n‚s[ ƒ ! Z, H V %WX, H t (^ U · [ H ^ · [ ) z u H t(k Y · f H k U · f · H k H k E · ) z u such that: s H 1 , s. %, , qQ t l qQ lu 0 s , 1 , , [ 1 f QX f f QZl 1 1 I (24) Note that in OSP-FV problem, all the voltage-dependent coefficients, i.e., k 3 -k 0 in P SEFF and h 2 , h 1 in P DE equation, as well as the coefficients in t s,i , t h,i , t cq , and t dq are recalculated for the voltage under test. Also, E dyn , P leak , d i and δ i are given window-size- independent inputs (generated by profiling or given by (8)-(10)) for each voltage. Lemma 1: In the optimal solution of OSP-FV design problem, the transparency window of the i th SEFF-set is equal to the time borrowed by combinational logic in the i th stage. Proof: According to the discussion in previous sections, the power consumption of a SEFF is a monotonically increasing function of the transparency window size while its setup time is a decreasing function of the same. Now, from the OSP-FV problem 49 formulation of equation (23) a minimum decrease in the setup time of the i th SEFF-set t s,i which meets the long-path constraint in the i th stage of the pipeline, will produce the minimum increase in the power dissipation of the i th SEFF-set P SEFF,i . Therefore, the optimal solution is achieved by utilizing the smallest possible window sizes which prevent setup time violation. ■ Lemma 2: In the optimal solution of OSP-FV design problem, the delay element inserted in the i th stage of the pipeline is equal to the minimum extra time needed to meet the hold time constraint at the i th soft-edge FF-set. Proof: According to the discussion in section 2.3.4 the power consumption of a delay element is a monotonically increasing function of the target delay value while the hold time of a SEFF is an increasing function of the same. Now, from the second inequality (hold time condition) in the OSP-FV problem formulation of (23) a minimum delay value z i added to the i th stage of the linear pipeline which meets the short-path constraint for that stage, will produce the minimum increase in the power of the combinational logic in the i th P DE (z i , v). Hence, the optimal solution is achieved by utilizing smallest possible delay elements which prevent hold time violations. ■ Theorem 1: The optimal solution to OSP design problem is obtained by solving the OSP-FV design problem S times for each distinct voltage level and selecting the voltage level v * and the corresponding w i * , z i * and T * clk values that minimize the total power dissipation for v * . 50 Proof: This follows from the observation that solution of the OSP-FV problem produces w i ’s, z i ’s and T * clk,i for each possible v and we enumerate over all v’s to get the global optimum solution in an exhaustive manner. ■ Note that although SEFF’s are custom-designed and their transparency windows are set only once at design time, implementing the optimal transparency window of SEFF’s may not be practical. Because, for instance, device (transistor) size and hence delay of window generation circuitry of SEFF cannot be any arbitrary value. Therefore, we round off the optimal sizing solution to its closest larger-sized match that is implementable. Since this realized SEFF will have minimally larger transparency window size, it will not violate any setup time constraints, while increasing the power consumption as minimum as possible. However, if the hold time constraints are violated by this adjustment, then adding delay elements may be used in violating short paths to solve the problem, with negligible impact on power-delay metric of pipeline. 1 Determine P leak,j , E dyn,j, d ij and δ ij and voltage-dependent coefficients a 1j , a 0j , b 1j , b 0j , t cq,j , t dq,j , k 3j , k 2j , k 1j , k 0j , h 2j , h 1j for all voltages 2 for (v = V j , j++, Vj ‰ Š D , … , D i Œ) { 3 PD j = Solution to OSP-FV(v) } 4 v*= ArgMin PD j for 1 ≤ j ≤ S 6 Set w i *’s and z i *’s as the solution of OSP-FV(v*) 7 Round-off w i *’s and z i * to closest upper feasible match Figure 2-14. Pseudo-code of OSP algorithm. 51 The pseudo-code presented in Figure 2-14 summarizes the steps in OSP algorithm. 2.7 Statistical Power-Delay Optimal Soft Pipeline (SOSP) In section 2.6, we followed the conventional static timing analysis framework in which deterministic values of worst case circuit delays are used to specify the circuit timing. However, due to process and environmental variations in integrated circuits, the path delays may vary from one die to next and from one operating condition to the other. Consequently, the path delays may be modeled by random variables. Therefore, we will replace the deterministic timing constraints with the probability of timing violations in a pipeline as given by equations (4) and (5). The problem of statistical power-delay optimal soft pipeline (SOSP) design is defined as that of finding optimal values of the operating voltage and frequency and the transparency window sizes of the individual soft-edge FF-sets in the pipeline so as to minimize the total power-delay metric in a soft pipeline circuit with N pipeline stages and S voltage states. As mentioned earlier, SEFF enables opportunistic time borrowing across adjacent stages of the pipeline in order to provide timing-critical stages with more time to complete their computations and thereby, reduces the probability of timing errors at a particular frequency. Let q setup,ij and q hold,ij denote probabilities of setup time and hold time violations at stage i of the pipeline under supply voltage v j , as given in equations (11) and (20). Assuming that the probability of encountering an error in a specific combinational circuit stage is independent of other stages, the probability of having a timing error in the entire 52 pipeline, q pipeline,j is calculated by (25). This probability should be limited to an extremely small value, ε, (e.g. 10e-12) to make failure of the pipeline virtually impossible. & X, 1 y | (1 & , )(1 & +%, )} z u (25) Now then, SOSP can be formulated as (26). It minimizes the power-delay product of the pipeline, subject to an upper-bound on the error probability, denoted by ε. b c c d c c e n‚s[ƒ �Žk „! P+QR, H t ! i]jj, z u H t ! \], z u … such that: & X, � f QX f f QZl 1 1 1 o m A ‰ Š D , … , D i Œ I (26) Note that even though the circuit delay is modeled as a random variable due to process variations, the power consumption is not. It is known that the effect of V t or L eff variation on dynamic power consumption is negligible [44]. On the other hand, Leakage power dissipation in a pipelined design is largely unchanged by our optimization process. We do not make any modifications to the combinational circuit part (e.g., do not perform gate sizing or logic re-synthesis). Therefore, leakage of the combinational logic gates is not affected by our optimization and remains unchanged. So we can set these leakage values to any fixed amount. We consider the maximum values (worst case) of leakage power consumption of combinational circuit. Next we approximate q pipeline,j which is given by (25) with a convex function to simplify the problem statements. Result of expanding equation (25) becomes a summation of all q setup,ij and q hold,ij ’s and their mutual product of second and higher order. 53 Since all error probabilities, i.e. q setup,ij and q hold,ij ’s, are relatively small values (e.g. in the order of 1e-3 or 1e-4) the product of any two (or more) of such functions are negligible compared to the summation of first order terms and could be ignored. The resulting equation for q pipeline,j would be a simple summation of q setup,ij and q hold,ij ’s: & X, � t (& , H & +%, ) z u (27) Furthermore, to conveniently formulate the problems as quadratic programs, we approximate q setup,ij and q hold,ij as first order polynomial functions of SEFF characteristics and T clk : & , � &‘ · H t &‘f Q · f qQ z Qu E H &‘ (28) & +%, � &^ · [ H &^f · f H &^ (29) where qsT j , qsw j , qhd j , qhw j are coefficients (of T clk , window size, delay element and window size in q setup,ij and q hold,ij respectively) corresponding to voltage setting j, and qs j (i) and qh j (i) are voltage and stage-delay dependent fixed terms. As a preprocessing step, we linearize the CDF of any max (min) stage delay around its μ+3σ (μ-3σ) point, i.e. for any x within a boundary around such point, F ij (x) ≈ α ij x+β ij . Hence equations (11) and (20) can be approximated as follows, and all coefficients, q * j , be determined accordingly: & , $ ( , , , ) � ’ · , ’ · g · f H “ ’ g E ’ , (30) 54 & ,Q � ’ s H 1 , ’ g f H “ ’ g E ’ , ’ · s · %, (31) & +%, � ’ h f ’ [ H “ ’ , ’ h E (32) Again, using Theorem 1, we conclude similar algorithm to solve the SOSP problem presented in (33), to enumerate all possible values of v, and we solve a quadratic program for each v. We refer to this version as SOSP-FV, SOSP with fixed voltage, in which, variables are only transparency window sizes, pipeline clock period, and delay elements. b c d c e n‚s[ƒ , „! P+QR, H t ! i]jj, z u H t ! \], z u … such that: & X, ” f QX f f QZl 1 1 I (33) Theorem 2: The SOSP-FV problem is a convex problem, and the optimal solution to it (if the feasible region is not empty), minimizes the objective function. Proof: In general, the product or ratio of two convex functions are not convex [83], and hence we used the additive approximation in (27) for q pipeline,j instead of (25). Therefore, the objective function of SOSP-FV problem is a quadratic function of its variables (the transparency window sizes, delay elements, and clock period) while the constraints are linear.■ Now then, the convex optimization problem of SOSP-FV is efficiently solvable by using any commercial mathematical optimization tools. Of course, when a solution is obtained we must verify the condition for approximations, but this has always been the case in our experimental results. 55 2.8 Error-Tolerant Statistical Power-Delay Optimal Soft Pipeline (ESOSP) The problem formulations presented in sections 2.6 and 2.7 conservatively calculate the pipeline operation clock period to avoid timing violations that cause pipeline errors. However, only for some specific combination of inputs is the critical path sensitized, and therefore, these formulations result in a pessimistic clock period. Instead, error-tolerant statistical power-delay optimal soft pipeline (ESOSP) algorithm on top of using SOSP techniques aggressively decreases the clock period to improve performance, while implementing a mechanism to capture and fix any possible timing violations due to this over-clocking. The proposed algorithm explores the trade-off between delay improvement and increase in power as well as the power and delay penalties caused by timing errors. An error handling mechanism is incorporated in our design to guarantee correct functionality under all conditions. Error detection and correction can be fully implemented in SEFF circuit, as described in section 2.4.5. In another method, error detection is built in SEFF circuit while error correction mechanism is supported by the architecture (through data/instruction flushing and replaying the same data/instructions this time under a transitory operating condition which is more conservative, e.g. lower frequency) (See section 2.4.4). If the error rate is relatively low, area and power overhead of FF design with built-in error detection circuit will be negligible, compared to FF with built-in error correction circuit. 56 The error-tolerant statistical power-delay optimal soft pipeline (ESOSP) algorithm aggressively scales down the pipeline clock period to improve performance, while employing “SEFF with error detection” to capture any possible timing violations due to this over-clocking. For simplicity, we focus on the fixed voltage version of ESOSP problem, and generate the solution to original problem of ESOSP by combining the solutions to multiple instances of ESOSP-FV based on Theorem 1. Let P j denote the average total power consumption of pipeline under supply voltage v j , and P p,j denote the average power overhead when encountering an error at same voltage v j (this overhead includes the power consumed for computing erroneous data as well as flushing it and its following data units). Also, let γ denote the average delay (in clock cycles) corresponding to error detection and correction, such as flushing. Given an error probability of q j under some voltage v j , the expected value of power-delay objective function may be written as: • (1 & )! , H & (! H ! , )– , (34) In fact, error probability, q j , is a decreasing function of T clk . This is the source of trade-off between power-delay metric of error-free and erroneous operation of pipeline. Decreasing T clk reduces the power-delay for error-free operation (the first term in (34)), but increases q j and as a result, the error correction overhead (the second term in (34)). Implementation of time borrowing across adjacent stages of the pipeline effectively reduces the probability of error due to timing errors, q j , and avoids the subsequent power and delay penalties of error correction step for any T clk . Increasing transparency window 57 size, however, increases total power consumption. Fortunately, gained power saving tends to more than compensate for it. Remember P j in equation (34) denotes the sum of power consumptions of the combinational logic blocks and SEFF’s, without encountering an error. P j is a function of voltage, SEFF’s window sizes and delay elements, and equation (22) can be rearranged as, ! — H ˜ H t Sk Y f H k U f T z u H t S^ U [ H ^ [ T z u (35) with A j and B j representing all the terms corresponding to constant values and coefficients of 1/T clk , respectively. For simplicity, let’s assume the power overhead of error correction is β times that of only producing a data value without encountering an error, i.e. ! , “. ! (Value of the β parameter is obtained from micro-architectural and circuit simulations). Consequently, ESOSP-FV problem is defined as finding optimum w i ’s, z i ’s and T clk in the following formulation: b c c d c c e n‚s[ƒ (1 & )! H & ! 1 H “– such that: f QX f f QZl 1 1 QX QZl & � & X, t(& , H & +%, ) z u I (36) Note that the objective function of (36) is a third order polynomial with proposed linear approximations for q j , which can be solved using general convex optimization tools [46][47]. In section 2.9, we introduce another constraint which bounds the undetected error probability, and should be added to (36). 58 2.8.1 ESOSP for Profiled Operation Dynamic Voltage and Frequency Scaling (DVFS) is widely used to minimize the power consumption in microprocessors. The entire pipeline should meet timing constraints in every circuit state (also known as DVFS setting). A circuit state is uniquely identified by a supply voltage level which is simultaneously applied to all stages of the pipeline. Changing the voltage to bring about a new circuit state affects the power consumption of pipeline as well as combinational path delay and time budget of combinational circuit. Consider a scenario whereby based on the system-level power management policy, it has been determined that the circuit will operate in each of its circuit states according to some probability distribution. We present another formulation to minimize the average expected power-delay product over all DVFS circuit states. More precisely, given the probability values for being in various circuit states during the active mode of pipeline operation, we attempt to minimize the power-delay product averaged over all such states. Let π j denote the probability of being in circuit state s j (characterized for a given voltage level v j ). Then, the weighted cost function is defined as: • ™ t š •(‘ ) i u (37) The ESOSP-Profiled problem is thus formulated as: 59 b c c c d c c c e n‚s[ƒ t š (1 & )! , H & ! 1 H “– , i u such that: f QX f f QZl 1 1 QX , QZl 1 o m & � t(& , H & +%, ) z u I (38) Now then, ESOSP tries to minimize the power-delay product of the pipeline, and find the optimum set of clock periods, T clk,j (j=1, …, S) under each circuit state, and a set of optimum window sizes, w i (i=1, ..., N-1), for each FF-set, and the optimum delay elements of each stage, z i (i=1, ..., N). Hence, for S circuit states and N pipeline stages, there are S+2N-1 optimization variables; in each circuit state, we apply the calculated optimum frequency to all pipeline stages. Notice optimum window size for each soft- edge FF-set (recall that the first and last FF-sets use always hard-edge FFs), as well as delay elements are design time decisions and these size assignments are independent of circuit state. 2.9 Bounding the Probability of Undetected Errors An undetected error in the pipeline can occur due to a very long path that violates internal timing of SEFF. Normally, in a SEFF with built-in error handling mechanisms, the input data is re-sampled at a later time by utilizing a phase-shifted global clock signal, PS (see section 2.4.4). The undetected error probability is the probability of data arriving after T clk +PS which is calculated by (39) – notice that we have replaced T clk with T clk +PS because an undetected error occurs only when the arrival time of the correct data is later than the triggering edges of the PS Clock in the current cycle. Consequently, given the 60 CDF for max stage delays, the probability of an undetected error in pipeline stage i and supply voltage v j is: � X%%, 1 $ % H !m , , (39) The overall rate of undetected errors for all voltage levels is: � X%% 1 y 1 y 1 � X%%, z u i u (40) To impose an upper bound on undetected-error probability, we include PS as a new variable of optimization to problem formulations with error detection technique enabled, along with the following constraint where ε UpperBound is user provided (typically in the same order as ε in (33), e.g. 1e-6 to 1e-10). � X%% , � ›œ�+X% (41) 2.10 Experimental Results 2.10.1 Simulation Setup To extract the parameters used in the optimization problem, we performed transistor-level simulations on soft-edge flip-flops by using HSPICE [48]. We used 90nm technology model [49] with nominal supply voltage of 1.2V. Simulations have been conducted at die temperature of 85 o C. In all experiments, the set of available voltage levels is {0.8V, 0.9V, 1V, 1.1V, 1.2V}. We synthesized a number of linear pipelines, including some modified ISCAS89s benchmarks (denoted by TBx) and datapath and processor circuits to construct a set of benchmarks. SIS [50] and Synopsys Design Compiler packages were used for synthesizing benchmarks. We then performed timing 61 simulations and used Synopsys PrimeTime to extract the static value of longest and shortest path delays of each pipeline stage under each voltage setting. Next, we considered max and min stage delays of a pipeline to have probability density functions. For this, we run Monte Carlo simulations on fully synthesized and mapped logic circuits to generate the max/min stage delay distributions by monitoring the top 100 critical paths of each stage (identified using Synopsys PrimeTime timing analysis tool) affected by variations. We assumed a σ/μ ratio of 5% for sources of variation, i.e. threshold voltage and channel length, similar to [39], and applied it to circuit simulations. We also assumed ρ=0.5 for correlation of stage delays. Then we use the linear approximation of (30)-(32) for any stage delay distribution around its μ+3σ, (or μ-3σ for min stage delay). Finally, we formulate different algorithms given all the coefficients and parameters needed. To solve the mathematical problems developed in this chapter, MATLAB [46] and TOMLAB toolbox [47] have been used. The algorithms calculate the optimal values of the operating supply voltage and frequency and the transparency window sizes of the individual soft-edge FF-sets in the design that minimized the total power-delay in the soft pipeline circuit. 2.10.2 Linear approximation of general stage delay CDF Given the delay distribution of all stages of pipeline, we apply the linear approximation of (30)-(32) where the error rate is below %5. Figure 2-15 illustrates the linear and piecewise linear estimates of sample CDF. The overall mean square relative error of the linear model was 1e-4 and that of piecewise linear approximation was 4.5e-6. 62 In our simulations, we used piecewise linear approximation with two regions intersecting at 99 percentile of CDF; T clk determines the region of estimation for each stage. For estimating multistage delays, we use the average of coefficients of linear models of involved stages delays. For all testbenches, the error of this linear approximation (single stage and multistage) remained below 2e-4 for linear model and under 1e-5 for piecewise linear model, which is acceptable, and does not have a high impact on the results of our solutions. Figure 2-15. Accuracy of linearly approximating stage delay CDF. 2.10.3 OSP Simulation Results In order to evaluate the performance of the proposed OSP algorithm, we assumed two conventional FF based approaches as the baselines for comparison: Baseline implements a conventional pipeline (which contains only conventional hard-edge FFs) and always runs at nominal voltage of 1.2V. The second method is Base+VS which adds the support for voltage scaling to Baseline. Both baselines were operating at the minimum clock cycle time for the pipeline circuit. This clock period was calculated for 63 each of the test pipeline circuits listed in Table 2-1 using standard timing equations of (1) and (2) (for regular FFs) and next the power dissipation of pipeline was subsequently computed. Next, OSP was run on each circuit, exploiting time borrowing across different stages, and thus, power saving. Percentage improvement of Power-Delay product by OSP with respect to Baseline and Base+VS on these benchmarks are provided in Table 2-1. Table 2-1. Power-delay-product improvement by OSP Testbench Stage delays at nominal voltage (max, min) [ps] Baseline Base+VS OSP %PDP Saving T clk*[ps] V dd* T clk* V dd* T clk* Base Base+VS tb1 (353,140) (214,112) (254,107) (217,110) 458.5 0.8 707.7 1.0 471.5 38.2 12.4 tb2 (646,192) (670,232) (550,158) (648,192) (583,189) 786.1 0.8 1206.9 0.9 1028.5 42.4 13.9 tb3 (334,108) (280,98) (219,80) 397.3 0.9 534.6 1.0 467.1 44.4 17.8 tb4 (250,96) (254,96) (251,95) (253,96) 329.4 1.0 380.8 1.0 384.9 14.9 -3.0 TROY proc. (1270,320) (2188,429) (4759,150) (4788,315) (1279,230) 4893 0.9 6986.7 0.9 6408.5 26.7 8.7 Openrisc1200 (2172,280) (2514,359) (7738,351) (6862,436) (1739,487) 7843 1.0 9487.9 0.9 12288 28.2 11.6 Viterbi dec. (817,175) (858,164) (926,215) (773,183) 1055.3 0.8 1608.5 0.8 1584.1 33.6 12.1 The first entry in this table is the name of benchmark. Specifications of benchmark, i.e., the max and min delays of each pipeline stage at nominal voltage are reported in the second through sixth columns of table. The next five columns report the optimum supply voltage (V*) and clock period (T* clk ) for Baseline (runs at nominal voltage), Base+VS, and OSP. The last two columns show the percentage of reduction in power-delay achieved by OSP (compared to Baseline and Base+VS algorithms) which are also depicted in Figure 2-16. As it can be observed, OSP achieves an average power- delay saving of 32% compared to Baseline by applying voltage scaling and time 64 borrowing, and a saving of 10% compared to Base+VS by only time borrowing. In case of tb4, the saving is negative compared to Base+VS, since it has the same logic circuit duplicated in each pipeline stage (balanced stages). As expected, there is no room for time borrowing in it; hence the power overhead of added circuitry causes PDP loss. Note that by balanced, we refer to (nearly) equal stage delays. Figure 2-16. Power-Delay reduction by OSP. An interesting observation in the results of Table 2-1 is that the optimum clock period calculated by OSP or Base+VS is much larger than the one of Baseline. This is because the objectives of these two algorithms are power-delay product (PDP), and in many cases, PDP is reduced when the supply voltage is reduced, and subsequently, T clk is increased. However, if the operating frequency of circuit is the important design criterion, a minimum frequency limit, f min , may be imposed by adding a linear constraint in the form of T clk <1/f min to the OSP problem formulations (and the other formulations.) For instance, we enforced f min to be higher than 85% of the Baseline frequency, for tb2 and tb4. In case of tb4, there is not a change since the result is already in that range. However, in case of tb2, the PDP saving of OSP (compared to Baseline) reduced to about 38% -10.0 0.0 10.0 20.0 30.0 40.0 50.0 %PDP Saving wrt Baseline %PDP Saving wrt Base+VS 65 while its optimum operating voltage and clock period were found to be 1V and 914ps, respectively. Here, by limiting the minimum frequency of circuit, the benefit of voltage scaling is limited, but time borrowing is still useful in minimizing the clock cycle time. To provide more insight into the results, we studied how SEFFs are used in a soft pipeline by solving OSP-FV. In this set of experiments, the supply voltage of each pipeline was set at the nominal value and OSP-FV was invoked to find the minimum values of T clk . Table 2-2 shows the optimum clock period of Base+VS and OSP along with the SEFF window sizes for each test circuit under nominal voltage. For example, in case of tb1, the window sizes are such that only the first stage borrows time from its next stage. Note that in soft pipeline of TROY and OR1200, some window sizes are set to the maximum allowed size (300ps in this case). Table 2-2. The optimum Tclk and window sizes obtained by OSP-FV Testbench T base [ps] T clk * [ps] W* [ps] %PDP Saving tb1 458.5 393.2 77 0 0 20.5 tb2 786.1 749.8 14.1 13.7 0 21 5.9 tb3 397.3 394 40.5 55 9.4 tb4 314.9 387. 0 0 0 -2.7 TROY 5057.9 4774 0 0 300 300 5.8 OR1200 8215.6 7781 0 0 300 0 5.2 Viterbi 1055.3 952.9 0 0 124.8 9.4 2.10.4 SOSP Simulation Results Next, we considered randomness and variability of longest and shortest delays of pipeline stages (calculated as described in section 2.10.1. We then set up SOSP, as the quadratic program presented in (26) with the mentioned linear approximation for q i,pipeline , 66 and solved it using TOMLAB optimization toolbox. It calculated the optimal values of operating supply voltage and frequency and the transparency window sizes of individual soft-edge FF-sets in the design that minimize the total PDP. By setting ε equal to the inverse of total number of critical paths, we avoid violation of timing constraints. For purpose of performance comparison, we used two baseline methods similar to the case of OSP, i.e. Baseline is limited to the nominal voltage while Base+VS can also change the supply voltage. The baselines determined the maximum clock frequency of the circuits based on a statistical analysis similar to SOSP, except for utilizing hard-edge FFs in the pipeline circuit. Table 2-3 reports the simulation results of applying SOSP to the benchmarks of Table 2-1 (with statistical specifications), including the maximum frequency determined by Baseline under nominal supply voltage, and the optimum operating voltage and frequency obtained by Baseline+VS, and by SOSP. Table 2-3. Power-delay-product saving by SOSP Testbench Base T*[ps] Base+VS SOSP %SOSP PDP Saving V dd T* [ps] V dd T* [ps] Base Base+VS tb1 41.7 0.8 675.5 0.8 625.3 46.7 20.0 tb2 774.4 0.8 1193.3 0.9 1012.6 40.3 10.9 tb3 402.0 0.9 411.3 0.8 644.5 52.8 22.6 tb4 371.8 0.8 575.2 0.8 587.2 28.5 -6.2 TROY 4702 1.0 5612.9 1.1 5231.9 24.3 8.3 OR1200 7792 0.9 10197 1.0 9155.0 31.8 6.8 Viterbi 1022.6 1.1 1086.4 1.1 1012.7 22.3 12.8 2.10.5 ESOSP Simulation Results Next we measured the error penalties of error detection and correction in a pipeline by micro-architectural simulations. Then we set up and solved ESOSP problem 67 as formulated in (38), and next compared it to Baseline described in section 2.10.4, which calculates the optimum frequency of a conventional pipeline under nominal voltage. Since ESOSP benefits from voltage scaling, time borrowing and error tolerance, we studied the portion of total expected power-delay saving due to each of these techniques in the statistical framework. Table 2-4 summarizes percentage of improvement in power-delay product of three techniques with respect to the Baseline algorithm described. The first one is Base+VS algorithm, that implements only voltage scaling (denoted by VS). The second algorithm is our proposed SOSP which combines voltage scaling and time borrowing (denoted by VS+TB). The third algorithm is ESOSP that adds error tolerance to SOSP. Table 2-4 gives the optimum voltage and clock periods for the testbenches as well as the optimum overall error rate of pipeline, q total . Table 2-4. ESOSP performance and comparison to baseline Test Bench %PDP saving vs. Base ESOSP VS VS+TB ESOSP V dd * T*[ps] q total tb1 33.7 46.2 54.8 0.8 533.8 2.11 tb2 30.0 36.0 47.8 0.9 852.9 1.71 tb3 36.7 51.5 60.3 0.8 520.7 1.35 tb4 33.9 25.8 39.2 0.8 493.6 1.86 TROY 20.1 27.4 30.9 1.1 4658.3 1.05 OR1200 24.2 31.8 35.5 1.0 8461.9 0.95 Viterbi 7.1 21.2 30.5 1.1 844.3 2.20 This table also reports the details of optimum operating point of the soft pipeline along with the total error rate of pipeline. Figure 2-17 illustrates the share of each technique in overall power-delay improvement with respect to Baseline. 68 Figure 2-17. Power-Delay reduction by OSP. Finally, we compared our ESOSP algorithm to an advanced baseline, Base+CS, which adopts the useful clock skew technique on top of Baseline. In this method, the pipeline stages are made balanced (by up to four FO4 inverter delays) by means of adjusting skew of clock for each individual stage. In contrast, ESOSP reduces the imbalance of pipeline by means of time borrowing. The results of this comparison show an average PDP saving of 38% for ESOSP over all testbenches. Compared to the 42.7% of average PDP saving of ESOSP with respect to Baseline, one can conclude that the share of PDP saving that was due to time borrowing reduces about 5%. The reason is that these two methods have almost the same effect on balancing the stage delays, and hence, clock period reduction gained by using SEFFs with respect to Base+CS is lower. However, using SEFFs enables dynamic (variable) time borrowing while the clock skew is a static (fixed) method for path delay balancing across different pipeline stages. As far as the overhead of our proposed techniques (including OSP, SOSP, and ESOSP) is concerned, the area overhead of a SEFF compared to a normal FF is only the -10 0 10 20 30 40 50 60 70 tb1 tb2 tb3 tb4 TROY OR1200 Viterbi Error T olerance Time Borrowing Voltage Scaling 69 internal delay circuitry, which is small compared to the area of the original FF. In addition, compared to the size of rest of the pipeline, area overhead of SEFFs and extra buffers is miniscule. Finally, as far as the runtimes of our proposed algorithms are concerned, for all benchmarks, it takes less than two seconds on a 2.4GHz Xeon Pentium-4 PC (with 2GB of memory) to run any of these algorithms in MATLAB/TOMLAB toolbox. 2.11 Summary In this chapter, we presented and solved the problem of minimizing power-delay product metric in a linear pipeline by utilizing soft-edge flip-flops to perform time borrowing between consecutive stages of the pipeline. We formulated the problem of optimally selecting the transparency window sizes of the SEFFs and the clock frequency of pipeline so as to optimize the power-delay product of entire pipeline, in three different scenarios that assume deterministic worst case path delays or probabilistic random delays for pipeline stage delays. Also, by over-clocking the pipeline and allowing timing violations to occur and then recovering the errors, our proposed ESOSP algorithm exploits the trade-off between performance and power saving to further minimize the expected power-delay product of a pipeline. Our experimental results demonstrated that the proposed technique is quite effective in reducing the expected power-delay of a pipeline. 70 Chapter 3. PERFORMANCE-CONSTRAINED POWER OPTIMIZATION IN A CHIP MULTIPROCESSOR 3.1 Introduction With the increase in demand for high performance processors, Chip Multiprocessor (CMP) architectures have been introduced to enable continued performance scaling in spite of the slow-down of the CMOS technology scaling. At the same time the demand for higher processing power is causing the need for power and energy efficient design of multi-core processing platforms. As technology continues to scale to smaller feature sizes, power dissipation and die temperature have become the main design concerns and key performance limiters in processor design. The problem of power efficient multiprocessor design has been studied in the literature. Prior studies propose dynamic power/thermal management for homogeneous [51]-[54] or heterogeneous multicore architectures [56][57]. The real-time power management techniques include local responses at the core-level [52][58][59] or global task scheduling heuristics [57][60][61][62]. Typically, the problem formulations target performance optimization under a power/energy budget [51][53] or a thermal constraint [58][63][64][65], or attempt to minimize a composite cost function in the form of energy per throughput [56][63]. Minimization of the total power consumption of a general- purpose CMP system while meeting a total throughput constraint [54][66] is an equally interesting problem, which is the focus of the present chapter. Our solution framework 71 solves the power management problem for such a CMP system through concurrent core consolidation, task assignment to cores, and core-level DVFS. In this chapter, we address the problem of minimizing the total power consumption of a CMP while maintaining a CMP-level average throughput target for tasks running on the CMP. The minimum power solution is achieved by applying DVFS, core consolidation, and task assignment in the introduced hierarchical global power manager that is comprised of three tiers. The top-tier PM unit performs core consolidation and coarse-grain DVFS based on the information/prediction about the current and future tasks provided by the workload manager unit. The mid-tier PM assigns the tasks, which are assumed to be independent, to available cores considering server and task affinities. The low-tier PM employs a closed-loop feedback implementing DVFS technique at core level which senses a core’s performance at periodic intervals and sets the operating frequency level of the core to enforce adherence to known chip level throughput requirements. The novelties of this method may be summarized as follows: - It solves the problem of CMP power optimization under an average throughput constraint by means of core consolidation and closed loop DVFS. - It proposes to append a workload analyzer to the CMP power management unit to perform coarse grain DVFS depending on the type of task e.g., memory intensive vs. CPU intensive. - It uses a high level simulation tool that simulates a CMP by emulating core consolidation, task scheduling, task queuing, and DVFS. 72 The remainder of this chapter is organized as follows. Section 3.2 reviews the related prior work. In section 3.3, we provide background on CMP and throughput models used in this chapter. Section 3.4 describes the problem of minimizing the power consumption of a CMP given a throughput constraint and our heuristic method in detail. Section 3.5 is dedicated to the experimental results, and section 3.6 summarizes this chapter. 3.2 Prior Work Dynamic Voltage and Frequency Scaling (DVFS) for single processor systems is well understood and standardized [67]. However, due to key differences between single- core and multicore systems, there are a number of options in applying DVFS to CMP platforms [53][68][69]. In particular, DVFS in such systems can be applied in one of two ways: chip-wide [52][53] or per-core [51][69][73]. Moreover, DVFS may be combined with power gating (shutdown) to a portion of the chip. Finally, performance of the CMP system is strongly influenced by the task to core assignment, and thus, DVFS should be combined with (or at least solved in light of) task assignment [60]. In [54], the authors address the problem of finding a chip-wide operating voltage- frequency (v-f) setting as well as finding the number of active cores that minimize power consumption of a CMP under a performance constraint. The proposed method uses an offline characterization of the system power and performance for target application and a hill-climbing search method to find the optimal solution, and therefore is costly to be a general purpose runtime power management technique. Reference [66] formulates the problem of minimizing total power consumption of a multi-core system subject to a 73 throughput constraint by means of dynamic voltage scaling and task scheduling, and proves it to be NP-hard. A heuristic is then presented for the case of queued tasks, which is based on performing exhaustive search in the state transition space at each task execution point. The shortcomings of this work include the high complexity of the proposed solution, and the fact that it does not utilize core-shut down as a way of saving power. In [52], the authors deploy a control theory based controller (PI controller) to perform DVFS in CMPs at runtime. Similarly, the limitation of this work is that it does not consider the potential power saving of changing the number of the active cores. In [51], the authors introduce the concept of a global power manager which senses the per- core power and performance of a CMP and sets the operating power mode of each core while meeting a target power budget. One of the limitations of this work is also that independent of the amount of the workload that is given to the CMP, the number of the active cores is always fixed. This results in sub-optimal power consumption values especially when the CMP workload is low. Also, the premise that each core is dedicated to run a specific application forever limits the practicality of this approach. 3.2.1 Feedback Control in Power Management Feedback control theory is a powerful tool for dealing with variability in engineered systems [111]. The feedback control technique was first employed by Stankovic et al. [112] for real-time CPU-scheduling in an embedded system. A PI (proportional-integral) controller was used in [113] to control the voltage dynamically, while a user specified system latency in stream processing is used as the set-point for the controller. By modeling a multimedia system as a soft real-time system, the authors of 74 [114] extended the aforesaid technique and employed a feedback controller that adjusts the decoder’s speed according to the difference between the actual and preset occupancy levels of the buffer between the decoder and the display device. A PID (proportional- integral-derivative) controller was employed in [115] to perform DVS on an embedded system platform, demonstrating that it outperforms existing ad hoc DVS techniques for such systems, e.g., [116][117]. Kandasamy et al. [118] presented an online control framework wherein the control actions governing the operation of the system are obtained by optimizing its behavior, as forecasted by a mathematical model, over a limited time horizon. They presented online control algorithm to minimize the energy expenditure of a processor by varying its operation frequency while meeting the QoS requirements of time-varying workload. The approach was developed for queuing systems. Alimonda et al. [119] developed a control-theoretic approach to feedback DVS for multi-processor system on chip (MPSoC) pipelined architectures. The approach aims to control the inter-processor queue occupancy levels. Wu et al. [120] proposed an analytical approach to DVS for multiple clock domain processors. It is based on a dynamic stochastic queuing model and a PI (Proportional-Integral) controller with queue occupancy being the controlled variable. In [51], the authors considered independent scaling of the voltage/frequency of each core of a CMP to enforce a chip-level power budget. Power mode assignments are re-evaluated periodically by a global power manager, based on the performance and average power consumption observed in the most recent period. 75 3.3 Background In this section we briefly describe the considered system models and provide the theoretical background for the models. 3.3.1 System Model We consider an N-way homogeneous CMP system. Such a system is composed of N homogenous processing cores which are independent except that they share the L2 cache and the interface to the Main Memory [70]. Each core has a separate supply voltage and clock generation module so that they can run at different voltage-frequency (v-f) settings. Note that utilizing per-core DVFS [69] makes cores in CMP operate heterogeneously, in terms of their power and performance. Application programs and/or the operating system generate requests/tasks and send them to the CMP. Similar to [62], we assume these tasks are independent of each other, and each task runs on a single core without the need for inter-core communication. The problem of optimally assigning dependent tasks in multiprocessor systems has been studied in the literature [61][71][72], but is outside the scope of the present dissertation. Note however that the only modification needed for addressing task dependencies is to utilize a task assignment policy in the task dispatching unit that handles the structure of a task graph; any performance losses due to one task waiting for the results of a predecessor task can thus be accounted for. General characteristics of tasks, including their expected job size, s, and memory access rate (MAR) values, are assumed to be known. Note that MAR is not a precise micro-architectural performance metric, such as the cache miss ratio; instead it denotes 76 an approximation of the number of accesses to memory that cause pipeline stalls and delay penalties. Roughly speaking, MAR value indicates if a given task is CPU-intensive or memory-intensive. This information can be collected from history-based profiling data of the tasks generated by certain applications (see for example published data about characteristics of various tasks generated by E-Commerce, Banking, or Support applications [74]) or by dynamic profiling with the aid of built-in performance monitoring units (e.g. a core’s IPS value can be measured on the fly by using the retired instruction count measured by Hardware Performance Counters during a time epoch; for instance, MSR_PERF_FIXED_CTR0 register in Intel Xeon processors reports the number of instructions that retire execution [75]). In either method, estimated values are prone to error which affects the result of decisions made based on these uncertain values, and the power manager must employ a technique to take care of this uncertainty and/or inaccuracy of exact task characteristics; in our approach it is done through using a feedback control loop (cf. section 3.4.1). Figure 3-1. System model with global and local queues. GQ TDU Cores PMU Core Tasks LQ 77 Figure 3-1 shows an abstract block diagram of the CMP system considered in this dissertation. A Power Management Unit (PMU), which sets the working v-f levels of different CMP cores and provides the Task Dispatching Unit (TDU) with the input data needed for task assignment, acts as the global controller for the system. The CMP has a single Global Queue (GQ), in which the incoming tasks are held. The TDU assigns the tasks in the GQ to the available cores periodically. Each CMP core has a Local Queue (LQ), which is used to hold tasks that are assigned to the core. The PMU may be implemented through either a centralized hardware unit, such as a separate embedded microcontroller, or as a piece of high-priority software which is being executed on one of the cores. The former realization can become a bottleneck for the system due to PMU’s limited bandwidth for collecting runtime data about cores and the growing overhead of detailed data processing and decision making as the number of cores goes up. The latter realization helps with the scalability of the PM framework with respect the number of cores in the system. In addition in our proposed hierarchical framework, the top tiers of PM perform a quick (low overhead) global data processing and decision making at the system level, whereas the low-tier PM performs detailed decisions at the core level. The disparate applications are assigned to the cores by the TDU, which is a part of the OS code. Depending on the size of the CMP, i.e., number of cores in the system, the TDU can be realized in a centralized or distributed manner. In this dissertation, we assume a centralized TDU implementation. The GQ is typically implemented in software 78 as part of the OS kernel while the LQ’s are implemented as part of the local power management codes that run on the individual cores. 3.3.2 Throughput Model Throughput of a processor core is defined as the average number of executed instructions per second and is denoted by instructions per second or IPS for short. If a core that is running at frequency f executes task j with known characteristics, then the time t 0 needed to run I 0 instructions can be estimated by equation (42), in which the first term represents the computation time and the second term accounts for the delay of accessing higher level caches. E ž E ž!Ÿ XQ · H ’ · Ÿn$ · ž E (42) where CMF j denotes cache miss frequency, i.e. the proportion of instructions that cause an L1 cache miss while executing task j; ’ c is a fixed parameter representing average cache miss penalty which captures the core’s expected stall time when a cache miss occurs. The value of ’ c depends on parameters such as the pipeline implementation, cache size, cache management policy, and speed of the L2 cache and main memory. IPC j ncm denotes the no-cache-miss instruction per cycle of the task; it is defined as the IPC value under a condition that there are no cache misses, e.g. very large cache that has all the application data pre-fetched, and thus no misses occur. Recall CMF is a micro-architecture level parameter that indicates the number of memory accesses of a task missing in the L1 cache. In fact, it can be interpreted as a translation of high level MAR in the architecture level; in general, a CPU-intensive task, 79 i.e. low MAR, has a low CMF value while a memory-intensive task, i.e. high MAR, exhibits a high CMF (although, a memory-intensive task may have a low CMF due to its special memory access pattern). Here, we use CMF and MAR interchangeably to distinct the memory-intensive and CPU-intensive tasks. Also, note that CMF j in (44) represents average cache miss frequency due to both instruction and data cache misses (denoted by Ÿn$ X and Ÿn$ %ZZ respectively): Ÿn $ Ÿn$ X H š % · Ÿn$ %ZZ (43) where š % is the fraction of instructions accessing data memory - typically in (0.1, 0.6). Referring to the definition of throughput, throughput of the core i is calculated as follows using (44): ž!m ž!Ÿ · ž!Ÿ ž!Ÿ XQ 1 H ’ · ž!Ÿ XQ · Ÿn$ · (44) where IPC j (f) denotes the actual IPC value of the task running on the core. 3.3.3 IPS Saturation Effect Figure 3-2 shows the relationship between IPS and frequency as captured in equation (44) for different types of tasks. Figure 3-2-a corresponds to three low-CMF tasks with high, medium and low IPC ncm values, while Figure 3-2-b shows three high- CMF tasks with high, medium and low IPC ncm values. 80 (a) (b) Figure 3-2. Throughput-frequency relationship for a) low CMF tasks b) high CMF tasks. From the Figure 3-2-b, domain of the IPS function of high CMF tasks can be divided into two regions: a frequency region where IPS rises rapidly with an increase in f and another where rate of change of IPS with f is low. We define a unit-slope frequency separating these two regions: X L ¡¢£i ¡¤ I u (45) where ¡¢£i ¡¤ is the partial derivative of the IPS with respect to frequency (normalized appropriately to produce a unity value for the ratio of full ranges of IPS and frequency). For example in Figure 3-2-b, f unsl for high CMF and high IPC ncm tasks is about 710MHz, which is illustrated by a dashed line. For different combinations of IPC ncm and CMF, the unit-slope frequency may be calculated for the corresponding task type. In practice, there is uncertainty about the predictive values of IPC ncm and CMF of an incoming task, and hence f unsl cannot be calculated accurately for a task in future. In practice, a single average f unsl is assumed for all memory intensive tasks to lower the complexity. Note that if a task has a high CMF value, much of the time the core is waiting idle for the memory response, and hence, clock frequency can be set to a relatively low value, to reduce 81 power/energy with no or very little performance loss. Therefore to reduce the runtime of the consolidation and coarse-grain DVFS steps, we will limit the clock frequency of a core running specific type of task to frequencies below the f unsl (cf. section 3.4.2). 3.4 CMP Power Management Problem Statement Consider an N-way CMP as described in section 3.3. The PMU seeks to minimize the total power consumption of the CMP subject to achieving a service rate whereby a GQ overflow does not occur. This means that on average, CMP service rate must be greater than or equal to the rate of the incoming tasks. This is equivalent to imposing a lower bound constraint on the average throughput of the CMP. The problem statement can be written as follows: n‚ Š! P¥£ Œ ‘. . { B (46) where P CMP denotes the CMP power (see equation (48)), λ is the rate of the incoming tasks (arrival rate of the tasks in the GQ), and μ is the CMP service rate (departure rate of the tasks from the GQ). To solve this problem, the power management algorithm needs to decide on the optimum number of the processing cores that are required to service the tasks, determine the v-f setting of each active core, and assign and schedule the tasks in the GQ to different cores. Moreover, the predictive input information of the system, such as the task characteristics (as described in section 3.3) are prone to uncertainty and inaccuracy, and a mechanism needs to be adopted to opt out the effect of inaccurate data. Due to the real time nature of the problem, conventional mathematical optimization approaches do not result in a robust solution to this problem. We want to utilize an efficient (light and thin) and robust algorithm to solve it. 82 To estimate the power consumption of the CMP, we use a power model which is the summation of the intra-core power dissipation and the CMP-level power contribution of the core. The intra-core power dissipation is comprised of a dynamic power which is cubically dependent on the core’s clock frequency (assuming that the frequency f is directly proportional to the core’s supply voltage level V) and an v-f setting dependent idle component, P idle (f). The second component of CMP’s core power dissipation is P common,chip (also denoted by P C ) which is comprised of power consumption of the shared resources in the CMP system, most importantly the L2 cache and I/O interface. This power component is independent of the frequency of any core. ! +œ,XœZ ¦ \ · Y H ! % ! +QQ+X, ! §U H ! ¢/© (47) where Q D in the P core,intra expression is a fixed term, that depends on the implementation and CMP platform and the average activity factor of the target task class, P idle (f) is the idle power consumption for each core which is a function of the frequency. Its value at different frequencies can be measured offline and the values can be kept in a lookup table. P L2 and P I/O –that are frequency independent - denote constant terms capturing the power dissipation of the L2 cache and I/O interface of the CMP. We have: ! P¥£ t g�Aƒ · ! +œ,XœZ ¥ u H ! +QQ+X, (48) where active(i) is a pseudo-Boolean variable set to 1 exactly if the i th core is active. In this model, it is assumed that at least one core is active in the CMP, executing arrived tasks and the PMU application. 83 3.4.1 Proposed Solution: 3T-PM We introduce an efficient strategy that solves the policy optimization problem described above. The proposed solution relies on a 3-tier hierarchical DPM approach (that we call 3T-PM) where the original problem is broken into three optimization problems based on the significance and the granularity level of the decisions that must be made. Higher level DPM sets values of the input parameters of the lower levels. Decisions at the top level are made based on coarse-grain information about the target task set (e.g., predicted MAR value for tasks in the GQ) whereas the lower level decisions are made based on characteristics of individual tasks. Figure 3-3 shows the block diagram of the proposed hierarchical PMU. The PMU attempts to minimize the CMP power consumption while ensuring that the CMP throughput is higher than a minimum threshold value. This is done by: a) Choosing the optimum number of the cores required to maintain the required throughput and turning the rest of the cores off (see tier 1 in Figure 3-3); b) Dividing the total active cores in two groups: high speed and low speed cores where the target working frequencies for high speed and low speed cores are set (we call this optimization core consolidation and coarse-grain DVFS, see tier 1 in Figure 3-3); c) Assigning tasks from the GQ into the LQ of different active cores (this task assignment step is done separately for high and low speed cores, see tier 2 in Figure 3-3); 84 d) Setting the target average throughput value (so-called “set point”) for each core considering the task assignments, such that the system-level throughput constraint –in the form of task processing rate- is satisfied (see tier 2 in Figure 3-3); e) Dynamically tuning voltage-frequency level of each active core by using a local control feedback loop for each core (we call this step fine-grain DVFS, see tier 3 in Figure 3-3). Figure 3-3. Block diagram of the proposed three-tiered PM. Decisions at each tier of the PM hierarchy are performed regularly, but with different frequencies. Tier 1 decisions are made at each decision epoch, T d . Task Tier3 Workload Analyzer/Predictor Task Assignment, Set-point for individual High Speed cores Feedback loop DVFS Feedback loop DVFS Task Assignment, Set-point for individual Low Speed cores Feedback loop DVFS Feedback loop DVFS Throughput Set-point Number & initial frequency of High Speed cores (n h , f h ) General performance requirements: IPS h , IPS l Profile ILS tasks Number & initial frequency of Low Speed cores (n l , f l ) Number of cores per task type, Coarse grain DVFS Profile of IHS tasks Throughput Set-point Tier1 Tier2 … … Core 1 Core n h Core n h +1 Core n l +n h 85 assignment is done as part of the second tier optimization at each allocation window T a , where T a < T d . The third tier decision making is done with period of T s , which denotes the sampling period of the digital feedback control loop of each core. Typically T s < T a , such that the lower level controller iterates for enough sampling periods and becomes stable within the T a period. This means the stability of two tiers are independent as long as they are operating according to the specification. Furthermore, note that the hierarchical structure of the solution implies that a higher level PM makes a decision that sets the target (aspiration level) for lower levels, and decisions lower levels only satisfy these targets i.e. they cannot damage higher level decisions, as long as target points are feasible. 3.4.2 Workload Analyzer The task of Workload Analyzer (WA) is monitoring the incoming tasks at the GQ to (i) classify them based on their IPC characteristics, and (ii) predict the future workload both in terms of its arrival rate and its IPC characteristics. The decision about the amount of workload that needs to be processed at each decision epoch is also made at this time. This decision is made so that, on average, queue overflow is avoided. The WA aims to keep the average queue occupancy of the GQ at a constant level, which of course implies that the service rate { matches the demand rate B. In fact, if this condition is held, CMP has supplied just enough performance to satisfy the throughput requirement of the system and save power as much as possible. 86 3.4.2.1 Task Classification As mentioned earlier, MAR indicate if the task is a CPU-intensive or memory- intensive task. Two classes of tasks are defined based on their MAR values on the given cores: Intrinsically Low Speed (ILS, or l for short) and Intrinsically High Speed (IHS, or h for short) tasks. We only define two classes due to simplicity of classification mechanism and implementation, and considering the low accuracy of the given information about tasks. Task classification is done based on the value of the task MAR, i.e., Ÿ g‘k p Ÿ n—ª Z n—ª Ÿ «^ƒf‘ƒ I (49) where C(task) is an enumerated type describing the class of the task and MAR th is a threshold value used to partition the tasks. When the apriori information about a task is not available, WA assigns it to default class ILS, which allows the task to run more power efficiently. Meanwhile, the WA monitors and records its MAR for later reference. The MAR values of tasks are recorded in a table, with least recently used (LRU) replacement policy to limit the table size. 3.4.2.2 Workload Analysis and Prediction The WA monitors and predicts the required throughput for each task set, IPS h and IPS l , and the average characteristics of tasks, e.g. IPC avg , to provide to the tier-one PM in order to manage the core consolidation and coarse-grain DVFS choices at each decision epoch. The prediction method used can be a history-based prediction technique, whereby a moving-window average of the task arrival rates and their IPC values over the last few 87 decision epochs is used as estimates of the task arrival rate and IPC value in the next decision epoch. Next, based on the current state of the GQ and prediction about task arrival rate, the WA determines the number of tasks, W, in the GQ to be dispatched to cores in each allocation window. The WA sets W such that the occupancy level of the GQ remains nearly constant at some target level, e.g. 50% (c.f. [69] for detailed analysis). This value is found to be energy efficient for a single processor system, however, the CMP can also be seen as a processor that is N-times faster, and the incoming task rate is thus N times higher too. The WA creates the ILS and IHS task sets running during the decision period and calculates required throughput for each set: ž!m P ∑ s ®¯°± ‰² % (50) where T d denotes the duration of the decision period and s j denotes the expected size of task j. 3.4.3 Tier-One PM The job of Tier-One PM includes first finding the optimum number of cores to run each class of tasks so as to minimize P CMP . Then, it must assign a single target voltage and frequency level to all the cores that are assigned to one class (the v-f setting would be fine-tuned by tier-three.) 88 3.4.3.1 Core Consolidation and Coarse-Grain DVFS Armed with the task classification, the PMU allocates the optimum number of cores to each class of tasks, and sets the coarse-grain frequency (and hence, the supply voltage level) of each core. The objective is to minimize the total power consumption while satisfying the throughput constraint for the task set in each class. Let n l , n h and N denote the number of cores assigned to the ILS and IHS tasks and the CMP core count. Tier-one power minimization problem can be formulated as follows: b c c c d c c c e n‚ ! P¥£ (‚ · Y · ¦ \, H ‚ · Y · ¦ \, ) H ! +QQ+X, H H ‚ · ! % H ‚ · ! % subject to: ‚ , ‚ 0 , ‚ H ‚ QX X QX QZl ‚ · ž!Ÿ ,ZM´ · ž!m ‚ · ž!Ÿ ,ZM´ · ž!m I (51) where f l and f h are the coarse-grain working frequencies for the ILS and IHS cores, respectively. These two frequencies together with the number of cores, n l and n h , assigned to each task class are the optimization variables to be determined. The first constraint limits the number of cores. Under the low workload conditions, it may be prudent from a power-saving perspective to turn off some of cores -this is why the summation of the two types of cores can be less than N. The second and third constraints bound f l and f h in the ranges (f min , f unsl ) and (f min , f max ). The last two constraints are throughput constraints for each task class. Here ž!Ÿ P,ZM´ denotes the average actual IPC values of all current tasks in the corresponding class, C, assumed to be equal to measured IPC in the recent past by hardware performance counters [76]. 89 Notice also that IPS l and IPS h have already been determined by the WA from equation (50). This is a Non Linear Integer Programming problem. Fortunately, since the range of independent variables is small (few available frequency levels and a limited number of cores on the chip,) a Branch and Bound search method, as described below, is attractive and computationally feasible. On line 12, the algorithm searches for the best variable values (n l , n h , f l , f h ) that minimize the power dissipation. 1 S = {}; 2 for (m = 0 to N; m++) do 3 for (f l = f min to f unsl ; f step++ ) do 4 n l =µž!m / ž!Ÿ ,ZM´ · ¶; // from the 4 th constraint 5 n h = m – n l ; // from the 1 st constraint 6 ž!m / ‚ · ž!Ÿ ,ZM´ // from the 5 th constraint 7 calculate P CMP from (51) ; 8 s = (n l , n h , f l , f h , P CMP ); 9 · · ¸ Š‘Œ ; 10 end for 11 end for 12 s min = find_min (S); 13 return s min ; It can be shown that the complexity of our proposed algorithm is O(N*F) due to two nested loops of lines 2 and 3, where F is the number of frequency steps between f min and f unsl . 90 3.4.4 Tier-Two PM After classifying tasks into IHS and ILS, and deciding about the number of high and low speed cores and their corresponding coarse-grain v-f settings in the top-level PM, TDU now assign the tasks to individual cores. It also determines the target throughput for individual cores in high speed and low speed categories. 3.4.4.1 Task Assignment The task assignment scheme is shown in Figure 3-4. The tasks in the GQ are passed through a switch where ILS and IHS tasks are distinguished from each other and will be sent to the corresponding Round Robin (RR) switches. Each RR switch assigns its input tasks to LQ of an available core using the round robin scheduling technique [62]. At each instance of time, both RR switches have a list of the current available cores. We define an available core as an active with Queue Occupancy (QO) level less than a threshold value. If the next core in the list of RR switch is not available, the RR switch simply ignores it and looks for the next available core. Figure 3-4. Tier-2 task assignment scheme. ILS/IHS Switch RR Switch RR Switch LQ nl LQ 1 LQ nh LQ 1 GQ QO 1 , …, QO nl QO 1 , …, QO nh Low Speed Cores High Speed Cores 91 3.4.4.2 Determining Target Throughput of Cores Once tasks are assigned to cores, based on the set of tasks that are assigned to each individual core, the mid-level PM calculates the target throughput that must be used as the set point in the feedback loop controller of the core (tier three PM). The target throughput of a core, IPS target , is equal to the sum of the expected number of instructions in the assigned tasks, s j , divided by the allocation period length. The calculation for each core uses a similar equation as equation (50) except that the task set is restricted to the tasks assigned to that core, and T d is replaced by T a . That is: ž!m Zœ´ ∑ s ®¯°± ‰² Z (52) Note that if the execution time of a task exceeds T a , it is not feasible to execute the complete task in a single allocation period, and the corresponding core must continue running such task for the next period. However, in order to calculate the target throughput value for the core that is running this task (with large expected execution time), the task is virtually divided into two or more subtasks to be executed in subsequent allocation periods. Therefore, only the portion of task that is executed during each T a period is considered in the target throughput calculation of the core for that period. 3.4.5 Tier-Three PM To maintain a target throughput, IPS target , for each core, we use the feedback control theory [77]. More precisely, we model a processor core as a system, called G s , whose input vector is the v-f settings and whose output is the resulting throughput of the core, IPS, as shown in Figure 3-5. The controller, shown by G c in the figure, assigns a v-f 92 setting for the core. The system then employs this v-f setting, and the resulting throughput is measured by means of the built-in performance monitoring units (a core’s IPS value can be measured on the fly by using the retired instruction count measured by Hardware Performance Counters [75] in a time interval). If the measured throughput is less than the target throughput, the controller will increase the v-f setting value, which results in higher throughput. On the other hand, if the measured throughput is greater than the target throughput, the controller will reduce the v-f setting value to match the required throughput. This technique reduces power consumption by performing DVFS to deliver only the required throughput. Figure 3-5. Closed loop system representation. We can model the throughput of a core, given by (44), as a linear function of its frequency, i.e., the input-output relationship of the system, G s , can be represented with a linear function. Consequently, we can apply the linear control techniques which are simple, effective, and accurate enough for our purpose. Recall the controller needs to be embedded in the PMU, and hence complex implementations are to be avoided. There are many options for the type of linear controller here. In this work, we use a Proportional Integral (PI) controller [77]. A PI controller is a special case of Proportional Integral Derivative (PID) controllers that are very easy to implement, and usually easy to design for a first-order system [52][59]. The derivative component of the general PID controllers G s f IPS G c IPS target 93 may amplify the effect of noise, and thus it is not used in this work. To design the PI controller, we follow the well established control theory techniques [77]. To use the linear control techniques, we first linearize the relationship between throughput of a core and its frequency around a frequency f 0 , by replacing ž!Ÿ in (44) with its maximum value at f 0 as a fixed value (using the maximum value is to guarantee stability of the closed loop system): ž!m ž!Ÿ E · (53) where ž!Ÿ E is defined by (54), if the set of tasks in the local queue have an average expected IPC of ž!Ÿ ZM´ XQ and average CMF of CMF avg . The value of IPC(f 0 ) is approximated at design time for worst case. ž!Ÿ E max ž!Ÿ ZM´ XQ 1 H ’ · ž!Ÿ ZM´ XQ · Ÿn$ ZM´ · E (54) The transfer function of a PI controller in the z-domain is: º [ » H » ¢ [ [ 1 (55) where K p and K I are coefficients to be determined based on the desired characteristics of the closed loop system. Hence the transfer function of the closed-loop control system may be written as: º [ º [º [ 1 H º [º [ (56) with a corresponding characteristic equation [77] as follows: [ U H ( » ¢ H » £ ž!Ÿ ZM´ 2)[ H 1 » £ · ž!Ÿ ZM´ 0 (57) 94 The solutions to the above equation are the closed-loop poles of system, whose placement in the z-plane determines the main characteristics of the system, such as its steady state error, response time, overshoot and stability. To guarantee stability of the loop, the poles should be placed inside the Unit Circle of z-plane in the Root Locus of system [77]. For our problem, the best placement of poles is found to be at 0.5±0.1i to generate a relatively fast and low-overshoot step response as shown in Figure 3-6. (a) (b) Figure 3-6. a) Root Locus and b) Step Response of the places poles. 95 3.5 Experimental Results We have developed a real-time simulator in C++ to implement and evaluate the proposed power management technique. The simulator uses an N-way CMP with shared L2 cache. The simulator is an event driven simulator, in which the triggering events are task arrival, task departure, decision, allocation, and sampling points. It emulates execution of the tasks on the cores based on their size, IPC ncm and CMF; however, PMU decisions are made only by knowing size and MAR of tasks, and estimating IPC C,avg ’s on- line. The PMU, GQ, LQ’s, and TDU are implemented in software. Configuration of the cores is as described in Table 3-1, which is based on the configuration of UltraSPARC T2 (Niagra2) processor [78]. The dynamic and idle power consumption of CMP are modeled as presented in equation (47) with the coefficients matching the target processor’s power characteristics. The proposed PM technique shown in Figure 3-3 (called 3T-PM) was implemented, as well as another baseline PM algorithm. For the purpose of comparison, no prior work is found that tackles the same power management problem that we have solved. For example, reference [51] that is the closest to our problem, ignores core consolidation and also is based on life-time fixed task to core assignment. References [52][53] are also lacking dynamic task to core assignment phase. Therefore, a direct comparison with a specific prior work is not possible. However, to evaluate 3T-PM’s performance, we compare it to a baseline PM which can be seen as a modified version of the work presented in [51]. The baseline PM does not support core consolidation; neither does it classify the tasks into IHS and ILS. Round robin is used for task assignment in the baseline PM. Also, to study the efficiency of control-theory feedback loop, the baseline 96 PM comes with per-core open loop DVFS capability, which indeed is different from non- control-theory closed loop DVFS of [51]. In order to realize DVFS, the baseline PM utilizes information about tasks to determine the required frequency that satisfies system throughput, and uses a higher core frequency value, as a safety margin, to take into account the uncertainty of those values. 3.5.1 Task Generation Tasks are randomly generated and sent to the CMP. The expected job size and inter-arrival time of tasks are assumed to be two independent random variables with exponential distributions with mean values of E(s) and 1/E(λ), respectively. Note that in order to avoid overflow, the average of task arrival rate is set to less than or equal to the maximum processing capacity of CMP. In other words, the mean value of task inter- arrival time is greater than or equal to the expected execution time (mean of expected job size, divided by product of the average IPC of tasks and the maximum core frequency) divided by N, total number of cores. Each incoming task is assumed to be generated by one of the nine applications (benchmarks) given in Table 3-2. The MAR values of tasks are assigned based on MAR of SPEC2000 benchmarks [154]. The detailed micro- architecture task characteristics used to emulate task execution in our simulator, i.e. IPC ncm and CMF are extracted by SimpleScalar simulations and shown in Table 3-2. In order to choose the parent application of each incoming task, we use a discrete uniform random variable that selects from the nine benchmarks given in Table 3-2 with equal probability, p=1/9. This means that the characteristic values for incoming tasks are picked from the values given in Table 3-2 with equal probability of p=1/9. In order to model the 97 uncertainty of the information about the tasks, we apply a ±20% uniform disturbance to the values of task characteristics at runtime, before issuing the task to CMP. Table 3-3 summarizes the parameters used in the simulations. Time intervals, hardware parameters, threshold values, and workload specifications are experimentally chosen for the framework configuration and task set under test. Table 3-1. Configurations of the cores in CMP system Pipeline stages 8 (int), 12 (fp) Execution units 2 INT units, 1 FP unit Issue queue size 20 Load/Store queue 32/32 L1 instruction/data cache 16KB, 8-way/8KB, 4-way/LRU L2 unified cache N*512KB, 16-way, 64B line Technology node/Vdd 65nm, 1.5V Frequency {200:200+:1600} MHz Typical Dynamic Power 8.9W @ f max Typical Leakage Power 2.9W Table 3-2. Average characteristics of benchmarks used to generate tasks Benchmark MAR IPC ncm CMF Art 16% 0.816 0.0488 Bzip 18% 0.902 0.0685 Equake 7% 1.850 0.0065 Gcc 14% 0.876 0.0392 Go 9% 0.773 0.0188 Gzip 26% 0.869 0.0865 Mcf 23% 2.221 0.0629 Mesa 13% 1.923 0.0272 Twolf 8% 1.205 0.0172 98 Table 3-3. Simulation parameters Number of Cores N= {4, 8, 16} funsl 800MHz LQ / GQ 8 / 40,80,160 T d / T a / T s 50 / 10 / 2ms MAR th %15 Q d 60% E(s) 1 M instructions E(λ) 1.0N, 1.9N [1/ms] Number of simulated task 5000 3.5.2 Evaluation of Proposed 3T-PM Algorithm For the purpose of comparison, we compare the proposed 3T-PM algorithm to the baseline power management algorithm described earlier in this section. It does not support core consolidation, task classification, or control-theory feedback loop. Figure 3-7 shows the average power consumption of the CMP system under the baseline solution and the 3T-PM solution. The experiments were done for three different CMP configurations with N=4, N=8, and N=16 cores, and under two system throughput constraints, low and high corresponding to 30% and 80% of the maximum processing capacity of CMP, respectively. On average, 3T-PM consumes 23% less power compared to baseline PM. Figure 3-7. Power consumption of 3T-PM vs. baseline for different configurations and arrival rates. 0 20 40 60 80 100 120 140 160 180 low high low high low high Throughput Baseline 3T-PM 16 cores 8 cores 4 cores Power [W] 99 Figure 3-8 depicts waveforms of frequency set by DVFS method of 3T-PM and the baseline PM for one core to compare the effect of PI controller-based DVFS with the open loop DVFS. The throughput constraint for both systems is the same and is shown in the figure with blue color, and none of the PM techniques violate the throughput constraint. It can be seen from the figure that the core frequency level used by 3T-PM is (on average) around 7% lower than the frequency level used by the baseline technique. Note that in this example, 3T-PM is about 17% more power efficient compared to the baseline system. Figure 3-8. Frequency waveforms used by 3T-PM and baseline PM for the same throughput constraint. In addition to power minimization, our PM performs better in terms of performance compared to an improved version of the baseline which now employs closed loop DVFS as explained in section 3.4.5. In particular, we considered very high task arrival rates that pushed the CMP to its processing capacity limit, hence resulting in 1 2 3 4 5 6 7 8 0 200 400 600 800 1000 1200 1400 1600 1800 2 18 34 50 66 82 98 114 130 146 162 178 194 210 226 242 258 274 290 306 322 338 354 370 386 Throughput [1000MIPS] Frequency [MHz] Time [ms] f_baseline f_3TPM Throughput 100 sizable task drop rate at the global queue of the CMP system. Under this scenario, our method shows an average of 18% lower task drop rate, with 7% lower power consumption. Note that the size of GQ was set to the same value in both cases. The reason lies in the separation of IHS and ILS tasks to run on different cores in our method, which prevents unnecessary wait of the IHS tasks for the ILS tasks in the GQ. Figure 3-9 shows this fact, which can also be interpreted as the higher quality of service (QoS) of the 3T-PM solution compared to the baseline one with feedback, under very high task arrival rate. Finally, in the experiments we performed for various core counts (up to 16) and workload configurations, the power and performance overheads of 3T-PM are negligible. More precisely, the 3T-PM runtime at each tier is negligible compared to the epoch length, i.e., it is less than 1%. Since the algorithm is software based, its power consumption overhead is linearly related to the ratio of execution time of PMU code to that of the applications; hence, the power dissipation overhead of 3T-PM is also insignificant, the same as its runtime overhead. Figure 3-9. Task loss rate improvement due to the task classification step in the 3T-PM solution. 0 0.1 0.2 0.3 0.4 0.5 4 8 16 Proposed Heurisrtic baseline method Number of Cores Task drop rate 101 3.6 Summary We formulated the problem of minimizing the power consumption of a chip multiprocessor system under an average throughput constraint. DVFS and core consolidation along with task assignment methods are employed as part of our solution framework. In particular, we introduced a hierarchical global power manager comprised of three tiers performing core consolidation and coarse-grain DVFS at top tier, assigning the tasks to available cores considering server and task affinities at mid-tier, and closed- loop feedback based per-core DVFS at the low-tier. Comparison of this technique to a baseline one showed 23% power saving for our technique, which also resulted in some 18% lower task drop rate under stringent throughput constraints for the target CMP system. 102 Chapter 4. PERFORMANCE OPTIMIZATION OF CHIP MULTIPROCESSORS UNDER POWER AND THERMAL CONSTRAINTS 4.1 Introduction Power dissipation and die temperature are the main design concerns and key performance limiters in today’s high-performance multi-core processors. In the previous chapter, we presented a hierarchical dynamic power management technique that solves CMP power optimization under a throughput constraint. However, die temperature is an equally important constraint as power dissipation. While design-time approaches exist, the dynamic solution is to utilize a power and thermal management unit that takes into account power, performance and temperature of processor cores, and makes the decisions that optimize performance (or power, or both.) On the other hand, as CMOS technology scaling continues, the spatially correlated intra-die process variations result in higher core-to-core (C2C) power and performance variations. These variations along with device and interconnect aging effects motivates the need to design and deploy robust power management solutions. It is in this context that we intend to tackle the problem of optimizing system performance (throughput) and power efficiency of CMPs under thermal considerations subject to different sources of variation. 103 In this chapter, we consider a CMP performance optimization problem that seeks to maximize the CMP throughput under variations in the system workload and fabrication characteristics of the cores, while the total CMP power consumption is bounded by a given power budget, and the die temperature (at the on-chip sensors located on the cores) is maintained below a critical temperature. We propose a hierarchical power and thermal management (PTM) for this problem, which utilizes DVFS and core consolidation, and employs a feedback-loop controller. Our proposed solution, called Variation-aware Power/Thermal Manager (VPTM), solves the core consolidation problem in a higher tier by a greedy (steepest descent) algorithm, decides the DVFS setting of cores in a lower tier by solving a convex optimization problem, and fine tunes DVFS setting of cores in a third tier (the lowest tier) through a set of parallel closed loop feedback controllers. The main contributions of this work are as follows, - We propose the use of a combination of core consolidation and DVFS to satisfy power and thermal constraints of CMP. - We present throughput, power, and thermal models and formulate the PTM problem as a convex optimization problem, relying on core consolidation and DVFS optimization knobs. - We propose using a simple monitoring unit to predict workload conditions along with a closed loop feedback controller that compensates for prediction uncertainties and variations in key system parameters. The feedback controller dynamically chooses to control power, temperature or performance. 104 - We show that core consolidation and DVFS can boost the total performance of CMP with no impact on the maximum temperature, for given power budget. The remainder of this chapter is organized as follows. In section 4.2, we provide a brief review of prior power and thermal management techniques. In section 4.3, we present the preliminaries of this work, such as CMP power, thermal models. Description of the problem formulation is presented in section 4.4, and our proposed solution is explained in section 4.5. Section 4.6 is dedicated to the experimental results, and section 4.8 concludes and summarizes the chapter. 4.2 Prior Work The problem of optimizing system performance (throughput) and global power consumption of CMP with thermal considerations in a framework subject to different sources of variation is an important problem in high-end servers and hosting datacenters [68][52][53][107][100]. Several power and thermal management techniques have been listed in [68][51]. In particular, the authors of [51] present several DVFS based techniques to maximize throughput of a homogeneous CMP under a power budget, but they neglect thermal constraints and do not exploit core consolidation, relying purely on DVFS. In the area of dynamic thermal management for CMPs, Authors of [97] and [98] suggest techniques to minimize power density or temperature hot spots by means of judiciously scheduling jobs or migrating them from core to core, which fall under task scheduling techniques. The authors of [100] study several effective methods, such as temperature-tracking frequency scaling, migrating computation to spare hardware units, 105 and a combination of fetch throttling and DVS. All these techniques can be utilized by any heuristic solutions to optimize performance under thermal constraint. Reference [99] provides an extensive background on thermal management fundamentals, including various triggering mechanisms as well as circuit level (e.g. DFS and DVFS) and microarchitecture (e.g. decode throttling, and cache toggling) response mechanisms to prevent or solve thermal emergency situations, and then explores the tradeoffs between these mechanisms for responding to periods of thermal trauma. The work presented in [143] inspired a part of our work. In this work, authors mathematically formulate the problem of speed scaling in multiprocessors, under thermal constraint, and show that it is a convex optimization problem. They model dissipated power of a processor as a positive and strictly increasing convex function of the speed, namely a cubic function. Their suggested approach is an optimal mathematical solution to their formulated convex problem. However, that work does not consider leakage power consumption and its temperature dependency, and more importantly, turning off cores to help in lowering thermal issues. Regarding variation-aware power management algorithms, there are two slightly close works to ours. Authors of [68] examine process variation in a CMP and focus on the core-to-core variation of power consumption. They suggest turning off cores that consume power in excess of a certain computed value, with the goal of maximizing the chip-wide performance/power ratio. Authors of [5] suggest variation-aware algorithms for application scheduling and power management. One such power management algorithm, called LinOpt, uses linear programming to maximize throughput at a given 106 core power budget through voltage and frequency scaling, in a 20- core CMP. However, this work does not consider the temperature constraint, leakage dependence on temperature, and core consolidation to save power and reduce overheating of cores, which play a significant role in determining the energy-efficient operation. Moreover, control theory based solutions for thermal management of CMPs have been reported, such as Model-Predictive Control based solutions, which can be effective in tackling variations of manufacturing and operating conditions [144][145]. 4.3 Preliminaries 4.3.1 Throughput Model Throughput of a CMP is defined as the total number of executed instructions per second and is denoted by IPS. Throughput for each core may be defined similarly. Clearly each core’s throughput is a function of its operating frequency. If core i, which is running at frequency level f, executes task j with known characteristics, its throughput can be estimated as, ¼ ž!Ÿ · (58) where IPC i denotes the instruction-per-cycle of (IPC) of all tasks running on core i. IPC of a task, IPC ij , depends on characteristic of the task, its memory access pattern, executing processor architecture, branches, etc. IPC ij of a task can be profiled for the target core architecture and/or adaptively monitored at runtime. We assume that there are plenty of idle cycles in memory-bound tasks with low IPC values; and hence two or more such tasks can be consolidated on a multithreaded core to maximize core utilization (minimize idle cycles count.) Therefore, the total number of committed instructions in a 107 fixed time interval is equal to the summation of instruction counts of individual tasks when running alone. In other words, given an assignment of some tasks to core i, (done by a simple task assignment algorithm like Round Robin or more complex algorithms e.g., [116]), the overall IPC of the core may be written as ž!Ÿ t ž!Ÿ Z´X% + (59) Consequently, the CMP throughput may be calculated as the summation of throughputs of all cores: ¼ P¥£ t ž!Ÿ · (60) 4.3.2 Thermal Model We model the relationship between the die temperature and power dissipation of a core using the thermal model presented in [79]. In our thermal model for the CMP, as illustrated in Figure 4-1, each node (and the corresponding on-chip temperature sensor) represents exactly one core. To improve the precision of this thermal model at the cost of increased problem complexity, one can consider multiple temperature sensor points inside each core, e.g., located at functional units and register file. Figure 4-1. Thermal model of a CMP. 108 Let θ i (t) denote temperature of core i at time t, and θ(t) = [θ i (t)] (i = 1, …, N) denotes the vector 3 of temperature readings of all cores at time t. Let P i denote the total power consumption of core i, and G ij and G i represent the thermal conductance between cores i and j and between core i and the ambient, respectively (the ambient temperature is assumed to be constant.) Using this thermal model, equation (61) calculates the temperature vector at time t+1, given the temperature vector, θ(t), and average power consumption vector, P(t), at time t. Note that this calculation is performed periodically, hence t+1 means one time epoch later than time t. ½ H 1 ¾ · ½ H ¿ · À (61) Note that A and B are matrices containing empirical regression coefficients. In this model, the die temperature of a core depends on the die temperatures of other cores but only the power consumption of the core itself; hence B is a diagonal matrix. The temperature of any core in the CMP should not go beyond a critical temperature, denoted by θ crit , which is normally provided in the CMP datasheet. Equation (62) is the thermal constraint that is applied to all individual cores, and equation (63) is its matrix representation in our thermal model. Note that θ(t) is the temperature sensor readings of cores. i: K Á H 1 K œ (62) ¾ · ½ H ¿ · À K ÂÃÄÅ Æ zÇ (63) Note that values of elements of matrices A and B are subject to modeling errors and process-induced variations. In spite of these inaccuracies, we will use matrices A and 3 All vector variables are shown in bold face fonts throughout this chapter, while scalar variables are in regular fonts. 109 B to make a decision about coarse-grain DVFS setting of cores; a closed loop controller will update the DVFS settings in order to avoid any thermal violations (or power budget or task-level throughput violations) due to the aforesaid inaccuracies. 4.3.3 Voltage and Frequency relationship It has been shown that for VLSI circuits, the relationship between circuit delay, supply voltage, V dd , and temperature, θ, can be expressed as [82], ƒŽgÈ É A · K Ê A D Ë (64) where V th is the threshold voltage, and { and Ì are technology dependent empirical values, obtained by SPICE to be { 1.19 and Ì=1.2 for 65nm technology. Hence, the maximum frequency, f, of a core operating at voltage v and temperature θ can be calculated as follows, k M A D .U A · K .Î (65) where k v is the constant of proportionality. However, in today’s high performance processors, the manufacturer predefines the relationship between operating voltage and frequency of processor. For instance, Table 4-1 demonstrates different P-states and their associated frequency and supply voltage values in an up-to-date AMD high performance six-core processor [149]. Similarly, Table 4-2 shows the provided operating points of an Intel Pentium M processor, in terms of its supply voltage level and the corresponding clock frequency in its DVFS technology [146], called Enhanced Intel SpeedStep, EIST. 110 Table 4-1. Voltage and frequency relationship in AMD six-core processor OS4176OFU6DGO P-state Frequency [MHz] Voltage [V] min max P0 2400 0.9125 1.1875 P1 2100 0.8875 1.1625 P2 1600 0.8375 1.1 P3 1200 0.8 1.0625 P4 800 0.7875 0.825 Table 4-2. Voltage and frequency relationship in Intel Pentium M processor Frequency [MHz] Voltage [V] 1600 1.484 1400 1.42 1200 1.276 1000 1.164 800 1.036 600 0.956 Figure 4-2. Linear relationship of supply voltage and clock frequency in modern processors. R² = 0.992 R² = 1 R² = 0.9776 R² = 0.9684 0 0.5 1 1.5 2 2.5 0 500 1000 1500 2000 2500 3000 Voltage [v] Frequency [MHz] Intel Pentium M AMD quadcore'07 AMD six-core'10 AMD 12-core'10 111 Figure 4-2 illustrates the data shown in the above tables, and similar data extracted from other AMD processors’ datasheet [149]. As it can be seen in this figure, one can approximate the relationship between a processor’s supply voltage and its corresponding clock frequency with a linear function, e.g. equation (66). This linear approximation results in a mean square error of at most 4% in case of Pentium M, and AMD multicore processors. A ‘ · H A E (66) 4.3.4 Power Consumption Model The power consumption of a CMP is the summation of the core power dissipation (“the core power”), plus power dissipation of other shared components on the chip e.g., high-level caches, memory controller and other integrated controllers (“the uncore power”.) The power manager controls the core power by changing the voltage and frequency settings of the cores. The power dissipation of a core is comprised of dynamic power and leakage power as given below, ! %WXZQ A, ’Ÿ ¤¤ A U ! ZZ´ A, K “ K U A exp S &D ‚kK T (67) where θ denotes die temperature and C eff , q, V th , β, n, k, are technology and circuit specific parameters, which can be assumed to be constant, and ’ is the activity factor which depends on workload. To reduce the complexity of dynamic power and thermal management (DPTM) problem, we sacrifice the unnecessary accuracy of power consumption models given by (67) and break the recursion between die temperature and power consumption. 112 Due to dependency of leakage power on temperature, there is a recursion between the die temperature and core power consumption. We neglect the interaction of voltage and temperature and assume that they independently affect the leakage power of a core, and we use a linear approximation in order to make the computation tractable, ! Z A, K, k Ï · K H ! Z Ð A, (68) On the other hand, using the linear model for frequency given in (66), dynamic power consumption is cubically dependent on the core’s clock frequency, f. Also, we assume that leakage power of a circuit is linearly proportional on core’s voltage, v [22] and hence its frequency, f. Therefore, the total power consumption of a core can be modeled as: ! · Y H Ž · H k Ï · K (69) where d, l, and k θ are empirical coefficients for dynamic power consumption, temperature-independent and temperature-dependent components of leakage power dissipation, respectively. These coefficients depend on the CMP implementation and fabrication technology parameters, and their values can be determined from measurements. Note that coefficient d captures the switching activity of the circuit which varies as a function of the workload running on the core. The vector form of the above equation is expressed as, À Ñ · Ò Y H Ó · Ò H Ô ½ · ½ (70) in which f Nx1 is the column vector of clock frequencies of cores, and exponentiation in f 3 is an element-wise operation, which returns a column vector. D, L and K θ are the diagonal matrices of coefficients d, l, and k θ of each core, i.e., 113 0 p o 0 Õ o I Ö p Ž o 0 Õ o I » Ï p k Ï o 0 Õ o I (71) Note that diagonal elements of each matrix are identical for homogeneous CMP architectures (since all the cores are identical.) However, if we consider the effect of process variation, which leads to different characteristics of cores, even in both homogeneous architectures, then the coefficients do not match. Now then, using the models given in this section, we mathematically formulate the dynamic power and thermal management problem, in next section. 4.4 Problem Formulation As mentioned earlier, the problem we target in this chapter is an optimization problem in CMPs considering effect of variations in circuit parameters and workload conditions. However, these effects significantly increase the complexity of problem formulation; one may consider probability distribution functions to model power consumption and maximum core frequencies, as well as the coefficients in the thermal model. Therefore, we first ignore parameter variations, and present a well formulated mixed integer program to model our problem, called ideal-condition formulation. Next, we will handle the variations by applying improvements on our proposed efficient solution to this ideal-condition formulation. Consider an N-way CMP system; composed of N independent processing cores. Each core has a separate supply voltage and clock generation module so that the cores 114 can potentially run at different voltage-frequency (v-f) settings, through a supervisory process called per-core DVFS [73]. Note that coordinated voltage scaling in which all cores on the CMP must run at the same voltage level at any given time is most common form of dynamic voltage scaling support in today’s CMPs. It is, however, expected that future CMP’s will put each ore (or clusters of cores) in its own voltage island and thus allow independent voltage scaling of the cores (or clusters of cores.) Furthermore, once a core’s supply voltage level is set, the maximum frequency at which the core can run will be determined. However, we may choose to run some cores at a lower frequency than this allowed maximum value, which gives rise to cores that run at the same supply voltage level but different clock frequencies. In the following, however, we focus on a scenario whereby each core has its own supply voltage feed and voltage island, and hence, it can be voltage scaled independently of the other cores. Furthermore, the core will run at the maximum clock frequency supported by the specified supply voltage level. A power and thermal management unit (PTM) manages core consolidation and DVFS settings of cores, according to measured core temperatures and powers. More precisely, the PTM seeks to determine the set of ON cores and assign their voltage and frequency levels such that the total CMP throughput is maximized while the following constraints, in the order of importance, are met: 1) temperatures of cores do not exceed critical temperature (thermal constraint), 2) the total CMP power budget is met (power constraint) and 3) minimum throughput requirements for each task is met (task-level IPS constraint.) We assume that these constraints are set logically and can be generally met, i.e. satisfying one constraint does not necessarily require violation of another one. 115 However, these constraints not always comply with each other. For instance, one can imagine a non-optimal solution where non-efficient execution of tasks (e.g., due to poor task-to-core assignment algorithm) results in wasting power budget without satisfying all IPS constraints. Another more common conflict may happen when thermal constraint forces execution of a task to stop, and temporarily causes throughput constraint violation. The inter-core temperature dependency is another source of complication of this problem, as it cannot be easily handled by simple heuristics. We formulate this problem as a mixed-integer program, later in this section. But first, we need to define one more variables to express core consolidation. We define an assignment (mapping) parameter, m ij , which is a binary variable that represents the assignment of task j to core i. s × 1 if task o is assigned to core 0 otherwise I (72) o : t s 1 (73) Note that a core can be turned off only if there are no tasks assigned to it, i.e. t s 0 (74) With the above definition, the total CMP throughput can be rewritten as, ¼ P¥£ t t s · ž!Ÿ · Á Ò Ú · Û Ü ÝÀÞ · Æ ß1 (75) where the symbol Ü denotes element-wise multiplication of two matrices or vectores, f is the vector of frequencies of cores, and IPC is the matrix of IPC of tasks if executed on any core, and M is the matrix of task to core assignment variables, m ij . 116 Finally, the per-task minimum throughput constraints are represented by (76); IPS req,j denotes the minimum required IPS for task j, which is zero when such a requirement does not exist. When more than one task is assigned to a core, its required IPS would be equal to the summation of those of all tasks assigned to it, ž!m ž!m œ, t s · ž!m œ, (76) Ò Ü Û Ü ÝÀÞ · Æ ß1 Û · ÝÀ· Ãàá (77) in which IPS req is the vector of required IPS of tasks. Now then, the problem statement can be written as a mixed-integer program, as illustrated in (78), which applies to both homogeneous and heterogeneous CMP architectures. In this formulation, the objective function is to maximize throughput by using model in (75), and the constraints are: - the thermal constraint given by (63), - the total CMP power budget given by (70), - the per-task throughput constraint given by (77), - the constraint on the maximum and minimum limits on frequency, - and the constraint on completeness of task assignment, that is equivalent to (73). b c c c d c c c e ngâs[ƒ ¼ Ò Ú · Û Ü ÝÀÞ · Æ ß1 ‘ãhoƒ� «: ¾ · ½ H ¿ · À K ÂÃÄÅ Ý Ò Ü Û Ü ÝÀÞ · Æ ß1 Û · ÝÀ· Ãàá Æ · À ! R%´ Ò äÄå Ò Ò äæç Æ 1z · Û Æ 1ß À Ñ · Ò Y H Ó · Ò H Ô ½ · ½ I (78) 117 The above problem formulation is a mixed integer program, and it falls into NP- hard problems. A linear time solution to a mixed integer programs can be found only when the coefficients matrix is a totally unimodular matrix [155]. Even if we estimate the non-linear functions with linear functions, the coefficients matrix is not totally uni- modular and hence the problem is NP-hard. In the following section, we present our solution to convert the problem of (78) to a simpler one by handling some constraints separately, and then solve the modified problem efficiently. More importantly, as mentioned earlier, due to inherent uncertainty of coefficients and measured parameters, even if an efficient optimal solution to problem formulation of (78) can be found, it would not be directly applicable to CMPs. The inaccuracy of parameters requires the solution to implement a feedback mechanism, so it adaptively updates the solution according to observed inaccuracies. Note that a task may have other resource needs including a certain amount of level-one cache or enough bandwidth to the shared level-two cache in order to exhibit an acceptable minimum level of performance. It is easy to add similar constraints to the problem formulation to ensure that only tasks whose resource needs can be met will be potentially consolidated into the same physical core. Also, note that the problem of (78) can be reduced to several different problems, such as task assignment problem, and optimal DVFS assignment, by appropriately setting constraints. For instance, by considering fixed frequencies and dropping temperature and power constraints, the problem reduces to task assignment. If we set f min = f max in (78), the frequency of cores would become fixed value of f = f min = f max . By setting θ crit and 118 P budget equal to infinity (or very large numbers), we effectively disable the temperature and power constraints, respectively. The modified problem formulation is equivalent to task assignment problem, in which the objective is to find the optimal assignment of tasks to cores such that a minimum performance is held for every task, and all the tasks are assigned to one and only one core. 4.5 Proposed Solution We propose a hierarchical power and thermal management solution to the problem illustrated in (78), called Variation-aware dynamic Power and Thermal Manager, VPTM. VPTM breaks this mixed integer problem into a hierarchical pseudo- Boolean assignment problem and a continuous domain optimization problem. It performs the assignment at a higher level of hierarchy, and eliminating the pseudo-Boolean assignment variable, solves a convex nonlinear program at a lower level of hierarchy. Finally, a closed loop feedback tracks the target parameters of system to ensure that constraints are met in spite of system characteristic variations. Since the optimal solution of problem is found for the input values that are subject to uncertainty and variability, applying these policies to actual workload may cause violation of temperature, power or performance constraints. Figure 4-3 illustrates the design architecture of the proposed Variation-aware dynamic Power and Thermal Manager, VPTM. VPTM consists of four sub-modules: a tier-one manager (T1-PTM), a tier-two manager (T2-PTM), a proportional integral feedback controller [77], and workload analyzer unit (WAU). 119 T1-PTM performs a constructive task assignment, i.e. given an initial task assignment it improves on it by migrating tasks between cores, and turning the cores on or off. More precisely, T1-PTM identifies the cores to be turned on/off in order to increase the power efficiency of the CMP (core consolidation) and resolves the thermal emergencies (when die-temperature reaches the critical temperature threshold.) The T2- PTM sub-optimally decides on the frequency of cores for the next time epoch; it also calculates the set point for the controller of core for next epoch. The proportional-integral (PI) feedback controller fine-tunes the core DVFS settings using the actual measurements at runtime. WAU analyzes the workload and predicts its characteristic (i.e. IPC of tasks) for the next epoch and provides it to T1-PTM and T2-PTM. The coordination between hierarchical layers of VPTM causes the system to be stable, and prevents unnecessary transitions of cores between active and sleep modes, which has power and performance overheads. T1-PTM has a feedback mechanism from T2-PTM, through WAU unit, which provides the expected values of cores’ frequencies, temperatures and IPCs based on their measured values. Before T1-PTM makes a decision, it evaluates behavior of T2-PTM to a candidate decision for the predicted data (by running T2-PTM for candidate configuration) and estimates the quality of optimal T2-PTM result. Hence, if the variation is not extreme, the solution found by T2-PTM is consistent with T1-PTM’s estimate. Furthermore, while T2-PTM exploits frequency scaling to optimally solve the optimization problem, if it fails in meeting any constraint, especially thermal constraint, a signal is triggered in T1-PTM to resolve the thermal emergency through core consolidation and shut-down mechanisms. T1-PTM then takes it 120 into account while re-assigning or migrating the tasks. Similarly, T2-PTM and PI controller work in coordination; T2-PTM sets the desired type of PI controller and its target value, and the controller’s job is to maintain the target parameter. However, a set of sensor readings of all parameters (i.e. core’s IPC, power, and temperature) are always fed back to T2-PTM. T2-PTM may choose another PI type and set its target differently based on this feedback, so as to make it feasible for the controller to meet all constraints, in order of importance. Note that due to uncertainty of parameters in the formulation of T2-PTM, we impose a tighter thermal constraint, to be safe. More precisely, the thermal constraint in T2-PTM is that no core temperature exceeds a threshold temperature, denoted by θ th , where θ th < θ crit . The role of WAU is to continuously monitor the actual IPC of tasks (using core’s performance counters [70]), and provide a prediction of IPC’s to other blocks of VPTM. While dynamic IPC prediction is in fact a challenging problem [151][152], and is out of scope of this work. Utilization of accurate prediction algorithms (such as the one described in [152]) improves the quality of predicted values at the cost of increased implementation complexity and power and performance overhead. In this work, we use moving average filter as a prediction method that is fast, simple and light-weight, yet reduces the estimation error caused by workload variation by using the recent history in its prediction. More precisely, WAU continuously measures the actual IPC of tasks, and applies a moving average filter to the past measured data to predict the IPC values for the 121 future time epoch; this predicted IPC data is passed on to T1-PTM and T2-PTM, next. We will explain the details of each tier in the following subsections. Figure 4-3. Block diagram of VPTM. Figure 4-4 illustrates a summary of the steps in the proposed VPTM. 1 Predict IPC’s by using a moving average (MA) method 2 Tier-1:Do consolidation + thermal emergency algorithm 3 Tier-2: Do coarse-grain DVFS 3.1 Run pre-processing feasibility verification step 3.2 Solve convex problem of (80) 4 Determine objective and set points of PI controllers 5 Tier 3: Engage a PI controller- fine-grain DVFS Figure 4-4. An online algorithm for VPTM. Workload Analyzer (WAU) PI controller (fine DVFS) Tier2 DVFS (coarse) Power budgeting P/IPS/θ freq set-point P , θ’s IPC’s Predicted IPCs Tier1 Core Consolidation Thermal Emergency Turn on/off ID of ON cores Total Power Budget, Required IPS task VPTM Measurements 122 4.5.1 Tier-One Manager – Core Consolidation In tier-one of the VPTM, T1-PTM, we adopt a simple heuristic to perform core consolidation decision and avoid thermal emergencies (when a core temperature rises above θ crit ) at the beginning of each decision epoch of duration T 1 . As mentioned before, T1-PTM carries out a constructive task assignment algorithm, where given an initial assignment of task to cores, T1-PTM migrates the tasks between cores and consolidates/deconsolidates them, to be able to turn cores on/off for enhancing CMP power efficiency. The initial task assignment has been extensively studied in the literature [107][150][98] and falls outside of scope of this work. We use the power-aware assignment algorithm presented in [107], called VarF&AppIPC. In this algorithm, the task with the highest IPC’s is mapped to the core with the highest maximum frequency core, and so on and so forth. Note that the maximum frequencies of cores are different in both heterogeneous CMPs and when we consider the inter-core process variation. The key idea for core consolidation is to group low IPC tasks that may be running on two or more cores into one core whenever possible, and to turn off the other cores (or put them in deep sleep state) resulting in noticeable power saving. Assuming fast thread switching support (similar to the Sparc family and Niagara architectures [147]) the performance overhead of task consolidation is negligible. The proposed core consolidation algorithm for T1-PTM is a greedy (steepest- ascent) hill-climbing algorithm that seeks to reach a local optimum solution by gradually moving towards the optimum point in a solution space neighborhood. More precisely, the proposed algorithm explores the neighbors (in terms of the number of ON cores) of the 123 current system configuration, and if it finds a better solution (yielding higher CMP throughput while meeting the power budget) it chooses and enforces that solution for the next decision epoch. T1-PTM relies on the quality of estimates of the CMP throughput, power dissipation and die-temperature at the end of current decision epoch, using the predicted data provided by WAU. A neighbor of the current solution is defined as one of the following three cases: (i) a solution with the same number of active (also called ON) cores, (ii) a solution with one more active core, or (iii) a solution with one fewer active core. To perform core consolidation, T1-PTM calculates the IPC of each core as a weighted summation of IPC’s of all of the tasks that are running on the core. Next it sorts the active cores in ascending order of their IPC values; if two cores that have the lowest IPC values can be consolidated without violating the per-task IPS requirements for each of the corresponding tasks, this consolidation is a valid candidate action. For simplicity, we assume that the pre-characterization of tasks includes the data of combined tasks as well as individual tasks running on any core. Here we only consider consolidation of up to two tasks on a core. More precisely, a data structure provides the average IPC when a specific task or a pair of tasks is executed on a specific core, i.e. a tuple in the form of (task1, task2, core) is the key of this look-up table which holds the average IPC values (note that task2 field can be empty.) On the other hand, if a core is running more than one task and even under its maximum frequency, the total IPS requirement is not met (with a safety margin of 10%) or a violation of thermal constraint may occur, it is a candidate for migrating one or more of its tasks to some other core (we call this process “de- 124 consolidation”.) Next, T1-PTM estimates the power and performance of candidate configurations for the next time epoch. Note that migration of tasks between cores has latency and energy overhead, which is taken into account when considering consolidation and de-consolidation actions. The time complexity of the proposed heuristic solution is è s H ‚. Ž«é‚ , where m denotes the number of running tasks and n denotes the number of ON cores. The other function of T1-PTM is handling critical thermal emergencies. As described in section 4.5, Tier two of VPTM, T2-PTM, finds the optimal DVFS configuration of active cores such that all constraints, including the thermal constraint (that no core temperature rises above θ th , are satisfied at the beginning of a timing window of size T 2 = T 1 /k where k is a small natural number, say 2 or 3. Now then, under this hierarchical control architecture), it is possible that the tier-two DVFS is not capable of keeping the core temperature below the threshold temperature value, θ th , even by running the core at its minimum possible voltage and frequency setting. In this case, the core temperature approaches the critical temperature value, θ crit . This is due to the fact the T2-PTM cannot turn off any core since that decision is reserved for the T1-PTM, which runs at each decision epoch. This imposes limits on T 1 and T 2 since we have to ensure that, even in the worst case, the temperature of a core cannot rise from θ th to θ crit in time T 1 if we set its voltage and frequency levels to the minimum allowed after time T 2 . Note that a core whose temperature is above θ th and rising toward θ crit , will be turned off for the next few epochs (and the tasks running on it will be migrated to other cores), until it 125 cools down below a second temperature value, θ cool < θ crit , and only then, it may be turned back on. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 n = min(number of threads, N) Ω ⟵ cores with no assigned threads (|Ω|=N-number of threads) Thermal_Emergency_Check { for i=1:n { if (θ i ≥ θ crit ) turn off core i ; n--; Ψ⟵core i ; if (Ω ≠ ì) migrate_task(i); if (θ i ≤ θ cool AND core i ‰Ψ) Ψ Ψ- core i ; if (task was migrated) Ω⟵core i ; } } S=Sort_ascending(task_IPCs) Core_Consolidation { if (IPC(S1&S2).fmax S1 ≥ IPS target (S1)+IPS target (S2)) { Π⟵core S1 ; Ω⟵core S2 ; n --; consolidate(S1, S2, core S1 ); } elseif (IPC(S1&S2).fmax S2 ≥ IPS target (S1)+IPS target (S2)) { Π⟵core S2 ;Ω⟵core S1 ; n --; consolidate(S1, S2, core S2 ); } } Core_Unconsolidation { if (core i ‰ Π AND IPC(i).fmax i < 1.1*Σ j assigned to i IPS target (j)) if (power of unconsolidated ≤ P budget AND Ω ≠ ì) { Π = Π -core i ; Ω= Ω-core k ; n++; unconsolidate(i, core i , core k ); } } Figure 4-5. Pseudo-code of T1-PTM algorithm. Figure 4-5 summarizes the pseudo-code of T1-PTM algorithm. In this algorithm, n denotes the number of available (ON) cores and N denotes the total number of cores. S 126 represents the set of tasks sorted in ascending order, and S1 and S2 are its first two elements. Ψ, Ω and Π denote the set of cores that are turned off due to thermal emergency, set of free cores (with no tasks assigned to them, and hence they are in OFF/sleep mode) and set of consolidated cores, respectively. This algorithm consists of three main functions, namely, Thermal_Emergency_Check, Core_Consolidation, and Core_Unconsolidation, which run in sequence every time T1-PTM is called. 4.5.2 Tier-Two Manager – Coarse-Grain DVFS The second tier of our proposed VPTM, T2-PTM, uses a simplified version of the mixed-integer problem (78) by eliminating the pseudo-Boolean assignment variables (since they have already been determined by T1-PTM.) Hence T2-PTM solves the nonlinear program of (79), which maximizes the total CMP throughput while satisfying the aforesaid constraints. The problem of (79) is a convex optimization problem, and an optimal solution can be found in polynomial time, if one exists. Note that in T2-PTM, temperatures of cores are required to stay below a threshold temperature, θ th , which is below the critical temperature, θ crit , to create a safety margin for the original temperature constraint in presence of mispredictions and uncertainties in the problem coefficients, which may cause underestimating the temperature rise, hence create a thermal emergency situation. The problem is indeed a modified version of the convex problem presented and solved online in [148], where the objective function was the summation of cores’ frequencies. In contrast, our objective function is the actual CMP throughput, which is a weighted summation of core frequencies. We will use the solution method presented in 127 [148] to efficiently solve (79). Note that to convert the problem to a convex one, the last constraint of (79) has been replaced with an inequality, however the optimum solution will be same as if it is an equality (see [148] for details.) b c c d c c e ngâs[ƒ ¼ Ò · ÝÀÞ¾ ‘ãhoƒ� «: ¾ · ½ H ¿ · À K Åî Æ Ò Ü ÝÀÞ¾ Û · ÝÀ· Ãàá Æ · À ! R%´ Ò äÄå Ò Ò äæç À Ñ · Ò Y H Ó · Ò H Ô ½ · ½ I (79) where IPCA is defined as the the vector of IPC of tasks once the assignments of tasks to cores are known, and it is equal to ÝÀÞ¾ Û Ü ÝÀÞ. Æ ï1Æ . Conditions for having a feasible solution to (79) are that: (i) the per-task minimum IPS requirements can be met in the best case of assigning a full physical core to the task and running the core at its maximum allowed frequency; Furthermore, no violation of the critical temperature setting will result by doing so.; (ii) the total CMP power budget is large enough for the CMP to meet the minimum IPS requirements of all running tasks i.e., the CMP power constraint is always met if each core runs at its minimum core frequency that achieves the required IPS of tasks. Note the implicit assumption of running a task at no more than a single core at any given time. Despite the global optimality of the solution to convex optimization problem of (79), it is possible that the optimal solution does not exist inside the feasible region. This is because values of task-level IPCs, power consumptions and temperatures, etc. are only estimates. Furthermore, there are process variations that make values of various regression coefficients subject to variability (and thus error.) To make the T2-PTM 128 suitable as an online power and thermal management technique for any instance of a CMP under any workload condition, we modify the constraints and take certain critical actions before we attempt to solve (79). More precisely, we may sacrifice power budget constraint (and if necessary per-task throughput constraints) to find a feasible solution which may not be optimal for the original problem. This constraint modification and the critical actions make a best effort to guarantee that the solution is of practical value in the sense that thermal emergency and per-task IPS constraints are met. To guarantee feasibility of solution, we perform a pre-processing step that verifies that all constraints can be met; otherwise, it replaces corresponding equations with the ones that make the problem solvable with the least changes to the constraints. This step is inspired by characteristics of the convex problem. In a first critical scenario, if the temperature of a core exceeds the threshold temperature (which is a temperature below critical temperature) and reducing its frequency level (hence its power consumption) to the minimum does not stop the rise toward the threshold temperature value then the convex problem will be infeasible. To fix this issue, we simply drop the constraint on core’s temperature and take the critical action of forcing its frequency to f min in order to keep the core’s temperature at its lowest, until T1-PTM turns the core off at the beginning of the next decision epoch. A second critical scenario of infeasibility occurs when the IPS of a task turns out to be less than its required IPS, IPS req , while being executed at the maximum frequency of the core to which it has been assigned. To fix this issue, T2-PTM removes the corresponding IPS req constraint of such a task, and takes the critical action of setting the 129 corresponding core’s frequency to its highest level. However, this highest frequency should be set such that thermal constraint is not violated. Hence T2-PTM sets the frequency to the highest level that does not give rise to a thermal emergency, denoted by f max,no-emer , to ensure that the core is delivering the maximum IPS it can for the task without invoking a thermal emergency. This condition will persist until T1-PTM migrates the task from its current core to another one at the beginning of the next decision epoch. These scenarios are the only cases of conflicting frequency limit constraints and temperature or performance constraints. A third conflict may occur when required IPS falls above the maximum frequency corresponding to thermal constraint. In this case VPTM ignores IPS requirement and keeps temperature of the core in the allowed range. Finally, a reasonable assumption is that the power budget is large enough for CMP to meet IPS requirements of all tasks; i.e. power constraint is always met if each core runs at its minimum frequency corresponding to the required IPS of tasks. In the aforesaid scenarios discussed above, VPTM works to generate a feasible solution, based on best effort to resolve infeasibility issue while maximally satisfying the constraints. In addition, it employs a PI feedback controller (cf. section 4.5.3) to adjust the core’s DVFS setting to keep the critical CMP parameter (die temperature or per-task throughput) within acceptable range. Note if there is no imminent threat of violating any of these constraints, the PI controller is in its default mode where it tries to meet the total power constraint (normally, this constraint is active at the optimum solution since the optimization goal is to maximize the CMP throughput.) 130 Once the existence of an optimal solution is ensured by the described constraint relaxations, the convex problem of (80), which is equivalent to (79), is solved to determine the operating point of all the cores. b c c c d c c c e ngâs[ƒ ¼ t ž!Ÿ · subject to: max S QX, , ž!m œ, ž!Ÿ T min | QZl , , ! (¿ K Åî Æ ¾ · ½)} Σ0 Y H ΣÖ ! R%´ ñ» Ï K I (80) When the optimal DVFS settings of cores are calculated, the frequency, expected power consumption and estimated IPS of each core is determined. To avoid power budget violation, we use a PI controller for each core (c.f. section 4.5.3) that adjusts its frequency so that its power consumption is equal to its calculated power consumption, i.e. per-core power budget. Moreover, in the optimal solution, a core’s frequency may be set such that it only satisfies the minimum required IPS. However, due to variability of task characteristics, the actual measured IPS might be lower, after applying calculated frequency. Therefore, we utilize a feedback controller to guarantee that the required IPS is maintained. 4.5.3 Tier-Three Manager –Closed-loop Feedback Controller Due to the fact that the problem formulations presented in (79) and (80) ignore the variation and uncertainty in the characteristics of cores and behavior of tasks, such as coefficients of power consumption and IPC of tasks, a direct solution to these problem formulations solution may suffer from overestimating or underestimating temperature, 131 power, or throughput. VPTM utilizes a PI (Proportional-Integral) controller [77] for each core to dynamically adjust the frequency of the core so as to maintain its per-core power budget, die temperature, or per-core IPS throughput close to their desired values, in spite of potential changes in the application behavior. This requires a break-up of the total CMP budget to target power budgets for all active cores, a step which we do by setting the per-core power target at the level required by the core’s calculated frequency and temperature in the optimal solution to (80). It also involves calculation of the target IPS value for each core, a step which we do we do by setting the per-core IPS target at the level required by the summation of per-task IPS values for all tasks assigned to the core in the optimal solution to (80). As illustrated in Figure 4-6, each core (Gs) has a power controller (GP c ), a temperature controller (GT c ) and a throughput controller (GH c ) that set its frequency. Based on the decision made in T2-PTM, one of these controllers will be engaged for each core to maintain its power, throughput, or temperature at the desired level. Based on the optimal solution of (80) and feasibility pre-processing step, we use different feedback controllers to maintain one of the followings: - Temperature below θ th , when a core’s temperature is close to the threshold temperature; - IPS at IPS req , when a core’s calculated IPS is equal to IPS req ; - Power at its allocated level, if none of the above occurs. The PI controller continuously measures the actual temperature, throughput or power of the core, and if required, changes its DVFS setting to match the set point, 132 determined by T2-PTM. The per-core power budget is calculated for the core’s calculated frequency and temperature, in the optimal solution. Similarly, target IPS of a core is set as its calculated IPS in solution to problem (80). Temperature of a core is used as the objective of controller only when it is near critical value, hence it is set to the critical temperature (or slightly below it) to avoid hot temperatures. Details of PI controller design are omitted here due to space limitations. See reference [77] for a detailed discussion on this topic. Figure 4-6. PI controllers of VPTM. 4.6 Experimental Results 4.6.1 Experimental Setup For our experiments, we setup a tool chain, which is an in-house MATLAB-based CMP simulator integrated with PTscalar, a cycle-accurate microarchitecture level power and performance simulator (it uses a temperature-dependent leakage model) [82], and HotSpot, a thermal simulator [100]. Multiple instances of PTscalar simulate execution of G s f IPS GH c IPS target GP c P target P T2 decision GT c θ critical θ 133 tasks on cores, and calculate the power and temperature of cores at each time epoch, then these values are reported to our PTM unit (in MATLAB) which decides core consolidation and task migration moves and adjusts DVFS settings of cores. We simulate a heterogeneous quad-core CMP in which the cores are of two types. Cores 1 and 2 are faster (3.2GHz) while cores 3 and 4 are slower (2.6GHz) -in the problem formulation, each individual core can be of any arbitrary type. Both core types are similar to Alpha 21264 architecture, with some configuration changes, as listed in Table 4-3. The ambient temperature is set to 25°C, and the critical temperature is set to 100°C. Table 4-3. Configurations of the cores in CMP system Pipeline Out-of-order Fetch-Issue-Commit 4-4-4/4-2-2 Load/Store queue 32/32 L1 instruction/data cache 16KB, 2-way/8KB, 2-way/LRU L2 unified cache 4MB, 8-way, 64B line Technology node/Vdd 32nm, 1.1V/1V Max frequency 3.2GHz/2.6GHz We first used PTscalar to extract and power model parameters, i.e. D (per task), L, and K θ matrices. Then, the effect of process variation was estimated by applying up to 5% random deviation to these parameters that are being used in the PTM solver. In order to determine thermal model parameters, i.e. A and B matrices, we considered a sample CMP floorplan which is similar to Intel Xeon processors, Sandy Bridge family [153]. Figure 4-7 illustrates this CMP floorplan in which L3cache is placed 134 in the center of die and cores are places around it. We then applied the thermal model used in Hotspot to determine elements of A and B matrices. Figure 4-7. CMP floorplan in our thermal model; based on Intel Xeon floorplan [153]. For workload, we use bundles of four different benchmarks selected from SPEC2000 and SPEC CPU2006 benchmark suites. Note that we do not consider inter- task communication, and bundle multiple independent tasks. The task mix is assigned to CMP and run virtually forever. Execution of each task on any core type is pre- characterized in terms of its average IPC, D, and L values. Note that these values are used as uncertain data and VPTM uses a moving average (MA) predictor (of length three) and a feedback loop to manage uncertainties. Table 4-4 illustrates a sample assignment of tasks to cores, and resulting average IPC of tasks on corresponding cores. Table 4-4. Assignment of benchmarks in test1 Core 1 2 3 4 Benchmark twolf mcf equake bzip Avg. IPC 1.205 2.12 1.7 0.90 135 4.6.2 Simulation Results Figure 4-8 demonstrates the performance of VPTM algorithm for the benchmark set and its given assignment in Table 4-4. Our baseline is a greedy algorithm called PushHiPullLo (PHPL) which is similar to the greedy algorithm presented in [52]. PHPL maximizes CMP throughput under a total power budget by consecutively reducing the frequency of the core with lowest IPC, until the power budget is met. Limiting the maximum frequency of cores enforces the thermal constraint. Figure 4-9 demonstrates performance of PHPL. For purpose of comparison, we disabled core consolidation capability of tier-one of VPTM, since the comparison baseline considered here does not perform core consolidation (to the best of our knowledge, there is no algorithm in the literature that solves the same problem as VPTM, by a combination of DVFS and core consolidation.) In Figure 4-8 and Figure 4-9, plot (a) demonstrates measured CMP throughput and power consumption. In this experiment, we have applied a sequence of {110W, 80W, 100W, 80W} for total power budget. As it can be seen, the average total IPS of VPTM is higher than PHPL for similar power budgets (which are 15.5 and 13.2 BIPS, respectively.) Plots (b) and (c) of these figures illustrate trace of frequency and temperature (θ crit =100) of each core, respectively. Note that the threshold temperature is always 5 degrees below critical temperature. Also, as it can be seen, VPTM follows the power budget very closely, which is because of the PI-controller, that adaptively updates DVFS to maintain target core powers. Another observation is that in VPTM, core 3 is executing a task with a high IPC (but not the highest) while its power consumption is the 136 most proportional, and hence the maximum power budget is allocated to it, and its frequency is mostly at its maximum. In PHPL in contrast, core 2 has the highest IPC and mostly runs at its maximum frequency. (a) (b) (c) Figure 4-8. Performance of VPTM algorithm. 0 50 100 150 200 250 300 0 5 10 15 20 1 501 1001 1501 Power [W] T otal Throughput [BIPS] H P 0 0.5 1 1.5 2 2.5 3 3.5 1 501 1001 1501 Frequency [GHz] f1 f2 f3 f4 0 20 40 60 80 100 1 501 1001 1501 Temperature [C] Time (ms) θ1 θ2 θ3 θ4 137 (a) (b) (c) Figure 4-9. Performance of PHPL algorithm. Figure 4-10 compares average performance of VPTM (with core consolidation enabled) to that of PHPL for five different mixes of benchmark selection, under three power budget conditions. The average throughput of VPTM is approximately 21.4% higher than PHPL. An average of 13% is gained by combination of precise solution of DVFS and utilization of PI controllers, and the rest is due to core consolidation. As can be seen in this figure, the difference between average IPS of CMP in VPTM and PHPL is less for higher power budget; the reason is that when the power consumption is not 0 50 100 150 200 250 300 0 5 10 15 20 1 501 1001 1501 Power [W] T otal Throughput [BIPS] H P 0 0.5 1 1.5 2 2.5 3 3.5 1 501 1001 1501 Frequency [GHz] f1 f2 f3 f4 0 20 40 60 80 100 1 501 1001 1501 T emperature [C] Time (ms) θ1 θ2 θ3 θ4 138 limited (except by thermal constraint,) and all cores can run at their highest frequency, no further optimization is possible, and therefore the two algorithms perform almost same. Figure 4-10. Total IPS under power budget –VPTM vs PHPL. Concerning complexity of algorithms, the runtime of VPTM is determined by runtime complexity of tier-1 (consolidation) plus runtime of solving (80). As mentioned before, the complexity of consolidation step is è . Ž«é where N denotes the number of cores, and its decision epoch is in the order of tens of milliseconds. Runtime of solving (80), which is invoked in the order of operating system’s 10ms time slice, is about 50- 100{s; i.e., less than 1% performance overhead. Finally, PI-controller performs few simple arithmetic calculations every hundreds of microseconds. This makes VPTM runtime acceptable as an online PTM. We also studied performance of VPTM on eight-core CMPs consisting of two 3.2GHz cores, two 2.6GHz and four 2.3GHz cores, which all have similar architectural configuration as in the previous experiments. We compared performance of VPTM in eight-core CMPs to PHPL, as shown in Table 4-5 and depicted in Figure 4-11. 0 5 10 15 20 25 120 100 80 Total Throughput [BIPS] Power Budget [W] PHPL VAPTM 139 Table 4-5. Total throughput of 8core CMPs at different power budgets Pbudget [W] 150 125 100 75 50 30 PHPL 34.01 31.29 28.43 23.82 17.94 15.55 VPTM 34.26 33.25 30.83 27.56 22.90 17.83 % Improvement 0.72 6.28 8.47 15.69 27.65 14.65 Figure 4-11. Comparison of VPTM and PHPL in 8-core CMPs. As it can be observed, at very high power budgets, the total throughput of VPTM and PHPL are similar, while at mid-range power budgets, they substantially differ. The reason is that at high power budgets, all cores are running at their maximum frequency; hence there is no room for optimization. However, at mid values of power budget, the VPTM optimally sets the DVFS setting and achieves up to 18% better throughput than PHPL. Figure 4-12 compares frequency settings of CMP by VPTM and PHPL in time, for different values of power budget. Plot (a) illustrates the measured total CMP power and throughput; plots (b) and (c) demonstrate VPTM and PHPL’s choice of frequencies, respectively. 0 5 10 15 20 25 30 35 40 150 125 100 75 50 30 Total Throughput [MIPS] Power Budget [W] PHPL VPTM 140 (a) (b) (c) Figure 4-12. Selection of frequencies in VPTM and PHPL at different power budgets. 0 200 400 600 800 1000 1200 0 20 40 60 80 100 120 140 160 180 200 Power & IPS 0 200 400 600 800 1000 1200 0 0.5 1 1.5 2 2.5 3 3.5 Frequencies 0 200 400 600 800 1000 1200 0 0.5 1 1.5 2 2.5 3 3.5 Frequencies 141 To investigate effect of thermal constraint on performance of VPTM, we simulated the VPTM applied to a quad-core CMP, described earlier in this section. For this experiment, we set the value of power budget to a large enough value which takes the cores to temperatures near their thermal constraint. Figure 4-13 illustrates the time trace of this experiment. As shown in Figure 4-13 (a), the power budget is initially set to 90W and then is increased to 120W so the power budget does not limit frequencies of cores. The measured power consumption is about 100W, and throughput is less than 5% better since the cores are operating near their maximum frequencies. When the power budget is set to a high value, the frequencies of cores increase to their maximum possible value, as illustrated in Figure 4-13 (b); however, due to high power dissipation, core temperatures reach threshold temperature, hence the frequencies are limited by this temperature constraint. The reason that frequency of core shown in green color is less that its maximum value is because two neighbor cores try to run at maximum frequencies, therefore both their temperatures raise. This increased temperature make VPTM set their frequencies to somewhat less than their maximum. Figure 4-13 (c) shows the resulting temperatures; in the first time interval, core temperatures are according to best frequency setting in the power budget, while in the second time interval, the cores in green and blue reach the maximum temperature, and the other cores reach their maximum temperature at their maximum frequencies. Figure 4-14 shows a similar experiment, but the core frequencies are limited initially by temperature constraint and next power budget. As observed in this figure, the resulting frequencies are independent of the order in which these constraints apply. 142 (a) (b) (c) Figure 4-13. Limitation imposed by power and thermal constraints. 0 100 200 300 400 500 600 0 20 40 60 80 100 120 Power & IPS 0 100 200 300 400 500 600 1 1.5 2 2.5 3 3.5 Frequencies 0 100 200 300 400 500 600 20 30 40 50 60 70 80 90 100 Temperature 143 (a) (b) (c) Figure 4-14. Limitation imposed by thermal and power constraints. 0 100 200 300 400 500 600 0 20 40 60 80 100 120 Power & IPS 0 100 200 300 400 500 600 1 1.5 2 2.5 3 3.5 Frequencies 0 100 200 300 400 500 600 20 30 40 50 60 70 80 90 100 Temperature 144 Finally, we have conducted an experiment on sensitivity of VPTM performance to critical (threshold) temperature. As shown in Figure 4-15 (c), the threshold temperature is set to 60, 75, 85, and 90 degrees in consecutive time intervals. The frequencies are shown in Figure 4-15 (b) for the given temperature constraints. Note for all these time intervals the power budget is set high to prevent saturation of core frequencies due to the power constraint. To investigate importance of dependency of leakage power and temperature, we simulated the VPTM applied to same CMP and tasks, once with considering this effect, and once ignoring it. Figure 4-16 illustrates the results; the graphs on the left do not take into account dependency of leakage power on temperature (called case i), while the ones on the right do (called case ii.) In this figure, the time trace of cores’ frequencies, temperatures, CMP total power and IPS are demonstrated in graphs (a), (b) and (c) respectively. As it can be seen, the power budget is the same for both experiments; however, the total IPS of case i is 7.8% higher. Overall, the frequencies of cores are higher in case i compared to case ii, since the available power budget for dynamic power is higher. With K θ =0, the leakage power consumption is less at any given temperature and hence the cores can consume more dynamic power by running at higher frequencies, at a given power budget. 145 (a) (b) (c) Figure 4-15. Sensitivity of frequencies in VPTM to critical temperature. 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 20 40 60 80 100 120 140 160 180 200 Power & IPS 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 0.5 1 1.5 2 2.5 3 3.5 Frequencies 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 60 70 80 90 100 Temperature 146 (a) (b) (c) Figure 4-16. Comparison of VPTM performance with K θ =0 (left) and K θ ≠0 (right). Figure 4-17 illustrates the sensitivity analysis of VPTM to K θ . In this figure, plots (a) demonstrate time trace for measured CMP throughput and power consumption, plots (b) illustrate frequencies of cores, and plots (c) show core temperatures. The plots 0 100 200 300 400 500 600 0 20 40 60 80 100 120 Power & IPS 0 100 200 300 400 500 600 0 20 40 60 80 100 120 Power & IPS 0 100 200 300 400 500 600 1 1.5 2 2.5 3 3.5 Frequencies 0 100 200 300 400 500 600 1 1.5 2 2.5 3 3.5 Frequencies 0 100 200 300 400 500 600 20 25 30 35 40 45 50 55 60 Temperature 0 100 200 300 400 500 600 20 25 30 35 40 45 50 55 60 Temperature 147 illustrate comparison of the results of case ii above, to a situation where K θ is twice larger (called case iii). A similar trend can be observed as in Figure 4-16; larger K θ , forces core frequencies to be less, resulting in lower CMP throughput under a given power budget. (a) (b) (c) Figure 4-17. Sensitivity of VPTM performance to K θ . 0 100 200 300 400 500 600 0 20 40 60 80 100 120 Power & IPS 0 100 200 300 400 500 600 0 20 40 60 80 100 120 Power & IPS 0 100 200 300 400 500 600 1 1.5 2 2.5 3 3.5 Frequencies 0 100 200 300 400 500 600 1 1.5 2 2.5 3 3.5 Frequencies 0 100 200 300 400 500 600 20 25 30 35 40 45 50 55 60 Temperature 0 100 200 300 400 500 600 20 25 30 35 40 45 50 55 60 Temperature 148 4.7 A Real-time Power/Thermal Manager in Linux We utilized currently available features of today’s server class processors, e.g. Intel Xeon E5410 quad-core, and developed a software program (in C++) in Linux that implements our proposed power-thermal management algorithm(s) in a hardware testbed. This program collects required data through system files, and issues DVFS commands to Linux (by writing to corresponding file), to perform frequency scaling, and core consolidation. We will next explain the features that provide measurement and control of CPU frequency, temperature, IPC, and power consumption. 4.7.1 Controlling CPU frequency scaling Linux provides list of scaling frequencies that any processor core support in the following files, X standing for number of core, between 0 and n-1: /sys/devices/system/cpu/cpuX/cpufreq/scaling_available_frequencies For example, the one corresponding to cpu0 of a server contains supported frequencies (in Hz). The frequency can be manually changed by the following commands: $ cpufreq-selector -f 1300000 Note that one needs to first change the permission of the cpufreq-selector file by the following command: $sudo chmod +s /usr/bin/cpufreq-selector Moreover, the frequency governor needs to be set to “userspace”. Linux kernel implements multiple governors that perform DVFS and determine operating frequency of cores, according to the desired power or performance requirements, such as userspace, 149 powersave, ondemand, conservative, performance (where userspace, for example, regulates the frequency according to user, while performance runs the CPU at max- frequency, etc.) The available governors are listed in the following file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors 4.7.2 Monitoring CPU temperature at each core To read on-die temperature sensors of CPU in real-time, lm-sensors [156] can be used that is a very useful tool to measure core temperatures and fan speed in Linux. These temperature measurements can be dumped into a file so that it can be read later by the PTM software. A sample measurement is illustrated in Figure 4-18, which reflects actual value of supply voltage levels, global core voltage, per-core temperature sensor readings, as well as fan speeds: gl520sm-i2c-0-2d Adapter: SMBus Via Pro adapter at 5000 +5V: +5.13 V (min = +0.00 V, max = +0.00 V) +3.3V: +3.31 V (min = +0.00 V, max = +0.00 V) +12V: +12.03 V (min = +0.00 V, max = +0.00 V) Vcore: +2.11 V (min = +0.00 V, max = +0.00 V) fan1: 0 RPM (min = 0 RPM, div = 1) fan2: 0 RPM (min = 0 RPM, div = 1) temp1: +35.0 C (high = -130.0 C, hyst = -130.0 C) temp2: +35.0 C (high = -130.0 C, hyst = -130.0 C) cpu0_vid: +2.050 V beep_enable:disabled Figure 4-18. Sample sensor reading file. 4.7.3 Monitoring CPU performance Intel has embedded performance counters that can be read to determine IPC of a core. To do so, we can periodically read the number of “Instructions Retired" and divide 150 it by "Non-sleep Clockticks" which are reported in "INST_RETIRED.ANY" and "CPU_CLK_UNHALTED _CORE" registers, respectively. Event | Event Mask Mnemonic | Umask | Description num. 3CH | UnHalted Core Cycles| 00H | Unhalted core cycles C0H | Instruction Retired | 00H | Instruction retired Linux PerfCtr [122] adds support to Linux kernel (2.6.x) for using hardware performance-monitoring counters on x86, x86-64, PowerPC, and certain ARM processors. It contains a kernel patch (as a driver) and a user library. Version 2.7 will create a SysFs directory at /sys/class/perfctr/*. The user library accesses hardware performance-monitoring counters by manipulating these special files through open, close, read, and ioctl calls. 4.7.4 Measuring CPU power Figure 4-19 shows a picture of our setup in the lab. It has two Intel Xeon E5410 quad-core processors in the server. To accurately measure in-system power consumption by the processor, we had isolated the power consumed by the processor from the power consumed by the rest of the motherboard. Motherboards contain multiple voltage regulator modules dedicated to delivering power to the individual power rails of the processor, which satisfies this isolation requirement. However, the implementation of these voltage regulators typically does not allow for direct power measurement without modifications to the motherboard and these regulators. Thus, we cut the power line between the main board and 12V DC power source and measure the power consumed by the processors and onboard voltage regulators. It should be noted that such instrumentation does not change the power requirements or characteristics of the 151 processor itself. This instrumentation allows for the direct measurement of all power rails supplying power to the processor while the system is running. Measurements can be captured using a high precision high sampling rate digital multimeter, and sent back to the PTM in the server. To further improve the quality and precision of measurements, one can use high speed and high accuracy power analyzers with internal 16 bits analog/digital converters and 200KHz sampling rates with USB connection. Figure 4-19. Sample power measurement setup. 4.8 Summary We presented a robust mathematical formulation and solution to the problem of power and thermal management in heterogeneous CMPs by proposing a hierarchical Variation-aware Power/Thermal Manager (VPTM). VPTM maximizes throughput of a CMP operating under described variations/uncertainties by means of DVFS and core 152 consolidation, subject to a given total power budget, and a constraint on die temperature. PI controller was employed to compensate for variations in key system parameters at runtime. Experimental results of VPTM show more than 20% performance improvements, with no impact on the maximum temperature, for given power budget. 153 Chapter 5. STOCHASTIC DYNAMIC POWER MANAGEMENT FOR CHIP MULTIPROCESSORS SUBJECT TO V ARIATIONS 5.1 Introduction With the increasing levels of variability in the characteristics of nanoscale CMOS devices and VLSI interconnects and continued uncertainty in the operating conditions of VLSI circuits, achieving power efficiency and high performance in electronic systems under process, voltage, and temperature (PVT) variations as well as current stress, device aging, and interconnect wear-out phenomena has become a daunting, yet vital, task. In this chapter, we address the problem of system-level dynamic power management (DPM) in chip multiprocessors (a.k.a multicore systems) which are manufactured in nanoscale CMOS technologies and are operated under widely varying conditions over the lifetime of system. Such systems are greatly affected by increasing levels of process variations typically materializing as intrinsic (random) or systematic sources of variability and aging effects in device and interconnect characteristics, and widely varying workloads usually appearing as a source of uncertainty. Variations have a randomizing effect on the performance and power dissipation of a particular processor chip. At the same time, measurements made about the state of the processor and predictions about its future state tend to be imperfect, which gives rise to uncertainty about the system state. At the system level this variability and uncertainty is beginning to undermine the effectiveness of traditional DPM approaches. 154 Our proposed power manager interacts with an uncertain environment and statistically changing state variables and immediate cost function and tries to minimize the discounted cost in the limit by choosing appropriate actions. The decision making in a partially observable environment is achieved by combining aspects of Hidden Markov Models (HMMs) and MDPs. More specifically, we adopt a Partially Observable MDP to consider the uncertainty and variability in parameter observation. The remainder of this chapter is organized as follows. Section 5.2 briefs the prior work in stochastic dynamic power management. In section 5.3, the preliminaries are presented. The details of stochastic uncertainty management framework are given in section 5.3.3. Section 5.4 presents variation-aware dynamic power management techniques. Experimental results and chapter summary are provided in section 5.5 and 5.6, respectively. 5.2 Prior Work A lot of research has been devoted to optimizing DPM policies, resulting in both heuristics and stochastic approaches. While the heuristic approaches are easy to implement, they do not provide any power/performance assurances. In contrast, the stochastic approaches guarantee optimality under performance constraints although they are more complex to implement. An approach based on discrete-time Markovian decision processes (DT-MDP) was proposed in [101]. This approach outperforms the previous heuristic techniques because of its solid theoretical framework for system modeling and policy optimization. A power management approach based on continuous-time MDP (CTMDP) was introduced in [123]. The policy change framework of this model is 155 asynchronous and thus more suitable for implementation as part of a real-time operating system environment. Reference [124] improved on the modeling technique of [101] by using time-indexed semi-Markovian decision processes (SMDP). A non-stationary process based power management technique is introduced in [125], where the workload is modeled as Markov-modulated stochastic process. Reference [126] presented a modeling and optimization stochastic framework using partially observable MDP. It conveniently formulated and solved it as a quadratic constrained linear program. Authors of [127] presented a POMDP based framework to improve accuracy of decisions in power/thermal management. 5.3 Preliminaries 5.3.1 Variability and uncertainty Randomness is defined as a lack of pattern or regularity. Two sources of randomness are generally recognized: variability and uncertainty [84]. The first one, variability, is an inherent irregularity in the phenomenon being observed. It means that different situations produce different numerical values for a quantity. Variability in the delay and power consumption of VLSI circuits arises from scaling circuit technologies beyond the ability to control specific performance-dependent and power-dependent parameters [84]. In multicore systems spatially correlated intra-die process variations manifest themselves as core-to-core variations. The existence of variability implies that a single action or strategy may not emerge as optimal for every individual case. The other source of randomness, uncertainty, is related to a generalized lack of knowledge about the processes involved. When making observations of past events or 156 speculating about the future, imperfect knowledge is the rule rather than the exception. For example, the workload attributed to a particular functional unit is not directly recorded very often. Furthermore direct measurements may be noisy or erroneous. Uncertainty implies that we might make a non-optimal choice since we may expect one outcome different from what actually occurs. Variations, especially those due to manufacturing process and aging phenomena, have a randomizing effect on the performance and power dissipation of a particular processor chip. At the same time, measurements made about the state of the processor and predictions about its future state tend to be imperfect, which gives rise to uncertainty about system state. We propose a power manager that interacts with statistically changing parameters and an uncertain environment. 5.3.2 System model This work targets current multiprocessor architectures, in single chip (CMP a.k.a. MPSoC) realizations supporting Dynamic Voltage and Frequency Scaling (DVFS). The availability of DVFS knob varies by implementation, from independent voltage and frequency scaling for each core to universal (a.k.a. global) voltage and frequency scaling for all cores in a CMP system. Without loss of generality, we consider a state-of-the-art CMP system (AMD’s K10 family [129] or Intel’s Nahalem or Sandy Bridge [130]) in which a single voltage source supplies power to all the cores (i.e., although the voltage setting can vary over time, the same value is applied to all cores at any time instance); However, the frequency setting of each core can be independently varied (subject to not 157 exceeding the maximum frequency threshold allowed by the voltage setting). In addition, each individual core can be turned on or off independently. Consider the described CMP has N processing cores with k global voltage levels, and l frequency levels in active mode for each core (note that we assume a one-to-one relationship between the operating voltage and maximum clock frequency of the core, however the core can operate at a frequency lower than the maximum one.) Each core also has a sleep state. Commands issued by a power manager are various f settings (including sleep) that cause transitions among different system states. The global voltage level is determined by the highest voltage applied to any individual core. A CMP’s performance may be defined by metrics such as the number of requests serviced, denoted by H (can be measured during a time epoch), instructions-per-second (IPS), and even more elaborate metrics based on accurate clock cycles-per-instruction (CPI) stacks that are useful for data placement and application software optimizations [131]. In our problem formulation, it is more appropriate to use a high-level metric; hence, we adopt the number of requests serviced per second as the system’s performance metric. We assume a set of application programs run on the multicore system, each giving rise to a specific, but known, request arrival rate, λ. This is modeled by introducing multiple rate modes for applications [132]. In addition the requests have different levels of instruction-level parallelism and access to memory, resulting in different IPS values. We assume a distributed task assignment mechanism where incoming requests to CMP enter a ready task queue and are held there until they are picked by the next available 158 core. That is, every time a core becomes available, it fetches the request in front of the ready queue and starts executing it (or adds the request to its own local queue; however it would not change the proposed mathematical model). 5.3.3 Markov Decision Process-based DPM A (time-average) Markov decision process (MDP) model facilitates reasoning in domains where actions change the system states and a cost (or reward) is utilized to optimize the system performance. The simple MDP is directly observable in the sense that its execution hinges on the assumption that the current system state can be determined without any errors and the cost (reward) of an action can be calculated exactly. In a multicore system, the true system state is determined by a three-tuple, <workload, power, performance>. It is difficult to determine the true state of a system directly, because none of the true state variables are directly measurable at run time and furthermore, they are all affected by one or more sources of randomness. Therefore, we define an observable system state which consists of observable variables and can be used to infer the true state of the system. The observable system state, which is the joint global state of the request generators, the ready queue, and the cores, is given by the tuple: < λ, w, q, v, f, u>. Table 5-1 provides their definition. Table 5-1. Definition of system parameters λ is the arrival rate of requests w is the request size q is the occupancy level of the ready queue v is the chip-wide operating voltage of the cores is the operating frequency vector of the cores ã is the utilization vector of the cores 159 In partially observable environments, where states of the system cannot be identified exactly, observations made by a power manager about the state of the system are indirect and may even be noisy, and therefore, they only provide incomplete information. One way to deal with uncertainty under a wide range of operating conditions and environments is to rely on the history of previous actions and observations to disambiguate the current state. For example, a hidden Markov model (HMM) [126][133] can be adopted, where the state is not directly observable but variables influenced by the state are observable, to learn a model of the hidden states. Thus, a power manager in HMM reasons about the system state indirectly through these observed variables, which capture complex system dynamics that are not completely observable. The decision making in a partially observable environment is achieved by combining aspects of HMMs and MDPs. Specifically, we utilize a MDP, which models the decision making strategy, combined with a HMM, which considers the uncertainty in parameter observation. Definition 1: A Partially Observable Markov Decision Process, POMDP, is a tuple (S, A, O, T, C, Z) such that: S is a finite set of states. A is a finite set of actions. O is a finite set of observations. T is a transition probability function. T: S × A → Δ(S) C is a cost function. C: S × A → ℜ Z is an observation function. Z: S × A → Δ(Z) 160 The state space S comprises of a finite set of states, where s ∈ S can be defined as global (observable) state of the system. In our problem context, a system state is characterized by the multi-tuple: < λ, w, q, v, , ã >. The action space A consists of a finite set of actions a ∈ A, that are issued by the power manager to change the processor state, and hence cause transitions into and out of the system states. These actions include commands to turn individual cores on/off and set the chip-wide voltage setting for all active cores, as well as setting their frequency level. They affect v and f directly, and u indirectly. The observation space O contains a finite set of observations o ∈ O, namely, value of λ for current requests reported by the operating system, occupancy level of the ready queue, q, chip-wide operating voltage of cores, v, operating frequency vector of cores, f, utilization vector of cores, u, and the request size, w, reported by built-in sensors on the chip. The state transition probability function, T(s t+1 , a t , s t ) 4 , determines the probability of a transition from a state s t to another state s t+1 after executing action a t , i.e., system transits to state s t+1 at time t+1 with the probability given by (81). ! ‘ q |‘ , g ‘ q , g , ‘ (81) An observation function, Z(o t+1 , s t+1 , a t ), which captures the relationship between the actual state and the observation, is defined as probability of making observation o t+1 4 In this chapter, we denote the time stamp of states, actions, and observations with a superscript on the variable, for instance s t denotes state s at time t. 161 after taking action a t that has landed the system in state s t+1 , i.e., state s t+1 generates observation o t+1 at time t+1 with probability, ! « q |‘ q , g ô « q , ‘ q , g (82) We consider a bounded cost function that assigns a statistical cost value to each state-action pair whereby an immediate cost, C(s t , a t ), is incurred when action a t is executed in state s t . In this work, the immediate costs are calculated as average energy per request in state s t during the epoch leading to time t plus the energy cost of the transition from state s t to s t+1 (this includes for example the energy consumed for adjusting the voltage of the processor chip), Ÿ ‘ , g 1 ¼ ‘ õV‚ƒéÈ ‘ H t (V‚ƒéÈ ‘ ö ‘ q · ‘ q , g , ‘ ) ÷øù úi û (83) This way of defining the immediate cost of state-action pairs is motivated by the fact we can account for the cost of a state only after it is accurately (not probabilistically) known that the system has landed there. Hence, the cost would be calculated at the next decision epoch, to account for the energy consumption of processor in the state s t during one decision epoch, plus transition cost of an action. 5.3.4 Belief state Instead of making decisions based on the current perceived state of the system, the POMDP maintains a belief, i.e., a probability distribution over the possible states of the system, and makes decisions based on its current belief. The optimal POMDP solution is Markovian over the belief space, B [127]. Hence, by using the belief space, we 162 convert original POMDP into a completely observable, regular (albeit continuous state space) MDP, the so-called belief state MDP. The belief state at time t is a |S|×1 vector of probabilities defined as [133]: h ü õh ‘û ∀‘∈m , ∑ h ‘ ∈i 1 (84) where b t (s) is the posterior probability distribution of state s at time t. Note that the |S|- dimensional belief state is continuous. Based on the belief state, an action a t is chosen by the policy π = {π t }, that maps the belief states to actions. Given a belief state b t and an action a t resulting in observation o t+1 , using Bayes’ rule, we can compute the successor belief state b t+1 as: h q ‘ ! ‘|« q , g , h ô « q , ‘, g · ∑ (h ‘ ý ‘, g , ‘þ) ∑ (ô « q , ‘", g · ∑ (h ‘ ý ‘", g , ‘) ) " (85) The belief state transition function, T b (b t+1 , a t , b t ), which provides the probability of a transition from current belief state b t to next belief state b t+1 after executing action a t , is given by: R h q , g , h ! h q |g , h =∑ (! h q |g , h , « · ! «|g , h ) + (86) The cost model presented in (87), denotes the immediate cost incurred by action a t issued in current state b t . Ÿ R h , g t (h ‘ · Ÿ ‘, g ) úi (87) We have thus transformed the problem based on the POMDP model to one based on belief-state MDP model. The optimal policy π*(b) of the belief-state MDP representation is also optimal for the physical-state POMDP representation. 163 5.4 Proposed Variation Aware DPM As stated earlier, the proposed power manager interacts with an uncertain environment and statistically variable state variables and immediate cost function and tries to minimize the discounted cost in the limit by choosing appropriate actions at regularly scheduled decision epochs. Figure 5-1 illustrates the basic structure of our proposed Variation Aware Dynamic Power Manager, VA-DPM. The actions issued by the power manager change the system state and lead to quantifiable cost. The power manager consists of two functional components. The first component is belief state estimator block that predicts system’s next belief state, b t+1 , based on its current state, b t , action taken, a t , and observation, o t . The second component is policy maker that calculates statistical total cost and assigns optimal actions, a t+1 , based on a policy optimization algorithm. In online VA-DPM, the belief state estimator utilizes a Kalman Filter [134], and the policy optimization block makes decisions to minimize a cost function (c.f. section 5.4.1) by using value-iteration method [135]. Figure 5-1. Structure of variability-aware DPM. Power Manager (DPM) CMP (multicore) Policy Selection Belief State Estimator (BSE) f 1 f 2 sleep f 2 a b o 164 5.4.1 Cost-function of VA-DPM Equation (83) defines the cost function associated with state-action pairs. In this equation, the energy consumption of CMP during an epoch, D epoch , is the product of its power consumption and the duration of decision epoch. On the other hand, since the average transition energy between the states is negligible compared to the energy dissipation due to execution of request, we drop the corresponding term. Moreover, to account for the performance degradation, we incorporate a penalty in the cost function which depends on occupancy level of the system’s ready queue, q, in state s. This penalty is a weighted linear function of the difference of q and a target occupancy level, q 0 (which is set statically by the designer or adaptively by the DPM itself– a typical value of q 0 is 30% to 75% of the max queue capacity – values outside the range tend to result in large performance penalty or even request loss going above the range or resource underutilization and energy inefficiency due to “non energy proportional” nature of today’s servers going below the range [136].) Thus, C(s,a t ) may be rewritten as: Ÿ ‘, g V ! ‘ · 0 + ¼ ‘ H ’|& ‘ & E | (88) where P(s) represents the power consumption at state s, D epoch indicates the duration of decision epoch, H denotes number of completed (serviced) requests during the last epoch. Note E[.] represents the expected value (average value). To estimate the power consumption of CMP in each state, we use the model presented in [138]. It provides accurate power and performance models for a high 165 performance multi-core server system. It models total power consumption of CMP as a function of utilization of cores, u i : ! ‘ � ‘ · t ã z u H � E ‘ (89) where c 1 (s) and c 0 (s) are regression coefficients for the power macro-model at state s. These coefficients can be extracted for any particular CMP through experiments, as it is done for an Intel Xeon processor 5400 series in [138]. The average number of serviced requests in any state, s, can be computed by dividing summation of useful clock cycles of all cores by the average request size in that state, w(s), ¼ ‘ 0 + · ∑ z u ‘ · ã ‘ f ‘ (90) where f i and u i denote clock frequency and utilization of the i th core, respectively. Note that w(s), which is an input to our problem, captures the size and CPI of the arriving requests in state s. This can be done by dynamic profiling with the aid of built-in performance monitoring units in the processor [138]. 5.4.2 Policy optimization by value-iteration Finding an optimal power management policy requires a decision-making strategy which maps belief states to actions. The goal is to minimize some cumulative function of the costs, typically infinite-horizon sum under a discounting factor γ (usually just under 1), which may be computed by (91). 166 Φ š min Ct t – R h q , g , h Ÿ R h , g Z ÷ ,R ÷ u E F (91) The standard family of algorithms to calculate the policy requires storage for two arrays indexed by state: cost Φ, which contains real values, and policy π which contains actions. At the end of the algorithm, π will contain the solution and Φ (b 0 ) will contain the discounted sum of the costs to be accrued (on average) by following that solution. The algorithm then has the following two steps, which are repeated in some order for all the states until no further changes take place. š h argmin Z t ( R h ý , g, h ·Φ b ý ) R (92) Φ h Ÿ R h, g H – t ( R h ý , š h, h ·Φ b ý ) R (93) In value iteration method [135], the π array is not used; instead, the value of π(b) is calculated whenever it is needed. Substituting the calculation of π(b) and (87) into the calculation of Φ(b) gives the combined step: Φ h min Z St (h ‘ · Ÿ ‘, g) úi H – t ( R h ý , g, h ·Φ b ý ) R T (94) Given the optimal cost function, the optimal policy is: š h argmin Z St (h ‘ · Ÿ ‘, g) úi H – t ( R h ý , g, h ·Φ b ý ) R T (95) Note that C(s,a) and T b are known in this problem statement; immediate cost of transitions between states can be calculated once states are defined; Finally, T b is determined by some estimation methods to be discussed later. Let b 0 denote the initial 167 state of system, then VA-DPM’s goal is to minimize Φ h E , using value-iteration method, as formulated below, b c c c d c c c e n‚s[ƒ Φ h E subject to: Φ h min Z Ÿ R h, g H – t ( R h ý , š h, hΦ b ý ) R g š h argmin Z t ( R h ý , g, h ·Φ b ý ) R Ÿ R h, g t „h ‘ CV ! ‘ · 0 + ¼ ‘ H ’|& ‘ & E |F… I (96) 5.4.3 Belief state estimator (BSE) We present an online prediction-based DPM technique, which is analytically and statistically tractable. First, assuming we know the distribution of PVT variations and observation noise, we can define the state and observation models in accordance with our framework as follows: h q h Hg H & , & ~ 0, ¦ (97) « q h q H q , q ~ 0, ª (98) where q t is a state noise induced by PVT variation which is normally distributed with zero mean and variance Q t ; r t+1 is an observation noise normally distributed with zero mean and variance R t . The state transition matrix X includes the probabilities of transitioning from state b t to another state b t+1 when action a t is taken. The action-input matrix Y relates the action input to the state, whereas the observation matrix Z contains the probabilities of making observation o t+1 when action a t is taken, leading the system to enter state s t+1 . In practice, X, Y, and Z might change with each time step or measurement, but here we assume they are constant. 168 The estimation algorithm performs the state estimation based on Kalman Filter (KF). It produces estimates of the true values of measurements and their associated calculated values by predicting a value, estimating uncertainty of predicted value, and computing a weighted average of predicted and measured value. We skip its details due to space limitation (cf. [134]). 1 make observation o t+1 2 estimate the next state, b t+1 = BSE(o t+1 ,a t , b t ){ 2.1 predict: h q h Hg H & 2.2 update: h q h q H ®q « q h q } 3 compute T b by maximum likelihood method 4 find optimal a t+1 through value-iteration 5 return a t+1 =π(b t+1 ) Figure 5-2. Online algorithm for variability-aware DPM. Figure 5-2 illustrates one decision (timestamp t) of the online VA-DPM algorithm to compute the optimal policy given by (94). It estimates the next belief state based on Kalman filter technique, and computes the belief-state transition probabilities by simply deriving the maximum likelihood estimates, and then finds the optimal policy by utilizing value-iteration algorithm, as explained in section 5.4.2. The immediate cost is provided in constant time e.g., in a table-lookup. With n states and m actions, this algorithm takes O(n) and O(m) times for finding a belief state and an optimal action, respectively, which results in total O(nm) running time. 169 5.5 Experimental Results We evaluated the effectiveness of the proposed VA-DPM in terms of energy saving and performance by a series of experiments. We implemented our optimization algorithms as well as a real-time simulator (to model the operation of a CMP and its power management) in MATLAB [46]. The simulator models a CMP (c.f. section 5.3.2) queuing and execution of requests, and performing commands issued by power manager. Requests are randomly generated and sent to CMP, with a random inter-arrival time (following an exponential distribution.) Size of requests are based on two different workloads in SPECWeb2005 benchmark suite [139]. In order to model uncertainty and variability of the parameters, we applied a Gaussian disturbance to of request characteristics at runtime. The processor in our experiments is a dual core CMP. It has two global voltage levels, and each individual core can operate at high (2.3GHz) or low frequency (2.0GHz) or sleep mode. The power-performance relationship presented in [138] is used to model the characteristic of our CMP, and without loss of generality, we use the power coefficients reported in this work as an example; its parameters may affect the values of our results, but not the nature of our solution. According to this work, mixing high and low frequency settings would not save any power; hence we define the set of actions to be A={L0, LL, H0, HH},where L,H,0 denote low, high frequency and sleep mode, respectively. Note that CMP’s voltage is set automatically by hardware corresponding to the highest employed frequency. Continuous parameters such as arrival rate, request size, queue occupancy, and utilization are divided into intervals to become discrete; i.e., the range of utilization levels is divided into discrete space of {U L =(0,45%], 170 U M =(45%,80%], U H =(80%,100%]}, and queue occupancy’s space is {q L =[0,30%], q M =(30%,65%], q H =(65%,100%]}. Finally, λ and w are profiled per application into two modes only. We next construct the corresponding discrete state space, S, and observation space, O, by Cartesian product of < λ, w, q, v, , ã> in which v, f do not have any uncertainty. We set γ=0.9 here, however, it is a user-defined constant, and can be set after profiling through different applications. Figure 5-3 demonstrates the performance of our belief state estimator, which estimates state of total utilization (summation of probabilities of states with same utilization level) starting at an initial state U L (low utilization). As illustrated, the final belief state is more likely to be medium utilization (U M ), here. Figure 5-3. Performance of belief state estimator. Next, we define two baseline power management techniques to establish the comparison basis for overall evaluation of proposed VA-DPM. The first one, called BASE, is a conventional DPM that acts only based on observations without any prediction of underlying state. It uses an optimized offline policy to select the best action 0 5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 Time stamp Belief of utilization U L U M U H 171 for to any observation. The second heuristic, called uncertainty tolerant (UT-DPM), is a POMDP similar to VA-DPM, but without the ability to catch variation related effects. It ignores the state noise in (97) and only captures the uncertainty in observations. Figure 5-4 demonstrates average power consumption of the three mentioned power management algorithms, BASE, UT-DPM, and VA-DPM. Note that in this figure, Action ID corresponds to the optimum frequency level that is chosen by algorithms that minimizes the cost function in (96). As it can be seen in Figure 5-4, BASE frequently switches between actions, because its decision is independent of PVT variations, hence non-optimal, and also it quickly responds to the uncertain input data, causing excessive power consumption, for any particular throughput. In contrast, VA-DPM takes advantage of estimated unobservable parameters and makes better decisions. On the other hand, comparison of UT-DPM and VA-DPM shows the importance of optimal action selection, given an estimate of system state. As mentioned, the only difference between these two algorithms is that UT-DPM accounts for observation noise but not for variations, while VA-DPM considers both. As it can be seen in Figure 5-4, the actions taken by VA-DPM result in lower power consumption, on average. 172 (a) (b) (c) Figure 5-4. Comparison of BASE, UT_DPM and VA-DPM a) power consumption b) queue occupancy c) action ID. Average power consumption and average queue occupancy of three algorithms are shown in Figure 5-5 (q 0 was set to 40%). VA-DPM reaches an average power saving of 13% with respect to BASE, by optimally handling variability and uncertainty. But the 5 10 15 20 25 30 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Power [W] Base power UT-DPM power VA-DPM power 0 10 20 30 40 50 60 70 80 90 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Queue occupancy % Time stamp Base q UT-DPM q VA-DPM q 0 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Action ID Base action UT-DPM action VA-DPM action 173 difference between the power consumption of UT-DPM and VA-DPM is negligible (VA- DPM consumes 3% less power than UT-DPM) because they operate the same where system parameters have not been affected by variations, however, VA-DPM provides good solutions and guaranteed correctness under all corner cases, and the variance in power and q values is lowest for VA-DPM among all three methods. Figure 5-5. Comparison of BASE, UT-DPM, and VA-DPM. As the final experiment, in order to test the effect of performance penalty, we applied a heavier workload, which has 20% larger average request size, and used a 20% higher ’ (busy-queue penalty factor; cf. (88). As a result of this change in performance requirements, overall power consumption of VA-DPM increased by about 11% but the percentage of queue being crowded (i.e., being in state q H ) dropped to about 5% (which is a 6% improvement) which in turn translates to lower request loss rate. 5.6 Summary In this chapter, we tackled the problem of system-level dynamic power management (DPM) in CMP architectures subject to process variations and widely varying environmental conditions. Our proposed solution is based on Partially 0 10 20 30 40 50 Base UT-DPM VA-DPM power q 174 Observable Markovian Decision Process, which finds the optimal policy that stochastically minimizes energy per request plus a performance degradation penalty. It uses well established prediction techniques to estimate system state and take the globally optimal action accordingly. In comparison to the baseline power management algorithms, our technique gains an average power saving of 13%. 175 Chapter 6. CONCLUSION 6.1 Summary of Contributions This dissertation has contributed to new problem formulation, circuit design, and chip level power management techniques in digital VLSI systems. In Chapter 2, we presented a novel circuit level power optimization method to minimize the power-delay product metric in a linear pipeline, in both deterministic and probabilistic frameworks. Our method benefits from optimally designing soft-edge flip- flops to perform time borrowing the pipeline stages. We described a unified methodology for optimally designing SEFF’s and selecting optimum operating voltage and frequency of a linear pipeline so as to achieve the minimum power-delay product. We formulated this problem for the deterministic values and random variables for stage delays. Our method minimizes the power-delay product while ensuring that the probability of timing violations due to increased operation frequency of pipeline is bounded. Also, we utilized error detection mechanisms to over-clock the pipeline and further minimize the expected power-delay product of a pipeline. Portions of this work has been appeared in [21][22] and [23]. In Chapter 3, we described our proposed hierarchical global power manager solution to the problem of minimizing the total power consumption of a CMP subject to a CMP-level average throughput target, taking advantage of DVFS, core consolidation, and task assignment. Portions of this work have been published in [140][141]. 176 Chapter 4 explained our proposed approach to the problem of performance maximization in a CMP under power budget and thermal constraints, using a hierarchical power management unit that utilizes DVFS, core consolidation and optimum task re- assignment. We introduced a mathematically rigorous and robust algorithm for power and thermal management of CMPs subject to variability and uncertainty in system parameters, called Variation-aware Power/Thermal Manager (VPTM), which is a hierarchical dynamic power and thermal manager for homogeneous and heterogeneous CMP architectures. VPTM utilizes DVFS and core consolidation as well as parallel feedback controllers to manage the core power consumptions, which implicitly regulate the core temperatures. In Chapter 5, we presented a stochastic power management technique that addresses the problem of variation aware power optimization in CMPs, using Partially Observable Markovian Decision Process (POMDP). Parts of this work have been published in [142]. 6.2 Future Directions In Chapter 3 and Chapter 4, we proposed two hierarchical dynamic power and thermal management methodologies for CMPs, based on core consolidation and coarse- grain DVFS, task assignment, and closed-loop feedback DVFS. The proposed solutions have a number of limitations which can be summarized as follows: (i) they focuses on independent tasks and ignore communication between the tasks, and also parallel and multi-threaded tasks, and (ii) relying on centralized power management or task assignment limits the scalability of the proposed power management techniques. The 177 advantage of a centralized task dispatch mechanism is better quality (near optimal) of task assignment, while a distributed task fetch mechanism is better scalable to large number of cores at the cost of less controllability over optimality of task assignment. In Chapter 4, we presented a robust dynamic power and thermal management in heterogeneous CMPs, VPTM. An interesting extension to this work is development of a framework where the system level performance objective is the average response time per task rather than overall instruction throughput (IPS) of tasks. This extension is highly useful for many high-end servers and hosting datacenters where the end user system cares about the latency. The new constraint will have to be handled in modeling and formulation, which requires a system level performance model that describes the average response time of a task as a function of task and CMP properties, such as CMP frequency, and should also account for the average wait time of task in system queue. In Chapter 5, we proposed a POMDP based dynamic power management (DPM) which uses well established prediction techniques to estimate system state and take the globally optimal action accordingly. This work can be extended in the estimation techniques used, in order to utilize a lighter and faster calculation algorithm for power manager. Also, a mechanism can be adopted to make better use of correlated observations to gather more reliable information, and decrease its uncertainty at the source (sensor). 178 BIBLIOGRAPHY [1] A. Abdollahi and M. Pedram, “Power minimization techniques at the RT-level and below,” in SoC: Next Generation Electronics, B. M. Al-Hashimi, Ed. New York, NY: IEE Press, 2005. [2] AMD Opteron processors, http://en.wikipedia.org/wiki/Opteron [online] [3] Intel Xeon processors, http://en.wikipedia.org/wiki/Xeon [online] [4] B. Amelifard, F. Fallah, M. Pedram, "Low-power fanout optimization using MTCMOS and multi-Vt techniques," in Proc. of International Symposium on Low Power Electronics and Design, 2006, pp. 334 -337. [5] S. S. Sapatnekar, "Power-delay optimization in gate sizing," ACM Trans. on Design Automation of Electronic Systems, vol. 5, no. 1, Jan. 2000, pp. 98-114. [6] A. Agarwal, C. Kim, S. Mukhopadhyay, and K. Roy, "Leakage in nano-scale technologies: mechanisms, impact and design considerations," in Proc. of Design Automation Conference, 2004, pp. 6-11. [7] S. Iman and M. Pedram, "An approach for multi-level logic optimization targeting low power," IEEE Trans. on Computer Aided Design, vol. 15, no. 8, Aug. 1996, pp. 889-901. [8] P. Gupta, A. B. Kahng, P. Sharma, and D. Sylvester, "Gate-length biasing for runtime- leakage control," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 8, Aug. 2006, pp. 1475-1485. [9] K. Shi and D. Howard, "Challenges in sleep transistor design and implementation in low- power design," in Proc. of Design Automation Conference, 2006, pp. 113-116. [10] J. W. Tschanz, S. G. Narendra, Y. Ye, B. A. Bloechel, et al., "Dynamic sleep transistor and body bias for active leakage power control of microprocessors," IEEE Journal of Solid-State Circuits, vol. 38, 2003, pp. 1838-1845. [11] H. Partovi, et al., "Flow-through latch and edge-triggered flip-flop hybrid elements," Proc. Solid-State Circuits Conf., 1996. [12] J.-M. Chang and M. Pedram, "Energy minimization using multiple supply voltages," IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 5, no. 4, Dec. 1997, pp. 436- 443. [13] S. Manne, A. Klauser, and D. Grunwald, "Pipeline gating: speculation control for energy reduction," Proc. Int’l Sym. Computer Architecture, 1998, 179 [14] H. M. Jacobson, "Improved clock-gating through transparent pipelining," Proc. Int’l Sym. on Low Power Electronics and Design, 2004. [15] H. Jacobson, et al. "Stretching the limits of clock-gating efficiency in server-class processors," High-Performance Computer Architecture, 2005. [16] D. Ernst, et al., "Razor: a low-power pipeline based on circuit-level timing speculation," Proc. Int’l Sym. on Microarchitecture, 2003. [17] K. Choi, R. Soma, and M. Pedram, "Fine-grained dynamic voltage and frequency scaling for precise energy and performance trade-off based on the ratio of off-chip access to on-chip computation times." IEEE Trans. on Computer Aided Design, Vol. 24, No. 1, 2005, pp.18- 28 [18] D. Harris and M. A. Horowitz, "Skew-tolerant domino circuits," IEEE Journal of Solid-State Circuits, 1997. [19] V. Joshi, D. Blaauw, and D. Sylvester, "Soft-edge flip-flops for improved timing yield: design and optimization," Proc. Int’l Conference on Computer-Aided Design, 2007. [20] S. Das, et al., “A self-tuning DVS processor using delay-error detection and correction”, IEEE Journal of Solid-State Circuits, 2006. [21] M. Ghasemazar, and M. Pedram, “Minimizing energy cost of throughput in a linear pipeline by opportunistic time borrowing,” Proc. Int’l Conf. Computer Aided Design, 2008. [22] M. Ghasemazar, B. Amelifard, and M. Pedram, "A mathematical solution to power optimal pipeline design by utilizing soft-edge flip-flops," Proc. Int’l Symp. on Low Power Electronics and Design, 2008. [23] M. Ghasemazar, and M. Pedram, “Optimizing the Power-Delay Product of a Linear Pipeline by Opportunistic Time Borrowing,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 30, Issue 10, Oct 2011, pp 1493-1506. [24] A. Dasdan, I. Hom, “Handling Inverted Temperature Dependence in Static timing Analysis,” ACM Trans. on Design Automation of Electronic Systems, Vol. 11, No. 2, Apr. 2006. [25] M. Pedram, and S. Nazarian, “Thermal Modeling, Analysis and Management in VLSI Circuits: Principles and Methods,” Proc. of IEEE, Special Issue on Thermal Analysis of ULSI, Vol. 94, 2006, pp. 1487-1501. [26] V. Joshi, R. R. Rao, D. Blaauw, D. Sylvester, “Logic SER reduction through flipflop redesign,” Int’l Sym.Quality Electronic Design, 2006. [27] A. Tiwari, S. R. Sarangi, J. Torrellas, “ReCycle: pipeline adaptation tolerate process variation,” Proc. Int’l Sym. Computer Architecture, 2007. 180 [28] H. Lee, S. Paik, Y. Shin, “Pulse width allocation with clock skew scheduling for optimizing pulsed latch-based sequential circuits,” Proc. of Int’l. Conf. on Computer-Aided Design, 2008. [29] R. B. Deokar, and S. S. Sapatnekar, “A fresh look at retiming via clock skew optimization,” Proc. Design Automation Conference, 1995. [30] S. Lee, et al., “Reducing pipeline energy demands with local DVS and dynamic retiming,” Int’l Sym. on Low Power Electronics and Design, 2004. [31] D. Blaauw, et al., “Razor II: In-situ error detection and correction for PVT and SER tolerance,” Proc. Int’l Solid-State Circuits Conference, 2008. [32] G. Yan, et al. “MicroFix: exploiting path-grained timing adaptability for improving power- performance efficiency,” Proc. Int’l Sym. on Low Power Electronics and Design, 2009. [33] Y. Ye, et al., “Statistical modeling and simulation of threshold variation under dopant fluctuations and line-edge roughness,” Proc. Design Automation Conference, 2008. [34] S. R. Nassif, “Modeling and analysis of manufacturing variations”, Proc. IEEE Custom Integrated Circuits Conference, 2001. [35] S. Ross, Introduction to probability models, 9 th edition, Academic Press, USA 2007. [36] S. Choi, B. C. Paul, and K. Roy, “Novel sizing algorithm for yield improvement under process variation in nanometer technology,” Proc. Design Automation Conference, 2004. [37] M. Orshansky, A. Bandyopadhyay, “Fast Statistical Timing Analysis Handling Arbitrary Delay Correlations,” Design Automation Conf., 2004. [38] Y. Zhan, et al., “Correlation-aware statistical timing analysis with non-gaussian delay distributions,” in Proc. Design Automation Conference, 2005. [39] J. Singh, S. Sapatnekar, “Statistical timing analysis with correlated non-Gaussian parameters using independent component analysis,” in Proc. Design Automation Conference, 2006, pp. 155–160. [40] L. Zhang, et. al, “Statistical static timing analysis with conditional linear MAX/MIN approximation and extended canonical timing model,” IEEE Trans. Computer-Aided Design Integrated Circuits Syst., vol. 25, 2006. [41] D. Blaauw, et al., “Statistical timing analysis: from basic principles to state of the art,” IEEE Trans. Computer-Aided Design, vol 27, 2008. [42] K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “Check Tc and min Tc: Timing verification and optimal clocking of synchronous digital circuits,” Proc. of Intl Conf. on Computer Aided Design, November 1990. 181 [43] J. Le, X. Li, L. T. Pileggi, “STAC: Statistical Timing Analysis with Correlation”, Proc. of Design Automation Conference, pp. 343-348, 2004. [44] M. Mani, A. Devgan, and M. Orshansky, “An Efficient Algorithm for Statistical Minimization of Total Power under Timing Yield Constraints,” Proc. of Design Automation Conference, 2005. [45] V. G. Oklobdzija, R. K. Krishnamurthy, High-Performance Energy-Efficient Microprocessor Design (Series on Integrated Circuits and Systems), 1st Ed. Springer, 2006 [46] MATLAB Optimization, http://www.mathworks.com [online] [47] Tomlab Optimization [Online] http://tomopt.com/tomlab/ [online] [48] HSPICE: gold standard for accurate circuit simulation, http://www.synopsys.com/products/mixedsignal/hspice/hspice.htm [online] [49] Predictive Technology Model, http://ptm.asu.edu/ [online] [50] E. M. Sentovich, et al., "SIS: A System for Sequential Circuit Synthesis," University of California, Berkeley, Report M92/41, May 1992. [51] C. Isci, A. Buyuktosunoglu, C. Cher, P. Bose, and M. Martonosi, "An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget," Proc. IEEE/ACM International Symposium on Microarchitecture, 347-358, 2006. [52] S. Herbert, D. Marculescu, “Analysis of dynamic voltage/frequency scaling in chip- multiprocessors,” Proc. of Int’l Symp. on Low Power Electronics and Design, 2007. [53] J. Sharkey, A. Buyuktosunoglu, and P. Bose, “Evaluating Design Tradeoffs in On-Chip Power Management for CMPs,” Proc. of Int’l Symp. on Low Power Electronics and Design, 2007. [54] J. Li, J. F. Martinez, “Dynamic power-performance adaptation of parallel computation on chip multiprocessors,” Proc. Int’l Symp. on High-Performance Computer Architecture, 2006. [55] I. Kadayif, M. Kandemir, and I. Kolcu, “Exploiting processor workload heterogeneity for reducing energy consumption in chip multiprocessors,” in Design, Automation and Test in Europe, February 2004. [56] R. Kumar, D. M. Tullsen, N. P. Jouppi, P. Ranganathan, “Heterogeneous Chip Multiprocessors”, IEEE Computer, 38(11):32–38, 2005. [57] S. Ghiasi, T. Keller, F. Rawson, “Scheduling for heterogeneous processors in server systems,” Proc. of the 2nd Conf. on Computing Frontiers 2005. 182 [58] R. Rao and S. Vrudhula, “Efficient online computation of core speeds to maximize the throughput of thermally constrained multi-core processors,” Proc. of Int’l Conf. on Computer-Aided Design, 2008. [59] F. Xia, Y.-C. Tian, Y. Sun, J. Dong, “Control- Theoretic Dynamic Voltage Scaling for Embedded Controllers,” IET Computers and Digital Techniques, 2008. [60] H. Aydin, Q. Yang, “Energy-Aware Partitioning for Multiprocessor Real-Time Systems," Proc. Int’l Symp. on Parallel and Distributed Processing, 2003. [61] Y. Xie, W. Wolf, “Allocation and scheduling of conditional task graph in hardware/software co-synthesis,” Proc. conf. on Design Automation and Test in Europe, 2001. [62] M. Harchol-Balter, M. E. Crovella, C. Murta, “On choosing a task assignment policy for distributed server system,” IEEE Journal of Parallel and Distributed Computing, vol59, 1999. [63] M. Annavaram, E. Grochowski, J. Shen, “Mitigating Amdahl's Law through EPI Throttling,” Proc. of 32nd Annual int’l Symp. on Computer Architecture, 2005. [64] I. Yeo, C.C. Liu, E.J. Kim, “Predictive dynamic thermal management for multicore systems,” Proc. of the 45th Annual Design Automation Conference, 2008. [65] M. Gomaa, M.D. Powell, T. Vijaykumar, “Heat-and-run: leveraging SMT and CMP to manage power density through the operating system,” SIGOPS Operating System Review, 2004. [66] G. Qu, “Power Management of Multicore Multiple Voltage embedded Systems by Task Scheduling,” Proc. Int’l Conf. on Parallel Processing Workshops, 2007, pp. 78-83. [67] K. Choi, R. Soma and M. Pedram, “Dynamic voltage and frequency scaling based on workload decomposition,” Proc. of Int’l Symp. on Low Power Electronics and Design, Aug. 2004, pp. 174-179. [68] J. Donald and M. Martonosi, “Techniques for Multicore Thermal Management: Classification and New Exploration,” SIGARCH Computer Architecture News, 2006. [69] P. Juang, Q. Wu, L. Peh, M. Martonosi, D.W. Clark, “Coordinated, distributed, formal energy management of chip multiprocessors,” Proc. of int’l Symp. on Low Power Electronics and Design, 2005. [70] Intel Xeon, http://www.intel.com/products/processor_number/chart/xeon.htm [online] [71] P. Rong and M. Pedram, “Energy-aware task scheduling and dynamic voltage scaling in a real-time system,” Int'l Journal of Low Power Electronics, American Scientific Publishers, Vol. 4, No. 1, Apr. 2008. 183 [72] S. L. Hary, and F Ozguner, “Precedence-Constrained Task Allocation onto Point-to-Point Networks for Pipelined Execution,” IEEE Trans. on Parallel and Distributed Systems, vol. 10, no. 8, Aug. 1999. [73] W. Kim, M. Gupta, G. Y. Wei, D. Brook, “System level analysis of fast, per-core DVFS using on-chip switching regulators,” Proc. High-Performance Computer Architecture, 2008. [74] SPEC Web2009, http://www.spec.org/web2009 [online] [75] Intel Corp, Intel® 64 and IA-32 Architectures Software Developer’s Manual, 2009, http://www.intel.com/products/processor/manuals/ [online] [76] M. Zagha, et al., “Performance analysis using the MIPS R10000 performance counters,” Proc. Conf. on Supercomputing , 1996. [77] R. C. Dorf, R. H. Bishop, Modern Control Systems, Prentice Hall, 2008. [78] M. Shah, et al., “UltraSPARC T2: A Highly-Threaded, Power-Efficient, SPARC SOC,” Proc. of Asian Solid-State Circuits Conference, Nov. 2007. [79] Y. Han, I. Koren, and C. M. Krishna, “TILTS: A fast architectural-level transient thermal simulation method,” Journal of Low Power Electronics, 3(1), 2007. [80] M.A. Breuer, Design Automation of Digital Systems: Theory and Techniques, Prentice Hall, 1975. [81] M. S. Squilante and E. D. Lazowska, “Using processor-cache affinity information in shared- memory multiprocessor scheduling,” IEEE Trans. Parallel Distrib. Syst., vol. 4, pp. 131- 143, Feb. 1993. [82] W. Liao, L. He, and K. M. Lepak, “Temperature and Supply Voltage Aware Performance and Power Modeling at Microarchitecture Level,” IEEE Trans. Computer-Aided Design, 24:1042–1053, 2005. [83] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge, UK: Cambridge University Press, 2003. [84] K. Bernstein, D. J. Frank, A. E. Gattiker, W. Haensch, B. L. Ji, S. R. Nassif, E. J. Nowak, D. J. Pearson, and N. J. Rohrer, “High-performance CMOS variability in the 65-nm regime and beyond,” IBM Journal of Research and development, Aug. 2006. [85] D. J. Frank, R. Dennard, E. Nowak, P. Solomon, Y. Taur, and H.-S. P. Wong, “Device Scaling Limits of Si MOSFETs and Their Application Dependencies,” Proc. IEEE, 2001, pp. 259-288. [86] T. Mizuno, J. Okamura, and A. Toriumi, “Experimental Study of Threshold Voltage Fluctuation Due to Statistical Variation of Channel Dopant Number in MOSFET’s,” IEEE Trans. Electron Devices, Vol. 41, 1994. 184 [87] B. E. Stine, D. S. Boning, and J. E. Chung, “Analysis and Decomposition of Spatial Variation in Integrated Circuit Processes and Devices,” IEEE Trans. Semiconductor. Manuf. Jan. 1997, pp.24-41. [88] S. Nassif, “Within-Chip Variability Analysis,” IEDM Tech. Digest, 1998, pp.283-286. [89] Y. F. Tsai, N. Vijaykrishnan, Y. Xie, M. J. Irwin, “Influence of Leakage Reduction Techniques on Delay/Leakage Uncertainty,” Proc. IEEE 18th Int’l Conf. on VLSI Design, Jan. 2005, pp.374-379. [90] S. R. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, J. Torrellas, "VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects," IEEE Transactions on Semiconductor Manufacturing, vol.21, no.1, pp.3-13, Feb. 2008 [91] H. Su, F. Liu, A. Devgan, E. Acar, and S. Nassif, “Full Chip Leakage Estimation Considering Power Supply and Temperature Variations,” Proc. Int’l Symp. on Low Power Electronics and Design, Aug. 2003, pp. 78-83. [92] N. Azizi, M. M. Khellah, V. De, and F. N. Najm, “Variations-aware low-power design with voltage scaling,” Proc. Design Automation Conf., 2005, pp. 529–534. [93] D. Marculescu and E. Talpes, “Variability and energy awareness: a microarchitecture-level perspective,” Proc. Design Automation Conf., 2005, pp. 11–16. [94] N. S. Kim, T. Kgil, K. Bowman, V. De, and T. Mudge, “Total power optimal pipelining and parallel processing under process variations in nanometer technology,” Proc. Int’l Conf. on Computer Aided Design, Nov. 2005, pp. 535–540. [95] E. Humenay, D. Tarjan, and K. Skadron, “The impact of systematic process variations on symmetrical performance in chip multiprocessors,” in Design, Automation and Test in Europe, April 2007. [96] S. Herbert, D. Marculescu, “Characterizing Chip-Multiprocessor Variability-Tolerance,” In Proc. of Design Automation Conference, 2008. [97] S. Heo, K. Barr, and K. Asanovic, “Reducing power density throughactivity migration,” in International Symposium on Low Power Electronics and Design, August 2003. [98] K. Stavrou and P. Trancoso, “Thermal-aware scheduling: A solutionfor future chip multiprocessors’ thermal problems,” in EURO MICRO Conference on Digital System Design, 2006. [99] D. Brooks, M. Martonosi, “Dynamic Thermal Management for High-Performance Microprocessors,” Proceedings International Symposium on High-Performance Computer Architecture, p.171, January 20-24, 200. [100] K. Skadron, et al. “Temperature-Aware Microarchitecture: Modeling and Implementation,” ACM Transactions on Architecture and Code Optimization, Vol. 1, No. 1, March 2004, Pages 94–125. 185 [101] L. Benini, G. Paleologo, A. Bogliolo, and G. De Micheli, “Policy Optimization for Dynamic Power Management,” IEEE Trans. on Computer Aided Design, Jun. 1999, pp. 813-833. [102] H-S. Jung, P. Rong, and M. Pedram, "Stochastic modeling of a thermally-managed multi- core system," Proc. Design Automation Conf., Jun. 2008, pp. 728-733. [103] S. Park, W. Jiang, Y. Zhou, and S. Adve, "Managing energy-performance tradeoffs for multithreaded applications on multiprocessor architectures," Proceedings of the 2007 ACM International Conference on Measurement and Modeling of Computer Systems, 169-180, 2007. [104] M. Curtis-Maury, A. Shah, F. Blagojevic, D. S. Nikolopoulos, B. R. de Supinski, and M. Schulz, "Prediction models for multi-dimensional power-performance optimization on many cores," Proc. International Conference on Parallel Architectures and Compilation Techniques, 250-259, 2008. [105] R. Bergamaschi, G. Han, A. Buyuktosunoglu, H. Patel, I. Nair, G. Dittmann, G. Janssen, N. Dhanwada, Z. Hu, P. Bose, and J. Darringer, "Exploring power management in multi-core systems," Proc. Asia and South Pacific Design Automation Conference, 708-713, 2008. [106] H-S. Jung and M. Pedram, “Dynamic Power Management under Uncertain Information,” Proc. Design Automation and Test in Europe, Apr. 2007, pp. 1060-1065. [107] R. Teodorescu, J. Torrellas, “Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors,” Proc. International Symposium on Computer Architecture, p.363-374, June 21-25, 2008 [108] A. Mutapcic, et al. “Processor Speed Control with Thermal Constraints,” IEEE Trans. on Circuits and Systems—I, VOL. 56, NO. 9, 2009. [109] J. Dorsey, et al., S, “An integrated quadcore Opteron processor,” in International Solid State Circuits Conference, February 2007. [110] R. McGowen, et al., “Power and temperature control on a 90-nm Itanium family processor,” Journal of Solid-State Circuits, January 2006. [111] G. F. Franklin, J. D. Powell, and A. Emami-Naeini. Feedback Control of Dynamic Systems. Addison-Wesley, third edition, 1994. [112] J. A. Stankovic, C. Lu, S. H. Son, and G. Tao. “The case for feedback control real-time scheduling,” Proc. the IEEE Euromicro Conf. on Real-Time, Jun. 1998. [113] Z. Lu, J. Hein, M. Humphrey, M. Stan, J. Lach, and K. Skadron. “Control-theoretic dynamic frequency and voltage scaling for multimedia workloads.” Proc. the Int’l Conf. on Compilers, Architecture, and Synthesis for Embedded Systems, Oct. 2002, pp. 156-163. [114] Z. Lu, J. Lach, M. Stan, and K. Skadron, “Reducing Multimedia Decode Power using Feedback Control,” Proc. 21st Int. Conf. on Computer Design, 2003, pp. 489-496. 186 [115] A. Varma, B. Ganesh, M. Sen, S. R. Choudhary, L. Srinivasan, and B. Jacob. “A control- theoretic approach to dynamic voltage scaling,” Proc. Int’l Conf. on Compilers, Architectures, and Synthesis for Embedded Systems, Oct. 2003, pp. 255–266. [116] M. Weiser, B. Welch, A. Demers, and S. Shenker. “Scheduling for reduced CPU energy,” Proc. USENIX Symp. on Operating Systems Design and Implementation, Nov. 1994, pp. 13– 23. [117] T. Pering and R. Broderson. “The simulation and evaluation of dynamic voltage scaling algorithms.” Proc. Int’l Symp. on Low-Power Electronics and Design, Jun. 1998. [118] N. Kandasamy, S. Abdelwahed, G. Sharp, J. Hayes, “An Online Control Framework for Designing Self-Optimizing Computing Systems: Application to Power Management,” Self- Star Properties in Complex Information Systems, O. Babaoglu et al., (Eds.), Lecture Notes in Computer Science, Springer-Verlag, 2005, pp.174-189. [119] A. Alimonda, et. al, “A Control Theoretic Approach to Run-Time Energy Optimization of Pipelined Processing in MPSoCs,” Proc. Design, Automation and Test in Europe, 2006, pp. 876-877. [120] Q. Wu, P. Juang, M. Martonosi, D. W. Clark, “Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors,” Proc. ASPLOS-XI, Oct. 2004, pp. 248-259. [121] P. Barham, et. al, “Xen and the art of virtualization,” SIGOPS Oper. Syst. Rev. 37, 2003, 164-177. [122] Performance Counter for Linux, http://user.it.uu.se/~mikpe/linux/perfctr/ [online] [123] Q. Qiu, Q. Wu, M. Pedram, “Stochastic Modeling of a Power-Managed System – Construction and Optimization,” IEEE Trans. on Computer-Aided Design, Oct. 2001, pp. 1200-1217. [124] T. Simunic, L. Benini, P. Glynn, G. De Micheli, “Event-driven Power Management,” IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, Jul. 2001, pp. 840- 857. [125] Z. Ren, B. H. Krogh, R. Marculescu, “Hierarchical Adaptive Dynamic Power Management,” IEEE Trans. on Computers, Apr. 2005. [126] Y. Tan, Q. Qui, “A framework of stochastic power management using hidden Markov medel,” Proc. Design Automation Test in Europe, 2008 [127] H-S. Jung, M. Pedram, “Uncertainty-aware dynamic power management in partially observable domains,” IEEE Trans. VLSI Systems, Jun. 2009 [128] F. Meyer, M. Schmitt, Space, Structure and Randomness: Contributions in Honor of Georges Matheron in the Fields of Geostatistics, Random Sets and Mathematical Morphology, Springer, 2007. 187 [129] AMD Corp, http://support.amd.com/us/Processor_TechDocs/40036.pdf [online] [130] Intel Co, http://www.intel.com/technology/architecture-silicon/next-gen/ [online] [131] S. Eyerman, L. Eeckhout, T. Karkhanis, J. E.Smith, “A performance counter architecture for computing accurate CPI components,” SIGOPS Oper. Syst. Rev. 40, 5, Oct. 2006, pp. 175- 184. [132] Q. Qiu, M. Pedram, “Dynamic power management based on continuous-time Markov decision processes,” Proc. Design Automation Conf., 1999. [133] A. R. Cassandra, L.P. Kaelbling, M.L. Littman, “Acting Optimally in Partially Observable Stochastic Domains,” Proc. Conf. Artificial Intelligence, Aug. 1996, pp. 1023-1028. [134] G. Welch and G. Bishop, “An introduction to the Kalman filters,” Technical Report TR 95- 041, University of North Carolina, Department of Computer Science, 1995. [135] J. Pineau, G. Gordon, S. Thru, “Point-based value iteration: An anytime algorithm for POMDPs,” Int’l Joint Conf. Artificial Intelligence, 2003. [136] L. A. Barroso, U. Hölzle, “The case for energy-proportional computing,” IEEE Computer, vol. 40, 2007. [137] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Publisher, New York, 1994. [138] I. Hwang, M. Pedram, “Power and performance modeling in a virtualized server system,” in Proc. of Int’l Conf. on Parallel Processing Workshops, 2010. [139] SPEC Web 2005, http://www.spec.org/web2005/ [online] [140] M. Ghasemazar, E. Pakbaznia, M. Pedram, “Minimizing Energy Consumption of a Chip Multiprocessor through Simultaneous Core Consolidation and Dynamic Voltage/Frequency Scaling,” IEEE International Symposium on Circuits and Systems (ISCAS), 2010. [141] M. Ghasemazar, E. Pakbaznia, M. Pedram, “Minimizing the Power Consumption of a Chip Multiprocessor under an Average Throughput Constraint,” International Society for Quality Electronic Design (ISQED), 2010. [142] M. Ghasemazar, M. Pedram, “Variability Aware Dynamic Power Management for Chip Multiprocessor Architectures,” Design Automation and Test in Europe (DATE), Mar 2011. [143] A. Mutapcic, S. Boyd, S. Murali, D. Atienza, G. De Micheli, and R. Gupta “Processor Speed Control with Thermal Constraints,” IEEE Trans. on Circuits and Systems—I, VOL. 56, NO. 9, 2009. [144] Y. Wang, K. Ma, and X. Wang, “Temperature-Constrained Power Control for Chip Multiprocessors with Online Model Estimation”, International Symposium on Computer Architecture, 2009. 188 [145] A. Bartolini, M. Cacciari, A. Tilli, L. Benini, “A Distributed and Self-Calibrating Model- Predictice Controller for Energy and Thermal management of High Perfromance Multicores,” DATE, 2011. [146] Intel Co., ftp://download.intel.com/design/network/papers/30117401.pdf [online] [147] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE MICRO Magazine, 2005. [148] S. Murali, A. Mutapcic, D. Atienza, R. Gupta, S. Boyd, and G. De Micheli, “Temperature- Aware Processor Frequency Assignment for MPSoCs Using Convex Optimization”, IEEE/ACM int’l conf. on Hardware/software Codesign and System Synthesis, 2007. [149] Advanced Micro Devices, Family 10h AMD Opteron Processor Product Data Sheet , Revision: 3.04, http://support.amd.com/us/Processor_TechDocs/40036.pdf [online] [150] K. Srinivasan and K. S. Chatha, “Integer linear programming and heuristic techniques for system-level low power scheduling on mul-tiprocessor architectures under throughput constraints,” Integration VLSI, vol. 40, no. 3, 2007. [151] R. Sarikaya and A. Buyuktosunoglu, "A Unified Prediction Method for Predicting Program Behavior,", IEEE Transactions on Computers, vol.59, no.2, pp.272-282, Feb. 2010 [152] C. Isci, G. Contreras, and M. Martonosi, “Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management,” Proc. Int’l Symp. Microarchitecture, Dec. 2006. [153] http://www.theregister.co.uk/2011/02/25/intel_westmere_ex_sandy_bridge_ep_xeons/print.h tml [online] [154] Standard Performance Evaluation Corporation, http://www.spec.org/ [online] [155] Linear Programming, Wikipedia, http://en.wikipedia.org/wiki/Linear_programming [online] [156] LM Sensors, http://www.lm-sensors.org/wiki/Documentation [online] 189 ALPHABETIZED BIBLIOGRAPHY A. Abdollahi and M. Pedram, “Power minimization techniques at the RT-level and below,” in SoC: Next Generation Electronics, B. M. Al-Hashimi, Ed. New York, NY: IEE Press, 2005. A. Agarwal, C. Kim, S. Mukhopadhyay, and K. Roy, "Leakage in nano-scale technologies: mechanisms, impact and design considerations," in Proc. of Design Automation Conference, 2004, pp. 6-11. A. Alimonda, et. al, “A Control Theoretic Approach to Run-Time Energy Optimization of Pipelined Processing in MPSoCs,” Proc. Design, Automation and Test in Europe, 2006, pp. 876- 877. B. Amelifard, F. Fallah, M. Pedram, "Low-power fanout optimization using MTCMOS and multi-Vt techniques," in Proc. of International Symposium on Low Power Electronics and Design, 2006, pp. 334 -337. M. Annavaram, E. Grochowski, J. Shen, “Mitigating Amdahl's Law through EPI Throttling,” Proc. of 32nd Annual int’l Symp. on Computer Architecture, 2005. H. Aydin, Q. Yang, “Energy-Aware Partitioning for Multiprocessor Real-Time Systems," Proc. Int’l Symp. on Parallel and Distributed Processing, 2003. N. Azizi, M. M. Khellah, V. De, and F. N. Najm, “Variations-aware low-power design with voltage scaling,” Proc. Design Automation Conf., 2005, pp. 529–534. P. Barham, et. al, “Xen and the art of virtualization,” SIGOPS Oper. Syst. Rev. 37, 2003, 164- 177. L. A. Barroso, U. Hölzle, “The case for energy-proportional computing,” IEEE Computer, vol. 40, 2007. A. Bartolini, M. Cacciari, A. Tilli, L. Benini, “A Distributed and Self-Calibrating Model- Predictice Controller for Energy and Thermal management of High Perfromance Multicores,” DATE, 2011. L. Benini, G. Paleologo, A. Bogliolo, and G. De Micheli, “Policy Optimization for Dynamic Power Management,” IEEE Trans. on Computer Aided Design, Jun. 1999, pp. 813-833. R. Bergamaschi, G. Han, A. Buyuktosunoglu, H. Patel, I. Nair, G. Dittmann, G. Janssen, N. Dhanwada, Z. Hu, P. Bose, and J. Darringer, "Exploring power management in multi-core systems," Proc. Asia and South Pacific Design Automation Conference, 2008. 190 K. Bernstein, D. J. Frank, A. E. Gattiker, W. Haensch, B. L. Ji, S. R. Nassif, E. J. Nowak, D. J. Pearson, and N. J. Rohrer, “High-performance CMOS variability in the 65-nm regime and beyond,” IBM Journal of Research and development, Aug. 2006. D. Blaauw, et al., “Razor II: In-situ error detection and correction for PVT and SER tolerance,” Proc. Int’l Solid-State Circuits Conference, 2008. D. Blaauw, et al., “Statistical timing analysis: from basic principles to state of the art,” IEEE Trans. Computer-Aided Design, vol 27, 2008. S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge, UK: Cambridge University Press, 2003. M.A. Breuer, Design Automation of Digital Systems: Theory and Techniques, Prentice Hall, 1975. D. Brooks, M. Martonosi, “Dynamic Thermal Management for High-Performance Microprocessors,” Proceedings International Symposium on High-Performance Computer Architecture, p.171, January 20-24, 200. A. R. Cassandra, L.P. Kaelbling, M.L. Littman, “Acting Optimally in Partially Observable Stochastic Domains,” Proc. Conf. Artificial Intelligence, Aug. 1996, pp. 1023-1028. J.-M. Chang and M. Pedram, "Energy minimization using multiple supply voltages," IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 5, no. 4, Dec. 1997, pp. 436-443. S. Choi, B. C. Paul, and K. Roy, “Novel sizing algorithm for yield improvement under process variation in nanometer technology,” Proc. Design Automation Conference, 2004. K. Choi, R. Soma and M. Pedram, “Dynamic voltage and frequency scaling based on workload decomposition,” Proc. of Int’l Symp. on Low Power Electronics and Design, Aug. 2004, pp. 174- 179. K. Choi, R. Soma, and M. Pedram, "Fine-grained dynamic voltage and frequency scaling for precise energy and performance trade-off based on the ratio of off-chip access to on-chip computation times." IEEE Trans. on Computer Aided Design, Vol. 24, No. 1, 2005, pp.18-28 M. Curtis-Maury, A. Shah, F. Blagojevic, D. S. Nikolopoulos, B. R. de Supinski, and M. Schulz, "Prediction models for multi-dimensional power-performance optimization on many cores," Proc. International Conference on Parallel Architectures and Compilation Techniques, 250-259, 2008. S. Das, et al., “A self-tuning DVS processor using delay-error detection and correction”, IEEE Journal of Solid-State Circuits, 2006. A. Dasdan, I. Hom, “Handling Inverted Temperature Dependence in Static timing Analysis,” ACM Trans. on Design Automation of Electronic Systems, Vol. 11, No. 2, Apr. 2006. R. B. Deokar, and S. S. Sapatnekar, “A fresh look at retiming via clock skew optimization,” Proc. Design Automation Conference, 1995. 191 J. Donald and M. Martonosi, “Techniques for Multicore Thermal Management: Classification and New Exploration,” SIGARCH Computer Architecture News, 2006. R. C. Dorf, R. H. Bishop, Modern Control Systems, Prentice Hall, 2008. R. C. Dorf, R. H. Bishop, Modern Control Systems, Prentice Hall, 2008. J. Dorsey, et al., S, “An integrated quadcore Opteron processor,” in International Solid State Circuits Conference, February 2007. D. Ernst, et al., "Razor: a low-power pipeline based on circuit-level timing speculation," Proc. Int’l Sym. on Microarchitecture, 2003. S. Eyerman, L. Eeckhout, T. Karkhanis, J. E.Smith, “A performance counter architecture for computing accurate CPI components,” SIGOPS Oper. Syst. Rev. 40, 5, Oct. 2006, pp. 175-184. D. J. Frank, R. Dennard, E. Nowak, P. Solomon, Y. Taur, and H.-S. P. Wong, “Device Scaling Limits of Si MOSFETs and Their Application Dependencies,” Proc. IEEE, 2001, pp. 259-288. G. F. Franklin, J. D. Powell, and A. Emami-Naeini. Feedback Control of Dynamic Systems. Addison-Wesley, third edition, 1994. M. Ghasemazar, B. Amelifard, and M. Pedram, "A mathematical solution to power optimal pipeline design by utilizing soft-edge flip-flops," Proc. Int’l Symp. on Low Power Electronics and Design, 2008. M. Ghasemazar, E. Pakbaznia, M. Pedram, “Minimizing Energy Consumption of a Chip Multiprocessor through Simultaneous Core Consolidation and Dynamic Voltage/Frequency Scaling,” IEEE International Symposium on Circuits and Systems (ISCAS), 2010. M. Ghasemazar, E. Pakbaznia, M. Pedram, “Minimizing the Power Consumption of a Chip Multiprocessor under an Average Throughput Constraint,” International Society for Quality Electronic Design (ISQED), 2010. M. Ghasemazar, and M. Pedram, “Minimizing energy cost of throughput in a linear pipeline by opportunistic time borrowing,” Proc. Int’l Conf. Computer Aided Design, 2008. M. Ghasemazar, and M. Pedram, “Optimizing the Power-Delay Product of a Linear Pipeline by Opportunistic Time Borrowing,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 30, Issue 10, Oct 2011, pp 1493-1506. M. Ghasemazar, M. Pedram, “Variability Aware Dynamic Power Management for Chip Multiprocessor Architectures,” Design Automation and Test in Europe (DATE), Mar 2011. S. Ghiasi, T. Keller, F. Rawson, “Scheduling for heterogeneous processors in server systems,” Proc. of the 2nd Conf. on Computing Frontiers 2005. M. Gomaa, M.D. Powell, T. Vijaykumar, “Heat-and-run: leveraging SMT and CMP to manage power density through the operating system,” SIGOPS Operating System Review, 2004. 192 P. Gupta, A. B. Kahng, P. Sharma, and D. Sylvester, "Gate-length biasing for runtime-leakage control," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 8, Aug. 2006, pp. 1475-1485. Y. Han, I. Koren, and C. M. Krishna, “TILTS: A fast architectural-level transient thermal simulation method,” Journal of Low Power Electronics, 3(1), 2007. M. Harchol-Balter, M. E. Crovella, C. Murta, “On choosing a task assignment policy for distributed server system,” IEEE Journal of Parallel and Distributed Computing, vol59, 1999. D. Harris and M. A. Horowitz, "Skew-tolerant domino circuits," IEEE Journal of Solid-State Circuits, 1997. S. L. Hary, and F Ozguner, “Precedence-Constrained Task Allocation onto Point-to-Point Networks for Pipelined Execution,” IEEE Trans. on Parallel and Distributed Systems, vol. 10, no. 8, Aug. 1999. S. Heo, K. Barr, and K. Asanovic, “Reducing power density throughactivity migration,” in International Symposium on Low Power Electronics and Design, August 2003. S. Herbert, D. Marculescu, “Analysis of dynamic voltage/frequency scaling in chip- multiprocessors,” Proc. of Int’l Symp. on Low Power Electronics and Design, 2007. S. Herbert, D. Marculescu, “Characterizing Chip-Multiprocessor Variability-Tolerance,” In Proc. of Design Automation Conference, 2008. E. Humenay, D. Tarjan, and K. Skadron, “The impact of systematic process variations on symmetrical performance in chip multiprocessors,” in Design, Automation and Test in Europe, April 2007. I. Hwang, M. Pedram, “Power and performance modeling in a virtualized server system,” in Proc. of Int’l Conf. on Parallel Processing Workshops, 2010. S. Iman and M. Pedram, "An approach for multi-level logic optimization targeting low power," IEEE Trans. on Computer Aided Design, vol. 15, no. 8, Aug. 1996, pp. 889-901. C. Isci, A. Buyuktosunoglu, C. Cher, P. Bose, and M. Martonosi, "An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget," Proc. IEEE/ACM International Symposium on Microarchitecture, 347-358, 2006. C. Isci, G. Contreras, and M. Martonosi, “Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management,” Proc. Int’l Symp. Microarchitecture, Dec. 2006. H. M. Jacobson, "Improved clock-gating through transparent pipelining," Proc. Int’l Sym. on Low Power Electronics and Design, 2004. H. Jacobson, et al. "Stretching the limits of clock-gating efficiency in server-class processors," High-Performance Computer Architecture, 2005. 193 V. Joshi, D. Blaauw, and D. Sylvester, "Soft-edge flip-flops for improved timing yield: design and optimization," Proc. Int’l Conference on Computer-Aided Design, 2007. V. Joshi, R. R. Rao, D. Blaauw, D. Sylvester, “Logic SER reduction through flipflop redesign,” Int’l Sym.Quality Electronic Design, 2006. P. Juang, Q. Wu, L. Peh, M. Martonosi, D.W. Clark, “Coordinated, distributed, formal energy management of chip multiprocessors,” Proc. of int’l Symp. on Low Power Electronics and Design, 2005. H-S. Jung, M. Pedram, “Uncertainty-aware dynamic power management in partially observable domains,” IEEE Trans. VLSI Systems, Jun. 2009 H-S. Jung and M. Pedram, “Dynamic Power Management under Uncertain Information,” Proc. Design Automation and Test in Europe, Apr. 2007, pp. 1060-1065. H-S. Jung, P. Rong, and M. Pedram, "Stochastic modeling of a thermally-managed multi-core system," Proc. Design Automation Conf., Jun. 2008, pp. 728-733. I. Kadayif, M. Kandemir, and I. Kolcu, “Exploiting processor workload heterogeneity for reducing energy consumption in chip multiprocessors,” in Design, Automation and Test in Europe, February 2004. N. Kandasamy, S. Abdelwahed, G. Sharp, J. Hayes, “An Online Control Framework for Designing Self-Optimizing Computing Systems: Application to Power Management,” Self-Star Properties in Complex Information Systems, O. Babaoglu et al., (Eds.), Lecture Notes in Computer Science, Springer-Verlag, 2005, pp.174-189. N. S. Kim, T. Kgil, K. Bowman, V. De, and T. Mudge, “Total power optimal pipelining and parallel processing under process variations in nanometer technology,” Proc. Int’l Conf. on Computer Aided Design, Nov. 2005, pp. 535–540. W. Kim, M. Gupta, G. Y. Wei, D. Brook, “System level analysis of fast, per-core DVFS using on-chip switching regulators,” Proc. High-Performance Computer Architecture, 2008. P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE MICRO Magazine, 2005. R. Kumar, D. M. Tullsen, N. P. Jouppi, P. Ranganathan, “Heterogeneous Chip Multiprocessors”, IEEE Computer, 38(11):32–38, 2005. J. Le, X. Li, L. T. Pileggi, “STAC: Statistical Timing Analysis with Correlation”, Proc. of Design Automation Conference, pp. 343-348, 2004. H. Lee, S. Paik, Y. Shin, “Pulse width allocation with clock skew scheduling for optimizing pulsed latch-based sequential circuits,” Proc. of Int’l. Conf. on Computer-Aided Design, 2008. S. Lee, et al., “Reducing pipeline energy demands with local DVS and dynamic retiming,” Int’l Sym. on Low Power Electronics and Design, 2004. 194 J. Li, J. F. Martinez, “Dynamic power-performance adaptation of parallel computation on chip multiprocessors,” Proc. Int’l Symp. on High-Performance Computer Architecture, 2006. W. Liao, L. He, and K. M. Lepak, “Temperature and Supply Voltage Aware Performance and Power Modeling at Microarchitecture Level,” IEEE Trans. Computer-Aided Design, 24:1042– 1053, 2005. Z. Lu, J. Hein, M. Humphrey, M. Stan, J. Lach, and K. Skadron. “Control-theoretic dynamic frequency and voltage scaling for multimedia workloads.” Proc. the Int’l Conf. on Compilers, Architecture, and Synthesis for Embedded Systems, Oct. 2002, pp. 156-163. Z. Lu, J. Lach, M. Stan, and K. Skadron, “Reducing Multimedia Decode Power using Feedback Control,” Proc. 21st Int. Conf. on Computer Design, 2003, pp. 489-496. M. Mani, A. Devgan, and M. Orshansky, “An Efficient Algorithm for Statistical Minimization of Total Power under Timing Yield Constraints,” Proc. of Design Automation Conference, 2005. S. Manne, A. Klauser, and D. Grunwald, "Pipeline gating: speculation control for energy reduction," Proc. Int’l Sym. Computer Architecture, 1998, D. Marculescu and E. Talpes, “Variability and energy awareness: a microarchitecture-level perspective,” Proc. Design Automation Conf., 2005, pp. 11–16. R. McGowen, et al., “Power and temperature control on a 90-nm Itanium family processor,” Journal of Solid-State Circuits, January 2006. F. Meyer, M. Schmitt, Space, Structure and Randomness: Contributions in Honor of Georges Matheron in the Fields of Geostatistics, Random Sets and Mathematical Morphology, Springer, 2007. T. Mizuno, J. Okamura, and A. Toriumi, “Experimental Study of Threshold Voltage Fluctuation Due to Statistical Variation of Channel Dopant Number in MOSFET’s,” IEEE Trans. Electron Devices, Vol. 41, 1994. S. Murali, A. Mutapcic, D. Atienza, R. Gupta, S. Boyd, and G. De Micheli, “Temperature-Aware Processor Frequency Assignment for MPSoCs Using Convex Optimization”, IEEE/ACM int’l conf. on Hardware/software Codesign and System Synthesis, 2007. A. Mutapcic, et al. “Processor Speed Control with Thermal Constraints,” IEEE Trans. on Circuits and Systems—I, VOL. 56, NO. 9, 2009. A. Mutapcic, S. Boyd, S. Murali, D. Atienza, G. De Micheli, and R. Gupta “Processor Speed Control with Thermal Constraints,” IEEE Trans. on Circuits and Systems—I, VOL. 56, NO. 9, 2009. S. R. Nassif, “Modeling and analysis of manufacturing variations”, Proc. IEEE Custom Integrated Circuits Conference, 2001. S. Nassif, “Within-Chip Variability Analysis,” IEDM Tech. Digest, 1998, pp.283-286. 195 V. G. Oklobdzija, R. K. Krishnamurthy, High-Performance Energy-Efficient Microprocessor Design (Series on Integrated Circuits and Systems), 1st Ed. Springer, 2006 M. Orshansky, A. Bandyopadhyay, “Fast Statistical Timing Analysis Handling Arbitrary Delay Correlations,” Design Automation Conf., 2004. S. Park, W. Jiang, Y. Zhou, and S. Adve, "Managing energy-performance tradeoffs for multithreaded applications on multiprocessor architectures," Proceedings of the 2007 ACM International Conference on Measurement and Modeling of Computer Systems, 169-180, 2007. H. Partovi, et al., "Flow-through latch and edge-triggered flip-flop hybrid elements," Proc. Solid- State Circuits Conf., 1996. M. Pedram, and S. Nazarian, “Thermal Modeling, Analysis and Management in VLSI Circuits: Principles and Methods,” Proc. of IEEE, Special Issue on Thermal Analysis of ULSI, Vol. 94, 2006, pp. 1487-1501. T. Pering and R. Broderson. “The simulation and evaluation of dynamic voltage scaling algorithms.” Proc. Int’l Symp. on Low-Power Electronics and Design, Jun. 1998. J. Pineau, G. Gordon, S. Thru, “Point-based value iteration: An anytime algorithm for POMDPs,” Int’l Joint Conf. Artificial Intelligence, 2003. M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Publisher, New York, 1994. Q. Qiu, M. Pedram, “Dynamic power management based on continuous-time Markov decision processes,” Proc. Design Automation Conf., 1999. Q. Qiu, Q. Wu, M. Pedram, “Stochastic Modeling of a Power-Managed System – Construction and Optimization,” IEEE Trans. on Computer-Aided Design, Oct. 2001, pp. 1200-1217. G. Qu, “Power Management of Multicore Multiple Voltage embedded Systems by Task Scheduling,” Proc. Int’l Conf. on Parallel Processing Workshops, 2007, pp. 78-83. R. Rao and S. Vrudhula, “Efficient online computation of core speeds to maximize the throughput of thermally constrained multi-core processors,” Proc. of Int’l Conf. on Computer- Aided Design, 2008. Z. Ren, B. H. Krogh, R. Marculescu, “Hierarchical Adaptive Dynamic Power Management,” IEEE Trans. on Computers, Apr. 2005. P. Rong and M. Pedram, “Energy-aware task scheduling and dynamic voltage scaling in a real- time system,” Int'l Journal of Low Power Electronics, American Scientific Publishers, Vol. 4, No. 1, Apr. 2008. S. Ross, Introduction to probability models, 9th edition, Academic Press, USA 2007. 196 K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “Check Tc and min Tc: Timing verification and optimal clocking of synchronous digital circuits,” Proc. of Intl Conf. on Computer Aided Design, November 1990. S. S. Sapatnekar, "Power-delay optimization in gate sizing," ACM Trans. on Design Automation of Electronic Systems, vol. 5, no. 1, Jan. 2000, pp. 98-114. S. R. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, J. Torrellas, "VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects," IEEE Transactions on Semiconductor Manufacturing, vol.21, no.1, pp.3-13, Feb. 2008 R. Sarikaya and A. Buyuktosunoglu, "A Unified Prediction Method for Predicting Program Behavior,", IEEE Transactions on Computers, vol.59, no.2, pp.272-282, Feb. 2010 E. M. Sentovich, et al., "SIS: A System for Sequential Circuit Synthesis," University of California, Berkeley, Report M92/41, May 1992. M. Shah, et al., “UltraSPARC T2: A Highly-Threaded, Power-Efficient, SPARC SOC,” Proc. of Asian Solid-State Circuits Conference, Nov. 2007. J. Sharkey, A. Buyuktosunoglu, and P. Bose, “Evaluating Design Tradeoffs in On-Chip Power Management for CMPs,” Proc. of Int’l Symp. on Low Power Electronics and Design, 2007. K. Shi and D. Howard, "Challenges in sleep transistor design and implementation in low-power design," in Proc. of Design Automation Conference, 2006, pp. 113-116. T. Simunic, L. Benini, P. Glynn, G. De Micheli, “Event-driven Power Management,” IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, Jul. 2001, pp. 840-857. J. Singh, S. Sapatnekar, “Statistical timing analysis with correlated non-Gaussian parameters using independent component analysis,” in Proc. Design Automation Conference, 2006, pp. 155– 160. K. Skadron, et al. “Temperature-Aware Microarchitecture: Modeling and Implementation,” ACM Transactions on Architecture and Code Optimization, Vol. 1, No. 1, March 2004, Pages 94–125. K. Skadron, et al. “Temperature-Aware Microarchitecture: Modeling and Implementation,” ACM Transactions on Architecture and Code Optimization, Vol. 1, No. 1, March 2004, Pages 94–125. M. S. Squilante and E. D. Lazowska, “Using processor-cache affinity information in shared- memory multiprocessor scheduling,” IEEE Trans. Parallel Distrib. Syst., vol. 4, pp. 131-143, Feb. 1993. K. Srinivasan and K. S. Chatha, “Integer linear programming and heuristic techniques for system- level low power scheduling on mul-tiprocessor architectures under throughput constraints,” Integration VLSI, vol. 40, no. 3, 2007. J. A. Stankovic, C. Lu, S. H. Son, and G. Tao. “The case for feedback control real-time scheduling,” Proc. the IEEE Euromicro Conf. on Real-Time, Jun. 1998. 197 K. Stavrou and P. Trancoso, “Thermal-aware scheduling: A solutionfor future chip multiprocessors’ thermal problems,” in EURO MICRO Conference on Digital System Design, 2006. B. E. Stine, D. S. Boning, and J. E. Chung, “Analysis and Decomposition of Spatial Variation in Integrated Circuit Processes and Devices,” IEEE Trans. Semiconductor. Manuf. Jan. 1997, pp.24- 41. H. Su, F. Liu, A. Devgan, E. Acar, and S. Nassif, “Full Chip Leakage Estimation Considering Power Supply and Temperature Variations,” Proc. Int’l Symp. on Low Power Electronics and Design, Aug. 2003, pp. 78-83. Y. Tan, Q. Qui, “A framework of stochastic power management using hidden Markov medel,” Proc. Design Automation Test in Europe, 2008 R. Teodorescu, J. Torrellas, “Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors,” Proc. International Symposium on Computer Architecture, p.363-374, June 21-25, 2008 A. Tiwari, S. R. Sarangi, J. Torrellas, “ReCycle: pipeline adaptation tolerate process variation,” Proc. Int’l Sym. Computer Architecture, 2007. Y. F. Tsai, N. Vijaykrishnan, Y. Xie, M. J. Irwin, “Influence of Leakage Reduction Techniques on Delay/Leakage Uncertainty,” Proc. IEEE 18th Int’l Conf. on VLSI Design, Jan. 2005, pp.374- 379. J. W. Tschanz, S. G. Narendra, Y. Ye, B. A. Bloechel, et al., "Dynamic sleep transistor and body bias for active leakage power control of microprocessors," IEEE Journal of Solid-State Circuits, vol. 38, 2003, pp. 1838-1845. A. Varma, B. Ganesh, M. Sen, S. R. Choudhary, L. Srinivasan, and B. Jacob. “A control-theoretic approach to dynamic voltage scaling,” Proc. Int’l Conf. on Compilers, Architectures, and Synthesis for Embedded Systems, Oct. 2003, pp. 255–266. Y. Wang, K. Ma, and X. Wang, “Temperature-Constrained Power Control for Chip Multiprocessors with Online Model Estimation”, International Symposium on Computer Architecture, 2009. M. Weiser, B. Welch, A. Demers, and S. Shenker. “Scheduling for reduced CPU energy,” Proc. USENIX Symp. on Operating Systems Design and Implementation, Nov. 1994, pp. 13–23. G. Welch and G. Bishop, “An introduction to the Kalman filters,” Technical Report TR 95-041, University of North Carolina, Department of Computer Science, 1995. Q. Wu, P. Juang, M. Martonosi, D. W. Clark, “Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors,” Proc. ASPLOS-XI, Oct. 2004, pp. 248-259. F. Xia, Y.-C. Tian, Y. Sun, J. Dong, “Control- Theoretic Dynamic Voltage Scaling for Embedded Controllers,” IET Computers and Digital Techniques, 2008. 198 Y. Xie, W. Wolf, “Allocation and scheduling of conditional task graph in hardware/software co- synthesis,” Proc. conf. on Design Automation and Test in Europe, 2001. G. Yan, et al. “MicroFix: exploiting path-grained timing adaptability for improving power- performance efficiency,” Proc. Int’l Sym. on Low Power Electronics and Design, 2009. Y. Ye, et al., “Statistical modeling and simulation of threshold variation under dopant fluctuations and line-edge roughness,” Proc. Design Automation Conference, 2008. I. Yeo, C.C. Liu, E.J. Kim, “Predictive dynamic thermal management for multicore systems,” Proc. of the 45th Annual Design Automation Conference, 2008. M. Zagha, et al., “Performance analysis using the MIPS R10000 performance counters,” Proc. Conf. on Supercomputing , 1996. Y. Zhan, et al., “Correlation-aware statistical timing analysis with non-gaussian delay distributions,” in Proc. Design Automation Conference, 2005. L. Zhang, et. al, “Statistical static timing analysis with conditional linear MAX/MIN approximation and extended canonical timing model,” IEEE Trans. Computer-Aided Design Integrated Circuits Syst., vol. 25, 2006. [online] Advanced Micro Devices, Family 10h AMD Opteron Processor Product Data Sheet, Revision: 3.04, http://support.amd.com/us/Processor_TechDocs/40036.pdf [online] HSPICE: gold standard for accurate circuit simulation, http://www.synopsys.com/products/mixedsignal/hspice/hspice.htm [online] AMD Corp, http://support.amd.com/us/Processor_TechDocs/40036.pdf [online] AMD Opteron processors, http://en.wikipedia.org/wiki/Opteron [online] Intel Xeon processors, http://en.wikipedia.org/wiki/Xeon [online] Intel Xeon, http://www.intel.com/products/processor_number/chart/xeon.htm [online] Intel Corp, Intel® 64 and IA-32 Architectures Software Developer’s Manual, 2009, http://www.intel.com/products/processor/manuals/ [online] Intel Co., ftp://download.intel.com/design/network/papers/30117401.pdf [online] Intel Co, http://www.intel.com/technology/architecture-silicon/next-gen/ [online] Linear Programming, Wikipedia, http://en.wikipedia.org/wiki/Linear_programming [online] LM Sensors, http://www.lm-sensors.org/wiki/Documentation [online] MATLAB Optimization, http://www.mathworks.com 199 [online] Performance Counter for Linux, http://user.it.uu.se/~mikpe/linux/perfctr/ [online] Predictive Technology Model, http://ptm.asu.edu/ [online] Standard Performance Evaluation Corporation, http://www.spec.org/ [online] SPEC Web 2005, http://www.spec.org/web2005/ [online] SPEC Web2009, http://www.spec.org/web2009 [online] http://www.theregister.co.uk/2011/02/25/intel_westmere_ex_sandy_bridge_ep_xeons/print.html [online] Tomlab Optimization [Online] http://tomopt.com/tomlab/
Abstract (if available)
Abstract
In today’s IC design, one of the key challenges is the increase in power consumption of the circuit which in turn shortens the service time of battery-powered electronics, and increases the cooling and packaging costs of server systems. On the other hand, with the increasing levels of variability in the characteristics of nanoscale CMOS devices and VLSI interconnects and continued uncertainty in the operating conditions of VLSI circuits, achieving power efficiency and high performance in electronic systems under process, voltage, and temperature (PVT) variations has become a daunting, yet vital, task. ❧ This dissertation investigates power optimization techniques in CMOS VLSI circuits both at circuit level and chip level, while considering the variations in fabrication process or operating conditions of such circuits and systems. First, at circuit level, we present and solve the problem of power-delay optimal design of linear pipeline utilizing soft-edge flip-flops which allow opportunistic time borrowing within the pipeline. We formulate this problem considering statistical delay models that characterize effect of process variation on gate and interconnect delays. To enable further optimization, the soft-edge flip flops are equipped with dynamic error detection (and correction) circuitry to detect and fix the errors that might arise from possible over-clocking. ❧ Second, we propose chip level solutions to the problem of low power design in Chip Multiprocessors (CMPs). We formulate this problem in the form of minimizing total power consumption of CMP while maintaining an average system-level throughput, or maximizing total CMP throughput subject to constraints on power dissipation or dietemperatures. ❧ We then propose mathematically rigorous and robust algorithms in the form of dynamic power (and thermal) management solutions to each of these problem formulations. Our proposed algorithms are hierarchical global power management approaches that aim to minimize CMP power consumption (or maximize throughput) by applying mainly dynamic voltage and frequency scaling (DVFS) technique, task assignment and consolidation of processing cores. To tackle the inherent variation and uncertainty of manufacturing parameters and operating conditions in these problems, our solutions adopt a closed loop feedback controller. Additionally, in one problem formulation, we focus primarily on the variations and uncertainty of CMP optimization problem parameters and adopt an algorithm based on partially observable Markovian decision process (POMDP) that uses belief states to determine unobservable system parameters, and then stochastically minimize overall CMP power consumption. Overall, simulation results of our solutions demonstrate promising results for the CMP power/thermal optimization problem.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Optimizing power delivery networks in VLSI platforms
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Thermal modeling and control in mobile and server systems
PDF
Thermal management in microprocessor chips and dynamic backlight control in liquid crystal diaplays
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Architectures and algorithms of charge management and thermal control for energy storage systems and mobile devices
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
In-situ digital power measurement technique using circuit analysis
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
Asset Metadata
Creator
Ghasemazar, Mohammad
(author)
Core Title
Variation-aware circuit and chip level power optimization in digital VLSI systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/22/2011
Defense Date
10/21/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Chip Multiprocessors,circuit and chip-level techniques,dynamic power and thermal management,dynamic voltage and frequency scaling,hierarchical power management,OAI-PMH Harvest,power optimization,soft pipeline,soft-edge flip flop
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Gupta, Sandeep K. (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
ghasemaz@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-211848
Unique identifier
UC11291441
Identifier
usctheses-c3-211848 (legacy record id)
Legacy Identifier
etd-Ghasemazar-433-0.pdf
Dmrecord
211848
Document Type
Dissertation
Rights
Ghasemazar, Mohammad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Chip Multiprocessors
circuit and chip-level techniques
dynamic power and thermal management
dynamic voltage and frequency scaling
hierarchical power management
power optimization
soft pipeline
soft-edge flip flop