Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Stochastic dynamic power and thermal management techniques for multicore systems
(USC Thesis Other)
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
STOCHASTIC DYNAMIC POWER AND THERMAL MANAGEMENT TECHNIQUES FOR MULTICORE SYSTEMS by Hwisung Jung A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2009 Copyright 2009 Hwisung Jung ii DEDICATION To my lovely wife whose unconditional love and support made this work possible. iii ACKNOWLEDGEMENTS I am most grateful to my advisor, Professor Massoud Pedram, for providing me invaluable support and guidance throughout my Ph.D. studies at the University of Southern California. He has been a continuous source of motivation for me and I want to sincerely thank him for all I have achieved. His multi-disciplinary approach and global vision of research problems have been instrumental in defining my professional career. I would also like to thank my other committee members Professor Sandeep Gupta, Professor Jeff Draper, and Professor Aiichiro Nakano for their insightful suggestions and for their valuable time. I am sincerely grateful to Andy Hwang for his guidance in some parts of my Ph.D. research and his help and support during my internship at Broadcom Corporation. I would also like to extend my appreciation to Patrick Tsui for his support and guidance during my summer internship at Intel Corporation. I would like to thank my parents for their unconditional love and support. I would have not been able to accomplish my goals without their support and encouragement. I am much indebted to my parents-in-law, for believing in me and encouraging me to pursue my studies; his strong support and guidance has been crucial in achieving my goals. Words cannot express my gratitude to my beloved wife, Sunkyung Choi. Not only is she my adorable wife and closest friend, but also the smartest supporter, iv technically helping me with fruitful discussions. I would like to thank Sunkyung for her constant love, support, and understanding. v TABLE OF CONTENTS DEDICATION ................................................................................................................ ii ACKNOWLEDGEMENTS ................................................................................................ iii LIST OF TABLES......................................................................................................... viii LIST OF FIGURES...........................................................................................................ix LIST OF ABBREVIATIONS ............................................................................................ xii ABSTRACT ..............................................................................................................xiv Chapter 1 Introduction..............................................................................................1 1.1 Dissertation Contributions .................................................................... 4 1.2 Outline of the Dissertation ......................................................................6 Chapter 2 Uncertainty-Aware Dynamic Power Management in Partially ................ Observable Domains ...............................................................................9 2.1 Introduction............................................................................................ 9 2.2 Related Work ........................................................................................11 2.3 Preliminaries .........................................................................................12 2.3.1 Effect of PVT Variation on the Performance State....................................12 2.3.2 Temperature Calculation ............................................................................14 2.4 Stochastic Decision Making Framework ..............................................15 2.4.1 Partially Observable Environment .............................................................15 2.4.2 Sequential Decision Making under Uncertainty ........................................16 2.4.3 POSMDP Framework for Dynamic Power Management ..........................19 2.5 Policy Representation in POSMDP ......................................................21 2.5.1 Conversion to Belief-state SMDP ..............................................................21 2.5.2 Policy Representation.................................................................................24 2.5.3 POSMDP-based DPM by Example............................................................26 2.6 Dynamic Power Management...............................................................31 2.6.1 Offline Dynamic Power Management........................................................31 2.6.2 Online Dynamic Power Management ........................................................33 2.7 Experimental Results ............................................................................37 2.8 Summary ...............................................................................................44 vi Chapter 3 Machine Learning based Power Management for Multicore Processors..............................................................................................46 3.1 Introduction...........................................................................................46 3.2 Preliminaries .........................................................................................48 3.2.1 Motivational Example ................................................................................48 3.2.2 Related Work..............................................................................................51 3.3 Learning-based DPM Framework.........................................................53 3.3.1 Background on Supervised Learning .........................................................53 3.3.2 Learning-based Power Management Framework.......................................55 3.4 Extraction Strategy................................................................................63 3.4.1 Extracting Input Features ...........................................................................63 3.4.2 Extracting Output Measures.......................................................................66 3.5 Power Management Policy ...................................................................69 3.6 Experimental Results ............................................................................73 3.6.1 Experimental Setup ....................................................................................73 3.6.2 Detailed Results..........................................................................................75 3.7 Summary ...............................................................................................85 Chapter 4 A Stochastic Local Hot Spot Alerting Technique..................................86 4.1 Introduction...........................................................................................86 4.2 Preliminaries .........................................................................................88 4.3 Estimation under Uncertainty ...............................................................90 4.3.1 Rationale for Developing Uncertainty Management..................................90 4.3.2 Temperature Estimation Framework..........................................................91 4.3.3 Power Profile Estimation Framework ........................................................93 4.4 Hot Spot Alerting Algorithm ................................................................94 4.4.1 Estimation of Junction Temperature of the Chip .......................................95 4.4.2 Estimation of Power State of the System ...................................................96 4.4.3 Hot Spot Alerting Algorithm......................................................................97 4.5 Experimental Results ............................................................................98 4.6 Summary .............................................................................................103 Chapter 5 Stochastic Modeling of a Thermally-Managed Multi-Core System....104 5.1 Introduction.........................................................................................104 5.2 Preliminaries .......................................................................................105 5.3 System Modeling ................................................................................107 5.3.1 Background ..............................................................................................107 5.3.2 Component Model....................................................................................108 5.3.3 Integrated Model of a TMS ......................................................................114 5.4 Dynamic Thermal Management..........................................................115 5.4.1 Optimal DTM Policy................................................................................116 5.4.2 Online DTM .............................................................................................117 5.5 Experimental Results ..........................................................................118 5.6 Summary .............................................................................................121 vii Chapter 6 Conclusion ...........................................................................................123 6.1 Summary of Contributions..................................................................123 6.2 Future Work ........................................................................................124 6.2.1 Dynamic Power Management ..................................................................125 6.2.2 Dynamic Thermal Management...............................................................125 BIBLIOGRAPHY..........................................................................................................127 viii LIST OF TABLES Table 2.1: Parameter values for the example problem...............................................27 Table 2.2: The distribution in percentage of power dissipation in the processor (no cache) ..................................................................................................38 Table 2.3: Parameter values for a given experiment..................................................38 Table 2.4: Comparing results of our proposed approach with the corner- based results. ..............................................................................................................42 Table 2.5: Comparison of our DPM policies with the conventional approach..........44 Table 2.6: Comparison of our DPM policies with the conventional approach..........44 Table 3.1: Example training set for the DPM problem..............................................57 Table 3.2: Examples of decision boundaries .............................................................69 Table 3.3: Classes of output measures .......................................................................78 Table 3.4: Different characteristics of training sets. ..................................................80 Table 3.5: Normalized total energy dissipation for various classifiers......................82 Table 3.6: Energy savings in the multicore processor. ..............................................85 Table 4.1: Percentage of power consumption in different modules of a MIPS-like processor...................................................................................................99 Table 4.2: Definition of power and temperature states of the processor. ..............100 Table 4.3: PBGA package thermal performance data (T A = 70 °C)........................100 Table 5.1: Transition times for the CTMDP model of the processor. .....................119 Table 5.2: Power and performance comparisons between Greedy and SDTM techniques. ...................................................................................................121 ix LIST OF FIGURES Figure 2.1: Effect of process variations on circuit delay. ..........................................13 Figure 2.2: Structure of a POSMDP-based power manager. ....................................18 Figure 2.3: A graphical representation of the belief state: (a) current belief (b) its one-step evolution for three different actions..................................................19 Figure 2.4: State estimation and state transition in the POSMDP-based DPM. ..........................................................................................................................20 Figure 2.5: Example of three possible observations at time t+1 from which the belief state is calculated. ......................................................................................21 Figure 2.6: The value iteration algorithm. ................................................................26 Figure 2.7: A graphical representation of the belief states and value functions: (a) the current belief state and immediate values over the belief space, (b) next belief state and horizon-1 value function...........................................28 Figure 2.8: (a) A decision tree of policy tables (b) probability density function of power and delay values used to trace a path from root to a leaf node in the tree...........................................................................................................32 Figure 2.9: An offline power management technique...............................................33 Figure 2.10: The structure of online power management. ........................................36 Figure 2.11: An online power management technique. ............................................37 Figure 2.12: Flow of power simulation.....................................................................37 Figure 2.13. Trace of belief state for state estimation...............................................40 Figure 2.14: Evaluation of policy generation algorithm. ..........................................41 Figure 2.15: Power consumption of offline / online DPM policies..........................43 Figure 3.1: Example of a power-managed multicore processor. ...............................49 Figure 3.2: DPM approaches with DVFS: (a) the traditional approach, and (b) the proposed approach..........................................................................................50 x Figure 3.3: Concept of supervised learning. ..............................................................54 Figure 3.4: Structure of the proposed power manager...............................................55 Figure 3.5. (a) Observations for each output measure, and (b) Decision boundaries for an output measure among various input feature states. .....................67 Figure 3.6: The value iteration algorithm. ................................................................72 Figure 3.7: Input features during training phase. .......................................................76 Figure 3.8: Output measures during training phase. .................................................77 Figure 3.9: Probability density functions for power dissipation...............................77 Figure 3.10: Selection of the training set size............................................................79 Figure 3.11: Evaluation of cost functions for a given example. ................................80 Figure 3.12: Comparison of energy dissipations, where actions are commanded by a classifier based on different training sets.......................................81 Figure 3.13. Evaluation of energy dissipation for a given scenario..........................82 Figure 3.14. Energy dissipation comparison between a greedy DPM and the Bayesian Learning based DPM..................................................................................84 Figure 4.1: Heat flow in the Plastic Ball Grid Array plus Heat Spreader package.......................................................................................................................89 Figure 4.2: One of the IC package heat transfer paths and the corresponding thermal resistive model. .............................................................................................89 Figure 4.3. Uncertainty-aware estimation framework. ..............................................94 Figure 4.4: The flow of the proposed estimation technique. .....................................96 Figure 4.5: The proposed hot spot alerting algorithm................................................98 Figure 4.6: Trace of estimation for the junction temperature. .................................101 Figure 4.7: Trace of belief state for the power profile. ............................................101 Figure 4.8: Evaluation of the hot spot alerting algorithm. .......................................102 Figure 5.1: IPC vs. L2 cache miss rate on Intel Core Duo processor. .....................106 Figure 5.2: Abstract model of a thermal-managed MC system. ..............................109 xi Figure 5.3: An example of CTMDP model of a processor (a) and its generator matrix (b). ................................................................................................110 Figure 5.4: An example of CTMDP model of applications (a) and its generator matrix (b). ................................................................................................112 Figure 5.5: An example of CTMDP model of a TMS (a) and its generator matrix (b)..................................................................................................................115 Figure 5.5.6: Online thermal management technique. .............................................118 Figure 5.7: The effectiveness of the proposed DTM technique...............................120 xii LIST OF ABBREVIATIONS CMOS Complementary Metal-Oxide Semiconductor CTMDP Continuous Time Markovian Decision Process CPU Central Processing Unit DPM Dynamic Power Management DLB Dynamic Load Balancing DTM Dynamic Thermal Management DTM DP Discrete Time Markovian Decision Process DVFS Dynamic Voltage and Frequency Scaling HMM Hidden Markov Model IC Integrated Circuit IPC Instruction Per Cycle KF Kalman Filter MAP Maximum a Posteriori MDP Markov Decision Process ML Maximum Likelihood MLE Maximum Likelihood Estimate PCB Printed Circuit Board PM Power Manager POMDP Partially Observable Markov Decision Process POSMDP Partially Observable Semi-Markov Decision Process xiii PVT Process, Voltage, and Temperature RISC Reduced Instruction Set Computer SoC System-on-a-Chip SMDP Semi-Markov Decision Process SP Service Provider SQ Service Queue SR Service Requestor VLSI Very Large Scale Integration xiv ABSTRACT With the progress in today’s semiconductor technology, the chip density and operating frequency have increased rapidly, making the power consumption in digital circuits a major concern for VLSI designers. Furthermore, as the nanometer technology is approaching the regime of randomness with variability in the behavior of silicon structure, improving the accuracy of power optimization techniques under increasing levels of process, voltage, and temperature (PVT) variations is becoming a critical task. In this dissertation, a stochastic dynamic power management (DPM) framework is presented to improve the accuracy of decision-making strategy for power management under probabilistic conditions induced by the PVT variations. Subsequently, an adaptive DPM technique is presented by constructing a machine- learning based power manager to improve the quality of decision-making strategy and to reduce the overhead of the power manager which has to repetitively determine and assign voltage-frequency settings for each core in multicore systems. Finally, the focus of this dissertation is shifted to thermal management techniques, since thermal control is also becoming a first-order concern due to the increased power density as well as the increasing vulnerability of the system. A technique is presented to identify and report local hot spots under probabilistic conditions induced by uncertainty in the chip junction temperature and the system power state. Lastly, an abstract model of a thermally-managed system with stochastic processes is introduced, followed by a stochastic thermal management technique. 1 Chapter 1 Introduction IC designers are seeking high-performance and reliable electronic circuits and systems. As we start to design with the nanometer process technology nodes, nanoscale VLSI circuits are becoming increasingly sensitive to the rising levels of process, voltage, and temperature (PVT) variations and design parameter fluctuations, where guaranteeing the quality of system-level performance optimization techniques is becoming of great concern. The PVT variations, especially within-chip variations, pose a major challenge to the design of low-power and high-performance circuits and systems [1][11]. These variations which arise from either environmental variations (e.g., temperature and voltage) or manufacturing variations (e.g., dopant fluctuation, oxide thickness variation, and effective channel length variation) can result in significant uncertainties in terms of the power and delay estimates [1]. Variability refers to a known quantitative relationship between a parameter of interest and a source, whereas uncertainty refers to a relationship that we cannot precisely describe. Lack of proper modeling tools transforms variability to uncertainty [45]. Variations at the higher levels of design abstraction are often translated into uncertainty because the underlying physical realization is not available. At the same 2 time, observations made about the performance state (e.g., delay or power) of the system tend to be imperfect, and approximate, which in turn gives rise to uncertainty in observing the performance state. Thus, the strong impact of PVT variations on performance of a VLSI circuit renders as impractical the traditional optimization techniques, which results in that a move to stochastic optimization is needed by developing a mathematical framework that models variability during system-level performance optimization (e.g., power management). Indeed, an integrated uncertainty management framework makes it possible to consider the stochastic behavior of power dissipation and provide the computational tractability of the randomness to treat the many sources of uncertainty, bringing the underlying probabilistic PVT effects to the forefront of power management policy optimization. Furthermore, it is prudent to account for different sources of variability or uncertainty earlier in the design process, especially when developing resource management and power control strategies for large complex electronics systems. In recent years much research has been conducted on reducing variability in the circuit design parameters [85][83][49]. Ongoing advances in CMOS process technologies and VLSI designs have resulted in the introduction of multicore processors. There is a demand for improved processing efficiency of a multicore processor without driving up its power dissipation and die temperature. Conventional DPM methods have not been able to take full advantage of power-saving techniques such as dynamic voltage and frequency scaling (DVFS). This is because a system-level power management routine, which continuously monitors the workloads of multiple processors, analyzes 3 the information to make decisions, and issues DVFS commands to each processor, can give rise to a considerable computational overhead and/or complicate the task scheduling [51]. The higher the number of cores in the processor is, the more severe these issues become. Therefore, the ability of a DPM framework to scale well on a multicore processor by eliminating these overhead is becoming a critical requirement [39]. The problem of determining a power management policy that utilizes DVFS in a multicore processor has received a lot of attention [35] [91][31][56][19]. With IC process geometries shrinking below 65nm technology and many applications requiring higher performance under dramatic changes in the power density on a chip, timely identification of hot spots on a chip has become a challenging task. Furthermore, local hot spots, which have much higher temperatures compared to the average die temperature, are becoming more prevalent in VLSI circuits. If heat is not removed at a rate equal to or greater than its rate of generation, the junction temperature will rise, which reduces the mean time to failure (MTTF) for the devices. Thus, identifying and removing heat from these hot spots is a major task facing design engineers concerned with higher circuit reliability. As reported in [100], the problem of thermal modeling and management has received a lot of attention in the last decade. Given the importance of low-power designs, this dissertation is focused on developing system-level power and thermal management techniques for high- performance systems. 4 1.1 Dissertation Contributions In this dissertation, we target four major techniques of power and thermal management for high-performance systems: • Power management techniques: - Uncertainty-aware dynamic power management technique - Machine-learning based power management technique • Thermal management techniques - Stochastic local hot spot alerting technique - Stochastic modeling of a thermally-managed multicore system Most of the previous work on variability and uncertainty in low-power designs has focused on the variability modeling, analysis, and control at the lower levels of design abstraction. Improving the accuracy and robustness of decision making by modeling and assessing the variability is an important step in guaranteeing the quality of system-level resource management algorithms, including DPM. To the best of our knowledge, there has been no reported work on dynamic power management techniques with stochastic modeling for uncertainty management. Indeed, this is the contribution of the first section of this dissertation. Traditional approaches for system-level DPM, which are based on models of service requestor (SR), service provider (SP), and service queue (SQ), tend to work very well if the workload of the system does not change rapidly. Furthermore, the energy and delay overhead of the power mode transitions can become quite significant, rendering the DPM strategy ineffective. Thus, traditional DPM techniques are unsuccessful in reducing the total chip power dissipation when the 5 overhead of power-mode transitions is not controlled in a multicore processor, where the power manager needs to control each processor individually. Therefore, knowing (or predicting) in real time which frequency and voltage levels to use, and when to apply a new performance setting in a multicore processor, must be done with the aid of a self-improving (i.e., intelligent and autonomous) power manager. Consequently, in the second part of this dissertation we address a dynamic power management problem where a power manager continuously issues power mode transition commands to maximally exploit the power-saving opportunities. Much of the past work related to thermal controls has examined techniques for thermal modeling and management, but these techniques may be ineffective if the accuracy of identifying local hot spots in question. This is because thermal models, based on equivalent circuit models, cannot adequately capture structures with complex shapes and boundary conditions, which in turn gives rise to uncertainty in identifying local hot spots. In particular, it is extremely difficult to obtain exact solution of the heat transfer equations that arise from realistic die conditions. Furthermore, temperature sensors have difficulty measuring the actual peak power dissipation and the resulting peak temperature, which renders stochastic the problem of identifying local hot spots. To the best of our knowledge, no research work has been conducted on estimation of the local hot spots on a die which rigorously account for the uncertainty in temperature sensing. Therefore, in the third section of this dissertation we concentrate on the stochastic hot spot estimation technique, which alerts against thermal problems. In the last section of this dissertation, we present a stochastic model of a thermally-managed multicore system to manage the 6 stochastic behavior of the temperature state of the system under dynamic reconfiguration of its micro-architecture, while maximizing the system performance subject to the constraint that a critical temperature threshold is not exceeded. 1.2 Outline of the Dissertation In Chapter 2, we tackle the problem of DPM in nanoscale CMOS design technologies that are typically affected by increasing levels of PVT variations and fluctuations due to the randomness in the behavior of silicon structure. We present a stochastic framework to improve the accuracy of decision making during DPM, while considering manufacturing process and/or design induced uncertainties. More precisely, the uncertainties are captured by a partially observable semi-Markovian decision process and the policy optimization problem is formulated as a mathematical program based on this model. Experimental results with a RISC processor in 65nm technology demonstrate the effectiveness of the technique and show that the proposed uncertainty-aware power management technique ensures system-wide energy savings under statistical circuit parameter variations. In Chapter 3, we present a supervised learning based DPM framework for a multicore processor, where a power manager PM learns to predict the system performance state from some readily available input features (such as the occupancy state of a global service queue) and then uses this predicted state to look up the optimal power management action (e.g., voltage-frequency setting) from a pre- computed policy table. The motivation for utilizing supervised learning in the form of a Bayesian classifier is to reduce the overhead of the PM which has to repetitively 7 determine and assign voltage-frequency settings for each processor core in the system. Experimental results demonstrate that the proposed supervised learning based DPM technique ensures system-wide energy savings under rapidly and widely varying workloads. In Chapter 3, we address the questions of how and when to identify and issue a hot spot alert. They are important questions since temperature reports by thermal sensors may be erroneous, noisy, or arrive too late to enable effective application of thermal management mechanisms to avoid chip failure. Thus, we present a stochastic technique for identifying and reporting local hot spots under probabilistic conditions induced by uncertainty in the chip junction temperature and the system power state. More specifically, it introduces a stochastic framework for estimating the chip temperature and the power state of the system based on a combination of Kalman Filtering (KF) and Markovian Decision Process (MDP) model. Experimental results demonstrate the effectiveness of the framework and show that the proposed technique alerts about thermal threats accurately and in a timely fashion in spite of noisy or sometimes erroneous readings by the temperature sensor. In Chapter 4, we present a new abstract model of a thermally-managed system, where a stochastic process model is employed to capture the system performance and thermal behavior. We formulate the problem of dynamic thermal management (DTM) as the problem of minimizing the energy cost of the system for a given level of performance under a peak temperature constraint by using a controllable MDP model. The key rationale for utilizing MDP for solving the DTM problem is to manage the stochastic behavior of the temperature states of the system under online 8 re-configuration of its micro-architecture and/or dynamic voltage-frequency scaling. Experimental results demonstrate the effectiveness of the modeling framework and the proposed DTM technique. 9 Chapter 2 Uncertainty-Aware Dynamic Power Management in Partially Observable Domains 2.1 Introduction As nanoscale VLSI circuits are becoming sensitive to the rising levels of variability in process and design parameters, guaranteeing the quality of system-level performance optimization techniques is becoming of great concern. Within-chip variations are typically passed into the delay budget of each circuit [11]. IC designers can no longer afford to mislay performance due to unacceptable levels of inaccuracy in their estimation/modeling techniques [1]. Thus, it is important to do rigorous modeling of variability early in the design cycle. Increasing interest has been given to the problem of modeling and reducing variability in the design parameters [11][4]. Most of the previous work has focused on the variability modeling, analysis, and control at the lower levels of design abstraction, e.g., by using physical design optimization and/or logic synthesis. But, there has been no reported work on dynamic power management techniques with stochastic modeling for uncertainty management In this chapter, we attempt to address uncertainty management issues in performance optimization at the system-level by ensuring IC designers’ goal to 10 produce low-power design with reliability. We propose uncertainty-aware power management framework which handles the parameter variations during power management. Our proposed framework is based on i) partially observable Markov decision process [65] to model the uncertainty in parameter observation, and ii) semi-Markov decision process to model the decision making strategy for optimizing total energy dissipation of the system. Markov decision process models offer a robust theoretical framework which enables one to apply strong mathematical optimization techniques to derive optimal policies. Finally, we present uncertainty-aware offline/online dynamic power management techniques to illustrate the effectiveness of the uncertainty management framework. Conventional DPM approaches [6], which define a power manager to interact with the system resources through a set of commands and their associated costs, tend to be less than satisfactory in the presence of variability. It is because they assume different variables of the system are (i) directly observable and (ii) deterministic. Our proposed DPM framework deals with such uncertainty and non-determinism [40]. The remainder of this chapter is organized as follows. The related work is discussed in section 2.2. In section 2.3, the preliminaries of the chapter are presented. The details of the stochastic uncertainty management framework are given in section 2.4. The policy representation for the proposed framework is described in section 2.5. Section 2.6 presents uncertainty-aware dynamic power management techniques. Experimental results and summary are given in section 2.7 and section 2.8. 11 2.2 Related Work Increasing attention has been given to the problem of reducing variability in the circuit design parameters. The work in [11] studies the parameter variations in nanometer and their impacts on leakage reduction techniques for a microprocessor. A full chip leakage estimation technique under variability is presented in [83] to account for power supply and temperature variations. In [49], the impacts of threshold voltage variations on the leakage power are modeled in a probabilistic way, where these models are subsequently employed to minimize the leakage power dissipation, while satisfying certain performance requirements. The authors in [4] show that interactions between voltage, frequency, and temperature significantly impact the energy-delay-product of a target system. The work presented in [85] studies the impact of leakage reduction techniques on the delay uncertainty. In [11], the authors discuss process, voltage, and temperature variations and their impacts on circuit and micro-architectures beyond the 90nm technology node. A lot of research has been devoted to optimizing DPM policies, resulting in both heuristics and stochastic approaches. While the heuristic approaches are easy to implement, they do not provide and power/performance assurances. In contrast, the stochastic approaches guarantee optimality under performance constraints although they are more complex to implement. A number of stochastic models have been reported for DPM [66][77][69]. To overcome the limitations of heuristic “time-out”- based power management techniques, an approach based on discrete-time Markovian decision processes (DTMDP) was proposed in [66]. This approach outperforms the previous heuristic techniques because of its solid theoretical framework for system 12 modeling and policy optimization. However, the discrete-time model requires policy evaluation at periodic time intervals and may thus consume a large number of power dissipation even when no change in the system state has occurred. To surmount this shortcoming, an approach based on continuous-time Markovian decision processes (CTMDP) was proposed in [77]. The policy change under this model is asynchronous and thus more suitable for implementation as part of a real-time operating system environment. Reference [69] also improved on the modeling technique of [66] by using time-indexed semi-Markovian decision processes (SMDP). An SMDP is a stochastic process where the next state depends on the current state and how long the current state has been active. A non-stationary process based power management technique is introduced in [69], where the workload requests are modeled as a Markov-modulated stochastic process. The above-mentioned stochastic power management techniques enjoy desirable features of flexibility, global optimality, and mathematical robustness. In general, however, these models are not practical because they assume that various variables of the system are directly observable and thus are deterministic. Our work differs from these works in that we explicitly take into account the uncertainty in determining the exact values of state variables, a phenomenon which is caused by PVT variations. 2.3 Preliminaries 2.3.1 Effect of PVT Variation on the Performance State Although performance analysis tools provide reliable bounds on the delay of circuits, 13 they cannot properly account for the variability inherent in the semiconductor process. For example, Figure 2.1 illustrates the effect of variations on propagation delays of logic gates (e.g., 2-input NAND gate driving FO4 load) as calculated by 2- D lookup tables, which are used in conventional performance analysis tools (e.g., PrimeTime [93]) under the worst corner case (125 °C, 1.08V for V dd ) of 65nm CMOS technology. Every point in the table represents characterized spice delay for the logic gate for a particular input transition time and output capacitance pair. In this figure, the closet four characterized points in the table are interpolated to provide a desired output delay value. Thus, although these analysis tools can provide estimates of performance parameters at design time, they cannot guarantee that the expected performance prediction is accurate in manufactured designs. Resultant delay variation Z [delay] Y [transition time] X [output capacitance] 0.070 0.035 0.05 0.10 Circuit parameter variations Resultant delay variation Z [delay] Y [transition time] X [output capacitance] 0.070 0.035 0.05 0.10 Circuit parameter variations Figure 2.1: Effect of process variations on circuit delay. We start by pointing out that voltage (V) and temperature (T) variations are dynamic, i.e., they occur during the circuit operation, whereas process (P) variations are static and get introduced during the manufacturing. The strong impact of PVT variations on performance of a VLSI circuit renders the traditional optimization techniques ineffective. This phenomenon has resulted in a move toward stochastic optimization strategies, i.e., techniques that treat design parameters as random values 14 whose values are described by probability distribution functions. The task of computing power management policies that cope with uncertainty and non- determinism require the construction of a stochastic framework with which one can predict the effect of various actions on the performance state of the system. A stochastic approach to system-level performance modeling and optimization (e.g., one based on the Markovian decision process model) enables us to apply mathematical optimization techniques to derive optimal policies for DPM. 2.3.2 Temperature Calculation The major source of heat generation in a die is the power dissipation of transistors whose active regions are implemented in the substrate [62]. Some amount of power dissipation also results from Joule heating (or self-heating) caused by the flow of current in the interconnections. This effect is ignored in our simulations. Temperature of a VLSI chip can be calculated as follows: total chip J P TT R A θ =− ⋅ ⎛⎞ ⎜⎟ ⎝⎠ (2.1) where T chip is the temperature of the case top, T J is the junction temperature, R θ is the equivalent junction-to-case thermal resistance of the substrate (Si) layer plus the package (cm 2 °C / W), P total is the total power consumption (W), and A is the chip area (cm 2 ). In this chapter, it is assumed that power density can serve as a proxy for temperature variations although a change in instantaneous power dissipation does not give rise to an immediate temperature change due to a low-pass filtering effect in translating power variations into temperature variations [78]. 15 2.4 Stochastic Decision Making Framework In this section, we first present the idea of using a stochastic model for dealing with the uncertainty in observations made by a power manager, and then introduce a theoretical framework for constructing the model of power manager operating in such an uncertain environment. 2.4.1 Partially Observable Environment Generally speaking, at specific instances in time called decision epochs, a power manager observes some characteristic of the system, estimates the system performance state (e.g., its execution delay and power dissipation) on the basis of this observation, and issues a command (i.e., action) to force a state transition according to a power management policy that maximizes (or minimizes) a user- specified reward (or cost) function. The concept is that the actual state of the system, which is not directly observable, is estimated by observing some other system characteristic. A Markovian decision process (MDP) model facilitates reasoning in domains where actions change the system states and where a reward (or cost) is utilized to optimize the system performance. The simple MDP is directly observable in the sense that its execution hinges on the assumption that the current system state can be determined without any errors and that the reward (cost) of an action can be calculated exactly. In partially observable environments, where performance states of the system cannot be identified exactly, observations made by a power manager about the state of the system are indirect and may even be noisy, and therefore, they only provide incomplete information. A naive strategy for dealing with this 16 uncertainty is to ignore the problem altogether, that is, to treat the observations as if they provide accurate and complete information about the actual state of the system and act on them. This strategy can result in undesirable decisions based on erroneous readings of the current and next states of the system. A more sophisticated strategy resorts to stochastic modeling and decision making. One way to deal with uncertainty under a wide range of operating conditions and environments is to rely on the history of previous actions and observations to disambiguate the current state. For example, we can adopt a hidden Markov model (HMM), where the state is not directly observable but variables influenced by the state are observable, to learn a model of the environment, including the hidden states. Note that in an HMM each state has a probability distribution over the possible actions, resulting in the fact that the sequence of actions generated by the HMM gives some information about the sequence of states. Thus, a power manager in the HMM reasons about the state of the system indirectly through the observed variables, which captures complex system dynamics which are not completely observable. 2.4.2 Sequential Decision Making under Uncertainty The decision making in a partially observable environment is achieved by combining aspects of HMMs and MDPs. Specifically, we start with a semi-Markov decision process (SMDP), a generalization of MDPs, to model the decision making strategy, and then combine it with a HMM to consider the uncertainty in parameter observation. We call this combination a partially observable semi-Markov decision process (POSMDP) model. Recall that inter-arrival times of requests in the SMDP model follow an arbitrary distribution, which is a more realistic assumption than an 17 exponential distribution used in the conventional MDP model. Definition 2.1: Partially Observable Semi-Markov Decision Process. A POSMDP is a tuple (S, A, O, T, Z, k) such that 1) S is a finite set of states, 2) A is a finite set of actions, 3) O is a finite set of observations, 4) T is a transition probability function, 5) Z is an observation function, and 6) k is a cost function, The state space S comprises of a finite set of state, where s ∈ S can be defined as performance state of the system. The action space A consists of a finite set of action a ∈ A, e.g., dynamic voltage and frequency scaling (DVFS) values which control the performance state of the system. The observation space O contains a finite set of observation o ∈ O, e.g., on-chip temperature measurement. The state transition probability function, T(s t+1 , a t , s t ), determines the probability of a transition from a state s t to another state s t+1 after executing action a t , i.e., the system transits to the state s t+1 at time t+1 with probability Pr(s t+1 | s t , a t ) = T(s t+1 , a t , s t ). An observation function, Z(o t+1 , s t+1 , a t ), which captures the relationship between the actual state and the observation, is defined as the probability of making observation o t+1 after taking action a t that has landed the system in state s t+1 , i.e., state s t+1 generates observation o t+1 at time t+1 with probability Pr(o t+1 | s t+1 , a t ) = Z(o t+1 , s t+1 , a t ). We consider a cost function that assigns a quantitative cost value to each state and action pair whereby 18 an immediate cost, k(s, a), is incurred when action a is chosen in state s. Note that the costs can be set by the applications or the system developers. Instead of making decisions based on the current perceived state of the system, the POSMDP maintains a belief, i.e., a probability distribution over the possible (nominal) states of the system, and makes decisions based on its current belief. The belief state at time t is a |S| ×1 vector of probabilities defined as: b t := [b t (s)], ∀s ∈S, where b t (s) is the posterior probability distribution of state s at time t. Note that Σ s ∈S b t (s) = 1. Based on the belief state, an action a t is chosen from a set of available actions. A policy is defined as a sequence of mappings from the belief states to actions π = { π t }. Belief State Estimation Optimal decision making a o b System POSMDP-based power manager Belief State Estimation Optimal decision making a o b System POSMDP-based power manager Figure 2.2: Structure of a POSMDP-based power manager. In this chapter, we consider a design scenario where actions incur a cost (i.e., energy dissipation), and the power manager’s goal is to devise a policy that minimizes the total expected energy dissipation. Figure 2.2 illustrates the basic structure of a POSMDP-based power manager. The proposed power manager interacts with an uncertain environment and tries to minimize the system cost over time by choosing appropriate actions. The frequency-voltage level assignment 19 actions issued by the power manager change the performance state (power dissipation and speed) of the system and lead to quantifiable rewards/penalties. In our formulation of the decision-making strategy, we define state s ∈ S as the dissipated power level and largest pipeline stage delay of the circuit. Furthermore, we use an observation, i.e., a temperature measurement to help determine the system state. The power manager consists of two functional components. The first component is the belief state estimation block which computes the system’s belief state, while the second component is the decision making block which assigns optimal actions to the system based on a value-iteration policy optimization algorithm. Consider a three-state DPM problem as an example. A graphical representation of the belief state and its evolution is provided in Figure 2.3. s 1 s 2 s 3 b(1)+ b(2) + b(3)= 1 current belief state [ b(1) b(2) b(3)] s 1 s 2 s 3 a 2 a 1 a 3 s 1 s 2 s 3 b(1)+ b(2) + b(3)= 1 current belief state [ b(1) b(2) b(3)] s 1 s 2 s 3 a 2 a 1 a 3 (a) (b) Figure 2.3: A graphical representation of the belief state: (a) current belief (b) its one-step evolution for three different actions. 2.4.3 POSMDP Framework for Dynamic Power Management The rationale for developing a POSMDP framework for dynamic power management is depicted in Figure 2.4. First, since the performance state of a system cannot be directly determined by the PM, it uses temperature readings to help estimate the current system state in the form of a belief state. We assume that the 20 chip temperature at time t is one of three observations: o 1 , o 2 , and o 3 corresponding to different, but well-specified, temperature ranges. The system state at time t is defined as a combination of delay (e.g., d 1 , d 2 , or d 3 , where d 1 < d 2 < d 3 ) and power dissipation (e.g., p 1 , p 2 , or p 3 , where p 1 < p 2 < p 3 ) values. Starting from system state s t (d 2 , p 3 ) at time t, 1 the power manager issues an action, a t = (V dd1 , freq 2 ), and as a result, the system is expected to move into a new state s t+1 (d 3 , p 2 ) at time t+. Let’s assume that, due to variations, the resulting system state is actually s t+1 (d 3 , p 3 ). Since state s t+1 is not directly observable, the PM must rely on observation o t+1 at time t+1 to estimate the state that it is in. Power manager state t (delay, power) d 2 , p 3 Issue action t ( V dd1 , freq 2 ) (delay, power) d 3 , p 2 Expected (delay, power) d 3 , p 3 Determine state t+1 state t+1 Time [s] Temp [ºC] o 1 o 2 o 3 observation t+1 Actual Power manager state t (delay, power) d 2 , p 3 Issue action t ( V dd1 , freq 2 ) (delay, power) d 3 , p 2 Expected (delay, power) d 3 , p 3 Determine state t+1 state t+1 Time [s] Temp [ºC] o 1 o 2 o 3 observation t+1 Actual Figure 2.4: State estimation and state transition in the POSMDP-based DPM. Figure 2.5 illustrates yet another uncertainty effect. More precisely, the figure shows three scenarios where starting from current state s t (d 2 , p 2 ) with an action a 2 , e.g., [1.20V / 650MHz] issued at time t, the next system state may be any one of three possible states at time t+1, that is, the power manager cannot know for certain which next state will occur, although it will have some information from the 1 In this chapter, subscripts denote state information whereas superscripts denote time stamp. 21 observation, o t+1 . For example, in case (a), the system remains in the same active state after a 2 is chosen, resulting in the same performance (i.e., s t+1 (d 2 , p 2 )). That is why decisions will be made based on the probability distribution vector of the belief state, b t+1 . a 2 = [1.20V / 650MHz] Time [s] Temp [ºC] Observation a 2 = [1.20V / 650MHz] Time [s] Temp [ºC] Observation a 2 = [1.20V / 650MHz] Time [s] Temp [ºC] Observation o 1 o 2 o 3 s t (d 2 , p 2 ) s t+1 (d 2 , p 2 ) s t (d 2 , p 2 ) s t+1 (d 3 , p 3 ) s t (d 2 , p 2 ) s t+1 (d 1 , p 1 ) a 2 = [1.20V / 650MHz] Time [s] Temp [ºC] Observation a 2 = [1.20V / 650MHz] Time [s] Temp [ºC] Observation a 2 = [1.20V / 650MHz] Time [s] Temp [ºC] Observation o 1 o 2 o 3 s t (d 2 , p 2 ) s t+1 (d 2 , p 2 ) s t (d 2 , p 2 ) s t+1 (d 3 , p 3 ) s t (d 2 , p 2 ) s t+1 (d 1 , p 1 ) (a) (b) (c) Figure 2.5: Example of three possible observations at time t+1 from which the belief state is calculated. 2.5 Policy Representation in POSMDP We provide a policy representation of the proposed power management framework by presenting a belief-state SMDP, and derive the optimal power management policy. 2.5.1 Conversion to Belief-state SMDP In partially observable environment, a power manager can make decisions based on the observed system state history H since the underlying performance state of the system cannot be fully observed. Note that the system history H is a sequence of state and action pair such as <s 0 , a 0 >, <s 1 , a 1 >,…, <s t , a t >. Thus, the power manager’s behavior is determined by its policy, which is a mapping from the set of observable history H to the action set A, where the power manager can only base its decisions on the history of its actions and states. This means that complete history of 22 system states is relevant to predicting the future state of the system, which makes this decision making process a non-Markovian process [65]. Fortunately, the power management problem may also be formulated as a Markovian process-based optimization problem as proved in [77]. More precisely, we can convert the above- mentioned non-Markovian process into a Markovian process when formulating the power management problem as follows. To achieve the Markovian property, we make use of the belief state, b. It has been shown that the belief state is sufficient in the sense that it completely captures the power manager’s knowledge about the current state and past history [3]. Given belief state b t and an action a t resulting in observation o t+1 , we can compute the successor belief state b t+1 as follows: ' 11 1 1 () ( | , , ) (,, ) (') (, ,') (| , ) tt s tt tttt tt t bs Prso ab Zosa bs Tsas Pr o a b ++ + + = ⋅ = ⋅ ∑ (2.2) where ( ) ( ) 11 ' " ( | , ) ( , ", ) ( ') ( ", , ') tt t t t t s t s ZT Pro a b o sa b s sa s ++ = ⋅ ∑∑ (2.3) In (2.2), the numerator consists of the product of the probability that observation o t+1 is made in state s after action a t is taken, and the probability that starting from belief state b t , we end up in state s under action a t . The denominator can be regarded as a normalization factor that makes belief state probabilities sum to 1. Note that the |S|- dimensional belief state is continuous. The belief state transition function, T b (b t+1 , a t , b t ), which provides the probability 23 of a transition from current belief state b t to next belief state b t+1 after executing action a t , is given by: 11 1 (, , ) ( | , ) (| , ,)(| , ) tt t t t t b tt t t t o TPr Pr Pr bab b ba b abo o a b ++ + = =⋅ ∑ (2.4) The probability of perceiving o, given action a t and belief state b t , is given by summing over all the actual states that may be reached, i.e., ' ( | , ) ( , ', ) ( ', , ) ( ) tt t t t ss Pr oa b Z os a T s a sb s =⋅ ∑ ∑ (2.5) As stated earlier, the key result is that if we maintain and update the belief state and transition probabilities according to (2.3) and (2.4), then the belief state will give us with just as much information as the entire action-observation history. This shows that the optimal POSMDP solution is Markovian over the belief space. Hence, by using the belief space B, we can convert the original POSMDP into a completely observable, regular (albeit continuous state space) semi-Markov decision process (SMDP), the so-called belief state SMDP, defined as follows. Definition 2.2: Belief state SMDP is a tuple (B, A, T b , C b ) such that 1) B is the belief space, 2) A is the set of actions, 3) T b is the belief state transition function, and 4) C b is the cost function, where the updated belief state after action a can be calculated from the previous belief state from (2.3). The belief state transition function is given by (2.4). We also need a model for system cost based on belief states: 24 (, ) () (, ) b s tt t t Cb a b sksa = ∑ (2.5) which denotes the immediate cost incurred by action a t issued in current state b t . Here, k(s,a t ) denotes the immediate cost of action a t in state s. We have thus transformed the problem formulation based on the POSMDP model to one based on belief-state SMDP model. The optimal policy, *(b) of the belief-state SMDP representation is also optimal for the physical-state POSMDP representation. Notice that the belief-state SMDP model is deterministic and fully observable because it already takes into account the uncertainty. 2.5.2 Policy Representation Finding an optimal power management policy requires a decision-making strategy which maps the belief states to actions. In this chapter, we develop a policy generation technique by using well-known dynamic programming method, which in turn relies on principles of overlapping subproblems, optimal substructures, and memorization. We speak of the minimum value of a system state as the expected infinite discounted sum of cost that the system will accrue if it starts in that state and executes the optimal policy [29]. Generally, using π as a decision policy, this minimum value is written as * 0 () min () i t t Vb E ct π γ ∞ = ⎛⎞ =⋅ ⎜⎟ ⎝⎠ ∑ (2.6) where γ is a discount factor, 0 ≤ γ < 1, the exponent t i denotes the duration of time that the system spends in the belief state b before an action a causes a transition to another state b’, and c(t) is the cost at time t. 25 In our problem setup, the minimum value function is unique and can be defined as ** ' () min ( , ) ( ', , ) ( ') b i tt t b a bB b Vb C a T b abV b b B γ ∈ ⎛⎞ ⎜⎟ ⎝⎠ =+ ∀∈ ∑ (2.7) which asserts that the value of a state b is the expected immediate cost plus the expected discounted cost of the next state, using the best available action. We make a common assumption that the value function V is additive. This is reasonable in our problem context since the cost is defined as the energy dissipation of the system over time, which is clearly additive. From Bellman’s principle of optimality [5], given the optimal cost function, we can specify the optimal policy as ** ' () argmin ( , ) ( ', , ) ( ') b i tt t a bB bb bbb Ca T a V πγ ∈ ⎛⎞ ⎜⎟ ⎝⎠ =+ ∑ (2.8) Simply stated, the power manager determines the optimal action based on Eqn. (2.8) at each (e.g., time-based or interrupt-based) decision epoch. The task of casting the decision epochs to absolute time units is achieved by the system developer. In this chapter, we consider battery operated systems that strive to conserve energy to extend the battery life. Given C b (b t , a t ) and T b (b’, a, b), one way to find an optimal policy is to find the minimum cost function. It can be determined by an iteration algorithm (cf. Figure 2.6) called value iteration that has been shown to converge to the correct V* value. Unfortunately, it is not obvious when to stop this algorithm. A key result bounds the performance of the current greedy policy as a function of the Bellman residual of the current cost function [90]. It states that if the maximum difference between two successive cost functions is less than ε, then the cost of the greedy policy (i.e., the 26 policy obtained by choosing, in every state, the action that minimizes the estimated discounted cost, using the current estimate of the cost function) differs from the cost function of the optimal policy by no more than 2 ε γ/(1- γ) at any state. This provides a stopping criterion for the algorithm. 1: initialize V(b) arbitrarily 2: loop until policy good enough 3: loop for ∀b ∈ B 4: loop for ∀a ∈ A 5: 6: 7: end loop 8: end loop 9: end loop ' (, ) ( , ) ( ', , ) ( ') i tt t t bb bB t bb b Qb a C b a T a V γ ∈ =+ ∑ *() min ( , ) tt a Vb Qba = 1: initialize V(b) arbitrarily 2: loop until policy good enough 3: loop for ∀b ∈ B 4: loop for ∀a ∈ A 5: 6: 7: end loop 8: end loop 9: end loop ' (, ) ( , ) ( ', , ) ( ') i tt t t bb bB t bb b Qb a C b a T a V γ ∈ =+ ∑ *() min ( , ) tt a Vb Qba = Figure 2.6: The value iteration algorithm. 2.5.3 POSMDP-based DPM by Example An example of value iteration for the POSMDP model is given next. The purpose of the example is to show how to find the best action by building value functions. We consider the POSMDP framework of a power manager with two system states, S = {s 1 , s 2 }, where s 1 denotes a low-power (low-performance) system state whereas s 2 corresponds to a high-power (high-performance state; two actions, A = {a 1 , a 2 }, where a 1 commands a low-voltage, low-frequency setting whereas a 2 commands a high-voltage, high-frequency assignment to the system; and finally two temperature observations, O = {o 1 , o 2 }, where o 1 corresponds to a low temperature range whereas o 2 denotes a high temperature reading. 27 Table 2.1: Parameter values for the example problem. [1.08V / 500MHz] [1.20V / 650MHz] Action a 1 a 2 Description [65 °C ≤ temp < 75 °C] [50 °C ≤ temp < 65 °C] Observation Description o 1 o 2 State s 1 s 2 Description d > 8 delay (ns) power (mW) d ≤ 8 p ≤ 22 p > 22 [1.08V / 500MHz] [1.20V / 650MHz] Action a 1 a 2 Description [65 °C ≤ temp < 75 °C] [50 °C ≤ temp < 65 °C] Observation Description o 1 o 2 State s 1 s 2 Description d > 8 delay (ns) power (mW) d ≤ 8 p ≤ 22 p > 22 The parameter values are given in Table 2.1. We also specify the immediate values of the two actions. Let action a 1 have a value of 1.0 if it is issued in state s 1 and 0.8 in state s 2 . Similarly, let action a 2 have a value of 0.4 and 1.5 in states s 1 and s 2 , respectively, i.e., k(s 1 , a 1 ) = 1.0, k(s 2 , a 1 ) = 0.8, k(s 1 , a 2 ) = 0.4, and k(s 2 , a 2 ) = 1.5. Referring to Figure 2.7, the two actual system states {s 1 , s 2 } are labeled by belief states [1, 0] on the left (i.e., state is s 1 with probability 1), and [0, 1] on the right (i.e., state is s 2 with probability 1). The solid line represents the value of taking action a 1 , while the dashed line represents the value of taking action a 2 . The actual belief state is a probability distribution over the two states, s 1 and s 2. Assuming that the initial belief state is [0.7 0.3], we will show how to construct the value function from which we determine the best action (i.e., one with the lowest value) when we consider only a sequence of two actions from any belief state (i.e., the horizon length is 2). The first step is to find the immediate values of choosing actions. For example, by applying (4), the immediate value of doing action a 1 in the initial belief state b is (0.7 ×1.0)+(0.3 ×0.8) = 0.94. Similarly, the immediate value of performing action a 2 is (0.7 ×0.4)+(0.3 ×1.5) = 0.73. Figure 2.7 (a) graphically depicts the immediate values over the belief space at the current belief state. The immediate cost (horizon length 1 value) for each action defines a linear function over belief space. We want to choose the action that gives the lowest value depending on the particular belief state. In the 28 figure, we also show the partition of belief space which this value function imposes. The gray region denotes all the belief states where action a 2 is the best strategy to use while the white region is the belief states where action a 1 is the best strategy. Since the current belief state lies in the gray region, action a 2 is the best available action for belief state b. Current belief state [0.7, 0.3] s 1 s 2 [0, 1] [1, 0] a 1 a 2 1.0 1.5 0.4 Next belief state [0.3, 0.7] s 1 s 2 [0, 1] [1, 0] 0.8 0.73 0.94 a 1 a 2 1.0 1.5 0.4 0.8 [0.55, 0.45] Current belief state [0.7, 0.3] s 1 s 2 [0, 1] [1, 0] a 1 a 2 1.0 1.5 0.4 Next belief state [0.3, 0.7] s 1 s 2 [0, 1] [1, 0] 0.8 0.73 0.94 a 1 a 2 1.0 1.5 0.4 0.8 [0.55, 0.45] (a) (b) Figure 2.7: A graphical representation of the belief states and value functions: (a) the current belief state and immediate values over the belief space, (b) next belief state and horizon-1 value function. We next show how to compute the horizon 2 value of belief state b given an action a 2 and an observation o 2 (which corresponds to a high temperature reading). The horizon 2 value of a belief state is simply the value of the immediate action plus the value of the next action. In general, we would like to find the best possible value which would include considering all possible sequences of two actions. However, since in this restricted problem our immediate action is fixed, the immediate value is fully determined. The only question is what the best attainable value for the initial belief state b is when we perform action a 2 and observe o 2 . We assume that with this information by using (2), the next belief state b’ is computed as [0.3 0.7]. This new belief state is the belief state we are in when we have one more action to perform. 29 We know what the best values are for every belief state when there is a single action left to perform; this is exactly what our horizon 1 value function tells us. Note that from looking at where b' is in the belief space, we immediately know that the best action we should take is a 1 . Therefore, the best horizon 2 value of belief state b, given action a 2 and observation o 2 , is 0.73+(0.3 ×1.0)+(0.7 ×0.8)=1.59. This value corresponds to the sequence of two actions: a 2 followed by a 1 . Figure 2.7 (b) illustrates the horizon-1 value function at the next belief state for initial action a 2 and observation o 2 . Next we show how to compute the value of belief state b given only an action a 2 . In our problem setup, there are two possible observations o 1 and o 2 . Even though we know the action with certainty, the observation we get is not known a priori. For the given belief state b, each observation has a certain probability associated with it. Since we know the value of the resulting belief state given the observation, to obtain the value of the belief state without knowing the observation, we simply weigh each resulting value by the probability that we will get that observation. Continuing with the previous example, let’s assume that when we observe o 1 after action a 2 , from (2.3), the next belief state b’ is [0.6 0.4]. Looking at where b' is in the belief space, we know that the best action is a 2 . Therefore, the horizon 2 value of belief state b, given a 2 and o 1 , is 0.73+(0.6 ×0.4)+(0.4×1.5)=1.57. To summarize, starting in b and fixing the initial action to a 2 , the next best action to do is a 2 if we observe o 1 and it is a 1 if we observe o 2 . Similarly, we can compute the optimal strategy for b given the initial action is a 1 . More precisely, assume that if we observe o 1 after action a 1 , the next belief state b’ 30 will be [0.9 0.1], whereas if we observe o 2 after action a 1 , the next belief state b’ is [0.5 0.5]. Then, the horizon 2 value of the belief state b when we fix the action at a 1 and observe o 1 is 0.94+ (0.9×0.4)+(0.1×1.5)=1.45 corresponding to action a 2 whereas if we observe o 2 after a 1 , the horizon 2 value is 0.94+(0.5 ×1) +(0.5 ×0.8)=1.84 corresponding to action a 1 . To summarize, starting in b and fixing the initial action to a 1 , the next best action is a 2 if we observe o 1 and it is a 1 if we observe o 2 . Suppose now the probabilities of getting observations o 1 and o 2 for the given belief state b and action a 2 are 0.45 and 0.55, respectively. These probabilities for the given belief state b and action a 1 are 0.75 and 0.25, respectively. Hence, the horizon 2 value of the belief state b when we fix the action at a 2 is (0.45 ×1.59)+(0.55×1.57)=1.58 and that when we fix the action at a 1 is (0.75 ×1.45)+(0.25×1.84)=1.55. The optimal strategy for b is the one that yields the least horizon 2 value. In this case, the strategy whereby we “do a 1 and then do a 2 if o 1 and do a 1 if o 2 ” is the optimal strategy for b. Now if we fix the current action to be a 1 and the future action to be the same as it is at point b (i.e., o 1 :a 2 , o 2 :a 1 ), we can find the value of every single belief point for that particular strategy. This is the best strategy to use for b, but may not be the best strategy for other points in the belief space. To efficiently compute the optimal strategy for all belief points, we utilize “transformed horizon 1 value functions” for different initial actions and partition the 1-D continuous belief space into a set of segments, where one optimal strategy holds within each segment. 31 2.6 Dynamic Power Management We introduce two techniques that incorporate the proposed uncertainty management framework: offline and online DPM techniques. The offline DPM technique finds an optimal action, assuming that the inputs to the power manager are known in advance. Our approach for offline DPM is similar to conventional offline DPM techniques [6] in the sense that entire input values are known before making any decisions; the difference is that in our offline DPM framework we consider uncertainty in reported power and delay values. On the other hand, the online DPM technique refers to strategies that attempt to find an optimal action based on information available at runtime. The proposed online DPM utilizes a Kalman filter based technique for belied state estimation to reduce the computational complexity. 2.6.1 Offline Dynamic Power Management We construct offline a collection of policies, where a policy is a list of state-action pairs, usually implemented as a hash table with key being the state and the value being the action. Policies are generated in advance through extensive offline simulations as explained in section 2.5. Various policies are organized into a decision tree where each leaf node represents a policy, as illustrated in Figure 2.8 (a). Nodes in the decision tree are indexed by the parameters that characterize the performance state of the system, where we use the power dissipation and execution delay values, e.g., [18mW 20mw] and [4ns 8ns]. The best policy can be found by tracing the appropriate path from the root node to a leaf node in the decision tree using the given parameter values. Once a policy is located, the belief state probability is used as the key into the policy hash table to find the optimal action. 32 Index 1: power dissipation Index 2: execution time Leaf node: policy hash table (8ns 10ns] [ 4ns 8ns] (12ns 16ns] (20mW 22mW] [18mW 20mW] (22mW 24mW] s 1 s 3 s 2 [b(s 1 ) b(s 2 ) b(s 3 )] = [0.2 0.55 0.25] action policy hash table s 3 s 1 s 2 Index 1: power dissipation Index 2: execution time Leaf node: policy hash table (8ns 10ns] [ 4ns 8ns] (12ns 16ns] (20mW 22mW] [18mW 20mW] (22mW 24mW] s 1 s 3 s 2 [b(s 1 ) b(s 2 ) b(s 3 )] = [0.2 0.55 0.25] action policy hash table s 3 s 1 s 2 (a) (b) Figure 2.8: (a) A decision tree of policy tables (b) probability density function of power and delay values used to trace a path from root to a leaf node in the tree. In the aforementioned approach, we assume that power dissipation and execution times are given in the form of probability density functions (e.g., normal distribution) based on state-action pairs, as shown in Figure 2.8 (b), where s 1 , s 2 , and s 3 are defined as, <[18mW 20mW], [4ns 8ns]>, <(20mW 22mW], (8ns 12ns]>, and <(22mW 24mW], (12ns 16ns]>, respectively. By doing so, we consider uncertainty in performance state while indexing the level of power dissipation and execution delay. For example, device power P is assumed to be a normally distributed random variable with a mean value of P sim and a standard deviation of ∆P induced by uncertainty, as illustrated in top of Figure 2.8 (b). Note that in our problem setup, P sim is the simulated power number while ∆P is the standard deviation of power values, which is calculated by running different tasks on the system at different process corners (e.g., fast, typical, and slow) available with the process technology. Furthermore, we can vary the ranges of power values for states (e.g., range of 2mW in [18mW 20mW] can be changed to the range of 4mW resulting in [17mW 21mW]), considering a higher standard deviation (i.e., uncertainty). The execution times are 33 treated in the same way. Then, belief state which represents the probabilities of being in each of the performance states is obtained as a key to policy hash table. For example, referring to Figure 2.8 (b), suppose that the probabilities of being in s 1 , s 2 , and s 3 are 0.3, 0.6, and 0.1 in terms of the power dissipation level, and 0.1, 0.5, and 0.4 in terms of the execution delay. Then, the belief state [b(s 1 ) b(s 2 ) b(s 3 )] is calculated simply as [0.2 0.55 0.25] by taking the average value of the two probability vectors. input: output: action 1: index level of power dissipation 2: index level of execution delay 3: achieve belief state based on indexes 4: find an appropriate action through hash table 5: return action () ( ) 2 , , sim NP P ∆ () ( ) 2 , sim ND D ∆ input: output: action 1: index level of power dissipation 2: index level of execution delay 3: achieve belief state based on indexes 4: find an appropriate action through hash table 5: return action () ( ) 2 , , sim NP P ∆ () ( ) 2 , sim ND D ∆ Figure 2.9: An offline power management technique. Figure 2.9 summarizes the offline power management technique with a decision tree-based policy selection, where power and delay values are given as N(P sim , ( ∆P) 2 ) and N(D sim , (( ∆D) 2 ). Similar to the power values, D sim denotes the simulated delay number while ∆D is the standard deviation of delay values. When the power manager receives a performance state with the knowledge of previously assigned action-state pairs, an optimal action is selected by the PM based on the policy hash table, and issued to the system, which causes the system state to change. 2.6.2 Online Dynamic Power Management For an online power management, belief-state transition probabilities are not given in 34 advance. Note that the complexity of computation required by Eqn. (2.3) for updating the belief state grows rapidly with the number of state variables, making it infeasible for real-time applications, e.g., online DPM techniques. In addition, calculating exact solutions for the finite-horizon stochastic POSMDP problems is P- SPACE hard [65]. Therefore, exact solutions cannot be found for belief-state SMDP with more than a handful of states. Indeed, solving a belief-state SMDP problem is extremely expensive because of the complexity of calculating the exact belief state. To overcome this difficulty, one is usually forced to estimate the system state by some other approaches. By doing so, the overwhelming complexity in deriving a power management policy for every possible situation is avoided. The basic idea of our online power management technique is to use the estimation of the unknown state based on a look-ahead search technique which also includes a step to predict an unknown error while estimating. Hence, we interleave state estimation based on “Kalman filter” technique [43] and policy optimization based on the value iteration algorithm. Details are provided below. We present a prediction-based online DPM technique, which is analytically and statistically tractable. First, assuming that we know the distribution of PVT variation and observation noise, we can define the state and observation models simply in accordance with our proposed framework as follows: 1 ,~(0,) tt t t t t bb au u NQ + =+ + XY (2.9) 11 1 1 ,~(0,) tt t t t ob v v NR ++ + + =+ Z (2.10) where t denotes a time step, u t is a state noise induced by PVT variation which is 35 normally distributed with zero mean and variance Q t , v t+1 is a temperature observation noise normally distributed with zero mean and variance R t . The state transition matrix X includes the probabilities of transitioning from state b t to another state b t+1 when action at is taken, the action-input matrix Y relates the action input to the state, whereas the observation matrix Z, which maps the true state space into the observed space, contains the probabilities of making observation o t+1 when action a t is taken, leading the system to enter state s t+1 . In practice, X, Y, and Z might change with each time step or measurement, but here we assume they are constant. With above-mentioned parameters, the structure of our proposed online DPM is provided in Figure 2.10. The estimation algorithm performs the state estimation based on KF as follows. a) Initialize: The algorithm initializes the first state b t as b 0 , and the error covariance matrix E t , which is a measure of the estimated accuracy of the state prediction, to a diagonal matrix where the diagonal elements are set to some fixed value, signifying that the initial system state is uncertain. b) Predict: The algorithm computes the predicted (a priori) state 1 t b + − and the predicted (a priori) error covariance matrix 1 t E + − . c) Update: The algorithm first computes the optimal Kalman gain K t+1 and uses it to produce an updated (a posteriori) state estimate, b t+1 , as a linear combination of 1 t b + − and the Kalman gain-weighted residue between an actual observation o t+1 and the predicted observation 1 t b + − Z . The algorithm also updates the error covariance matrix. 36 This iterative approach is one of the appealing features of the Kalman filter. Initialize • Initialize noise variations: • Initialize the first state: 0 t bb = • Initialize the error covariance: 0 t ER = Predict • Predict the next state: • Predict the error covariance: 1T 1 tt t EE Q ++ − =+ XX 1tt t bb a + − =+ XY Update •Kalman gain: • Update the state prediction with observation: 11 1 1 1 () tt t t t bb b o ++ + + + −− =+ − KZ 11T 1T 11 () tt t t EE R ++ + +− −− =+ KZZZ 11 1 () tt t EE ++ + − =− IK Z 1 tt ←+ 00 , tt QQ R R == State estimator a Policy ( π) system sleep active DVFS 3 DVFS 2 DVFS 1 o • Update the error covariance b Initialize • Initialize noise variations: • Initialize the first state: 0 t bb = • Initialize the error covariance: 0 t ER = Predict • Predict the next state: • Predict the error covariance: 1T 1 tt t EE Q ++ − =+ XX 1tt t bb a + − =+ XY Update •Kalman gain: • Update the state prediction with observation: 11 1 1 1 () tt t t t bb b o ++ + + + −− =+ − KZ 11T 1T 11 () tt t t EE R ++ + +− −− =+ KZZZ 11 1 () tt t EE ++ + − =− IK Z 1 tt ←+ 00 , tt QQ R R == State estimator a Policy ( π) system sleep active DVFS 3 DVFS 2 DVFS 1 o • Update the error covariance b Figure 2.10: The structure of online power management. 1 Simply speaking, the proposed online DPM technique estimates the next belief state based on the KF technique, and computes the belief-state transition probabilities and observation functions by simply deriving the maximum likelihood estimates, while storing the occurrence frequencies. Figure 2.11 shows the proposed online DPM technique based on the Kalman filter technique, where an appropriate action is given to the system by utilizing the value iteration algorithm after estimating the belief state. 1 The subscript “-“denotes that the value calculated at the prediction stage will be updated in the correction stage. 37 input: o output: action 1: observe temperature o 2: estimate the next state b’ 3: compute T b and Z by deriving maximum likelihood estimates 4: find an appropriate action through value iteration algorithm 5: return action input: o output: action 1: observe temperature o 2: estimate the next state b’ 3: compute T b and Z by deriving maximum likelihood estimates 4: find an appropriate action through value iteration algorithm 5: return action Figure 2.11: An online power management technique. 2.7 Experimental Results In the experimental setup, we implemented a 32bit RISC processor compatible with [94] in TSMC 65nmLP library, which has 3 optional operating voltages (1.08V, 1.20V, and 1.29V) and dual threshold voltages. RTL design Specman Synopsys Analyze Compile rtl2saif saif2trace RTL simulation Back Annotation vcd2saif read_saif Power report SAIF backward file SAIF forward file RTL design Specman Synopsys Analyze Compile rtl2saif saif2trace RTL simulation Back Annotation vcd2saif read_saif Power report SAIF backward file SAIF forward file Figure 2.12: Flow of power simulation. To achieve accurate power values for dynamic power and leakage power consumption, we first generated a forward SAIF (Switching Activity Interchange File) after synthesizing into gate-level netlist. Second, we obtained a backward SAIF by back-annotated RTL simulation with the Specman function simulator [96], and then executed the Power Compiler [95], where the switching activities of the netlist are incorporated so as to calculate accurately the dynamic and static power consumption (cf. Figure 2.12). 38 Table 2.2: The distribution in percentage of power dissipation in the processor (no cache) . 4.6 SPECint 2000 gcc gap gzip dAreg dOutreg iAreg incr mul shifter alu sr reg fetch decode busCtl 9.2 15.4 4.6 4.5 3.8 18.6 2.3 15.1 4.6 4.6 8.1 4.3 13.1 15.7 4.0 4.2 3.1 14.2 1.7 14.8 4.2 4.2 12.3 4.6 14.4 13.8 4.1 2.3 2.7 16.5 2.2 15.4 4.1 4.1 11.7 execute 4.6 4.2 4.1 Function blocks 4.6 SPECint 2000 gcc gap gzip dAreg dOutreg iAreg incr mul shifter alu sr reg fetch decode busCtl 9.2 15.4 4.6 4.5 3.8 18.6 2.3 15.1 4.6 4.6 8.1 4.3 13.1 15.7 4.0 4.2 3.1 14.2 1.7 14.8 4.2 4.2 12.3 4.6 14.4 13.8 4.1 2.3 2.7 16.5 2.2 15.4 4.1 4.1 11.7 execute 4.6 4.2 4.1 Function blocks In the first experiment, we analyzed characteristics of the designed processor in terms of power dissipation by executing SPECint2000 benchmark programs [92] where we include data for only three of the benchmark programs: gcc, gap, and gzip. Table 2.2 reports the power dissipation distribution of the processor, indicating that certain components of the processor such as the execution units and the register units have a very high power density. Table 2.3: Parameter values for a given experiment State s 1 s 2 Description delay (ns) power (mW) s 3 (9.0 12.0] (6.5 9.0] [3.5 6.5] [18.0 23.5] (23.5 25.5] (25.5 30.5] (75 90] (65 75] [50 65] Observation Description [ °C] o 1 o 2 o 3 cost k(s, a) s 1 s 2 s 3 a 1 a 2 a 3 216 218 212 212 190 165 165 138 105 State s 1 s 2 Description delay (ns) power (mW) s 3 (9.0 12.0] (6.5 9.0] [3.5 6.5] [18.0 23.5] (23.5 25.5] (25.5 30.5] (75 90] (65 75] [50 65] Observation Description [ °C] o 1 o 2 o 3 cost k(s, a) s 1 s 2 s 3 a 1 a 2 a 3 216 218 212 212 190 165 165 138 105 The second experiment is to demonstrate the effectiveness of the proposed DPM under uncertainty management framework. First, we set the parameter values for the evaluation of the proposed framework as shown in Table 2.3 where we have sets of three actions {a 1 , a 2 , a 3 }, where a 1 = [1.08V / 500MHz], a 2 = [1.20V / 650MHz], and a 3 = [1.29V / 800MHz], and observations {o 1 , o 2 , o 3 }. The range of observations is 39 defined by the temperature thresholds based on the ACPI (Advanced Configuration and Power Interface) specification [97]. The expected cost rate is defined as the power-delay product (PDP) of the processor for each state and action pair, where we set the range of performance states {s 1 , s 2 , s 3 } as a combination of power dissipation and execution delay values for the processor. For example, cost k(s 1 , a 1 ) is the power-delay product of the system that stays in state s 1 when action a 1 is taken, i.e., 18mW (least power) × 12nS (highest delay) = 216pJ. Similarly, k(s 1 , a 2 ) and k(s 1 , a 3 ) are calculated as 20.75mW (medium power) × 10.5nS (medium delay) ≈ 218pJ, and 23.5mW (highest power) × 9nS (least delay) = 212pJ, respectively. Note that we define different cost values for a system state (e.g., s 1 ), since different actions (e.g., a 1 , a 2 , or a 3 ) can cause the system to transition into the same system state (i.e., the system maintains the same range of performance values) with difference cost values. We arbitrarily chose a sequence of 50 application program runs, comprising of instances of gcc, gap, and gzip benchmarks, e.g., gap 1 - gzip 2 - gap 3 - gcc 4 - …- gap 50 , where program i is the i-th program in the sequence. The sequence of 50 application programs is executed on the processor to calculate the belief states based on the estimated temperature which serves as the observation. Because we do not have a packaged IC equipped with a thermal sensor to report the on-chip temperature, we estimate the on-chip temperature by utilizing () chip A JA JT TT P θψ =+ ⋅ − (2.11) based on the parameter values extracted from the commercial data sheet for a PBGA package. Note that T A is the ambient temperature, θ JA is the thermal resistance for 40 junction-to-ambient, ψ JT denotes the junction-to-top of package thermal characterization parameter, and P is the power dissipation. Next, the belief states are evaluated based on the actions and observations over the state space as the processor executes the sequence of programs. Figure 2.13. Trace of belief state for state estimation. Figure 2.13 shows the trace of belief state for state s 1 , s 2 , and s 3 , where we use the Kalman filter estimation technique of the proposed online DPM framework. We set that the values of PVT variation variance Q and observation noise variance R equal to 1.1, where we achieved the probability density function for the power consumption of the processor such that the mean value is 25mW and covariance is 1.1 (i.e., N(25 1.1). In our experiment, the time steps are abstractly defined and the power manager issues a command at each time step (i.e., decision epochs), where observations are made when each program in the sequence has been completed. Simulations reported in Figure 2.14 show the results of the policy generation algorithm based on the information provided in Table 2.3 and Figure 2.13. We set the discount factor as 0.5 when evaluating the value function. The optimal action is 41 chosen to minimize the value function. Figure 2.14: Evaluation of policy generation algorithm. In the third experiment, we investigate how robustly the proposed approach can handle variability during the power management process by comparing with various operating conditions (i.e., worst and best corners). The optimal DPM policy is achieved by evaluating the value function with the derived state transition probabilities. In our approach, we performed tasks while varying the operating conditions, and identifies the most probable system state given noisy temperature observations. Table 2.4 summarizes these simulation results in terms of power, energy, and normalized energy-delay-product (EDP) as the figure of merit. Clearly, the uncertainty-aware DPM approach cannot do any better than a conventional DPM at the best corner case. The expectation, however, is that it will outperform the conventional DPM at the worst corner case, while ensuring energy efficiency. It is also clearly seen that a lot of silicon performance is left untapped under the worst corner-case assumption. 42 Table 2.4: Comparing results of our proposed approach with the corner-based results. Our approach Worst corner Average power (mW) 25.8 22.7 EDP (normalized) 1.27 2.04 Condition 163.5 197.2 Best corner 28.8 1.00 155.4 Average energy (mJ) Our approach Worst corner Average power (mW) 25.8 22.7 EDP (normalized) 1.27 2.04 Condition 163.5 197.2 Best corner 28.8 1.00 155.4 Average energy (mJ) The fourth experiment is designed to evaluate the proposed offline/online DPM techniques by capturing the energy-saving opportunities of the system. In both DPM policies, we compare the performance of the proposed technique with a conventional DPM approach, similar to that presented in [3], which can be defined simply as follows (denoted by Greedy), where DVFS 1 < DVFS 2 < DVFS 3 in terms of operating voltage and frequency values. Greedy: Apply the following DPM strategy. - When the workload of tasks (e.g., the arrival rate of tasks) is low, we use the lowest DVFS 1 value. - When the workload of tasks is high, we use the highest DVFS 3 value. Otherwise, we use the DVFS 2 value. Note that we consider the overhead of power-mode transitions during simulation as illustrated in [3]. The simulation results for the proposed offline and online DPM technique are shown in Figure 2.15, where the online DPM policy tends to dynamically adapt to environmental changes, which can incur a mode transition penalty. This is because the online DPM algorithm performs prediction-correction procedure to react to the environmental changes. Table 2.5 presents the simulation results in terms of power savings (%). It includes the specific power saving result for 43 each performance state. For example, there is 10.5% power savings in the state s 3 by the offline DPM, whereas there is 7.4% power penalty in the state s 1 . The table shows that the proposed DPM policies result in power savings when the system is in the state s 2 and s 3 . However, there is no significant impact for online DPM in terms of average power savings. Figure 2.15: Power consumption of offline / online DPM policies. Table 2.6: gives the result of the energy savings by the proposed DPM techniques. Contrary to the little impacts on power savings, this result demonstrates that our approaches greatly reduce the total energy dissipation especially in the state s 1 and s 2 . For example, there is 21.1% energy savings in state s 1 by the offline DPM policy, although we have 7.4% power penalty by running the same DPM policy. The conventional DPM approach (which is unaware of the PVT variations), however, can outperform slightly our DPM technique in terms of energy savings only for the case that the system is in state s 3 , where our online DPM technique produces an a priori estimate for the next time step which may result in energy waste. 44 Table 2.5: Comparison of our DPM policies with the conventional approach Offline DPM policy Online DPM policy avg. power (mW) 1.7 24 8.9 power saving in each state (%) average power saving (%) 0.5 2.1 DPM policy s 1 s 2 s 3 -9.2 10.5 Greedy Proposed technique 3.1 -7.4 Offline DPM policy Online DPM policy avg. power (mW) 1.7 24 8.9 power saving in each state (%) average power saving (%) 0.5 2.1 DPM policy s 1 s 2 s 3 -9.2 10.5 Greedy Proposed technique 3.1 -7.4 Overall, the proposed DPM techniques achieve energy savings in the presence of the PVT variations up to average of 10.3% and 6.8% in the case of offline and online policies, respectively. Furthermore, it is clearly seen that if we focus on conserving energy in low performance settings, we can achieve energy saving up to 21.1%, and 16.8% in the case of offline and online policies, respectively (see energy saving in state s 1 ). This scenario typically occurs in applications that require low voltage- frequency value for their operations. Table 2.6: Comparison of our DPM policies with the conventional approach Offline DPM policy Online DPM policy avg. energy (mJ) 8.1 175 -4.5 energy saving in each state (%) average energy saving (%) 6.8 10.3 DPM policy s 1 s 2 s 3 16.8 -2.5 Greedy Proposed technique 12.3 21.1 Offline DPM policy Online DPM policy avg. energy (mJ) 8.1 175 -4.5 energy saving in each state (%) average energy saving (%) 6.8 10.3 DPM policy s 1 s 2 s 3 16.8 -2.5 Greedy Proposed technique 12.3 21.1 2.8 Summary We addressed the problem of system-level power management subject to uncertainty in system performance parameters and in observations made on the system. The uncertainty itself is caused by a variety of factors including manufacturing induced process variations and environment-induced voltage and temperature variation. In 45 particular, we presented a system-level power management approach based on a stochastic decision making framework i.e., a partially observable Markovian decision process model, which is capable of coping with uncertainty in system state and observations. This uncertainty management framework guarantees to find an optimal power management policy by utilizing a value iteration algorithm. We implemented both offline and online DPM techniques and reported experimental results demonstrating their effectiveness in robustly reducing total system energy dissipation when running a variety of applications. 46 Chapter 3 Machine Learning based Power Management for Multicore Processors 3.1 Introduction Ongoing advances in CMOS process technologies and VLSI designs have resulted in the introduction of multicore processors. Conventional DPM methods have not been able to take full advantage of power-saving techniques such as dynamic voltage and frequency scaling (DVFS). This is because a system-level power management routine, which continuously monitors the workloads of multiple processors, analyzes the information to make decisions, and issues DVFS commands to each processor, can give rise to a considerable computational overhead and/or complicate the task scheduling [51]. The higher the number of cores in the processor is, the more severe these issues become. Therefore, the ability of a DPM framework to scale well on a multicore processor by eliminating these overhead is becoming a critical requirement [58]. The problem of determining a power management policy that utilizes DVFS in a multicore processor has received a lot of attention [35][91][31][56][19][47]. Although these techniques perform system-level DPM or DVFS for multicore 47 processors, little attention has been paid to improve decision-making strategy which minimizes the overhead of a power manager (PM), i.e., to devise a learning-based power management policy that can quickly analyze some easily available input features and accurately predict the overall system performance state, which is subsequently used to choose and issue the “optimal action”. Indeed, traditional power management techniques are unsuccessful in reducing the total chip power dissipation when the overhead of power-mode transitions is not controlled in a multicore processor, where the PM needs to control each processor individually [19]. Therefore, knowing (or predicting) in real time which frequency and voltage levels to use, and when to apply a new performance setting in a multicore processor, must be done with the aid of a self-improving power manager. In this chapter, we address a dynamic power management problem where a PM continuously issues power mode transition commands to maximally exploit the power-saving opportunities. The overhead associated with the functioning of the PM to monitor the workload of the system and make decisions about performance mode (voltage and frequency level) of different cores in a multicore processing system tends to be high. This chapter thus describes a supervised learning [16] based DPM framework for the multicore processor, which enables the PM to predict the performance state of the processor for each incoming task by inspecting some readily available input features, followed by a Bayesian classification technique. The key rational for utilizing supervised learning for power management is to reduce the overhead of the PM. Experimental results demonstrate the effectiveness of the proposed power management framework and show that the DPM technique ensures 48 system-wide energy savings under rapidly varying workloads. The remainder of this chapter is organized as follows. Section 3.2 provides a motivational example and related work while section 3.3 describes the details of proposed supervised learning based power management framework. An extraction strategy for input features and output measures is described in section 3.4. In section 3.5, we present a stochastic policy optimization technique. Experimental results and summary are given in section 3.6 and section 3.7. 3.2 Preliminaries 3.2.1 Motivational Example Consider a power-managed multicore processor, where each processor core is equipped with multiple power-saving modes (i.e., different DVFS settings). A system-level PM dynamically assigns the DVFS setting for each processor based on its workload as is shown in Figure 3.1 for a distributed shared-memory multicore processor. The figure also shows a dynamic load balancing block which enables high-throughput and low-latency data flow for each processor and a control unit which ensures cache coherency. The flow queue (i.e., receive queue) interacts with the PM by providing information about a processor’s workload for the purpose of controlling the performance state of the processor. The PM, which profiles and analyzes the workload characteristics i.e., the arrival rate of tasks by examining the flow queue, determines and executes a power management policy (i.e., one that maps workloads to power state transition commands) so as to minimize the system energy dissipation. 49 Dynamic load balancing Proc I/F Flow Queue Processor L1 Memory Control Unit Coherence control bus I & D bus Multicore Processor Power manager Performance monitor DVFS assignment Policy calculation Proc I/F Flow Queue Processor L1 Memory Proc I/F Flow Queue Processor L1 Memory Proc I/F Flow Queue Processor L1 Memory Dynamic load balancing Proc I/F Flow Queue Processor L1 Memory Control Unit Coherence control bus I & D bus Multicore Processor Power manager Performance monitor DVFS assignment Policy calculation Proc I/F Flow Queue Processor L1 Memory Proc I/F Flow Queue Processor L1 Memory Proc I/F Flow Queue Processor L1 Memory Figure 3.1: Example of a power-managed multicore processor. When tasks are given to a multicore processor, the dynamic load balancing block (i.e., SR) dispatches each task into some flow queue (i.e., local SQ). Each processor core (i.e., SP) reads the assigned tasks from its SQ. At regular time instances (or aperiodic times dictated by interrupts), called decision epochs, the PM determines the workload of the processor by checking the occupancy state of its SQ, and subsequently, assigns a DVFS value to the processor. Note that the decision epochs are separated by a fixed (or some average) time interval; the shorter this time interval is, the higher the delay and energy dissipation overheads of the PM are. This is because the DVFS method utilizing a DC-DC converter with multiple regulated output voltage levels and a PLL with multiple output frequencies incur non- negligible mode transition latency and energy overheads. At the same time, the shorter this interval is, the more responsive the PM is to changes in the workload. Consider a scenario (see Figure 3.2(a)), whereby T Q denotes the queuing time (i.e., the time spent by a task in a local flow queue, S Q ), and T P is the time interval between two consecutive S Q read operations by the SP. Furthermore, assume that the 50 optimal DVFS assignment for task j i (where i denotes the time index) is DVFS 1 . Similarly, assume that task j i+1 runs optimally under DVFS 2 , where DVFS 1 < DVFS 2 in terms of supply voltage and operating frequency values. PM monitors The SQ state task j i+1 arrives to the SQ Q T PM determines DVFS 2 Starts task j i+1 with new DVFS 2 task j i+1 is finished P T time task j i+1 arrives to the SQ Q T PM starts task j i+1 with new DVFS 2 P T time PM classifies task j i+1 and assigns it DVFS 2 (a) (b) task j i is finished π T TRAN T TRAN T task j i is finished task j i+1 is finished PM monitors The SQ state task j i+1 arrives to the SQ Q T PM determines DVFS 2 Starts task j i+1 with new DVFS 2 task j i+1 is finished P T time task j i+1 arrives to the SQ Q T PM starts task j i+1 with new DVFS 2 P T time PM classifies task j i+1 and assigns it DVFS 2 (a) (b) task j i is finished π T TRAN T TRAN T task j i is finished task j i+1 is finished Figure 3.2: DPM approaches with DVFS: (a) the traditional approach, and (b) the proposed approach. A conventional DPM procedure works as follows. When task j i+1 arrives in the S Q , it must wait for a time T Q before the processing of all prior tasks including task j i is completed. At that time instance, the PM reads off the occupancy number of the SQ (i.e., state of the SQ) and determines an optimal DVFS value for the processor. This step takes T π time which increases in proportion to the number of SPs and is dependent on the clock frequency of the PM subsystem. After making the DVFS selection, the PM commands the voltage regulator and the PLL to change the supply voltage level and operating frequency of the SP. This action takes time equal to T TRAN = max ( τ TRAN , τ PLL ), where τ TRAN is the transition time of voltage regulator, and τ PLL is the PLL lock re-acquisition time (or PLL output multiplexing time). Finally, the SP takes T P time to complete the processing of task j i+1 . Consequently, the total time from when task j i+1 is first enqueued in the SQ until it is finally retired (completed) is equal to: T Q + T π + T TRAN + T P . 51 The shortcoming of the conventional DPM procedure described above is the following. When the workload (the occupancy number of the SQ) changes, each core has to send an interrupt to the PM to request a DVFS adjustment for the corresponding core, which significantly increases the computational overhead of the PM in a multicore system with a large number of processor cores. Alternatively, the PM on a regular basis examines the state of the SQ in front of each core in order to determine the DVFS value for that core, and subsequently, schedules a sequence of DVFS assignments for every processor cores. Either approach creates a significant overhead. A key contribution of our work is that an incoming task is directly labeled with an optimal DVFS value through the Bayesian classification process while it is still in the SQ, as shown in Figure 3.2 (b). In this case, the total time from when task j i+1 is first enqueued in the SQ until it is finally retired (completed) is equal to: T Q + T TRAN + T P . 3.2.2 Related Work Dynamic power management techniques based on machine leaning [54] have been the subject of a number of recent investigations [86][26][89][71]. An adaptive power management technique based on machine learning was presented in [86], where the authors described a system that learns when to turn off functional blocks of the system based on different usage patterns, e.g., history of active application or the CPU utilization factor. In this model-based approach, system dynamics and user patterns are captured to choose power-saving actions. The authors in [26] described a power management technique that employs a machine learning algorithm to choose an optimal policy from a set of power 52 management policies available to a system. The proposed algorithm evaluates the performance of the policies during each idle period and decides which policy to adopt next. An automated approach to identify a task-specific power management policy was proposed in [89], where an enforcement-learning based operating system automatically learns which action to take for a specific workload given to a system. The authors applied the proposed technique to hard disk power management in a mobile device, enabling the operating system to record hard disk accesses and monitor I/O related system parameters. The authors of [71] presented a machine learning approach to perform dynamic voltage scaling (DVS) on an integrated CPU-core and on-chip L2-cache. The proposed approach identifies application phases at runtime and issues appropriate DVS commands. The DVS policy itself is derived through a learning process performed on a representative workload. All of the above-mentioned power management approaches are based on machine learning techniques, where an agent (i.e., power manager) is trained based on a number of representative workloads or user patterns in order to learn the performance state of a target system for the purpose of taking a DVS or DVFS action. Unfortunately, little attention has been paid to power management policy optimization under a cost function and to the accurate classification of the performance state of the system. Furthermore, as explained previously, the aforesaid techniques are inefficient for multicore processor architecture due to computational overheads for deriving an optimal policy for each processor core, exacerbating with 53 scheduling of a series of DVFS assignments for every processor. 3.3 Learning-based DPM Framework 3.3.1 Background on Supervised Learning Supervised learning is an effective and practical technique for discovering relations and extracting knowledge in cases where the mathematical model of the problem may be too expensive to construct, or not available at all. Alternatively, it may be used to derive a self-improving decision-making strategy instead of making decisions based on the current perceived state of the system. The goal of the supervised learning is to learn a mapping from x ∈ X to y ∈ Y, given training sets that consist of input and output pairs. Here X = {x 1 , x 2 , …, x n } denotes a set of inputs (a.k.a. input features), and Y = {y 1 , y 2 , …, y n } is a set of outputs (a.k.a. output measures). The input feature set contains quantifiable features of the system under consideration. The output measure set can be a continuous value (called regression) or a class label of the input (called classification), which thus results in a numerical or categorical measure. If the output measure is numerical (categorical), then the learning will become a regression (classification) problem. In this chapter, each output measure is labeled with a pre-defined class (e.g., performance level). The learning is performed on a collection of training sets. Thus, training an agent (e.g., a PM) involves finding a mapping from input features to output measures so as to enable the agent to accurately predict the class of an output measure when a new input feature is given. Figure 3.3 shows the concept of 54 supervised learning, where the agent predicts the classes of output measures y k when input features x k are given after learning with the training sets, where k = 1, …, n. Input feature i x Output measure i y () , i i xy Training set class After learning Input feature k x Output measure k y Predict class Input feature i x Output measure i y () , i i xy Training set class After learning Input feature k x Output measure k y Predict class Figure 3.3: Concept of supervised learning. Considering algorithms for supervised learning, there are a number of methods for classification such as rule based learner, decision tree based learner, instance based learner, probability based learner, and kernel based learner. In the following, we give a brief description of each of these learners. i) The rule based learner, e.g., RIPPER [20], builds a rule-set by adding rules, where rules are formed for various new conditions. Once a rule-set is constructed, a classification is performed while improving its fit to the training data. ii) In the decision tree based learner, e.g., C4.5 [67], nodes of the tree correspond to attributes while links correspond to possible attribute values. Once the tree is constructed, starting at the root of the tree and following the path to some leaf node corresponding to the input feature’s values, the classification is obtained. iii) The instance based learner, e.g., Nearest Neighbor [82], is a non-parametric inductive learning method that stores training data in a memory structure. The classification is based on reuse of stored data in the memory. iv) The probability based learner, e.g., Bayesian classification, has been gaining popularity due to its performance in category of filtering. New input features are classified using Bayesian rule and the posterior probability is calculated such that 55 classification becomes a simple matter of selecting the most probable class. v) The kernel based learner, e.g., support vector machine [21], selects a small number of critical boundary samples from each class and builds linear discriminate function which separates two classes of data. If linear separation is impossible, the technique of kernel is used to inject the training data into a higher- dimensional space to learn a classifier in that space. In our problem setup, we have found that the probability based learner (i.e., Bayesian classifier) is more efficient than other methods since it can efficiently classify the output features corresponding to a new input feature into a finite number of class labels. The key to speed of the classification step is the pre-computation of prior and conditional probabilities based on a training step. 3.3.2 Learning-based Power Management Framework It is useful to describe how the supervised learning can be adapted to the power management technique. Figure 3.4 presents a top level structure of the proposed PM which incorporates a Bayesian learning framework. Fea ture ex traction Training set co llectio n C lassification DV F S sets ou tp u t m easure Extraction phase C lassification phase inpu t featu re Me a s u r e ex traction P olicy gen era tion Fea ture ex traction Training set co llectio n C lassification DV F S sets ou tp u t m easure Extraction phase C lassification phase inpu t featu re Me a s u r e ex traction P olicy gen era tion Figure 3.4: Structure of the proposed power manager. Essentially, we aim to use the supervised learning to enable the PM to automatically discover the relations between input features and output measures and 56 to predict the processor’s performance level (power dissipation and execution time per task) by using the classification. Key functions implemented inside the PM are as follows: - Feature extraction: choose the input feature (i.e., characteristics of the tasks and the state of the SQ), - Measure extraction: choose the output measure (i.e., the power dissipation and execution time of the tasks), - Training set generation: assemble the input feather and output measure into the training sets, - Supervised learning: map the input feature to the output measure based on the training sets, and - Classification: select the most likely class given the input feature. The proposed supervised learning-based DPM technique mainly comprise of three parts: extraction, classification, and policy generation. The procedures for extraction and classification are explained next. 3.3.2.1 Input Feature and Output Measure Extraction The first step is the extraction phase which extracts input features and output measures, where system knowledge is required to produce well-prepared training sets. During the process of feature extraction, in the context of the power management problem, the PM gathers input features such as the type of tasks (e.g., high-priority or low-priority), the state of the SQ, and the arrival rate of tasks, which affect the performance level of the SP. In addition, the PM observes performance- related information (e.g., the system power dissipation and the execution time of 57 tasks) as the output measures. The class of each output measure, considered as an attribute, is as a pre-defined level or range, such as a range of system power dissipations or time durations for task execution. Table 3.1: Example training set for the DPM problem. Queue occupancy Output measures Task type Input features Arrival rate of task Power dissipation Execution time pow 1 low-priority high-priority high-priority low-priority high-priority low-priority low-priority low-priority low-priority pow 2 pow 3 pow 2 pow 1 pow 1 pow 2 pow 2 pow 1 exe 1 exe 1 exe 3 exe 3 exe 1 exe 2 exe 2 exe 2 exe 1 med low med low low med med med med low low med med med med med med high Queue occupancy Output measures Task type Input features Arrival rate of task Power dissipation Execution time pow 1 low-priority high-priority high-priority low-priority high-priority low-priority low-priority low-priority low-priority pow 2 pow 3 pow 2 pow 1 pow 1 pow 2 pow 2 pow 1 exe 1 exe 1 exe 3 exe 3 exe 1 exe 2 exe 2 exe 2 exe 1 med low med low low med med med med low low med med med med med med high Table 3.1 shows an example of training sets which consist of selected input feature and output measure pairs. Notice that the queue occupancy and the arrival rate of task are assigned attributes (i.e., low, med, or high), where low = [0 33%], med = (33% 67%], and high = (67% 100%] when applied to the SQ occupancy, and low = [0 0.33], med = (0.33 0.67], and high = (0.67 1] when applied to the arrival rate. Each output measure is labeled with a specific class from the set L. In our problem setup, the class set L is defined as L 1 = {pow 1 , pow 2 , pow 3 } where pow 1 < pow 2 < pow 3 , and L 2 = {exe 1 , exe 2 , exe 3 } where exe 1 < exe 2 < exe 3 . Note that each class is defined as a range of values, e.g., pow 1 = [34mW 41mW], pow 2 = (41mW 47mW], pow 3 = (47mW 54mW], exe 1 = [14.1ns 21.5ns], exe 2 = (21.5ns 28.5ns], and exe 3 = (28.5ns 35.7ns]. In addition to our input features, the power dissipation and 58 execution time may be determined by many other factors, including the cache hit/miss ratio, cache hierarchy, and so on. However, the extent to which these factors impact the performance level of the SP is highly dependent on the system architecture or configuration, which is outside the scope of the present paper. Thus, we choose a set of factors that are correlated to produce the output measures to be included in the input features. The training set size affects the accuracy of classification, i.e., variance of the predicted value increases as the training set size is reduced, resulting in an increased bias. In this chapter, the training set size is determined by calculating a conditional probability while varying the set size, as described in the experimental results section. 3.3.2.2 Classification Having obtained the training set, the second step is the classification phase, which uses supervised learning to train an accurate classifier. The classifier’s goal is to organize a new input feature {x 1 , x 2 , …, x n } into a finite number of classes l from the set L for each one of the output features in the set {y 1 , y 2 , …, y n }. Specifically, in the Bayesian classifier, the classification task is essentially the assignment of the maximum a posteriori (MAP) class given the data x = (x 1 , x 2 , …, x n ) and the prior of class assignments to y i by maximizing the posterior probability Prob(y i = l | x 1 , x 2 ,…, x n ) of assigning class l to output feature y i given the new evidence x, such as 12 12 12 arg max ( | , , , ) (, , , | ) ( ) arg max (, , , ) MAP i n l ni i l n yProbylxxx Prob x x x y l Prob y l Prob x x x == =⋅= = … … … (3.1) 59 The denominator Prob(x 1 , x 2 , …, x n ), which is the marginal probability of witnessing the new evidence x under all possible hypotheses, is irrelevant for decision making since it is the same for every class assignment. Prob(y i = l), which is the prior (pre- evidence) probability of the hypothesis that the class of y i is l, is easily calculated from the training set. Hence, we only need Prob(x 1 , x 2 ,…, x n | y i = l), which is the conditional probability of seeing the input feature vector x given that the class of y i is l. The factor 12 12 (, , , | ) (, , , ) ni n Prob x x x y l Prob x x x = … … represents the impact of the new evidence x on the hypothesis that y i =l. If it is likely that the evidence will be observed when this hypothesis is true, then this factor will be large. Note that multiplying the prior probability by this factor results in a large posterior probability of the hypothesis given the evidence. The Bayes' theorem thus measures how much new evidence should alter belief in some hypothesis. Now Prob(x 1 , x 2 ,…, x n | y i = l) may be expanded as Prob(x 1 | x 2 , …, x n , y i = l) ×Prob(x 2 , x 3 ,…, x n | y i = l). The second factor above can be decomposed in the same way, and so on. Furthermore, assuming that all input features are conditionally independent given the class, i.e., Prob(x 1 | x 2 , …, x n , y i = l) = Prob(x 1 | y i = l). Therefore, we obtain: Prob(x 1 , x 2 ,…, x n | y i = l) = ∏ j Prob(x j | y i = l), and we compute the maximum a posteriori class as follows: 1 arg max () (| ) n MAP i j i l j yProb Prob yl x y l = ==⋅ = ∏ (3.2) When used in real applications, the Bayesian classifier first partitions the training set into several subdatasets by the class label of the target output measure. Then, in 60 each subdataset labeled by l for output measure y i , the maximum likelihood (ML) estimator Prob(x j = a jk | y i =l) can be given by the frequency n jkl / n l , where n jkl is the number of the occurrences of the event {x j = a jk } in subdataset denoted by class label l; n l is the number of the samples in the same subdataset. An example of how to classify the input features is given next. Suppose that we have a set of three input features and a set of two output features as shown in Table 3.1, where {x 1 , x 2 , x 3 } = {task type, queue occupancy, arrival rate}, and {y 1 , y 2 } = {power dissipation, execution time}. We first compute the per-input-feature conditional probabilities required for the classification task. For the example training set, we have: Prob(x 1 = low | y 1 = pow 1 ) = Prob(x 1 = low | y 1 = pow 2 ) = 3/4, Prob(x 1 = high | y 1 = pow 1 ) = Prob(x 1 = high | y 1 = pow 2 ) = 1/4, and Prob(x 1 = high | y 1 = pow 3 ) = 1. There may be some cases where particular input features do not occur together with an output measure due to an insufficient number of data points in the training set. In this case, a standard way to deal with zero conditional probabilities is to eliminate them by smoothing [54] as follows () (| ) () , j i i ji x freq x y l Prob x y l freqy ln λ λ = + == =+ (3.3) where λ is a smoothing constant ( λ > 0), and n x is the number of different attributes of x i that have been observed. For the example training set, using equation (3.3) with λ = 1, we have: Prob(x 1 = low | y 1 = pow 3 ) = Prob(x 2 = med | y 1 = pow 3 ) = 1/4. We will also need the prior probabilities for the various output feature classifications, which are calculated from the training set data. In this example, Prob(y 1 = pow 1 ) = Prob(y 1 = pow 2 ) = 4/9, and Prob(y 1 = pow 3 ) = 1/9. After calculating the conditional 61 and prior probabilities, the PM can decide the best power management policy by predicting the MAP class for a new input feature vector. Let a new input feature (x 1 = low, x 2 = med, x 3 = med), which was not in the training set, be presented to the PM, which classifies the input feature based on equation (3.2) as follows. i) Firstly, for the hypothesis y 1 = pow 1 , the posterior probability is: Prob(y 1 = pow 1 ) ⋅Prob(x 1 = low, x 2 = med, x 3 = med | y 1 = pow 1 ) = (4/9) ⋅(3/4) ⋅(1/2) ⋅(1) = 1/6 because Prob(x 1 = low | y 1 = pow 1 ) = 3/4, Prob(x 2 = med | y 1 = pow 1 ) = 1/2 and Prob(x 3 = med | y 1 = pow 1 ) = 1. ii) Secondly, for the hypothesis y 1 = pow 2 , the posterior probability is: Prob(y 1 = pow 2 ) ⋅Prob(x 1 = low, x 2 = med, x 3 = med | y 1 = pow 2 ) = (4/9) ⋅(3/4) ⋅(1) ⋅(1/4) = 1/12 because Prob(x 1 = low | y 1 = pow 2 ) = 3/4, Prob(x 2 = med | y 1 = pow 2 ) = 1 and Prob(x 3 = med | y 1 = pow 2 ) = 1/4. iii) Lastly, for the hypothesis y 1 = pow 3 , the posterior probability is: Prob(y 1 = pow 3 ) ⋅Prob(x 1 = low, x 2 = med, x 3 = med | y 1 = pow 3 ) = (1/9) ⋅(1/4) ⋅(1/4) ⋅(1) = 1/144 because Prob(x 1 = low | y 1 = pow 3 ) = 1/4, Prob(x 2 = med | y 1 = pow 3 ) = 1/4 and Prob(x 3 = med | y 1 = pow 3 ) = 1. Consequently, the MAP class of the power dissipation for the new input feature vector is pow 1 . Similarly, computing MAP of the execution time results in posterior probabilities of hypotheses y 2 = exe 1 , y 2 = exe 2 , and y 2 = exe 3 being 1/24, 2/9, and 1/18. Thus, the PM concludes that the MAP class of the execution time is exe 2 . The PM predicts the MAP performance level of the processor when a new task arrives in the SQ. The classification based on the Bayesian classifier is robust to 62 noisy and/or extraneous input features. It is also fast because it only requires a single pass through the training data to initialize the prior and conditional probabilities while requiring only a few multiplications and comparison to determine the MAP performance level of the processor at runtime. 3.3.2.3 Discriminative Bayesian Classifier As we have seen above, a Bayesian classifier assumes a conditional independency among the input features. When used for classification, the Bayesian classifier predicts a new data point as the class with the highest posterior probability by writing the classification rule in a decomposable form using the conditional independence assumption (see equation (2.2)). A key advantage of the Bayesian classifier is the ability to deal with the missing information during classification (i.e., missing input features that are relevant to the identification of output features). For example, some information such as cache miss statistics or branch mis-prediction rate, which affect the processor performance are considered as missing input features in our problem setup. Assume the input feature set {x 1 , x 2 , …, x n } be X. When the values of a subset of X, for example M, are unknown or missing, the marginalization inference can be obtained immediately as follows: arg max () (| ) MAP i j i l jX M y Prob Prob yl x y l ∈− ==⋅ = ∏ (3.4) No further computation is needed in handling this missing information problem, because each term Prob(x j | y i =l) has been calculated in training the Bayesian 63 classifier. However, there are shortcomings in this simple classifier. More precisely, this approach models the joint probability in each subset separately and then applies the Bayes rule to obtain the posterior classification rule. Consequently, this construction procedure - sometime called a generative classifier - discards some discriminative information for classification [55]. Without considering the other classes of data, this method only tries to approximate the information within each subdataset. On the other hand, a discriminative classifier, which directly estimates Prob(y i |x j ), preserves inter-subdataset information well by directly constructing decision rules among all available data. Therefore, the Bayesian classifier may be extended to provide a global scheme to preserve the discriminative information among all the data. 3.4 Extraction Strategy 3.4.1 Extracting Input Features Input feature selection plays an important role in the classification procedure which maps input features onto output measures. There are some relevant input features that have important information regarding the output measures, whereas there may be some irrelevant ones containing little information regarding the output measures. Finding every input feature that contains relevant information about the resulting output measure is difficult and in many cases unnecessary. The PM gathers available information on input features (e.g., types of the tasks, state of the SQ, and arrival rate of tasks) as explained in the previous section. At the same time, the PM needs to 64 watch for the missing input features (e.g., the amount of cache interference) which affect the performance-related output measures as well. There are two approaches to compensate for the missing input features [74]: input feature-compensate method and classification-compensate method. The first approach estimates values of hidden input features by using the expectation- maximization (EM) algorithm [25] and then performs classification on the complete input features. Note that the EM algorithm is a general technique that can be used to determine the maximum likelihood estimate (MLE) of the parameters of an underlying distribution from some given data when the measured data is incomplete. The second approach passes the incomplete input features directly to the classifier which is then adjusted to operate on the incomplete input features. A brief description of each method follows. 3.4.1.1 Input Feature-Compensate Method Let x denote the known (measured) input feature and let m denote the missing input feature. Together x and m form the complete input feature. Notice that m can be a hidden source of variation that affects the output measures. Then, we have Prob(x, m | θ), the joint probability density function of the complete input features with parameters given by vector θ ( θ may for example correspond to the mean value and variance of a Gaussian distribution). This function can be considered as the complete data likelihood, that is, it can be thought of as a function of θ and expressed as (, | ) ( | , ) ( | ) Prob x m Prob m x Prob x θ θθ = ⋅ (3.5) by using the Bayes rule. 65 The EM algorithm iteratively improves an initial estimate θ 0 by constructing new estimate θ 1 , θ 2 , etc., where an individual re-estimation step that derives θ n+1 from θ n takes the following form: 1 arg max ( ) n Q θ θ θ + = (3.6) where Q( θ) is the expected value of the log-likelihood of complete input feature. Since we do not know the complete data, we cannot determine the exact value of the likelihood, but given the input feature x, we can calculate a posteriori estimates of the probabilities for the various values of m. For each set of m values, there is a likelihood value for θ, and we can hence calculate an expected value of the likelihood with the given values of x’s. Q is given by ( ) () log ( , | ) m Q E Prob x m x θθ = (3.7) where it is understood that this denotes the conditional expectation of log Prob(x, m | θ) being taken with the θ used in Prob(m | x, θ) fixed at θ n . In other words, θ n+1 is the value that maximizes the conditional expectation of log-likelihood of the complete input feature given the measured variables under the previous parameter values. The expectation Q( θ) may be rewritten as: () ( | )log ( , | ) Q Prob m x Prob x m dm θθ ∞ −∞ = ∫ (3.8) These two steps (Expectation and Maximization) are repeated until | θ n+1 - θ n | ≤ ω, where ω is some user specified tolerance level [8]. It can be shown that the EM iteration does not decrease the measured input feature likelihood function. The EM algorithm finds θ that maximizes the complete-input feature likelihood, which in turn 66 removes the effect of hidden variables (i.e., the missing input features). 3.4.1.2 Classification-Compensate Method In this method, the incomplete input features are used directly for the classification. Every input feature x is assigned a probability α to show how reliable and critical it is for the output measure. Likewise, each of the missing input features is assigned a probability (1 - α). Assuming that all measured input features and missing input features are independent, the total likelihood of each input feature simply becomes a weighted sum of the likelihood of the input features. This can be expressed as () 12 1 1 (, , , | ) (| ) ( ) ( | ) ni n ji ji j Prob x x x y l Prob x ylProbm y l αα = ⋅− = =∏ = + ⋅ = … (3.9) where y is the output measure and l is the class, provided that we have the missing input features m = (m 1 , m 2 , …, m n ). In practice, we substitute (3.9) into (3.2) to compute the maximum a posteriori (MAP) during the classification. 3.4.2 Extracting Output Measures Modern processors include hardware features for monitoring performance characteristics of the processor [46], which enables the PM to collect performance- related information. When an application runs by itself on a single-core processor system, the resources in that system are dedicated to its execution. Thus it is relatively easy to characterize and model resultant application performance behavior. However, when multiple applications run simultaneously on a multicore processor, it is comparatively difficult to determine the resources that end up being given to each individual application, which means that the performance behavior of each 67 application on the multicore processor may not be measured accurately. Thus, the PM is forced to observe the output measure in a probabilistic way. Let r denote an input feature state (r i , i=1,…, h) where state r corresponds to a particular assignment of various attributes to input features (x 1 , x 2 , …, x n ). Let o denote an observation which corresponds to output measures (y 1 , y 2 , …, y n ) with various classes. Figure 3.5 (a) illustrates observations for each output measure given an input feature state. Note that o y1 (r 1 ) represents the observation o in y 1 (output measure) given the input feature state r 1 . For example, the power dissipation (o y1 ), one of output measures under consideration, of a processor given an input feature state r 1 (e.g., low priority task, medium queue occupancy, and high arrival rate of task) is normally distributed with mean of 38mW and variance of 2 i.e., N(38, 2). o y1 (r 1 ) = N( µ 1 , σ 1 2 ) Probability µ 1 y 1 Probability µ 2 y 2 Probability µ n y n o y2 (r 1 ) = N( µ 2 , σ 2 2 ) o yn (r 1 ) = N( µ n , σ n 2 ) µ a µ b o y1 (r 1 ) δ 1 y 1 o y1 (r 2 ) µ c o y1 (r 3 ) δ 2 Probability (a) (b) o y1 (r 1 ) = N( µ 1 , σ 1 2 ) Probability µ 1 y 1 Probability µ 2 y 2 Probability µ n y n o y2 (r 1 ) = N( µ 2 , σ 2 2 ) o yn (r 1 ) = N( µ n , σ n 2 ) µ a µ b o y1 (r 1 ) δ 1 y 1 o y1 (r 2 ) µ c o y1 (r 3 ) δ 2 Probability (a) (b) Figure 3.5. (a) Observations for each output measure, and (b) Decision boundaries for an output measure among various input feature states. For accurate classification, the decision boundaries of the output measure in Bayesian classifier have to coincide with or be close to the performance specification criteria or boundaries. Figure 3.5 (b) shows an example of decision boundaries for an 68 output measure (e.g., o y1 ) among various input feature states (e.g., r 1 , r 2 , and r 3 ), where our goal here is to find the distinction points δ 1 and δ 2 . By doing so, we can define the class as a range of values, as explained before. Let fr 1 , fr 2 , and fr 3 denote the probability density functions of output measure for the input feature states r 1 , r 2 , and r 3 , respectively. Based on the illustration (see Figure 3.5 (b)), δ 1 and δ 2 are determined from the following: 1 1 2 12 2 23 () () () () frxdx fr xdx frxdx fr xdx δ δ δ δ ∞ −∞ ∞ −∞ = = ∫∫ ∫∫ (3.10a) (3.10b) Assuming normal distribution function for the output measure in our problem setup, we can rewrite (3.10a) and (3.10b) as: 22 22 1 1 () ( ) 22 11 22 ab ab ab xx edx e dx µµ δ σσ δ πσ π σ −− −− ∞ −∞ = ∫∫ 22 22 2 2 () ( ) 22 11 22 bc bc bc xx edx e dx µµ δ σσ δ πσ π σ −− −− ∞ −∞ = ∫∫ (3.11a) (3.11b) where µ a , µ b , and µ c are the mean values of the output measure for the input feature states, and σ a , σ b , and σ c are their standard deviations. Solving these integral equations, we obtain: 12 , ab b a b c cb ab bc µσµσ µσ µσ δδ σσ σ σ + + == ++ (3.12) Table 3.2 shows an example of the decision boundaries for various probability density functions of the output measure (i.e., power dissipation), while varying values of standard deviations, where o y1 (r 1 ) = N(µ a , σ a 2 ), o y1 (r 2 ) = N(µ b , σ b 2 ), and 69 o y1 (r 3 ) = N(µ c , σ c 2 ). To simplify the comparison among these, we assume that the mean values for the output measure are fixed (e.g., µ a =37.5, µ b = 44.0, µ c = 50.5). Table 3.2: Examples of decision boundaries δ 1 2.0 δ 2 DI 1 DI 2 σ a 3.0 σ b σ c 1.5 1.4 3.0 3.0 3.0 3.0 1.5 40.1 48.3 1.30 1.44 41.9 46.0 1.47 1.47 39.6 47.3 1.44 1.08 σ a σ b σ c σ a σ b σ c case (a) case (b) case (c) δ 1 2.0 δ 2 DI 1 DI 2 σ a 3.0 σ b σ c 1.5 1.4 3.0 3.0 3.0 3.0 1.5 40.1 48.3 1.30 1.44 41.9 46.0 1.47 1.47 39.6 47.3 1.44 1.08 σ a σ b σ c σ a σ b σ c case (a) case (b) case (c) Without loss of generality, we assume, µ b > µ a . Next we introduce “distinction index (DI)” [27] as the performance criterion for boundary selection in output measure by the following: ba ba DI µ µ σ σ − = + (3.13) which indicates that the larger the value of DI is, the better the distinction between the output measures will be. For example, in case (c), DI 1 that represents the distinction between o y1 (r 1 ) and o y1 (r 2 ) is 1.44, which is greater than DI 2 (between o y1 (r 2 ) and o y1 (r 3 )). This indicates that we can achieve better accuracy in classification when we are given input feature states r 1 and r 2 rather than r 2 and r 3 . To ensure high accuracy in classification, the selection of distinction points has to be considered for the establishment of the discriminant function of the classifier. 3.5 Power Management Policy Finding an optimal power management policy in a learning-based framework requires an autonomous decision making strategy which maps the output classes to 70 actions. The actions commanded by the PM change the performance state of the system and lead to quantifiable penalties (or rewards). We consider the case where an action incurs a cost (e.g., energy dissipation), where the PM’s goal is to devise a policy for issuing a command that minimizes this expected cost. Assume that the target processor system has k (power-delay or PD for short) states denoted by s 1 , …, s k , where s 1 <…< s k in terms of the PD product (PDP) in the respective states. The PM can choose an action from a finite set of supply voltage- clock frequency (VF) settings A = {a 1 , …, a n }, where a 1 <… < a n in terms of the VF values (notice that a lower V requires a correspondingly lower F for the processor while a higher V allows a higher F, hence VF pairs may be considered as a single optimization variable in this setup). There is a state transition probability for transitioning from state s to another state s’ after executing an action a, i.e., T(s’, a, s) = Prob(s’ | a, s). Furthermore, we make a common assumption that the cost function is additive (the PDP which is the same as energy dissipation is clearly additive). Considering the minimization of the total energy dissipation as an objective, we define the energy dissipation of a system at a given time t as follows. First, assume that the predicted classes for the output measures (i.e., power dissipation and execution time) are p and d, where p ∈ L 1 and d ∈ L 2 as defined in our problem setup. Note that p and d may be considered as ranges of power and execution time values, i.e., p = [p − p + ] and d = [d − d + ]. Then, the expected cost of current state, C(s, a), where a is the action prescribed by the PM in state s=<p, d>, is defined as a specific range such that 71 [ ] (, ) ( , ) (, ) Cs a p d e s a p d e s a −− + + ∈⋅ + ⋅ + (3.14a) where e(s, a) is the expected energy dissipation to transit from state s to some next state under action a, which is in turn calculated from T(s’, a, s) and the state transition energy dissipation overhead. The above expression means that cost lies between expected minimum and maximum costs. To obtain a scalar cost function, we define: (, ) ( , ) 2 pd p d Cs a e s a −− + + ⋅+ ⋅ =+ (3.14b) We develop a policy generation technique by using well-known dynamic programming method making use of principles of overlapping subproblems, optimal substructures, and memorization. We speak of the minimum cost of a system state as the expected infinite discounted sum of cost that the system will accrue if it starts in that state and executes the optimal policy. Generally, using π as a decision policy, this minimum cost is written as * 0 () min () t t s Ect π γ ∞ = ⎛⎞ Ψ= ⋅ ⎜⎟ ⎝⎠ ∑ (3.15a) where γ is a discount factor, where 0 ≤ γ < 1, and c(t) is the cost at time t. In our problem setup, the minimum cost function is unique and can be defined ** ' () min ( , ) ( ', , ) ( ') a sS s Cs a T s a s s s S γ ∈ ⎛⎞ Ψ= + Ψ ∀∈ ⎜⎟ ⎝⎠ ∑ (3.15b) which asserts that the cost of a state s is the expected instantaneous cost plus the expected discounted cost of the next state, using the best available action. From 72 Bellman’s principle of optimality [5], given the optimal cost function, we specify the optimal policy as ** ' ( ) arg min ( , ) ( ', , ) ( ') a sS s Cs a T s a s s πγ ∈ ⎛⎞ =+ Ψ ⎜⎟ ⎝⎠ ∑ (3.16) Simply stated, the power manager determines the optimal action based on equation (3.16) at each event occurrence (i.e., decision epochs). The task of casting the decision epochs to absolute time units is achieved by the system developer. Unlike AC line-powered systems, we focus on battery operated systems that strive to conserve energy to extend the battery life. Given C(s, a) and T(s’, a, s), another way to find an optimal policy is to find the minimum cost function. It can be determined by an iterative algorithm (cf. Figure 3.6) called value iteration that can be shown to converge to the correct * Ψ values. It is not obvious when to stop this algorithm. A key result bounds the performance of the current greedy policy as a function of the Bellman residual of the current cost function. 1: initialize Ψ(s) arbitrarily 2: loop until a stopping criterion is met 3: loop for ∀s ∈ S 4: loop for ∀a ∈ A 5: 6: 7: end loop 8: end loop 9: end loop ' (, ) ( , ) ( ', , ) ( ') sS Qs a C s a T s as s γ ∈ =+ Ψ ∑ () min ( , ) a sQsa Ψ= 1: initialize Ψ(s) arbitrarily 2: loop until a stopping criterion is met 3: loop for ∀s ∈ S 4: loop for ∀a ∈ A 5: 6: 7: end loop 8: end loop 9: end loop ' (, ) ( , ) ( ', , ) ( ') sS Qs a C s a T s as s γ ∈ =+ Ψ ∑ () min ( , ) a sQsa Ψ= Figure 3.6: The value iteration algorithm. Results of the policy generation are stored in a state-action mapping table so that 73 the PM does not need to compute the optimal action in each system state at runtime. Instead the optimal action generation is reduced to a simple table lookup. In practice, the PM examines the input features each time a new task arrives in the SQ, estimates the most likely state of the system, and looks up and issues the corresponding “optimal” action from the mapping table. 3.6 Experimental Results 3.6.1 Experimental Setup We apply the proposed DPM technique to a multicore network processor which includes a dynamic load balancing (DLB, a.k.a., Application Delivery Controller or ADC) block and four processing cores (cf. Figure 3.1). The DLB block, which guarantees in-order delivery of tasks, enables tasks from a single network interface to be processed in parallel on multiple cores. There are various ways to distribute incoming tasks (a.k.a. connections or requests) to cores (a.k.a. back-end service hosts or servers), including the following methods [72]: - Least workload: assigns the task to the host with the least workload (connections), - Fastest host: assigns the task to the core that currently has the best performance, - Observed performance: assigns the task to a core that has the highest performance rating, based on a combination of least workload and best response time, - Predictive method: assigns the task to a core that has the highest predicted performance rating over time, and - Dynamic ratio: determines the capabilities of the core to create a dynamic performance ratio accounting for host affinity to a connection and the resultant 74 cache locality; the tasks are then distributed to the cores based on this ratio. Among these, we consider RSS (receiver-side scaling) [101], which falls in the category of dynamic ratio techniques. 1 The RSS technique is capable of re-balancing the received processing load across multiple processor cores while maintaining in- order delivery of the data. RSS enables in-order packet delivery by ensuring that packets for a single connection are always processed by one processor. This RSS feature requires that the network adapter examine each packet header and then use a hashing function to compute a signature for the packet. To ensure that the load is balanced across the cores, the hash result is used as an index into an indirection table. Because the indirection table contains the specific core that is to run the associated deferred procedure call and the host protocol stack can change the contents of the indirection table at any time, the host protocol stack can dynamically balance the processing load on each core. As a typical application, we consider executing TCP/IP-related tasks (e.g., TCP segmentation and checksum offloading [102]) on a multi-core processor. For the simulation setup, we analyzed performance characteristics of each processor core in terms of the power dissipation and execution time. We relied on detailed gate-level realization of the processor with TSMC 65nmLP library to precisely capture the power-saving opportunities within a core. By varying the 1 In the current world of high-speed networking, where multiple processing cores reside within a single system, the ability of the networking protocol stack of the operating system to scale well on a multi-core system is inhibited because the architecture of conventional Network Driver Interface Specification (NDIS 5.1 and earlier versions) limits receive protocol processing to a single core. Receive-Side Scaling (RSS) resolves this issue by allowing the network load from a network adapter to be balanced across multiple cores. 75 voltage and frequency values during the simulation, we achieved power and delay numbers with Power Compiler for the core after running the same tasks. Furthermore, we utilized a back-annotated SAIF (Switching Activity Interchange File), which captures switching activity factor with test patterns, based on the RTL simulation to achieve accurate power numbers. We defined a set of three actions, i.e., a 1 = [1.00V, 150MHz], a 2 = [1.08V, 200MHz], and a 3 = [1.20V, 250MHz] for simplicity, assuming that the voltage of the core is determined based on the operating frequency. 3.6.2 Detailed Results In the first experiment, we generated a training set by running a set of tasks on the processor core as follows. First, we considered a scenario whereby the core accepts two types of tasks: low-priority and high-priority, where a high-priority task can move ahead of all low-priority tasks waiting in the queue. Next, we defined a set of input features {type of task, occupancy state of the SQ, arrival rate of task} and output measures {power dissipation [mW], execution time [ns]}, similar to Table 3.1. During the training phase, voltage and frequency values are assigned to the processor core based on simple requirements such as: - The core runs faster when high-priority tasks with medium or high arrival rates arrive under low or medium queue occupancy, - The core runs slower when low-priority tasks with low or medium arrival rates arrive under medium or high queue occupancy. Figure 3.7 shows various input features during the training phase, whereas Figure 3.8 depicts the corresponding output measures for 100 training sets. Note that profiling output measure (e.g., power dissipation) at runtime is feasible with support of 76 specific hardware such as external current sensors or internal architectural counters for each core. The supply voltage for each core is isolated from the supply voltages for other cores by a power gating transistor [13] which operates as a simple on/off switch. An external current sensor [103], supplied by a voltage regulator which also provides power to the corresponding core, enables online current measurement, which is accumulated in the current accumulator, digitally multiplied by voltage value, and fed into a power dissipation accumulator. Figure 3.7: Input features during training phase. On the other hand, internal architectural counters used to compute the power consumed by cores count a number of relevant events and appropriately weight the counter values. For example, the total numbers of load/store instructions, arithmetic/logic instructions, floating-point operations, and retirement executions for each core are counted and summed up after being multiplied by appropriate weights [80]. 77 Figure 3.8: Output measures during training phase. Figure 3.9: Probability density functions for power dissipation. The decision boundaries for an output measure are obtained as follows. First, we assign various labels to the input features based on our simulation results. After running a number of simulations, we derive probabilistic density functions for the power consumption of the core (cf. Figure 3.9) for three observations: o 1 = N(35.8, 2.2), o 2 = N(44.2, 3), and o 3 = N(50.5, 1.8). Next, the two separation points between neighboring observations are calculated as: δ 1 = 39.4 and δ 2 = 48.1. The minimum 78 power (30.3mW) and maximum power (56.0mW) consumption values for active mode of the processor core operation are used as the lower and upper bounds of the power dissipation range. The decision boundaries for the execution delay are also obtained in a similar manner. Consequently, the classes of output measures are defined according to Table 3.3. Table 3.3: Classes of output measures Power dissipation (mW) pow 1 pow 2 pow 3 Execution time (ns) exe 1 [30.0 39.4] exe 2 exe 3 (39.4 48.1] (48.1 56.0] [14.1 21.5] (21.5 28.5] (28.5 35.7] Power dissipation (mW) pow 1 pow 2 pow 3 Execution time (ns) exe 1 [30.0 39.4] exe 2 exe 3 (39.4 48.1] (48.1 56.0] [14.1 21.5] (21.5 28.5] (28.5 35.7] To ensure high accuracy in classification, we define classification error [64][68] as follows. The error in classification is calculated as (( ), ) ( , ) ER L f x y Prob x y dx dy = ∫ (3.17) where f(x) denotes the predicted output measure while y is the actual output measure. L(.,.) is a general loss function. Here, we use a 0-1 loss function, i.e., 0() (( ), ) 1 if y f x Lf x y otherwise = ⎧ = ⎨ ⎩ (3.18) where ( ) arg max ( | ) Y fxProbYXx == in this case. The class-conditional classification accuracy is then given by 1 – ER. It is a measure of the performance of the classifier. Considering the input feature that we used as an example in section 3.3, the accuracy reaches around 88% in classification. In addition, the training set size can greatly impact the classification accuracy, so we performed simulations to determine an appropriate size by varying the set size 79 from 50 to 3000 as shown in Figure 3.10. We have thus empirically determined that a training set size of 1000 is adequate. Note that substantial reductions in training set size may be possible if interest is focused on a single class (e.g., only power dissipation) [28]. Figure 3.10: Selection of the training set size. The second experiment was designed to demonstrate the effectiveness of the proposed learning-based DPM framework. The PM first collects the training sets which consist of 1,000 input feature and output measure pairs. Second, based on the classes of the output measures, the PM calculates the required prior and conditional probabilities. Next, we randomly generated 100 tasks that arrive into the queue of the corresponding processor. The classification was performed for each incoming task to determine the system state, i.e., power and delay levels, used to generate the cost function (3.13b) as shown in Figure 3.11, where we compared the expected (normalized) cost while considering the overhead of energy dissipation due to power mode transitions. Note that the required energy (i.e., overhead) for frequency scaling is negligible compared to voltage scaling via a voltage regulator, which may range 80 from several clock cycles to tens of milli-seconds [53]. After predicting the system state, the PM looks up the pre-characterized state to action mapping table to determine the optimal action to issue, which guarantees (3.15). Figure 3.11: Evaluation of cost functions for a given example. We also investigated the effect of the missing input feature in the proposed DPM framework. We applied the classification-compensate method, where we varied the probability factor ( α) in (3.8) to calculate the weighted sum of the likelihood of the input features. To simplify the simulation setup, we assume that only power dissipation is affected by the missing input feature. Table 3.4: Different characteristics of training sets. Training set A Training set B Training set C Input feature High Low Priority Arrival rate High Med Low 50% 50% 20% 60% 20% 20% 80% 80% 20% 20% 20% 60% 60% 20% 20% Training set A Training set B Training set C Input feature High Low Priority Arrival rate High Med Low 50% 50% 20% 60% 20% 20% 80% 80% 20% 20% 20% 60% 60% 20% 20% 81 Figure 3.12: Comparison of energy dissipations, where actions are commanded by a classifier based on different training sets. It is worthwhile to consider a scenario whereby the characteristics of the task may change over time [17][1]. If the workload characteristics change over time, the performance of the classification can degrade. This is because, having relied on biased input features during the training phase, the classifier may not be able to correctly predict the output measure class of a given input feature. For example, consider different sets of training data as shown in Table 3.4. Suppose we train three classifiers based on training set A, set B, or set C. Next we randomly generated 100 tasks and perform classification for each incoming task, followed by an optimal action for each task based on the classification result. Figure 3.12 shows the normalized energy dissipation by the issued actions commanded by the three aforesaid classifiers. 82 Figure 3.13. Evaluation of energy dissipation for a given scenario. Table 3.5: Normalized total energy dissipation for various classifiers. Training sets Energy A, B, C A, B A, C B, C AB C 127.5 125.0 122.6 114.3 115.3 113.7 108.2 Training sets Energy A, B, C A, B A, C B, C AB C 127.5 125.0 122.6 114.3 115.3 113.7 108.2 To validate the above statement, we considered a scenario whereby a classifier is trained based on some expected input characteristics but is subsequently used to classify input features with different characteristics. In particular, we first trained a classifier with training set B and used it to determine the output measure class of elements in set C (modeling the case whereby the input characteristics changed over time from those of set B to those of set C). Figure 3.13 shows the comparison in energy dissipation for 100 tasks between this case and one in which a classifier (“with update”) was trained based on set C and then ran on data with similar characteristics as those of set C. It is clearly seen that the classifier “with update” 83 outperforms that “without update”. Finally, notice that we could have trained a better classifier by using data from all three training sets A, B, and C. Table 3.5 shows the normalized total energy dissipation for 100 tasks by various classifiers, where each classifier is trained with the specific training set. Finally, we investigated the energy-efficiency of the proposed DPM technique. For comparison purpose, we implement a power management policy (denoted by Greedy), as described below. We use three VF values to simplify the experimental setup (a 1 < a 2 < a 3 in terms of VF values). Greedy: Apply a greedy DPM assignment strategy which - Uses the lowest a 1 value when 0 < the arrival rate of task ≤ 0.33 (i.e., low workload). - Likewise, it uses a 2 and a 3 when 0.33 < the arrival rate of task ≤ 0.67 and 0.67 < the arrival rate of task < 1, respectively. Bayesian: Apply the optimal actions, based on the Bayesian learning-based DPM method described in this paper. Next, we generated a number of tasks by selecting the priority and arrival rate of tasks randomly, where 0 < the arrival rate of tasks < 1, and applied the above- mentioned DPM policies to the multicore processor described earlier. The simulation results in Figure 3.14, which report the (normalized) energy dissipation of each task for one processor, show that the proposed DPM technique, i.e., Bayesian exhibits sizeable energy savings compared to the other techniques. Note that considering the performance of the processor, the overhead of performing classification in Bayesian is negligible since it does not affect the execution time of the processors (i.e., the 84 classification and table lookup are performed during the queuing period before the VF change). In addition, we set the probability factor ( α) in (3.8) for the classification-compensate method to 1.0, 0.95, and 0.90. Figure 3.14. Energy dissipation comparison between a greedy DPM and the Bayesian Learning based DPM. Experimental results in Table 3.6, which also reports the characteristics of the workload distribution for each processor (e.g., the Proc1 receives 27 high-priority tasks and 23 low-priority tasks), demonstrate that, compared to the Greedy policy, the proposed Bayesian classification–based power management policy achieves system-wide energy (normalized) savings of up to 24.1% (these are the normalized averages on four processors when α = 1). It is also seen that if we consider the missing input feature (e.g., α = 0.95 and α = 0.90), there is little performance degradation. This also results in energy savings up to 22.0% and 20.7% (average on four processors), respectively. 85 Table 3.6: Energy savings in the multicore processor. Processor Proc3 Proc1 Proc2 Proc4 Number of tasks High-pri Low-pri Total Greedy (Energy) Prob. ( α)Energy Energy saving Over Greedy 27 23 50 64.0 46.8 26.9% 52 48 100 67 83 150 103 97 200 1.00 0.95 0.90 128.4 97.8 23.8% 1.00 0.95 0.90 195.8 150.6 23.1% 1.00 0.95 0.90 255.5 197.6 22.7% 1.00 0.95 0.90 Bayesian 203.5 20.4% 205.4 156.2 99.8 50.8 20.6% 49.2 98.2 154.3 23.1% 22.3% 23.5% 20.2% 21.2% 19.7% Processor Proc3 Proc1 Proc2 Proc4 Number of tasks High-pri Low-pri Total Greedy (Energy) Prob. ( α)Energy Energy saving Over Greedy 27 23 50 64.0 46.8 26.9% 52 48 100 67 83 150 103 97 200 1.00 0.95 0.90 128.4 97.8 23.8% 1.00 0.95 0.90 195.8 150.6 23.1% 1.00 0.95 0.90 255.5 197.6 22.7% 1.00 0.95 0.90 Bayesian 203.5 20.4% 205.4 156.2 99.8 50.8 20.6% 49.2 98.2 154.3 23.1% 22.3% 23.5% 20.2% 21.2% 19.7% 3.7 Summary In this chapter we have addressed the problem of dynamic power management, where a system-level PM continually intervenes to exploit power-saving opportunities subject to performance requirements. The overhead associated with regular activity of the PM to monitor the workload of a system and make decisions about power management of different functional blocks in the system tends to undermine the overall power savings of the DPM approaches. This chapter thus described a supervised learning based DPM framework for a multicore processor, which enables the PM to predict the performance state of the system for each incoming task by a simple and efficient analysis of some readily available input features. Experimental results have demonstrated that the proposed DPM framework results in significant energy savings for various workloads in multicore processors. 86 Chapter 4 A Stochastic Local Hot Spot Alerting Technique 4.1 Introduction With IC process geometries shrinking below 65nm technology and many applications requiring higher performance, thermal control is becoming a first-order concern. Furthermore, local hot spots, which have much higher temperatures compared to the average die temperature, are becoming more prevalent in VLSI circuits. Thus, identifying and removing heat from these hot spots is a major task facing design engineers concerned with higher circuit reliability. As reported in [33][12][81][58][100], the problem of thermal modeling and management has received a lot of attention. The work presented in [33] relies on a compact thermal model to achieve a temperature-aware design methodology. A thermal control mechanism used to cool the microprocessor’s temperature has been derived in [12]. Predictive thermal management [81], which exploits certain properties of multimedia applications, is an example of online strategies for thermal management. In [58][100], design guidelines for power and thermal management for high-performance microprocessors are provided. Much of the past work has examined techniques for thermal modeling and management, but these techniques may be ineffective if the accuracy of identifying 87 local hot spots in question. This is because thermal models, based on equivalent circuit models, cannot adequately capture structures with complex shapes and boundary conditions, which in turn gives rise to uncertainty in identifying local hot spots. In particular, it is extremely difficult to obtain exact solution of the heat transfer equations that arise from realistic die conditions. Furthermore, temperature sensors have difficulty measuring the actual peak power dissipation and the resulting peak temperature, which renders stochastic the problem of identifying local hot spots. In this chapter, we present a stochastic hot spot estimation technique, which alerts against thermal problems. Uncertainties in the temperature measurements and power state identification are modeled by using stochastic processes. Our proposed framework is based on the Kalman filter (KF) algorithm and the Markovian decision process (MDP) model, which enable the framework to predict thermal behavior and power state of the system under variable, and uncertain, environmental conditions. Note that KF provides an estimation technique of the most probable state of a continuous-state system [42] while MDP is a theory of modeling sequential decision problem in a discrete-state system [65]. The key rationale for utilizing MDP and KF for hot spot estimation is to manage uncertainty, combining continuous thermal state and discrete power state estimations, respectively. The remainder of this chapter is organized as follows. Section 4.2 provides some preliminaries of the chapter, while section 4.3 describes the details of uncertainty- aware estimation framework. Section 4.4 presents a hot spot alerting algorithm. Experimental results and summary are given in section 4.5 and section 4.6. 88 4.2 Preliminaries An integrated circuit (device) is typically allowed to operate when the ambient air temperature, T A , surrounding the device package, is within the range of 0 °C to 70 °C [75]. The package can be characterized thermally by a thermal resistance. The value of thermal resistance determines the temperature rise of the junction above a reference point by θ JX = (T J – T X ) / P, where θ JX is the thermal resistance from the device junction to the specific environment ( °C/W), T J is the device junction temperature ( °C), T X is the reference temperature for a specified environment ( °C), and P is the device power dissipation (W). If the reference temperature is denoted as T A (i.e., the temperature of ambient), T B (i.e., the temperature of PCB board), or T C (i.e., the temperature of the case top), then the thermal resistances for junction-to-air, θ JA , junction-to-board, θ JB , and junction-to-case, θ JC , may be calculated as ,, θθ θ − −− == = JAJB JC JA JB JC TT TT TT PP P (4.1) As illustrated in Figure 4.1, heat is dissipated from the die into the ambient primarily through either the package encase bottom surface or its top surface, where PBGA (Plastic Ball Grid Array) plus HS (Heat Spreader) package is used. The arrows in this figure indicate the direction of heat flow. The two heat dissipation paths are graphically plotted in Figure 4.2, where the package is thermally represented with a two-resistance model, one corresponding to the heat transfer resistance from the device junction to the package bottom surface, θ JB , and the other corresponding to the heat transfer resistance from the device junction to the package 89 top surface, θ JC . The thermal resistances, external to the package, include θ BS , θ BA , and θ CA , which are determined by the thermal design of the target system. For example, if there are no heat sinks attached to the package in the system, the surface- to-air thermal resistance, θ BA and θ CA , can be estimated from 1 θθ = = BA CA s s hA (4.2) where h s and A s denote the heat transfer coeffcient and exposed surface area of the package, respectively [99]. Note that θ BS is the PCB board spreading thermal resistance, influenced by component population on board. PCB Die Q CA Q JC T J T C Q JB Q JB Q BS Q BS T B Q BA Q BA T A T A T S T S T A PCB Die Q CA Q JC T J T C Q JB Q JB Q BS Q BS T B Q BA Q BA T A T A T S T S T A Figure 4.1: Heat flow in the Plastic Ball Grid Array plus Heat Spreader package. P ackage envelope θ CA Q CA T C Q JC T B θ JC T A θ JB Q JB θ BS T S θ BA Q BA T A Q BS T J P ackage envelope θ CA Q CA T C Q JC T B θ JC T A θ JB Q JB θ BS T S θ BA Q BA T A Q BS T J Figure 4.2: One of the IC package heat transfer paths and the corresponding thermal resistive model. 1 Using the above-mentioned models, the package junction-to-air thermal resistance can be calculated from the following 1 There is an analogy to Ohm’s Law for electrical circuits where temperature replaces voltage and power dissipation replaces current. 90 1 11 θ θθ θ θ θ − = ⎛⎞ + ⎜⎟ ++ + ⎝⎠ JA JB BS BA JC CA (4.3) Then, the junction temperature can be estimated with θ = +⋅ JA JA TTP (4.4) where the goal of thermal design of the package is to maintain the device θ JA value small enough so that the junction temperature T J does not exceed a maximum specified value during operation. It is worthwhile to note that θ JA cannot be modeled directly due to the complexity of thermal models for the package, cooling system, and board stack-up. In addition, θ JA is assumed to be a single parameter under the assumption that device power dissipation, P, is distributed uniformly across the die, which is not a realistic assumption. In practice, the package case top temperature, T T , is utilized along with temperature measurements to estimate T J . Temperature reading can be performed by either external or internal thermal sensors. External thermal sensors, e.g., thermocouples, incur a rather large time delay in reading the temperature and tend to produce less accurate temperature measurements of T J . Internal thermal sensors, e.g., analog/digital CMOS sensors [70], which can be deployed in large numbers across a chip, have been widely used in pursuit of higher accuracy in measuring T J . However there still remain inaccuracies associated with the internal sensors. 4.3 Estimation under Uncertainty 4.3.1 Rationale for Developing Uncertainty Management As illustrated in section 4.2, the junction temperature, T J , is not estimated easily with 91 Eqn. (4.4) due to the complexity of modeling θ JA . To overcome this difficulty, we use an observation, i.e., temperature reading T T of the package top obtained by a thermal sensor, to estimate T J as follows: JT JT TT P ψ = + (4.5) where ψ JT is the junction-to-top of package thermal characterization parameter used as a measure of the temperature difference between junction and package top surface, and is estimated from JEDEC thermal tests [75]. There is, however, uncertainty in T T due to various noise sources. We overcome this problem by modeling T T readings as a stochastic process. Power dissipation of logic devices in the substrate is the major source of heat generation. Furthermore, process, voltage and temperature (PVT) variations result in statistical changes in the spatial distribution of power dissipation across the die, i.e., the power density and the resulting temperature profile will have different values from one part of the chip to another part and from one time instance to next on the same location in the chip. As a result, power dissipation, which is affected by PVT variation as well as time and space dependent gate-level switching activities in the circuit, cannot be easily characterized by the design itself. The key contribution of the proposed framework is to recognize the uncertainty in estimating the power state of the system and the resulting junction temperature of the IC, and to qualitatively manage this uncertainty by providing timely alerts about the local hot spots. 4.3.2 Temperature Estimation Framework It is useful to describe how the KF can be adapted to our proposed framework, where 92 our goal is to estimate the junction temperature of a device. Definition 4.1: Kalman Filter-based Temperature Estimation (KFTE) framework. The KFTE is tuple (s, a, o, X, Y, Z) where - s is a state representing the junction temperature, T J , - a is a voltage and frequency assignment (VFA) action, - o is a temperature observation, T T , - X denotes a state transition matrix, - Y denotes an action-input matrix, and - Z denotes an observation matrix. We assume that a power manager (e.g., the operating system) commands an appropriate action a, which changes the operational mode (i.e., power state) of the design, and hence, will result in a change in its T J value (i.e., a change in state s). In our proposed framework, we estimate the next value of T J (state s’) by employing a prediction technique based on the KF algorithm, while analyzing (possibly noisy) T T values. The state and observation calculations are performed by using the following linear matrix equations: 1 ,~(0, ) tt tt t t ss auuNQ + =+ + Y X (4.6a) 11 1 1 ,~(0,) ttt t t t os v v NR ++ + + =+ Z (4.6b) where t denotes a time step, u t is a temperature state noise which is normally distributed with zero mean and variance Q t , v t+1 is a temperature observation noise normally distributed with zero mean and variance R t . The state transition matrix X includes the probabilities of transitioning from state s t to another state s t+1 when 93 action a t is taken, the action-input matrix Y relates the action input to the state, whereas the observation matrix Z, which maps the true state space into the observed space, contains the probabilities of making observation o t+1 when action a t is taken, leading the system to enter state s t+1 . In practice X, Y and Z might change with each time step or measurement, but here we assume they are constant. Furthermore, we assume that the initial state, and the noise vectors at each step {s 0 , u 1 , ..., u t , v 1 , ..., v t } are mutually independent. KFTE tries to obtain an estimate of the junction temperature from the T T data. 4.3.3 Power Profile Estimation Framework We assume that the target system has k power states denoted by pwr 1 , …, pwr k , where pwr 1 < …< pwr k in terms of power dissipation values in the respective states. In the context of system modeling under uncertainty, a belief state b is the posterior distribution of the underlying power state given observations and actions. Thus, we use POMDP to formulate a power estimation framework as described below. Definition 4.2: POMDP-based Power Profile Estimation (P3E) framework. The P3E is a tuple (b, a, o, T, Z) such that - b is a belief state about power dissipation level of the system, - a is an action input, e.g., VFA, - o is an observation, e.g., temperature value T T , - T is a state transition function, and - Z is an observation function A belief state, b t+1 , after action a t and observation o t+1 , may be calculated from 94 the previous belief state b t as follows: 1 ,' 11 1 1 (, , ) ( , , ) (, , ') ( ',, ) t pwr t t pwr pwr tt t t t tt Z o a pwr b T pwr a pwr b Zoa pwr bTpwrapwr + ++ + + = ∑ ∑ (4.7) The estimation of power state of the system is performed by obtaining the maximum a posterior (MAP) estimate based on the Bayesian approach, which provides a way to include the prior knowledge concerning the quantities to be estimated, as will be explained in section 4.4. thermocouples POMDP estimate Power state Hot spot alert PCB a operating system PBGA + HS Kalman filter Junction temperature estimate o (=T T ) o (=T T ) pwr MAP T J thermocouples POMDP estimate Power state Hot spot alert PCB a operating system PBGA + HS Kalman filter Junction temperature estimate o (=T T ) o (=T T ) pwr MAP T J Figure 4.3. Uncertainty-aware estimation framework. Figure 4.3 illustrates the proposed uncertainty-aware estimation framework, where the estimators are based on the KF algorithm for the junction temperature of the device and based on the POMDP for the power state of the system. Assume that the operating system, based on performance requirements, can choose an action from a finite set of action A = {a 1 , …a n }, where a 1 < …< a n in terms of voltage and frequency values. These actions are taken at periodic time instances (synchronous) or interrupt-based event occurrences (asynchronous), which are called decision epochs. 4.4 Hot Spot Alerting Algorithm In this section, we first explain the method for estimating the junction temperature of 95 the chip as well as the power state of the system in an uncertain environment. We point out that the Kalman filter is a recursive estimator, which means that only the estimated state from the previous time step and the current measurement are needed to generate the estimate for the current state. The Kalman filter has two distinct phases: Predict and Update. The predict phase uses the state estimate from the previous time step to produce an (a priori) estimate of the state at the current time step. In the update phase, measurement information at the current time step is used to refine this prediction to arrive at a new, (hopefully) more accurate (a posteriori) state estimate, again for the current time step. 4.4.1 Estimation of Junction Temperature of the Chip In KFTE, the framework performs the temperature estimation based on the KF as follows. a) Initialize: The algorithm initializes the first state t s to s 0 , and the error covariance matrix which is a measure of the estimated accuracy of the state prediction t E to a diagonal matrix where the diagonal elements are set to some fixed value, signifying that the initial system state is uncertain. b) Predict: The algorithm computes the predicted (a priori) state 1 t s + − and the predicted (a priori) error covariance matrix 1 t E + − . c) Update: The algorithm first computes the optimal Kalman gain K t+1 and uses it to produce an updated (a posteriori) state estimate, s t+1 , as a linear combination of 1 t s + − and the Kalman gain-weighted residue between an actual observation o t+1 and the predicted observation 1 t s + − Z . The algorithm also 96 updates the error covariance matrix. Initialize 0 t ER = • Initialize noise & error variation: • Initialize the first state: 0 t ss = Predict • Predict the next state: • Predict the error variance: 1 1T tt t EE Q + + − =+ XX 1tt t ss a + − =+ XY Update •Kalman gain: • Update the state prediction with observation: 11 1 1 1 () tt t t t ss o s ++ + + + −− =+ − KZ 11 1 11 TT () tt t t EE R ++ + +− −− =+ KZZZ • Update the error variance: 11 1 () tt t EE ++ + − =− IK Z Junction temperature estimation 0 t QQ = 0 t RR = 1 tt ←+ Initialize • Compute (',, ), (',, ') T pwr a pwr Z o a pwr • Calculate b t+1 • Calculate b MAP Predict Power state estimation • pwr t+1 = power state with max. probability in b MAP • Compute Table lookup (, '| ') Prob a o b 1 tt ←+ Alert hot spot • Hot spot alerting algorithm Initialize 0 t ER = • Initialize noise & error variation: • Initialize the first state: 0 t ss = Predict • Predict the next state: • Predict the error variance: 1 1T tt t EE Q + + − =+ XX 1tt t ss a + − =+ XY Update •Kalman gain: • Update the state prediction with observation: 11 1 1 1 () tt t t t ss o s ++ + + + −− =+ − KZ 11 1 11 TT () tt t t EE R ++ + +− −− =+ KZZZ • Update the error variance: 11 1 () tt t EE ++ + − =− IK Z Junction temperature estimation 0 t QQ = 0 t RR = 1 tt ←+ Initialize • Compute (',, ), (',, ') T pwr a pwr Z o a pwr • Calculate b t+1 • Calculate b MAP Predict Power state estimation • pwr t+1 = power state with max. probability in b MAP • Compute Table lookup (, '| ') Prob a o b 1 tt ←+ Alert hot spot • Hot spot alerting algorithm Figure 4.4: The flow of the proposed estimation technique. 4.4.2 Estimation of Power State of the System We consider only the task of estimating the system power state, not controlling it, where the system generates temperature observation o ∈ O given an action a. Let history, h, denote a stream of action-observation pairs which characterize the system behavior as h 0:t := (a 0 , o 1 , a 1 , o 2 , …, a t −1 , o t ). Then, according to the Bayesian formula, the probability density function of belief state b given h can be written as 97 (| ) ( ) (| ) () ⋅ = tt t t t t Prob Prob Prob Prob hb b bh h (4.8) where Prob(b t | h) is called the posterior probability density function (PPDF), Prob(h t | b t+1 ) is the likelihood function, Prob(h t ) is the prior distribution, and Prob(b t ) is precisely the probability of belief state which can be obtained from Eqn. (4.7). Once the PPDF is known, the most probable power state can be computed as follows arg max ( | ) arg max ( | ) ( ) t MAP b tt b t t bProbbh Prob h b Prob b = =⋅ (4.9) Note that as a normalizing constant, the knowledge of Prob(h t ) is not needed because we are not interested in making any decisions. Since we assume that action a is issued to the system at each decision epoch, we may consider that the current power state of the system is only affected by the previous action and observation, which results in 1 arg max ( , | ) ( ) tt MAP b tt bProbaobProbb − =⋅ (4.10) where we use a table-lookup method for obtaining Prob(a t-1 , o t | b t ) efficiently. Note that even though we the know the given action a t-1 with certainty, the observation o t is only known probabilistically. Let pwr MAP denote the power-state which has the maximum probability in the belief state, b MAP , obtained from Eqn. (4.10). 4.4.3 Hot Spot Alerting Algorithm In predicting hot spots, we combine estimation for the junction temperature of the chip (by using KF) with estimation for power dissipation of the system (by using POMDP). 98 We assume the presence of a thermal sensor which produces a stream of continuous-valued temperature readings that are noisy. This fact in turn implies that recognizing a temperature rise by using a thermal sensor may render the thermal control mechanism useless due to its slow response. Therefore, we propose a hot spot alerting algorithm based on the predictions of the junction temperature of the device and the power state of the system. 1: do forever 2: predict the junction temperature, T j t+1 3: predict the power state of the processor, pwr t+1 4: if T j t+1 ≥ T a.H 5: alert red hot spot 6: else if T a.L ≤ T j t+1 < T a.H 7: if pwr t+1 ≥ P a 8: alert red hot spot 9: else 10: alert yellow hot spot 11: else 12: if ≥ G j,a 13: alert yellow hot spot 14: return hot spot level / j Tt ∂∂ 1: do forever 2: predict the junction temperature, T j t+1 3: predict the power state of the processor, pwr t+1 4: if T j t+1 ≥ T a.H 5: alert red hot spot 6: else if T a.L ≤ T j t+1 < T a.H 7: if pwr t+1 ≥ P a 8: alert red hot spot 9: else 10: alert yellow hot spot 11: else 12: if ≥ G j,a 13: alert yellow hot spot 14: return hot spot level / j Tt ∂∂ Figure 4.5: The proposed hot spot alerting algorithm. Figure 4.5 shows the proposed hot spot alerting algorithm. We define red and yellow hot spot levels in terms of the degree of thermal threat. Note that T a.H and T a.L are pre-defined temperature threshold values (T a.L < T a.H ), P a is the power threshold value, and G j,a is the temperature gradient threshold value. All these parameters are set by system or package developers. 4.5 Experimental Results The proposed technique is applied to a MIPS-compatible RISC processor with a 5- 99 stage pipeline, instruction / data caches, and internal SRAM for code/data storage. To precisely evaluate the characteristics of the processor, we relied on the detailed Verilog RT-level description of the processor synthesized with a TSMC 65nm cell library. The power dissipation numbers were obtained through functional simulations with exact switching activity information. We do not have a packaged IC equipped with a thermal sensor to report T T . Hence, we estimate T T by combining Eqn. (4.4) and (4.5), resulting in () TA JA JT TT P θψ =+ ⋅ − .Assuming that T A = 70 °C and using package thermal performance data of Table 4.3 for θ JA and ψ JT . Note that device power P in the above equation is assumed to be a normally distributed random variable with a mean value of P sim and a standard deviation of ∆P. Now, P sim is the simulated power number while ∆P is the standard deviation of power values, which is calculated by running different tasks on the processor at different process corners (e.g., fast, typical, and slow) available with the TSMC 65nm library. We thus generate different T T values by running various benchmark programs, regularly monitoring and recording P sim values, but subsequently using a power value P which follows a normal distribution, () () 2 , sim NP P ∆ . Table 4.1: Percentage of power consumption in different modules of a MIPS-like processor. 14.4% 13.8% SPECint 2000 gcc gap gzip 9.2% 15.4% 4.6% 7.5% 21.6% 2.3% 17.1% 9.2% 13.1% 15.7% 4.0% 6.2% 16.2% 1.7% 18.1% 4.1% 8.3% 17.6% 2.2% 16.4% 8.1% 8.4% Functional modules i-areg d-areg incr mul alu sr reg decode busctl 15.1% 16.6% 13.1% 14.4% 13.8% SPECint 2000 gcc gap gzip 9.2% 15.4% 4.6% 7.5% 21.6% 2.3% 17.1% 9.2% 13.1% 15.7% 4.0% 6.2% 16.2% 1.7% 18.1% 4.1% 8.3% 17.6% 2.2% 16.4% 8.1% 8.4% Functional modules i-areg d-areg incr mul alu sr reg decode busctl 15.1% 16.6% 13.1% We first analyzed the power consumption inside the processor by executing 100 SPECint2000 benchmarks as reported in Table 4.1 (without accounting for memory power). This table indicates that non-uniform power density exists across the processor, which impacts local hot spot on a die. Table 4.2: Definition of power and temperature states of the processor. range power [W] state observation [ °C] state [0.6 1.4] (2.2 3.0] (1.4 2.2] pow 1 pow 2 pow 3 o 3 o 2 o 1 [86 93] (100 107] (93 100] range power [W] state observation [ °C] state [0.6 1.4] (2.2 3.0] (1.4 2.2] pow 1 pow 2 pow 3 o 3 o 2 o 1 [86 93] (100 107] (93 100] Next we set the power (W) and temperature ( °C) ranges corresponding to three power states and three temperature states (see Table 4.2). These values were obtained during the active state of the processor (recall that thermal control occurs mostly during active state) with the extracted thermal data [104] for (31mm x 31mm) PBGA as summarized in Table 4.3. Table 4.3: PBGA package thermal performance data (T A = 70 °C). [ °C] 107.9 0.51 Air velocity m/s ft/min T J_max [ °C] T T_max 1.02 2.03 100 200 300 θ JA ψ JT [ °C/W] [ °C/W] 105.3 102.7 106.7 104.1 101.2 0.51 0.53 0.65 16.12 15.62 14.21 [ °C] 107.9 0.51 Air velocity m/s ft/min T J_max [ °C] T T_max 1.02 2.03 100 200 300 θ JA ψ JT [ °C/W] [ °C/W] 105.3 102.7 106.7 104.1 101.2 0.51 0.53 0.65 16.12 15.62 14.21 Figure 4.6 shows the trace of junction temperature estimation, where we randomly chose a sequence of 50 programs of SPECint2000, which include gcc, gap, and gzip. For example, one such sequence of programs may be gap 1 -gcc 2 -gcc 3 - …- gzip 49 -gap 50 , where program i is the ith program in the sequence, which is executed on the processor. In this simulation setup, for simplicity but without loss of generality, we set that the values of Q t and R t to 1. The time steps are abstractly 101 defined. Note that the temperature trace of calculation in the figure is based on above-mentioned method (i.e., by using the estimate of T T ). Figure 4.6: Trace of estimation for the junction temperature. Considering the estimation of power state, Figure 4.7 shows the trace of belief state for power states, pow 1 , pow 2 , and pow 3 , as estimated by the proposed POMDP- based technique. Here belief state (pow 1 ) denotes the probability we are in the pow 1 state. Figure 4.7: Trace of belief state for the power profile. 102 Figure 4.8: Evaluation of the hot spot alerting algorithm. Experimental results of the proposed hot spot alerting algorithm are shown in Figure 4.8. Here we have assumed that two processor core inside a multicore processor execute 50 programs alternately. Hot spot levels are defined as red alert, yellow alert and safe (i.e., there is no thermal threat), where we set the required parameter values as: T a.H = 100 °C, T a.L = 90 °C, G j.a = 7 °C, and P a = 2.2W. The power state of each processor is also estimated as explained above. Results in this figure demonstrate that local hot spots (see Figure 4.8 bottom) are estimated based on the power state of the processor (see Figure 4.8 top), considering the junction temperature of the device (see Figure 4.8 middle). 103 4.6 Summary We have proposed a stochastic hot spot alerting technique based on estimations of the junction temperature of a device and the power state of a system. The proposed uncertainty-aware estimation framework efficiently captures the uncertain dynamics of the system behaviors. Being able to handle various sources of uncertainty would improve the accuracy and robustness of the estimation technique, ensuring the thermal safety of the device with truly quality and reliability. Experimental results demonstrate that the proposed technique alerts thermal threats under probabilistic variations. 104 Chapter 5 Stochastic Modeling of a Thermally-Managed Multi-Core System 5.1 Introduction Thermal control in multi-core systems has become a first-order concern due to the increased power density and thermal vulnerability of the chip. Localized heating is a frequent occurrence in SoC designs. Power dissipation is spatially non-uniform across the chip, resulting in emergence of hot spots and spatial temperature gradients that can cause accelerated aging, timing errors (setup time violations), or even physical damage to the chip. To solve this, dynamic thermal management (DTM) techniques, which attempt to ensure thermal safety by employing runtime mechanisms to control power density and to prohibit excessive local heating, have been proposed as a class of micro-architectural solutions and control strategies, which seek to enable the highest SoC performance while meeting peak temperature constraints. Much of the past work has examined techniques for thermal modeling and management, but these techniques may be ineffective to reduce chip temperature of multi-core (MC) systems because the configurability of the micro-architecture depending on the target application and the uncertainty in temperature measurement 105 (erroneous or noisy temperature reports) have often not been considered. Furthermore, thermal models, based on equivalent circuit models, cannot adequately model heat generation and diffusion in structures with complex shapes and boundary conditions. Indeed, it is extremely difficult to obtain the exact solution of the heat equations that arise from realistic die conditions. These difficulties render the problem of identifying hot spots stochastic. In this chapter, we present a stochastic model of a thermally-managed MC system (which we shall call TMS, for short) using a Markov decision process (MDP) model. The key rationale for utilizing MDP for solving the DTM problem in MC systems is to manage the stochastic behavior of the temperature states of the system under dynamic re-configuration of its micro-architecture (which may take place in response to application program characteristics), while maximizing the system performance subject to the constraint that a critical temperature threshold is not exceeded locally or globally. The remainder of this chapter is organized as follows. Section 5.2 provides some preliminaries of the chapter, while section 5.3 describes the details of the proposed models for a TMS. Section 5.4 presents a DTM problem formulation. Experimental results and summary are given in section 5.5 and section 5.6. 5.2 Preliminaries A modern computing system, which typically utilizes multi-cores to achieve high performance, exhibits different thermal profiles under different application programs due to its re-configurable micro-architecture. For example, its cache size varies 106 drastically, depending on the characteristics of the running threads (i.e., application programs), where these adaptive caches adjust from small sizes with fast access time to higher capacity but slower and more power hungry configurations. As expected, larger cache configurations, which are more prominent for dual-thread workloads, provide higher power dissipation than smaller size cache. This in turn dynamically changes the temperature profile of the SoC during program execution. bzip mgrid gzip mcf parser vpr art equake galgel mesa bzip mgrid gzip mcf parser vpr art equake galgel mesa Figure 5.1: IPC vs. L2 cache miss rate on Intel Core Duo processor. Application programs tend to exhibit different characteristics as a function of the program phase they are in [48]. This is in turn affects the computational workload of the processor, causes a new micro-architectural configuration to be employed, which in turn results in a different thermal profile on the chip. Figure 5.1 shows the obtained IPC (Instruction per Cycle) by running various application programs (e.g., SPEC CPU2000 [92]) on the Intel Core Duo processor with a typical architectural specification (cf. [9]). In this figure, IPCs for applications are compared to L2 cache 107 miss rate, where average IPC for CPU2000 benchmarks is measured as 0.85. It is clearly seen that higher L2 cache miss rate accounts for its low IPC. An integrated circuit (device) is typically allowed to operate when the ambient air temperature, T A , surrounding the device package, is within the range of 0 °C to 70 °C. It is expedient to define the critical temperature threshold, T crit , as the temperature above which a chip is in thermal violation resulting in timing errors or accelerated device/interconnect aging, and a trigger temperature threshold, T trig , as the temperature above which DTM techniques are employed. A thermal manager employs temperature reduction mechanisms when the system temperature exceeds a pre-defined temperature threshold (i.e., the trigger temperature). 5.3 System Modeling 5.3.1 Background A CTMDP is a controllable continuous-time Markov process, which satisfies Markovian property and takes a set of state s ∈ S, where state transition rates are controlled by actions a ∈ A. We consider a cost function which assigns a value to each state and action pair by adopting a conventional approach, i.e., when the system makes a transition from state s to another state s’, it receives a cost. Given a CTMDP with n states, its generator matrix G is defined as an n ×n matrix, where an entry σ s,s’ in G is called the transition rate from state s to another state s’, which can be calculated as ,' (', ) (1/ ( , ')) () , ' ss asa ss s s σδ τ ⋅ = ≠ (5.1) 108 where τ(s, s’) is a transition time from s to s’, and δ(s’, a) is 1 if s’ is the destination state of action a or 0 otherwise. We can calculate the limiting distribution (steady) state probabilities of the CTMDP from its generator matrix. If state transition rates are controlled by actions chosen from a finite set of action A, a policy is defined as a set of state-action pairs for all the states of the CTMDP. The exponential distribution for state transition times, a prominent property of CTMDP model, is sometimes insufficient to model practical cases, especially when we model the first request arrival in the idle state period, where the inter-arrival times of service requests are in this case generally distributed. However, it will not hurt the quality of the present paper if we assume that the task inter-arrival times are exponentially distributed during the active state period since thermal management is only in effect during the program execution. Furthermore, the burst of program execution on a processor follows exponential distribution. 5.3.2 Component Model We present a CTMDP-based model of a TMS to optimally solve the DTM problem. Figure 5.2 shows an abstract model of a TMS, which comprises of three components: processor, application program, and thermal sensor. In this chapter, for simplicity we assume that each application is executed by one processor and that individual thermal sensors measure temperatures of each and every processor in the MC system. A new application may cause micro-architectural re-configuration of the corresponding processor in order to improve the overall performance. A thermal manager (TM) receives state (phase) of the application, reads temperature data from 109 the thermal sensor, and issues commands to the processor under its control to manage the temperature rise above T trig . There is one TM assigned to each processor. Notice that R i , S j , and H k represent the state sets of the application program (i = 1, 2, …, l), the processor (j = 1, 2, …, m), and the temperature (k = 1, 2, …, n), respectively, where l, m, and n are the number of applications, processors, and thermal sensors available within a MC system. Next, we construct the CTMDP model of a single processor system for simplicity. The CTMDP model of a MC system can be constructed in the same manner. S Thermal manager Processor j S Thermal sensor k H Application i R S Thermal manager Processor j S Thermal sensor k H Application i R Figure 5.2: Abstract model of a thermal-managed MC system. 5.3.2.1 Modeling the Processor State The CTMDP model of the processor is constructed as follows. Assume that each state s ∈ S represents a combination of a micro-architectural configuration c ∈ C (e.g., register file sizing, cache sizing, or float-point-unit disabling) and an action a ∈ A, where there are micro-architectural configuration set C = {c 1 , c 2 , …, c u } and action set A = {a 1 , a 2 , …, a v } available to the processor. Thus, the CTMDP model of the processor includes a state set S = {s 1 , s 2 , …, s w } and a parameterized generator matrix G proc , where w is the numbers of states of the processor, i.e., w = u·v. A state transition out of some state s is controlled by either an action a ∈ A or a 110 configuration change c ∈ C. Any state transition takes a certain amount of time to complete, where this latency overhead ranges from several clock cycles to hundreds of milli-seconds. A typical micro-architecture re-configuration latency, the duration between the time a decision is made to change the micro-architectural configuration and the time of actual configuration, takes up to tens of clock cycles [63]. Thus, a state transition time in the CTMDP model of the processor takes τ(s, s’) time (= max ( τ DVFS , τ ARCH )), where τ DVFS is the transition time of dynamic voltage and frequency scaling (DVFS), and τ ARCH is the micro-architecture transition period, when system transits from state s to s’. s 1 a 1 , c 1 s 2 a 1 , c 2 a 1 , c 3 a 2 , c 1 a 2 , c 2 a 2 ,c 3 a 3 , c 1 a 3 , c 2 a 3 , c 3 s 3 s 4 s 5 s 6 s 7 s 8 s 9 proc 0.6 0.2 0.3 0 0 0.1 0 0 0.4 0.1 0 0.4 0 0 0.1 0 0.20.3 0 0 0.2 0 00.3 0.6 0 0 0.3 0.1 0.1 0 0 0 0.5 0 0.2 0.2 0 0.1 0 0 0 0.6 0.3 0.2 0 0 0.2 0.3 0 0 0.5 0 0 0.2 0.4 0 0.1 0 0 0.7 0 0.2 0.2 000.4 0 0 0.3 0.20.3 ∞ ⎡ ⎤ ⎢ ⎥ ∞ ⎢ ⎥ ⎢ ⎥ ∞ ⎢ ⎥ ∞ ⎢ ⎥ ⎢ ⎥ =∞ ⎢ ⎥ ∞ ⎢ ⎥ ⎢ ⎥ ∞ ⎢ ⎥ ∞ ⎢ ⎥ ⎢ ⎥ ∞ ⎢ ⎥ ⎣ ⎦ G s 1 a 1 , c 1 s 2 a 1 , c 2 a 1 , c 3 a 2 , c 1 a 2 , c 2 a 2 ,c 3 a 3 , c 1 a 3 , c 2 a 3 , c 3 s 3 s 4 s 5 s 6 s 7 s 8 s 9 proc 0.6 0.2 0.3 0 0 0.1 0 0 0.4 0.1 0 0.4 0 0 0.1 0 0.20.3 0 0 0.2 0 00.3 0.6 0 0 0.3 0.1 0.1 0 0 0 0.5 0 0.2 0.2 0 0.1 0 0 0 0.6 0.3 0.2 0 0 0.2 0.3 0 0 0.5 0 0 0.2 0.4 0 0.1 0 0 0.7 0 0.2 0.2 000.4 0 0 0.3 0.20.3 ∞ ⎡ ⎤ ⎢ ⎥ ∞ ⎢ ⎥ ⎢ ⎥ ∞ ⎢ ⎥ ∞ ⎢ ⎥ ⎢ ⎥ =∞ ⎢ ⎥ ∞ ⎢ ⎥ ⎢ ⎥ ∞ ⎢ ⎥ ∞ ⎢ ⎥ ⎢ ⎥ ∞ ⎢ ⎥ ⎣ ⎦ G (a) (b) Figure 5.3: An example of CTMDP model of a processor (a) and its generator matrix (b). An example of how to construct the CTMDP model of the processor is given next. For simplicity, we suppose that the processor has three micro-architectural configurations (e.g., cache resizing) denoted by c 1 , c 2 , and c 3 , and a voltage frequency (VF) setting chosen from a finite set of actions A = {a 1 , a 2 , a 3 } is applied to the processor, where a 1 < a 2 < a 3 in terms of the VF values. Then, the abstract CTMDP model of the processor can be illustrated as shown in Figure 5.3 (a), where 111 a node represents a processor state and a directed arc represents a transition between two states with the parameterized generator G proc . In Figure 5.3 (b), σ s,s’ = ∞ means the processor switches from state s to s’ immediately (i.e., s = s’), and σ s,s’ = 0 means the processor can never switch from state s to s’. Note that upsizing cache configuration may require different time compared to downsizing the cache configuration. In addition, cache downsizing occurs at the beginning of the DVFS change because of the required cache partitioning time, whereas cache upsizing happens at the end of the DVFS change [63][50]. 5.3.2.2 Modeling the Application State Application programs can be characterized by using their architecture-dependent characteristics (such as the IPC and cache-miss rate), architecture-independent characteristics (such as data and instruction temporal localities), or a combination of these two [38]. In this chapter, we focus on the micro-architecture re-configurations that affect the IPC and data cache-miss rate characteristics of application programs, which subsequently result in temperature change on the processor die. Measuring the architecture-independent characteristics may be achieved by exploiting the notion of data similarity (e.g., instruction level parallelism, data locality). Inspired by these observations, we construct a CTMDP model of an application program. The CTMDP model consists of a state set R = {r 1 , r 2 , …, r p } and a generator matrix G app , where p is the number of states that are present in the application. In our problem setup, application state r is differentiated based on values of IPC and the cache-miss rate. A state transition between different application states takes place autonomously, and may initiate a change in the state of the processor. 112 An example of a four-state CTMDP model of an application, considering workload characteristics, is depicted in Figure 5.4. Here r 1 , r 2 , r 3 , and r 4 represent combinations of IPC and cache-miss rate ( η) ranges of the application, e.g., r 1 = [IPC ≤ 0.85, η ≤ 0.01], r 2 = [IPC ≤ 0.85, η > 0.01], r 3 = [IPC > 0.85, η ≤ 0.01], and r 4 = [IPC > 0.85, η > 0.01]. Note that the threshold values for the IPC and η are set by the application developers. A state transition between different application states occurs within a processor. The transition rate σ r,r’ in G app includes the context switch time, not assuming a round-robin context switching architecture, controlled by the operating system. For example, if we make a context switch when the deadline for completing an application program is missed, then a state transition will occur with a specific transition rate. r 1 r 2 r 3 r 4 app 0.3 0.4 0.3 0.2 0.5 0.3 0.4 0.2 0.5 0.3 0.4 0.1 ∞ ∞ ∞ ∞ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ G r 1 r 2 r 3 r 4 app 0.3 0.4 0.3 0.2 0.5 0.3 0.4 0.2 0.5 0.3 0.4 0.1 ∞ ∞ ∞ ∞ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ G (a) (b) Figure 5.4: An example of CTMDP model of applications (a) and its generator matrix (b). 5.3.2.3 Modeling the Temperature State Temperature readings from thermal sensors are important to DTM technique, since by knowing the temperature profile of a chip, the TMS may be triggered to respond to chip temperature changes so as to avoid thermal failure/damage of the chip or to maximize performance of interest under temperature constraints. 113 Conventionally, the junction temperature T J of the IC can be estimated with θ = +⋅ JA JA TTP (5.2) where T A is the ambient temperature ( °C), P is the device power dissipation (W), and θ JA is the thermal resistance from device junction to ambient ( °C/W). In general, thermal failure is avoided by maintaining the device θ JA value small enough so that the junction temperature T J does not exceed a maximum value during operation. It is worthwhile to note that θ JA cannot be modeled directly due to the complexity of thermal models for the package, cooling system, and board stack-up. In addition, θ JA is assumed to be a single parameter under the assumption that device power dissipation, P, is distributed uniformly across the die, which is not realistic assumption. To overcome this difficulty, we use an observation (i.e., temperature reading T T of the package top obtained by a thermal sensor) as JT JT TT P ψ = + ⋅ (5.3) where ψ JT is the junction-to-top of package thermal characterization parameter used as a measure of the temperature difference between junction and package top surface. The device power P, a major source of heat generation, is varied based on micro- architectural configurations, which are also application dependent. To construct the temperature state of the processor, we first define a set of temperatures T 0 < T 1 < … < T c , where T 0 = T trig (i.e., the trigger temperature threshold) and T c = T crit (i.e., the critical temperature threshold). The intervening temperature thresholds are defined according to the ACPI (Advanced Configuration and Power Interface) specification. Once the temperature of the processor reaches 114 the initial trigger temperature, the thermal manager is awakened to consider the conditions and issue a thermal management decision (i.e., a system state-changing command), ensuring that the critical temperature threshold is not exceeded. Thus, the CTMDP model of the chip temperature includes a set of temperature states H = {h 1 , h 2 , …, h c , h c+1 } and a generator matrix G temp , where c+1 is the number of states that are possible with a thermal sensor. Note that h i represents the temperature region between T i-1 and T i , and h c+1 represents the temperature region that lies beyond T crit . The transition rates in G temp can be calculated as the inverse of the time it takes for the temperature of the processor to increase (decrease) from temperature state h i to h i+1 (h i-1 ), assuming that a thermal sensor receives streams of continuous-valued sensor data so that state h i cannot evolve into state h i+2 or h i-2 directly. 5.3.3 Integrated Model of a TMS After constructing the CTMDP models of the processor, application, and temperature reading, we denote by X the global state set of the integrated model, defined as the Cartesian product [60] of the state sets S, R, and H, with the generator matrix G TMS which contains the transition rate from a global state x = (s, r, h) to another x’ = (s’, r’, h’). Note that the Cartesian product is a direct product of sets such as S ×R ×H = {(s, r, h) | s ∈ S, r ∈ R, and h ∈ H}. The global generator matrix G TMS is calculated as the tensor sum [23] of generator matrices G proc , G app , and G temp . Note that when two CTMDPs with generator matrices A and B are given, the generator matrix of the joint process is obtained by the tensor sum, a matrix operator, of A and B. Basically, the tensor sum, for example, C = A ⊕B is given by C = A ⊗I n2 + I n1 ⊗B, where n 1 is 115 the order of A, n 2 is the order of B, I ni is the identity matrix of order n i , and ⊗ is the tensor product [23]. The tensor product, for example, C = A ⊗B, is defined as, 11 12 11 12 21 22 21 22 , ⊗= = ⎡⎤ ⎡ ⎤ ⎢⎥ ⎢ ⎥ ⎣ ⎦ ⎣⎦ aa aa if aa aa BB AB A BB (5.4) where a 11 , a 12 , a 21 , and a 22 are scalars. x 1 x 3 x 5 x 6 x 8 s 1 , r 1 , h 1 s 2 , r 2 , h 2 s 2 , r 1 , h 2 s 1 , r 2 , h 2 s 1 , r 2 , h 1 s 2 , r 1 , h 1 s 2 , r 2 , h 1 x 7 x 2 s 1 , r 1 , h 2 x 4 TMS 0.70.30 0.1 000 0.8 0 0.3 0 0.1 0 0 0.4 0 0.7 0 0 0.1 0 0 0.4 0.8 0 0 0 0.1 0.2 0 0 0 0.7 0.3 0 00.2 0 0 0.8 0 0 0 0 0.2 0 0.4 0 0.7 00 0 0.200.40.8 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ G x 1 x 3 x 5 x 6 x 8 s 1 , r 1 , h 1 s 2 , r 2 , h 2 s 2 , r 1 , h 2 s 1 , r 2 , h 2 s 1 , r 2 , h 1 s 2 , r 1 , h 1 s 2 , r 2 , h 1 x 7 x 2 s 1 , r 1 , h 2 x 4 TMS 0.70.30 0.1 000 0.8 0 0.3 0 0.1 0 0 0.4 0 0.7 0 0 0.1 0 0 0.4 0.8 0 0 0 0.1 0.2 0 0 0 0.7 0.3 0 00.2 0 0 0.8 0 0 0 0 0.2 0 0.4 0 0.7 00 0 0.200.40.8 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ G (a) (b) Figure 5.5: An example of CTMDP model of a TMS (a) and its generator matrix (b). An example of the integrated CTMDP model that captures temperature evolution is provided in Figure 5.5, assuming that we have two states for processor (s 1 , s 2 ), application (r 1 , r 2 ), and temperature (h 1 , h 2 ), for simplicity. For example, if a micro- architectural configuration change from s 1 to s 2 takes place given application r 1 and temperature reading h 1 , the TMS transits from x 1 to x 6 via x 5 , where the temperature in the end evolves into h 2 . 5.4 Dynamic Thermal Management In this section, mathematical formulation of the DTM problem that maximizes the performance metric subject to no exceeding a critical temperature threshold is constructed. 116 5.4.1 Optimal DTM Policy After determining the relevant parameters for each state x ∈ X and each arc in the CTMDP model of the TMS, we set up a mathematical programming model to solve the DTM problem as a linear program as below. The goal is to find an optimal state s ∈ S which consists of action and micro-architectural configuration (a, c), while minimizing the energy cost of the system for a given level of performance and given an application r under tight temperature constraints. We call the tuple (a, c) a command since the TM controls the micro-architectural configuration and the VF setting, which in turn affect power dissipation of the processor, and thereby the resulting temperature. mimimize xx x x ss s xxx s xs fg τ ⎛⎞ ⎜⎟ ⎜⎟ ⎝⎠ ∑∑ (5.5) s.t. '' ' ', ' , xxx xx sss xxxx sxxs x X ffp ≠ ∀∈ = ∑∑∑ (5.6) 1 xx x ss xx xs f τ = ∑∑ (5.7) 1 ((), ) Pr xx x ccrit ss xx xs fhxh τδ + < ∑∑ (5.8) where x s x f is the frequency that the system enters into state x with command s x , x s x τ is the expected duration of time that the system stays in state x when command s x is taken, x s x g is the energy cost of the system for a given level of performance (i.e., the energy-delay-squared product, ED 2 P, which captures the power-performance- efficiency under voltage scaling and is independent of the clock frequency) when the system is in state x and command s x is chosen, ' ', x s x x p is the probability that the next 117 system state is x’ if the system is currently in state x and command s x is taken, δ(h(x), h c+1 ) is 1 if h(x) (i.e., current h of state x) = h c+1 (i.e., temperature beyond T crit ) or 0 otherwise, and Pr crit is a pre-defined threshold probability (i.e., the probability of exceeding the critical temperature threshold). ,' ' 1/ x x s xx xx s x σ τ ≠ = ∑ . The ED 2 P metric may also be written as 2 33 Pwr Pwr ED P Perf IPC =≅ (5.9) where Pwr denotes the processor power consumption, and the processor performance is measured as the number of instructions per cycle (IPC). 3 Pwr IPC is an excellent figure of merit to capture the energy cost of a given level of processor performance. Note that we focus on AC line powered systems that strive to deliver maximum performance while operating under temperature constraints. Specifically, the purpose of this optimization problem is to maximize the system’s power-performance- efficiency while constraining the probability that the peak temperature is greater than T crit to be less than a pre-defined probability value, P crit . 5.4.2 Online DTM In many cases, we are unable to know the actual characteristics of the applications which are running on the processor in advance. Thus, we must also develop an online DTM technique by constructing a pre-characterized configuration-command mapping table, where the entries of this mapping table correspond to various combinations of application types and temperature readings. Figure 5.5.6 illustrates how the thermal manager interacts with the applications and the temperature 118 readings. In this figure, the pre-characterized mapping table is obtained through extensive offline simulation during design time, considering every possible combination of states for processor, applications, and temperature readings. It is worthwhile to note that the thermal manager is initiated only when the temperature exceeds the initial trigger temperature threshold T trig and then controls the performance of the processor by limiting critical temperature. More precisely, the thermal manager receives the states of current application and temperature when the temperature exceeds T trig , and issues an optimal micro-architectural configuration and action set (i.e., command) to the processor. performance constraints architectural configuration L2: 2MB L2: 4MB L2: 1MB action 1.30V/2.3GHz 1.35V/2.3GHz 1.30V/1.8GHz pre-characterized mapping table P crit 0.1 0.2 Time [s] Temp [ºC] T init Initiate the thermal manager Issue a command (a, c) Application i R Thermal sensor k H performance constraints architectural configuration L2: 2MB L2: 4MB L2: 1MB action 1.30V/2.3GHz 1.35V/2.3GHz 1.30V/1.8GHz pre-characterized mapping table P crit 0.1 0.2 Time [s] Temp [ºC] T init Initiate the thermal manager Issue a command (a, c) Application i R Thermal sensor k H Figure 5.5.6: Online thermal management technique. 5.5 Experimental Results Experiments have been designed to evaluate the effectiveness of the proposed modeling technique and assess the performance of our optimization method. We use abstract models of the Intel Core Duo processor [106], which provides dynamic L2 cache resizing mechanism, to construct a TMS. Table 5.1 shows the transition time (normalized) for the CTMDP model of the processor, assuming that the system has S = {s 1 , s 2 , s 3 , s 4 }, where each state set s = (a, c) is a combination of a 1 = [1.3V 119 1.8GHz], a 2 = [1.35V 2.3GHz], c 1 = [2MB L2 cache], and c 2 = [4MB L2 cache]. Note that due to the limitations of the simulation environment (e.g., Vtune performance analyzer [107]), we only consider a variable cache for the architectural configuration set (i.e., other configurable resources such as register file, reorder buffer, and load/store queue are not considered). Note that information about dynamic cache resizing time and voltage and frequency control lock time is obtained from [106]. Table 5.1: Transition times for the CTMDP model of the processor. 0 ( a 1 , c 1 ) ( a 1 , c 1 ) ( a 1 , c 2 ) ( a 1 , c 2 ) ( a 2 , c 1 ) ( a 2 , c 2 ) ( a 2 , c 2 ) ( a 2 , c 1 ) 10.8 1 50 5 0.8 0.8 1 0 1 50.8 5 0 0 ( a 1 , c 1 ) ( a 1 , c 1 ) ( a 1 , c 2 ) ( a 1 , c 2 ) ( a 2 , c 1 ) ( a 2 , c 2 ) ( a 2 , c 2 ) ( a 2 , c 1 ) 10.8 1 50 5 0.8 0.8 1 0 1 50.8 5 0 To simplify the experimental setup, we consider R = {r 1 , r 2 , r 3 , r 4 }, where each r is a combination of two IPC ranges and two L2 cache miss rate ( η) ranges: IPC ≤ 0.85, IPC > 0.85; η ≤ 0.01, and η > 0.01, based on the performance distribution for application programs as shown in Figure 5.1. The initial trigger temperatures threshold is defined as T trig = 60 °C, with an ambient temperature of T A = 40 °C, based on the thermal design guideline, where we use the thermal performance data of a 35x35mm 478-pin micro-FCPGA package [106] to obtain temperature states. The on-chip temperature is estimated by utilizing T chip = T A + P ⋅( θ JA – Ψ JT ) based on the parameter values of the package. The device power dissipation P can be assumed to be a normally distributed random variable with some known mean value and standard deviation. 120 crit P crit P Figure 5.7: The effectiveness of the proposed DTM technique. Figure 5.7 shows the results of the proposed DTM technique, where we randomly chose a sequence of 100 programs of SPEC CPU2000 (cf. Figure 5.1) with P crit set to 0.2 and T crit set to 71 °C. An optimal architectural configuration and action set is selected and provided to the processor when an input (i.e., application and temperature state) is given to the mapping table, where the entries of this table correspond to various combinations of inputs and performance constraints. It is clearly seen that the peak power consumption, which results in the temperaure increase, is limited by constraining the probability that the peak temperature of the system is greater than T crit to be less than P crit in our DTM policy. The time steps are abstractly defined to represent the peak power value of each program run. As expected, constraining the power dissipation causes some performance (ED 2 P) degradation. It, however, guarantees the thermal safety of the system. We investigated the performance-efficiency of the proposed DTM technique, which we call stochastic DTM, or SDTM for short. We assumed two voltage- frequency (VF) change commands are available (where a 1 < a 2 in terms of VF 121 values). For comparison purpose, we also implemented a greedy DTM policy. Greedy: Apply the following VF assignment strategy - Use a 2 at low temperatures, i.e., T trig ≤ T < (T trig + T crit )/2; - Use a 1 at high temperatures, i.e., (T trig + T crit )/2 ≤ T < T crit . SDTM: Apply the optimal DTM commands, based on the mathematical program formulation of the TMS. The Greedy policy gives considerable performance benefit, similar to clock throttling techniques which throttle clock and flush pipelines under temperature constraints. The simulation results in Table 5.2 (normalized), which varies the values of T crit , demonstrate that, compared to the Greedy policy, the SDTM policy which allows exceeding T crit to the degree of P crit achieves performance savings of up to 16.1% (average) at the cost of 3.5% (average) power penalty. However, it indicates that as we move P crit to smaller values (e.g., 0.05), we can achieve 13.9% (average) performance savings with little impact on the power metric Table 5.2: Power and performance comparisons between Greedy and SDTM techniques. T crit 71 °C 63 °C 67 °C Greedy Average Pow er Average ED 2 P P crit = 0.05 Saving (%) Pow er Perf Average Power Average ED 2 P Average Pow er Average ED 2 P P crit = 0.15 Saving (%) Pow er Perf Average Pow er Average ED 2 P P crit = 0.25 Saving (%) Pow er Perf SDTM 29.3 5.29 33.6 35.4 4.64 4.32 30.3 33.8 35.8 4.40 3.91 3.73 -3.0 -0.6 -1.1 16.8 15.7 13.6 31.9 33.9 36.2 4.31 3.90 3.70 -8.8 -0.9 -2.2 18.5 15.9 14.4 32.5 34.0 36.3 4.20 3.85 3.60 -10.9 -1.2 -2.5 20.6 17.0 16.7 Average Pow er Average ED 2 P P crit = 0.35 Saving (%) Power Perf 32.7 34.2 36.5 4.15 3.82 3.52 -11.6 -1.7 -3.1 21.5 17.7 18.5 75 °C 37.6 3.65 38.0 3.30 -1.0 9.6 38.3 3.25 -1.5 10.9 38.6 3.20 -2.6 12.3 38.9 3.01 -3.4 17.5 T crit 71 °C 63 °C 67 °C Greedy Average Pow er Average ED 2 P P crit = 0.05 Saving (%) Pow er Perf Average Power Average ED 2 P Average Pow er Average ED 2 P P crit = 0.15 Saving (%) Pow er Perf Average Pow er Average ED 2 P P crit = 0.25 Saving (%) Pow er Perf SDTM 29.3 5.29 33.6 35.4 4.64 4.32 30.3 33.8 35.8 4.40 3.91 3.73 -3.0 -0.6 -1.1 16.8 15.7 13.6 31.9 33.9 36.2 4.31 3.90 3.70 -8.8 -0.9 -2.2 18.5 15.9 14.4 32.5 34.0 36.3 4.20 3.85 3.60 -10.9 -1.2 -2.5 20.6 17.0 16.7 Average Pow er Average ED 2 P P crit = 0.35 Saving (%) Power Perf 32.7 34.2 36.5 4.15 3.82 3.52 -11.6 -1.7 -3.1 21.5 17.7 18.5 75 °C 37.6 3.65 38.0 3.30 -1.0 9.6 38.3 3.25 -1.5 10.9 38.6 3.20 -2.6 12.3 38.9 3.01 -3.4 17.5 5.6 Summary We introduced a new technique for modeling and solving the DTM problem for multi-core systems. The proposed modeling technique, based on Markov decision 122 processes captures dynamic characteristics of processor, applications, and die temperatures. From the mathematical model, we can calculate the optimal DTM policy, which maximizes the power-performance-efficiency under a peak temperature constraint. 123 Chapter 6 Conclusion 6.1 Summary of Contributions This dissertation has contributed to new power and thermal management techniques to address the problem of low-power design in VLSI circuits. In Chapter 2, we tackled the problem of DPM in systems which are greatly affected by increasing levels of PVT variations in nanoscale CMOS technologies. This variability and uncertainty is beginning to undermine the effectiveness of traditional DPM approaches. It is thus critically important that we develop the mathematical basis and practical applications of a variability-aware, uncertainty- reducing DPM approach. The proposed uncertainty management framework based on stochastic processes guarantees to find an optimal power management policy under dynamic variability. In Chapter 3, we described the problem of traditional DPM where the energy and delay overhead of the power mode transitions can become quite significant. We proposed a new power management technique that predicts in real time which frequency and voltage levels to use in a multicore processor by utilizing supervised learning methods. This enables the power manager to predict the performance state of the system for each incoming task by a simple and efficient analysis of some 124 readily available input features, which results in significant energy savings for various workloads in multicore processors. In Chapter 4, we addressed the problem of how and when to identify and issue a hot spot alert as a thermal control method. We thus presented a stochastic technique for identifying and reporting local hot spots under probabilistic conditions induced by uncertainty on the chip junction temperature and the system power state. The proposed uncertainty-aware estimation framework based on MDP and KF techniques efficiently captures the uncertain dynamics of the system behaviors. Being able to handle various sources of uncertainty would improve the accuracy and robustness of the estimation technique, ensuring the thermal safety of the device with truly quality and reliability. In Chapter 5, we presented a new abstract model of a thermally-managed system, where a stochastic process model is employed to capture the system performance and thermal behavior by tackling the problem in obtaining the exact solution of the heat equations that arise from realistic die conditions. The proposed modeling technique, which captures dynamic characteristics of processor, applications, and dies temperature, manages the stochastic behavior of the temperature states of the system under dynamic re-configuration of its micro-architecture, while maximizing the system performance subject to the constraint that a critical temperature threshold is not exceeded. 6.2 Future Work In this section we provide some directions for future research based on the material 125 presented throughout this dissertation. 6.2.1 Dynamic Power Management One possible extension can be developed on top of our machine learning based power management techniques for data center applications. Data center energy efficiency has become a public policy concern and it is imperative that data centers implement methods to minimize their energy use. A novel power management technique can be developed by constructing computational models for learning behaviors of data center power manager. Typically, a self-learning power manager is highly required for data centers since they exhibit different performance characteristics based on target applications (e.g., e-commerce, financial, or scientific applications). Thus, by capturing appropriate input features and output measures based on the characteristics of performance requirements of data centers, a machine learning based power management technique is strategically able to adjust its power management policy for dynamically varying workloads, which results in minimizing total energy use. 6.2.2 Dynamic Thermal Management Several extensions can be developed for new DTM solutions. One interesting future work is to carry out power and thermal management concurrently. By constructing multi-objective optimization models for power and thermal managers whose goals are to minimize power consumption and maximize performance, an optimal power and thermal management policies can be achieved in a way that thermal limit is guaranteed while maximizing performance. 126 Other possible extension is to take predictive approaches based on constructing and utilizing stochastic processes. In a multicore system, based on the machine-learning algorithm and performance monitor, a thermal manager can predict thermal behaviors for each core in a system by capturing current and predicting incoming workloads so that “thermal-hopping” is performed in a way that temperature is balanced across the die. In other words, the incoming workloads are distributed to the coolest core. This extension thus produces temperature-aware management policies and techniques for ensuring that the microprocessor chips operate within the allowed temperature zone 127 BIBLIOGRAPHY [1] I. Ahmand, “Easy and Efficient Disk I/O Workload Characterization in VMware ESX Server,” in Proc. of International Symposium on Workload Characterization, Sep. 2007, pp. 149-158. [2] A. Asenov, “Random dopant induced threshold voltage lowering and fluctuations in sub 50nm MOSFETs: a statistical 3D ‘atomistic’ simulation study,” Journal of Nanotechnology, Vol. 10, pp. 153-158, 1999. [3] K. J. Astron, “Optimal Control of Markov Decision Processes with Incomplete State Estimation,” Journal of Mathematical Analysis and Application, Vol. 10, 1965, pp. 174-205. [4] A. Basu, S. Lin, V. Wason, A. Mehrotra, and K. Banerjee., “Simultaneous Optimization of Supply and Threshold Voltages for Low-Power and High- Performance Circuits in the Leakage Dominant Era,” in Proc. of Design Automation Conference, Jun., 2004, pp. 884-887. [5] R. E. Bellman, Dynamic Programming, Princeton University Press, Princeton, 1957. [6] L. Benini, and G. De Micheli, Dynamic Power Management: Design Techniques and CAD Tools, Kluwer Academic Publishers, 1998. [7] L. Benini, G. Paleologo, A. Bogliolo, and G. De Micheli, “Policy Optimization for Dynamic Power Management,” IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, Vol. 18, Issue 6, pp. 813-833, Jun. 1999. [8] J.A. Bilmes, “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Model,” Technical Report, TR-97-021, U.C. Berkeley, 1998. [9] S. Birdj, A. Phansalkar, L. John, A. Mericas, and R. Indukuru, “Performance Characterization of SPEC CPU Benchmarks on Intel’s Core Microarchitecture based processor,” in Proc. of 2007 SPEC Benchmark workshop, Jan., 2007. [10] T. D. Burd, and R. W. Brodersen, “Design Issues for Dynamic Voltage Scaling,” in Proc. of Int’l Symposium on Low-Power Electronics and Design, Aug., 2000, pp.9-14. 128 [11] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, “Parameter Variations and Impact on Circuits and Microarchitecture,” in Proc. of Design Automation Conference, Jun., 2003, pp. 338-342. [12] D. Brook, and M. Martonosi, “Dynamic Thermal Management for High Performance Microprocessor,” in Proc. of High Performance Computer Architecture, Jan., 2001, pp.171-182. [13] B. Calhoun, J. Kao, and A. Chandrakasan, Leakage in Nanometer CMOS Technologies, Springer, 2006. [14] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman, “Acting Optimally in Partially Observable Stochastic Domains,” in Proc. of 12 th National Conference on Artificial Intelligence, Aug., 1996, pp.1023-1028. [15] X. Cao, and X. Guo, “Partially Observable Markov Decision Processes with Reward Information,” in Proc. of 43 rd IEEE Conf. on Decision and Control, Dec., 2004, pp.4398-4398. [16] O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning, The MIT Press, 2006. [17] S. Chaudhuri and V. Narasayya, “An Efficient Cost-driven Index Tuning Wizard for Microsoft SQL Server,” in Proc. of 23 rd International Conference on Very Large Databases, Sep. 1997, pp. 146-155. [18] Y. Cheng, C. Tsai, C. Teng, and S. Kang, Electrothermal Analysis of VLSI Systems, Kluwer Academic Publishers, 2000. [19] E. Chung, L. Benini, and G. De. Micheli, “Dynamic Power Management Using Adaptive Learning Tree,” in Proc. of International Conference on Computer Aided Design, Nov. 1999, pp. 274-279. [20] W. Cohen, “Fast Effective Rule Induction,” in Proc. of 12 th Int’l Conference on Machine Learning, Dec. 1995. pp. 115-123. [21] C. Cortes and V. Vapnik, “Support-Vector Networks,” Journal of Machine Learning, Vol. 20, No. 3, pp. 273-297, 1995. [22] P. Dadyar, and K. Skadron, “Potential Thermal Security Risks,” in Proc. of 21 st IEEE Semi-Therm Symposium, Mar., 2005, pp.229-234. [23] M. Davio, “Kronacker products and shuffle algebra,” IEEE Trans. on Computers, Vol. 30, No.2, 1981, pp.1099-1109. [24] M. H. DeGroot, Optimal Statistical Decisions, McGraw-Hill Publisher, 1970. 129 [25] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, 39(1), pp. 1-38, 1977. [26] G. Dhiman, and T. S. Rosing, “Dynamic Power Management Using Machine Learning,” in Proc. of ICCAD, Nov. 2006, pp. 747-754. [27] R. A. Fisher, Statistical Methods and Scientific Inference, Macmillan Pub Co. 1973. [28] G.M. Foody, A. Mathur, C. Sanchez-Hernandez, and D. S. Boyd, “Training set size requirements for the classification of a specific class,” Journal of remote sensing of environment, Vol. 104, Issue 1, pp. 1 – 14, Sep. 2006. [29] A. Gosavi, Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, Kluwer Academic Publishers, 2003. [30] S. Gurumurthi, A. Sivasubramaniam, and V. Natarajan, “Disk Drive Roadmap from the Thermal Perspective: A Case for Dynamic Thermal Management,” ACM SIGARCH Computer Architecture News. Vol. 22, Issue 2, May, 2005, pp.38-49. [31] Y. Hotta, M. Sato, H. Kimura, S. Matsuoka, T. Boku, and D. Takahashi, “ Profile-based Optimization of Power Performance by using Dynamic Voltage Scaling of a PC cluster,” in Proc. of Parallel and Distributed Processing Symposium, Apr. 2006, pp. 8-16. [32] W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusamy, “Compact Thermal Modeling for Temperature-Aware Design,” in Proc. of Design Automation Conference, June, 2004, pp.878-883. [33] K. Huang, Z. Xu, I. King, M. R. Lyu, Z. Zhou, “A Novel Discriminative Naive Bayesian Network for classification,” In Bayesian Network Technologies: Applications and Graphical Models. Ankush Mittal, Ashraf Kassim, Tele Tan (Eds.), March, 2007, pp 1-12, Idea Group Inc. [34] F. Incropera, and D. Dewitt, Introduction to Heat Transfer, 3 rd edition, Wiley, New York, 1996. [35] A. Iyer and D. Marculescu, “Power Efficiency of Voltage Scaling in Multiple Clock, Multiple Voltage Cores,” in Proc. of International Conference on Computer Aided Design, Nov. 2002, pp. 379-386. [36] M. Janicki, and and A. Napieralski, “Inverse Heat Conduction Problems in Electronics with Special Considerations of Analytical Analysis Methods,” in Proc. of Int’l Semiconductor, Vol. 2, Issue 4, Oct., 2004, pp.455-458. 130 [37] R. Jejurikar, and R. Gupta, “Dynamic Voltage Scaling for System-wide Energy Minimization in Real-time Embedded Systems,” in Proc. of Int’l Symposium on Low Power Electronics and Design, Aug. 2004, pp.78-81. [38] A. Joshi, A. Phansalkar, L. Eeckhout, and L. John, “Measuring benchmark similarity using inherent program characteristics,” IEEE Trans. on Computers, Vol. 55, No. 6, Jun., 2006, pp.769-782. [39] H. Jung and M. Pedram, “Continuous Frequency Adjustment Technique based on Dynamic Workload Prediction,” in Proc. of International Conference on VLSI Design, Jan. 2008, pp.415-420. [40] H. Jung, and M. Pedram, “Dynamic Power Management under Uncertain Information,” in Proc. of Design Automation and Test in Europe, Apr. 2007, pp.1060 – 1065. [41] A. B. Kahng, “Design Challenges at 65nm and Beyond,” in Proc. of Design Automation and Test in Europe, Mar., 2007, pp.1-2. [42] R. E. Kalman, “A New Approach to Linear Filtering and Prediction Problems,” Trans. of the ASME – Journal of Basic Engineering, Vol. 82, Series D, pp. 35- 45, 1960. [43] R. E. Kalman, and R. S. Bucy, “New Results in Linear Filtering and Prediction Theory,” Trans. of the ASME - Journal of Basic Engineering, Vol. 83, pp. 95- 107, 1961. [44] K. Kang, K. Kim, and K. Roy, “Variation Resilient Low-Power Circuit Design Methodology using On-Chip Phase Locked Loop,” in Proc. of Design Automation Conference, Jun., 2007, pp.934-939. [45] K. Keutzer, and M. Orshansky, “From Blind Certainty to Informed Uncertainty,” in Proc. of International Workshop on Timing Issues, Dec., 2002, pp. 37-41. [46] R. Knauerhase, P. Brett, T. Li, B. Hohlt, and S. Hahn, “Using OS Observations to Improve Performance in Multi-core Systems, ” in Proc. of IEEE Micro, Vol. 28, Issue 3, pp. 54-66, May-Jun. 2008. [47] R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, D. Tullsen, R. Kumar, K. Farkas, and P. Ranganathan, “Single ISA Heterogeneous Multicore Architecture: The Potential for Processor Power Reduction,” in Proc. of Symposium on Microarchitecture, Dec. 2003, pp. 81-93. [48] D. C. Lee, P. Crowley, J. Baer, T. Anderson, and B. Bershad, “Execution characteristics of desktop applications on Windows NT,” in Proc. Int. Conf. on Computer Architecture, 1998, pp. 27-38. 131 [49] M. Lie, W.S. Wang, and M. Orshansky, “Leakage Power Reduction by Dual- Vth Designs Under Probabilistic Analysis of Vth Variation,” in Proc. of International Symposium on Low Power Electronics and Design, Aug. 2004, pp. 2-7. [50] S. Lopez, S. Dropsho, D. Albonesi, O. Garnica, and J. Lanchares, et al., “Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches,” Proc. of High Performance Embedded Architecture and Compiler, Jan., 2007, pp.136-150. [51] Y-H. Lu and G. De. Micheli, “Comparing System-Level Power Management Policies,” IEEE Design & Test of Computers, Vol. 18, Issue 2, pp. 10-19, Mar- Apr. 2001. [52] F. Marc, B. Mongellza, C. Bestory, H. Levi, and Y. Danto., “Improvement of Aging Simulation of Electronic Circuits Using Behavioral Modeling” IEEE Trans. on Device and Materials Reliability, Vol. 6, No. 2, Jun., 2006, pp.228- 234. [53] C. McNairy and R. Bhaita, “Montecito: A Dual-Core, Dual-Thread Itanium Processor,” IEEE Micro, Vol. 25, Issue 2, pp. 10-20, Mar-Apr., 2005. [54] T. Mitchell, Machine Learning, McGraw Hill, 1997. [55] A. Mittal and A. Kassim, Bayesian Network Technologies: Applications and Graphical Models, IGI Publishing, 2007. [56] B. Mochocki, D. Rajan, X.S. Hu, C. Poellabacer, K. Otten, and T. Chantem, “Network-Aware Dynamic Voltage and Frequency Scaling,” in Proc. of Real Time and Embedded Technology and Application Symposium, Apr. 2007, pp. 215-224. [57] R. Mukherjee, and S. O. Memik, “Systematic Temperature Sensor Allocation and Placement for Microprocessors,” in Proc. of Design Automation Conference, July, 2006, pp.542-547. [58] A. Naveh, E. Rotem, A. Mendelson, S. Gochman, R. Chabukswar, K. Krishnan, and A. Kumar, “Power and Thermal Management in Intel Core Duo Processor,” Intel Technology Journal, Vol. 10, Issue 2, pp. 109-122, May 2006. [59] M. Nagulapally, G. V. Shankaran, and S. Z. Zhao, “Development of RC network compact models of IC packages using multigrid method,” in Proc. of Thermal and Thermal-mechanical Phenomena in Electronic Systems, Jun., 2004, pp.635-640. [60] M. J. Osborne, A Course in Game Theory, MIT press, 1994. 132 [61] S. Paquet, G. Gordon, and S. Thrun, “Point-based Value Iteration: An Anything Algorithm for POMDPs,” in Proc. of Int’l Joint Conf. on Artificial Intelligence, Aug. 2003, pp. 1025-1032 [62] M. Pedram, and S. Nazarian, “Thermal Modeling, Analysis and Management in VLSI Circuits: Principles and Methods,” in Proc. of IEEE, Vol. 95, No. 8, Aug., 2006, pp.1489-1501. [63] D. Ponomarev, G. Kucuk, and K. Ghose, “Dynamic Resizing of Superscalar Datapath Components for Energy Efficiency,” IEEE Trans. on Computers, Vol. 55, No. 2, Feb., 2006, pp.199-213. [64] V. Pronk, S.V.R. Gutta, and W.F.J. Verhaegh, “Incorporating Confidence in a Naïve Bayesian Classifier,” Lecture Notes in Computer Science: User Modeling 2005, pp.317-326, Aug. 2005. [65] M. L. Putterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley Publisher, New York, 1994. [66] Q. Qiu, Q. Wu, and M. Pedram, “Stochastic Modeling of a Power-Managed System – Construction and Optimization,” IEEE Trans. on Computer-Aided Design, Vol. 10, No. 10, pp. 1200-1217, Oct. 2001. [67] R. Quinlan, C4.5: Program for Machine Learning, Morgan Kaufmann Publisher, 1993. [68] Nilanjan Ray, “Statistical Decision Theory, Bayes Classifier,” Lecture Notes for CMPUT 466/551, [online] http://www.cs.ualberta.ca/~nray1/CMPUT466_551/SDTheory.ppt . [69] Z. Ren, B. H. Krogh, and R. Marculescu, “Hierarchical Adaptive Dynamic Power Management,” IEEE Trans. on Computer, Vol. 15, Issue 4, pp. 409-420, Apr. 2005. [70] E. Rotem, J. Hermerding, C. Aviad, and C. Harel, “Temperature measurement in the Intel Core Duo Processor,” in Proc. of 12 th Int’l Workshop on Thermal investigation of ICs, Sep., 2006. pp. 1-5. [71] C. Rusu, N. AbouGhazaleh, A. Ferreira, R. Xu, B. Childers, R. Melhem, and D. Mosse, “Integrated CPU and L2 cache Frequency/Voltage Scaling using Supervised Learning,” in Proc. of Workshop on Statistical and Machine Learning Approaches applied to Architectures and Compilation, Jul. 2007, pp. 41 – 50. [72] Ken Salchow, “Load Balancing 101,” white papers from F5 Inc., http://www.f5.com/pdf/white-papers/evolution-adc-wp.pdf and http://www.theacademy.ca/load-balancing101-wp.pdf 133 [73] A. Sardag, and H. L. Akin, “Kalman based finite state controller for partially observable domains,” Journal of Advanced Robotics Systems, Vol. 3, No. 4, pp. 331-342, 2006. [74] M.L. Seltzer, B. Raj, and R.M. Stern, “A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition,” Journal of Speech Communication, Vol. 43, pp. 379-393, Mar. 2004. [75] H. Shaukatullah, and A. Claassen, “Effect of Thermocouple Wire Size and Attachment Method on Measurement of Thermal Characteristics of Electronic Package,” in Proc. of 19 th IEEE Semi-Therm Sym. 2003, pp. 468-473. [76] S. Shidore, and T. Lee, “A Comparative Study of the Performance of Compact Model Topologies and Their Implementation in CFD for a Plastic Ball Grid Array Package,” Journal of Electronic Packaging, Vol. 123, Issue 3, pp. 232- 236, Sep., 2001. [77] T. Simunic, L. Benini, P. Glynn, and G. De Micheli, “Event-driven Power Management,” IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, Vol. 20, Issue 21, pp. 840-857, Jul., 2001. [78] K. Skadron, M.R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, “ Temperature-Aware Microarchitecture,” in Proc. of Int’l Symposium on Computer Architecture, Jun., 2003, pp.94-125. [79] A. Srivastava, D. Sylvester, and D. Blaauw, Statistical Analysis and Optimization for VLSI: Timing and Power, Springer, 2005. [80] V. Srinivasan, D. Brooks, M. Gschwind, and P. Bose, “Optimizing Pipelines for Power and Performance,” in Proc. of International Symposium on Microarchitecture, Nov. 2002, pp.333-344. [81] J. Srinivasan, and S. V. Adve, “Predictive Dynamic Thermal Management for Multimedia Applications,” in Proc. of ACM Int’l Conference on Supercomputing, Jun., 2003, pp.109-120. [82] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D.P. Hardin, and S. Levy, A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis, Bioinformatics, 2004. [83] H. Su, F. Liu, A. Devgan, E. Acar, and S. Nassif, “Full Chip Leakage Estimation Considering Power Supply and Temperature Variations,” Proc. of International Symposium on Low Power Electronics and Design, Aug. 2003, pp. 78-83. [84] R. S. Sutton, and A. G. Barto, Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998. 134 [85] Y.F. Tsai, N. Vijaykrishnan, Y. Xie, and M.J Irwin, “Influence of Leakage Reduction Techniques on Delay/Leakage Uncertainty,” Proc. of IEEE 18 th Int’l Conference on VLSI Design, Jan. 2005, pp.374-379. [86] G. Theocharous, et al., “Machine Learning for Adaptive Power Management,” Intel Technology Journal, Vol. 10, Issue 4, pp.299 – 310, Jul. 2006. [87] V. N. Vapnik. (1999). The nature of statistical learning theory (2nd ed.). New York: Springer-Verlag. [88] T. Wang, Y. Lee, and C. C. Chen, “3D Thermal-ADI: An Efficient Chip-Level Transient Thermal Simulator,” in Proc. of International Symposium on Physical Design, Apr., 2003, pp. 10-17. [89] A. Weissel, and F. Bellosa, “Self-Learning Hard Disk Power Management for Mobile Devices,” in Proc. of Int’l Workshop on Software Support for Portable Storage, Oct. 2006, pp. 33 – 40. [90] R. Williams, and L. Baird, “Tight Performance Bounds on Greedy Policies based on Imperfect Value Functions,” Technical report NU-CCS-93-14, Northeastern University, Nov. 1993. [91] Q. Wu, P. Juang, M. Martonosi, and D.W. Clark, “Voltage and Frequency Control with Adaptive Reaction Time in Multiple-Clock Domain Processors,” in Proc. of Symposium on High-Performance Computer Architecture, Feb. 2005, pp. 178-189. [92] CPU SPECint2000 Document. [online] http://www.spec.org. [93] Synopsys PrimeTime Analyzer. [online] http://www.synopsys.com. [94] OpenRISC 1000 processor. [online] http://www.opencoreg.org [95] Synopsys Compiler Documents. [online] http://www.synopsys.com [96] Specman HDL simulator [online]. http://www.cadence.com [97] Advanced Configuration and Power Interface Specification, Rev. 3.0b, Oct. 2006. http://www.acpi.info/spec.htm. [98] Dual-Core processor Power and Thermal Design Guide, Mar., 2006. [online] http://www.amd.com. [99] JEDEC Standards.[online] http://www.jedec.org. [100] Dual-Core Processor Power and Thermal Design Guide, Mar., 2006. [online] v. http://www.amd.com. 135 [101] White paper, “Scalable Networking: Eliminating the receive processing bottleneck – Introduction RSS,” WinHEC 2004 version, Apr. 2004 http://www.microsoft.com/whdc/. [102] IEEE 802.3 Ethernet document. [online] http://www.ieee802.org. [103] White paper, “Bi-directional current/power monitor with I 2 C Interface,” Sep. 2008, [online] http://focus.ti.com. [104] Thermal data for MIPS64 processors. [online] http://www.broadcom.com. [105] CPU SPEC2000 documents.[online] http://www.spec.org. [106] Intel Core Duo processor on 65nm process: Thermal Design Guide, Feb., 2006, [online] http://www.intel.com. [107] Vtune performance analyzer. [online] http://www.intel.com/software.
Abstract (if available)
Abstract
With the progress in today’s semiconductor technology, the chip density and operating frequency have increased rapidly, making the power consumption in digital circuits a major concern for VLSI designers. Furthermore, as the nanometer technology is approaching the regime of randomness with variability in the behavior of silicon structure, improving the accuracy of power optimization techniques under increasing levels of process, voltage, and temperature (PVT) variations is becoming a critical task. In this dissertation, a stochastic dynamic power management (DPM) framework is presented to improve the accuracy of decision-making strategy for power management under probabilistic conditions induced by the PVT variations. Subsequently, an adaptive DPM technique is presented by constructing a machine-learning based power manager to improve the quality of decision-making strategy and to reduce the overhead of the power manager which has to repetitively determine and assign voltage-frequency settings for each core in multicore systems. Finally, the focus of this dissertation is shifted to thermal management techniques, since thermal control is also becoming a first-order concern due to the increased power density as well as the increasing vulnerability of the system. A technique is presented to identify and report local hot spots under probabilistic conditions induced by uncertainty in the chip junction temperature and the system power state. Lastly, an abstract model of a thermally-managed system with stochastic processes is introduced, followed by a stochastic thermal management technique.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Thermal management in microprocessor chips and dynamic backlight control in liquid crystal diaplays
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Architectures and algorithms of charge management and thermal control for energy storage systems and mobile devices
PDF
Thermal modeling and control in mobile and server systems
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Low power and reliability assessment techniques for advanced processor design
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Optimizing power delivery networks in VLSI platforms
PDF
Energy-efficient shutdown of circuit components and computing systems
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Improving efficiency to advance resilient computing
PDF
Timing and power analysis of CMOS logic cells under noisy inputs
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
Asset Metadata
Creator
Jung, Hwisung
(author)
Core Title
Stochastic dynamic power and thermal management techniques for multicore systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/06/2009
Defense Date
09/10/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning,microprocessor,multicore systems,OAI-PMH Harvest,power management,thermal management
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Draper, Jeffrey T. (
committee member
), Gupta, Sandeep K. (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
hwijung@usc.edu,hwisung@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2650
Unique identifier
UC1497972
Identifier
etd-JUNG-3249 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-258297 (legacy record id),usctheses-m2650 (legacy record id)
Legacy Identifier
etd-JUNG-3249.pdf
Dmrecord
258297
Document Type
Dissertation
Rights
Jung, Hwisung
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
machine learning
microprocessor
multicore systems
power management
thermal management