Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
(USC Thesis Other)
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTI-LEVEL AND ENERGY-AWARE RESOURCE CONSOLIDATION IN A VIRTUALIZED CLOUD COMPUTING SYSTEM By Inkwon Hwang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2016 Copyright 2016 Inkwon Hwang 2 To Sooyeon, Irene, and Kayla for their unconditional love and constant support 3 ACKNOWLEDGMENTS In my whole life pursuing a Ph.D. is one of the most important and remarkable experience. I cannot imagine how it could be possible for me to be a Ph.D. without people who have constantly supported me. First and foremost, I would like to thank my Ph.D. advisor, Professor Massoud Pedram. He kept encouraging and guiding me during my Ph.D. His insights and passion have greatly inspired my research as well. He also gave me a chance to work at companies, which helped me understand the industries well; hence, my research could be more practical and applicable to the real world. During last a couple of years of my Ph.D. I had to work remotely. With his great consideration and patience, I could continue the research and complete my Ph.D. Besides my advisor, my sincere thanks also go to Professor Sandeep Gupta. He was my screen, qualifying, and dissertation exam committee, and gave me invaluable comments and encouragement. I have worked with him as a teaching assistance about four years. While working with him, we have had lots of discussions, from which I could develop teaching skills and learn fundamentals of Very Large Scale Integration academic field. Last but not least, I would like to thank my beloved family: Sooyeon, Irene, and Kayla. I bet this achievement is impossible without their constant care, support and devotion. I also would like to thank my parents-in-law. She supported me a lot not only financially but also by taking care of my two daughters, Irene and Kayla. I also gratefully thank my parents for supporting me in every step of my life. 4 TABLE OF CONTENTS ACKNOWLEDGMENTS ..................................................................................................................... 3 TABLE OF CONTENTS ..................................................................................................................... 4 LIST OF FIGURES ............................................................................................................................ 6 LIST OF TABLES .............................................................................................................................. 8 ABSTRACT ....................................................................................................................................... 9 Chapter 1. Introduction ............................................................................................................. 11 1.1 Motivations ...........................................................................................................................12 1.2 Methodology .........................................................................................................................14 1.3 Thesis Outline .......................................................................................................................17 Chapter 2. Hierarchical, Portfolio Theory-based Virtual Machine Consolidation in a Cloud . Computing System ........................................................................................................................ 18 2.1 Related Works .......................................................................................................................19 2.2 Resource Demand Model and Portfolio Effect .....................................................................22 2.2.1 Resource Demand Model ............................................................................................ 22 2.2.2 Multiple Resource Types ............................................................................................ 23 2.2.3 Portfolio Effect............................................................................................................ 24 2.3 Problem Statement – Multi-Capacity Stochastic Bin Packing Optimization .......................27 2.4 Hierarchical Resource Management Solution ......................................................................30 2.4.1 Global Resource Manager ........................................................................................... 31 2.4.2 Local Resource Manager ............................................................................................ 37 2.4.3 Joint VM-to-Cluster and VM-to-PM Allocation Algorithm ....................................... 42 2.5 Online Resource Management ..............................................................................................44 2.5.1 Intra-cluster Migration ................................................................................................ 44 2.5.2 Inter-cluster Migration ................................................................................................ 47 2.6 Simulation Results ................................................................................................................49 2.6.1 Simulation Setup ......................................................................................................... 49 2.6.2 Global Resource Manager ........................................................................................... 50 2.6.3 Local Resource Manager ............................................................................................ 53 2.6.4 Overall Performance Comparison ............................................................................... 54 2.6.5 Online Resource Management .................................................................................... 57 2.7 Summary ...............................................................................................................................62 5 Chapter 3. CPU Consolidation in a Virtualized Multi-Core Server ........................................ 63 3.1 Background – power management techniques .....................................................................64 3.1.1 Processor Power States (C, CC, PC-States) ................................................................ 64 3.1.2 Processor Performance States (P-States) .................................................................... 66 3.1.3 Core-level Power Gating............................................................................................. 67 3.1.4 Intel® Quickpath Interconnect (QPI) ......................................................................... 67 3.2 Power, Delay and Consolidation...........................................................................................69 3.2.1 Power Model ............................................................................................................... 69 3.2.2 CPU Consolidation and Power Dissipation ................................................................ 71 3.2.3 Delay Model................................................................................................................ 74 3.3 Energy Efficiency Metrics ....................................................................................................76 3.3.1 Energy per Task (E/task) ............................................................................................ 76 3.3.2 Energy-delay product per task (ED/task) .................................................................... 77 3.4 Experimental Setup ...............................................................................................................78 3.4.1 Hardware test-bed and XEN ....................................................................................... 78 3.4.2 Benchmarks – PARSEC and SPECWeb2009 ............................................................. 80 3.5 Experimental Results and Discussion ...................................................................................82 3.5.1 Power Model Derivation and Verification .................................................................. 82 3.5.2 Package-level Consolidation ....................................................................................... 85 3.5.3 Consolidation Overhead – vCPU Count ..................................................................... 87 3.5.4 CPU Selection Policy .................................................................................................. 88 3.5.5 Execution Time ........................................................................................................... 91 3.5.6 E/task and ED/task Improvements for PARSEC ........................................................ 93 3.5.7 CPU Consolidation for SPECWeb2009 Benchmarks................................................. 95 3.5.8 Online CPU consolidation algorithms ........................................................................ 98 3.6 Summary .............................................................................................................................104 Conclusion .................................................................................................................................. 105 References ................................................................................................................................... 107 6 LIST OF FIGURES Fig 1. Effect of correlations between RVs on a standard deviation ......................................... 25 Fig 2. Portfolio effect example #1 ............................................................................................ 25 Fig 3. Portfolio effect example #2 ............................................................................................ 26 Fig 4. A proposed hierarchical resource management .............................................................. 31 Fig 5. VM-to-cluster algorithm (preliminary version) .............................................................. 35 Fig 6. VM-to-cluster algorithm (complete version) .................................................................. 37 Fig 7. Multi-dimensional vs. multi-capacity bin packing ......................................................... 38 Fig 8. Example of nvd calculation (K=3) ................................................................................. 41 Fig 9. MNVD algorithm for VM-to-PM assignment ................................................................. 42 Fig 10. JOINT algorithm for direct VM allocation ..................................................................... 43 Fig 11. Intra-cluster VM migration algorithm (Phase 1) ............................................................ 46 Fig 12. Intra-cluster VM migration algorithm (Phase 2) ............................................................ 47 Fig 13. Inter-cluster VM migration algorithm – global manager ............................................... 48 Fig 14. Inter-cluster VM migration algorithm – local manager ................................................. 48 Fig 15. Quality comparison among the algorithms (N=500, W=50) .......................................... 51 Fig 16. Cost and algorithm running time vs. window size (W) .................................................. 52 Fig 17. Running time vs. W for different problem sizes (N) ...................................................... 53 Fig 18. Cost and running time vs. dimension of resource (K) .................................................... 54 Fig 19. Quality comparison among the algorithms ..................................................................... 56 Fig 20. Running time comparison among the algorithms ........................................................... 57 Fig 21. Costs comparison between RBR and proposed algorithm ............................................. 59 Fig 22. Ratio of inter/intra cluster migration .............................................................................. 60 Fig 23. Quality comparison ........................................................................................................ 61 Fig 24. Intel® QPI block diagram .............................................................................................. 68 Fig 25. Example of CC-state switch by consolidation ................................................................ 73 Fig 26. The server system along with a power analyzer ............................................................. 80 Fig 27. Power dissipation vs. utilization for C-state limits ......................................................... 83 Fig 28. Power estimation vs. measurements when the C-state limit is C3 ................................. 84 Fig 29. Relationship between PC0 states of two packages of the target server when there are 4 active CPUs in exactly one of the packages ................................................................ 86 7 Fig 30. Percentage of the time that each package in the target server is in the PC0 state as a function of the total utilization ....................................................................................... 87 Fig 31. Consolidation overhead i.e., execution time as a function of the virtualization ratio .... 88 Fig 32. Consolidation overhead i.e., energy per task as a function of the virtualization ratio .. 88 Fig 33. CC0 and PC0 state residencies for the bodytrack program ............................................. 90 Fig 34. CC0 and PC0 state residencies for the canneal program ................................................ 90 Fig 35. Effect of simple CPU selection policies on energy consumption per task of various PARSEC programs ......................................................................................................... 91 Fig 36. Execution time of PARSEC benchmark programs as a function of the average utilization per core .......................................................................................................... 92 Fig 37. Energy per task improvement ......................................................................................... 94 Fig 38. Energy delay product per task improvement .................................................................. 95 Fig 39. Response time and power dissipation ............................................................................. 96 Fig 40. Frequency vs. total utilization (SPECWeb) .................................................................... 97 Fig 41. Psuedo codes for min_cpu() and min_freq() ................................................................ 100 Fig 42. Four online consolidation algorithms ........................................................................... 101 Fig 43. ED/packet and QoS comparisons ................................................................................. 102 8 LIST OF TABLES Table 1. C-state Limit and Hardware-reported Information ......................................................... 83 Table 2. Power Macro-Models for the Server System under Test ............................................... 84 Table 3. ED/packet comparison .................................................................................................. 103 9 ABSTRACT Improving the energy efficiency of cloud computing systems has become an important issue because the electric energy bill for 24/7 operation of these systems can be quite large. As a way of lowering daily energy consumption, this thesis proposes resource consolidation techniques: the virtual machine consolidation and the CPU consolidation. In contrast to the many existing works that assume resource demands of virtual machines are given as scalar variables, we treat these demands as random variables with known means and standard deviations because the demands are not deterministic in many situations. These random variables may be correlated with one another, and there are several types of resources which can be performance bottlenecks. Therefore, both correlations and resource type heterogeneity must be considered. The virtual machine consolidation problem is thus formulated as a multi-capacity stochastic bin packing problem. This problem is NP-hard, so we present heuristic methods to efficiently solve the problem. Simulation results show that, in spite of its simplicity and scalability, the proposed method produces high quality solutions. While the virtual machine consolidation saves significant amount of energy, individual server machine is still under-utilized in order to avoid service-level-agreement violations. A popular way to reduce energy consumption of such under-utilized server machines is to perform dynamic voltage and frequency scaling (DVFS), thereby matching the CPU’s performance and power level to incoming workloads. Another power saving technique is CPU consolidation, which uses the minimum number of CPUs necessary to meet the service request demands and turns off the remaining unused CPUs. DVFS has been already extensively studied and verified its effectiveness. On the other hand, it is necessary to study more about effectiveness of CPU consolidation. Key 10 questions that must be answered are how effectively the CPU consolidation improves the energy efficiency and how to maximize the energy efficiency improvement. These questions are addressed in this thesis. After understanding modern power management techniques and developing an appropriate power model, this thesis presents an extensive set of hardware-based experimental results and makes suggestions about how to maximize energy efficiency improvement through CPU consolidation. In addition, the thesis also presents new online CPU consolidation algorithms, which reduce the energy delay product up to 13% compared to the Linux default DVFS algorithm. 11 Chapter 1. INTRODUCTION 12 1.1 MOTIVATIONS Energy efficiency of datacenters was not a major concern as of only a few years ago, but the electric energy bill for operating a typical datacenter has been increasing rapidly. Although the energy efficiency of server machines has been improving, this efficiency advances have not kept pace with the increase in cloud computing services and the concomitant increase in the number and size of datacenters. As a result, an ever increasing amount of electrical energy is being consumed in today’s datacenters, giving rise to concerns about the carbon emission footprint of datacenters and the costs of operating them. The latter is especially important concern from the viewpoint of datacenter owners and operators (as well as their customers/clients who must eventually pay the bill). Therefore, improving the energy efficiency has become a critically important issue. A large datacenter comprises tens of thousands of heterogeneous server machines, and each of these machines consumes hundreds of Watts. Hence, the gross power consumption of these server machines plus the air conditioning units in a typical datacenter can easily exceed a few MW. For example, the number of servers in Google’s datacenter is estimated to be 0.9 million in 2010 and the gross power consumption is about 1.9BkWh (this is equivalent to 730MW) [1]. One datacenter of Facebook in Prineville, Oregon has a power capacity of 28MW [2]. The power capacity of very large sized datacenters comes close to 100 MW of power, which is the same as the amount of power consumed by 80,000 U.S. homes or 250,000 E.U. homes [3]. With 10-12 cents per KWhr of electrical energy consumption, the electrical energy bill for even a mid-size datacenter (say with 10MW average power consumption) exceeds twenty thousand dollars per day. Because of the huge costs, there is a growing need for energy-aware resource management strategies in datacenters. 13 A datacenter is typically under-utilized; it has been designed to provide the required performance and satisfy its service level agreements (SLAs) with clients even during peak workload hours, and hence, at other times its resources are vastly under-utilized. For example, the minimum and the maximum utilization of the statically provisioned capacity of Facebook’s datacenter are 40% and 90%, respectively [4]. Hence, in light of the energy non-proportionality of today’s server base [5], a greater amount of energy costs can be reduced by consolidating jobs into as few server machines as possible and turning off the unused machines. In other words, there are potential energy costs savings if a well-designed resource management technique is applied. This technique is also known as ‘server consolidation’. The server consolidation may effectively save energy costs, but there is still room for additional energy savings because of limitations and overheads associated with the server consolidation. Due to non-negligible overheads of task (or VM) migration, server consolidation cannot be conducted very often. This implies we cannot very tightly migrate tasks on to minimum number of server machines in order for avoid SLA violation. Therefore each server machine is still under-utilized, and further energy savings can be achieved another resource management technique: a CPU consolidation is proposed in this thesis. The thesis proposes two different resource management techniques: server consolidation and CPU consolidation. These two techniques are complementary to each other, and largest energy savings are expected when both techniques are applied. 14 1.2 METHODOLOGY Considering that a typical datacenter is under-utilized (precisely 5 to 20% of server utilization) much of the time [6], the energy cost can be greatly reduced by consolidating service requests and/or running applications into as few servers as possible and shutting down the surplus servers. This technique is known as server consolidation. This consolidation enhances the energy efficiency of datacenters because of the non-energy proportional characteristics of the modern server machines [5]. Note that in a virtualized cloud system, the key consolidation task is to pack the virtual machines (VMs) into the minimum number of server machines. This server consolidation for a virtualized system is called VM consolidation. The VM consolidation is not a simple task. First, it is hard to anticipate future workloads. If the workloads are under-estimated, then servers can be over-utilized; it may violate Service Level Agreements (SLA). Second, there are a large number of machines and VMs, a proposed solution should be efficient so that it finds a high quality solution on time. Last, VM migration is costly operation, so we need to relocate VMs to as fewer servers as possible with least number of migrations. We propose a hierarchical VM consolidation technique. Frist, the technique uses stochastic resource demand model while many existing works assume resource demand is deterministic. Second, the proposed method has hierarchical structure, so it can quickly find solution of large-size problems while its quality of the solution is high enough. Third, the online solution relocates as fewer VMs as possible, so it achieves the same level of energy efficiency with minimal migration overheads or costs. The thesis presents extensive simulation results evaluating the proposed technique. Although the server (VM) consolidations can greatly lower a datacenter’s total energy 15 consumption, there is still room for further energy savings due to the limitations and overheads associated with the server consolidation. For one, it is difficult to conduct server consolidation very frequently because the migration of tasks or VMs causes high overheads; e.g., heavy network traffic, high network latency and large system boot time, plus large energy consumption to move virtual machines and their local contexts around. Because of these overheads, there is a relatively long period between server consolidation decision times. To avoid the SLA violations during each timing period when virtual machine to server assignments are fixed, virtual machines (or tasks) are not too tightly consolidated into the active server set in order to provide a safe margin of operation. Too aggressive a server consolidation strategy will result in violation of client SLAs. The longer the period between migrations is, the larger the aforesaid margin should be (i.e., more server machines should be activated, each at a lower average utilization rate). Hence, server machines are still under-utilized after datacenter-level server consolidation, which implies that there is the potential of further energy savings through additional resource management techniques. There are a number of resources in a server machine such as computing, storage, and I/O bandwidth. This study focuses on the computing resource, i.e., the CPU, which is a major energy consumer. A well-known and popular energy-aware CPU management technique is a Dynamic Voltage and Frequency Scaling (DVFS) [7-9], which attempts to match the performance of each ON server to the current workloads so that energy can be saved at the workload level of each server. The DVFS was introduced decades ago, and it has been one of the most effective power saving techniques for CPUs. The amount of the energy savings by DVFS, however, is decreasing due to the following reasons. First, the supply voltages have already become quite low; hence, the remaining headroom for further supply voltage reductions is small and shrinking. Second, many 16 modern servers have two or more processor chips, each chip containing multiple CPUs (cores 1 ) but a single on-chip power distribution network shared by all the CPUs. Because of this sharing, the CPUs on the same package must operate at the same supply voltage level and hence the same clock frequencies 2 . Unless we do ‘perfect’ load balancing among CPUs sharing the same power bus, the voltage level that is set for the most highly loaded CPU will result in energy waste because all the other under-utilized CPUs run at higher frequency than what is actually needed. Third, in a virtualized server system, it is difficult to gather sufficient information about the running applications, which is necessary to choose the optimal clock frequency and voltage level for the CPUs. This is because the virtual machine manager (hypervisor), which conducts DVFS, resides in a privileged domain whereas the applications are running in a different domain (virtual machine domain) [10]. While DVFS becomes less effective technique, we propose a CPU consolidation as a complemental technique to reduce CPU power consumption. The idea of the CPU consolidation is quite simple: turn off unused CPU as many as possible. However, it is difficult to anticipate how the CPU consolidation impacts on power consumption and performance. Many exiting works evaluate the CPU consolidation using simulation data. However, we believe that extensive experiments on a real server system are essential for evaluating the CPU consolidation because simulation cannot be very accurate. We setup a server system which is close to the real server machines in modern datacenters, and run extensive experiment. From the data, we report how CPU consolidation affects power and performance. In addition, we suggest guidelines which maximize energy savings of CPU consolidation with minimum performance degradation. 1 The terms ‘CPU’ and ‘core’ are used interchangeably in this thesis. 2 Some processors are capable of independent DVFS among cores while Intel ® processors are not. Intel ® Xeon ® processors are used for this study. 17 1.3 THESIS OUTLINE This proposal consists of three chapters. Research motivation and our methodology are introduced in Chapter 1. We propose two resource consolidation techniques which are complementary to each other. First, hierarchical and stochastic VM consolidation technique is proposed in Chapter 2. In this chapter we introduce a stochastic VM consolidation technique which is based on the Portfolio Theory and evaluate the proposed technique. Next, another technique, CPU consolidation, is proposed in Chapter 3. In this chapter, we present extensive experimental results from which we can know how to maximize energy improvement by CPU consolidation. 18 Chapter 2. HIERARCHICAL, PORTFOLIO THEORY-BASED VIRTUAL MACHINE CONSOLIDATION IN A CLOUD COMPUTING SYSTEM 19 2.1 RELATED WORKS Much of the existing work proposes deterministic methods of consolidation, assuming that the resource demands are known precisely and given as scalar values [11]. This assumption, however, is generally invalid. First, the estimated resource demands must represent actual demands during relatively long periods of time, which range from a few minutes to a few hours [12]. It is because the VM migration, which is an essential part of the consolidation, cannot be done too frequently due to the migration overhead; in fact, even the live migration, which is a very efficient migration method, causes service down for a few hundred milliseconds [12]. In practice, the instant resource demands vary during such a long period of time. If the VM resource demands are modeled as scalar values, then they must be set to very large values in order to account for the worst case. Clearly, this is unnecessary and wasteful. Second, some types of resource demands are very bursty with rates that vary greatly and rapidly over time; e.g., consider the problem of packing multiple connections together on a link when each connection is bursty [13]. In such a case, it has proven useful to rely on the notion of an effective bandwidth for bursty connections and to model this bandwidth as a random variable with expected mean and standard deviation (which are estimated by dynamic profiling of the connection). Therefore, it can be inappropriate to characterize the VM resource demands by fixed values. Instead, we suggest characterizing the demands by random variables (RV). There are prior studies which treat resource demands as random variables [14-18]. These works formulate the energy aware resource allocation problem as the stochastic bin packing (SBP) problem, which states that items with sizes following some probabilistic distribution function must be packed into the minimum number of bins such that the probability of violating any bin size is below a given threshold. For example, Meng et al. [14], who focus on the network bandwidth as 20 the resource in question, start off with a couple of assumptions: 1) items (VMs with known network bandwidth demands) are independent RVs following a normal distribution, 2) bins (physical machines, PMs, with known network capacities) are identical. The authors subsequently present an algorithm to solve the SBP based on the notion of the equivalent size of an item, which in general depends on the other items packed together in the same bin. Similarly, Breitgand and Epstein [17] consider consolidating VMs onto the minimum number of PMs where the physical network (e.g., network interface link) is the bottleneck. In their formulation, each VM has a probabilistic guarantee (derived from its Service Level Agreement, SLA) for being granted its required bandwidth. The problem is set up as SBP problem where the bandwidth demand of each VM is treated as a RV following a normal distribution. The authors also assume the RVs are independent of one another. In contrast, Ming et al. and Xiaoqiao et al. do not assume either the identical bin size or the independency of RVs, but consider only one resource type (i.e., the CPU) [15, 16]. The aforesaid assumptions are generally invalid. First, VMs can be strongly correlated; therefore, their resource demands can be dependent on one another. For example, a web service request (such as a Google query) may simultaneously be served by many VMs (i.e., load balancing), resulting in correlation among resource demands of these VMs. Kahn et al. [19] collected CPU utilization trace data from 3,019 VMs and found correlated behaviors across different servers. Second, it is important to consider multiple resource types (e.g., CPU, network bandwidth, disk space, and memory size) since any one of them may become a performance bottleneck (and the bottleneck can change over time). Third, a typical datacenter consists of heterogeneous PMs [20]; hence, it is inappropriate to assume all PMs are identical (the same bin size). Therefore, in this chapter we first formulate the consolidation problem as a multi-capacity 21 stochastic bin packing (MCSBP) problem 3 , which is NP hard. Next we present a hierarchical and highly scalable heuristic method to solve the MCSBP problem. 3 Note the multi-capacity bin packing problem differs from the multi-dimensional bin packing problem. Detailed comparison is shown in Section 2.4.2. 22 2.2 RESOURCE DEMAND MODEL AND PORTFOLIO EFFECT 2.2.1 Resource Demand Model This study assumes that resource demands of VMs are specified as RVs. These demands depend on the characteristics and computing needs of the applications running on the VMs. If the cumulative distribution function (cdf) of a RV is known, the minimum amount of resource allocation can be estimated from this cdf to meet a target quality of service (QoS). This QoS may be specified as the probability that the aggregate resource demand of a VM does not exceed the resource capacity of the PM, where the VM is deployed, by more than certain level; e.g., 5%, which is referred to as having a 95% QoS. For example, the minimum amount of resource allocation satisfying a 95% QoS target can be calculated as cdf −1 (0.95). The cdf, however, is unknown in many cases. Without knowledge of the cdf, the minimum amount of resource allocation can be estimated by the Cantelli's inequality [21], which is the single-tailed variant of Chebyshev’s inequality: 2 1 0 1 , XX PX (1) where 𝜇 𝑋 and 𝜎 𝑋 are mean and standard deviation of a RV X, respectively. According to the Cantelli’ s inequality, the minimum amount of resource needed to meet the target QoS can be estimated as: XX , 4.4 for 95% QoS target (2) This inequality holds for any RVs regardless of their distribution functions, so it does not give a tight bound and may cause resource overbooking. If more information about these RVs is given 23 (e.g., cdf), we can assign less amount of resource while meeting the same QoS. For example, if a RV is known to be normally distributed, β can be set to as low as 1.7, which is much smaller than what the Cantelli’ s inequality gives (i.e., β=4.4). According to the central limit theorem (CLT), the mean of a sufficiently large number of independent RVs, each with finite mean and variance, approximately follows normal distribution [21]. The CLT holds even for weakly dependent RVs; hence, we can use smaller β (i.e., a value of 1.7) if a RV is the sum of a large number of weakly dependent RVs. At worst, i.e., even when we cannot assume CLT holds, we can estimate the proper amount of resource avoiding performance degradation using the Cantelli’s inequality. 2.2.2 Multiple Resource Types This study takes care of multiple resource types, e.g., the CPU, memory, network bandwidth, and so on. If the demand for each resource type is modeled as a RV, there are too many RVs, which make the model too complicated; hence, it is hard to solve a problem using the model. One difficulty is that a correlation coefficient matrix becomes too big, which leads to large memory usage as well as longer running time to solve the problem. Instead, only the workload intensity of a VM is modeled as a RV, and resource demands of the VM are calculated as a linear function of the RV 𝑿 𝒏 : 1 1 1 , , k k k n n n n n n n n R a X b R a X b (3) where 𝑿 𝒏 is a RV modeling the workload intensity of the 𝑛 𝑡 ℎ VM (𝑉𝑀 𝑛 ), 𝑹 𝒏 𝒌 is the demand for the 𝑘 𝑡 ℎ resource type, and 𝑎 𝑛 𝑘 and 𝑏 𝑛 𝑘 are regression coefficients. Workload intensity is an indicator how many jobs are assigned to a VM, and this may be simply defined as an average number of 24 jobs assigned every second. This linear model is reasonable because there is a correlation between the workload intensity and the resource demands. If a resource type is very weakly correlated with the workload intensity, e.g., the memory, a relatively small value of 𝑎 𝑛 𝑘 and large value of 𝑏 𝑛 𝑘 can be chosen. 2.2.3 Portfolio Effect The modern portfolio theory (MPT) is a financial theory, which attempts to maximize the expected return from the assets of a portfolio for a given amount of risk, or equivalently minimize the risk for a given level of expected return, by carefully choosing the proportions of various assets. The MPT models the amount of returns from an asset as a RV, and uses standard deviation of return as a proxy for risk, which is valid if asset returns are jointly normally distributed (or otherwise elliptically distributed). The MPT models a portfolio as a weighted combination of the assets so that the return of the portfolio is a weighted combination of the asset returns. By combining different assets whose returns are not perfectly positively correlated (i.e., correlation coefficient is less than 1), the MPT seeks to reduce the total risk [22]. The MPT reduces risk by investing on multiple assets (portfolio) rather than one asset. It is possible because diversification minimizes the volatility of the investment, which is called portfolio effect. More formally, the portfolio effect can be stated as the risk (standard deviation of return) of a portfolio is always less than or equal to sum of each asset risk: 2 2 , j YX i ii i i j Y i ij (4) where Y = ∑ X i i and ρ ij is the correlation coefficient between X i and X j (−1 ≤ ρ ij ≤ 1 ). The degree of risk reduction is a function of a correlation coefficient (𝜌 𝑖𝑗 )—the smaller 𝜌 𝑖𝑗 is, the lower 25 the risk is (cf. Fig 1). In other words, one has to avoid putting highly positively correlated assets into the same portfolio. Fig 1. Effect of correlations between RVs on a standard deviation As shown in (2), the minimum resource allocation is proportional to both mean and standard deviation of resource demands; hence, reduction in the standard deviation also reduces the amount of allocated resource. The proposed method exploits the portfolio effect to minimize the resource allocation, and consequentially, energy cost decreases because fewer PMs are utilized (we call these PMs ‘active’ PM from now on). An example is provided next. Fig 2. Portfolio effect example #1 Suppose there are 6 VMs, where resource demands (mean and standard deviation) of the VMs are as shown in Fig 2. μ 1 =5 VM 1 μ 2 =15 μ 3 =25 μ 4 =35 μ 5 =45 μ 6 =55 σ 1 =55 σ 2 =45 σ 3 =35 σ 4 =25 σ 5 =15 σ 6 =5 VM 2 VM 3 VM 4 VM 5 VM 6 -1 -0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 1 standard deviation (normalized) correlation coefficient ( ij ) stdev vs. correlation coefficient X 1 + X 2 X 1 +X 2 26 Fig 3. Portfolio effect example #2 Fig 3 depicts and compares two VM deployments. For the sake of simplicity, β is set to 1. Moreover, it is assumed that there are no correlations among the VMs (i.e., 𝜌 𝑖𝑗 = 0 for all i and j). Suppose only two PMs are available, and their capacities are 170 and 120, respectively. The first deployment (i.e., Case1) is able to assign all six VMs on these two PMs. On the other hand, the second deployment (i.e., Case2) needs an additional PM to serve all six VMs. This example illustrates that the VM deployment can affect the number of active PMs. The difference (the number of active PMs) of these two deployments can be even larger if the correlations among VM resource demands (RVs) are not zero, which is a more realistic situation. Therefore, a sophisticated VM deployment method has to be devised so as to minimize the number of active PMs; hence reduce the overall energy cost. μ 1+2+3+4 =80 σ 1+2+3+4 =83.1 μ 5+6 =100 σ 5+6 =15.8 μ 4+5+6 =135 σ 4+5+6 =29.6 μ 1+2 =20 σ 1+2 =71.1 μ 3 =25 σ 3 =35 Case 1 Case 2 PM 1 Size = 170 PM 2 Size = 120 PM 1 Size = 170 PM 2 Size = 120 Need more PM 27 2.3 PROBLEM STATEMENT – MULTI-CAPACITY STOCHASTIC BIN PACKING OPTIMIZATION Let say that there are M PMs, N VMs, and K resource types. Each PM has a set of scalar (available) capacity limits for each resource type, and this limit is specified as 𝑟 𝑚 𝑘 for capacity of the 𝑘 𝑡 ℎ resource of the 𝑚 𝑡 ℎ PM (𝑃𝑀 𝑚 ). The ON/OFF state of 𝑃𝑀 𝑚 is captured by a pseudo- Boolean variable 𝑓 𝑚 . In addition, a non-deterministic workload intensity of 𝑉𝑀 𝑛 is specified as 𝑋 𝑛 , of which mean 𝜇 𝑛 and variance 𝜎 𝑛 2 are known 4 . Recall that demands by a VM for each type of resource are given as linear equations (3) of the workload intensity of the VM. Let 𝜌 𝑖𝑗 denote a correlation coefficient between 𝑋 𝑖 and 𝑋 𝑗 . An assignment variable (𝑒 𝑛𝑚 ) is 1 if 𝑉𝑀 𝑛 is assigned to 𝑃𝑀 𝑚 , and 0 otherwise. The minimum amount of resources for achieving the given QoS target is obtained from (2) with 𝜇 𝑅 𝑛 𝑘 and 𝜎 𝑅 𝑛 𝑘 given by the following relations: , kk nn k k k n n n n n RR a b a (5) The minimum resource assignment (MRA) problem can then be formulated as a multi-capacity stochastic bin packing (MCSBP) problem as follows. 4 More precisely, the mean and variance are supposed to be denoted by 𝜇 𝑋 𝑛 and 𝜎 𝑋 𝑛 2 . However, for the sake of simplicity, the simpler notation is used. 28 1 1 1 11 1 min 0,1 { 1, , }, { 1, , } 0,1 { 1, , } 1 { 1, , } .. { 1, , } { 1, , }, { mm m k mm k k nm m M nm m N m nm N n NN k k k m im jm ij i i j j ij N k k k k nm n n n m m n fq q w r e m M n N f m M e n N st f e m M VAR e e a a e a b VAR r m M k 1, , } K (6) It is assumed that values of the following parameters are given: 𝜇 𝑛 , 𝜎 𝑛 , 𝑤 𝑘 , 𝑎 𝑛 𝑘 , 𝑏 𝑛 𝑘 , 𝜌 𝑖𝑗 , and 𝑟 𝑚 𝑘 . The objective is to assign VMs to PMs (i.e., determine values of 𝑒 𝑛𝑚 variables) so as to minimize the total amount of allocated resource while meeting a target QoS for each VM. If all PMs are homogenous, the objective is simply to minimize the number of active PMs, i.e., ∑ 𝑓 𝑚 𝑚 where 1 1 N N m nm n fe . The word ‘active’ refers that at least one VM is deployed on that PM. This is similar to the objective function used in classical bin packing (with identical bin size). However, in order to account for the heterogeneity of the PMs, a new objective function is chosen; it minimizes the sum of total resource capacities of all active PMs (i.e. ∑ 𝑓 𝑚 ∑ 𝑤 𝑘 𝑟 𝑚 𝑘 𝑘 𝑚 ). Note that parameter 𝑤 𝑘 is used to normalize the cost of different resource types. The above objective function forces that the resource capacity of an active PM will be maximally utilized. This formulation thus implicitly favors a solution that has as few active PMs as possible (i.e., Server consolidation). There is a couple of important constraints to be met: 1) every VM must be deployed on a PM. 2) the aggregate resource demands of VMs in the same PM do not exceed resource capacity of the PM. This second constraint should hold in order to meet the SLA. 29 This MCSBP problem is a variation of the bin packing (BP) problem, which is known to be NP-hard [21]. The MCSBP is also NP-hard as proved next. Theorem 1. The MCSBP problem is NP-hard. Proof: Consider a special case of the MCSBP problem; standard deviation of the workload intensity is zero, capacity of PMs (size of the bins) is constant, and only one type of resource is to be considered, that is: 1 1 , , 0, 0 00 k k k m n n n R if k R if k r a b otherwise otherwise (7) For this special case, the MCSBP problem becomes the classical bin packing problem: 1 1 1 1 min 0,1 { 1, , }, { 1, , } 0,1 { 1, , } . . 1 { 1, , } { 1, , } { 1, , }, { 1, , } m m nm m M nm m N m nm N n N nm n n f e m M n N f m M s t e n N f e m M e R m M k K (8) The BP problem, therefore, is reducible to the MCSBP. Because the BP problem is known to be NP-hard, the MCSBP is also NP-hard. □ The fact that MCSBP is NP-hard motivates the development of an efficient heuristic (non-optimal) method to solve the problem. 30 2.4 HIERARCHICAL RESOURCE MANAGEMENT SOLUTION In this section the main idea and algorithms for the hierarchical resource management is introduced. Modern datacenters consist of many server clusters. Moreover, the number of PMs (servers) in any cluster is bounded because the cluster has hard limits in terms of peak power that it can consume or the peak network bandwidth that it can have. Therefore, a larger datacenter has more clusters, but typically not bigger clusters. A common interconnect topology in today’s datacenters is depicted in Fig 4. The figure shows a two level topology, although there could easily be more than two such levels [23]. As depicted in Fig 4, the proposed solution consists of two distinct resource managers: global and local. The global manager assigns VMs to a cluster first, whereas the local manager deploys the VMs to PMs in the cluster. A key advantage of the proposed solution is that it splits a large problem into a number of small problems which are independent of each other. Because of their small size and independent characteristics, there is an opportunity for us to apply more sophisticated and elaborate solution approaches o each smaller problem. As is well-known, although a hierarchical approach is typically more efficient and scalable, the quality of its result may be worse than that of a flat (non-hierarchical) method. Indeed if the original problem is not divided into sub problems well and/or the quality of the solution merging step is not good, the quality of the overall result may be so poor that it becomes unacceptable. Hence, a lot of care must be taken when developing a hierarchical solution. 31 Data center (Cloud computing system) Cluster Cluster Cluster Local Global Portfolio effect maximization Balanced resource allocation Fig 4. A proposed hierarchical resource management The approach used by the global manager is quite different from that of the local manager. The global manager tends to maximize portfolio effect; i.e., it deploys the least correlated VMs to the same cluster. On the other hand, the local manager aims to balanced usage of different resource types within a PM. More detailed explanation of these managers is provided in the following sections. 2.4.1 Global Resource Manager The global resource manager is responsible for assigning VMs to clusters. In our problem formulation, the objective function to be minimized is the gross resource usage of clusters (10). The resource usage of a cluster is defined as the sum of resource allocations of VMs running on that cluster. 1 11 where and k k k c c c N k k k c nc n n n n NN k k k c ic jc ij i i j j ij u g a b g g a a (9) 32 𝑔 𝑛𝑐 =1 if 𝑉𝑀 𝑛 is deployed on the 𝑐 𝑡 ℎ cluster and 0 otherwise. Now then the VM-to-Cluster assignment problem is: 1 1 1 1 1 min min s.t. C K C K kk cc c k c k M kk c mc m m u u h r (10) ℎ 𝑚𝑐 =1 if 𝑃𝑀 𝑚 resides in 𝑐 𝑡 ℎ cluster and 0 otherwise. Note that the sum of means of all resource type allocations of all clusters (i.e., ∑ ∑ 𝜇 𝑐 𝑘 𝐾 𝑘 =1 𝐶 𝑐 =1 ) is constant regardless of the assignment of the VMs to the clusters. On the other hand, the sum of standard deviations (i.e., ∑ ∑ 𝜎 𝑐 𝑘 𝐾 𝑘 =1 𝐶 𝑐 =1 ) is affected by the VM deployment; this is because of the non-zero correlation coefficients between resource demands of different VMs. Intuitively, the aforesaid objective function leads to the following: first, only VMs which are least correlated one another (i.e., small positive, zero or large negative correlations) are assigned to the same cluster, which in turn results in the least amount of resources allocated (and eventually used). Second, the local manager could assume VMs are uncorrelated, which makes the local optimization problem simpler. Some VMs may be correlated with one another. For example, multiple VMs may be spawned off by the same application or VMs may correspond to different tiers of a multi-tiered application, etc. We consider a couple of different situations: 1. all VMs are uncorrelated and 2. some VMs are correlated with one another. Because the uncorrelated case is simpler, we analyze and solve this case first in order to get some useful intuitions. Subsequently, we extend the algorithm for correlated case, which is more realistic. The key idea of the algorithm is based on the following proposition [24]: Proposition 1. Suppose we have N items and C bins. Size of all items is 1 (constant) and cost of each item is 𝜎 𝑛 (n = 1, 2, …, N). The size of the 𝑐 𝑡 ℎ bin is 𝑟 𝑐 (integer) and total size of all bins is 33 the same as the number of items (∑ 𝑟 𝑐 𝐶 𝑐 =1 = 𝑁 ). The items are sorted in non-increasing order of their costs (𝜎 𝑖 ≥ 𝜎 𝑗 for 𝑖 < 𝑗 ). Bins are also sorted in non-increasing order of their sizes 𝑟 𝑐 (𝑟 𝑖 ≥ 𝑟 𝑗 for 𝑖 < 𝑗 ). A set 𝑆 𝑐 is the set of items put into the 𝑐 𝑡 ℎ bin. The overall cost to be minimized is defined as: cost 2 1 : C n c n S c (11) This cost is minimized when the bigger bin contains items with higher cost, that is: for and a ij b a S b S i j (12) Proof: We first consider a simple case where there are only two bins (i.e., K = 2). This will be generalized later. The initial deployment of items is: 1 1 1 2 , , , c S and 11 2 1 2 , , , c c N S The cost of the initial deployment is: cost 1 1 22 11 cN nn n n c . Now, each set is split into two subsets: 𝑆 1 = 𝑆 𝐴 ∪ 𝑆 𝐵 and 𝑆 2 = 𝑆 𝐶 ∪ 𝑆 𝐷 . Assume that the size of set 𝑆 𝐴 is the same as that of 𝑆 𝐶 . The cost can be rewritten as follow: cost A B C D V V V V where * 2 * n n S V Note that there is couple of inequalities between sets: AC VV and BD VV . This is because any member of 𝑆 𝐴 is greater than members of 𝑆 𝐶 and any member of 𝑆 𝐵 is greater than members of 𝑆 𝐷 . In addition, the size of 𝑆 𝐴 is equal to that of 𝑆 𝐶 and the size of 𝑆 𝐵 is greater than or equal to that of 𝑆 𝐷 . If set 𝑆 𝐴 is swapped for set 𝑆 𝐶 , the new cost is 𝑐𝑜𝑠𝑡 ′ = √v 𝐶 + v 𝐵 + √ v 𝐴 + v 𝐷 . For easy comparison of the cost, we check if the difference of cost squared is positive or negative: '2 2 cos cos 2 0 B D B D C A A C B D B D B D C A A C A C t t V V V V V V V V V V V V V V V V V V V V Hence, 𝑐𝑜𝑠𝑡 ′2 − 𝑐𝑜𝑠𝑡 2 ≥ 0 and 𝑐𝑜𝑠𝑡 ′ ≥ 𝑐𝑜𝑠𝑡 and this means the initial deployment (12) is the optimal solution in terms of minimum cost. This result can be generalized. For the case of more 34 than two bins (i.e., K > 2), we can convert any swaps among a number of bins into a sequence of swaps between two bins. Hence, the proposition is true for the general case. □ According to the proposition, we can minimize the total allocated resources for clusters running the given set of VMs subject to SLA requirement (10) when a VM with larger cost (i.e., summation over all resource types of the standard deviation of resource demands) is deployed on the cluster whose size (i.e., the total resource capacity of the PMs in the cluster) is bigger. This proposition, however, does not perfectly fit our problem; it assumes that the size of items is constant (i.e., the same amount of resource demands for all VMs), so it requires some modification as explained below. The modification is made based on the following intuition. Assume that an item is already in a bin and that the item will be replaced by other items with smaller sizes. For the sake of simplicity, we assume that the substitutes are identical (i.e., their sizes and costs are all the same). With the further assumption that the ratio of the original item’s size to the substitute item’s size is an integer, we can show that the substitution decreases the overall cost (11) if the following inequality holds: 2 2 original substitute original substitute size size (13) Hence, the cost of 𝑉𝑀 𝑛 is redefined as follows: 2 2 1 11 cost : n kk nn K k nn k n KK k k k n n n n n kk RR a a b a (14) The proposed VM-to-cluster algorithm assigns VMs to clusters based on this cost. Pseudo code of the preliminary version of the VM-to-cluster algorithm (VM2C, for short) used by the global resource manager is provided in Fig 5. Its main structure is similar to that of the classical First Fit Decreasing (FFD) heuristic [25]. The algorithm starts from sorting clusters and 35 VMs in non-increasing order (lines 1 and 2) by cluster’s size and VM’s cost (14), respectively. The size of a cluster is calculated by aggregating the resource capacity of all PMs in the cluster: 11 cluster MK k mc m mk size h r (15) For each cluster, the algorithm pre-assigns a VM with largest cost (14) among all unassigned VMs (line 5). It calculates the total amount of resource that the cluster is supposed to provide (𝑢 𝑐 𝑘 ) and compares it with the capacity of the cluster (15). If the cluster cannot provide enough resources to the VM, the assignment is canceled (line 7). The above steps are repeated until either all VMs are assigned or all clusters are full. VM-to-cluster Algorithm (VM2C): uncorrelated Inputs: 𝑟 𝑚 𝑘 , ℎ 𝑚𝑐 , 𝜇 𝑛 , 𝜎 𝑛 , 𝑎 𝑛 𝑘 , and 𝑏 𝑛 𝑘 Output: 𝑔 𝑛𝑐 1: sort clusters C in non-increasing order of their size (15) 2: sort VMs in non-increasing order of their cost (14) 3: for each cluster C do 4: for each unassigned VM do 5: 𝑔 𝑛𝑐 = 1 // assign 𝑉𝑀 𝑛 to cluster C 6: if ∃𝑘 : 𝑢 𝑐 𝑘 > ∑ ℎ 𝑚𝑐 ∑ 𝑟 𝑚 𝑘 𝑘 𝑚 then 7: 𝑔 𝑛𝑐 = 0 // cancel the assignment 8: end if 9: end for 10: end for Fig 5. VM-to-cluster algorithm (preliminary version) The above algorithm should be extended to support the correlated case, dealing with which requires high amount of computing resources because the complexity of correlation calculation is square of the number of VMs. Finding the optimal solution takes a huge amount of time if there is a large number of VMs. Hence, we present a heuristic algorithm to solve the problem. The main 36 idea is that VMs are assigned to clusters one at a time and the best VM is selected in a greedy manner as explained next. Suppose N VMs have already been assigned to some cluster and 𝑉𝑀 𝑥 is considered for assignment to the cluster next. The variance of the cluster after adding 𝑉𝑀 𝑥 is: 2 11 2 1 2 NN k k k c ic jc ij i i j j ij N k k k nc nx n n x x x x n g g a a g a a a (16) Therefore, the increase in variance (∆ 𝑐 𝑘 ) by assigning 𝑉𝑀 𝑥 to this cluster is: 2 1 2 N k k k k c nc n n x x x x nN n g a a a (17) An overhead is defined as the ratio of ∆ 𝑐 𝑘 summed over all resource types to the total resource demand of 𝑉𝑀 𝑥 : 1 1 : K k c k K k k k x x x x x k overhead a b a (18) The complete version of the VM2C algorithm dealing with correlations (Fig 6) is similar to the preliminary version (Fig 5) except in a few places. This algorithm considers the first W VMs as candidates for the allocation, and chooses one which has the lowest overhead (18) among these candidates. If there is no space available in the cluster for the VM, the algorithm abandons the first choice and tries the second best candidate, and so on. If none of the W candidates can be assigned to the cluster, the cluster is regarded as a fully assigned cluster (line 14). In this algorithm the window size (W) should be carefully chosen for good results. Notice that larger W tends to produce higher quality result, but at the same time, it requires more computing power and time. We will show how to choose the window size (W) in a Section 2.6.2. 37 VM-to-cluster Algorithm (VM2C): correlated Inputs: 𝑟 𝑚 𝑘 , ℎ 𝑚𝑐 , 𝜇 𝑛 , 𝜎 𝑛 , 𝑎 𝑛 𝑘 , 𝑏 𝑛 𝑘 , and 𝜌 𝑖𝑗 Output: 𝑔 𝑛𝑐 1: sort clusters C in non-increasing order of their size (15) 2: sort VMs in non-increasing order of their cost (14) 3: for each cluster C do 4: while true do 5: sort the first W unassigned VMs in non-decreasing order of their overhead (18) and put them in the candidate list 6: for 𝑉𝑀 𝑛 in the candidate list do 7: 𝑔 𝑛𝑐 = 1 // assign 𝑉𝑀 𝑛 to cluster C 8: if ∃𝑘 : 𝑢 𝑐 𝑘 > ∑ ℎ 𝑚𝑐 ∑ 𝑟 𝑚 𝑘 𝑘 𝑚 then 9: 𝑔 𝑛𝑐 = 0 // cancel the assignment 10: else 11: break // break the for loop 12: end if 13: end for 14: break // break while loop. The cluster is full 15: end while 16: end for Fig 6. VM-to-cluster algorithm (complete version) 2.4.2 Local Resource Manager After global manager assigns VMs to a cluster, the local resource manager deploys those VMs to PMs in the cluster. The problem at the local level is identical to the original one (6) except that the problem size is much smaller. Therefore, the local resource management problem is NP-hard, so we present a heuristic algorithm. Recall that the problem can be formulated as the MCSBP problem. Note that a multi-capacity bin packing (MCBP) problem is different from the classical multi- dimensional bin packing (MDBP) problem; the difference is depicted in Fig 7. In this example, every bin and every item 38 have a pair of (horizontal and vertical) capacity parameters. In the MDBP problem, an item can be put into a bin only if there is enough geometric space for the item (resulting in a compact packing of geometric objects in a bin). The MCBP problem, however, differs from the MDBP problem in the sense that any portion of the horizontal or vertical capacity can be used by only one item (resulting in a diagonal non-overlapping arrangement of items in a bin). Item1 Item3 Item5 Item1 Item3 Item5 Item2 Item4 Item7 (a) Example solution to MDBP (b) Example solution to MCBP Fig 7. Multi-dimensional vs. multi-capacity bin packing Roy et al. [26] propose several heuristics for solving the MCBP problem. The heuristics are simple but they perform well only on problem instances when there is a great deal of slack between component resource requirements (the sum of the item sizes) and total system resources (the sum of all bin capacities), the heuristics were likely to find a solution, if one existed. Otherwise, the success rate of the proposed heuristics would be low e.g., less than 35% when the slack is between 0-5% and item sizes between 0 and 30. William et al. [27] present the permutation pack (PP) heuristics for solving the MCBP. The quality of results from the PP method is quite good but its complexity increases exponentially as the number of resource types grows. The PP method works as follows. For each bin, it finds an item whose relative ordering of the resource demands best matches the relative ordering of the remaining resources in the current bin. For example, let’s say there are four resource types (i.e., K = 4). The remaining capacity of the current bin and the size of the item are denoted by 𝑎 𝑘 and 𝑟 𝑖𝑘 , respectively, where k denotes the index of resource type. If 39 the order of remaining capacities for the bin is 𝑎 1 > 𝑎 2 > 𝑎 3 > 𝑎 4 , then the best fit item will be the one whose order of resource demands is r i1 > r i2 > r i3 > r i4 . If there is no such item, the PP method searches for an item whose order of resource demands is r i1 > r i2 > r i4 > r i3 . In the worst case, the PP method must repeat this process 4! times to identify the “fittest” item for the bin. Note that the PP method need not consider all resource types; more precisely, it may identify w (w<K) important resource types when doing the search. In addition, it can sort the resource types to give more importance to “matching” one versus the other. The complexity of the PP method is O ( N 2 K! (K−w)! ), which increases rapidly as K increases. Complexity of the PP (K=w) can be calculated as follow. We make a list and put N VMs in it. These VMs will be deployed on m PMs. If there are K resource types, the number of possible ordering is K!. Note that we have to consider the worst-case for complexity calculation. For the first deployment, we need to check all n VMs if their ordering match to a PM (we call it ‘current’ PM). Once a VM is deployed to the PM, we update resource ordering of the PM. Repeat the process for all possible orderings (K!). Therefore, the complexity of the first deployment is O(𝑁 K!). The complexity of the second deployment is O((𝑁 − 1)K!) because there are N-1 VMs which have not been deployed yet. Likewise the complexity of i th deployment is O((𝑁 − i + 1)K!). Therefore, the total complexity of the PP is O(K! (1 + 2 + 3 + ⋯ + 𝑁 )) = O(𝑁 2 K!). If we cannot deploy any VM because the current PM is full, we continue VM deployment for the next PM. The worst- case is that we cannot deploy any VMs to the first M-1 PMs (there are M PMs) and deploy all N VMs to the last PM. In this case, we need to do K!NM processes additionally. Considering this situation, the total complexity is O(𝑁 (𝑁 + 𝑀 )K!). Generally speaking the number of VM (N) is larger than that of PM (M), i.e., N > M, so the complexity is O(𝑁 2 K!). We used an array as a data 40 structure. If we use B-tree, whose worst-case search complexity is O(𝑙𝑜𝑔𝑁 ), the complexity could be reduced to O(𝑁 (𝑙𝑜𝑔𝑁 )K!). Note that it still rapidly increases as K increases. In this thesis we propose a heuristic algorithm, which is better than the PP method in terms of both the quality of results and algorithm’s running time. The key idea of the proposed algorithm (called Minimum Normalized Vector Difference or MNVD) is to achieve balanced resource allocation: deploy a VM to a PM where the amount of available resources is most ‘similar’ to that of the VM. A detailed explanation of ‘similarity metric’ is presented below. First, we define a remaining resource capacity vector in 𝑃𝑀 𝑚 (denoted by 𝑎 ⃗ 𝑚 ), and a resource demand vector for 𝑉𝑀 𝑥 (denoted by 𝑟 ⃗ 𝑥 ): 1 2 1 2 : , , , , : , , , mxKK a a a a r r r r (19) where 𝑎 𝑘 is the residual amount of the k th resource in 𝑃𝑀 𝑚 and 𝑟 𝑘 is the amount of increase in k th resource usage by deploying 𝑉𝑀 𝑥 to 𝑃𝑀 𝑚 : 22 1 where and 2 k k k k k m m m x x x k N k k k k m nm nx n n x x x x n r a b e a a a (20) 𝑒 𝑛𝑚 is a Boolean function, which is 1 if 𝑉𝑀 𝑛 is running in 𝑃𝑀 𝑚 and 0 otherwise. The normalized vector difference (nvd) between 𝑎 𝑚 and 𝑟 𝑥 is defined as the Euclidean distance between two unit- sized vectors: : xm xm xm ra nvd d ra (21) where ||.|| denotes the Euclidean norm. The graphical representation of the nvd is shown in Fig 9. The algorithm selects the PM with the minimum nvd, which means that the resource demand of the VM is most ‘similar’ to the remaining resource capacity of the PM. 41 x r x x r r m m a a m a nm d 1 1 1 Fig 8. Example of nvd calculation (K=3) The pseudo code of the MNVD method is provided in Fig 9. This algorithm is similar to the classical best fit decreasing (BFD) algorithm. For us, the term best means the smallest nvd (21). The MNVD starts from sorting PMs, which are in the same cluster, by their size in non-increasing order (line 1). For each PM, the algorithm assigns the best VM, i.e., the VM with normalized resource demands having the least nvd from the available resource capacities in the PM (line 2 through 6). Notice that if a VM cannot be deployed on the PM due to insufficient residual resource in the PM, the assignment of the VM relegated to a later iteration of the for loop (line 2). If no VM’s can be found, the PM is already full and the algorithm moves on to the next PM (line 8). Notice that there may be no feasible solution at the local level. If the local manager cannot deploy some VMs to any PM in the current cluster, then it will send the list of unassigned VMs to the global manager, which subsequently assigns them to other clusters. The steps are repeated until either all VMs are assigned or all PMs are full. 42 Minimum Normalized Vector Difference Algorithm (MNVD) Inputs: 𝑟 𝑚 𝑘 , ℎ 𝑚𝑐 , 𝑔 𝑛𝑐 , 𝜇 𝑛 , 𝜎 𝑛 , 𝑎 𝑛 𝑘 , 𝑏 𝑛 𝑘 , and 𝜌 𝑖𝑗 Output: 𝑒 𝑛𝑚 1: sort PMs in the same cluster (ℎ 𝑚𝑐 = 1) in non-increasing order of their available resource size (∑ 𝑎 𝑚 𝑘 𝑘 ) 2: d for each 𝑃 𝑀 𝑚 in the sorted list do 3: while true do 4: find best 𝑉𝑀 𝑛 with smallest norm. vect. diff. from (21) 5: if 𝑉𝑀 𝑛 is found then 6: 𝑒 𝑛𝑚 = 1 // assign 𝑉𝑀 𝑛 to 𝑃 𝑀 𝑚 7: else 8: break // break while loop. 𝑃 𝑀 𝑚 is full 9: end if 10: end while 11: end for Fig 9. M N V D algorithm for VM-to-PM assignment 2.4.3 Joint VM-to-Cluster and VM-to-PM Allocation Algorithm A hierarchical solution is highly scalable, but the quality of result may be worse than that of a combined (joint) solution. For the purpose of comparison, we have also developed a unified version of the VM allocation algorithm (called JOINT), which conceptually merges VM2C (global) and MNVD (local) algorithms. The JOINT algorithm starts by sorting PMs and VMs in non- increasing order of their size and cost (14), respectively (lines 1 and 2). After that, the algorithm assigns the best VM to each PM as follows. First, it makes a candidate VM list by selecting 𝑾 𝟐 VMs, whose overheads (18) are the smallest among the first 𝑾 𝟏 VMs in the sorted list (line 5). Next, it chooses the VM with the smallest nvd and assigns the VM to the current PM. 43 Joint VM Allocation Algorithm (JOINT) Inputs: 𝑟 𝑚 𝑘 , 𝜇 𝑛 , 𝜎 𝑛 , 𝑎 𝑛 𝑘 , 𝑏 𝑛 𝑘 , and 𝜌 𝑖𝑗 Output: 𝑒 𝑛𝑚 1: f sort PMs in non-increasing order of resource capacity (∑ 𝑟 𝑚 𝑘 𝐾 𝑘 =1 ) 2: sort VMs in non-increasing order of their cost (14) 3: for each 𝑃 𝑀 𝑚 in the sorted list do 4: while true do 5: find 𝑾 𝟐 candidate VMs with the smallest overhead (18) from the first 𝑾 𝟏 unassigned VMs in the sorted list 6: find best 𝑉𝑀 𝑛 with smallest norm. vect. diff. from (21) 7: if 𝑉𝑀 𝑛 is found then 8: 𝑒 𝑛𝑚 = 1 // assign 𝑉𝑀 𝑛 to 𝑃 𝑀 𝑚 9: else 10: break // break while loop. 𝑃 𝑀 𝑚 is full 11: end if 12: end while 13: end for Fig 10. JO I N T algorithm for direct VM allocation 44 2.5 ONLINE RESOURCE MANAGEMENT The proposed algorithms seek to find the best VM to PM mappings only considering resource requirements of the VMs and available resource capacities of the PMs. That is, the algorithm does not consider the cost of VM migration, so the algorithm is more suitable for initial VM allocation. However, the resource requirements of VMs can change, and some VMs may migrate to another PM to avoid SLA violation and to reduce energy costs. More precisely, if a PM is over-utilized, VMs running on the PM have high possibility of violating SLA. In addition, the energy costs can decrease if all VMs of an under-utilized PM are migrated and that PM is shut down (this is called server consolidation). Because the VM migration overhead is non negligible, we have to achieve SLA and reduce the energy cost as much as possible while minimizing the migration overhead. In this section, an online resource management algorithm is introduced. This algorithm starts by migrating VMs running on an over-utilized PM to avoid SLA violation because this is the most important and has the highest priority. If no PM is over-utilized, VMs running on an under-utilized PM migrate so that as many PM can be shut down as possible. The goal is of course to reduce the overall energy costs. The proposed online algorithm is also hierarchical; hence, it is highly scalable. 2.5.1 Intra-cluster Migration As shown in Section 2.4.2, there is a local resource manager in each cluster, and this local manager is also responsible for the online VM reallocation within that cluster; hence we call this ‘intra’-cluster migration. Every local manager can work in parallel, so this solution can work efficiently even for very large datacenters (i.e., it is scalable). The intra-cluster migration, which is preferable because of lower migration overhead compared to inter-cluster migration. In addition, 45 it utilizes a network switch in the cluster so that it does not cause any network overheads for the other clusters. A potential drawback of the intra-cluster migration is that the impact of a migration (in terms of meeting SLA or reducing energy cost) can be less than that of the inter-cluster migration. To develop an effective online algorithm for VM reallocation, the migration overheads have to be quantitated. The only available information about VMs is their resource requirements, but this is not enough to estimate the overheads. For example, network bandwidth requirement of a VM is not deeply related with its migration overheads. Therefore, for sake of simplicity, we assume that the migration overhead is the same for all VMs. The only distinction considered is between intra- versus inter-cluster migration costs (where the latter is clearly larger than the former). The proposed algorithm consists of two parts: phase1 and phase2. As stated above, the algorithm starts by moving VMs from over-utilized PMs to under-utilized PMs: phase 1 (Fig 11). First, we sort over-utilized PMs in the same cluster by sum of resource demands of their assigned VMs (line 2). A PM is considered to be over-utilized if the last constraint in (6) is violated. The gross resource demand for a PM is calculated as follows: 11 11 where KN k k k nm n n n m kn NN k k k m im jm ij i i j j ij e a b VAR VAR e e a a (22) If a PM is over-utilized, possibly the QoS of all VMs in that PM are affected; therefore, the PM with larger resource demands has higher priority in order to maximize any QoS improvement with the same number of migrations (we call this ‘migration count’). Such a PM has higher probability of running more number of VMs, so we can expect higher QoS improvement. VMs on the over-utilized PM are migrated until that PM is not over-utilized any more. In particular, the VM that has the largest resource demands (denominator of (18)) is migrated first for minimizing 46 VM migration count. The selected VM migrates to ‘best’ PM. The ‘best’ means that the nvd of the PM (21) is the smallest among all ‘active’ and under-utilized PMs (line 6 and 7). Only when none of active PMs has sufficient available resource for the VM, the algorithm turns on an inactive (fm=0) PM and deploys the VM. Notice that it is better to have as small number of active PMs as possible (i.e., lower energy costs). If the VM cannot be migrated to any PMs (active or inactive) in the current cluster, the algorithm resorts to inter-cluster migration (line 8). A detailed explanation of inter-cluster migration is presented in the following section. The first phase ends when the migration count reaches a pre-specified limit (line 10) or none of PMs are over-utilized. 1: a migration_count = 0 2: sort over-utilized local PMs in non-increasing order of total resource allocation (22) 3: for each 𝑃 𝑀 𝑚 in the sorted list do 4: sort VMs in non-increasing order of their resource demands 5: for each 𝑉 𝑀 𝑛 in the sorted list do 6: find the best 𝑃 𝑀 𝑏 among active (𝑓 𝑚 =1) PM m and migrate 7: if cannot find, find best one among inactive PMs end if 8: if cannot find, do inter cluster migration end if 9: if ++migration_count > migration_limit then 10: end algorithm 11: end if 12: if 𝑃 𝑀 𝑚 is not over-utilized then jump to line 3 end if is 13: end for 14: end for Fig 11. Intra-cluster VM migration algorithm (Phase 1) The second phase of the algorithm is depicted in Fig 12. If there are no over-utilized PMs (i.e., no SLA violations exist), the algorithm consolidates PMs so as to reduce energy cost until migration count reaches its limit. In this case, a PM whose utilization (committed resource allocation) is the smallest is selected first (line 1). This is because the PM can be turned off by 47 performing fewer migrations. Similar to FFD/BFD, the VM with the largest resource demands is selected first (line 3). Note that the algorithm finds the best PM among ‘active’ PMs, not inactive PMs; that is, inactive PM is never turned on (line 5). If a VM cannot migrate to any PM, the algorithm finishes. It is because the PM cannot be turned off even all the other VMs running on the PM can migrate. 1: sort under-utilized local PMs in increasing order of their resource allocation (utilization) 2: a for each 𝑃 𝑀 𝑚 in the sorted list do 3: sort VMs in non-increasing order of their resource demands 4: for each 𝑉 𝑀 𝑛 in the sorted list do 5: find the best 𝑃 𝑀 𝑏 among active (𝑓 𝑚 =1) PMs and migrate 6: if cannot find one then 7: end algorithm 8: end if 9: if ++migration_count > migration_limit then 10: end algorithm 11: end if 12: end for 13: end for Fig 12. Intra-cluster VM migration algorithm (Phase 2) 2.5.2 Inter-cluster Migration As shown in the previous section, only local manager is responsible for intra-cluster migration. On the other hand, for inter-cluster migration both local and global managers are involved in. When the global manager is asked to migrate a VM to another cluster (line 8 in Fig 11), it asks all local resource managers in the other clusters if there is a PM on which the VM can be allocated. Key idea is a cluster with smaller overhead (18) has higher priority to be selected (line 2 in Fig 13). That is, the algorithm selects a cluster whose variance increase is smallest if the VM is 48 assigned to the cluster in order to maximize the portfolio effect. If the cluster does not have space for the VM, the next (second best) cluster is chosen until VM migrates. If VM migrates, the algorithm finishes (line 5). 1: d VM // VM to be migrated 2: sort clusters C in non-increasing order of overhead (18) 3: for each cluster C do 4: ask local manager to deploy the VM 5: if VM is deployed (i.e., migrates) then break end if 6: end for Fig 13. Inter-cluster VM migration algorithm – global manager The local manager chooses the best (i.e., smallest nvd (21)) PM for the VM among ‘active’ PMs. If none of active PMs can serve the VM, the algorithm checks to see if there is a good one among the inactive PMs. If a PM is found, the algorithm deploys the VM to PM and notifies if the VM migrates or not. 1: find the best 𝑃 𝑀 𝑏 among active PMs (𝑓 𝑚 =1) 2: s if no active PM can server VM then 3: find the best one among inactive PMs(𝑓 𝑚 =0) 4: end if 5: if a PM found then deploy the VM to the PM end if 6: Notify to global manager if VM is deployed or not Fig 14. Inter-cluster VM migration algorithm – local manager 49 2.6 SIMULATION RESULTS 2.6.1 Simulation Setup For the simulation we needed the following data: 1) capacity vector of PMs, 2) list of PMs in clusters, 3) resource demands of VMs, and 4) correlation among VMs. The data is randomly generated using the uniform distribution subject to the following conditions: Numbers of VMs (N), PMs (M), clusters (C), and resource types (K): these are set to different values to generate different simulation setups Workload intensity of VMs: lower and upper bounds are imposed on 𝜇 𝑛 and 𝜎 𝑛 Regression coefficients: lower and upper bounds are imposed on 𝑎 ∗ 𝑘 and 𝑏 ∗ 𝑘 (the same bounds for all VMs) Capacity of PMs: lower and upper bounds are imposed on 𝑟 ∗ 𝑘 (the same bounds for all PMs). It is also randomly decided which PMs are placed in which clusters (topology). Because VMs are heterogeneous, workload intensity is randomly generated using the given bounds. Likewise, the scale factors and PM capacity are randomly generated. Making a valid correlation coefficient matrix [𝜌 𝑖𝑗 ] is important, so we use the hypersphere decomposition [28] method, which is a relatively simple method for generating a valid correlation matrix. If we focus on a very specific application, where we may find patterns of correlation, then higher energy savings can be achieved. However, this thesis covers general situation, so we make a valid correlation matrix using random function. Application specific algorithm tuning can be a future plan. 50 2.6.2 Global Resource Manager The objective of the global resource manager is to minimize the sum of standard deviations of clusters (10), which is called cost in this section; lower cost means better quality of the solution. To assess the quality of solutions generated by the proposed algorithms, our solution is compared with some other well-known methods: SA – based on a simulated annealing algorithm [29]. SA does not guarantee to find the global optimal point, but it finds a near-optimal solution given sufficiently long time. This method may generate different solutions each time, so we run the SA six times and pick the ‘best’ result. RAND – assigns VMs to PMs in different clusters in a completely random manner. If a solution is worse than the RAND solution, then the quality of that solution will be considered as being quite poor. We run RAND ten times and report the ‘average’ of these runs. FFD – uses the First Fit Decreasing algorithm, which is a well-known and commonly used heuristic to solve the bin packing problem. FFD treats the problem as a single capacity bin packing problem by converting various resource types to some “abstract resource type” using appropriate weighting (coin exchange) coefficients. PM2C – based on the PM-to-cluster algorithm of Fig 6. We first investigate the quality of the algorithms by comparing them with the SA. Normalized overall cost of the algorithms is shown in Fig 6. Recall that the overall cost is calculated as follows: 11 11 overall cost CK k c ck NN k k k c ic jc ij i i j j ij where g g a a (23) For a fair comparison, we generate eight different test cases based on the same setup and run simulations. Even for the same setup, the results can be different because the input data is randomly generated in each case (cf. Section 2.5.1). As depicted in Fig 15–a, with K = 3, PM2C produces the best result (i.e., lowest overall cost), 51 which is even better than the SA. The overall cost of RAND solution is the largest as expected. FFD is better than RAND but results in solutions with overall costs that are about 30% greater than those produced by PM2C. The same trend is observed for the case where K = 7 (Fig 15–b). Fig 15. Quality comparison among the algorithms (N=500, W=50) The overall cost of the results produced by VM2C and its running time for four different data sets are presented in Fig 16. The original data has 2000 VMs. The other three sets of data are created by sampling 500, 1000, and 1500 VMs from the original data (in a random manner), respectively. The same trend is observed for both cost and running time for four data sets: 1) The rate of cost reduction decreases as W increases. 2) The running time increases linearly as W increases. 1 2 3 4 5 6 7 8 0.8 1 1.2 1.4 1.6 1.8 2 2.2 cases overall cost (normalized) (a) K = 3 SA RAND FFD VM2C 1 2 3 4 5 6 7 8 0.8 1 1.2 1.4 1.6 1.8 2 2.2 cases (b) K = 7 52 Fig 16. Cost and algorithm running time vs. window size (W) As mentioned before, choosing the proper window size (W) is quite important for PM2C. Unnecessarily large W is not acceptable due to very long running time. It is hard to find the optimal window size in an analytical way. Instead, we propose a simple way to find the proper window size: First, make a smaller problem by randomly selecting VMs and PMs from the original problem (random sampling). Next, run PM2C multiple times with different W’s and find the best W. This simple method works because the relationship between the cost and the window size is similar for all data sets (Fig 16). As shown in Fig 17, running time is strongly dependent on the problem size (N), so investigation of smaller size problem can find a proper window size in a shorter time. 0 20 40 60 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 overall cost (normalized) size of window (W) (a) overall cost N=500 N=1000 N=1500 N=2000 0 20 40 60 1 2 3 4 5 6 7 8 9 running time (normalized) size of window (W) (b) running time 53 Fig 17. Running time vs. W for different problem sizes (N) 2.6.3 Local Resource Manager The local resource manager deploys VMs to PMs. In order to assess the quality of results from the proposed algorithm, it is compared with the below heuristics: Permutation Pack (PP) [27] – for each bin, find an item whose ordering of the resource demands best matches the ordering of residual resources in the bin. This algorithm may consider only ‘w’ resource types which are the most important ones. w can be set to K or any smaller natural number. FFD – uses First Fit Decreasing algorithm. It is the same as the FFD presented in Section 2.6.2. MNVD – uses the MNVD algorithm. In order to evaluate the quality of results, the cost is defined as follows: cost k mm k mk f w r (24) This cost is same as the objective function to be minimized in the original problem statement (6). The MNVD method generates the best quality of result (i.e., the minimum cost) as shown in Fig 18. PP is the second best whereas both 𝑃𝑃 𝑤 =3 and FFD are the worst. FFD is a simple 0 10 20 30 40 50 60 0 10 20 30 40 50 running time (normalized) size of window (W) N=500 N=1000 N=1500 N=2000 54 algorithm and its running time is the smallest. Running time of PP exponentially increases as the resource dimension increases. Fig 18 implies that PP is not applicable when K is greater than 8 due to the huge running time. On the other hand, the proposed algorithm, MNVD, may be used for any (large) K. Note that 𝑃𝑃 𝑤 =3 is supposed to be simpler than PP because it considers the order of the three most important resources only, but the running time of 𝑃𝑃 𝑤 =3 is larger than that of PP when K is less than four. This is caused by our implementation; the algorithm makes a list for comparison of the order, and revises the list only when window size (w) is specified. Because of this additional computation, the running time of 𝑃𝑃 𝑤 =3 is greater than PP for K < 4. Fig 18. Cost and running time vs. dimension of resource (K) 2.6.4 Overall Performance Comparison We have shown that the proposed algorithms are simple and work well for both global and local resource allocations. However, the high quality of each level does not necessarily guarantee the high quality of the overall solution. In this section, we will compare the quality of the HVMC (Hierarchical for VM Consolidation) solution generated by different algorithms and verify that the 2 4 6 8 10 0 1 2 3 4 5 6 resource dimension (K) cost (normalized) (a) cost 2 4 6 8 10 10 0 10 1 10 2 10 3 10 4 resource dimension (K) running time (normalized) (b) running time FFD PP w=3 PP MNVD 55 proposed algorithms produce high quality solutions. In addition to this, we run the simulation for a large number of VMs, and show if the proposed scheme is scalable or not. In this section the results of below heuristics are presented: HVMC – uses the proposed HVMC: VM2C (for global assignment of VMs to clusters. Fig 6) and MNVD (for local assignment of VMs to PMs). PP – uses the permutation pack algorithm. This is a non-hierarchical solution method. FFD – uses the First Fit Decreasing algorithm. It is the same as the FFD presented in Section 2.6.2. It is non-hierarchical. JOINT – uses the joint VM to cluster and VM to PM assignment algorithm of Fig 10. W2 is equal to W of HVMC and W1 =5W2. In the simulation K is set to 7; if K is greater than 7, the running time of PP becomes very large (Fig 18) even for small N. A comparison among the algorithms with different numbers of VMs (N) is reported in Fig 19. It is seen that costs of FFD and 𝑃𝑃 𝑤 =3 are much larger than those of all the others. For all cases, JOINT produces the best results and HVMC is the second best. The difference in costs between JOINT and HVMC is less than 4%; which implies the quality of HVMC’s result is quite similar to that of JOINT. The scalability is the one of the most important features of the resource management solution for the cloud computing. Hence, the relationship between problem size and running time of the algorithms is very important. The running time of the proposed algorithm is calculated as follow: max running time global local T T T (25) where 𝑇 𝑔𝑙𝑜𝑏𝑎𝑙 and 𝑇 𝑙𝑜𝑐𝑎𝑙 are running time of the global resource manager and local resource managers, respectively. The local managers run in parallel; therefore, the longest execution time among them is considered for the calculation. 56 Fig 19. Quality comparison among the algorithms The running time comparison among the algorithms is shown in Fig 20. Both PP and JOINT tended to run for a very long time in order to find a solution for the large number of VMs because they are non-hierarchical methods. Because of their large running times, neither PP nor JOINT is appropriate for solving large problem instances. On the other hand, the running time of all other heuristics (FFD, PPw=3, and HVMC) is acceptable. Note that the rate of running time increase of HVMC is smallest among these algorithms; this is one of the key benefits of the hierarchical method. The quality of the result made by FFD is quite poor as shown in Fig 19 (30% greater than JOINT), so FFD is not acceptable either. For the same reason, 𝑃𝑃 𝑤 =3 is not a good method. Hence, HVMC is the best method among these algorithms and its solution costs is about 20% less than the costs of solution produced by FFD and 𝑃𝑃 𝑤 =3 . 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 1.1 1.2 1.3 the number of VMs (N) cost (normalized) FFD PP w=3 PP JOINT HVMC 57 Fig 20. Running time comparison among the algorithms 2.6.5 Online Resource Management In this section, results of the proposed online algorithm are presented and compared with the ‘rank based VM remapping (RBR)’ method of [30] and Monte-Carlo experiment result (as pseudo optimal solution) . The RBR algorithm is similar to FFD, which is a state-of-the-art algorithm to solve various bin-packing problems. Notice that the PP method is designed for initial VM allocation and not for the VM reallocation, so the proposed online algorithm cannot be compared with the PP. The RBR algorithm consists of two parts: load distribution and load consolidation. If a PM is over-utilized, RBR remaps VMs of that PM. RBR introduces a ‘rank’, which is defined depending on the objective function; for example, if the objective is to minimize energy consumption, the rank may be defined as energy-efficiency. RBR sorts PMs in non-increasing order of rank; that is, PM whose rank is bigger has higher priority for migration. Load consolidation works in a similar 0 5000 10000 0 100 200 300 400 500 600 700 800 900 the number of VMs (N) running time (normalized) (a) linear scale FFD PP w=3 PP JOINT HVMC 0 5000 10000 10 0 10 1 10 2 10 3 the number of VMs (N) running time (normalized) (b) semi-logarithmic scale 58 way. The VMs in an under-utilized PM are migrated to another PM whose rank is as high as possible. For the comparison with proposed algorithm, the rank is defined as total available (remaining) resource. The detailed explanation of test setup for the comparison is given next. First, we do the initial resource allocation using the JOINT method (Fig 10). Next, resource requirements of the VMs are re-generated using a random generator to make some PMs become over-utilized. By comparing the cost of two algorithms with the same migration limit, we can determine which one is better. The cost may be defined as ∑ 𝑓 𝑚 𝑞 𝑚 𝑚 (cf. (6)), but this does not take account of the over-utilized PMs. Hence, the cost is revised to be: cost : 1 m m m m o f q (26) where o m is 1 if PM m is over-utilized (i.e., the last constraint of (6) is violated), and 0 otherwise. The cost comparison is shown in Fig 21. As shown in the figure, the proposed online algorithm always results in higher quality of results. The maximum costs difference between results of the RBR algorithm and those of the proposed algorithm is nearly 15%. A similar percentage difference between two persists across a wide range of migration count limits. 59 Fig 21. C osts comparison between RBR and proposed algorithm Having the same number of VM migrations does not necessarily mean that the overheads are the same. This is primarily because the overhead of an inter-cluster migration is greater than that of an intra-cluster migration. RBR chooses the best PM in a whole set of PMs without considering if those PMs are located in the same cluster that the target VM is in. Therefore, more than 90% of migrations are inter-cluster migrations (Fig 22). On the other hand, the proposed online algorithm avoids inter-cluster migration as much as possible; therefore, only 45% of migrations become inter-cluster migrations. This observation implies that cost difference between the aforesaid two algorithms can be greater than 15% under the same migration overhead. 50 100 150 200 250 300 350 400 0.6 0.7 0.8 0.9 1 migration count limit normalized cost RBR Proposed algorithm 0 50 100 150 200 250 300 350 400 450 0 5 10 15 20 migration count limit cost difference (%) difference 60 Fig 22. Ratio of inter/intra cluster migration It is shown that the proposed algorithm produces better quality of result than RBR. However, it does not necessarily prove the quality of the proposed algorithm is good enough. Therefore, we do Monte-Carlo simulation, which give near optimal result. For the comparison, we use cost of RBR and proposed algorithm when migration limit is 400. Migration count and cost comparisons are shown in Fig 23. Migration count of Monte-Carlo does not mean the number of migrations during Monte-Carlo simulation. Comparing the initial VM-PM mapping with the final solution, we get migration count of Monte-Carlo. As shown below migration count of Monte-Carlo is much greater than others, and most of migration is inter-cluster migration, which implies VM overhead of the optimal solution (which only consider maximum cost reduction) is very significant. On the other hand, the cost of proposed algorithm (‘proposed’ in the plot) is around 10% higher than Monte-Carlo. This implies the quality of result produced by the proposed algorithm is very high. 100 200 300 400 0 100 200 300 400 500 migration count limit migration count (a) RBR inter cluster intra cluster 100 200 300 400 0 100 200 300 400 500 migration count limit migration count (b) Proposed algorithm inter cluster intra cluster 61 Fig 23. Quality comparison To summarize, the proposed initial resource allocation method (HVMC) is scalable and provides higher quality solution. This method has hierarchical structure and effectively maximizes portfolio effect by considering correlations among VMs. Therefore, it can find higher quality solution but its running time is reasonable even for very large –size problems. The online resource management method (variant of HVMC) is also scalable and minimizes migration overheads by avoiding inter-cluster migration as possible. With less migration overheads, however, quality of the result is only 10% worse than near-optimal solution, which is acceptable. 0 2 4 6 8 10 normalized migration count RBR Proposed Monte-Calro inter cluster intra cluster 0.9 1 1.1 1.2 1.3 1.4 normalized cost RBR Proposed Monte-Calro 62 2.7 SUMMARY With increasing energy cost of cloud computing systems, necessity of energy-aware resource management techniques has been growing. This chapter presented a hierarchical resource management solution that produces high quality solutions and is scalable for a number of resource types as well as a number of VMs. First, we proposed HVMC as an initial VM allocation algorithm. It maximizes portfolio effect considering correlations among VMs. Its hierarchical structure makes the algorithm scalable; hence, its running time for very large size problem is reasonably small. For the situation where intensity and characteristics of workload keeps changing, we also propose an online resource management solution. This method considers network topology so as to minimize inter-cluster migration, which causes higher overhead than intra-cluster migration. In addition, when it does inter-cluster migration, it chooses a cluster which maximizes portfolio effect if a VM migrates to that cluster. Therefore, it can produce high quality solution which is just 10% worse than optimal solution. 63 Chapter 3. CPU CONSOLIDATION IN A VIRTUALIZED MULTI-CORE SERVER 64 3.1 BACKGROUND – POWER MANAGEMENT TECHNIQUES The purpose of this study is to understand how effectively CPU consolidation improves the energy efficiency of server systems so as to maximize the improvement. The consolidation interacts with existing power management technologies, so it is helpful to understand these technologies. In this section, the processor power and performance states are briefly reviewed. In addition, we review Intel® Quickpath Interface technology which may affect the energy savings by consolidation. Before starting a discussion, a few confusing terms is clearly defined below: CPU – all circuits used to perform arithmetic/logic operations and L1 and L2 cache memories. A term ‘core’ is used interchangeably with a word ‘CPU’ in this thesis uncore – all components in a processor except cores package – a physical unit which has core and uncore. A word ‘package’ is used interchangeably with a word ‘processor’ CPU consolidation – it is simply called ‘consolidation’ unless it is confusing total utilization – sum of percentage of times when CPUs are running codes. For example, when two CPUs are fully utilized, the total utilization is 200% average utilization – average utilization per core. It is calculated by dividing total utilization by the number of active CPUs, so it should be equal to or less than 100%. This term is simply called ‘utilization’ or ‘util’ throughput – the number of tasks (jobs) processed in a second delay – total amount of time spent for executing a task. This includes the time when a task is suspended and that waiting in a queue of the CPU scheduler. 3.1.1 Processor Power States (C, CC, PC-States) The Advanced Configuration and Power Interface (ACPI) specification was developed as an open standard for OS-directed power management. Many modern operating systems (OS) meet 65 this specification. First, the specification defines C-States as processor power states; when a processor is in a higher-numbered C-State, which is also called a ‘deeper’ sleep state, a larger area of internal circuitry is turned off or inactive, which reduces power dissipation. On the other hand, it also takes longer time to go back to the operating state (i.e., C0 state). The number of supported C-States is processor dependent. For example, the Intel® Core™ i7 processor (code-named Nehalem) supports the following core states: C0 (normal operating mode–cores in this state are either executing code or are in standby), C1 (autoHALT–a low power state entered when all threads within the core execute a HLT or MWAIT instruction), C1E (autoHALT with lowest frequency and voltage operating point), C3 (deep sleep–cores in this state flush their L1 instruction cache, L1 data cache, and L2 cache to the L3 shared cache; Clocks are shut off to each core), and C6 (deep power down–cores in this state save their architectural state before removing core voltage). See Intel® CoreTM i7 datasheets for more detailed information. The C-state is also known as a logical C-State. An OS ‘requests’ a change in C-State of logical cores 5 , but the request may be denied (e.g., auto demotion). The decision about demotion is made based on each core’s immediate residency history (think of this as the breakeven time for the proposed power state transition); if an amount of future residency in idle, which is estimated from the residency history, is insufficient, then a request to transition a core into a deeper power state will be ignored. In general, the entry/exit costs (latency and energy overheads) increase when the processor/core escapes from a deeper sleep state; hence, the auto demotion prevents unnecessary excursions into deeper power states, and thereby, it reduces both latency and energy overheads of power state switches. 5 A logical core is identical to a physical core unless Intel ® hyper-threading is enabled. In this study hyper- threading was disabled. 66 In addition to the logical C-states there are two more types of hardware C-States: core (𝐶𝐶 𝑛 ) and package C-state (𝑃 𝐶 𝑛 ). Based on logical C-state switch request from an OS, a Power Control Unit (PCU) decides about the CC and PC-states. Each core and/or package can switch its power state independently; that is, CC-state of a core may be different from that of other cores. Likewise, PC-state of one package could be different from that of another package. The 𝐶𝐶 0 state is a special state that deserves more discussion; a core is in this state when it is executing tasks or when it is standing by for the next task to arrive (but is still in the normal operating mode). Note that a core does not immediately switch from 𝐶𝐶 0 state to a deeper sleep state when it becomes idle. The PC-state of a package depends on the CC-states of cores in the package. In particular, a package state can be 𝑃 𝐶 𝑘 only when all of its cores are in 𝐶𝐶 𝑘 state or deeper CC-states. It is because some resources shared by cores cannot be turned off unless all cores are inactive. For example, the Intel ® i7 processor’s L3 cache is shared by multiple cores, so a package should stay in active state if any core is still active. Otherwise, the active cores cannot function properly. 3.1.2 Processor Performance States (P-States) While a processor executes codes, it can be in one of several performance states (P-State), which specify the clock frequency and corresponding voltage level. At the higher frequency, the performance is higher, but its power dissipation is also higher. Similar to the C-States, the number of supported P-States is processor- dependent, and a frequency is higher at lower numbered P- States, that is, P0 is the highest performance state. An OS chooses the P-State based on the historical workload information. The OS may not choose the same state for all cores, but all cores in Intel ® processors will run at the same clock frequency. Therefore, even if the OS sets different P-States for the cores, only one state is selected 67 and applied. In general, the highest performance state is chosen, but another decision policy may be used. Because of this hardware constraint of the current Intel ® processors, it is recommended to distribute the workload evenly among all active cores. Otherwise, the selected P-State will be appropriate only for some cores, but not for the others. 3.1.3 Core-level Power Gating Recent state-of-the-art Intel ® processors are capable of core-level power gating; processors can completely shut down some of the cores (the OFF cores consume nearly zero power). Processors with the power gating feature have additional C-State at which power dissipation is nearly zero, but with the largest entry/exit costs. Note that the processor used in this study supports core-level power gating. 3.1.4 Intel® Quickpath Interconnect (QPI) In the past there was only one memory controller (MC) in a system because one MC was enough for single core/processor systems. However, modern server systems, which have multiple packages (processors), need multiple MCs to achieve a target performance level. With a single MC, performance of all packages reduces as soon as a CPU in one package uses memory controller heavily. In order to address such a memory bottleneck issue, Intel ® invented QPI; each package owns an integrated memory controller (IMC) and communicates with other packages through the QPI. An example of a dual processor system is shown in Fig 24. Half of DDR3 slots are directly connected to a first package while the other half is connected to a second package. Therefore, if a core in package #1 needs data located in a remote DDR3, which is connected to package #2, the 68 data is requested and received through QPI. Hence, the PC-state of some package may not be the deepest sleep state even when all CPUs in that package are inactive. This is because any CPU in other packages may be requesting data from the DDR3 slots that are directly connected to the package in question. Due to this phenomenon, it is expected that power savings achievable by package-level consolidation are negligible. This will be discussed in more details in a later section. L3 CPU 0 L1 CPU 1 L1 CPU 2 L1 CPU 3 L1 L2 L2 L2 Package #1 L3 CPU 4 L1 CPU 5 L1 CPU 6 L1 CPU 7 L1 L2 L2 L2 Package #2 QPI L2 L2 IMC Crossbar QPI IMC Crossbar DDR3 DDR3 Fig 24. Intel® QPI block diagram 69 3.2 POWER, DELAY AND CONSOLIDATION In this section we present power and delay models for Chip Multiprocessor (CMP) server systems. Using these models, we will discuss how the consolidation affects the energy efficiency and delay. This discussion is abstraction level, so it may be too simplified to cover everything of real situations. However, we believe it is sufficient for deriving insights. The discussion about the power/latency tradeoffs will be verified by empirical results in Section 3.5. Note that thermal issues (e.g., leakage power variation as a function of chip temperature) are not considered. This is because we can do consolidation only when the system is under-utilized, which also implies that the temperature of processor chips is not very high. 3.2.1 Power Model This section presents a full platform-level power dissipation model, accounting for the power consumed by all components within a server system. This model estimates the system power dissipation by using statistical data reported by the system; i.e., the percentage of time spent in specific CC and PC-states. The processor power dissipation consists of core and uncore power dissipations. Some notations used for the model and their definitions are shown below: 𝑃 𝑎𝑐𝑡𝑖𝑣𝑒 𝑐𝑜𝑟𝑒 – Power dissipation by a core when the core is active and executing code. This active power is a function of P-State of active cores. This is also dependent on the type of workload, but we do not consider this factor because consolidation does not change characteristics of the current workload. 𝑃 𝐶 𝐶 𝑛 𝑐𝑜𝑟𝑒 – Power dissipation by a core at 𝐶 𝐶 𝑛 state. Note that 𝑃 𝐶 𝐶 0 𝑐𝑜𝑟𝑒 is different from 𝑃 𝑎𝑐𝑡𝑖𝑣𝑒 𝑐𝑜𝑟𝑒 ; a core may be simply standing by while it is in 𝐶𝐶 0 (i.e., it is not executing any tasks 70 although it is fully on) 𝑃 𝑃 𝐶 𝑛 𝑢𝑛𝑐𝑜𝑟𝑒 – Power dissipation by the uncore when a package is in the 𝑃 𝐶 𝑛 state 𝑇 𝑎𝑐 𝑡𝑖𝑣𝑒 𝑐𝑜𝑟𝑒 𝑖 – Percentage of time when a core is active and executing tasks, which is also called core utilization (𝑢𝑡𝑖𝑙 𝑖 ) 𝑇 𝐶 𝐶 𝑛 𝑐𝑜𝑟𝑒 𝑖 – Percentage of time when a core is in 𝐶𝐶 𝑛 state 𝑇 𝑃𝐶 𝑛 𝑢𝑛𝑐𝑜𝑟𝑒 – Percentage of time spent when a package is in 𝑃 𝐶 𝑛 state Total (server platform) power dissipation is the sum of the processor power dissipation (P proc ) and the power consumed by other system components (P other ), e.g., I/O, memory, and hard disc drive. P other is fixed and independent of the DVFS or CPU consolidation, so it acts as a fixed offset on top of the P proc . j i uncore core proc total other other ij P P P P P P (27) The core power dissipation can be estimated using 𝑃 𝑎𝑐𝑡𝑖𝑣𝑒 𝑐𝑜𝑟𝑒 , 𝑃 𝐶 𝐶 𝑛 𝑐𝑜𝑟𝑒 , 𝑇 𝑎𝑐𝑡𝑖𝑣𝑒 𝑐𝑜𝑟𝑒 𝑖 , and 𝑇 𝐶 𝐶 𝑛 𝑐𝑜𝑟𝑒 𝑖 as shown below. CC0 is a special state; a core is in the CC0 state when the core is executing codes. However, the core may be in the CC0 state even when it is in idle. For example, a CPU stays in the CC0 state for a certain amount of time (i.e., a timeout period) before switching to deeper power state. 0 0 0 1 0 i i i i i n n ii n n core core core core core core active CC active active CC core core CC CC n core core core core core active CC CC active CC n P P T P T T PT P P T P T (28) Similar to the core power dissipation, the uncore power dissipation is: j j n n uncore uncore uncore PC PC n P P T (29) Note that 0 :1 i n core CC n iT and 0 :1 j n uncore PC n jT 71 3.2.2 CPU Consolidation and Power Dissipation In this section we discuss whether consolidation reduces power dissipation or not. We do not cover its impact on delay, which is discussed in the following section. The discussion here is based on an assumption that consolidation is performed in a correct way so that throughput remains the same; in other words, a sufficiently large number of CPUs are always active. Lower power dissipation at the same throughput means less energy is consumed for the same workloads, which also means more energy-efficient operation. The discussion in this section focuses on how power dissipation changes with consolidation. Therefore, we can see if energy efficiency is improved or not. Consolidation reduces the number of active CPUs, that is, the type and level of workloads do not change. Therefore, it is expected that 𝑃 𝑜𝑡 ℎ𝑒𝑟 is not affected by consolidation. Hence, we focus on changes in the core and uncore power dissipations: j i uncore core total ij P P P (30) We start with the power impact on cores: 0 0 ii i nn core core core core active active CC ii core core CC CC ni P P P T PT (31) In the above equation, term ∑ 𝑇 𝑎𝑐𝑡𝑖𝑣 𝑒 𝑐𝑜𝑟𝑒 𝑖 𝑖 is not affected by consolidation because the workload level does not change (i.e., ∆ ∑ 𝑇 𝑎𝑐𝑡𝑖𝑣 𝑒 𝑐𝑜𝑟𝑒 𝑖 𝑖 = 0). Therefore, 0 ii nn core core core CC CC i n i P P T (32) As shown in the above equation, the power savings of consolidation is a function of changes in the sum of 𝑇 𝐶 𝐶 𝑛 𝑐𝑜𝑟𝑒 𝑖 . 72 Let us assume that power state transition is ideal: 1. CC-state immediately switches to the deepest sleep state (CC6) without any delay when a core becomes idle (i.e., 𝑇 𝑎𝑐𝑡𝑖𝑣𝑒 𝑐𝑜𝑟𝑒 𝑖 = 𝑇 𝐶𝐶 0 𝑐𝑜𝑟𝑒 𝑖 ) 2. There is no power state switch cost: additional delay and power consumption when power state switches. Based on these assumptions, there is negligible change in core power dissipation by consolidation because all cores are in the CC6 state when they are idle: 6 6 6 6 1 0 i i i i core core core core core active CC CC CC i i i core core active CC i P P T P T PT (33) However, this assumption is not realistic. Because of non-negligible switch costs, a core may not switch promptly its power state when it becomes idle; if the core is in low power state for the very short time, then switch costs could be greater than power savings by the switch. Consolidation can decrease power consumption by reducing the number of switch (hence, reducing switch costs). Fig 25 depicts an example which shows how consolidation reduces the costs; there are two CPUs and two CC-states available: CC0 and CC6. When a task is given to a CPU, the CPU executes the task (CC0-active). When the execution is done, the CPU stays in the CC0 state (CC0-idle) for certain amount of time before switching to CC6. For the rest of period, the CPU is in the CC6 state. From the upper case in the figure (Fig 25-a), we can see one CC6-to-CC0 switch and two CC0-to-CC6 switches. On the other hand, from a consolidation case (Fig 25-b), there is only one switch: CC0- to-CC6. In addition, CPUs reside in the CC0-idle state for shorter amount of time. Therefore, in this example, consolidation reduces power dissipation, which also means it improves energy efficiency. However, consolidation may increase execution time of a task. In this example, the 2 nd task cannot be executed promptly because a CPU is running the previous task (task 1) and this CPU is the only one active CPU. Therefore, we have to consider performance degradation and decide whether performs consolidation or not. 73 Second, we discuss about power dissipation impact on uncore power by consolidation. As discussed in Section 3.1.1, a package can switch its power state to deeper one only when all cores in the package are idle. A uncore can stay longer at deeper power state when both CPUs are active (Fig 25-a); there is overlap where both CPUs are in active state, so both CPUs are in the CC6 state for longer period than the 2 nd case where only one CPU is active. In other words, consolidation may increase uncore power. However, consolidation can reduce the percentage of time spent in the CC0-idle state, so if the reduction is greater than the overlap, uncore power dissipation may be reduced by consolidation. CC 0-active CPU1 CC 0-idle CC 6 (a) 2 active CPUs (b) 1 active CPU (consolidation) CC 0-active CC 0-idle CC 6 CC 6 CPU2 CC 0-active CPU1 CC 0-active CC 0-idle CC 6 CC 6 CPU2 task 1 arrives task 2 arrives time Fig 25. Example of CC-state switch by consolidation We have discussed about impacts on power dissipation using the power model, but real system is too complicated for the model to consider all factors which affect power dissipation. Therefore, we run experiment on a real system and quantify the power savings by consolidation. 74 3.2.3 Delay Model As discussed in the previous section, consolidation may increase delay of tasks. In this section, we discuss the impact of consolidation on delay. The proposed delay model is a function of core utilization ( 𝑇 𝑎𝑐𝑡𝑖𝑣 𝑒 𝑐𝑜𝑟𝑒 𝑖 ). In general, the delay increases rapidly when a CPU approaches full utilization [31]: 1 i i core active e Df T (34) 𝐷 𝑖 is the delay of the 𝑖 th CPU (𝑐𝑜𝑟𝑒 𝑖 ). Coefficient e represents how sensitive the delay is to the core utilization; with larger e, delay increases more rapidly as the core utilization approaches 1. Another coefficient f represents a lower bound on the delay, that is, the delay may not reduce below certain value even when the core utilization is very low (𝐷 𝑖 ≥ 𝑒 + 𝑓 ). These coefficients are task-dependent, that is, coefficients for one task might be different from those for another task. They are also hardware-dependent. This delay is affected by consolidation because 𝑇 𝑎𝑐𝑡𝑖𝑣𝑒 𝑐𝑜𝑟𝑒 𝑖 is a function of the active CPU count. When 𝐾 tasks are assigned to the system every second, the tasks are evenly distributed to the 𝑚 active CPUs by a scheduler; therefore, each CPU is assigned 𝐾 /𝑚 tasks every second. The core utilization 𝑇 𝑎𝑐𝑡𝑖𝑣𝑒 𝑐𝑜𝑟𝑒 𝑖 is linearly proportional to the workload (𝐾 /𝑚 ): () i core active T d K m (35) Coefficient d represents the amount of CPU resource (i.e. the number of CPU cycles) needed for executing a task. A task with higher d needs more CPU cycles compared to another task with smaller d. Now we can model the delay as a function of the active CPU count (m) and the total number of tasks (K): 1 e Df d K m (36) 75 Delay increases as the core utilization 𝑇 𝑎 𝑐𝑡𝑖𝑣𝑒 𝑐𝑜𝑟𝑒 𝑖 increases. The increase rate at high utilization is greater than that at low utilization. Consolidation increases the core utilization, but if we keep the core utilization lower than certain level (threshold), then delay increase by consolidation will be insignificant. Hence, it is important to find this threshold and keep core utilization lower than the threshold. Coefficients in the delay model may be application dependent, so a threshold for one application may be different from that for another application. Therefore, we will find thresholds for various kinds of benchmark tests. From experiments, we recommend that the average core utilization is no more than 70% for CPU-bound applications. Note that for memory-bound applications where the execution time limit is tight, contention can occur on other shared resources (including bus, second level cache, and main memory) and hence a limit on average CPU utilization will not be sufficient. However as we will show later for such applications CPU consolidation is not an effective technique anyways. Details will be presented in Section 3.5. 76 3.3 ENERGY EFFICIENCY METRICS In the previous discussion, the term ‘energy efficiency’ has been used without defining it. In order to determine if consolidation improves the energy efficiency or not, we have to precisely define what ‘energy efficiency’ is. Depending on how it is defined, consolidation may or may not enhance the energy efficiency. In this study we use two metrics for energy efficiency: energy per task (E/task) and energy-delay product per task (ED/task). 3.3.1 Energy per Task (E/task) This metric is often used for comparing energy efficiency among different platforms. A term ‘task’ denotes an instance of executing a specified benchmark. This metric is simply calculated using average power consumption (P avg ) and throughput (i.e., the number of tasks processed in a second): / # # gross avg avg E P Time P E task of tasks of tasks throughput (37) The consolidation may decrease this metric, but it can also reduce performance. If ‘throughput’ is selected as a performance indicator, then this metric also includes performance information in it. If consolidation reduces E/tasks, we can say energy savings dominate performance degradation (i.e., throughput reduction). However, for another performance definition, this metric may be insufficient; if we have to care of execution time as well as throughput, this metric does not include performance information. For example, if the execution time increases due to consolidation but throughput does not change, this metric shows that the energy efficiency is improved without any 77 performance degradation. This may mislead into a wrong decision. Hence, we introduce another metric at the following section. 3.3.2 Energy-delay product per task (ED/task) A ‘delay’ in this metric is the average execution time of tasks. This also includes period when a task is suspended by the CPU scheduler and CPUs execute other tasks. This metric is calculated using the average power dissipation, throughput, and execution time: // avg P delay ED task E task delay throughput (38) Depending on a metric a different power management technique can be determined as the best one. Hence, we will report energy efficiency improvement of both metrics. 78 3.4 EXPERIMENTAL SETUP A goal of this chapter is to quantify energy efficiency improvement of consolidation and to find a way to maximize the improvement. In addition, we will compare consolidation with DVFS which is the most popular technique. Because a real system is too complicated to be well simulated, all data shown in the following sections are measured from experiments (not simulations). 3.4.1 Hardware test-bed and XEN The server system under test has two Intel ® Xeon ® Westmere E5620 processor packages, and each package in turn includes four CPUs in it (Fig 24). As mentioned in Section 3.1.2, all CPUs in the same package run at the same clock frequency and voltage. However, the power state of a CPU can be different from that of the other CPUs in the same package. Each 64-bit CPU has its own dedicated 256KB L1 and 1MB L2 caches but shares a 12 MB L3 cache with the other CPUs. The total size of the system memory is 6GBytes. This processor supports seven clock frequency levels, from 1.6GHz to 2.4GHz. This considered server system may appear too small to represent typical servers in datacenters. A common myth is that datacenters always consist of large-size servers which have many processors. In fact this is not true for all datacenters; the Google datacenter consists of clusters of inexpensive desktop-class machines [32, 33]. As another example, the Facebook datacenter is comprised of dual processor servers [34, 35]. There are a few reasons why datacenters consist of many small servers rather than fewer large-size servers [36]: first, resource management in many processor servers is a complex and challenging task, so actual performance may not high enough. Second, the license cost of resource management software for large servers is high. Third, it is 79 tricky to properly handle a failure of individual component, that is, failure of one processor in a large server may cause the whole server system to fail, taking out a big chunk of computing resources within a datacenter. Hence, our setup is realistic and representative of typical server systems found in some datacenters. A power analyzer tool measures the total platform (system) power dissipation, which includes total power consumed by all components; e.g., processor, HDD, DRAM, fan, and so on. None of the components other than CPU are optimized to achieve any power savings. For example, cooling fans are running at highest speed all the time and high performance HDDs are used all the time in order to avoid any risk of performance degradation. Hence, the system power dissipation is very high even when the system is idle (we call this quantity the standby power from now on). In order to compensate potential power inefficiency of other system components, we calculate and report ‘power dissipation’ as the difference between the total system power and the standby power: measured standby power power power (39) The reported power value thus accounts for dynamic power consumption of all system components. The standby power of our system is 98.1W. When the system is fully loaded, the system power is about 160W; that is, we report 61.9W as the power consumption. Consolidation is needed only when the system is under-utilized. If the average core utilization is 50%, calculated power consumption is about 30W. If consolidation reduces power dissipation by 15W, then we report 50% power savings. On the other hand, the power savings would have to be reported as only 12% if we had used the total system power for the calculation. We believe reporting 50% total dynamic power saving is more indicative of the actual effect of consolidation that reporting 12% saving in the total platform power. All power dissipation numbers reported in the following sections are calculated using the above equation unless there is specific description. A photo of the 80 system under test is shown in Fig 26. Fig 26. The server system along with a power analyzer We have built the virtualized system using XEN (version 4.0.1), which is an open source hypervisor-based virtualization product and provides the APIs for changing VM configurations: the number of virtual CPUs (vCPU), clock frequencies, and the set of active CPUs. We change these configurations by calling the XEN built-in functions. 3.4.2 Benchmarks – PARSEC and SPECWeb2009 For this study two different benchmark suits are used: 1. the Princeton Application Repository for Shared-Memory Computers (PARSEC) [37] and 2. SPECWeb2009. PARSEC consists of 13 multithreaded and shared-memory programs, which represent next-generation programs for CMP. All these programs are designed and developed for real applications. Characteristics of these programs are very different from one another, and they represent wide range of applications. Therefore, we can make strong conclusion using PARSEC benchmark. Note that there are total of 81 13 programs are provided, but we use 11 programs. It is because ‘facesim’ and ‘ferret’ programs are very instable and often crashed in our setup. The PARSEC benchmark does not provide an I/O- bound program, so SPECWeb2009 was used as an I/O-bound one. For PARSEC, we present improvement of both metrics: E/task and ED/task. On the other hand, for SPECWeb2009, we only present ED/task. Delay, which is defined as turn-around time for SPECWeb2009, is very important for web service, so E/task is not an appropriate metric for SPECWeb2009. 82 3.5 EXPERIMENTAL RESULTS AND DISCUSSION In this section, experimental results of PARSEC and SPECWeb2009 running on the system under test are presented and discussed below. We start from presenting a detailed power model as a function of CC and PC-states. We also investigate the consolidation overhead 6 and suggest that the number of virtual CPUs (vCPU) has to be dynamically changed to reduce the overhead. It is also important to find out which set of CPUs should be active in order to maximize energy efficiency. Next we report the E/task and ED/task improvement of PARSEC using three techniques: 1. DVFS, 2. Consolidation, and 3. Combined. Finally, we present a highly effective, yet simple, online consolidation algorithms for SPECWeb2009 and report energy efficiency improvement. 3.5.1 Power Model Derivation and Verification This section presents a full platform-level power dissipation model, accounting for the power consumed by the core and uncore components within the target server system. As will be seen, this model is more detailed than the generic one that was described in Section 3.2.1. Our system allows limiting the deepest C-state, and we can set the limit to C1, C2, or C3 by using the xenpm [39] of the XEN hypervisor. The hardware-reported information for each C-state limit is shown in 0. As shown in this table, not all information is available; percentage of times spent in the CC0, CC1, PC0, and PC1 states are not reported, hence these unreported times will be estimated. Our goal is to estimate the power dissipation when all C-states are available, i.e., the C- state limit is C3, but this is a difficult undertaking. Therefore, we start from the simplest case when the C-state limit is C1. Subsequently, we go over the second case when the C-state limit is C2. 6 The DVFS overhead has been extensively studied in reference [38]. 83 Finally, we will derive the power equation when the C-state limit is C3. Table 1. C-state Limit and Hardware-reported Information C-state limit Core C-state Processor C-state T CC 0 T CC 1 T CC 3 T CC 6 T PC 0 T PC 1 T PC 3 T PC 6 C1 available but not reported n/a n/a available but not reported n/a n/a C2 OK n/a OK n/a C3 OK OK OK OK Power dissipation is dependent on the C-state limit as shown in Fig 27; for the higher C-state limit, the power dissipation is lower. Note that utilization and system power reported in this figure are all measurements; In particular, power is measured using the power analyzer tool whereas the utilization is reported by xentop. The power difference among different C-state limits is greater when the utilization is lower. This is because cores stay in the C0 state most of the time when utilization is high. Fig 27. Power dissipation vs. utilization for C-state limits Full derivations can be found in the Appendix of a USC CENG technical report [40]. The key idea behind the derivation is to start with equations (28) and (29), and then use a combination of analytical expansion of terms, lookups from hardware- reported information (0), and regression 0 20 40 60 80 100 100 110 120 130 140 150 160 utilization (%) system power (W) C-state limit: C1 C-state limit: C2 C-state limit: C3 84 analysis to derive the appropriate power macro-models as shown in 0. Note that time spent in power states of a core is almost identical to one another because a CPU scheduler evenly distributes tasks. Hence, these times in 0 are core-independent terms. Table 2. Power Macro-Models for the Server System under Test C-state limit Power equation C1 𝑃 𝑒𝑠𝑡 . 𝑡𝑜𝑡𝑎𝑙 = 21.88𝑇 𝑎𝑐𝑡𝑖𝑣𝑒 + 141.12 C2 𝑃 𝑒𝑠𝑡 . 𝑡𝑜𝑡𝑎𝑙 = 22.48𝑇 𝑎𝑐𝑡𝑖𝑣𝑒 − 5.76𝑇 𝐶𝐶 3 − 31.16𝑇 𝑃𝐶 3 + 140.7 C3 𝑃 𝑒𝑠𝑡 . 𝑡𝑜𝑡𝑎𝑙 = 22.48𝑇 𝑎𝑐𝑡𝑖𝑣𝑒 − 5.76𝑇 𝐶𝐶 3 − 8.56𝑇 𝐶𝐶 6 − 31.16𝑇 𝑃𝐶 3 − 42.55𝑇 𝑃𝐶 6 + 140.7 The power models presented in the above table are highly accurate; Fig 27 shows a comparison between measurements and model predictions for the case that C-state limit is set to C3; estimation is very close to measurements. Fig 28. Power estimation vs. measurements when the C-state limit is C3 The first coefficient (for active state) in 0 is application dependent. The main point here is not finding very accurate parameters of the power model but showing that power dissipation can be well estimated using CC/PC-state stats. Therefore, we will see how those states are changed by consolidation in order to understand how consolidation improves energy efficiency. 0 20 40 60 80 100 100 120 140 160 average utilization (%) system power (W) Power measured Power estimated 85 3.5.2 Package-level Consolidation As shown in equation (29), uncore power is a function of PC-states. If we have more than one package in a system, further power savings may be achieved by package-level consolidation: select CPUs from the minimum number of packages and put other packages in the deepest power state. Package consolidation can reduce the total time spent in the active state (PC0), which is obtained by summing over all packages the time that each package spends in its PC0 state; thus, the uncore power dissipation decreases. In particular package consolidation utilizes as few packages in a server as possible, so the amount of time when multiple CPUs in the same package are in the CC0 state at the same time increases; i.e., the CC0 state overlap time increases and therefore, the total time spent in the PC0 state (which is equal to ∑ 𝑇 𝑃𝐶 0 𝑐𝑜𝑟𝑒 𝑖 𝑖 ) decreases. When we consider an extreme case, this point becomes more obvious. Let us say one CPU is chosen from each package to remain active. Then the total time spent in PC0 state will be greater than or equal to times spent in the CC0 state of each CPU because there is no possibility for CC0 state overlap. In comparison if the two CPUs are chosen from the same package, then only one of the packages will be active and even then the time spent in PC0 for that package is less than or equal to times spent in the CC0 state of each CPU because there is CC0 state overlap. The above discussion is based on a key assumption, i.e., the PC-state of a package is independent of that of other packages. In practice, this assumption is far from the truth. Fig 29 depicts PC0 state of two packages, which reveals the opposite. The data presented in the figure is for a case when all active CPUs are chosen from exactly one package, and therefore, all CPUs in the other package are idle. The figure shows that PC0 states of both packages are nearly identical. Indeed, the same behavior is observed for other PC-states (PC3 and PC6), although not shown here. This implies that all packages should stay in the active state (PC0) when any CPU, which may in 86 fact reside in another package, is active. This is because an active CPU may need data from a remote DRAM, so not only the package where the active CPU is located but also all the other packages should remain in active states to provide the requested data (details of Intel ® QPI architecture has already been discussed in Section 3.1.4) in order to avoid a significant additional latency. As discussed above, PC-states of all packages are nearly the same due to Intel ® QPI, hence little or no energy savings are expected from package-level consolidation. On the other hand, if a task accesses memory infrequently (such as in the case of CPU bound tasks), package consolidation may save further energy. In other words, package-level consolidation may or may not improve energy efficiency depending on the characteristics of tasks. We discuss later whether or not package-level consolidation results in any energy savings. Fig 29. Relationship between PC0 states of two packages of the target server when there are 4 active CPUs in exactly one of the packages There are 8 CPUs in the system, so only when all these CPUs are inactive, both packages can be switched to inactive states (PC3 or PC6). Hence, we expect that the two packages are active most of the time even when the total utilization is low. As depicted in Fig 30, PC0 state is 100% when total utilization is greater than 150% out of 800%. This implies that there is a very small 40 50 60 70 80 90 100 40 60 80 100 package 1 (%) package 2 (%) PC 0 state 87 room for uncore power reduction. Fig 30. Percentage of the time that each package in the target server is in the PC0 state as a function of the total utilization 3.5.3 Consolidation Overhead – vCPU Count The number of virtual CPUs, called vCPU count, is an important parameter of a virtual machine (VM) because this count limits the performance of the VM. For example, a VM with two vCPUs is capable of utilizing up to two CPUs at a time, so the maximum total CPU utilization of the VM is 200%. However, managing each vCPU causes additional overheads; thus, it could hurt both performance and energy efficiency if VMs have unnecessarily too many vCPUs. We use the ratio of vCPU to active CPU counts (called virtualization ratio) as an indicator of this overhead. Experimental results of PARSEC benchmark programs with different virtualization ratios are reported in Fig 31 and Fig 32 The active CPU count is 4; that is, the total CPU utilization is always equal to or less than 400%. The same experiments are repeated for four different vCPU counts: 8, 16, 24, and 32 (corresponding to virtualization ratios of 2, 4, 6, and 8, respectively). Except one program, i.e., vips, execution time remains the same when the ratio is 6 or less while E/task of 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 total utilization (%) PC 0 state (%) package 1 package 2 88 many programs increases noticeably even when the ratio is 4 (Fig 32). This is due to higher overheads of vCPUs management. We suggest keeping the virtualization ratio to be less than or equal to 3. Fig 31. Consolidation overhead i.e., execution time as a function of the virtualization ratio Fig 32. Consolidation overhead i.e., energy per task as a function of the virtualization ratio 3.5.4 CPU Selection Policy The basic idea of the consolidation is to have as fewer active CPUs at any time. In addition to the active CPU count, the CPU selection policy can be important for multi core/processor systems; e.g., choosing CPUs from a minimum number of packages or selecting CPUs uniformly from all 2 4 6 8 0 2 4 6 8 virtualization ratio normalized exe time blackscholes bodytrack canneal dedup fluidaminate freqmine raytrace streamcluster swaptions vips x264 2 4 6 8 0.9 1 1.1 1.2 1.3 1.4 virtualization ratio normalized E/task blackscholes bodytrack canneal dedup fluidaminate freqmine raytrace streamcluster swaptions vips x264 89 packages. The system under test has two packages, so there are two possible selection policies: i) Select all CPUs from one package first and take additional CPUs from the other package if necessary. ii) Select equal number of CPUs (modulo plus/minus one) from each package. CC0 and PC0 states of the bodytrack program are shown in Fig 33. Each plot compares two selection policies: all four CPUs are selected from one package (4CPU-1P) and two CPUs are chosen from each package (4CPU-2P). In our experimental results, core and package may reside in either active (CC0 and PC0) or deepest sleep (CC6 and PC6) states. Hence, we only present statistics of active states. The total time spent in each state (called state residency) is calculated as the sum of all times spent in the corresponding state by all active cores (for CC0) or packages (for PC0), so these times can be greater than 100%. Normalized workloads are calculated as ratios of actual workloads over workloads that result in 50% total core utilization. As shown in Fig 33, CC0 states of the policies close to each other, which is reasonable and expected. On the other hand, the total time spent in the PC0 state for the first policy (4CPU-1P) is smaller than that for the other (4CPU-2P). That is, time spent in the PC6 state under 4CPU-1P is greater; therefore, the uncore power dissipation of 4CPU-1P is smaller. According to the discussion in Section 3.5.2, bodytrack is considered to be a non-memory-intensive task. 0 2 4 6 0 50 100 150 200 normalized workload percentage (%) (a) CC 0 state 0 2 4 6 0 50 100 150 200 normalized workload (b) PC 0 state 4CPU-1P 4CPU-2P 90 Fig 33. CC 0 and PC 0 state residencies for the bo dy t r a ck program Result of another program, canneal, is shown in Fig 34. Here there is a significant difference in the CC0 state residency between the two policies. canneal is a program to find a chip design with minimum routing cost. It uses cache-aware simulated annealing which creates intensive memory read/write activity. If we use all four CPUs in the same package, then we only use half of the L3 cache compared with the other case (4CPU-2P). This causes much higher cache misses, so both the time spent in the CC0 state and the application execution time increase. There is negligible difference in the PC0 state residency. From this result, package-level consolidation is not a good idea for applications requiring extensive data transfers to and from the main memory. Fig 34. CC 0 and PC 0 state residencies for the can ne a l program Normalized E/task comparisons for all PARSEC programs are reported in Fig 35. Due to run- to-run variations, the average of 15 measurements is presented. E/task difference is less than 3% for most of programs except bodytrack, canneal, and x264. Most significant difference (~ 6%) is observed for canneal. Later in this chapter we will present a more sophisticated CPU selection policy to minimize the E/task. 0 2 4 6 8 0 100 200 300 normalized workload percentage (%) (a) CC 0 state 0 2 4 6 8 0 50 100 150 200 250 normalized workload (b) PC 0 state 4CPU-1P 4CPU-2P 91 Fig 35. Effect of simple CPU selection policies on energy consumption per task of various PARSEC programs 3.5.5 Execution Time According to (36), the delay (execution time) of a task increases as the average utilization per core increases. The marginal rate of increase at high utilization is greater than that at low utilization. Hence, if we keep the average utilization lower than a certain threshold, then delay increase by consolidation can be made small. The normalized execution times of PARSEC benchmark programs at various average utilizations are shown in Fig 36. 0.9 0.95 1 1.05 blackscholes bodytrack canneal dedup fluidanimate freqmine raytrace streamcluster swaptions vips x264 normalized E/task 4CPU-1P 4CPU-2P 92 Fig 36. Execution time of PARSEC benchmark programs as a function of the average utilization per core Except one program, i.e., canneal, the execution time increase is less than 5% at average utilizations as high as 70%. In other words, if we keep the average utilization below 70%, the maximum execution time increase will be less than 5% for most applications. We use this threshold to decide about the degree of consolidation that we do. For example, let’s say that the average CPU utilization is 40% with 8 active CPUs. If we consolidate the workloads to four CPUs, then the new average utilization will be approximately 80%, which is greater than our threshold (70%). This implies that there should be at least 5 active CPUs in order to avoid a considerable increase in the average execution time of tasks. As discussed in the previous section, canneal generates extensive memory read/write requests, so the cache miss rate is expected to rise rapidly as the CPU utilization increases. In other words, the execution time of canneal increases monotonically with CPU consolidation. More generally speaking, when we have an execution time target (upper bound), aggressive CPU consolidation can result in significant SLA violation for memory-intensive applications. 20 40 60 80 100 0.9 1 1.1 1.2 1.3 1.4 average utilization (%) normalized exe time blackscholes bodytrack canneal dedup fluidaminate freqmine raytrace streamcluster swaptions vips x264 93 3.5.6 E/task and ED/task Improvements for PARSEC In this section we present E/task and ED/task for PARSEC benchmark programs. The average utilization is kept to be less than 70% and the virtualization ratio not to be greater than 3 as discussed in previous sections. The first interesting metric is E/task, which is reported in Fig 37. The white, gray, and black bars in the plot show improvements in E/task achieved by DVFS, consolidation, and both DVFS and consolidation, respectively. For all programs, improvement by DVFS is always greater than that by consolidation. Recall that the processor under test supports 7 frequencies from 1.6GHz to 2.4GHz. DVFS can thus effectively reduce E/task by slowing down the clock frequency from 2.4GHz all the way down to 1.6GHz (and accordingly lowering the supply voltage level). Evidently, this action increases the average execution time of tasks; however, this execution time increase does not affect the E/task metric much (this is because energy consumption of the server is dominated by dynamic power and not leakage power). The maximum E/task improvement achieved by consolidation is about 10%. Another observation is that the effects are somewhat additive that is, when we apply both DVFS and consolidation (see the ‘Combined’ results in the figure), the improvement is greater than the other two cases for most programs with the exception of canneal. The maximum improvement of the ‘Combined’ technique is greater than 15% (achieved for dedup). 94 Fig 37. Energy per task improvement Surprisingly, we observe very different results for the ED/task metric, as seen in Fig 38. The ED/task is worsened by DVFS because the task execution time increases significantly as a result of reducing the CPU clock frequency. On the other hand, consolidation maintains its relative energy savings except for the case of canneal. This is because the execution time of canneal increases monotonically even when the average CPU utilization is kept below 70%. Therefore, the ED/task improvement of consolidation for canneal is much smaller than all other programs. From this result, we can conclude that consolidation is a much more effective solution for delay sensitive applications compared to DVFS (although it loses much of its advantage in memory-bound applications). 0 5 10 15 20 blackscholes bodytrack canneal dedup fluidanimate freqmine raytrace streamcluster swaptions vips x264 E/task improvement (%) DVFS Consolidation Combined 95 Fig 38. Energy delay product per task improvement 3.5.7 CPU Consolidation for SPECWeb2009 Benchmarks In the previous section, the relative effectiveness of the CPU consolidation and DVFS was studied for the PARSEC benchmark suite. In this section, results for the SPECWeb2009 are presented. This benchmark suite comprises of I/O bound application programs whose characteristics are very different from those of the PARSEC programs. SPECWeb2009 is a very well developed benchmark suite, and its main purpose is to evaluate a web server (I/O-bound application); hence, we can see how consolidation affects the delay and energy efficiency of I/O- bound applications from SPECWeb2009 results. The energy efficiency is quantified as ED/packet because delay (i.e., response time) is a critical performance metric in these applications. SPECWeb2009 requires a simultaneous user session (SUS) count as an input, which is another way of specifying the workload intensity. The SUS count specifies only the average workload intensity (the instantaneous workload intensity -40 -20 0 20 blackscholes bodytrack canneal dedup fluidanimate freqmine raytrace streamcluster swaptions vips x264 ED/task improvement (%) 96 fluctuates a lot). Hence, an online method, which dynamically finds optimal settings for consolidation, is needed. In this section, we start from analyzing characteristics of the SPECWeb2009. After that, four online consolidation algorithms are presented, and results of those algorithms are reported and analyzed. Web applications are in general not compute-intensive [41]; hence, the average response time is less dependent on CPU clock frequencies as shown in Fig 39-a. This is because the response time of web servers is closely related to the I/O processes, such as network and disk access. Likewise, the response time is almost independent of the active CPU count when a sufficiently large number of CPUs is active. The relationship between the power dissipation and clock frequency/active CPU count is shown in Fig 39-b. The power dissipation declines as the frequency decreases and/or the active CPU count is reduced. This result implies that both DVFS and the CPU consolidation improve the energy efficiency without any significant performance degradation. In addition, we expect higher power efficiency gains when both techniques are applied at the same time. Fig 39. Response time and power dissipation When the OS changes the CPU clock frequency, the CPU utilization also changes under the 1.5 2 2.5 1 1.2 1.4 (a) response time frequency (GHz) normalized time 3CPU 4CPU 5CPU 6CPU 1.5 2 2.5 0.9 1 1.1 1.2 1.3 (b) power frequency (GHz) normalized power 97 same workload. Therefore, before changing the CPU clock frequency, the corresponding CPU utilization must be estimated in order to prevent the undesirable situation whereby active CPUs are overloaded because the chosen frequency is too low for the given workload. The relationship between the total CPU utilization and CPU clock frequency is depicted in Fig 40. Note that utilization is the percentage of time that a CPU spends executing user and system space codes. When a task is waiting for an I/O operation to be completed, the task is suspended and CPU does nothing. Hence, this suspension time is not included in the utilization. Fig 40. Frequency vs. total utilization (SPECWeb) According to the R 2 value, a linear equation is a nearly perfect fit the data points in Fig 40. The relationship is then as follows: () uf (40) where 𝛼 = 150.4, 𝛽 = 29.9 and 0 ≤ 𝑢 ≤ 800 (i.e., there are eight CPUs). Since coefficient β is relatively small, it can be ignored to simplify the relationship. Hence, the equation may be written as follows: i i j j f u f u (41) 0.4 0.45 0.5 0.55 0.6 0.65 90 100 110 120 130 1/freqency (1/GHz) total utilization (%) u = / f + = 150.422, = 29.854 R 2 = 0.999 98 3.5.8 Online CPU consolidation algorithms As shown in the previous section, both the clock frequency and the active CPU count affect the E/task and ED/task. In this section, we present online algorithms, which perform voltage/frequency setting and consolidation simultaneously. These algorithms monitor the CPU utilization, and change the frequency setting and/or the active CPU count depending on the current workloads. The main idea of these algorithms is to utilize as few CPUs at low frequencies as possible (while meeting the performance constraints); the decision is made by considering the current CPU utilization levels. This approach is reasonable for I/O bound applications because performance degradation is not significant unless the CPU is very highly utilized [42]. To avoid energy and delay overheads associated with frequent state changes, the proposed algorithms change the system configuration conservatively, that is, if the system is overloaded, these algorithms will immediately increase the frequency and/or the number of active CPUs. If, however, the system is underutilized, they will apply a state change (reduce frequency and/or turn off some CPUs) only if the situation persists for at least some time. We achieve this goal by introducing two different thresholds with hysteresis as described below. We present four algorithms whose main ideas are quite similar to each other: if the average utilization (𝑢 𝑖 ) of a CPU is greater than an upper threshold (𝑢 ℎ𝑖𝑔 ℎ ), these algorithms deploy more computing resources by increasing the clock frequency of the active CPUs and/or by adding to the number of active CPUs. On the other hand, if the average utilization is less than a lower threshold (𝑢 𝑙𝑜𝑤 ), then they will release some computing resources by decreasing the CPU frequency and/or reducing the number of active CPUs. It is necessary to estimate the new utilization level under the new frequency and active CPU count to avoid any performance degradation. Equation (41) does not account for the number of active CPUs (𝑐 𝑖 ) in the system, and hence, it is modified to apply to 99 this new situation: i i i j j j c f u c f u (42) Because we can change both the CPU frequency and the active CPU count (when needed), we must decide which strategy must be given higher priority: i) Changing the clock frequency first and the CPU count next, ii) Changing the CPU count first and the clock frequency next. Two pseudo codes are presented in Fig 41. The first function 𝑚𝑖𝑛 _𝑐𝑝𝑢 () finds the minimum CPU count (𝑥 𝑐 ) without any performance degradation. After finding the minimum CPU count, it determines the slowest frequency (𝑥 𝑓 ) with the new CPU count that would still avoid any performance degradation. This function tries to achieve a new CPU utilization close to 𝑢 𝑚𝑖𝑑 , which is the median of high/low thresholds and is calculated as follows: , 85%, 65% 2 high low mid high low uu u u and u (43) The second function 𝑚𝑖𝑛 _𝑓𝑟𝑒𝑞 () finds the slowest frequency first, and then finds the minimum CPU count with the new frequency. Again no performance penalty is allowed. The two functions are called when the system is under-utilized (i.e., the current utilization is smaller than 𝑢 𝑙𝑜𝑤 ) or over-utilized (i.e., the current utilization is greater than 𝑢 ℎ𝑖𝑔 ℎ ). For each case, we can choose which function is called, i.e., 𝑚𝑖𝑛 _𝑐𝑝𝑢 () or 𝑚𝑖𝑛 _𝑓𝑟𝑒𝑞 (). Therefore, there are a total of four online algorithms, which are shown in Fig 41. 100 Function min_cpu(𝑢 𝑖 , 𝑓 𝑖 , 𝑐 𝑖 ) { 𝑥 𝑐 = ⌈ 𝑢 𝑖 𝑓 𝑖 𝑢 𝑚𝑖𝑑 𝑓 𝑚𝑎𝑥 𝑐 𝑖 ⌉; 𝑥 𝑓 = ⌈ 𝑢 𝑖 𝑐 𝑖 𝑢 𝑚𝑖𝑑 𝑥 𝑐 𝑓 𝑖 ⌉; return (𝑥 𝑐 , 𝑥 𝑓 ); } Function min_freq(𝑢 𝑖 , 𝑓 𝑖 , 𝑐 𝑖 ) { 𝑥 𝑓 = ⌈ 𝑢 𝑖 𝑐 𝑖 𝑢 𝑚𝑖𝑑 𝑐 𝑚𝑎𝑥 𝑓 𝑖 ⌉; 𝑥 𝑐 = ⌈ 𝑢 𝑖 𝑓 𝑖 𝑢 𝑚𝑖𝑑 𝑥 𝑓 𝑐 𝑖 ⌉; return (𝑥 𝑐 , 𝑥 𝑓 ); } Fig 41. Psuedo codes for m i n_ c pu() and m i n_ f r eq() The first algorithm (Type1) calls 𝑚𝑖𝑛 _𝑐𝑝𝑢 () function for both the under and over-utilized CPU cases. The Type2 algorithm calls 𝑚𝑖𝑛 _𝑐𝑝𝑢 () when a CPU is over-utilized and 𝑚 𝑖𝑛 _𝑓𝑟𝑒𝑞 () if it is under-utilized. The Type3 algorithm calls 𝑚𝑖𝑛 _𝑓𝑟𝑒𝑞 () when a CPU is over-utilized and 𝑚𝑖𝑛 _𝑐𝑝𝑢 () if it is under-utilized. The last algorithm (Type4) calls 𝑚 𝑖 𝑛 _𝑓𝑟𝑒𝑞 () for both over and under-utilized CPU cases. We do experiments for three different SUS counts and compare the ED/packet and the quality of service (QoS) for the aforesaid four consolidation algorithms and two more algorithms (read below). The QoS refers to the percentage of packets whose response time (latency) is less than the pre-defined threshold. This QoS is reported by SPECWeb2009 benchmark suite. In addition to the four proposed algorithms, we provide results for two other algorithms: base and ondemand. The base algorithm means there is no dynamic adjustment of the active CPU count and clock frequency, i.e., all CPUs are active and running at the maximum allowed clock frequency. The ondemand algorithm is the default DVFS method used in Linux TM , which does not change the active CPU count but changes the CPU frequency. 101 over-utilized under-utilized start u i > u high Yes No u i < u low & persist 5 sec Yes c i+1 = x c f i+1 = x f end (x c , x f ) = min_freq() (x c , x f ) = min_freq() Type 1 No start u i > u high Yes No u i < u low & persist 5 sec Yes c i+1 = x c f i+1 = x f end (x c , x f ) = min_freq() (x c , x f ) = min_cpu() Type 2 No start u i > u high Yes No u i < u low & persist 5 sec Yes c i+1 = x c f i+1 = x f end (x c , x f ) = min_cpu() (x c , x f ) = min_freq() Type 3 No start u i > u high Yes No u i < u low & persist 5 sec Yes c i+1 = x c f i+1 = x f end (x c , x f ) = min_cpu() (x c , x f ) = min_cpu() Type 4 No Fig 42. Four online consolidation algorithms Experimental results are reported in Fig 43. Regardless of the SUS count, the proposed 102 algorithms always result in smaller ED/packet compared to the base and ondemand algorithms. Among the four proposed algorithms, Type1 algorithm is the best one in terms of the ED/packet. As the SUS count increases, QoS of all algorithms decreases, but QoS remains greater than 95%; hence, there are no appreciable performance degradation concerns. Note that the magnitude of ED/packet metric also decreases as the SUS count increases, which implies that the system consumes less energy for executing a packet. This is because of the energy non-proportionality of the existing server systems (including the one used in this study). From these results, we can state that the Type1 consolidation algorithm is the best. This implies that, at least for the system under experiment, adjusting the CPU frequency has higher impact on the ED/packet metric than changing the CPU count. We compare ED/packet of the ondemand and Type1 algorithm in 0. For three SUS settings, ED/packet of Type1 algorithm is always smaller than that of ondemand. In addition, the difference between the two algorithms increases for larger number of user sessions. Fig 43. ED/packet and QoS comparisons 0.4 0.6 0.8 1 base ondemand type1 type2 type3 type4 (a) SUS = 1000 ED / packet (Js) 90 92 94 96 98 100 0.4 0.6 0.8 1 base ondemand type1 type2 type3 type4 (b) SUS = 1400 90 92 94 96 98 100 0.4 0.6 0.8 1 base ondemand type1 type2 type3 type4 (c) SUS = 1900 90 92 94 96 98 100 QoS (%) 103 Table 3. ED/packet comparison SUS ED/packet (Js) ∆ED/packet(%) ondemand Type1 1000 0.91 0.82 9.44 1400 0.76 0.67 11.83 1900 0.51 0.44 13.65 104 3.6 SUMMARY DVFS has been a promising method for reducing the energy consumption, but the energy saving leverage of DVFS reduces as the supply voltage level decreases with CMOS scaling. In this chapter, CPU consolidation is considered as a substitute, or better stated, as a complement. The idea looks simple; however, we need to investigate CPU consolidation under realistic setup to maximize the energy efficiency. The effectiveness of CPU consolidation was thus investigated for different configurations: types of applications, the virtual CPU count, the active CPU count, and the active CPU set. From the investigation we learn a few useful lessons. First, unnecessarily large number of virtual CPUs causes significant performance degradation; hence, the virtual CPU count must be dynamically adjusted. Second, we need to choose different CPU selection policy depending on applications. Third, DVFS outperforms consolidation in terms of E/task improvement. On the other hand, DVFS do not improve ED/task of PARSEC while consolidation does. Forth, the maximum improvement of ED/task for SPECWeb2009 is also achieved when both DVFS and the consolidation are applied. Similarly, biggest E/task improvement of PARSEC is achieved when both techniques are used. 105 CONCLUSION With increasing energy cost of cloud computing systems, necessity of energy-aware resource management techniques has been growing. This thesis proposes two energy-aware resource management techniques. The first technique is a hierarchical VM consolidation that produces high quality solutions and is scalable for a number of resource types as well as a number of VMs. First, we proposed HVMC as an initial VM allocation algorithm. It maximizes portfolio effect considering correlations among VMs. Its hierarchical structure makes the algorithm scalable; hence, its running time for very large size problem is reasonably small. For the situation where intensity and characteristics of workload keeps changing, we also propose an online resource management solution. This method considers network topology so as to minimize inter-cluster migration, which causes higher overhead than intra-cluster migration. In addition, when it does inter-cluster migration, it chooses a cluster which maximizes portfolio effect if a VM migrates to that cluster. Therefore, it can produce high quality solution which is just 10% worse than optimal solution. The second proposed technique is the CPU consolidation, which achieves further energy savings for under-utilized server machines. DVFS has been a promising method for reducing the energy consumption, but the energy saving leverage of DVFS reduces as the supply voltage level decreases with CMOS scaling. In this paper, CPU consolidation is considered as a substitute, or better stated, as a complement. The idea looks simple; however, we need to investigate CPU consolidation under realistic setup to maximize the energy efficiency. The effectiveness of CPU 106 consolidation was thus investigated for different configurations: types of applications, the virtual CPU count, the active CPU count, and the active CPU set. From the investigation we learn a few useful lessons. First, unnecessarily large number of virtual CPUs causes significant performance degradation; hence, the virtual CPU count must be dynamically adjusted. Second, we need to choose different CPU selection policy depending on applications. Third, DVFS outperforms consolidation in terms of E/task improvement. On the other hand, DVFS do not improve ED/task of PARSEC while consolidation does. Forth, the maximum improvement of ED/task for SPECWeb2009 is also achieved when both DVFS and the consolidation are applied. Similarly, biggest E/task improvement of PARSEC is achieved when both techniques are used. 107 REFERENCES [1] J. Koomey, “Growth in data center electricity use 2005 to 2010,” A report by Analytical Press, completed at the request of The New York Times, pp. 9, 2011. [2] K. Fehrenbacher. "The era of the 100 MW data center," http://gigaom.com/cleantech/the- era-of-the-100-mw-data-center/. [3] G. Cook, and J. Van Horn, “How dirty is your data? A look at the energy choices that power cloud computing,” Greenpeace (April 2011), 2011. [4] O. Bilgir, M. Martonosi, and Q. Wu, “Exploring the potential of CMP core count management on data center energy savings,” in Proceedings of the 3rd Workshop on Energy Efficient Design (WEED), 2011. [5] L. A. Barroso, and U. Holzle, “The case for energy-proportional computing,” Computer, vol. 40, no. 12, pp. 33-37, 2007. [6] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “A view of cloud computing,” Commun. ACM, vol. 53, no. 4, pp. 50-58, 2010. [7] G. Dhiman, G. Marchetti, and T. Rosing, “vGreen: a system for energy efficient computing in virtualized environments,” in Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design, San Fancisco, CA, USA, 2009, pp. 243- 248. 108 [8] G. von Laszewski, W. Lizhe, A. J. Younge, and H. Xi, “Power-aware scheduling of virtual machines in DVFS-enabled clusters,” in Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on, 2009, pp.1-10. [9] P. Pillai, and K. G. Shin, “Real-time dynamic voltage scaling for low-power embedded operating systems,” in Proceedings of the eighteenth ACM symposium on Operating systems principles, Banff, Alberta, Canada, 2001, pp. 89-102. [10] R. Nathuji, and K. Schwan, “VirtualPower: coordinated power management in virtualized enterprise systems,” in Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, Stevenson, Washington, USA, 2007, pp. 265-278. [11] S. Mehta, and A. Neogi, “Recon: A tool to recommend dynamic server consolidation in multi-cluster data centers,” in Network Operations and Management Symposium, 2008. NOMS 2008. IEEE, 2008, pp. 363-370. [12] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield, “Live migration of virtual machines,” in Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2, 2005, pp. 273-286. [13] J. Kleinberg, Y. Rabani, and É. Tardos, “Allocating Bandwidth for Bursty Connections,” SIAM Journal on Computing, vol. 30, no. 1, pp. 191-217, 2000. [14] W. Meng, M. Xiaoqiao, and Z. Li, “Consolidating virtual machines with dynamic bandwidth demand in data centers,” in INFOCOM, 2011 Proceedings IEEE, 2011, pp.71-75. [15] C. Ming, Z. Hui, S. Ya-Yunn, W. Xiaorui, J. Guofei, and K. Yoshihira, “Effective VM sizing in virtualized data centers,” in Integrated Network Management (IM), 2011 IFIP/IEEE International Symposium on, 2011, pp.594-601. 109 [16] X. Meng, C. Isci, J. Kephart, L. Zhang, E. Bouillet, and D. Pendarakis, “Efficient resource provisioning in compute clouds via VM multiplexing,” in Proceedings of the 7th international conference on Autonomic computing, Washington, DC, USA, 2010, pp. 11-20. [17] D. Breitgand, and A. Epstein, “Improving consolidation of virtual machines with risk-aware bandwidth oversubscription in compute clouds,” in INFOCOM, 2012 Proceedings IEEE, 2012, pp.2861-2865. [18] A. Verma, G. Dasgupta, T. K. Nayak, P. De, and R. Kothari, “Server workload analysis for power minimization using consolidation,” in Proceedings of the 2009 conference on USENIX Annual technical conference, San Diego, California, 2009, pp. 28-28. [19] A. Khan, X. Yan, S. Tao, and N. Anerousis, “Workload characterization and prediction in the cloud: A multiple time series approach,” in Network Operations and Management Symposium (NOMS), 2012 IEEE, 2012, pp.1287-1294. [20] D. Li, M. Xu, H. Zhao, and X. Fu, “Building mega data center from heterogeneous containers,” in Network Protocols (ICNP), 2011 19th IEEE International Conference on, 2011, pp.256-265. [21] P. Billingsley, Probability and Measure: Wiley, 2012. [22] A. Rudd, and H. K. Clasing, Modern portfolio theory: the principles of investment management: Andrew Rudd, 1988. [23] M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” in Proceedings of the ACM SIGCOMM 2008 conference on Data communication, Seattle, WA, USA, 2008, pp. 63-74. 110 [24] I. Hwang, and M. Pedram, “Portfolio Theory-Based Resource Assignment in a Cloud Computing System,” in Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on, 2012, pp.582-589. [25] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction To Algorithms: MIT Press, 2001. [26] N. Roy, J. S. Kinnebrew, N. Shankaran, G. Biswas, and D. C. Schmidt, “Toward Effective Multi-Capacity Resource Allocation in Distributed Real-Time and Embedded Systems,” in Object Oriented Real-Time Distributed Computing (ISORC), 2008 11th IEEE International Symposium on, 2008, pp.124-128. [27] W. Leinberger, G. Karypis, and V. Kumar, “Multi-capacity bin packing algorithms with applications to job scheduling under multiple constraints,” in Parallel Processing, 1999. Proceedings. 1999 International Conference on, 1999, pp.404-412. [28] P. Jaeckel, and R. Rebonato, “The most general methodology for creating a valid correlation matrix for risk management and option pricing purposes,” Journal of risk, vol. 2, no. 2, pp. 17-28, 1999. [29] C. C. Skiścim, and B. L. Golden, “Optimization by simulated annealing: A preliminary computational study for the TSP,” in Proceedings of the 15th conference on Winter Simulation - Volume 2, Arlington, Virginia, USA, 1983, pp. 523-535. [30] S. Takeda, and T. Takemura, “A rank-based vm consolidation method for power saving in datacenters,” IPSJ Online Transactions, vol. 3, no. 0, pp. 88-96, 2010. [31] N. Bobroff, A. Kochut, and K. Beaty, “Dynamic Placement of Virtual Machines for Managing SLA Violations,” in Integrated Network Management, 2007. IM '07. 10th IFIP/IEEE International Symposium on, 2007, pp.119-128. 111 [32] L. A. Barroso, J. Dean, and U. Holzle, “Web Search for a Planet: The Google Cluster Architecture,” IEEE Micro, vol. 23, no. 2, pp. 22-28, 2003. [33] S. Shankland. "Google uncloaks once-secret server," http://www.cnet.com/news/google- uncloaks-once-secret-server-10209580/. [34] "Open computer project: Server/Specs and designs," http://www.opencompute.org/wiki/Motherboard/SpecsAndDesigns. [35] A. Andreyev. "Introducing data center fabric, the next-generation Facebook data center network," https://code.facebook.com/posts/360346274145943/introducing-data-center- fabric-the-next-generation-facebook-data-center-network/. [36] L. A. Barroso, J. Clidaras, and U. Hölzle, “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second edition,” Synthesis Lectures on Computer Architecture, vol. 8, no. 3, pp. 1-154, 2013/07/31, 2013. [37] C. Bienia, “Benchmarking modern multiprocessors,” Princeton University, 2011. [38] P. Sangyoung, P. JaeHyun, S. Donghwa, W. Yanzhi, X. Qing, M. Pedram, and C. Naehyuck, “Accurate Modeling of the Delay and Energy Overhead of Dynamic Voltage and Frequency Scaling in Modern Microprocessors,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 32, no. 5, pp. 695-708, 2013. [39] "Xen power management," http://wiki.xen.org/wiki/Xen_power_management. [40] I. Hwang, and M. Pedram, “A Comparative Study of the Effectiveness of CPU Consolidation versus Dynamic Voltage and Frequency Scaling in a Virtualized Multi-Core Server,” in CENG Tech Report, 2013. 112 [41] D. Meisner, C. M. Sadler, L. A. Barroso, W. Weber, and T. F. Wenisch, “Power management of online data-intensive services,” in Computer Architecture (ISCA), 2011 38th Annual International Symposium on, 2011, pp.319-330. [42] M. Pedram, and H. Inkwon, “Power and Performance Modeling in a Virtualized Server System,” in Parallel Processing Workshops (ICPPW), 2010 39th International Conference on, 2010, pp.520-526.
Abstract (if available)
Abstract
Improving the energy efficiency of cloud computing systems has become an important issue because the electric energy bill for 24/7 operation of these systems can be quite large. As a way of lowering daily energy consumption, this thesis proposes resource consolidation techniques: the virtual machine consolidation and the CPU consolidation. ❧ In contrast to the many existing works that assume resource demands of virtual machines are given as scalar variables, we treat these demands as random variables with known means and standard deviations because the demands are not deterministic in many situations. These random variables may be correlated with one another, and there are several types of resources which can be performance bottlenecks. Therefore, both correlations and resource type heterogeneity must be considered. The virtual machine consolidation problem is thus formulated as a multi-capacity stochastic bin packing problem. This problem is NP-hard, so we present heuristic methods to efficiently solve the problem. Simulation results show that, in spite of its simplicity and scalability, the proposed method produces high quality solutions. ❧ While the virtual machine consolidation saves significant amount of energy, individual server machine is still under-utilized in order to avoid service-level-agreement violations. A popular way to reduce energy consumption of such under-utilized server machines is to perform dynamic voltage and frequency scaling (DVFS), thereby matching the CPU’s performance and power level to incoming workloads. Another power saving technique is CPU consolidation, which uses the minimum number of CPUs necessary to meet the service request demands and turns off the remaining unused CPUs. DVFS has been already extensively studied and verified its effectiveness. On the other hand, it is necessary to study more about effectiveness of CPU consolidation. Key questions that must be answered are how effectively the CPU consolidation improves the energy efficiency and how to maximize the energy efficiency improvement. These questions are addressed in this thesis. After understanding modern power management techniques and developing an appropriate power model, this thesis presents an extensive set of hardware-based experimental results and makes suggestions about how to maximize energy efficiency improvement through CPU consolidation. In addition, the thesis also presents new online CPU consolidation algorithms, which reduce the energy delay product up to 13% compared to the Linux default DVFS algorithm.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Energy proportional computing for multi-core and many-core servers
PDF
A joint framework of design, control, and applications of energy generation and energy storage systems
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Architectures and algorithms of charge management and thermal control for energy storage systems and mobile devices
PDF
Improving efficiency to advance resilient computing
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Development of electronic design automation tools for large-scale single flux quantum circuits
PDF
Reinforcement learning in hybrid electric vehicles (HEVs) / electric vehicles (EVs)
PDF
Thermal modeling and control in mobile and server systems
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Energy optimization of mobile applications
PDF
Optimizing power delivery networks in VLSI platforms
PDF
Efficient techniques for sharing on-chip resources in CMPs
Asset Metadata
Creator
Hwang, Inkwon
(author)
Core Title
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
06/23/2016
Defense Date
03/24/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithm,Cloud,consolidation,OAI-PMH Harvest,optimization,Power
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Gupta, Sandeep (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
inkwonhw@gmail.com,inkwonhw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-257054
Unique identifier
UC11280220
Identifier
etd-HwangInkwo-4470.pdf (filename),usctheses-c40-257054 (legacy record id)
Legacy Identifier
etd-HwangInkwo-4470.pdf
Dmrecord
257054
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Hwang, Inkwon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
algorithm
consolidation
optimization