Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Energy proportional computing for multi-core and many-core servers
(USC Thesis Other)
Energy proportional computing for multi-core and many-core servers
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENERGY PROPORTIONAL COMPUTING FOR MULTI-CORE AND MANY-CORE SERVERS by Daniel Wong A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2015 Copyright 2015 Daniel Wong To my family for their constant support To Liz who showed me life outside these office walls To my dogs who keep me company throughout the long nights ii Acknowledgements First, my deepest gratitude goes to my advisor, Murali Annavaram. Words cannot de- scribe how much I have learned and how much I have grown, as a person and a re- searcher, under his guidance. As the saying goes, a children’s book is worth a thousand words. I hope with the completion of this dissertation, I am finally worthy to have a place on your website. I would like to thank my qualifying and defense committee members: Michel Dubois, Massoud Pedram, Viktor Prasanna, and Minlan Yu for their insightful feedback. I would also like to acknowledge Monte Ung, for his support throughout my entire undergradu- ate and graduate career at USC. I hope you can finally enjoy your retirement! Special thanks goes to Mehrtash Manoochehri and Lakshmi Kumar Dabbiru, whose friendship I will truly cherish (as long as there’s kebab). There was never a dull moment these past 6 years. I would also like to acknowledge my colleagues who are helping build the foundation of the USC “mafia”: Lizhong Chen, Hyeran Jeon, and Yanzhi Wang. I am grateful for my office mates who helped me procrastinate: Sang Wook Do, Justin Huang, Gunjae Koo, and Jinho Suh; the latter of whom never ceases to amaze me with his ability to always find time for video games. I would like to thank my colleagues: Mohammad Abdel-Majeed who introduced me to GPGPUs, Qiumin Xu who survived Samsung with me, and Krishna Giri Narra and Kiran Matam who will conquer Everest without me. I would also like to acknowledge the rest of the research group, for their iii insightful discussions: Melina Demertzi, Waleed Dweik, Sabya Ghosh, Sangwon Lee, Zhifeng Lin, Abdulaziz Tabbakh, Ruisheng Wang, and Bardia Zandian. I would like to thank the staff members of the Ming Hsieh Department of Electrical Engineering: Diane Demetras, Tim Boston, Christina Fontenot, Kristie Rueff, Ted Low, Estella Lopez, and Mayumi Thrasher, who all welcomed me into the Trojan family my first week on campus. They’re the best boss and administrative/student service support team anyone can ask for. I would like to thank my family for their constant support and reminder that there’s more to life than just academics. Last but not least, I am grateful for Elizabeth Kuo, my better half. Thanks for stick- ing by my side, throughout the ups and downs, and the many sacrifices over the years. I can’t wait to continue exploring the world with you. iv Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures ix Abstract xi 1 Introduction 1 1.1 Low Power Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Overcoming the Energy Proportionality Wall with KnightShift 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Measuring Energy Proportionality . . . . . . . . . . . . . . . . . . . . 10 2.3 Energy Proportionality Trends . . . . . . . . . . . . . . . . . . . . . . 13 2.4 KnightShift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 KnightShift Implementation Options . . . . . . . . . . . . . . . 19 2.4.2 KnightShift Runtime . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.3 Choice of Knights . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 A Case for Server-level Heterogeneity . . . . . . . . . . . . . . . . . . 24 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6.1 Prototype Evaluation . . . . . . . . . . . . . . . . . . . . . . . 29 2.6.2 Trace-based Evaluation . . . . . . . . . . . . . . . . . . . . . . 33 2.6.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . 37 2.7 TCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 v 3 Implications of High Energy Proportional Servers on Cluster-wide Energy Proportionality 45 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Cluster-wide Energy Proportionality . . . . . . . . . . . . . . . . . . . 48 3.2.1 Measuring Idealized Best-Case Cluster-wide EP . . . . . . . . 49 3.2.2 Measuring Actual Cluster-wide EP . . . . . . . . . . . . . . . 50 3.2.3 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 51 3.2.4 Utilization Data Derived from Workload Traces . . . . . . . . . 52 3.2.5 Arrival rate inputs to the Queueing Model . . . . . . . . . . . . 52 3.2.6 Service rate input for the Queuing Model . . . . . . . . . . . . 54 3.2.7 Server Energy Proportionality Inputs to the Queuing Model . . 54 3.2.8 Load balancer implementations . . . . . . . . . . . . . . . . . 55 3.2.9 Revisiting Effectiveness of Cluster-level Packing Techniques . . 57 3.2.10 Challenges facing adoption of server-level low power modes . . 60 3.3 Server-level Low Power Mode Scalability . . . . . . . . . . . . . . . . 62 3.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.2 Case study with 32-core server . . . . . . . . . . . . . . . . . . 64 3.3.3 Sensitivity to Core Count . . . . . . . . . . . . . . . . . . . . . 66 3.4 Minimizing realistic latency slack of server-level active low power mode 67 3.4.1 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.2 KnightShift Switching Policies . . . . . . . . . . . . . . . . . . 68 3.4.3 Switching Policy Evaluation . . . . . . . . . . . . . . . . . . . 73 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 A Journey to the Edge of Energy Proportionality 78 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Where are we now and how did we get here? . . . . . . . . . . . . . . 80 4.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.2 Feature Annotation . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.3 Stepwise multiple regression . . . . . . . . . . . . . . . . . . . 82 4.2.4 Stepwise Regression Output . . . . . . . . . . . . . . . . . . . 83 4.2.5 Selecting a parsimonious model . . . . . . . . . . . . . . . . . 84 4.2.6 Findings from Stepwise Regression . . . . . . . . . . . . . . . 85 4.2.7 How did we get here? Explaining historical trends . . . . . . . 86 4.3 Where are we going? Identifying a possible edge of energy proportionality 89 4.3.1 Deriving Pareto-optimal Frontier . . . . . . . . . . . . . . . . . 90 4.3.2 Case Study: A Hypothetical Super Energy Proportional Server . 93 4.4 How can we get there? . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4.2 Low Power Modes . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4.3 Effect of Low Power Modes on Server EP . . . . . . . . . . . . 99 4.4.4 Low Power Opportunities for Non-Processor Components . . . 102 vi 4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5 Managing Super Energy Proportional Servers 105 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Best-case Cluster-wide Energy Proportionality Analysis . . . . . . . . . 107 5.3 A Case for Peak Efficiency Scheduling . . . . . . . . . . . . . . . . . . 111 5.4 Effectiveness of Peak Efficiency Scheduling . . . . . . . . . . . . . . . 113 5.4.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . 114 5.4.2 Energy Impact of Realistic Load Scheduling . . . . . . . . . . 115 5.4.3 Efficiency Impact of Realistic Load Scheduling . . . . . . . . . 117 5.4.4 Tail Latency Impact of Realistic Load Scheduling . . . . . . . . 119 5.5 A case for maximizing compute capacity under power capping . . . . . 121 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6 Conclusions and Future Work 127 Bibliography 130 vii List of Tables 1.1 Classification of Server Low-power modes . . . . . . . . . . . . . . . . 4 2.1 Energy consumption and response time of Wikibench using our Knight- Shift prototype and simulator. . . . . . . . . . . . . . . . . . . . . . . . 32 2.2 Data center trace workload characteristics . . . . . . . . . . . . . . . . 35 2.3 Energy savings and latency impact wrt Baseline of a 15% Capable Knight- Shift system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4 Cost breakdown of primary server and Knight based on prototype Knight- Shift system. Other system components include motherboard, chipset, network interface, fans, and other on-board components. . . . . . . . . 42 3.1 Workload thumbnail and autocorrelation . . . . . . . . . . . . . . . . . 53 3.2 BigHouse server power model based on [74] and [31]. Power is pre- sented as percentage of peak power. . . . . . . . . . . . . . . . . . . . 63 4.1 Annotated Server Features . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Stepwise regression analysis results . . . . . . . . . . . . . . . . . . . 84 4.3 Server configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4 EP, LD, and DR values for various processor power management policies 100 5.1 Variables for Best-Case Cluster-wide Energy Proportionality Analysis. . 106 viii List of Figures 2.1 Energy Proportionality (EP) curve. The dotted, dashed, and solid lines shows the ideal, linear, and actual server energy proportionality curve, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Energy Proportionality Trends . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Three proposed implementations of KnightShift. . . . . . . . . . . . . 17 2.4 Performance trends of commercial systems. . . . . . . . . . . . . . . . 21 2.5 KnightShift enhanced energy proportionality curve . . . . . . . . . . . 25 2.6 Effect of KnightShift on SPECpower commercial servers with 20% and 50% Knight capability . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Typical Google data center utilization distribution [10] . . . . . . . . . 28 2.8 Server Energy, EP, LD improvement with KnightShift . . . . . . . . . . 28 2.9 KnightShift Prototype setup . . . . . . . . . . . . . . . . . . . . . . . 30 2.10 Coordination of KnightShift servers . . . . . . . . . . . . . . . . . . . 31 2.11 Effect of Capability on Latency and Energy . . . . . . . . . . . . . . . 40 2.12 Effect of Wakeup transition time . . . . . . . . . . . . . . . . . . . . . 40 2.13 Effect of Sleep transition time . . . . . . . . . . . . . . . . . . . . . . 40 2.14 TCO breakdown across PUE and Energy Cost . . . . . . . . . . . . . . 42 2.15 TCO breakdown across servers and infrastructure . . . . . . . . . . . . 43 3.1 Trace-driven Queueing Model-based Simulation Methodology . . . . . 56 3.2 Best-case cluster-wide energy proportionality curve using Packing load balancing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Cluster-wide Energy Proportionality using (a,b,c) Packing load balanc- ing (Autoscale), (d,e,f) Uniform load balancing, and (g,h,i) Server-level active low power technique (KnightShift). . . . . . . . . . . . . . . . . 57 3.4 KnightShift provides similar energy savings to idleness scheduling al- gorithms but with less latency slack required. . . . . . . . . . . . . . . 61 3.5 Energy savings vs core count and latency slack . . . . . . . . . . . . . 66 3.6 Performance and energy impacts of various KnightShift mode switching policies. Mode corresponds to (1) Knight mode, (2) Wakeup, and (3) Primary mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7 Effect of switching policy on latency, energy consumption, and mode switches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 ix 4.1 Historical Trends labeled with processor generation. Processors with ˆ label support DDR2, all others support DDR3. Improvements to EP occurs in spurts. Two major EP growths in mid-2009 and early-2012 are due to DR and LD improvements, respectively. . . . . . . . . . . . . . 87 4.2 Pareto-optimal frontier derived from historical SPECpower results la- beled with processor generation. Data points with gray white box around it are pareto-optimal points. . . . . . . . . . . . . . . . . . . . . . . . 92 4.3 Energy Proportionality curve of a high energy proportional server with proportional processor only (a), and hypothetical server with all propor- tional components (b). . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4 Instrumented server details. Current sensors are spliced into the power supply cables for the CPU, memory, and disk. . . . . . . . . . . . . . . 96 4.5 Effect of power management on EP . . . . . . . . . . . . . . . . . . . 98 5.1 Energy Efficiency of future high energy proportional servers . . . . . . 106 5.2 Best Case Cluster-wide Energy Proportionality Curve for various Cluster- level Load Balancing Schemes (Red solid line). The blue dotted line represents ideal linearly energy proportionality. . . . . . . . . . . . . . 109 5.3 Cluster-wide Energy Efficiency Curve for various Cluster-level Load Balancing Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.4 Energy curve for varying levels of wakeup transition times . . . . . . . 115 5.5 Effect of Transition time on Cluster-wide EP . . . . . . . . . . . . . . . 117 5.6 Efficiency curve for varying levels of wakeup transition times . . . . . . 118 5.7 Latency curve for varying levels of wakeup transition times . . . . . . . 119 5.8 Power curve for varying levels of wakeup transition times . . . . . . . . 120 5.9 Latency curve for varying levels of wakeup transition times . . . . . . . 121 5.10 Compute Capacity for various Cluster-level Load Balancing Schemes . 123 x Abstract Data centers, driven by the exponential growth of cloud computing, are growing at an unprecedented rate. A major obstacle to the sustained growth of cloud computing is the colossal energy consumption of data centers. Data center workloads typically operate at low-mid utilization regions. Conversely, data center servers operate most efficiently at high utilization regions. This mismatch leads to significant inefficiencies in data center energy efficiency. Due to this mismatch, there have been a push for energy proportional computing, where a server’s power consumption should be proportional to the server’s utilization, thus enabling high efficiency across all utilization regions. This dissertation presents comprehensive understanding and holistic solutions for energy proportionality in data centers. In this dissertation, we first present a server-level solution to realized high energy proportional servers. Next, we explore the implications of high energy proportional servers on cluster-level management techniques. We then take a retrospective look back on the energy proportionality revolution in order to predict how future energy propor- tional servers will look like. Finally, we explore how these super energy proportional servers will impact data center energy management and propose cluster management schemes for super energy proportional servers. As a first step to achieve the goal of energy proportional computing, metrics are pro- posed to accurately measure energy proportionality. Using these metrics, we analyzed xi the historical energy proportionality trends of 291 servers and showed that server-level heterogeneity is critical for future proportionality improvements. Through the insights we gained, we proposed KnightShift, a novel server architecture to enable high energy proportionality through the addition of a server-level active low power mode. We evalu- ated KnightShift through a combination of prototyping and simulations. Using our Intel Xeon/ Intel Atom-based KnightShift prototype, we demonstrated 34% energy savings with 19% impact to 95th percentile response time along with an improvement of 24% to energy proportionality while running a Wikipedia-based benchmark. Through queu- ing model based simulations using USC’s data center utilization traces, we showed that we could save up to 87.9% of energy for certain workloads. To the best of our knowl- edge, KnightShift is the first server-level active low power mode design that exploits low utilization periods. As KnightShift improves energy proportionality of a single server in a data center, the next question we tackle is how high energy proportional servers will affect server cluster management in data centers. In the past, to improve energy proportionality of an entire cluster of servers, researchers have proposed many cluster-level power saving techniques to mask the poor energy proportionality of individual servers. One exam- ple is dynamic capacity management, which works toward the goal of minimizing the number of servers needed for a given workload utilization. For instance, the Packing scheduling algorithms assign as much work as possible to each server, in order to max- imize each server’s utilization, and then turn off a subset of servers. In the presence of changing workload demands, these algorithms require workload migration and turning on and off servers routinely. With the emergence of high energy proportional servers, we revisit if cluster-level packing techniques are still the best approach to achieving high cluster energy efficiency. We found that cluster-level packing techniques may now hin- der cluster-wide energy proportionality with the emergence of high energy proportional xii servers. As server energy proportionality improves, we conclude that it may be benefi- cial to shift away from cluster-level packing techniques and rely solely on server-level techniques. We found that running a cluster without any cluster-level packing techniques can actually achieve higher cluster-wide energy proportionality than with packing tech- niques, a finding that is major departure from conventional wisdom. Based on the above two studies we note that it is possible to build servers with bet- ter than linear energy proportionality. But designing such servers requires tackling the energy proportionality of all system components, beyond CPUs. We term these servers as super energy proportional servers. In order to anticipate how the energy proportion- ality profile of super energy proportional servers will look like, it is important to un- derstand the driving forces behind the energy proportionality improvements of current traditional server systems. We first mine historical SPECpower results and annotate all server power consumption data records with technological features, such as memory bus types and processor family. We perform statistically-rigorous multivariable regression on the labeled dataset in order to determine which technological features contributed the most to energy proportionality improvements. Based on historical SPECpower results, we derive a pareto-optimal frontier for dynamic range/linear deviation/energy propor- tionality tradeoff. Based on the derived pareto-optimal frontiers, we conclude that there is still a 35% headroom for EP improvement over an ideal linearly energy proportional server, prompting the need to redefine what an ideal energy proportional server would be. We then present a hypothetical study of a server with all components exhibiting high energy proportionality to demonstrate that such a radical system would still fall within the pareto-optimal frontier. Super energy proportional servers exhibit unique qualities which will have signifi- cant impact on existing energy efficiency techniques. Specifically, these super energy proportional servers achieve peak efficiency at non-peak utilization levels. Leveraging xiii this unique property, we present a Peak Efficiency Scheduling techniques which restricts server clusters to run at utilization levels that achieve highest energy efficiency, rather than aiming for highest utilization. Furthermore, we demonstrate the usefulness of such properties in power capping scenarios, where Peak Efficiency Scheduling can achieve significantly higher compute capacity for a given power budget level. As part of the dissertation research we also explored emerging energy efficiency paradigms for many-core accelerators. Since these paradigms focused on energy effi- ciency, not energy proportionality, we did not include the full details of the work in this dissertation. But we provide a brief summary of this work. We look at emerging graphics processing units and workloads to identify opportu- nities for approximation. In particular, we identify a new type of value similarity called intra-warp operand similarity, where multiple threads that execute within a warp all op- erate on similar data, that is, the values of all operands in a warp are within a certain distance threshold of each other. We propose to exploit intra-warp data operand similar- ity by enabling value- guided approximate computation in GPUs. When the operands within the warp are all similar, then the current warp instruction is approximated. In- stead of computing on each operand exactly in their respective lane, we instead only take one thread from the warp, called the representative thread, and compute exactly using the representative thread’s operands. The result generated from the representative thread will then be forwarded to the rest of the threads, where the result may be ap- proximate for the receiving lanes. One key observation is that programmers can provide coarse-grain annotation where approximation, arising from intra-warp operand value similarity, can be tolerated. Within those regions, the hardware dynamically checks if value similarity is present. This approach provides several benefits compared to existing user-guided and compiler-guided approximation techniques, such as lower programmer xiv burden and not requiring offline profiling of training datasets. We designed and eval- uated warp approximation with minimal changes to GPU microarchitecture and show that warp approximation reduces energy consumption by 41.1% with negligible impact on performance. To summarize, the work presented in this dissertation enhances our ability to accu- rately measure energy proportionality and tackles detailed design and implementation challenges of building energy proportional servers, through prototype evaluations and analytical models. It then presents new approaches to achieve energy proportionality at the cluster-level in data centers, in the presence of highly proportional individual servers. Thus this dissertation makes contributions to improving energy proportionality both at the individual server level as well as at the cluster-level. xv Chapter 1 Introduction With the growing popularity of cloud computing and social networks, data centers are growing at an unprecedented rate. Data centers currently account for an estimated 2% of all electricity use in the US [44]. Unfortunately, a large part of the electricity usage in data centers goes to waste due to server inefficiency. While the energy efficiency, defined as workload operations per second per watt, of servers have been improving, most of the efficiency improvements occur at peak server utilizations. Unfortunately, servers in data centers spend the majority of the time at 20-50% utilization, where servers are most en- ergy inefficient. For instance, it was observed that servers can consume 50% of the peak power even at idle [10], where the server does no useful work. Therefore, it is imperative to improve the energy efficiency of data center servers, particularly at lower utilization levels. Going forward, the need to improve energy efficiency of data center servers will only become more important; it is expected that the power consumption of data centers will increase 6% year on year in North America [1]. To tackle energy efficiency con- cerns in data centers there must be two concurrent developments in server design. One goal is to simply reduce the power consumption of a server through component level efficiency improvements. The second goal is to consume power that is proportional to the amount of work the server does. This dissertation is focused on the second goal by proposing server designs aimed at achieving energy proportional computing. En- ergy proportional computing aims to design components and systems where the server’s power consumption is proportional to the server’s utilization, enabling servers to run at high efficiency regardless of utilization. 1 Historically, energy efficiency improvements have been focused only at peak and idle utilization. But in reality, most servers spend the majority of their time in the in- termediate utilization regions [10]. It have been shown in many prior works [10, 22, 55] that at intermediate utilizations energy efficiency is significantly degraded. To tackle this concern researchers have started to focus on energy proportionality which has led to the development of many techniques that focus on improving the energy efficiency in this intermediate utilization region. In many modern servers, CPUs account for the largest fraction of power con- sumed [74]. Many low power solutions are targeted at the component level, such as DVFS [16, 30, 48, 62, 66, 78] and power gating for CPUs [14, 41, 50, 51]. In addition, at the cluster-level, many techniques [15, 21,25, 76] have been developed in order to mask the poor energy proportionality of the underlying servers. Techniques such as dynamic cluster resizing [25,76] were developed to reduce the number of active servers in a clus- ter based on the amount of work being performed. These techniques improve energy proportionality at the cluster level by forcing fewer servers to operate at higher utiliza- tion levels, where they are more energy efficient. Both component-level and cluster-level energy efficiency techniques are widely used and adopted in production data centers to- day [22, 47, 56]. An intermediate granularity, between the component and cluster level techniques, is sever-level energy efficiency. Server-level low power modes places the entire server into a low power state. For example, server standby places the server into an inoperable state, but consumes nearly zero power [55]. Combined with cluster-level techniques, these server-level low power modes provide near ideal energy proportionality at the cluster level [82]. We use the term server-level inactive low power mode to describe those low power modes where a server is unable to provide any useful computations. An alternative approach that researchers have recently explored is server-level active low 2 power modes. These power modes, such as Somniloquey and Barely-alive servers [3,5], allow servers to be placed in to a low power state, but still operate on simple I/O requests. A large part of this dissertation explores new architectural techniques for improving the efficiency and usability of server-level active low power modes. In the next section, we will frame the design space of low power modes and show the gap that this dissertation fills in. 1.1 Low Power Modes Power and energy related issues in the context of large scale data centers have become a growing concern for both commercial industries and the government. Numerous studies have examined energy efficiency approaches to servers in data center settings. These approaches can be classified along three dimensions: spatial granularity (Granularity), the period in which the low power mode is active (Period), and the ability for the low power mode to perform work (Active/Inactive). The granularity refers to whether the low power mode work at the cluster, server, or component level. The period in which the low power mode is active refers to the region of operation that the low power mode exploits. Low power modes can exploit either idle periods (0% utilization), or low utilization periods. The ability to perform work refers to whether the low power mode allows the system to continue processing requests. For inactive low power modes, the system cannot processes requests. For active low power modes, the system can still process requests, possibly with lower capability and performance. For example, if a low power mode is an active low power mode and exploits low utilization periods, it means that the low power mode is activated during low utilization periods and can still perform work. Using the three dimensional classification Table 1.1 bins the most relevant prior work which we will briefly describe next . 3 Granularity Cluster-level Server-level Component-level Period Idle Low Utilization Idle Low Utilization Idle Low Utilization Active Consolidation & Somniloquy [3] KnightShift [80] DVFS Low Dynamic Cluster Barely-alive MemScale [20] power Resizing [12, 13] Servers [6] Heter. Cores [8, 26, 29, 45] Inactive Low power Shutdown PowerNap [55] DRAM Self-refresh Core Parking Disk Spin down Table 1.1: Classification of Server Low-power modes Cluster-level techniques: Common techniques such as consolidation and dynamic cluster resizing [12, 13] concentrate workload to a group of servers to increase average server utilization and power off idle machines, improving efficiency and lowering total power usage. These techniques while primarily focus on improving cluster-wide energy efficiency, they also improve cluster-wide energy proportionality. Since they operate only a minimum required set of servers each at higher utilizations each server itself operates at the high efficiency region. Thus the overall cluster energy proportionality always stays high since servers are either highly utilized or turned off, and both these regions of operation have high energy efficiency. Although beneficial, these techniques are not suitable for many emerging workloads in today’s data center settings. For direct- attached storage architectures or workloads with large distributed data sets, servers must remain powered on to keep data accessible. Furthermore, due to the large temporal gran- ularity of these techniques, they cannot respond rapidly to unanticipated load as it could take minutes to migrate tasks with very large working sets. Under these circumstances, server consolidation is not always a viable solution. Furthermore, as we will show in chapter 3, as energy proportionality of servers improves, it is in fact better to forego cluster-level packing techniques in favor of uniformly distributing work to all servers. Component-level techniques: Component-level energy saving techniques for CPU, memory, and disk covers both active and inactive low-power modes. Active low-power techniques improves the energy-proportionality of components by providing multiple operating efficiencies at different utilization levels. Heterogeneous cores [8, 26, 29, 45], 4 such as Tegra 3 and ARM big.LITTLE, can switch to low-power efficient cores dur- ing low-utilization periods, while DVFS and MemScale [20] scales the frequency and power of components depending on utilization levels. Furthermore, inactive low-power techniques, such as DRAM self-refresh [11], core parking [56] and disk spin down can improve idle power consumption of these components. Most dynamic range improve- ments seen to date are driven primarily by processor energy efficiency gains. But going forward, no single component dominates overall power usage [74]. Therefore, it is im- portant to explore other low power techniques besides at the component-level. Server-level techniques: Server-level techniques aim to put the entire server into a low-power mode. Previous techniques aimed to improve energy efficiency by increas- ing the dynamic range through lowering the idle power usage and extending the time a system stays in idle. PowerNap [55] designs an approach for switching a server to inactive low-power mode rapidly, and then exploits millisecond idle periods by rapidly transitioning to an inactive low-power state. DreamWeaver [58] extends PowerNap to queue requests, artificially creating and extending idle periods. Barely-alive servers [6] place the server in a low-power state, but extends idle periods by keeping memory ac- tive to process remote I/O requests. Similarly, Somniloquy [3] allows idle comput- ers to supports certain application protocols, such as download and instant messaging. As the number of processors in servers increase, idle periods will become increasingly rare [58]. Thus active low-power modes that can efficiently operate at low-utilization levels will be the only practical server-level energy saving technique in the future. Prior to the publication of the work presented in this dissertation, existing literature lacked work that exploits low-utilization opportunities. It is critical to tackle the lack of energy efficiency during low-utilization periods. This dissertation tackles this critical gap by providing a wide range of solutions for active low-power server design and their usage within the context of data centers. 5 1.2 Contributions This dissertation provides fundamental insights into the hurdles in achieving ideal en- ergy proportionality and then use these insights to develop new computing paradigms and hardware platforms that can operate efficiently under varying load conditions. This research spans several related areas, including measuring and understanding energy pro- portionality, server-level and cluster-level low power modes. This dissertation makes the following contributions: Overcoming the Energy Proportionality Wall with KnightShift: It first presents metrics to reason about energy proportionality. Using these metrics it analyzes historical energy proportionality trends to show that energy efficiency at low-mid utilization region has not improved as much over time. Inspired by this observation it presents KnightShift, a novel server architecture to enable high energy proportionality through the addition of a server-level active low power mode. Details of metrics for energy proportionality and KnightShift are presented in chapter 2. Implications of High Energy Proportional Servers on Cluster-wide Energy Proportionality: In the second part of this dissertation, we explore how the emer- gence of high energy proportional servers will affect the cluster-wide energy pro- portionality. Specifically, we show that with high energy proportional servers, uniform load balancing may outperform packing load balancing, an observation that is a major departure from conventional wisdom. Our motivation and quanti- tative justification of these novel observations are presented in chapter 3. Energy Proportionality Characterization of Future Server Systems: In the third part of this dissertation, we will present characterization of future server sys- tems by first identifying how energy proportionality has improved. We identified 6 the major contributing factors towards energy proportionality growth, and charac- terized how processor low power modes affect energy proportionality. Based on historical SPECpower data, we then derive pareto-optimal frontiers to character- ize future super energy proportional servers and its implications. Details of this study are presented in chapter 4. Management of Super Energy Proportional Servers: In the fourth part of this dissertation, we present cluster-level energy proportional management schemes using super energy proportional servers. Super energy proportional servers achieve peak efficiency at non-peak utilization levels. Leveraging this unique property, we present a Peak Efficiency Scheduling techniques which enables server clusters to run at utilization levels that achieve highest energy efficiency, rather than aiming for highest utilization. Furthermore, we demonstrate the use- fulness of such properties in power capping scenarios, where Peak Efficiency Scheduling can achieve significantly higher compute capacity for a given power budget level. Details of this study are presented in chapter 5. Finally in Chapter 6 we conclude with a summary of the contribution of this dis- sertation and highlight some open problems, which indicate possible future exploration directions in the area of energy efficient server design. 7 Chapter 2 Overcoming the Energy Proportionality Wall with KnightShift 2.1 Introduction Energy consumption of data center servers are a critical concern. Server operating en- ergy costs comprise a significant fraction of the total operating cost of data centers. However, many servers operate at low utilization and still consume significant energy due to the lack of ideal energy proportionality [10]. As described in Chapter 1, ideal energy proportional servers consume power in proportional to their utilization. This en- able servers which can sustain a high level of energy efficiency regardless of operating utilization. Server consolidation [12, 13] can boost utilization on some servers while allowing idle servers to be turned off, improving energy proportionality at the data center level. Unfortunately, server shutdown is not always possible due to data availability concerns and workload migration overheads. When server shutdown is impractical, as is the case in many industrial data centers [6, 58], system-level energy proportionality approaches must be explored. Energy proportionality improvements of various server components [20,66], such as CPUs and memory, have fueled the improvements of overall system efficiency. The pri- mary concern today is that energy proportionality improvements have not been uniform across different utilizations. The problem of disproportionality is particularly acute at 8 non-zero but low server utilization. Since no single component dominates server en- ergy usage [74], holistic system-level approaches must be developed to improve energy proportionality particularly at low utilization regions. Several system-level power saving approaches have focused on reducing the power consumption during idle periods [55]. Researchers then focused on increasing the length of idle periods by queueing requests [58] or by shifting I/O burden directly to disk and memory [4, 6]. However, as multi-core servers becomes dominant, idle periods are virtually nonexistent [58, 81]. Even as idle periods become rare, servers still spend a significant fraction of their execution time operating at low utilization levels. Thus there is a critical need to develop active low-power modes that exploits low-utilization periods to continue improving server-level energy proportionality across the entire utilization range. This chapter addresses this critical need by proposing KnightShift, a server-level energy proportionality technique. This chapter makes the following contributions: Metrics to Identify Disproportionality(Section 2.2): We propose metrics to evalu- ate energy proportionality and to identify sources of disproportionality. Using data from historical SPECpower [83] results of 291 servers, we show that commonly used metrics such as dynamic range are inappropriate due to the lack of linearity in energy consump- tion across different utilizations. We present a metric for measuring linearity of energy consumption across different utilizations. Using the linearity metric we show that the proportionality gap is much wider at lower utilization than at idle or higher utilization. Energy Proportionality Trend Analysis(Section 2.3): From historical SPECpower data we show the existence of an energy proportionality wall due to the lack of improve- ments to the dynamic range and poor energy efficiency at low server utilization periods. Previous work (Section 1.1) only targeted improvements to the dynamic range by im- proving idle power. In order to continue improving energy proportionality, we must 9 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% % of peak power Utilization Actual Linear Ideal (a) Superlinear EP 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% % of peak power Utilization Actual Linear Ideal (b) Sublinear EP Figure 2.1: Energy Proportionality (EP) curve. The dotted, dashed, and solid lines shows the ideal, linear, and actual server energy proportionality curve, respectively. improve the linearity of the server’s energy proportionality curve, especially at lower utilization where the majority of the proportionality gap exists. KnightShift(Section 2.4): We present KnightShift, a server-level heterogeneous server architecture that introduces an active low power mode to exploit low-utilization periods. By fronting a high-power primary server with a low-power compute node, called the Knight, we enable two energy-efficient operating regions. We show that KnightShift effectively improves energy proportionality and linear deviation of servers in Section 2.5. We present evaluation results of KnightShift in Section 2.6 and explore TCO impact in Section 2.7. 2.2 Measuring Energy Proportionality In order to understand energy proportionality trends, we must first quantify energy pro- portionality. Figure 2.1 illustrate the power usage, as a fraction of peak power consumed at 100% utilization, of two servers over their operating utilization, called the energy proportionality curve. The dotted line shows the ideal energy proportionality curve of a server that spends power in proportion to its utilization. For instance, at 10% utiliza- tion the server spends 10% of the peak power. The dashed line shows the linear energy 10 proportionality curve by interpolating idle and peak power; linear energy proportional- ity assumes that the power consumption scales linearly between idle and peak power. The solid line shows the actual server energy proportionality curve. The data presented in this figure are obtained from measurements on real servers reported to SPECpower (more detailed analysis of SPECpower data is provided in section 2.3). While the figure is illustrative of energy proportionality, we show power on y-axis. The implicit assumption of all energy proportional computing is that when improving energy proportionality, one does not alter the execution time. Hence, whether a server at 10% utilization consumes 10% or 20% of the peak power, the execution time remains constant. Hence, we show power on y-axis, as opposed to energy. Dynamic Range: The dynamic range (DR) metric is commonly used as a first order approximation for energy proportionality. The dynamic range of a server is given by: DR = Power peak Power idle Power peak (2.1) wherePower peak is the peak power at 100% utilization andPower idle is the idle power at 0% utilization. In Figures 2.1a and 2.1b, the DR for both servers are the same at 60% since the two illustrative servers consume 40% of the peak power at idle. DR only accounts for peak and idle power usage and does not account for power usage variations across different utilizations. Since most servers are rarely fully utilized or fully idle, DR is a poor measurement of the server’s actual proportionality. For example, assume that both servers in Figure 2.1 consume 100W at peak power. If each server experiences utilization distribution similar to Google servers reported in [10], then Server A (on the left) would consume on average 28% more power (68.6W vs 52.6W) compared to Server B (on the right), even though they both have the same dynamic range. The reason 11 for this discrepancy is that the two servers have different power consumption profiles at intermediate utilizations. Energy Proportionality: To accurately quantify energy proportionality, we must account for intermediate utilization power usage. The energy proportionality (EP) of a server (proposed in [63] and adapted for this work) is given by: EP = 1 Area actual Area ideal Area ideal (2.2) where Area actual and Area ideal is the area under the server’s actual and ideal energy proportionality curve, respectively. Note that ifArea actual =Area linear , then EP would equal DR. Therefore, DR is a good measurement of energy proportionality only if a server is linearly energy proportional, however, this is not the case in most servers. For example, the EP of Server A and B is 53% and 74%, respectively. Although the DR of both servers is 60%, their EP values differ by over 20%. Energy proportionality is a function of the dynamic range and the linearity of the energy proportionality curve. Thus to accurately account for energy proportionality, one has to account for the amount of deviation from linearity within the server’s energy proportionality curve. Linear Deviation: We define Linear Deviation (LD) as a measure of the energy proportionality curve’s linearity. Linear Deviation is given by: LD = Area actual Area linear 1 (2.3) A server is considered linearly energy proportional if LD = 0, superlinearly energy proportional if LD > 0, and sublinearly energy proportional if LD < 0. Figure 2.1a and 2.1b shows a proportionality curve with superlinear and sublinear energy propor- tional system, respectively. Superlinear energy proportional servers (EP +LD ) have EP < DR , while sublinear energy proportional servers (EP LD ) have EP > DR. This 12 can be proven by equation 2.2. For superlinear, linear, and sublinear servers, we can see that the area below the actual curve (Area X ) has the following relationship: Area +LD > Area linear > Area LD . Through this relationship and equation 2.2, the EP for superlinear, linear, and sublinear servers EP X has the following relationship: EP +LD <EP linear <EP LD , whereEP linear =DR. Proportionality Gap: The Proportionality Gap (PG) is a measure of deviation be- tween the server’s actual energy proportionality and the ideal energy proportionality at individual utilization levels. PG allows us to quantify the disproportionality of servers at a finer granularity compared to EP to better pinpoint the causes of disproportionality. PG at utilization levelx% is given by: PG x% = Power actual@x% Power ideal@x% Power peak (2.4) For an ideal energy proportional server, the PG for all utilization levels is 0. For super- linearly proportional systems, like Server A, PG is very large at zero utilization and it continues to grow for some time before it starts to shrink. For sublinearly proportional systems, like Server B, PG is very large at zero utilization but it continues to decrease with utilization. 2.3 Energy Proportionality Trends To understand historical trends in energy proportionality, we analyze the submitted re- sults of SPECpower [83] for 291 servers from November 2007 to December 2011. These servers are a representative mix of server configurations in use during that time win- dow. They feature servers from various vendors, with various form factors, and proces- sors. The SPECpower benchmark evaluates the power and performance characteristics 13 0 0.2 0.4 0.6 0.8 1 Nov-07 Mar-09 Jul-10 Dec-11 Dynamic Range Time (a) Dynamic Range 0 0.2 0.4 0.6 0.8 1 Nov-07 Mar-09 Jul-10 Dec-11 Energy Proportionatliy Time "+LD" "-LD" (b) Energy Proportionality 0 0.2 0.4 0.6 0.8 1 -0.10 -0.05 0.00 0.05 0.10 0.15 Energy Proportionality Linear Deviation (c) Linear Deviation 0 0.2 0.4 0.6 0.8 0% 20% 40% 60% 80% 100% Proportionatliy Gap Utilization LOW(<50) MID(50-75) HIGH(75+) (d) Proportionality Gap 0 0.2 0.4 0.6 0.8 1 0% 20% 40% 60% 80% 100% Normalized ssj_ops/W Utilization HIGH(75+) MID(50-75) LOW(<50) (e) Energy Efficiency 0 1000 2000 3000 4000 5000 6000 7000 Nov-07 Mar-09 Jul-10 Dec-11 ssj_ops/watt Time Efficiency @ 100% Load Efficiency @ 10% Load (f) Efficiency Growth Figure 2.2: Energy Proportionality Trends of servers by measuring the performance and power consumption of servers at each 10% utilization interval. These trends are shown in Figure 2.2 and are discussed below. Dynamic Range: Figure 2.2a plots the dynamic range of servers (computed using equation 2.1 described earlier) along with the median trend line. Each data point corre- sponds to one server whose SPECpower results were posted on a given date. Overall, DR improved from about 50% to 80% from 2007 to 2009. From 2009 onward, DR stagnated at 80%. Although the best DR is 80%, half of servers in late 2011 still have DR less than 70%; and worse yet, there are still servers with DR less than 40%. We 14 can surmise that achieving 100% dynamic range is very difficult due to energy dispro- portional and energy inefficient components such as power supplies, voltage converters, fans, chipsets, network components, memory, and disks. In chapter 4 we did in fact measured component level energy proportionality by instrumenting a high-end server and our analysis justifies this conclusions. Energy Proportionality: Figure 2.2b shows that energy proportionality trends are similar, but not identical, to dynamic range trends. Clearly, energy proportionality has also stalled at around 80%. This energy proportionality wall is mainly due to the lack of dynamic range improvement. Each server’s energy proportionality data point is clas- sified as either superlinear (labeled as +LD in the figure) or sublinear (labeled as -LD in the figure) proportional based on their SPECpower data. It is important to draw atten- tion to a few data points where EP> 80%, although no servers have a dynamic range above 80%. Recall that the only way a server can have EP> DR is for that server to have sublinear energy proportionality (-LD). A sublinearly energy proportional server consumes less power than a linearly proportional server. Hence, it can have higher energy proportionality than dynamic range. Thus, those few servers with EP > 80% in Figure 2.2b have sublinear energy proportionality. Note that -LD does not always imply high energy proportionality. In particular, energy proportionality is affected by two components: dynamic range and the linear deviation. If dynamic range can be im- proved, then linear deviation improvements have a secondary impact on overall energy proportionality. But as dynamic range improvements hit a wall, the only way to improve energy proportionality moving forward is to improve linear deviation. Linear Deviation: Figure 2.2c shows the relationship between linear deviation and energy proportionality. Unfortunately, the data shows that the majority of servers (at least 80%) are superlinearly proportional (+LD). Hence, there lies potential to improve energy proportionality in current servers by improving their linear deviation. 15 Proportionality Gap: Figure 2.2d shows the average proportionality gap of servers at various utilization levels. The curves, from top to bottom, represent servers with low EP (<50%), medium EP (50-75%), and high EP (>75%). Irrespective of the energy proportionality level the striking feature is that all servers suffer large proportionality gap at low utilizations. Furthermore, as energy proportionality increases, it becomes clear that the majority of the proportionality gap occurs at lower utilizations. As energy proportionality improves, energy disproportionality at lower utilization will be the main obstacle to achieving perfectly energy proportional systems. Energy Efficiency: SPECpower reports ssj ops/Watt as a metric for energy effi- ciency. SPECpower models a Java-based middle-tier business logic where ssj ops stand for the number of server side Java operations per second. Figure 2.2e shows that energy efficiency is strongly correlated with the proportionality gap. The curves, from top to bottom, represents servers with high, medium, and low energy proportionality, respec- tively. Due to the large proportionality gap at low utilization, server energy efficiency is about 30% of the peak efficiency even for servers with relatively high energy propor- tionality. Hence, even if the overall energy proportionality of a server improves over time, the energy efficiency will still suffer at low utilizations unless the proportional- ity gap at low utilizations is reduced. Otherwise, even the highest energy proportional servers can run efficiently only at high utilizations. In order to improve energy effi- ciency, we should improve efficiency at lower utilization. Unfortunately, as Figure 2.2f shows, improvements to efficiency at higher utilization has outpaced improvements at lower utilization. Overcoming the EP wall: In order to improve the energy efficiency of servers, we cannot solely rely on improvements to dynamic range, as have been the case in the past. Therefore, we cannot concentrate on energy efficiency improvements at peak and idle 16 Server Motherboard Power Disk Power NIC CPU Chipset LAN SATA Memory Chipset LAN SATA CPU CPU Memory Memory Memory (a) Board-level Server Motherboard Power Disk NIC CPU Chipset LAN SATA Memory Motherboard Power Chipset LAN SATA CPU CPU Memory Memory Memory (b) Server-level Server Node Knight Node Motherboard Power Disk CPU Chipset LAN SATA Memory Motherboard Disk Power Simple Router Chipset LAN SATA CPU CPU Memory Memory Memory (c) Ensemble-level (Prototype) Figure 2.3: Three proposed implementations of KnightShift. only. As dynamic range is now static, we must focus on improving the linear devia- tion. As shown previously, servers operate in two distinct energy proportional regions. Servers tend to be perfectly proportional at high utilization (>50%), while dispropor- tional at low utilization (<50%). Therefore, in order to gain the most benefits, we must focus our efforts in the low utilization region. Furthermore, processors are no longer the major consumer of power in servers [74]. In order to reduce energy consumption, new server-level solutions that tackle proportionality gap at low utilizations are needed. 17 2.4 KnightShift In this section we introduce KnightShift, a heterogeneous server-level architecture to reduce the proportionality gap of servers at low utilization. KnightShift fronts a high- power primary server with a low-power compute node, called the Knight node, enabling two energy-efficient operating regions. We define the Knight capability as the fraction of throughput that Knight can provide compared to the primary server. To the best of our knowledge, KnightShift is the first server-level active low-power mode solution to exploit low-utilization periods. The fundamental issues limiting energy proportionality have been lack of improvement to dynamic range and linear deviation. While previ- ous techniques [3, 6, 41, 55] only targeted dynamic range, KnightShift extends previous techniques by also targeting linear deviation. A KnightShift system consists of three components: 1. KnightShift hardware: The KnightShift hardware consists of a low-power low- performance compute node, called the Knight node, paired with a high-power high-performance server. Both the Knight and primary server can be indepen- dently powered on and off. They share a common data disk and are able to com- municate with one another through traditional network interface. In section 2.4.1, we will introduce three possible implementations of KnightShift. We assume the Knight node is provisioned with less memory than the primary server. However, certain workloads require large memory-resident datasets, such as scale-out workloads [23]. These workloads can still benefit from KnightShift by alternatively using low-power mobile memory [52], therefore still benefiting from overall reduced energy savings. Current server motherboards are typically not built to accommodate low-power mobile memory while a Knight can use such a memory type. 18 2. System software: The system software enables several key functionalities re- quired for KnightShift, such as disk sharing, network configuration and remote wakeup of compute nodes. Most operating systems already support the required system software functionality. In section 2.4.1 we will describe the specifics of system software required to support KnightShift. 3. KnightShift runtime: The KnightShift runtime is the new software layer that is built specifically for the purpose of operating KnightShift. This runtime layer monitors utilization, makes mode switching decisions, redirect requests between the Knight and the primary server, and coordinates disk access rights to ensure data consistency. We will discuss this runtime in detail in section 2.4.2 and present our prototype implementation in section 2.6.1. 2.4.1 KnightShift Implementation Options We propose three implementations of KnightShift as shown in Figure 2.3. The pre- ferred choice for implementing KnightShift depends on the usage scenario and level of integration supported by system designers. Board-level integrated KnightShift: Board-level integrated KnightShift integrates the primary server and Knight onto the same motherboard. Both Knight and primary server have independent memory, CPU, and chipsets. To allow each node to power on/off independently, the motherboard is separated into two power domains (designated by the dotted box). The Knight’s power domain comprises of it’s memory, CPU, chipset, ethernet, and disks. The Knight’s power domain is always on but the primary server’s power status is controlled by the Knight. The Knight is capable of remotely waking up the primary server. Existing technology such as wake-on-lan can be used to support remote wakeup. Using wake-on-lan, when the primary server is off, the Knight can send a “magic” packet to the primary server’s network interface which in turn will wake 19 up the primary server. All three proposed implementations use wake-on-lan for remote wakeup. Networking is provided through sideband ethernet, allowing two devices to be ex- posed through a single physical port to external servers. Both the Knight and primary server would have their own IP address, but only the Knight’s IP address would be pub- licly available. This allows the KnightShift server to appear as a single server on the network, eliminating additional network overheads to adopt KnightShift servers. Disks are shared between the primary server and Knight through a shared SATA connection. Since SATA currently supports hot-plugging, the system designer can add switching logic to route SATA requests between the primary server and Knight. Server-level integrated KnightShift: In a server-level integrated KnightShift con- figuration, the Knight resides on a separate independently powered motherboard. The only shared components between the Knight and primary server is the network and disk. We envision that this approach can be implemented by intercepting the SATA interface and building a Knight which can fit as a hard drive module within the primary server. Hard drive mounts are designed to fit various hard drive sizes. For example, 3.5inch drives comes in 19mm or 25.4mm heights. By using 19mm height drives or 2.5inch drives, we can integrate the Knight into the unused space on the 3.5inch mount. This approach is feasible even today as some potential Knight candidates are as small as credit cards [39]. Since the Knight remains on at all times, it is exposed to the outside world as the only server. Thus, the primary ethernet connection will be on the Knight board. The existing primary server’s ethernet is then connected into the Knight board. Thus this approach requires one extra internal ethernet connection compared to board-level integration. This implementation allows us to convert any commodity server to a KnightShift-enabled server without using additional space. 20 10 100 1000 10000 100000 Mar-06 Sep-07 Mar-09 Sep-10 Apr-12 Passmark CPU Mark Release Dates Atom Xeon Core-i3 Figure 2.4: Performance trends of commercial systems. Ensemble-level KnightShift: The ensemble-level implementation uses only com- modity parts with no changes to hardware. By using a primary server and a Knight based on commodity computers (such as nettops), a KnightShift system can be imple- mented. This is the prototype that we will use to evaluate KnightShift in section 2.6.1. Disk sharing is fulfilled through NFS, with the Knight acting as the NFS server and the primary server mounting the NFS drive. This allows data to persist when the primary server is off. Since the Knight acts as the NFS server this approach requires the Knight to always be on. A router is used to network the Knight and primary server. To the out- side world only the Knight’s IP address is exposed. The primary server communicates to the outside world through the Knight. Board-level implementation requires the least amount of space, but requires several modifications to the system design. The server-level implementation allows commodity servers to be KnightShift-enhanced, with minimal space requirements. The ensemble- level is the simplest to implement with commodity parts. But this solution does not fully exploit the power savings capabilities of KnightShift. However, due to its ease of implementation we use this approach to implement our prototype as described in section 2.6.1. 21 2.4.2 KnightShift Runtime In the above section we presented three choices for implementing KnightShift and the basic system software needed for remote wakeup, disk and network sharing. The KnightShift functionality is implemented in a special purpose runtime layer called the KnightShift runtime. The runtime serves the following purposes: 1) Monitor server uti- lization, 2) Decide on when to switch between Knight and primary server using mode switching policies, 3) Ensure data consistency on shared disk data, 4) Coordinate mode switching, and 5) Redirect requests to active node. Monitoring server utilization and mode switching policies: An essential part of KnightShift is the ability to monitor the utilization of the primary server and Knight to make mode switching decisions. Server utilization monitoring can be carried out simply through the Linux kernel or through third-party libraries. Whenever the primary server’s utilization is low, the Knight will put the primary server to sleep and handle service requests. Whenever the Knight’s utilization is too high, it does a remote wakeup of the primary server which then handles service requests. In this chapter, our primary goal is to introduce the benefits of KnightShift and thus we use a simple switching policy to determine when to switch modes. More complex mode switching policies are explored in chapter 3. In order to maximize power savings, we have to maximize the amount of time that we spend in the Knight. To do so, our simple switching policy aggressively switches into the Knight, and conservatively switches to the primary server. For example, if the Knight is 20% capable, the KnightShift run- time will switch to the Knight whenever the primary server utilization falls below 20%. KnightShift switches back to the primary server only when the Knight’s utilization ex- ceeds 100% for at least the amount of time it takes to switch between Knight and primary server, called the transition time. 22 By maximizing energy savings, we may negatively impact performance as we may stay in the Knight mode during periods when the Knight cannot handle the requests, causing increased response time. In chapter 3 we evaluated different mode switching policies that rely on predicting utilization periods and show that KnightShift in fact provides a better balance between energy savings and performance. Data consistency and coordinating mode switching: Recall that in all three im- plementations the Knight and primary server share the disk data needed for processing service requests. Hence, whenever mode switching is activated, the compute node that is shutting down must flush any buffered disk writes that are cached in memory back to disk and unmount the disk. This allows the complementary node to mount the disk and operate on consistent data. The KnightShift runtime enforces this consistency by coordinating disk access sequence between the two nodes. In section 2.6.1 we detail our prototype KnightShift system where coordination is carried out through a set of scripts communicating using message passing. Redirecting requests: There are many ways to forward incoming requests to the active compute node. One approach is to run a simple load balancer software on the Knight, which would require the Knight to remain always on. We take this approach in our prototype KnightShift implementation in section 2.6.1. It would also be possible to use a hardware component which redirects requests. 2.4.3 Choice of Knights We originally considered three options for the Knight: ARM, Atom, and Core i3-based systems. In 2012, at the time this research was conducted, it was not feasible to use ARM based systems as a Knight because its capability level (<10%) is simply too low 23 and does not provide ample opportunities to switch to the Knight mode. With the emer- gence of server-class ARM processors [37], ARM may become a viable Knight option at the present. Figure 2.4 shows the performance growth of Atom and Core i3 as potential Knights compared to a Xeon based server as the primary server. The performance data was obtained from Passmark CPU Mark [38]. Most Atom based systems have one order of magnitude lower capability than a Xeon based server and in the best case they have 20% capability. The performance of Core i3 on the other hand, is within 50% of Xeon based server. Thus Atom and Core i3 can provide Knight capability of up to 20% and 50%, respectively. Although Core i3 based Knights use 4x more power than Atom based Knights, Core i3 offers more opportunity for the Knight to handle requests from the primary server. In our prototype implementation we used only an Atom based Knight due to limited hardware budget. Mixed ISA: In our current prototype, all the Knight choices run x86 ISA. Addi- tionally, we ran a fully functional KnightShift prototype using x86+ARM and we didn’t encounter any functional difficulties. Many popular applications, such as java, apache and mysql already have ARM binaries. As ARM becomes more powerful and preva- lent, mixed ISA KnightShift systems may even become the norm. While the ARM based Knight ran perfectly well in terms of functionality, the latency overhead was too high. Hence we do not consider mixed ISA implementation in the rest of this work. 2.5 A Case for Server-level Heterogeneity In this section we show the potential benefits of KnightShift on top of current production systems. We selected all 291 servers from the SPECpower results and studied how various energy proportionality metrics are affected if that server was enhanced with 24 0% 20% 40% 60% 80% 100% 120% 0% 20% 40% 60% 80% 100% Peak power Utilization Actual Linear Ideal KnightShift (a) 20% Capable 0% 20% 40% 60% 80% 100% 120% 0% 20% 40% 60% 80% 100% Peak power Utilization Actual Linear Ideal KnightShift (b) 50% Capable Figure 2.5: KnightShift enhanced energy proportionality curve Knight Capability Proportionality Gap Energy Efficiency Linear Deviation 20% -0.2 0 0.2 0.4 0.6 0% 20% 40% 60% 80% 100% Proportionatliy Gap Utilization LOW(<50) MID(50-75) HIGH(75+) 0 0.5 1 1.5 2 2.5 3 3.5 0% 20% 40% 60% 80% 100% Normalized ssj_ops/Watt Utilization HIGH(75+) MID(50-75) LOW(<50) 0 0.2 0.4 0.6 0.8 1 1.2 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 Energy Proportionality Linear Deviation 50% -0.2 0 0.2 0.4 0.6 0% 20% 40% 60% 80% 100% Proportionatliy Gap Utilization LOW(<50) MID(50-75) HIGH(75+) 0 0.5 1 1.5 2 2.5 3 3.5 0% 20% 40% 60% 80% 100% Normalized ssj_ops/Watt Utilization HIGH(75+) MID(50-75) LOW(<50) 0 0.2 0.4 0.6 0.8 1 1.2 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 Energy Proportionality Linear Deviation Figure 2.6: Effect of KnightShift on SPECpower commercial servers with 20% and 50% Knight capability a Knight. Recall that we define Knight capability as the fraction of throughput that the Knight can provide compared to the primary server. By assuming the Knight was created with the same technology as the primary server, the peak and idle power of the Knight, with capabilityC, can be obtained by theoretically scaling the power using the equation Power Knight = C 1:7 Power Primary [8]. For example, if the primary server operates between 100W (idle power)-200W (peak power at 100% utilization), then a 50% capable Knight will operate from 31-62W. We assume the Knight is linearly proportional (LD=0) between its idle and peak power. 25 Figure 2.5 shows the effect of KnightShift on the energy proportionality curve, from Figure 2.1, with a 20% and 50% capable Knight. To generate this data, we assume that anytime the utilization is within the Knight’s capability levels, the Knight will handle that requests. Otherwise, the primary server will handle the request. Note that in Knight- Shift, the Knight must remain on, which increases the peak power consumption of the server. The reason for this requirement was explained in the previous section. Even with the increase in peak power consumption, we still experience significant power sav- ings because the servers spend the majority of the time in the low utilization regions. (more details will be presented in section 2.6). The primary server is shut down at low utilizations, allowing the Knight to handle all low utilization requests, significantly de- creasing power consumption. Depending on the capability, energy savings vary. But in all cases, we shift the server to sub-linearly proportional (linear deviation ¡ 0) domain. It is interesting to note that at specific utilization levels, a KnightShift-enabled system can consume less power than an ideal energy proportional system, opening the possibility of servers operating with even better efficiency than ideal energy proportionality. For instance, in Figure 2.5b when the server utilization is approximately between 20% and 50%, the overall power consumption is better than an ideal energy proportional server because the Knight uses less power than an ideal energy proportional server at that uti- lization. Figure 2.6 shows the effect of KnightShift with a 20% and 50% capable Knight on proportionality gap, energy efficiency, and linear deviation across all the 291 servers we analyzed from the SPECpower dataset. Proportionality Gap: At 20% capability, the proportionality gap of the KnightShift server is essentially eliminated at utilization below 20%. While in Knight mode, the proportionality gap is negative, meaning that the power used by the Knight at a specific utilization is lower than that of an ideal energy proportional server, as shown in the left 26 plots of Figure 2.5. At 50% capability, the proportionality gap is greatly reduced in the 0%-25% utilization range as compared to Figure 2.2d, while the proportionality gap is eliminated from 25%-50% utilization range. The reason for non-zero proportionality gap at the lower range is because of the power consumed by the Knight itself. As long as the proportionality gap exists at low utilization, KnightShift should benefit that server. Energy Efficiency: The energy efficiency curves for the 20% and 50% Knight ca- pabilities are shown in the middle plots of Figure 2.6. KnightShift enhances server’s energy efficiency and allows them to run at or better than peak efficiency (great than 1 in the figure) even at lower utilization. The improvement is directly correlated with the reduction in proportionality gap. 20% capable Knights operate above peak efficiency between 0-25% utilization range. 50% capable Knights operate at above peak efficiency from 25%-50% utilization range, and just below peak efficiency below 25% utilization. This data shows that KnightShift energy efficiency is substantially higher than the base- line shown in Figure 2.2e. Linear Deviation: KnightShift effectively shifts all servers from super-linear range (linear deviation ¿ 0) to sub-linear range (linear deviation ¿ 0). Improving LD is the only option to improve energy proportionality when dynamic range improvements are not feasible. With a 20% capable Knight, the lowest EP server amongst the 291 servers jumped from 20% to 60%. Thus KnightShift is able to improve the EP of servers by allowing commodity servers to exhibit sub-linearity. For 50% capable Knights, we even see servers with EP> 1, indicating that KnightShift effectively closed the proportional- ity gap. EP and Energy Savings: To evaluate energy savings, we assume server utilization distribution similar to Google data center servers in [10]. For completeness, figure 2.7 presents the Google data center utilization distribution reproduced from [10]. Figure 2.8 shows the average improvements to LD, EP and potential energy savings. For 20% 27 34 Computer voltage-frequency scaling. Mobile devices require high performance for short periods while the user awaits a response, followed by relatively long idle intervals of seconds or minutes. Many embedded computers, such as sensor network agents, present a similar bimodal usage model. 4 This kind of activity pattern steers designers to empha- size high energy efficiency at peak performance levels and in idle mode, supporting inactive low-energy states, such as sleep or standby, that consume near-zero energy. However, the usage model for servers, especially those used in large-scale Internet services, has very different characteristics. Figure 1 shows the distribution of CPU utilization lev- els for thousands of servers during a six-month inter- val. 5 Although the actual shape of the distribution varies significantly across services, two key observations from Figure 1 can be generalized: Servers are rarely com- pletely idle and seldom operate near their maximum uti- lization. Instead, servers operate most of the time at between 10 and 50 percent of their maximum utiliza- tion levels. Such behavior is not accidental, but results from observing sound service provisioning and distrib- uted systems design principles. An Internet service provisioned such that the average load approaches 100 percent will likely have difficulty meeting throughput and latency service-level agree- ments because minor traffic fluc- tuations or any internal disrup- tion, such as hardware or software faults, could tip it over the edge. Moreover, the lack of a reasonable amount of slack makes regular operations exceedingly complex because any maintenance task has the potential to cause serious service disruptions. Similarly, well-pro- visioned services are unlikely to spend significant amounts of time completely idle because doing so would represent a sub- stantial waste of capital. Even during periods of low ser- vice demand, servers are unlikely to be fully idle. Large-scale ser- vices usually require hundreds of servers and distribute the load over these machines. In some cases, it might be possible to completely idle a subset of servers during low-activity periods by, for example, shrinking the num- ber of active front ends. Often, though, this is hard to accom- plish because data, not just com- putation, is distributed among machines. For example, common practice calls for spreading user data across many databases to eliminate the bottleneck that a cen- tral database holding all users poses. Spreading data across multiple machines improves data availability as well because it reduces the likeli- hood that a crash will cause data loss. It can also help hasten recovery from crashes by spreading the recov- ery load across a greater number of nodes, as is done in the Google File System. 6 As a result, all servers must be available, even during low-load periods. In addition, networked servers frequently perform many small back- ground tasks that make it impossible for them to enter a sleep state. With few windows of complete idleness, servers can- not take advantage of the existing inactive energy- savings modes that mobile devices otherwise find so effective. Although developers can sometimes restruc- ture applications to create useful idle intervals during periods of reduced load, in practice this is often difficult and even harder to maintain. The Tickless kernel 7 exem- plifies some of the challenges involved in creating and maintaining idleness. Moreover, the most attractive inac- tive energy-savings modes tend to be those with the high- est wake-up penalties, such as disk spin-up time, and thus their use complicates application deployment and greatly reduces their practicality. 1.0 0 0.01 0.005 0.02 0.03 0.015 0.025 CPU utilization Fraction of time 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Figure 1. Average CPU utilization of more than 5,000 servers during a six-month period. Servers are rarely completely idle and seldom operate near their maximum utilization, instead operating most of the time at between 10 and 50 percent of their maximum utilization levels. Authorized licensed use limited to: University of Southern California. Downloaded on April 18,2010 at 18:32:48 UTC from IEEE Xplore. Restrictions apply. Figure 2.7: Typical Google data center utilization distribution [10] -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0% 10% 20% 30% 40% 50% 60% 10% 20% 30% 40% 50% LD Improvements Energy and EP Improvements Knight Capability Energy EP LD Figure 2.8: Server Energy, EP, LD improvement with KnightShift Knights, we experience average EP improvements of 25%, average energy savings of 18% and average LD decreased by .175. As Knight capability increases, energy savings grows due to more opportunity to be in the Knight. For 50% capable Knights, we experience average energy savings of 51% and average EP improvement of 41%. By having KnightShift being configurable, vendors may pick a KnightShift implementation that is best suited for their performance and energy budget goals. 28 2.6 Evaluation In this work we evaluate KnightShift using two approaches. First, we present a Knight- Shift prototype and run a real-world workload, WikiBench [40], to demonstrate feasi- bility and performance of KnightShift under realistic conditions. A prototype imple- mentation, however, provides limited flexibility to change the hardware configuration parameters. Hence, we developed a queueing model based simulator that is validated against the prototype implementation. We then use the simulator to conduct a broad design space exploration with utilization traces collected from USC’s production data center. 2.6.1 Prototype Evaluation Prototype Setup The KnightShift prototype is similar to Figure 2.3c. The exact experimental setup is shown in Figure 2.9 . In this setup the Knight is a Shuttle XS35 slim PC with a 1.8GHz Intel Atom D525, 1GB of ram, 500GB hard drive with power consumption ranging from 15 Watts at idle and 16.7 Watts at 100% utilization. At idle, the CPU and memory consumes 9W, with the disk and motherboard consuming 6W. The primary server is a Supermicro server with dual 2.13GHz 4-core Intel Xeon L5630, 36GB of ram, 500GB hard drive and consumes from 156W when idle to 205W at 100% utilization. Recall that we assume that it is reasonable for the Knight to have less memory than the primary server as the performance impact due to less memory is accounted for in the capac- ity measurement of the Knight. For our particular setup, our Knight is not capable of supporting memory size larger than 1GB. 29 Power Meter Client Node Knight Primary Server Power Logger Workload Generator KnightShiftd Scheduler Web App. KnightShiftd Web Application Power Meter Wall Plug Power Supply Power Usage Client Request Request Coordination Message Figure 2.9: KnightShift Prototype setup Using SPECpower results we determined that our primary server has an EP of 24%. By enhancing the primary server with a Knight, we improved the EP to 48%. Knight- Shift is not meant to compete directly with servers that are already highly proportional across all utilization levels. KnightShift improves EP of servers that have large propor- tionality gap at low utilizations. The primary server can turn on/off in 20/10 seconds, respectively. During Knight mode, the primary server is placed in hibernate mode, where the system is shutdown except for the network interface. It is also possible to place the primary server in sus- pend mode quicker than shutting down, but at the cost of higher idle power. Both nodes run Ubuntu Linux with all power-saving features enabled (DVFS, drive spin-down, etc). We determined that our Knight is 15% capable compared to the primary server using throughput measurements from apachebench [34]. In other words, the Knight’s maxi- mum throughput is 15% of the peak throughput of the primary server. Request genera- tion and power measurement data collection are handled by a separate client node which is not part of the prototype. The power consumption of the primary server and Knight are measured using two Watts Up? Pro [36] power meters with data logged to the client node. KnightShift Runtime: In our prototype setup, both nodes shares data through NFS, with the Knight acting as the NFS server. In order to force data consistency, we 30 Sleep Wakeup Awake Sync Low High ß Power Consumption à ß Time à Primary: Flush memory state, send sleep msg., enter low power state Knight: Begin processing requests Knight: Flush memory and send sync msg Primary: Wakeup, send awake msg, wait for data sync, process requests Figure 2.10: Coordination of KnightShift servers require coordination between both nodes. Thus, we require runtime software for Knight- Shift support. This software will handle both utilization monitoring and coordination. While there are many options for implementation, here we only present one particular implementation used in our prototype. To support and enforce the KnightShift functionality, both nodes run a daemon, called KnightShiftd, to support utilization monitoring and coordination messages. KnightShiftd is implemented as a set of scripts and acts as the control center for the KnightShift system. KnightShiftd monitors the utilization of the node it’s running on and makes mode switching decisions. To support redirection of requests, the Knight runs a scheduler, which acts as a simple load balancer to forward the requests to the active node. Communication between both nodes takes place through messages. Figure 2.10 highlights the processes of switching between nodes, which also enforces data consis- tency. Upon entering a low utilization period, the KnightShiftd daemon will detect the low utilization of the primary server and initiates a mode switch. The primary server will flush its memory state to ensure that the latest data is up to date in the disk. When this completes, KnightShiftd will send a sleep message to the Knight and begin to power down. The KnightShiftd daemon on the Knight system receives the sleep message from 31 Response Time Energy Average 95th Consumption(KWH) Prototype Baseline 144ms 249ms 23.27 KnightShift 150ms 296ms 15.35 Improvement -4% -19% 34% Simulation Baseline 1.00 1.66 23.27 KnightShift 1.12 2.00 15.11 Improvement -12% -21% 35% Error 8% 2% 1% Table 2.1: Energy consumption and response time of Wikibench using our KnightShift prototype and simulator. the primary server, which is an indication that the Knight should begin processing re- quests. The Knight will process low utilization requests until it reaches a high utilization region. At this point, the daemon on the Knight will send a Wakeup message (through wake-on-lan) to wake up the primary server. When the primary server has booted up, the daemon on the primary server gets ready to process requests. It will send an awake message to the Knight. At this point, the Knight will flush its data and send a sync message, indicating to the primary server that it can resume processing requests. Prototype Results To verify the correctness of KnightShift and to evaluate KnightShift under realistic workloads, we cloned Wikipedia and benchmarked it using real-world Wikipedia re- quest traces. Wikipedia consists of two main components, Mediawiki, the software wiki package written in PHP, and a backend mySQL database. For our clone, we used a pub- licly available database dump from January 2008, containing over 7 million articles. We replayed a single-day Wikipedia access trace [75], which follows a diurnal sinusoidal pattern, using WikiBench [40], a Wikipedia based web application benchmark. Detailed WikiBench workload utilization profile for this case study is presented in [79]. 32 The first three rows of data of Table 2.1 show the energy consumption and the 95th percentile response time of our KnightShift prototype compared to the baseline primary server. Service Level Agreements (SLA), which sets per-request latency targets, are typically based on 95th percentile latency [56]. We define the baseline as a system where all requests are always handled by the pri- mary server. KnightShift is able to achieve 34% energy savings with only 19% impact on 95th percentile response time. This latency impact is mainly due to the single-threaded performance of the Knight rather than penalties due to switching between the Knight and primary server (Note that the average response time only increased by 4%). When run- ning Wikibench only on the Atom-based Knight, we experience 95th percentile response time of 323ms for successfully completed requests. Thus, KnightShift’s 95th percentile response time is bounded by that of the Knight. By using higher single-threaded per- forming processors, such as Intel Core i3, we should expect to experience response time bounded by the response time of the Core i3. 2.6.2 Trace-based Evaluation Trace-based Setup While a prototype implementation provides great confidence regarding the functional viability and realistic improvement results, it also limits our ability to alter some of the critical design space parameters, such as Knight capability level, Knight performance, and Knight transition time. In order to fully explore these variables, we present Knight- Sim, a trace-driven KnightShift system simulator validated against our prototype sys- tem. During simulation runs, KnightSim replays the utilization traces collected from our production data center on a modeled KnightShift system. KnightShift is modeled as a G/G/k queue, where the arrival rate is time-varying based on the utilization trace, the 33 service rate is exponential with a mean of 1 second, and varying k servers modeling the capability of the Knight and primary server. Because we do not have measured response time from out data center traces, we arbitrarily set the service rate to 1 second and report relative performance impact. Data center Utilization Traces: In order to rigorously evaluate KnightShift un- der various workload patterns, we collected minute-granularity CPU and I/O utilization traces from our production data center over 9 days. The data center serves multiple tasks, such as e-mail store(email, msg-store1), e-mail services (msg-mmp, msg-mx), file server (scf), and student timeshare servers (aludra, nunki, girtab). Each task is assigned to a dedicated cluster, with the data spread across multiple servers. Selected servers within a cluster exhibit a behavior representative of each server within that cluster. Table 2.2 shows the properties of each server workload along with its correspond- ing utilization and burstiness characteristics. Some of our servers (aludra, email, girtab, nunki, scf) run at less than 20% CPU utilization for nearly 90% of their total operational time [79]. These traces reaffirms prior studies that CPU utilization reaches neither 100% nor 0% for extended periods of time [10,22,60]. We also collected a second-granularity traces for a subset of these servers and found that there is a high correlation to minute- granularity. Thus, we use the minute-granularity data for the rest of this chapter. The burstiness of the workload is characterized by utilization , the standard deviation of the workload’s utilization, and utilization, the change in utilization from sample to sam- ple. utilization tells us how varied the utilization of the server is, while the utilization tells us how drastic the utilization changes from sample to sample. For example, nunki has a wide operating utilization range with large variation in utilization from sample to sample. More details of our data center traces are presented in section 3.2.4. Modeling Knight capability: Knight capability is modeled by varying the system capacity, k. For example, if we have a 10% Knight, then k = 10 in our G/G/k queueing 34 Utilization Utilization Server Type x x aludra stu. timeshare 3.87 3.12 0.59 0.84 email email store 3.26 1.74 0.78 1.20 girtab stu. timeshare 0.83 2.42 0.73 1.94 msg-mmp email services 32.62 13.60 2.64 2.76 msg-mx email services 19.23 7.41 1.69 2.30 msg-store email store 11.05 5.88 2.39 2.72 nunki stu. timeshare 4.86 10.85 1.98 4.50 scf file server 5.47 4.19 1.15 1.65 Table 2.2: Data center trace workload characteristics model when operating in Knight mode. When the primary server becomes active then k = 100. Scaling Power Consumption: To faithfully scale the power of the Knight as its capability changes, we assume simply that the power consumption of the CPU scales quadratically with performance. The quadratic assumption is based on historical data [8] which showed that power consumption increased in proportions to performance 1:7 . We assume this is a reasonable assumption due to the fact that even if the Knight and primary server require similar infrastructure (such as same size memory), the Knight can tradeoff performance by using low-power components (such as low-power mobile memory), therefore, most components can scale. Modeling Power: Our power model is based on our prototype system to allow us to compare and validate KnightSim. Through online instrumentation, we collect the utilization vs power data for both the Knight and primary server. We use this utilization- power data in our simulations; whenever a Knight is active at a given utilization we use the power consumption data collected from our prototype Knight. Similarly whenever the primary server is operating at a given utilization, we use the power consumption collected from the primary server in our prototype. It is also possible to generalize the power model and use a linear power model validated in [22]. 35 In order to capture the energy penalty of transitioning to/from Knight, we conserva- tively model the transition power as a constant power during the entire wakeup period equal to the peak transition power. We determined empirically that the peak transition power for the primary server is 167W. Arrival Rate and Latency Estimation: Our data center traces only have CPU and I/O utilization per second without individual request information. By assuming a mean service time of 1 second for each request, we can estimate a time-varying arrival rate through our utilization trace. For example, 50% utilization would correspond to an arrival rate of 50 requests per second. Through the simulated queueing model, we can obtain a relative average and 95th percentile latency of a KnightShift system compared to a baseline system. Modeling Single-threaded Performance: We vary the queueing model’s service time to model the performance difference of the Knight and primary server. We cannot infer single-threaded performance directly from processor frequency because single- threaded performance is based on frequency and the underlying architecture. Instead, we compare the 95th percentile latency of the Knight and primary server and scale the service time accordingly. For example, our primary server has tail latency of 249ms while our Knight has tail latency of 323ms as shown in section 2.6.1. As we do not have direct access to the data center servers, nor can we replicate the proprietary applications on our Knight, we cannot collect response times for the primary server and Knight for each individual workload. Therefore, in our model, we assume that all workloads ex- perience similar performance slowdown due to the Knight similar to WikiBench, where the service time is increased by a factor of 1.3 compared to baseline. Simulator Validation: We validated our trace-based emulation by collecting utilization traces from our WikiBench run and replayed the utilization traces through the trace emulator. In addition, we validated our power results against our prototype 36 Trace Energy Savings 95% Latency Impact aludra 87.9% 40% email 85.5% 37% girtab 87.2% 49% msg-mmp -6.7% 7% msg-mx 7.2% 254% msg-store 34.5% 53% nunki 67.7% 5989% scf 77.5% 46% wikibench 35.1% 21% Table 2.3: Energy savings and latency impact wrt Baseline of a 15% Capable Knight- Shift system system by running a CPU and I/O load generator to match the utilization of the traces. Table 2.1 shows the results of our validation run. 95th percentile latency and energy consumption improvement results from KnightSim are all within 2% of our prototype system. 2.6.3 Sensitivity Analysis In this section we explore KnightShift’s sensitivity to various parameters such as work- load utilization patterns, Knight capability, and transition times. Sensitivity to Workload patterns: We used KnightSim to simulate KnightShift running a variety of workload patterns by driving the queueing model with traffic pat- terns from different server types listed in Table 2.2. The energy and latency impact are shown in Table 2.3. Recall that our Atom-based Knight has a 95% response time that is 30% greater than the primary server, thus we consider any latency above 30% to be attributable to the KnightShift mechanism overhead. For workloads with low bursti- ness (aludra, email, msg-mmp, wikibench), we experience relatively low response time impact (<10%). 37 For moderately bursty workloads (girtab, msg-store, scf), we experience latency im- pact within 25% of the Atom-based Knight. For these workloads, the majority of the latency impact occurs during the transition from the Knight to primary server when the Knight is handling requests that it cannot handle until the primary server is ready. These bursty behaviors tend to be periodic, thus it would be possible for KnightShift to learn day-to-day utilization patterns and proactively switch to the primary server to handle these high-utilization bursty periods, negating the high latency impact. This topic will be explored in chapter 3.4. For very bursty workloads with high utilization, such as msg-mx and nunki, we experience the most latency impact, as expected. KnightShift does not handle scenarios where the workload switches quickly between very low and high utilization. In these scenarios, the workload may benefit from a higher capacity Knight. Almost all workloads experience energy saving benefits from KnightShift with the exception of workloads with high utilization. There are no benefits from using Knight- Shift for workloads that operate mostly at utilization above the capability of the Knight, hence such workloads don’t need KnighShift support to begin with. For these cases, KnightShift may even lead to an energy penalty due to running the Knight alongside a heavily utilized primary server. For most other workloads (aludra, email, girtab, scf, wikibench), we can experience an average of 75% energy savings with tail latency within 9% of the Atom-based Knight node. Sensitivity to Knight Capability: Figure 2.11 shows the effect of Knight capabil- ity levels on energy savings and 95th percentile response time. As Knight capability increases up to 50%, so does energy savings due to more opportunity for the system to stay in the Knight mode. Although the Knight uses more power at higher capability 38 levels, increased energy savings from time spent in the Knight offset the Knight’s higher power. As Knight capability increases, up to a limit of around 50%, the primary server spends more time sleeping, and the latency converges to 1.30x. Recall that Knight is 1.30x slower than the primary server. As the capability of Knight increases it handles much of the incoming requests and may only rarely wakeup the primary server. As such almost all the requests suffer the latency penalty equivalent to the relative slowdown of Knight compared to the primary server, which is 30% slower than running on the pri- mary server. At low Knight capability, especially for capability less than 20%, Knight- Shift thrashes; Knight cannot handle the tasks when switched to the Knight mode and these tasks endure long latency while waiting for the primary server to wakeup. Some workloads (msg-mx and nunki) experience latency penalties only when the Knight ca- pability increases past 30%. These workloads do not experience latency impact at lower Knight capability since KnightShift rarely switches to the Knight mode and hence the primary server handles nearly all the requests due to the high utilization demands. But when the Knight capability increases the system occasionally switches to the Knight mode, but the Knight is quickly saturated. Hence, the workload switches back to the primary server leading to latency penalties. KnightShift is in fact unnecessary for these workloads. Thus, for certain workloads with stringent QoS bounds KnightShift may not be an ideal solution. Sensitivity to Transition Time: Figure 2.12 shows the effect of wakeup transition time on energy savings and 95th percentile response time. In general, as transition time increases, we experience less energy savings due to the primary server using power but not doing work, while the Knight is still handling requests it potentially cannot handle. As the Knight handles more requests it reflects a corresponding increase in the 95th percentile latency as transition time increases. Figure 2.13 shows the effect of sleep 39 0 1 2 3 4 5 10% 20% 30% 40% 50% Norm. 95% Latency Knight Capability -20% 0% 20% 40% 60% 80% 100% 10% 20% 30% 40% 50% Energy Savings Knight Capability aludra email girtab msg-mmp msg-mx msg-store nunki scf wikibench Figure 2.11: Effect of Capability on Latency and Energy 0 1 2 3 4 5 0 10 20 30 40 50 60 Norm. 95% Latency Wakeup Transition Time (s) -20% 0% 20% 40% 60% 80% 100% 0 10 20 30 40 50 60 Energy Savings Wakeup Transition Time (s) aludra email girtab msg-mmp msg-mx msg-store nunki scf wikibench Figure 2.12: Effect of Wakeup transition time transition time on energy savings and 95th percentile response time. The general trends shown with wakeup transition time are similarly observed with sleep transition time. Sensitivity to single-threaded performance: The tail latency of KnightShift is de- termined by the Knight. If the SLAs demand very tight latency slack (less than 20%), then it is best to use low-power processors, such as Core i3, as Knights instead of ex- tremely low power Atom boards. 0 1 2 3 4 5 0 5 10 15 20 25 30 Norm .95% Latency Sleep Transition Time (s) -20% 0% 20% 40% 60% 80% 100% 0 5 10 15 20 25 30 Energy Savings Sleep Transition Time (s) aludra email girtab msg-mmp msg-mx msg-store nunki scf wikibench Figure 2.13: Effect of Sleep transition time 40 2.7 TCO To study the effect of KnightShift on TCO of an entire data center, we use a publicly available cost model [35]. The model assumes an 8MW power budget where facility and IT capital costs are amortized over 15 and 3 years, respectively. The model breaks down TCO into server, networking, power distribution and cooling, power, and other infrastructure costs. We assume that KnightShift has no impact on rack density, with power budget as the sole limiting factor. In Table 2.4, we present our cost breakdown for our primary server and our Knight. We broke down cost into memory, storage, pro- cessor, and other system components. Other system components includes motherboard, chipset, network interface, fans, and other on-board components. A significant portion of the energy savings derive from other system components. This is due to the fact that many of these components are energy-disproportional, such as chipset, network interface, fans, and sensors. For example, the power consumption of motherboard components, such as chipset and network interface, are mostly constant regardless of utilization. But with KnightShift, when we switch to the Knight node, we could use a low-power mobile chipset (such as for Atom) rather than a higher power chipset (such as for Xeon) to save significant power for these other system components. We assume our prototype implementation of KnightShift to present worst-case TCO. Although, an integrated version of KnightShift is expected to consume less power and have lower cost. We present TCO on a monthly basis as Performance per TCO Dollar spent (Perf/$), an important metric in TCO-conscious data centers [49]. Here, performance is measured as ssj ops as reported by SPECpower. Recall that in the worst case with KnightShift both the primary server and Knight are always on if the workload demands computing power of the primary server. Thus for our TCO analysis we make the worst case assumption that both the Knight and primary server are always on. Figure 2.14 shows the effect of 41 Primary Server Knight Cost Power(W) Cost Power(W) Memory $248 40 $20 3 Storage $130 20 $70 18 Processor $1102 70 Other $350 75 $69 12 System Components Total $1830 205 $159 33 No. Servers 37361 34483 Table 2.4: Cost breakdown of primary server and Knight based on prototype KnightShift system. Other system components include motherboard, chipset, network interface, fans, and other on-board components. 1.5 1.4 1.3 1.2 1.1 1 $0.02 $0.06 $0.10 $0.14 Perf/$ Impact -5%-0% 0%-5% 5%-10% 10%-15% Electricity Cost per kWh PUE Figure 2.14: TCO breakdown across PUE and Energy Cost PUE and electricity cost per kWh on Perf/$. Due to the increased peak power usage of Knightshift and the fixed power budget of the data center, we suffer a decrease in total data center performance. Although there is a reduction in the number of servers due to peak power constraint, we do not always suffer any loss in Perf/$. In regions of higher electricity prices and higher PUE, it is easier to recoup the cost of KnightShift hardware due to more monetary savings per watt. For cases with high PUE and electricity cost, we experience up to 14% improvement to Perf/$. Only at very low electricity prices do we see a negative impact in Perf/$, due to the hardware cost outweighing the potential in energy savings. Note that even with PUE of 1, KnightShift can still provide Perf/$ advantages with electricity prices above $0.07 per kWh. 42 !"#$ %#$ &'#$ &'#$ (#$ )*+,*+-$ .*/01+2345$ 610*+$ 73-/+389:14$;$ <11=345$ (a) baseline 68% 9% 16% 4% 3% Servers Networking Power Distribu:on & Cooling Power Other (b) knightshift Figure 2.15: TCO breakdown across servers and infrastructure Figure 2.15 shows the TCO breakdown across server and infrastructure for PUE of 1.45 and electricity cost of $0.07. Although the total cost of servers is higher with KnightShift (68% total cost vs 60% in the baseline), the power budget improvements (from 14% to 4%), more than makes up for the difference, resulting in TCO savings of 11% monthly. Even by accounting for the lower number of servers, Perf/$/month improved by 4% compared to baseline. 2.8 Conclusion Energy efficiency of computer systems have been increasing over the past few years. We introduce several metrics to analyze energy proportionality which shed light into why efficiency has not improved uniformly across all utilization levels. We show that servers exhibit significant proportionality gap at low utilizations. With the pervasiveness of multi-core, servers in future will be rarely idle and hence energy saving techniques must now tackle the proportionality gap at low server utilization levels. We introduce KnightShift, a server-level heterogeneous architecture that fronts a primary server with a low-power compute node. By operating KnightShift at two levels of efficiency, we 43 convert any server to exhibit sublinear energy proportionality, drastically improving en- ergy proportionality. In our prototype KnightShift implementation with a 15% capable Atom-based Knight, we achieve a 2x improvement in energy proportionality (from 24% to 48%) due to improvements to both dynamic range and proportionality linearity. We demonstrated energy savings of 35% with latency bounded by the latency of the Knight using a real-world Wikipedia workload. In addition, we rigorously evaluated our pro- totype using various production data center traces and experience up to 75% energy savings with tail latency increase of about 9%. Through publicly available cost models, we also showed that KnightShift can improve performance per TCO dollar spent up to 14%. Our work hopes to motive future work in system-level active low-power modes that exploits low-utilization periods. 44 Chapter 3 Implications of High Energy Proportional Servers on Cluster-wide Energy Proportionality 3.1 Introduction In the previous chapter, we proposed KnightShift as an active low power server de- sign where the server continues to provide service even at very low utilization. In this chapter, we will explore how KnightShift can be used within a data center cluster envi- ronment to improve cluster-wide energy proportionality. In particular, we will compare KnightShift with other orthogonal approaches to improve energy proportionality, such as dynamic cluster resizing. Historically, techniques such as dynamic cluster resizing have been used to hide the poor energy proportionality of data center servers. In the presence of changing workload demands cluster resizing requires workload migration and frequent server wakeup/sleep actions. In this chapter we will explore the need, if any, for dynamic cluster resizing techniques, particularly in the presence of high energy proportional servers such as KnightShift. We will also address fundamental concerns regarding the scalability of KnightShift under multi-core scaling, which have been a major issue for previously proposed server-level low power modes. Traditionally, clusters are managed with the goal of improving response time through uniform load balancing, where the workload is uniformly distributed to all servers in the 45 cluster. This technique is simple, but can be energy inefficient, especially during low utilization periods if servers are not energy proportional. Recognizing this concern, researchers have proposed many cluster-level power saving techniques. Workload consolidation techniques [15, 71] have the goal of migrating workloads to improving the utilization of data centers to reduce the number of servers required, resulting in improved TCO. Dynamic capacity management [25, 76] work towards the goal of minimizing the number of servers needed for a given workload utilization in order to turn off a subset of servers using Packing scheduling algorithms. The challenge with Packing schemes is in maintaining the optimal number of servers to keep on in order to meet QoS levels in the face of sometimes unpredictable and bursty incoming work requests, while also minimizing energy consumption. The most recent work to address this problem is AutoScale [25], which showed that conservatively turning off servers with the proper threshold can meet QoS levels while minimizing energy usage. Cluster-level packing techniques were originally developed to mask the effect of the poor energy proportionality of individual servers. Packing schemes, however, incur significant overheads due to competing workload migration and QoS management de- mands. Over the past few years, however, server energy proportionality has improved through a combination of industrial and research designs, such as our KnightShift design presented in Chapter 2.4. With the emergence of high energy proportional servers, we revisit if cluster-level packing techniques and their associated management overheads are still the best approach to achieving high cluster energy efficiency. The question we would like to answer is, can server-level energy proportionality improvements alone translate into cluster-wide energy proportionality (the observed energy proportionality of the whole cluster), or do we still need cluster-level proportionality approaches? This chapter also addresses the scalability challenges for server-level energy propor- tionality approaches in the presence of increasing on-chip core counts. As discussed 46 in Chapter 2, sever-level energy proportionality solutions can be classified into inactive server-level low power modes, and server-level active low power modes. Commercially available inactive low power modes, such as server shutdown and sleep/hibernate, re- quire large idle periods, in the order of minutes, to become effective. To combat the need for large idle periods, recent techniques such as PowerNap [55], were proposed. Pow- erNap enables fast sleep and wakeup transition using nap state support in all hardware sub-systems. The fast wakeup and sleep transition are then exploited to place a server in sleep mode even during sub-second idle periods. Server-level inactive low power modes depend on the presence of idle cycles, but with the emergence of multi-core processors and increasing core count within servers, idle periods are becoming shorter and are in- creasingly rare [58]. Recognizing this concern, prior work [21, 58] proposed idleness scheduling techniques in order to artificially create idle periods at the cost of increased response time. On the other hand, sever-level active low power modes can still perform work in a low power state. Barely-alive servers [5] and Somniloquy [3] only handle I/O requests in their low power state. KnightShift [80] can handle general computation and takes advantage of low utilization periods. Chapter 2 showed the benefits of using server-level active low power modes to improve energy proportionality without sacrificing response time significantly. It is necessary to compare the scalability of both inactive and active server-level low power modes to evaluate whether these benefits of KnightShift can be sustained when core counts within a server scale, which is the second aim of this chapter. This chapter makes the following contributions: We extended the energy proportionality model proposed in chapter 2 to mea- sure energy proportionality at the cluster level. We then explored the effect of cluster-level and server-level techniques on cluster-wide energy proportionality. 47 We found that cluster-level packing techniques effectively mask the server’s en- ergy proportionality profile. Hence, when server-level energy proportionality is poor, packing techniques achieve good cluster-wide energy proportionality. Running a cluster without any cluster-level packing techniques can actually achieve higher cluster-wide energy proportionality than with cluster-level pack- ing techniques. Furthermore, removing cluster-level packing techniques exposes the underlying server’s energy proportionality profile, enabling server-level low power modes to improve cluster-wide energy proportionality. (Section 3.2) We explored server-level low power modes to understand how the efficiency of these low power modes scale with increasing core count. We performed a de- tail power consumption analysis of how various server-level low power modes perform under server multi-core scaling and found that active low power modes consistently outperform inactive low power techniques using idleness scheduling techniques, while requiring significantly less latency impact in order to be effec- tive. (Section 3.3) In order to meet the best-case latency slack required for server-level low power modes to be efficient, we explore the causes of high latency in KnightShift. We propose various mode switching policies to overcome the high latency currently experienced with server-level active low power modes. (Section 3.4) 3.2 Cluster-wide Energy Proportionality In this section, we first extend the energy proportionality model presented in chapter 2 to measure cluster-wide energy proportionality. We will use our extended model to explain 48 the reasoning why prior cluster-level packing techniques can effective improve cluster- wide energy proportionality even using low energy proportionality servers. We will then reason about how improved energy proportionality at the server-level is impacting cluster-wide energy proportionality. Specifically, we make the following observations. 1) Cluster-level packing techniques are highly effective at masking individual server’s energy proportionality. 2) On the flip side, significant improvement in server energy pro- portionality seen in the past few years do not translate into cluster-level energy efficient gains due to the masking effect. 3) To take advantage of improved server energy propor- tionality it may be more favorable to forego cluster-level packing techniques entirely. Foregoing cluster-level packing techniques enable energy improvements by server-level low power techniques to translate to cluster-wide energy improvements. 3.2.1 Measuring Idealized Best-Case Cluster-wide EP We first describe how we measure cluster-wide energy proportionality using an ideal- ized Packing algorithm. Using the idealized Packing approach, the cluster-wide energy proportionality is dependent on the number of servers used in the cluster. Under an ideal scenario, cluster-level packing technique can always perfectly provision the right number of servers to meet the current utilization demand. Then the cluster-wide energy proportionality curve would resemble steps as shown in figure 3.2. In these illustrative figures, the x-axis is the cluster-level utilization and the y-axis is the cluster-wide power consumption. Figure 3.2a shows the best-case cluster-wide energy proportionality with a cluster consisting of 10 servers. The shape of each step resembles the shape of each individual server’s energy proportionality curve. In this illustration, each step represents the energy proportionality curve of a server with a relatively low energy proportionality of 0.24. The best case Packing approach represents the case where the exact number of servers are on for the current utilization level, then the next step occurs when another 49 server needs to be turned on to handle the additional load of a higher utilization. As the utilization increases, it will only load the most recently woken up server until another server needs to be awoken. Similarly, in the idealized Packing approach, if the utiliza- tion decreases then a server can be instantaneously put to low power mode. Idealized packing algorithms are also assumed to pay zero cost for migrating workloads between servers whenever a server is turned on or off. Using this best case Packing approach, we compute the best case cluster-wide energy proportionality (which is represented as BestEP) using Equation 2.2, which is also shown on the top left corners of Figure 3.2a. As the number of servers in a cluster increases, these steps become smaller until the point where the cluster-wide energy proportionality curve resembles the ideal en- ergy proportionality curve as shown in figure 3.2b. When increasing the cluster size from 10 servers to 100 servers, the best achievable cluster-wide energy proportionality approaches 1, improving from 0.92 to 0.98, even though each server suffers from low energy proportionality of 0.24. In the absence of cluster-level packing techniques, requests may be routed uniformly across all servers to balance the load. In this case, each server’s utilization should track the cluster’s utilization almost perfectly. When using Uniform load balancing approach, the best-case cluster-wide energy proportionality curve would simply be that of the un- derlying server’s energy proportionality curve. 3.2.2 Measuring Actual Cluster-wide EP In real clusters it is not possible to achieve the ideal scenario where servers can sleep and wakeup instantaneously. Cluster-wide utilization vs power measurements are rela- tively noisy due to the fact that at any given cluster utilization, there could be a differing number of servers that are on. For example, consider two scenarios. In both scenarios the cluster utilization during a given time epoch is 10%. In scenario#1 a large number 50 of servers (much more than what is needed for 10% cluster utilization) were already on during the previous epoch. The cluster power consumption can be high even at 10% utilization since it is not possible to instantaneously pack workloads and turn on/off un- needed servers for the current epoch. Consider scenario#2 where there are only just enough servers turned on to meet 10% cluster utilization demand even during the previ- ous epoch. Then the power consumption in the current epoch with 10% utilization will be lower in scenario#2 than in scenario#1. In order to enable the measurement of cluster-wide energy proportionality, we need to first derive the actual energy proportionality curve (figure 2.1). In order to reduce the raw data measurement into a single curve, we first take the average power of the cluster at each measured utilization and then use these points to find a 3rd degree poly- nomial best fit curve to create an average power curve. We then use this curve as the actual energy proportionality curve to calculate cluster-wide energy proportionality us- ing equation 2.2. 3.2.3 Evaluation Methodology We will use three different cluster-level techniques to see how individual server-level energy proportionality translates in cluster-wide proportionality. They are Packing (Au- toscale [25]), uniform load balancing with a regular server with no energy proportion- ality enhancements, and uniform load balancing with KnightShift which enhances the servers EP. To evaluate the implication of various cluster-level and server-level techniques on cluster-wide energy proportionality, we implemented a trace-driven queueing model- based simulator shown in figure 3.1. We model a cluster with 20 servers, where each server is modeled as a G/G/k queue. We found that using a cluster size of 20 servers gave us the required resolution to measure cluster-wide proportionality, while minimizing the 51 amount of simulation time required. Using a larger cluster would result in higher power measurement resolution, but would not change our observations. The following are the inputs to the queueing model. First, the model needs arrival rates. We derive arrival rates from the real-world utilization traces obtained from the USC data center for different workload categories. Second, the model also needs service time for each request. Finally, the model needs to compute the cluster-wide power consumption for at given utilization level. For this purpose we will rely on energy proportionality curves. In the following subsections we present a detailed description of how the model inputs were derived. 3.2.4 Utilization Data Derived from Workload Traces The queueing model is driven by real-world utilization traces containing minute- granularity utilization measurements from various clusters taken over a 9-day period from an institutional data center. We use the workload traces from chapter 2. We show a thumbnail of the utilization traces in table 3.1. The X-axis represents time, with each major gridline representing a day. The Y-axis represents the utilization. The utilization reported in these traces are from a single representative server in a cluster where all servers exhibit similar utilization patterns. From these thumbnails it is clear that these workloads all contain varying levels of utilization and traffic spikes. 3.2.5 Arrival rate inputs to the Queueing Model Due to the absence of individual service request times in our utilization traces, we use a verified G/G/k queueing model methodology based on the concept of capability [80]. In this model, k represents the capability of a server. Each server has a capability of k = 100 (since each server can have up to 100% utilization). 52 Name Trace Autocorrelation aludra 0 35 email 0 35 girtab 0 35 msg-mmp 0 90 msg-mx 0 60 msg-store 0 60 nunki 0 99 scf 0 60 Table 3.1: Workload thumbnail and autocorrelation Each entry in the data center utilization trace shows the cluster utilization at that given minute epoch. The utilization traces are used as a proxy to derive a time-varying arrival rate. For instance, if the cluster has 10% utilization during a given epoch, this translates to 10% of the capacity (k) being utilized. Essentially, each individual job takes up the equivalent of 1% utilization for a server. In order to submit requests equivalent to 10% cluster utilization, we would need to generate 10*N jobs, where N is the number of servers. For example, with 10% cluster utilization and 20 servers, we would generate 53 cumulatively 200 jobs. If the cluster utilization drops to 1%, then during that epoch only 20 jobs arrive. 3.2.6 Service rate input for the Queuing Model Because these data center traces do not report actual workload response time, we assume the service rate is exponential with a mean of one second and relative performance impact (99th percentile latency) is reported. This approach still enables us to obtain latency impact by comparing relative performance. 3.2.7 Server Energy Proportionality Inputs to the Queuing Model In our simulator, we evaluated three different server categories; LowEP, MidEP and HighEP servers. The low EP server, with an EP of 0.24, is based on a Supermicro server with dual 2.13GHz 4-core Intel Xeon L5630. This server is the same server configuration that was used in our KnightShift prototype implementation. For this server we empirically measured the energy proportionality curves (similar to the figure 2.5). Recall that this server consumes 156 Watts when idle and 205 watts at 100% utilization. The medium EP server is an HP ProLiant DL360 G7 with an EP of 0.73. The high EP server is a Huawei XH320 with an EP of 1.05. The medium and high EP servers are two representative servers selected from the SPECPower results. The energy proportionality curves for these two servers are derived from their respective SPECPower results. Our simulator uses these energy proportionality curves to determine the power consumption at any given utilization. It is assumed that all servers consume 205W at full load, similar to the empirically measured server in chapter 2. Both Packing and KnightShift transition servers between sleep and active states. From our observations, when servers are transitioning on, they tend to be near-fully 54 utilized. As in chapter 2, to capture transition energy penalties for these models, an em- pirically measured constant power of 167W is conservatively added to the transitioning server during the entire transition period. The three load balancing schemes evaluated in this chapter, namely packing, uniform load balancing without KnightShift and with KnightShift, use variable number of servers to service a given arrival rate. To measure cluster-wide power at minute granularity, we simply aggregate each server’s power consumption during simulation runtime. 3.2.8 Load balancer implementations Packing Load Balancing: To explore the impact of cluster-level packing techniques, we implemented a state-of-the-art dynamic capacity management algorithm proposed in Autoscale [25]. Each server is configured with the same settings as in [25], where the servers conservatively turn off after 120 seconds of idleness and has server wakeup time of 260 seconds. Through empirical experiments, it was determined that the packing factor for our servers is 97 jobs in our simulation framework. The packing factor is server dependent and indicates the maximum number of jobs that the server can handle and still meet the target 99th percentile latency. The Autoscale load balancing algorithm assigns the incoming requests to individual servers. For instance, if there are only two servers currently turned on and the new utilization trace record has 6% cluster utilization then it translates to 120 jobs that will be submitted to these two servers. The Packing algorithm first submits 97 requests to the first server and then assigns the remaining 23 to the next server. If the arrival rate exceeds the total capability of all active servers then a new server is turned on with a wakeup latency and the remaining overflowed requests will join the shortest queue. Uniform Load Balancing: To understand how well Packing load balancing im- proves cluster-wide energy consumption, we also implemented a basic Uniform load 55 Utilization Time Utilization Trace 1 2 k 1 2 k Load Balancing Server 1 Server 2 Server n 1 2 k Time-varying Arrival Rate Figure 3.1: Trace-driven Queueing Model-based Simulation Methodology 0 20 40 60 80 100 Utilization 0 500 1000 1500 2000 Power (W) BestEP: 0.915261849134 IdealEP BestEP (a) Cluster-wide EP with 10 servers 0 20 40 60 80 100 Utilization 0 5000 10000 15000 20000 Power (W) BestEP: 0.98469813085 IdealEP BestEP (b) Cluster-wide EP with 100 servers Figure 3.2: Best-case cluster-wide energy proportionality curve using Packing load bal- ancing. balancer as an alternative to Autoscale. The load balancer simply distributes work equally to all servers and all servers in the cluster are always on. Uniform Load Balancing with KnightShift: In addition, we also explored the effect of a server-level active low power mode, KnightShift, on cluster-wide energy proportionality. The KnightShift server is configured as described in chapter 2 with a 15% capable Knight. The Knight is capable of handling 15% of the work of the primary server. In the queueing model, when in Knight mode, k is set to 15. The Knight consumes 15W at idle and 16.7W at full load with wakeup transition time of 20 seconds. 56 0 20 40 60 80 100 Utilization 0 1000 2000 3000 4000 5000 Power (W) IdealEP BestEP ActualEP Raw (a) Uniform LowEP 0 20 40 60 80 100 Utilization 0 1000 2000 3000 4000 5000 Power (W) IdealEP BestEP ActualEP Raw (b) Uniform MidEP 0 20 40 60 80 100 Utilization 0 1000 2000 3000 4000 5000 Power (W) IdealEP BestEP ActualEP Raw (c) Uniform HighEP 0 20 40 60 80 100 Utilization 0 1000 2000 3000 4000 5000 Power (W) IdealEP BestEP ActualEP Raw (d) Autoscale LowEP 0 20 40 60 80 100 Utilization 0 1000 2000 3000 4000 5000 Power (W) IdealEP BestEP ActualEP Raw (e) Autoscale MidEP 0 20 40 60 80 100 Utilization 0 1000 2000 3000 4000 5000 Power (W) IdealEP BestEP ActualEP Raw (f) Autoscale HighEP 0 20 40 60 80 100 Utilization 0 1000 2000 3000 4000 5000 Power (W) IdealEP BestEP ActualEP Raw (g) Uniform KnightShift LowEP 0 20 40 60 80 100 Utilization 0 1000 2000 3000 4000 5000 Power (W) IdealEP BestEP ActualEP Raw (h) Uniform KnightShift MidEP 0 20 40 60 80 100 Utilization 0 1000 2000 3000 4000 5000 Power (W) IdealEP BestEP ActualEP Raw (i) Uniform KnightShift HighEP Figure 3.3: Cluster-wide Energy Proportionality using (a,b,c) Packing load balancing (Autoscale), (d,e,f) Uniform load balancing, and (g,h,i) Server-level active low power technique (KnightShift). 3.2.9 Revisiting Effectiveness of Cluster-level Packing Techniques Figure 3.3 shows the results of our study. In this section, we only ran a 1-day period of the utilization traces as we found this is sufficient to obtain the cluster-wide energy proportionality curves. We ran each workload in table 3.1 on a simulated cluster of 20 servers and collected the cluster power consumption and cluster utilization at 1 minute intervals. Each dot in the figure represents one minute epoch. Each plot shows the linear 57 ideal energy proportionality curve (IdealEP), the best-case idealized energy proportion- ality curve for that scenario (BestEP), and the actual fitted energy proportionality curve (ActualEP) derived from the measured (raw) utilization vs power data collected during runtime as derived in section 3.2.2. Note that for figures 3.3(a)-(c) the BestEP curve is obtained by assuming the best case Packing algorithm as described earlier in the context of figure 3.2 with no wakeup and sleep latencies. For figures 3.3(d)-(f) that does not use Packing, the cluster-wide BestEP is the energy proportionality curve of the underlying servers since all servers are always turned on and each server has the same utilization. Observation 1: Cluster-level packing techniques are highly effective at masking server EP Using a Uniform load balancing technique where jobs are sent equally to all servers would result in the cluster-wide energy proportionality exhibiting an EP curve resem- bling that of the server’s EP curve as shown in figure 3.3(d)-(f). Hence, the actual cluster-wide energy proportionality is 0.24 in figure 3.3a, 0.73 in figure 3.3b, and 1.05 in figure 3.3a) Figure 3.3d shows how cluster-wide energy proportionality changes using Au- toscale. Using low energy proportional servers (EP = 0.24) as a base, the cluster-wide energy proportionality improved to 0.69, even though the individual server energy pro- portionality is just 0.24. This demonstrates the effectiveness of cluster-level techniques, like Autoscale, at masking the individual servers’ low energy proportionality. As indi- vidual server energy proportionality improves, the cluster-wide energy proportionality also improves, but the effectiveness is reduced. For instance, MidEP servers in fig- ure 3.3e achieve a cluster-wide energy proportionality of 0.79 with Autoscale, while the Uniform load balancer achieves cluster-wide energy proportionality of 0.73 (an im- provement of only 0.06). 58 Observation 2: With improving server energy proportionality, it may be more favorable to forego cluster-level packing techniques entirely. As server energy proportionality improvements continue, servers are beginning to exhibit super-energy proportional (EP > 1.0) behaviors. Autoscale does not improve the overall cluster-wide energy proportionality when the underlying servers already ex- hibit high energy proportionality. Having servers with EP of 1.05 had very little effect at the cluster level, where cluster-level energy proportionality improved to just 0.82, com- pared to cluster-level EP of 0.79 when individual server EP is 0.73. With high energy proportional servers appearing, it may be more desirable, from an energy proportional standpoint, to uniformly load balance work and depend on server-level low power tech- niques to further improve energy proportionality. In the past, it was clear that cluster-wide energy proportional using cluster-level packing technique is better than the individual server’s energy proportionality (fig- ure 3.3d vs 3.3a and 3.3e vs 3.3b). But we may now have reached a turning point where servers may now offer more energy proportionality than cluster-level packing techniques can achieve (figure 3.3f vs 3.3c). Observation 3: Foregoing cluster-level packing techniques enable energy im- provements by server-level low power techniques to translate to cluster-wide en- ergy improvements. By switching to a Uniform load balancing scheme and exposing the underlying server’s energy proportionality curve, it is possible to apply server-level low power modes to further improve server energy efficiency at low utilization levels. Previously, cluster-level techniques would mask the effect of these server-level techniques, render- ing them ineffective. A switch to uniform load balancing can enable server-level low power modes to have more overall benefits by exposing the underlying server’s energy proportionality curve. 59 Figure 3.3(g)-(i) shows the cluster-wide energy proportionality curve using Knight- Shift servers with Uniform load balancing. While KnightShift is able to lower individual server’s energy proportionality, which is reflected in the cluster-wide energy proportion- ality, it does not provide better cluster-wide energy proportionality than cluster-level packing techniques (figure 3.3d and 3.3e) when the baseline server has relative low en- ergy proportionality. This was also the case with Uniform load balancing. Only with higher server energy proportionality does KnightShift provide better cluster-wide energy proportionality than cluster-level packing techniques. Uniform load balancing with or without KnightShift but with high energy proportional servers exhibit EP near 1, out- performing cluster-level packing technique which has EP of 0.82. Although the energy proportionality may be similar when comparing Uniform load balancing with high energy proportional servers and KnightShift (figure 3.3c vs 3.3i), KnightShift still offers significant energy efficiency improvements, especially at low utilization regions. By using a low power Knight mode, KnightShift enables the server to use only a fraction of the primary server’s power at low utilization periods. This resulted in the average power usage falling from 890W to 505W using KnightShift. The average power usage is calculated by finding the average of all the raw samples during the simulation run. 3.2.10 Challenges facing adoption of server-level low power modes In order for system-level low power modes to become more widely adopted, issues re- lating to its practicality must be resolved. All server-level low power modes, such as PowerNap [55] and Dreamweaver [58], Barely-alive servers [5], Somniloquy [3] and KnightShift [80] trade off latency to improve energy efficiency. Each technique requires varying amounts of latency slack, the latency impact required before they become ef- fective. Previous work [58] has identified that server-level low power modes require 60 0 0.2 0.4 0.6 0.8 1 1 1.5 2 2.5 3 Energy Savings Normalize 99% Latency KnightShift DreamWeaver Batch (a) Apache 0 0.2 0.4 0.6 0.8 1 1 1.5 2 2.5 3 Energy Savings Normalize 99% Latency KnightShift DreamWeaver Batch (b) DNS 0 0.2 0.4 0.6 0.8 1 1 1.5 2 2.5 3 Energy Savings Normalize 99% Latency KnightShift DreamWeaver Batch (c) Mail 0 0.2 0.4 0.6 0.8 1 1 1.5 2 2.5 3 Energy Savings Normalize 99% Latency KnightShift DreamWeaver Batch (d) Shell Figure 3.4: KnightShift provides similar energy savings to idleness scheduling algo- rithms but with less latency slack required. increasing latency slack as the number of processors in servers increase in order to be effective. The data in this section focused exclusively on energy efficiency improve- ments. But many data center workloads, however, are sensitive to latency. In order to use server level low power modes, we need to perform a careful scalability study to understand how increasing core counts impact latency of using various server level low power techniques. 61 3.3 Server-level Low Power Mode Scalability We now evaluate the effectiveness of server-level low power modes with increasing core count. In this section, we will explore the results of both server-level active low power mode, primarily KnightShift as described in chapter 2, and idleness schedul- ing algorithms used in server-level inactive low power modes (DreamWeaver [58] and Batch [21]) on a high core count (32-core) system. We show that as core count increases, KnightShift can match the energy savings of idleness scheduling algorithms, but with significantly less latency slack required. Furthermore, we show that active low power modes can offer superior energy savings at any given latency slack compared to inactive low power modes. Energy-Latency Tradeoffs The scalability of server-level low power modes will be analyzed using energy-latency tradeoff curves. These curves show the available energy savings for a certain allowable latency slack. The latency slack is defined as the slack (or latency increase) allowed on 99th percentile response time. The goal of this section is to explore what latency slack is required in order for server-level low power modes to be effective. All low-power modes will incur latency impact to some extent. For workloads with stringent latency constraints (zero latency slack, for instance), the best design may be to not use any power management. Thus, this scalability study will focus on workloads that allow some level of latency slack. The idleness scheduling algorithms in Dreamweaver and Batch can tradeoff energy and latency by adjusting the level of request queueing. When requests are queued for a longer time, there are more opportunities to place the server into a low-power idle mode, which allows for longer server sleep times. But more queueing implies longer latency. KnightShift can adjust energy-latency tradeoffs by adjusting the threshold of the switching policy. For KnightShift to allow increased latency for higher energy savings, we can increase the threshold to switch out of the Knight and into the primary server. 62 Utilization CPU Memory Disk Other Max 40% 35% 10% 15% Idle 15% 25% 9% 10% Table 3.2: BigHouse server power model based on [74] and [31]. Power is presented as percentage of peak power. This policy keeps the Knight active longer at the expense of increased latency. Similarly, to decrease latency at the cost of energy savings, we can increase the threshold to switch into the Knight, so the primary server can stay on longer even if the utilization falls below Knight capacity. 3.3.1 Methodology To evaluate scalability, we use the BigHouse simulator [59], a simulation infrastructure for data center systems. BigHouse is based on stochastic queueing simulation [59], a validated methodology for simulating the power-performance behavior of data cen- ter workloads. BigHouse uses synthetic arrival/service traces that are generated through empirical inter-arrival and service distributions.These synthetic arrival/service traces are fed into a discrete-event simulation of a G/G/k queueing system that models active and idle low-power modes through state-dependent service rates. Output measurements, such as 99th percentile latency, and energy savings, are obtained by sampling the output of the simulation until each measurement reaches a normalized half-width 95% con- fidence interval of 5%. The baseline server power model used in BigHouse is shown in table 3.2. This model is based on component power breakdowns from HP [74] and Google [31]. We implemented the KnightShift server in BigHouse. Due to synthetic arrival/ser- vice traces generated through imperial inter-arrival and service distributions, these traces have no notion of time. Therefore, workload burstiness and high/low utilization peri- ods does not exists. Since BigHouse cannot accurately capture transition penalty due 63 to statistical sampling, we assume an ideal KnightShift configuration where there are no transition delays. We will explore in detail in the next section how different switch- ing policies with realistic transition penalties will affect overall energy and performance impact when running real-world data center utilization traces. We evaluate four workload distributions, DNS, Mail, Apache, and Shell, pro- vided with the BigHouse simulator. Each workload’s load is scaled so that the modeled server within BigHouse operates at 30% average utilization, similar to average utiliza- tion in [10]. 3.3.2 Case study with 32-core server We compare energy-latency tradeoffs of the following server-level low power ap- proaches: (1) a 30% capable KnightShift, (2) Batching [21] and (3) DreamWeaver [58]. As shown in chapter 2, the power consumption of the Knight is equal to PrimaryServerPowerKnightCapability 1:7 . The quadratic assumption is based on historical data [8] which showed that power consumption increased in proportions to performance 1:7 . Thus a 30% capable Knight is expected to spend 13% of the power of the primary server. Figure 3.4 shows the latency vs energy savings curves of the four workloads. The latency slack shown is normalized to the workloads 99th percentile latency. The y-axis shows the energy savings possible if we are allowed to relax the 99th percentile response time by the given x-axis value. Batching provides a nearly linear tradeoff between latency and energy, but is consis- tently outperformed by DreamWeaver, confirming previous results in [58]. Compared to DreamWeaver, KnightShift improves energy savings at any given latency slack. For Mail, KnightShift provides similar energy savings with less than half the latency slack required of DreamWeaver. ForDNS, KnightShift provides similar energy savings with 64 25% less slack required. ForApache andShell, DreamWeaver with 3x latency slack has less energy savings than KnightShift at 1.3x latency slack. In all cases, we conclude that server-level active low power modes outperform server-level inactive low power modes at every latency slack. For workloads that can tolerate very large latency slack (3x or more), it may be possible to also consider the use of wimpy cluster [7,61], which can allow power savings greater than any of the approaches compared here. But when very large latency slack cannot be tolerated, then KnightShift offers almost all of the power savings up front, with a tighter latency slack. Power savings achievable from KnightShift saturates rather quickly with even a small latency slack. KnightShift can take advantage of all the opportunity periods for low power mode at a low latency slack. For idleness scheduling algorithms, the op- portunity periods increase as the latency slack increases. The maximum savings of KnightShift saturates at1.75x latency slack in most cases. This contrasts to idleness scheduling algorithms, which ramp up energy savings slowly as latency slack increases. But they never reach the maximum energy savings achievable with KnightShift. For workloads which requires latency slack even tighter than what KnightShift can provide, system-level low-power modes, both active and inactive modes, may not be the best solution and energy saving techniques may even be disregarded all together. These results show that there must be at least a 1.5x latency slack available in order to allow server-level low power modes the opportunity to achieve a majority of their power savings potential. Essentially, in order for server-level active low power modes to be effective, the best-case latency slack required would be 1.5x that of the baseline. Currently under realistic conditions, the KnightShift server requires at least a 2x latency slack on average. In the section 3.4 we will explore various mode switching policies in order to try to meet the best-case latency slack time. 65 4 8 16 32 64 128 Cores 1.0 1.5 2.0 2.5 3.0 99% Latency 0.00 0.08 0.16 0.24 0.32 0.40 0.48 Energy Savings (a) DreamWeaver 4 8 16 32 64 128 Cores 1.0 1.5 2.0 2.5 3.0 99% Latency 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Energy Savings (b) KnightShift Figure 3.5: Energy savings vs core count and latency slack 3.3.3 Sensitivity to Core Count For this experiment we vary the number of cores in the primary server from 4 to 128 in order to explore how increasing core count affects the effectiveness of server-level low power modes. Figure 3.5 shows the energy savings that can be realized across different core counts and latency slack allowed for the Apache workload. Results for the other work- loads also follow similar trends. Figure 3.5a shows the possible energy savings using DreamWeaver. At low core counts, DreamWeaver can achieve significant power sav- ings (over 40%) with relatively low latency slack (1.6x). But as core count increases, the potential energy savings quickly decreases and the latency slack required for similar energy savings at lower core counts increases drastically (Over 3x latency slack at 32- cores to save 40% energy!). The reason that idleness scheduling algorithms becomes less effective with core count is that they primarily rely on idle periods to exploit power savings. Figure 3.5b shows the energy savings with KnightShift. Similar to inactive low power modes, significant energy savings can be achieved at low core counts. But un- like inactive low power modes, active low power modes can sustain significant energy 66 savings, independent of core count, and maintain a constant low latency slack. Knight- Shift is not dependent on idle periods, but on low utilization periods, which remain present even at high core counts. Therefore, as long as low utilization periods exists, KnightShift can scale to any number of cores. 3.4 Minimizing realistic latency slack of server-level ac- tive low power mode We have shown in the last section that server-level active low power modes have con- sistent opportunity, independent of server core counts, with a low latency slack require- ment. But the evaluations in section 3.3 assumed zero transition penalty between the Knight and primary server. Under more realistic conditions that was evaluated in chap- ter 2, they showed that non-bursty, low utilization workloads suffered 1.5x latency in- crease or less, but bursty workloads, such asnunki, can suffer over 6x latency penalty. It is important for server-level low power modes to meet a low latency slack un- der realistic conditions in order to become a feasible alternative to cluster-level packing techniques. This would allow clusters to forego cluster-level packing techniques and enable continued cluster-wide energy proportionality improvements as shown in sec- tion 3.2. In this section, we will investigate in detail the cause of poor performance for certain workload categories with KnightShift. We then propose and evaluate various switching policies and its effect on energy and tail latency of KnightShift. 3.4.1 Workloads We use the workload traces as detailed in section 3.2.4. Performance penalties to KnightShift can manifest in several ways. If the current utilization is low and we en- counter a high utilization spike, then there would be a performance penalty when we 67 have to wait to switch to the primary server to keep up with the high utilization requests. This is the case for very random bursty workloads (nunki). For workloads with pe- riodic utilization spikes (scf), it may be possible to predict these periodic events to anticipate high utilization periods, but such prediction approaches were not evaluated in chapter 2. Another case for performance loss occurs when workloads have a very high level of utilization variability, especially if the variation is around the Knight’s capability levels. During this scenario, the KnightShift switching policy may be tricked into entering the Knight mode, when in actuality, the workload is still in a high utilization phase. This causes KnightShift to thrash between modes, which causes performance penalties dur- ing mode transitions. This is the case for workloads such asmsg-mx andmsg-store. Table 3.1 also contains the autocorrelation of the workloads to show how strongly these utilizations exhibit daily patterns. Workloads with low predictability, such as nunki and girtab, would have low autocorrelation as shown in its autocorrelation plot. Workloads with strong daily utilization patterns, such as msg-mmp, msg-mx, msg-store, and aludra would exhibit local maxima at each day marker in their autocorrelation plots. Previous work analyzing data center workloads has also found strong daily correlations [27]. 3.4.2 KnightShift Switching Policies We will now introduce several KnightShift mode switching policies and quantify their ability to reduce the mode switching overhead of KnightShift. In order to facilitate the understanding of the strength and weaknesses of these policies, we illustrate the effect of these policies on a 15% capable KnightShift server in figure 3.6. Figure 3.6a shows an il- lustrative utilization variation overtime. Y-axis is the server utilization, and X-axis time. 68 The horizontal bar shows the 15% utilization cut-off. Under ideal conditions when- ever the utilization is below 15%, the Knight will immediately take over the requests and server will instantly move into low power sleep/hibernate state. Anytime utilization goes above 15% the primary server will instantly take over the tasks. Hence, the area of the curve under the utilization cutoff provides opportunities for energy savings. The workload utilization scenario represents a scenario with a short low utilization period (between T1 and T2), followed by a higher utilization (T2 to T3) and then eventually a longer low utilization period (T3 to T4). We highlight the regions where we should expect energy savings (green), performance loss (red) and historical highs (orange). The historical high regions represents time periods where the previous day observed a high utilization period; we make use of this information for a heuristic switching policy. Aggressive Policy: This policy aims to maximize energy savings by instantly switching into the Knight whenever we cross the 15% threshold, and conservatively switching out of the Knight after being in a high utilization period for an amount of time equal to the transition time. Figure 3.6b illustrates the energy savings and latency penalties using the aggressive switching policy as described in Chapter 2. The Y-axis is the current operating mode. Mode 3 represents when the primary server is on, mode 1 represents when the primary server is off and only Knight is on, and mode 2 represents transitioning from mode 1 to mode 3. For the given illustrative workload utilization pattern (3.6a), we show the periods where energy savings occurs (green bars on bottom of plots) and where performance penalties occur (red shaded regions). Whenever the utilization falls below the Knight capability level, KnightShift would immediately switch into the Knight mode. The server will not switch back to the pri- mary server until the server experiences utilization levels beyond what the Knight can handle for a period of time equivalent to the transition time (illustratively, time spent in mode 2) between the Knight to Primary server. As can be seen in figure 3.6b, the 69 0 5 10 15 20 25 30 35 U(liza(on T1 T2 T3 T4 (a) Workload Utilization 0 1 2 3 Mode (b) Aggressive Policy 0 1 2 3 Mode (c) Conservative Policy 0 1 2 3 Mode (d) Balanced Policy 0 1 2 3 Mode (e) D2D Policy Key: Energy Savings Performance Loss Historical High Figure 3.6: Performance and energy impacts of various KnightShift mode switching policies. Mode corresponds to (1) Knight mode, (2) Wakeup, and (3) Primary mode. aggressive policy can save significant amount of power, but at the cost of high perfor- mance penalties. For very short low utilization periods, this simple policy can be tricked to switch into the Knight, and immediately experience utilization greater than it can handle. Conservative Policy: This policy aims to minimize performance loss by sacrificing energy saving opportunities during short low utilization periods. A conservative policy 70 in KnightShift would switch into the Knight only when the server’s utilization level have been below the Knight’s capability level for a certain amount of time. We assume this threshold to be equivalent to the wakeup transition time in all policies in this section. A transition from Knight to primary server will occur immediately upon high utilization. Figure 3.6c shows that the policy does not switch during short low utilization periods (between T1 and T2). Since the policy only switches when there is a long enough low utilization period, the conservative policy will avoid the performance penalty due to short low utilization periods. The performance penalty to the conservative policy would therefore be limited to when the server transitions from Knight to the Primary server. The conservative policy trade off power savings for performance by missing some opportunity to safely be in a low power state during the start of long low utilization periods. Balanced Policy: The balanced policy aims to seek a balance between the ag- gressive and conservative policy by achieving energy savings with low performance impact. The balanced policy conservatively switches into the Knight and conservatively switches out of the Knight. Both switches occur only after the utilization has crossed the threshold for a time equivalent to the transition time. Figure 3.6d shows the effect of using a balanced policy. By switching into the Knight conservatively, we avoid short low utilization periods, avoiding performance penalties due to untimely and aggressively switching into the Knight during high utilization pe- riods. By conservatively switching out of the Knight, we are able to extract as much energy savings as possible while in Knight mode by trying to stay in an energy saving state as long as possible. This also makes this policy stay in a low power state when faced with a utilization spike, saving energy at the cost of performance. Day to Day (D2D) Policy: This simple heuristic policy aims to use the insight derived from the autocorrelation plots of each workload shown in table 3.1. This policy 71 demonstrates the possible effect of a more sophisticated switching policy compared to the other policies in this section. By looking at the binary historical utilization levels (either high or low), we can anticipate high utilization periods. In order to use utilization as a heuristic, we require a 1 day history of the utilization pattern. Utilization can be trivially logged through software. Depending on the need, utilization can be recorded at second, or minute granularity. Even with second granularity, this history log only takes up 85kB. For this policy, we use the aggressive policy as the base policy to build off of and extend it with knowledge of past historical high periods. We can only switch into the Knight only if there is not a historical high from the previous day. Similarly, if cur- rently in the Knight mode, a historical high is detected, we will preemptively switch into the primary server state to anticipate a high utilization period regardless of current utilization. In our study, we log minute-level granularity of historical high periods. Because it’s unlikely for high/low utilization to occur daily at exactly the same minute, we ex- pand the window of historical high periods by +/-15 minutes. We empirically selected 15 minutes as it provides a good balance between energy savings and performance im- pact. This allows nearby high utilization periods to merge into a larger high utilization window. This prevents KnightShift from switching into the Knight during this period, which normally led to performance penalty as seen in figure 3.6b. Figure 3.6e shows the past historical highs marked with an orange bar on top of the plot. By detecting historical high periods, we avoid switching during short low utilization periods and an- ticipate future high utilization periods, avoiding significant performance penalty, while still achieving reasonable energy savings. 72 0.5 1 1.5 2 2.5 3 3.5 4 aludra email girtab msg-‐mmp1 msg-‐mx10 msg-‐store1 nunki scf geomean Normalized 99%Ale Latency Workload Aggressive Balanced ConservaAve D2D (a) Switching policy effect on latency 0 0.2 0.4 0.6 0.8 1 1.2 aludra email girtab msg-‐mmp1 msg-‐mx10 msg-‐store1 nunki scf geomean Normalized Energy Workload Aggressive Balanced ConservaFve D2D (b) Switching policy effect on energy 10 100 1000 10000 aludra email girtab msg-‐mmp1 msg-‐mx10 msg-‐store1 nunki scf geomean Number of Mode Switches Workload Aggressive Balanced ConservaBve D2D (c) Switching policy effect on mode switches Figure 3.7: Effect of switching policy on latency, energy consumption, and mode switches. 3.4.3 Switching Policy Evaluation In this section we use the evaluation methodology presented in section 3.2 with 1 server. The results in this section is compared to a baseline machine running without Knight- Shift. 73 Effect on Latency: Figure 3.7a shows the 99th percentile latency normalized to the 99th percentile latency of the baseline server. Although the aggressive policy has the highest geometric mean latency, there are several workloads that actually performs the best using this policy. In particular, workloads that tend to stay low with rare utilization spikes (aludra,email) seems to benefit the most from this policy. The conservative policy benefits workloads (nunki,girtab) that tend to be random (low autocorrela- tion). Workloads with high utilization variations (msg-mx,msg-store,scf) benefits most from a balanced policy. Bothmsg-mmp andmsg-mx have latency of near 1 be- cause the policies realize that these workloads tend to always be in a high utilization phase, and therefore does not go into knight mode. For workloads with daily patterns observable from their autocorrelation plot (msg-mmp, msg-mx, msg-store, scf) all benefits the most from the Day-to-Day switching policy. Using a very simple 1-day history heuristics, KnightShift is able to lower the overall geometric mean latency to about 1.5x the baseline latency. This is in line with the best-case energy-latency tradeoff curves in figure 3.4. Therefore, it is possible to meet the best-case latency slack, even under realistic conditions. Effect on Energy: Figure 3.7b shows the normalized energy consumption normal- ized to the baseline server. Note that the energy consumption values presented here for KnightShift cannot be quantitatively compared to the values presented in section 3.3 due to entirely different simulation methodologies and workloads. For all scenarios, the aggressive policy saves the most energy (79.5% geometric mean). As expected, the bal- anced policy’s energy usage (77.8%) falls in between that of the aggressive policy and the conservative policy (73.7%). Aggressive saves the most power, but at cost of high- est latency impact. The D2D policy meanwhile saves the least amount of power due to cautiously staying in the primary server mode to anticipate historical high periods. In certain workloads, such asmsg-mmp andmsg-mx, KnightShift actually consumes 74 more energy than the baseline server as these workloads are high utilization workloads and do not offer opportunity for KnightShift’s low power mode. The extra energy used is due to the overhead of the Knight remaining on all the time. Overall, the conservative and balanced policy provides the best balance of latency and energy savings. Effect on thrashing: Figure 3.7c shows the effect of switching policies on the num- ber of low power mode transitions. The aggressive policy has a geometric mean of 330 mode switches. The balanced policy has a geometric mean of 135 switches. The signif- icant reduction in switches comes from the balanced policy avoiding being tricked by short low utilization periods during high utilization phases (as shown in figure 3.6). The conservative policy has a geometric mean of 550 mode switches. The relatively larger number of switches occurs during the scenario where there are short high utilization spikes during low utilization phases. The conservative policy will immediately switch modes during these spikes, anticipating that it has hit a high utilization phase, when it has not. This is most apparent in workloads that exhibit low utilization time windows, interspersed, with high utilization bursts (aludra, email, girtab, nunki, scf). The D2D policy has a geometric mean of 252 mode switches. Compared to the aggres- sive policy, the D2D policy is able to avoid switching on short low utilization periods by predicting that the server is likely to be in a high utilization phase from previous day’s heuristics, even though the current utilization is low for a short time period. We showed that server-level active low power modes, with corresponding policies, can achieve the best-case latency slack needed to achieve energy savings under realistic conditions. Server-level techniques can be competitive with cluster-level packing tech- niques, requiring only a small latency slack. We also showed that while server-level inactive low power modes became ineffective due to core count, server-level active low power modes can remain effective even with increasing server core count. By overcom- ing these two challenges, server-level low power modes can be an attractive alternative 75 over cluster-level packing techniques as future servers continue to improve their energy proportionality. 3.5 Conclusion While cluster-level packing techniques provided an answer to cluster-wide energy pro- portionality in the past, the continuing improvements to individual server energy pro- portionality is threatening to disrupt this convention. Cluster-level packing techniques can now actually limit cluster-wide energy proportionality. As we near a turning point in achieving highly energy proportional servers, it may be favorable to forego cluster- level packing techniques and rely solely on server-level low power modes. By foregoing cluster-level packing techniques, we expose the underlying server’s energy proportion- ality profile to the cluster-wide energy proportionality profile. This exposure will now allow server-level energy proportionality improvements, such as those made by Knight- Shift in the previous chapter, to bring cluster-wide energy proportionality improvements. In order for server-level low power modes to become practical, there must be im- provements to the latency slack required for server-level low power modes to be effec- tive. As multi-core servers are now common, multi-core scaling can provide challenges to server-level low power modes. As server core count increases, the amount of idle periods that exists across a server decreases, therefore requiring more and more latency impact before the low power mode can translate into energy savings. We have shown that server-level active low power modes can provide consistent energy savings at a low latency slack independent of server core counts, unlike server-level inactive low power modes. Therefore, server-level active low power modes, like KnightShift, can provide a low power mode solution for future high core count servers. 76 One of the negative impacts of KnightShift is the high latency impact for certain workloads. In this chapter we have shown that with the right mode switching policies, server-level active low power modes,can meet the best-case latency slack under realistic conditions. By solving these issues, we demonstrate the potential for server-level low power mode use in practice. 77 Chapter 4 A Journey to the Edge of Energy Proportionality 4.1 Introduction Energy proportional computing have been a major research focus over the past near- decade [10,43,47,53,54,60,63,65,70,72,77,80,82,84]. Since then, a greater emphasis on intermediate utilization efficiency has emerged. In the last two chapters we designed and evaluated KnightShift and presented heuristic mode switching policies to tackle la- tency impact of using KnightShift. As KnightShift brings practical and highly-effective energy proportional computing to mainstream, we now address several questions re- garding the future of energy proportional computing. As such, we go on a journey to explore the edge of energy proportionality. Along the way, we will answer various ques- tions relating to the paths that we have taken, or will take, towards the edge of energy proportionality. We will answer the following questions: Where are we now and how did we get here? We first take a retrospective look back to identify where past energy proportionality improvements came from, and to ex- plain historical trends. Through statistical regression of published SPECpower results, we’re able to identify and quantify the sources of past EP improvements. We quantify that processor and DRAM innovations accounted for 74.5% of the EP variation, with the processor uniquely contributing 4x more than DRAM. Despite the fact that processors account for the largest fraction of power consumption, future improvements to all other 78 components (memory, storage, network, motherboard, etc.) will contribute as much as the processor to future server EP improvements. This is due to the fact that processors are largely proportional with little room for energy proportionality improvements, while non-compute components still holds vast opportunities to improve energy proportional- ity (Section 4.2) Where are we going? The traditional definition of an ideal energy proportional server is that of an ideal linearly energy proportional server, which consumes power linearly proportional to its utilization; that is, if a server is utilized 20%, then it should consume 20% of its peak power, etc. Energy proportionality has made great strides over the past decade to the point where servers are now exhibiting energy proportional- ity equivalent to that of an ideal linear energy proportional server. Based on historical SPECpower results, we derive a pareto-optimal frontier for dynamic range/linear de- viation/energy proportionality tradeoff. Based on the derived pareto-optimal frontiers, we conclude that there is still a 35% headroom for EP improvement over an ideal lin- early energy proportional server, prompting the need to redefine what an ideal energy proportional server would be. We then present a hypothetical study of a server with all components exhibiting high energy proportionality to demonstrate that such a radical system would still fall within the pareto-optimal frontier. (Section 4.3) How will we get there? In order to reach the edges of energy proportionality, fu- ture energy proportionality improvements must come equally from processors and non- compute components. As processors led the EP revolution to date, we want to explore how processors improved EP, and how we can learn from this and apply it towards non-compute EP improvements. Through detailed characterization of an instrumented server, we identify how different power management techniques drove different com- ponents of CPU EP improvement. From these observations we discuss energy savings techniques that can benefit non-compute components. (Section 4.4) 79 Variables Description Values gen Processor generation Netburst, Core Penryn, Nehalem Westmere, Sandy Bridge Ivy Bridge, Haswell SSD Hard drive type 0 - HDD, 1 - SSD DDR3 DRAM type 0 - DDR2, 1 - DDR3 FBDIMM Fully Buffered Dimm 0 - No FB-DIMM 1 - FB-DIMM Registered Registered Memory 0 - No RDIMM 1 - RDIMM LowV oltage Low V oltage 0 - No Low V oltage 1 - Low V oltage DDR3 Table 4.1: Annotated Server Features 4.2 Where are we now and how did we get here? In this section, we will take a retrospective look back on the energy proportional com- puting revolution and identify the path that we took to get here. Through statistical regression of published SPECpower [83] results, we are able to identify and quantify the sources of past energy proportionality improvements and explain the various histor- ical trends that we observed. 4.2.1 Methodology For our retrospective analysis, we use reported SPECpower benchmark [83] results, which contains the server’s performance and power characteristics measured incremen- tally at each 10% utilization intervals. Our SPECpower dataset contains 398 servers from 2007 to 2014 that are a representative mix of server configurations in current pro- duction environments. Note that for results in Chapter 2 we used SPECpower data of 291 servers published until Dec 2011. In this chapter we expanded the data set to include all the recent server results as well. Thus we have 398 servers. They feature servers with various vendors, form factors, and processors, including multi-socket servers. Other 80 non-server form factors, such as mobile processors, and heterogeneous processors are outside the scope of this work, but are interesting topics for future analysis. Figure 4.1a shows the historical energy proportionality trend. For this plot, we com- puted the EP using equation 2.2 as defined in chapter 2. While it is clear that energy proportionality has improved drastically over the past near-decade [63, 77, 80], what is not clear is what technological advances contributed to improving energy proportionality and how much these advances contributed. By identifying and understanding the driving forces behind energy proportionality, we are better able to gain insight into techniques that improve proportionality, and also identify potential future areas for improvement. We limit our analysis to coarse-grain advances (such as processor generations, memory type, etc.). The reason we limit to course-grain advances is due to limitations of what we can infer from reported SPECpower server configurations. The reported server con- figuration reports high-level details, such as processor model, DRAM type and size, but lacks more fine-grain details, such as type of on-chip low power modes active, whether ECC is active, etc. In order to identify course component-level contributions to energy proportionality, we propose applying statistical stepwise regression. 4.2.2 Feature Annotation We annotated each server result with various technology features associated with that server as shown in table 4.1. For each server in the SPECpower dataset we identi- fied the following: the processor generation (labelled gen in the table), whether the server used SSDs, which DRAM technology was used for memory subsystem, whether fully buffered DIMMs or registered memory DIMMs were used, and finally whether the DRAM was low voltage DDR3 or otherwise. Due to the granularity of details in SPECpower results, we are unable to annotate with more detailed technology fea- tures. We only annotate the features that we can confidently extract from published 81 SPECpower results. Although there exist other fine-grain processor technology features, such as the presence of P-states, C-states, and integrated memory controllers, these fea- tures are inherently part of the processor. From the SPECpower results, we can confirm that OS power management is enabled, therefore, we assume that all available proces- sor power management techniques, such as P and C-states, for the corresponding server configuration are available. Thus these fine-grain features are implicitly accounted for in the gen feature. In addition to technology features, we also annotated and regressed with configura- tion parameters (processor frequency, number of processor cores, number of processor sockets, memory size, etc.). We found configuration parameters are not significant when it comes to impacting EP. In fact, our resultant models and conclusions remain exactly the same. We then run the annotated dataset through a stepwise multiple regression anal- ysis to determine the order of importance of various predictor variables. We performed the stepwise regression analysis using SPSS 22 [42]. 4.2.3 Stepwise multiple regression Stepwise multiple regression is a method commonly used in educational and psychologi- cal research to select a set of predictor variables, from a large pool of predictor variables, that can best explain the dependent variable. In our case, the set of predictor variables are the variables presented in table 4.1 and the dependent variable is EP. We use the bidirectional elimination [42] approach to decide which predictor variables to include or exclude in the regression. In this paper, we use a standard inclusion criterion of p< 0.05 and exclusion criterion of p> 0.05, giving us a 95% level of confidence. A p-value of 0.05 is a widely accepted level to denote statistical significance [18]. A comprehen- sive description of the stepwise multiple regression is beyond the scope of this work; a good introduction to this approach is provided in [17, 28]. 82 4.2.4 Stepwise Regression Output Table 4.2 shows the results of the stepwise regression. The stepwise regression produced 4 models. The table column headers are defined below. R: Multiple correlation coefficient. Simple bivariate correlation between observed values of y and predicted values of y. The closer R is to 1, the stronger the linear association is. R Squared: Coefficient of determination. Indicates proportion of variance in the dependent variable which is accounted for by the model. For example, model 1 can explain 70.8% of the variance of EP. Adjusted R Squared: Modification of R squared to take into account the number of explanatory terms in a model. Unlike R squared which always increases with the introduction of a predictor variable, adjusted R squared will only increase if the new predictor variable improves the model more than would be expected by chance. This metric is useful for comparing nested models. For example, based on adjusted R squared value, model 2 has more explanatory power than model 1. Coefficients: Represents the change in the dependent variable given a unit change in the predictor variable. Unstandardized coefficients are in natural units. For example, in model 1, for every processor generation introduced, EP is expected to improve by 0.122. t and Sig: t-statistics and two-tailed p-values (Sig.). Used in testing whether a given coefficient is significantly different from zero. A sig. value less than 0.05 would indicate significance. Semi-partial Correlation: The proportion of variance in the dependent variable that is uniquely accounted for by a particular predictor variable that is not accounted for by the other predictor variables. The squared semi-partial correlation gives us the percent- age of the total variance in y that is uniquely accounted for by a particular predictor variable beyond that accounted for by other predictor variables. For example, in model 83 Model R R Squared Adjusted R Squared Variable Unstandardized Coefficients t Sig. Semi-partial B Std. Error Correlations 1 .841 .708 .707 (Constant) gen .189 .122 .016 .004 11.683 30.759 .000 .000 .841 2 .863 .745 .744 (Constant) gen DDR3 .202 .088 .148 .015 .006 .020 13.245 15.071 7.516 .000 .000 .000 .386 .192 3 .868 .754 .752 (Constant) gen DDR3 LowVoltage .234 .070 .158 .057 .017 .007 .020 .015 13.547 9.278 8.085 3.760 .000 .000 .000 .000 .234 .204 .095 4 .871 .758 .756 (Constant) gen DDR3 LowVoltage FBDIMM .268 .069 .128 .058 -.056 .021 .007 .022 .015 .021 12.468 9.209 5.708 3.834 -2.586 .000 .000 .000 .000 .010 .230 .143 .096 -.065 Table 4.2: Stepwise regression analysis results 2, 14.9% of the total variance of EP is uniquely accounted for by gen that is not ac- counted for by DDR3 (0:386 2 ). 4.2.5 Selecting a parsimonious model The goal here is to find a parsimonious model that can explain the individual component contribution to energy proportionality improvements. A parsimonious model is a model that achieves a desired level of explanation or prediction with as few predictor vari- ables as possible. While models with greater number of variables may have a greater R-squared value (indicating the variables account for more variance in the model), it may not be a parsimonious model because some of these variables do not sufficiently contribute to explain the total variance (low squared semi-partial correlation). The stepwise regression stopped after generating model 4 due to no other predictor variables meeting the entry criteria, that is, no remaining predictor variables are signif- icant. Therefore, according to the results of the stepwise regression, only gen, DDR3, LowVoltage, and FBDIMM are significant in predicting a server’s EP. Although Low- Voltage and FBDIMM are statistically significant in contributing to the model, they both uniquely help explain less than 1% of the total variation in EP based on the squared semi-partial correlation. Therefore, the main contributing factors are gen and DDR3. Although model 4 has the largest R squared value, it includes two predictor variables 84 that contribute only minimally to the explanation of variation to EP. Therefore, we will use model 2, which uses just gen and DDR3, because it is the more parsimonious model. 4.2.6 Findings from Stepwise Regression The goal of this section is to answer the question, how did we get here? By identifying the path we took during historical EP growth, we are better able to understand the impli- cations for future EP growth. The regression model showed that processor generations and DRAM switching from DDR2 to DDR3 are the main driving forces behind histor- ical EP growth. This conclusion should be straightforward due to the large proportion of power consumption of both processors and DRAM. This observation has also been observed in prior works [31], but lacks quantification. From the regression, we are able to quantify the effect of each component through our statistical analysis. Using model 2, the switch to DDR3 is credited with improving EP by 14.8% and every new generation of processor is expected to improve EP by 8.8%. Together, processor generations and switch to DDR3 accounts for 74.5% of EP variation. Due to integrated memory controllers, the introduction of DDR3 and processor gen- eration (Nehalem) was simultaneous. Therefore, a large portion of the EP variation is explained by a combination of both EP and DDR3. While it can be obvious that processor and memory improvements has driven energy proportionality growth, what is not known is how much each of these components contribute individually. Using our correlation analysis we show that gen can uniquely explain 14.9% of EP variation, while DDR3 can uniquely explain 3.7% of EP variation. Thus processor generation can uniquely explain 4x more variation to EP than the switch to DDR3. The remaining 55.9% is contributed to a combination of both processor and mem- ory improvements. For example, with the integration of memory controllers into pro- cessors, the stepwise regression has difficulty in separating the individual contributions. 85 Potentially, this can be improved with larger sample sizes, as well as more detailed SPECpower server configuration reports, which are currently unavailable. 4.2.7 How did we get here? Explaining historical trends Now that we have established which variables contributed most to energy proportional- ity changes, we will now see how these variables contributed to energy proportionality changes. We will now use the results of the stepwise regression, along with SPECpower results, to analyze historical trends in energy proportionality. Figure 4.1 shows the his- torical trend of energy proportionality (EP), dynamic range (DR), and linear deviation (LD) for SPECpower [83] results from 2007 to 2014. Each point represents published SPECpower server result and is coded to a processor generation. Processor generations are also coded with DRAM type; ˆ label supports DDR2, while other processors support DDR3. Over the past 7 years, there have been significant growth in energy proportionality, as seen in Figure 4.1a. Notably, improvements tend to occur in spurts. Two notable jumps occurred during mid-2009 and early-2012. The first major EP jump during mid-2009 is fueled mainly by a substantial jump in DR, as shown in the corresponding DR plot. The DR of servers jumped from around 50% to 80%. Since then, DR growth stalled. The second EP jump in early-2012 is fueled by an improvement to LD. Before 2012, the best LD was less than -0.1. During early-2012, LD dropped to -0.2, significantly driving up the EP. As DR hits a wall, EP improvements were fueled by reducing LD. As we showed in section 4.3, we foresee that LD will also hit another wall. EP growth spurts tend to correspond with the introduction of new processor micro- architectures [65]. Intel follows a “tick-tock” model, where every “tick”, a shrinking of process technology, is followed by a “tock”, a new microarchitecture. The “tocks” in the figure are Core, Nehalem, Sandy Bridge, and Haswell. We observe that there is 86 2008 2009 2010 2011 2012 2013 2014 Published SPECPower date 0.0 0.2 0.4 0.6 0.8 1.0 Energy Proportionality (a) Historical Energy Proportionality Trends 2008 2009 2010 2011 2012 2013 2014 Published SPECPower date 0.0 0.2 0.4 0.6 0.8 1.0 Dynamic Range (b) Historical Dynamic Range Trends 2008 2009 2010 2011 2012 2013 2014 Published SPECPower date 0.2 0.1 0.0 0.1 0.2 Linear Deviation NetBurst^ Core^ Penryn^ Nehalem Westmere Sandy Bridge Ivy Bridge Haswell (c) Historical Linear Deviation Trends Figure 4.1: Historical Trends labeled with processor generation. Processors with ˆ label support DDR2, all others support DDR3. Improvements to EP occurs in spurts. Two major EP growths in mid-2009 and early-2012 are due to DR and LD improvements, respectively. a significant EP jump during the introduction of a new microarchitecture, as supported by our stepwise regression. Note that there is no initial jump during the introduction of Sandy Bridge as the release of server-class (Xeon E5) was delayed by a few months after the introduction of the desktop version of the processor. In this case, server version of Sandy Bridge coincided roughly with the release of Ivy Bridge. The first major EP improvement, seen around mid-2009, is due to a combination of the shift from DDR2 to DDR3. The first jump occurred with the introduction of the Nehalem architecture which is coupled closely with DDR3. In this case, the intro- duction of DDR3 and integrated memory controllers with Nehalem are simultaneous. 87 The introduction of Nehalem demonstrates one of the reasons why our statistical analy- sis shows that processors uniquely contribute 14.9% and memory uniquely contributing 3.67%, despite both processor and memory accounting for a combined 74.5% of vari- ation. Due to the simultaneous introduction of many technological advances, which leads to multicollinearity [17,28] , it is difficult to tease out the individual contributions of each component. A possible remedy for multicollinearity is to increase the data set size, which would allow us to tease out the individual contributions of processors and memory more. Around the end of the last decade (2007-2010), it was observed that processors no longer dominate server energy power consumption [10, 31, 74] due to high energy con- sumption of DDR2 and FBDIMM DRAM. Nehalem was the first Intel processor with integrated memory controller [68], introducing support for DDR3 memory. DDR3 pro- vided significant energy improvements compared to DDR2. For example, 1GB of 60nm DDR3 consumes 35% less power than 60nm DDR2; with continued process scaling, DDR3 can consume 86% less power compared to DDR2 [64]. Once DDR3 adoption became widespread memory power consumption was lowered and then the pendulum swung back to the processor, which once again dominate server energy consumption. This is a trend that has also been observed by [9, 43]. By lower- ing memory power consumption, we essentially improve the dynamic range of servers, leading to the large improvement to DR in figure 4.1. However, while overall memory power has decreased, the energy proportionality of memory has still not improved as we will see in section 4.4. Therefore, there still exist future opportunities to improve the energy proportionality of memory. The second major EP improvement is likely due to TurboBoost. The second major EP jump, seen during early 2012, occurred with the introduction of server-class Sandy Bridge architecture. During this time period we also saw the wide-spread adoption of 88 SSDs and also low voltage DDR3. While their unique contributions captured in the stepwise regression is minimal, however, that is not to say that their development was aimless; their energy efficiency benefits and implications to data/storage systems are immense in practice. Due to the SPECpower benchmark characteristics, it may not stress the IO sufficiently, leading to limited contribution for SSDs. Low voltage DDR3 would typically lead to better dynamic range due to lower overall power usage, but due to the low fraction of memory power consumption, the impact to overall system energy proportionality improvement may be limited. Therefore, the second EP jump can be contributed solely to the introduction of server-class Sandy Bridge architecture as SSDs has minimal effect on energy propor- tionality and there was no significant new memory technology introduced at this time. This second jump in EP is mainly caused by a significant improvement to the linear deviation as shown in the corresponding LD chart. Since dynamic range has stalled, the only way to improve EP is through linear deviation improvements, that is, improving the energy efficiency of intermediate utilization levels. We speculate that improved power management, such as TurboBoost, led to the improvement to LD. Although TurboBoost was introduced with Nehalem processors, it was disabled in SPECpower results until TurboBoost v2 was introduced with Sandy Bridge. Therefore, Sandy Bridge marks the first SPECpower results with TurboBoost. 4.3 Where are we going? Identifying a possible edge of energy proportionality In the previous section, we explored the path that led to our current position on energy proportionality. We are now beginning to see many highly energy proportional servers that approach ideal linearly energy proportionality (EP=1). In this section, we now 89 answer the questions, where are we going? The goal here is to identify the boundaries of energy proportionality in order to get an idea of what future energy proportional servers will look like. Specifically, the goal here is to identify what the most reasonable best-case energy proportional server may look like. Based on historical SPECpower results, we first derive a pareto-optimal frontier for dynamic range/linear deviation/energy proportionality tradeoff. We conclude that there is still a 35% headroom for EP improvement over an ideal linearly energy proportional server, prompting the need to redefine what an ideal energy proportional server would be. We then present a hypothetical study of a server will all components exhibiting high energy proportionality to demonstrate that such a radical system would still fall within the pareto-optimal frontier. 4.3.1 Deriving Pareto-optimal Frontier We make use of Pareto frontiers to provide a tradeoff between dynamic range, linear de- viation, and energy proportionality. Figure 4.2a shows the results of the Pareto-optimal frontier analysis. We plot 398 servers from SPECpower results (See section 4.2 for SPECpower dataset details). It turns out that all of these data points fall on a plane in 3D-space as indicated by the wireframe plane. This plane exists because there exists a mathematical relationship between DR, LD, and EP [33]. This plane is given by: EP = 2 (2DR)(LD + 1) (4.1) Since all points fall on this plane, it simplifies our Pareto frontier to a line on the plane. Pareto-optimal server designs that fall on the Pareto frontier are indicated by an enclos- ing black-bordered box. These server designs were identified by iterating through all SPECpower servers, sorted by ascending DR, and identifying subsequent servers with 90 better DR, LD, and EP than the last identified Pareto-optimal point. For simplicity in viewing and comprehending, we break down the three-dimensional plot into two two- dimensional plot with components DR(ep) and LD(ep). In DR(ep) and LD(ep), ep is the energy proportionality of a measured server in SPECpower. We then fit a quadratic polynomial through the pareto frontier points using the least square regression method such that all design points are enclosed by the frontier. The Pareto frontier represents optimal design point in terms of their respective components. In other words, a design point that falls on the Pareto frontier has a component that is not dominated by another design point. For simplicity in visualization and comprehension, we decompose the DR/LD/EP tradeoff into DR/EP and LD/EP tradeoffs. Each decomposed tradeoff represents a com- ponent of the DR/LD/EP tradeoff. Figure 4.2b shows the DR/EP server design space. The Pareto-optimal frontier for the DR/EP tradeoff is given as: DR(ep) =0:1025ep 2 + 0:8594ep + 0:026 (4.2) Figure 4.2c shows the LD/EP server design space. The servers along this Pareto frontier are the same points as those in figure 4.2b. We identified LD(ep) in a similar manner as before through a fitted quadratic polymonial. The Pareto-optimal frontier for the LD/EP tradeoff is given as: LD(ep) =0:197ep 2 + 0:022ep 0:004 (4.3) Using these derived Pareto frontiers, we are now able to identify a “new ideal” for energy proportionality. The possible values of dynamic range are [0; 1]. Given this and DR(ep), we see that if dynamic range is 1.0, then if we solve for EP, we can achieve an 91 Dynamic Range 0.0 0.2 0.4 0.6 0.8 1.0 Linear Deviation 0.3 0.2 0.1 0.0 0.1 0.2 Energy Proportionality 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 NetBurst Core Penryn Nehalem Westmere Sandy Bridge Ivy Bridge Haswell Pareto Frontier Hypothetical (a) DR/LD/EP Tradeoff Frontier 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Energy Proportionality 0.0 0.2 0.4 0.6 0.8 1.0 Dynamic Range NetBurst Core Penryn Nehalem Westmere Sandy Bridge Ivy Bridge Haswell Pareto Frontier Hypothetical (b) Dynamic Range/Energy Proportionality Frontier 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Energy Proportionality 0.3 0.2 0.1 0.0 0.1 0.2 Linear Deviation NetBurst Core Penryn Nehalem Westmere Sandy Bridge Ivy Bridge Haswell Pareto Frontier Hypothetical (c) Linear Deviation/Energy Proportionality Frontier Figure 4.2: Pareto-optimal frontier derived from historical SPECpower results labeled with processor generation. Data points with gray white box around it are pareto-optimal points. EP of 1.35. This number is given by solving 1.0 =0:1025ep 2 + 0:8594ep + 0:026. Through LD(ep), we can also see that LD(1.35) would give LD = -0.32. Therefore, rather than having an ideal linearly energy proportional server (EP=1) as “ideal”, we contend that an ideal energy proportional server should be a server with an EP=1.35. 92 4.3.2 Case Study: A Hypothetical Super Energy Proportional Server Our prior analysis are based on current and historical servers where processors and memory are the main drivers of energy proportionality improvements. One may wonder whether the previously identified Pareto-optimal frontiers will still hold true for radically different servers with aggressive energy proportionality profiles. We will now present a hypothetical case study of a super energy proportional server in order to give confidence in the Pareto frontiers that we identified. Figure 4.3a shows the component level energy proportionality curves of the high- est energy proportional server reported from SPECpower. This server has EP of 1.05 with DR of 0.833 and LD of -0.1888 (Upper-right most Pareto-optimal point in fig- ure 4.2b and lower-right most pareto-optimal point in figure 4.2c). We assume that the component-level power breakdown is similar to our empirical component-level power measurements as measured in section 4.4 by interpolating the ratio of component power consumption at peak and idle utilization. Figure 4.3a shows the energy proportionality of the server along with the component-level breakdown. Clearly, the processor accounts for the vast majority of the power consumption. Furthermore, the processor is the most energy proportional component. Both the memory and hard drive are relatively energy disproportional. The hard drive accounts for a near constant 2.6% of the server’s peak power consumption. The memory consumes 4.7% at idle and 6.0% at peak. The other category measures the power difference between the wall power (whole server system) and our component power measurements. Therefore, other includes the power supply, the fans, and other motherboard components, such as network. The others category consumes 3.6% at idle 93 0 20 40 60 80 100 0% 20% 40% 60% 80% 100% % of Peak Power Utilization CPU Other Mem HD (a) EP profile of a high energy proportional server 0 20 40 60 80 100 0% 20% 40% 60% 80% 100% % of Peak Power Utilization CPU Other Mem HD (b) Hypothetical server with EP components Figure 4.3: Energy Proportionality curve of a high energy proportional server with pro- portional processor only (a), and hypothetical server with all proportional components (b). and 12.2% at peak. The relatively large dynamic range of the other category can be at- tributed to the fans due to the cubic relationship of power and fan speed. Our empirical component-level power breakdown is similar to those recently reported [43]. Figure 4.3a clearly shows that CPU is the most proportional component in the server, while other components are not. Now let’s assume that all of these non-compute com- ponents are just as energy proportional as the processor. This would result in a hypo- thetical server with an EP curve as shown in figure 4.3b. This radically different server would represent the case where if all future energy proportionality improvements are contributed by non-compute components. This hypothetical server would have an EP 94 of 1.20, DR of 0.923 and LD of -0.257. By having all components being energy pro- portional, the dynamic range has improved to the point where the server would only consume 7.7% of the peak power at idle. This hypothetical server is labeled in figure 4.2. Despite this server’s aggressive energy proportionality profile of all components, it still falls along the Pareto-optimal frontier. Therefore, we’re confident that our derived Pareto frontier will still hold true for future server platforms with aggressive energy proportionality mechanisms. We contend that an ideal energy proportional server should be a sublinear energy proportional server with an EP=1.35. We have derived Pareto frontiers based on histori- cal server design points, and demonstrated that our models still holds true for hypothet- ical aggressively proportional servers. Achieving EP=1.35 should be the next goal in the energy proportionality frontier. The question we now ask is, how can we realize this new ideal energy proportional server? Where will the improvement opportunities come from? 4.4 How can we get there? In order to understand how non-compute components can become more proportional, it may be beneficial to study how processors became so proportional in the first place. Processors contain many low power management techniques, such as idle states and frequency-voltage scaling states. Knowing how these power states impact energy pro- portionality can give us an idea of what types of techniques should be applied to other non-compute components. Energy proportionality improves in two manners, improve- ments to DR, and through improved efficiency at intermediate utilization levels. 95 Figure 4.4: Instrumented server details. Current sensors are spliced into the power supply cables for the CPU, memory, and disk. 4.4.1 Methodology We configured a high energy proportional server modeled after a Dell R520 server from SPECpower results. The configuration of the server is shown in table 4.3. In order to measure component-level energy proportionality, we added power measurement capa- bility to the power rails providing power to the memory, cpu, disk, and overall server as shown in figure 4.4. We instrumented each individual component by intercepting the power rails and measuring the current with LTS 25-NP current sensors. The outputs of the current sensors are sampled at 1kHz using a DAQ and logged using LabView. To measure CPU power, we inserted a current sensor in series with the 4-pin ATX power connector. Our processor consumed 39.6W at idle, and 205.9W at full load. To measure memory power, we inserted a current sensor in series with pins 10 and 12 of the 24-pin ATX power connector which supplies power to the mother board [73]. We verified that 96 Processor Type 2 x Intel Xeon E5-2470 Processor Frequency 2.3 GHz, up to 3.1 GHz TurboBoost Memory 24 GB (6 DIMMs x 4GB) Disk Drive 1 x 100GB SSD System Profile Performance per Watt (OS) OS Linux Mint 16 (Linux Kernel 3.11) Table 4.3: Server configuration these pins measure memory power usage by removing individual DIMMs to observe idle power consumption changes to these pins. At idle, the memory consumed 12.3W of power, and at high load, the memory consumed 15.7W of power. To measure the power of the hard drive, we inserted a current sensor in series with the hard drive back- plane power connector. At idle, the hard drive and the backplane consumed 6.1W of power. The backplane accounted for a constant 4.7W of the power. To measure overall server power, we use a Watts Up? Pro power meter. At idle, our server consumed 68.0W of power, and at full load, consumes 260.0W of power. We run Linux Mint 16 with the Linux 3.11 kernel. Although the vast majority of servers in the SPECpower published results run Windows Server, we decided to go with Linux due to the flexibility in controlling power management. The processor power management is controlled by the Intel pstate driver, a relatively new processor gover- nor introduced with the Linux 3.09 kernel, supporting modern Intel processors starting with Sandy Bridge. We are able to enable/disable various processor power management modes, such as TurboBoost and C-states. In order to get the energy proportionality profile of the server, we run the SPECpower benchmark [83]. The SPECpower benchmark consists of 3 calibration phases, followed by 11 different load phases, from 100% to 0% (active idle) at 10% decrements. From these results, we are able to derive the component-level energy proportionality curve of the server as shown in figure 4.3a. 97 0 50 100 150 200 250 300 0% 20% 40% 60% 80% 100% Power (W) Utilization OS NoTurbo NoCn NoC1E Figure 4.5: Effect of power management on EP 4.4.2 Low Power Modes Modern processors consists of a large number of power saving measures. Common processor power management includes processor performance states (P-states) and pro- cessor operating states (C-states). C-states: C-states are processor idle states. C0 state is the normal operating state where code is being executed. The C1 state, often known as Halt, is in a state where the processor is not executing instructions, but can return instantaneously to an executing state. The processor is put into the C1 state when it receives a Halt instruction from the OS. The C1E state (Enhanced Halt State) additionally reduces clock speeds and processor voltage. In the C1E state, the processor is typically at the lowest operating frequency. For states C2 and up, the processor shuts off various parts for greater power savings at the cost of higher wakeup time. P-states: In the past with single-core processors, simple dynamic voltage-frequency scaling provided an effective method to scale runtime performance and power. Since there was only a single-core, a frequency can be typically mapped to a P-state. But, with the emergence of multi-core processors, runtime dynamic power management of processor power have gotten much more complicated. A frequency can no longer be simply mapped to a P-state. On current Intel processors all cores in a package share the same voltage. Furthermore, all cores share the same clock frequency, with the exception 98 of an idle core. The OS can request a separate P-state for each logical processor. The core frequency on the package will then be maximum of the individual core’s requested frequency. The P0 state is the maximum possible operating frequency. The P1 state is the guaranteed frequency that the cores can simultaneously operate in at heavy loads. For example, in our configured server system, our processors has a frequency of 2.3GHz with up to 3.1Ghz TurboBoost. Therefore, the P1 state would be 2.3GHz and the P0 state would be 3.1GHz. Any other P-states, P2 and up (Pn), are energy efficient states. The P-states P1 to Pn are all OS controllable. Any frequency in the P1 to P0 range is controllable only by the hardware. 4.4.3 Effect of Low Power Modes on Server EP In order to explore the effect of various low power modes on server energy proportion- ality, we configured our server with various configurations of processor power manage- ment policies enabled. We explored four policies: OS, NoTurbo, NoC1E, and NoCn. OS represents the default OS power management using the intel pstate drive. Under this policy, all P-states, C-states, and TurboBoost are available for the OS and hardware. The NoTurbo policy disables TurboBoost (disables P0) but all C-states and P-states (other than P0) are enabled. This configuration allows us to see the effect of enabling and dis- abling TurboBoost on server EP. The NoC1E policy disables the Enhanced Halt State, but leaves all other C1 and Cn policies enabled (P1-Pn states are also enabled). For C1 state, the processor will halt and still be running at its original clock speed and processor voltage. The enhanced C1E state is the lightest idle state that also reduces clock speed and processor voltage, therefore, during non-idle periods, the C1E state represents the deepest sleep state that most cores will achieve. The NoCn policy disables all idle states, leaving only C0 enabled. This policy represents the lack of processor idle power states. 99 Policy EP DR LD OS 0.721 0.739 0.0137 NoTurbo 0.571 0.647 0.0561 NoC1E 0.433 0.533 0.0681 NoCn 0.089 0.135 0.0247 Table 4.4: EP, LD, and DR values for various processor power management policies Figure 4.5 and table 4.4 shows the results of this study. Our configured server when using the OS profile has an EP of 0.721 with DR of 0.739 and LD of 0.0137. As ex- pected, the OS profile featured the best EP, along with the best DR and LD since the OS has the option of using the full range of power savings options at its disposal. The high DR value is mainly due to the higher peak power while using TurboBoost. Under TurboBoost, the peak power of the server reached 260W, while all other policies with- out TurboBoost reached about 197W at peak. Furthermore, OS also has the lowest LD (recall, for LD, the lower the better). We compare the OS curve with the NoTurbo curve in order to isolate the effect of TurboBoost on server EP. The NoTurbo curve follows the OS curve during lower uti- lization, and then steadily continues increasing with higher utilization levels. At low utilization levels, both curve are essentially the same since there is not a need to enable TurboBoost when the load is low. With TurboBoost enabled, there is a distinctive “knee” in the energy proportionality curve starting at 40% utilization. Starting around this uti- lization, the power policy begins to identify opportunity to enable TurboBoost. From ta- ble 4.4, we can see that by simply enabling TurboBoost, we see a large improvement in the LD of the server. Without TurboBoost enabled, the server has an LD of 0.0561, while with TurboBoost enabled, the LD improved significantly to 0.0137. Therefore, power management techniques that improve the intermediate utilization energy efficiency, such as TurboBoost, would improve overall server energy proportionality by lowering LD. While some of the broader observations may be readily inferred, one of our goals is to quantify this benefit to show the relative importance of these features in improving EP. 100 For the NoC1E curve, we disabled TurboBoost and the C1E state, leaving the C1 and Cn states available. By disabling C1E state, we can compare this curve with the NoTurbo curve to identify the effect of the C1E state. Recall that the C1E state is an idle state that has reduced clock speed and processor voltage that can return instantaneously to an execution state. The C1E state enables power saving opportunities to a core even when the other cores in the package are active. The gap between the NoTurbo and NoC1E curve represents the power savings that can be contributed to C1E. With C1E disabled, the peak power remains the same, but due to the greater power consumption of the C1 state (compared to C1E), the dynamic range decreased from 0.647 for NoTurbo to 0.533. By disabling C1E, the EP of the server decreased by 0.138 from 0.571 for NoTurbo to 0.433 for NoC1E. Thus it is clear that innovations, such as C1E, that enable a processor to enter lowest frequency while in idle state and still allowing it to return to active state quickly can have a huge impact on improving EP. Particularly in multi-core environments when long idle periods at the system level are few and far in-between [58, 82], enabling a few idle cores to enter C1E state can drastically improve the EP of the server without relying on long idle durations. For the NoCn curve, we disabled TurboBoost and all idle states (C1, C1E, and Cn), leaving only the active C0 state. This curve resembles that of all non-compute com- ponents in servers today. The area between the NoCn and NoC1E curve represents the effect on energy proportionality of processor idle states. This shows the power sav- ings opportunity for idle processor states. As can be expected, this gap increases as the utilization decreases due to the presence of more idle periods as utilization decreases. For low utilizations when many of the cores are idle, the cores will still remain in an active operating state, causing significant waste in energy. Without any idle states, the EP of the server is only 0.089 with DR of 0.135. Without any idle states, the processor essentially runs at peak power regardless of operating load. 101 4.4.4 Low Power Opportunities for Non-Processor Components At present, the non-compute components’ energy proportionality curve resembles that of the NoCn curve, which lacks idle states. If we are able to achieve proportionality for all components, we would have a server energy proportionality profile similar to fig- ure 4.3b. To realize the server in this figure, one would require both reduction to DR and LD, through idle states power reduction and improvement to intermediate utilization efficiency. In section 4.2, we showed that equal importance must be given to all server components for future energy proportionality growth, not just the processor. In this sec- tion, we demonstrated the effectiveness of idle states and aggressive DVFS techniques on energy proportionality. While prior research has mainly focused on processor power management due to its dominance, we are now at a crossing point where proportionality improvements to non-compute components can bring as much energy proportionality improvements as processors. Therefore, the low-hanging fruits for future server energy proportionality improvements will be in developing idle states and scaling techniques for memory, disk, networking, and other non-processor server components. We will highlight a few such opportunities for less commonly studied components, such as disks and network interfaces. Disks. We are beginning to see the widespread adoption of SSDs into servers. The switch from spinning hard drives to SSDs helped overall server energy proportionality by reducing the dynamic range of the server since SSDs consume less power than spin- ning disks. But by themselves, SSDs also are not very energy proportional [67]. To make matters more complicated, garbage collection tends to occur in the background, making idle periods rare for idle power states. These operations tend to be carried out by the embedded processors inside of SSDs. Therefore, an opportunity exists to throttle the embedded processor inside of SSDs during idle periods, as garbage collection speed may not be performance critical if current requests to the disk is low. 102 Network Interface. Network interface cards (NICs) have been the topic of a few energy oriented works [3, 69]. [69] proposed metrics to measure the energy efficiency of network interfaces. Relating to the energy proportionality of NICs, [69] finds that there is very little difference in the power consumption of an idle or loaded NIC, which implies poor energy proportionality. We found that sleep states improve processor DR. This approach can be applied to NICs where if number of active links is low, circuit- level techniques, such as power gating, can be incorporated to gate unnecessary link hardware. When CPUs offload network processing to NICs, these network processors can be throttled based on bandwidth usage to improve EP. In another interesting ap- proach, these network processors have been enhanced to receive offloading from the main processor [3], in order to perform certain tasks while allowing the server to sleep. Furthermore, network energy proportionality has also been gaining interest at the data center network level [2]. 4.5 Related Work EP Trends. Prior work also looked at the historical trends of energy proportionality, but only from a quantitative standpoint [32, 57, 63, 65, 77, 80]. In this work, in addition to presenting updated historical trends, we also provide insightful explanations, through statistical analysis and empirical measurements, into why energy proportionality im- proved the way it did. In addition, we also develop models to anticipate future energy proportionality trends. Super Proportional Servers. Better-than-proportional servers appeared in a few prior works. [9] presented an arbitrary illustrative better-than-proportional curve, to il- lustrate what is desirable in WSC. [56] discussed briefly the possibility of better-than- proportional servers due to optimistic coordinated scaling of CPU and memory. Our 103 new contribution is formalizing what a better-than-proportional server can be by iden- tifying practical limits for energy proportionality and presenting evidence supporting this newly defined ideal. We present a concrete better-than-proportional curve backed by our analysis, and show the possibility of actually realizing it, and some potential implications of these better-than-proportional servers. Non-CPU Energy Efficiency. [56] explored opportunities for CPU/Memory/HDD low power modes and CPU/Memory power-performance tradeoffs. While non- component EP is a theme our work share, we further showed that in order to continue improving EP, the “other” components (non-CPU/Mem/HD) will be equally important. From our empirical measurements, we find that these “other” components consume more than Mem and HD combined. 4.6 Conclusion Energy proportionality of current servers has improved to the point where servers are now approaching the traditional definition of “ideal” proportional servers, namely they burn power that is perfectly in proportion to the utilization. In this paper, we challenge the widely-held definition of an “ideal” energy proportionality (EP=1). Through deriv- ing Pareto-optimal frontiers and a hypothetical server case study, we found that the ideal energy proportional server would have EP=1.35. To identify challenges and opportuni- ties for realizing super energy proportional servers, we use a combination of statistical regression and empirical measurements. Despite the fact that processors account for the largest fraction of power consumption, improvements to all other components (mem- ory, storage, network, motherboard, etc.) will now contribute as much as the processor to overall server energy proportionality improvements. Specifically, advances in idle power modes and scaling techniques will contribute the greatest to future EP growth. 104 Chapter 5 Managing Super Energy Proportional Servers 5.1 Introduction In the previous chapter, we identified how potential super energy proportional servers may look like in the future. In this section, we will leverage the unique properties of these super proportional servers to identify opportunities to efficiently managing these super energy proportional servers. One unique property is that the peak energy efficiency point of the server no longer resides at peak utilization. Figure 5.1 shows the energy efficiency of two highly pro- portional server normalized to the energy efficiency of the server at 100% utilization. The energy efficiency curves are derived from two servers studied in the previous chap- ter: figure 4.3a and 4.3b. We are plotting these curves on a single graph to provide a comparison. The solid red line shows the efficiency curve for a server with EP of 1.05 (figure 4.3a) and an EP of 1.20 (figure 4.3b). What’s interesting to note is that as energy proportionality improves, the utilization at which peak energy efficiency (e.g. ssj ops/watt for SPECpower) occurs decreases. The server with EP of 1.05 has peak efficiency at 60% utilization, while the server with EP of 1.20 has peak efficiency at 50% utilization. What is most drastic is that at 50% utilization, the energy efficiency exhibited is 60% better than that when the server is 100% utilized. This contradicts the widely held assumption that peak energy efficiency occurs at peak utilization. In this 105 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0% 20% 40% 60% 80% 100% Efficiency Normalized to 100% Utilization EP=1.20 EP=1.05 Figure 5.1: Energy Efficiency of future high energy proportional servers Variable Description f() Server Util. vs Power Curve g() Cluster-wide Util. vs Power Curve x Cluster-wide utilization N Number of Servers in Cluster P Power at 100% Utilization,f(100) U Peak efficiency utilization Table 5.1: Variables for Best-Case Cluster-wide Energy Proportionality Analysis. chapter, we will propose techniques that leverages this property to enable peak energy efficiency for cluster management. In this chapter, we will: Motivate the need for an efficiency-aware scheduling policy through idealized analytical models. We show there are limitations for Uniform and Packing load balancing, where no one policy can dominate across all utilizations. We propose Peak Efficiency Scheduling, which enable clusters to sustain maximum energy efficiency across all utilization levels. Explore how various load balancing techniques (Uniform, Packing, and Peak Ef- ficiency Scheduling) are affected by realistic server conditions. We observed ex- perimentally the phenomenon, described in chapter 3, where Packing scheduling can never achieve ideal energy proportionality under realistic server conditions. Furthermore, we will demonstrate how Peak Efficiency scheduling can achieve 106 better energy efficiency than alternative load balancing techniques across all clus- ter utilization and server wakeup delays. Finally, we will explore the effect of super energy proportional servers on data center power capping. Specifically, we make a case that Peak Efficiency schedul- ing can be used to maximize compute capacity under power capping scenarios. 5.2 Best-case Cluster-wide Energy Proportionality Analysis In chapter 3, we demonstrated experimentally that uniform load balancing may offer greater cluster-wide energy proportionality than cluster-level packing techniques. One limitation of uniform load balancing is that packing load balancing still provides lower power consumption at very low utilization (<30%) regions. This presents an oppor- tunity to further improve the energy efficiency of load balancing policies. In order to motivate the need for an efficiency-aware load balancing policy, we will first provide an analytical model for best-case cluster-wide energy proportionality for uniform and packing load balancing to show their limitations. In this section, we will analytically model the best-case cluster-wide energy propor- tionality curve, which is defined as the idealized theoretically achievable cluster-wide energy proportionality if there are no power mode transition penalties or workload mi- gration penalties. Since the load balancing cases covered in chapter 3 are simple, we are able to reason about the best-case cluster-wide energy proportionality curve. We define the variables in table 5.1 to aid us.f() andg() represents the utilization vs power curve for individual servers and the entire cluster, respectively. Recall that the energy proportionality curve uses utilization vs normalized power curve, therefore the server 107 and cluster-wide energy proportionality curve is the normalization given byf()=P and g()=(NP ), respectively, where P is the maximum power consumed by each individual server at 100% utilization, and N is the number of servers in the cluster. For the simplest case of Uniform load balancing, all servers within the cluster will be operating at the same utilization as the cluster. Therefore for the case of Uniform load balancing, the best-case cluster-wide utilization vs power curve is simply: g(x) =f(x)N (5.1) For the case of Packing load balancing, it is assumed that each server can be packed until peak utilization. In this scenario, we would not turn on and load a new server unless all the available servers are completely packed already. Therefore, the best-case cluster-wide utilization vs power curve resembles steps where each step is shaped like the individual server’s utilization vs power curve. The best-case cluster-wide utilization vs power curve for packing load balancing is: g(x) = j x 100 N k P +f x mod 100 N N (5.2) In this case, there are two terms present; the first represents the power consumption of the full complete steps, and the second represents the power of the current partial step. The first term shows the number of servers that are running at peak utilization within the cluster, while the second term shows the utilization of the one single server that is not running at peak within the cluster. For instance, if the cluster wide utilization is 55% and the number of servers in the cluster is 50 then the first term will be 27 (27 servers are running at peak utilization) and the second term will bef(50) (the last server is 50% utilized). 108 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 Normalized Power Utilization (%) (a) Packing 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 Normalized Power Utilization (%) (b) Uniform 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 Normalized Power Utilization (%) (c) Peak Efficiency Scheduling Figure 5.2: Best Case Cluster-wide Energy Proportionality Curve for various Cluster- level Load Balancing Schemes (Red solid line). The blue dotted line represents ideal linearly energy proportionality. Figure 5.2a and 5.2b show the best-case cluster-wide energy proportionality curve for these two different load balancing algorithms, namely packing and uniform workload distribution. The x-axis is the cluster-wide utilization, while the y-axis is the cluster- wide power curve normalized to power at 100% utilization, or g(x)=(NP ), which gives the cluster-wide energy proportionality curve. The solid red line represents this cluster-wide energy proportionality curve, while the dotted blue line represents the ideal linearly energy proportionality curve. Using uniform load balancing (figure 5.2b), the cluster-wide EP is 1.05, equivalent to the EP of the servers that make up the cluster. 109 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 20 40 60 80 100 Normalized Efficiency to 100% Efficiency Utilization (%) (a) Packing 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 20 40 60 80 100 Normalized Efficiency to 100% Efficiency Utilization (%) (b) Uniform 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 20 40 60 80 100 Normalized Efficiency to 100% Efficiency Utilization (%) (c) Peak Efficiency Scheduling Figure 5.3: Cluster-wide Energy Efficiency Curve for various Cluster-level Load Bal- ancing Schemes In an idealized case of packing, only the minimum number of servers required to meet a certain load is active, with all other servers off. Therefore, the number of servers required is proportional to the utilization. AsN!1, equation 5.2 reduces to only the first term whereg(x) = x 100 N P , which represents the ideal linearly proportional curve with EP of 1. 110 5.3 A Case for Peak Efficiency Scheduling The analytical model in the previous section shows that there are regions where Packing uses less power (<30%) and there are regions where Uniform uses less power (>30%). Although Uniform load balancing provides greater cluster-wide energy proportional- ity, it does not necessarily provide the lowest power consumption across all utilization levels. We will leverage this insight to derive a new load balancing policy for super energy proportional servers. As shown in section 5.1, super energy proportional servers with EP of 1.05 achieves peak efficiency at 60% utilization (U = 60), which is significantly below peak utilization. It is expected that future super energy proportional servers would have even lower peak efficiency utilization. Rather than naively packing each server until peak utilization, or uniformly spreading out the load to all servers, it may be more efficient to pack server up until their peak efficiency point. Then once all servers are operating at their peak efficiency point, if more requests come, issue those requests uniformly. The intuition here is to quickly get servers to the point of peak efficiency, and then once they reach that point, move away from that point as slowly as possible with the goal of staying a near peak efficiency as much as possible. We call this scheduling policy Peak Efficiency Scheduling. The best-case cluster-wide utilization vs power curve for Peak Efficiency Scheduling is given as: g(x) = 8 > > < > > : x U N f (U) +f x mod U N 100N U x<U f (x)N x>U (5.3) 111 This is a generalized version of the packing equation 5.2, where packing load balancing hasU = 100. Essentially, packing load balancing provides the optimal best-case cluster- wide energy proportionality when peak efficiency coincides with peak utilization. IfU = 0, then this equation reduces to the uniform load balancing equation 5.1. In this scenario, U = 0 represents the case where peak efficiency is at 0% utilization, or more concretely, that of a server where its efficiency remains constant at all utilization. An example server with this property is that of an ideal linearly energy proportional server. Essentially, as the peak efficiency utilization begins to lower, uniform load bal- ancing becomes more favorable. Peak efficiency scheduling is able to capture the be- havior of both packing and uniform load balancing, and therefore provide the optimal best-case cluster-wide energy proportionality for any given peak efficiency. Figure 5.2c shows the best-case cluster-wide energy proportionality curve for Peak Efficiency Scheduling. This scheme achieved an EP of 1.15. Most notably, this line is completely below the ideal linear line at all times. The ideal energy proportionality line (in blue) represents the power consumption when all servers are running at 100% utilization. Since Peak efficiency scheduling keeps servers running at peak efficiency utilization and achieves greater energy efficiency that at 100% utilization, it requires less power that Packing for the same cluster utilization. Therefore, the Peak efficiency energy proportionality curve is constantly below the ideal linear curve. Figure 5.3 shows the energy efficiency across utilization levels for all three schemes. This figure is derived from figure 5.2 by dividing utilization by power. The x-axis repre- sents the cluster-wide utilization, while the y-axis is the cluster-wide energy efficiency normalized to the energy efficiency at 100% utilization. The solid red line represents the efficiency of the best-case load balancing scheme, and the dotted blue line repre- sents the efficiency of an ideal linear energy proportional cluster. Most notably, the peak efficiency packing scheme is able to sustain cluster-wide efficiency that is better 112 than efficiency at peak load, across all utilization levels. This is significantly better than packing efficiency curve, which sustains the efficiency at peak utilization, and uniform, which exhibits the efficiency curve of the individual servers. 5.4 Effectiveness of Peak Efficiency Scheduling In the previous section we motivated the need for, and demonstrated the superiority of, Peak Efficiency Scheduling under idealized scenarios where servers can turn on/off instantaneously. The analytical models do not have a notion of workload requests or requests latencies, and therefore does not reflect the performance aspect of various load balancing schemes. We showed that Peak Efficiency Scheduling is promising under idealized scenarios, but we do not know if these benefits will translate to real-world scenarios. In order to explore the effectiveness of various load balancing policies under real-world scenarios, we will perform an empirical evaluation using stochastic queueing simulation in order to capture both performance and power implications. In chapter 3.2 we demonstrated the cluster-wide energy proportionality implications of uniform load balancing versus packing load balancing, but did not evaluate the performance impli- cations or explore the effect of various real-world parameters such as server transition times. In this chapter, we will supplement the results in chapter 3.2 and provide a more comprehensive evaluation. In this section, we will explore the effect of non-idealized servers, with power mode transition time, on cluster-wide energy proportionality with the goal of identifying the level of server transition time tolerable. Furthermore, we will also explore the perfor- mance implications of various load balancing schemes using a set of data center work- loads. 113 5.4.1 Experimental Methodology To evaluate the effect of server transition time on cluster-wide energy proportionality and workload latency, we will use the BigHouse simulator [57] that was also used in chapter 3.3. We model a cluster of 100 highly energy proportional servers. Each server has an EP of 1.05 and exhibits the utilization vs power curve as shown in figure 5.1. We implemented Uniform, Packing, and Peak Efficiency cluster scheduling into Big- House. For Uniform scheduling, we simply assigned incoming jobs to each server in a round robin fashion. For Packing scheduling, we direct incoming jobs to the highest utilized server with available capacity. For servers that are idle, it is placed into the off state. To handing any request spikes, we keep a single standby server on. At any one time, there should be a single idle server that is on. For Peak efficiency scheduling, we direct incoming jobs by first identifying the highest utilized server that is less than peak efficiency utilization with available capacity. If we cannot identify such a server, we then identify the lowest utilized server, with available capacity, among servers that are utilized above the peak efficiency utilization point. In Peak efficiency scheduling, we also have a single standby server to handle request spikes. We evaluate five workload distributions: Apache, DNS, Mail, Search, and Shell provided with the BigHouse simulator. Apache models a university depart- mental web server. DNS models a DHCP server and DNS name server as part of the Domain Name System. Mail models a university departmental email server using POP and SMTP.Shell models interactive tasks found in departmental Shell login servers. Search models the leaf node in Google’s search cluster. Initially, we will only present the results for Apache. After using Apache results to discuss our observations, we will present the full results for the remaining workloads. The general trends and obser- vations forApache also holds true for the other workloads. 114 0% 20% 40% 60% 80% 100% 120% 0% 20% 40% 60% 80% 100% % of Peak Power Utilization Uniform Packing Peak (a) 1 second 0% 20% 40% 60% 80% 100% 120% 0% 20% 40% 60% 80% 100% % of Peak Power Utilization Uniform Packing Peak (b) 10 seconds 0% 20% 40% 60% 80% 100% 120% 0% 20% 40% 60% 80% 100% % of Peak Power Utilization Uniform Packing Peak (c) 40 seconds 0% 20% 40% 60% 80% 100% 120% 0% 20% 40% 60% 80% 100% % of Peak Power Utilization Uniform Packing Peak (d) 90 seconds Figure 5.4: Energy curve for varying levels of wakeup transition times 5.4.2 Energy Impact of Realistic Load Scheduling Figure 5.4 shows the utilization vs power results of our studied load balancing schemes with the Apache workload. We explore server wakeup times of 1 second, 10 sec- onds, 40 seconds, and 90 seconds. For very low server wakeup delays (figure 5.4a), we see that the utilization vs power curve resembles that of the best-case cluster-wide energy proportionality curve shown in figure 5.2. This figure demonstrates that our various load balancing implementations in BigHouse are able to achieve the expected theoretical energy proportionality curves detailed in section 5.2 and 5.3. With Uniform load balancing, as the server wakeup transition time increases, the cluster-wide energy proportionality stays the same. This behavior is inline with what is observed in prior chapters. 115 For Packing load balance, as the server wakeup transition time increases, we begin to see the cluster-wide energy proportionality begin to decrease. With a transition time of 1 second, the observed Packing energy proportionality curve represents that of the ideal linear energy proportionality curve. With increased server transition time, the observed Packing energy proportionality curve becomes superlinear. This behavior was explained in chapter 3.2 as caused by the inability to perfectly match the number of servers needed with the cluster utilization and the need for standby servers to anticipate utilization spikes. Peak efficiency scheduling provides the best energy proportionality curve regard- less of server transition time. In addition, Peak efficiency scheduling also provides the lowest power consumption at any cluster utilization. Figure 5.5 shows the cluster-wide energy proportionality for Uniform, Packing, and Peak efficiency scheduling with var- ious server wakeup transition delays. What is most notable is that both Uniform and Packing load balancing requires 0 wakeup delays in order to achieve an EP of 1.0. On the other hand, Peak efficiency load balancing is able to achieve an EP of 1.0 even with a wakeup delay of 60 seconds. Figure 5.8 shows the utilization vs power curve for the other experimental workloads with a 10 second wakeup transition penalty. From here, we can see that the utilization vs power curve is very similar to that observed in the Apache workload. From our observations the utilization vs power curve is fairly workload agnostic and the trends we observed in figure 5.4 applies for all other workloads. To summarize, while Uniform and Packing load balancing requires idealized servers to achieve EP of 1.0, Peak efficiency scheduling can achieve EP of 1.0 even with servers with wakeup delay of 60 seconds. In addition, Peak efficiency scheduling provides lower power consumption at any cluster utilization. 116 0 0.2 0.4 0.6 0.8 1 1.2 0 10 20 30 40 50 60 70 80 90 Energy Proportionailty Wakeup Delay (s) Uniform Packing Peak Figure 5.5: Effect of Transition time on Cluster-wide EP 5.4.3 Efficiency Impact of Realistic Load Scheduling Figure 5.6 shows the energy efficiency curve of the Apache workload under various load scheduling policies and varying server wakeup transition times. The x-axis shows the cluster-wide utilization, while the y-axis shows the energy efficiency (defined as requests per second / Watt) normalized to the energy efficiency at 100% utilization. For a low wakeup transition penalty of 1 second, the observed energy efficiency curve is similar to the theoretical curves in figure 5.3. Under idealized conditions, we observed that Packing load balancing is able to provide a near consistent energy effi- ciency across all utilization levels. Uniform load balancing is able to provide significant higher energy efficiency at higher utilization levels (40% to 100%) compared to Packing. Surprisingly, with higher transition penalties (at least 40 seconds), Packing scheduling no longer beats Uniform scheduling at low utilization. In fact, Uniform only beats Pack- ing between 30% and 80% utilization. Peak efficiency scheduling remains superior to Uniform load balancing and Packing load balancing regardless of utilization and server wakeup transition penalty. Specifi- cally, the main advantage of Peak efficiency scheduling occurs in the lower utilization regions (0% to 60%) where the observed efficiency is significantly better than Uniform 117 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0% 20% 40% 60% 80% 100% Eff. Norm. to Eff. @ 100% Utilization Uniform Packing Peak (a) 1 second 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0% 20% 40% 60% 80% 100% Eff. Norm. to Eff. @ 100% Utilization Uniform Packing Peak (b) 10 seconds 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0% 20% 40% 60% 80% 100% Eff. Norm. to Eff. @ 100% Utilization Uniform Packing Peak (c) 40 seconds 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0% 20% 40% 60% 80% 100% Eff. Norm. to Eff. @ 100% Utilization Uniform Packing Peak (d) 90 seconds Figure 5.6: Efficiency curve for varying levels of wakeup transition times and Packing. Recall, that Peak efficiency scheduling packs requests during cluster uti- lization lower than the peak efficiency utilization, and uniformly distributes requests during cluster utilization higher than the peak efficiency utilization. Therefore, for uti- lizations higher than 60%, Peak efficiency scheduling performs similar to Uniform load balancing. Note that as server wakeup transition time increases, the achievable energy efficiency actually decreases. In Peak efficiency scheduling, with low wakeup times, servers are able to sleep and take advantage of idle periods. But with higher wakeup times, servers are less likely to sleep due to the time penalty of having to wakeup the servers. Fur- thermore, as the cluster is waiting for a server to turn on, the active servers absorb the incoming requests and would run at a higher (and less efficient) utilization. 118 0 0.1 0.2 0.3 0.4 0.5 0.6 0% 20% 40% 60% 80% 100% 95%tile Latency (s) Utilization Uniform Packing Peak (a) 1 second 0 1 2 3 4 5 0% 20% 40% 60% 80% 100% 95%tile Latency (s) Utilization Uniform Packing Peak (b) 10 seconds 0 1 2 3 4 5 6 7 0% 20% 40% 60% 80% 100% 95%tile Latency (s) Utilization Uniform Packing Peak (c) 40 seconds 0 1 2 3 4 5 6 7 8 9 0% 20% 40% 60% 80% 100% 95%tile Latency (s) Utilization Uniform Packing Peak (d) 90 seconds Figure 5.7: Latency curve for varying levels of wakeup transition times 5.4.4 Tail Latency Impact of Realistic Load Scheduling Figure 5.7 shows the 95th percentile tail latency for theApache workload. The x-axis shows the cluster-wide utilization and the y-axis shows the 95th percentile tail latency in seconds. Under ideal scenarios, the tail latency remains relatively unchanged with 0.3 seconds response time from 0% to 90% utilization. As the wakeup transition time increases, we begin to see a larger 95th percentile tail latency at high utilization regions. For example, with 10 second wakeup penalty, we see a 95th percentile tail latency of 4.6 seconds; with 40 second wakeup penalty, we see a 95th percentile tail latency of 6.3 seconds. In addition to the greater 95th percentile tail latency, the utilization where the tail latency spikes also occurs earlier. The “knee” of the curve occurs at 90% with 10 119 0% 20% 40% 60% 80% 100% 120% 0% 20% 40% 60% 80% 100% % of Peak Power Utilization Uniform Packing Peak (a)DNS 0% 20% 40% 60% 80% 100% 120% 0% 20% 40% 60% 80% 100% % of Peak Power Utilization Uniform Packing Peak (b)Mail 0% 20% 40% 60% 80% 100% 120% 0% 20% 40% 60% 80% 100% % of Peak Power Utilization Uniform Packing Peak (c)Search 0% 20% 40% 60% 80% 100% 120% 0% 20% 40% 60% 80% 100% % of Peak Power Utilization Uniform Packing Peak (d)Shell Figure 5.8: Power curve for varying levels of wakeup transition times second wakeup transition time, at 80% with 40 second wakeup transition time, and at 60% with 90 second wakeup transition time. Figure 5.9 shows the 95th percentile tail latency for the other experimental work- loads with a 10 second wakeup transition penalty. Unlike the power curve, the latency curves are workload dependent. The general trends observed forApache can also be seen here. In general, workloads all tend to have longer tail latencies as wakeup transi- tion time increases, and the “knee” at where the tail latency spikes also decreases. The Search workload is the most tail latency sensitive, and therefore, experiences a tail latency spike beginning at 70% utilization. 120 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0% 20% 40% 60% 80% 100% 95%tile Latency (s) Utilization Uniform Packing Peak (a)DNS 0 1 2 3 4 5 6 0% 20% 40% 60% 80% 100% 95%tile Latency (s) Utilization Uniform Packing Peak (b)Mail 0 0.05 0.1 0.15 0.2 0.25 0% 20% 40% 60% 80% 100% 95%tile Latency (s) Utilization Uniform Packing Peak (c)Search 0 0.5 1 1.5 2 2.5 3 0% 20% 40% 60% 80% 100% 95%tile Latency (s) Utilization Uniform Packing Peak (d)Shell Figure 5.9: Latency curve for varying levels of wakeup transition times 5.5 A case for maximizing compute capacity under power capping In the prior section, we demonstrated that super energy proportional servers can be man- aged in a way where we can sustain high energy efficiency across all utilizations. Super energy proportional server properties can impact more than just workload scheduling schemes. In this section, we will explore the effect of super energy proportional servers on data center power capping. Specifically, we make a case that Peak Efficiency schedul- ing can be used to maximize compute capacity under power capping scenarios. Power capping is a technique that is commonly utilized in over-provisioned data centers. Many data center servers typically run in the low-mid utilization region, and 121 therefore rarely consumes power equivalent to its nameplate power. In the past, data centers would be provisioned based on the nameplate power, which results in signifi- cant under-utilization of the power budget. To combat this, data centers began to over- provision servers, that is, to pack more servers than can be supported at peak. Under common conditions, over-provisioning would allow data centers to have a large com- pute capacity. Under power emergency scenarios, where servers are all running at peak and can violate the data center power budget, power capping is enforced. The goal is to avoid violating the data center power budget by capping the power of data center servers, while still maintaining quality-of-service requirements. In this section, we explore under ideal conditions, how various load balancing tech- niques can affect the compute capacity of the data center at various power capping levels. We model a cluster of 100 servers, with each server exhibiting the power profile as shown in figure 5.1 and an EP of 1.05. We define the cluster’s compute capacity as the requests per second that the cluster can handle. We sweep through a range of cluster power budget from 0% to 100%. The power budget is normalized to the power consumption of all the servers in the cluster. For example, with a cluster size of 100 servers each consuming 100W at 100% load, then the cluster power is 10,000W. With a power budget of 50%, then there is only 5,000W of power available to the cluster. Under Uniform load balancing, this power budget would be equally distributed to each server, where each server can only consume 50% of it’s peak power. Under Packing load balancing, a 50% power budget translates to 50% of the servers being able to run at max, while another 50% of the servers are off. Figure 5.10 shows the cluster’s compute capacity normalized to the compute capac- ity of when all servers are running at peak utilization with no power budget. The x-axis shows the power budget available to the cluster, the y-axis shows the cluster utilization 122 0% 20% 40% 60% 80% 100% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% U.l Power Budget 80%-‐100% 60%-‐80% 40%-‐60% 20%-‐40% 0%-‐20% (a) Packing 0% 20% 40% 60% 80% 100% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% U.l Power Budget 120%-‐140% 100%-‐120% 80%-‐100% 60%-‐80% 40%-‐60% 20%-‐40% 0%-‐20% (b) Uniform 0% 20% 40% 60% 80% 100% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% U.l Power Budget 120%-‐140% 100%-‐120% 80%-‐100% 60%-‐80% 40%-‐60% 20%-‐40% 0%-‐20% (c) Peak Efficiency Figure 5.10: Compute Capacity for various Cluster-level Load Balancing Schemes and the coloring shows the compute capacity normalized to the compute capacity at peak utilization with no power budget. For Packing load scheduling, the compute capacity remains the same across all clus- ter utilization levels at a given power budget due to all active servers running at peak (or 123 near-peak) utilization. In this scenario, the cluster’s compute capacity is solely deter- mined by the power budget. For Uniform load scheduling, the compute capacity is determined by both the power budget available and the utilization of the cluster. Under this scheme, the optimal com- pute capacity occurs when servers are running at ˜60%, where the server’s peak effi- ciency occurs. Furthermore, within this operating range, the compute capacity actually exceeds the compute capacity of running at peak utilization. Unfortunately, for cluster utilization less than 40%, Packing load scheduling outperforms Uniform load schedul- ing. Figure 5.10c shows the compute capacity using Peak Efficiency Scheduling. Most notably, the Peak Efficiency Scheduling scheme is able to provide greater compute ca- pacity than Packing across all utilization and power budget. In certain scenarios, Peak Efficiency Scheduling can achieve over 20% more compute capacity at the same power budget. This implies that servers that are active are running at optimal efficiency, re- quiring lesser number of active servers compared to Packing technique. Compared to Uniform scheduling, Peak Efficiency Scheduling provides a larger utilization “sweet spot” than Uniform scheduling. Uniform scheduling achieves its best compute capacity between the 50% - 80% utilization range. In comparison, Peak Efficiency Scheduling can achieve its best compute capacity between 10% - 80%. Clearly, the varying behav- iors of different cluster load scheduling schemes makes a significant difference on the compute capacity available. 5.6 Related Work Data center servers typically consumes less than peak power due to low utilization. When data centers are provisioned based on the nameplate power of servers, there exist 124 a significant gap between actually consumed power and available data center power. Many data centers are now over-provisioned, that is, there are more servers than can be supported at peak utilization. This allows data centers to increase the compute capacity under typical operating conditions, where the data center has enough power to supply to all of these servers; but under peak utilization periods, there may not be enough data center power for all of these servers. Power capping and power shaving is utilized to limit server power consumption during such power emergencies. Power capping can be achieve through DVFS [19, 46], thread packing [16], power gating [51], and turning servers off [24]. Under a power cap, DVFS based techniques [19, 46] would throttle the processor in order to lower power consumption at the cost of increased latency. Thread packing techniques such as Pack & Cap [16] augments DVFS by packing workload threads onto a variable number of cores, enabling idle cores to enter low-power sleep states. PGCap- ping [51] leverages per-core power gating, which lowers static energy, to create energy headroom to overclock active cores, maximizing performance under a chip-level power cap. These techniques typically only focus on relying on single-component low power modes to meet a power cap. In contrast, our proposed technique rely on server-level energy consumption profiles. 5.7 Conclusion In this chapter we demonstrated the opportunities that are possible for super energy proportional servers. We showed through analytical modeling the limitations of uni- form and packing load balancing. Through these insights, we proposed Peak Efficiency Scheduling which enables data center clusters to operate at a higher efficiency than at peak utilization efficiency, across all utilization levels. We then demonstrated that this 125 can also be applied towards power capping scenarios in order to maximize the compute capacity of the cluster. The presented scenarios in this chapter demonstrates the promise of energy efficiency techniques leveraging super energy proportional servers. 126 Chapter 6 Conclusions and Future Work This dissertation presents a holistic approach towards tackling the challenges of data center energy proportionality. As a first step it proposes a set of metrics to measure energy proportionality. It then uses these metrics to perform characterization studies on several hundred servers from the SPECpower benchmark results. Through these studies, we are able to gain critical insight into the energy proportionality of data cen- ter hardware as well as energy proportionality of existing state-of-the-art energy effi- ciency techniques. Using these insights, we propose energy proportionality techniques at the server-level (KnightShift) and cluster-level (Peak Efficiency scheduling) for future multi-core servers. In chapter 2, we first presented metrics in order to allow us to reason about energy proportionality. Using insights derived from these metrics, we proposed the KnightShift server architecture, which enables high energy proportional servers through the intro- duction of a tightly-coupled Knight node. KnightShift was the first server to demonstrate that achieving ideal energy proportionality was possible. In chapter 3, we explore the implications of emerging high energy proportional servers on existing cluster-level energy proportionality techniques. We found that as the energy proportionality of servers continues to improve, existing cluster-level en- ergy proportionality techniques, such as dynamic cluster resizing, may actually hinder cluster-wide energy proportionality. Instead, we advocate that simply using uniform load balancing may provide better cluster-wide energy proportionality, a finding that 127 goes against conventional wisdom. We then show that KnightShift can combined with uniform load balancing to provide superior cluster-wide energy proportionality. In chapter 4, we take a retrospective look back at the energy proportional revo- lution to identify the major drivers of energy proportionality and how processor low- power advancements affect server energy proportionality. Through these insights, we are able to derive, using pareto-optimal frontier analysis, how future super energy pro- portional servers may look like. Specifically, we found that these super energy propor- tional servers can achieve peak efficiency at non-peak utilization. In chapter 5, we explore how to best manage super energy proportional servers to achieve peak cluster efficiency. We proposed a Peak Efficiency scheduling technique that enable clusters to consistently run at a greater efficiency than alternative sched- ulers across all cluster utilizations and realistic server wakeup delays. As an example application, we demonstrate the usefulness of Peak Efficiency Scheduling to maximize compute capacity under power capping scenarios. Throughout the course of this dissertation, we have seen energy proportionality im- prove significantly to the point where we now see commercially available servers with near ideal energy proportionality. Despite this, energy proportionality is not yet a solved issue. As we demonstrated, high energy proportional servers presents challenges and op- portunities. We demonstrated, under ideal conditions, theoretically a potential optimal solution for dealing with high energy proportional server at the cluster-level. In the real world, there would be many challenges towards achieving these ideal results, such as server migration periods, server power state transition periods, etc. Therefore, there ex- ists significant future work potential in realizing cluster-management policies for super energy proportional servers under realistic scenarios. 128 Furthermore, the state of data center hardware is evolving rapidly and growing more heterogeneous. These heterogeneous data center hardware present challenges and op- portunities for improving energy proportionality in a heterogeneous server cluster. For example, ARM-based servers are now being commercialized. At the time of our study, these servers did not exist. Many of these ARM processor are heterogeneous with both a big and little core. Orthogonal to ARM-based servers, many-core accelerators are mak- ing an entrance into commercial data centers. GPGPUs are the most common many- core accelerators, and it is well known that GPGPUs can consume significant power. GPGPUs and future many-core accelerators will present a new challenge for data center energy proportionality. In order to efficiently utilize GPGPUs in data centers, there still exist significant challenges in efficiently mapping enterprise workloads to GPGPUs, as well as how to manage GPGPU clusters in an energy efficient manner. Besides GPGPUs, system on chip (SoCs) will present significant heterogeneous ac- celerators into data centers. SoCs will present unprecedented heterogeneity to data cen- ters, and a major challenge will be in how to best efficiently utilize these emerging server hardware. Cloud computing is here to stay. As more and more of our lives are tied to the cloud, and more and more industries rely on the cloud, there will always be a need to continue to scale the capacity of data centers. A major impediment towards the sustained growth of these data centers will be the energy consumption of the underlying hardware. This dissertation presents comprehensive solutions to provide energy efficiency during the multi-core and many-core era. As hardware continues to evolve with the emergence of SoCs and heterogeneous servers, there will be a need for developing future energy efficiency techniques, to continue the scaling of cloud computing services. 129 Bibliography [1] North american data center market trends 2013-2014 report. Technical report. [2] D. Abts, M. Marty, P. Wells, P. Klausler, and H. Liu. Energy proportional datacen- ter networks. In Proceedings of the International Symposium on Computer Archi- tecture, pages 338–347, 2010. ISCA’10 June 19–23,2010, Saint-Malo, France. [3] Y . Agarwal, S. Hodges, R. Chandra, J. Scott, P. Bahl, and R. Gupta. Somnil- oquy: augmenting network interfaces to reduce PC energy usage. In NSDI’09: Proceedings of the 6th USENIX symposium on Networked systems design and im- plementation, Apr. 2009. [4] H. Amur and K. Schwan. Achieving power-efficiency in clusters without dis- tributed file system complexity. In ISCA’10: Proceedings of the 2010 International Conference conference on Computer Architecture, pages 222–232, June 2010. [5] V . Anagnostopoulou, S. Biswas, H. Saadeldeen, A. Savage, R. Bianchini, T. Yang, D. Franklin, and F. T. Chong. Barely alive memory servers: Keeping data active in a low-power state. J. Emerg. Technol. Comput. Syst. [6] V . Anagnostopoulou, S. Biswas, A. Savage, R. Bianchini, T. Yang, and F. Chong. Energy conservation in datacenters through cluster memory management and barely-alive memory servers. In WEED 09: Workshop on Energy-Efficient De- sign, 2009. [7] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V . Va- sudevan. FAWN: a fast array of wimpy nodes. In SOSP ’09: Proceedings of the 22nd Symposium on Operating Systems Principles, Oct. 2009. [8] M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl’s law through EPI throttling. In ISCA’05: Proceedings of the 32nd international symposium on Computer Architecture, pages 298–309, 2005. [9] L. A. Barroso, J. Clidaras, and U. Hlzle. The datacenter as a computer: An in- troduction to the design of warehouse-scale machines, second edition. Synthesis Lectures on Computer Architecture, 8(3):1–154, 2013. 130 [10] L. A. Barroso and U. Holzle. The case for energy-proportional computing. Com- puter, 40(12):33–37, dec 2007. [11] I. Bhati, Z. Chishti, and B. Jacob. Coordinated refresh: Energy efficient techniques for dram refresh scheduling. In Low Power Electronics and Design (ISLPED), 2013 IEEE International Symposium on, 2013. [12] J. S. Chase, D. C. Anderson, P. N. Thakar, A. M. Vahdat, and R. P. Doyle. Manag- ing energy and server resources in hosting centers. In SOSP ’01: Proceedings of the 18th Symposium on Operating Systems Principles, Dec. 2001. [13] G. Chen, W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, and F. Zhao. Energy-aware server provisioning and load dispatching for connection-intensive internet services. In NSDI’08: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2008. [14] L. Chen and T. Pinkston. Nord: Node-router decoupling for effective power-gating of on-chip routers. In 45th Annual IEEE/ACM International Symposium on Mi- croarchitecture (MICRO) 2012, pages 270–281, 2012. [15] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In NSDI’05: Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation, 2005. [16] R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda. Pack & Cap: Adaptive DVFS and Thread Packing Under Power Caps. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, 2011. [17] J. Cohen, P. Cohen, S. G. West, and L. S. Aiken. Applied Multiple Regression/- Correlation Analysis for the Behavioral Sciences. Routledge, 3 edition, 2002. [18] Dallal, Gerard. Why p=0.05?, May 2012.http://www.jerrydallal.com/ LHSP/p05.htm. [19] J. D. Davis, S. Rivoire, and M. Goldszmidt. Star-cap: Cluster power management using software-only models. Technical Report Microsoft Technical Report MSR- TR-2012-107, Microsoft Research, October 2012. [20] Q. Deng, D. Meisner, L. Ramos, T. F. Wenisch, and R. Bianchini. MemScale: ac- tive low-power modes for main memory. In ASPLOS ’11: Proceedings of the 16th International Conference on Architectural support for programming languages and operating systems, Mar. 2011. 131 [21] M. Elnozahy, M. Kistler, and R. Rajamony. Energy conservation policies for web servers. In Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4, USITS’03, 2003. [22] X. Fan, W.-D. Weber, and L. A. Barroso. Power provisioning for a warehouse- sized computer. In ISCA’07: Proceedings of the 34th international symposium on Computer architecture, pages 13–23, 2007. [23] M. Ferdman, A. Adileh, O. Kocberber, S. V olos, M. Alisafaee, D. Jevdjic, C. Kay- nak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In ASPLOS ’12: Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 37–48, Mar. 2012. [24] A. Gandhi, M. Harchol-Balter, R. Das, J. Kephart, and C. Lefurgy. Power capping via forced idleness. In WEED ’09, 2009. [25] A. Gandhi, M. Harchol-Balter, R. Raghunathan, and M. A. Kozuch. Autoscale: Dynamic, robust capacity management for multi-tier data centers. ACM Trans. Comput. Syst., 30(4), Nov. 2012. [26] S. Ghiasi. Aide de camp: asymmetric multi-core design for dynamic thermal man- agement. PhD thesis, 2004. [27] D. Gmach, J. Rolia, L. Cherkasova, and A. Kemper. Workload Analysis and De- mand Prediction of Enterprise Data Center Applications. In Workload Character- ization 2007 IISWC 2007 IEEE 10th International Symposium on, 2007. [28] R. A. Gordon. Regression Analysis for the Social Sciences. Routledge, 1 edition, 2010. [29] E. Grochowski, R. Ronen, J. Shen, and H. Wang. Best of both latency and through- put. In Proceedings of International Conference on Computer Design, pages 236– 243, 2004. [30] S. Herbert and D. Marculescu. Analysis of dynamic voltage/frequency scaling in chip-multiprocessors. In Proceedings of the 2007 international symposium on Low power electronics and design, ISLPED ’07, pages 38–43, New York, NY , USA, 2007. ACM. [31] U. Hoelzle and L. A. Barroso. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, 1st edition, 2009. 132 [32] C.-H. Hsu and S. W. Poole. Power signature analysis of the SPECpower ssj2008 benchmark. In Performance Analysis of Systems and Software (ISPASS), 2011 IEEE International Symposium on, 2011. [33] C.-H. Hsu and S. W. Poole. Measuring Server Energy Proportionality. In Proceed- ings of the 6th ACM/SPEC International Conference on Performance Engineering, ICPE ’15, 2015. [34] http://httpd.apache.org/docs/2.0/programs/ab.html. ab - apache http server bench- marking tool. [35] http://perspectives.mvdirona.com. Cost of power in large-scale data centers. [36] https://www.wattsupmeters.com/secure/index.php. Watts up? power meter. [37] http://www8.hp.com/us/en/products/servers/moonshot/. Hp moonshot system. [38] http://www.cpubenchmark.net/. Passmark cpu benchmark. [39] http://www.fit-pc.com/web/fit pc/. fit-pc2. [40] http://www.wikibench.eu. Wikibench. [41] Z. Hu, A. Buyuktosunoglu, V . Srinivasan, V . Zyuban, H. Jacobson, and P. Bose. Microarchitectural techniques for power gating of execution units. In Low Power Electronics and Design, 2004. ISLPED ’04. Proceedings of the 2004 International Symposium on, pages 32–37, 2004. [42] IBM SPSS Statistics. Version 22.0. IBM Corp, Armonk, NY , 2013. [43] S. Kanev, K. Hazelwood, G.-Y . Wei, and D. Brooks. Tradeoffs between power management and tail latency in warehouse-scale applications. In 2014 IEEE Inter- national Symposium on Workload Characterization (IISWC), 2014. [44] J. Koomey. Growth in data center electricity use 2005 to 2010. Technical report, August 2011. [45] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single- ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In MICRO 36: Proceedings of the 36th International Symposium on Microarchitecture, pages 81–92, Dec. 2003. [46] C. Lefurgy, X. Wang, and M. Ware. Power capping: a prelude to power shifting. Cluster Computing, 11(2):183–195, June 2008. [47] D. Lo, L. Cheng, R. Govindaraju, L. A. Barroso, and C. Kozyrakis. Towards energy proportionality for large-scale latency-critical workloads. 2014. 133 [48] D. Lo and C. Kozyrakis. Dynamic management of turbomode in modern multi- core chips. In Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, HPCA-19 ’14, 2014. [49] P. Lotfi-Kamran, B. Grot, M. Ferdman, S. V olos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi. Scale-out processors. In ISCA ’12: Proceedings of the 39th International Symposium on Computer Ar- chitecture, pages 500–511, June 2012. [50] A. Lungu, P. Bose, A. Buyuktosunoglu, and D. J. Sorin. Dynamic power gat- ing with quality guarantees. In Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design, ISLPED ’09, 2009. [51] K. Ma and X. Wang. PGCapping: Exploiting Power Gating for Power Capping and Core Lifetime Balancing in CMPs. In Proceedings of the 21st International Con- ference on Parallel Architectures and Compilation Techniques, PACT ’12, 2012. [52] K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periyathambi, and M. Horowitz. Towards energy-proportional datacenter memory with mobile DRAM. In ISCA ’12: Proceedings of the 39th International Symposium on Com- puter Architecture, pages 37–48, June 2012. [53] K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periyathambi, and M. Horowitz. Towards energy-proportional datacenter memory with mobile DRAM. In ISCA ’12: Proceedings of the 39th International Symposium on Com- puter Architecture, June 2012. [54] K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, D. Lo, B. C. Lee, and M. Horowitz. Rethinking DRAM Powermodes for Energy Proportionality . In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO- 45 ’12, 2012. [55] D. Meisner, B. T. Gold, and T. F. Wenisch. PowerNap: eliminating server idle power. In ASPLOS ’09: Proceeding of the 14th International Conference on Architectural support for programming languages and operating systems, pages 205–216, Feb. 2009. [56] D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, and T. F. Wenisch. Power management of online data-intensive services. In ISCA’11: Proceeding of the 38th international symposium on Computer architecture, pages 319–330, June 2011. [57] D. Meisner and T. F. Wenisch. Does low-power design imply energy efficiency for data centers? In Low Power Electronics and Design (ISLPED) 2011 International Symposium on, pages 109 –114, aug. 2011. 134 [58] D. Meisner and T. F. Wenisch. DreamWeaver: architectural support for deep sleep. In ASPLOS ’12: Proceedings of the 17th International Conference on Architec- tural Support for Programming Languages and Operating Systems, pages 313– 324, Mar. 2012. [59] D. Meisner, J. Wu, and T. Wenisch. Bighouse: A simulation infrastructure for data center systems. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35–45, april 2012. [60] P. Ranganathan, P. Leech, D. Irwin, and J. Chase. Ensemble-level Power Manage- ment for Dense Blade Servers. In ISCA ’06: Proceedings of the 33rd international symposium on Computer Architecture, pages 66–77, June 2006. [61] V . J. Reddi, B. C. Lee, T. Chilimbi, and K. Vaid. Web search using mobile cores: quantifying and mitigating the price of efficiency. In ISCA’10: Proceedings of the 37th international symposium on Computer architecture, pages 314–325, June 2010. [62] B. Rountree, D. H. Ahn, B. R. de Supinski, D. K. Lowenthal, and M. Schulz. Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International, 2012. [63] F. Ryckbosch, S. Polfliet, and L. Eeckhout. Trends in Server Energy Proportional- ity. Computer, 44(9):69–72, 2011. [64] Samsung Semiconductor. Samsung green ddr3, April 2011. http: //www.samsung.com/global/business/semiconductor/file/ media/green_ddr3_april_11-1.pdf. [65] W. Saunders. Energy proportionality trends.http://communities.intel. com/community/datastack, 2012. [66] G. Semeraro, D. H. Albonesi, S. G. Dropsho, G. Magklis, S. Dwarkadas, and M. L. Scott. Dynamic frequency and voltage control for a multiple clock domain mi- croarchitecture. In MICRO 35: Proceedings of the 35th international symposium on Microarchitecture, Nov. 2002. [67] E. Seo, S. Y . Park, and B. Urgaonkar. Empirical analysis on energy efficiency of flash-based ssds. In Proceedings of the 2008 Conference on Power Aware Com- puting and Systems, HotPower’08, 2008. [68] R. Singhal. Inside Intel Core Microarchitecture (Nehalem). In Hot chips 20, A Symposium on High Performance Chips, 2008. 135 [69] R. Sohan, A. Rice, A. Moore, and K. Mansley. Characterizing 10 gbps network interface energy consumption. In Local Computer Networks (LCN), 2010 IEEE 35th Conference on, Oct 2010. [70] B. Subramaniam and W.-c. Feng. Towards energy-proportional computing for enterprise-class server workloads. In ICPE ’13: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, 2013. [71] C. Subramanian, A. Vasan, and A. Sivasubramaniam. Reducing data center power with server consolidation: Approximation and evaluation. In High Performance Computing (HiPC), 2010 International Conference on, 2010. [72] N. Tolia, Z. Wang, M. Marwah, C. Bash, P. Ranganathan, and X. Zhu. Deliver- ing energy proportionality with non energy-proportional systems: optimizing the ensemble. In HotPower’08: Proceedings of the 2008 conference on Power aware computing and systems, Dec. 2008. [73] V . Tseng. Measuring the power consumption in a server at component level. Tech- nical report, Hogeschool van Amsterdam, Software Improvement Group, 10 2012. [74] D. Tsirogiannis, S. Harizopoulos, and M. A. Shah. Analyzing the energy efficiency of a database server. In SIGMOD ’10: Proceedings of the 2010 International Conference on Management of Data, June 2010. [75] G. Urdaneta, G. Pierre, and M. van Steen. Wikipedia workload analysis for decen- tralized hosting. Computer Networks: The International Journal of Computer and Telecommunications Networking, 53(11), July 2009. [76] B. Urgaonkar, P. Shenoy, A. Chandra, and P. Goyal. Dynamic provisioning of multi-tier internet applications. In Autonomic Computing, 2005. ICAC 2005. Pro- ceedings. Second International Conference on, 2005. [77] G. Varsamopoulos and S. K. S. Gupta. Energy proportionality and the future: Metrics and directions. In ICPP Workshops, 2010. [78] A. Vega, A. Buyuktosunoglu, H. Hanson, P. Bose, and S. Ramani. Crank it up or dial it down: coordinated multiprocessor frequency and folding control. In MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013. [79] D. Wong and M. Annavaram. Evaluating a prototype knightshift-enabled server. In WEED ’12: Workshop on Energy-Efficient Design, 2012. [80] D. Wong and M. Annavaram. Knightshift: Scaling the energy proportionality wall through server-level heterogeneity. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45 ’12, 2012. 136 [81] D. Wong and M. Annavaram. Scalable System-level Active Low-Power Mode with Bounded Latency. Technical Report CENG-2012-5, University of Southern California, 2012. [82] D. Wong and M. Annavaram. Implications of high energy proportional servers on cluster-wide energy proportionality. In Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, HPCA-19 ’14, 2014. [83] www.spec.org/power ssj2008/. Spec power ssj2008. [84] W. Zheng, A. P. Centeno, F. Chong, and R. Bianchini. Logstore: toward energy- proportional storage servers. In Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design, ISLPED ’12, 2012. 137
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Modeling and optimization of energy-efficient and delay-constrained video sharing servers
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Energy-efficient shutdown of circuit components and computing systems
PDF
Energy consumption and lifetime/reliability improvement of computing systems using voltage overscaling (VOS) approximation technique
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
A framework for runtime energy efficient mobile execution
PDF
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Integration of energy-efficient infrastructures and policies in smart grid
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
A joint framework of design, control, and applications of energy generation and energy storage systems
PDF
Thermal modeling and control in mobile and server systems
PDF
Architectures and algorithms of charge management and thermal control for energy storage systems and mobile devices
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
Asset Metadata
Creator
Wong, Daniel
(author)
Core Title
Energy proportional computing for multi-core and many-core servers
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/28/2015
Defense Date
06/02/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer architecture,data centers,energy efficiency,energy proportionality,OAI-PMH Harvest,servers
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Pedram, Massoud (
committee member
), Yu, Minlan (
committee member
)
Creator Email
danx255@gmail.com,wongdani@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-612773
Unique identifier
UC11303100
Identifier
etd-WongDaniel-3729.pdf (filename),usctheses-c3-612773 (legacy record id)
Legacy Identifier
etd-WongDaniel-3729.pdf
Dmrecord
612773
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wong, Daniel
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer architecture
data centers
energy efficiency
energy proportionality
servers