Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Dynamic voltage and frequency scaling for energy-efficient system design
(USC Thesis Other)
Dynamic voltage and frequency scaling for energy-efficient system design
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DYNAMIC VOLTAGE AND FREQUENCY SCALING FOR ENERGY- EFFICIENT SYSTEM DESIGN by Kihwan Choi A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2005 Copyright 2005 Kihwan Choi R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . UMI Number: 3180330 Copyright 2005 by Choi, Kihwan All rights reserved. INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3180330 Copyright 2005 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Dedication To my parents, all my family, my wife quiyeon, and my two lovely kids seungpil and seungseo. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Acknowledgements First, I thank my advisor, Professor Massoud Pedram, for his exceptional support in my PhD program at USC. It has been an invaluable opportunity for me to work with him during past five years and nothing could have been accomplished without his encouragement. I also thank Professor Namgoong Won and Professor Roger Zimmermann for being my thesis committee. Besides my thesis committee members, I would like to thank Professor Sandeep K. Gupta and Professor Timothy M. Pinkston who asked me good questions at my qualifying exam. Also, I would like to thank all my colleagues in SPORT research group and all my friends here in Los Angeles and Korea for their friendship. Last, I thank my company, Samsung Electronics, for giving a good chance for studying abroad to me. Kihwan Choi USC, May 2005 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Table of Contents Dedication........................................................................................................................... ii Acknowledgements...........................................................................................................iii List of Tables.....................................................................................................................vii List of Figures................................................................................................................. viii Abstract...............................................................................................................................xi CHAPTER I INTRODUCTION.......................................................................................1 1.1 Motivation for energy-efficient system........................................................... 1 1.2 Overview of the dissertation............................................................................ 6 CHAPTER H DYNAMIC VOLTAGE AND FREQUENCY SCALING WITH WORKLOAD DECOMPOSITION............................................................................... 10 n .l Dynamic voltage and frequency scaling....................................................... 10 II. 1.1 Basic concept..................................................................................................10 II. 1.2 Implementation of DVFS functionality...................................................... 13 II. 1.3 Workload estimation/prediction.................................................................. 15 H.2 Overview of previous w orks.......................................................................... 17 11.2.1 Classification of DVFS approaches.............................................................17 11.2.2 Real-time vs. non real-tim e..........................................................................17 11.2.3 Inter-task vs. intra-task.................................................................................20 11.2.4 Policy determination: off-line vs. on-line...................................................23 II. 2.5 Discussions..................................................................................................25 H.3 Workload decomposition............................................................................. 27 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . V CHAPTER m FINE-GRAINED DYNAMIC VOLTAGE AND FREQUENCY SCALING BASED ON WORKLOAD DECOMPOSITION.................................... 34 HI. 1 Introduction.....................................................................................................34 HI. 2 Related works..................................................................................................35 m .3 Performance-energy trade-offs..................................................................... 37 111.3.1 Workload partitioning................................................................................. 37 111.3.2 Performance degradation and energy saving...........................................42 111.3.3 Scaling granularity...................................................................................... 45 111.3.4 Events monitored through the PMU on XScale...................................... 46 m .4 Regression-based fine-grained DVFS......................................................... 49 III.4.1 Calculating /?with a regression equation............................................... 49 0 .4 .2 Prediction error adjustment........................................................................ 51 0 .5 Implementation................................................................................................56 0 .6 Experimental results.......................................................................................62 0 .7 Conclusions.....................................................................................................71 CHAPTER IV OFF-CfflP LATENCY-DRIVEN DYNAMIC VOLTAGE AND FREQUENCY SCALING FOR AN MPEG DECODING.........................................73 IV. 1 Introduction................... 73 IV.2 Related works..................................................................................................74 IV.3 MPEG decoding............................................................................................. 78 IV.4 Proposed DVFS policy for MPEG decoding............................................. 81 IV. 5 Experimental results...........................................................................86 IV. 6 Conclusions..................................................................................................... 94 CHAPTER V DYNAMIC VOLTAGE AND FREQUENCY SCALING FOR THE SYSTEM ENERGY REDUCTION..................................................................... 95 V. 1 Introduction.....................................................................................................95 V.2 Related works..................................................................................................97 V.3 DVFS for the system energy reduction....................................................... 98 V.3.1 Modeling the system power consumption................................................. 98 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . V.3.2 System energy vs. CPU frequency............................................................102 V.4 Description of the target system....................................................................104 V.4.1 BitsyX platform........................................................................................... 104 V.4.2 Execution time model in BitsyX...............................................................107 V.4.3 Energy consumption model for BitsyX....................................................110 V.5 Proposed DVFS policy...................................................................................115 V.5.1 Scaling granularity......................................................................................115 V.5.2 Calculating the average on-chip CPI........................................................ 116 V.5.3 Determining the optimal frequency setting..............................................122 V.6 Experimental results....................................................................................... 124 V.7 Conclusions..................................................................................................... 131 CHAPTER VI CONCLUSIONS AND FUTURE RESEARCH DIRECTION.... 132 Bibliography.................................................................................................................... 135 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . List of Tables Table II-1 : CPU energy saving with workload decomposition.................................33 Table HI-1 : Statistics for the Rvalue seen for different applications at a CPU clock frequency of 733 M Hz.................................................................................. 53 Table III-2 : Apollo testbed II (AT2) system components..........................................57 Table III-3 : Intel 80200 Xscale processor configuration........................................... 58 Table T IT-4 : Frequency and voltage levels in the system ........................................... 59 Table III-5 : Summary of test applications....................................................................63 Table III-6 : Calculated CPU frequency for test applications after profiling........... 68 Table IV-1 : The ratio of TV A R and T°ff of each frame type in each video c lip ......... 83 Table IV-2 : CPU Energy saving comparison - OL: OL-DVFS, CON: CON- DVFS. (*: numbers in parenthesis are for (6)).....................................................92 Table V -l : Intel PXA255 processor configuration...................................................104 Table V-2 : Frequency combinations in BitsyX system ........................................... 105 Table V-3 : Definition of used term s...........................................................................112 Table V-4 : Extracted parameters for system energy estimation............................. 114 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . viii List of Figures Figure 1-1 : Transistor density according to technology scaling.................................. 2 Figure 1-2 : Processor frequency trend.............................................................................3 Figure 1-3 : Power consumption trend..............................................................................4 Figure II-1 : An illustration of the DVFS technique....................................................12 Figure II-2 : Block diagram of DVFS implementation................................................14 Figure II-3 : CPU usage of MP3 and M PEG ................................................................16 Figure II-4 : Classification of previous DVFS approaches......................................... 17 Figure II-5 : Execution time changes according to CPU frequency......................... 29 Figure II-6 : More energy saving by workload decomposition.................................. 32 Figure El-1 : DVFS with detailed knowledge of subtasks and their relative order and workload requirement (scenario I) and without this information (scenario II ).............................................................................................................. 40 Figure III-2 : Performance loss changes according to CPU frequency.....................43 Figure HI-3 : Contour plots of C P fvs versus M P fvg for different CPU clock frequencies: (a) “fgrep” and (b) “gzip” .................................................................48 Figure III-4 : Variation in /? value of applications: (a) “math” and (b) “gzip” 52 Figure III-5 : Compensating for the error due to misprediction of (3........................ 55 Figure III-6 : Main board with the CPU, memory, and memory controller............. 59 Figure III-7 : Data acquisition system........................................................................... 60 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . ix Figure III-8 : The structure of power plane of XScale board..................................... 60 Figure III-9 : Software architecture of our DVFS implementation............................61 Figure III-10 : Performance loss with different target values.................................... 64 Figure III-11 : CPU power consumption of with/without DVFS...............................65 Figure III-12 : CPU Energy saving for various application programs......................67 Figure IH-13 : Comparison of dynamic (the proposed work) vs. static approach (profiling)..................................................................................................................69 Figure IH-14 : Energy saving comparison according to scaling granularity (1 msec, 5 msec, 20 msec, 50 msec, and OS quantum): (a) “gzip” and (b) “djpeg” ...................................... 70 Figure IV-1 : MPEG decoding sequence...................................................................... 79 Figure IV-2 : Decoding time variation as a function of the CPU clock frequency. 80 Figure IV-3 : Contour plots of TV A R versus INSTR for different CPU clock frequencies................................................................................................................82 Figure IV-4 : Decoding time and power consumption at different CPU frequencies and voltage levels................................................................................ 88 Figure IV-5 : CPU power consumption on the AT2 platform when running a video c lip ................................................................................................................89 Figure IV-6 : CPU energy savings using proposed DVFS..........................................91 Figure TV-7 : CPU energy savings with off-chip latency separation during TV A R ... 93 Figure IV-8 : Frame rate variation with the proposed D V FS.................................... 93 Figure V -l : System power breakdown......................................................................... 96 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . X Figure V-2 : System power consumption during task execution............................. 101 Figure V-3 : Clock distribution diagram in PXA255 processor............................... 106 Figure V-4 : BitsyX system........................................................................................... 106 Figure V-5 : Execution time variation over different frequency combinations.... 107 Figure V-6 : Execution time estimation error.............................................................109 Figure V-7 : Energy consumption over different frequency combinations 110 Figure V-8 : Total system power consumption during execution............................I l l Figure V-9 : Accuracy of the proposed models for power consumption and execution tim e........................................................................................................ 115 Figure V-10 : Contour plots of C P fvs vs S P fv g for different clock frequencies combinations: (a) “gzip” (b) “qsort” (c) “djpeg” (d) “math” ...........................119 Figure V -ll : SPIo n a v 8 extraction using DPI................................................................121 Figure V-12 : Actual performance: CE-DVFS and SE-DVFS................................ 126 Figure V-13 : System energy saving: SE-DVFS and CE-DVFS.............................127 Figure V-14 : Actual power consumption of two DVFS methods...........................129 Figure V-15 : System energy difference: SE-DVFS vs. CE-DVFS.........................130 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . xi Abstract Demand for low power consumption in battery-powered computer systems has risen sharply. This is due to the fact that extending the service lifetime of these systems by reducing their power dissipation is a key customer requirement. Dynamic voltage and frequency scaling (DVFS) techniques have proven to be a highly effective in achieving low power consumption in various computer systems. The key idea behind DVFS technique is to adaptively scale the supply voltage level of the CPU so as to provide “just-enough” circuit speed to process the system workload while meeting total computation time and/or throughput constraints, and thereby, reduce the energy dissipation. This thesis presents intra-process DVFS techniques targeted toward both non real time and real-time applications running on embedded system platforms. To enhance the amount of the CPU energy saving by DVFS, a technique called “workload decomposition” is proposed whereby the workload of a target program is decomposed in two parts: on-chip and off-chip. The on-chip workload signifies the CPU clock cycles that are required to execute instructions inside the CPU, whereas the off-chip workload captures the number of external memory access clock cycles that are required to perform external memory transactions. When combined with a DVFS technique to minimize the energy consumption, this workload decomposition method results in higher energy savings for memory R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . intensive applications. Notice that on-chip and off-chip workload cause different amount of power dissipation because on-chip workload requires the CPU to be executed while off-chip workload requires a proper system component such as memory. So, if the task workload is decomposed into on-chip and off-chip component, the system energy variation according to the CPU frequency can be predicted very accurately, which enables the development of a more effective DVFS approach for the system energy reduction. The proposed techniques have been implemented on two real computing systems: (1) the XScale-based embedded system platform built at USC and (2) the PXA255- processor based BitsyX system from ADS Inc. Energy savings with the proposed DVFS policies have been obtained by performing current measurements on real hardware. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 CHAPTER I INTRODUCTION I .l M OTIVATION FOR ENERGY-EFFICIENT SYSTEM In the last four decades, computer systems, especially microprocessors, have been exponentially enhanced in terms of their productivity and performance. This has been achieved by the combination of advancements in process technology and improvements in micro-architecture and design technologies. The traditional objectives of computer system design until 1980’s were to reduce silicon area and to increase system performance. As a result, the integrated circuit design focus has been on adding more functionality with higher transistor counts and improving performance by using higher clock frequencies. Technology scaling was the major force for this trend and is still continuing. Driven by advances in fabrication process and minimum feature size of CMOS devices, these functionality and performance improvements have resulted in the introduction of new technology generations every two to three years, an industry-wide historic trend commonly referred to as the “Moore’s Law”. Each new generation has approximately doubled R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 2 the logic circuit density and increased performance by about 40%. Figure 1-1 shows the projected minimum gate length and transistor density based on the report from the International Technology Roadmap for Semiconductors (ITRS) [1]. As seen in this figure, a single chip with billions of transistors will be emerging in the near future due to aggressive technology scaling. 1400 > 1200 2 1000 -o 0 ) 20 400 > Year Figure 1-1: Transistor density according to technology scaling Figure 1-2 shows the evolution of the operating clock frequency for Intel microprocessors and the expected CPU clock frequencies in the future [1][70]. Micro-processors with clock frequencies higher than 10 GHz will become a reality in the near future. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 3 100000 - N X > o c o 3 O ' o * o o o 10000 1000 - 100 10 ITRS 2004 Pentium Pentium PentiOmlll Pentium Pro Pentium 1985 1990 1995 2000 2005 2010 2015 Year Figure 1-2 : Processor frequency trend However, pursuing higher clock frequency and higher transistor density inevitably increase the power consumption of the micro-processors. Since mid 1990’s, power consumption has become a major concern for the micro-processor as well as IC designs since portable electronic applications such as mobile phones, laptop computers, and personal data assistants, which tend to be mostly used as battery- powered devices, have emerged as key consumer products. Low power design is a critical design consideration even in high-end computer systems where expensive cooling and packaging costs and lower reliability often associated with high levels of on-chip power dissipation are the important concerns. According to the ITRS report as shown in Figure 1-3, power for high performance CPU is projected to R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . exceed 180 W in 2006, which is comparable to nuclear reactor power density if the CPU die area is 1cm2 [1]. 220 200 fTRS 2004 Q. 180 c 160 140 120 2002 2003 2004 2005 2006 2007 2008 2009 Year Figure 1-3 : Power consumption trend In recent years, the design trade-off of performance versus power consumption has received large attention because of: (i) the large number of systems that need to provide services with the energy provided by a battery of limited weight and size, (ii) the limitation on high-performance computation because of heat dissipation issues, and (iii) concerns about dependability of systems operating at high substrate temperatures, which are in turn caused by high power dissipation. Thus, various techniques have been developed in order to reduce on-chip power dissipations. These techniques include attempts are to minimize the circuit activity where a R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 5 circuit function is eliminated when not needed, minimize the number of logic gates where simpler circuits are used instead of complex ones, and minimize the operating clock/voltage where power is reduced at the cost of speed degradation. Among these approaches, dynamic voltage and frequency scaling (DVFS) technique has proven to be an especially effective method of achieving low power consumption while meeting user-specified performance requirements. The key idea behind DVFS techniques is to dynamically scale the supply voltage level of the CPU so as to provide “just-enough” circuit speed to process the system workload while meeting the total compute time and/or throughput constraints, and thereby, reduce the energy dissipation. Commercial examples of such processors include the Intel’s XScale-based CPUs [47][48], Transmeta’s Crusoe [73], and AMD’s K6-2+ [15]. Judicious use of these processors in the designs can greatly reduce the energy consumption of the system. Standard DVFS techniques attempt to minimize the time the processor spends running the operating system idle loop. Offline profiling or compiler analysis can determine the optimal CPU frequency for an application or application phase. A programmer can insert operations to change frequency directly into application code, or the operating system can perform the operations during process scheduling. This dissertation is focused on the development of effective DVFS techniques to improve energy saving compared to the previous DVFS approaches. The proposed technique, called workload decomposition, exploits the asynchrony between the R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 6 CPU and other peripheral devices in a computing system. DVFS techniques combined with workload decomposition could enable significantly higher energy saving in the CPU as well as the whole system, resulting in longer system lifetime in mobile applications and lower cooling cost in high performance applications. 1.2 OVERVIEW OF THE DISSERTATION This section includes an overview of the proposed DVFS techniques to reduce the CPU energy and the system energy. In Chapter II, a general overview of the DVFS techniques is provided, including a description of the basic concepts behind DVFS, discussion of issues related to implementing a DVFS capability such as the required hardware support and scaling overhead, and a classification of previous DVFS approaches. This is followed by a brief discussion of the workload decomposition approach. In Chapter III, an intra-process DVFS technique targeted toward non real-time applications running on an embedded system platform is presented. The key idea is to make use of runtime information about the external memory access statistics in order to perform CPU voltage and frequency scaling with the goal of minimizing the energy consumption while translucently controlling the performance penalty. The proposed DVFS technique relies on dynamically-constructed regression models that allow the CPU to calculate the expected workload and slack time for R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 7 the next time slot, and thus, adjust its voltage and frequency in order to save energy while meeting soft timing constraints. This is in turn achieved by estimating and exploiting the ratio of the total off-chip access time to the total on-chip computation time. The proposed technique has been implemented on an XScale-based embedded system platform and actual energy savings have been calculated by current measurements in hardware. For memory-bound programs, a CPU energy saving of more than 70 % with a performance degradation of 12 % was achieved. For CPU-bound programs, 15-60 % CPU energy saving was achieved at the cost of 5-20 % performance penalty. In Chapter IV, a DVFS technique for MPEG decoding to reduce the energy consumption using the computational workload decomposition where the amount of workload to decode a frame is separated into on-chip and off-chip workload. The execution time required for the on-chip workload is CPU frequency-dependent, whereas the off-chip workload execution time does not change, regardless of the CPU frequency, resulting in the maximum energy saving by setting the minimum frequency during off-chip workload execution time without causing any delay penalty. This workload decomposition is performed using a performance monitoring unit (PMU) in the XScale-processor, which provides various statistics such as cache hit/miss and CPU stall due to data dependency at run time. The on- chip workload for an incoming frame is predicted using a frame-based history so that the processor voltage and frequency can be scaled to provide the exact amount R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 8 of computing power needed to decode the frame. To satisfy the user-specified QoS constraint, a prediction error compensation method, called inter-frame compensation, is proposed in which the on-chip workload prediction error is diffused into subsequent frames such that run time frame rate changes smoothly. The proposed DVFS algorithm has been implemented on an XScale-based Testbed. Detailed current measurements on this platform demonstrate significant CPU energy savings ranging from 50% to 80% depending on the video clip. In Chapter V, a DVFS technique that minimizes the total system energy consumption for performing a task while satisfying a given execution time constraint. We first show that in order to guarantee minimum energy for task execution by using DVFS it is essential to divide the system power into fixed, idle and active power components. Next, we present a new DVFS technique, which considers not only active power, but also idle and fixed power components of the system. This is in sharp contrast to previous DVFS techniques, which only consider the active power component. The fixed plus idle components of the system power are measured by monitoring the system power when it is idle. The active component of the system power is estimated at run time by a technique known as workload decomposition whereby the workload of a task is decomposed into on- chip and off-chip based on statistics reported by a performance monitoring unit (PMU). We have implemented the proposed DVFS technique on the BitsyX platform; an Intel PXA255-based platform manufactured by ADS Inc., and R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . performed detailed energy measurements. These measurements show that, for a number of widely used software applications, a total system energy savings of up to 18 %, compared to a conventional DVFS technique that considers only variable power, is achieved while satisfying the user-specified timing constraints. The dissertation is concluded in Chapter VI with some remarks and an outline of future research directions. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 10 CHAPTER II DYNAMIC VOLTAGE AND FREQUENCY SCALING WITH WORKLOAD DECOMPOSITION 11.1 DYNAMIC VOLTAGE AND FREQUENCY SCALING 11.1.1 BASIC CONCEPT Many kinds of application programs, which may require real-time or non real-time operations, are executed on a general-purpose processor. In general, DVFS techniques are very effective in reducing the energy dissipation while meeting a performance constraint in real-time applications such as video decoding. The energy consumption per task in CMOS VLSI circuits is quadratically proportional to the supply voltage and is given by the following well-known equation [28]. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 1 E - C swi,ched ‘ V • f dk • T (II. 1) where V is the supply voltage level, CS W itc hed is the switched capacitance per clock cycle, fdk is the clock frequency, and T is the total execution time of the task. Therefore, reducing the supply voltage results in a large energy saving. Reducing the voltage/frequency level, however, slows the circuit down as: where TD is the circuit delay, Vtj, is the threshold voltage, and a is a velocity saturation index which is technology dependent [71]. Based on equations (II.l) and (II.2), there is trade-off between energy consumption and execution time and DVFS is a procedure to choose a voltage/frequency level such that energy consumption is minimized while satisfying a given timing constraint. Thus, a target CPU frequency, ftarget, for a task with workload, W, and deadline, D, is calculated as: W f target = - (H.3) R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 12 Figure II-1 illustrates the basic concept of DVFS for real-time application scenarios, where a timing constraint, i.e., deadline, is given for a task. deadline deadline for W1 for W2 Voltage >• Time Figure II-1: An illustration of the DVFS technique In this figure, T2 and T4 denote deadlines for tasks Wi and W2, respectively (in practice, these deadlines are related to the QoS requirements.) Wi finishes at 7) if the CPU is operated with a supply voltage level of V). The CPU will be idle during the remaining (slack) time, Sy To provide a precise quantitative example, let’s assume T2 -T0 -T 4 -T2 -A T, and Ti-To-AT!2\ the CPU clock frequency at Vj is fj=n/AT for some integer n; and that the CPU is powered down or put into standby with zero power dissipation during the slack time. The total energy consumption of the CPU is Ei=C-V 2 -fyATI2=n-C-V 2I2 where C is the effective switched capacitance of the CPU per clock cycle. Alternatively, Wi may be executed on the CPU by using a voltage level of V2 =V]/2, and is thereby completed at T2. Assuming a first-order R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 13 linear relationship between the supply voltage level and the CPU clock frequency, /2=/;/2. In the second case, the total energy consumed by the CPU is 2 2 E2=C-V2 fr&T=n-C'Vi /8. Clearly, there is a 75 % energy saving as a result of lowering the supply voltage (this saving is achieved in spite of “perfect” - i.e., immediate and with no overhead - power down of the CPU). This energy saving is achieved without sacrificing the QoS because the given deadline is met. An energy saving of 89 % is achieved when scaling Vj to V?=V//3 and f j to /j=/;/3 in case of task W 2. To implement a DVFS for a computing system, we need a variable frequency/voltage processor. Also, it is required to have an accurate amount of workload for a target program to be run on the target system so that energy saving can be maximized by removing all slack time before deadline. II.1.2 IMPLEMENTATION OF DVFS FUNCTIONALITY To provide the DVFS functionality two hardware supports are required: voltage scaling part and frequency scaling part, as shown in Figure II-2. In variable voltage scaling part, a D/A converter (DAC) is usually used to control the reference input voltage to a DC-DC converter that supplies variable voltage to the CPU. Inputs to the D/A converter are generated using a programmable interface, either the general purpose input output (GPIO) or complex programmable logic device (CPLD). The R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 14 CPU frequency is varied by setting different constant values to the PLL inside the CPU. Scaling operation is triggered by either application or OS level according to the DVFS policy. application scaling OS device driver VDD T DC-DC Converter .variable voltage DAC programmable Interface - voltage scaling „ - - S ' ' ^ crystal oscillator / frequency scaling rcpu t / y ™ ' CPU PLL ' cpu Figure II-2 : Block diagram of DVFS implementation The whole scaling operation consists of changing voltage and frequency and they are performed sequentially. In other words, changing voltage first followed by changing frequency or vice versa. When the CPU clock speed is changed, a minimum operating voltage level should be applied at each frequency to avoid a system crash due to increased gate delays. Thus, there should be an order in the scaling operation according to the level difference between the current frequency and the next target frequency. If the target frequency is higher (lower) than the R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 15 current frequency, voltage (frequency) should be scaled up before frequency change to avoid system crash. In many previous DVFS works, it is assumed that the voltage/frequency of the processor can be changed instantaneously. In other words, the scaling overhead was ignored. In reality, however, it takes time to change CPU frequency/voltage due to factors such as the internal phase lock loop (PLL) locking time and capacitances that exist in the voltage path. The reported overhead ranges from 6 psec to 500 psec depending on the system [7][47][48][59] [63][64]. In [67] the amount of energy savings according to various DVFS systems with different number of voltage levels and scaling overhead are compared. In general, the CPU is stopped and does not execute any useful instruction until PLL is relocked. Thus, the big scaling overhead outweighs the advantages from DVFS. The scaling unit in a DVFS method must be much larger than this scaling overhead so that the overhead becomes negligible compared to the scaling time unit. II.1.3 WORKLOAD ESTIMATION/PREDICTION To know the accurate amount of workload is indispensable for an effective DVFS. As seen in Eq.(II.3), the target frequency may be calculated incorrectly due to inaccurate workload, which results in either deadline miss (i.e., processing speed is too slow) or losing chance of more energy saving due to the caused slack times (i.e., R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . 16 processing speed is too fast). In case of hard real-time scenarios, the amount of workload for a task is known as priori, but in general, this is not the case in soft real-time and non real-time applications. For example, MPEG decoding which is a popular soft real-time application shows very different workload depending on video streams. Figure II-3 shows the CPU usage measured in the SAlllO-based system during each time interval (300 msec) for both MPEG and MP3 playback. As seen in this figure, the workload of MP3 is quite uniform, whereas MPEG application shows significant variation in its workload. Thus, how to get an accurate amount of the task workload has been a key issue for many DVFS algorithms. 100 MPEG 3 Q. O A^seie9ei9e^scA fise^V ^^^e^90A ieiscA9eA^9efie^9er^> MP3 0 0 10 20 30 40 50 No. of Measurements Figure II-3 : CPU usage of MP3 and MPEG R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 17 II.2 OVERVIEW OF PREVIOUS WORKS II.2.1 CLASSIFICATION OF DVFS APPROACHES Over the past few years there have been extensive works on DVFS context and they can be classified into several groups based on three factors; constraint type, scaling granularity, and policy determination, as shown in Figure II-4. soft real-time hard real-time real-time non real-time off-line intra-task inter-task on-line policy determination constraint type scaling granularity DVFS Figure II-4 : Classification of previous DVFS approaches II.2.2 REAL-TIME VS. NON REAL-TIME Based on constraint type forced to the policy, a DVFS technique can be grouped into either real-time or non real-time operation. Real-time operation is further divided into either hard real-time or soft real-time according to the criticality of the constraint. For hard real-time systems where tasks have stringent timing constraints, R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . any deadline miss can cause catastrophic system failure, whereas a certain level of deadline miss is allowed in soft real-time operation. There have been many studies to apply DVFS in hard real-time scenarios. In multi-task DVFS algorithms for hard real-time operations, the timing constraint (arrival time and finish time) and workload for a task are fixed and given to a task scheduler (e.g., the operating system (OS) scheduler). Based on these information, the scheduler calculates an optimal process speed to each task so that all tasks can be finished without violating its own deadline [23] [27] [34][35] [36] [37] [39] [68] [69] [89]. Unlike in hard real-time operations in which workload of tasks are given in advance, one major issue in both soft real-time and non real-time scenarios, where there is no explicit information about deadline and task workload, is prediction of the future workload at runtime, which allows one to choose the minimum required voltage/frequency levels while satisfying key constraints on energy and QoS. One popular technique to estimate the task workload at runtime is interval-based approach. In an interval-based method, the prediction of workload and scaling operation occurs at a fixed-length time interval (for example, 100 msec). The defining characteristic of the interval-based scheduling algorithm is that uniform- length intervals are used to monitor the system utilization in the previous intervals and thereby set the voltage level for the next interval by extrapolation. This algorithm is effective for applications with predictable computational workloads such as audio [7] or other digital signal processing intensive applications [8]. Many R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . DVFS approaches use this prediction method since it is simple and easy to implement. As proposed in [22] and [85], this interval-based scheduling algorithm can be used in non real-time operation. This is because there is no timing constraint. As a result, some performance degradation due to workload misprediction is allowed. In [61] two interval-based scheduling methods were proposed, PAST and AVG on a mobile computing system. While PAST assumes that the workloads of the future intervals will be like the previous one, AVG takes the exponential moving average of the workload of the previous intervals to determine the speed of the processor. The effectiveness of these algorithms has recently been evaluated on a pocket computer in [24]. An extension of the approach in [61] makes use of information such as recent processor utilization, predicted future behavior, estimates of workload provided by the individual tasks, etc. to determine the desired operating speed [62]. Reference [41] proposed various scaling policies according to different types of workload and concluded that the energy savings are dependent on the workload and the hardness of the deadline constraints. In [18] an improved version of interval-based algorithm was proposed in which the length of each time interval is not fixed but varied according to runtime statistics so that a DVFS can become more effective in terms of energy saving. Although the interval-based scheduling algorithm is simple and easy to implement, it often predicts the future workload incorrectly when a task’s workload exhibits a R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . 20 large variability [24] [61] [66]. To solve this problem, techniques to average workload where fluctuating workload profile can be smoothed out, either using adaptive filter [80] or buffering interval results [9]. One typical example of such a task with large workload variation is MPEG decoding as mentioned in section II. 1.3. In MPEG decoding, because the computational workload varies greatly depending on each frame type, repeated mispredictions may result in a decrease in the frame rate, which in turn means a lower QoS in MPEG. To solve this problem, many specific DVFS approaches for MPEG decoding have been presented [11][14][33][42][65][81], II.2.3 INTER-TASK VS. INTRA-TASK DVFS-related works may be divided into two categories based on the scaling granularity: coarse-grained and fine-grained. Coarse-grained voltage scaling, called inter-task voltage scaling, is performed at the OS or application level, whereas fine-grained voltage scaling, called intra-task voltage scaling, is performed at the level of individual blocks/segments in an application task or software program. Examples of coarse-grained scheduling policies are DVFS methods presented for hard real-time applications with multiple tasks [23][26][27][34][35][36][37][39] [68] [69] [89]. They focused on how to assign proper speed to each task so that all tasks can be finished without violating its own deadline. More precisely, scheduling is performed at task level by the OS so as to reduce energy consumption while R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 21 meeting hard timing constraints for each task. In these inter-task voltage scaling approaches, it is assumed that the total number of CPU cycles needed to complete each task is fixed and known a priori. For each task, only one supply voltage/frequency is assigned to the task and it is not changed during the task execution. In general, however, a task may finish earlier than the deadline because the workload of the task is given at the worst-case execution, which is not the often case and the execution time of each task frequently deviates from its the worst-case execution time by large amount. As a result, slack time is generated and the chance of more energy saving is lost. There have been some approaches which concerned this problem in inter-task voltage scaling. In [75] and [76] the variation in the execution time of the tasks has been exploited by allowing the processor (at run time) to lower the supply voltage such that the current active job finishes at its deadline or the release time of the next job. If the release times and deadlines are known a priori, then greater energy savings can be obtained by dynamically varying the speed of the processor to exploit the execution time variation of each task. Intra-task DVFS has been proposed as a solution to overcome the limitations of inter-task voltage scheduling. Intra-task DVFS exploits all the slack time from run time variations of different execution paths; there is no slack time when the scheduled program completes its execution, thus significantly improving energy efficiency. Typical examples of an intra-task DVFS are approaches with interval - R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . based method [9] [19] [22] [41] [43] [61] [62] [79] [80] [85], where the CPU speed adjustment occurs at every interval so that possible slack generation can be avoided in the middle of the task execution. Another types of intra-task DVFS are checkpoint-based algorithms where the application code is divided into sub-blocks and the run time variation in each block is monitored. In [78], an intra-task voltage scheduling techniques was proposed in which the application code is divided into many segments and the worst-case execution time of each segment (which is obtained from a static timing analysis) is used to determine a suitable voltage for the next segment. In [38] a method based on a software feedback loop was proposed. In this method, a deadline for each time slot is provided. The authors calculate the operating frequency of the processor for the next time slot depending on the slack time generated in the current slot and the worst-case execution time of the next time slot. In [3], [46], and [74], checkpoint-based algorithms are proposed in which the scaling points are identified off-line either by the compiler or by training runs before actual execution. In [14], an intra-task DVFS for multimedia application was proposed in which scaling is performed at each video frame based on the timing information given by the video server. In [88], an intra-task scheduling method for an embedded multiprocessor system was proposed whereby tasks are dynamically scheduled based on a predefined schedule set during the compilation step. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 23 II.2.4 POLICY DETERMINATION: OFF-LINE VS. ON-LINE Off-line DVFS algorithms determine a target CPU frequency for a task before mnning the task. Typical examples for off-line DVFS algorithms are inter-task voltage scheduling algorithms for hard real-time operations where the workload of a task is usually obtained off-line by either simulations or pre-profiling operation. They generally calculate a proper frequency for each task by formulating an integer linear or non-linear programming problem based on given timing constraint and workload information. For example, ILP formulations to assign proper frequency values for a set of tasks were proposed in [34][50] [72][83]. Many approaches with compiler supports may also be classified as off-line DVFS [29] [30] [58] [87]. In [29] and [30], compiler-assisted DVFS techniques were proposed, in which frequency is lowered in memory-bound region of a program with little performance degradation. On-line DVFS methods rely on the history during task execution to determine processor speed without prior knowledge about task to be performed. In these methods, the accuracy of predicted workload significantly affects the effectiveness of the DVFS methods. Interval-based DVFS methods discussed in section n.2.2 are one popular type of on-line algorithms. One more examples are hardware- supported DVFS approaches. In [40] and [51] microarchitecture-driven DVFS techniques were proposed where a cache miss drives voltage scaling. Reference [86] used an embedded hardware which can monitor dynamic events during task R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 24 execution. In [82] an auxiliary hardware unit was used to detect loop-based memory-bound regions. There have been DVFS efforts which used both on-line and off-line approach. One particular type is based on checkpoints. In a checkpoint-based algorithm, a program is partitioned into scaling units, either uniform time slots or code blocks, and the worst-case execution time for each unit is determined off-line. Then, each unit is assigned a CPU frequency based on this predetermined worst-case execution time. During runtime, the elapsed time in scaling units is monitored by inserted checkpoints and a target frequency is recalculated when slack time occurs due to the variation in the execution time. Examples in this category include intra-task algorithms in section II.2.3. The various ways to compute scaling factors in checkpoint-based DVFS methods were presented in [58]. In [38] and [39] checkpoints are placed at the equally spaced points in the worst-case execution time of a task. Reference [77] and [78] placed the checkpoints at selected control-flow graph edges of a task to capture the slack time from different execution paths. Some approaches used the profile database at runtime to determine process speed [3][4][31]. Having many checkpoints in program allows better tracking to the runtime system state. However, each checkpoint introduces additional work for computing the scaling factor on-line as well as the transition costs. As a result, checkpoints are usually placed at highly selected branches, loops, and call sites to R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 25 minimize the overheads associated with them. Reference [2] focused on the optimal number of checkpoints by taking the checkpoint overheads into account. II.2.5 DISCUSSIONS There are two drawbacks to previously proposed DVFS algorithms. First, in most DVFS approaches, the workload of a task is often represented by the number of CPU clock cycles required to complete the task regardless of whether the workload consists of mainly CPU-bound or memory-bound instructions. The latter information is of course critical in determining the idle time of the CPU. The optimality property in DVFS context [34] [41][85][89], which is “running at a constant speed to complete the task just at the deadline consumes the least amount of energy, is only valid for CPU-bound tasks and is not valid for memory-bound tasks. This is due to the fact that bus cycles required to access external components is asynchronous to the CPU frequency. In other words, off-chip access time is affected by its own bus clock frequency, not by the CPU frequency. Thus, a lower CPU frequency can be set during off-chip access time with little performance degradation, which, in turn, gives more CPU energy saving. There have been some approaches to tackle this problem [29] [30] [40] [51] [86]. Second, most of the past works only considers the energy consumed by the CPU. However, there is no computing system which consists of a CPU only. Thus, the battery life of a system is determined by the system’s energy consumption, and not R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 26 just the CPU’s energy consumption. While a task is running, many computer components such as memory and peripheral devices in the system also consume energy at the same time. If the amount of power consumed in such components is comparable to the CPU power and the execution time is increased due to lowered CPU frequency, the overall effect by DVFS can result in increased energy consumption compared to the case that no DVFS is applied. In other words, in some cases the assumption of lower performance levels being more energy efficient is not valid and finishing the task earlier (than the deadline) saves more energy. Therefore, energy consumption models used in past efforts are not accurate for prolonging the battery life. This phenomenon has been recently demonstrated by [6] [17] [52] [54] [57] [84]. For example, Reference [54] argued that a practical DVFS algorithm must consider other system-level effects as well, such as non-ideal battery capacity and memory behavior. In [52] and [84], a system-level energy consumption model was proposed. In this model, the system-level energy consumption per cycle does not scale quadratically to the CPU frequency based on the experimental observations that some components in computer systems consume constant power, and some consume power only scalable to frequency (i.e., voltage). In this thesis, we propose a technique, called workload decomposition, to solve above-mentioned problems. By decomposing workload into on-chip and off-chip workload, it is possible to get accurate execution time variation of a task as the CPU frequency varies, which is quite important to determine the optimal scaling R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 27 factor for the task. Thus, Workload decomposition enables more CPU energy saving in memory-bound applications when combined with the previous DVFS policy, as shown in Chapter III and IV. This workload decomposition technique can be used to reduce the system energy consumption. It is expected that on-chip and off-chip workload cause different amount of power because on-chip workload requires the CPU to be executed while off-chip workload requires a proper system component such as memory. So, the system energy variation according to the CPU frequency is predicted very accurately, which enables an effective DVFS approach for the system energy reduction. We discussed this in Chapter V. II.3 WORKLOAD DECOMPOSITION Generally speaking, a software program consists of a stream of instructions to be executed. Execution time of the program can be represented in terms of the number of CPU clock cycles per instruction (CPI), the number of instructions being executed, and the CPU frequency as follows [25]: _ « ____ cn.4) cpu f where n is the total number of instructions in the instruction stream, CP/, is the number of CPU clock cycles for ith instruction, a n d /cp“ is the CPU frequency. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 28 If Eq.(II.4) holds for all applications either CPU-bound or memory-bound programs, which is a general assumption in most previous DVFS policies, there should be an exact inverse-linear relationship between the execution time and the CPU frequency, which may be not true for memory-bound applications. As a motivating example, Figure II-5 shows the different degree of execution time increase for two applications as CPU frequency varies. They were measured in our XScale-80200 processor based testbed. For example, in case of the “crc”, lowering frequency introduces significant performance losses compared to “qsort” implying that these programs are CPU-bound. On the contrary, it is known that “qsort” is memory- bound by observing little execution time increase with lowered frequency. This phenomenon is due to the asynchrony between the off-chip component accesses and the CPU. The program execution can be viewed as the execution of each instruction included in binary code of the program. Some instructions are performed inside the CPU such as ALU operations while others require off-chip accesses such as external memory or peripheral devices to be performed. During an off-chip access, which is asynchronous with respect to the CPU clock, the CPU stalls until the requested memory transactions are completed. Furthermore, the off- chip access time is solely determined by external access clock cycle, not by the CPU clock cycle. Considering this fact, it is obvious that Eq.(II.4) does not hold for memory-intensive applications in which frequent memory accesses occur. For example, when a D-cache miss occurs the CPU is waiting without executing any instruction until the requested data is available from the external SDRAM. Based R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 29 on these observations, we found that, when the same amount of timing constraint is provided, a lower CPU clock frequency can be applied for memory-bound programs compared to CPU-bound programs. This in turn results in higher relative energy saving for DVFS when it is applied to memory-bound programs. □ Memory-bound: qsort ■ CPU-bound: crc 733 666 600 533 466 400 333 Frequency [MHz] Figure II-5 : Execution time changes according to CPU frequency To illustrate the key point of the workload decomposition for the system energy reduction, we define two different types of workload: on-chip and off-chip workload. Definition 1-1: On-chip workload, W°n , is the number of CPU clock cycles required to perform the set of on-chip instructions, which are executed inside the CPU only. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 30 The execution time required to finish W°n, T n, varies depending on the CPU frequency, f cpu, and is calculated as T n = W" 7f cp u . Definition 1-2: Off-chip workload, W°^, is the number of external clock cycles needed to perform the set of off chip accesses. Note that the CPU stalls until the external memory transactions are completed. The execution time required to finish W°ff, T°ff, depends on the external memory clock freq u en cy ,/^, and is calculated as T > n = W ’ ,] / f ext. From these two definitions, the execution time, T, for a task is calculated as: T = T o n + T ° 5 (n -5) Notice that this breakdown of the total execution time is not exact when the target processor supports out-of-order execution whereby instructions after the instruction that caused an off-chip access may be executed during the off-chip access. In such a case, T°n and T°ff can overlap. However, in practice, the error introduced in this way is quite small considering that the memory access time is about two orders of magnitude greater than the instruction execution time. Therefore, out-of-order execution does not cause a large error in Eq.(II.5). R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 31 When the CPU frequency is changed for executing a task, the variation in the execution time is solely dependent upon W°n, i.e., 7°", of the task as in Eq.(II.5), because f ext is independent of th e fcp u and is not scaled. AT A T o n A T o ff n :0 (II. 6) cpu Apu A f c p u A f Then, a target CPU frequency, ftarget, for a task with workload decomposition is calculated as: W o n ftarget p y _ ’ j'° S ^ ^ Comparing Eq.(II.3) with Eq.(II.7), it is concluded that a target CPU frequency for a task with workload decomposition is always less than the case of without decomposition, which can be proved by subtracting Eq.(II.3) from Eq.(II.7). Thus, more CPU energy saving can be achieved by applying workload decomposition. Figure II-6 shows that how more energy saving is possible by using workload decomposition. We considered four tasks, A, B, C, D, and each task has its own amount of on-chip (white) and off-chip (shaded) workload. Here, task A is the most CPU-bound among tasks, whereas task D is the most memory-bound. The execution time of all tasks is 10 at the maximum CPU frequency as shown in R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 32 Figure II-6 (a). Figure II-6 (b) shows the execution time variations for each task when the CPU frequency is reduced to a half of the maximum frequency. Assuming that deadline is 20, then there is no slack for task A, whereas tasks B, C, and D have slack of 2, 5, and 12, respectively. Detailed energy and delay calculations are summarized in Table 11-1. From this simulation results, it was found that a DVFS method becomes more effective with workload decomposition as we expected. For example, in case of task D, 14.44 % more energy saving was possible by decomposing workload. CPU freq. deadline i L i k Time (a) at the maximum frequency, f m CPU freq. 20 16 10 I 5 deadline Time (b) at lowered frequency, f=fm a x /2 Figure II-6 : More energy saving by workload decomposition R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 33 Table I I - l: CPU energy saving with workload decomposition task @fmax @fmax/2 more energy saving with workload decomposition (%) exec. time exec. time delay (%) energy saving (%) slack time A 10 20 100 75 0 0 B 10 18 80 77.5 2 6.70 C 10 15 50 81.25 5 13.19 D 10 12 20 85 8 14.44 Next problem to be solved is how to decompose workload of a task. In general, it is very difficult to get the exact W°n and W°ff of a program in a static manner such as during the compilation time because on/off-chip latencies are severely affected by dynamic behavior of the program such as cache statistics and different access overheads for different external devices. So, these unpredictable dynamic behaviors should be captured at run time. This can be achieved by using a performance monitoring unit (PMU) that is often available in modem microprocessors [47] [48]. For example, the PMU in Intel’s XScale-80200 processor [47] supports monitoring of 20 performance events including cache hit/miss, TLB hit/miss, and number of executed instructions. By using these dynamic events from the PMU, the task workload can be effectively separated into W°n and W°^, as will be shown in following chapters. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 34 CHAPTER III FINE-GRAINED DYNAMIC VOLTAGE AND FREQUENCY SCALING BASED ON WORKLOAD DECOMPOSITION III.l INTRODUCTION We are interested in a DVFS policy for general-purpose computer systems that differentiates between CPU-bound and memory-bound instructions in the workload. The intuition for workload partitioning is as follows. Memory is asynchronous with the processor and often has its own clock. Now if the task execution time is dominated by the memory access time, then the CPU speed can be slowed down with little impact on the total execution time. This could, however, result in potentially significant savings in energy consumption. In this Chapter, we propose an intra-process DVFS technique for non real-time operation in which finely tunable energy and performance trade-off can be achieved. The main idea is to lower the CPU frequency during the CPU idle times, R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 35 which are, in turn, due to external memory stalls. To capture the CPU idle time at run time, several performance monitoring events, provided by the performance monitoring unit (PMU) in the XScale processor, are used. The proposed DVFS technique relies on dynamically-constructed regression models that allow the CPU to calculate the expected workload and slack time for the next time slot, and thus, adjust its voltage and frequency in order to save energy while meeting soft timing constraints. This is in turn achieved by estimating and exploiting the ratio of the total off-chip access time to the total on-chip computation time. The proposed technique has been implemented on an XScale-based embedded system platform and actual energy savings have been calculated by current measurements in hardware. For memory-bound programs, a CPU energy saving of more than 70 % with a performance degradation of 12 % was achieved. For CPU-bound programs, 15-60 % CPU energy saving was achieved at the cost of 5-20 % performance penalty. III.2 RELATED WORKS There are different DVFS approaches that make use of the asynchrony of memory access to the CPU clock during a task execution. In [29] and [30], compiler-assisted DVFS approaches were proposed, in which frequency is lowered in memory-bound region of a program with little performance degradation. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 36 There are DVFS approaches that rely on micro-architecture or embedded hardware without any assistance from a compiler or a simulator in recognizing the CPU stall time. In [51], a microarchitecture-driven DVFS technique was proposed in which cache miss drives the voltage scaling. In [40], an L2-cache miss driven voltage scaling method was proposed. In [10], [21], and [32], IPC (instruction per cycle) rate of a program execution was used to direct the voltage scaling. Reference [86] proposed a voltage scaling method, called process cruise control, which used a PMU to produce the optimal frequency and voltage levels under a given performance degradation. The PMU captures the dynamic program behavior such as cache hit/miss ratio and memory access counts during the whole execution time. In particular, the authors defined optimal frequency domains in 2-D memory vs. instruction count space. This approach requires no help from off-line simulation or compiler and only relies on dynamic event counts from the PMU. However, it is not flexible in the sense that frequency domains are obtained through extensive experiments of micro-benchmarks for a given performance loss (set to 10 % in that work) and this performance loss is fixed for all different applications. This stiff policy does not allow a precise and graceful control of the energy-performance trade-off. In this thesis, we propose a DVFS policy for non real-time application similar to the one presented in [86]. However, in our proposed DVFS approach, we use the performance events in a different way. Furthermore, our policy enables more R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 37 precise control over energy-performance trade-off by using regression-based method in which performance events are used to recognize memory-bound region at runtime effectively. The proposed DVFS method can easily be extended to soft real-time applications such as multimedia processing as will be discussed in Chapter IV. In such application, a large number of memory transactions take place, but at the same time, it is acceptable to miss the target deadline every now and then. For example, MPEG decoding which is one of most popular multimedia applications requires frequent memory accesses during some of its decoding steps (e.g., dithering). Simultaneously, it is allowed to miss the target frame rate for short periods of time (more precisely, an average, rather than the minimum, frame rate must be guaranteed.) Consequently, the CPU idle time due to memory transactions can be captured by using the proposed method, and the CPU voltage and frequency scaled so as to attain a significant energy saving. III.3 PERFORMANCE-ENERGY TRADE-OFFS III.3.1 WORKLOAD PARTITIONING Generally speaking, a task consists of a sequence of instructions to be performed. The execution time of a task is the sum of latencies of all instructions in the task. Thus, the execution time is a function of the instruction mix (the sequence of unrolled instructions to be executed) and the CPI. A RISC instruction mix consists R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 38 of register-type instructions, memory-type instructions, and branch-type instructions (the control instructions for supervisor mode are not considered here). After the application is compiled from the source code into the object code, the ratios between these three instruction-types in the instruction mix will become fixed if the control flow is known at compile time. The CPI of the instruction mix depends on not only the instruction-types and the data dependency, but also the run-time factors such as SDRAM access latency, PCI access latency, other running processes, etc. The instruction latencies can in turn be classified as on-chip latencies (register-type and branch-type instructions) or off-chip latencies (memory-type instructions). The on-chip latencies are caused by events that occur inside the CPU. They are synchronized to the internal clock and may linearly be reduced by increasing the CPU frequency. The off-chip latencies, on the other hand, are independent of the internal frequency and are thus not affected by changing the CPU frequency. Accesses to external devices such as SDRAM and PCI peripheral devices are synchronized to the bus clock, which is independent of the CPU frequency. Using Definition 1-1 and 1-2 of section II.3 in Chapter II, the amount of on-chip (W°n) and off-chip workload can be represented in terms of the CPI multiplied by the number of instructions being executed [25] as follows: R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 39 tf"=2cw-=»-cp/r ( in .1 ) 1 = 7 m W°ff = £ cPVo ff = m ■ C P Q (IIL2) i= i where n is the total number of instructions in the instruction stream, m is the number of off-chip accesses in that stream, CPI lo n denotes the number of CPU clock cycles for the ith instruction, CPI J off denotes the number of memory clock cycles for the jth off-chip access, C P fV 8 o n and C P fV 8 0 jf denote the average on-chip and off-chip CPI, respectively. The CPU frequency for a task can be calculated differently depending on temporal distribution of W°n and W°ff as well as values of W°" and W°ff. Consider a task which has W°n comprising of W °n and W /'' and W°ff comprising of and Furthermore, assume that the four subtasks are executed in the order shown in Figure III-1. Then, there are two different scenarios, (I) and (II), according to whether we know the complete execution sequence of W°n and W°ff or not. In scenario (I), it is assumed that we know the temporal execution sequence of subtasks inside the task, i.e. W /m — > W-fS — > W /n — > W 4 ° ff, whereas, this information is not available in scenario (II). R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 40 #.of cycles Scenario (I) W on W ° " w 3°n w ° ” D V F S Execution sequence Scenario (II) W m + W off CPU freq. fo n C p U fc ffC p U ▲ D W w ° n w 3on f CPU off w f W40lf * 1 * 2 4 Time fcpu ▲ D W on w p w 3on W ° « - > * 1 * 2 * 4 Time Figure III-l: DVFS with detailed knowledge of subtasks and their relative order and workload requirement (scenario I) and without this information (scenario II) Now, the CPU frequencies for W{> S and can be set the minimum possible level in scenario (I) while it is not possible to assign the minimum CPU frequency for W °ff in scenario (II). Thus, not surprisingly, more CPU energy can be saved in scenario (I) compared to scenario (II). More precisely, the CPU clock frequencies for the two scenarios are given next: scenario (I) : / J “ =• D- W ° n + W ° n rrj 1 fr 3______ rcpu__rcpu /'w 2 0 ff+wf's ’ “ r 1 (III.3) wp+wp scenario (II) , rcpu rcpu _ • J on J of ’o ff D - r (IH.4) R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 41 where W°n (W°ff ) is on-chip (off-chip) workload of the ith subtask, D is the deadline, f m inc p u is the minimum CPU frequency, and f o n cp u (f0 / pu ) is CPU frequency during the period of time that we are servicing on-chip (off-chip) accesses. Notice that to set the minimum frequency during off-chip accesses in scenario (I), W2 S and W /^ should be sufficiently large compared to the frequency and voltage scaling overhead in actual hardware. For example, if a task results in a large number of small W°ff,s that are scattered over the whole execution time of the task, then the CPU frequency for such a case is calculated as in scenario (II) even when the execution sequence is known. The definition of these two scenarios is useful for MPEG decoding as will be shown in Chapter IV because different steps in the MPEG decoding sequence can be mapped to one of these two scenarios. Notice that to set the minimum frequency during off-chip accesses in scenario (I), W °ff and W{)S should be sufficiently large compared to the frequency and voltage scaling overhead in actual hardware. For example, if a task results in a large number of small W° ff’s that are scattered over the whole execution time of the task, then the CPU frequency for such a case is calculated as in scenario (II) even when the execution sequence is known. As it can be seen from the above equations, Eq.(III.3) and Eq.(III.4), the target CPU frequency is closely related to the ratio of l°ff and T°n of a program. Consequently, R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 4 2 accurate calculation of T °f f and T ° n , i.e., W ° f f and W ° n , is quite important to the effectiveness of energy reduction method using DVFS technique. III.3.2 PERFORMANCE DEGRADATION AND ENERGY SAVING To perform ideal DVFS, we have to accurately predict the execution time of a task at any clock frequency. Let T , T m , and T ’f f denote the total execution time of a program, the on-chip computation time, and the off-chip access time, respectively. T is obviously the sum of T ° n and T ° f f as shown in Eq.(II.5). The increased execution time of a program due to lowered clock frequency represents the performance loss (P F i o ss), which is defined as follows: ( T f - T f ) PFlf, s = " m a x (III.5) fmax wherefmox is the maximum frequency of the CPU, f n is a frequency lower than f m ax, T jh and T fm ax are the total task execution times at CPU frequencies of f n and f m ax, respectively. For a given program, different ratios of T ° n and T ° f f result in very different P F i 0SS over CPU frequencies. Figure III-2 provides energy-performance trade-offs for various applications. For example, in case of the “crc” and “djpeg”, lowering R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 43 frequency introduces significant performance loss compared to other tasks implying that these programs are CPU-bound (i.e., T°n » On the contrary, it is known that “fgrep” and “qsort” are memory-bound (i.e., T°n « l°ff) by observing little performance degradation with lowered frequency. v > o o c (0 E o t o Q . 120 100 cfij'ep □ qsort 80 □ gzip ^djpeg 60 ■ crc lligiilllijiilllllill 40 20 H H H H 0 1 .s 666 600 533 466 400 Frequency [MHz] 333 Figure III-2 : Performance loss changes according to CPU frequency Based on these observations, we conclude that the ratio of T°n to T )ff for a program is very important to the degree of energy saving and performance penalty attained by DVFS techniques. More precisely, T°n and 1°ff can be represented as follows using E q.(m .l) and Eq.(III.2): R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 44 n-C P i:: m -C P Q 1 - J^U ’ 1 -L-L (HI-6) where / cp u and f ext denote the current clock frequency of the CPU and the clock frequency of the off-chip bus. It should be pointed out that / ext can assume different values depending on the external devices being accessed. For example, in our test system, 100 MHz clock frequency is used for the SDRAM access whereas 33 MHz speed is used for the PCI-peripheral devices. Note that f ext cannot be scaled. Definition III-l: The P value of a program is defined as the ratio T°ff/T m for that program. /? represents the degree of potential energy saving because the larger P is, the more CPU energy saving can be achieved by a DVFS technique. Consequently, we need accurate information about P in order to sustain an effective DVFS technique. In regards to the two scenarios described in Chapter II, because it is impractical to obtain the exact temporal distribution of W°n and W°ff at run time, we calculate the optimal CPU frequency as in scenario (II). From equations (III. 3) and (III.4), the optimal frequency, ftarget, for a given PFi0 S S value is calculated as follows: R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 45 f J target f J ma x ma x 1 + PFu,„- l + i» max (IIL7) As it can be seen from the above equation, ftarget is closely related to /? of a program. Consequently, accurate calculation of /? is quite important to the effectiveness of our proposed DVFS approach. The ideal DVFS can instantaneously change the voltage/frequency values. In reality, however, it takes time to change CPU frequency/voltage due to factors such as the internal PLL (phase lock loop) locking time and capacitances that exist in the voltage path. For the 80200 XScale processor, the latency for switching the CPU voltage/frequency is 6 psec at 333 MHz [47]. The quantum of time for scaling the CPU frequency/voltage must be much larger than this switching overhead so that the overhead becomes negligible compared to the scaling time unit. At the same time, we would like to minimize the overhead of the voltage/frequency scaling as far as the OS is concerned. Therefore, we use the start time of an (OS) quantum (approximately 50 msec in Linux) used by the OS to schedule processes as DVFS decision points, that is, each time the OS invokes the scheduler to schedule processes in the next quantum, we also make a decision as to whether or not the III.3.3 SCALING GRANULARITY R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 46 CPU voltage/frequency is changed and if so, scale the voltage/frequency of the CPU. In addition to ease of the implementation, the advantage of using the OS quantum as a scaling unit is that the quantum is the smallest unit of time in which a process is executed atomically under a multitasking OS environment such as Linux. III.3.4 EVENTS MONITORED THROUGH THE PMU ON XSCALE It is very difficult to calculate the exact P of a program in a static manner such as during the compilation time because on/off-chip latencies are severely affected by dynamic behavior such as cache statistics and different access overheads for different external devices. So, these unpredictable dynamic behaviors should be captured at run time. This can be achieved by using a performance-monitoring unit that is often available in modem microprocessors. In our target system, the CPU is Intel’s XScale, which supports monitoring of 20 performance events including cache hit/miss, TLB hit/miss, and number of executed instructions. The overhead for accessing PMU (read/write) is less than 1 psec [86] and can be ignored. However, there is a limitation in using these events in the sense that only two events can be monitored at the same time along with the number of clock counts in a quantum (CCNT). For our DVFS policy, we performed many experiments to figure out which events can give valuable clue about ft and the following two events were proven to be R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 47 most helpful based on experimental results: (i) the number of instructions being executed (INSTR) and (ii) the number of external memory accesses (MEM). Using these two events, INSTR and MEM, along with CCNT, CPIo n can be extracted as in Figure III-3. Figure III-3 plots the combination of three events while executing (a) “fgrep” and (b) “gzip” applications at different frequencies from 733 MHz to 333 MHz at a fixed step of 66 MHz. At the start of each quantum, the PMU reports the CCNT, INSTR, and MEM. From these three parameter values, we can calculate the average number of CPU cycles per instruction (C P fv g ) for the instruction stream as the ratio of CCNT to INSTR. Similarly, we can calculate the average number of memory accesses per instruction {M P fvg) as the ratio of MEM to INSTR. The MPI value represents the degree of memory-bound of an application and quite useful when accurate numbers of clock cycles per memory access are not available. For example, the same memory instruction can cause different clock cycle numbers depending on its memory access pattern, either the access in the same row in the memory or not. In this figure, we have plotted C P fvg on the y-axis and M P fvg on the x-axis. Each dot in the plot represents one PMU report. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 4 8 60 50 40 n 30 (a) fgrep a o 20 10 o 0.1 733MHz *1- - 333MHz 0.2 MPIa v g 0.3 0.4 60 50 40 f E 30 o (b) gzip 733MHz . * ? • 333MHz 0.1 0.2 MPIa v g 0.3 0.4 Figure III-3 : Contour plots of CPPrg versus M P r v g for different CPU clock frequencies: (a) “fgrep” and (b) “gzip” R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 49 From this figure, we can easily see that, at a fixed CPU clock frequency, C P fv g is linearly related to M P f v 8 as follows: C P r g = b(f) • MPIm s + c (in. 8) where b(f) is frequency-dependent slope. Notice that intercept c is equal to the average on-chip CPI, C P fv & n n and is independent of frequency/. Therefore, Eq.(III.6) can be used to provide an accurate estimation of C P fvg o n from which /? can be determined from Eq.(III.4) and Definition IH-1. III.4 REGRESSION-BASED FINE-GRAINED DVFS III.4.1 CALCULATING WITH A REGRESSION EQUATION In our proposed DVFS approach, monitored event values are used to estimate coefficient b and c of regression Eq.(III.6), and then to use this equation to predict {$ of a program. Voltage/frequency scaling is performed at the start of each quantum. Regression coefficients b and c are dynamically updated as explained below. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 50 Let the linear equation for the regression be y=b-x+c, where x and y denote M P fvs and C P fV 8 , respectively. Coefficients b and c at quantum t >N, are calculated from the last N PMU reports as follows: l-N + l t-N + 1 t-N+ 1 w * ( Z *<•?,)-( Z *<m Z ?<) ty _ _ i^t________________i=l__________i=t______ t-N + 1 t-N + 1 * • < £ V ) - ( E ^ ) 2 i=t i=1 where x, and y ,- denote the M P fv g and C P fv g for the ith quantum. Note that we must choose N carefully since if N is chosen to be too small, we will be too sensitive to small changes in the program behavior and we may not have enough data points to do a good regression. On the other hand, if N is too large, then we may potentially filter out many important changes in the program behavior. The regression coefficients are updated at the start of every quantum. Recall that the regression equation is maintained for each frequency because b is different for different frequencies. The optimal frequency for the next quantum t+1 is calculated as follows. After quantum t, f i of quantum t, is calculated as: a t cprg t . p = 7 ^ 7 - ; (m-W) on t-N + 1 t-N + 1 C = - U - N N (HI-9) R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 51 Once p 1 is obtained, the target CPU frequency for the next quantum, / t+ 1, is calculated from Eq.(in.5) with the specified PFi0 S S as follows: f +1=- 1 + PF, loss f J max 1+p 1 an. t o III.4.2 PREDICTION ERROR ADJUSTMENT We assumed that parameter P for the next quantum is the same as that for the current quantum. However, in reality, P can vary significantly even within one time quantum depending to the characteristics of the target application program (J3 variation tends to be higher for memory-bound applications.) The P variation is in fact due to the different off-chip latencies for the SDRAM and PCI-device accesses in our target system. Figure HI-4 shows the actual distributions of P over time as measured at 5 msec intervals during the execution of (a) “math” and (b) “gzip”. As expected, the P variation is larger (by nearly one order of magnitude) for “gzip” compared to “math”, which is a CPU-bound application. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 52 (a) "math” @733MHz 1.5 ^ 1 0.5 1 2 3 4 5 Time [sec] 20 (b) "gzip” @733MHz 15 ^ 10 5 0 1 1.5 2 0 0.5 Time [sec] Figure III-4 : Variation in 0 value of applications: (a) “math” and (b) “gzip’1 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 53 The average value, /?avg, and the standard deviation, op, of P are calculated as follows: — (111.12) N - l where N is the total number of quanta, P i is the P value of application program in the ith quantum, and T lo n (T l 0g) is the on-chip (off-chip) execution time in the ith quantum. Note that /?av g represents the amount of potential energy saving, i.e., the higher the Pavg, the higher the potential energy saving under a given timing constraint, whereas <Jp captures the degree of difficulty in achieving finely-controlled performance-energy tradeoffs. The /?a v g and op values are reported in Table HI-1. Table III-l: Statistics for the value seen for different applications at a CPU clock frequency of 733 MHz math bf crc djpeg gzip qsort fgrep P avg 0.14 0.2 0.16 0.49 5.26 7.82 10.01 O p 0.21 0.64 0.56 0.45 2.80 10.94 9.44 B =-&■— r a v e avg n V r / i on i=l R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The severe fluctuations in the /? parameter, which tend to occur in memory-bound applications, may cause a large error when attempting to predict //fo r the next quantum based on /? for the current quantum. This situation becomes worse when the quantum length is variable. For example, consider the case when a process performs an I/O operation (mostly file read/write functions.) In such a case, the CPU preempts the process; therefore, the length of the OS quantum is shortened compared to the "standard" quantum length of approximately 50 msec. For example, for “gzip”, the actual length of the quantum ranges from 2 msec to 50 msec (with an average value of 6msec), whereas it is nearly constant (with an average value of 50msec) for “math”. Notice that the /? prediction error is especially severe when the current quantum is quite short and the next quantum is quite long. So, we modify the proposed technique in order to handle the error in predicting /3 for the next quantum. The modification is shown in Figure III-5, which depicts three consecutive quanta, q '!, q, and qt+I, each with a distinct /? value and quantum lengths Tad'1, TaJ , and Ta ct + 1 • For the specified PFioss, the expected execution time is denoted by Texp1 '1, Texp ‘, and Texp t+ 1, respectively. Voltage/frequency scaling for q, qt+ > , and qt+ 2 is performed at tj, t2, and r.? , respectively. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 55 T M 1 exp T f -1 i acf . ■ *----—— —i- ■ S'-7 q t- t q‘ qt+-i T t-t | j t T * ' exp T t 1 act H H t * f, T t+ 1 ' exp T t+ 1 1 act ; quantum sequence , ET at fm a x ; expected ET with a given PFl0 S S ; actual ET g f+r (slack generation) Qf-1 — T f-* — T t'1 ^ ' exp ' act q t - t t . T t - 1 - T t — T *-1 w ' ex[ ' ' exp ' act ' act = T 1 - T * + c?‘? ' exp 'act qt+1 — t t+1 + T t - t - T t-1— T t+1 . t t T" t-1 ' exp ' exp ' exp ' act ' act ' act = TexpM - T J + i + S E T : Execution time Texpk= T k ‘ (1+PFl0S S ) (k = t-1 , t, t+1) Figure III-5 : Compensating for the error due to misprediction of /? When a frequency is chosen for the next quantum, there may exist some (positive or negative) slack time (i.e., the difference between Texp* and Tac*-) These slack times come about due to the misprediction of /? for the next quantum. With a positive (negative) slack, the frequency for the next quantum should be made smaller (larger) compared to the case of zero slack. For example, at time t2 , the actual execution time until ?2 is {Tact1 + TaJ ) which is less than the expected time {Texp 1' 1 + Txp1 ), so there is a positive slack time S 1 - Texp ( - TaJ + S ul. If 5 1 is added in the calculation of the frequency for the next quantum q> + 1 , then the error that occurred in the previous quanta can be compensated in the following quantum. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 56 Eq.(III.5) for calculating the target frequency for next quantum is thus modified as follows: r 1 = ■ l + PF, loss 1+ ___ L s1 ^ f f \ J m a> PF -T‘ loss ’'act J r V j / (111.13) Notice that for positive (negative) slack S \ the denominator will be larger (smaller) than the zero slack case, and hence the target frequency f t+ I will be smaller (larger), which is of course the desired behavior. III.5 IMPLEMENTATION We implemented the proposed policy on a high-performance XScale-based testbed, which runs Linux (v2.4.17) including various components summarized in Table III- 2. The Intel 80200 XScale processor configuration is summarized in Table III-3. A programmable clock multiplier (PLL) in the XScale processor generates the internal CPU clock, which can be adjusted from 200 up to 733 MHz in steps of about 66 MHz with the development-board speeds only available from 333 MHz and up. The lower bound results from a constraint to the memory bus speed, which is at 100 MHz in our system. The bus speed has to be less than a third of the CPU clock speed. This would yield a minimum speed of 333 MHz. Running the system at CPU speeds slower than 333 MHz causes immediate halts. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 57 Table III-2 : Apollo testbed II (AT2) system components Component Description Intel 80200 XScale microprocessor; PC100 SDRAM main memory (128 MByte, 64-bit bus); XScale microprocessor system bus interface and peripheral bus interface; Main module SDRAM controller @100MHz (Sustained 800MB/sec of SDRAM bandwidth); PCI host bridge; Integrated FLASH memory controller and integrated UART16550 controller; DMA controller and interrupt controller FPGA , , Xilinx VirtexE/VirtexII FPGA companion chip module LCD module 10.4 inch 800x600 resolution color TFT LCD panel (LG-Philips LP064V1); Xilinx XC2S150 LCD controller; 16MB SDRAM frame buffer memory pgp TITMS320C6713 floating point DSP; module 1800 MIPS/1350 MFLOPS@225 MHz Ethernet , , 10/100 Mbps module USB 2 0 NEC uPD720101 module Supporting up to 480 Mb/s data bandwidth PCMCIA PCI RICOH R5C475II module Supporting IEEE 802.1 lb WLAN protocol R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 58 Table III-3 : Intel 80200 Xscale processor configuration Unit Configuration Instruction Cache 32 Kbytes, 32 ways Data Cache 32 Kbytes, 32 ways Mini-Data Cache 2 Kbytes, 2 ways Branch Target Buffer 2 Kbytes, 2 ways Instruction Memory Management Unit 32 entry TLB, Full associative Data Memory Management Unit 32 entry TLB, Full associative The photo of the main PC Board of our target system, where XScale processor, SDRAM module, and memory controller are embedded, is shown in Figure III-6. The main PCB of our testbed includes an on-board variable voltage generator, which provides suitable operating voltage at each clock frequency level. A D/A converter was used as a variable operating voltage generator to control the reference input voltage to a DC-DC converter that supplies operating voltage to the CPU. Inputs to the D/A converter were generated using a customized CPLD (Complex Programmable Logic Device). When the CPU clock speed is changed, a minimum operating voltage level should be applied at each frequency to avoid a system crash due to increased gate delays. In our implementation, these minimum voltages are measured and stored in a table so that these values are automatically sent to the variable voltage generator when the clock speed changes. Voltage levels R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 59 mapped to each frequency are obtained through extensive measurements and summarized in Table HI-4. Figure III-6 : Main board with the CPU, memory, and memory controller Table III-4 : Frequency and voltage levels in the system Frequency (MHz) Voltage (V) 333 0.91 400 0.99 466 1.05 533 1.12 600 1.19 666 1.26 733 1.49 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 60 For the measurements, the system has a 100K samples/second data acquisition system in which the voltage drop across a precision resistor inserted between the external power line and the “design under test” (DUT) power line is used to measure the power consumption as shown in Figure III-7. Figure III-8 represents the structure of power plane in our XScale-based system. Power split Resistor Operating Voltage of DUT AV - Q L I X DUT Sam ple 40kHz [ill Figure III-7 : Data acquisition system 3.3V 0.056 a VW— 5V # - 5 £2 - m - 0.2 Si 0.067 a -WW— H 5£2 -n/WY— H o.i a 3.i a OTHERS CPU CORE PCI BRIDGE FLASH SDRAM ETHERNET ► BOOT ROM Figure III-8 : The structure of power plane of XScale board R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 61 As software works, we wrote a module in which the proposed policy is implemented and this module is hooked to the scheduler so that voltage scaling can occur during every context switch. Figure III-9 shows the software architecture of DVFS implementation. Kernel space external PF|0 S S input (ex, battery status or user request) i proc” interface module I Linux scheduler policy module PMU access module DVFS module I i XScale processor Figure III-9 : Software architecture of our DVFS implementation During the context switch, the PMU values for the previous process are read and the ideal frequency calculation for the next quantum is performed as described in section III.3.3. A regression equation at each frequency is maintained for each process, which consists of no more than 5 long-type variables, resulting in little space overhead for implementing our DVFS policy. We measured the time R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 62 overhead of our policy by using benchmark in the suite of the Lmbench [55] and found that the time overhead was about 100 psec. The original context switch time was also nearly 100 psec. Although we almost doubled the context switch time, the overhead is still quite negligible in comparison to the quantum time of a few tens of millisecond. Our implementation supports a proc-file interface to the module such that the performance loss level and size of the window can be specified by writing the appropriate value to the this proc-file, which allows us to dynamically control the desired level of energy saving. Furthermore, the current values can be read from the proc-file interface. Another feature we have implemented to gain more accurate information (at the cost of higher overhead) is to measure the event values of PMU at every timer interrupt (1 ms on our platform). This feature is disabled by default and is not exploited in the experimental results section. III.6 EXPERIMENTAL RESULTS Our experiments are performed on the following applications including two common UNIX utility programs (“gzip” and “fgrep”) and five representative benchmark programs available on the web [90]. They are summarized in Table III- 5. All the measurements are performed 10 times for each benchmark and the average performance loss and average energy saving values are reported. Size of R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 63 the window, N, is set to 25 through exhaustive experiments. Based on the experimental results, it is found that iV of 20 ~ 50 shows similar characteristics. Table III-5 : Summary of test applications Benchmarks Description gzip compressing a given input file fgrep searching for a given pattern in the files residing in a directory math floating-point calculations bf (blowfish) a symmetric block cipher with a variable length key from 32 to 448 bits crc 32-bit cyclic redundancy check on a file djpeg decoding a jpeg image file qsort sorting a large array of strings in ascending order Figure III-10 represents the measured performance degradation with target performance loss ranging from 5 % to 20 % at steps of 5 %. As seen in this figure, we obtained actual performance loss values very close to the target values for all programs (i.e., actual within 2 % of the target) except for “fgrep” and “qsort” programs, which are memory-bound and PFi0 S S of these are saturated to -12 %, corresponding to data in Figure IH-2. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 64 Target Performace Lo Figure III-10 : Performance loss with different target values In Figure III-11, actual power consumptions (including both CPU and DC-DC converter power) for two cases: (i) without DVFS and (ii) with DVFS are reported when running “gzip” and “fgrep”. In case (a) and (c), two programs are run at the maximum frequency (733 MHz) and 10 % target PFi0 S S is given consistent with case (b) and (d). By applying the proposed policy, 52.1 % of the CPU energy is saved at the cost of 11.6 % performance loss for “gzip”, whereas 77.6 % of CPU energy saving with 10.3 % performance loss is achieved for “fgrep”. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 65 2500 gzip, @7331 2000 c o ■ ■ = 1500 £ 3 m c o o d ) 5 o CL avg. power: 789.5mW 1000 500 0.2 0.4 0.6 0.8 Time [sec] 1.2 (a) “gzip”: without DVFS - at maximum frequency 2500 gzip, with 10% avg. power: 338.7mW (52.1 % energy saving) 1.0806 sec (11.6% PF,0 s J S ) £ 2000 t; 1500 c 1000 £ 500 0.2 0.4 0.6 0.8 Time [sec] 1.2 (b) “gzip”: with DVFS - at a 10 % performance loss constraint Figure III-ll : CPU power consumption of with/without DVFS R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 66 avg. power : 832.2mW ®733MHz 0.9466, sec 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec] (c) “fgrep”: without DVFS - at maximum frequency fgrep, with 10% PF] avg. power: 186.7mW (77.6% energy saving) 1.044 sec (10.3% PF/0 S S ) 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec] (d) “fgrep”: with DVFS - at a 10 % performance loss constraint Figure III-11 : continued R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 67 Measured energy savings for all benchmarks appear in Figure III-12. From these measurements, we conclude that a CPU energy saving of more than 70 % is achieved for memory-bound applications (“fgrep” and “qsort”) with about 10 % performance loss. arget Performace Loss bf crc djpeg gzip math fgrep qsort Figure 111-12 : CPU Energy saving for various application programs The energy saving saturates after that, i.e., we cannot increase the amount of energy savings by tolerating a larger performance loss value. For CPU-bound applications, the degree of energy saving is smaller, but our approach allows a finely tuned energy-performance tradeoff. For example, in the case of “djpeg” program, we R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 68 obtain a 42 % CPU energy saving with a 20 % performance loss constraint or a 26 % energy saving with a 5 % performance loss constraint. Next, we compared our proposed DVFS method with a static approach in which the CPU frequency for an application program is calculated based on the off-line data profiling in a uni-tasking system and the calculated frequency is set during the whole execution. The percentage of T°n and T > s in the total execution time T at the maximum CPU frequency (733 MHz) are measured for three benchmarks, “djpeg”, “gzip”, and “qsort” and provided in Table IH-6. Table III-6 : Calculated CPU frequency for test applications after profiling Bench marks f t avg f o p t (calculated frequency) [MHz] fapp (applied frequency) [MHz] 10 % PFl0 S S 20 % PFlo s s 10 % PFi0 S S 20 % PFi0 S S djpeg 0.49 637.9 564.7 666 600 gzip 5.26 450.8 325.5 466 333 qsort 7.82 389.5 265.2 400 333 Based on the measured T°n and T°ff, i.e., ftavg, the optimal CPU frequency, f opt, for each application is calculated using Eq.(III.5) with the target performance of 10 % and 20 %. Due to discreteness of the available CPU frequency in our target system, R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 69 we used the closest CPU frequency, f app, larger than f opt and measured the CPU energy consumption. Figure III-13 shows the CPU energy saving difference between our dynamic method and the static approach. As shown in this Figure, it was found that there is little difference in the energy savings (less than 5 %) between the two approaches. djpeg gzip qsort Figure III-13 : Comparison of dynamic (the proposed work) vs. static approach (profiling) As the final experiment, we compared energy saving according to a scaling granularity whereby the PMU readings and the voltage/frequency scaling operation occur at intervals of 1 msec, 5 msec, 20 msec, and 50 msec, respectively. The R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 70 results for “gzip” (memory-bound) and “djpeg” (cpu-bound) applications are shown in Figure HI-14. 100 TargetPFlot, -5°,o DlO^o L;15°o ■20°o (a) “gzip < 0 40 1ms 5ms 20m s 50ms OS quant. -20 100 80 vP 60 O ) c ro 40 < /) O ) o 20 c H I -20 Scaling granularity TargetPF^. D5% ° 10% ^ 15% >20% 5ms 20ms 50ms OS quant. Scaling granularity Figure 111-14 : Energy saving comparison according to scaling granularity (1 msec, 5 msec, 20 msec, 50 msec, and OS quantum): (a) “gzip” and (b) “djpeg” R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 71 We can see that too frequent a scaling, i.e., 1msec time interval, causes higher energy consumption for 5 % and 10 % target performance loss specs in “djpeg”. This phenomenon is due to additional overhead associated with handling the timer interrupt (-80 psec). In addition, in such a case, it was observed that the external memory access count, MEM, increases due to changes in the cache content, which makes the total execution time longer, resulting in higher overall energy consumption. However, these ill effects become insignificant once the scaling intervals becomes longer than 5 msec. Finally, notice that there is little difference in the CPU energy saving between the OS quantum unit (which can be different from 50 msec depending on the I/O interrupt generation or task completion) and other time units. III.7 CONCLUSIONS In this Chapter, a regression-based DVFS policy for finely tunable energy- performance trade-off was proposed and implemented on an XScale-based platform. In the proposed DVFS approach, a program execution time is decomposed into two parts: on-chip computation and off-chip access latencies. The CPU voltage/frequency is scaled based on the ratio of the on-chip and off-chip latencies for each process under a given performance degradation factor. This ratio is given by a regression equation, which is dynamically updated based on runtime event monitoring data provided by an embedded performance-monitoring unit. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Through actual current measurements in hardware, we demonstrated a CPU energy consumption of saving of more than 70 % for memory-bound programs with about 12 % performance degradation. For CPU-bound programs, 15 ~ 60 % energy saving was achieved with fine-tuned performance degradation, ranging 5 % to 20 %. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 73 CHAPTER IV OFF-CHIP LATENCY-DRIVEN DYNAMIC VOLTAGE AND FREQUENCY SCALING FOR AN MPEG DECODING IV .l INTRODUCTION In this Chapter, we propose a DVFS technique for MPEG decoding to reduce the energy consumption using the computational workload decomposition. This technique decomposes the workload for decoding a frame into on-chip and off-chip workloads. The execution time required for the on-chip workload is CPU frequency-dependent, whereas the off-chip workload execution time does not change, regardless of the CPU frequency, resulting in the maximum energy savings by setting the minimum frequency during off-chip workload execution time, without causing any delay penalty. This workload decomposition is performed using a PMU in the XScale-80200 processor, which provides various statistics such as cache hit/miss and CPU stall, due to data dependency at run time. The on-chip workload for an incoming frame is predicted using a frame-based history so that the R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 74 processor voltage and frequency can be scaled to provide the exact amount of computing power needed to decode the frame. To guarantee a quality of service (QoS) constraint, a prediction error compensation method, called inter-frame compensation, is proposed in which the on-chip workload prediction error is diffused into subsequent frames such that run time frame rates change smoothly. The proposed DVFS algorithm has been implemented on an XScale-based testbed. Detailed current measurements on this platform demonstrate significant CPU energy savings ranging from 50 % to 80 % depending on the video clip. IV.2 RELATED WORKS A number of researchers have applied DVFS to MPEG video decoding in order to achieve lower energy consumption [11][24][61][65][81]. In [24] and [61], DVFS with interval-based prediction is performed based on the ratio of the number of idle and busy cycles of the CPU while the MPEG stream is decoded. Although significant energy reduction has been reported, there is no guarantee that the deadline for each frame is met. A method using feedback control is proposed in [65] in which decoding time is predicted based on encoded code size of a frame. The authors assume a static (fixed) relationship (in the form of a linear equation) between decoding time and code size of each group of macro blocks. A macro block corresponds to a 16 x 16 pixel area of the original image and consists of 8 x 8 blocks. By analyzing the first group of macro blocks, they obtain the code size of R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 75 that group. Next, they assume that the code size of the second group of macro blocks is the same as this value and calculate the decoding time of the second group based on the decoding time and this code size. This code size prediction scheme is inaccurate, however, and may frequent miss deadlines. Furthermore, the linear prediction equation must be changed when different resolutions of the video image or different frame pixel sizes are encountered. In [11] a frame-based workload prediction is used for DVFS in which the different steps of decoding sequence are divided into frame-independent and frame-dependent parts. The prediction error for the frame-dependent part is compensated during the frame-independent part which consists of memory-intensive work and the execution time during this part can be scaled by the CPU frequency. However, this approach is inapplicable to the high performance processors such as XScale and Crusoe in which external memory clock cycle is asynchronous to the CPU. In [81], the estimation of decoding time is performed in units of group of picture (GOP) that consists of 12 or 15 frames, in general. In this approach, sizes and types of the frames of an incoming GOP are observed and the time needed to decode the next GOP is estimated based on statistics of the previous GOPs. It is highly probable that severe QoS degradation may occur when the prediction is inaccurate because the same frequency (voltage) is applied for all frames in a GOP. There is an approach in which the decoding time prediction is not needed [14]. This is accomplished by including the execution time information of each frame to the video content itself (e.g., as part of the frame header). However, this approach adds to the computational workload of the video R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 76 encoding. As acknowledged in the same reference, this approach is only worthwhile if the encoded video stream is sent to many clients so that the extra cost of adding decoding time information to the frame headers is compensated by energy savings on many mobile clients. In addition, this scheme requires modification of currently used standard video stream format. In [18] an application- independent DVFS approach, Vertigo, was proposed, which uses multiple performance-setting algorithms that are organized into a decision hierarchy for various types of applications. This algorithm was applied to MPEG decoding. Vertigo is an interval-based approach where workload in the next time interval is estimated based on the history of previous intervals. This algorithm estimates the deadline for each interval based on the estimated workload in the previous intervals. This approach is different from [61] and [24] where the length of each time interval is fixed. The authors compared vertigo with LongRun policy [16] and reported that Vertigo has higher performance in terms of the match between the actually achieved frame rate and the target frame rate. Note, however, that actual frame rates with Vertigo are still far less than the target frame rate, i.e., the actual time are 17% to 30% shorter than the target times, which will in turn result in lower energy saving. There have been studies on using buffers in multimedia processing [33] [44] [45]. One of the most important advantages of using buffers is that no explicit frame-decode time prediction is needed, and therefore, missed deadlines due to prediction errors are avoided. These techniques, however, suffer from underflow/overflow of the finite buffer when the decoding time variation is high R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 77 [5] or for improper gain of the proportional-integral controller [45]. Reference [44] used an off-line algorithm to schedule the frame decoding rate and respective frequency, and they did not consider multimedia streams that include B frames. Reference [33] focused on the estimation of the input/output buffer size for the decoder and it is assumed that the worst-case execution time is known in advance. In [45] a feedback control scheme using PI controller at the decoder output buffer such that constant frame rate is achieved by monitoring the frame buffer occupancies. But, it is difficult to control the gain of PI controller and a slight mismatch in the controller gain might cause underflow/overflow at the buffer. Also, frequency/voltage setting is linearly subdivided into 40 discrete levels, which is not true in actual situation. These techniques using buffers introduce some amount of delay at the initial of video session due to buffer filling as well as severe modification of application source code itself to implement a control scheme. None of the previous works on low-power MPEG decoding consider the decomposition of the computational workload, as proposed in this thesis. In this Chapter, we propose a DVFS method for MPEG decoding in which the time for memory-bound operations is accurately singled out of the whole decoding time such that CPU energy savings can be maximized under a given frame rate by ietting lower CPU frequency during memory-bound operations. The calculation of memory-bound operation time is performed at run time based on the dynamic R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 78 events reported by the PMU without any help from an off-line simulator or compiler. IV.3 MPEG DECODING Two objectives of DVFS in MPEG decoding are to maximize CPU energy savings and to guarantee a given QoS constraint such as a given frame rate. An MPEG video stream consists of three frame types: I-frame (Intra-coded), P-frame (Predictive-coded), and B-frame (Bi-directionally-coded). I-frames can be decoded independently. P-frames have to be decoded based on the previous frame. B-frames require both the previous and the next frames in order to be decoded. Sequences of frames are grouped together to form a Group of Pictures (GOP). A GOP contains 12-15 frames, starting with an I-frame. As shown in Figure IV -1, it takes several steps to decode each frame: Parsing, Inverse Discrete Cosine Transformation (IDCT), Reconstruction, Dithering, and Display [56]. Among these steps, the IDCT and Reconstruction take up half of the decoding time [60]. Each frame type results in a different workload during the IDCT and reconstruction step, meaning that the execution time of different frame types varies by a large amount while the time for dithering and display steps is same over all type of frame. Careful examination of what operations are performed in each step is quite helpful in partitioning the MPEG decoding workload into on-chip and off-chip. For R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . 79 example, the IDCT is a CPU-intensive operation in which iterative multiplication- accumulation computation over an 8 x 8 array of inter or floating-point values are required, so the IDCT step is classified W°n, whereas the reconstruction, dithering, and display steps are memory-intensive, requiring a frame-size data movement between the processed video stream and display frame buffer causing frequent cache misses, which can be considered as Decoding sequence J VAR JCON locks/MB MBs/frame MB : Macroblock IDCT: Inverse discrete cosine transformation Read stream s Dither frame Display frame IDCT Make frame Read blocks Reconstruct Merge MB Figure IV-1: MPEG decoding sequence R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 80 The situation is complex and more closely resembles scenario (II) of section HI.3.1 in Chapter III. We therefore opted to divide the whole decoding time of a frame into two parts, 7^°^ and TV A R , where TC0N is CPU frequency-independent and comprises of the “dithering” and “display” times while TV A R is the elapsed time for the remaining steps, which are CPU frequency-dependent. Figure IV-2 shows the actual experimental results of TV A R and 7^°^ of each frame type while changing CPU frequency from 733 MHz to 333 MHz. 140 120 77 100 0) I 80 0 ) 60 £ i— 40 20 0 333 400 466 533 600 666 733 CPU frequency [MHz] Figure IV-2 : Decoding time variation as a function of the CPU clock frequency (2) Siberian Tige R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 81 As we expected, TC 0 /V s of all frames are independent of the CPU frequency, regardless of frame type, while TV A R changes according to the CPU frequency. This means that we can assign the minimum frequency during 7^°^, i.e., “dithering” and “display” steps as in scenario (I) of section in.3.1 in Chapter IE. To calculate the target CPU frequency during TV A R , it is required to know the accurate ratio of the on-chip and off-chip times during t var, which can, in turn, be collected by using dynamic events from the PMU. IV.4 PROPOSED DVFS POLICY FOR MPEG DECODING The off-chip time, T°® , can be obtained by making use of the fact that it is independent of the CPU frequency. To relate a PMU event with T )ff, we plotted many combinations of PMU events and measured TV A R with changing CPU frequency and found that INSTR, the number of executed instructions, can give quite accurate information about T°ff in TV A R . In Figure IV-3, we have plotted TV A R on the y-axis and INSTR on the x-axis at a CPU frequency of 333 MHz and at 733 MHz. Each dot in the plot represents one PMU report for a B-type frame at the corresponding clock frequency. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 82 80 70 „ 60 o < L > 5 0 < / > ou 40 cc < 30 ■ “ 20 10 0 (2) Siberian Tiger (B-type) 333MHz 733MHz 6 8 10 Executed Instructions [10 ] Figure IV-3 : Contour plots of TV A R versus INSTR for different CPU clock frequencies From this figure, we can see that TV A R for all B-frames in the test video form a line and that l°ff can be obtained as y-axis intercept point in a linear equation as follows: rpVAR _ f f~ > p ja vg \ on f cpu ■ INSTR+To ff (IV. 1) Based on the Eq.(IV.l), CPIo n a vg is calculated as about 2.7, regardless of the CPU frequency, and T°s at both frequencies converged to 7.5 msec. l°ff for each frame type is different with B-frame having the largest T°^ while the I-frame has the smallest T ff. This observation can be justified by recalling that predictive frames R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 83 (P- and B-frame) need macroblocks that have already been reconstructed and decoded in the previous I-frames; thereby, causing more off-chip access delays due to frequent data cache-misses. Finally, the proposed DVFS method is quite effective in MPEG decoding application if we consider that an MPEG video clip usually has 10 times more P- and B-frames than I-frames. In Table IV-1, the obtained ratios of T°ff and TV A R at 733 MHz for each frame type of six different video clips are reported. Table IV-1 : The ratio of TV A R and T°ff of each frame type in each video clip Test video Frame size Frame type I P B (1) Terminator2 352 x 240 3.49 % 11.60% 40.58 % (2) Siberian Tiger 320 x 240 7.96 % 11.87% 25.74 % (3) Deploy 352 x 288 15.01 % 58.01 % 47.19 % (4) Wg_wt 304 x 224 10.12% 43.95 % - (5) Badboy2 480 x 208 20.64 % 38.85 % 50.76 % (6) Final3 160 x 120 26.11 % 36.80 % 59.34 % R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 84 Let the linear equation for the regression be y=a-x+b, where x and y denote INSTR and Tvar of some frame type, respectively. Coefficients a and b at frame t > N, are calculated from the last N PMU reports as follows: l-N + l t-N+1 t-N+1 t-N+1 t-N+1 y<) X > '< X x< i=t i=t i=t I___ i=t „ i=f a - im -. — ~ i T ( I V -2 ) W - ( Z V ) - ( Z ^ i=l The regression coefficients are updated at the end of every frame. Recall that the regression equation is maintained for each frame type because MEM varies for different frame types, resulting in different execution times for off-chip accesses. For varying T°n of each frame, we maintained a moving-average of the last M INSTRs for each frame type (three averages, one per frame type). Here, M can be the same as N, that is, the number of data for the regression equation. The expected decoding time for an incoming frame under a given frame rate, R, is thus determined based on the following: the moving average of INSTR and CPIo n a v8 from the regression equation for on-chip latency, the y-axis intercept of the regressed equation for off-chip latency, Texpoff, and constant 7^°^ which is easily obtained after decoding the first frame for a given video clip. Then, the CPU frequency for t+ lth frame, f t+ic p u is calculated as: R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 85 r(: P u IN S T R ^f-C P C 8 f t ' i = „ " c o n (IV.3) P i rpCON rpoff L J~ 1 ~ 1 EXP where INSTRt+ 1 E p is the average of INSTR (until the tth frame) of the frame type that matches frame type at time t+1. In MPEG decoding, meeting a QoS constraint such as a given frame rate is quite important. In fact, the proposed DVFS method is based on the prediction for on- chip and off-chip times for a frame. This kind of prediction may not be perfect when each frame exhibits severe variation in the computational workload such that target frame rate cannot be maintained. So, a method that can compensate for the prediction error and effectively maintain the user-specified QoS is required. There is a commonly used technique in video rendering called error diffusion [20] in which the quantization error of previously quantized pixel is filtered and distributed forward to unquantized pixels in the neighborhood such that a smooth image can be achieved. This same idea can be used to eliminate severe fluctuations in frame rate due to prediction error. In inter-frame compensation methods, the amount of error is diffused over the subsequent frames and the CPU frequencies for the following frames are calculated by considering not only their own predicted decoding times, but also accounting for the timing slack that occurred due to the imperfect prediction in the previous frames. This error diffusion makes the prediction error localized into a small number of neighboring frames, thereby, it R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 86 can effectively compensated for by decreasing (increasing) the CPU frequency in case of over-prediction (under-prediction), resulting in soft and stable variation in the frame rate. In some way, and indirectly, the proposed inter-frame compensation method is analogous to considering “excess cycles” from the previous time slots in interval-based workload prediction techniques [85]. Adopting inter-frame compensation, the Eq.(IV.3) is modified as follows; , cpu_ INSTR™p -CPIZs n t c o n T ° f f 4. t sl a c k (IV -4) U ' 1 EXP t where T, is the time difference between D and actually elapsed time expended on decoding the tth frame. IV.5 EXPERIMENTAL RESULTS We implemented the proposed DVFS technique, called OL-DVFS which stands for off-chip latency driven DVFS) for MPEG decoding with on-chip vs. off-chip workload partitioning on an XScale-based system which includes an on-board variable voltage generator to generate a suitable CPU voltage at each frequency level. Details about the XScale-based system such as power measurement scheme and the allowed CPU clock frequencies with the corresponding minimum voltage levels were explained in the section III.5 in Chapter III. Sizes of window, N and M, R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 87 are set to 25 through exhaustive experiments. For the actual measurement, a data acquisition system (DAQ) with a sampling rate up to 100 KHz is used. Figure IV-4 depicts the CPU power consumption while decoding an I-frame followed by a B-frame in which two different frequencies are set during T*'0 1 * (a) 666 MHz and (b) 333 MHz. A 733 MHz is used for t var. As mentioned in the previous section, 7^°^, which contains the off-chip access latencies during “dithering” and “display”, does not change with the CPU frequency, i.e., it remains at 37 msec at both frequencies. The average power consumption during is significantly reduced from 510 mW to 186 mW (64 % reduction) as a result of voltage scaling. We measured the actual CPU power consumptions while playing back a test video clip (Terminator2) at target frame rate of 14 on the XScale-based system with the proposed DVFS method (OL-DVFS) and the result is shown in Figure IV-5. For this experiment, OL-DVFS resulted in an energy saving of about 70 %. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 88 TVM &733MH TC0N &666MH (2) Siberian Tiger < B-frame > m m bttiiii s £ < D ■ —1 P “ ■ Q. 2 I O w c o o 0 40 80 120 160 200 Time [msec] (a) r V A R : 733 MHz, 1^O N : 666 MHz (2) Siberian Tiger r A H &733MH Icon @333MH frame i y j 40 80 120 160 200 Time [msec] (b) Tvar : 733 MHz, l C O N : 333 MHz Figure IV-4 : Decoding time and power consumption at different CPU frequencies and voltage levels R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 89 (1) Terminator 2 @ 141 £ I c I .2 a ? E Q - 3 O (0 o o 500 f t 8 9 10 11 Time [sec] (a) without DVFS (1) T e r m f n a t g i ^ ^ M ^ a \ 12 O (0 8 9 10 11 12 Time [sec] (b) with OL-DVFS Figure IV-5 : CPU power consumption on the AT2 platform when running a video clip R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 90 We compared the proposed DVFS method (OL-DVFS) with the case of conventional DVFS without workload partitioning (CON-DVFS) with six test video clips. The proposed inter-frame compensation is used for both OL-DVFS and CON-DVFS. CON-DVFS refers to state-of-the-art work prior to OL-DVFS and comprises of the following: The computational workload (i.e., the number of CPU clock cycles needed to decode the frame) is calculated as the elapsed total decoding time divided by the current CPU frequency. Voltage and frequency scaling is done as a function of this calculated workload. Figure IV-6 shows the CPU energy savings of a test video for both OL-DVFS and CON-DVFS compared to no DVFS. As we can see, the OL-DVFS method enables much higher energy savings as the frame rate becomes higher compared to CON- DVFS. Results for other test videos are summarized in Table IV-2, demonstrating a CPU energy savings ranging from 50 % to 80 % for various frame rates. We also compared the OL-DVFS method with a DVFS technique (called MIX- DVFS) that uses the minimum CPU clock frequency for I C0N (this is similar to OL- DVFS) and a policy similar to the CON-DVFS for t var. The results are reported in Figure IV-7. Notice that in this experiment, the minimum CPU frequency is set during Tccw for both OL-DVFS and MIX-DVFS in order to clearly highlight the effect of R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 91 considering T ff during TV A R . Inter-frame compensation is not used in this experiment for both cases. As seen in Figure IV-7, ^id en tificatio n becomes more effective as the frame rate goes higher. In particular, with off-chip latency separation during Tvar, a 6.5 % higher energy savings at a frame rate of 14 is achieved for the test video (clip 5). Finally, Figure IV-8 shows the effectiveness of inter-frame compensation method. 100 □ OL-DVFS ■ c o n -d v fs (1) Terminator i 79.97 79.68 78.08 71.60 72 38 64.33 (0 60 40.45 22.17 12 13 14 Frame rate [fps] 15 Figure IV-6 : CPU energy savings using proposed DVFS R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 9 2 Table IV-2 : CPU Energy saving comparison - OL: OL-DVFS, CON: CON-DVFS. (*: numbers in parenthesis are for (6)) fps* (1) Terminator2 (2) Siberian Tiger (3) Deploy CON OL CON OL CON OL 10 - - 73.15 % 77.78 % - - 11 (27) 80.46 % 80.75 % 55.49 % 71.39 % - - 12 (28) 79.68 % 79.97 % 43.39 % 60.66 % - - 13 (29) 71.60 % 78.08 % 25.36 % 49.54 % - - 14 (30) 40.45 % 72.38 % - - 57.94 % 75.69 % 15 22.17% 64.33 % - - 35.53 % 64.44 % Table IV-2: continued fps* (4) Wg_wt (5) Badboy2 (6) Final3 CON OL CON CON OL CON 10 - - - - - - 11(27) - - - - - - 12 (28) - - 79.33 % - - 79.33 % 13 (29) 75.27 % 77.74 % 78.85 % 75.27 % 77.74 % 78.85 % 14 (30) 60.59 % 73.18 % 71.34 % 60.59 % 73.18 % 71.34 % 15 41.33 % 66.99 % 46.99 % 41.33 % 66.99 % 46.99 % R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . 9 3 120 0 s 100 O) c ■ > 80 t o 0) > 60 U) < D C 40 1 1 1 D 20 Q _ O 0 □ OL-DVFS ■ MIX-DVFS 1 1 fps 14 fps- 12 fps 77.66 7651 | 66.10 M O Q bd.a» 61.29 7 1 5 0 65.32 58.06 (2) Siberian Tiger (5) Badboy2 Test video Figure IV-7 : CPU energy savings with off-chip latency separation during T 16 -------- (1) Terminator 2 Frame rate target: 13fps without V A R jnA with "inter-frame compensation" 12 11 ‘ .'i ■ * '. ! . * . ■ > a .•i.'.g.i.,. j.w.. 30 60 90 120 Frame number 150 Figure IV-8 : Frame rate variation with the proposed DVFS R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 94 With this compensation scheme, the run time frame rate smoothly converges to the target frame rate (here, 13 fps). Notice that the frame rate diverges from the target rate without this compensation, resulting in wasted CPU energy. The reason that the divergent rate is higher (rather than lower) than the target frame rate is that the I- and P-frames need maximum frequency to meet the deadline, and are unaware of positive timing slacks that are carried over from the previous B-frames. IV.6 CONCLUSIONS A DVFS for MPEG decoding was proposed and implemented on the XScale-based portable system. In this DVFS, the computational workload in decoding a frame is partitioned as on-chip and off-chip workload by using a dynamic event from PMU and which results in significant CPU energy savings. To avoid QoS degradation due to misprediction of on-chip and off-chip latencies, an inter-frame compensation method was proposed in which an error occurring in a frame was diffused into a small number of subsequent frames and compensated for with a negligible fluctuation in the frame rate. On this platform the significant CPU energy savings ranges from 50 % to 80 % depending on the test video sequence under which various frame rates were achieved. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 95 CHAPTER V DYNAMIC VOLTAGE AND FREQUENCY SCALING FOR THE SYSTEM ENERGY REDUCTION V .l INTRODUCTION This Chapter presents a DVFS technique that minimizes the total system energy consumption to perform a task while satisfying a given execution time constraint. To guarantee minimum energy for task execution using DVFS, it is important to divide the system power into two parts: fixed and variable power. Fixed power represents the component of power that remains unchanged during the task execution. Examples include DC-DC converter power and PLL power as well as leakage power dissipations. Variable power captures the component of the system power consumption that changes with time. Examples include the CPU and memory power dissipations as well as I/O controller power. The variable power component is, in turn, decomposed into two subcomponents: idle and active power. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 96 As the name implies, active (idle) power is the portion of variable power that is consumed when the system is executing some (no) useful task. We also define standing power as the summation of fixed plus idle power components of the system. Figure V -l shows our model of the system power consumption which can be break down into several components as mentioned above. total system power fixed variable remains unchanged idle + fixed when each component is not used when each component is used for some task \ standing active Figure V-l: System power breakdown DVFS can reduce only the active component of system power dissipation. If this component is large compared to the standing components of system power, then lowering the CPU clock frequency and then reducing the supply voltage of the CPU (while meeting a task execution time deadline) will result in lower system energy consumption due to the linear relation between the CPU cycle time and voltage and the quadratic relation between the CPU power consumption and R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 97 voltage. On the other hand, if the active component of system power is small compared to the standing components, then slowing down the CPU speed may in fact increase the system energy consumption due to an increase in the task execution time and the dominance of the standing power dissipation components. In this Chapter, we present a new DVFS technique, which considers not only active power, but also standing power components of the system. This is in sharp contrast to previous DVFS techniques, which only consider the active power component. The standing components of the system power are measured by monitoring the system power when it is idle. The active component of the system power is estimated at run time by a technique known as workload decomposition whereby the workload of a task is decomposed into on-chip and off-chip, based on statistics reported by a performance monitoring unit (PMU), which most modem processors such as XScale-80200 [47] or PXA255 [48] come equipped with. V.2 RELATED WORKS There have been many studies on DVFS for either real-time or non real-time operations. However, most previous approaches solely focused on the CPU energy saving, based on the two assumptions mentioned in Chapter II; inverse relationship between execution time versus operating frequency and cubic relationship between the system power and operating frequency. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 98 There are different DVFS approaches that make use of the asynchrony of memory access to the CPU clock during a task execution [12] [13] [21] [29] [30] [51] [8 6 ]. These approaches also considered CPU energy reduction only without noticing the existence of fixed power portion in the system power. There are some works considering fixed power occurred by subcomponents in using DVFS techniques. References [52] and [53] have suggested that power consumption in the memory effects should be taken into account. Reference [64] reported that there is a lower bound on the CPU frequency such that any further slowing down degrades the amount of computation that can be performed per battery discharge. The authors also allude to the problem that the high cost of memory may dominate the total energy consumption of a system such that even effective DVFS for the CPU energy saving might be less effective in terms of the system energy. V.3 DVFS FOR THE SYSTEM ENERGY REDUCTION V.3.1 MODELING THE SYSTEM POWER CONSUMPTION We consider a computing system consisting of a CPU with a variable operating frequency,/„ , where f m n < / , < / „ . Let Pcp u J n denote the CPU power dissipation at f n . The system also includes N system modules. Let Pm o d i denote the power dissipation of the ith module. Then, we can write the following: R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . C /„ is the active Portion o f P cpu,fn ■ C , f n is the standing portion of P cpuJn , which is in turn the summation of the idle portion ( P ^ ‘ f ) plus the fixed portion ( P/*f ). p™ d. is the active portion of P m4l when the ith module is being accessed whereas P“ ‘ ‘ di denotes the standing portion of P m o d l when the Ith module is not accessed, which is equal to the idle component of the ith module, P^ e dl. Here, it is assumed that the idle component and the fixed component of power dissipation in a system module are the same as one another because, generally speaking, the operating clock and voltage for the modules are not dynamically varied, but they remain fixed. The required system energy to complete a task in time T with a CPU clock frequency of /„ is given by: where P sysj n (t) is the time-varying system power at f n and is in turn calculated as: (V.2) N N (V.3) R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 100 Here, P * p u Jn + ^ P^!fU denotes the standing system power consumption, P*Stf , ” i=i ' ) whereas C . w + E C f t ) i=l denotes the active system power consumption, P?ysjn ( 0 ■ Generally speaking, it is difficult to accurately calculate P™j (t) because the power requirement for each instruction is different. For example, considering instructions in the dynamic trace of an application program running on the system, the CPU is used to execute the on-chip workload, whereas the memory is required to execute the off-chip workload. When on-chip and off- chip workload are executed randomly during the program execution, pZ lM) should be severely fluctuating as shown in Figure V-2 (a). However, once workload of a task is decomposed into on-chip and off-chip, Psysj (t) cm be modeled as: on fn during Tj I C W + C ' during T * „ i= l (V.4) Figure V-2 (b) shows Psysj J t ) after workload decomposition. Hence, E S ys,f„ after workload decomposition is given as: R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 101 E. sys,fn [ •pact , p s td "1 rpon , cpu,f„ sys,f„ J ' L f n + N E p a c t mod,i i= l + P std sys,f„ -O ff (V.5) power p m ± sys,fn \ / I T U l F b l J l J - — ► time *1 t , (a) without workload decomposition power p s td sys,fn Psys,<St) n -M on ! -roff time (b) with workload decomposition Figure V-2 : System power consumption during task execution R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 102 jyact -pstd In Eq.(V.5), * C pu,fn and *sys,fn can easily be obtained from simple measurements on the target system by performing benchmark programs with different CPU N frequencies, but it is difficult to get PH,Zn(t) because complete information i-1 about system components usage by the target program is not available at run time. N In practice, ^ PZdJ0 is approximated by memory power for general applications i=i used in this paper because memory is the most frequently-used system component and the power consumed in the memory takes up more than half of the system power in our target system. So, we include the power consumptions of all other system components in P “ * f . V.3.2 SYSTEM ENERGY VS. CPU FREQUENCY As shown in Eq.(V.5), E f is a function of various parameters of the system configuration ( P “ “ Jn , P ^ ifn , and ) and application program ( T™ and T °^ ). Depending on these parameters, an optimal CPU frequency which results in task execution with minimum system energy consumption is determined as explained next. The system energy equation (V.5) is rewritten as: R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 103 -pact rpon m f n ~ cP “-fn ' fn 1 + f pstd mfn p a c t \ c p u ,fn ) + j act , pstd p a c t c p u ,fn rpon \ l fn J (V.6 ) act N where Pm em is memory power and used instead of in Eq.(V.5) as i=l mentioned before. The case in which PS ysjn and P^m are all zero is equal to the situation assumed in the previous DVFS works, where purely CPU-intensive task is executed on the system consisting of only one CPU with no standing power. In that case, lowering CPU frequency always results in system energy saving. Assuming a linear relationship between the operating voltage and frequency ( Pcp!jn 00 fn and T ff 00 fn )> ^ e n E sys,f„ becomes dependent upon f n as following form: E sys,f„ = a ' f n +b ’f n‘ +C (V.7) where a, b, and c are constant coefficients. In particular, b and c represent the amounts of standing power in the total system power dissipations. Subsequently, an optimal CPU frequency which gives the minimum system energy, f o p t , is calculated as yj0.5 -b/a by taking the derivative of Eq.(V.7). If b is zero, then fopt is fmin > but f o p t increases as b increases. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 0 4 V.4 DESCRIPTION OF THE TARGET SYSTEM V.4.1 BITSYX PLATFORM Our target system for DVFS is the BitsyX system from ADS Inc. [91]. BitsyX has a PXA255 microprocessor which is a 32-bit RISC processor core, with a 32 KB instruction cache and a 32 KB write-back data cache, a 2 KB mini-cache, a write buffer, and a memory management unit (MMU) combined in a single chip. It can operate from 100 MHz to 400 MHz, with a corresponding core supply voltage of 0.85 V to 1.3 V. The configuration of the PXA255 processor is summarized in Table V-l. Table V -l: Intel PXA255 processor configuration Unit Configuration Instruction Cache 32 Kbytes, 32 ways Data Cache 32 Kbytes, 32 ways Mini-Data Cache 2 Kbytes, 2 ways Branch Target Buffer 2 Kbytes, 2 ways Instruction Memory Management Unit 32 entry TLB, Full associative Data Memory Management Unit 32 entry TLB, Full associative R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 105 Power supply for the PXA255 core is provided externally through an on-board variable voltage generator. There are nine different frequency combinations, Fs to Fg. Each combination is given as a 3-tuple consisting of the processor clock frequency ( f cpu), the internal bus clock frequency ( f int), and the external bus clock frequency (fext). These frequency combinations and appropriate CPU voltage levels are reported in Table V-2. Table V-2 : Frequency combinations in BitsyX system Freq. set f cp u (MHz) CPU Volt. (V) f in t (MHz) f ext (MHz) Fi 10 0 0.85 50 1 0 0 f 2 133 0.85 6 6 133 f 3 2 0 0 1.0 50 10 0 f 4 2 0 0 1.0 1 0 0 10 0 f 5 265 1.0 133 133 f 6 300 1.1 50 10 0 f 7 300 U 1 0 0 10 0 Fs 400 1.3 1 0 0 10 0 F9 400 1.3 2 0 0 1 0 0 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 0 6 The internal bus connects the core and other functional blocks inside the CPU such as I/D-cache unit and the memory controller whereas the external bus in the target system is connected to SDRAM (64MB) as shown in Figure V-3. The photo of the BitsyX system is shown in Figure V-4. clock PXbus i k 3.6864M H z OSC 100-400 MHz PLL CPU Core MEM Controller LCD Controller Figure V-3 : Clock distribution diagram in PXA255 processor Figure V-4 : BitsyX system R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 1 0 7 V.4.2 EXECUTION TIME MODEL IN BITSYX To derive a suitable execution timing model for BitsyX, five different applications were run over all frequency sets, F i to F g , and the total execution time for each case was measured and shown in Figure V-5. 5 — 4 E >_ o S 3 0) E ~ 2 c o o 1 d) X LU 0 1 2 3 4 5 6 7 8 9 Frequency combination Figure V-5 : Execution time variation over different frequency combinations Figure V-5 provides the execution time of all the applications for each frequency setting normalized to the execution time with the maximum performance setting, i.e., setting F g . From Figure V-5, we can easily see that “math”, “crc”, and “djpeg” are more CPU-intensive than the “gzip” and “qsort” applications since lowering the CPU frequency for these applications introduce significant execution time increase ^m ath mere □ djpeg M qsort Hgzii R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 0 8 compared to “gzip” and “qsort” cases. Comparing execution times of settings Fj, F3 and F(j (where only the CPU frequency is different, while all other clocks are the same) also validates this observation. In fact, this comparison allows us to determine that “gzip” is more memory-bound than “qsort” by looking at the execution time variation according to CPU frequency only. The same observations can be made by examining settings F4, F$, and Fg, which are again only different from each other in terms of the CPU clock frequency. It should be noted that when frequency scaling is performed, not only / cpu is changed but also f m t and f ext are scaled in BitsyX. Therefore, the effect o f/'" ' and/ e xt on the total program execution time should also be considered. Execution time T is sum of T°n and T°ff as in Eq.(II-5) in Chapter II and clearly T°ff is strongly dependent on the external clock frequency. However, an important observation from data reported in Figure V-5 is that the internal bus clock frequency also affects T > ff. The relation between the internal bus clock and T°ff can be understood from a closer examination of the operations performed during the external memory access. For example, a D-cache miss requires two operations: data fetch from the external memory and data transfer to the CPU core where the cache-line and destination register are updated. The time needed for the latter operation is obviously affected by the internal bus frequency. Due to the lack of exact timing information about these two operations that are performed during a D-cache miss R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 1 0 9 service, we have opted to model T°ff as a function of both the internal clock frequency and the external memory access clock as follows: T off ^oir , 'j'off ( j - g j - w * l int " r ext c int " r r r (V.8) where a is the ratio between the data transfer time (T °f ) and the data fetch time (7 ^ f) and f lnt is internal bus clock frequency. Based on the experimental results on : ■ int various application programs, the average error in predicting the execution time for all applications and over all frequency combinations was less than 2 % with a value of 0.35 as shown in Figure V-6 . 4 T □ math Dcrc □ djpeg 1 qsort -4 2 3 4 5 6 7 8 Frequency combination Figure V-6 : Execution time estimation error R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 110 V.4.3 ENERGY CONSUMPTION MODEL FOR BITSYX Measured energy consumptions for each application are presented in Figure V-7. The frequency combination at which the minimum energy is consumed is not Fj which has the minimum CPU frequency of 100 MHz for all tested applications. On the contrary, Fj causes the largest system energy among all frequency combinations. □ math Dcrc □ djpeg □ qsort Bgzip 1 2 3 4 5 6 7 8 9 Frequency combination Figure V-7 : Energy consumption over different frequency combinations Energy trend according to frequency sets is similar to execution time variation in Figure V-5, i.e., less execution time cause less system energy. This result is due to the fact that there is the standing component in the total system power during task execution and it should be better to finish a program as soon as possible for less R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . Ill system energy. One more important observation in Figure V-7 is that this minimal energy frequency is varying depending on applications, either CPU-intensive or memory-intensive. For example, F§ (CPU frequency of 400 MHz) gives minimum energy for the “math” which is the most CPU-intensive, whereas Fs (CPU frequency of 265 MHz) does for the “gzip” which is the most memory-intensive. The reason why Fs is the best for “gzip” is that F j has the fastest condition for memory access operation, both memory clock and internal bus clock are 133 MHz. Energy consumption model in BitsyX is shown in Figure V-8 . Terms used in this figure are presented in Table V-3. power p s t d sys,Fn Psys.F^t) P% Fn=kfV*n- f r + k2- C t = r F ” + 7; " + r " F r rj uii,rn &Ai,rn -M - ->T«- T o n I TOff ! T-O ff F « i 1 int,F „ j ext,Fn off time Figure V-8 : Total system power consumption during execution R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 112 Table V-3 : Definition of used terms Term Explanation F „ the n* frequency setting, ) Tpn on-chip computation time at Fn TplJp data update time after fetch from memory at Fn T r f tF data fetch time from memory at Fn PsysF(t) time-varying system power at Fn standing power in Ps y s F n (t) Pp" active power in Ps y s F n during T™ P M,Fn aCtive P°Wer i n P sys,F„ during T*P ' p 0 l, F n active P°wer i n Psys,F „ during Tfap n Vp CPU operating voltage at Fn k, fitting coefficient for P°^F n , [nF] k2 fitting coefficient for P °^tFn, [V2-nF] R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Here, P -f^ is represented as k, -V^ ■ f ‘ p u +k2 • since power consumption during is a function of the CPU voltage/frequency for data update into the destination register or LD-Cache and the voltage level of the internal bus clock generator for data transfer to the CPU (assumed to be 3.3 V). kj and £2 are coefficients which relates P ^F with CPU frequency/voltage and internal bus clock frequency/voltage, respectively. P^ F „ 's obtained by measuring the system power in all frequency settings when the system is idle. P™ , which is the active component of the CPU power, is the difference between P ™ l F and the measured power when a CPU-intensive task is running. P ^ iF is the power consumption of accessing the memory. The main memory has a total size of 64MB, comprising of two 32 MB SDRAMs. For each 32 MB SDRAM, we used data sheet values [93] of 446 mW when the SDRAM is being accessed and 132 mW when it is in the idle mode. Therefore, P e ° f > F n can be calculated as 2-(446 mW-132 mW)/0.8 = 785 mW, where factor of 0.8 represents the efficiency factor for the DC-DC converter (12 V to 3.3 V conversion). We performed a curve fitting procedure with measured power values to get k; and , and found them to be 0.73 and 6.2, respectively. Extracted parameters are summarized in Table V-4. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 1 4 Table V-4 : Extracted parameters for system energy estimation p£ f . w P™ (mW) P i r n (mW) (mW) Fi 1665 89 363 785 f 2 1757 148 479 785-1.33 F3 1699 218 456 785 f 4 1728 217 766 785 f 5 1836 336 1018 785-1.33 f 6 1732 344 575 785 f 7 1778 378 885 785 Fs 1869 673 1113 785 f 9 1963 675 1733 785 The system energy for a task at Fn, E sys^ , is given as: TP — p stcl . T a- P on • T on 4- P 0 ^ • T ° ^ 4- P ° F • T ° f f ( \ j o \ sys,Fn sys,Fn Fn Fn int,Fn int,Fn ext,Fn ext,F„ v ' ■^) Figure V-9 shows the estimated energy consumption for “djpeg” using Eq.(V.9) and extracted parameters over all frequency combinations and compared these R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 115 estimated energy values with the actually measured ones. The average error rate for “djpeg” is less than 4 % and for other applications, the error rate is about 3 %. 40 - measured estimated o 30 - Q. E 5 20 - c o o D) 10 o c L U 1 2 3 4 5 6 9 7 8 Frequency combination Figure V-9 : Accuracy of the proposed models for power consumption and execution time V.5 PROPOSED DVFS POLICY V.5.1 SCALING GRANULARITY The ideal DVFS is supposed to instantaneously change the voltage/frequency values. In reality, however, it takes time to change the CPU frequency/voltage due to some factors such as the internal PLL (phase lock loop) locking time and R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 11 6 capacitances that exist in the voltage path. For the PXA255 processor, the latency for switching the CPU voltage/frequency is 500 psec [91]. In order to safely ignore this scale overhead, the minimum quantum of time for scaling the CPU frequency/voltage must be at least two to three orders of magnitude larger than this switching latency. At the same time, we would like to minimize the overhead of the voltage/frequency scaling as far as the OS is concerned. Correspondingly, we use the start time of an (OS) quantum (approximately 50msec in Linux) used by the OS to schedule processes as DVFS decision points, that is, each time the OS invokes the scheduler to schedule processes in the next quantum, we also make a decision so as to whether or not the CPU voltage/frequency is changed, and if so, we then scale the voltage/frequency of the CPU. V.5.2 CALCULATING THE AVERAGE ON-CHIP CPI Static calculation of W°n and W°ff of a program, e.g., during the compilation time, is very difficult because of on/off-chip latencies are greatly affected by dynamic behavior such as cache statistics and different access overheads for various external devices. So, these dynamic behaviors should be captured at run time. We calculated the W°n and W°ff of a program at run time, by using the processor’s PMU. The PMU unit consists of a clock counter and two other counters, each of which can monitor one of 15 different events including cache hit/miss, TLB hit/miss, and number of executed instructions. The overhead for accessing the PMU (for both read and write R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 1 7 operations) is less than 1 psec [47] and can thus be ignored. However, there is a limitation in using these events in the sense that only two events can be monitored at the same time along with the number of clock counts in a quantum (CCNT), which is same situation in XScale-80200 processor. For our DVFS technique with workload decomposition, we need on-chip CPI, CPIo n avs, to separate workload. Our approach is similar to the one used in Chapter III, where number of memory bus transactions and executed instruction count were used to accurately estimate the on-chip CPI. Since the PMU in PXA255 does not provide support for counting the number of memory bus transactions, we have used the following three events based on extensive experiments: (i) number of instructions being executed (INSTR) and (ii) number of stall cycles due to data dependency (STALL) (iii) number of D- cache miss (DMISS). INSTR is required to get the CPI value, which indirectly represents the amount of off-chip workload. STALL captures the number of clock cycles when the CPU is stalled due to data dependency either because of on-chip stalls from internal register dependencies or off-chip stalls from external memory access. Note that DMISS is not exactly equivalent to the off-chip access count because of the “miss-under-miss” capability using the “fill buffer” and “pending buffer” in PXA255-microarchitecture [49]. When a D-cache miss requests data in the same cache line as a previous D-cache miss event, then an external memory access does R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 118 not occur for the current D-cache miss. In spite of this compilation, the D-cache miss event can be used as a clue to determine whether a task is CPU or memory- bound. At the end of every quantum, INSTR and STALL event statistics along with the number of clock counts in a quantum (CCNT) which is given by the clock counter are read from the PMU. From these values we calculate the average CPU clock cycles per instruction (CPPvg) as CCNT/INSTR. Similarly, average number of stalls per instruction {S P fvg ) is calculated. S P f vg accounts for both the on-chip stalls (SPPvgon) and the off-chip stalls (SPPvg0 ff)- Figure V-10 shows the plot with S p f vg of each quantum on the x-axis and the C P f vg on the y-axis while executing (a) “gzip” (b) “qsort”, (c) “djpeg”, and (d) “math” applications under different frequency settings in Table V-2. From this figure, we can easily see that C P f vg is linearly related to S P f vg as follows: cp rs = k ■ spia v g + c (v.io) where k is the slope (~1). Notice that the y-intercept c is equal to the average on- chip CPI without any stall cycles, C P/”m on. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 1 9 § > * 12 (a) gzip 10 8 6 4 2 0 8 10 0 2 4 6 SPI avg 12 ....... (b) qsort 10 8 6 4 2 0 10 8 4 6 2 0 s p ,aw/ Figure V-10 : Contour plots of CPFvg vs SPFvg for different clock frequencies combinations: (a) “gzip” (b) “qsort” (c) “djpeg” (d) “math” R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 120 (c) djpeg 10 H 10 8 4 6 SPIava 12 10 8 6 o 4 2 o (d) math 4 6 s p r 9 8 10 Figure V-10: continued R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 121 To obtain C P fvg on, which is obviously equal to CPI"uno n + S P fvg on, it is required to extract SPPv g o n from S P fvg. Figure V -ll shows a method of obtaining SPPv g o n from the D-cache miss statistics. cpr9 CPI, i max on C P C = C P C + dpi2 spi(DPI) dpi2spi(DPI) DPI CPIonmax-CPIonm in DPI < K 1 DF*(n-1) Kf < DPI < K2 DF-2 Kn.2 < D P I< K n, DF*1 Kn-, < DPI <Kn 0 DPI > Kn Kn is constant: K1 < K2 < ... < Kn D F _ { c p i r - c p c ) Figure V - l l : SPIo n a v g extraction using DPI First, we consider the y-intercept of the above line, the CPfvg when no data-stalls occur, as the lower bound for the on-chip CPI (CP/m m 0 „). The CPfvg at the lowest SPfvg value (SPfn m ) is considered as the upper bound (CPfia x o n )• The CPPvg o n is estimated from both CPtm n o n and CPIm a x u n along with the values of DMISS of the quantum. The intuition for using DMISS to calculate CPfvg o n is that if the number of data cache misses is high, most of the stalls are off-chip stalls. Therefore, if the value of DMISS is high (low) then a CPfvg value close to C P rin o n (CPf^on) is R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 122 chosen. Let DPI denotes D-cache miss count per instruction, defined as DMISS/INSTR. We equally divided the region from C P fm xo n to C P fn m „ n, into n sub-regions and each region is selected with the reported DPI value, which results in CPF*an = C P fV 8 m in + C P fv g (DPl), where C P fV 8 (DPl) is C P f V 8 value for the corresponding DPI and increases (decreases) as the DPI value decreases (increases). Our DVFS approach requires three events: INSTR, STALL and DMISS. Since PXA255’s PMU can only provide two event statistics at a time, the PMU must be read twice in every quantum: (INSTR, STALL) pair is read during the first half whereas (INSTR, DMISS) pair is read during the second half of every quantum. V.5.3 DETERMINING THE OPTIMAL FREQUENCY SETTING In the proposed DVFS policy, an optimal frequency is determined considering both the timing constraints and minimum system energy consumption. As a timing constraint for non real-time applications, we used performance loss (PFi0 S S ) which is defined as the increased execution time of a program due to lowered clock frequency and given as [12] [86]: R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 123 (V .ll) where Fm ix is the best performance frequency combination, i.e., Fg, and TF are the total task execution time at frequency combination of Fn and Fm ix, respectively. After obtaining the C P fvg value for the current quantum i, CPFvso n ,u we calculate where A ,- is the number of executed instructions, T, and f cp u i,n are the execution time and the CPU frequency in Fn during the quantum i, respectively. W°ni and W°ffi are derived from the calculated values of 7°" and T°ffi and it is assumed that W°”!+ ; and W°ffi+i are equal to W°”/ and respectively. An optimal frequency set for the quantum i+1, F )p ti+ i, is determined as following: on-chip and off-chip execution times for this quantum, T"n and as follows: (V.12) J i,n R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 2 4 1 . 'F={F1, ..., F9}, r ={</)}, and Em in = 0 0 2 . fo r every frequency setting Fn in W 3. i f ( T ^ < ( l + PFloss)-T ’ J 4. r = r u F n; 5. fo r every frequency setting F„ in r 6 . calculate EsysF J'rom Eq.(V-9) 7. if( E syS,Fn ~ E min ) 8 . E min = E sys,Fn > ' E>P ‘+1 ~ E n where T ‘ f+ i is the expected execution time of quantum i+1 at Fn and TF j is the execution time of quantum i at Fg. ) V.6 EXPERIM ENTAL RESULTS We implemented the proposed policy on the BitsyX platform, which runs Linux (v2.4.17). Precisely speaking, we wrote a software module implementing the proposed policy. This module is tied to the linux OS scheduler in order to allow voltage scaling to occur at every context switch. To show the effectiveness of the proposed DVFS method considering system energy (SE-DVFS), we also R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 125 implemented the DVFS method used in [12], which considers CPU energy only (CE-DVFS), and compared the results each other. To measure the power consumption of the system, we inserted a 0.125 ohm precision resistor between the external power source (~12 V) and the system power line. The actual power consumption at run time was measured by using a data acquisition system which operates up to 100 KHz sampling frequency by reading voltage drop across the precision resistor [92]. Our experiments are performed on a number of applications including a common UNIX utility program, “gzip”, and four representative benchmark programs available on the web [90]. Figure V-12 represents the measured performance degradation with target performance loss ranging from 10 % to 50 % at steps of 10 % for both CE-DVFS and SE-DVFS. As seen in this figure, using CE-DVFS in case (a), we obtained actual performance loss values very close to the target values for all programs (i.e., actual average within 1.5 % of the target), whereas performance loss values obtained using SE-DVFS are saturated to 2 % and 12 % for CPU-intensive and memory-intensive applications, respectively even with 50 % target. This is due to significant fixed power in the target system and it is better to finish task as soon as possible in terms of total system energy saving. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 12 6 T a rg e ? P rlo ss 10% D20% D30% H40% B50% V 60 O ra 50 ? • t O 30 Q. (0 20 3 o 10 math crc djpeg qsort gzip (a) conventional DVFS (CE-DVFS) M g e t P P j o s * .................................................. □ 10% □ 20% □ 30% □ 40% ■ 50% math crc djpeg qsort gzip (b) proposed DVFS (SE-DVFS) Figure V-12 : Actual performance: CE-DVFS and SE-DVFS R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 2 7 Figure V-13 shows the achieved system energy saving with (a) CE-DVFS and (b) SE-DVFS running benchmark programs at various performance loss values. math crc djpeg qsort gzip I Target PFl0 S S □ 10°/o D20% '□30% 1140% < / > -25 math (a) CE- DVFS djpeg crc qsort gzip 10 0s 5 O ) c 0 > < 0 -5 (0 > O ) -10 0) c -15 0) E -20 V + ■ » < 0 -25 >. < /> -30 - * • i 1 TargetPF'c m 1 o% □ 2 0 % - □ 3 0 % n 40% 150% (b) SE- DVFS Figure V-13 : System energy saving: SE-DVFS and CE-DVFS R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 1 2 8 System energy saving is calculated by comparing measured system energy applying DVFS method with that of without any DVFS method case in which programs were run at Fg. From this figure, it is found that system energy increased for all applications by applying CE-DVFS, whereas there are energy savings for “math”, “crc”, and “gzip” programs in case of SE-DVFS and little changes in the system energy are observed in “djpeg” and “qsort”. These results are corresponding to data in Figure V-7. For example, the frequency set for minimum energy consumption of “djpeg” is F§ and CPU frequency of Fg, 400 MHz, is the same as in Fg. CE-DVFS does not consider the system energy, but only concerns timing constraint. So, as target performance increases, less frequency set is chosen by CE- DVFS, resulting in system energy increase. This is not the case in SE-DVFS so that minimum system energy is maintained by using SE-DVFS. Figure V-14 depicts the power consumption waveform of the BitsyX system when mnning “gzip” with 30 % target performance degradation factor using: (a) CE- DVFS and (b) SE-DVFS. In case (a) using CE-DVFS, the average power is less than that of case (b) since active power is reduced. But, due to increased fixed energy by lowered frequency, total system energy increased in case of CE-DVFS. For this application, SE-DVFS requires 11.4 % less system energy than that of CE- DVFS. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 2 9 5000 C 4000 power: 2619m W . 1000 3 4 5 Time [sec] 8 (a) CE- DVFS 5000 ---------------------------- gzip, with 30% Target g rfcS . power: 2720.3mW, 5.568sec 1000 3 4 5 Time [sec] 8 (b) SE- DVFS Figure V-14 : Actual power consumption of two DVFS methods R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1 3 0 For other applications with different performance target values are shown in Figure V-15 and about 2 % to 18 % of total system energy is reduced by using SE-DVFS compared with the results of using CE-DVFS. From these measurements, we conclude that our proposed SE-DVFS technique is quite helpful to extend the whole system lifetime. 30 Target PFh„ ® 10% □ 20% □ 30% m 40% ■ 50% math crc djpeg qsort gzip Figure V-15 : System energy difference: SE-DVFS vs. CE-DVFS R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 131 V.7 CONCLUSIONS In this Chapter, a DVFS policy for the actual system energy reduction was proposed and implemented on a PXA255-based platform. In the proposed DVFS approach, a program execution time and system energy required for the program are quite accurately estimated using workload decomposition in which execution time of the program is decomposed into on-chip computation and off-chip access latencies. System power is also decomposed into variable and fixed power and very accurately estimated using decomposed execution time. The CPU voltage/frequency is scaled based on the ratio of the on-chip and off-chip latencies for each process such that both a given performance degradation factor and minimal energy consumption are satisfied. This ratio is given by a regression equation, which is dynamically updated based on runtime event monitoring data provided by an embedded performance monitoring unit. Through actual current measurements in hardware, we demonstrated that up to 12 % less energy saving was achieved with the proposed DVFS compared with the results in the previous DVFS techniques. For both CPU and memory-bound programs, given timing constraints were also satisfied. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 3 2 CHAPTER VI CONCLUSIONS AND FUTURE RESEARCH DIRECTION In this thesis, DVFS techniques based on workload decomposition were proposed and the effectiveness of the proposed techniques was verified through actual measurements by implementing them on real computing systems. The key idea of workload decomposition is to exploit the asynchrony between off-chip access times and the CPU frequency. Thus, lower CPU frequency can be set during off-chip access with little performance degradation, which gives more CPU energy saving. Our proposed workload decomposition is done with the aid of an embedded hardware on the CPU. The embedded hardware on the CPU, called Performance Monitoring Unit (PMU), provides various dynamic events occurred at runtime, which are in turn very useful in dividing the workload into on-chip and off-chip component. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 3 3 Another advantage of using workload decomposition is that accurate estimation/prediction of the system power dissipation at runtime becomes possible so that the optimal processing speed to maximize the lifetime of a battery-operated system may be achieved. This is due to the fact that the required power consumptions for on-chip and off-chip workload execution, which tend to be quite different, are available in the specification of each system components. Future work may be pursued in two directions: one is to reduce the scaling overhead, and the other is to combine supports from OS or device drivers into the proposed DVFS policies. Currently the transition cost of voltage/frequency scaling is at least a few microseconds which is still large compared to off-chip access time. (For example, access time to 100MHz SDRAM is about 250 nsec.) If the scaling cost can be made negligible, then the proposed DVFS policies will become much more effective in terms of energy saving. One possible solution is to use a multiplexing scheme where all usable frequency levels are already generated and the target value is applied to the CPU through a multiplexer. In this method, the transition cost will be quite small compared to typical frequency generation because there is no need to wait until PLL is relocked. The proposed DVFS policies solely depend on dynamic events from the PMU. Our technique can be improved if additional support from the OS and/or device drivers is available. For example, there may be execution statistics which cannot be R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 3 4 captured by the PMU such as data about the dynamic memory access (DMA) operations between sub-components in the system. Without knowing each system component’s state (e.g., active or idle), it is difficult to estimate the system power dissipation at runtime. However, this information can be available from either the OS or the device driver for the component of interest. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 3 5 Bibliography [1] The 2004 International Technology Roadmap for Semiconductors. http://public.itrs.net [2] N. Aboughazaleh, D. Mosse, B. Childers, and R. Melhem, “Toward the placement of power management points in real time applications,” Proc. of the Workshop on Compilers and Operating Systems for Low Power, Sep. 2001 [3] A. Azevedo, I. Issenin, R. Cornea, R. Gupta, N. Dutt, A. Veidenbaum, and A. Nicolau, “Profile-based dynamic voltage scheduling using program checkpoints in the COPPER framework,” Proc. of the Design Automation and Test in Europe Conference, March 2002, pp.168-176 [4] A. Azevedo, R. Cornea, I. Issenin, R. Gupta, N. Dutt, A. Nicolau, and A. Veidenbaum, “Architectural and compiler strategies for dynamic power management in the COPPER project,” Proc. of the International Workshop on Innovative Architecture, Jan. 2001 [5] A. Bavier, A. Montz, and L. Peterson, “Predicting MPEG execution times,” Proc. of the International Conference on Measurement and Modeling of Computer Systems, 1998, pp. 131-140 [6] F. Bellosa, “The benefits of event-driven energy accounting in power- sensitive systems,” Proc. of the 9th ACM SIGOPS European Workshop, Sep. 2000 [7] T. Burd, T. Pering, A. Stratakos, and R. Brodersen, “A dynamic voltage scaled microprocessor system,” IEEE Journal of Solid-State Circuit, vol. 35, n o .ll, Nov. 2000, pp.1571-1580 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 3 6 [8] A. Chandrakasan, V. Gutnik, and T. Xanthopoulos, “Data driven signal processing: an approach or energy efficient computing,” Proc. of the International Symposium on Low Power Electronics and Design, 1996, pp.347-352 [9] L. Chandrasena, P. Chandrasena, and M. Liebelt, “An energy efficient rate selection algorithm for quantized dynamic supply voltage scaling,” Proc. of the International Symposiums on Systems Synthesis, Oct. 2001 [10] B. Childers, H. Tang, and R. Melhem, “Adapting processor supply voltage to instruction-level parallelism,” Proc. of the Kool Chips 2000 Workshop, Dec. 2000 [11] K. Choi, K. Dantu, W. Cheng, and M. Pedram, “Frame-based dynamic voltage and frequency scaling for an MPEG decoder,” Proc. of the International Conference on Computer Aided Design, Nov. 2002, pp. 732- 737 [12] K. Choi, R. Soma, and M. Pedram, “Fine-grained dynamic voltage and frequency scaling for precise energy and performance trade-off based on the ratio of off-chip access to on-chip computation times,” Proc. of the Design, Automation and Test in Europe, 2004 [13] K. Choi, R. Soma, and M. Pedram, “Off-chip latency-driven dynamic voltage and frequency scaling for an MPEG decoding,” Proc. of the Design Automation Conference, Jun. 2004 [14] E. Chung, L. Benini, and G. Micheli, “Contents provider-assisted dynamic voltage scaling for low energy multimedia applications,” Proc. of the International Symposium on Low Power Electronics and Design, Aug. 2002, pp.42-47 [15] AMD Corporation, Mobile AMD-K6-2+ Processor Data Sheet, Jun. 2000. Publication #23446. [16] Transmeta Crusoe, http://www.transmeta.com/technologv/index.html [17] X. Fan, C. Ellis, and A. Lebeck, “Interaction of power-aware memory systems and processor voltage scaling,” Proc. of the Workshop on Power- Aware Computer Systems, Dec. 2003 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 3 7 [18] K. Flautner and T. Mudge, “Vertigo: automatic performance-setting for Linux,” Proc. of the Symposium on Operating Systems Design and Implementation, Boston, MA, Dec. 2002 [19] K. Flautner, S. Reinhardt, and T. Mudge, “Automatic performance-setting for dynamic voltage scaling,” Proc. of the 7th Annual International Conference on Mobile Computing and Networking, Jul. 2001 [20] R. Floyd and L. Steinberg, “An adaptive algorithm for spatial grayscale,” Proc. of the Society for Information Display, 17 (2), 1976, pp. 75-77 [21] S. Ghiasi, J. Casmira, and D. Grunwald, “Using IPC variation in workloads with externally specified rates to reduce power consumption,” Proc. of the Workshop on Complexity Effective Design, Jun. 2000 [22] K. Govil, E. Chan, and H. Wasserman, “Comparing algorithms for dynamic speed-setting of a low power CPU,” Proc. of the 1st ACM International Conference on Mobile Computing Networking, Nov. 1995, pp. 13-25 [23] F. Gruian, “Hard real-time scheduling using stochastic data and DVS processors,” Proc. of the International Symposium on Low Power Electronics and Design, Aug. 2001, pp. 46-51 [24] D. Grunwald, P. Levis, K. Farkas, C. Morrey III, and M. Neufeld, “Policies for dynamic clock scheduling,” Proc. of the Symposium on Operating Systems Design and Implementation, Oct. 2000 [25] J. Hennessy and D. Patterson, “Computer Architecture-A Quantitative Approach,” 2nd, Morgan Kaufmann Publishers, 1996 [26] I. Hong, G. Qu, M. Potkonjak, and M. B. Srivastava, “Synthesis techniques for low-power hard real-time systems on variable voltage processor,” Proc. of the 19th IEEE Real-Time Systems Symposium, 1998, pp.178-187 [27] I. Hong, G. Qu, M. Potkonjak, and M. Srivastava, “Power optimization of variable-voltage core-based systems,” IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, vol.18, no.12, Dec. 1999, pp. 1702-1714 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 3 8 [28] M. Horowitz, T. Indermaur, and R. Gonzalez, “Low-power digital design,” Proc. of the IEEE Symposium on Low Power Electronics, 1994, pp.8-11 [29] C. Hsu and U. Kremer, “Compiler-directed dynamic voltage scaling for memory-bound applications,” Technical Report DCS-TR-498, Department of Computer Science, Rutgers University, Aug. 2002 [30] C. Hsu and U. Kremer, “Single region vs. multiple regions: a comparison of different compiler-directed dynamic voltage scheduling approaches,” Proc. of the Workshop on Power-Aware Computer Systems, Feb. 2002. [31] M. Huang, J. Renau, and J. Torrellas, “Profile-based energy reduction in high-performance processors,” Proc. of the 4th Workshop on Feedback- Directed and Dynamic Optimization, Dec. 2001 [32] C. Hughes, J. Srinivasan, and S. Adve, “Saving energy with architectural and frequency adaptations for multimedia applications,” Proc. of the 34th International Symposium on Microarchitecture, Dec. 2001 [33] C. Im, H. Kim, and S. Ha, “Dynamic voltage scheduling technique for low- power multimedia applications using buffers,” Proc. of the International Symposium on Low Power Electronics and Design, Aug. 2001, pp. 34-39 [34] T. Ishihara and H. Yasuura, “Voltage scheduling problem for dynamically variable voltage processors,” Proc. of the International Symposium on Low Power Electronics and Design, 1999, pp. 197-202 [35] W. Kim, J. Kim, and S. L. Min, “A dynamic voltage scaling algorithm for dynamic-priority hard real-time systems using slack time analysis,” Proc. of the Conference on Design, Automation, and Test in Europe, 2002, pp. 788- 794 [36] W. Kim, J. Kim, and S. L. Min, “ Dynamic voltage scaling algorithm for fixed-priority real-time systems using work-demand analysis,” Proc. of the International Symposium on Low Power Electronics and Design, 2003, pp. 396-401 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 3 9 [37] C. Krishna and Y. Lee, “Voltage-clock scaling adaptive scheduling techniques for low power in hard real-time systems,” Proc. of the 6th IEEE Real-Time Technology and Applications Symposium, Jun. 2000, pp. 156- 165 [38] S. Lee and T. Sakurai, “Run-time power control scheme using software feedback loop for low-power real-time applications,” Proc. of the Asia- Pacific Design Automation Conference, 2000, pp. 381-386 [39] S. Lee and T. Sakurai, “Run-time voltage hopping for low-power real-time systems,” Proc. of the 37th Conference on Design Automation, Jun. 2000 [40] H. Li, C. -Y . Cher, T. N. Vijaykumar, and K. Roy, “VSV: L2-miss-driven variable supply-voltage scaling for low power,” Proc. of the 36th International Symposium on Microarchitecture, 2003 [41] J. Lorch and A. Smith, “Improving dynamic voltage algorithms with PACE,” Proc. of the International Conference on Measurement and Modeling of Computer Systems, Jun. 2001 [42] Z. Lu, J. Lach, M. Stan, K. Skadron, “Reducing multimedia decode power using feedback control,” Proc. of International Conference on Computer Design, San Jose, CA, Oct. 2003 [43] Z. Lu, J. Hein, M. Humphrey, M. Stan, J. Lach, and K. Skadron, “Control- theoretic dynamic frequency and voltage scaling for multimedia workloads,” Proc. of the 2002 International Conference on Compilers, Architectures, and Synthesis for Embedded Systems, Oct. 2002 [44] Y. Lu, L. Benini, and G. D. Micheli, “Dynamic frequency scaling with buffer insertion for mixed workloads,” IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, 21(11): Nov. 2002, pp. 1284-1305 [45] Z. Lu, J. Lach, M. Stan, K. Skadron, “Reducing multimedia decode power using feedback control,” Proc. of International Conference on Computer Design, San Jose, CA, Oct. 2003 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 4 0 [46] G. Magklis, M. L. Scott, G. Semeraro, D. H. Albonesi, and S. Dropsho, “Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor,” Proc. of the 30th International Symposium on Computer Architecture, Jun. 2003, pp. 14-25 [47] Developer’s manual: “Intel 80200 Processor Based on Intel XScale microarchitecture,” http://developer.intel.com/design/iio/manuals/273411.htm [48] Developer’s manual: “Intel XScale microarchitecture for the PXA255 Processor” http://www.intel.com/design/pca/applicationsprocessors/manuals/278693.ht [49] User’s manual: “Intel XScale Microarchitecture for the PXA255 Processor” http://www.intel.com/design/pca/applicationsprocessors/manuals/278796.ht m [50] A. Manzak and C. Chakrabarti, “Variable voltage task scheduling for minimizing energy or minimizing power,” Proc. of the International Conference on Acoustics, Speech, and Signal Processing, Jun. 2000 [51] D. Marculescu, “On the use of microarchitecture-driven dynamic voltage scaling,” Proc. of the Workshop on Complexity-Effective Design, Jun. 2000 [52] T. Martin, “Balancing batteries, power and performance: System issues in CPU speed-setting for mobile computing,” PhD thesis, Carnegie Mellon University, 1999 [53] T. Martin, D. Siewiorek, and J. M. Warren, “A CPU speed-setting policy that accounts for nonideal memory and battery properties,” Proc. of the 39th Power Sources Conference, Jun. 2000, pp.502-505 [54] T. Martin and D. Siewiorek, “Nonideal battery and main memory effects on CPU speed setting for low power,” IEEE Transactions on Very Large Scale Integration System, 9(1), Feb. 2001, pp. 29-34 [55] L. McVoy and C. Staelin, “lmbench: portable tools for performance analysis,” Proc. of the USENIX 1996 Technical Conference, Jan. 1996, pp. 279-294 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 4 1 [56] J. Mitchell, W. Pennebaker, C. Fogg, and Didier LeGall, MPEG video compression standard, Chapman and Hall, 1996 [57] A. Miyoshi, C. Lefurgy, E. Hensbergen, and R. Rajkumar, “Critical power slope: understanding the runtime effects of frequency scaling,” Proc. of the 16 Annual ACM International Conference on Supercomputing, Jun. 2002 [58] D. Mosse, H. Aydin, B. Childers, and R. Melhem, “Compiler-assisted dynamic power-aware scheduling for real-time applications,” Proc. of the Workshop on Compilers and Operating Systems for Low Power, Oct. 2000 [59] K. Nowka, G. Carpenter, E. Mac Donald, H. Ngo, B. Brock, K. Ishii, T. Nguyen, and J. Bums, “A 0.9V to 1.95V dynamic voltage-scalable and frequency-scalable 32b PowerPC processor,” Proc. of the Digest of Technical Papers, IEEE International Solid-State Circuits Conference, Feb. 2002 [60] K. Patel, B. Smith, and L. Rowe, “Performance of a software MPEG video decoder,” Proc. of the 1st ACM International Conference on Multimedia, 1993, pp.75-82 [61] T. Pering, T. Burd, and R. Brodersen, “The simulation and evaluation of dynamic voltage scaling algorithms,” Proc. of the International Symposium on Low Power Electronics and Design, Aug. 1998, pp.76-81 [62] T. Pering, T. Burd, and R. Brodersen, “Voltage scheduling in the IpARM microprocessor system,” Proc. of the International Symposium on Low Power Design, 2000, pp. 96-101 [63] P. Pillai and K. Shin, “Real-time dynamic voltage scaling for low-power embedded operating systems,” Proc. of the 18th Symposium on Operating Systems Principles, Oct. 2001 [64] J. Pouwelse, K. Langendoen, and H. Sips, “Dynamic voltage scaling on a low-power microprocessor,” Proc. of the 7th Annual International Conference on Mobile Computing and Networking 2001, 2001, pp.251-259 [65] J. Pouwelse, K. Langendoen, R. Lagendijk, and H. Sips, “Power-aware video decoding,” Presented at 22nd Picture Coding Symposium, Seoul, Korea, 2001 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 4 2 [66] J. Pouwelse, K. Langendoen, and H. Sips, “Energy priority scheduling for variable voltage processors,” Proc. of the International Symposium on Low Power Design, 2001, pp. 28-33 [67] G. Qu, “What is the limit of energy saving by dynamic voltage scaling?,” Proc. of the International Conference on Computer Aided Design, Nov. 2001 [68] G. Quan and X. Hu, “Minimum energy fixed-priority scheduling for variable voltage processors,” Proc. of the Design Automation and Test in Europe, Mar. 2002, pp.782-787 [69] G. Quan and X. Hu, “Energy efficient fixed-priority scheduling for real time systems on variable voltage processors,” Proc. of the Design Automation Conference, Jun. 2001, pp. 828-833 [70] S. Rusu, “Trends and challenges in VLSI technology scaling toward lOOnm,” Presented at the European Solid-State Circuits Conference, 2001 [71] T. Sakurai and A. Newton, “Alpha-power law MOSFET model and its application to CMOS inverter delay and other formulas,” IEEE Journal of Solid State Circuits, 25(2), 1990, pp. 584-594 [72] H. Saputra, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, J. Hu, C.-H. Hsu, and U. Kremer, “Energy-conscious compilation based on voltage scaling,” Proc. of the Conference on Languages, Compilers, and Tools for Embedded Systems and Software and Compilers for Embedded Systems, Jun. 2002 [73] “Cruso SE Processor TM5800 Data Book v2.1,” http://www.transmeta.com/evervwhere/products/embedded/embedded sefa m ilv.htm l. [74] G. Semeraro, G. Magklis, R. Balasubramonian, D. H. Albonesi, S. Dwarkadas, and M. L. Scott, “Energy efficient processor design using multiple clock domains with dynamic voltage and frequency scaling,” Proc. of the 8th International Symposium on High-Performance Computer Architecture, Feb. 2002 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 4 3 [75] Y. Shin and K. Choi, “Power conscious fixed priority scheduling for hard real-time systems,” Proc. of the 36th Annual Design Automation Conference, 1999, pp. 134-139 [76] Y. Shin, K. Choi, and T. Sakurai, “Power optimization of real-time embedded systems on variable speed processors,” Proc. of the International Conference on Computer Aided Design, November 2000, pp. 365-368 [77] D. Shin and J. Kim, “A profile-based energy-efficient intra-task voltage scheduling algorithm for hard real-time applications,” Proc. of the International Symposium on Low-Power Electronics and Design, Aug. 2001 [78] D. Shin, J. Kim, and S. Lee, “Low-energy intra-task voltage scheduling using static timing analysis,” Proc. of the Design Automation Conference, 2001, pp.438-443 [79] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. De Micheli, “Dynamic voltage scaling for portable systems,” Proc. of the 38th Design Automation Conference, Jun. 2001 [80] A. Sinha and A. Chandrakasan, “Dynamic voltage scheduling using adaptive filtering of workload traces,” Proc. of the 14th International Conference on VLSI Design, Jan. 2001 [81] D. Son, C. Yu, and H. Kim, “Dynamic voltage scaling on MPEG decoding,” Proc. of the International Conference of Parallel and Distributed System, Jun. 2001 [82] P. Stanley-Marbell, M. Hsiao, and U. Kremer, “A hardware architecture for dynamic performance and energy adaptation,” Proc. of the Workshop on Power-Aware Computer Systems, 2002 [83] V. Swaminathan and K. Chakrabarty, “Investigating the effect of voltage switching on low-energy task scheduling in hard real-time systems,” Proc. of the Asia South Pacific Design Automation Conference, Jan./Feb. 2001 [84] J. Wang, B. Ravindran, and T. Martin, “A power aware best-effort real-time task scheduling algorithm,” Proc. of the IEEE Workshop on Software Technologies for Future Embedded Systems, May 2003, pp. 21-28 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 4 4 [85] M. Weiser, B. Welch, A. Demers, and S. Shenker, “Scheduling for reduced CPU energy,” Proc. of the 1st Symposium on Operating Systems Design Implementation, 1994, pp. 13-23 [86] A. Weissel and F. Bellosa, “Process Cruise Control,” Proc. of the Compilers, Architectures and Synthesis for Embedded Systems, Oct. 2002, pp.238-246 [87] F. Xie, M. Martonosi, and S. Malik, “Compile time dynamic voltage scaling settings: opportunities and limits,” Proc. of the ACM SIGPLAN Conference on Programming Languages Design and Implementation, Jun. 2003 [88] P. Yang, C. Wong, P. Marchal, F. Catthoor, D. Desmet, D. Verkest, R. Lauwereins, “Energy-aware runtime scheduling for embedded multi processor SoCs”, IEEE Design and Test of Computers, vol. 18, no. 5, 2001, pp. 46-58 [89] F. Yao, A. Demers, and S. Shenker, “ A scheduling model for reduced CPU energy,” IEEE Annual Foundations of Computer Science, 1995, pp.374-382 [90] http://www.eecs.umich.edu/mibench [91] http://www.applieddata.net/products bitsvX.asp [92] http://www.instrument.com/pci/udas.asp [93] http://download.micron.com/pdf/datasheets/dram/sdram/256MSDRAM G. pdf R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Effects of non-uniform substrate temperature in high-performance integrated circuits: Modeling, analysis, and implications for signal integrity and interconnect performance optimization
PDF
A CMOS frequency channelized receiver for serial-links
PDF
A template-based standard-cell asynchronous design methodology
PDF
Encoding techniques for energy -efficient and reliable communication in VLSI circuits
PDF
Contributions to efficient vector quantization and frequency assignment design and implementation
PDF
Clustering techniques for coarse -grained, antifuse-based FPGAs
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Energy -efficient strategies for deployment and resource allocation in wireless sensor networks
PDF
Dynamic radio resource management for 2G and 3G wireless systems
PDF
A passive RLC notch filter design using spiral inductors and a broadband amplifier design for RF integrated circuits
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Design and analysis of MAC protocols for broadband wired/wireless networks
PDF
Efficient acoustic noise suppression for audio signals
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
Contributions to image and video coding for reliable and secure communications
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Error resilient techniques for robust video transmission
Asset Metadata
Creator
Choi, Kihwan
(author)
Core Title
Dynamic voltage and frequency scaling for energy-efficient system design
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Won, Namgoong (
committee member
), Zimmermann, Roger (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-336157
Unique identifier
UC11340403
Identifier
3180330.pdf (filename),usctheses-c16-336157 (legacy record id)
Legacy Identifier
3180330.pdf
Dmrecord
336157
Document Type
Dissertation
Rights
Choi, Kihwan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical