Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
QoS-aware algorithm design for distributed systems
(USC Thesis Other)
QoS-aware algorithm design for distributed systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
QOS-AWARE ALGORITHM DESIGN FOR DISTRIBUTED SYSTEMS by Yuan Yao A Dissertatation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2012 Copyright 2012 Yuan Yao To my parents and my wife Xinxin, for their love and support. ii Acknowledgments This dissertation would not have been possible without the guidance and the assistance of several individuals. First and foremost, I would like to thank my Ph.D. advisor, Professor Leana Golubchik, for her inspiration, encouragement and help. With her vision and wisdom she taught me how to conduct research. With her experience and dedication she taught me how to present our work to other researchers. I learned a lot from her. I would also like to thank Professor Ramesh Govindan, Professor Michael Neely and Professor Shang-hua Teng for their support as my dissertation committee members. The qualify of my thesis improved greatly due to their invaluable suggestions and comments. Professor Samir Khuller of University of Maryland not only co-authored two of my dissertation works, but also inspired me with his enthusiasm to algorithmic research. In addition to people mentioned above, I also had chance to work with Dr. Alix L. H. Chow, Dr. Longbo Huang, Koyel Mukherjee, Dr. Abhishek Sharma and Kai Song. They are all excellent collaborators. I am also grateful to my labmates and friends – Ying Chen, Leslie Cheung, Michael Hung, Sunghan Lin, Bochun Wang, Ranjan Pal, Yan Yang and Xiaoming Zheng. They helped me greatly both on research and in my life, especially when I am in difficulty. Last but not the least, I wish to thank my parents, Chongsong Yao and Wenshu Yuan, and my wife, Xinxin Wu, for their unconditional love and care. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii List of Algorithms x Abstract xi Chapter 1: Introduction 1 Chapter 2: Literature Review 9 2.1 Percentile Reduction for Data Transmisson . . . . . . . . . . . . . . . . . . 9 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Performance Trade-offs in Structured P2P Streaming Systems . . . . . . . . 12 2.4 Anomaly Detection in Sensor Systems . . . . . . . . . . . . . . . . . . . . 15 Chapter 3: To send or not to send: Reducing the cost of data transmission 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Solution to Offline Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.2 Proof of correctness . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Extending OPT to General Uniform Delay . . . . . . . . . . . . . . . . . . 26 3.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.2 Proof of Correctness . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Online Problem: Hardness . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6 A slightly different online problem . . . . . . . . . . . . . . . . . . . . . . 32 3.6.1 An efficient online algorithm for the mini-max problem . . . . . . . 32 3.6.2 Performance lower bound for all online algorithms . . . . . . . . . 33 3.7 Applying OPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 iv 3.7.1 The HELPER approach . . . . . . . . . . . . . . . . . . . . . . . 37 3.7.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.8.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.8.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.9 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 4: Power Cost Reduction for Delay Tolerant Workloads in Distributed Data Centers: a Two Time Scale Approach 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 The Workload Model . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.2 The Job Distribution and Server Operation Model . . . . . . . . . . 49 4.2.3 The Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.4 The data center Power Cost Minimization (PCM) problem . . . . . 51 4.2.5 Model Discussion and Practical Consideration . . . . . . . . . . . . 53 4.3 The SAVE algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1 SAVE’s control algorithms . . . . . . . . . . . . . . . . . . . . . . 55 4.3.2 Properties of SAVE . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.3 Implementing the algorithm . . . . . . . . . . . . . . . . . . . . . 56 4.4 SAVE: Design and Performance Analysis . . . . . . . . . . . . . . . . . . 57 4.4.1 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.2 Performance bounds . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.3 Extension to Non-i.i.d Arrival and Power Prices . . . . . . . . . . . 62 4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5.3 Simulated Schemes for Comparison . . . . . . . . . . . . . . . . . 67 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.6.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 68 4.6.2 Robustness Characteristics . . . . . . . . . . . . . . . . . . . . . . 71 4.6.3 The Impact of Workload Intensity . . . . . . . . . . . . . . . . . . 72 4.6.4 Non i.i.d Workload . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6.5 Power Consumption of SAVE . . . . . . . . . . . . . . . . . . . . 74 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 5: Performance Tradeoffs in Structured Peer to Peer Streaming Systems 76 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Multi-Tree Construction and Transmission . . . . . . . . . . . . . . . . . . 81 5.2.1 Construction of Treeτ . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2.2 Construction of Interior Disjoint Trees . . . . . . . . . . . . . . . . 84 5.2.3 Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 v 5.2.4 Not Fully Connected Network . . . . . . . . . . . . . . . . . . . . 96 5.3 Node Addition and Deletion . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4.1 Static Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4.2 Dynamic Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5 Hypercube-based Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.5.1 Hypercube Streaming for SpecialN . . . . . . . . . . . . . . . . . 108 5.5.2 Hypercube Streaming for ArbitraryN . . . . . . . . . . . . . . . . 111 5.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Chapter 6: Online Anomaly Detection for Sensor Systems 116 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2.1 Sensor systems for data collection . . . . . . . . . . . . . . . . . . 121 6.2.2 Building blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.2.3 Using SSA on a tiered sensor network . . . . . . . . . . . . . . . . 126 6.2.4 Hybrid approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.4.1 Accuracy evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.4.2 Sensitivity evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.4.3 CPU and Memory Usage . . . . . . . . . . . . . . . . . . . . . . . 144 6.4.4 Robustness evaluation . . . . . . . . . . . . . . . . . . . . . . . . 145 6.5 Quantitative comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Chapter 7: Conclusions 154 References 155 vi List of Tables 1.1 Performance metrics considered in each work . . . . . . . . . . . . . . . . 3 3.1 Statistics of datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Statistics of power prices . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Distribution of job sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1 Average Startup Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Comparison between multi-tree based streaming and hypercube based stream- ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1 Anomaly categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.2 Hybrid method: SensorScope . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.3 Hybrid method: Jug Bay . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.4 SSA vs. Rule-based methods . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.5 Sensitivity test forT : SSA with differentT values; Change of shape anomaly140 6.6 Robustness to parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.7 PCA based method: SensorScope . . . . . . . . . . . . . . . . . . . . . . . 150 6.8 Wavelet based method: SensorScope . . . . . . . . . . . . . . . . . . . . . 150 6.9 Cluster based method: SensorScope . . . . . . . . . . . . . . . . . . . . . 152 vii List of Figures 3.1 An example of arrival dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Average reduction with differentD values . . . . . . . . . . . . . . . . . . 41 3.3 95th percentile results for OPT, HELPER and ES . . . . . . . . . . . . . 41 3.4 Mini-max results for OPT, HELPER and ES . . . . . . . . . . . . . . . . 43 4.1 A model ofM geographically distributed data centers. . . . . . . . . . . . 49 4.2 An example of different time scales T and T 1 . In this example, T = 8, T 1 = 4, andT = 2T 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Average power cost and delay of all schemes under differentV andT values 69 4.4 Average power cost and delay of all schemes under different N min values and robustness test results . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5 Average power cost and delay of SAVE with different workload intensity and autocorrelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6 Differences in average power usage reduction for differentV values . . . . 74 5.1 Cluster construction with sourceS,D = 3,d = 4: the ovals represent clus- ters, where thick and thin arrows represent inter- and intra-cluster transmis- sion, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Receiving and sending schedules of node id 6, for example in Figure 5.3. . 87 5.3 Example of interior disjoint tree construction using the two schemes with N = 15, d = 3, where G 0 = {1,2,3,4}, G 1 = {5,6,7,8}, G 2 = {9,10,11,12},G 3 ={13,14,15} . . . . . . . . . . . . . . . . . . . . . . 87 5.4 Worst-case delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 viii 5.5 Example of node deletion with 6 nodes. In this example, node ID 2 is leaving the system. Given our deletion algorithm, it first swaps positions with node ID 6 and then leaves the system. . . . . . . . . . . . . . . . . . . 99 5.6 Example of adding node ID 7 to the multi-tree. In this example, nodes in positions 3 and 5, in each of the trees (indicated by thicker circles) need to be swapped, in order to make room for accommodating this new node. After swapping, node ID 7 is attached to the new interior nodes, i.e., node ID 5 and 6 in treeT 0 andT 1 , respectively. . . . . . . . . . . . . . . . . . . 100 5.7 Average Startup Delay under Different Streaming Approaches . . . . . . . 105 5.8 Results with Different Degree . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.9 Example of theO(1) buffer space scheme . . . . . . . . . . . . . . . . . . 109 5.10 An illustration of theO(1) buffer space scheme with N=7 . . . . . . . . . . 110 5.11 An example of a hypercube communication pattern with 7 nodes. The nodes’ communication in time slots 3n, 3n + 1, and 3n + 2, is depicted using dashed, dotted, and solid lines, respectively. . . . . . . . . . . . . . . 111 5.12 Average playback delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.1 (a) Data set with long duration anomaly and (b) example of a piecewise linear model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2 Examples of Anomalies in Sensor Readings . . . . . . . . . . . . . . . . . 130 6.3 Soil Moisture Readings, Constant, Change in Variance and Change in Shape anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.4 (a) Resource usage of SSA and (b) Data set with faulty readings. . . . . . . 144 6.5 Detection of long duration anomalies: PCA and wavelet based methods. . . 150 List of Algorithms ix List of Algorithms 1 Check if a givenH value is feasible . . . . . . . . . . . . . . . . . . . . . 25 2 ES: CalculateS ES (i),i = 1,2,...,N . . . . . . . . . . . . . . . . . . . . 32 3 HELPER: CalculateS x+1 (i),i = 1,2,...,N . . . . . . . . . . . . . . . . . 37 x Abstract With the fast development of networking technology, modern distributed systems are designed to provide a variety of services. In this dissertation, we focus on algorithm design for four different types of distributed systems with consideration of quality of service as well as resource constraints. We design algorithms for (i) data transmission system (ii) geographically distributed data centers, (iii) structured peer to peer video streaming systems, and (iv) wireless sen- sor systems. These systems differ from each other in several dimensions — functionality, resource availability, and workload characteristics. As illustrated in this dissertation, our approaches are built on a common theme: understanding and exploitation of resource avail- ability and workload characteristics. More specifically, in this document we focus on the following topics: for data trans- mission systems, we provide scheduling algorithms as well as performance bounds aimed at reducing the cost on bandwidth, which is accounted based on 95th percentile rule. For geographically distributed data centers that serve delay tolerant workloads, we design a stochastic optimization based approach for making routing and server management deci- sions, so as to reduce the cost of power, while maintaining quality of service. For structured peer to peer streaming systems, we provide two different peer communication algorithms that achieve different levels of trade-offs in performance metrics such as playback delay and xi buffer size. For wireless sensor systems, we develop an online, low resource usage anomaly detection algorithm that accurately identifies interesting events in sensor readings. xii Chapter 1 Introduction Distributed systems have been developing rapidly and providing a variety of services. Nowadays, in data centers, distributed computer clusters consisting of millions of com- puters provide computation, storage, and communication services for enterprises and end users. Distributed wireless sensor systems are deployed in the wild to help monitor climate change by collecting measurements. The broad spectrum of functionality and resource availability of such systems offer great opportunities and challenges for algorithm design. At the same time, the performance of these systems is of greater concern. For exam- ple, data center operators are concerned with the performance provided to customers, as profit is lost if certain service level agreements are violated. On the other hand, solely opti- mizing quality of service may also cause loss of profit, because of potential waste of costly resources such as energy. Thus, appropriate utilization and management of such distributed system is needed, while considering multiple performance metrics. In this thesis, we design and analyze algorithms for different distributed systems with awareness of their performance. More specifically, we focus on several performance met- rics, including response time or delay, accuracy, robustness and resource consumption. Because of the heterogeneity in system resource availability and functionality, appropri- ate performance metrics vary in algorithm design of different systems. For instance, in distributed computer clusters that carry out massive computation, cost of power is a vital issue; whereas in sensor systems monitoring the environment, accurate identification of interesting events becomes the primary concern. However, this is not to say that in each case only a single performance metric should be considered, and other performance aspects 1 are not important. As shown in later in this thesis (Chapters 3, 4, 6 and 5) , we consider multiple performance aspects and their trade-offs in designing these algorithms, and eval- uate them through experimental studies. This thesis consists of four main research efforts and. The first two pieces of work share the similar theme of reducing costs of resources under delay/response time constraints, but differ in approaches. In the first work (referred to as “Percentile Reduction”), the perfor- mance metric we are interested in is the cost of bandwidth and delay in data transmission. In Chapter 3, we study the problem of minimizing the95th percentile of the number of jobs served, based on which the bandwidth costs are being accounted for. We give algorithms as well as performance bounds of this problem with uniform delay constraints. In the second work (referred to as “Power Cost Reduction”), we study performance metrics including delay and power cost, In Chapter 4, we propose a general framework based on stochastic optimization. This framework allows us to reduce the cost of power for such computing infrastructures through exploitation of temporal and spatial variations in workloads and power prices. In the next research work (referred as ”Structured P2P Streaming”), we study the trade-off between different QoS metrics such as playback delay and buffer size in structured peer to peer streaming systems. In Chapter 5, We design and analyze algorithms that achieve different trade-off levels of these performance metrics. In the last piece of work, we shift our focus to some other performance metrics. In Chapter 6, we look into accuracy and study the problem of anomaly detection (referred as “Sensor Anomaly Detection”). We propose an approach to anomaly detection in sensor systems that is able to detect anomalies accurately and in an online manner. We also show that this approach is robust and has a small CPU/memory footprint. Table 1.1 summarizes the performance metrics that are considered in each work. We note that although the performance metrics of interest vary in different research topics, 2 the understanding of trade-offs between these metrics, explicit or underlying, serves as an important aspect of our works. Research Work\Performance Metrics Delay/Response Time Accuracy Robustness Resource Consumption Percentile Reduction Focus No No Focus Power Cost Reduction Yes No Yes Focus Structured P2P Streaming Focus No No Focus Sensor Anomaly Detection Yes Focus Yes Yes Table 1.1: Performance metrics considered in each work In the reminder of this chapter, we briefly introduce the four pieces of research work and summarize our contributions in each work. Frequently, ISPs charge for Internet use not based on peak bandwidth usage, but accord- ing to the 95th percentile cost model. In other words, the time slots with the top 5 percent of data transmission volume do not affect the cost of transmission, instead we are charged based on the volume of traffic sent in the 95th percentile slot. In such an environment, by allowing a short delay in transmission of some data, we may be able to reduce our cost considerably. In this thesis, we provide an optimal solution to the offline version of this problem (in which the job arrivals are predetermined), for any delayD > 0. We also show that there is no efficient deterministic online algorithm for this problem. However, for a slightly differ- ent problem, where the maximum amount of data transmitted is used for cost accounting, we provide an online algorithm with a competitive ratio of (2D +1)/(D + 1), as well as a lower bound that any online algorithm can possibly achieve for any D > 0. We also provide a heuristic that can be used in an online setting, based on the solution to the offline 95th percentile problem. Experimental results are used to illustrate the performance of the algorithms proposed in this work. The major contributions of this work are as follows: • We analyze the offline problem of minimizing the 95th percentile of job service and provide OPT, a dynamic programming algorithm that obtains an optimal service 3 schedule inO(n 2 logC) time, whereN is the number of time slots andC is the link capacity (see Chapter 3.3). • We extend our approach to general uniform delay requirements and prove that our algorithm indeed computes an optimum solution. (see Chapter 3.4). • We show that the online problem is hard and no algorithm achieves constant compet- itive ratio (see Chapter 3.5). • We provide a simple heuristic aimed at reducing the 95th percentile of job service based on OPT. The heuristic can be used in an online manner (see Chapter 3.7). • We study a closely related problem where the maximum, instead of the 95th per- centile is minimized. We provide an efficient online algorithm that achieves close to optimal performance for this problem. We validate this by showing a lower bound that any deterministic algorithm can achieve for a general uniform delay.(see Chap- ter 3.6). • We evaluate the performance of all algorithms using synthetic workload datasets and show that our heuristic achieves up to 13% reductions of the 95th percentile of the number of jobs served (see Chapter 3.8). Geographically distributed computing infrastructure offers significant potential for exploring power cost reductions. In Chapter 4 we first present a two time scale model of geographically distributed data centers. Then we design our algorithm to reduce power cost using this model. This algorithm is based on stochastic optimization and makes dis- tributed routing and server management decisions. Our approach considers such decisions at different time scales and offers provable power cost and delay characteristics. The utility of our approach and its robustness is also illustrated through simulation-based experiments under delay tolerant workloads. 4 The major contributions of this work are as follows: • We propose a two time scale control algorithm aimed at reducing power cost and facilitating a power cost vs. delay trade-off in geographically distributed data centers (see Chapter 4.2 and 4.3). • By extending the traditional Lyapunov optimization approach, which operates on a single time scale, to two different time scales, we derive analytical bounds on the time average power cost and service delay achieved by our algorithm (see Chapter 4.4). • Through simulations based on real-world data sets, we show that our approach can reduce the power cost by as much as 18%, even for state-of-the-art server power consumption profiles and data center designs (see Chapter 4.6). We also show that our approach reduces not only power cost but also the actual power usage. • We demonstrate the robustness of our approach to errors in estimating data center workload both analytically as well as through simulations (see Chapter 4.6.2 and 4.6.5). In Chapter 5 we study structured peer to peer streaming systems. Specifically, we consider the following basic question: a source node wishes to stream an ordered sequence of packets to a collection of receivers, which are inK clusters. A node may send a packet to another node in its own cluster in one time slot and to a node in a different cluster in T c time steps (T c > 1). Each cluster has two special nodes. We assume that the source and the special nodes in each cluster have a higher capacity and thus can send multiple packets at each step, while all other nodes can both send and receive a packet at each step. In these systems, playback delay and buffer size requirements are very important performance metrics. We provide two algorithms for (intra-cluster) data communications, one based on multi-trees (using a collection of d-ary interior-disjoint trees) and the other 5 based on hypercubes. We prove bounds on playback delay and buffer size requirements of these algorithms. The contributions of this work are as follows: • We provide algorithms for constructing multiple streaming trees and a corresponding transmission schedule so as to maintain QoS characteristics (Chapter 5.2.2). We are also able to extend our schemes to dynamic scenarios while maintaining our nice tree properties (Chapter 5.3). We validate and evaluate our dynamic schemes through simulations (Chapter 5.4.2). • We analyze the QoS of the multi-tree-based schemes and derive an upper bound on the delay required at each node before it can start playback, and use it to bound the required buffer size; we also prove a lower bound on the average playback delay (Chapter 5.2.3). We also evaluate these schemes through simulations (Chapter 5.4.1). • Interestingly, our work establishes that it is only useful to consider degree 2 and 3 trees, if one wants to minimize worst-case delay. (see Chapter 5.2.3). • We present algorithms for constructing streaming structure using hypercube and cor- responding transmission schedule for arbitrary values ofN with a provable limit on the number of neighbors with which a node needs to communicate. (see Chapter 5.5). • We analyze the hypercube-based schemes to give upper bounds on worst case and average case playback delay. (see Chapter 5.5.2). • Our main schemes are presented under the model of each cluster being a fully con- nected graph. We also show that the existence of two interior disjoint trees in a general graph is anNP -complete problem. (see Chapter 5.2.4). 6 In Chapter 6 we shift our focus to wireless sensor systems. These systems aid scien- tific studies by instrumenting the real world and collecting measurements. Given the large volume of measurements collected by sensor systems, one problem arises–an automated approach to identifying the “interesting” parts of these data sets, or anomaly detection. A good anomaly detection methodology should be able to accurately identify many types of anomalies, be robust, require relatively little resources, and perform detection in (near) real-time. Thus, in this work we focus on an approach to online anomaly detection in mea- surements collected by sensor systems. The contributions of this work can be summarized as follows: • We propose an approach to anomaly detection in sensor systems that is able to detect anomalies accurately and in an online manner. (Chapter 6.2). • We perform an extensive study using data sets from two real deployments, one con- sisting of measurements collected by 23 sensor nodes for around 43 days, and the other consisting of soil temperature, moisture, and air humidity measurements col- lected by 3 sensor nodes for over 5 months. This study illustrates that our approach is accurate and works well over a range of parameter choices, and has a small (CPU, memory) footprint (Chapters 6.3 and 6.4). • Through an extensive comparison study, we show that our (online) approach is more accurate than potential other (offline) techniques, which are more computationally intensive (Chapter 2.4 and 6.5). This rest of this thesis is organized as follows. In Chapter 2, we review related work. In Chapter 3, we provide algorithms to reduce the 95th percentile of data transmission. In Chapter 4, we present our framework for power cost reduction in distributed data centers. Chapter 5 discusses the video streaming algorithms in structured peer to peer systems, 7 In Chapter 6, we describe our approach to anomaly detection in wireless sensor systems. Chapter 7 concludes the thesis. 8 Chapter 2 Literature Review This document includes four topics on algorithm design for distributed systems. Therefore, in this chapter, we provide four sections of literature review – one for each topic. 2.1 Percentile Reduction for Data Transmisson In this work, our main focus was on scheduling algorithms with uniform delay bound. Although there are quite a few works in this area, none of them focus on the95th percentile. For example, [39, 49, 38] studied online algorithms that try to optimize total weights of served jobs. They provide deterministic as well as randomized algorithms that achieve constant competitive ratio on throughput, which is the total weight of jobs served. The jobs that are not served before their deadlines are dropped. In our work, we study the problem of minimizing the 95th percentile or the maximum of the total number of jobs served in a time slot. No jobs are allowed to be dropped. Although these two problems appear to be related, it is not obvious that these techniques can be easily adapted to solve the problems we consider in this work. In the CPU scheduling literature, specifically, in the domain of minimizing the energy and power consumed by a server through speed scaling, the works of Yao et al. [78] and Bansal et al. [13] are relevant. These works target reduction of the maximum of jobs served with general deadline constraints. Yao, Demers and Shenker [78] provide an optimal offline algorithm for minimizing the maximum energy consumed by a processor when the power consumed is a convex function 9 of the speed. Their algorithm also minimizes the maximum speed of the processor at any point in time, subject to the constraint that all jobs are completed by their deadline. Further, Bansal, Kimbrel and Pruhs [13] gave an optimal e-competitive algorithm for the online version of the energy minimization problem. This is also e-competitive for the online problem of minimizing the maximum speed at any instant such that all jobs are completed by their deadline. In general, we have a different goal of minimizing the 95th percentile. However in Section 3.6 we study a similar problem to theirs and show that under the assumption of uniform delay our approach achieves a better competitive ratio. A few papers do study the impact of the95th percentile, e.g., [21] studies the impact of time slot size through modeling and measurement study. Further, this accounting property is leveraged to reduce the cost of bulk data transfer in [47]. In [30] smart routing algorithms are designed for data streams on multiple ISPs with the consideration of the95th percentile model. In contrast, the main problems we consider are the following: minimize the 95th percentile through job level scheduling, and exploit the95th percentile model. 2.2 Related Work As mentioned in Section 4.1, work on power cost reduction can be classified into three broad categories – single server, single data center, and multiple data centers. For a single server, researchers have proposed scheduling algorithms to minimize power consumption subject to job deadlines [78], or minimize average response time subject to a power con- strain [61]. Wierman et al. use dynamic CPU speed scaling to minimize weighted sum of mean response time and power consumption [77]. A survey of work on single server power cost reduction is given in [10]. For a data center, Gandhi et al. provide management policies that minimize mean response time under a total power cap [28] or minimize the product to response time and power cost [27]. Chen et al. propose solutions based on queueing 10 models and control theory to minimize the server energy as well as data center operational costs [18]. Lin et al. design an online algorithm to minimize a convex function of power usage and delay that is 3-competitive [51]. Our approach differs from these work in three ways: (i) it leverages spatio and temporal differences in job arrivals and power prices to achieve power cost reduction; (ii) all work mentioned above except [61] and [78] assume closed form convex expressions for service delay or convex delay-cost functions, whereas our approach does not make these assumptions as they may not always hold, especially for delay tolerant jobs; (iii) our approach does not rely on predictions of workload arrival processes, as [18] and [51] do. Power cost reduction across multiple data centers is an area of active research. Qureshi et al. proposed the idea of exploiting spatial diversity in power prices to reduce cost by dynamically routing jobs to data centers with lower prices [62]. They also provide a cen- tralized heuristic for doing so that is similar to the Low Price scheme we evaluated in Section 4.6. Rao et al. provide an approximation algorithm for minimizing the total power cost subject to an average delay constraint [66], while Liu et al. designed load balancing algorithms to minimize a metric that combines power and delay costs [56]. [52] extends [51] to the case of multiple data centers with heterogeneous delay, power price and server characteristics. All papers mentioned here make routing and server on/off decisions based on predictions on arrival rates and closed form convex expressions for average delay. [70] makes control decisions at server, data center and inter-data center level every time slot by solving a deterministic convex optimization problem. All these works have work con- serving schedulers and only exploit the spatial differences in job arrivals and power prices. Also, they work on a single time scale. In contrast, our approach exploits both the spatial and temporal differences in job arrivals and power prices by using a non work conserving scheduler. This leads to greater power cost reductions when serving delay-tolerant work- loads. Moreover, it works on two time scales to reduce the server on/off frequency. 11 A number of recent works introduce new aspects in better usage of power in data cen- ters. For example, [75] takes battery into consideration and show that it can be used to reduce cost on power. [22] studies cooling strategy in data centers. [57] considers server management together with cooling and usage of renewable energy. These works are orthog- onal to ours in that approaches used by these works can be combined with ours to achieve lower power usage/cost for data centers. The Lyapunov optimization technique that we use to design our approach was first proposed in [72] for network stability problems. It was generalized in [29] for network utility optimization problems. Recently, Urgaonkar et al. used this technique to design an algorithm for job admission control, routing, and resource allocation in a virtualized data center [74]. However, they consider power reduction in a single data center only. To the best of our knowledge, our work is the first to apply a novel two time scale network control methodology to workload management for geographically distributed data centers. 2.3 Performance Trade-offs in Structured P2P Streaming Systems The original end-system multicast approach [19] explores the use of an application level multicast tree for streaming; however, it suffers from several shortcomings, including: (i) leaf nodes contribute no resources (and hence some system resources are essentially wasted), (ii) less resilience to node failures, (iii) less flexibility in use of bandwidth (e.g., bandwidth is used in units of needed streaming rate), and (iv) internal nodes require sig- nificantly higher upload bandwidth (than the streaming rate of a stream) in order to keep a multicast tree shallow (which is desirable as a single deep tree results in long startup delays and large buffer size requirements). Existing literature explores two directions in trying to overcome these shortcomings. One direction (as in our work) is the use of multiple trees, 12 e.g., as in [16]. However, most such works focus on providing different quality of service to users with different capabilities as well as on adapting to failures and bandwidth hetero- geneity, e.g., through the use of Multiple Description Coding (MDC). Our techniques can be combined with MDC as well, but we do not rely on their use. Other works, [76], address system dynamics at the cost of not maintaining nice tree properties (e.g., balance), whereas our schemes do maintain such properties. One generalization of [16] examines stability properties of the overlay [20] and, unlike in this work, considers QoS in a probabilistic setting. Overall, the distinction of our effort, is that we provide provable QoS guarantees (such as startup delay). Another direction is to abandon the use of trees in favor of unstructured peer-to-peer (P2P) systems, e.g., [80][55], which allows for greater flexibility in dealing with system dynamics and heterogeneity. However, to date most such solutions have been fairly heuris- tic in nature and (as a result) difficult to analyze, from the perspective of performance as well as QoS guarantees. One notable exception is the analysis of BitTorrent in [12]. This work gives a very nice analysis which explains BitTorrent’s success for download applica- tions; however it is not clear how successfully it can be applied to streaming applications, where the data needs to be delivered in time for playback. That is, one can view a stream as a collection of small segments and apply the analysis in [12] (or in other works, e.g., [26]) to each segment. However, given the bound in [12], keeping up with real-time constraints would require an assumption of faster packet transmission than playback, an assumption we do not make here. A comparison study of multi-tree-based approaches and unstructured P2P schemes is given in [58]. One interesting difference, as compared to our results, is that the results in [58] suggest the use of higher degree trees, whereas our work shows optimality of lower degree trees (refer to Section 5.2.3). But, the main metric in [58] is bandwidth utilization under heterogeneous conditions, whereas we focus on playback delay. Overall, a high level 13 conclusion from [58] is that unstructured aspects of the system allow for better utilization of heterogeneous resources and better adaptivity to node churn which lead to better QoS. We note that this study is (a) done for specific multi-tree and unstructured schemes, (b) assumes the use of MDC, and (c) is simulation based, i.e., no provable QoS guarantees are given. By contrast, the work here does derive provable QoS guarantees for a class of multi-tree based schemes, and thus is more useful to scenarios where QoS is of importance. Other analytical works on performance of streaming schemes include [25], [45], [82] and [54]. The focus of [25] is to understand the practical issues (e.g., periodic buffer exchange and lack of central scheduling), that might prevent these schemes from achieving theoretical bounds in playback delay and streaming rate. [45] provides a fluid model for peer to peer streaming systems. The main performance metric considered in this work is the maximum achievable streaming rate. In [82] a stochastic model of unstructured peer to peer streaming systems is given. In this work, the main goal is to understand the impact of different chunk selection policies on continuity and start-up latency. The goal of [54] is to minimize playback delay of peer to peer streaming systems. In these works, performance metrics such as streaming rate or continuity are studied individually (i.e., without considering other metrics). That is, the main effort in these works is focused on understanding and/or optimizing a specific performance metric. In contrast, our work stives to provide understanding of the trade-offs between several important metrics that are of interest to streaming systems. Lastly, our hypercube scheme is based on a generalization of [24], originally designed for message broadcast. We note, however, that [24] cannot be directly used for streaming as that scheme does not guarantee that, after a certain start-up delay, each node in the system can receive a new packet in the next time slot. Hence, it does not provide continuity in data delivery needed for QoS provision. Moreover, in the scheme given in [24], the source 14 needs to participate in packet exchanges, insead of sending out new packets. However, in live streaming, for instance, the source’s main role is to continously send out new packets. 2.4 Anomaly Detection in Sensor Systems Anomaly detection is an area of active research. In this section we briefly survey work most related to ours, organized into four categories. Moreover, we select representative tech- niques from these categories and quantitatively compare them against our hybrid approach, with results of this comparison presented in Section 6.5. Anomaly detection in sensor systems. This is not a well-studied area. The only efforts we are aware of that focus on anomaly detection in sensor systems data are [63, 64, 71]. Rajasegarar et al. [63] present a clustering based method based onK nearest neighbor algo- rithm in [63], and a SVM based approach in [64]. Briefly, these works view measurements collected by a sensor system as coming from the same (unknown) distribution and “pre- define” anomalies as outliers, with main focus of these (offline) techniques being on min- imizing the communication overhead (in transmitting data needed for anomaly detection) and the corresponding energy consumption. In contrast, we focus on an online approach, without “pre-defining” what is an anomaly. However, we do perform a quantitative com- parison between our hybrid approach and the clustering based method in [63]. The details of this comparison can be found in Section 6.5. Tartakovsky et al. use a change point detection based approach, CUSUM (Cumula- tive Sum Control Chart), for quickly detecting distribution changes (e.g., mean, variance, covariances) in sensor measurements [71]. We do not provide a quantitative comparison between CUSUM and our hybrid approach as we lack sufficient domain knowledge - i.e., the CUSUM based approach assumes knowledge of the (time varying) probability distribu- tion from which sensor measurements are sampled, and such information is not provided 15 for the SensorScope and the Jug Bay datasets (and would be quite difficult for us to compute accurately). In general, change point detection can identify changes in the parameters of a stochastic process. Often, such changes are anomalous, and hence, change point detection techniques have been used for change/outlier detection in time series data [32, 17]. However, such existing works (a) assume that sensor measurements are sampled from a known or easily inferred distribution (e.g., Gaussian), an assumption often not true for real world datasets, (b) target a specific subset of anomalies, and (c) do not use real-world sensor datasets for evaluation. In addition, existing change point detection based methods either require a long training phase, or are computationally intensive, or (in some cases) both – hence, they do not meet our efficiency criteria. Moreover, not all changes in a time series might be anomalous, i.e., there could be a change and no anomaly (i.e., a false positive) and vice versa (i.e., a false negative). For example, [32] flags turning points of a time series from increasing to decreasing – as it is often “normal” for a sensor data time series to have periodic behavior, this is (likely) not an anomaly. On the other hand, such an approach would only flag a few turning points, instead of an entire anomalous event, as depicted in Figure 6.1(a). We believe that change point detection could be a promising approach to anomaly detection, but that it would require significant work to devise an accurate and efficient online anomaly detection method based on such techniques. Fault detection in sensor systems. Short duration anomalies can be viewed as instances of data faults, errors in measurements, or outliers (see [59]). Sharma et al. focus on SHORT spikes, NOISE faults, and CONSTANT readings data faults and show that these are quite prevalent in real-world sensor datasets [68]. They also evaluate the performance of several methods – Rule-based methods, a least-squares estimation based method, Hidden Markov model (HMM) based method, and a time series analysis (ARIMA model) based method– that are targeted towards detecting transient sensor data faults. However, these methods 16 perform poorly at detecting long duration anomalies. E.g., in Section 6.4.1 we showed that Rule-based methods are not sufficient for detecting long duration anomalies. Other than that, we also compare the ARIMA model based approach, which works best in case of data faults affecting more than a few samples, against our hybrid approach (with details given in Section 6.5). Other approaches to sensor data faults detection include [42, 37, 23, 73]. However, these are not suited for detecting (unknown) long duration anomalies due to one or more of the following assumptions they make: (1) the anomalous data pattern is known a priori [42, 37, 23], (2) the distribution of normal sensor readings is known [23], and (3) focusing on short duration trends is enough to capture anomalies [73]. Network anomaly detection. The work in this area includes detecting anomalies such as DDoS attacks, flash crowds, worm propagating, bulk data transfer in enterprise or ISP networks [46, 69, 14, 35]. Techniques such as Principal Component Analysis (PCA) [46] and wavelet analysis [69, 14, 35], used for detecting network traffic anomalies, are not well suited for online detection in sensor systems primarily because they are computationally intensive and difficult to perform in an online manner. Furthermore, we show that even in an offline manner, these methods perform poorly in detecting long duration anomaly on sensor system data (as detailed in Section 6.5). Our quantitative study (given in Section 6.5) indicates that a straightforward application of such techniques, as designed for network anomaly detection, is not very effective on sensor data. Piecewise linear time series models. Piecewise linear models have been used to model time series data in the other contexts (i.e., not for anomaly detection). For example, Keogh developed an efficient technique to search for occurrences of a known pattern within a time series using a piecewise linear representation [41]. Specifically, for a time series with n points, Keogh’s algorithm starts with k =⌊ n 3 ⌋ line segments and defines a “goodness of fit” metric for k line segments as B k =std(e 1 ,...,e k ), where e i is the average linearization 17 error for line segmenti. It then iteratively merges two consecutive line segments untilB k cannot be reduced any further; this process is continued until a single line approximates the entire time series. This process ends with a family of piecewise linear models for a single time series – one for each value of k, 1≤ k≤⌊ n 3 ⌋. Each linear model has a B k value associated with it, and the one with the smallestB k is selected as the final representation. In contrast, our greedy linearization algorithm differs as follows: (1) we start with a single line segment and continue adding more segments, (2) we use a different “goodness of fit” criterion (maximum linearization error; see Section 6.2) and (3) we compute a single representation instead of a family of piecewise linear models. The main reason behind these choices (rather than, e.g., using Keogh’s algorithm) is computational cost as our goal is an efficient online approach. Briefly, computing a family of models to find the “best” one is wasteful for our purposes, if it takes significantly greater computation. Hence, we opt for our greedy approach. 18 Chapter 3 To send or not to send: Reducing the cost of data transmission Frequently, ISPs charge for Internet use not based on peak bandwidth usage, but according to the 95th percentile cost model. In other words, the time slots with the top 5 percent of data transmission volume do not affect the cost of transmission, instead we are charged based on the volume of traffic sent in the 95th percentile slot. In such an environment, by allowing a short delay in transmission of some data, we may be able to reduce our cost considerably. We provide an optimal solution to the offline version of this problem (in which the job arrivals are predetermined), for any delay D > 0. We also show that there is no efficient deterministic online algorithm for this problem. However, for a slightly different problem, where the maximum amount of data transmitted is used for cost accounting, we provide an online algorithm with a competitive ratio of(2D+1)/(D+1), as well as a lower bound that any online algorithm can possibly achieve for anyD > 0. We also provide a heuristic that can be used in an online setting, based on the solution to the offline 95th percentile problem. Experimental results are used to illustrate the performance of the algorithms proposed in this work. 19 3.1 Introduction A huge amount of data is transferred over the Internet everyday. For example, Google, Amazon and other operators of large data centers that host cloud computing applications need to provide services and synchronize processed data across multiple locations. Multi- media content providers such as Akamai need to replicate video clips over servers located in different regions for optimized content delivery. Huge data chunks are transmitted over high speed links, and cost the senders a significant amount of money on a daily basis. Cost accounting for data transfer is performed differently from other utilities such as energy, where companies are billed by the total volume of utility consumed. For network bandwidth, the cost in large scale computing systems, the 95th percentile rule is widely used [47]. The rule can be described as follows: the average utility usage is sampled every w time units. Thus, in an accounting cycle of lengthL, a total of L w samples are taken. (A typical number for w is 5 minutes and L is 24 hours.) Then the 95th percentile of these samples is taken as the actual utility usage in that accounting cycle. The performance requirements of such data transfers typically include transmission delay, as the data chunks have to be delivered within some time period. Often, such delay requirements are not stringent. For example, video replications over Akamai servers at dif- ferent locations do not have to be carried out exactly at exactly 9:00PM daily. All that is needed is that they are completed, for instance, before midnight. This provides opportuni- ties for data senders to optimize data transfers by attempting to reduce the 95th percentile usage. We can view data chunks as jobs, and links (or senders) as servers. Then, the prob- lem can be viewed as a scheduling problem aimed as reducing the 95th percentile usage. In fact, given some slack in job scheduling, we can consider any percentile as defining our metric. Consider the following simple example: Suppose there is a server that serves jobs of the same size. A different number of jobs arrive at the server in each time slot. The server 20 has a capacity of serving up to5 jobs in one slot. Over an accounting period of5 time slots, the number of jobs arriving at the server is [3,3,3,3,0]. Our goal is to minimize the 80th percentile (second largest in this case) of the number of jobs served in one time slot. If all jobs are served within the time slot in which they arrive, with no delay, then the service sequence is also [3,3,3,3,0]. Note that according to the 80th percentile rule, the billed output bandwidth usage is also 3. However by delaying each job by at most one time slot, the server can instead choose the service sequence [2,4,2,2,2], thus reducing the billed bandwidth usage to 2. Note that the trade-off here is that four jobs are delayed. One is delayed from slot 1 to slot 2, one from slot 3 to slot 4, and two from slot 4 to slot 5.) Note that all jobs are being served and no job is dropped. The above example looks straightforward. But the general problem of percentile min- imization can be difficult, due to the following: 1) jobs arrive with delay requirements and 2) the job arrival patterns might not be available ahead of time. This work considers this problem, as well as a closely related problem from an algorithmic prospective. We focus on reducing the 95th percentile because it is the value based on which the bandwidth usage is currently billed; however, our approach can be applied to other percentile values 1 We provide job scheduling algorithms and analysis of the problem of minimizing the 95th percentile of job service, subject to delay constraints. We focus on uniform delay require- ments, where all jobs arrive with the same delay requirement. Our study covers both the offline problem, where all the job arrivals are known in advance, and the online problem, where no knowledge of job arrivals is available a priori. While scheduling problems have been investigated for decades [60], the metrics consid- ered in prior work have more to do with minimizing the maximum load [48], or scheduling 1 For the problem of minimizing the 100th percentile (maximum value), the complexity of our approach is lower than the complexity for other percentile values. 21 a given fraction of jobs, while minimizing maximum load [31]. Our model represents a new class of problems, and is based on the charging mechanism in use. The contributions of this work are summarized as follows: • We analyze the offline problem of minimizing the 95th percentile of job service and provide OPT, a dynamic programming algorithm that obtains an optimal service schedule inO(n 2 logC) time, whereN is the number of time slots andC is the link capacity (see Section 3.3). • We extend our approach to general uniform delay requirements and prove that our algorithm indeed computes an optimum solution. (see Section 3.4). • We show that the online problem is hard and no algorithm achieves constant compet- itive ratio (see Section 3.5). • We provide a simple heuristic aimed at reducing the 95th percentile of job service based on OPT. The heuristic can be used in an online manner (see Section 3.7). • We study a closely related problem where the maximum, instead of the 95th per- centile is minimized. We provide an efficient online algorithm that achieves close to optimal performance for this problem (see Section 3.6). We validate this by showing a lower bound that any deterministic algorithm can achieve for a general uniform delay. • We evaluate the performance of all algorithms using synthetic workload datasets and show that our heuristic achieves up to 13% reductions of the 95th percentile of the number of jobs served. (see Section 3.8). 22 3.2 Problem Definition Consider a system with a single server, where time is slotted. LetN be the total number of time slots in an accounting cycle. LetA(i) denote the number of jobs arriving in time slot i. Each job is of unit work. Jobs arrive at the beginning of each time slot and some number of them,S(i), are served by the end of that time slot. Jobs that are served in the time slot in which they arrived incur no delay. Unserved jobs are carried over for service in future time slots; each time a job is carried over, it incurs an additional delay of 1 time slot. The goal is to choose S(i),i = 1,2,3,...,N, such that the 95th percentile of S(i) is minimized. To make this problem realistic (and potentially interesting) the following constraints are introduced: • The server has a capacity of C. Thus, S(i)≤ C,∀i. To simplify the problem, we assume thatA(i)≤C,∀i. • Each job can be delayed by at mostD > 0 time slots. This problem has an offline and an online version, depending on whether job arrivals,A(i), are known in advance. 3.3 Solution to Offline Problem In this section we give an algorithm to solve the proposed offline problem, whereA(i),i = 1,2,...,N is known a priori. To simplify the problem, we first assume delayD = 1. We will later extend our solution to accommodate general D values. We call our approach OPT. A solution to this problem is useful in scenarios like bulk data transfer[47], where workload is pre-determined but can be scheduled with some flexibility. 23 3.3.1 Algorithm According to the 95th percentile accounting rule, the number of jobs served in the top 5% of the total time slots are not included in accounting. These time slots can be viewed as “free” because the jobs served in these time slots do not change the 95th percentile result. LetT = 5%×N be the number of “free” time slots. Assume that in “free” time slots, up to C jobs can be served, whereas in the otherN−T time slots, at mostH(H≤ C) jobs can be served. In this way, H serves as the upper bound of the 95th percentile of the number of jobs served. Our objective is to find the minimumH that satisfies the delay constraints. We make a guess forH, and then verify if our guess leads to a feasible solution. We can do a doubling search, combined with a binary search forH. GivenH, we defineOPT(i,t),{i = 1,2,3,...,N;t = 0,1,2,...,T} to be the minimum number of jobs left unfinished by the end of time sloti, whilet “free” time slots have been used. To see ifH is feasible, we calculateOPT(i,t) using Algorithm 1. It is clear that OPT(1,0) = max{A(1)−H,0} In cases wherej > 0, one option is to not pay for sloti and M C = max{(OPT(i−1,j−1)+A(i)−C),0} is the result of serving up toC jobs in time sloti, and M H = max{(OPT(i−1,j)+A(i)−H),0} is the result of serving up toH jobs in time sloti; however in this case we can still dropj slots from the firsti−1 slots. Since in each time slot there are only these two choices, we 24 Algorithm 1 Check if a givenH value is feasible Require: H,A(i),i = 1,2,...,N 1: OPT(1,0)← max{A(1)−H,0} 2: OPT(1,1)← 0 3: for time sloti = 2,3,...,N do 4: forj = 0,1,...,T do 5: ifj > 0 then 6: M C ← max{(OPT(i−1,j−1)+A(i)−C),0} 7: else 8: M C =∞ 9: end if 10: M H ← max{(OPT(i−1,j)+A(i)−H),0} 11: OPT(i,j)← min{M C ,M H } 12: ifOPT(i,j)>A(i) then 13: OPT(i,j)←∞ 14: end if 15: end for 16: end for 17: ifOPT(N,T) = 0 then 18: H is feasible 19: else 20: H is infeasible 21: end if have OPT(i,j) = min{M C ,M H } if min{M C ,M H }≤A(i) ∞ Otherwise This is expressed in the Algorithm 1. Here∞ indicates that the givenH is not feasible. In this case, the number of jobs carried to the next time slot is greater than the total number of job arrivals in the current time slot. This means that some jobs are being delayed for more than one time slot. To obtain theOPT(N,T), It takesO(NT) time. We can check ifOPT(N,T) = 0 to see if there is a feasible schedule for the chosenH value, or in other words, ifH is feasible. 25 To obtain the minimum feasibleH value, we use a binary search between 0 andC. In each step we check ifH is feasible using Algorithm 1. IfC is integral then the total running time isO(nT logC). 3.3.2 Proof of correctness Note that if OPT(N,T) = 0, then there exists a feasible schedule sequence{S(i),i = 1,2,...,N} for a givenH. Indeed, we can record the choice ofM C andM H at each step. These choices indicate a permutation of N−T “accounted” time slots, in which at most H jobs are served, andT “free” time slots, in which up toC jobs are served. The sequence of S(i) can be generated according to this permutation. According to our algorithm, the generatedS(i) sequence must be feasible. On the other hand, if there exists a feasible sequence for a given H value, suppose algorithm X generates such a sequence. Let U X (i,t) be the number of jobs that remain unserved for algorithm X at the end of time slot i, when t “free” slots are used. We can see thatU X (i,t)≥OPT(i,t) because when determiningOPT(i,t) we always choose the schedule that minimizes the unfinished work. Thus OPT(n,T)≤ U X (n,T) = 0. This completes the proof. 3.4 Extending OPT to General Uniform Delay Now let us assume that all arriving jobs can be delayed by at mostD≥ 1 time slots. Again, the goal is to minimize the 95th percentile of the number of jobs served. Our algorithm in Section 3.3 solves theD = 1 case. Now we extend this algorithm to the generalD case. 26 3.4.1 Algorithm WhenD > 1, the jobs left unfinished by time sloti consist of jobs that have been delayed for 1,2,...,D time slots. Intuitively, the service priority for these jobs should be different. In general, letU(i) = (U 1 (i),...,U D (i)) be the vector of unfinished jobs at the end of time sloti. That is, an elementU j (i) represents the number of jobs that have been delayed for j time slots, including time sloti. For instance,U 1 (i) is the number of jobs that arrived in time sloti but did not get served in that time slot. Similarly,U D (i) represents the number of jobs that have already been delayed for a maximum allowedD time slots; thus, they all have to be served in time slot i + 1. LetS(i) = (S 0 (i),...,S D (i)) be the service vector. That is, it represents the fact that in time slot i, out of U j (i− 1) jobs, S j (i) jobs will be served, wherej > 0; S 0 (i) represents the number of jobs served out of theA(i) jobs that arrived in intervali. In this scenario,H being feasible means that P D j=0 S j (i)≤ H for an “accounted” slot and P D j=0 S j (i)≤C for a “free” slot. Moreover. 0≤S j (i)≤U j (i−1) and 0≤S 0 (i)≤A(i). We show that an algorithm similar to the one given in Section 3.3 can solve the general problem. Here we still use dynamic programming and calculate OPT(i,t). Since jobs can be delayed for more than one time slot now, our algorithm now needs to (1) provide a service schedule vector, (2) calculate the vector of unfinished jobs, and (3) decide whether a certainH value is feasible. SupposeM is aD +1 byD identity matrix, whereV×M corresponds to taking the firstD elements of the vectorV . We haveU(i) ={(A(i),U(i−1))−S(i)}×M. For example, when D = 2, supposeU(i−1) = (2,1), A(i) = 3, S(i) = (2,0,1). Then U(i) ={(3,2,1)−(2,0,1)}×M = (1,2). For a particular U(i), there are multiple feasible S(i) vectors that satisfy the above constraints. Again, when D = 2, suppose U(i−1) = (2,1), A(i) = 3 and H = 3, S(i) can be either (1,1,1), (0,2,1), or (2,0,1). In OPT, we choose S(i) in a greedy 27 manner. More specifically, jobs that arrive earlier will be served before those arriving later. This is intuitive because jobs that have been delayed longer are more urgent and should be prioritized over those that just arrived. Given this, the service schedule vector we choose for the example above is (0,2,1). To check feasibility of assigning an “accounted” time slot or “free” slot, we need to compareH (orC) withU D (i) to check if the corresponding schedule is feasible. Again we defineOPT(i,t),{i = 1,2,3,...,n;t = 0,1,2,...,T} as the minimum num- ber of total jobs left unfinished by the end of time sloti, usingt “free” time slots. Also we haveOPT(i,t) = 0 ifi≤t, as we assume thatA(i)≤C,∀i; otherwise, let M C = max{OPT(i−1,t−1)+A(i)−C,0} be the number unfinished jobs after serving up toC jobs in total in time sloti.U C (i,t) = {(A(i),U(i−1,t−1))− S C (i)}× M and S C (i,t) is generated by distributing a capacity ofC greedily fromU D (i−1,t−1) toU 1 (i−1,t−1), then toA(t). IfU D (i− 1,t−1)>C, thenM C =∞. Similarly, let M H = max{OPT(i−1,t)+A(i)−H,0} be the result of serving at mostH jobs in time sloti. U H (i,t) ={(A(i),U(i−1,t))−S(i)}×M S H (i,t) is generated by distributing a capacity ofH greedily fromU D (i−1,t) toU 1 (i− 1,t), then toA(t). IfU D (i−1,t)>H, thenM H =∞. WhenD > 1, we recordS(·) andU(·) for each time slot. The process is as follows: 1. CalculateM C andM H . 28 2. CalculateU C (i,t),U H (i,t),S C (i,t), andS H (i,t). 3. OPT(i,t) = min{M C ,M H }. If bothM C andM H are∞ , then theH value is not feasible. 4. SetU(i,t) andS(i,t) to beU C (i,t),S C (i,t), orU H (i,t),S H (i,t), according to the choice ofM C orM H . 3.4.2 Proof of Correctness As explained in Section 3.4.1, there exist multiple possible S(i) vectors per each U(i−1) vector. In our algorithm, we only choose the vector that greedily serves jobs with higher delay first. We need to show that doing this will not result in a greaterH value as compared to any other strategy. Suppose that for{A(i),i = 1,2,...,N} there exists an optimal service schedule sequence{S O (i),i = 1,2,...,N} that makes H feasible. At the same time we have our greedy service schedule sequence{S G (i),i = 1,2,...,N}. For clarity of presentation, we add superscripts G and O to sequences U(·) and S(·), to denote their correspondence to the greedy schedule and the optimal schedule, respectively. Our goal is to show that H is also feasible under{S G (i),i = 1,2,...,N}. To prove this we show that for any i, U G D (i)≤U O D (i). For ease of presentation, letSU(i) = P D j=1 U j (i) andSS(i) = P D j=0 S j (i) be the total amount of unfinished jobs and jobs served at time sloti, respectively. Indeed, at any time slot i, suppose that the vector of unfinished work for the opti- mal and greedy schedule are(A(i),U O 1 (i),...,U O D (i)) and(A(i),U G 1 (i),...,U G D (i)), respec- tively. Note that according to our algorithm we haveSU G (i)≤ SU O (i). (Again, we add superscriptsG andO to denote sequences corresponding to the schedules generated by the greedy strategy and the optimal strategy, respectively.) 29 Now we look at U O D (i +D) and U G D (i +D). If SU G (r) = 0 for any r∈{i + 1,i+ 2,...,i+D}, thenU G D (i+D) = 0≤U O D (i+D). If this is not the case, since the greedy schedule is always work conserving, we have P i+D−1 j=i SS G (j)≥ P i+D−1 j=i SS O (i). AlsoS O (i) is a feasible schedule. Thus in time slot i, at leastU O D (i) jobs have to be served. In time sloti+1, at leastU O D−1 (i) jobs has to be served, and so on. Finally, in time slotT +D−1, at leastU O 1 (i) jobs have to be served. As a result, U O D (i+D)≥A(i)−( i+D−1 X j=i SS O (j)− D X j=1 U O j (t)) ≥A(i)−( i+D−1 X j=i SS G (i)−SU O (i)) ≥A(i)−( i+D−1 X j=i SS G (i)−SU G (i)) =U G D (i+D) This completes the proof. 3.5 Online Problem: Hardness So far we have studied an offline version of the95th percentile problem, where it is assumed that the sequenceA(i),i = 1,2,...,N is known before OPT is carried out. In this section, we relax this assumption and assume thatA(i) cannot be observed until time sloti. This relaxation makes the problem more realistic because in many scenarios, it is difficult to predict the volume of future job arrivals. However, it also makes the problem much more difficult. In fact, for the 95th percentile problem, performance of any online algorithm can be quite bad, as stated in Theorem 1. 30 For simplicity, we useN X A to denote the 95th percentile of the service schedule gener- ated by algorithmX for arrival sequenceA(1) throughA(N). Theorem 1. For any online algorithm there exists an arrival series A(i),i = 1,2,...,N, such that N OL A N OPT A cannot be bounded by any constant. To prove the theorem, we show that an adversarial is able to construct an arrival sequence, such that no matter how the jobs are being served, The ratio of the optimal solution and the one achieved by the online algorithm goes to infinity. Proof. The adversarial construct arrival series as follows: LetA(1) = x, and thenA(i) = αx, where α > 1 and i = 2,3,.... Keep generate such arrival until the number of time slots with non-zero job service (S(·)) is exactly T for the online algorithm. Suppose this happens at time slotT ∗ . Clearly,T ∗ ≤ 2T . If T ∗ < 2T , then let A(T ∗ + 1) = x, A(T ∗ + 2),...,A(n) = 0. In this case, we can constructS OPT (2k−1) = 0 andS OPT (2k) = A(2k−1)+A(2k) fork = 1,2,...,T ∗ /2. SinceT ∗ + 1≤ 2T , all these non-zero services are not accounted. ThusN OPT A = 0. On the other hand,N OL A > 0. The reason is that in time slot 1 throughT ∗ +2 there are at least T +1 time slots with non-zero job service for the online algorithm. Therefore, N OL A N OPT A can be arbitrarily large. On the other hand, ifT ∗ = 2T , then it must be the case that,SS OL (1) = 0,SS OL (2) = (1+α)x,SS OL (2k) = 2αx andSS OL (2k−1) = 0 fork = 2,3,...,T . Now letA(T ∗ +1) = 4αx. A(T ∗ + 2),...,A(n) = 0, and we haveN OL A ≥ min{(1 +α)x,2αx}≥ (1 +α)x. On the other hand,N OPT A = x/2. The corresponding optimal job service sequence is S OPT (1) =S OPT (2) =x/2,S OPT (2k+1) =A(2k)+A(2k+1) andS OPT (2k+2) = 0 fork = 1,2,...,T . Then the competitive ratio is at least (1+α)x x/2 = 2(1+α) and cannot be bounded. 31 3.6 A slightly different online problem In Section 3.5 we show that the original online problem has no constant approximate solu- tion. And from the proof we can see that the difficulty lies in the correct identification and usage of “free” time slots. This motivates our next step on a similar problem where no “free” time slots are allowed. More specifically, we consider the problem of minimizing the maximum of number of jobs served, instead of the 95th percentile. Formally, let A(i) andS(i) be the number of job arrivals (with unit work) and the vector of jobs served in time sloti, respectively. Again, jobs can be delayed for at mostD time slots, and our goal is to minimize the maximum of the sum ofS(i), i.e., total number of jobs served in a time slot. We refer to this problem the mini-max problem. We note that the offline version of this problem is a special case of the problem solved in [78]. What we are interested is whether the online version of problem can be solved effectively. 3.6.1 An efficient online algorithm for the mini-max problem We term our approach ES as short for “Equal Split”. The algorithm is quite simple. Algorithm 2 ES: CalculateS ES (i),i = 1,2,...,N Require: constantD 1: for time sloti = 1,2,...,N do 2: ObtainA(i) 3: ifi<D then 4: S(i)← 1 D+1 (A(1),...,A(i)) 5: end if 6: ifD≤i≤N−D then 7: S(i)← 1 D+1 (A(i−D +1),A(i−D +2),...,A(i)) 8: end if 9: ifi>N−D then 10: S(i)← ( A(i−D+1) N−i+D ,..., A(i) N−i+1 ) 11: end if 12: end for 32 Consider Algorithm 2, in each time slot, the arriving jobs are split into D + 1 sets of equal size. In this time slot, as well as each of the followingD time slots, one set will be served. Note that lines 9 to 11 is just to make sure that jobs arriving in the lastD time slots are split correctly, since they cannot be split intoD +1 sets. The competitive ratio of this simple algorithm is given in Theorem 2. Theorem 2. The competitive ratio of ES is 2D+1 D+1 . Proof. LetF A (i) △ = P i+D j=i A(j) be the number of job arrivals during consecutiveD+1 time slots starting with time sloti. Additionally, letF A △ = max i F A (i). Then we have: M ES A = F A D +1 . (3.1) This is because in scheme ES, the number of jobs served in time slot i is exactly P i+D−1 j=i A(j) D+1 . However, in an optimal solution we have: M OPT A ≥ F A 2D+1 . (3.2) This is the case since all jobs arriving between time slots i and i +D have to be served between time sloti and time sloti+2D. Comparing (3.1) and (3.2) we complete our proof. 3.6.2 Performance lower bound for all online algorithms Above we showed that ES is 2D+1 D+1 competitive. In fact, this ratio is not far from what any online algorithm can possibly achieve. We now give a lower bound on the performance of any online algorithm in the following theorem. 33 Theorem 3. For the mini-max problem with maximum allowed delay D, the competi- tive ratio of any deterministic online algorithm denoted by OL is at least 2D+1 D+F(D) , where F(D) = P D+1 i=1 i D+i . In other words, M OL A M OPT A ≥ 2D +1 D +F(D) . (3.3) Again, the proof approach is to show that an adversary can force any online algorithm to achieve a competitive ratio of at least 2D+1 D+F(D) . Proof. Let α = 2D+1 (D+F(D))(D+1) . We need to show that the competitive ratio of any online algorithm is at least (D + 1)α. We useM OPT A (i) to denote the optimal solution to the mini-max problem whenA(i+1) throughA(N) is set to 0. Let the adversary first setA(1) = 1. It is obvious thatM OPT A (1) = 1 D+1 . If the online algorithm servesS OL (1)≥α, then the adversary stops, and the competitive ratio is at least (D +1)α. Otherwise, the carry-overU OL A (1)≥ 1−α. Now the adversary setsA(2) to be 1 also.M OPT A (2) = 2 D+2 . If the online algorithm results in S OL (2)≥ 2(D+1) D+2 α, then the adversary stops, and the competitive ratio is at least(D+1)α. Otherwise, the adversary sets A(3) = 1. ThenM OPT A (3) = 3 D+3 . If the online algorithm results inS OL (3)≥ 3(D+1) D+3 α, then the adversary stops, and the competitive ratio is at least (D + 1)α. Continuing in this manner, the adversary sets A(1) through A(D + 1) to be 1, then A(D + 2) through A(2D +2) to beD +1, and so on. At any time stepi, if the online algorithm does no less than(D+1)α of the optimum offline algorithm in the time sloti, or(D+1)αM OPT A (i), the sequence stops, and the adversary has forced a competitive ratio of (D +1)α. Otherwise, the sequence of jobs faced by the online algorithm from time slot 1 tok(D +1) is 1,D + 1,(D+1) 2 ,...,(D +1) k−1 , each repeatedD +1 times. We know that M OPT A (p(D +1)+r) = r(D+1) p D +r , 34 p = 0,1,...,k−2,r = 1,2,...,D+1 Thus, the carry-over to time slotk(D +1)+1 satisfies U OL A (k(D+1))≥ k(D+1) X i=1 [A(i)−S OL A (i)] ≥ (D +1) k−1 X i=0 (D +1) i −(D+1)α( D+1 X i=1 i D +i )( k−1 X i=0 (D +1) i ) = (D +1) (D +1) k −1 D (1−F(D)α) Now let A(k(D + 1) + 1) = (D + 1) k . In this case we haveM OPT A (k(D + 1) + 1) = (D +1) k−1 . If the online algorithm results inS OL (k(D +1)+1)≥ α(D +1) k , then the adversary stops, and the competitive ratio is at least(D +1)α. Otherwise, U OL A (k(D +1)+1)≥U OL A (k(D+1)+1)+(1−α)(D+1) k = (D +1) (D+1) k −1 D (1−F(D)α)+(1−α)(D+1) k = [ 2D +1 D −( D +1 D F(D)+1)α](D+1) k −o((D +1) k−1 ) (3.4) Note that the last term on the right side of Equation (3.4) is (D+1)(1−F(D)α) D . It is much less than(D +1) k−1 whenk is large, so we useo((D +1) k−1 ) to represent it. Next the adversary lets all subsequent number of arrivals be0. We know thatM OPT A = (D+1) k−1 . Moreover,U OL A (k(D+1)+1) has to be served in the followingD time slots. Thus M OL A ≥ U OL A (k(D+1)+1) D 35 Thus the competitive ratio M OL A M OPT A ≥ U OL A (k(D +1)+1) D(D +1) k−1 = (D+1) k [ 2D+1 D −( D+1 D F(D)+1)α]−o((D +1) k−1 ) D(D +1) k−1 (3.5) Letk→∞ in Equation (3.5) and plug inα = 2D+1 (D+F(D))(D+1) , we have M OL A M OPT A ≥ (D +1)(2D+1) D 2 −[ (D+1) 2 D 2 F(D)+ D +1 D ]α = (D +1)(2D+1) D 2 − (2D +1)((D+1)F(D)+D) D 2 (D +F(D)) = 2D +1 D +F(D) = (D +1)α This completes the proof. We not that when D = 1, the lower bound is 2D+1 D+F(D) = 18 13 . At the same time, ES achieves a ratio of 2D+1 D+1 = 3 2 , which is quite close to the lower bound. WhenD→∞, we have thatF(D)→ (1−ln2)D + 1. Thus the lower bound 2D+1 D+F(D) → 2 2−ln2 ≈ 1.53. At the same time, ES’s competitive ratio approaches 2. 3.7 Applying OPT In Section 3.3, we have shown OPT gives an offline optimal solution to the95th percentile problem. However, it relies on knowledge of future job arrivals. This limits the practi- cability of the algorithm. In this section, we provide a heuristic approach based on OPT that is more practical. We term this algorithm HEuristic for onLine PErcentile Reduction, HELPER in short. 36 3.7.1 The HELPER approach In many cases, network traffic in real systems exhibits a pattern that repeats over the accounting cycle, which is typically 24 hours. For example, traffic volume at the same times on two consecutive weekdays days often shows strong correlations. Such correlation can be leveraged to obtain good performance in practice without having a priori knowl- edge of arrivals. Then we can use traffic information obtained today as a prediction for tomorrow. Below we propose a simple heuristic approach based on this observation and OPT. Let the job arrivals of day x be denoted by A x (1) through A x (N). We record job arrivals of day x. At the end of the day, we apply OPT and obtain S x (1),...,S x (N) and U x (1),...,U x (N). Now let r x (i) = Sx(i)−Ux(i−1) Ax(i) ,i = 1,2,...,N, be the fraction of A x (i) that are not being delayed. With r x (i),i = 1,2,...,N, the arrivals in day x + 1 can be served according to the following algorithm. Algorithm 3 HELPER: CalculateS x+1 (i),i = 1,2,...,N Require: r x (i),i = 1,2,...,N,T = 0.05N 1: U x+1 (0)← 0 2: for time sloti = 1,2,...,N do 3: ObtainA x+1 (i) 4: s←r x (i)A x+1 (i)+U x+1 (i−1) 5: m← (T +1)th largest inS x+1 (j),j = 1,2,...,i−1 6: ifs>m then 7: S x+1 (i)←s 8: else 9: S x+1 (i)← min{m,A x+1 (i)+U x+1 (i−1)} 10: end if 11: U x+1 (i)←A x+1 (i)+U x+1 (i−1)−S x+1 (i) 12: RecordA x+1 (i) 13: end for Let us considerA x+1 (i),i = 1,2,...,N. Intuitively, when A x+1 (i) = K×A x (i),i = 1,2,...,N, where K > 0 is a constant, following the service schedule of day x results in 37 optimal 95th percentile for day x + 1. Such a schedule is characterized by the sequence r x (1) through r x (N). Thus, in day x + 1, when A x+1 (i) comes in, we should serve s = r x (i)A x+1 (i)+(1−r x (i−1))A x+1 (i−1) jobs, which is stated in line 4 of Algorithm 3. On the other hand, the 95th percentile will not be increased unless more than m jobs are served, wherem is the(T+1)st largest value fromS(1) throughS(i−1), which is already computed in previous time slots. (line 5). Thus we can serve up to the maximum ofs and m, as expressed in line 5 to 10. The idea here is that ifs < m, we can still serve up tom jobs without increasing the95th percentile. At the same timeA x+1 (i) is recorded and then used to learnr x+1 (i). Then the sequence ofr x+1 (i),i = 1,2,...,N is used to schedule jobs arrive in dayx+2, and so forth. 3.7.2 Discussions In HELPER, we useA x (i) as a simple prediction ofA x+1 (i). We note that more sophisti- cated algorithms such as ARIMA[81] can be applied here to predictA x+1 (i). These pre- diction algorithms might result in better performance of HELPER, but also brings in more computation overhead, while as shown in Section 3.8, our simple model already brings in significant results. Also, it is easy to see that this heuristic approach does not come with any performance guarantee. It is not surprising that it can perform arbitrarily bad ifA x+1 (i) is not correlated withA x (i). 3.8 Experiments We now evaluate performance of our approach through simulations. First, we describe our simulation setting. 38 3.8.1 Experimental settings Workload data We use hit records of Wikipedia as the basis, on top of which we generate our workload datasets. Wikipedia records its hourly hit rate of the previous month at [9]. We use these records between August 11, 2010 and October 25th, 2010. Since the length of one time slot is usually 5 minutes, we interpolate this data so that the sample frequency is once every 5 minutes. Based on this record, we generate6 time series with different characteristics. The statistics of these datasets are listed in Table 3.1. Note that the “autocorr” column shows the auto-correlation with 24 hour lag. From the table we can see that the datasets show medium to high 24 hour lag auto-correlation and small (datasets 1 and 2) to high (datasets 5 and 6) coefficient of variation. 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 1 1.5 2 2.5 3 3.5 4 4.5 x 10 6 Time slot Number of jobs Change in Traffic Volume Diurnal Pattern Figure 3.1: An example of arrival dataset. As an example, a portion of dataset 1 is depicted in Figure 3.1. It can be seen from this figure that these workload time series include diurnal patterns, as well as some “anomalies”, such as change of traffic volume. 39 dataset max×10 6 min×10 6 mean×10 6 autocorr coef of var 1 4.35 1.04 2.01 0.93 0.34 2 2.32 1.00 1.58 0.82 0.17 3 3.30 0.23 1.66 0.74 0.54 4 3.34 0.25 1.71 0.78 0.54 5 2.35 0.00 0.72 0.71 0.89 6 2.34 0.04 0.86 0.72 0.70 Table 3.1: Statistics of datasets. Experimental methodology We implement OPT, HELPER and ES. In the first set of simulations, we study the effect of different delays and percentile. We run OPT with different D values against all datasets. We also modify OPT so that the minimization objective varies: 95th, 98th, and 99th percentile. (Recall that OPT can be adapted to find the optimal service schedule of any percentile; all that s needed is a change in the value of T .) We then drive all algorithms using all of the data sets and calculate the relative change in the95th percentile and the maximum number of jobs served for each of the 74 days, as compared to the original scenario, where workload was served without any delay. Note that OPT and HELPER are originally designed to reduce the percentile value. We modify them so that OPT also computes the optimal solution for the mini-max problem. Adopting this version of OPT, HELPER can also be used to reduce the maximum number of jobs served. For each dataset, we obtain3 sets of results. For OPT and ES the sets have74 elements, whereas for HELPER there are 73 elements because no predictions are available for the first 24 hours. 40 1 2 3 4 5 6 7 8 9 10 15 20 25 30 35 40 D Value Average Reduction(%) 95th Percentile 98th Percentile 99th Percentile Figure 3.2: Average reduction with differentD values 3.8.2 Experimental results In Figure 3.2 we plot the average performance of OPT in reducing the95th,98th and 99th percentile with different delay values, as compared with the original scheme. Each point in the figure is the average reduction in corresponding percentile over all 6 datasets. As D increases, OPT offer greater reduction. This is intuitive because increasing D allows greater flexibility in service. With this OPT is able to “accumulate” more jobs in time slots that are not being accounted. Also, when the objective percentile decreases, OPT performs better since more “free” time slots are available. 1 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Dataset Relative 95th Percentile Value OPT HELPER ES Figure 3.3: 95th percentile results for OPT, HELPER and ES 41 In Figure 3.3 we show the performance of all algorithms in reducing the95th percentile for all datasets. In this set of simulationsD is set to be 1. As mentioned earlier, for each dataset, each algorithm generates more than 70 results. We use bars with different colors to show the mean value of these results generated by different algorithms. We also use the dark bars on top of each color bar to show the95% (two sigma) confidence interval. From Figure 3.3 we can see that OPT achieves significant reduction of the 95th per- centile (18% to 35%) for all datasets. In the worst case, OPT provides more than 15% reduction as compared to the no delay scheme. HELPER reduces the 95th percentile by about5% to13% on average for different datasets. Moreover, the results of HELPER show a larger variation as compared to those of OPT. This is understandable because the predic- tions are not accurate. And this may cause HELPER to perform worse than the no delay scheme on some days, although on average HELPER results in better performance. Addi- tionally, we can observe from Figure 3.3 that the variation in performance of HELPER increases significantly in datasets 5 and 6, where (1) the coefficients of variation are larger than those of datasets 1 through 4, and (2) the 24 hour lag autocorrelation is smaller than those of datasets 1 through 4. This makes sense because high coefficient of variation and low autocorrelation indicate more “randomness” in the dataset. This makes accurate pre- diction of future traffic more difficult. Finally, ES does not show much improvement than the no delay scheme on average. It also shows increased variation as the coefficient of variation of the dataset increases. This is because ES is designed to reduce the maxi- mum instead of the95th percentile. It tends to “smooth” the number of jobs served instead of “accumulate” jobs arrived in several time slots into one “free” time slot, as OPT and HELPER do. Thus the “free” time slots may not be fully utilized in ES to serve more jobs. To conclude, we see that for all datasets there is significant room for reduction, as demonstrated by results of OPT. Allowing larger deadline values brings in more potential 42 reductions in percentiles. HELPER, while giving no performance guarantees, is effective in reducing the 95th percentile in all datasets. However, the performance of HELPER shows larger variance when the datasets include more “randomness”. ES is not effective in reducing the95th percentile. 1 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Dataset Relative Max Value OPT HELPER ES Figure 3.4: Mini-max results for OPT, HELPER and ES Figure 3.4 depicts the reduction achieved (by the various algorithms) in the maximum number of jobs served. We observe that OPT reduces the maximum value by≈ 4%−10% on average. HELPER does not result in very good performance. For datasets 2, 5, and 6 it results in a larger maximum value than the no delay scheme. For other datasets, it gives at most a3% reduction. Moreover, the variance in HELPER’s performance is larger than that of both, OPT and ES on all datasets. In summary, HELPER is not effective in reducing the maximum number of jobs served. Finally, ES shows a small reduction (< 2% on average, about 5% at most) in the maximum number of jobs served with very small variance in performance. The intuition here is that ES is designed for bounded worst case performance. There is no guarantee it would perform better than the no delay scheme in practice. 43 3.9 Conclusions and Future Work In this chapter, we studied scheduling algorithms aimed at reducing the 95th percentile of jobs served, which is the basis of bandwidth cost accounting. We provided an offline algorithm that optimizes the95th percentile for uniform delay constraints. We also showed that the online problem is hard. But for a slightly different problem, where the maximum is minimized, we provided a simple algorithm and prove that it achieves close to optimal performance. As part of future work, we would like to close the performance gap between known algorithms ( 2D+1 D+1 ) and the lower bound on all algorithms that can achieve ( 2D+1 D+F(D) ). Here we presented a simple heuristic approach based on our offline algorithm. As part of future efforts, we are also interested in designing heuristics that are more effective in reducing the 95th percentile. Additionally, we would like to extend our work to the case of non-uniform delay, where different jobs can be delayed for different number of time slots. 44 Chapter 4 Power Cost Reduction for Delay Tolerant Workloads in Distributed Data Centers: a Two Time Scale Approach Geographically distributed computing infrastructure offers significant potential for explor- ing power cost savings. In this work we focus on a stochastic optimization based approach to make distributed routing and server management decisions in the context of large-scale, geographically distributed data centers. Our approach considers such decisions at different time scales and offers provable power cost and delay characteristics. The utility of our approach and its robustness is also illustrated through simulation-based experiments under delay tolerant workloads. 4.1 Introduction Over the last few years, the demand for computing has grown significantly. This demand is being satisfied by very large scale, geographically distributed data centers, each containing a huge number of servers. While the benefits of having such infrastructure are significant, so are the corresponding energy costs. As per the latest reports, several companies own a number of data centers in different locations, each containing thousands of servers – Google (≈1 million), Microsoft (≈200K), Akamai (60-70K), INTEL (≈100K), and their 45 corresponding power costs are on the order of millions of dollars per year [4]. Given this, a reduction by even a few percent in power cost can result in a considerable monetary saving. Extensive research has been carried out to reduce power cost in data centers, e.g., [62, 56, 51, 77, 70, 27]; such efforts can (in general) be divided into the following two categories. Approaches in the first category attempt to save power cost through power effi- cient hardware design and engineering, which includes designing energy efficient chips, DC power supplies, and cooling systems. Approaches in the second category exploit three different levels of power cost reduction at existing data centers as follows. First, at the server level, power cost reduction can be achieved via power-speed scaling [77], where the idea is to save power usage by adjusting the CPU speed of a single server. Second, at the data center level, power cost reduction can be achieved through data center right siz- ing [27, 51], where the idea is to dynamically control the number of activated servers in a data center to save power. Third, at the inter-data center level, power cost reductions can be achieved by balancing workload across centers [62, 56], where the idea is to exploit the price diversity of geographically distributed data centers and route more workload to places where the power prices are lower. Our work falls under the second category, where our goal is to provide a unifying frame- work that allows one to exploit power cost reduction opportunities across all these levels. Moreover, the non-work-conserving nature of our framework allows us to take advantage of the temporal volatility of power prices while offering an explicit tradeoff between power cost and delay. Consider a system of M geographically distributed data centers, each consisting of a front end proxy server and a back end server cluster as shown in Figure 4.1. At different time instances, workload arrives at the front end proxy servers which have the flexibility to distribute this workload to different back end clusters. The back end clusters receive 46 the workload from front end servers and have the flexibility to choose when to serve that workload by managing the number of activated servers and the service rate of each server. Our goal then is to minimize power cost by deciding: (i) how to distribute the workload from the front end servers to the back end clusters, (ii) how many servers to activate at each back end cluster at any given time, and (iii) how to set the service rates (or CPU power levels) of the back end servers. There are two challenging aspects of this problem: 1) different control actions act as different time scales – power scaling can be performed much faster than server activation; and 2) both power prices and workload are stochastic in nature and can be hard to predict accurately. To address these challenges, we design an algorithm that makes decisions over two timescales. The algorithm uses Lyapunov optimization concepts from [72] and offers explicit performance guarantees in these stochastic settings. Our proposed solution exploits temporal and spatial variations in the workload arrival process (at the front end servers) and the power prices (at the back end clusters) to reduce power cost. It also facilitates a cost vs. delay trade-off which allows data center operators to reduce power cost at the expense of increased service delay. Hence, our work is suited for delay tolerant workloads such as massively parallel cloud computing and data intensive MapReduce jobs. Today, MapReduce based applications are used to build a wide array of web services – e.g., search, data analytics, social networking, etc. For example, it is indicated in [9] that Google often has a large number of “long duration” jobs running on back end servers. These jobs takes from several minutes to more than 20 hours, and thus are relatively delay tolerant. Our contributions can be summarized as follows: • We propose a two time scale control algorithm aimed at reducing power cost and facili- tating a power cost vs. delay trade-off in geographically distributed data centers (Section 4.2 and 4.3). 47 • By extending the traditional Lyapunov optimization approach, which operates on a single time scale, to two different time scales, we derive analytical bounds on the time average power cost and service delay achieved by our algorithm (Section 4.4). • Through simulations based on real-world power price datasets, we show that our approach can reduce the power cost by as much as 18%, even for state-of-the-art server power consumption profiles and data center designs (Section 4.6). We also show that our approach is environmentally friendly, in the sense that it not only reduces power cost but also the actual power usage (Section 4.6.5). • We demonstrate the robustness of our approach to errors in estimating data center work- load both analytically as well as through simulations (Section 4.6.2). • We evaluate the performance of our approach with datasets of different workload inten- sities and show that our approach performs better when workload intensity is lower. It is able to reduce power cost by close to 60% when average system load is around 26%. (Section 4.6.3). • Further evaluations show that our approach is able to reduce power cost for non-i.i.d. arrival datasets with various autocorrelation levels (Section 4.6.4). 4.2 Problem Formulation ConsiderM geographically distributed data centers, denoted byD ={D 1 ,...,D M }. The system operates in slotted time with slotst∈{0,1,2,....}. Each data centerD i consists of two components, a front end proxy server, S F i , and a back end server cluster,S B i , that hasN i homogeneous servers. Fig. 4.1 depicts our system model. 48 Data Center 1 S F 1 S B 1 Q F 1 Q B 1 Data Center 2 S F 2 S B 2 Q F 2 Q B 2 Data Center M S F M S B M Q F M Q B M Figure 4.1: A model ofM geographically distributed data centers. 4.2.1 The Workload Model In every time slot t, jobs arrive at each data center. We denote the amount of workload arriving at D i by A i (t), where A(t) = (A 1 (t),...,A M (t)) denotes the arrival vector. In our analysis, we first assume that A(t) are i.i.d. over time slots withE A(t) = λ , (λ 1 ,...,λ M ). We later discuss how our results can be extended to the case when A(t) evolves according to more general random processes. We also assume that there exists someA max such thatA i (t)≤ A max for alli andt. Note that the components inA(t) can be correlated – this is important in practice as arrivals to different data centers may be correlated. 4.2.2 The Job Distribution and Server Operation Model A job first arrives at the front end proxy server ofD i ,S F i , and is queued there. The proxy serverS F i then decides how to distribute the awaiting jobs to the back end clusters at differ- ent data centers for processing. To model this decision, we useµ ij (t) to denote the amount of workload routed fromD i toD j at timet, and useμ i (t) = (µ i1 (t),...,µ iM (t)) to denote 49 the vector of workload routing rates atS F i . We assume that in everyt,μ i (t) must be drawn from some general feasible workload distribution rate setR i , i.e.,μ i (t)∈R i for allt. We assume that each setR i is time invariant and compact. It can be either continuous or finite discrete. Also each setR i contains the constraint that P j µ ij (t)≤ µ max for some finite constantµ max . Note that this assumption is quite general. For example,R i can contain the constraints that µ ij (t) = 0 for all t to represent the constraint that jobs arriving at D i cannot be processed atD j , e.g., due to the fact thatD j does not have a data set needed for the corresponding computation. For each data centerD i , the jobs routed to its back end cluster are queued in a shared buffer. The data center then controls its back end cluster as follows. In every time slot of the form t = kT with k = 0,1,... where T ≥ 1, the data center first decides on the number of servers to activate. We denote the number of active servers at timet at D i as N i (t), whereN i (t)∈{N i min ,N i min +1,N i min +2,...,N i }, withN i being the total number of servers at the back end cluster S B i , and N i min ,0≤ N i min ≤ N i being the minimum number of servers that should be activated at all times for data center D i . If at time slot t =kT , we haveN i (t)>N i (t−T), then we activate more servers. Otherwise, we simply putN i (t−T)−N i (t) servers to sleep. The reasons for havingN i (t) changed only every T time slots are: (i) activating servers typically costs a non-negligible amount of time and power, and (ii) frequently switching back and forth between active and sleep states can result in reliability problems [18]. In addition to deciding on the number of active servers, Each data center also make decisions on the service rate of each active server every time slot. This can be achieved by using techniques such as power scaling [77]. For ease of presentation, below we assume that all active servers in a data center operate at the same service rate, and denote active servers’ service rate atD i byb i (t),0≤b i (t)≤b max , where b max is some finite number. We note that our model can be extended to more general scenarios. A further discussion of the above assumptions is given in Section 4.2.5. 50 4.2.3 The Cost Model We now specify the cost model. For time slot t, we use t T =⌊ t T ⌋T to denote the last time before t when the number of servers has been changed. Then at time slot t, by run- ning N i (t T ) servers at speed b i (t), D i consumes a total power of P i (N i (t T ),b i (t)). We assume that the function P i (·,·) is known to D i , and there exists some P max such that P i (N i (t T ),b i (t))≤ P max for allt andi. Such power consumption will in turn incur some monetary cost for the data centers, of the form “power×price”. To also model the fact that each data center may face a different power price at time slott, we denote the power price atD i at time slott byp i (t). We assume thatp(t) = (p 1 (t),...,p M (t)) varies everyT 1 ≥ 1 time slots, whereT = cT 1 for some integerc. We assumep(t) are i.i.d and everyT 1 time slots, eachp i (t) takes a value in some finite state spaceP i ={p 1 i ,...,p |P i | i }. We also define p max , max ij p j i as the maximum power price that any data center can experience. We use π i (p) to denote the marginal probability that p i = p. An example of these different time scales is given in Figure 4.2. Finally we usef i (t) =P i (N i (t T ),b i (t))p i (t) to denote the power cost atD i in time slot t. It is easy to see that if we definef max ,MP max p max , then P i f i (t)≤f max for allt. t 0 T 2T T T1 Figure 4.2: An example of different time scalesT andT 1 . In this example,T = 8,T 1 = 4, andT = 2T 1 . 4.2.4 The data center Power Cost Minimization (PCM) problem LetQ(t) = (Q F i (t),Q B i (t),i = 1,...,M),t = 0,1,..., be the vector denoting the workload queued at the front end servers and the back end clusters at time slott. We use the following 51 queueing dynamics: Q F i (t+1) = max Q F i (t)− P j µ ij (t),0 +A i (t), (4.1) Q B i (t+1)≤ max Q B i (t)−N i (t)b i (t),0 + P j µ ji (t). (4.2) The inequality in (4.2) is due to the fact that the front end servers may allocate a total routing rate that is more than the actual workload queued. In the following, we assume that data centers can estimate the unfinished workload in their queues accurately. The case when such estimation has errors will be discussed in Section 4.4. Throughout the paper, we use the following definition of queue stability: Q, limsup t→∞ 1 t t−1 X τ=0 M X i=1 E Q F i (τ)+Q B i (τ) <∞. (4.3) Then we define a feasible policy to be the one that choosesN i (t) everyT time slots subject toN i min ≤N i (t)≤N i , and choosesµ ij (t) andb i (t) every time slot subject to onlyμ i (t)∈ R i and0≤b i (t)≤b max . We then define the time average cost of a policyΠ to be: f Π av , limsup t→∞ 1 t t−1 X τ=0 M X i=1 E f Π i (τ) . (4.4) Here, f Π i (τ) denotes the power cost incurred by policy Π at time slot τ. We call every feasible policy that ensures (4.3) a stable policy, and usef ∗ av to denote the infimum average power cost over all stable policies. The objective of our problem is to find a stable policy that chooses the number of activated servers N i (t) every T time slots, and chooses the workload distribution rates µ ij (t) and the service rates b i (t) every single time slot, so as to minimize the time average power cost. We refer to this as the data center Power Cost Minimization (PCM) problem in the remainder of the paper. 52 4.2.5 Model Discussion and Practical Consideration We now discuss the assumptions made in our model. Above, we assume that in time slott all activated servers at D i have the same service rate b i (t). This is not restrictive. Indeed, let us focus on one data center in a single time slot and consider the following formulation: Let the power consumption of a server with service rateb beP server (b), a convex function ofb (see [70, 28]). If theN activated servers run at service rates b 1 , b 2 , ... , b N , then the total power consumed by them can be written as P total = P N j=1 P server (b j ). Suppose the actual power consumption in this data center, P center , has the formP center =C 1 ×P total +C 2 , whereC 1 andC 2 are constants or functions ofN only. Then we have: P center =C 1 P total +C 2 =C 1 N X i=1 P server (b i )+C 2 ≥C 1 NP server ( P N i=1 b i N )+C 2 . This indicates that, to minimize the power consumption without reducing the amount of workload served, all servers should have the same service rate. This justifies our assump- tion. We also assume that jobs can be served at more than one data center. When serving cer- tain types of jobs, such as I/O intensive ones, practical considerations such as data locality should also be considered. This scenario can easily be accommodated in our model by imposing restrictions onR i at the front end servers. For example, we can setR i such that jobs arrived at the front end server of data center D i can only be routed to a subset of all back end clusters. Moreover, service providers like Google replicate data across (at least) two data centers [2]. This provides flexibility for serving I/O intensive jobs. 53 We also assume that the data centers can observe/measure Q(t), i.e., the unfinished work of all incoming workload accurately. However, in many cases, the available informa- tion only contains the number of jobs and the number of tasks per job. With this, we can estimate the amount of workload of the incoming jobs. In addition, in Section 4.4 we show that even if the estimation is not accurate we can still prove bounds on the performance of our approach. Moreover, in Section 4.6.2 we show the robustness of our approach against workload estimation errors through experiments. Finally, we assume that a server in sleep state consumes much less power than an idle server. According to [27], a server consumes10 Watts when in sleep state, as compared to 150 Watts in idle state. This indicates that our assumption is reasonable. We also assume that servers can be activated and deactivated immediately. This assumption involves two issues: 1) regarding server wake up: we note that waking up from sleep takes around 60 seconds. During this 60 seconds, the server cannot perform any work. 2) regarding server deactivation: We assume that running jobs on these servers need to be migrated to active servers, but this takes some time and introduces additional power cost, Thus these operations should not be ignored, if the control actions on activating servers are made frequently. However, when we chooseT , the period of such actions, to be large, potentially no less than an hour, we can assume that the delay and cost incurred by such actions is amortized over the relatively long period during which the server is active. The effect ofT is further discussed in Section 4.6, where we give experimental results. 4.3 The SAVE algorithm We solve PCM through our StochAstic power redUction schEme (SAVE). We first describe SAVE’s control algorithms and discuss corresponding insight and implementa- 54 tion related issues. SAVE’s derivation and analytical performance bounds are discussed in the following section. 4.3.1 SAVE’s control algorithms SAVE’s three control actions are: • Front end Routing: In every timet = kT , k = 0,1,..., each D i solves forµ ij (t) to maximize M X j=1 µ ij (t)[Q F i (t)−Q B j (t)], (4.5) whereμ i (t) = (µ i1 (t),...,µ iM (t))∈R i . Then in every time slotτ∈ [t,t+T−1], D i sets µ ij (τ) = µ ij (t) and route up to µ ij (τ) amount of workload to the back end cluster atD j , 1≤j≤M. • Back end Server Management: At time slott = kT ,D j choosesN j (t)∈ [N j min ,N j ] to minimize: E t+T−1 X τ=t Vf j (τ)−Q B j (t)N j (t)b j (τ) |Q(t) (4.6) ThenD j usesN j (t) servers over time[t,t+T−1]. In every time slotτ∈ [t,t+T−1], eachD j choosesb j (τ) (note thatN j (t) is determined at time slott) to minimize: Vf j (τ)−Q B j (t)N j (t)b j (τ) (4.7) • Queue Update: Update the queues using (4.1) and (4.2). Note thatV is a parameter of SAVE that controls the tradeoff between power cost and delay performance, as detailed in Section 4.4.2. SAVE works at two different time scales. 55 The front end routing and number of active servers are determined every T slots, while back end servers’ service rates are updated each slot. This two time scale mechanism is important from a practical perspective because activating servers usually takes much longer than power scaling. 4.3.2 Properties of SAVE We highlight SAVE’s two interesting properties. First, it is not work-conserving. A back end clusterS B j may choose not to serve jobs in a particular time slot, even ifQ B j > 0, due to a high power price atD j . This may introduce additional delay but can reduce the power cost as shown in Section 4.6.1. Second, SAVE can provide opportunities for bandwidth cost reductions because (i) it provides an explicit upper bound on the workload sent from S F i toS B j , and (ii) these routing decisions remain unchanged forT time slots. IfT is large, this can provide oppor- tunities for data centers to optimize data transmission ahead of time to reduce bandwidth cost. As suggested in [62], content distribution networks like Akamai can incur significant bandwidth costs. Incorporating bandwidth costs into our model is part of future work. 4.3.3 Implementing the algorithm Note that the routing decisions made at the front end servers do not require any statistical knowledge of the random arrivals and prices. All that is needed is that D i ’s back end cluster broadcastsQ B i (t) to all front end proxy servers everyT time units. This typically only requires a few bits and takes very little time to transmit. Then each data center D i computesµ ij for each j. The complexity of maximizing P j µ ij [Q F i (t)−Q B j (t)] depends on the structure of the setR i . For example, ifR i only allows one µ ij to take nonzero values, then we can easily find an optimal solution by comparing the weight-rate-product of each link, and finding the best one. 56 In the second step, we need to minimize (4.6). This in general requires statistical knowl- edge of the power prices. It is obtained as follows: At every t = kT , k≥ 0, we use the empirical distribution of prices over a time window of sizeL to compute the expectation. Specifically, if t ≥ L, then let n i p (t,L) be the number of times the event{p i (t) = p} appears in time interval [t−L+1,t]. Then use ˆ π i (p) △ = n i p (t,L) L as the probabilityπ i (p) for estimating the expectations. Since allp i (t) only take finitely many values, we have: lim L→∞ ˆ π i (p) = lim L→∞ n i p (t,L) L =π i (p). (4.8) Therefore, as we increase the number of samples, the estimation becomes better and better. Now we can solve for N j (t) through an optimization of (4.6). More specifically, for each candidate choice of the number of active servers ˆ N j (t)∈ [N j min ,N j ], define g( ˆ N j ) △ = X p {ˆ π j (p) t+T−1 X τ=t min ˆ b j (τ)∈[0,bmax] {Vp·P( ˆ N j , ˆ b j (τ)) −Q B j (t) ˆ N j ˆ b j (τ)}} Then chooseN j (t) from[N j min ,N j ] that minimizeg( ˆ N j ). 4.4 SAVE: Design and Performance Analysis SAVE’s focus on reducing power cost along with queue stability suggests a design approach based on the Lyapunov optimization framework [29]. This framework allows us to include power costs into the Lyapunov drift analysis, a well-known technique for designing stable control algorithms. However, the vanilla Lyapunov optimization based algorithms operate on a single time scale. We extend it to two time scales and derive ana- 57 lytical performance bounds analogous to the single time scale case. We now highlight the key steps in deriving SAVE, and then characterize its performance. 4.4.1 Algorithm Design We first define the Lyapunov function,L(t), that measures the aggregate queue backlog in the system. L(t) △ = 1 2 M X i=1 [Q F i (t)] 2 +[Q B i (t)] 2 . (4.9) Next, we define theT -slot Lyapunov drift, Δ T (t) as the expected change in the Lyapunov function overT slots. Δ T (t),E L(t+T)−L(t)|Q(t) . (4.10) Following the Lyapunov optimization approach, we add the expected power cost over T slots (i.e., a penalty function), E P t+T−1 τ=t P j f j (τ) , to (4.10) to obtain the drift-plus- penalty term. A key derivation step is to obtain an upper bound on this term. The following lemma defines such an upper bound. Lemma 1. LetV > 1 andt =kT for some nonnegative integerk. Then under any possible actionsN i (t)∈ [N i min ,N i ],μ i (t)∈R i andb i (t)∈ [0,b max ], we have: Δ T (t)+VE t+T−1 X τ=t X j f j (τ)|Q(t) ≤ B 1 T +VE t+T−1 X τ=t X j f j (τ)|Q(t) −E t+T−1 X τ=t X i Q F i (τ) X j µ ij (τ)−A i (τ) |Q(t) 58 −E t+T−1 X τ=t X j Q B j (τ) N j (t)b j (τ)− X i µ ij (τ) |Q(t) . (4.11) HereB 1 ,MA 2 max + P i N 2 i b 2 max +(M 2 +M)µ 2 max . Proof. See appendix. The main design principle in Lyapunov optimization is to choose control actions that minimize the R.H.S. of (4.11). However, for any slott, this requires prior knowledge of the future queue backlogs (Q(t)) over the time frame [t,t+T−1]. Q(τ) depends on the job arrival processesA i (τ), and SAVE’s decisionsµ ij (t),b i (t), andN i (t) (that depend on time varying power prices). Hence, minimizing the R.H.S. of (4.11) requires information about the random job arrival and power price processes. This information may not always be available. In SAVE we address this by approximating future queue backlog values as the current value, i.e.,Q F i (τ) =Q F i (t) andQ B j (τ) =Q B j (t) for allt<τ≤t+T−1. However, the simplification forces a “loosening” of the upper bound on the drift-plus-penalty term as shown in the following lemma. Lemma 2. Let t = kT for some nonnegaive integer k. Then under any possible actions N i (t),µ ij (t),b i (t) that can be taken, we have: Δ T (t)+VE t+T−1 X τ=t X j f j (τ)|Q(t) ≤ B 2 T +E t+T−1 X τ=t X j Q i (t)A i (τ)|Q(t) −E t+T−1 X τ=t X j µ ij (τ) Q F i (t)−Q B j (t) |Q(t) +E t+T−1 X τ=t X j Vf j (τ)−Q B j (t)N j (t)b j (τ)|Q(t) . (4.12) 59 Here B 2 = B 1 + (T− 1) P j [N 2 j b 2 max + (M 2 +M)µ 2 max ]/2 + (T− 1)MA 2 max /2 = (T +1)B 1 /2. Proof. See appendix. Comparing (4.12) with (4.5), (4.6), and (4.7), we can see that SAVE chooses N i (t),µ ij (t),b i (t) to minimize the R.H.S. of (4.12). 4.4.2 Performance bounds Theorem 5 (below) provides analytical bounds on the power cost and delay performance of SAVE. To derive these bounds, we first need to characterize the optimal time average power costf ∗ av that can be achieved by any algorithm that stabilizes the queue. Theorem 4 (below) shows that using a stationary randomized algorithm we can achieve the minimum time average power costf ∗ av possible for a given job arrival rate vectorλ = (λ 1 ,...,λ M ) where λ i =E A i (t) . We define stationary randomized algorithms as the class of algorithms that chooseN i (t),µ ij (t), andb i (t) according to a fixed probability distribution that depends on A i (t) andf j (t) but is independent ofQ(t). In Theorem 4,Λ denotes the capacity region of the system – i.e., the closure of set of rates λ for which there exists a joint workload assignment and computational capacity adaptation algorithm that ensures (4.3). Theorem 4. (Optimality over Stationary Randomized Policies) For any rate vectorλ∈Λ , there exists a stationary randomized control policy Π opt that choosesN i (t),i = 1,...,M every T slots, and chooses μ i (t) ∈ R i and b i (t) ∈ [0,b max ] every time slot purely as functions ofp i (t) andA i (t), and achieves the following for allk = 0,1,2,... kT+T−1 X τ=kT M X i=1 E f Πopt (τ) =Tf ∗ av (λ), kT+T−1 X τ=kT E X i µ Πopt ij (τ) = kT+T−1 X τ=kT E N Πopt j (kT)b Πopt j (τ) , 60 kT+T−1 X τ=kT E A i (τ) = kT+T−1 X τ=kT E X j µ Πopt ij (τ) . Proof. It can be proven using Caratheodory’s theorem in [29]. Omitted for brevity. The following theorem presents bounds on the time average power cost and queue back- logs achieved by SAVE. Theorem 5. (Performance of SAVE) Suppose there exists anǫ> 0 such thatλ+ǫ1∈Λ, then under the SAVE algorithm, we have: Q T , limsup t→∞ 1 K K−1 X k=0 M X i=1 E Q F i (kT)+Q B i (kT) ≤ B 2 +Vf max ǫ , (4.13) f SAVE av , limsup t→∞ 1 t t−1 X τ=0 M X i=1 E f(τ) ≤f ∗ av + B 2 V . (4.14) Heref ∗ av is the optimal cost defined in Section 4.2 and1 denotes the vector of all1’s. Proof. See appendix. What happens when SAVE makes its decisions based on queue backlog estimates ˆ Q(t) that differ from the actual queue backlogs? The following theorem shows that the SA VE algorithm is robust against queue backlog estimation errors. Theorem 6. (Robustness of SAVE) Suppose there exists anǫ > 0 such thatλ+ǫ1∈ Λ. Also suppose there exists a constantc e , such that at all timet, the estimated backlog sizes ˆ Q F i (t), ˆ Q B i (t) and the actual backlog sizesQ F i (t),Q B i (t) satisfy| ˆ Q F i (t)−Q F i (t)|≤c e and | ˆ Q B i (t)−Q B i (t)|≤c e . Then under the SAVE algorithm, we have: Q T , limsup t→∞ 1 K K−1 X k=0 M X i=1 E Q F i (kT)+Q B i (kT) 61 ≤ B 3 +Vf max ǫ , (4.15) f SAVE av , limsup t→∞ 1 t t−1 X τ=0 M X i=1 E f(τ) ≤f ∗ av + B 3 V . (4.16) Heref ∗ av is the optimal cost defined in Section 4.2, andB 3 =B 2 T +2Tc e (µ max +A max + N max b max +Mµ max ). Proof. See appendix. By comparing the inequalities (4.16) and (4.14), we can see that with inaccurate infor- mation, we need to set V to a larger value to obtain the same time average power cost as with accurate information. However, this will result in higher average queue backlogs (compare inequalities (4.15) and (4.17)). Hence, SAVE works even with inaccurate queue backlog information but its robustness is achieved at the expense of a power cost vs. delay trafe-off. We further demonstrate SAVE’s robustness using simulations in Section 4.6. 4.4.3 Extension to Non-i.i.d Arrival and Power Prices Theorem 5 and 6 are obtained with the assumption that A(t) is i.i.d. We note that they can indeed be extended to the case whenA(t) andp(t) are Markovian. More specifically, regarding the performance of SAVE, we can prove the following theorem: Theorem 7. WhenA(t) is Markovian, then under the SAVE algorithm we have Q T , limsup t→∞ 1 K K−1 X k=0 M X i=1 E Q F i (kT)+Q B i (kT) ≤O(V), (4.17) f SAVE av , limsup t→∞ 1 t t−1 X τ=0 M X i=1 E f(τ) ≤f ∗ av +O( 1 V ). (4.18) Heref ∗ av is the optimal cost defined in Section 4.2. 62 The theorem can be proven using a similar argument as in [36]. The sketch of the proof is provided as follows: We first re-define A T (k) = (A(kT),...,A((k + 1)T− 1)) and p T (k) = (p(kT),...,p((k + 1)T− 1)) as the frame arrival state and frame price state. Since bothA(t) andp(t) evolves according to a finite state Markov chain, it can be seen that z(t) , (A T (k),p T (k)) evolves according to a finite state Markov chain, and has a steady state distribution. Now choose any statez ∗ T . We first show that (4.11) holds without the expectation over a single frame. Then, assumingz T (k) = z ∗ T , we define a W -frame Lyapunov drift, Δ W T (t) =E L((k +W)T)−L(kT)|Q(kT) , (4.19) whereW is a random time such that (k +W)T is the next timez T (k) = z ∗ T . After that, we show that the right-hand-side ofE Δ W T (t) can be bounded by the minimum power costE W Tf ∗ av , by using a renewal theory argument and duality. After this, the theorem can be proven as in the proof of Theorem 5. Similarly, we can also prove a theorem that resembles Theorem 6 withA(t) andp(t) being Markovian. 4.5 Experimental Setup The goal of this experimental study is to evaluate the performance of SAVE. Our evaluation scenario consists of 7 data centers at different geographic locations. Next, we describe the main components of our simulations: the electricity prices and workloads at different data centers, the system parameters, and alternate schemes against which we compare SAVE. 63 4.5.1 Data sets Electricity prices. We downloaded the hourly electricity prices for 7 hubs at different geo- graphic locations from [3]. These hubs supply electricity to large cities such as Boston and New York, and sites like Palo Alto, CA and Ashburn, V A that host Google’s data centers. To fully exploit the cost savings due to temporal power price variations, we would have preferred to have prices at a time granularity that exhibits high variability, e..g. 5 minute intervals [62]. However, since we had access to only the hourly prices, we use interpolation to generate prices at5 minute intervals. Statistics of these prices are listed in Table 4.1. We also note that the cross correlations of prices vary. For example, prices from NP-15 and SP-15 have a correlation coefficient of 0.7, possibly because they all belong to the same transmission organization [1], whereas the correlation coefficient between prices from Commonwealth Edison and NP-15 is only 0.13. We include such characteristics in generating input to our simulations, as described below. Hub City Mean Std Auto Corr. Commonwealth Edison Chicago 45.22 26.35 0.76 Cinary Indianapolis 39.92 28.94 0.74 Dominion Ashburn 62.08 44.77 0.90 NEMass/Boston Boston 59.97 30.95 0.94 NP-15 Palo Alto 40.53 41.32 0.89 NYC New York City 41.05 56.55 0.96 SP-15 Los Angeles 52.07 24.11 0.90 Table 4.1: Statistics of power prices Workload. We generate synthetic workload that mimics MapReduce jobs, according to the published statistics on MapReduce usage at Facebook [79]. Each job consists of a set of independent tasks that can be executed in parallel. A job is completed when all its tasks are completed. We make the following assumptions in our experiments: (i) all tasks belonging to a job have the same processing time; tasks from different jobs can have different processing times; (ii) jobs can be served in any of the 7 data centers; and (iii) all the tasks from the 64 same job must be served at the same back end cluster. Regarding (i), in practice, tasks from the same MapReduce job (and other parallel computations), exhibit variability in process- ing times. However, techniques exist for reducing both the prevalence and the magnitude of task duration variability for MapReduce jobs [11]. Hence, (i) is not a significant oversim- plification. As explained in Section 4.2.5, we believe (ii) is also reasonable. Assumption (iii) is not required by our approach. Rather, it is motivated by the fact that, in practice, partitioning tasks from the same MapReduce job across geographically distant clusters can degrade overall performance due to network delays. We choose the execution time of a task belonging to a particular job to be uniformly dis- tributed between0 and60 seconds with the “job size” distribution (i.e., number of tasks per job) given in Table 4.5.1; these distributions (job execution time and “job size”) correspond to data reported in [79]. Table 4.2: Distribution of job sizes # Tasks 1 2 10 50 100 200 400 800 4800 % 38 16 14 8 6 6 4 4 4 We generated 5 groups of delay tolerant workloads. Each group consists of 7 different Poisson job arrival traces – one for each cluster. Group1 has “homogeneous” arrival rates – the arrival process for each cluster is Poisson with a rate of 15 jobs per time slot. The length of one time slot is 15 seconds. Groups 2 to 5 have “heterogeneous” arrival rates. The average arrival rate across all data centers is kept at 15 jobs per time slot. But as the index grows larger, the variance of arrival rates grows larger. For example, Group 2 has arrival rates ranging from 14 to 16 jobs every time slot, whereas Group 5 has arrival rates ranging from12 to18 jobs per time slot. We note that the assumption of Poisson distributed arrivals is not groundless. In fact, it is suggested by the measurements in [79]. In addition to that, we generate10 groups of workloads with Poisson arrival processes, 5 with arrival rate 5 jobs per time slot, 5 with arrival rate 10 jobs per time slot. These 65 workloads, combined with those mentioned above, are used to study performance of SAVE under different workload intensities. Finally, we generate 3 groups of non i.i.d workloads with the same job arrival rate –15 per time slot. These workloads differ in autocorrelation. The average autocorrelation (lag 1) of these 3 groups are 0.98, 0.50 and 0.29, respectively. They are used to evaluate the impact of traffic correlations on SAVE. 4.5.2 Experimental Settings Power Cost Function. SAVE can handle a wide range of power cost functions including non-convex functions. In our experiments, we model power consumptionP(N i ,b i ) as: P(N i (t),b i (t)) = N i (t) b i (t) α A +P idle ·PUE (4.20) In (4.20), A, P idle , and α are constants determined by the data center. Specifically, P idle is the average idle power consumption of a server, and b i (t) α /A +P idle gives the power consumption of a server running at rateb i (t). In our experiments we chooseα = 3,P idle = 150 Watts, andA such that the peak power consumed by a server is250 Watts. The model (4.20) and all its parameters are based on the measurements reported in [28]. The PUE term accounts for additional power usage (such as cooling). According to [5] PUE values for many of today’s data centers lie between1.3 and2. We chosePUE = 1.3 in all of our experiments. This choice is pessimistic, in a sense that SAVE achieves larger power cost reductions whenPUE is higher. System Parameters. We setN i , the number of servers in data centeri, to be 1000 for all 7 data centers. Each server can serve up to10 tasks at the same time. Since average running time of a task is 30 seconds,b i (t) can be chosen between 0 and5 tasks per time slot. With a 15 jobs per slot arrival rate, this gives an average server utility of about 78%. We set the 66 bandwidth between the front end servers and the back end clusters to a large value, based on the real world practice [2]. Hence, a large amount of workload can be sent from one data center to another within one time slot. We varyN i min across a range of values to explore its impact on SAVE’s performance. 4.5.3 Simulated Schemes for Comparison We compare SAVE to the following work-conserving schemes that either represent the current practices in data center management or are heuristics proposed by others. Local Computation. All the requests arriving atS F i are routed toS B i (the local back end); i.e.,µ ii =A i andµ ij = 0 ifj6=i. Load Balancing. The amount of workload routed fromD i toD j ,µ ij , is proportional to the service capacity ofD j (b max ·N i ), regardless ofD j ’s power prices. Intuitively, this scheme should have good delay characteristics. Low Price. This scheme is similar to the heuristic proposed in [62] that routes more jobs to data centers with lower power prices. However, no data center receives workload over95th percentile of its service capacity. Due to the discreteness of job sizes and the constraint that all tasks of one job should be served at the same back end cluster, it is difficult to ensure that the cluster with the lowest power price always runs close to, but not over, its capacity. Thus, in this scheme, the workload is routed such that the long term arrival rate at the back end cluster with the lowest average power price is close to the 95th percentile of its capacity. We then route the workload to the second lowest price cluster, and so on. In all these schemes, we assume that all servers are up at all times. 1 However, we assume that the service rates of the servers can be tuned every slot to optimize power cost. And we simulate the best they can do in reducing power cost. We also simulate the following scheme that deactivates idle servers: 1 About80% of data center operators do not identify idle servers on a daily basis [8]. 67 Instant On/Off. Here, the routing decisions between front end servers and back end clus- ters are exactly the same as in scheme Load Balancing. However, now not all servers are active in all time slots. In every slot, each data center is able to activate/put to sleep any number of servers with no delay or power cost, and also adjust servers’ service rates. In particular, we assume that data centerD i is able to chooses optimalb i (t) andN i (t) accord- ing to Q B i (t) in each and every time slot t. This idealized scheme represents the upper bound of power cost reductions achievable for the single data center case by any work- conserving scheme in our experimental settings. It is highly impractical because it assumes that servers can switch between active state and sleep state at the same frequency as power scaling (once every 15 seconds in our simulations). 4.6 Experimental Results 4.6.1 Performance Evaluation The performance of SAVE depends on parametersV andT . We show experimental results of all schemes on all data sets under differentV andT values (with other parameters fixed). For power cost reduction results, we use scheme Local Computation as a baseline. For all other schemes, we show the percentage of average power cost reduction as compared to scheme Local Computation. Specifically, letPC X denote the average power cost of scheme X. PC L.C. −PC X PC L.C. ×100 is used to quantify power cost reduction of schemeX. (L.C. is short for Local Computation.) For delay results, we show the schemes’ average delay (in number of time slots). Delay results of scheme On/Off are omitted as they are nearly identical to those of scheme Load Balancing — the maximum difference is≈ 0.03 time slots (0.45 second). For all comparison schemes, we show average values (power cost reduction and delay) over all arrival data sets. For SAVE, we use curves to represent average values and bars to show the corresponding ranges. 68 .01.02.05 .1 .2 .5 1 2 5 10 20 50100 0 5 10 15 20 V Value Power Cost Reduction (%) SAVE Local Load Balancing On/Off Low Price .01.02.05 .1 .2 .5 1 2 5 10 20 50100 2 4 8 16 32 64 V Value Delay (Time Slot) SAVE Local Load Balancing Low Price 30 60 90 120 180 240 360 540 7201080 0 5 10 15 T Value (Time slot) Power Cost Reduction (%) SAVE Local Load Balancing On/Off Low Price 30 60 90 120 180 240 360 540 7201080 2 4 8 16 32 64 T Value (Time slot) Delay (Time slot) SAVE Local Load Balancing Low Price (a) (b) (c) (d) Figure 4.3: Average power cost and delay of all schemes under differentV andT values We first fix T to be 240 time slots (one hour) and run experiments with different V values. The results are shown in Figure 4.3(a) and (b). From Figure 4.3(a) we can see that asV goes from0.01 to100, the power cost reduction grows from an average of around0.1% to about 18%. Scheme On/Off achieves power cost reduction of about 9.7%. If we choose V ≥ 5 then SAVE results in larger power cost reductions than scheme On/Off (which is impractical). This is because (i) our approach considers differences in power prices across different data centers, and (ii) our approach is not work conserving and can adjust service rates at different data centers according to power prices. We also note that scheme Low Price gives a small power cost reduction (≈ 0.5%) – i.e., sending more workload to data centers with low power prices in a greedy fashion is not very effective in reducing power cost. In Figure 4.3(b), we observe that whenV is small (< 0.1) the average delay of SAVE is quite small and close to the delay of scheme Load Balancing. Increasing V leads to 69 0 10 20 30 40 50 0 5 10 N min (%) Power Cost Reduction (%) SAVE Local Load Balancing On/Off Low Price 0 10 20 30 40 50 1 2 4 8 16 32 64 N min (%) Delay (Time Slot) SAVE Local Load Balancing Low Price .01.02.05 .1 .2 .5 1 2 5 10 20 50100 −1 −0.5 0 0.5 1 V Value Difference in Power Cost Reduction (%) .01.02.05 .1 .2 .5 1 2 5 10 20 50100 −1.5 −1 −0.5 0 0.5 1 V Value Difference in Delay (Time Slot) (a) (b) (c) (d) Figure 4.4: Average power cost and delay of all schemes under differentN min values and robustness test results larger delay as well as larger power cost reductions. In general, V in SAVE controls the trade-off between delay and power cost. We fixV to be 10 and varyT from 30 time slots (7.5 minutes) to 1080 time slots (4.5 hours), which is a sufficient range for exploring the characteristics of SAVE. (Note that servers are activated and put to sleep every10 minutes in [51] and every hour in [66]. Also, when setting T ≤ 30, servers may be powered up and down very frequently. This is not efficient, as discussed in Section 4.2.5.) Corresponding results of the different schemes are shown in Figures 4.3(c) and (d). From Figure 4.3(c) we can see that changing T has relatively small effect on power cost reductions of our SAVE approach. The average power cost reduction fluctuates between 8.7% and 13.6% when T varies from 30 to 1080 time slots. In most cases, it results in higher cost reductions than scheme On/Off. However, we note that T has a larger impact on average delay, as shown in Figure 4.3(d). In the extreme case, whenT = 1080 time slots, the average delay is close to 64 time slots. This 70 is not surprising – recall that in the bound on queue size given in Theorem 5, theB 2 term is proportional to T , i.e., the delay increases with T . However, reasonable delay values are possible with appropriate choices ofT , e.g., if we chooseT = 240 time slots (1 hour), SAVE gives an average delay of 14.8 time slots (3.7 minutes). From this set of results we can see that for delay tolerant workloads, SAVE would take infrequent actions on server activation/sleep (once in an hour or less) and still achieve significant power cost reduction. In the next set of experiments we consider the impact of N min . We keep V and T values fixed, but varyN min values from0 to50% ofN i . The results are depicted in Figures 4.4(a) and (b). Figure 4.4(b) indicates that increasing N min improves delay performance, e.g., when N min is increased to 20% of N i , the average delay decreases from≈ 72.5 to ≈ 25.9 time slots. At the same time, as shown in Figure 4.4(a), the effect ofN min on power cost reduction is relatively small. This makes intuitive sense. WhenN min increases, more servers are up regardless of job arrivals, providing more slots to serve jobs, thus reducing average delay. On the other hand, adding more active servers reduces the service rate of each server, which compensates for the extra power consumed by added servers. 4.6.2 Robustness Characteristics As mentioned in Section 4.2.5, SAVE needs to know the amount of workload of each job. In practice, when this is not available, we use estimated values. In this set of experiments we explore the influence of estimation errors on the performance of SAVE. To do this, for each arriving job, we add a random estimation error (±50%, uniformly distributed) to the amount of workload it contains. We run SAVE on these data sets, but let SAVE make all control decisions based on the amount of workload with estimation errors. Only when a job get served does the server know the exact amount of workload it actually contains. We run experiments for all5 groups of data sets for differentV values, and compare the results to those obtained using the original arrival data sets. In Figures 4.4(c) and (d),we use 71 the results on data sets without estimation errors as the baseline, and show the differences in power cost reduction and delay due to injected estimation errors. Figure 4.4(c) indicates that for all V values we experimented with, the difference (due to errors) in power cost reduction is between−1.0% and 0.7%. As shown in Figure 4.4(d), changes in average delay caused by estimation errors are small (between−1.2 to0.9 time slots). To conclude, SAVE is robust to workload estimation errors. 4.6.3 The Impact of Workload Intensity Here we consider workloads with different workload intensity. In this set of experiments, we keepT andN min values unchanged and run SAVE with differentV values on datasets with different arrival rates. With low, medium and high arrival rates, the average system utilization is around 26%, 52% and 78%, respectively. We group experimental results on datasets with the same arrival rate together. The results are depicted in Figures 4.5(a) and (b). For comparison, we also plot the average power cost reduction and delay of scheme On/Off. From Figure 4.5(a) and (b) we can see that as workload intensity increases, SAVE results in less power cost reduction and higher delay with the sameV value. In particular, for datasets with low workload intensity, SAVE can reduce power cost up to almost 60%. But with high workload intensity, the sameV value only brings in18% reduction in power cost. This makes sense. With small workload intensity more servers can be deactivated to save power. On the other hand, with higher workload intensity, delay increases, and there is less “room” for SAVE to delay jobs for better performance. We can also observe that if we setV to be larger than 5, SAVE outperforms scheme On/Off in power cost reduction in most datasets. 72 .01 .02 .05 .1 .2 .5 1 2 5 10 20 50 100 0 20 40 60 V Value Power Cost Reduction (%) SAVE High SAVE Medium SAVE Low ON/OFF High ON/OFF Medium ON/OFF Low .01.02.05 .1 .2 .5 1 2 5 10 20 50100 1 2 4 8 16 32 64 V Value Delay (Time Slot) SAVE High SAVE Medium SAVE Low ON/OFF High ON/OFF Medium ON/OFF Low .01.02.05 .1 .2 .5 1 2 5 10 20 50100 0 5 10 15 20 V Value Power Cost Reduction (%) SAVE .01.02.05 .1 .2 .5 1 2 5 10 20 50100 4 8 16 32 V Value Delay (Time Slot) Autocorr High Autocorr Medium Autocorr Low (a) (b) (c) (d) Figure 4.5: Average power cost and delay of SAVE with different workload intensity and autocorrelations 4.6.4 Non i.i.d Workload We use 3 groups of workloads with different levels of autocorrelation to evaluate the per- formance of SAVE on non i.i.d arrivals. In Figure 4.5(c) we show power cost reduction of SAVE for these datasets. The curve indicates that SAVE results in higher reduction as V increases. The bars show that SAVE offer similar performance for all datasets. In fact, with the sameV value, the difference in relative power cost reduction is at most 1 per cent. In Figure 4.5(d), average delay of SAVE are depicted for each of these workloads. From the figure we can observed that when V is small, higher autocorrelation workloads results in higher delay. This is because jobs arrives more in batches when autocorrelation is high, and more in spread out manner when autocorrelation is low. With increasedV , SAVE has larger flexibility of delaying jobs for better performance. Thus delay performance become similar for all three data sets. 73 In conclusion, SAVE is able to provide similar level of power cost reduction regardless of autocorrelations in arrival. But its performance in delay may vary whenV is small. 4.6.5 Power Consumption of SAVE SAVE is designed to reduce the cost of power in geographically distributed data centers, as this is one major concern for large computational facility providers. At the same time, with more attention paid to the social and ecological impact of large computational infrastruc- tures, it is also desirable to consider environmental friendly approaches, i.e., while reducing the cost of power, it is also desirable to reduce the consumption of power. To this end, we record the power consumption of all simulated schemes. In Figure 4.6 we show the percent- age of average power consumption reduction by SAVE with differentV values, relative to scheme Local Computation. Figure 4.6 illustrates that withV values ramping from0.01 to .01.02.05 .1 .2 .5 1 2 5 10 20 50100 0 5 10 V Value Power Usage Reduction (%) SAVE Local Load Balancing On/Off Low Price Figure 4.6: Differences in average power usage reduction for differentV values 100, the power consumption reduction goes from about0.1% to10.3%. WhenV = 10, the reduction is around 8.7%. This indicates that SAVE is environmentally friendly – while it reduces power cost, it also reduces power consumption significantly. As a comparison, the Low Price scheme is not environmentally friendly. Although it reduces power cost (see Figure 4.3(a)), it consumes more power than scheme Local Computation. 74 4.7 Conclusions In this paper, we propose a general framework for power cost reduction in geographically distributed data centers. Our approach incorporates routing and server management actions on individual servers, within a data center, and across multiple data centers, and works at multiple time scales. We show that our approach has provable performance bounds and is especially effective in reducing power cost when handling delay tolerant workloads. We also show that our approach is robust to workload estimation errors and can result in significant power consumption reductions. As mentioned earlier, the design of SAVE included several practical concerns. How- ever, it still has some impractical aspects. For example, SAVE requires simultaneous acti- vation/deactivation of multiple servers. This may potentially cause problem to the stability of the power grid. Part of our future work is to address this issue with better server man- agement. Other future work include taking more design aspects, such as renewable energy and battery, into consideration. 75 Chapter 5 Performance Tradeoffs in Structured Peer to Peer Streaming Systems In this chapter we focus on structured peer to peer streaming system. We consider the following basic question: a source node wishes to stream an ordered sequence of packets to a collection of receivers, which are inK clusters. A node may send a packet to another node in its own cluster in one time step and to a node in a different cluster in T c time steps (T c > 1). Each cluster has two special nodes. We assume that the source and the special nodes in each cluster have a higher capacity and thus can send multiple packets at each step, while all other nodes can both send and receive a packet at each step. We construct two (intra-cluster) data communication schemes, one based on multi-trees (using a collection ofd-ary interior-disjoint trees) and the other based on hypercubes. The multi- tree scheme sustains streaming within a cluster withO(dlogN) maximum playback delay and O(dlogN) size buffers, while communicating with O(d) neighbors, where N is the maximum size of any cluster. We also show that this protocol is optimal when d = 2 or 3. The hypercube scheme sustains streaming within a cluster, withO(log 2 (N)) maximum playback delay andO(1) size buffers, while communicating withO(log(N)) neighbors, for arbitrary N. In addition, we extend our multi-tree scheme to work when receivers depart and arrive over time. We also evaluate our dynamic schemes using simulations. 76 5.1 Introduction Continuous media (CM) streaming over a variety of networks is an application which pro- vides a rich source of interesting algorithmic problems. The specific problem we consider here is that of a source node streaming data to a collection of receiving nodes, where the receivers need to contribute to the delivery process, i.e., due to communication (bandwidth) resource limitations it is not possible for all nodes to receive the stream directly from the source. This is a standard motivation for use of peer-to-peer (P2P) systems or systems that involve application layer multicast. The streamed data can correspond to either a live source (i.e., the data is produced during the delivery process), or to a pre-recorded stream (i.e., all data is available at the beginning of the delivery process). In a real environment, we may have nodes that are in different geographical locations and thus communication delays may be significantly different. Assume that the nodes are divided into several clusters (e.g., based on geographic proximity). Each node can transmit one packet to any node within the same cluster in one time step. However, although packets could be sent from one node to any node in another cluster, the transmission delay across clusters is large (this is similar to the model in [43]). As a result, when streaming packets through the network, it is desirable to have a small number of transmissions across clusters. To be more specific, assume that there areK clusters, each containing a sufficiently large number of nodes. Let the delay to send a message between nodes in two different clusters beT c . However, within a cluster we can send a message to another node in the same cluster in one time step. Our packet distribution scheme will distribute the stream of packets using a “super-tree”,τ, on the clusters, constructed by selecting a special node from each cluster. Within each cluster a special (local) root node will be responsible for distributing the stream of packets to the members of that cluster. Details of this are given in Section 5.2.2. Since we are considering communication between peers in an overlay network, the actual routing of packets within one cluster can be viewed abstractely as a unicast commu- 77 nication model (with more details given in Section 5.2.2). Moreover, we view the cluster (logically) as a fully connected graph. That is, any nodei can send/receive packets to/from any other node j in the cluster. In a single time slot (as defined in Section 5.2.2), each nodei can transmit one packet and receive one packet; a number of works use this model [26, 12, 44, 40] as a communication abstraction. The packets can arrive at a node in any order, but they must be played back in a specific order (and at a specific rate), correspond- ing to the original recording rate of the stream. The subset of the (fully connected) graph edges used for packet delivery form a mesh. The system’s performance is a function of the algorithms used to construct and maintain this mesh as well as the algorithms used for scheduling packet delivery over the mesh. Hence, our work focuses on specific approaches to constructing such meshes within a single cluster. Our primary goal is to develop an understanding of two related quantities – playback delay and buffering requirements. As an example, one could simply chain the receiving nodes of a cluster (of sizeN) in a list, where the source streams packets to the first node in the list. Each node then simply forwards the packets to the next node in the list, and so on. While the buffering requirements are minimal in this approach, the playback delay is unacceptable for all but a few nodes, particularly since the cluster could be large. Another simple approach might be to arrange the nodes, for instance, in a binary tree, with the source being the root of that tree. Each node would then need to forward the packets to its two children. While this results in constant buffering requirements and O(logN) delay, this also requires each node to have at least twice as much upload bandwidth as bandwidth needed for downloading (i.e., streaming). (Note also that approximately half the nodes, i.e., leaves of the tree, are not contributing to the streaming process; hence the need for other nodes to make up for the lack of upload capacity in the system.) This is not a reasonable requirement as typically a node’s upload bandwidth is significantly lower than its download bandwidth. Hence, better approaches to mesh construction are needed, which 78 result in acceptable playback delay and buffering characteristics, while utilizing system resources efficiently. To this end, in this chapter we explore two approaches to mesh construction and sub- sequent streaming, one based on multi-trees and the other based on a hypercubes. We use these approaches to explore the resulting playback delay, buffer space, and communication requirements. (By communication requirements we mean the number of neighbors with which a node needs to communicate in a particular scheme, as detailed later.) We note that, our goal here is not to optimize playback delay or buffer space requirements. Rather, we aim to illustrate (through the proposed schemes) the existing trade-offs between playback delay, buffer space, and communication requirements. Specifically, we design the hypercube based scheme (as described in Section 5.5) with O(1) buffer space requirements andO(log 2 N) playback delay in general. In doing so, we observe that particular care must be taken in such an adaptation in order to limit the number of neighbors with which a node needs to communicate. Our motivation for limiting the number of neighbors with which a node communicates is that such communication requires protocol maintenance overhead, e.g., due to “keep alive” messages, due to nodes joining and departing (under node churn), and so on. Thus, we extend this scheme such that the number of neighbors with which each node needs to communicate is limited toO(logN). In the case of multi-tree based schemes, we perform streaming on a collection of d, d-ary interior-disjoint, trees where we show the playback delay to be at most dlog d N for all receivers. By interior-disjoint we mean that each of the receivers does not appear as an interior node in more than one of the d trees. Unlike the hypercube-based scheme described above, the multi-tree-based schemes only require each node to communicate with at most 2d nodes in its cluster. (As shown later, smaller values ofd are more desirable for playback delay and buffer space requirements, and thus should result in a smaller number of neighbors with which a node needs to communicate.) 79 Our approach has provable quality-of-service (QoS) guarantees, and we provide anal- ysis of corresponding performance characteristics. Specifically, in this work we focus on playback delay and buffer size requirements, as our metrics of QoS, and we study them as a function ofd andN. Note that in this work we focus on a more “structured” approach to mesh construction, i.e., the set of edges used for delivery of packets is fixed by our algorithms. An alternate approach might be to use an “unstructured” approach – i.e., the edges used for delivery are determined on a per packet basis (essentially on the fly when the data is needed). This allows the system to more easily adapt to node churn. However, existing unstructured approaches to streaming are essentially “best effort”, and little exists in the way of formal analysis of resulting QoS guarantees. A number of very nice works have looked at formal analysis of unstructured approaches to file downloads, e.g., as in [26, 12]. Here the file is decomposed intok chunks, and then distributed using a randomized gossip mechanism. However, since we are streaming a very large (potentially infinite) number of packets, the arrival order of packets is important, otherwise they all have to be buffered. In addition, the chunks need to be all available initially, which is not the case for live streaming. Techniques with provable QoS guarantees, such as ours, may be more suitable for scenarios where QoS is of importance. Independently, [53] focuses on characterizing limits of peer-assisted live streaming (for “structured” and “unstructured” systems) and gives performance bounds (on minimum source capacity and tree depth, and maximum streaming rate) using a fluid flow (rather than packet) model and under different assumptions from ours (e.g., they assume a potentially unlimited source capacity, they do not constraint trees to be interior-disjoint, etc.). The contributions of this work are as follows: • We provide algorithms for constructing multiple streaming trees and a corresponding transmission schedule so as to maintain QoS characteristics (see Section 5.2.2). We 80 are also able to extend our schemes to dynamic scenarios while maintaining our nice tree properties (see Section 5.3). We validate and evaluate our dynamic schemes through simulations (see Section 5.4.2). • We analyze the QoS of the multi-tree-based schemes and derive an upper bound on the delay required at each node before it can start playback, and use it to bound the required buffer size; we also prove a lower bound on the average playback delay (see Section 5.2.3). We also evaluate our schemes through simulations (see Section 5.4.1). • Interestingly, our work establishes that it is only useful to consider degree 2 and 3 trees, if one wants to minimize worst-case delay (see Section 5.2.3). • We present algorithms for constructing streaming structure using hypercube and cor- responding transmission schedule for arbitrary values ofN with a provable limit on the number of neighbors with which a node needs to communicate (see Section 5.5); we refer to this as a hypercube-based scheme. • We analyze the hypercube-based schemes to give upper bounds on worst case and average case playback delay (see Section 5.5.2). • Our main schemes are presented under the model of each cluster being a fully con- nected graph, i.e., where we assume that all nodes in a cluster are capable of com- municating with each other. The existence of two interior disjoint trees in a general graph is anNP -complete problem (see Section 5.2.4). 5.2 Multi-Tree Construction and Transmission LetS be the source node of the streamed data (e.g., video and/or audio), which corresponds to a (potentially infinite) sequence of data units (also referred to as “packets”) being deliv- 81 ered to a number of receivers. We assume that the network provides sufficient bandwidth, so that a packet can be delivered within a time slot. This is essential, if we are to have a solution without the use of (potentially) unbounded buffers. (If a receiver cannot receive a packet in each time slot, then it must accumulate a lot of packets before playback can start. Even with such a scheme, when the stream is infinite, eventually the receiver will starve and cause the playback to temporarily stop when the buffer is empty.) This assumption is quite reasonable. For instance, if we stream MPEG-1 video, recorded at the rate of 1.5 Mbps using 1400 byte packets (which is quite common), then each packet would play for ≈ 7.5 msec. If we stream these packets over a 10 Mbps connection, then it would take ≈ 1.1 msec to transmit one packet. If the propagation delay is also significant – e.g., a packet sent across US might experience a one-way delay on the order of 30 msec (includ- ing propagation, queueing, and processing delays) – then, we could think of transmitting a set of packets as one “large packet”, in order not to waste network resources. Given above numbers, that would be on the order of 5 packets. 5.2.1 Construction of Treeτ We assume that there are K clusters, with cluster i containing N i ≤ N nodes. As men- tioned earlier, T c is the transmission time to send a packet from a node in one cluster to a node in another cluster. Within a cluster we assume that the nodes form a complete sub- graph, with the transmission time beingT i . For clarity of presentation (and without loss of generality), we will assume in the remainder (unless otherwise stated) thatT i is 1. The source S has D(D ≥ 3) times the capacity of a receiver node. Moreover, in each cluster i there are two “super nodes”, S i and S ′ i , where S i has the same capacity as the source, S, and S ′ i has capacity d times the capacity of a receiver node. Under these assumptions we construct the treeτ using the following algorithm: 82 Step 1: Build a tree using nodes S 1 through S K , with S being the root of this tree. The degree ofS isD; other interior nodes have degree at mostD−1. In order to keep the tree tight, at most one interior node can have degree less thanD− 1, and this node must be in the next to the last layer. Step 2: MakeS ′ i the child ofS i for alli. Step 3: Within each cluster build degreed interior-disjoint trees withS ′ i as the root of the cluster. The data is distributed (in order) fromS to all the nodesS i . Each nodeS i forwards the packet it receives from its parent on to itsD children (D−1 children in treeτ and the child S ′ i ). An example of the resulting treeτ using this construction is depicted in Figure 5.1. S 1 S’ 1 S 4 S’ 4 S 5 S’ 5 S 2 S’ 2 S 6 S’ 6 S 7 S’ 7 S 3 S’ 3 S 8 S’ 8 S 9 S’ 9 S Figure 5.1: Cluster construction with sourceS,D = 3,d = 4: the ovals represent clusters, where thick and thin arrows represent inter- and intra-cluster transmission, respectively. The main idea here is to use a set of “super nodes” as the “backbone” of our network. As a result, we have the following theorem: Theorem 8. The worst case playback delay is on the order ofT c ·log D−1 K+T i ·d(h−1), whereh is the maximum height of the interior-disjoint trees in all clusters. 83 Proof. The log D−1 K term is due to the “backbone” tree, while thed(h−1) term is given in Theorem 9. What remains is to determine and evaluate tree construction and transmission schemes for each cluster. Thus (unless otherwise stated), in the remainder of the chapter we focus our discussion on a single cluster. For clarity of presentation we use the term “sourceS” (or “rootS”) to mean the rootS ′ i of clusteri. 5.2.2 Construction of Interior Disjoint Trees Recall that the packets can be delivered out of order. However, since our data corresponds to continuous media (such as audio or video), the packets must be played in order and at the rate at which data was recorded – we assume that the playback rate is one packet per time slot. We define a time slot as the playback time of a single packet. We also assume that the receivers are homogeneous and can transmit and receive one packet per time slot from other nodes in their own cluster. Similar models are used, for instance in [26, 12]. In practice, a node may send and receive more than one packet in a time slot, although still a bounded number. (Essentially, this would correspond to a node splitting its bandwidth between multiple transmissions, each at a slower rate, with a longer time slot.) The schemes we propose here work with either model. However, not all schemes which work under the latter model would carry over to the former. Since our goal is to develop an understanding of playback delay and buffer requirements inherent in such schemes, we use the former model. Ideally, we would like each receiver to receive a new packet in every time slot. However, the sourceS does not have the capability to stream a new packet to each receiver in each time slot. We assume that S is powerful enough to send packets to up to d receivers in each time slot, whered≥ 1. (It is reasonable to assume that the source is a server which is slightly more powerful than the clients receiving the stream.) 84 Thus, the general scheme is then based on receivers themselves acting as senders of packets which they have just received. One approach would be to construct a single tree, with S as the root, and deliver the data along that tree, e.g., as in [19]. However, that would waste the upload capacity of the leaf nodes; e.g., in a binary tree approximately half of the upload capacity of the system is potentially wasted. This would also require internal nodes to have significantly higher upload capacity (e.g., twice the leafs’ capacity in a binary tree), whereas most technologies actually provide higher download capacity. Thus, a more resource-efficient approach (e.g., as in [16]) is to construct multiple transmission trees where receivers can obtain a different fraction of the data stream from each of the trees. This leads to efficient use of resources and reduction in playback start-up delay as well as buffer space requirements. If a node splits its upload bandwidth among its d children, then in fact a node may belong to up to d trees. To maximize efficiency of resource usage, we focus on schemes where each node does belong to d trees. Thus, we choose trees so that each node is an interior node in at most one tree, in which it has exactly d children (a few dummy receivers may be added to ensure this). In the remaining d− 1 trees it needs to be a leaf node. Constructing thed trees with this property is actually quite easy. What is surprising is that this can be done in a way that enables a schedule where in each time slot, each node will receive a packet from exactly one parent, with no collisions. Note that each node has d parents in the d trees. After receiving a packet j, each node then sends packetj to itsd children in the nextd time slots. Unlike theN receivers in the system, the sourceS distributes one packet in each time slot to a receiver in each of thed trees. Thus, below we construct d trees, each being a d-ary tree, where S acts as the root in each of the trees and allN receivers appear in each tree. The data stream is then split among thed trees (e.g., the first packet might be delivered through the first tree, the second packet might be delivered through the second tree, and so on). The tree construction and trans- 85 mission schemes we devise must then satisfy the following constraints: (1) each receiver node receives at most one packet in each time slot, (2) each receiver node transmits at most one packet in each time slot, (3) nodeS can transmit at mostd packets in each time slot, and (4) after some finite amount of time, referred to as playback delay, each receiver node should be able to start playback and continue that playback without hiccups (i.e., without reaching a situation where the next packet to be played has not arrived yet). In what follows, we refer to receiver i, 1≤ i≤ N, as a receiver with node id i, in order to distinguish its “name” from the various positions it might occupy in the d trees. We number the positions in any tree in a breadth first order. Intuitively, we would also like to construct thesed trees in such a manner so as to reduce the playback delay. This implies that the trees should be, in some sense, balanced and that they should be interior node disjoint, i.e., that each of theN receivers should not appear as an interior node in more than one of thed trees. LetI be the number of interior nodes in a tree; then,I = N d −1. LetG 0 ={1,2,3,...,I},G 1 ={I+1,I+2,...,2I},...,G d−1 = {(d−1)I +1,(d−1)I +2,...,dI},G d ={dI+1,dI+2,...,N}. Node ids inG 0 toG d−1 correspond to those nodes which will appear as interior nodes in some tree. The node ids inG d will always be leaf nodes. We refer to thej th element ofG i asG j i . For notational convenience, we would like to assume that each internal node in each of thed trees has exactlyd children. We accomplish this by adding dummy receiver nodes. In the constructions that follow, we ensure that the dummy nodes only appear as leaf nodes in our trees (i.e., we add them intoG d ). Thus, they can simply be removed in the real system. The key point in the construction is that each node appears as thei th child in only one tree. This is what enables a transmission schedule with no collisions (see Figures 5.2 and 5.3). We describe two slightly different construction schemes. The essential properties achieved by both schemes are identical; however, the locations of the nodes themselves can be different. We now give tree construction and transmission schemes. 86 Greedy Construction Structured Construction S 6 11 12 1 1 11 S 6 2 9 4 1 11 (b) (a) t mod d = 1 Packet transmission on link when: t mod d = 0 t mod d = 2 Figure 5.2: Receiving and sending schedules of node id 6, for example in Figure 5.3. (b) Greedy Construction (a) Structured Construction S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 S 5 6 7 8 9 10 11 12 1 2 3 4 15 13 14 S 9 10 11 12 1 2 3 4 5 6 7 8 14 15 13 S 9 10 11 12 1 2 3 4 5 6 7 8 15 13 14 S 5 6 7 8 3 1 2 9 4 11 12 10 14 15 13 13 S 1 2 3 4 5 6 7 8 9 10 11 12 14 15 real node dummy node Packet transmission on link when: t mod d = 1 t mod d = 0 t mod d = 2 k = 0 k = 1 k = 2 k = 0 k = 1 k = 2 Figure 5.3: Example of interior disjoint tree construction using the two schemes with N = 15, d = 3, where G 0 = {1,2,3,4}, G 1 = {5,6,7,8}, G 2 = {9,10,11,12}, G 3 = {13,14,15} Structured Disjoint Tree Construction Let P = d gcd(I,d) . We number the d trees T 0 ,T 1 ,...,T d−1 . We construct these trees by filling in the nodes, in breadth first order, using the current ordering of the groupsG i , where the first group in the current order always corresponds to interior nodes. Specifically, letG be the concatenation (⊕) ofd elements. Step 1: Initialization: LetG = G 0 ⊕G 1 ⊕...⊕G d−1 . Construct T 0 usingG⊕G d . Let k = 0. 87 Step 2: Construct group sequence for next tree: Let k = k + 1. Construct a newG by rotating the currentG to the left whereG i takes place ofG i−1 and the first element of the currentG becomes the last element of the newG. Ifk modP6= 0, go to Step 4. Step 3: Adjust groups (afterP rotations): Construct newG i s, 0≤ i≤ d−1, by rotating elements of eachG i to the right, i.e.,G j i takes place ofG j+1 i and the last element of the currentG i becomes the first element of the newG i . Step 4: Construct next tree: Rotate G d to the right, i.e., G j d takes place of G j+1 d and the last element of the currentG d becomes the first element of the newG d . ConstructT k usingG⊕G d . Step 5: Loop: Ifk <d−1, go to Step 2. Proof. Note that only nodes inG k act as the interior nodes in treeT k . Thus alld trees are indeed interior node disjoint. Hence, what remains to be proven is that no node will receive more than one packet in one time slot. According to the transmission schedule described in Section 5.2.2, the time slots during which one node, say node x, receives packets in a certain tree is determined by its position in that tree modulod. Therefore, it is sufficient to show that among all positions ofx in thed trees, no two positions are congruent modulod. If x∈ G d , then according to the construction algorithm it will be placed in positions N−d+1 throughN once without repetition. Thus no two positions are congruent modulo d. If x∈G, then let g = gcd(d,I). Thus, d = P·g and I = I 1 ·g, where I 1 and P are relatively prime. According to the algorithm, the positions of nodex in alld trees are equivalent to: T 0 −−−T P−1 :x,x−I,...,x−(P−1)I T P −−−T 2P−1 :x−PI +1,x−(P +1)I +1,...,x−(2P−1)I +1 ...... 88 T (g−1)P −−−T gP−1 :x−(g−1)I +g−1,...,x−(Pg−1)I +g−1. And, we want to show that no two of thesed numbers are congruent modulod. Note that, N is ignored in some of the above terms sinced|N. Also, PI = PgI 1 = dI 1 ; thus, d|PI. We can further refine the positions to: T 0 −−−T P−1 :x,x−I,...,x−(P−1)I T P −−−T 2P−1 :x+1,x−I +1,...,x−(P−1)I +1 ...... T (g−1)P −−−T gP−1 :x+g−1,...,x−(P−1)I +g−1. Now, assume thatx−a 1 I+b 1 ≡x−a 2 I+b 2 (modd), wherea 1 ,a 2 ∈{0,1,2,...,P−1} and b 1 ,b 2 ∈{0,1,2,...,g− 1}. We have d|(a 1 −a 2 )I + (b 2 −b 1 ), where (a 1 −a 2 )∈ {1−P,...,P−1} and(b 1 −b 2 )∈{1−g,...,g−1}. Thus,Pg|(a 1 −a 2 )gI 1 +(b 2 −b 1 ). So g|(b 2 −b 1 ), which indicates thatb 2 =b 1 . Then,P|(a 1 −a 2 )I 1 . SinceP andI 1 are relatively prime, P|(a 1 −a 2 ). As a result, a 1 = a 2 . This proves that no two distinct positions are congruent. Greedy Disjoint Tree Construction In this scheme, for ease of exposition, we assign a parity to each receiving node, where node idi has parityp i = (i−1 modd),i∈{1,2,3,...,N}. The node’s parity determines which child slot the node occupies in each of thed trees. Specifically, nodei with parityp i occupies child slot (p i −k) modd in treek, where 0≤ k≤ d−1. The scheme can then be described as follows: Step 1: Initialization: LetG =G 0 ,G 1 ,...,G d−1 and constructT 0 usingG,G d . Letk = 0. Step 2: Interior node selection for next tree Let k = k + 1. All interior nodes of tree T k are chosen from the set G k where we fill in the nodes in a breadth first manner, for 89 positions i, i = 1,2,...,I, by choosing the smallest node id j which satisfies the following conditions: (a)j∈G k , (b)j has parityi+k−1, (c)j has not been placed in treeT k yet. Step 3: Leaf node selection for next tree: LetL ={1,2,3,...,N}/G k be the set of nodes that were not yet placed in treeT k in Step 2. All leaf nodes are chosen from the set L where we fill in the nodes in a breadth first manner, for positionsi,i =I +1,I + 2,...,N, by choosing the smallest node idj which satisfies the following conditions: (a)j∈G, (b)j has parityi+k−1, (c)j has not been placed in treeT k yet. Step 4: Loop: k =k +1. Ifk <d go to Step 2. Proof. The following facts can be easily observed from the algorithm. Firstly, according to the construction, only nodes in G k act as the only interior nodes in tree T k . Thus, all d trees are indeed interior node disjoint. Secondly, suppose that node id i is in position P a and P b in trees T a and T b , respectively. Then according to the construction algorithm, P a ≡ a +i(mod d) and P b ≡ b +i(mod d). Thus, P a 6= P b (mod d). This shows that node id i can receive at most one packet in one time slot. Thirdly, all dummy nodes are added to group G d only. And, these nodes will be leaf nodes in all trees. Thus, deleting the dummy nodes does not affect the constructed trees. Finally, sinced|N, the number of nodes with parity j is N d , for all j. On the other hand, all d trees are filled from position 1 through position N. Thus, the number of positions in tree T k to be filled with parity j 90 nodes is exactly N d , which indicates that before all positions are filled, we can not exhaust our supply of nodes with appropriate parities. Given all of the above observations, we conclude that the construction algorithm is correct. Transmission Schedule The data streamed fromS can either be produced live (e.g., as in a broadcast of a sporting event) or it can be pre-recorded (e.g., as in the case of delivery a movie). For ease of presentation, we first assume that we are delivering pre-recorded data, i.e., all packets are available at nodeS at time 0. It is a simple extension to make our schemes work for live streams, and we explain it briefly at the end of this section. The transmission schedule can be obtained as follows. Letk∈{0,1,2,...,d−1}, and lett be a time slot, wheret = m·d+r and 0≤ r < d. In time slott,S transmits packet (k +m·d) to itsr th child in tree T k . All other interior nodes in tree T k send one packet to theirr th child in time slott. (The children are numbered, left to right, from 0 tod−1. Thus, the transmission essentially proceeds in a round-robin manner.) For example, in the multi-tree constructed in Figure 5.3, in time slot 0, S sends packet 0 to node id 1 in tree T 0 , packet 1 to node 5 in treeT 1 , and packet 2 to node 9 in treeT 2 . Then, in time slot 1,S sends packet0 to node2 in treeT 0 , packet1 to node6 in treeT 1 and packet2 to node10 in treeT 2 . After receiving packet 0 fromS in time slot0 in treeT 0 , node1 will send packet0 to node5 in time slot1, node6 in time slot2 and node4 in time slot3, etc. In the case of live streaming, it is not possible to send packet 1 in time slot 0 because it has not been generated yet. One approach to address this problem would be to, in a sense, pipeline the packets and thus modify the schedule as follows. Letr = (t+k) mod d. Then, in time slot t +k, S transmits packet (k +m·d) to its r th child in tree T k , if (k+m·d)≤t. All other interior nodes in treeT k send one packet to theirr th child in time 91 slott, as long as they have an appropriate new packet to send to that child. In this case, the transmission schedules of the different trees are not homogeneous; thus, this scheme is not easy to analyze. Another approach to address this problem is forS to delay the streaming until it accu- mulatesd packets. For ease of illustration we assume that thed packets were “pre-buffered” before time 0, and S can send out d packets at time 0. Thus, all nodes experience d units of additional delay as compared to the above described approach. However, we can then assume the same transmission scheduling procedure as in the case of “pre-recorded” streaming. 5.2.3 Delay Analysis In this section we first give an upper bound on the overall playback delay and buffer space requirements, given the construction and transmission schemes of Section 5.2.2. We then use this bound to determine an optimal value for the degree of our trees, i.e., ford. Finally, a lower bound on the average delay is given. We do all this in the context of pre-recorded streaming applications. Worst-case Playback Delay: We derive the upper bound on playback delay under the assumptions that: (1) a packet playback takes one time slot; (2) all d trees are complete, i.e.,d+d 2 +...+d h = N for some integerh – here (h+1) is the depth of our trees; and (3) S begins packet transmission in time slot 0. These assumptions simplify the analysis significantly. (We have also performed a simulation-based evaluation of the delay charac- teristics without the assumption of complete trees; those results are outside the scope of this work.) We now state the following theorem. Theorem 9. Worst-case playback delay,T , satisfies the following inequality: T≤h·d. 92 whereh = log d [N(1− 1 d )+1] andh+1 is the depth of the trees. Proof. We claim that one (achievable) upper bound on the playback delay is h· d time slots. That is, any node can begin playback after time slot h· d and be guaranteed not to experience hiccups due to lack of data. This claim can be proven by considering the following two observations: Observation 1: It takesh·d time slots to transmit packet 0 to node idN, i.e., the node in the last position of the first tree. So node idN cannot start playback prior to time sloth·d. Observation 2: Given our round robin transmission schedule, if one node receives packetj in time slott, then it will definitely receive packet(j+d) in time slot(t+d). Consequently, if by time slott 0 a node has received packets 1 throughd, then by time slot (t 0 +d) it will receive packets (d+1) through 2d, by time slot (t 0 +2d) it will receive packets (2d+1) through 3d, and so on. Therefore, it is safe for this node to start playback at time slot t 0 without being concerned with running out of data and experiencing hiccups. Note that, by time sloth·d each node would have received at least one packet in each of thed trees. And, after that, each node continues to received packets everyd time slots. Since nodes do not receive redundant packets,h·d is a safe value fort 0 . Thus, given our claim above, the playback delay T satisfies the following inequality: T ≤ h· d, where h = log d [N(1− 1 d )+1] corresponds to the depth of our trees. For general values ofN, these trees may not be complete; hence, it is possible forT to be strictly less thanh·d. Given this worst-case delay, a buffer of sizeh·d·(size of a packet) is sufficient at every node. Note that this is an upper bound on the buffer space requirements, i.e., not all nodes need that much buffer space. For instance, in the multi-tree system constructed in Figure 5.3, node1 will receive packets0,1, and 2 in time slots0,2, and 1, respectively. Therefore a buffer size of 3 is sufficient for node1. Tree Degree Optimization: Given that we would like to minimize worst-case delay (in Theorem 9), we can state the following. Asymptotically, we can show that for large N, 93 degree 3 trees are optimal. Moreover, for any N, either degree 2 or degree 3 trees are optimal. Specifically, let us assume that N is large. Then, a reasonable approximation for the upper bound on playback delay is T ≈ log d [N(1− 1 d )]· d. Let F(d) = log d [N(1− 1 d )]· d. Then, using natural logs and taking the derivative with respect to d, we obtain dF dd = (logd−1)[log(d−1)+logN]+ d d−1 logd (logd) 2 −1 whered must be an integer andd≥ 2. Note that, whend = 2, dF dd = (log2−1)logN (log2) 2 + 2 log2 −1≈ 1.89−0.64logN < 0. And, whend≥ 3, logd−1> 0, so dF dd > 0. Consequently, an optimal value ofd should always be either2 or 3. Note thatF(2) = 2(log 2 N−1) andF(3) = 3( log 2 N log 2 3 −log 3 (3/2)). Thus, for sufficiently largeN (and these values do not have to be very large), degree3 trees are optimal. We note that numerical results depicted in Figure 5.4 (obtained through simulations) indicate that for small and large values ofN, the resulting delays for degree 2 and 3 trees are quite close, and they are better than for higher degree trees. Thus, we believe that it is reasonable to used = 2 in practice. 0 5 10 15 20 25 30 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Maximum Startup Delay (# Time Steps) Number of Nodes Degree 2 Degree 3 Degree 4 Degree 5 Figure 5.4: Worst-case delay Average Playback Delay: In addition to the worst-case playback delay, average playback delay, P N i=1 a(i) N , is also an important metric for evaluating the performance of our scheme; herea(i) corresponds to the playback delay of node idi. However, this is a difficult metric to derive analytically. Instead Theorem 10 gives a lower bound on this metric, under the same assumptions stated above 94 Theorem 10. The following inequality gives a lower bound on the average playback delay: P N i=1 a(i) N ≥ d h (d+1)(h−1)−d 2 (h−2)−d(d+1)/2 N(d−1) . Proof. LetA(i,k),i∈{1,2,3,...,N} andk∈{0,2,3,...,d−1} denote the delay of node idi in treeT k , e.g.,A(1,1) = 1 andA(d,1) = d. Also leta(i),i∈{1,2,3,...,N} denote the playback delay of node idi, and leta ′ (i),i∈{1,2,3,...,N−d}, denote the delay which node idi experiences as interior node only. For node idsi∈{N−d+1,...,N} (i.e., nodes which are leaves in all trees), leta ′ (i) =A(i,1). (For example, in the multi-tree constructed in Figure 5.3(b),a ′ (1) = 1 anda ′ (6) = 2.) Then, similarly to the argument we made in case of worst-case playback delay, we have: a(i) = max{A(i,0),A(i,2),A(i,3),...,A(i,d− 1)}. Note that(d−1)a(i)≥ P d j=1 A(i,j)−a ′ (i). Indeed, fori∈{1,2,3,...,N−d}, the right hand side of this inequality is the average delay of node id i when it is a leaf node. Then,(d−1) P N i=1 a(i)≥ P d i=1 P i∈L j A(i,j)−d 2 (h−2)− d(d+1) 2 , where the right hand side is the sum of the delays of all leaves in alld trees minus the delay of node idsN−d+1 throughN in treeT 0 . Thesed nodes are in positionsN−d+1,N−d+2,...,N in tree T 0 ; thus, the corresponding delay isd(h−2)+1,d(h−2)+2,...,d(h−1). Now we prove that 1 |L k | P i∈L k A(i,k) = (d+1)(h−1) 2 . For k ∈ {0,2,...,d− 1}, let L k ={1,2,3...,N}/G k . L k denotes the set of leaf nodes in tree T k . We first prove the following lemma: Lemma 3. In L k , the number of nodes with delayj is equal to the number of nodes with delay(d+1)(h−1)−j. Proof. Let X 1 ,X 2 ,...,X h−1 ∈ {1,2,3,...,d} be the delay (in number of time slots) between each layer. Then each vector (X 1 ,X 2 ,...,X h−1 ) corresponds to a unique node, sayi, inL k , andX 1 +X 2 +...+X h−1 =A(i,k). Thus, the number of nodes with delayj 95 is equal to the number of solutions of the equationX 1 +X 2 +...+X h−1 =k. Also, from symmetry, the number of solutions ofX 1 +X 2 +...+X h−1 =k is equal to the number of solutions ofX 1 +X 2 +...+X h−1 = (d+1)(h−1)−k, which is the number of nodes with delay (d+1)(h−1)−j. This indicates that the number of nodes with delayj is equal to the number of nodes with delay (d+1)(h−1)−j. According to Lemma 3, 1 |L k | P i∈L k A(i,k) = 1 |L k | |L k | 2 (d+1)(h−1) = (d+1)(h−1) 2 . Also we have|L k | =d h−1 . Putting it all together gives: (d−1) P N i=1 a(i)≥d·d h−1 · (d+1)(h−1) 2 − d 2 (h−2)− d(d+1) 2 . Thus, the average delay is P N i=1 a(i) N ≥ d h (d+1)(h−1)−d 2 (h−2)−d(d+1)/2 N(d−1) . 5.2.4 Not Fully Connected Network In this work, we modeled each cluster as a fully connected graph and hence constructed the multiple trees within this fully connected graph. Consider now a network which is represented by an arbitrary undirected graph, G, where an edge exists between a pair of nodes if one packet can be transmitted between these nodes in a single time slot. An interesting question then is, for instance, can we construct two interior disjoint spanning trees usingG, each rooted at a nodeS, i.e. as before the root is permitted to be an interior node in both trees. We refer to this problem as the Two Interior-Disjoint Tree problem. As stated in Section 5.1, this problem isNP -complete; the proof is given below. Proof. Recall the following knownNP -Complete problem, i.e., the E-4 Set Splitting prob- lem [33]: Given a collection of elementsV and a collection of setsR i , such that for alli, R i contains exactly four elements ofV , is there a way of splitting the setV intoV 1 andV 2 such that for eachi,R i has at least one element in both sets. We now reduce this problem to the Two Interior-Disjoint Tree problem. 96 Construct a bipartite graph with a node for each element inV ; call it setV ′ . For each R i we have a nodex i . The collection of nodesx i will be calledX. Add a rootr and add edges fromr to all nodes inV ′ . Also connectx i to the nodes contained inR i . Suppose there exist two interior-disjoint trees in this graph T 1 and T 2 . Now do this operation: for any i, if x i is an interior node, then connect all its children directly to the root and move all the edges between x i and its children. This is possible because all the children ofx i must inV ′ . Note that after completing all operations for alli. The two trees are still interior disjoint(we did not add any interior nodes). Also all thex i nodes are leaves in both trees now. Let V 1 ,V 2 be a solution for the E-4 splitting problem. Take the interior nodes of two trees asV 1 andV 2 . Connect eachx i as a leaf to each tree, since eachx i has an element from each partition. This completes the proof. 5.3 Node Addition and Deletion Thus far, our tree construction schemes were described under static conditions. In a real streaming system, there is node churn. That is, it is quite likely that some nodes will arrive and some nodes will depart afterS has begun the streaming process, i.e., after the original trees are constructed and are in use for data streaming. Thus, we must also be able to add new nodes to and delete existing nodes from our trees “on-the-fly”, ideally without having to reconstruct the trees from scratch. Below, we describe how this can be done for our schemes, when the node churn is due to the arrival and departure of “regular” nodes (i.e., not the “super nodes” forming the “backbone” of Figure 5.1). It is reasonable for us to assume that the “super nodes” will not exhibit significant churn (and thus focus our attention on the “regular” node churn) for the following reasons. It is common for real systems to provide some infrastructure which is static over long periods of time, and it is 97 common in real P2P systems to adopt the use of “super nodes”, e.g., as in Skype, Kazaa, and Gnutella. In real streaming systems, “super nodes” could be provided, e.g., with the aid of content distribution companies, such as Akamai - this is an approach taken by a popular P2P streaming system, Joost. Note that the tree construction schemes given in Section 5.2.2 have a nice property that the nodes in the set G d are all leaf nodes (i.e., they appear as leaves in all d trees), and moreover, the nodes inG d are always at the end of our trees (i.e., when considered in the breadth-first-order). Thus, it is always easy for us to find nodes which are not transmitting any data to anyone and thus can be used to (a) take on the role of interior nodes when they are deleted and (b) take on children when new nodes are added to the system.We now give the details of the node deletion and addition algorithms. Deletion: Suppose our system hasN (real) nodes, and node idi has decided to leave. Let node idx be the last all leaf node in treeT 0 . Then the deletion ofi can be done as follows: Step 1: Find replacement: Swapi withx in alld trees. Step 2: Restore property: Ifd|(N−1) (i.e., ifG d only has one node), then letP(i) be the set of the (new) parents ofi in alld trees (i.e., the nodes which became its parents after it was swapped withx); thus|P(i)| =d. In each treeT k , swap the nodes inP(i) with the nodes in positionsN−d toN−1 in treeT k . Step 3: Remove node: Deletei from all trees. Note that Step 2 is executed only whenx was originally the only child of its parents in alld trees. Thus, after deletingi, all nodes inP(i) will become all leaf nodes. Hence, the purpose of Step 2 is to make sure that the nodes inP(i) end up in positionsN−d through N−1, i.e., at the end of alld trees, in breadth-first-order. (Another minor detail is that, for consistency of presentation,x takes oni’s node id.) An example of node deletion is given in Figure 5.5. 98 Step 2 Before Deletion 3 4 5 1 6 5 3 S 4 2 S 1 6 2 3 4 5 6 1 5 3 S 4 6 S 1 Step 1 3 4 5 1 2 5 3 S 4 6 S 1 2 6 real node node to be deleted node to be swapped Figure 5.5: Example of node deletion with 6 nodes. In this example, node ID 2 is leaving the system. Given our deletion algorithm, it first swaps positions with node ID 6 and then leaves the system. Addition: Suppose our system hasN nodes, and a new node idi arrives. LetN≡r 1 (mod d) and⌊ N d ⌋≡ r 2 (mod d). If d|N, i.e., all nodes in G 0 through G d−1 are full, i.e., have d (real) children in some tree, and thus nodes inG d will have to become interior nodes in some trees. Otherwise, nodes not inG d still have vacancies and i can simply be added in appropriate positions as a child of those nodes in all trees. The addition algorithm is as follows: Step 1: Make room for growth: Ifd|N, swap the node in position N d with the node in positionN−d+(r 2 −1) in each tree. Step 2: Grow trees: Addi to positionN +1 in treeT 0 ,N +2 in treeT 1 , ..., position N +d−r 1 in treeT d−r 1 −1 , positionN−r 1 +1 in treeT d−r 1 , ..., positionN in treeT d−1 . Intuitively, the main purpose of r 1 in Step 2 above is to count the number of children of an interior node with fewer thand (but more than 0) children, i.e., such nodes haved−r 1 vacancies. Intuitively, the main purpose ofr 2 is to determine the parity of the node where tree growth will occur. Note that the growth in all trees has to occur in position⌊ N d ⌋. And in Step 1 we ensure that the node in that position is an all leaf node (fromG d ) in each of the trees, before addingi as its child. Thus, in Step 1 we swap the node in position⌊ N d ⌋ with 99 Before Addition Step 1 Step 2 3 4 5 1 6 5 5 4 3 1 2 5 3 S 4 2 S 1 3 S 4 2 S 1 7 7 5 4 3 6 6 1 2 5 3 S 4 2 S 1 6 6 6 2 real node node to be added node to be swapped 7 Figure 5.6: Example of adding node ID 7 to the multi-tree. In this example, nodes in positions 3 and 5, in each of the trees (indicated by thicker circles) need to be swapped, in order to make room for accommodating this new node. After swapping, node ID 7 is attached to the new interior nodes, i.e., node ID 5 and6 in treeT 0 andT 1 , respectively. one of the nodes inG d ; specifically, we swap it with an all leaf node of the same parity (to ensure that it continues to receive data in that tree in appropriate time slots). An example of node addition is given in Figure 5.6. Note that, when swapping occurs, either during addition or deletion, nodes participating in the swapping process may suffer from hiccups. This may occur, for example, because they lose data which was delivered before they were moved up a tree, or perhaps because they wait longer than originally planned for some data because they were moved down a tree. In the case of deletion, if a node being deleted is an all leaf node, then the number of resulting swaps is between0 andd 2 (where the higher value occurs whend|(N−1)). If an interior node is deleted, then the number of resulting swaps is betweend andd 2 +d (where the higher value occurs whend|(N−1)). In the case of addition, the number of resulting swaps is between 0 andd (where the higher value occurs whend|N). Thus, up tod 2 nodes may suffer from hiccups. Appropriate evaluation of the resulting QoS (due to hiccups) is a complex issue; an empirical evaluations of such effects (using simulation) can be found in section 5.4. One inefficiency of the above algorithms is that whend|(N−1) and deletion occurs, up tod 2 swappings have to be made to keep all theG d nodes in appropriate positions in all trees. However, these swaps are not really necessary, if the next event is an addition of 100 a new node, i.e., the addition of a new node will force us to “undo” swaps made during the deletion process. Thus, in such situations, if we had this sequence of deletions/additions, it would have been better to just replace the deleted node with the newly added one, i.e., thus saving d 2 +d swaps. Given this observation, we explore “lazy” versions of the deletion and addition algorithms where we wait until a new event occurs before deciding whether swapping is needed. The details of this approach are given in as follows: Lazy Deletion: Step 1: Restore property: Ifd|N, swap nodes inG d with nodes in positionsN−d toN−1 in all trees. Step 2: Find replacement: Ifi is not inG d , swap it with nodex in all trees. Step 3: Remove node: Deletei. Lazy Addition: Step 1: Make room for growth: If d|N, check if nodes in position⌊ N d ⌋ in all trees are in G d . If not, swap the node in position⌊ N d ⌋ with the node in position N−d+(r 2 −1) in each tree. Step 2: Grow trees: Add i to positionN + 1 in tree T 0 , positionN + 2 in tree T 1 , ..., positionN +d−r 1 in treeT d−r 1 −1 , positionN−r 1 +1 in treeT d−r 1 , ..., positionN +d in treeT d−1 . Note that the difference with these “lazy” schemes is that now we wait until a new event occurs before deciding whether swapping is needed in order to keep the all leaf nodes in appropriate positions in the trees. We evaluate the difference between the original and the “lazy” versions of these algorithms empirically in Section 5.4, using simulations. 101 5.4 Simulation Results In this section, we study our tree construction schemes using an event-driven simulator which implements the static and dynamic schemes described earlier in the chapter. The goal of our simulations is to illustrate how the practical performance of our multi-tree scheme compares to the theoretical results we provided. In the static scenarios, all nodes arrive to the system before streaming begins and they stay in the system for the duration of the streaming process. In such cases, we assume that all nodes are present by time 0, and the trees are constructed before S begins streaming. The receiver nodes begin their playback after receivingd consecutive packets (as in Section 5.2.3). In dynamic scenarios, in addition to an initial batch of nodes which are used to con- struct initial trees (as in the static case), nodes also join and leave the system throughout the streaming process. Consequently, hiccups (due to node swaps) in the playback process may occur (refer to Section 5.3). A node participating in a swap does the following in our simulation: (1) it discards all old packets in the buffer, and (2) it restarts playback after receivingd consecutive packets whose playback time is after the node’s last playback posi- tion before the swap occurred. More sophisticated playback restart schemes are possible (e.g., perhaps some packets in the buffer could be salvaged). However this simple scheme is sufficient for us to illustrate the characteristics of the hiccups caused by the addition and deletion algorithms (refer to Section 5.3). To illustrate the performance characteristics of our schemes, we use the following met- rics: 1. startup delay – the time difference between the playback start time and the joining time of a node; 102 2. startup lag – the time difference between a node’s playback startup position (i.e., packet number) and the source’s sending position; 3. number of restarts per event – the number of playback restarts per dynamic event (i.e., a node joining or leaving); 4. restart delay – the time required to restart playback after swapping; 5. restart gap length – the difference between the new playback position and the last playback position before the hiccup, which measures the amount of data skipped by a node due to swaps; 6. number of losses per event – the number of packet loss bursts per event experienced by a node not involved in a swap, due to swapping of its ancestors; 7. loss burst length – the length of a packet loss burst in metrics (6) above. We computed the minimum, average, and maximum values as well as the distributions of the above metrics. Due to lack of space, we only report the average values below. The first two metrics measure a node’s playback startup properties. The remaining metrics measure the hiccups (i.e., interruptions in playback) caused by the nodes’ dynamics. All seven metrics are useful in the case of a dynamic system; however, only the first metric is of use in the case of a static system. (Note that in the case of a static system, the first two metrics are equivalent as we assume that all nodes join at time0.) In our simulation we consider the following three streaming types: (i) pre-stored - which emulates pre-recorded data streams, i.e., all data is available atS at time0; (ii) live - which emulates live broadcast streams, i.e., data is generated as it is being sent, and henceS cannot transmit packett earlier than time slott (refer to Section 5.2.2); 103 (iii) buffer-d-packets - which also emulates live broadcast streams as described in Section 5.2.2. We first study all three streaming types in the context of a static scenario. Then we focus on pre-stored streams in the context of dynamic scenarios, as that is the case analyzed in Section 5.2.3. 5.4.1 Static Scenarios Startup Delay under Different Streaming Approaches: We present average startup delay results for systems with10 to2000 nodes and tree degrees ranging from2 to5; these are illustrated in Figures 5.7(a), Figures 5.7(b), and Figures 5.7(c) for pre-stored, live, buffer-d-packets streaming, respectively. (We experimented with other system configurations, e.g., larger node degrees; since the results were qualitatively similar, we omit them here.) We observe that the average startup delay scales with the number of nodes, with degree 2 or 3 trees giving the lowest average startup delay. Here pre-stored streaming corresponds to the lowest startup delay; however it cannot stream live content directly. Buffer-d-packets streaming has a similar average startup delay as live streaming, with live streaming’s delay being a bit shorter. Moreover, our experiments indicate that the distributions of startup delay appear to be binomial; this may indicate that average startup delay is a reasonably representative metric. The distribution plots are omitted here due to lack of space. Complete Trees Average Startup Delay: Next, we evaluate the tightness of the derived lower bound on average startup delay. We obtain the average startup delay for complete tree cases (as this was an assumption made in the derivation) from the simulations and compare it with the corresponding analytical lower bound described in Section 5.2.3. We experimented with different tree degrees and different tree depths; Table 5.4.1 illustrates a subset of the results we obtained. We observe 104 0 5 10 15 20 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Average Startup Delay (# Time Steps) Number of Nodes Degree 2 Degree 5 Degree 4 Degree 3 0 5 10 15 20 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Average Startup Delay (# Time Steps) Number of Nodes Degree 2 Degree 5 Degree 4 Degree 3 0 5 10 15 20 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Average Startup Delay (# Time Steps) Number of Nodes Degree 2 Degree 5 Degree 4 Degree 3 (a) Pre-stored (b) Live (c) Buffer-d-Packets Figure 5.7: Average Startup Delay under Different Streaming Approaches that when the tree degrees are small and the tree depth is high, the analytical lower bound is close to the simulation results. We believe that it is not significant that the bound is not tight for higher tree degree or smaller depths. This is the case since (a) as shown in Section 5.2.3, small (2 or 3) degree trees are optimal, and (b) interesting cases are those with large numbers of nodes (which result in higher depth trees). Tree Depth 1 2 3 4 5 6 7 d = 2 (bound) 1.50 2.83 4.36 5.90 7.44 8.96 10.48 d = 2 (sim) 2.00 3.00 4.43 5.93 7.45 8.97 10.48 d = 3 (bound) 2.00 3.88 5.92 7.96 9.98 11.99 14.00 d = 3 (sim) 3.00 4.58 6.56 8.63 10.69 12.74 14.77 d = 4 (bound) 2.50 4.90 7.45 9.98 12.49 15.00 17.50 d = 4 (sim) 4.00 6.15 8.65 11.23 13.80 16.36 18.91 d = 5 (bound) 3.00 5.92 8.97 11.99 15.00 18.00 21.00 d = 5 (sim) 5.00 7.70 10.72 13.79 16.86 19.92 22.98 Table 5.1: Average Startup Delay 5.4.2 Dynamic Scenarios Here, we start out with 500 initial nodes in the system. We consider two types of dynamic events: (i) node arrivals, where the inter-arrival times are exponentially distributed with a mean of 1000 time units, and (ii) node departures, where the inter-departure times are exponentially distributed with a mean of 1000 time units. When a node departure event 105 occurs, we randomly choose a node to remove from the system. The simulation is run for 10000 dynamic events which would correspond to approximately5.8 simulated days, given a packet rate of 100 packets/sec. We experiment with tree degrees of 2 to 10, using both original and lazy node addition and deletion schemes (refer to Section 5.3). In order to make meaningful comparisons, all simulations use the same sequence of node arrivals and departures. We consider the metrics defined above as a function of tree degrees as well as the schemes given in Section 5.3. Figure 5.8(a) illustrates the average startup delay and the average startup lag. Figure 5.8(b) illustrates the average number of restarts and the average number of losses per event. And, Figure 5.8(c) illustrates the average restart delay, the average restart gap length, and the average loss burst length. We observe that the average startup delay does not equal the average startup lag in dynamic scenarios, with the average startup delay being shorter. This is intuitive, since by the time new nodes join the system, the startup cost has been “paid”, and the system is in a phase whered packets arrive at each node every d time slots. We also note that the average startup delay increases almost lin- early with the tree degree. This makes sense since new nodes need to bufferd consecutive packets before they begin playback. The average startup delay is greater than d since the nodes initially in the system incur a delay greater thand and some nodes may be involved in several consecutive swaps before they are able to start playback. We also note that the average number of restarts per event and the average number of losses per event are relatively small (i.e., relative to the number of nodes in the system). In our experiments, the use of lazy schemes can reduce the number of restarts by approx- imately 40− 45%, and it can reduce the number of losses by approximately 37− 39%. Also, we see that smaller tree degrees result in shorter restart delays (and smaller restart gap lengths). This makes sense since nodes participating in a swap need to buffer d con- secutive packets in order to restart playback. 106 In addition to the average values, we also computed the distributions for the restart delay, the restart gap length, the loss burst length, the total number of restarts over the nodes’ lifetimes, and the total number of losses over the nodes’ lifetimes. (The plots are omitted here due to lack of space.) These distributions exhibit exponentially decreasing shapes, which indicates that most hiccups (interruptions in data delivery) are brief and that most nodes have either no or very few hiccups during their lifetime. Lastly, although not implemented in the simulator, we expect that all the packet losses, i.e., metric (6), can be eliminated if we adopt a more graceful approach to node departures and swaps. Specifi- cally, the main idea would be for the replaced or swapped out interior nodes to remain in the system a bit longer and continue sending packets (they have already accumulated) to their children, until their replacement node receives appropriate packets and is able to take over. 0 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 10 10 12 14 16 18 20 22 24 Average Number of Restarts/Event Average Number of Losses/Event Degree # Restarts (Lazy) # Restarts (Original) # Losses (Lazy) # Losses (Original) 0 2 4 6 8 (a) Startup Delay and Lag (Dynamic) (b) Number of Restarts and Number of Losses per Event (c) Restart Delay, Restart Gap Length, and Loss Burst Length 0 2 4 6 8 10 12 14 16 18 20 22 24 2 3 4 5 6 7 8 9 10 Number of Time Steps Degree Average Startup Delay Average Startup Lag 0 2 4 6 8 10 12 14 16 18 20 2 3 4 5 6 7 8 9 10 Number of Time Steps Degree Average Loss Length (Original) Average Loss Length (Lazy) Average Gap Length (Original) Average Gap Length (Lazy) Average Restart Delay (Original) Average Restart Delay (Lazy) Number of Packets 0 2 4 6 8 10 12 14 16 18 20 Figure 5.8: Results with Different Degree 5.5 Hypercube-based Scheme In this section we present another approach to mesh construction for streaming purposes; we refer to this scheme as hypercube-based streaming (for reasons made clear below). For simplicity of presentation we focus on streaming within a single cluster. However, this scheme can be easily adapted to streaming over multiple clusters, using the tree τ, as in 107 the context of multi-trees. All assumptions remain the same as in the context of multi- trees (Section 5.2.2), except that, at first, we do not assume that the sourceS has a greater transmission capability than any other node. Towards the end of the section we comment on how to adjust the results if the same assumption is made as in the case of multi-trees, i.e., if the sourceS hasd times the capacity of a receiver node. 5.5.1 Hypercube Streaming for SpecialN For ease of illustration, we first present a streaming scheme under the assumption that the number of nodes, N, is one less than a power of two. That is, let N = 2 k − 1, where k is an integer. We can construct a scheme with O(1) buffer size requirement and O(k) playback delay (using a generalization of [24]) by reaching a state in which N 2 i nodes have thei th packet (i = 1...k). In the next round, we can have all (remaining) nodes receive packet 1 and at the same time double the nodes having packets i, i = 2...k, while the source sends out a new packet (k + 1). Packet 1 can now be consumed, and the protocol repeats. We take a somewhat different approach than in [24, 50] as we are designing our scheme for streaming rather than for message distributions. Moreover, in [24, 50], the number of messages is limited to a finite number, saym; thus afterm time slots the source can aid in packet exchange without having to send new packets. However, in streaming (and particularly live streaming which is potentially infinite) this is often not possible as the source always has new packets to send. Figure 5.9 depicts a simple example of our scheme. Specifically, Figure 5.9(a) illus- trates the beginning of time slot X, where each parallelogram corresponds to a group of nodes which transmit the packet number indicated (by the number inside the parallelo- gram); the arrows indicate the direction of transmission (i.e., sender/receiver relationship between groups of nodes). The shaded parallelogram indicates an absence of a node. Note that the scheme is depicted for N = 2 k − 1 nodes, where one node is receiving the next 108 packet (here, packet8) from the source. Figure 5.9 (b) depicts how many nodes have packet i, 2≤ i≤ 8, at the end of time slotX (each node also has packet 1, so we omit that from the figure), i.e., we have doubled the number of nodes which have packeti as compared to the beginning of the time slot. 1 1 1 1 1 1 2 3 4 5 6 7 source 8 2 3 4 5 6 7 2 3 4 5 6 7 8 (a) beginning of timeslot X (b) end of timeslot X Figure 5.9: Example of theO(1) buffer space scheme The main problem with this scheme, as described above, is that it can potentially involve an arbitrary communication pattern, where a node may actually need to communicate with a large number of nodes (e.g., half of all the other nodes in the network). An illustration of such a possible communication pattern with 7 nodes is given in Figure 5.10. Here, S is the source and each oval represents a node,N j , with rectangles inside an oval representing buffer occupancy of the corresponding node - the shaded rectangle depicts the packet number being consumed, and the clear rectangle depicts the packet number being transmitted to another node, where the arrow indicates to which node it is being transmitted. To limit the number of neighbors with which a node communicates, we construct a specific communication pattern as follows. For simplicity let the source S have node ID 0. In order to transmit packets, in each time slot we pair up theN +1 nodes (i.e., includ- ing the source) and have them exchange packets as follows. Let n = 0,1,2,3,...,j = 0,1,2,...,k− 1. In time slot kn + j, we pair up nodes with IDs (xx...x0xx...x) 2 and (xx...x1xx...x) 2 , where0 and1 appear in thej +1 st position from the right. (Here we use “() 2 ” to indicate a node ID written in binary form.) Then, theN +1 nodes can be viewed 109 Time N 1 N 2 N 3 N 4 N 5 N 6 N 7 S k k k k+1 k+1 k+2 k+3 k+2 k+1 k+1 k k k k k-1 k-1 k-1 k-1 k-1 k-1 k-1 N 1 N 2 N 3 N 4 N 5 N 6 N 7 S k+3 k+2 k+1 k+1 k+4 k+2 k+1 k+1 k+2 k+1 k+1 k+3 k k k k k k k N 1 N 2 N 3 N 4 N 5 N 6 N 7 S k+3 k+4 k+3 k+2 k+2 k+2 k+5 k+2 k+2 k+2 k+2 k+3 k+4 k+3 k+1 k+1 k+1 k+1 k+1 k+1 k+1 k+1 k+2 Figure 5.10: An illustration of theO(1) buffer space scheme with N=7 as vertices of ak dimensional hypercube where in each time slot, communication between vertices is performed along the same dimension. For instance, suppose we have 7 nodes, plus a source, with node IDs 0 to 7. Then, (a) in time slot 3n we pair up nodes with IDs (xx0) 2 and (xx1) 2 , i.e., we pair up node IDs 0, 2, 4, and 6 with node IDs 1, 3, 5, and 7, respectively; (b) in time slot 3n + 1 we pair up node IDs (x0x) 2 and (x1x) 2 , i.e., we pair up node IDs 0, 1, 4, and 5 with node IDs 2, 3, 6, and 7, respectively; (c) in time slot 3n+2 we pair up node IDs (0xx) 2 and (1xx) 2 , i.e., we pair up node IDs 0, 1, 2, and 3 with node IDs 4, 5, 6, and 7, respectively, and so on. (We give a corresponding depiction in Figure 5.11). Then, the performance of our scheme is described by Proposition 1. Proposition 1. Given N = 2 k − 1, under the hypercube streaming scheme, where nodes are arranged as vertices of ak-dimensional hypercube, each node only communicates with k other nodes and can begin playback after time slotk + 1. Moreover, each node is only required to store2 packets in its buffer, i.e., this scheme hasO(1) buffer space requirements. 110 node ID 5 (101) 2 node ID 4 (100) 2 node ID 6 (110) 2 node ID 7 (111) 2 node ID 0 (000) 2 node ID 2 (010) 2 node ID 1 (001) 2 node ID 3 (011) 2 Figure 5.11: An example of a hypercube communication pattern with 7 nodes. The nodes’ communication in time slots3n, 3n+1, and 3n+2, is depicted using dashed, dotted, and solid lines, respectively. 5.5.2 Hypercube Streaming for ArbitraryN We now extend our hypercube-based streaming approach to arbitrary values of N, where the basic idea is to divide theN nodes into multiple hypercubes. Letk 1 =⌊log 2 (N +1)⌋. Then, we construct a hypercube from the sourceS andN 1 = 2 k 1 −1 nodes; we refer to this hypercube asHC 1 . InHC 1 , in each time slot, the node receiving data from node ID 0 (i.e., the source) has nothing to send. Thus, this spare capacity can be utilized to send packets to a node in another hypercube. Specifically, letn = 0,1,2,...,j = 0,1,2,...,k 1 −1; then in time slotnk 1 +j, the node with ID 2 j receives a new packet fromS, while consuming another packet (previously received from another node). Since this node has nothing to send to a node within HC 1 (refer to Figure 5.9), we let this node send the packet it just consumed to another hypercube. As a result, HC 1 as a whole can be viewed as a logical source for the remaining N− N 1 nodes, as it has the capacity to send one packet (in appropriate order), in every time slot. The one difference here is that this logical source begins sending packets in time slot k 1 + 1. Given the logical source HC 1 and N−N 1 nodes, we can repeat the above process until all nodes are assigned to some hypercube. 111 For example, supppose we need to stream packets from sourceS to 18 nodes. We first build a hypercube of 16 nodes, denoted byHC 1 , withS and nodes with IDs 1 through 15. In HC 1 , S sends packets to node IDs 1, 2, 4, and 8 in time slots 4k + 1, 4k + 2, 4k + 3 and4k+4, respectively. As a result, these nodes do not send out packets in corresponding time slots. Moreover, they consume packets 4k−3, 4k−2, 4k−1, and 4k in these time slots. Thus, their unused output capacities are utilized to send the packets consumed in the correspoding time slots, and they are viewed as a logical source S ′ . It is easy to see that packets are sent fromS ′ in order, but with a4 time slot lag as compared toS. UsingS ′ and node IDs 16 through 18, we build hypercube HC 2 . At this point, all nodes are assigned. We have15 nodes with a delay of 4 time slots and3 nodes with a delay of 6 time slots. The performance of our scheme is described by Proposition 2. Proposition 2. Given an arbitraryN, in the hypercube-based scheme each node commu- nicates with at mostO(logN) other nodes. The corresponding worst case playback delay is at mostO(log 2 N), where only two packets need to be stored in each node’s buffer. In scenarios where we stream over multiple hypercubes, nodes start playback in differ- ent time slots. Thus, the average delay (average playback start time) is also of interest and is characterized in Theorem 11. Theorem 11. Hypercube-based streaming overN nodes (for arbitrary values ofN) results in an average delay of no more than2logN. Proof. We prove Theorem 11 by induction. Letave(N) denote the average playback delay ofN nodes. WhenN is small, we can verify thatave(N)≤ 2logN. WhenN is large, according to the scheme above, we takeN1 = 2 k 1 −1 nodes to form the first cube. The playback delay of all nodes in this cube isk 1 . Thus we have: ave(N) = k 1 (2 k 1 −1)+(N−2 k 1 +1)(k 1 +ave(N−2 k 1 +1)) N 112 = k 1 + (N−2 k 1 +1)ave(N−2 k 1 +1) N . Note that2 k 1 −1≥N/2. Therefore(N−2 k 1 +1)ave(N−2 k 1 +1)≤N logN. Also k 1 < logN, thusave(N)≤ 2logN, which completes the prove. Lastly, in Section 5.2.2, we assumed that the source can send d packets in one time slot. Under this assumption, the hypercube-based scheme can be adjusted as follows. We can divideN nodes (as evenly as possible) intod groups, and then apply hypercube-based streaming to these groups individually. Since each group has at most⌈ N d ⌉ nodes, the worst case and average playback delay would be bounded byO(log 2 ( N d )) and 2log⌈ N d ⌉, respec- tively, with each node communicating withO(log⌈ N d ⌉) other nodes. 5.5.3 Discussion In Theorem 11, we give an upper bound on the average delay in our hypercube based scheme. In real sceenarios this bound is not always tight. In some cases, the actual average delay can be significantly smaller than the bound. To verify this point, we performed simulation experiments of our hypercube based scheme. In these experiments we construct hypercubes with different numbers of nodes,N, ranging from 1 to 100000. For eachN, we calculate the actual average delay. The results are depicted in Figure 5.12. In this figure the dotted curve is the upper bound given by Theorem 11, and the solid curve is the simulated average delay. From Figure 5.12 we can observe that for someN the actual average delay is close to the bound. But for other values of N, the actual average delay is only about half of the bound. This is because for someN, the number of hypercubes is small. Thus, all nodes have small playback delay. For example, when N = 2 k − 1, all nodes are in a single hypercube, and the the average delay is k + 1. When k is large, this value is close to logN (whereas the bound is 2logN). We also note that the average delay does not 113 increase monotonically with the number of nodes. For instance, whenN increases by one, sometimes the number of constructed hypercubes decreases, resulting in improvements of average delay. For example, whenN = 6, we construct 2 hypercubes, each with 3 nodes. The average delay is then 4.5 time slots. WhenN increases to7, we can construct a single hypercube of 7 nodes. Now the average delay is only4 time slots. We also note that in this case the set of nodes with which each node needs to communicate changes. For example, the source now needs to communicate with node ID 1, node ID 2, and node ID 4, whereas originally it only needed to communicate with node ID 1 and node ID 2. As a result, all nodes in the two original hypercubes are affected by the arrival of a new node. 0 1 2 3 4 5 6 7 8 9 10 x 10 4 0 5 10 15 20 25 30 35 Number of Nodes Average Delay (Number of Time Slots) Simulation Results Theoretical Bound Figure 5.12: Average playback delay Thus, in the hybercube based scheme, node churn can result in service disruption (dur- ing hypecube re-organization) for a large number of nodes. This is a disadvantage of the hypercube based scheme and is a result of its fairly rigid send/receive schedule. Moreover, this is also due to having to ensure that Theorem 11 (describing average delay guarantees) holds, i.e., in each round we have to construct a hypercube with no less than half of the remaining nodes. All this makes it difficult for the hypercube based scheme to handle node churn. We note that our approach to handling node arrivals and departures is straightfor- ward. That is, every time a node arrives or departs we re-construct the hypercubes. For example, when a node departs from a 7 node system, consisting of a single hypercube, we 114 re-construct the 6 node system and arrive at two hypercubes, each with 3 nodes. In the worst case, all N nodes can be affected by such a (re)construction process. As a result, some nodes may suffer from hiccups and others from QoS degradations. Hence, con- structing effective algorithms for handling node churn, in the context of hypercube-based schemes, is an open problem. 5.6 Conclusions In this Chapter, we focused on structured streaming. We provide two streaming con- struction and transmission schemes which allowed us to derive provable QoS guarantees for streaming applications. We summarize the comparison between hypercube-based and multi-tree-based streaming in Table 5.2. Schemes Maximum Delay Average Delay Buffer Size Number of Neighbors Multi-tree O(logN) O(logN) O(logN) O(1) hypercube for special N O(logN) O(logN) O(1) O(logN) hypercube for arbitrary N O(log 2 (N)) O(log(N)) O(1) O(log(N)) Table 5.2: Comparison between multi-tree based streaming and hypercube based stream- ing. As can be seen from this table, the multi-tree-based scheme provides better worst case playback delay but requires larger size buffers. Moreover, in the multi-tree-based scheme each node communicates with a constant number of nodes, while in the hypercube-based scheme communication is needed withO(logN). This is particularly important, given that small values ofd are preferable in the multi-tree scheme. 115 Chapter 6 Online Anomaly Detection for Sensor Systems Wireless sensor systems aid scientific studies by instrumenting the real world and collecting measurements. Given the large volume of measurements collected by sensor systems, one problem arises – an automated approach to identifying the “interesting” parts of these data sets, or anomaly detection. A good anomaly detection methodology should be able to accurately identify many types of anomalies, be robust, require relatively little resources, and perform detection in (near) real-time. Thus, in this paper we focus on an approach to online anomaly detection in measurements collected by sensor systems, where our evaluation, using real-world datasets, shows that our approach is accurate (it detects over 90% of the anomalies with few false positives), works well over a range of parameter choices, and has a small (CPU, memory) footprint. 6.1 Introduction Wireless sensor systems have significant potential for aiding scientific studies by instru- menting the real world and collecting measurements, with the aim of observing, detecting, and tracking scientific phenomena that were previous only partially observable or under- stood. However, one obstacle to achieving the full potential of such systems, is the ability to process, in a timely and meaningful manner, the huge amounts of measurements they collect. Given such large volumes of collected measurements, one natural question might 116 be: Can we devise an efficient automated approach to identifying the “interesting” parts of these data sets? For instance, consider a marine biology application collecting fine- grained measurements in near real-time (e.g., temperature, light, micro-organisms concen- trations) – one might want to rapidly identify “abnormal” measurements that might lead to algal blooms which can have devasting consequences. We can view identification of such “interesting” or “unexpected” measurements (or events) in collected data as anomaly detection. In the remainder of the chapter, we use the generic term “anomaly” for all inter- esting (typically, other-than-normal) events occurring either on the measured phenomena or the measuring equipment. Automated online (or near real-time) anomaly detection in measurements collected by sensor systems is the focus of this chapter. Anomalies can have a variety of lengths, magnitudes, and patterns. For instance, Fig- ure 6.1(a) depicts a long duration, relatively gradual change in sensor reading, whereas Figure 6.2(b) includes several short duration, quite abrupt change in sensor readings. Both scenarios correspond to anomalous events and should be accurately detected by an anomaly detection methodology. Thus, a good anomaly detection methodology should have the following properties. First, it should be able to accurately identify all types of anomalies as well as normal behavior (i.e., it should have low false negative and false positive rates). Second, it should be robust, i.e., the methodology should be relatively insensitive to parameter settings as well as pattern changes in the data sets. Third, it should require relatively small amounts of resources, as these are typically limited in sensor systems. That is, to run on sensor systems, it should ideally have low computational complexity, occupy little memory space, and require little transmission power. Last, it is also desirable for a detection algorithm to be able to detect anomalies in real-time or near real-time. This is particularly important for sensor systems corresponding to temporary deployments (as it might not be as useful to detect anomalies once the deployment is over) and those monitoring hazardous natural 117 phenomena (e.g., spread of contaminants in aquatic ecosystems), where prompt detection (and reaction) can be essential to reducing loss of life and money. Anomaly detection, in general, has been studied in a number of systems contexts, most notably in networking, where several techniques have been proposed for detecting network traffic anomalies [46, 69, 14, 35]. While one might take the approach of adapting one (or more) of these techniques to sensor systems, we believe that they do not satisfy all the desir- able properties described above, at least in their current form. In Section 6.5, we provide (a) quantitative results from applying network anomaly detection techniques to data col- lected by real sensor systems deployments, and (b) intuition for why these techniques did not yield good results on such data. Consequently, the properties required of an effective anomaly detection method for sensor data and our experience with applying network traf- fic anomaly detection techniques to sensor measurements, motivated us to explore methods different from prior work in network anomaly detection. We also note that little exists in the literature on the topic of anomaly detection in sensor systems data. Most efforts are focused on detection of faulty sensor readings, such as those depicted in Figures 6.3(a) and 6.3(b) – these are typically short duration events, with values significantly deviating from the “normal” sensor readings [59]. Often, such sensor data faults are modeled as outliers and can be detected using simple Rule-based approaches or by using statistical models to capture the pattern of normal sensor readings and flagging any significantly different samples as faulty [68]. In this work, we view faulty sensor readings as a special case of anomalies. As illustrated in Section 6.4, our approach is able to capture such faulty readings, as well as other long duration, “gradual” anomalies such as the one depicted in Figure 6.1(a). To the best of our knowledge, the only efforts focused on anomaly detection in sensor systems data are [63, 64, 71]. Briefly, [63, 64] view measurements collected by a sensor system as coming from the same (unknown) distribution and “pre-defines” anomalies as 118 outliers. The main focus of that effort, which is an off-line approach, is on minimizing communication overhead (in transmitting data needed for anomaly detection) and corre- sponding energy consumption. In contrast, we focus on an online approach that, on-the-fly, builds an adaptive model of “normal” data and does not a priori define what is an anomaly. For instance, the approach in [63, 64] might only flag the most extreme measurement in Figure 6.1(a) as an anomaly, whereas our approach would flag the entire event (outlined by the dashed rectangle) as an anomaly We give a more detailed description of [63, 64] and a quantitative comparison in Section 6.5. In [71] a change point detection based approach is used for detecting distribution changes (e.g., mean, variance, covariances) in sensor mea- surements. However, (a) this approach assumes knowledge of the (time varying) prob- ability distribution from which sensor measurements are sampled (information often not available in real-world deployments), and (b) such techniques do not meet (at least in their current form) our efficiency criteria (see Section 6.5). 1000 2000 3000 4000 5000 6000 −5 0 5 10 Sample Number Temperature 275 290 302 4.9 5 5.1 5.2 Sample Number Temperature X(j) Y(j) Y(j+1) X(j+1) (a) Long duration anomaly (b) Piecewise linear model for time series data Figure 6.1: (a) Data set with long duration anomaly and (b) example of a piecewise linear model. In this work, we formulate the problem of anomaly detection in sensor systems as an instance of identifying unusual patterns in time series data problem. Of course, one possi- ble direction would then be to construct a timeseries-based approach, e.g., based on [68]. However, we also did not find this direction to be effective as such techniques are (typ- 119 ically) not well-suited for detecting long duration anomalies. So, we do not pursue this direction further here, but in Section 6.5, we do illustrate quantitative results correspond- ing to applying a representative timeseries-based approach to data collected by real sensor systems deployments and provide intuition for why such a technique did not yield good results. In contrast, the basic idea behind our approach is to compare the collected measure- ments against a reference time series. But, to do this efficiently and robustly, the following challenging problems need to be solved: (1) How do define a reference time series?; (2) How to compare two time series efficiently?; (3) What metric to use in deciding whether two sensor data time series are similar or different?; and (4) How to update the reference time series, to adapt to (normal) changes in sensor data patterns? We address these challenges by proposing and evaluating an anomaly detection algo- rithm, termed Segmented Sequence Analysis (SSA) that exhibits the desirable character- istics stated above. Briefly, SSA leverages temporal and spatial correlations in sensor measurements and constructs a piecewise linear model of sensor data time series. This is motivated by [41] which focused on searching for known patterns in time series (see Section 6.5). To detect anomalies, we compare the piecewise linear models of sensor data (collected during a time interval) and a reference model, with significant differences (as determined by a proposed similarity metric) flagged as anomalies. We use data from real- world deployments to evaluate our approach and demonstrate its accuracy, robustness, and efficiency. In summary, our the main contributions are: • We propose an approach to anomaly detection in sensor systems that is able to detect anomalies accurately and in an online manner (Section 6.2). • We perform an extensive study using data sets from two real deployments, one con- sisting of about 30,000 environmental temperature measurements collected by 23 sensor nodes for around 43 days, and the other consisting of more than 10,000 soil 120 temperature, moisture, and air humidity measurements collected by 3 sensor nodes for over5 months. This study illustrates that our approach is accurate (it detects over 90% of the anomalies with few false positives), works well over a range of parameter choices, and has a small (CPU, memory) footprint (Sections 6.3 and 6.4). • We show that our (online) SSA-based based approach is more accurate than potential other (offline) techniques 1 , which are more computationally intensive (Section 2.4 and 6.5). 6.2 Methodology In this section, we first describe a tiered sensor system architecture that is representative of data collection deployments. We then formulate the problem of anomaly detection in sensor readings as an instance of the problem of identifying unusual patterns in time series data. Lastly, we describe our method for detecting anomalous sensor readings. 6.2.1 Sensor systems for data collection We consider a typical tiered sensor system [34] consisting of two tiers: a lower-tier of resource-constrained battery-operated wireless motes with one or more attached sensors (e.g., temperature, humidity, acceleration), and an upper tier of more capable master nodes each of which has significantly higher computation, storage, and communication capabili- ties than the motes. Here, we are interested in the class of data collection sensor systems, where each mote (usually) collects periodic sensor data, possibly performs some local pro- cessing on the data, and then transfers the resulting data over multiple hops. We model the measurements collected by a sensor m as a time series D m [t],t = 1,2,.... For example, 1 Most of these were designed in other contexts, but constitute possible directions that could have been taken for sensor systems anomaly detection. 121 suppose a sensing system had 20 motes, each collecting data from 3 sensors. Then, we would have a total of 60 time series (3 from each of the 20 motes), and we would represent these as a set{D m [t],m = 1,2,...,60;t = 1,2,...}. In many data collection applications, these time series exhibit a high degree of temporal and spatial correlations due to the nature of the physical phenomenon being monitored (e.g., temperature or light conditions). We leverage such correlations to detect anomalies (interesting events) in the sensor data time series. As noted in Section 6.1, anomalies have various lengths, magnitudes, and patterns, and a good anomaly detection methodology should be robust to such variations. We first describe the building blocks of our approach, where the basis involves building (and continuously updating) a model of the “normal” and then determining how similar new sensor measurements are to the “normal”. We then describe our approach to anomaly detection. 6.2.2 Building blocks At a high level, our approach answers the following question: How similar is a time series of sensor measurements to a given “reference” time series?. Suppose we are given two time series,D new [t] andD ref [t], whereD new [t] is the time series of new sensor data, andD ref [t] is the reference time series. Then, an anomaly detection method can: (1) Construct models corresponding to D new [t] and D ref [t]; (2) Compare these two models using a similarity measure; and (3) If the model forD new [t] is not sufficiently similar to the model forD ref [t], conclude that there are anomalies in the time series D new [t]. Thus, our method involves solving three main problems: (1) how to construct the models forD new [t] andD ref [t], (2) which similarity measure to use for comparing these models, and (3) how to decide whether the models for two different time series data are sufficiently similar, given our similarity measure. 122 Piecewise linear model. We use a piecewise linear model to represent D new [t] and D ref [t]. Figure 6.1(b) depicts an example piecewise linear representation of sensor mea- surements collected by the SensorScope deployment [7]. Each line segment represents a small subset of sensor readings, determined using linear least-squares regression. The advantages of a piecewise linear representation of time series data are: (a) It is succinct, since only a few line segments are needed to represent a large amount of time series data; (b) It is representative as essential information (e.g., significant patterns) in the data is cap- tured; (c) It is robust to changes in model parameters as well as to faults and noise in sensor measurements (as demonstrated in Section 6.4). A succinct, representative, and robust piecewise linear model of sensor data time series is desirable for online anomaly detection. First, we can compute such a model in near real-time (Section 6.2.3). Second, it enables us to create a data driven reference model that is easy to update – hence, we do not need prior knowledge about the types of anomalies that sensor data might contain. Third, because it is succinct, it enables us to compare two different time series efficiently and transmit models with low overhead. Finally, because it is representative of the sensor data patterns, it enables accurate detection of anomalous patterns. Due to their usefulness in modeling time series data, linearization based approaches have also been used in other contexts. For example, [41] developed an efficient technique to search for occurrences of a known pattern within a time series. However, the problem of searching for a known pattern in time series data is different from anomaly detection because often we do not have any prior information about the patterns exhibited by anoma- lous sensor readings. Linearization Error. In order to compute a piecewise linear model, we need to define the linearization error between a sensor data pointj and the line segmentl covering it. We define this error as the perpendicular distance between the pointj and the linel. Accord- 123 ingly, we define the linearization error ǫ for a piecewise linear model representing a time series{D[t],t = 1,2,3...,n}, as the maximum linearization error across all the data points inD[t]. How many line segments to use? We also need to determine the number of line seg- ments, k, to use. Intuitively, using a large number of line segments will result in a small linearization error – as explained below, this leads to lower computation cost but larger communication cost. (This tradeoff is explored in detail in Section 6.4.2.) We automatically determine the number of line segments in our piecewise linear model based on the maximum allowed linearization error ǫ, which is a (free) parameter in our approach. For a fixed choice of maximum linearization errorǫ, we use a greedy approach to determine the number of line segments needed to represent a time series. We start with the first two data points of the time series and fit a line segment, (say)l 1 , to them. Then we consider the data points one at a time and recomputel 1 using linear least-squares regression to cover a new data point. We compute the distance of the new data point from the line l 1 . If this distance is greater than ǫ, then we start a new line segment, l 2 such that the distance between the new data point and l 2 is at most ǫ. We keep repeating this process until we exhaust all data points. Note that our approach is suited for both offline and online processing. In an online setting, whenever the sensor collects a new reading, we can either recompute the current line segment to cover it or start a new line segment (depending on the linearization error). We represent the k line segments that constitute a piecewise linear model of a time series using their end points{(X[i],Y[i]),i = 1,2,...,k}, where X[i] denotes a sample number (or the time at which a sample was collected). The corresponding Y[i] is one of the end points of a line segment and represents an estimate of the actual sensor reading collected at timeX[i]. For example, in Figure 6.1(b), the line segments approximate actual sensor readings (shown using dots) – here we indicate two measurement collection times, 124 X[j] andX[j +1] that correspond to two end points,Y[j] andY[j +1], that are part of a piecewise linear model. Similarity measure. Let {( ˆ X[i], ˆ Y[i]),i = 1,2,..., ˆ k} and {( ˜ X[i], ˜ Y[i]),i = 1,2,..., ˜ k} denote the piecewise linear representation of two time series ˆ D[t] and ˜ D[t], respectively. In order to define a similarity measure between any two piecewise lin- ear representations, we need to first align them so that their X[i] values (end points on the x-axis) line up. For example, consider two representations{( ˆ X[i], ˆ Y[i]),i = 1,2} and{( ˜ X[i], ˜ Y[i]),i = 1,2,3} such that ˜ X[1] = ˆ X[1] and ˜ X[3] = ˆ X[2], and hence, ˜ X[2] < ˆ X[2]. In order to align the two representations, we choose the X values as {X[1] = ˆ X[1] = ˜ X[1],X[2] = ˜ X[2],X[3] = ˆ X[2] = ˜ X[3]}. Hence, after alignment, the new representations are{(X[i], ˜ Y[i]),i = 1,2,3}, and{(X[i],Y[i]),i = 1,2,3}, where Y[1] = ˆ Y[1], Y[3] = ˆ Y[2] and theY[2] value (corresponding to the sample at timeX[2]) is computed using the equation of the line segment joiningY[1] andY[3]. We define the difference between the (aligned) piecewise linear representations of two time series ˆ D[t] and ˜ D[t] as: S( ˆ D, ˜ D) = 1 k k X i=1 |Y[i]− ˜ Y[i]| (6.1) Here, S( ˆ D, ˜ D) represents the average difference between the Y values of the piecewise linear representations of ˆ D[t] and ˜ D[t] over the k line segments. We chose this metric because it is efficient to compute, and it indirectly captures the difference between the two time series. Threshold computation. We set the thresholdγ (for deciding whetherS( ˆ D, ˜ D) is suf- ficiently large) to the standard deviation of the initialD ref [t]. We remove any CONSTANT anomalies (described in Section 6.3), before computing the standard deviation - intuitively such measurements are not a good indication of variability in sensor data as they typically 125 correspond to faulty data, e.g., due to low voltage supply to the sensor [59]. Intuitively, the standard deviation is a reasonable indication of the variability in the “normal” data. A multiple of standard deviation could also be used, but our more conservative approach already results (Section 6.3) in a reasonably low false positive rate; more sophisticated (than threshold-based) approaches are part of future efforts. Putting it all together. Given a time series of new sensor data,D new [t], and a reference time series,D ref [t], our Segmented Sequence Analysis (SSA) based approach to anomaly detection utilizes the following steps (all detailed above): 1. Linearization: We apply our linearization technique to obtain the two piecewise linear models{(X new [i],Y new [i])} and{(X ref [i],Y ref [i])}. 2. Alignment: We align the two linear representations so that they have the same X values. 3. Similarity computation: We compute the similarity, S(D new ,D ref ), between the reference model and the model for new sensor data using Equation (6.1). 4. Anomaly detection: We detect an anomaly using a simple threshold-based approach. Specifically, ifS(D new ,D ref ) is greater than a thresholdγ, then we conclude that the sensor readingsD new [t] contain an anomaly. We now describe in detail our SSA-based anomaly detection framework. 6.2.3 Using SSA on a tiered sensor network We perform anomaly detection in a tiered sensor network in two stages – (1) a local step, executed at each mote, followed by (2) an aggregation step, executed at the master nodes. In the local step we exploit temporal correlations (in sensor readings), and in the aggrega- tion step we exploit spatial correlations, as described next. 126 Local step. During the local phase (executed at individual motes), each mote m per- forms the following tasks: (1) construct or update a reference time series, D ref m [t], for its sensor readings, (2) collect new sensor readings{D new m [t],t = 1,2,...,T} over a period T , (3) construct or update linear models forD new m [t] andD ref m [t], and (4) perform anomaly detection using the SSA-based method (refer to Section 6.2.2). Reference time series. To construct a reference time series at motem, D ref m [t], we use the following approach. For physical phenomena such as ambient temperature or humidity variations that exhibit a diurnal pattern, we initially start with a time seriesD[t] consisting of measurements collected over a period of 24 hours, (say) on day1 of the deployment. Let D new [t] be the new sensor readings collected by motem over time periodT corresponding to (say) 9-9:30 a.m. on day2 of the deployment. For these new readings, we define the data points inD[t] that were collected between 9-9:30 a.m. (on day 1) asD ref [t]. We first look for anomalies in the new sensor readingsD new [t], and then use the data points inD new [t] to updateD ref [t] using weighted averaging. For example, we can use exponential weighted moving averaging to (pointwise) updateD ref [t], i.e., ˜ D ref [t] = (1−α)×D ref [t] +α× D new [t], where ˜ D ref [t] denotes the updated reference time series. Figure 6.2(a) depicts the time series of humidity readings collected by a sensor from the Jug Bay deployment [6] along with two reference time series for it, constructed usingT = 12 hours (36 data points with one data point collected every 20 minutes). The reference time series labeled “Reference time series (including anomalous readings)” is computed using both non-anomalous as well as anomalous readings inD new [t] to update the reference time series, while the “Reference time series (excluding anomalous readings)” excludes the anomalous readings inD new [t]. The humidity measurements contain two anomalies – sharp changes in the sensor reading (marked by bounding rectangles in Figure 6.2(a)) which cause the humidity readings to increase sharply and then decay over time. It is important to detect these sharp changes in sensor readings. 127 As shown in Figure 6.2(a), excluding the anomalous readings inD new [t] when updating the reference time series causes D ref [t] to diverge from the sensor data time series. A divergingD ref [t] is likely to result in an increase in false positives (due to lots of samples being flagged as anomalies) and failure to “zoom in” on the samples where sharp changes in sensor readings occur. If we include the anomalous readings in D new [t] for updating of the reference time series, then the reference time series exhibits the same patterns as D new [t] but with a time lag. Our evaluation results in Section 6.4 show that this lag is long enough for SSA to identify the anomalous readings. There is a potential downside in using anomalous readings in updatingD ref [t]. If an anomaly affects a large number of samples, then SSA will fail to detect many of them. We discuss this in detail in Section 6.4 and show that for long duration anomalies, SSA can identify anomalous samples that correspond to the start and end of these anomalies, which is also quite useful. For scenarios where the “normal” pattern of sensor readings might not be known or might not exhibit any periodicity – e.g., sensors deployed for monitoring of birds’ nests [34], in the absence of any domain expertise, we assume that the sensor readings collected over a large duration (a 24 hour period in most cases) capture the normal patterns in the sensor data time series, and start with such a time series as our reference. Clearly, the performance of our local anomaly detection step depends on the quality of the reference data. A reference data that does not capture the normal sensor readings or is corrupted by anomalies can lead to false positives and/or false negatives. In Section 6.4, using real- world sensor readings for periodic (e.g., ambient temperature) as well as aperiodic (e.g., soil moisture variations) phenomena, we show that our approach for selecting and updating D ref m [t] is robust and works well in practice. Aggregation step. After performing its local step, each motem sends its linear model, {(X new m [i],Y new m [i]),i = 1,.,k}, for the new sensor readings, D new m [t], and the results of its local anomaly detection step to its master node. For each mote m, the master node 128 performs another round of anomaly detection by comparing its linear model against the models from other motes (treating them as reference). Hence, a master node managing n slave motes performsO(n 2 ) model comparisons. The design of our aggregation step is based on the observations from several real-world deployments that often the readings from sensors deployed at different locations are correlated [7]. The aggregation step exploits these spatial correlations to detect additional anomalies (if any) that might not have been detected during the local step. The final set of anomalies is the union of the anomalies detected during the local and aggregation steps. In our current framework, the master node does not provide feedback to its slave motes. Hence, the anomalous readings from mote m detected only by the aggregation step are currently not leveraged to improve the accuracy of the local anomaly detection step. Incorporating a feedback mechanism between the aggregation and local steps is part of future efforts. Online fault detection. To achieve online detection, we run the local and aggregation anomaly detection steps periodically, every T minutes. For example, if T = 30 min, we first collect new sensor readings for half an hour and then perform anomaly detection using the framework described above. The anomaly detection interval,T , controls the trade-off between real-time anomaly detection and resource consumption, as discussed in detail in Section 6.4.2. 6.2.4 Hybrid approach As noted in Section 6.2.2, our piecewise linear representation is very succinct – in practice, a small number of line segments is sufficient to capture the essential information (diurnal patterns, trends lasting for a long duration, etc.) in a time series. However, because it is designed to capture significant trends, a piecewise linear representation will mask faults or anomalies that affect a very small number of sensor samples. The top plot in Figure 6.2(b) 129 2500 3000 3500 4000 4500 5000 5500 0 20 40 60 80 100 Sample Number Humidity Sensor readings Reference (including anomalous readings) Reference (excluding anomalous readings) Anomalies 2000 3000 4000 5000 6000 7000 8000 9000 −5 0 5 10 Temperature 2000 4000 6000 8000 Normal Abnormal Sample Number SHORT Anomaly (a) Reference time series (b) Short anomalies (marked by rectangles) Figure 6.2: Examples of Anomalies in Sensor Readings shows a temperature reading time series from the SensorScope datasets [7], and the bottom plot shows whether each sensor reading was identified as “normal” or “anomalous” by SSA. While SSA is able to detect instances of long duration anomalies (marked by circles) it fails to detect the three very short duration anomalies (marked by rectangles in the top plot). To improve the accuracy of SSA on short duration anomalies, next we propose a hybrid approach. Combining SSA with Rule-based methods. We can view data faults in sensor read- ings as short duration anomalies (refer to Section 6.5). Thus, it is reasonable to adapt tech- niques designed for fault detection for identification of short duration anomalies. Specifi- cally, [65, 68] are representative of such techniques and they consider: SHORT anomalies (a sharp change in the measured sensor readings between two successive samples), NOISE anomalies (increase in the variance of sensor readings) and CONSTANT or “Stuck-at” anomalies (the sensor reports a constant value). Thus, we use the Rule-based methods [68] (originally designed for fault detection), for detection of short range anomalies in our hybrid approach by adding the following rules. 130 SHORT Rule: To detect SHORT anomalies in the time series{D[t],t = 1,2,3...}, we keep track of the change in sensor readings between two successive samples,|D[t]−D[t− 1]|. If this value is larger than a thresholdσ s , then we flagD[t] as anomalous. CONSTANT Rule: To detect CONSTANT anomalies we calculate moving variance statistics of time series{D[t],t = 1,2,3...}. Let V[t] = variance({D[j]} j=t j=t−c+1 ) be the variance ofc consecutive data readings prior to timet. If V[t] is less than a threshold σ c , then we flag the set of samples{D[j]} j=t j=t−c+1 as anomalous. A rule-based method also exists for detecting NOISE data faults. But, as shown in Section 6.4, SSA is accurate at detecting NOISE faults anomalies; thus, we do not include the NOISE rule as part of our hybrid method. To automatically determine the detection thresholds,σ s andσ c , we use the histogram- based approach [68]. We plot the histogram of the change in sensor readings between two successive samples (for SHORT rule) or the variance ofc samples (for CONSTANT rule) and select one of the modes of the histogram as the threshold. Thus, in scenarios where both short and long duration anomalies are expected, we propose a hybrid approach for anomaly detection. Specifically, every T minutes, we use the Rule-based methods to detect and mark short duration anomalies, and then use SSA to detect the remaining anomalies 2 . We evaluate our hybrid approach using real-world datasets in Section 6.4 and show that it is effective at detecting short and long duration anomalies. Our evaluation also shows that Rule-based methods boost the efficacy of SSA only in situations where we are interested in detecting short duration anomalies along with interesting long duration events or anomalies (e.g., changes in sensor readings patterns). Hence, in situations where detecting short duration anomalies is not of interest, the addi- 2 It is possible to combine other detection methods with SSA to design variants of our hybrid approach, e.g., [68] proposes other techniques, such as HMM-based methods, for detecting sensor data faults. However, other methods in [68] are much more computationally intensive and require a much longer training phase (than our reference model). 131 tional complexity of using Rule-based methods is not needed. Note that we do not need to remove short duration anomalies (or data faults) from the data – e.g., by replacing the sen- sor readingsD[j] corrupted by a SHORT anomaly with the average of its adjacent samples D[j−1] andD[j+1] – in order for SSA to be able to detect long duration anomalies. Our evaluation results in Section 6.4 show that the presence of short duration anomalies does not impact the accuracy of SSA when it comes to detecting long duration anomalies. Complexity and Overhead. Of all the steps in SSA, linearization requires the most computation, with the worst case complexity beingO(n 2 ), wheren is the number of mea- surements accumulated in a time slot of lengthT . Since we use linear least-squares regres- sion to determine the best-fit line segment, the cost of approximatingd (one dimensional) data points with a line segment isO(d). However, our greedy approach performs a least- squares regression fit every time a new sensor sample is recorded. In the worst case, we may need to perform least-squares regressionn times (once for each data point) resulting inO(n 2 ) computational complexity for the linearization step, and hence, for SSA. In prac- tice, SSA is quite efficient, as shown in Section 6.4 (as n is typically not very large). We note that the Rule-based methods used in our hybrid approach are simple and have O(n) computational complexity; thus, they do not increase the overall complexity of the hybrid approach. SSA incurs a communication overhead every time a mote conveys its linear model to its master node. Note that a mote needs to convey 4 data points per line segment – two X[i] values (sample times) and the corresponding two Y[i] values. Since a mote’s linear model consists of k line segments, the communication overhead is O(k). Note that this overhead is incurred everyT minutes since a mote recomputes its linear model once every T minutes. 132 6.3 Experimental setup Sensor datasets. The sensor data time series used in our evaluations come from the Sen- sorScope [7] and the Life Under Your Feet [6] projects. Both projects represent the state-of- the-art in sensor systems and collect measurements in very different environments. Hence, the two datasets allow us to evaluate SSA on representative and diverse sensor system data. In the SensorScope project, large networks of sensors are deployed to collect envi- ronmental data such as temperature, humidity, solar radiation, etc. In this work, we use temperature readings collected from 23 sensors deployed in the Grand St. Bernard pass between Switzerland and Italy in 2007. Each sensor collected samples every 2 minutes for 43 days. Since the temperature measurements exhibit a diurnal pattern, the sensor data time series are periodic with the period being 720 data points (collected every 24 hours). In what follows, we show results for all 23 sensor data time series. We refer to these time series as SensorScope 1 through SensorScope 23. Our second sensor data source is from the Life Under Your Feet project [6], which studies soil ecology in a number of locations. We use data sets collected at the Jug Bay Wetland Sanctuary in Anne Arundel County, Maryland between June, 2007 and April, 2008. In this deployment, sensors were placed in the nests of Box Turtles to study the effect of soil temperature and soil moisture on the incubation of turtle eggs. Measurements of soil temperature, soil moisture, box temperature, and box humidity are collected every 20 minutes for more than 5 months. These measurements exhibit very diverse patterns. For example, as depicted in Figure 6.3(a), the soil moisture data are non-periodic – here the soil moisture readings are close to 8% when it is not raining, but they exhibit a sharp jump followed by a gradual decay when it rains. Hence, for the soil moisture time series, instances of rainfall are the anomalies (or events) of interest that we try to detect using SSA. In contrast, the box humidity data sets are periodic with a period of 72 data points 133 (or 24 hours). The Jug Bay dataset consists of readings from 3 different sensors. In what follows, we show results for soil moisture readings collected (we refer to them as Soil Moisture 1, Soil Moisture 2, and Soil Moisture 3), as well as the box humidity data time series (we refer to them as Box Humidity 1, Box Humidity 2, and Box Humidity 3). 0 1000 2000 3000 4000 5000 5 10 15 20 25 Sample Number Percentage 7950 8000 8050 8100 8150 8200 0 5 10 Sample Number Temperature (a) Soil Moisture Readings with Short anomalies (b) Anomaly: Constant 3800 4000 4200 4400 4600 4800 10 20 30 40 50 60 70 Sample Number Temperature 5600 5800 6000 6200 6400 10 20 30 40 50 60 Sample Number Temperature (c) Anomaly: Change in Variance (d) Anomaly: Change in Shape Figure 6.3: Soil Moisture Readings, Constant, Change in Variance and Change in Shape anomalies Anomalies in the data sets. To the best of our knowledge, there are no publicly available data sets with anomalies already annotated. Thus, to obtain the ground truth, we visually inspected the sensor data time series from the SensorScope and the Jug Bay deployments, to identify long and short duration anomalies. This is consistent with current practice for other data sets (e.g., Internet traces in [46]) that lack ground truth. To identify long duration anomalies, we used the (subjective) criterion of “what kind of patterns would a human find interesting?”. The short duration anomalies that we identified resemble sensor data faults 134 Anomaly Description Duration Change in mean Anomalous readings differ significantly from Long average value of normal readings (e.g., as in Figure 6.1(a)) Change in variance Anomalous readings exhibit less or more variability Long & Short than normal readings (e.g., as in Figure 6.3(c)) Short spike Equivalent to SHORT fault data type [59] (e.g., as in Figure 6.3(a)) Short Constant reading Sensor reports a constant value over a period of time Long & Short (e.g., as in Figure 6.3(b)) Change in shape Anomalous readings differ in mean and/or variance from normal Long & Short readings but with shorter duration than Change in mean and Change in variance (e.g., as in Figure 6.3(d)) Table 6.1: Anomaly categories types (SHORT, NOISE, and CONSTANT faults) described in [59, 68]. We categorize these identified anomalies into five groups shown in Table 6.1. We note that this categorization is done for ease of results presentation (in Section 6.4) only and is no way used in our anomaly detection approach. 6.4 Experimental Results We now evaluate our SSA-based approach and illustrate its goodness using the following criteria (a comparison with related literature is presented in Section 6.5). • Accuracy: SSA alone detects most long duration anomalies (plus a significant frac- tion of the short duration ones), and our hybrid approach detects both, long and short duration anomalies accurately. • Sensitivity: Our results are not sensitive to SSA’s parameter, namely to the settings of the linearization periodT , and the maximum linearization errorǫ. • Cost: SSA has low computation and memory cost, and hence it can be effectively implemented in sensor systems. • Robustness: SSA is robust to the presence of sensor data faults in the reference time series (i.e., there is no need to “clean” the data before running SSA). 135 6.4.1 Accuracy evaluation We first describe our method’s accuracy, using the data sets and the ground truth identifi- cation described in Section 6.3. We use (1) number of false positives (detecting non-exist anomalies), and (2) number of false negatives (not being able to detect an anomaly) as our metrics. Specifically, the results in the tables below are presented as follows - the x/y number indicates thatx out ofy anomalies were detected correctly (corresponding toy−x false negatives) plus we also indicate the number of corresponding false positives (FP). Note that a long duration anomaly may consist of many consecutive data points. In this work, we focus on detecting these events rather than on identifying each and every anoma- lous data point within an event. (Thus, when 50% or more of the data points of a long duration anomaly are identified by SSA as anomalous, we consider this to be a successful detection 3 .) The accuracy results of our hybrid approach on all data sets are given in Tables 6.2 and 6.3. Our hybrid method is able to detect long duration and short duration anomalies, with a small number of false positives, and often without any false negatives. Most of the false positives are due to the Rule-based part of the hybrid approach rather than to the SSA part (as explained below). Tables 6.2 and 6.3 also show that long duration anomalies – particularly the Change in Mean and Change in Shape anomalies – occur quite often in the SensorScope and the Jug Bay datasets (refer to the last row of both tables). For example, over the course of 43 days, a total of 84 instances of Change in Mean and 139 instances Change in Shape anomalies occurred in the SensorScope datasets; on average, 2 instances of Change in Mean and 3 instances of Change in Shape anomalies per day. Previously, others have shown that short 3 Of course, more sophisticated approaches are possible, but as our results indicate, this simple approach already results in good accuracy and allows us to focus on evaluation of SSA, without additional complica- tions. 136 duration anomalies or data faults (Short spikes, Constant readings, Noise faults) are quite prevalent in real-world datasets [59, 68]; this work is the first to demonstrate that long duration anomalies occur quite often as well. Under our hybrid approach, anomalies can be detected at three different stages – the Rule-based methods, the local step in SSA, and the aggregator step in SSA. For both the SensorScope and the Jug Bay datasets, we found that the aggregator step in SSA did not add significantly to the accuracy of SSA. This is because the combination of the Rule-based methods and the local step in SSA was able to detect most of the anomalies. We now focus on understanding the contribution to our hybrid approache’s accuracy of SSA vs. the Rule- based methods. The first two rows of Table 6.4 show the results of applying SSA alone (without the use of Rule-based methods) on the Soil Moisture 1 and the SensorScope 1 time series. Based on these results, we make the following observations: (1) SSA is accurate at detecting long duration anomalies such as Change in Average, Change in Variance, and Change in Shape, and (2) SSA can fail to detect short duration anomalies such as Short spikes. For example, while it is able to detect more than 70% of the Short spikes in Soil Moisture 1, it detects only about 50% of the Short spikes in SensorScope 1. This makes sense, as SSA is intended more for longer duration anomalies. The utility of the hybrid approach can be seen, e.g., by comparing the Short results for Soil Moisture 1 and SensorScope 1 in Tables 6.2 and 6.3 with those in Table 6.4. The hybrid approach outperforms SSA on short duration anomalies as it uses Rule-based methods, designed specifically for short duration anomalies like Short spikes and Constant readings. However, our hybrid approach incurred a higher false positive rate than SSA, and a detailed inspection of the samples falsely identified as anomalous revealed that this increase was due to the Rule-based methods. Prior work [68] showed that Rule-based methods can incur a high false positive rate mainly because the histogram method for determining their fault detection threshold (Section 6.2.4) does not always identify a good threshold. We 137 also verified this by computing the histogram using the entire data set, which significantly reduced the false positive rate. However, such an approach would not be online and hence not used here. We also verified that the Rule-based methods alone do not provide any benefits in detecting long duration anomalies. For instance, this can be seen by comparing the results for Soil Moisture 1 and the SensorScope 1 data in Tables 6.2 and 6.3 with those in Table 6.4, where our hybrid method performs the same as SSA w.r.t. to detecting long duration anomalies like Change in Average and Change in Shape. In fact, the Rule-based meth- ods can perform quite poorly when used for identifying long duration anomalies. The last row of Table 6.4 shows the results of applying the Rule-based methods alone on the Box Humidity 2 data - compare that to the Box Humidity 2 results in Table 6.3. As expected, the Rule-based methods detect the short duration anomalies (Short spikes and Constant readings), but fail to detect most of the long duration anomalies. Data Set Change in Mean Change in Var Change in Shape Short Constant False Positives SensorScope 1 3/4 0/0 6/7 90/90 2/2 7 SensorScope 2 8/8 0/0 6/7 86/86 6/6 6 SensorScope 3 7/7 2/2 9/10 64/64 5/5 12 SensorScope 4 5/5 2/2 12/13 220/222 13/13 27 SensorScope 5 6/6 4/4 9/9 726/819 34/34 0 SensorScope 6 7/7 0/0 8/10 206/206 1/1 7 SensorScope 7 8/8 4/4 12/12 555/567 54/54 0 SensorScope 8 6/6 0/0 9/10 243/243 2/2 4 SensorScope 9 6/6 2/2 11/12 65/65 23/23 6 SensorScope 10 5/5 0/0 10/12 46/46 2/2 3 SensorScope 11 7/7 0/0 8/10 122/122 1/1 6 SensorScope 12 7/7 0/0 11/13 84/84 13/13 7 SensorScope 13 8/8 2/2 13/14 250/250 15/15 5 SensorScope 14 5/5 4/4 5/7 595/633 26/26 19 SensorScope 15 6/6 2/2 8/9 464/475 24/24 12 SensorScope 16 4/4 2/2 7/7 120/120 12/12 25 SensorScope 17 5/5 4/4 6/6 166/166 17/17 13 SensorScope 18 6/7 2/2 16/18 56/56 9/9 12 SensorScope 19 3/3 0/0 4/6 98/98 1/1 15 SensorScope 20 3/3 0/0 3/4 77/78 1/1 9 SensorScope 21 2/2 1/1 3/4 332/337 26/26 5 SensorScope 22 3/3 0/0 3/5 88/88 1/1 11 SensorScope 23 3/3 0/0 4/4 84/84 2/2 17 Total 121/123 31/31 183/209 4837/4999 290/290 238 Table 6.2: Hybrid method: SensorScope 138 Data Set Change in Mean Change in Var Change in Shape Short Constant False Positives Soil Moisture 1 7/8 1/1 8/10 53/55 1/1 0 Soil Moisture 2 8/9 1/1 9/11 74/74 1/1 2 Soil Moisture 3 5/5 0/0 6/7 42/42 0/0 4 Box Humidity 1 2/2 4/4 18/19 15/15 0/0 2 Box Humidity 2 5/5 7/7 27/28 16/16 2/2 2 Box Humidity 3 3/3 2/2 1/1 17/17 2/2 3 Total 30/32 15/15 69/76 217/219 6/6 13 Table 6.3: Hybrid method: Jug Bay Method Data Set Change in Mean Change in Var Change in Shape Short Constant FP SSA Soil Moisture 1 7/8 1/1 8/10 40/55 1/1 0 SSA SensorScope 1 3/4 0/0 6/7 46/90 1/2 1 Rule based Box Humidity 2 2/5 0/7 5/28 16/16 2/2 2 Table 6.4: SSA vs. Rule-based methods 6.4.2 Sensitivity evaluation The linearization periodT and the maximum linearization errorǫ are the two main param- eters in our SSA-based approach. Next, we present an analysis of the sensitivity of our results to these parameters’ settings. Impact ofT . SSA computes the similarity measureS( ˆ D, ˜ D) everyT time units. The smaller the value of T , the more real-time is SSA’s anomaly detection. However, if T is too small, there may be too few data points for SSA to accurately capture the pattern of a long duration anomaly. Thus,T controls the trade-off between (near) real-time anomaly detection and the accuracy of detecting long duration anomalies. To characterize the sensitivity of SSA’s accuracy to T , we ran SSA using different values of T . For SensorScope datasets, we used T values ranging from 30 minutes (the time to collect 15 data points) to 8 hours (the time to collect 240 data points). For Jug Bay datasets, we variedT from 2 hours (the time to collect 6 data points) to 24 hours (the time to collect 72 data points). We found that changingT ’s value did not affect SSA’s accuracy w.r.t. Change in Aver- age and Change in Variance anomalies, but it did affect the accuracy w.r.t. Change in Shape anomalies. We show examples of SSA’s performance in detecting instances of the Change 139 in Shape anomaly in SensorScope 2 and Box Humidity 1 time series (for different T val- ues) in Tables 6.5(a) and 6.5(b), respectively. Very small T values (corresponding to few data points in a linearization period) result in a significant number of false positives. As T grows, the number of false positives improves and becomes reasonably insensitive toT . The false negative rate is quite insensitive to the value ofT , with a small increase for very large values ofT . Intuitively, this can be explained as follows. For a small value ofT , SSA considers only a few data points at a time and even small differences in these data points (e.g., due to measurement noise) can cause SSA to misclassify these points as an instance of Change in Shape anomaly resulting in an increase in the false positive rate. 4 The small increase in false negatives for large values of T is due to very short duration Change in Shape anomalies being “averaged out” (with a largeT ). T Value 0.5 1 2 4 8 Detected 7/7 7/7 6/7 6/7 6/7 FP 8 4 0 0 0 T Value 2 4 6 8 12 24 Detected 19/19 19/19 19/19 19/19 18/19 18/19 FP 17 10 6 4 2 2 (a) SensorScope 2 (b) Box Humidity 1 Table 6.5: Sensitivity test forT : SSA with differentT values; Change of shape anomaly In summary, our evaluation shows that our method is not sensitive to the linearization periodT , provided it is long enough to collect a reasonable number of data points. The main reason for this is that beyond a certain value ofT , our similarity metricS(D new ,D ref ) does not change significantly withT , as illustrated next. For a fixedT value, we ran SSA on SensorScope 1 and SensorScope 2 separately, and recorded the similarity values (w.r.t. to the reference time series) computed every T time units. For example, forT = 1 hour, SSA computes a similarity value for new data points collected every hour using Equation (6.1). We then computed the mean and the variance 4 The Change in Average and the Change in Variance anomalies occur over longer periods of time; thus, to cause a false positive corresponding to these anomalies, (random) noise would have to affect a much greater number of samples, which is unlikely to happen. 140 of the differences in the similarity values for different values ofT for SensorScope 1 (and separately for SensorScope 2). For example, consider a set of SensorScope 1 data points collected over 2 hours. Let α 2 be the similarity value for these data points when T = 2 hours, and forT = 1 hour, letα 11 andα 12 be the similarity values for data points collected during the first and the second hour, respectively. The mean and variance of the differences in the similarity values for T = 1 hour and T = 2 hours are computed using the values |α 2 − α 11 | and|α 2 − α 12 |. These values capture the difference in the similarity values associated with a set of data points for differentT values. Table 6.6(a) shows the results, where|S q [t]−S r [t]| is the difference in similarity values corresponding toT = q hrs andT = r hrs and the (x,y) entry is the corresponding mean and variance of that difference. The similarity values forT≥ 2 are close, but the similarity values for T = 1 hr are different from those for T = 2 hrs. Recall that SSA compares similarity values against a threshold to detect anomalies. Hence (for SensorScope 1 and SensorScope 2), SSA’s performance with T = 1 hr differs from its performance with T = 2 hrs; but for T ≥ 2, SSA’s performance is insensitive to the choice of T . We observed a similar behavior for the other SensorScope time series. For the Jug Bay dataset, we observed similar behavior for T ≥ 12 hrs. The range of T values over which SSA’s performance is insensitive is different for the SensorScope and Jug Bay datasets primarily because of the differences in sampling intervals (2 minutes for SensorScope and 20 minutes for Jug Bay). So, it makes sense that it takes much longer to collect a sufficient number of samples in the Jug Bay data sets and hence requires a larger T to achieve robust SSA performance. Impact ofǫ. As discussed in Section 6.2.4, forn data points collected during an interval of lengthT , the worst case running time of our linearization algorithm isO(n 2 ). Such worst case scenarios arise when a small number of line segments are sufficient to model alln data points. That is, in the extreme case where a single line segment is sufficient to cover all the 141 Data sets SensorScope 1 SensorScope 2 |S 1 [t]−S 2 [t]| (1.35,1.40) (1.28,2.05) |S 2 [t]−S 3 [t]| (0.22,0.08) (0.25,0.12) |S 3 [t]−S 4 [t]| (0.25,0.11) (0.29,0.17) |S 4 [t]−S 6 [t]| (0.29,0.14) (0.30,0.20) ǫ value SensorScope1 SensorScope2 # Lines Running Time # Lines Running Time 0.01 92.55 0.08 86.51 0.10 0.05 47.49 0.08 50.50 0.11 0.10 25.45 0.10 29.39 0.12 0.50 3.38 0.34 3.80 0.29 1.00 1.85 0.46 1.98 0.40 (a) Similarity metric as a function ofT (b)Impact ofǫ Table 6.6: Robustness to parameters. n data points, our greedy approach will be forced to solve larger and larger instances of the least-square regression problem in each iteration – the first iteration will have 2 samples, the second 3 samples, and the (n-1)st will haven samples, resulting inO(n 2 ) complexity. At the other extreme, is the case where each pair of consecutive samples defines a new line segment, leading to O(n) line segments and O(n) computation. The number of line segments, k, used to model a sensor data time series depends on the desired linearization errorǫ. Intuitively, a small value ofǫ will force us to use more line segments. which would lead to a lower computational cost.However, the communication overhead of our approach is O(k). Thus, ǫ controls the trade-off between computational cost and communication overhead (or size of the model). We found that in practice (e.g., in SensorScope and Jug Bay datasets) a small value of ǫ results in each line segment covering a small number of points. Intuitively, this happens as typically sensor readings exhibit non-linear patterns (e.g., diurnal or sharp increases in value when an event occurs), and approximating non-linear patterns using line segments results in only a few points being covered by a single line. Table 6.6(b) shows the average number of line segments used to model 120 data points collected when T = 4 hrs, for SensorScope 1 and SensorScope 2, for different ǫ values. As the ǫ value is reduced from 1 to 0.01, the average number of line segments increases from 1.85 (1.98) to 92.55 (86.51) for SensorScope 1 (SensorScope 2). (Note that we can use at most 119 line segments to model 120 data points). Table 6.6(b) shows our linearization approache’s running time (on 142 a PC with a 2.8 GHz processor with 2GB of RAM) on the entire time series; as expected, it is smaller for smallerǫ values. Table 6.6(b) results support our intuition that choosing a small value for ǫ results in faster execution time. However, the overhead of the communication between a mote and its master isO(k) (Section 6.2.4) - a small value ofǫ reduces the computation time at the expense of a larger communication overhead. In scenarios where the aggregator step does not boost the accuracy of SSA (as is the case with SensorScope and the Jug Bay datasets; Section 6.4.1), we can either do away with the aggregator step or invoke it less frequently than after everyT minutes. This can help us reduce the computational cost of the local step (by selecting a smallǫ) while not incurring a significant communication overhead. The choice of ǫ can also determine how well a sensor time series is approximated by our piecewise linear model. Intuitively, we should choose an ǫ value that is very small compared to the threshold γ against which the similarity measure S(D new ,D ref ) is compared to detect anomalies (see Section 6.2). With ǫ << γ, it is unlikely that linearization errors will significantly impact the similarity measure, and hence, not impact the accuracy of SSA 5 . In this work, we conservatively set ǫ = 0.1. We also investigated how ǫ affects SSA’s accuracy by trying different values between 0.1 and 1 for it and found that the accuracy of SSA was the same for the different values of ǫ. As shown in Table 6.6(b) ǫ = 0.1 achieves a good computational cost vs. communication overhead trade-off – choosing a smaller value did not reduce the running time significantly but led to a large increase in the number of line segmentsk. 5 As discussed in Section 6.2, we set γ to be equal to the standard deviation of the initial reference time series; γ for the SensorScope and Jug Bay Box Humidity datasets was within the interval [4,7] and [6,9], respectively. 143 6.4.3 CPU and Memory Usage In Section 6.2.4, we discussed the computation complexity and overhead of SSA. We also measured the running time and memory usage of SSA on a low-end netbook with Atom 1.6 GHz processor and 1.5 GB of RAM. The processing power and available memory of this netbook are comparable to that of the emerging master class devices used in today’s tiered sensor network deployments [34]. We first run a monitoring program in the background, and then run SSA over all 23 time series in the SensorScope data sets. We record the running time for each times series is recorded. The monitoring program also records the memory usage by SSA every second. We perform two sets of experiments with different linearization period T . In Table 6.4(a), we show the maximum running time and memory usage of SSA over all the23 time series. For bothT = 60 and 120, SSA takes less than 5 seconds to process approximately 30,000 samples with a small memory footprint. These results show that the computation and memory requirements of SSA are small and well within the resources available on today’s master class devices. 0.6 0.8 1 1.2 1.4 x 10 4 0 5 10 15 Temperature 6000 8000 10000 12000 14000 16000 Normal Rule SSA Sample Number Sensor reading Reference (b) Data set with faulty readings (a) Running time and Memory Usage of SSA T Sample Number Max Running Time (Sec) Max Memory Usage (KB) 60 4.876 2048 120 4.796 2052 Figure 6.4: (a) Resource usage of SSA and (b) Data set with faulty readings. 144 6.4.4 Robustness evaluation Data faults are quite prevalent in real-world sensor system deployments [59, 68] and can be caused by bugs in hardware or software, improper sensor calibration, or due to motes running out of battery power [59]. Hence, in a real-world scenario, it is quite likely that the reference time series D ref [t] used by SSA may itself contain anomalous readings. Note that, as described in Section 6.2.3, when updating D ref [t] using new sensor data, we do not discard the anomalous readings. With an anomaly-free reference time series initially, as a result of updating it with these anomalous data points, the reference time series may eventually exhibit the same pattern as the anomalous readings. For example, we observed an instance of this problem in a time series from the SensorScope datasets, as described next. Figure 6.4(b) (top plot) shows a SensorScope time series with a long duration Constant reading anomaly that lasted for6 days. The bottom plot in Figure 6.4(b) shows the readings identified as anomalous by SSA alone and the Rule-based methods. SSA is able to cor- rectly identify the samples corresponding to the start and the finish of the Constant reading anomaly but misses most of the “in-between” anomalous samples. This is due to our design choice to updateD ref [t] using anomalous samples as well. We can see in Figure 6.4(b) (top plot) that after (approximately) 400 successive samples corrupted by the Constant reading anomaly, the reference time series values are quite close to the anomalous readings, and as a result, SSA does not stops flagging the subsequent readings as anomalous. In Sec- tion 6.2, we justified our use of anomalous readings to update D ref [t] by demonstrating that it helps us “zoom in” on samples where the sharp changes in sensor readings happen (see Figure 6.2(a)). However, as this example shows, updating D ref [t] using anomalous readings can cause SSA to miss a large number of samples affected by a long duration anomaly, and only identify the beginning and end of a long duration anomalous event. This 145 again motivates the use of our hybrid approach - i.e., for the time series in Figure 6.4(b) (bottom plot), we identify the samples missed by SSA using the CONSTANT rule. The SensorScope time series in Figure 6.4(b) (top plot) also contains two instances of Short spikes (that occur before the long duration Constant anomaly). Even though SSA alone fails to detect them, their presence does not impair SSA’s ability to detect the long duration anomaly that occurs later. Hence, we do not need to “clean” the time series data before running SSA. We can see in Figure 6.4(b) (top plot) that the SHORT faults do not affect the reference time series significantly, i.e., SSA is robust to faults. 6.5 Quantitative comparison We now present results from a quantitative comparison of our hybrid approach and anomaly detection techniques based on ARIMA models [68], PCA [46], wavelets decomposi- tion [14], and K-means clustering [63]. As noted above, these techniques have proven to be effective at detecting sensor data faults and network anomalies. However, our eval- uation shows that an “out-of-a-box” application of these techniques for detecting sensor system anomalies performs poorly. Comparison with ARIMA based method. ARIMA (Autoregressive Integrated Moving Average) models are a standard tool for modeling and forecasting time series data with peri- odicity [15], and [68] leverages temporal correlations in sensor measurements to construct an ARIMA model of sensor data. This model is used to predict the value of future readings, with new sensor readings compared against their predicted value - if the difference between these values is above a threshold (the95% confidence interval for the predictions) then the new data is marked as anomalous. We compare our hybrid approach against the ARIMA L-step method from [68], where the ARIMA model is used to predict the next L sensor readings. 146 We first trained the ARIMA model to estimate its parameters using sensor readings (from SensorScope 1 and SensorScope 2) collected over 3 days as training data (a separate training phase was required for SensorScope 1 and SensorScope 2, [68] also uses training data from 3 days; these are more favorable conditions for ARIMA as SSA uses a shorter period for its reference model). The ARIMA L-step method with L = 24 hrs flagged 12,135 (of the 30,356 data points in SensorScope 1) as anomalies. Our inspection revealed a total of 107 anomalies that affect more than 7,500 data points. While the ARIMA L- step method identified most anomalous samples, it also falsely identified a large number of samples as anomalous. The extremely high number of false positives resulting from the ARIMA L-step method reduces its utility. We failed to train ARIMA on SensorScope 2 due to the training data containing a Con- stant readings anomaly that affects almost two-thirds of the samples. This failure highlights the lack of robustness in ARIMA based methods to long duration anomalies in the train- ing data. SSA’s counterpart to the ARIMA training phase is the selection of an initial reference time series, and SSA tolerates anomalies in its initial reference time series (see Section 6.4.4). Comparison with network anomaly detection techniques. The techniques such as Prin- cipal Component Analysis (PCA) [46], and wavelet analysis [14, 69, 35] used for detecting network traffic anomalies are not well suited for online detection in sensor systems primar- ily because they are computationally intensive and difficult to perform in an online manner. However, one can still ask: How accurately would those technique perform on sensor sys- tems data?, i.e., in an offline manner. As a reasonably representative technique, we use the PCA based approach in [46], which collects periodic measurements of the traffic on all network links and splits it into “normal” and “residual” traffic using PCA. The residual traffic is then mined for anomalies using the Q-statistic test in which L2-norm||y|| 2 of the 147 residual traffic vectory for a link is compared against a threshold. If it is greater than the threshold, then the traffic vectory is declared anomalous [46]. We ran the PCA based method on the data from SensorScope time series 6, 8, 9, 10, 11, and 12. We chose these as their start and end times are the same. The input to PCA was a data matrix with 6 columns (one for each sensor) and 29,518 rows (the total number of samples collected). Note that the PCA based method is applied in an offline fashion using the entire time series data from 6 different sensors whereas in our SSA-based hybrid approach, at best, the aggregator step would get access to linear models from 6 sensors during the pastT = 4 hours only. The results for the PCA based method are summarized in Table 6.7. It fails to detect most long duration anomalies (5 out of 38 Change in Mean anomalies and 4 out of 67 Change in Shape anomalies). It does better at detecting Short spikes but is still not as accurate as our hybrid approach. Thus, even under a best case scenario (offline with access to the entire time series), the PCA based method does not perform as well as our hybrid approach. Recall that it identifies anomalies by looking at the L2-norm of the data vector in the residual space. As pointed out in [46], the PCA based method works best when the intensity of an anomaly is large compared to the variability of the normal data. This is not the case with most of the long duration anomalies present in the sensor data analyzed in Table 6.7. For instance, Figure 6.5(a) shows a Change in Mean anomaly in SensorScope 12 time series that the PCA based method fails to detect. It also shows a Short anomaly (spike) that the PCA based method is able to detect. To further illustrate the impact of anomalies’ intensities on the accuracy of a PCA- based method, we injected anomalies (Short, Noise, and Constant) into the time series SensorScope 9, and attempted to detect these injected faults using PCA. The data samples corrupted by Short and Noise anomalies had a higher variance as compared to the normal data whereas, by definition, the variance of the samples corrupted by the Constant anomaly 148 was lower than the normal data. We found that the PCA based method was able to detect most of the samples corrupted by SHORT and NOISE anomalies, but missed the Constant anomaly. Apart from the intensity of an anomaly, the PCA results might also be impacted by several other factors such as sensitivity of PCA to its parameters and lack of data prepro- cessing. For instance, [67] shows that the performance of PCA is sensitive to the number of principal components included in the normal subspace, and the threshold used for anomaly detection. We note that, in our experiments, we did vary the number of principal compo- nents in the normal subspace, and Table 6.7 depicts the best results obtained (i.e., those with having only1 principal component in the normal space). To our knowledge, the Q-statistic based technique that we use here is the only known method for automatically deciding the detection threshold. It is also well-known that anomalies can contaminate the normal sub- space and hence, avoid detection by PCA [67]. One way to ameliorate this situation can be to preprocess the data to identify and remove large anomalies before applying PCA. How- ever, in our context, defining a large anomaly would itself have required us to introduce another heuristic (with its own shortcomings). We do not pursue this (or other PCA related improvements) further, as the main goal of our PCA-based evaluation here was to illustrate that, as in the case of network anomaly detection, it is not straightforward to apply PCA to detect anomalies in sensor data. (This partly contributed to our choice of a different approach.) Of course, we do not claim that the PCA based method cannot be made more effective at detecting sensor data anomalies, but as noted, this is not the goal of our work here. To the same end, we also explore wavelet based methods for detecting sensor data anomalies. We select the method presented in [14] as a representative technique. This method first separates a time series of network data (e.g., aggregate byte counts on a link) into low, medium, and high frequency components using wavelet decomposition. To detect 149 Data Set Change in Mean Change in Var Change in Shape Short Constant SensorScope 6 1/7 0/0 2/10 45/206 0/1 SensorScope 8 1/6 0/0 1/10 59/243 0/2 SensorScope 9 1/6 1/2 0/12 47/65 4/23 SensorScope 10 0/5 0/0 0/12 31/46 0/2 SensorScope 11 1/7 0/0 0/10 33/122 0/1 SensorScope 12 1/7 0/0 1/13 27/84 6/13 Total 5/38 1/2 4/67 242/766 22/42 Table 6.7: PCA based method: SensorScope 7000 8000 9000 10000 11000 −5 0 5 10 Temperature 7000 8000 9000 10000 11000 Normal Abnormal Sample Number Change of Mean 10001500200025003000350040004500 −5 0 5 10 Temperature 1000 2000 3000 4000 5000 −5 0 5 10 Sample Number Low Freq Componnet Anomaly (a)SensorScope 12: PCA result (b)SensorScope 11: Wavelet result Figure 6.5: Detection of long duration anomalies: PCA and wavelet based methods. anomalies, first a (time varying) deviation score is computed by combining the local vari- ance of the medium and high frequency components. Local variance is defined as the variance of the data falling within a moving time window. The deviation score is then compared against a threshold to detect anomalies. Data Set Change in Mean Change in Var Change in Shape Short Constant False Positive SensorScope 6 1/7 0/0 2/10 205/206 1/1 26 SensorScope 8 1/6 0/0 2/10 243/243 1/2 24 SensorScope 9 1/6 2/2 3/12 65/65 13/23 42 SensorScope 10 1/5 0/0 3/12 46/46 0/2 47 SensorScope 11 1/7 0/0 2/10 122/122 1/1 29 SensorScope 12 1/7 0/0 3/13 80/84 9/13 22 Total 6/38 2/2 15/67 761/766 25/42 190 Table 6.8: Wavelet based method: SensorScope We ran the wavelet based anomaly detection method described above on the same set of data that we used for evaluating the PCA based method. The results are summarized in 150 Table 6.8. While the wavelet based method detects more anomalies as compared to PCA, it does not perform as well as our hybrid approach at detecting long duration anomalies. In particular, it fails to detect most of the Change in Mean and Change in Shape anomalies. In our evaluation, this method also incurred a large number of false positives. One possible reason for the wavelet based method not being very effective on the SensorScope dataset could be that it looks for anomalies in the medium and high frequency components of a time series, whereas the long duration anomalies fall into the low frequency component. The top plot in Figure 6.5(b) shows a Change in Mean anomaly in SensorScope 11 time series that is captured by the low frequency component shown in the bottom plot. Hence, it is difficult for previously proposed wavelet based methods that mine for anomalies only in the medium and high frequency components [14, 69, 35] to detect such anomalies. These approaches can possibly be extended using heuristic(s) that mine for anomalies in the low frequency component. We do not pursue this (or other improvements to the wavelets based approach) further as, like in the case of PCA, the main goal of our wavelet based evaluation of sensor data was to demonstrate that a straightforward application of a wavelet-based technique designed for network anomaly detection is not very effective on sensor data. Of course, we do not claim that wavelet based techniques cannot be made to work on sensor data, but as noted, that is not the goal of our work here. Comparison with clustering based method. We compare our hybrid approach with the clustering based method in [63], which first groups the vector of readings from several sensors into clusters of fixed-width. After this, the average inter-cluster distance between a cluster and itsK nearest neighboring clusters is used to determine the abnormal clusters 6 . We ran this method on SensorScope time series 6, 8, 9, 10, 11 and 12. The inputs to the clustering method are vectors of readings with the same time stamp from the 6 time series. 6 An SVM-based method [64] comparison is omitted as it assumes that few measurements are anomalous; thus it is unlikely to detect long duration anomalies such as the Constant anomaly in Figure 6.4. 151 Data Set Change in Mean Change in Var Change in Shape Short Constant SensorScope 6 2/7 0/0 4/10 56/206 1/1 SensorScope 8 3/6 0/0 5/10 54/243 1/2 SensorScope 9 3/6 1/2 5/12 20/65 5/23 SensorScope 10 2/5 0/0 5/12 16/46 1/2 SensorScope 11 3/7 0/0 6/10 49/122 1/1 SensorScope 12 3/7 0/0 5/13 36/84 3/13 Total 16/38 1/2 30/67 243/766 12/42 Table 6.9: Cluster based method: SensorScope We found that the performance of this method depends strongly on the cluster widthw, as also noted in [63]. We tried a large number ofw values (from 0.03 to 8) and used the best results found to compare with our method. This method can only identify which data vectors contain anomalies, but cannot identify anomalous readings within a data vector. We determine the number of detections and false negatives as in the PCA-based method; so, it is not possible to report false positives for individual time series. This method did incorrectly flag 18 data vectors as anomalous; the actual number of false positives is between18 and108.We give detailed results in Table 6.9 and note that the cluster based method performs poorly in detecting long term anomalies such as Change in mean, Change in shape, and Constant. Intuitively, this is because this method is designed to detect outliers and does not exploit temporal correlations within the time series. We can also see that the method has a lot of false negatives in detecting Short spikes. This makes sense as a spike is determined by the corresponding data point’s relative position to the previous and next point, but in the cluster based method all data points are considered as a whole and this temporal relationship is lost. 6.6 Conclusions We proposed an online anomaly detection approach for sensor systems measurements. Our approach utilized piecewise linear models of time series, which are succinct, representa- 152 tive, and robust, and therefore enabled us to (a) compute such models in near real-time, (b) create models without prior knowledge about anomaly types that sensor data might contain, and (c) compare and communicate different time series efficiently. Our extensive evaluation study, using real sensor systems deployments data, illustrated that our approach is accurate, robust, and efficient. Future work includes study of (1) dynamic setting of parameters (e.g., the linearization period and the size of the initial reference model), (2) feedback mechanisms between the local and aggregation steps of our approach, and (3) other techniques that are useful in our hybrid approach. 153 Chapter 7 Conclusions In this dissertation, we provide algorithms and studied their performances in four different distributed systems: (1) network transmission systems that send data chunks to the Internet, (2) distributed data centers that provide computation services, (3) peer to peer systems that provide streaming to end users, and (4) wireless sensing systems that helps monitor the environment. For the network transmission system, we provide offline scheduling algorithms aimed at reducing the95th percentile of the number of jobs served, which is the basis of bandwidth cost accounting. We also showed performance bounds on the online problem for reducing the95th percentile, as well as the maximum. We also designed a job scheduling and server management framework that reduces power cost in distributed data centers. Our framework establishes the trade-off between power cost and mean service delay. We showed that it can achieve significant reduction in power cost as well as power usage. We studied structured peer to peer streaming systems and present algorithms for struc- ture construction as well as packet scheduling. These algorithms allowed us to derive provable QoS guarantees for streaming. Finally, In the context of wireless sensor systems, we designed an algorithm for cap- turing anomalies in measurements collected by the sensors. through evaluation against real world traces, we showed that our approach is accurate, robust and resource efficient. 154 References [1] The california independent system operator. http://www.caiso.com. [2] Disaster recovery by google. http://googleenterprise.blogspot.com/2010/03/disaster- recovery-by-google.html. [3] The federal energy regulatory commission. www.ferc.gov. [4] Google server number. www.gizmodo.com/5517041/. [5] Google’s green data centers. www.google.com/corporate/green/. [6] The life-under-your-feet project. http://www.lifeunderyourfeet.org/. [7] Sensorscope: Sensor networks for environmental monitoring. http://sensorscope.epfl.ch/. [8] Unused servers survey results analysis. http://www.thegreengrid.org/. [9] Wikipedia hit records. http://dammit.lt/wikistats/. [10] S. Albers. Energy-efficient algorithms. Commun. ACM, 53:86–96, May 2010. [11] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y . Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using mantri. In OSDI, 2010. [12] D. Arthur and R. Panigrahy. Analyzing the efficiency of bittorrent and related peer- to-peer networks. In ACM-SIAM SODA, 2006. [13] Nikhil Bansal and Kirk Pruhs. Speed scaling to manage temperature. In STACS ’05. [14] P. Barford, J. Kline, D. Plonka, and A. Ron. A signal analysis of network traffic anomalies. In Proceedings of the Workshop on Internet measurment (IMW), 2002. [15] G. E. P. Box, G. M. Jenkins, and G. C. Reinsen. Time Series Analysis: Forecasting and Control, 3rd Edition. Prentice Hall, 1994. 155 [16] M. Castro, P. Druschel, A. Kermarrec, A. Nandi, A. Rowstron, and A. Singh. Split- stream: High-bandwidth multicast in cooperative environments. In SOSP, 2003. [17] C. Chen and L.-M. Liu. Joint Estimation of Model Parameters and Outlier Effects in Time Series. Journal of the American Statistical Association, 88:284–297, 1993. [18] Y . Chen, A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, and N. Gautam. Managing server energy and operational costs in hosting centers. In ACM SIGMETRICS, 2005. [19] Y . Chu, S. Rao, S. Seshan, and H. Zhang. A case for end system multicast. IEEE JSAC, 20(8):1456–1471, 2002. [20] G. Dan, V . Fodor, and I. Chatzidrossos. On the performance of multi-tree-based peer- to-peer live streaming. In IEEE INFOCOM, 2007. [21] Xenofontas Dimitropoulos, Paul Hurley, Andreas Kind, and Marc Ph. Stoecklin. On the 95-percentile billing method. In PAM ’09. [22] Nosayba El-Sayed, Ioan A. Stefanovici, George Amvrosiadis, Andy A. Hwang, and Bianca Schroeder. Temperature management in data centers: why some (might) like it hot. In Proceedings of ACM SIGMETRICS, 2012. [23] E. Elnahrawy and B. Nath. Cleaning and Querying Noisy Sensors. In Proceedings of the ACM International Workshop on Wireless Sensor Networks and Applications (WSNA), 2003. [24] A. M. Farley. Broadcast time in communication networks. SIAM Journal on Applied Mathematics, 39(2):385–390, 1980. [25] C. Feng and B. Li. Understanding the performance gap between pull-based mesh streaming. In IEEE INFOCOM, 2008. [26] C. Fernandess and D. Malkhi. On collaborative content distribution using multi- message gossip. In IEEE IPDPS, 2006. [27] A. Gandhi, V . Gupta, M. Harchol-Balter, and A. Kozuch. Optimality analysis of energy-performance trade-off for server farm management. Perform. Eval., 67:1155– 1171, November 2010. [28] A. Gandhi, M. Harchol-Balter, R. Das, and C. Lefurgy. Optimal power allocation in server farms. In ACM SIGMETRICS, 2009. [29] L. Georgiadis, M. J. Neely, and L. Tassiulas. Resource allocation and cross-layer control in wireless networks. Foundations and Trends in Networking, vol. 1, no. 1, pp. 1-149, 2006. 156 [30] David Goldenberg, Lili Qiuy, Haiyong Xie, Yang Richard Yang, and Yin Zhang. Opti- mizing cost and performance for multihoming. In ACM SIGCOMM ’04. [31] Anupam Gupta, Ravishankar Krishnaswamy, Amit Kumar, and Danny Segev. Scheduling with outliers. In APPROX ’09 / RANDOM ’09. [32] V . Guralnik and J. Srivastava. Event detection from time series data. In ACM KDD, 1999. [33] J. Hastad. Some optimal inapproximability results. JACM, 48:798–859, 2001. [34] John Hicks, Jeongyeup Paek, Sharon Coe, Ramesh Govindan, and Deborah Estrin. An Easily Deployable Wireless Imaging System. In Proceedings of ImageSense Work- shop, 2008. [35] C.-T. Huang, S. Thareja, and Y .-J. Shin. Wavelet-based Real Time Detection of Net- work Traffic Anomalies. International Journal of Network Security, 6:309–320, 2008. [36] L. Huang and M. Neely. Max-weight achieves the exact [o(1/v),o(v)] utility-delay tradeoff under markov dynamics. arXiv:1008.0200v1. [37] S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom. Declarative Sup- port for Sensor Data Cleaning. In Proceedings of the International Conference on Pervasive Computing, 2006. [38] Lukasz Jez. One to rule them all: A general randomized algorithm for buffer man- agement with bounded delay. In ESA ’11. [39] Lukasz Jez. A 4/3-competitive randomised algorithm for online packet scheduling with agreeable deadlines. CoRR, abs/0905.4068, 2009. [40] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate infor- mation. In IEEE FOCS, 2003. [41] E. Keogh. Fast similarity search in the presence of longitudinal scaling in time series databases. In ICTAI, 1997. [42] N. Khoussainova, M. Balazinska, and D. Suciu. Towards Correcting Input Data Errors Probabilistically Using Integrity Constraints. In Proceedings of the ACM Workshop on Data Engineering and Mobile Access, 2006. [43] S. Khuller, Y . Kim, and Y-C. Wan. Broadcasting on networks of workstations. In ACM Symp. on Parallel Algs. and Arch., 2005. [44] S. Khuller, Y . Kim, and Y-C. Wan. On generalized broadcasting and gossiping. Jour- nal of Algorithms, 59(2):81–106, 2006. 157 [45] R. kumar, L. Yong, and K. Ross. Stochastic fluid theory for p2p streaming systems. In IEEE INFOCOM, 2007. [46] A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In ACM SIGCOMM, 2004. [47] Nikolaos Laoutaris, Georgios Smaragdakis, Pablo Rodriguez, and Ravi Sundaram. Delay tolerant bulk data transfers on the internet. In ACM SIGMETRICS ’09. [48] J. K. Lenstra, D. B. Shmoys, and ´ E. Tardos. Approximation algorithms for scheduling unrelated parallel machines. Math. Program., 46, 1990. [49] Fei Li, Jay Sethuraman, and Clifford Stein. An optimal online algorithm for packet scheduling with agreeable deadlines. In ACM SODA ’05. [50] A. Liestman, T. Shermer, and M. Suderman. Broadcasting multiple messages in hypercubes. In ACM ISPAN, 2000. [51] M. Lin, A. Wierman, L. Andrew, and E. Thereska. Dynamic right-sizing for power- proportional data centers. In IEEE INFOCOM, 2011. [52] Minghong Lin, Zhenhua Liu, Adam Wierman, and Lachlan L. H. Andrew. Online algorithms for geographical load balancing. In IGCC, 2012. [53] S. Liu, R. Zhang-Shen, W. Jiang, J. Rexford, and M. Chiang. Performance bounds for peer-assisted live streaming. In SIGMETRICS, 2008. [54] Y . Liu. On the minimum delay peer-to-peer video streaming: how realtime can it be? In MULTIMEDIA ’07: Proceedings of the 15th international conference on Multimedia, pages 127–136, New York, NY , USA, 2007. [55] Z. Liu, Y . Shen, K. Ross, S. Panwar, and Y . Wang. Substream trading: Towards an open p2p live streaming system. In IEEE ICNP, 2008. [56] Z. Liu, A. Wierman, S. Low, and L. Andrew. Greening geographical load balancing. In ACM SIGMETRICS, 2011. [57] Zhenhua Liu, Yuan Chen, Cullen Bash, Adam Wierman, Daniel Gmach, Zhikui Wang, Manish Marwah, and Chris Hyser. Renewable and cooling aware workload management for sustainable data centers. In Proceedings of ACM SIGMETRICS, 2012. [58] N. Magharei, R. Rejaie, and Y . Guo. Mesh or multiple-tree: A comparative study of live p2p streaming approaches. In IEEE INFOCOM, 2007. 158 [59] K. Ni, N. Ramanathan, M. Chehade, L. Balzano, S. Nair, S. Zahedi, G. Pottie, M. Hansen, and M. Srivastava. Sensor Network Data Fault Types. Transactions on Sensor Networks, 5(3), 2009. [60] M. Pinedo. Scheduling: Theory, Algorithms and Systems. Prentice Hall, second edition, 2006. [61] K. Pruhs, P. Uthaisombut, and G. Woeginger. Getting the best response for your erg. ACM Trans. Algorithms, 4:1–17, July 2008. [62] A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, and B. Maggs. Cutting the electric bill for internet-scale systems. In SIGCOMM, 2009. [63] S. Rajasegarar, C. Leckie, M. Palaniswami, and J. Bezdek. Distributed anomaly detec- tion in wireless sensor networks. In ICCS, 2006. [64] S. Rajasegarar, C. Leckie, M. Palaniswami, and J. Bezdek. Quarter sphere based distributed anomaly detection in wireless sensor networks. In IEEE ICC, 2007. [65] N. Ramanathan, L. Balzano, M. Burt, D. Estrin, E. Kohler, T. Harmon, C. Harvey, J. Jay, S. Rothenberg, and M. Srivastava. Rapid Deployment with Confidence: Cali- bration and Fault Detection in Environmental Sensor Networks. Technical Report 62, CENS, April 2006. [66] L. Rao, X. Liu, L. Xie, and W. Liu. Minimizing electricity cost: Optimization of distributed internet data centers in a multi-electricity-market environment. In IEEE INFOCOM, 2010. [67] H. Ringberg, A. Soule, J. Rexford, and C. Diot. Sensitivity of pca for traffic anomaly detection. In ACM SIGMETRICS, 2007. [68] Abhishek B. Sharma, Leana Golubchik, and Ramesh Govindan. Sensor Faults: Detec- tion Methods and Prevalence in Real-World Datasets. Trans. on Sensor Networks, 6(3), 2010. [69] A. Soule, K. Salamatian, and N. Taft. Combining filtering and statistical methods for anomaly detection. In Proceedings of the ACM conference on Internet Measurement (IMC), 2005. [70] R. Stanojevic and R. Shorten. Distributed dynamic speed scaling. In IEEE INFO- COM, 2010. [71] A. Tartakovsky and V . Veeravalli. Asymptotically Optimal Quickest Change Detec- tion in Distributed Sensor Systems. Sequential Analysis, 27:441–475, Oct. 2008. 159 [72] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE Trans. on Automatic Control, vol. 37, no. 12, pp. 1936-1949, Dec. 1992. [73] Daniela Tulone and Sam Madden. PAQ: Time series forecasting for approximate query answering in sensor networks. In Proceedings of the European Conference on Wireless Sensor Networks (EWSN), 2006. [74] R. Urgaonkar, U. Kozat, K. Igarashi, and M. Neely. Dynamic resource allocation and power management in virtualized data centers. In IEEE/IFIP NOMS, 2010. [75] Rahul Urgaonkar, Bhuvan Urgaonkar, Michael J. Neely, and Anand Sivasubrama- niam. Optimal power cost management using stored energy in data centers. In Pro- ceedings of ACM SIGMETRICS, 2011. [76] V . Venkataraman and P. Francis. Chunkyspread: Multi-tree unstructured peer-to-peer multicast. In IPTPS, 2006. [77] A. Wierman, L. Andrew, and A Tang. Power-aware speed scaling in processor sharing systems. In IEEE INFOCOM, 2009. [78] Frances Yao, Alan Demers, and Scott Shenker. A scheduling model for reduced cpu energy. In IEEE FOCS ’95. [79] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, 2010. [80] X. Zhang, J. Liu, B. Li, and T. P. Yum. Coolstreaming/donet: A data-driven overlay network for efficient live media streaming. In IEEE INFOCOM, 2005. [81] B. Zhou, D. He, and Z. Sun. Network Traffic Modeling and Prediction with ARIMA/GARCH. In HET-NETs’05. [82] Y . Zhou, D. Chiu, and J.C.S. Lui. A simple model for analyzing p2p streaming pro- tocols. In IEEE ICNP 2007, pages 226 –235, oct. 2007. 160 Appendix Proof of Lemma 1 Here we prove Lemma 1. Proof. Lett =kT for somek∈Z + . Then consider anyτ∈ [t,...,t+T−1]. Squaring both sides of the queueing dynamic (4.1) and using the fact that for anyx∈R, (max[x,0]) 2 ≤ x 2 , we have: [Q F i (τ +1)] 2 ≤ [Q F i (τ)] 2 +[ X j µ ij (τ)] 2 +[A i (τ)] 2 − 2Q F i (τ) X j µ ij (τ)−A i (τ) . Summing the above over i = 1,...,M and using the fact that P j µ ij (τ) ≤ µ max and A i (t)≤A max , we have: X i [Q F i (τ +1)] 2 −[Q F i (τ)] 2 ≤M(µ 2 max +A 2 max ) − M X i=1 2Q F i (t) X j µ ij (τ)−A i (τ) . (7.1) Repeating the above steps for the back-end backlogs, we get: X j [Q B j (τ +1)] 2 −[Q B j (τ)] 2 ≤ X j N 2 j b 2 max +M 2 µ 2 max − M X j=1 2Q B j (τ) N j (t)b j (τ)− X i µ ij (τ) . (7.2) Note in (7.2) the number of servers is N i (t) because it is determined every T slots. Now multiply (7.1) and (7.2) by1/2, summing them together, and taking expectations overA i (τ) 161 andp i (τ),i = 1,...,M,τ∈ [t,...,t+T−1], conditioning onQ(t), we get: Δ 1 (τ)≤B 1 −E X i Q F i (τ) X j µ ij (τ)−A i (τ) |Q(t) −E X j Q B j (τ) N j (t)b j (τ)− X i µ ij (τ) |Q(t) . (7.3) HereΔ 1 (τ) is the expected change in the Lyapunov function from time slotτ toτ+1. Also B 1 =MA 2 max +(M 2 +M)µ 2 max + P i N 2 i b 2 max . Summing (7.3) overτ =t,t+1,...,t+T−1, and using the definition of Δ T (t), we have: Δ T (t)≤B 1 T−E t+T−1 X τ=t X i Q F i (τ) X j µ ij (τ)−A i (t) |Q(t) −E t+T−1 X τ=t X j Q B j (τ) N j (t)b j (τ)− X i µ ij (τ) |Q(t) . Now adding to both sides the power cost over the frame, i.e., the term VE P t+T−1 τ=t P M j=1 f j (τ)|Q(t) proves the lemma. Proof of Lemma 2 We now prove Lemma 2. Proof. Recall that we have: Δ T (t)+VE t+T−1 X τ=t X j f j (τ)|Q(t) ≤ B 1 T +VE t+T−1 X τ=t X j f j (τ)|Q(t) −E t+T−1 X τ=t X i Q F i (τ) X j µ ij (τ)−A i (τ) |Q(t) 162 −E t+T−1 X τ=t X j Q B j (τ) N j (t)b j (τ)− X i µ ij (τ) |Q(t) . (7.4) Now by (4.1) and (4.2), we see that for anyτ∈ [t,t+T−1], we have: Q F i (t)−(τ−t)µ max ≤Q F i (τ)≤Q F i (t)+(τ−t)A max , Q B j (t)−(τ−t)N j b max ≤Q B j (τ)≤Q B j (t)+(τ−t)Mµ max . Therefore, t+T−1 X τ=t X j Q B j (τ) N j (t)b j (τ)− X i µ ij (τ) ≥ t+T−1 X τ=t X j [Q B j (t)−(τ−t)N j b max ]N j (t)b j (τ) − t+T−1 X τ=t X j [Q B j (t)+(τ−t)Mµ max ] X i µ ij (τ) ≥ t+T−1 X τ=t X j Q B j (t) N j (t)b j (τ)− X i µ ij (τ) − t+T−1 X τ=t X j (τ−t)[N 2 j b 2 max +M 2 µ 2 max ] ≥ t+T−1 X τ=t X j Q B j (t) N j (t)b j (τ)− X i µ ij (τ) − T(T−1) X j [N 2 j b 2 max +M 2 µ 2 max ]/2. (7.5) Similarly, we have: t+T−1 X τ=t X i Q F i (τ)[ X j µ ij (τ)−A i (τ)] 163 ≥ t+T−1 X τ=t X i [Q F i (t)−(τ−t)µ max ][ X j µ ij (τ)] − t+T−1 X τ=t X i [Q F i (t)+(τ−t)A max ]A i (τ) ≥ t+T−1 X τ=t X i Q F i (t)[ X j µ ij (τ)−A i (τ)] −T(T−1)M[µ 2 max +A 2 max ]/2. (7.6) Therefore by defining B 2 = B 1 + (T− 1) P j [N 2 j b 2 max + (M 2 + M)µ 2 max ]/2 + (T− 1)MA 2 max /2 = (T +1)B 1 /2 and using (7.5) and (7.6) in (7.4), we get: Δ T (t)+VE t+T−1 X τ=t X j f j (τ)|Q(t) ≤ B 2 T +VE t+T−1 X τ=t X j f j (τ)|Q(t) −E t+T−1 X τ=t X i Q F i (t) X j µ ij (τ)−A i (τ) |Q(t) −E t+T−1 X τ=t X j Q B j (t) N j (t)b j (τ)− X i µ ij (τ) |Q(t) . This completes the proof. Proof of Theorem 5 We prove Theorem 5 here. 164 Proof. Let t = kT for some k∈{0,1,...,}. Using Lemma 1 and the fact that SSA is constructed to minimize the RHS of (4.12), we have: Δ T (t)+VE t+T−1 X τ=t X j f SSA j (τ)|Q(t) ≤ B 2 T +VE t+T−1 X τ=t X j f alt j (τ)|Q(t) −E t+T−1 X τ=t X i Q F i (t) X j µ alt ij (τ)−A i (τ) |Q(t) −E t+T−1 X τ=t X j Q B j (t) N alt j (t)b alt j (τ)− X i µ alt ij (τ) |Q(t) . (7.7) Here alt represents any alternative policy that can be implemented over the frame from t tot+T−1. Now sinceλ+ǫ1∈Λ, it can be shown using Theorem 4 that there exists a stationary and randomized policyΠ ′ opt that achieves the following: kT+T−1 X τ=kT M X j=1 E f Π ′ opt j (τ) =Tf ∗ av (λ+ǫ1), (7.8) kT+T−1 X τ=kT E X i µ Π ′ opt ij (τ) = kT+T−1 X τ=kT E N Π ′ opt j (τ)b Π ′ opt j (τ) −ǫ, ∀ j, (7.9) kT+T−1 X τ=kT E A i (τ) = kT+T−1 X τ=kT E X i µ Π ′ opt ij (τ) −ǫ∀ i. (7.10) Heref ∗ av (λ+ǫ1) is the minimum cost corresponding to the rate vectorλ+ǫ1. Plug (7.8), (7.9) and (7.10) into (7.7), we get: Δ T (t)+VE t+T−1 X τ=t X j f SSA j (τ)|Q(t) ≤B 2 T+ 165 VTf ∗ av (λ+ǫ1)−ǫT X i Q F i (t)−ǫT X j Q B j (t). Now we can take expectations on both sides overQ(t) to get: E L(t+T)−L(t) +VE t+T−1 X τ=t X j f SSA j (τ) ≤B 2 T+ VTf ∗ av (λ+ǫ1)−ǫTE X i Q F i (t) −ǫTE X j Q B j (t) . (7.11) Rearranging the terms, and using the fact that0≤f ∗ av (λ+ǫ1)≤f max we get that: E L(t+T)−L(t) +ǫT E X i Q F i (t) +E X j Q B j (t) ≤B 2 T +VTf max . Summing the above overt = kT ,k = 0,1,...K−1, rearranging the terms, using the fact thatL(t)≥ 0 for allt, and dividing both sides byǫKT , we have: 1 K K−1 X k=0 E X i Q F i (kT) +E X j Q B j (kT) ≤ B 2 +Vf max ǫ . Taking alimsup asK→∞ proves (4.17). To prove (4.14), using (7.11), we have: VE t+T−1 X τ=t X j f SSA j (τ) ≤B 2 T +VTf ∗ av (λ+ǫ1). 166 Summing the above overt = kT ,k = 0,...,K−1, and dividing both sides byKTV , we have: 1 KT E KT−1 X τ=0 X j f SSA j (τ) ≤f ∗ av (λ+ǫ1)+ B 2 V . Now (4.14) follows by taking a limsup as K → ∞, using the Lebesgue’s dominated convergence theorem, and then lettingǫ→ 0. Proof of Theorem 6 Proof. It suffices to show that using the weights ˆ Q F i (t) and ˆ Q B i (t), we still minimize the right hand side of (7.7) to within some additive constant. To see this, suppose now ˆ Q F i (t) and ˆ Q B i (t) are used to carry out the SA VE algorithm, then we see that we try to minimize: Obj( ˆ Q(t)),VE t+T−1 X τ=t X j f j (τ)|Q(t) −E t+T−1 X τ=t X i ˆ Q F i (t) X j µ ij (τ)−A i (τ) |Q(t) −E t+T−1 X τ=t X j ˆ Q B j (t) N j (t)b j (τ)− X i µ ij (τ) |Q(t) . Denotee F i (t) = ˆ Q F i (t)−Q F i (t) ande B i (t) = ˆ Q B i (t)−Q B i (t). Rewrite the above as: Obj( ˆ Q(t)) =VE t+T−1 X τ=t X j f j (τ)|Q(t) −E t+T−1 X τ=t X i Q F i (t) X j µ ij (τ)−A i (τ) |Q(t) −E t+T−1 X τ=t X j Q B j (t) N j (t)b j (τ)− X i µ ij (τ) |Q(t) 167 −E t+T−1 X τ=t X i e F i (t) X j µ ij (τ)−A i (τ) |Q(t) −E t+T−1 X τ=t X i e B i (t) N j (t)b j (τ)− X i µ ij (τ) |Q(t) . (7.12) Now denote the minimum value of Obj(Q(t)) to be Obj ∗ , i.e., the minimum of the RHS of (7.7) withQ(t), and denote the minimum value of Obj( ˆ Q(t)) to be Obj † . Then using the fact that|e F i (t)|,|e B i (t)|≤c e for allt, and the facts that| P j µ ij (τ)−A i (τ)|≤µ max +A max and|N j (t)b j (τ)− P i µ ij (τ)|≤N max b max +Mµ max , we see that Obj † ≤ Obj ∗ −E t+T−1 X τ=t X i e F i (t) X j µ ij (τ)−A i (τ) |Q(t) −E t+T−1 X τ=t X i e B i (t) N j (t)b j (τ)− X i µ ij (τ) |Q . Using this and (7.12), we see that: VE t+T−1 X τ=t X j f † j (τ)|Q(t) −E t+T−1 X τ=t X i Q F i (t) X j µ † ij (τ)−A i (τ) |Q(t) −E t+T−1 X τ=t X j Q B j (t) N † j (t)b † j (τ)− X i µ † ij (τ) |Q(t) ≤ Obj ∗ +2Tc e (Mµ max +MA max +N max b max +Mµ max ). Heref † j (τ),µ † ij (τ),N † j (t) andb † j (τ) are the action taken by the policy based on ˆ Q(t). This completes the proof. This shows that (7.7) holds with Q(t) replaced by ˆ Q(t), and B 2 replaced byB 3 =B 2 T +2Tc e (Mµ max +MA max +N max b max +Mµ max ). The rest of the proof follows similarly as the proof of Theorem 5. 168
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive resource management in distributed systems
PDF
QoS-based distributed design of streaming and data distribution systems
PDF
Satisfying QoS requirements through user-system interaction analysis
PDF
Performance and incentive schemes for peer-to-peer systems
PDF
Distributed algorithms for source localization using quantized sensor readings
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Distributed resource management for QoS-aware service provision
PDF
Modeling and optimization of energy-efficient and delay-constrained video sharing servers
PDF
Distributed wavelet compression algorithms for wireless sensor networks
PDF
Design-time software quality modeling and analysis of distributed software-intensive systems
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Cloud-enabled mobile sensing systems
PDF
Reliable languages and systems for sensor networks
PDF
Theoretical foundations and design methodologies for cyber-neural systems
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Elements of next-generation wireless video systems: millimeter-wave and device-to-device algorithms
PDF
Improving machine learning algorithms via efficient data relevance discovery
PDF
Resource scheduling in geo-distributed computing
PDF
Efficient data collection in wireless sensor networks: modeling and algorithms
PDF
Distributed edge and contour line detection for environmental monitoring with wireless sensor networks
Asset Metadata
Creator
Yao, Yuan
(author)
Core Title
QoS-aware algorithm design for distributed systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/19/2012
Defense Date
09/20/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithm,distributed system,OAI-PMH Harvest,performance analysis
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Golubchik, Leana (
committee chair
), Govindan, Ramesh (
committee member
), Neely, Michael J. (
committee member
), Teng, Shang-hua (
committee member
)
Creator Email
yaoyuan.usc@gmail.com,yuanyao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-114853
Unique identifier
UC11289109
Identifier
usctheses-c3-114853 (legacy record id)
Legacy Identifier
etd-YaoYuan-1303.pdf
Dmrecord
114853
Document Type
Dissertation
Rights
Yao, Yuan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
algorithm
distributed system
performance analysis