Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Adaptive and resilient stream processing on cloud infrastructure
(USC Thesis Other)
Adaptive and resilient stream processing on cloud infrastructure
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ADAPTIVE AND RESILIENT STREAM PROCESSING ON CLOUD INFRASTRUCTURE by Alok Gautam Kumbhare A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2016 Copyright 2016 Alok Gautam Kumbhare Dedication To Dr. B. R. Ambedkar, who has shaped the lives of many. To my parents for their sacrifices, support and encouragement. In memory of my beloved cousin. ii Acknowledgments I would like to express sincere gratitude to my advisor, Prof. Viktor K. Prasanna for the continuous guidance and support through my Ph.D. years. His guidance and encouragement has helped me remain focused and motivated. I am also grateful to Prof. Cauligi Raghavendra and Prof. Aiichiro Nakano for serving on my qualifier and thesis committee and for their guidance. My sincere thanks also go to Prof. Yogesh Simmhan for mentoring me during the initial years of my Ph.D. and helping me define my research path. I wish to thank Dr. Marc Frincu for guidance and collaboration on several research projects and publications. Their guidance has been invaluable in my research career. I would also like to thank all the members of our group at USC, especially, Dr. Vickram Sorathia, Dr. Charalampos Chelmis and Charith Wickarammarachi, for stimulating discussions and their help on numerous occasions. I would like to thank my family and friends for their love and affection, their encouragement and confidence have driven me towards success. Finally, I would like to thank my wife for her unwavering support and unlimited understanding as a friend and companion during my Ph.D. years. iii Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xi 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Data Dynamism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Infrastructure Dynamism . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Domain Dynamism . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.1 Motivating Scenario - Smart Power Grids . . . . . . . . . . . 10 1.5 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Background and Preliminaries 19 2.1 Stream Processing Applications Composition Model: Continuous Dataflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Scalable Streaming Application Deployment . . . . . . . . . . . . . 24 2.3 Cloud Infrastructure Model . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3 Dynamic Dataflow Applications 30 3.1 Runtime Adaptations through Application Re-composition . . . . . 30 3.2 Conceptual Model for Dynamic Dataflows . . . . . . . . . . . . . . 31 3.2.1 Dynamic Continuous Dataflow . . . . . . . . . . . . . . . . . 32 3.2.2 Dynamic PEs . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.3 Dynamic Edges . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.4 Types of Guard Policies . . . . . . . . . . . . . . . . . . . . 35 3.3 Dynamic Dataflow Execution Model . . . . . . . . . . . . . . . . . . 38 iv 3.4 Dynamic Dataflow Programming Constructs . . . . . . . . . . . . . 41 3.4.1 Primitive Stream Constructs . . . . . . . . . . . . . . . . . . 42 3.4.2 Dynamic Dataflow Constructs . . . . . . . . . . . . . . . . . 43 3.4.3 Policy Specifications . . . . . . . . . . . . . . . . . . . . . . 45 3.5 System Architecture and Implementation . . . . . . . . . . . . . . . 47 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7.1 Continuous dataflows and stream processing systems . . . . 56 3.7.2 Flexible workflows and SOA . . . . . . . . . . . . . . . . . . 56 3.7.3 Heterogeneous computing . . . . . . . . . . . . . . . . . . . 57 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4 Reactive Scheduling Heuristics 59 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Application and Infrastructure Models . . . . . . . . . . . . . . . . 62 4.3 Cloud Infrastructure Model . . . . . . . . . . . . . . . . . . . . . . 66 4.4 Deployment and Adaptation Approach . . . . . . . . . . . . . . . . 67 4.5 Metrics for Quality of Service . . . . . . . . . . . . . . . . . . . . . 70 4.6 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.7 Genetic Algorithm-based Scheduling Heuristics . . . . . . . . . . . . 76 4.8 Greedy Scheduling Heuristics . . . . . . . . . . . . . . . . . . . . . 80 4.8.1 Initial Deployment Heuristic . . . . . . . . . . . . . . . . . . 80 4.8.2 Runtime Adaptation Heuristic . . . . . . . . . . . . . . . . . 84 4.9 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.9.1 Linear Road Benchmark (LRB) . . . . . . . . . . . . . . . . 88 4.9.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.10 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 Predictive Lookahead Scheduling Heuristics 100 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3 Scheduling Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3.1 Predictive Look-Ahead Scheduling (PLAStiCC) . . . . . . . 106 5.3.2 Averaging Models with Reactive Scheduling . . . . . . . . . 108 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 v 6 Elasticity and Fault Tolerance 120 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.3 Goals of Proposed System . . . . . . . . . . . . . . . . . . . . . . . 127 6.4 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . 129 6.4.1 Dynamic Re-Mapping of Keys . . . . . . . . . . . . . . . . . 131 6.4.2 Peer-Backup of Tuples . . . . . . . . . . . . . . . . . . . . . 133 6.4.3 Reducer State Model and Incremental Peer-Checkpointing . 134 6.4.4 Tuple Ordering and Eviction Policy . . . . . . . . . . . . . . 137 6.5 Adaptive Load Balancing and Elasticity . . . . . . . . . . . . . . . 138 6.6 Fault-Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.7.1 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . 147 6.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7 Conclusions 157 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.2 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Reference List 162 vi List of Tables 4.1 Functions used in Initial Deployment Strategies . . . . . . . . . . . 83 4.2 Effect of variability of infrastructure performance and input data rate on relative output throughput using different scheduling algo- rithms. Static LRB deployment with average of 50 msgs/sec input rate, b Ω = 0.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 vii List of Figures 1.1 Cumulative Distribution Function (CDF) of normalized CPU Core performance, across VMs and over time. . . . . . . . . . . . . . . . 4 1.2 CDF of performance variation relative to previous measurement time-slot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Box-and-Whiskers plot of normalized performance distribution over time (Y axis) and for different Amazon AWS VMs (X axis). . . . . 6 1.4 CDF of variation in normalized Latency measures relative to previ- ous measurement time-slot. . . . . . . . . . . . . . . . . . . . . . . 7 1.5 CDF of variation in normalized Connect time performance measures relative to previous measurement time-slot. . . . . . . . . . . . . . . 8 1.6 CDF of variation in normalized bandwidth measures relative to pre- vious measurement time-slot. . . . . . . . . . . . . . . . . . . . . . 9 1.7 ReducedelectricityconsumptionduringDemandResponse(DR)event 11 1.8 Simplified Demand Response Optimization process in Smart Grids . 13 1.9 Forecast module with different implementations . . . . . . . . . . . 15 2.1 Subset of Dataflow Patterns supported inF`oε . . . . . . . . . . . 21 viii 2.2 (a) A continuous dataflow application with four PEs, (b) A con- crete dynamic dataflow with resource requirements determined (c) A distributed deployment of the concrete dataflow on VMs . . . . 25 3.1 Dynamic Dataflow Execution Model . . . . . . . . . . . . . . . . . . 40 3.2 Sample Generated Dataflow . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Dynamic dataflows constructs . . . . . . . . . . . . . . . . . . . . . 44 3.4 Dynamic dataflows generated by dataflow compiler . . . . . . . . . 47 3.5 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7 CDF of the MAPE values for all methods. . . . . . . . . . . . . . . 55 4.1 Sample declarative representation of Dynamic Dataflow using XML. Equivalent visual representation is in Figure 4.2(a). . . . . . . . . . 64 4.2 (a) A sample dynamic dataflow. (b) Dataflow with selected alter- nates (e 1 1 ,e 2 2 ) and their initial core requirements. (c) A deployment of the dataflow onto VMs. . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 A sample linear function for trade-off between cost (C) and value (Γ). The line denotes break-even, with a slope = σ. . . . . . . . . . 74 4.4 Sample iteration for the GA heuristic. . . . . . . . . . . . . . . . . 78 4.5 Dynamic Continuous Dataflow for LRB. Alternates forP 2 andP 7 have value (γ), cost (c) and selectivity (s). . . . . . . . . . . . . . . 89 4.6 Effect of infrastructure and data rate variability on Static Deploy- ment and Runtime Adaptation, as input data rate rises. . . . . . . 91 4.7 Algorithm scalability (a,b) and advantage of alternates (c) . . . . . 95 5.1 The (a) reactive and (b) look-ahead run-time optimization problems defined over the optimization period T. . . . . . . . . . . . . . . . . 104 ix 5.2 Sample dataflow pattern used for evaluation . . . . . . . . . . . . . 111 5.3 Performance of scheduling for a perfect, oracle prediction model, as the prediction interval horizon (in secs) vary on X axis . . . . . . . 112 5.4 Overall profit of PLAStiCC for different prediction error %, with prediction interval horizon (in secs) on X axis . . . . . . . . . . . . 115 5.5 Performance of scheduling with realistic prediction models, having different prediction error % on X axis. Plots differ in prediction bias.116 6.1 Stateful Streaming MapReduce Word Count Example. . . . . . . . 125 6.2 Distributed SMR execution model. . . . . . . . . . . . . . . . . . . 127 6.3 Reducer node components. . . . . . . . . . . . . . . . . . . . . . . . 130 6.4 Consistent hashing examples with 4 reducers. . . . . . . . . . . . . 132 6.5 State representation with master state and partial state fragments. 135 6.6 Load balancing example. . . . . . . . . . . . . . . . . . . . . . . . . 140 6.7 Scaling out example . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.8 Scaling in example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.9 Reducer peer ring with two fault zones: before and after failure. . . 144 6.10 Achieved Peak throughput for different number of reducers. . . . . . 146 6.11 ThroughputandLatencyCharacteristicsforLoadBalanceandScale- out. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.12 Relative state size for incremental checkpointing. . . . . . . . . . . 150 6.13 Fault Tolerance Throughput and Latency Characteristics. . . . . . . 151 x Abstract Ubiquitous deployment of physical and virtual sensors, coupled with tremen- dous increase in the number of connected devices has led to the explosion of data, not only in terms of the volume and but also the velocity at which it is being generated. Fast analysis of such high velocity data streams has become criti- cal in several domains to derive real-time insights and take advance measures in response. Examples include real-time log analysis, real-time pattern analysis and fraud detection, trend analysis and social network modeling, as well as mission critical systems such as smart cyber physical systems including smart power grids, smart transportation, and smart oilfields. These systems rely heavily on high velocity stream processing and real-time data analytics to monitor, analyze, and control the underlying system. As a result, the need to analyze such high velocity data in real-time forms one of the core dimensions of “Big Data” analytics. Like batch oriented high-volume data analytics, high velocity applications demand extremely scalable execution frameworks. However, in addition, these applications exhibit unique character- istics that demand specialized programming models and execution frameworks. First, due to the continuous, long running nature of the applications, and the low processing latency requirements, they are more prone to infrastructure per- formance variations and failures (infrastructure dynamism), especially on public xi cloudplatforms. Henceitdemandsadaptationstoperformancevariationsandhigh tolerance to failures with fast recovery and minimum downtime. Second, unlike batch oriented systems, the expected workload (i.e. the data rate) is not known at deployment time and can vary significantly over the application lifetime (data dynamism), which requires not only highly scalable framework but also the ability to elastically scale up and down at run-time based on observed load. And finally, the most distinct characteristic, we call domain dynamism, is due to the dynamic environment in which the applications run as well as variations in domain require- ments and quality of results achieved by the data analytic algorithms over time. This leads to degradation in the overall value achieved from the application and requires autonomic adaptations and dynamic re-composition to account for such domain dynamism. To address these unique requirements, we propose a dynamic dataflow applica- tion model that inherently supports flexible stream processing applications which can adapt to the variations observed in domain requirements and the applica- tion value over time. To reduce the development complexity and provide robust execution, we propose a run-time framework that decouples the data plane which processeshighvelocitydatastreamsfromthe control plane whichdeterminesadap- tation strategies governing the application. The proposed model and execution framework provides an extremely powerful tool to develop, deploy and execute longrunning, dynamicstreamprocessingapplicationswithminimumoverheadand downtime and promotes the notion of “value-driven execution” through continuous monitoring and adaptations to achieve the best value from the application. Further, to address the data and infrastructure dynamism, we propose several scheduling and resource mapping heuristics for deployment of dynamic dataflows onpubliccloudsthattakeadvantageoftheelasticityandpay-as-you-gocostmodel. xii The heuristics use a combination of dynamic application re-composition and run- time elastic scaling to achieve the desired quality of service (QoS) and to provide a balance between the application value, and resource cost. Finally, to ensure unin- terrupted execution in the presence of infrastructure failures, we propose novel integrated approach for efficient run-time elasticity, fault-tolerance and load bal- ancing for stateless and stateful processors that enables seamless elasticity and provides high fault tolerance with sub-second recovery latency even in the pres- enceofmultiplesimultaneousfailuresandhenceofferstrongerresilienceandservice guarantees. The novel dynamic dataflow application model, the scheduling and resource mapping algorithms, and the efficient fault-tolerance mechanisms together enable adaptive resilient stream processing and hence allow users to develop high velocity mission critical applications on public clouds. xiii Chapter 1 Introduction In this chapter, we provide the motivation and outline the problems that this thesis attempts to address. We briefly describe the research contributions along with the outline of this thesis. 1.1 Introduction As the deployment of physical and virtual sensors become ubiquitous, the amount and the rate at which the data is being generated is growing rapidly [32]. Several domains require fast-analysis of these high-velocity data streams to derive real-time insights from the data. For example, real-time log analysis [18], fraud detection [96, 78], trend analysis and social network modeling [46], and real-time threat detection [66]. As a result, several scalable distributed stream processing systems [81, 88, 112, 135] have emerged which enable near-real time analysis of these high-velocity data streams by exploiting data parallelism and distributed processing by scaling to thousands of machines. The availability of these real-time data streams and computational frameworks coupled with advancement in machine learning algorithms have led to the emer- gence of information-centric fields such as smart cyber-physical infrastructures which rely heavily on real-time data analytics [108, 20]. A common trait of such smart infrastructure is a wide deployment of sensors and actuators to monitor the physical environment, analyze the data using cyber infrastructure, and actively 1 control the physical infrastructure. Here streaming pipelines are used to collect, and integrate data across different data stream sources; perform real-time analyt- ics such as pattern detection and predictions using machine learning models; and to trigger actions in response to the insights obtained from this analysis. Such applications perform close loop “observe, orient, decide, and act” cycles [45, 104] suchasindemandresponseoptimzationsinsmartgrids[108], trafficpredictionand control in transportation networks [12, 36], and smart oil fields [115]. Smart infras- tructure thus offer exemplar streaming applications and various stream processing systems have been demonstrated for use in smart infrastructure such as intelligent transportation using IBM Infosphere Streams (System S) [20] and our own work on Smart Power Grids [109] which uses the Floe [112] continuous dataflow engine. Like batch oriented high-volume data analytics [130, 134, 76, 34], high velocity applications demand extremely scalable execution frameworks. Such high velocity stream processing and mission critical system exhibit following unique character- istics. First, due to the continuous, long running nature of the applications and the low latency (near real-time) processing requirements, they are more prone to infrastructureperformancevariationsandfailures(infrastructure dynamism), espe- cially on public cloud platforms. Hence it demands application deployment that can adapt to the variations in performance and requires high tolerance to failures, with minimum downtime and fast recovery. Second, unlike batch oriented systems, the expected workload (i.e. the data rate) is not known at deployment time and canvarysignificantlyovertheapplicationlifetime(data dynamism), whichrequires not only highly scalable framework but also the ability to elastically scale up and down at run-time based on observed load. Finally, the most distinct characteris- tic, we call domain dynamism, is due to the dynamic environment in which the applications run as well as the time-varying domain requirements and variations 2 in quality of results achieved by the data analytic algorithms over time [42]. This leads to degradation in the overall value achieved from the application over time and requires autonomic adaptations and dynamic re-composition to account for such domain dynamism. The combination of Data, Infrastructure and Domain dynamism pose signifi- cant challenges for high velocity data processing. We describe these dimensions of dynamisms and their potential impact on the execution of such high velocity applications. 1.2 Data Dynamism Data Dynamism refers to the variations that are observed in the high velocity data stream characteristics over time, specifically, the rate at which the data is generated and processed. For example, while the average rate of tweets generated per second currently is around 6000, the tweets per second peak upto 25 times the average values of upto 200,000 has been observed. Such changes in the data rates can cause significant changes in the workload for the data analytics application processing that data. The impact of increase in workload is the increase in the processing latency mainly due to the queuing delay. For applications with low latency requirements, such increase in latency may not be acceptable and hence require mechanisms to elastically scale up the system at run-time. Further, as the data rate fluctuates with temporary spikes and valleys, it is also important to scale down and de-provision the acquired resources to reduce the resource cost, and energy usage when the stream data rate becomes small and does not require the provisioned resources to perform the data analytics. 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 80 90 100 cumulative frequency (percentage) CPU Core Coefficient Amazon EC2 Eucalyptus OpenStack Figure 1.1: Cumulative Distribution Function (CDF) of normalized CPU Core performance, across VMs and over time. The need for run-time elasticity raises several research challenges, specifically, adaptation algorithms to determine when, what and how much to scale up or down in response to the variable data rates, and second, once the scaling decisions are made, how to efficiently perform the operations with minimum overhead or down time. We address both these issues in this thesis in chapters 4, 5, and 6. 1.3 Infrastructure Dynamism Performance variations present in virtualized and multi-tenant infrastructures like Clouds make it challenging to execute time-sensitive applications such as con- tinuous dataflows. Several studies have observed such variations both within and across VMs [59]. Their cause is attributed to factors including multi-tenancy, evolving hardware generations in the data center, placement of VMs on the phys- ical infrastructure, hardware and software maintenance cycles, and so on. In this section, further empirical evidence of such performance variation, both on public 4 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 cumulative frequency (percentage) Percentage change in CPU performance w.r.t previous time slot Amazon EC2 Eucalyptus OpenStack Figure 1.2: CDF of performance variation relative to previous measurement time- slot. and on private clouds, is offered that motivates the need for adaptive deployment for continuous dataflows. Two Cloud providers are considered: FutureGrid, a private academic IaaS Cloud running Eucalyptus 2.0 and OpenStack Grizzly Clouds, supporting over 150 research projects 1 , and Amazon EC2, a popular public commercial IaaS Cloud. The experimental setup consists of running three groups of VMs: 20 Eucalyptus, and 10 OpenStack VMs on FutureGrid, and 10 EC2 VMs on Amazon, over 1– 26 day period. Each VM’s class was equivalent to Amazon’s m1.small, with one dedicated virtual CPU core and 1024 MB of memory. Standard CPU (whetstone [124]), disk, and memory benchmarks are run on each VM every 2 mins, and a peer-to-peer network benchmark between the VMs measure TCP connection time, packet latency and bandwidth [127]. Here, we focus on the CPU and network per- formance metrics. However, this can be extended to consider variations in metrics 1 https://portal.futuregrid.org/projects-statistics 5 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 Virtual Machine # (Amazon AWS) CPU Core Coefficient Figure 1.3: Box-and-Whiskers plot of normalized performance distribution over time (Y axis) and for different Amazon AWS VMs (X axis). like local disk I/O, durable storage and messaging services (Amazon S3, SQS), etc. based on the application’s characteristics. In our analysis we focus on two specific aspects. First, we analyze the overall average performance deviation over a period of time relative to the rated perfor- mance, and second, we analyze the performance fluctuations observed between two consecutive time slots (of 2 mins each). The former motivates the need for dynamic adaptations in general while the latter motivates the need for proactive look-ahead planning based adaptations to minimize the thrashing phenomenon (i.e. reversal of resource provisioning decisions) observed with the reactive adaptation approach (chapter 5). The observed instantaneous performance for a VM in each of these group (i.e. eucalyptus, openstack or AWS) is normalized to [0, 1] by dividing it by the best observed (rated) performance for that metric in that particular group. Although the absolute performance for each of the VM groups differ, we do not consider that in our analysis since we assume single cloud deployment for an appli- cation. 6 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 cumulative frequency (percentage) Percentage change in latency w.r.t previous time slot Amazon EC2 Eucalyptus OpenStack Figure 1.4: CDF of variation in normalized Latency measures relative to previous measurement time-slot. First we analyze the overall average performance across different VMs. Figure 1.1 shows the cumulative distribution function for the normalized average CPU core performance across all VMs in each of the three VM groups. The Y axis shows the cumulative fraction of the number of VMs whose observed performance is less than the normalized performance on the X axis. We see that all three VM groups show significant variations in performance, with the EC2 VMs on Amazon’s public Cloud showing the highest variations, possibly due to greater data center diversity and high multi-tenancy and utilization of physical resources. Specifically, 23% of EC2 VMs have a normalized core performance of < 0.80, while < 10% of the FutureGrid VMs exhibit this shortfall, possibly due to more uniform physical resources. Similar variations are also seen in the network performance measures when considering all VM pairs over time. In brief, we see a high deviation in normalized packet latency with more than 45% of the VM pairs exhibiting less 7 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 cumulative frequency (percentage) Percentage change in connect time w.r.t previous time slot Amazon EC2 Eucalyptus OpenStack Figure 1.5: CDF of variation in normalized Connect time performance measures relative to previous measurement time-slot. than 0.70; and 22% of VMs have a normalized bandwidth of less than 0.80. These can punitively impact the exchange of small messages by streaming applications. Second,wealsoanalyzethetransientandsharpfluctuationsinperformance(i.e. highamplitudeof Δ, smallduration), bothinCPUaswellasnetworkperformance, for the same VM. Figure 1.2 shows the percentage change (unsigned) in average normalized CPU performance between two consecutive time slots. Larger the % valueoftheXAxis(Left),greaterthevariationinperformancebetweenconsecutive periods, and more pronounced are the performance fluctuations. For example, in 25% of the observations, the performance changes by > 20% across consecutive time slots for an EC2 VM. We see similar performance fluctuations for future grid eucalyptus and openstack deployments. Further, the extent of variations is different across VMs. Figure 1.3 shows a box-and-whiskers plot with the Y axis representing the average normalized CPU performance for each time slot and the X axis representing different EC2 VMs. The height of the box shows the 8 difference in performance between q 1 andq 3 quartiles and the + indicates outliers (<q 1 − 1.5× (q 3 −q 1 ). 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 cumulative frequency (percentage) Percentage change in bandwidth w.r.t previous time slot Amazon EC2 Eucalyptus OpenStack Figure 1.6: CDF of variation in normalized bandwidth measures relative to previ- ous measurement time-slot. Such variations in performance mean that the Quality of Service (e.g. latency or relative throughput) obtained from the application varies overtime even if the incoming data rates remains constant as factors such as bandwidth availability and CPU performance change over time. Further, short term fluctuations and outliers mean that the immediate past performance of a resource may not be sustained in the immediatefuture, andassuming so will cause sub-optimal resource allocation strategies [70]. In particular, such transient performance changes can cause decisions that need to be unrolled soon after. Thus a look ahead adaptation strategy that not only accounts for the performance variations across VMs, but is also resilient to such short term fluctuations is warranted. We posit that the impactofinfrastructuredynamismcombinedwithdatadynamismdiscussedearlier can be mitigated by intelligent run-time elasticity and the models, algorithms, and mechanisms for such elasticity is one of the main contribution of this thesis. 9 1.4 Domain Dynamism As we move from high volume to high velocity data analytics, the data ana- lytics takes form of a long running continuous application rather than isolated analytic modules. For such long running applications, Domain Dynamism plays a critical role as the environment in which the application runs as well as the domain requirements from the application change over time. Such changes often require complex adaptations and potential application re-composition and is another point of focus in this thesis. We elaborate on the notion of “domain dynamism” through a usecase in the Smart Grid domain, which will be used throughout the thesis as a running exam- ple to introduce and explain key concepts. The primary motivation behind this work comes from our experiences in designing demand-response (DR) applications for Smart Grids as part of the Los Angeles Smart Grid Project [109]. However, the flexible application models, abstractions and system requirements motivated by this vital domain and proposed in this thesis are more broadly applicable to applications that lie at the intersection of high velocity data processing and run- time application dynamism. We briefly introduce a facet of the Demand Response (DR) application in Smart Power Grids to illustrate these needs. 1.4.1 Motivating Scenario - Smart Power Grids According to the International Energy Agency, the global energy demand is set to grow by 37% by 2040 (compared to the 2014), of which electricity is the fastest growing final form of energy [57, 100]. Thus, meeting increased demand for electricity is considered as one the most critical challenges facing modern societies. 10 Figure 1.7: Reduced electricity consumption during Demand Response (DR) event Efforts initiated to meet this challenge include generating electricity from renew- able energy sources, such as solar and wind; developing efficient energy storage systems; modernizing and optimizing the electric grids; and increasing consumer awareness and participation in achieving energy sustainability. In recent years, grids worldwide are being transformed into smart grids [100] as they adopt Advanced Metering Infrastructure (AMI), or smart meters 2 . AMIs allow bi-directional, real-time communication between the utility and the con- sumer thus permitting the utility to measure consumer power consumption in near realtime and also act as a communication gateway for the utility to interact with the home or building power control system and appliances [40]. This has resulted in unprecedented amounts of data being generated at high spatial and temporal resolutions [109] and provides unique opportunities for demand optimizations in power grids. An informatics approach to Smart Grids that uses contiuous information inte- gration and data analytics methods have been proposed for translating this fine 2 Smart Meter deployments continue to rise, US Energy Info. Admin., 2012 www.eia.gov/todayinenergy/detail.cfm?id=8590 11 grained power usage data into a decision support system for utilities to manage energy, optimize demand and production, and provide better quality of service [100, 8]. For example, predictive modeling can be used by electric utilities to learn about how electricity consumption patterns will change over a span of few hours, to anticipate periods of peak demands, and to make dynamic decisions on how to address potential mismatch between supply and demand - either by increasing generation, or preferably, by sending out signals to consumers to reduce consump- tion. The latter is termed as demand response (DR) optimization which focuses on demand mitigation (i.e. consumer-side optimizations) by leveraging either direct control over consumer power usage (e.g. through curtailment con- tracts with commercial clients) or through voluntary consumer participation in return for incentives to reduce the peak load on utility when a potential demand supply mismatch is anticipated (Figure 1.7). While utilities have long used day ahead predictions and planning for demand response, we focus on the recent trends towards dynamic demand response (D 2 R) that requires performing consump- tion prediction and demand response at a much finer granularity (typically few minutes to hour ahead) due to dynamically changing conditions of the grid, such as intermittent generation from renewable energy sources [7, 71]. A typical dataflow pipeline (simplified) for such an application is shown in Figure 1.8. Various data sources in the form of continuous data streams (e.g. smart meters, power generation stations, and environmental sensors) flows though an information integration pipeline which integrates, normalizes and annotates the datafromthesesources. Itisthenpassedtoa“DemandForecasting”modulewhich uses machine learning techniques to predict demand in the immediate (or later) future. To avoid exceeding generation capacity, when consumption is predicted to raise above a certain threshold, a curtailment event may be triggered for a subset 12 Smart Meter s Power Genera tors Power Genera tors Power Generators Smart Meter s Smart Meters Information Integration Environmental Sensors Historical Storage Demand Forecasting Predict Curtailment Load Control/Curtailment Signals Select Customers Figure 1.8: Simplified Demand Response Optimization process in Smart Grids of customers based on the predicted curtailment that may be achieved by those customers to keep the consumption under the desired threshold. The demand forecasting and curtailment prediction modules form the core compoments of the DR application and a number of different models have been proposed, for example, based on time series models [80], regression models [39], probabilistic models [31], averaging models [2, 1, 37], and ARIMA and regression tree models [9]. However, previous studies [43, 77] have shown that the best predic- tion method for demand/curtailment predictions for different consumers depend on several factors such as consumer type, sampling granularity, data availabil- ity, consumption variations etc. and no method is universally better than the other in general. Further, it has been observed that the best prediction model to use for specific customer (or a cluster of similar customers) also varies over time due to run-time variations in above factors and hybrid models which use a static preselected schedule [77] or dynamically select the best forecasting [43] have been proposed that outperform any single prediction model. Given the long running and continuous nature of the demand response application, integrating such dynamic model selection requires the application to seamlessly adapt to such variations in 13 application requirements. We categorize these variations into two categories as follows: Spatial Variations This refers to the requirement of using different prediction models that offer best demand or curtailment predictions for different customers. While personalized models tailored to behavior of individual customer would be ideal for accurate prediction, the scale at which this has to be performed inhibits such personalized models. Predictions based on clustering of similar customers have been proposed which use same prediction models for customers with similar behavior [113]. Existing stream processing systems [112, 88, 81] support key based routing and streaming map-reduce patterns which can be used to route prediction requests for consumers belonging to the same cluster to the appropriate prediction models based on the aforementioned clustering of customers, and is out of scope of this thesis. Temporal Variations This refers to the temporal variations in patterns observed for a specific customer (or a cluster of customers with similar behavior), which in turn requires the application to dynamically adapt and switch the pre- diction models based on factors such as prediction quality [43]. Thus, it becomes necessary to allow the user to specify different alternate prediction models for the forecasting energy consumption over time as well as to specify domain rules that govern which of the models should be used for prediction. Figure 1.9 shows an example of simple domain rules that can be applied for dynamic application re-composition at run-time. Another example of temporal variation that requires run-time application re- composition is when a potential demand-supply mismatch is predicted and hence requires the system to start the DR event. During this time, the system not only 14 B A Demand Forecast Module If [incoming_datarate > 100 AND demand_supply_peak_alert == false] then activate A Else if [demand_supply_peak_alert == true OR foreCastError >= 0.3] then activate B Else activate Any Figure 1.9: Forecast module with different implementations has to use the consumption prediction module but also the curtailment prediction module and to select the best customers for sending curtailment events based on several factors. As a result, during the DR event, the application needs to dynamically deploy the curtailment and customer selection modules at run-time in response to the DR event. A simple approach to support either of these temporal variations is to embed the control logic for activating or deactiving modules directly into the dataflow. However, such an approach raises various compositional issues. The application logic becomes more complex and difficult to maintain since there is no separation of concern between the data processing logic and control mechanisms, which may be owned by different people. Second, during run-time, the application has to not only read and process the data streams but also has to subscribe to the control events and match complex rules which adds additional processing overhead. And finally, in cases when the control logic may change without any change to the data 15 processing logic, the entire application may have to be redeployed, which obviates our original goal of continuous uptime. These requirements motivate the need for a generic dynamic and continuous dataflow framework that not only provides easy to use programming abstractions to enable dynamic adaptations and policies but also provides run-time support to monitor control events, detect business rules and perform dynamic updates and application re-composition at run-time. Thisdomain driven application dynamism,coupledwithinfrastructure anddata dynamismposevariouschallengesforbuildingscalable, adaptivestreamprocessing systems and is the focus of our work. 1.5 Research Contributions Existing stream processing engines offer high level programming abstractions to build modular applications which makes it easier for the developers to com- pose and deploy scalable distributed stream processing applications at large scales [81, 135]. These models provide great extensibility and ease of development, but none of the existing frameworks provide the programming abstractions nor the underlying infrastructure to support dynamic stream processing application that allow dynamic application re-composition to adapt to the varying domain con- ditions. At the same time, these system lack the ability to elastically scale at run-time in response to the changing load or infrastructure conditions and need to either start with a over provisioned deployment or (at least partly) redeploy at run-time which adds significant cost or high latency respectively. We address these challenges in this dissertation and following are the key contributions. 16 • We propose a novel conceptual model for flexible dataflow applications that operate on continuous data streams and allow seamless application re- composition at run-time in response to changing domain conditions. We translate these conceptual models into a high-level programming abstrac- tion that allows developers to rapidly compose flexible streaming applica- tions which can execute at scale on distributed infrastructure while hiding the underlying system complexity. This introduces a key concept of “value- driven execution” which allows the users to continuously monitor the domain specificvalueachievedfromtheapplicationovertimeandadapttothechang- ing conditions. Further, we present a system architecture that decouples the data and control planes and provides low latency application dynamism with minimal downtime (chapter 3). • To address the challenges due to variations in infrastructure performance and system load (data rate), we propose several scheduling and resource mapping heuristics that exploit cloud elasticity and dynamic dataflows for scalable, elastic deployment of stream processing applications on public clouds. The heuristics aim to enable constraint driven execution, e.g. in terms of latency or throughput, under dynamic environment while balancing the application value and resource cost on clouds. We design the heuristics with several prac- tical considerations in mind such as monitoring cost, centralized vs. decen- tralized control, and reactive vs. lookahead scheduling which gives a plethora of different heuristics to choose based on the deployment conditions (chapters 4 and 5). • While the proposed heuristics help decide when and how much to scale, frequent updates to the application, and migrations can significantly add to 17 theapplicationlatencyduringsuchoperations. Further, toensurecontinuous execution and minimize the downtime due to system failures, which are quite common in cloud infrastructure, we propose efficient integrated approach towards elasticity, load balancing, and fault tolerance that exhibit minimum run-time overhead and cause very low (sub-second) delays during migrations due to elastic scaling or fault-recovery. These efficient mechanisms allow the application to seamlessly scale up or down and continue regular operations even during partial failures and hence minimize the down time (chapter 6). • We bring together the proposed dynamic dataflow model, the heuristics and theefficientscalingandfault-tolerancetechniquesanddevelopanopensource implementationoftheproposedsystemandevaluateitusingrealworldappli- cation in the smart grid domain. 18 Chapter 2 Background and Preliminaries The content presented in this chapter is aimed at providing relevant context to the discussion in the subsequent chapters. We discuss the background and concepts upon which we build our proposed model, algorithms, and system; and present general concepts and definitions used throughout the thesis. We first describe a widely used application composition model based on dataflow task graphs for building stream processing applications. We also describe an approach for scalable deployment of such applications on a shared-nothing dis- tributed system such as clusters and clouds. Finally, we present an infrastructure model that captures the resource management, performance, and cost aspects of an Infrastructure as a Service (IaaS) cloud system, which we use later in the thesis to discuss the proposed system. 2.1 Stream Processing Applications Composi- tion Model: Continuous Dataflows Scientific workflows [131], stream processing systems [3, 88, 81, 20, 135] and similar large-scale distributed programming frameworks [134, 95] have garnered a renewed research focus due to the recent explosion in the amount of data, both archived and real time, and the need for large-scale data analysis on this “Big Data”. 19 We develop the application models for dynamic applications, later in the thesis, as an extension to the continuous dataflow composition model that allow a task- graph based programming model to build and execute long running continuous applications which process incoming data streams in near-real time. Continuous dataflow systems have their root in Data Stream Management Systems, which processcontinuousqueriesovertuplestreamscomposedofofwell-definedoperators [14]. General-purpose continuous dataflow systems such as S4 [88], Storm [81], and Spark [134], on the other hand, allow user-defined processing elements, making it necessary to find generic auto-scaling solutions such as operator scaling and data- parallel operations [138]. Here we present the definition and scalable execution model for continuous dataflows, which will be extended in chapter 3 to support dynamic dataflow applications. Continuous Dataflows often leverage the familiar Directed Acyclic Graph (DAG) based composition model that uses the Action-Port [25] model for defining processing elements. This allows users to compose loosely coupled applications from individual tasks with data dependencies between them defined as streaming dataflow edges. While, in practice, this model can be extended to include more complex constructs like back flows/cycles, we limit our discussion to DAGs to keep the application model simple and the optimization problem tractable. Def. 1. A continuous dataflow G is a quadruple G =hP,E,I,Oi, where P = {P 1 ,P 2 ,...,P n } is the set of Processing Elements (PE) andE ={hP i ,P j i|P i ,P j ∈ P} is a set of directed dataflow edges without cycles such that data messages flow from P i to P j . I6=?⊂P is a set of input PEs which receive messages only from external data streams, and O6= ?⊂ P is a set of output PEs that emit output messages only to external entities. 20 P1: Single Execution Push P2: Streamed Execution Pull P3: Window Execution Push P5: Synchronized Merge P4: Loop P6: Interleaved Merge P7: Duplicate Split P8: Round-Robin Split R P9: Dynamic Data Mapping Map-Reduce-Reduce- … M M M R R R P10: Bulk Synchronous Parallel S S M Data Control Figure 2.1: Subset of Dataflow Patterns supported inF`oε Each PE represents a long-running, user-defined task which executes continu- ously, accepting and consuming messages from its incoming ports and producing messages on the outgoing ports. A directed edge between two PEs connects an output port from the source PE to an input port of the sink PE, and represents a flow of messages between the two. Existing systems, including our floe stream processing system [112], support a number of different dataflow semantics that allows users to build complex dataflow applications. 21 Basic Dataflow Patterns. Pellets can be composed into directed graphs where ports are wired to each other to indicate dataflow between them. Con- trol flow constructs such as if-then-else or switch can be easily implemented through pellets with multiple output ports, where the user logic emits an output message on only one of them. F`oε graphs can also have cycles, so that an out- put port can be connected to the input port of a “preceding” pellet. This allows iteration constructs like for-loops to be designed (Figure 2.1, P4). F`oε graphs support different patterns for aggregating messages from multiple pellets. A synchronous merge (Figure 2.1, P5) aligns messages that arrive on different input ports to create a single message tuple map, indexed by port name, that can be pulled by or pushed to the pellet. Alternatively, pellets multiple input edges can be wired to a single port as an interleaved merge (Figure 2.1, P6) where messages from either input edges are accessible on that input port immediately upon their arrival. Task parallelism is achieved explicitly by wiring output ports of a pellet to different pellets. When an output port of a (source) pellet is split to multiple (sink) pellets, users can decide at graph composition time an output message should be duplicated to all outgoing edges (Figure 2.1, P7) or sent to only one. In the latter case, a default round-robin strategy is used to select the edge as load balancing (Figure 2.1, P8) but a more sophisticated strategy can be employed in future (e.g. depending on the numbers of messages pending in the input queue for the sink pellet). Every pellet is inherently data parallel. TheF`oε framework transparently creates multiple instances of a pellet to operate on messages available on a logical input port. The number of instances created is determined by our optimizations, 22 discussed later. Pellet instances also share the same logical output port. One side- effect of this is that the output messages may not be in the same sequence as the input, since the pellet instances may complete out of order. Users can explicitly set an option that forces a pellet to operate sequentially without data parallelism to ensure messages flow in order. Advanced Dataflow Abstractions. The basic dataflow patterns can be composed to form complex applications and advanced constructs. For e.g., Bulk Synchronous Parallel (BSP) model has seen a revival off-late for large scale graph algorithms on Clouds [76]. BSP is an ’s’ stage (also called superstep) bi-partite dataflow where each superstep has ’m’ identical pellets that operate concurrently and emit messages to other pellets in that superstep. The message transfer itself is synchronous at superstep boundaries, and made available as input to pellets once the next superstep starts. The number of supersteps is decided at run-time. We can compose a native BSP model using the basicF`oε patterns using a ’s’ fully pelletswithalltheiroutputportsconnectedtoeachothers’inputports(Figure2.1, P10). In addition, a manager pellet acts as a synchronization point to determine when a superstep is completed by all pellets. So “data” messages on the input port of superstep pellets are gated by a “control” message from the manager pellet on another the input port. MapReduce isanotheradvanceddataflowpattern,constructedasatwo-stagebi- partitegraphfrom’m’mappertasksto’r’reducertasks. Sinceits“shuffle”between Mappers and Reducers does not naturally fit the basic patterns – a shortcoming of other dataflow frameworks– we introduce the notion of dynamic port mapping during a split pattern. This allows continuous MapReduce+ to be composed, with one Map stage and one or more Reduce stages (Figure 2.1, P9). Map and Reduce pellets are wired as a bipartite graph similar to a MapReduce dataflow, and they 23 both emit messages as <key,value> pairs. However, rather than use duplicate or round-robin split of messages from the output port of a Map pellet to the input port of the Reducer pellets, theF`oε framework performs a hash on the key to dynamically select one of the edges on which message passes, similar to Hadoop. The hash ensures that messages from any Map pellet having the same key reaches the same Reduce pellet. The Map and Reduce pellets can be used in any dataflow composition. This approachgeneralizesthepatternbeyondjustMapReduce, andallowseveniterative MapReducecomposition(butdoesawaywiththe, often, unnecessaryMapstagefor thesecondandsubsequentiterations). Also, ourstreamingversionallowsReducers to start before all Mappers complete, and also allows operation over incremental datasets as they arrive rather than in batch mode. Pellets can emit user-defined “landmark” messages to indicate when a logical window of message streams have been processed to allow the reducer pellets to emit their result, say, when they are doing aggregation. 2.2 Scalable Streaming Application Deployment We outline here the overall approach to scalable deployment of the continuous dataflows used by a number of systems [81, 112, 88] on shared-nothing distributed systems such as clusters and clouds. The specific heuristics for determining the scaling factor and resource mapping are discussed in subsequent chapters. Figure 2.2(a) shows a simple abstract continuous dataflow with four PEs {E 1 ,E 2 ,E 3 ,E 4 } connected using dataflow edges. E 1 and E 4 are the input and output PEs respectively with. For simplicity, we assume that all edges follow the duplicate split on outgoing edges and interleaved merge semantics on incoming 24 E 2 E 3 E 1 E 4 VM2 VM3 VM1 (a) (b) (c) E 2 E 3 E 1 E 4 φ=3 φ=2 φ=1 φ=3 Figure 2.2: (a) A continuous dataflow application with four PEs, (b) A concrete dynamic dataflow with resource requirements determined (c) A distributed deploy- ment of the concrete dataflow on VMs edges as described in previous section. Messages emitted on E 1 ’s output port are duplicated to bothE 2 andE 3 whileE 4 interleaves the messages from the task par- alleloperations performedbyE 2 andE 3 . Whenthis abstractdataflowis submitted for execution, first the required resource requirements are determined either man- ually [81] or through scheduling and resource mapping algorithms [70] as shown in Figure 2.2(b). This determines the number of data parallel instances of each PE that needs to be deployed given the expected datarates and infrastructure characteristics. Each of the instances for a given PE consume and process the incoming mes- sages in parallel thus exploiting data parallellism and easily scaling to extremely high data rates [112]. While this approach works well for stateless PEs where each of the incoming message is processed independently, special mechanisms need to be in place to handle stateful PEs where a state object is associated with each PE and is accessed and updated with each incoming message and hence depends on the entire history of incoming messages. This state can potentially be shared 25 among the several running instances of a PE further adding to the complexity due to shared concurrent state access. For simplicity, we initially assume all PEs to be stateless and the application model as well as scheduling heuristics are built based on this assumption. However, we later relax this assumption and propose efficient mechanism for scaling for stateful PEs with minimum overhead such that the same models and heuristics work equally well for stateful PEs. Finally, given the resource requirements and number of parallel instances for each PEs, we determine the number of resources (VMs) and the mapping of these data parallel instances onto these resources so as to minimize the total number of required resources as shown in Fig 2.2(c). At run-time, these PE instances operate in a data parallel manner, with each CPU core allocated to the PE being able to operate on independent messages available on their input port. Incoming messages to a VM instance are buffered in a local queue before execution. We assume incoming messages are load-balanced across the CPU cores. This deployment model is used by a number of existing systems and provides a highly scalable execution environment for high velocity stream processing appli- cations. However, this model does not support run-time elasticity or application adaptation to the various dimensions of dynamism discussed earlier. We build upon this model to efficiently support run-time adaptations through several adap- tive heuristics and infrastructure optimizations. 2.3 Cloud Infrastructure Model We primarily focus on Infrastructure as a Service (IaaS) cloud model an the underlying infrastructure for deploying dynamic stream processing applications due to ease of access, management and pay-as-you-go cost model. We model 26 the behavior of a virtualized commercial cloud Infrastructure as a Service (IaaS) environment here which will be used by our adaptation heuristics proposed in chapters 4.4 and 5 as well as the efficient elasticity and fault-tolerance mechanism in chapter 6. We assume that the execution framework has access only to the virtualized cloud resources: virtualized CPU cores, virtual disks within a VM, and virtual network connectivity and there is no control over or knowledge of the actual VM placement within the data center and, consequently, the network connection behavior between the VMs. This implies different network profiles for different VM instances even of the same class. We abstract and model the IaaS cloud characteristics as follows: The cloud environment consists of a set of VM resource classes C = {C 1 ,C 2 ,...,C n } that differ in the number of available CPU cores N, their rated core speed π, and their rated network bandwidth β. We assign atleast one dedi- cated core to each VM. For simplicity, we ignore memory and disk characteristics in the current model. Since CPU core speeds may vary across VM classes, we define the normalized processing powerπ i of a resource classC i ’s CPU core as the ratio of its processing power to that of a “standard” VM core, under ideal condi- tions. Naïvely, this may be the ratio of their CPU core clock speeds (e.g, 2.2 GHz 1.7 GHz ), but could also be the result of running application benchmarks on a standard VM and the VM of a particular resource class. Cloud providers such as Amazon also provide resource class ratings in the form of Elastic Compute Units (ECUs) that can be used. The processing requirements of a PE alternate is defined in terms of core-seconds (c) required to process a single message on the standard VM core (whose π = 1). Hence, the latency of a PE to process a message on a resource of classC i can be obtained by scaling asc i =c/π i . In addition, each resource class is associated with a fixed hourly usage priceζ i . We follow a costing model similar to 27 existing cloud providers. The usage of a VM instance is rounded up to the nearest hourly boundary and the user is charged for the entire hour even if it is shut down before the hour ends. R(t) ={r 1 ,r 2 ,...,r n } is a set of all VMs that have been instantiated till timet. Each instance is described by the tuple r i =hC j ,t start ,t off i, which is the resource classC j to which the VM instance belongs,t start is the time at which the instance was created andt off is the time at which the instance was turned off. t off is set to ∞ for an active VM. Total accumulated cost for the instance r i at time t is then calculated as μ i [t] =dmin(t off ,t)−t start )/60e×ζ j , where min(t off ,t)−t start ) is the duration for which the instance has been active. Various studies, including our own (chapter 2), have shown that the perfor- mance of cloud VM instances is volatile, over time and across instances of the same class, including its processing power and network bandwidth [59, 61, 58]. To gauge the current behavior of the virtualized cloud resource, we presume a moni- toring framework that periodically and nonintrusively probes the performance of and faults in the VMs and their network connectivity using standard benchmarks. The normalized processing power of a VM instance r i ’s CPU core at time t thus monitored is given byπ i (t), and the network latency and bandwidth between pairs of active VM instances r i and r j are λ i×j (t) and β i×j (t) respectively. Hence, the processing latency for a PE alternate p j i at time t is a function of the current normalized processing power for the set of VM instances it is allocated at that time. Similarly, the bandwidth between two PE instances is the current bandwidth between the VM resources on which they are deployed. We assume in- memory message transfer if two PEs are collocated in the same VM instance, i.e., λ i×i → 0 andβ i×i →∞. In addition, during the deployment stage, we assume that the network bandwidth between two VMs is equal to the “rated values”. However, 28 during the run-time adaptation, we use the actual bandwidth, reported by the monitoring framework, which may change due to factors including collocation of VMs and data center traffic. 2.4 Summary In this chapter, we discussed the required background and introduced certain concepts and notations to be used in the rest of the thesis. Specifically, we dis- cussed continuous dataflow model a widely used application composition model for stream processing applications. We also discussed an approach towards scal- able deployments of such dataflows on shared-nothing distributed systems such as clusters and clouds. Finally, we discussed an abstract model for representing the resource characteristics and cost models for IaaS clouds which will be used by our proposed adaptation heuristics in subsequent chapters. In the next chapter, we introduce the concept of Dynamic Dataflows as an extesion to the continuous dataflow model that retains the extensibility and ease of development of continuous dataflows and provides both development and run- time support for dynamic adaptations and run-time application recomposition. 29 Chapter 3 Dynamic Dataflow Applications Wediscussedthemotivationfordynamicapplicationsinchapter1andthebasic continuous dataflow model in chapter 2. In this chapter, we develop a conceptual modelfordynamic streamprocessingapplicationsthatenableseamlessrun-timere- compositionandascalableexecutionmodelbasedonseparationofdataandcontrol planes that operate in a closed-loop cycle to adapt the applications at run-time based on user defined business rules. We further map these into accessible high- level programming abstractions for rapid application development and evaluate the proposed system through real world application in the Smart Grid domain and demonstrate considerable improvements in overall application value achieved by the dynamic applications. 3.1 Runtime Adaptations through Application Re-composition We discussed, in chapter 2, the motivation and use case in the smart grid domain for the notion of domain dynamism and the need for run-time adaptations tomitigatetheimpactofdomaindynamism. Whileagoodmitigationforbothdata and infrastructure dynamism is run-time elasticity in terms of allocated resources and resource mapping, similar approach falls short for domain dynamism since it often requires changes in certain application modules or even the application logic. We posit that in addition to elasticity, domain dynamism requires run-time 30 application re-composition that allows user to update part of the application logic at run-time. In the context of continuous dataflow application model discussed earlier this translates to two types of application re-composition capabilities: • TaskLevelRe-composition: Thisreferstotheabilityto switchthe imple- mentationofaparticulartaskintheapplicationwithanotherimplementation (but with same logical output) with different characteristics such as perfor- mance, quality (e.g. accuracy of predictions), data requirements etc. As the domain characteristics vary over time, the user can select different available implementations for a given task to maintain the desired quality of results or other domain constraints. • Data Flow Re-composition: This refers to the ability to switch between different sub-dataflows potentially performing different logical operations. This is critical in scenarios where certain domain triggers require specific processing pipelines. A user should thus be able to specify several alternate pipelines for data processing and switch between then at run-time based on observed triggers. The existing continuous dataflow systems lack compositional as well as execu- tion support to allow such run-time application re-composition required to support domain dynamism and is the focus of the rest of this chapter. 3.2 Conceptual Model for Dynamic Dataflows A simple approach to support such run-time recomposition is to embed the control logic for switching between alternates or selecting alternates directly into the dataflow, similar to workflow systems that mix control and dataflow logic 31 [101]. However, such an approach raises various compositional issues. First, the application logic becomes more complex and difficult to maintain since there is no separation of concern between the data processing logic and control mechanisms, which may be owned by different people. Second, during run-time, the application has to not only read and process the data streams but also has to subscribe to the controleventsandmatchcomplexruleswhichaddsadditionalprocessingoverhead. And finally, in cases when the control logic may change without any change to the data processing logic, the entire application may have to be redeployed, which obviates our original goal of continuous uptime. These requirements motivate the need for a generic dynamic and continuous dataflowframeworkthatnotonlyprovideseasytouseprogrammingabstractionsto specifyalternatesandpolicesbutalsoprovidesrun-timesupporttomonitorcontrol events, detect business rules and perform dynamic updates to the application. In this section, we define the notion of dynamic continuous dataflow applica- tions followed by an execution model to allow flexible updates to the dataflow application at run-time. 3.2.1 Dynamic Continuous Dataflow Stream processing systems [81, 88, 112] often use the Directed Acyclic Graph (DAG) based model of application composition to build continuous dataflow (steaming) applications where the nodes represent individual processing elements (PEs) and the edges represent the data flow between these processing elements. Def. 2 (Continuous Dataflows). A continuous dataflow G is a quadrapule G = {P,E,I,O}, whereP ={P 1 ,P 2 ,...,P n } is the set of ProcessingElements(PE) and E ={E k = (P i ,P j )| P i ,P j ∈ P} is a set of data flow edges such that there is no cycle, where (P i ,P j ) denotes flow of data messages from P i to P j . I6=φ⊂P is a 32 set of input PEs where external data messages enter the dataflow continuously with variable rates, and O6= φ⊂ P is a set of output PEs which emit data messages to be consumed by an external entity. The sets of P and E are together called data flow elements. Each processing element represents a long-running user-defined task in the dataflow which runs continuously, accepting and consuming data messages from the incoming edges and producing messages on the outgoing edges. The PE may be stateless where each incoming message is processed independently or may be stateful such that the output depends on the history of messages seen so far. A directed edgeE k between two PEs connects an output port from the source PE to an input port of the sink PE, and represents a flow of messages between the two. A user can select different semantics for the flow of data tuples on the edges such as AND-split, OR-split, and Key-based routing (Streaming MapReduce), similar to data flow patterns in static workflows [120] to build complex applications. While this dataflow model has been proven to be suitable for both small and large scale continuous stream processing applications, it does not support dynamic adaptations to support temporal variations that would allow users to specify and update the dataflow application at run-time based on changing domain conditions. We propose the notion of “Dynamic Dataflows” as an extension to the DAG based dataflow model by incorporating the novel concept of Guarded Dynamic Elements. Guarded Dynamic Elements enable users to define multiple implemen- tations for individual processing elements (Dynamic PE) or for a set of dataflow edges (Dynamic Edge) and allow switching between them at run-time. Each imple- mentation (alternate) has an associated guard policy that dictates the activation strategy associated with that alternate. At run-time, the system subscribes to several control events and determines the active alternate at the given moment 33 based on these guard policies. Thus, by using a combination of alternates, guard policies, and control events the application can adapt to observed temporal varia- tions. Next we describe dynamic elements (Dynamic PE and Dynamic Edge), and different types of guard policies that can be applied to these dynamic elements. 3.2.2 Dynamic PEs A dynamic PEP i is an abstraction that defines a high-level logical operation in the dataflow application and consists of a set of alternate feasible implementations P i ={p 1 i ,p 2 i ,...,p j i |j≥ 2} that can be used to perform the operation. Each alter- natep j i ={g j i ,m j i } is a pair, consisting of a guard policyg j i and an implementation m j i . Among the several alternate implementations, the guards determine which of those implementations can be active at a given time. The system subscribes to control events such as domain, performance, and system and update the guards at run-time, which in turn changes the active implementation to be used for the logical operation. A typical scenario in which the dynamic PE may be used is when the logical operation to be performed remains the same but the specified alternate implemen- tations offer various tradeoffs e.g. between quality of results achieved and the cost of operations, especially when the quality achieved is a function of several factors that vary over time. In such scenario, the user may use time varying guards to adapt to the variable factors and select the best alternate over time. 3.2.3 Dynamic Edges A dynamic edge E k is an abstraction that defines the flow of data from the given PE to the next PE in the dataflow, and hence governs the dataflow or the “logic” behind the application. A dynamic edge consists of a set of alternate edges 34 E k ={e 1 k ,e 2 k ,...,e j k |j≥ 2}, where each alternatee j k ={g j k ,f j k } is a pair, consisting of a guard policy g j k and a dataflow edge f j k . Among the several alternate edges, the guards determine which of those edges can be active at a given time. As before, the system subscribes to several control events and updates the guards at run-time, which in turn changes the active edge over which the data flows and hence updates the application logic. A typical scenario in which the dynamic edge may be used is when the appli- cation is built with more than one dataflow or logic and there is a need to update logic based on several factors that vary over time. In such scenarios, the user may build several sub dataflows, each implementing one these logical dataflows and use the dynamic edge abstraction with guard policies to determine which dataflow remains active at a given time. Note that these are different from OR-Split edges which does similar thing, because in that case the edge is selected by the particular processing element that is running and is incorporated in the PE logic. However, what dynamic edges allow the user to do is decouple the routing logic from the processing and individual processing elements can be built independently of how they will be used to compose a dataflow application. 3.2.4 Types of Guard Policies Dynamic PEs and edges provide a powerful abstraction to the user to incor- porate dynamism into the dataflow application. Both the dynamic elements rely on the guards that determine the configuration at a given time. We support two types of guard policies that can be used in different scenarios as described below. 35 Activation Guards Given a dynamic element D i (either Dynamic PE or Dynamic edge) with alternates{d 1 i ,d 2 i ,...,d n i } where d j i = {g j i ,x j i }, the guards g j i are called activation guards iff g j i ∈ {0, 1}and P n j=1 g j i = 1 In other words, the activation guard gives precise control to the user to activate or deactivate specific alternates at run-time. Activation guards may be used in sce- narios where the application logic depends on activation of a particular dynamic element or a set of dynamic elements. By controlling the activation guards asso- ciated with a set of dynamic elements, the user can force application updates at run-time. Value-based Guards While activation guards allow precise control to the user activate or deactivate specific alternates of the dynamic elements, value-based guards are geared towards the scenario where precise control over the activation is not required and any of multiple alternates of a given dynamic element to be activated. In such scenarios, a domain specific “value” is associated with each alternate that determines the relative advantage of using one alternate over the other. Thus, given a dynamic element D i (either Dynamic PE or Dynamic edge) with alternates{d 1 i ,d 2 i ,...,d n i } where d j i ={g j i ,x j i }, each of the value-based guards is equal to the relative value associated with that alternate. i.e. g j i = v j i , wherev j i ∈ [0, 1] 36 The relative value for each alternate d j i is given by: v j i = F(d j i ) Max k F(d k i ) whereF is a domain specific value function defined for the corresponding dynamic element. Giventherelative“value”ofeachofthealternatesatrun-time,thesystem may choose to activate any one of the available alternates so as to maximize the overall value (or benefit) obtained from the PE under the given constraints (such as resource cost), if any. As an example, in the smart grid domain, a sample value function associated with the short term forecast dynamic PE discussed above would beF = 1− MSE, where MSE is the mean square error associated with the predicted energy consumption values. As the energy consumption patterns vary over time, we have observed [43] that the quality of predictions given by different models also vary over time and hence the value associated with them does as well. By associating this value function with the PE, we obtain the relative value of each alternate with respect to each other at run-time by monitoring the prediction quality and hence update the guards accordingly. Thus, although selection of a particular prediction model is not necessary for the overall application logic, by taking advantage of the value-based guards, we can dynamically change application to adapt to the variations in achieved application value over time and hence increase the overall quality of service. The dynamic elements provide a very powerful abstraction for application com- position that allows fine control over the application not only during composition but also at the run-time. This allows the execution engine to dynamically adapt the application composition in response to domain as well as system events. The 37 adaptation capabilities of dynamic applications lie with the “guards” associated witheachofthealternatesofthedynamicelements. However, providingtheability to the user to control these guards in a way that enables both the ease of devel- opment as well as better run-time administration through fine grained control is a challenging problem that needs to be addressed for wide adoption and use of dynamic dataflows. We discuss an execution model to address this challenge in the next section. 3.3 Dynamic Dataflow Execution Model We present an execution model to deploy and execute dynamic dataflow appli- cations. The goal of designing the model is to support rapid development of adaptive and dynamic dataflow applications and to facilitate designing a generic framework that allows fine control over application composition during run-time. Following characteristics guide the design of the model: 1. The dataflow is composed of independently developed PEs which have no knowledge of the application and the interactions with other PEs or the domain requirements that occur during run-time. 2. For dynamic dataflow, while the overall application logic and potentially the set of alternate implementations can be determined at composition time, the exact implementation to activate can be determined only at run-time and is based on several varying factors which are not known at development time. 3. On the other hand, we assume that during application composition, the administrator has no ability to update the individual PE implementations but can choose and connect different PEs (dynamic or otherwise) to compose the dataflow application. 38 4. The dataflow/streaming applications are often required to run continuously for a long period of time and there is a non negligible cost associated with downtime. 5. Further, the control logic responsible for updating the guards may change duringrun-timewithoutthechangeintheabstractdataflowapplicationlogic. The characteristics (1) and (2) lead to the requirement that the decision and control logic required for updating the guards must not lie with the PE developer and must be specified at the time of application composition rather than PE devel- opment. Characteristic (3) suggests that the control logic responsible for updating theguardsneedstosubscribetoseveraldomain, applicationandsystemeventsand perform complex operations to make adaptation decisions and update the guards. This will significantly increase the application development complexity if the data and control logic are to be handled in the same dataflow application. Hence it would be desired to decouple the dataflow application logic from the adaptation and control logic. Characteristics (4) and (5) further add to this requirement as a single application incorporating the data and control logic needs will need to be updated if either the dataflow or the control logic is changed and hence adds to the potential downtime. To address these requirements, we separate application deployment into two planes, the application/data plane and the control plane. The dynamic dataflow deployed in the application/data plane consists of processing elements, including dynamic dataflow elements, as described earlier. Further, each of the dynamic elements consists of a number of alternates whose activation/deactivation is gov- erned by corresponding guards. However, we decouple the update logic for these guards from the application/data plane to reduce both design time and run-time complexity. 39 The control plane takes the responsibility of monitoring various control signals including system events generated by the execution framework as well as domain specific events (external or application generated). It then continuously tracks each of the guard policies and informs the application/data plane whenever a guard state is updated. The application/data plane can then choose to update the active alternate for the dynamic element in response to the changes in guard states. The separation of concern between the data plane and the control plane pro- vides a very flexible conceptual model for dynamic dataflow applications. It allows ease of development, low run-time overhead and ability to change guard policies at run-time without affecting the deployed application and hence avoids downtime. Figure3.1showsasampleapplicationdescribedusingthismodel, includingvarious interactions in the systems through the data and control flows. Application/Data Plane Control Plane External Signals System Signals Domain Signals Dataflow PEs Data Streams Control Streams Guards Guard Signals Control Elements Figure 3.1: Dynamic Dataflow Execution Model 40 While this conceptual dataflow model combined with the execution model pro- vides a powerful mechanism for development and execution of dynamic dataflows, we must map this to easy to use programming constructs to be accessible to the developers for wider adoption and is described in the next section. 3.4 Dynamic Dataflow Programming Constructs We define a number of high level programming constructs for continuous dataflows as well as the extensions to support application dynamism. We follow a similar approach to DryadLINQ [133] and SPARK [134] and to allow users to build distributed dataflow application using a high level imperative programming language. Such an approach provides a number of advantages over the declarative model (e.g. using XML). It not only allows developers to use familiar strongly typed constructs thus allowing ease of development but also hides the system and infrastructure details from the programmer. We provide a stream and a processor as the basic building blocks for composing applications. Astreamiscomposedofacontinuoussupplyofdatatuplesatvarying rates. Each item T is a strongly typed tuple with optionally named parameters. For example astreamhpairha as int, b as stringii is composed of tuples containing two parameters of type integer and string respectively. For a given tuple, the user can access the individual parameters by index or by name e.g. T [0] or T.a. We provide a number of primitive operations on the stream object which transforms it into one or more children streams. A processor is a domain specific user define operation that takes exactly one data stream as an input and emits exactly one stream as an output. Such custom processors can be used in conjunction with primitive operators to input transform 41 streams into desired output. A datflow application thus is composed of primitive streamoperationsaswellasdomainspecificoperationsconnectedtogetherthrough data streams. We further present novel constructs for defining dynamic elements in the dataflow and present an overview of the policy engine and the policy specifica- tion language that govern these dynamic elements. 3.4.1 Primitive Stream Constructs The set of primitive stream operators are categorized into three, viz., Input/Output operators, stream transformation operators and domain specific operators. WesupporttwogenericInput/Output operators, create_stream(input_adapter, params) and dump_stream(output_adapter, params), for creating input streams and consuming output streams from the application. Each of these operators can be extended to support a number of transport mechanisms by implementing corresponding adapters. Further, we support a range of transformation operators [47] on the stream object such as filter, project, merge, split and window with usual semantics for each operators. In addition, we support groupby(index) and join(join_strategy) stateful operators that perform corresponding functions on each window in the given stream. These operators are not valid if the window operator has not been applied to the stream. In addition, to perform domain specific operations, we allow the user to define custom processors and use those for stream transformation. The operator cre- ate_process(processor_FQDN) creates a custom processor pointed to by the fully 42 qualifiednameandagenericprocess(processor) operatorappliesthatcustomtrans- formation on the stream object. The system allows these processors to be inter- leaved with the primitive operators during application composition and hence pro- vide a powerful composition tool to the user. Figure 1 shows a snippet of the sample application built using these abstrac- tions. Lines 1-2 create a couple of input streams using an HttpAdapter which is then merged together to produce a single stream. It then applies a domain specific transformation (“tweet cleaner”) and a filter operation on the stream (Line 3). Further, in line 4, it applies a Map/Reduce operation on the stream by using a window followed by a groupby operator. The groupby operator creates clusters of data items based on the groupby column which is forwarded cluster-at-a-time to the custom operator “WordCountReducer”. Finally, the output stream is dumped using a PushNotificationAdapter. Algorithm 1 Sample Dataflow Application 1: s1←create_streamhPairhint,strii (HttpAdapter, params1) 2: s2←create_streamhPairhint,strii (HttpAdapter, params2) 3: tc←s1.merge(simple,s2).projet(1).process(create_process(”utils.TweetCleaner”)) .filter(“#awesome 00 ) 4: cs ← tc.process(create_process(“WordCountMap 00 )) .window(10, 0).groupby(0) .process(create_process(“WordCountReducer 00 )) 5: cs.dump_stream(PushNotificationAdapter,param) 3 Figure3.2showsthecorrespondingabstractdataflowgeneratedbythedataflow compiler. 3.4.2 Dynamic Dataflow Constructs We define additional constructs to support seamless integration of dynamic dataflow elements. As described in the conceptual model, we take a two layered 43 createStream createStram Merge (RR) project filter tweetCleaner Split(duplicate) window groupby WordCountMapper WordCountReducer LDA Topic Classifier dumpStream dumpStream createStream createStram Merge project filter tweetCleaner Split(duplicate) window groupby WordCountMapper WordCountReducer dumpStream Figure 3.2: Sample Generated Dataflow t2 t4 t1 t3 t5 T3_1 T3_2 G1 G2 t2 t4 t1 t3 t9 t5 t6 t7 G1 G2 t8 (a) Dynamic task (b) Dynamic path Figure 3.3: Dynamic dataflows constructs approach to enable this. First, we define high level programming constructs that allows users to develop guarded dynamic elements in the dataflow applications. And second, we develop a policy engine and a policy specification language to gov- ern the element guards at run-time and hence in turn govern dynamic application re-composition. The following extensions support both dynamic tasks as well as dynamic paths in the dataflow as shown in Figure 3.3. The create_dynamic_process(Values[], GuradPolicy[], Processor[]) construct allows creation of dynamic PEs by combining two or more user-defined processors. Each alternate processor has an associated “relative value” and a “Guard Policy”. The former determines a domain perceived value of that alternate relative to the “best” available alternate for that dynamic PE. This relative value can be used by the execution framework during scheduling and optimization to select one of the alternates if more than one guard is activated at a time. While, the latter, allows 44 the user to specify complex business rules using system as well as domain control signals that determines activation and deactivation strategy for each guard. The dynamic_split(Values[], GuradPolicy[]) construct provides an even more powerful abstraction to build dynamic dataflows. With dynamic_split, the users can specify alternates at the subgraph level that allows them to define alternate pipelines to to perform a complex operation. The dynamic_split construct is applied on a data stream that splits the stream into a number of children dynamic streams (dynamic edges). Similar to dynamic task, each dynamic edge has an associated relative value and a corresponding guard policy that determines which one of the pipeline is to be activated at a given time. 3.4.3 Policy Specifications Users must specify guard policies for each of the alternates associated with the dynamic elements. The goal of these policies is to allow the users to construct complex rules to govern the activation and deactivation of the guards at run-time. Since the dataflows under focus are long-running continuous applications exe- cuted in an ever changing environment, it is required that the events generated by the system and the application are monitored continuously and that the policies are evaluated periodically to update the application composition accordingly. Complex Event Processing (CEP) systems provide these abilities out of the box [117]. CEP systems support complex event pattern specification and real-time event matching. We thus propose to use a CEP system as a component of the policy engine. This provides highly flexible policy engine and we can leverage the event pattern specification language provided by the CEP system to specify guard policies for the alternates. 45 The CEP system allows users to define arbitrarily complex event patterns using a declarative language similar to SQL by combining simple events from various event streams. For example a guard defined as “[incoming_datarate≤ 10 AND demand_supply_peak_alert == true OR foreCastError≥ 0.3]” con- sists of a system event “incoming_datarate”, and two domain specific events viz., “demain_supply_peak_alert” and “foreCastError”. The former is generated by the execution engine while the latter are generated by the specialized diagnostic processes defined by the user in the application/data layer which are routed back to the control plane. Figure 2 shows a snippet of a sample dynamic dataflow application that uses the “create_dynamic_process” construct a dynamic task and a CEP based policy specification language. Algorithm 2 Sample Dynamic Dataflow Application 1: s1←create_streamhPairhstr,doubleii (TcpAdapter, params1) 2: tc←s1.projet(1) 3: dp ← create_dynamic_process( {1.0, 0.75}, {[“alert == trueORforeCastError≥ 0.3 00 ], [“datarate≥ 100AND alert ==false] 00 }, {ARIMAPredictor,RegressionTreePredictor} 4: fc←tc.process(dp) 5: tc.window(30, 0).join(fc, 1) .process(ForeCastErrorCalculator) Figure 3.4 shows the corresponding dataflow generated by the dataflow com- piler. The compiler also generates concrete CEP rules and hooks it up to the corresponding guards to be evaluated at run-time. While CEP provides an easy to use api for real time pattern detection, we envi- sion that many control algorithms will need more sophisticated approach towards making adaptation decisions. The proposed guard based dynamic application 46 createStream project Detect forecast error Forecast join window A B [alert == true OR foreCastError < 0.3] [datarate > 100 AND alert == false] Guard Policies Figure 3.4: Dynamic dataflows generated by dataflow compiler modelallowsarbitrarycomplexpolicyspecificationenginesincludingmanualadap- tations, output from another floe application, or machine learning algorithms to be coupled with the data plane as we demonstrate in the evaluation section. 3.5 System Architecture and Implementation Figure 3.5 shows the overall system architecture and the interactions between different components at “compile-time” (deployment time) as well as at run-time. Following the conceptual model, we divide the run-time system into two layers, the application layer and the control layer. The application layer is responsible for executing stream operators as well as user-defined processors, communication between the processing elements in the dataflow, and monitoring QoS metrics for the application as well as the infrastructure. The control layer is responsible for activities such as resource management, dataflow scheduling and managing 47 dataflow updates. In addition, it is responsible for monitoring various system and domain control events, detect policy matches and initiate dynamic re-composition. We delegate some of these functions to the Floe stream processing engine [112]. Floe provides a number of run-time features such as distributed deployment, auto- scaling, and adaptation to infrastructure variability required for fast-data applica- tions. Floe is then responsible for activities such as dataflow scheduling, resource management and managing dataflow updates [125]. The proposed abstractions provide high level language integration for building dynamic applications which is compiled by the “Dataflow Compiler” to a Floe application and handed off to the Floe run-time for execution. Dataflow Compiler Dynamic Dataflow Application User defined processor library Abstract Floe Dataflow Container 1 Container 2 Container X … Coordinator Resource Manager Health Monitor Execution Planner (Scheduler) Policy Manager Domain Events System Events Policies (CEP Patterns) Floe Proposed Extensions Control Channel Existing Floe/CEP Components Siddhi CEP Engine Event Reservoir Figure 3.5: System Architecture We further develop a policy engine based on Siddhi Complex Event Processing (CEP) [117] system that provides ability to monitor a number of events and mine for specified events over a period of time. The guard policies specified by the user are compiled to CEP patterns and are handed off to the Siddhi CEP engine. The policy manager is responsible for making appropriate dataflow re-composition decisions by continuously monitoring the events from the event reservoir including both system and domain control events. 48 The end-to-end operations of Floe and the proposed extensions are summarized below. • Step 1: User develops a dynamic continuous dataflow using a high level programming language using constructs defined in sections 3.4.1 and 3.4.2. • Step 2: The program is compiled by the dataflow compiler on the local machine which generates an abstract dataflow, a set of dynamism policies (as discussed in section 3.4.3) and a library of user-defined processors. • Step 3: The abstract dataflow and the user library is submitted to the Floe coordinator’s execution planner, which in turn talks to the Resource Manager to acquire a number of resources (containers) and deploys the application onto the acquired resources in a distributed environment. • Step4: Thedynamismpolicies(whicharegeneratedasasetofCEPpatterns) are submitted to the policy manager. The compiler generates two CEP patterns per alternate, one for activation and another for deactivation of the guard, for each dynamic element in the graph. • Step 5: During run-time, the policy manager subscribes for events from the event reservoir and continuously mines for the defined event patterns. When- ever such a pattern is observed, the corresponding gaurd is either activated or deactivated and a notification is sent to the Execution Planner. The execution planner then re-schedules the dataflow while choosing appropriate alternate to be deployed based on the guard values. • Step 6: Once the re-scheduling decisions are made by the execution planner, it talks to the Floe coordinator and the resource manager to initiate an “Update” cycle for dynamic application re-composition. 49 We discuss these system issues in subsequent chapters to support dynamic application re-composition. If more than one guard is enabled or if we are using value based guards for a dynamic task or a dynamic path, it indicates availability ofmorethanonealternateatthattimefromthedomainperspective. Thispresents a choice to the underlying execution framework and allows the system to choose an any of the available alternates to optimize overall application utility by balancing the value, resource cost and latency/throughput as defined by the user [70]. We discuss algorithms to select the active alternate in the next chapter. Another major issue while updating the live processors in a streaming dataflow is that the impact in terms of down time or loss of state is non-trivial [125]. The simplest way to perform such updates would be to pause the all the incoming streams (or to keep buffering, if pause is not allowed), process pending items and flush out the entire workflow, then update the dataflow while allocating new resources if required and finally resume the incoming streams. Such update is considered totally consistent since it is guaranteed that all data items that arrive after the update was initiated will be processed only by the new version of the dataflow. However, such updates involve very high update latency which might be unacceptable in some scenarios. On the other extreme, an update can be performed by preallocating required resources, creating a copy of the "updated" processors and finally just re-wiring the neighbors. Such updates, while very fast produce inconsistencies since a data item (more precisely, a wave of data items) may be processed by old versions of some processors and new versions of others. Charith et. al. proposes several such algorithms with tradeoff between update latency and consistency guarantee [125] and we leverage those to perform updates as decided by the execution planner. 50 3.6 Evaluation Hereweevaluatetheproposedabstractionsthroughthedemandresponseappli- cation in smart power grids discussed in chapter 2 and demonstrate the concept of dynamic dataflow model and the advantage of value-driven programming and execution. As discussed in chapter 1.4.1, demand response optimization is a critical com- ponent of smart power grids designed to perform demand side power management through curtailment by providing incentives to the users. As shown in Fig. 1.8, the demand response pipeline is a streaming application consisting of multiple process- ing elements including Information integration, demand forecasting, curtailment forecasting, customer selection etc. The demand and curtailment forcasting mod- ules form the core components of the DR applications and are often the most resource intensive modules of the application. Here we focus on demand forecast- ing module but similar arguments apply for curtailment forecasting component as well. Various models have been proposed for demand forecasting in different scenar- ios for example, based on time series models [80], regression models [39], bayesian models[31], averagingmodels[2,1,37], andARIMAandregressiontreemodels[9]. However, previous studies [43, 77] have shown that the best prediction method for demand/curtailment predictions for different consumers depend on several factors such as consumer type, sampling granularity, data availability, consumption varia- tions etc. and no method is universally better than the other in general. Further, it has been observed that the best prediction model to use for specific customer (or a cluster of similar customers) also varies over time due to run-time variations in above factors (especially the changes in variance over time) and hybrid models 51 which use a static preselected schedule [77] or dynamically select the best forecast- ing [43] have been proposed that outperform any single prediction model. Here we demonstrate capturing this temporal variability using the proposed dynamic dataflows and evaluate the advantage of using adaptive application over a static deploymentwithafixedpredictionmodel. Themodelselectionapproachpresented here is discussed based on the artificial neural network based method proposed by Frincu et. al [43]. Fig. 3.6 shows the demand response application implemented using the pro- poseddynamicdataflowabstractionsandtheseparationofdataandcontrolplanes. The information integration module reads the fine grained power readings for each customer, preprocesses it and forwards it to the prediction consumption module which is responsible for forecasting power usage for individual customers. As dis- cussed earlier, a single prediction algorithm cannot be selected for all customers at all times and intelligent adaptation mechanism is required to select the best alternate at run-time. This makes it a good candidate for dynamic PE. As shown in Fig. 3.6, the prediction curtailment module consists of several alternate imple- mentation (which can be specified at deployment time and can also be updated at run-time as required). Further, as mentioned earlier, an important factor that determines the best model to choose is the variance of the data stream observed in the near past. We use this observation and train a neural network by giving it as input an array of variance values observed over last one hour and using an oracle to find the best prediction model for the next prediction interval. For training we use data from past 3 weeks and a prediction interval of one hour in the future. We implemented our backpropagation network in R using the neuralnet package. The number of hidden neurons was set to 70 and the threshold value for the stopping criterion was set to 0.01. 52 At run-time, we monitor and periodically evaluate the observed accuracy from the currently selected model and if it found to be below a threshold, we signal the control plane to re-evaluate the best model for a given customer using the neural network and the array of standard deviation observed for that customer for the past one hour. The output of the neural network gives the best model and is connected to the guards on the dynamic PE which in turn activates and deactivates the corresponding alternate. Info. Integration Predict Consumption Predict Curtailment ARIMA RANDOM NYISO CASCE CAISO SLIDING1 SLIDING2 Data Plane Vector of consumption standard deviation … Estimated “best” forecasting method” Control Plane . . . . Input Layer Hidden Layer Output Layer Artificial Neural Network Figure 3.6: System Architecture The performance metric we used to quantify accuracy in the task of electricity consumption prediction is Mean Absolute Percentage Error (MAPE): MAPE = 1 n n X i=1 a i −f i a i ! ∗ 100, (3.1) 53 where a i and f i represent the actual and predicted values respectively, and n is the number of predictions. MAPE is a widely used metric in smart grids for determining prediction accuracy. Figure 3.7 shows the CDF of the MAPE values for all methods. Besides the Oracle method which always produces the best results, we also plot a random selection method according to which the “best” method is selected at random. Clearly, the neural network, which employs a combination of forecasting methods, achieves the best results, i.e., closest to the Oracle. Specifically, the neural network is able to predict the best method for each customer with an accuracy ranging from 0.8412 to 0.9448 for the three folds. The high accuracy is corroborated by the overall deviation from the best MAPE value, which we calculated to be -0.0318. We define deviation from the best MAPE value as follows: MAPE predictedMethod −MAPE bestMethod MAPE bestMethod (3.2) This simple example thus motivates and demonstrates the advantage of using value-driven programming and execution. The separation of data and control plane maintains the simplicity of application development and makes it easy to port existing applications to the dynamic dataflow model. Further the simplicity ofcontrolprovidedbytheguards(bothactivation, andvalue-basedguards)enables arbitrary control plane implementation and also allows us to update or switch the control plane at run-time without the need to restart the data plane application. We thus encourage developers to incorporate value-based execution approach in their development to improve the value achieved from the continuously running stream processing application. 54 Figure 3.7: CDF of the MAPE values for all methods. 3.7 Related Work Business and Scientific workflows [6, 17, 16], continuous dataflow systems [87, 28, 10] have garnered a renewed research focus due to the recent explosion in the amount of data, both archived and real time, and the need for large scale data analysis. Our work is closely related to the stream processing and continuous dataflowsystemsthatallowatask-graphbasedprogrammingmodeltoexecutelong running continuous applications which process incoming data streams in near-real time. Other related work includes flexible workflows, service oriented architecture and the notion of “alternates” in heterogeneous computing paradigm. We discuss state-of-the-art in each of these research areas in relation to our proposed system. 55 3.7.1 Continuous dataflows and stream processing systems Continuous dataflow and stream processing frameworks allow users to build large scale dataflow applications. Such systems have their root in Data Stream Management Systems (DSMS), primary focus of which is to allow for continuous queries over datastreams, composed using a set of well defined operators [14]. Scientific and general purpose dataflow systems such as S4 [88], Storm [81], Spark Streaming [134], on the other hand, allow arbitrary processing element enabling much powerful modelling languages to build highly complex large scale dataflow as well as streaming applications. However, to the best of our knowledge, none of the existing systems support dynamic updates to the application composition. This limits adaptation capabil- ities of the dataflow applications, especially for mission critical applications that demand high flexibility and adaptation to changing environment. We extend the notion of continuous dataflows and propose conceptual models and programming abstractions to support development and execution of dynamic applications that allows run-time adaptations to changing execution environment. 3.7.2 Flexible workflows and SOA The idea of flexible workflows [92, 102, 121] and service selection in Service oriented architecture (SOA) [132, 82, 52] is to allow flexibility in workflow com- position and by allowing transformations on the structure of the workflow to be applied at run-time. This provides a powerful compositional tool to the developer to define business rule based generic workflows that can be specialized at run-time based on the environment characteristics. The notion of “alternates” we propose is similar to this in that it allows greater flexibility to the developer and a choice of execution at run-time. However, unlike the flexible workflows, where the decision 56 about task specialization is to be made exactly once based on certain determinis- tic parameters, in continuous dataflows, this decision has be re-evaluated regularly due to the continuous and dynamic nature of the dataflow. The proposed abstrac- tions allows users to easily specify complex rules to govern application dynamism and are continuously monitored and re-evaluated by the system architecture. 3.7.3 Heterogeneous computing Heterogeneouscomputingsystems(HCS)are(usuallydistributed)complexsys- tems consisting of a number of different types of resources (e.g. CPUs, GPUs) with different architecture and programming models. To exploit a heterogeneous com- puting environment, an application task may be composed of several subtasks that have different architectural requirements and performance characteristics. Various programming models [13, 5] and dynamic as well as static task matching and scheduling techniques have been proposed for such scenarios [75, 122, 119]. How- ever Heterogeneous computing system focus either on execution of independent heterogeneous tasks or on simple task workflow like task dependencies. The con- cept of alternates in dynamic dataflow is similar to these, however, with a focus on continuous dataflow applications and a homogeneous programming model. This chapter thus advances the state-of-the-art in terms of programming abstractions for continuous data flows and streaming applications and proposes a generic framework to support application adaptation through dynamic re- composition of the application graph at run-time. 57 3.8 Summary In this chapter we introduced the notion of dynamic fast-data applications. We further presented a conceptual model that captures both design time and run- time characteristics of dynamic dataflow applications. The proposed programming abstractions provide easy to use high level language constructs that allow rapid development of such applications. We also proposed a generic framework and sys- temcomponentsrequiredtosupportdynamicapplications. Wemotivatedtheneed for and presented a CEP based policy engine that allows users to specify complex policies that govern domain driven dynamism and application re-composition at run-time. We believe that the proposed novel concepts and abstractions will provide very powerful and highly flexible composition tools for application developers that can be leveraged to develop large scale fast-data dynamic applications. However, such run-time dynamic adaptations to the application also require corresponding efficient resource management and scheduling to maximize the uti- lization from the available resources. Further, application dynamism coupled with Infrastructure and Data dynamism present numerous challenges in terms of elastic resource management, especially under strict QoS constraints. We study these challenges and propose scheduling and resource management heuristics in the next chapter. 58 Chapter 4 Reactive Scheduling Heuristics In the previous chapter we presented a novel dynamic dataflow application model that allows users to build adaptive application capable of seamless run- time application recomposition. We also described the notion of “value-driven” programming and execution which entails fine grain control over the application to monitor and achieve higher value from the application over its life time. In this chapter we explore the resource management and scheduling heuris- tics required for large scale deployment of such dynamic applications on cloud infrastructures. We look at deployment issues and requirements in the presence of not only application dynamism, but also infrastructure and data dynamism. We leverage the concept of “dynamic dataflows” which utilize alternate tasks as additional control over the dataflow’s cost and QoS. Further, we formalize an optimization problem to represent deployment and run-time resource provisioning that allows us to balance the application’s QoS, value, and the resource cost. Contemporary stream processing systems use simple techniques to scale on elastic cloud resources to handle variable data rates. However, application QoS is also impacted by variability in resource performance exhibited by clouds and hence necessitates autonomic methods of provisioning elastic resources to support such applications on cloud infrastructure. We propose two greedy heuristics, centralized and sharded, based on the variable-sized bin packing algorithm and compare against a Genetic Algorithm (GA) based heuristic that gives a near-optimal solution. A large-scale simulation 59 study, using the Linear Road Benchmark and VM performance traces from the AWS public cloud, shows that while GA-based heuristic provides a better qual- ity schedule, the greedy heuristics are more practical, and can intelligently utilize cloud elasticity to mitigate the effect of variability, both in input data rates and cloud resource performance, to meet the QoS of fast data applications. 4.1 Introduction While run-time scalability and seamless fault-tolerance together are the key requirements for handling high velocity variable-rate data streams, in this chap- ter, we emphasize on run-time scalability of stream processing applications with increasing data velocity and their adaptation to fluctuating data rates. We deal with fault-tolerance and recovery in chapter 6. As we discussed in chapter 2, existing stream processing (or continuous dataflow) systems (SPS) [88, 81, 20, 135] allow users to compose applications as task graphs that consume and process continuous data, and execute on dis- tributed commodity clusters and clouds. These systems support scalability with respect to high input data rates over static resource deployments, assuming the inputratesarestable. Whentheinputrateschange, theirstaticresourceallocation causes over- or under-provisioning, resulting in wasted resources during low data rate periods and high processing latency during high data rate periods. Storm’s rebalance [81] function allows users to monitor the incoming data rates and rede- ploy the application on-demand across a different set of resources, but requires the application to be paused. This can cause message loss or processing delays during the redeployment. As a result, such systems offer limited self-manageability to changing data rates, which we address in this and the subsequent chapter. 60 Recent SPSs such as Esc [103] and StreamCloud [53] have harnessed the cloud’s elasticity to dynamically acquire and release resources based on application load. However, several works [59, 72, 85] have shown that the performance of public and private clouds themselves vary: for different resources, across the data center, and over time. Such variability impacts low latency streaming applications, and causes adaptation algorithms that assume reliable resource performance to fail. Hence, another gap we address here is autonomic self-optimization to respond to cloud performance variability for streaming applications. In addition, the pay- per-use cost model of commercial clouds requires intelligent resource management to minimize the real cost while satisfying the streaming application’s quality of service (QoS) needs. In this chapter we push towards autonomic provisioning of continuous dataflow applicationstoenablescalableexecutiononcloudsbyleveragingdynamicdataflows and cloud elasticity and addressing the following issues: 1. Autonomous run-time adaptions in response to fluctuations in both input data rates and cloud resource performance. 2. Offeringflexibletrade-offstobalancemonetarycostofcloudresourcesagainst the users’ perceived application value. This chapter covers the following aspects: • Application Model and Optimization Problem: We leverage the appli- cation model for dynamic dataflows (chapter 4.2) as well as the infras- tructure model to represent IaaS cloud characteristics, and propose an opti- mization problem for resource provisioning that balances the resource cost, application throughput and the domain value based on user-defined con- straints. 61 • Scheduling Heuristics: We present a Genetic Algorithm (GA)-based heuristic for deployment and run-time adaptation of continuous dataflows to solve the optimization problem. We also propose efficient greedy heuris- tics(centralized andsharded variants)thatsacrificeoptimalityoverefficiency, which is critical for low latency streaming applications. • Evaluation: We extend the Linear Road Benchmark (LRB) [11] as a dynamic dataflow application, which incorporates dynamic processing ele- ments, to evaluate the reactive heuristics through large-scale simulations of LRB, scaling up to 8,000 msgs/sec, using VM and network performance traces from Amazon AWS cloud service provider. Finally, we offer a compar- ative analysis of the greedy heuristics against the GA in terms of scalability, profit, and QoS. 4.2 Application and Infrastructure Models We leverage the familiar Directed Acyclic Graph (DAG) model to define Con- tinuous Dataflows (Def. 1). We briefly summarize the application model and infrastructure model before presenting the formal scheduling problem and heuris- tics. Dynamic Dataflows (Def. 3) are an extension to continuous dataflows and incorporate the concept of dynamic PEs [70] and Dynamic Edges. Dynamic PEs consists of one or more user-defined alternative implementations (alternates) for the given PE, any one of which may be selected as an active alternate at run-time. Each alternate may possess different performance characteristics, resource require- ments and domain perceived functional quality (value). Heterogeneous computing [75] and (batch processing) workflows [121] incorporate a similar notion where the 62 active alternates are decided once at deployment time but thereafter remain fixed during execution. We extend this to continuous dataflows where alternate selec- tion is an on-going process at run-time. This allows the execution framework to perform autonomic adaptations by dynamically altering the active alternates for an application to meet its QoS needs based on current conditions. For simplicity, in this chapter we focus on Dynamic PEs only, but the extension of the problem formulation and heuristics for Dynamic Edges should be straightforward. We use the following restricted definition of dynamic dataflows: Def. 3 (Simplified Dynamic Dataflow). A Dynamic Dataflow D =hP,E,I,Oi is a continuous dataflow where each PEP i ∈ P has a set of alternatesP i = {p 1 i ,p 2 i ,...p j i | j ≥ 1} where p j i =hγ j i ,c j i ,s j i i. γ j i , c j i , and s j i denote the relative value, the processing cost per message, and the selectivity for the alternate p j i of PEP i respectively. Selectivity, s j i , is the ratio of the number of output messages produced to the number of input messages consumed by the alternatep j i to complete a logical unit of operation. It helps determine the outgoing data rate of a PE relative to its input data rate, and thereby its cascading impact on downstream PEs in the dataflow. Each alternate has associated cost and value functions to assist with alternate selection and resource provisioning decisions. The relative value, 0 < γ j i ≤ 1, for an alternate p j i is: γ j i = f(p j i ) Max j {f(p j i )} (4.1) where f :P i → R is a user-defined value function for the alternates. It quanti- fies the relative domain benefit to the user of picking that alternate. For e.g., a Classification PE that classifies its input tuples into different classes may use the 63 <dataflow> <PE id="parser"> <in id="tweet" type=" string " /> <out id="cleaned" type=" string " /> <alternate impl="Parser . TweetParser" /> </PE> <PE id=" classifier "> <in id="cleaned_twt" type=" string " /> <out id="tpc_twt" type=" string " /> <alternate impl=" Classifier .Bayes" value="0.65" /> <alternate impl=" Classifier .LDA" value="0.70" /> <alternate impl=" Classifier .MWE" value="1.00" /> </PE> <edge source=" parser:cleaned " sink=" classifier:cleaned_twt " /> </dataflow> Figure 4.1: Sample declarative representation of Dynamic Dataflow using XML. Equivalent visual representation is in Figure 4.2(a). F 1 score 1 as the quality of that algorithm to the domain, and F 1 can be used to calculate the relative value for alternates of the PE. Finally, the processing cost per message,c j i , is the time (in seconds) required to processonemessageona“reference”CPUcoreforthealternatep j i . Theprocessing needs of an alternate determines the resources required to processes incoming data streams at a desired rate. The concept of dynamic PEs and alternates provides a powerful abstraction and an additional point of control to the user. A sample dynamic dataflow is shown in Figure 4.1 using a generic XML representation, with a visual equivalent shown in Figure 4.2(a). Any existing dataflow representation, such as declarative [94], functional [81] or imperative [135] may also be used. The sample dataflow 1 F 1 = 2× precision×recall precision+recall is a measure of the classifier’s labeling accuracy. 64 E 1 E 2 VM2 VM3 VM1 (a) (b) (c) Parser PE Classifier PE γ = 0.65 γ = 0.7 γ = 1.0 E 1 E 2 Parser PE Classifier PE core req. = 4 Source Edges Sink Edges Alternates & Values Active Alternate & reference cores req. core req. = 9 Large, 4 cores, π = 2 Medium, 2 cores, π = 2 Small, 1 core, π = 1 Figure 4.2: (a) A sample dynamic dataflow. (b) Dataflow with selected alternates (e 1 1 ,e 2 2 ) and their initial core requirements. (c) A deployment of the dataflow onto VMs. continuously parses incoming tweets, and classifies them into different topics. It consists of two PEs: parser, and classifier, connected using a dataflow edge. While the parser PE consists of only one implementation, the classifier PE consists of three alternates, using the Bayes, Latent Dirichlet Allocation (LDA) and Multi- Word enhancement (MWE) to LDA algorithms, respectively. Each alternate varies in classification accuracy and hence has different value to the domain; these are normalized relative to the best among the three. The three alternates are available for dynamic selection at run-time. For brevity, we omit a more detailed discussion of the dynamic dataflow programming abstraction. The execution and scalability of a dynamic dataflow application depends on the capabilities of the underlying infrastructure. Hence, we develop an infrastructure model to abstract the characteristics that impacts the application execution and use that to define the resource provisioning optimization problem. 65 4.3 Cloud Infrastructure Model We assume an Infrastructure as a Service (IaaS) cloud that provides access to virtual machines (VMs) and a shared network. In IaaS clouds, a user has no control over the VM placement on physical hardware, the multi-tenancy, or the network behavior between VMs. The cloud environment provides a set of VM resource classesC ={C 1 ,C 2 ,...,C n } that differ in the number of available virtual CPU cores N, their rated core speed π, and their rated network bandwidth β. In this chapter, we focus on CPU bound PEs that operate on incoming data streams from the network. As a result, we ignore memory and disk characteristics and only use the VM’s CPU and network behavior in the infrastructure performance model used by our adaptation heuristics. As CPU core speeds may vary across VM classes, we define the normalized processing power π i of a resource classC i ’s CPU core as the ratio of its processing power to that of a reference VM core. Naïvely, this may be the ratio of their clock speeds, but could also be obtained by running application benchmarks on different sets of VMs and comparing them against a defined reference VM, or use Cloud-providers’ “ratings” such as Amazon’s Elastic Compute Units (ECUs). The set of VM resources acquired till timet is denoted byR(t) ={r 1 ,r 2 ,...,r n }. Each VM is described by r i =hC i j ,t i start ,t i stop i where C i j is the resource class to which the VM belongs, and t i start and t i stop are the times at which the VM was acquired and released, respectively. t stop =∞ for an active VM. The peer-to-peer network characteristic between pairs of VMs, r i and r j , is given by λ i×j and β i×j , where λ i×j is the network latency between VM r i and r j and β i×j is their available bandwidth. 66 VMs are typically charged at whole VM-hours by current cloud providers. The user is billed for the entire hour even if a VM is released before an hour boundary. The total accumulated monetary cost for the VMr i at timet is then calculated as: μ i (t) =d min(t stop ,t) − t start 60 e×cost per VM hour (4.2) where min(t stop ,t)−t start is the duration in minutes for which the VM has been active. We gauge the on-going performance of virtualized cloud resources, and the vari- ability relative to their rated capability, using a presumed monitoring framework. This periodically probes the compute and network performance of VMs using stan- dard benchmarks. The normalized processing power of a VM r i observed at time t is given by π i (t), and the network latency and bandwidth between pairs of active VMr i andr j areλ i×j (t) andβ i×j (t), respectively. To minimize overhead, we only monitor the network characteristics between VMs that host neighboring PEs in the DAG to assess their impact on dataflow throughput. We assume that rated network performance as defined by the provider is maintained for other VM pairs. Two PEs collocated in the same VM are assumed to transfer messages in-memory, i.e., λ i×i → 0 and β i×i →∞. 4.4 Deployment and Adaptation Approach Based on the dynamic dataflow and cloud infrastructure models, we propose a deployment and autonomic run-time adaptation approach that attempts to bal- ance simplicity, realistic cloud characteristics (e.g., billing model, elasticity), and user flexibility (e.g., dynamic PEs). Later, we formally define a meaningful yet 67 tractableoptimizationproblemforthedeploymentandrun-timeadaptationstrate- gies (chapter 4.6). We make several practical assumptions on the continuous dataflow processing framework, reflecting features available in existing systems [112, 81]: 1. The dataflow application is deployed on distributed machines, with a set of instances of each PE (ϕ(t) at timet) running in parallel across them. Incom- ing messages are actively load-balanced across the different PE instances based on the processing power of the CPU cores they run on and the length of the pending input queue. This allows us to easily scale the application by increasing the number of PE instances if the incoming load increases. 2. Within a single multi-core machine, multiple instances of different PEs run in parallel, isolated on separate cores. A core is exclusively allocated to one PE instance. We assume that there is minimal interference from system processes. 3. Running n data-parallel instances of a PE on n CPU cores, each with pro- cessing power π = 1, is equivalent to running 1 instance of the PE on 1 CPU core with π =n. 4. The active alternate for a PE is not dependent on the active alternate of any other PE, since each alternate for a given PE follows the same input/out- put format. This allows us to independently switch the active alternate for different PEs during run-time. 5. The framework can spin up or shutdown cloud VMs on-demand (with an associated startup/shutdown latency) and can periodically monitor the VM characteristics, such as CPU performance and network bandwidth. 68 Given these assumptions we define the following deployment model. When a dynamic dataflow, such as Figure 4.2(a), is submitted for execution, the scheduler for the stream processing framework needs to make several decisions: alternate selection for each dynamic PE, acquisition of VMs, mapping of these PEs to the acquired VMs, and deciding the number of data parallel instances per PE. These activities are divided into two phases: deployment time and run-time strategies. Deployment time strategies select the initial active alternate for each PE, and determine their CPU core requirements (relative to the “reference” core) based on estimated initial message data rates and rated VM performance. Figure 4.2(b) showstheoutcomeofselectingalternates, pickinge 1 1 ande 2 2 forPEsE 1 andE 2 with their respective core requirements. Further, it determines the VMs of particular resource classes that are instantiated, and the mappingM from the data-parallel instances of the each PE (ϕ(t)) to the active VMs (R(t)), following which the dataflow execution starts. Figure 4.2(c) shows multiple data-parallel instances of these PEs deployed on a set VMs of different types. Note that the number of PE instances in Figure 4.2(c) –ϕ(t) is 2 and 5 fore 1 1 ande 2 2 – is not equal to the core requirements in Figure 4.2(b) – 4 and 9 cores – since some instances are run on faster CPU cores (π> 1). Runtime strategies, on the other hand, are responsible for periodic adaptations to the application deployment in response to the variability in the input data rates and the VM performance obtained from the monitoring framework. These active decisions are determined by the run-time heuristics that can decide to switch the active alternate for a PE, or change the resources allocated to a PE within or across VMs. The acquisition and release of VMs are also tied to these decisions as they determine the actual cost paid to the cloud service provider. A formal 69 definitionoftheoptimizationproblemandthesecontrolstrategieswillbepresented in section 4.6. 4.5 Metrics for Quality of Service In chapter 4.2, we captured the value, and processing requirements for indi- vidual PEs and their alternates using the metrics: relative value (γ j i ), alternate processing cost (c j i ), and selectivity (s j i ). In this section, we expand these QoS metrics to the entire dataflow application. We define an optimization period T for which the dataflow is executed. This optimization period is divided into time intervals T ={t 0 ,t 1 ,...,t n }. We assume these interval lengths are constant,4t =t i+1 −t i . For brevity we omit the suffix i while referring to the time interval t i , unless necessary for disambiguation. A dataflow is initially deployed with a particular configuration of alternates which can later be switched during run-time to meet the application’s QoS. To keep the problem tractable and avoid repetitive switches, these changes are only made at the start of each interval t i . This means that during a time interval t, only a specific alternate for a PEP i is active. The value of the PEP i during the time interval t is thus: Γ i (t) = X p j i ∈P i A j i (t)·γ j i A j i (t) = 1, if alternate p j i is active at time t 0, otherwise Since value can be perceived as an additive property [62] over the dataflow DAG, we aggregate the individual values of active alternates to obtain the value for the entire dynamic dataflow during the time interval t. 70 Def. 4 (Normalized Application Value). The normalized application value, 0 < Γ(t)≤ 1, for a dynamic dataflow D during the time interval t is: Γ(t) = P P i ∈P Γ i (t) |P| (4.3) where|P| is the number of PEs in the dataflow. The application’s value thus obtained gives an indication of its overall quality from the domain’s perspective and can be considered as one QoS dimension. Another QoS criterion, particularly in the context of continuous dataflows, is the observed application throughput. However, raw application throughput is not meaningfulbecauseitisafunctionoftheinputdataratesduringthattimeinterval. Instead we define the relative application throughput, Ω, built up from the relative throughput 0 < Ω i (t)≤ 1 of individual PEsP i during the interval t. These are defined as the ratio of the PEs’ current output data rate (absolute throughput) o i (t) to the maximum achievable output data rate o max i (t) : Ω i (t) = o i (t) o max i (t) The output data rate for a PE depends on the selectivity of the active alternate, and is bound by the total resources available to the PE to process the inputs given the processing cost of the active alternate. The actual output data rate during the interval t is: o i (t) = min q i (t) +i i (t)· Δt, φ i ·Δt c j i ×s j i Δt (4.4) whereq i (t) is the number of pending input messages in the queue for PEP i , i i (t) is the input data rate in the time intervalt for the PE ,φ i is its total core allocation for the PE (φ i = P k π k ), and c j i is the processing cost per message and s j i is the selectivity for the active alternate p j i . 71 The maximum output throughput is achieved when there are enough resources forP i to process all incoming data messages including messages pending in the queue at the start of the interval t. This is given by o max i (t) = (q i (t)+i i (t)×Δt)×s j i Δt . While the input data rate for the source PE is determined externally, the input rate for other PEs can be characterized as follows. The flow of messages between consecutive PEs is limited by the bandwidth, during the interval t, between the VMs on which the PEs are deployed. We define the flow f i,j (t) fromP i toP j as: f i,j (t) = min o i , β i,j (t)·4t m , hP i ,P j i∈E 0, otherwise (4.5) whereβ i,j (t) is the available cumulative bandwidth between all instances ofP i and P j at time t and m is the average output message size. Given the multi-merge semantics for incoming edges, the input data rate for P k during time t is: i k (t) = X f j,k (t) (4.6) Unlike the application’s value, its relative throughput is not additive as it depends on the critical processing path of the PEs during that interval. The relative throughput for the entire application is the ratio of observed cumulative outgoing data rate from the output PEs,O ={O i }, to the maximum achievable output rate for those output PEs, for the current input data rate. Def. 5 (Relative Application Throughput). The relative application throughput, 0 < Ω(t)≤ 1 , for dynamic dataflow D =hP,E,I,Oi during the time interval t is: Ω(t) = P P i ∈O Ω i (t) |O| (4.7) 72 The output data rate for the output PEs is obtained by calculating the output data rate of individual PEs (Eqn. 4.4) followed by the flowf i,j between consecutive PEs (Eqn. 4.5), repeatedly in a breadth-first scan of the DAG starting from its input PEs, till the input and output data rates for the output PEs is obtained. Normalized Application Value, Γ(t), and Relative Application Throughput, Ω(t), together provide complimentary QoS metrics to assess overall application execu- tion, which we use to define an optimization problem that balances these QoS metrics based on user-defined constraints in the next section. 4.6 Problem Formulation We formulate the optimization problem as a constrained utility maximization problem during the period T for which the dataflow is executed. The constraint ensures that the expected relative application throughput meets a threshold, Ω≥ b Ω; the utility to be maximized is a function of the normalized application value, Γ; and the cost for cloud resources, μ, during the optimization period T. For a dynamic dataflow D = hP,E,I,Oi, the estimated input data rates, I(t 0 ) ={i j (t 0 )} , at each input PE,P j ∈I, at initial time,t 0 , is given. During each subsequenttimeinterval,t i , basedontheobservationsofthemonitoringframework during t i−1 , we have the following: the observed input data rates, I(t) ={i j (t)}; the set of active VMs, R(t) ={r 1 ,r 2 ,...,r m }; the normalized processing power per core for each VM r j , π(t) ={π j (t)}; the network latency and the bandwidth between pairs of VMs r i ,r j ∈ R(t) hosting neighboring PEs are λ(t) ={λ i×j (t)} and β(t) ={β i×j (t)}, respectively. 73 0 20 40 60 80 100 120 0.65 1 Cost over duration T (μ) ($) Application Value (Γ) Γ max Γ min Γ m ax Γmin Profit Loss 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 20 40 60 80 100 120 Application Value (Γ) Cost over duration T (μ) (US $) Profit Loss Γ min Γ max Γmin Γ m ax Break-even Figure 4.3: A sample linear function for trade-off between cost (C) and value (Γ). The line denotes break-even, with a slope = σ. At any time intervalt, we can calculate the relative application throughput Ω(t) (Eqn. 4.7), the normalized application value Γ(t) (Eqn. 4.3), and the cumulative monetary cost μ(t) till time t (Eqn. 4.2). The average relative application throughput (Ω), the average relative applica- tion value (Γ), and the total resource cost (μ) for the entire optimization period T = {t 0 ,t 1 ,...,t n } are: Ω = P t∈T Ω(t) |T| Γ = P t∈T Γ(t) |T| μ =μ(t n ) We define the combined utility as a function of both the total resource cost (μ) and the average application value (Γ). To help the users to trade-off between cost and value, we allow them to define the expected maximum cost at which they break-even for the two extremes of application value, i.e., the values obtained by selecting the best alternates for all PEs, on one end, and by selecting the worst alternates, on the other. For simplicity, we assume a linear function to derive the expected maximum resource cost at an intermediate application value, as shown in 74 Figure 4.3. If the actual resource cost lies below this break-even we consider it as profit, while if it lies above, we consider it as loss. This can be captured using the following objective function, Θ , which is to be maximized over the optimization period under the constraint Ω≥ b Ω: Θ = Γ−σ· (μ −C Γmax max ) + 1 (4.8) where σ is a equivalence coefficient between cost and value given by the slope: σ = Γ max − Γ min C Γmax max −C Γ min max (4.9) Γ max and Γ min are the maximum and minimum possible relative application values when picking the alternates with the best and worst values for each PE, respec- tively, whileC Γmax max andC Γ min max are the user-defined break-even resource cost at Γ max and Γ min . Given the deployment approach (chapter 4.4), the above objective function Θ can be maximized by choosing appropriate values for the following control param- eters at the start of each interval t i during the optimization period: • A j i (t), the active alternate j for the PE P i • R(t) ={r j (t)}, the set of VMs in R(t) • ϕ(t) ={ϕ j (t)}, the set of data-parallel instances for each PE P j ; and •M(t) ={ϕ j (t)→M j×k (t)}, the mapping of a data-parallel instances ϕ j for PE P j to the actual VM r k Optimally solving the objective function Θ with the Ω constraint is NP-Hard. While techniques like integer programming and branch-and-bound have been used 75 to optimally solve some NP-hard problems [126], these do not adequately trans- late to low-latency solutions for continuous adaptation decisions. The dynamic nature of the application and the infrastructure, as well as the tightly-bound deci- sion making interval means that fast heuristics performed repeatedly are better than slow optimal solutions. We thus propose simplified heuristics to provide an approximate solution to the objective function. Other approximate procedures such as gradient descent are not directly appli- cable to the problem at hand since the optimization problem presents a non- differentiable, non-continuous function. However, nature-inspired search algo- rithms such as Genetic Algorithms (GAs), ant-colony optimization, and particle- swarm optimization, which follow a guided randomized search, are sometimes more effective than traditional heuristics in solving similar single and multi-objective workflow scheduling problems [21, 44, 129, 136]. In this chapter, we also explore GAs for finding an approximate solution to the optimization problem. While GAs are usually slow and depend on the population size and complexity of the operators used, theyprovideagoodbaselinetocompareagainstourgreedyheuristicsandare discussed in the next section, followed by the greedy deployment and adaptation heuristics in section 4.8. 4.7 GeneticAlgorithm-basedSchedulingHeuris- tics A GA [55] is a meta-heuristic used in optimization and combinatorial problems. It facilitates the exploration of a large search space by iteratively evolving a num- ber of candidate solutions towards the global optimum. The GA meta-heuristic 76 abstracts out the structure of the solution for a given problem in terms of a chro- mosome made up of several genes. It then explores the solution space by evolving a set of chromosomes (potential solutions) over a number of generations. The GA initially generates a random population of chromosomes which act as the seed for the search. The algorithm then performs genetic operations such as crossover and mutations to iteratively obtain successive generations of these chromosomes. The crossover operator takes a pair of parent chromosomes and generates an offspring chromosome by crossing over individual genes from each parent. This helps potentially combine partial solutions from the parents into a single offspring. Further, the mutation operator is used to randomly alter some parts of a given chromosomes and advance the search by possibly avoiding getting stuck in a local optimum. The algorithm then applies a selection operator which picks the best chromosomes from the entire population based on their fitness values and eliminates the rest. This process is repeated until a stopping criterion, such as a certain number of iterations or the convergence of a fitness value, is met. We adapt GA to our optimization problem by defining these domain-specific data structures and operators. Chromosome: The solution to the optimization problem (Eqn. 4.8) requires (1) determining the best alternate to activate for each PE, and (2) the type and number of VMs, and the mapping of data-parallel PE instances to these VMs. We capture both of these aspects of the deployment configuration using a double stranded chromosome (Figure 4.4). Each chromosome represents one possible deployment configuration and the goal is to find the chromosome with the optimal configuration. The first strand (S1) represents the PEs and their active alternates, with each gene in the strand representing a PEP i in the dataflow The value of the i th gene holds the indexj of the active alternatep j i for the corresponding PE. The 77 Population 0 S1 S2 0 E 1 Population 4. Survive Fittest {0,1,1} {1,1} 0 S1 S2 2 E 2 {0,0,1} {1} 0 S1 S2 1 1 0 Bit mask 0 S1 S2 2 {0,0,1} {1,1} OF 1 2. Cross Over 1. Select Parent Pairs 3. Mutation 0 S1 S2 1 {0,0,1} {0,1,1} OF 1 0 S1 S2 1 E 2m-1 {0,0} {1,1} 0 S1 S2 0 E 2m {0,1} {1} … … … A ct i v e A lte r na te : : PE Inst anc es : : 0 1 2 1 Parent Chromosome Figure 4.4: Sample iteration for the GA heuristic. second strand (S2) contains genes that represent the list of available VMs with the list of PE instances running on them. The value of the i th gene in strand S2 holds the list of index values j for PEsP j mapped to the VM r i . The chromosome size (i.e. number of genes) is fixed based on the available resource budget. For example, in Fig 4.4, the chromosome E 1 represents a deployment where strand S1 has two PEsP 0 andP 1 with active alternates p 0 0 and p 0 1 , respectively. Strand S2 identifies two VMs, r 0 and r 1 , with r 0 running one instance of PEP 0 and two instances of PEP 1 on it, whiler 1 has two instances of PEP 1 running on it. Fitness Function: The objective function Θ acts as the fitness function for the chromosome, anddependsonboththestrandsofthechromosome. Thethroughput constraint Ω≥ b Ω is incorporated using a weighted penalty function that penalizes 78 the fitness value if a constraint is violated. This ensures that while the chromo- some is penalized, it is not disregarded immediately and given a chance to recover through mutations. In addition, during run-time, the penalty function also con- siders the number of changes induced on the deployment to reduce the overhead of frequent changes in the system. penalty = −α·| Ω− b Ω|·it , deployment −α·| Ω− b Ω|·it +α 0 ·v·it , run-time where, α and α 0 are constants, v is the number of deployment changes observed in the chromosome as compared to the previous deployment andit is the iteration count for GA.it ensures that the penalty is increased as the chromosome survives over multiple iterations and hence allows removal of unfit chromosomes. Crossover: Each parent chromosome is first selected from the population using a probabilistic ranked model (similar to the Roulette wheel) [50] while also retaining the top 5% of the chromosomes with best fitness values. Next the parent chromo- somes are paired randomly and a random bit mask is generated to choose the a gene from either parent to produce the offspring chromosome. For e.g., in Figure 4.4, parentsE 1 (white) andE 2 (gray) are selected for crossover, and a random bit mask is used to decide if the gene from the first parent (bit is 0) or the second parent (bit is 1) is retained in the offspring OF 1 . Mutation: We allow independent mutation for the two chromosome strands. For the first strand with PE alternates the mutation involves randomly switching the active alternate. For the second strand of VM instances and mapping, we probabilistically decide whether to remove or add a PE instance for each VM. For example, in Figure 4.4, the offspring OF 1 undergoes mutation by switching the 79 active alternate forP 1 fromp 2 1 →p 1 1 , and by adding an instance ofP 0 to the second VM. Mutated genes are shown with a dotted pattern. While GAs explore a wide range of solutions and tend to give near-optimal solutions, their convergence performance becomes a bottleneck during run-time adaptation. Hence we design sub-optimal greedy heuristics that trade optimality for speed, making them better suited for streaming applications. 4.8 Greedy Scheduling Heuristics In this section, we propose greedy heuristics to find an approximate solution to the optimization problem. As before, the algorithm is divided into the initial deployment phase and the run-time adaptation phase. For the proposed heuristic, we provide sharded (SH) and centralized (CE) vari- ants that differ in the quanta of information needed and the execution pattern of the scheduler. The sharded version uses one scheduler per PE, and all data-parallel instances of a PE communicate with their scheduler, potentially across VMs. How- ever, schedulers for different PEs do not communicate. Hence each scheduler only has access to its PE instances. In the centralized version, a single scheduler gathers information about the entire dataflow and hence has a global view of the execution of all PEs. As we show later, while the SH scheduler is inherently decentralized and reduces the transfer of monitoring data during execution, the CE variant, due to its global view, is more responsive to changes in the execution environment. 4.8.1 Initial Deployment Heuristic The initial deployment algorithm (Alg. 3) is divided into two stages: Alternate selection (lines 2–11) and Resource allocation (lines 12–25). These algorithms 80 Algorithm 3 Initial Deployment Heuristic Algorithm 1: procedure InitialDeployment(Dataflow D) . Alternate Selection Stage 2: for PE P∈D do 3: for Alternate A∈P do 4: c A P ← GetCostOfAlternate(A) 5: γ←A.Value 6: if γ/c A P ≥best then 7: best←γ/c A P 8: selected←A 9: end if 10: end for 11: end for . Resource Allocation Stage 12: while Ω≤ b Ω do 13: if (VM.isAvailable =false) then 14: VM← InitializeVM(LargestVMClass) 15: end if 16: P← GetNextPE 17: CurrentAlloc←AllocateNewCore(P,VM) 18: Ω← GetEstimatedThruput(D,CurrentAlloc) 19: end while 20: for PE P∈D do 21: if IsOverProvisioned(P) then 22: RepackPEInstances(P). Move PE instances to a VM with lower core capacity 23: end if 24: end for 25: RepackFreeVMs . Repack PEs in VMs with free cores to VMs with less number of cores 26: end procedure are identical for both SH and CE schedulers; however, their costing functions differ (Table 4.1). The alternate selection stage ranks each PE alternate based on the ratio of its value to estimated cost (line 4), and chooses the one with the highest ratio. Since we do not know the actual cost for the alternates until resource allocation, the heuristic uses the estimated processing requirements (c A P ) as an approxima- tion. The GetCostOfAlternate function varies between the SH and CE ver- sions. The SH strategy calculates an alternate’s cost based on only its processing requirements, while the CE strategy calculates the cost of the alternate as the sum 81 of both its own processing needs and that of its downstream PEs – intuitively, if an upstream PE has more resources allocated, its output message rate increases and this has a cascading impact on the input rate of the succeeding PEs. Also, a higher selectivity upstream PE will further impact the successors’ cost since they will have to process more messages. This cost is calculated using a dynamic pro- gramming algorithm by traversing the dataflow graph in reverse BFS order rooted at the output PEs. This is followed by a resource selection stage (lines 12–25) which operates sim- ilar to the variable-sized bin packing (VBP) problem [64]. For the initial deploy- ment, in the absence of running VMs, we assume that each VM from a resource class behaves ideally as per its rated performance. The algorithm picks PEs (objects) in an order given by GetNextPE, and puts an instance of each PE in the largest VM (bin) (line 17), creating a new VM (bin) if required. It then calculates the estimated relative throughput for the application given the current allocation (line 18) using Eqn. 4.7 and repeats the procedure if the application constraint is not met. It should be noted that theGetEstimatedThroughput functionconsidersboththeallocatedcoresandtheavailablebandwidthtocalculate the relative throughput and hence scales out when either becomes a bottleneck. The intuition behind GetNextPE is to choose PEs in an order that not only increases VM utilization but also limits the message transfer latency between the PEs by collocating neighboring PEs in the dataflow within the same VM. We order the PEs using a forward DFS traversal, rooted at the input PEs, and allocate resourcestotheminthatordersoastoincreasetheprobabilityofcollocatingneigh- boring PEs. It should be noted that the CPU cores required for the individual PEs are not known in advance as the resource requirements depend on the current load which in turn depends on the resource requirements of the preceding PE. Hence, 82 Table 4.1: Functions used in Initial Deployment Strategies Function Sharded (SH) Centralized (CE) GetCostOfAlternate A.cost A.cost +S i × P successor.cost GetNextPE if All PEs assigned then return argmin Pj∈P (Ω j t ) else return Next PE in DFS end if RepackPEInstances N/A Move PE instance to smallest VM big enough for required core-secs RepackFreeVMs N/A Iterative Repacking [64] after assigning at least one CPU core to each PE (IncrementAllocation), the deployment algorithm chooses PEs in the order of largest bottlenecks in the dataflow, i.e., lowest relative PE throughput (Ω i ). This ensures that PEs needing more resources are chosen first for allocation. This in turn may increase the input rate (and processing load) on the successive PEs, making them the bottlenecks. As a result, we end up with an iterative approach to incrementally allocate CPU cores to PEs using the VBP heuristic until the throughput constraint is met. Since the resource allocation only impacts downstream PEs, this algorithm is bound to converge. We leave a theoretical proof to future work. At the end, the algorithm performs two levels of repacking. After a solution is obtained using VMs from just the largest resource class, we first move one instance for all the over-provisioned PEs to the smallest resource class large enough to accommodate that PE instance (best fit, using RepackPEInstances). This may free up capacity on the VMs, and hence, we again use iterative repacking [64] (RepackVMs) to repack all the VMs with spare capacity to minimize wasted cores. During this process, we may sacrifice instance collocation in favor of reduced 83 resource cost. Our evaluation however shows that this is an acceptable trade- off toward maximizing the objective function. Note that these algorithms are all performed off-line, and the actual deployment is carried out only after these decision are finalized. Both, the order in which PEs are chosen and the repacking strategy affects the quality of the heuristic. While the sharded strategy SH uses a local approach and does not perform any repacking, the centralized strategy CE repacks individual PEs and VMs, as shown in Table 4.1. 4.8.2 Runtime Adaptation Heuristic The run-time adaptation kicks in periodically over the lifetime of the appli- cation execution. Alg. 4 considers the current state of the dataflow and cloud resources – available through monitoring – in adapting the alternate and resource selection. The monitoring gives a more accurate estimate of data rates, and hence the resource requirements and its cost. As before, the algorithm is divided into two stages: Alternate selection and Resource allocation. However, unlike the deployment heuristic, we do not run both the stages at the same time interval. Instead, the alternates are selected every m intervals and the resources reallocated every n intervals. The former tries to switch alternates to achieve the throughput constraint given the existing allocated resources, while the latter tries to balance the resources (i.e. provision new VMs or shutdown existing ones) given the alternates that are active at that time. Separating these stages serves two goals. First, it makes the algorithm for each stage more deterministic and faster since one of the parameters is fixed. Second, it reduces the number of retractions of deployment decisions occurring in the system. For e.g., if a decision to add a new VM leads to over-provisioning at 84 a later time (but before the hourly boundary), instead of shutting down the VM, the alternate selection stage can potentially switch to an alternate with higher value, thus utilizing the extra resources available and in the process increase the application’s value. During the alternate selection stage, given the current data rate and resource performance, we first calculate the resources needed for each PE alternate (line 6). We then create a list of “feasible” alternates for a given PE, based on whether the current relative throughput is lesser or greater than the expected throughput b Ω. Finally, we sort the feasible alternates in decreasing order of the ratio between value to cost, and select the first alternate which can be accommodated using the existing resource allocation. After this phase the overall value either increases or decreases depending on whether the application was over-provisioned or under- provisioned to begin with, respectively. TheresourceRedeployprocedureisusedtoallocateorde-allocateresources to maintain the required relative throughput while minimizing the overall cost. If the Ω≤ b Ω−, the algorithm proceeds similar to the initial deployment algorithm. It incrementally allocates additional resources to the bottlenecks observed in the system and repacks the VMs. However, if Ω> b Ω +, the system must scale in to avoid resource wastage and has two decisions to make: first, which PE needs to be scaled in and second, which instance of the PE is to be removed, thus freeing the CPU cores. The over-provisioned PE selected for scale in is the one with the maximum relative throughput (Ω). Once the over-provisioned PE is determined, to determine which instance of that PE should be terminated, we get the list of VMs on which these instances are running and then weigh these VMs using the following “weight” function (eqn 4.10). Finally, a PE instance which is executing on the least weighted VM is selected for removal. 85 VM Weight(r i ) = T c (r i )× FreeCores(r i ) TotalCores(r i ) ×(1− ϕr i ϕ )× TotalCores(r i )×π Cost Per VM Hour (4.10) where T c is the time remaining in the current cost cycle (i.e. time till the next hourly boundary), ϕ r i is the number of PE instances for the over-provisioned PE on the VM r i , and ϕ is the total number of instances for that PE across all VMs. The VM Weight is lower for VMs with less time left in their hourly cycle, and thus preferred for removal. This increases temporal utilization. Similarly, VMs with fewer cores used are prioritized for removal. Further, VMs with higher cost per normalized core have a lower weight so that they are selected first for shutdown. Hence the VM Weight metric helps us pick the PE instances in a manner that can help free up costlier, under-utilized VMs that can be shutdown at the end of their cost cycle to effectively reduce the resource cost. 4.9 Evaluation We evaluate the proposed heuristics through a simulation study. To emulate real-world cloud characteristics, we extend CloudSim [24] simulator to IaaSSim, that incorporates temporal and spatial performance variability using VM and net- work performance traces collected from IaaS Cloud VMs 2 . Further, FloeSim simulates the execution of the dynamic dataflows [112], on top of IaaSSim, with support for dataflows, alternates, distributed deployment, run-time scaling, and plugins for different schedulers. To simulate data rate variations, the given data rate is considered as an average value and the instantaneous data rate is obtained using a random walk between±50% of that value. However, to enable comparisons 2 IaaSSimandperformancetracesareavailableathttp://github.com/usc-cloud/IaaSSimulator 86 Algorithm 4 Runtime Adaptation Heuristic Algorithm 1: procedure AlternateReDeploy(DataflowD, Ω t ) . Ω t is the observed relative throughput 2: for PE P∈D do . Alternate selection phase 3: alloc← CurrAllocatedRes(P) . Gets the current allocated resources (accounting for Infra. variability) 4: requiredC← RequiredRes(P.activeAlternate) . Gets the required resources for the selected alternate 5: for AlternateA∈P do 6: requiredA← ActualResRequirements(A) 7: if Ω t ≤ b Ω− then 8: if requiredA≤requiredC then . Select alternate with lower requirements 9: feasible.Add(A) 10: end if 11: else if Ω t ≥ b Ω + then 12: if requiredA≥requiredC then . Select alternate with higher requirements 13: feasible.add(A) 14: end if 15: end if 16: end for 17: Sort(feasible) . Decreasing order of value/cost 18: for feasible alternate A do 19: if requiredA<alloc then 20: SwitchAlternate(A) 21: done 22: end if 23: end for 24: end for 25: end procedure 1: procedure ResourceReDeploy(DataflowD, Ω t ) 2: if Ω t ≤ b Ω− then 3: Same procedure as initial deployment 4: else if Ω t ≥ b Ω + then 5: while Ω≥ b Ω do 6: PE←overProvisionedPE 7: instance← SelectInstanceToKill(PE) 8: newAlloc← RemovePEInstance(instance) 9: Ω← GetEstimatedThruput(D,newAlloc) 10: end while 11: repackFreeVMs() . Repack PEs in the VMs with free cores onto smaller VMs with collocation 12: end if 13: end procedure between different simulation runs, we generate this data trace once and use the same across all simulation runs. 87 For each experiment, we deploy the Linear Road Benchmark (LRB) [11] as a dynamic dataflow using FloeSim, run it for 12 simulated hours (T = 12 hrs) on simulated VMs whose performance traces were obtained from real Amazon EC2 VMs. Each experiment is repeated at least three times and average values are reported. We use a timestep duration of t = 5 mins at the start of which adaptation decisions are made. 4.9.1 Linear Road Benchmark (LRB) The Linear Road Benchmark [11] is used to compare the performance of data stream management systems and has been adopted to general purpose stream processing systems [26]. Using this as a base, we develop a dynamic continuous dataflow application to evaluate the proposed scheduling heuristics. LRB models a road toll network within a confined area (e.g., 100 sq. miles), in which the toll depends on several factors including time of the day, current traffic congestion levels and proximity to accidents. It continuously ingests “position reports” from different vehicles on the road and is responsible for (i) detecting average speed and traffic congestion for a section, (ii) detecting accidents, (iii) providing toll notifications to vehicles whenever they enter a new section, (iv) answering account balance queries and toll assessments, and (v) estimating travel times between two sections. The goal is to support the highest number of expressways while satisfying the desired latency constraints. To simulate realistic road conditions, the data rate varies from around 10 msgs/sec to around 2,000 msgs/sec per expressway. Figure 4.5 shows the LRB benchmark implemented as a dynamic continuous dataflow. The Weather Updates (P 0 ) and Report Parse (P 1 ) PEs act as the input PEs for the dataflow. While the former receives low frequency weather updates, 88 Congestion Estimation# Travel Time Estimator# Weather updates Accident Detector Toll Calculator Toll Assessment Balance Queries Report Parser 0 1 2 3 4 5 6 7 2 0 : Υ 2 0 = 1.0, 2 0 = 1.2, 2 0 = 1.0 2 1 : Υ 2 1 = 0.8, 2 1 = 0.8, 2 1 = 1.0 2 2 : Υ 2 2 = 0.5, 2 2 = 0.8, 2 2 = 0.5 7 0 : Υ 2 2 = 1, 2 2 7 1 : Υ 2 2 = 1, 2 2 Υ 2 0 1 1.2 1 2 1 0.8 0.8 1 2 2 0.5 0.8 0.5 Υ 7 0 0.6 0.3 1 7 1 1 0.5 1 Figure 4.5: Dynamic Continuous Dataflow for LRB. Alternates forP 2 andP 7 have value (γ), cost (c) and selectivity (s). the latter receives extremely high frequency position reports from individual vehi- cles (each car sends a position report every 30 secs) and exhibits variable data rates based on the current traffic conditions. The Congestion Estimation PE (P 2 ) estimates current as well as near-future traffic conditions for different sections of all the monitored expressways. This PE may have several alternates using differ- ent machine learning techniques that predict traffic with different accuracy and future horizons. We simulate three alternates with different value (γ), cost (c) and selectivity (s) values as shown in the tables in Figure 4.5. The Accident Detector PE (P 3 ) detects accidents based on the position reports, which is forwarded to Toll Calculator (P 4 ) and Travel Time Estimator (P 7 ) PEs. The former notifies toll values (P 5 ) and account balances (P 6 ) to the vehicles periodically. The latter (P 7 ) provides travel time estimates, and has several alternates based on different forecasting models. For simulations, we use two alternates, e.g., (1) decision/re- gression tree model which takes several historical factors into account, and (2) time series models which predict using only the recent past traffic conditions. 89 4.9.2 Results We compare the proposed centralized and sharded heuristics (CE and SH) and the GA algorithm with a brute force approach (BR) that explores the search tree but uses intelligent pruning to avoid searching sub-optimal or redundant sub-trees. We evaluate their overall profit achieved, overall relative throughput and the mon- etarycostofexecutionovertheoptimizationperiodofT = 12hrs. Analgorithmis better than another if it meets the necessary relative application throughput con- straint, Ω≥ b Ω−, and has a higher value for the objective function Θ (Eqn. 4.8) Note that the necessary constraint for Ω must be met but higher values beyond the constraint do not indicate a better algorithm. Similarly, an algorithm with a higher Θ value is not better unless it also meets the Ω constraint. For all the experiments, we define the relative throughput threshold as b Ω = 0.8 with a tolerance of = 0.05. We calculate σ for the LRB dataflow using Eqn. 4.9 by setting C Γ min max = 0.5×T×DataRate 10 and C Γmax max = 1.0×T×DataRate 10 . We empirically arrive at these bounds by observing the actual break-even cost for executing the workflow using a brute force static deployment model. 1) Effect of Variability: Table 4.2 shows the overall profit and the relative throughput for a static deployment of LRB using different scheduling algorithms Table 4.2: Effect of variability of infrastructure performance and input data rate on relative output throughput using different scheduling algorithms. Static LRB deployment with average of 50 msgs/sec input rate, b Ω = 0.8. Relative Application Throughput (Ω) Algo. Profit (Θ) Neither Infra. Perf. Data Rate Both BR 0.67 0.80 0.68 0.59 0.44 GA 0.65 0.79 0.67 0.48 0.37 SH 0.45 0.81 0.60 0.40 0.29 CE 0.58 0.81 0.66 0.42 0.31 90 0 0.2 0.4 0.6 0.8 1 10 20 50 100 500 1000 2000 Relative App. Throughput( Ω) Avg. Data Rates (msg/sec) BR - No Var. BR - Both Var. GA - No Var. GA - Both Var. Ω = 0.8 (a) Ω with Static Deployment 0 0.2 0.4 0.6 0.8 1 10 20 50 100 500 1000 2000 Relative App. Throughput ( Ω) Avg. Data Rates (msg/sec) SH CE GA Ω = 0.8 (b) Ω with Runtime Adaptation -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 10 20 50 100 500 1000 2000 Profit ( θ) Avg. Data Rates (msg/sec) SH CE GA (c) Θ with Runtime Adaptation Figure 4.6: Effect of infrastructure and data rate variability on Static Deployment and Runtime Adaptation, as input data rate rises. 91 with an average input data rate of 50 msgs/sec. The overall profit which is a func- tion of application value and resource cost remains constant due to a static deploy- ment (without run-time adaptation). However, the relative throughput varies as we introduce infrastructure and data variability. In the absence of any variability, the brute force (BR) approach gives the best overall profit and also meets the throughput constraint (Ω≥ 0.8). Further, GA approaches a near optimal solution with Θ GA → Θ BR when neither infrastructure nor input data rates vary, and is within the tolerance limit ( b Ω−< Ω = 0.79< b Ω +). The SH and CE heuristics meet the throughput constraint but give a lower profit when there is no variability. However, when running simulations with infrastructure and/or data variabil- ity, none of the approaches meet the desired throughput constraints with Ω values ranging between 0.29 to 0.68, which is much less than the goal of b Ω = 0.8. Since these experiments do not change the deployment at run-time, the initial deploy- ment is based on assuming constant data rates and infrastructure performance. Even the two best static deployment strategies in the absence of variability, BR and GA, rapidly degrade in output throughput when both infrastructure and data variability are present. This is more so as the average input data rate increases from 10 msgs/sec to 2,000 msgs/sec in Figure 4.6(a). Due to explosion in state space, we could not run the BR algorithm beyond 50 msgs/sec input rate. This analysis motivates the need for autonomic adaptations to application deployment. 2)ImprovementswithRuntimeAdaptation: Figs.4.6(b)and4.6(c)show the improvements in relative throughput and overall application profit by utilizing different run-time adaptation techniques in the presence of both infrastructure as well as data variability. We use GA as the upper bound (instead of BR) since BR is prohibitively slow for run-time adaptations and, as shown in Table 4.2, GA approaches the optimal in many scenarios. However, as discussed later (Figure 92 4.7(a)), even GA becomes prohibitive for data rates≥ 500 msg/sec, and hence the missing entries in Figs. 4.6(b) and 4.6(c). We observe that with dynamic adaptation both GA and the greedy heuristics (SH and CE) achieve the desired throughput constraint (Ω≥ b Ω = 0.8) for all the input data rates tested. This allows us to compare their achieved application profit (Figure 4.6(c)) and make several key observations. First, profit from GA is consistently more than the SH and CE greedy heuristics. Second, CE achieves a better profit than SH and reaches between 60% to 80% of GA’s profit. In fact, SH gives negative profit (loss) in some cases. Understandably, the CE scheduler having a global view performs significantly better than SH that performs local optimizations. However, CE has a higher overhead due to centralized collection of monitoring data (the exact overhead is not available from simulations). Lastly, comparing the static and dynamic deployment from Table 4.2 and Figure 4.6(c), for a data rate of 50 msg/sec under variable conditions, we do see a drop in profit using run-time adaptation as it tries to achieve the Ω constraint. For GA and CE, the profits for adaptive deployment drop from 0.65→ 0.34 and 0.58→ 0.29, respectively, but are still well above the break-even point of 0. But the static deployments violate the throughput constraint by a large margin which makes their higher profits meaningless. 3) Scalability of Algorithms: Figs. 4.7(a) and 4.7(b) show algorithm scal- ability with respect to algorithm run-time and the number of cores for the initial deployment algorithm with increase in the incoming data rates. While the BR and the GA algorithms provide (near) optimal solutions for smaller data rates, their overhead is prohibitively large for higher data rates (Figure 4.7(a)) (> 10, 000 secs for BR at 50 msg/sec, and > 1, 000 secs for GA at 8,000 msg/sec). This is due to the search space explosion with the increase in the number of required VMs as 93 shown in fig 4.7(b). On the other hand, both CE and SH greedy heuristics take just∼ 2.5 secs to compute at 8,000 msg/sec, and scale linearly (O(|ϕ| +|R|)) with the number of PE instances (|ϕ|) and number of virtual machines(|R|). Further we see that SH algorithm leads to higher resource wastage (more cores) with increase in the data rates, while CE and GA show a linear relation to the data rates in Figure 4.7(b). Similar results are seen for the adaptation stage for SH and CE algorithms but the plots are omitted due to space constraints. 4) Benefit of using Alternates: We study the reduction in monetary cost to run the continuous dataflow due to the use of alternates, as opposed to a dataflow deployed with only a single implementation for the PEs; we choose the imple- mentation with the highest value (Γ = 1). Figure 4.7(c) shows the cost (US$) of resources required to execute the LRB dataflow for the optimization interval T = 12 hr using the greedy heuristics with run-time adaptation, in the presence of both infrastructure and data variability. We use AWS’s EC2 prices using m1.* generation of VMsfor calculating the monetary cost. We see that the use of alternates by run-time adaptation leads to a reduction in total monetary cost by 6.9% to 27.5%, relative to the non-alternate dataflow; alternates with different processing requirements provide an extra dimension of control to the scheduler. In addition, the benefit of alternates increases with high data rate – as the input data rate and hence the resource requirement increases, even small fluctuations in data rate or infrastructure performance causes new VMs to be acquired to meet the Ω constraint, and acquiring VMs has a higher over- head (e.g., the hourly cost cycle and startup overheads) than switching between alternates. 94 0.001 0.1 10 1000 100000 10 100 1000 10000 Time (sec) log scale Avg. Data Rates (msg/sec) log scale BR GA CE SH (a) Algorithm time as a function of datarate. 0 1000 2000 3000 4000 5000 6000 0 2000 4000 6000 8000 # cores Avg. Data Rates (msg/sec) BR GA CE SH (b) # Cores as a function of datarate 0 1000 2000 3000 4000 10 20 50 100 500 1000 2000 Monetary Cost ($) Avg. Data Rates (msg/sec) CE With Alternates CE No Alternates SH With Alternates SH No Alternates (c) Monetary cost benefit of using alternates Figure 4.7: Algorithm scalability (a,b) and advantage of alternates (c) 95 4.10 Related Work Scientific workflows [131], continuous dataflow systems [88, 81, 20, 135] and similar large-scale distributed programming frameworks [134, 95] have garnered a renewed research focus due to the recent explosion in the amount of data, both archived and real time, and the need for large-scale data analysis on this “Big Data”. Our work is based upon the stream processing and continuous dataflow systems that allow a task-graph based programming model to execute long run- ning continuous applications which process incoming data streams in near-real time. Other related work includes flexible workflows and heterogeneous comput- ing, service oriented architecture (SOA) and autonomic provisioning and resource management in clouds. We discuss the state-of-the-art in each of these research areas. Continuous dataflow systems have their root in Data Stream Management Sys- tems, which process continuous queries over tuple streams composed of of well- defined operators [14]. This allows operator-specific query optimizations such as operator split and merge to be performed [106]. General-purpose continuous dataflow systems such as S4 [88], Storm [81], and Spark [134], on the other hand, allow user-defined processing elements, making it necessary to find generic auto- scaling solutions such as operator scaling and data-parallel operations [138]. Sev- eral solutions, including one leveraging cloud elasticity to support auto-scaling, have been proposed [53]. However, these systems [103, 118], only consider data variability as a factor for auto-scaling decisions and assume that the underlying infrastructure offers the same performance over time. Our work shows that this assumption does not hold in virtualized clouds. Autonomicprovisioningforworkloadresourcemanagementoncloudshavebeen proposed. These use performance monitoring and model-based approach [99, 56]. 96 Weuseasimilarapproachandproposeheuristicsfordynamiccontinuousdataflows that handle not only data rate variations but also changes in the underlying infras- tructure performance. Recent work [26] integrates elastic scale out and fault tol- erance for stateful stream processing but adopts a local only policy based on CPU utilization for scaling. In this chapter, we assume stateless PEs, and fault tolerance for stateful PEs is discussed in chapter 6. Our results do show that using local scale-out strategies that ignore the dataflow structure under-perform, and hence motivates heuristics with a global view. Flexible workflows [92, 121] and service selection in SOA [52] allow workflow compositions to be transformed at run-time. This provides a powerful composi- tional tool to the developer to define business-rule based generic workflows that can be specialized at run-time depending on the environmental characteristics. The notion of “alternates” we propose is similar in that it offers flexibility to the developer and a choice of execution implementations at run-time. However, unlike flexible workflows where the decision about task specialization is made exactly once based on certain deterministic parameters, in continuous dataflows, this deci- sion has to be re-evaluated regularly due to their continuous execution model and dynamic nature of the data streams. To exploit a heterogeneous computing environment, an application task may be composed of several subtasks that have different requirements and performance characteristics. Various dynamic and static task matching and scheduling tech- niques have been proposed for such scenarios [75, 122, 119]. The concept of alter- nates in dynamic dataflow is similar to these. However, currently, we do not allow heterogeneous computing requirements for these alternates, though they may vary in processing requirements. Even with this restriction, the concept of alternates provides a powerful programming abstraction that allows us to switch between 97 them at run-time to maximize the overall utility of the system in response to changing data rates or infrastructure performance. Several studies have compared the performance of the virtualized environment against the barebones hardware to show their average performances are within an acceptabletolerancelimitofeachother. Howeverthesestudiesfocusedontheaver- age performance characteristics and not on the variations in performance. Recent analysis of public cloud infrastructure [59, 58, 61, 85, 93] demonstrate high fluctua- tions in various cloud services, including cloud storage, VM startup and shutdown time as well as virtual machines core performance and virtual networking. How- ever, thedegreeofperformancefluctuationsvaryacrossprivateanddifferentpublic cloud providers [72]. Reasons for this include multi-tenancy of VMs on the same physical host, use of commodity hardware, collocation of faults, and roll out of software patches to the cloud fabric. Our own studies confirm these. On this basis, we develop an abstraction of the IaaS cloud that incorporates infrastructure variability and also include it in our IaaS Simulator. Meta-heuristics have been widely used to address the task scheduling problem [21, 44, 129, 136]. Most of the approaches are nature inspired and rely on GA [55], ant colony optimization [30], particle swarm optimization [67] or simulated annealing [83] techniques to search for sub-optimal solutions. A number of studies to analyze the efficiency of meta-heuristics [21] show that in certain scenarios GAs can over perform greedy heuristics. More generally, Zamfirache et. al. [136] show that population based GA meta-heuristics of classical greedy approaches provide, through mutations, solutions that are better compared to their classic versions. Recently, a comparison of ant colony optimization and particle swarm optimization with a GA for scheduling DAGs on clouds was proposed [129]. None of these task scheduling algorithms or meta-heuristics based solutions take into 98 account a dynamic list of task instances. Our own prior work [44] uses a GA to elastically adapt the number of task instances in a workflow to incoming web traffic but does not consider alternates or performance variability. 4.11 Summary In this chapter, we have motivated the need for online monitoring and adap- tation of continuous dataflow applications to meet their QoS constraints in the presence of data and infrastructure variability. To this end we introduce the notion of dynamic dataflows, with support for alternate implementations for dataflow tasks. This not only gives users the flexibility in terms of application composition, but also provides an additional dimension of control for the scheduler to meet the application constraints while maximizing its value. Our experimental results show that the continuous adaptation heuristics which makes use of application dynamism can reduce the execution cost by up to 27.5% on clouds while also meeting the QoS constraints. We have also studied the fea- sibility of GA based approach for optimizing execution of dynamic dataflows and show that although the GA based approach gives near-optimal solutions its time complexity is proportional to the input data rate, making it unsuitable for high velocity applications. A hybrid approach which uses GA for initial deployment and the CE greedy heuristic for run-time adaptation may be more suitable. This is to be investigated as future work. 99 Chapter 5 Predictive Lookahead Scheduling Heuristics In the previous chapter we presented a formal problem formulation and reac- tive heuristics for elastic deployment of dynamic dataflows on cloud that balances resource cost, application value, and application throughput (QoS). In this chapter we propose PLAStiCC, an adaptive scheduling algorithm using a prediction-based look-aheadapproach. Inaddition, wealsoproposeseveralsimplerstaticscheduling heuristics that operate in the absence of accurate performance prediction model. These static and adaptive heuristics are evaluated through extensive simulations using performance traces obtained from Amazon AWS IaaS public cloud. Our results show an improvement of up to 20% in the overall profit as compared to the reactive adaptation algorithm. 5.1 Introduction As discussed in chapter 2, cloud infrastructure exhibit performance variability for virtualized resources (e.g. CPU, disk, network) as well as services (e.g. NoSQL store, message queues) both over time and space. These variations may be caused by factors including shared resources and multi-tenancy, changing workloads in the data center, as well as diversity and placement of commodity hardware [59]. For time sensitive continuous dataflow applications, it becomes imperative to address 100 such fluctuations in performance to ensure that the desired QoS is maintained. Strategies may range from simple over-provisioning and replication to dynamic application re-composition [70] and pre-emptive migration [29]. Proactive management of application and resource mapping is possible with predictive models that can forecast resource behavior. Several performance models have been proposed for computational Grid infrastructure that also exhibit similar performance variability [74]. These models use the network topology and work- load characterization to estimate the behavior of specific Grid resources [15, 116]. Similar models have also been developed for short and medium term predictions of performance metrics for virtual machines with varying degrees of prediction confidence using techniques such as discrete-time Markov chain [51]. Given the ability to elastically control Cloud resources and the existence of resource performance prediction models, the question arises: How can we re-plan the allocation of elastic resources for dynamic, continuous applications, at run- time, to mitigate the impact of resource and workload variability? In previous chapter, we proposed adaptive scheduling heuristics for online task re-composition of dynamic and continuous dataflows as well as allocation and scaling of cloud resources. However, these strategies reacted to changes in work- loads and resource variations to ensure QoS targets were met. In this chapter, we leverage the knowledge of future resource and workload behavior to pro-actively plan resource allocation, and thus limit costly and frequent run-time adaptation. We propose a new Predictive Look-Ahead Scheduling algorithm for Continuous dataflows on Clouds (PLAStiCC) that uses short term workload and infrastruc- ture performance predictions to actively manage resource mapping for continuous 101 dataflows to meet QoS goals while limiting resource costs. We also offer, as alter- natives, simpler scheduling heuristics that enhance the reactive strategy from the previous chapter. This chapter covers the following aspects: 1. We build upon our prior continuous dataflow abstraction and approach for reactive scheduling (chapter4) to formalize look-ahead scheduling on Clouds as an optimization problem. 2. We propose several scheduling heuristics for elastic resources, including PLAStiCC, that leverage short and medium term performance predictions. 3. The proposed heuristics are evaluated through extensive simulations that use real performance traces from AWS public cloud VMs for a representative synthetic continuous dataflow with different workloads. 5.2 Problem Formulation Reactive scheduling can be mapped to a constrained utility maximization prob- lem applied to continuous dataflows [70]. We define the pro-active look-ahead opti- mization as a refinement of this by including predictions for the incoming data rates and performance of the underlying infrastructure. In contrast, the reactive problem only has access to the current snapshot of the behavior. We define a fixed optimization period T over which the application schedule is to be optimized while satisfying the QoS constraint. T may be the duration of the application lifetime or some large time duration. T is divided into a number of fixed length timesteps, t 0 ,t 1 ,...,t n . The initial deployment decision, based on the rated infrastructure performance and user estimated data rates, is done at 102 the start of timestep t 0 , the system is monitored during t i and run-time decisions performed at the start of timestep t i+1 . Figure 5.1(a) shows the reactive optimization problem defined over an optimiza- tion periodT. The profit (i.e. utility) function to maximize, Θ, is defined over the normalized application value and the total resource cost over the entire period T. Similarly, the application throughput constraint Ω≥ b Ω is defined over the average throughput observed over T. Consequently, the instantaneous Ω at a given time may not satisfy the given constraint though the average is satisfied by the end of T. This problem is modified for pro-active look-ahead optimization as follows. Since prediction algorithms are often accurate only over short horizons (i.e. near future), and to reduce the time complexity of the problem, we reduce the above global optimization to a set of smaller local optimization problems (Figure 5.1(b)). First, the optimization period T is divided into prediction intervals [P 0 ,...,P k ], each of size equal to the length of the prediction horizon given by the prediction model (“oracle”). As before, we divide each interval P j into timesteps t j 0 ,t j 1 ,...,t j k , at the start of which the adaptation decisions are made. Next, a stricter version of the application constraint is defined, requiring it be satisfied for each prediction intervalP j , i.e.∀P j : Ω p j ≥ b Ω . It is evident that if this constraint is satisfied for each interval, it will be satisfied over the optimization period; however, the converse may not hold. Further we cannot define Θ for each interval since it is cumulative over the resource cost, and the VM provisioning may span multiple prediction intervals. Rather, we optimize the profit calculated “so far”, given as the cumulative average over each of the previous intervals, i.e., Θ p ≤j = ∀ i≤j P Γ p i P P i −σ j ×μ ≤j , where μ ≤j is the cumulative cost till prediction interval P j . While conciseness precludes a 103 formal proof, it can be shown by contradiction that the global profit function Θ defined earlier is equivalent to Θ p ≤k , where P k is the last prediction interval in the optimization period T. T t 0 t n t 1 … Γ = Γ( ) Ω= Ω( ) Θ= Γ − σ. … P 0 P m Reactive Look Ahead ℎ ℎ Ω> Ω Θ Γ 0 = 0 Γ( ) Θ 0 = Γ 0 − σ. 0 Ω 0 = 0 Ω( ) Γ = Γ( ) Θ = Γ − σ. Ω = Ω( ) ≤ Θ ≤ ℎ ℎ Ω > Ω (a) (b) ∀ ∈{0 … } Figure 5.1: The (a) reactive and (b) look-ahead run-time optimization problems defined over the optimization period T. The look-ahead optimization problem described above assumes a prediction “oracle” that can predict into the entire prediction interval. With a perfect oracle, the predicted values for an interval remain fixed for that interval. However, given the limitations of state-of-the-art multivariate time series models [27], in practice, the errors in the predicted values will increase with the horizon of prediction. Hence, we will get better prediction values as we move further into a prediction interval. To utilize this, we allow new predictions to be made at each timestep t i rather than only at the start of each prediction interval. Further, anytime a new (improved) prediction is available, we slide the prediction interval window to that 104 point and re-plan actions for the all its timesteps, retracting the prior planned actions. This sliding window of replanning does not change the optimization problem, except that the optimization decisions planed in one interval may be altered in the succeeding interval if improved predictions are available. However, note that this is different from the reversal of decisions in the reactive version, since the decisions there are enacted immediately for the next timestep whereas in the look-ahead version the replanning may just change a future planned action and hence not incur any resource penalty. 5.3 Scheduling Heuristics As with reactive scheduling (chapter 4.8), the look-ahead scheduling consists of two phases, the initial deployment phase which maps the application to VMs in the Cloud assuming rated performance and estimated incoming data rate; and the run-time adaptation phase which alters the resource allocation and alternates to accountfortheobservedandpredictedbehavior. Inthischapter, weretainourear- lier initial deployment algorithm [70] and focus on the latter. As such, the impact of the initial deployment is amortized by the run-time adaptation for long running dataflows. We propose a look-ahead algorithm, PLAStiCC, which uses the pre- dicted resource performance and data rates to generate the adaptation plan for a near future. Further, wepropose severalheuristics that usesimpler averaging mod- els to estimate performance and data rates. These, while out-performing reactive scheduling in some cases, offer different trade-offs than PLAStiCC(chapter5.4). 105 5.3.1 Predictive Look-Ahead Scheduling (PLAStiCC) The PLAStiCC algorithm is run at the start of each prediction interval. It plans a set of adaptation actions for timesteps in that interval to maximize the cumulative profit while meeting the constraint for that interval. The key intuition is to identify the farthest timestep in the current interval for which the existing deployment meets the QoS constraint for messages expected in that interval, given the performance and data rate predictions. The cross-over timestep indicates the future timestep beyond which the current deployment fails to meet the QoS. Once identified, we replan the deployment starting at the cross-over timestep and recursively repeat the process till the last timestep of the interval. The result of this algorithm is zero or more cross-over timesteps, identified in the interval, where the QoS constraint fails and the planned adaptation action to be taken at that timestep. Algorithm 5 shows the PLAStiCC run-time adaptation algorithm. It takes as input the continuous dataflow DAG D, the predicted data rate PR i t..t+n for each input PEP i as a time series [t..t+n] for timesteps in the interval, and the predicted VM performance time seriesPV j t..t+n for eachVM i , which includes CPU and inter- VM network behavior. The algorithm outputs a plan of actions X to be taken at timesteps in the prediction interval [t,t +n]. PLAStiCC is called as and when new predictions are made. It is run before the first timestep, t, of the prediction interval and tries to identify the farthest non-cross-over timestep. The algorithm maintains two pointers, ot, which is the last-known cross-over timestep in the interval, initially blank, and ft, the last timestep of the interval, set to ft = t +n. It tries to locate the last timestep when the aggregate constraint is met (lines 5–13). Starting at the last timestep, it calculates the aggregated throughput (Ω) by computing the processing capacity 106 available to each PE (Alloc), the cumulative incoming data messages at each PE using the predicted data rates and the network behavior (M), and the aggregated relative throughput resulting from these values. If this aggregate throughput is not within b Ω±, we try the previous timestep (line 12). If it is met, have found the cross-over boundary between constraint being satisfied and not (line 10). If the constraint is met in the first iteration, i.e. at the interval’s end, no adaptation is needed (line 15). Once the cross-over is identified, ot is set to this timestep, and ft is reset to t+n for the next iteration. We then decide the action to be taken at the cross-over to meet the constraint. For this replanning, we get the aggregated values for the available resources, the incoming messages and the relative throughput for the rest of the interval, (ot,t+n] (lines 18–20). Since the expected throughput is deviating forthisperiod, weeitherscale up (Ω ot..t+n < b Ω−)orscale down (Ω ot..t+n > b Ω+)), incrementally (lines 21–27). Both theScaleUp and theScaleDown functions try to balance the overall application value Γ (Sec. 4.2) and the total application cost μ. The ScaleUp function first identifies the “bottleneck” PE with minimum relative throughput, lying on the critical path from the input to output PEs [70]. It can then either switch to an alternate for this PE with lower processing needs (and potentially decrease the application value) or increase the number of instances (and possibly increase the resource cost) for the bottleneck PE. This choice is based on the local optimization of the profit function Θ t which uses the user specified σ coefficient that determines the “acceptable” value:cost ratio. Specifically, the cost of increasing an instance is zero if a VM has spare cores or else is equal to the cost of a new VM. Then, the change in application value from switching to the alternate with next lower processing needs is calculated. The 107 change in Θ ot−ft from either of these actions is calculated, and the adaptation that results in higher profit is selected. Similarly, the ScaleDown function identifies an over-provisioned PE and makes a choice between switching to an alternate with higher value or decreasing the number of instances, whichever leads to higher profit. Once such incremental action is planned for the cross-over timestep, we resume evaluating the planned deployment ensure it satisfies the constraint for the rest of the interval, (ot,t +n] (lines 5 - 13). The replanning is incrementally done till all cross-over timesteps and adaptations are identified for the interval, and the resulting plan satisfies the constraint for the entire interval. Finally, the algorithm performs a repacking of VMs to collocate neighboring PEs as well as shuts down those that are unused (line 30). 5.3.2 Averaging Models with Reactive Scheduling Rather than use fine-grained predictions (PLAStiCC) or assume that the most recent observed values will sustain (reactive), we develop several heuristics that use simple but intuitive estimates of future behavior and use it as a representative in thereactivealgorithm. ThisalsoaddressesPLAStiCC’ssusceptibilitytoprediction errors since it over-fits the schedule to the predictions. Similar to the reactive scheduler, these heuristics are executed only when the application QoS deviates from the constraint. Current After-the-fact (CAF): This chooses observed value in the most recent timestep as the estimated behavior for remaining timesteps. This is same as the classic reactive scheduler [70]. This reacts after-the-fact to the observed resource and data rate changes. 108 Algorithm 5 PLAStiCC Runtime Adaptation Heuristic 1: procedure PLAStiCC(Dataflow D, MinThroughput c Ω t , DataRatePredictions PR t..t+n , VMPerfPredictions PV t..t+n ) 2: ot←?, X←? . Init cross-over list to null 3: while ot≤ t+n do 4: ft← t+n 5: while ft≥ t do . Work back from last timestep. 6: Alloc← AvailableResourceAlloc(PV ot..ft ) 7: M← CumulativeIncomingMsgs(PR ot..ft ) 8: Ω← CumulativeThroughput(D, M, Alloc) 9: if b Ω−< Ω< b Ω + then 10: Break . QoS constraint met at this timestep. 11: end if 12: ft← ft - 1 . QoS not met. Work backwards. 13: end while 14: if ft = t+n then . QoS met at end of interval. 15: Break . No more replanning. Done. 16: end if . Replan actions for the remaining duration (ft,t +n] 17: ot← ft . Update last-known cross-over. 18: Alloc← AvailableResourceAlloc(PV ot..t+n ) 19: M← CumulativeIncomingMsgs(PR ot..t+n ) 20: Ω← CumulativeThroughput(D, M, Alloc) 21: if Ω< b Ω then 22: bpe← Bottleneck(D,CA) 23: X← X∪ ScaleUpOrDecAppValue(bpe, ot) 24: else if Ω> b Ω then 25: ope← OverProvisioned(D,CA) 26: X← X∪ ScaleDownOrIncAppValue(bpe, ot) 27: end if 28: end while 29: X← X∪ RepackAndShutdownVMs( ) 30: return X . Return cross-over timesteps and action plan. 31: end procedure Windowed Average After-the-fact (WAAF): This uses the average seen over several recent timesteps as the estimated behavior. This smooths out spikes while accounting for the recently observed trends that might last in the near future. Windowed Predictive Average (WPA): This averages the values predicted for all timesteps by the advanced prediction model used by PLAStiCC, and uses the 109 single average value in all timesteps in the interval. This mitigates the impact of large prediction errors in a few timesteps, where the PLAStiCC is susceptible. Windowed Predictive Optimistic (WPO): Thistoousestheadvancedprediction modelofPLAStiCCbutpicksthemostfavorableinfrastructureperformance(high- est) and data rate (lowest) from among all timesteps predicted for, and uses that single value for all timesteps. This offsets the impact of prediction algorithm with under-prediction bias [68]. This heuristic usually leads to low relative throughput since it tends to under-provision resources. Windowed Predictive Pessimistic (WPP): This is similar to WPO, except it picks the least favorable performance and data rate prediction as the single value. This handles prediction algorithms with over-prediction bias [68]. This heuristic usually gives the best relative throughput since it tends to over-provision resources. 5.4 Evaluation Simulation Setup: We evaluate PLAStiCC and other heuristics through a simulations study. We extend the CloudSim [24] data center simulator to IaaS- Sim 1 , that incorporates temporal and spatial performance variability using real traces collected from IaaS Cloud VMs. Further, FloeSim simulates the execution ofthe Floestreamprocessing engine[112], ontopof IaaSSim, with supportforcon- tinuous dataflows, alternates, distributed deployment, dynamic instance scaling, and plugins for different schedulers. For each simulation experiment, we deploy a dataflow to the Cloud using FloeSim, run it for 12 hours (simulated) using Amazon EC2’s performance traces – which is also the optimization period (T = 12hrs), and repeat it at least three 1 http://github.com/usc-cloud/IaasSimulator 110 times to get an average. We use a timestep duration of t = 5mins and the predic- tion intervals between P = 10− 30mins. Synthetic Dataflows and Streams: We evaluate the adaptation algorithms using a synthetic dataflow in Figure 5.2 with 10 stages of 3 PEs each, and 3 input andoutputdatastreams. Thedataflowisrandomlyperturbedtogeneratebetween 1–3 alternates each, with different cost, value and selectivity. This gives us a total of 30 PEs and∼45 alternates. Each of the three input data streams are independently generated based on observations in domains like Smart Power Grids [109]. Starting from a base stream with sinusoidal data rates whose peak and trough range from 75−2 messages, and with wave length of 24hrs, we add random noise every 1min that varies the rate by up to±20%. . . . 10 stages of 3 PEs each • 1 —3 alternates are generated for each PE, with different behavior. • 30PEs and ~45 alternates in all. • 3 input streams with sinusoidal stream rates with periodic noise. Figure 5.2: Sample dataflow pattern used for evaluation Evaluation Metrics We evaluate the heuristics using following metrics. The overall application relative throughput (Ω), the overall application profit (Θ) and the Heuristic Stability (χ). Specifically, we verify if the heuristic satisfies the QoS constraint Ω≥ b Ω. A higher value of Ω beyond b Ω is not necessarily good unless it also gives a higher overall profit. Further, given two heuristics which satisfy the QoS constraint, the one with a higher overall profit is considered better. Note that profit is unit-less and can only be compared relatively. chi is the frequency 111 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 600 1200 1800 2400 3000 3600 4200 Relative Throughput ( Ω) Time (seconds) WPA WAAF CAF WPO WPP Plasticc (a) Overall Relative Throughput (Ω) 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0 600 1200 1800 2400 3000 3600 4200 Total Profit Time (seconds) WPA WAAF CAF WPO WPP Plasticc (b) Total Profit (Γ) 0 20 40 60 80 100 120 140 600 1200 1800 2400 3000 3600 Total # of violations Time (seconds) WPA WAAF CAF WPO WPP Plasticc (c) Heuristic Stability (χ) Figure 5.3: Performance of scheduling for a perfect, oracle prediction model, as the prediction interval horizon (in secs) vary on X axis of deviation in instantaneous relative throughput Ω t from the b Ω± during the optimization period and a smaller value indicates a more stable heuristic. PredictionModels Theproposedheuristicsareevaluatedunderdifferentpre- diction oracle models. First we evaluate different heuristics under a perfect oracle model that gives accurate predictions over the prediction interval. We also run experimentswherethepredictedperformanceanddataratevaluesincludebounded 112 errors, ˆ e, to simulate short-comings of real world prediction models. The predic- tions have smaller error for near timesteps and larger ones for farther timesteps. First, we linearly scale the error from zero for the current timestep (e 0 ) to the 2× ˆ e for the last timestep (e n ) in the prediction interval, to give a mean close to ˆ e. Then, we sample the actual error for a timestep i from a normal distribution with mean 0 and variance equal to e i /2, and use this as the actual error for that timestep to obtain the predicted value. We can also optionally specify a “bias” of ±5% that is added to get a positive or negative biased model. We make predictions every 10mins for all models. Further, we evaluate them with prediction intervals ranging from 10 mins (i.e. two 5min timesteps per interval) upto to 30 mins and study the effect of the length of prediction interval under different oracles. 5.4.1 Results Perfect Oracle Model: We first present the results obtained for the perfect oracle model, which signifies the best case results for the PLAStiCC heuristic as compared to the various reactive models. Figure 5.3 shows comparison between the PLAStiCC algorithm to the reactive models for different metrics (Ω, Θ, andχ) with different prediction interval lengths for which the oracle predictions are avail- able). Figure 5.3(a) shows the overall relative throughput for various algorithms observed at the end of the optimization duration. We observe that all the adap- tation algorithms successfully satisfy the application constraint (Ω≥ 0.8) over the optimization duration for prediction interval lengths. Higher values of relative throughput do not necessarily mean a better algo- rithm. We compare this with Θ, and χ. Figure 5.3(b) shows the overall profit (Θ = Γ−σμ) obtained for different algorithms. First we observe that the PLAS- tiCC algorithm performs consistently better than any of reactive algorithms across 113 different prediction interval lengths. Further, we also observe that the profit value for the PLAStiCC algorithm increases as the length of the prediction interval increases. This is because, with shorter prediction intervals, the planned actions for that interval may be already executed before the plan for the future is obtained, and hence leads to a reversal in the actual action and not just the planned action, the former of which incurs additional cost. As the length of the interval increases, fluctuations further in the future are known and hence the planned activities may be reversed before they are enacted and hence lowering the resource cost. In addi- tion, we observe that the heuristics based on current (CAF) and historical average (WAAF) performance perform fairly similar. However, the WPP algorithm leads to much lower profit since it always tries to allocate resources for the worst case scenario, hence incurring higher resource cost. Further, Figure 5.3(c), shows the stability of the different heuristics (lower values are better). We observe that the PLAStiCC algorithm is more stable than any of the other heuristics and hence provides more predictable performance over the optimization duration. This is mainly because the algorithm pro-actively takes actions (such as launch VMs in advance to account for VM launch time while initiating a new instance at the desired time), whereas in the reactive versions the delay incurred by such actions cause constraint mismatch till the action is performed. This simulation study thus show that the look-ahead PLAStiCC algorithm out-performs all the reactive algorithms in the presence of data and infrastruc- ture variability. Further, it shows that the PLAStiCC algorithm is dependent on the length of the prediction interval and performs better as the interval length increases. Predictions with Errors and Biases: Next, we analyze PLAStiCC under a more realistic prediction model conditions, with prediction errors. We compare 114 PLAStiCC against WPO, WPA, and WPP to evaluate their relative strengths. Figure 5.4 shows the performance of PLAStiCC for models with prediction errors but no biasing. As before, we observe that the profit of the algorithm for a perfect oracle model (0% error) increases from 0.6 to 0.7 with increase in the prediction intervalfrom 10−60mins. However, weseedegradingperformanceastheerrorrate increases to 20%. We make two observations. First, for a given prediction inter- val length the performance decreases, as expected, as prediction error increases. Second, with high error values (> 10%), the performance of the algorithm, for a given error rate, initially increases with the prediction interval length, and then falls with further increase in the interval length (inverse bell curve). The reason for such behavior is that as we predict farther into the future, the error increases. But as we move into the future, more accurate results are obtained and hence they lead us to reverse actions. While most of these are planned actions, some enacted ones may be reversed too. As the error increases, only prediction very close to the current time will prove accurate and hence the PLAStiCC degenerates to the reactive version. 0.3 0.4 0.5 0.6 0.7 0.8 0 600 1200 1800 2400 3000 3600 4200 Total Profit Time (seconds) Average Error 0% Average Error 5% Average Error 10% Average Error 15% Average Error 20% Figure 5.4: Overall profit of PLAStiCC for different prediction error %, with pre- diction interval horizon (in secs) on X axis 115 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0 5 10 15 20 25 Total Profit Average Prediction Error WPA WPO WPP Plasticc (a) Overall profit for different algorithms for unbiased predictor (0%) 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0 5 10 15 20 25 Total Profit Average Prediction Error WPA WPO WPP Plasticc (b) Overall profit for different algorithms for low biased predictor (- 5%) 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0 5 10 15 20 25 Total Profit Average Prediction Error WPA WPO WPP Plasticc (c) Overall profit for different algorithms for high biased predictor (+5%) Figure 5.5: Performance of scheduling with realistic prediction models, having different prediction error % on X axis. Plots differ in prediction bias. Finally, we compare PLAStiCC against the reactive versions which aggregate predicted values, and are potentially affected by the prediction error. CAF and WAAF are unaffected by prediction errors and hence skipped. Figure 5.5 shows the overall profit for WPA, WPO, WPP and PLAStiCC for different error rates 116 and prediction biases (P = 20min). As seen in the Figure 5.5, PLAStiCC out- perform the others for smaller errors, irrespective of the prediction bias, while its performancedegradesastheerrorincreases, sinceittriestoover-fittothepredicted values. WPA, on the other hand performs consistently with different error rates and for different biases. While the WPO algorithm, which assumes an optimistic infrastructure behavior, performs poorly when the prediction errors are unbiased. However, it performs better than WPA with under-biased predictor. With lower predicted values, the best performance values over the interval tend to be closer to the real performance value than the average. On the other hand, WPP performs poorly under all scenarios because it assumes pessimistic infrastructure behavior and hence a single dip in the predicted performance causes it to over-provision resources. 5.5 Related Work Continuous dataflows have emerged as an extension to the Data Stream Man- agementssystems, whichwerefocusedoncontinuousqueriesoverthedatastreams. These have evolved into a general purpose abstraction that provides strong com- position tools for building large scale stream processing applications. Several such stream processing systems have been recently proposed such as S4 [88], IBM Info- Sphere streams [20], D-Streams [135], Storm [73], and Time Stream [98]. While these systems are built for large scale stream processing applications and provide distributed deployment and run-time, most of these (with the exception of Time Stream) do not provide automatic, dynamic adaptations to changing data rates or performance variability of the underlying infrastructure, over the application’s lifetime. 117 The notion of “alternates” is similar to flexible workflows [89]. It also draws inspiration from dynamic tasks defined in heterogeneous computing [75] where a task is composed of one or more alternatives with different characteristics, such as resource requirements and dependencies. However, unlike these systems, where the alternate to be executed is selected in advance, in our dynamic continuous dataflow abstraction such decisions are made at periodic intervals based on the changing execution characteristics. Time Stream [98] supports similar dynamic reconfiguration, called resilient substitution, which allows replacing a sub-graph with another in response to the changing load. This requires dependency tracking and introduces inefficiencies. In contrast, we restrict dynamic adaptation to single processingelementsmakingtheprocessofdynamicreconfigurationindependentfor each PE and hence allow more flexibility in dynamic adaptation, while restricting the application composition model. Our earlier work has discussed consistency models for enacting such run-time updates [125]. Thepredictiveschedulingapproachproposedherefollowthegeneric“scheduler, monitor, comparator, resolver” approach proposed by Nof et. al. [90]. Several such predictive scheduling techniques have been studied on computation grids [91, 15], however in the context of batch task scheduling instead of continuous dataflows. Further, several performance prediction models for the Grid have been proposed to profile workloads as well as infrastructure variations [128]. Although, we do not provide models for performance prediction in the Clouds, we envision that similar performance models may be applicable. PRESS [51] is one such system which uses “signature driven” prediction for variations, with periodic patterns and a discrete- time Markov chain based model for short-term predictions for VMs. They achieve high prediction accuracy with less than 5% over-estimation error and close to zero 118 under-prediction error. We presume existence of such monitoring and prediction algorithm to be used as the basis for our look-ahead scheduling algorithm. 5.6 Summary In this chapter, we proposed PLAStiCC, a predictive look-ahead scheduling heuristic for dynamic adaptation of continuous dataflows, that responds to fluc- tuations in stream data rates as well as variations in Cloud VM performance. Through simulations, we showed that the proposed PLAStiCC heuristic results in a higher overall throughput, up to 20%, as compared to the reactive version of the scheduling heuristic, we proposed earlier and improved here, which performs adaptation actions after observing the effects of these variations on the applica- tion. We also studied the effect of realistic prediction models with different error bounds and biases on the proposed heuristic, and identified scenarios where the look-ahead algorithm falls short of the reactive one. 119 Chapter 6 Elasticity and Fault Tolerance In chapters 4 and 5, we studied several resource management and scheduling algorithms that make run-time elasticity and adaptation decisions in response to changing domain characteristics, infrastructure performance or data rates. How- ever, in each of those algorithms we assumed that individual processing elements as stateless and hence did not consider migration cost and fault recovery cost as part of the algorithms. However, in reality, a number of processing elements will fall into one of the two categories: partitioned stateful PEs or shared stateful PEs and hence incur non-negligible cost of migration or fault recovery. In this chapter we study these stateful PEs and propose efficient techniques for checkpointing, migration and recovery to achieve seamless elasticity and load balancing as well as fast recovery from failures. While the techniques proposed can be extended to shared stateful PEs, here we focus on partitioned stateful PEs only. Partitioned stateful PEs provide a power abstraction similar to the MapRe- duce programming model used in batch high volume processing. The MapReduce programming model, due to its simplicity and scalability, has become an essen- tial tool for processing large data volumes in distributed environments. Recent Stream Processing Systems (SPS) extend this model to provide low-latency anal- ysis of high-velocity continuous data streams. However, integrating MapReduce with streaming poses challenges: first, the run-time variations in data characteris- tics such as data-rates and key-distribution cause resource overload, that in-turn leads to fluctuations in the Quality of the Service (QoS); and second, the stateful 120 reducers, whose state depends on the complete tuple history, necessitates effi- cient fault-recovery mechanisms to maintain the desired QoS in the presence of resource failures. In this chapter, we propose an integrated streaming MapRe- duce architecture leveraging the concept of consistent hashing to support run-time elasticity along with locality-aware data and state replication to provide efficient load-balancing with low-overhead fault-tolerance and parallel fault-recovery from multiple simultaneous failures. Our evaluation on a private cloud shows up to 2.8× improvement in peak throughput compared to Apache Storm SPS, and a low recovery latency of 700− 1500 ms from multiple failures. 6.1 Introduction The MapReduce (MR) programming model [130] and its execution frameworks have been central to building “Big Data” [60, 134] applications that analyze huge volumes of data. Recently, Big Data applications for processing high-velocity data and providing rapid results – on the order of seconds or milliseconds – are gaining importance. These applications, such as fraud detection using real-time financial activity [96, 78], trend analysis and social network modeling [46], and online event processing to detect abnormalities in complex systems, operate on a large and diversified pool of streaming data sources. As a result, distributed Stream Pro- cessing Systems (SPS) [88] have been developed to scale with high-velocity data streams by exploiting data parallelism and distributed processing. They allow applications to be composed of continuous operators, called Processing Elements (PEs), that perform specific operations on each incoming tuple from the input streams and produce tuples on the output streams to be consumed by subsequent PEs in the application. 121 The stateful Streaming MapReduce (SMR) [22] combines the simplicity and familiarity of the MR pattern with the non-blocking, continuous model of SPSs. SMR consists of mappers and reducers connected using dataflow edges with a key-based routing such that tuples with a specific key are always routed to and processed by the same reducer instance. However, unlike the batch MR, which guarantees that the reducers start processing only after all mapper instances fin- ish execution and the intermediate data transfer is completed, in SMR a reducer instance continuously receives tuples from the mappers, with different keys inter- leaved with each other. Hence the reducers need to store and access some state associated with an individual key whenever a tuple with that key needs to be processed. When SMR with stateful reducers are used to process high-velocity, variable- rate data streams with low-latency, a couple of challenges arise: (1) Unlike batch MR, the system load is not known in advance and data stream characteristics can vary over time – in terms of data rate as well as the reducer’s key distribution across tuples. This can cause computational imbalances across the cluster, with overloaded machines that induce high stream processing latency. Intelligent load-balancing and elastic run-time scaling techniques are required to account for such variations and maintain desired QoS. (2) The distributed execution of such applications on large commodity clusters (or clouds) is prone to random failures [48]. Techniques used for fault-recovery in batch MR, such as persistent storage and replication using HDFS with re- execution of failed tasks [107], incur high latency cost of seconds to minutes. Other solutions [54, 123, 137] that rely on process replication are unsuitable for SMR due to their high resource cost and need for tight synchronization. To meet the QoS of streaming applications, SMR must support fault-tolerance mechanisms with 122 minimal run-time overhead during normal operations while also providing low- latency, parallel recovery from one or more concurrent resource failures. In this chapter we address these two challenges through an integrated archi- tecture for SMR on commodity clusters and clouds that provides: (1) adaptive load-balancing to redistribute keys among the reducers at run-time, (2) run-time elasticity for reducers to maintain QoS and reduce costs in the presence of load variations, and (3) tolerance to random fail-stop failures at the node or the net- work level. While existing systems for batch and continuous stream processing partially offer these solutions (chapter 6.8), our unique contribution is to build an integrated system for streaming MapReduce that supports above features, by extending well-established concepts such as Consistent Hashing [65], Distributed Hash Tables (DHT) [63], and incremental checkpointing [4] to the streaming con- text while maintaining the latency and throughput requirements in such scenarios. Specifically, the contributions of this chapter are: (1) We propose a dynamic key-to-reducer (re)mapping scheme based on consistent hashing [65]. This minimizes key re-shuffling and state migration during auto- scaling and failure-recovery without needing an explicit routing table (chapter 6.4.1). (2) We propose a decentralized monitoring and coordination mechanism that builds a peer-ring among the reducer nodes. Further, a locality-aware tuple peer- backup (chapter 6.4.2) and incremental peer-checkpointing (chapter 6.4.3) enables low-overhead load-balancing, auto-scaling as well as fault-tolerance and recovery. Specifically, it can tolerate and efficiently recover from r≤x≤ nr (r+1) faults, where r is the replication factor andn is number of nodes in the peer-ring. (chapter 6.6). (3) Finally, we implement these features into the Floe SPS and evaluate its low- overhead operations on a private cloud, comparing its throughput with Apache 123 StormSPSwhichusesupstreambackupandexternalstateforfault-recovery(chap- ter6.7). Wealsoanalyzethelatencycharacteristicsduringload-balancinganddur- ing fault-recovery, exhibiting a constant recovery time from one or more concurrent faults. 6.2 Background Streaming MapReduce. SMR [22, 41] extends the batch MR model by leveraging stateful operators in SPS and using a key-based routing strategy for sending tuples from mappers to reducers. The mapper is a stateless operator that transforms the input tuples and produces one or more output tuples of the form t =hk i ,vi. Unlike batch MR, SMR relaxes the “strict phasing” restriction between map and reduce. Hence, the system does not need to wait for all mappers to complete their execution (i.e., produce all tuples for a given key) before starting thereducers. Instead,tuplesareroutedtothematchingreducerastheyareemitted and the reduce function is executed on each incoming tuple, producing continuous results. Unlike the batch MR, where reducers are stateless as they can access all the tuples for a given key during execution, in SMR, reducers must be stateful. As tuples with different keys may arrive interleaved at a reducer, a single reducer will operate on different keys, while maintaining an independent state for each key. Specifically, the reducer function takes a tuple, and a state associated with the given key, performs a stateful transformation and produces a new state with an optional output key-value pair, i.e. R :hk i ,vi,s k i j → [h b k i ,b vi, ]s k i j+1 . Figure 6.1 shows the streaming word frequency count application using stateful streaming map reduce (SMR) programming model. The Map function is executed for each 124 incoming tuple (e.g. once per tweet), which then emits a tuple as a key value pair hWord, 1i. Each tuple is then mapped to a reducer instance based on the spe- cific key, similar to the batch mapreduce version. However, unlike batch MR, the reducer function is executed for each tuple it receives. Hence the reducer does not have access to all tuples for the corresponding key, instead the SMR framework maintains a local state associated with each “key” processed by the reducer (chap- ter 6.4.3), which is passed along with the tuple to the reducer function. Hence the reducer keeps a running count of all the words seen so far. It can choose to emit the current count based on certain condition such as an external signal. 1: procedure Map(Tuple, Emitter) 2: . Tuple : Incoming data tuple 3: . Emitter : Emit tuples to the reducers 4: for Word in Tuple.value do 5: Out←hWord, 1i 6: Emitter.Write(Out) 7: end for 8: end procedure 1: procedure Reduce(Tuple, State, Emitter) 2: . Tuplehk i ,vi : Incoming tuple with key k i 3: . State : State associated with the key k i 4: . Emitter : Used to emit output tuples 5: Word←Tuple.key 6: ct← State.get(“count”) 7: ct←ct +Tuple.value 8: State.Update(“count”, ct) 9: if Condition then 10: out←hWord,cti 11: Emitter.write(out) 12: end if 13: end procedure Figure 6.1: Stateful Streaming MapReduce Word Count Example. Execution Model. The SMR execution model is based on existing well estab- lished SPSs such as Storm, S4 [88], and Floe [112]. Figure 6.2 shows an exam- ple of a SMR application on five hosts (physical or virtual). The mapper and reducer instances are distributed across the available hosts to exploit inter-node 125 parallelism. Each host executes multiple instances in parallel threads, exploiting intra-node parallelism. Although mappers and reducers may be collocated (pre- ferred), for simplicity we assume that they are isolated on different hosts. We thus refer to a node hosting a number of mappers or reducers as a Mapper node or Reducer node, respectively. Each mapper consumes incoming tuples in parallel, processes and emits tuples of the formhk i ,vi where k i is the key used for routing the tuple to a specific reducer. Given the stateless and independent nature of the mappers, simple load-balancing and elastic scaling mechanisms [70, 69] are ade- quate and are not discussed here. Instead, we focus on load-balancing, elasticity, and fault-tolerance only for the reducer nodes. The tuples are reliably transmitted to the reducers with assured FIFO ordering between a pair of mapper and reducer. However, given that the several mappers may process and produce tuples in parallel, the total ordering among the tuples generated by them, even for a single key, is undefined. Hence, the reducer logic should be agnostic to the order in which the tuples arrive. This constraint is similar to the traditional MR, as well as other stream processing systems. For applications that require specific ordering, techniques such as used by Flux [105] or Virtual Synchrony may be used but are out of scope. Fault Model. We only consider and guard against fail-stop failures. These occur due to hardware failures, critical software bugs or planned upgrade and reboot in the cluster or cloud environment. In such cases we assume that the acquired resource becomes unavailable and we lose all the buffered data and state associated with that host. Further, individual network link failures are treated as host failures and the fault-tolerance techniques reused. Recovering the lost data and state by replaying tuples from the beginning is not possible as it would cause 126 M M M M Data Stream Sources R R S S R R S S R R S S Virtual/Physical Machine Parallel Mapper/Reducer Thread Intermediate streams that follow key based routing semantics Data buffer S Reducer State Figure 6.2: Distributed SMR execution model. an unacceptable latency, and hence requires efficient state-management and low- latency fault-tolerance techniques to recover the lost state to continue processing. Using simple techniques that store the state in an external distributed shared, persistent [88, 81] memory (DHT, HDFS, databases) would also suffer from high processing and recovery latency since these systems lack a notion of locality and the state updates and recovery require expensive read/write operations from the external system. 6.3 Goals of Proposed System Performance and Availability. In SPSs, QoS is defined not only in terms of overall system availability and its capability to eventually process all incoming tuples, butisalsomeasuredintermsofaverage response time –thelatencybetween when a tuple is generated by the mappers and when it is processed by the reducer, including the corresponding state update. Brito et al. [22] identified several classes of applications with expected response time ranging from minutes or hours for traditional batch data analysis, to less than 5ms per incoming tuple for real-time applications. Here we focus on fast streaming applications with average response 127 times ranging from 10− 500 ms, but real-time applications with millisecond or sub-millisecond response time is out of scope. Theresponsetimeincludestransmission, queuing, andprocessing delays. While the transmission and processing delays are a function of the hardware and appli- cation logic, the queuing delay is a function of the variable load on the host. The queuing delay may be reduced by transferring its current load to other hosts or newly acquired resources. Such load transfer should incur low overhead and not interfere with the regular system operations. Application performance should be maintained during failure and fault-recovery procedure while reducing recovery overhead. These low-latency requirements dictate the design of our system and preclude the use of persistent storage for fault-tolerance. Replicated in-memory distributed hash tables (DHTs) can persist state for failure-recovery but are costly during normal operations for accessing and updating the distributed state since they are not sensitive to the locality of the state on reducer(s). Deterministic Operations. We assume that the reducer is “an order agnos- tic, deterministic operator”, similar to batch MR, i.e., it does not require tuples to arrive at a certain time or in a particular order to produce results, as long as each tuple is processed exactly once by the reducer. This implies that the application can be made deterministic if we ensure the following: (D1) the state associated with a key at each reducer is maintained between successive executions, i.e., the state is not lost, even during failures; (D2) none of the tuples produced by the mappers are lost; and (D3) none of the tuples produced by the mappers are pro- cessed more than once for two different state updates. The last condition might occur during failure recovery or when shifting the load to another reducer (chapter 6.5). A tuple is non-deterministic if used for multiple different state transforma- tions, a case which may occur if: (1) the state, or part of it, is lost during recovery 128 (which should never occur as it violates D1), or (2) the reducer instance is unable to determine if the replayed tuple was processed earlier and applies the update again on the recovered state. Achieving strict determinism that satisfies D1, D2, and D3 is expensive and requires complex ordering and synchronization [105]. We relax these and allow “atleast-once” tuple semantics, instead of “exactly-once”, i.e., we ensure that D1 and D2 are strictly enforced, but some of the tuples may be processed more than once by the reducers, thus violating D3. Although this limits the class of applica- tions supported by our system, this is an acceptable compromise for a large class of big data applications and can be mitigated at the application level using simple techniques such as last seen timestamp. 6.4 Proposed System Architecture To achieve the dual low-latency goals of response time performance and fault- recovery under variable data streams, we need to support adaptive load-balancing and elastic run-time scaling of reducers to handle the changing load on the system. Further, we need to support low-latency fault-tolerance mechanism that has min- imal overhead during regular operations and supports fast, parallel fault-recovery. We achieve adaptive run-time load-balancing and elastic scaling of stateful reducers by efficiently re-mapping a subset of keys assigned to each reducer, at run-time, from overloaded reducer nodes to less loaded ones. However, the reducer semantics dictate that all the tuples corresponding to a particular key should be processed by a single reducer which conflicts with this requirement. Hence, to transparently meet these semantics, we efficiently manage the state associated 129 R R Backup Manager State Manager Peer Monitor Tuple Buffer Monitor Peer (Primary) Receive Incremental state backup from primary Load balance/ Scale Manager Asynchronous Incremental Backup to secondary Reducer Nodes Peer Ring Figure 6.3: Reducer node components. with the overloaded reducer and migrate the partial state corresponding to the re-mapped keys. In addition, to support efficient fault-tolerance and recovery, we backup tuples withminimalnetworkandmemoryoverhead, andperformasynchronousincremen- tal checkpointing and backup such that the tuple and state backup corresponding to a specific key are collocated on a single node. The asynchronous checkpointing ensures that performance of regular operations is not affected, and state and tuple collocation ensures that the recovery from failure is fast, without requiring any state or tuple transfer over the network. The proposed system design integrates these two approaches by building a reducer peer-ring as shown in Figure 6.3. We overlay a decentralized monitoring framework where each reducer node is identical and acts both as a primary node responsible for processing incoming tuples, as well as a secondary node responsible for backing up its neighbor’s tuples and state as well as to monitor it for failures. A reducer node consists for several components (Figure 6.3): a state manager responsibleformaintainingthelocalpartitionedstateforeachkeyprocessedbythe reducer; a backup manager responsible for receiving and storing state checkpoints 130 from its peers as well as to perform asynchronous incremental checkpointing for the local state; a load-balancer/scaler responsible for monitoring the load on the reducer and perform appropriate load-balance or scaling operations; and a peer monitorresponsibleformonitoringitspeerforfailureandperformrecoveryactions. We discuss these techniques in detail and describe the advantages offered by the peer-ring design in meeting our system goals. 6.4.1 Dynamic Re-Mapping of Keys The shuffle phase of batch MR uses a mapping function F =hash(k) mod n, where k is the key and n is the number of reducers, to determine where the tuple should be routed. The shuffle phase is performed in two stages. First, the mapper appliesF to each outgoing tuple and sends the aggregated results to the designated reducer. Once all mappers are done, each reducer aggregates all the tuples received for each unique key from several mappers, and then executes the reducer function for each key over its complete list of values. Scalable SMR is similar except that the tuples are routed immediately to the corresponding reducer which continuously processes the incoming tuples and updates its state (chapter 6.2). However, when performing elastic scaling, the number of reducers can change at run-time. So the above mapping function F becomes infeasible since adding or removing even a single reducer would cause a shuffle and re-map of many of the existing keys to different reducers, leading to a large number of state migrations between the reducers. ThisO(|K|) remapping, where|K| is the number of unique keys, will introduce a high processing latency. Existing SPSs, such as Storm, S4, SEEP and Granules, are limited in their run-time scale in/out due to the use of such a hash function. 131 0.1 0.35 0.7 0.88 Keyspace to reducer mapping Key Space Intervals (a) Consistent hashing. 0.1 0.35 0.7 0.88 F(k i ) = 0.2 Primary Replica Secondary Replica Secondary Replica (b) Peer-backup for r=2. Figure 6.4: Consistent hashing examples with 4 reducers. To overcome this, we rely on consistent hashing [65] for key-to-reducer map- ping whose monotonocity property ensures that adding or removing reducers, only affects a small portion,O( |K| n ), of the key set, needing less state migrations and incurring a smaller latency. The idea is to assign each of the reducers a unique “token”∈ [0, 1] to form a ring (Figure 6.4(a)). The keys are always mapped to [0, 1] using a well-distributed hash function. We then select the closest reducer 132 in the counter-clockwise direction of the ring from that point. The complexity of mapping the key to a reducer isO(1). Whenever a reducer is added, it is assigned a new unique token which puts it between two existing consecutive reducers, dividing the keys space mapped to the original reducer into two subsets, one of which is mapped to the newly added reducer without affecting key mapping for any of the other existing reducers. Sim- ilarly whenever a reducer is removed (e.g., due to a fault) its keys are dynamically mapped to the next reducer on the ring in the counter-clockwise direction. The basic consistent hashing algorithm assigns a random position for each of the nodes on the ring possibly leading to non-uniform load distribution. It is also vulnerable to changes in the system’s load due to the variations in data rates and key distribution over time. A virtual nodes approach [35] addresses only the initial non-uniform load distribution issue. Instead, we use a dynamic approach, similar to Cassandra, that allows a node to move along the ring at run-time in response to the variations in the system load. 6.4.2 Peer-Backup of Tuples To achieve efficient backup of incoming tuples to a reducer, we again use a variation of consistent hashing that assigns each key to r + 1 contiguous buckets on the ring onto which the tuple is backed up; r is the tuple replication factor. Figure 6.4(b) shows a sample configuration with 4 buckets and r = 2. A key k i is hashed to the interval [0, 1] and the primary replica is selected by finding the nearest neighbor on the ring in the counter-clockwise direction as before. In addition,r = 2 neighboring buckets are chosen as secondary replicas by traversing the ring in the clockwise direction. The mapper then sends the tuple to all 3 of the 133 nodes (1 primary, 2 secondary), using a reliable multi-cast protocol (e.g., PGM) to minimize network traffic. On receiving a tuple, each node checks if it is the primary by verifying if it appears first in the counter-clockwise direction from the tuple’s position on the ring. If so, the tuple is dispatched to the appropriate reducer thread on the machine for processing. Else, the node is a secondary and the tuple is backed-up in-memory, to be used later for load-balancing or fault-recovery. Note that each nodecandetermineifitisaprimaryinO(1) time, andneedstonavigatetoatmost r clockwise neighbors. Note that it is important to clear out tuple replicas which have been successfully processed to reduce the memory footprint. This eviction policy is discussed in 6.4.4. 6.4.3 Reducer State Model and Incremental Peer- Checkpointing The state model must support (1) partial state migration, i.e., migrating the state associated with only a subset of keys at a reducer, and (2) reliable, incremen- tal and asynchronous checkpointing [97, 4, 49], i.e., the state must be deterministic, not concurrently updated by the reducer during checkpointing, and without the need to pause the reducers during this process. Partial states can be managed by decoupling the state from reducer instance, and partitioning it based on the individual keys being processed by the reducer (Figure 6.5). This allows incremental backup only for the keys processed and updated during the last checkpointing period. We can further reduce the size of the incremental backup by restricting the state representation to a set of key-value pairs (Master State in Figure 6.5). Given this two-level state representation, an incremental backup can be obtained by keeping track of reducer keys updated 134 State Manager K1 K2 K3 Kn … x1 v1 x2 v2 x3 v3 Reducer Keys Master State (per reducer key) R x1 v1 x4 v4 … … … … xm vm Partial State Fragment x2 v2 x3 v3 … … Active InActive Write Read Hit? Miss? Hit? Miss? Hit? R R Reducer Node Figure 6.5: State representation with master state and partial state fragments. during the latest checkpoint and the set of key-value pairs updated for each of the reducer keys. We divide the state associated with each key into two parts, a master state and a state fragment. The former represents a stable state, i.e., a state that has been checkpointed and backed up onto a neighbor. The latter represents the incrementally updated state which has to be checkpointed at the end of this checkpoint interval. After the fragment is checkpointed, it can be merged into the master state and cleared for the next checkpoint interval. This allows efficient incremental checkpointing, but can lead to unreliable checkpoints as the state fragment may be updated by the reducers during the checkpointing process. We can avoid this by pausing the reducers during checkpointing, but it incurs high latency. Instead we propose an asynchronous incremental checkpointing process. Here, we divide the state fragment into two mutually exclusive sub-fragments: active and inactive. The active fragment is used by the reducers to make state updates, i.e., the key-value pairs are added/updated only in the active fragment, 135 while the inactive fragment is used only during checkpointing, as follows. At the end of a checkpoint interval, the active fragment contains all the state updates that took place during that checkpoint interval, while the inactive fragment is empty. To start the checkpointing process, the state manager atomically swaps the pointers to the active and inactive fragments and the corresponding timestamp is recorded. The reducers continue processing as usual and update the state in the active fragment, while the inactive fragment contains a reliable snapshot of updates that occurred in the previous interval. The inactive fragment is then asynchronously serialized, checkpointed, and transferred to the backup nodes using a multi-cast protocol. After completion, the inactive fragment is merged with the master state and is cleared for the next checkpoint cycle. Withrespecttofreshness,theactivefragmentcontainsthemostrecentupdates, followed by the inactive fragment and finally the master state. Thus, a “state read” request for a key-value pair from the reducer is first sent to the active fragment, then the inactive fragment, and finally the master state until the corresponding key-value pair is found (Figure 6.5). While this involves three seeks, it can be mitigated withO(1) data structures like hash tables. Thecombinationofpartitionedstateandactive/inactivestatefragmentsallows the reducer thread to continue processing incoming tuples and update the state in the active fragment during the checkpointing process without any conflicts or interruptions thus minimizing the overhead (chapter 6.7). As with tuple-backup, wefollowanoptimisticcheckpointingprocesswheretheprimaryhostdoesnotwait for an acknowledgment from any of the backup hosts, letting reducers execute uninterrupted. This optimistic replication works as long as atleast one backup node is available. We rely on the peer ring to determine the hosts to be used for backup to ensure tuple and state collocation for fast recovery. 136 6.4.4 Tuple Ordering and Eviction Policy The tuple eviction policy determines which tuples can be safely removed from the backup nodes such that the state associated with the failed host can be recov- ered by replaying the remaining tuples and updating the checkpointed state. Tuple eviction allows us to keep a low memory footprint. The proposed eviction policy allows for a small subset of tuples to be processed morethanonce. Weassumethatallthemappersarelooselytimesynchronizedand that the maximum time skew between any two mappers is less than the maximum networklatency(L n )totransfertuplesfrommapperstoreducers. Suchassumption have been used in systems such as Flux [105] to identify missing messages and is practically justified since using NTP an error bound within few μ-seconds may be achieved and network latency is typically observed in the milli-seconds range. Each mapper emits tuples with an associated timestamp. We assume a reliable multi-cast protocol to ensure that emitted tuples are delivered in FIFO order. The primary host then processes the tuples in the order they arrive from different mappers and marks the state update with the latest timestamp of the tuples that affect the current state. Given the time skew between mappers and the maximum network latency,L n , a tuple with lower timestamp may be received and processed at a later point. In this case the timestamp mark on the state is not updated. The backup hosts also receive tuples from different mappers for backup. These hosts store the backup tuples in decreasing order of their timestamps. Maintaining this order is not costly since the tuples are usually received in an increasing order of timestamp (except for few due to clock skew and network latency). Whenever a checkpointisreceivedfromtheprimary, thesecondaryhostretrievestheassociated timestamp T s and evicts all tuples with a timestamp less than T s − 2L n (instead of tuples with TS less than T s ) since tuples with an earlier timestamp may arrive 137 later and may have not been processed yet by the primary node. This leaves some potentialtuplesinthebackupwhichmayhavealreadybeenprocessedandreflected inthecheckpointedstate. Iftheprimaryhostfails, thebackupnodetriestorecover thestatebyreplayingthebackuptuplesandupdatingthecheckpointedstate. This may lead to some tuples being processed twice, violating condition D3. Certain measures can be implemented at the application level to handle this scenario but are out of our scope. As long as the time skew is bounded by network latency, L n , there will be no tuple loss during failure, hence condition D2 will be met. Note that the issue of variability in network latency over time can be mitigated in one of the two ways: first, by setting a much high value of the threshold (L n ) compared to the observed maximum network latency (10-20x) such that any latency higher than the threshold may be considered as network failure. However, a side effect of high threshold is that it significantly increases the number of messages that may be replayed during a fault recovery or scaling-in scenario. Second, we can monitor the network latency over time and update the threshold (2×L n ) to reflect the variations in network latency. Note that this can be achieved efficiently in a decentralized manner since each of the nodes in the ring can monitor the latency and use a gossip protocol to propagate the changes to other nodes in the ring. We use the first approach for our system evaluation. 6.5 Adaptive Load Balancing and Elasticity Because of fluctuations in the data streams two scenarios can occur. First, the number of keys mapped to a particular interval (i.e., on a reducer node) may fluctuate at run-time, and second, the rate of generated tuples associated with 138 a specific key may vary at run-time. Both scenarios can lead to processing load imbalances among the reducer nodes. The proposed architecture supports adaptive load balancing by leveraging the factthattheneighboringnodes(secondaryinstances)associatedwithakeyalready possess a recently cached state (peer-checkpointing) as well as the unprocessed tuples (peer-backup) associated to that key. We use a simple load-balancing strategy where each host monitors its buffer queue length (q L ) and the overall CPU usage (c) while processing the incoming tuples. A host is said to be overloaded if q L ≥ τ high q AND c ≥ τ high c . While this strategy is prone to oscillation in resource allocation if frequent variations are observed, we have previously proposed robust run-time adaptation algorithms that not only consider the current system load but also observe data patterns and use predictive techniques to avoid such oscillations as well as to minimize resource cost [70, 69], which can be applied here. Anoverloadednode(A)negotiateswithitsclockwiseneighbor(B)(i.e., thefirst backup host) to see if it can share some of the load (i.e., q L ≤τ low q AND c≤τ low c for that node). If so, it requests B to expand the primary interval associated with that node by moving in the counter-clockwise direction on the ring. Figure 6.6 shows a sample peer-ring configuration consisting of 4 reducer nodes. Assuming that the node (A) is overloaded, it negotiates with its neighbor (B) to check if it has spare cycles (i.e. q L ≤ τ low q AND c≤ τ low c ). If so, it requests B to take over some of its load (i.e. share the key space). Node B is then moved along the circle in the counter clockwise direction and its key-space is extended and hence it shares some load from node A. Note that during this process, no explicit state transfer is required since B acted as a secondary before load-balancing and hence had a recent checkpoint for the state associated with the given keys. However, 139 for simplicity, we assume that the backup node always moves by half the distance between the two at a time. 0.1 0.35 0.7 0.88 0.1 0.23 0.7 0.88 Overloaded Primary (A) Backup node (B) Negotiate Move here (B) (A) Figure 6.6: Load balancing example. Node B then trivially recovers the state associated with the transferred keys thatwascheckpointedearlierbyA.Itthenreplaysandprocessesthetuplesbuffered in its backup (i.e., those tuples not reflected in the currently checkpointed state) and starts processing the new tuples associated with that interval. The backup node now becomes the primary for the given interval and relieves the load on the original primary (A) by taking over some of its load. It also notifies the mappers about the change using multi-cast. Note that the backup node need not wait for this to be delivered to the mappers, since, being the secondary node, it already receives all the relevant tuples. On receiving the notification, the mappers will update their ring and stop sending the tuples to A and start sending it to B and the load-balancing will be completed. The adaptive load-balancing technique allows an overloaded node to negotiate with its immediate neighbor to offload some of its work. However, if the neighbor- ing host does not have spare cycles and is working to its capacity, it can in turn request its neighbor to offload before accepting the load from the primary. If none of the hosts can accept additional load, the primary host will elastically scale out as follows. 140 Scaling Out. Scaling out involves provisioning a new host, putting the new host on the ring such that it offloads some of the work from the overloaded primary node, and transferring the corresponding state on to it. To minimize the overhead we start the state transfer in background only state corresponding to the keys that will be offloaded. We also start sending (duplicating) tuples associated with the keys to the new host for backup (using the multi-cast protocol, hence without the additional hop). During this process, the primary node continues to process the incoming tuples as before (albeit at a slower rate due to overload) and update the active state fragment as before. Once the master state transfer is completed, the previously described checkpointing process is performed on the backup nodes as well as on the newly acquired node. This step ensures that the new host has the latest state and that the processed tuples are evicted from its backup buffer. Finally, the new host is put on the peer-ring mid-way between the overloaded node and its neighbor and itsr neighbors are informed about the change. The new host thus takes over some work from the overloaded node. Figure 6.7 shows a similar transition for scaling out. An overloaded node (A), first sends a load-balance request to its neighbor (B). However, unlike before, the neighbor (B) does not have enough free processing cycles and hence in turn tries to offload some load to its neighbor and so on. If it is observed that all the reducers are processing at its capacity, a new node C is provisioned and placed between A and B, which takes over some of the load from node A. However, note that in this case, node C does not have access to the state associated with the keys. Hence we perform a delayed transition (i.e. we start asynchronous state transfer to node C) whileA continues to process the incoming tuple, and the load is transferred only after the state transfer is completed. 141 0.1 0.35 0.7 0.88 Overloaded Primary Secondary Negotiate 0.1 0.35 0.7 0.88 0.23 Overloaded Primary Secondary (B) (A) (C) (B) (A) Figure 6.7: Scaling out example Scaling In. A primary host can be scaled in if that host along with its clockwise neighbor are lightly loaded (i.e., q L ≤ τ low q AND c≤ τ low c for both hosts). The primary host can offload all of its load to the neighbor and can return to the resource pool and be available for other requirements or shutdown to conserve cost and energy. The process is similar to the load-balancing process including the checkpoint, tuple replay, and state recovery, except that the neighbor’s interval is now expanded to cover the primary’s entire interval by setting its token to be equal to the primary’s token and removing the primary from the ring. Figure 6.8 shows the transition for scaling in resources. In this case, since node A is lightly loaded, it checks with its neighbor node B if it has spare cycles to take over all of its load. If so, node A can be removed from the peer-ring and node B is moved into its position. As before, no state or tuple transfer is required to complete the scaling-in transition. Note that even though node B now processes a larger key-space compared to other nodes, the low tuple density allows it to process the tuples without getting overloaded. 142 0.1 0.35 0.7 0.88 0.1 0.7 0.88 Lightly loaded Primary Lightly loaded Secondary Negotiate (B) (B) (A) Figure 6.8: Scaling in example. 6.6 Fault-Tolerance A system is said to be r-fault-tolerant if it can tolerate r arbitrary faults (6.2) and can resume its operations (i.e., satisfy D1, D2) without having to restart/re- deploy the entire application. The proposed system can tolerate at mostr failures in consecutive neighbors on the ring since the state and the tuples are backed up on r + 1 hosts. However, it can tolerate more than r failures if the failures do not occur in consecutive neighbors on the ring. Specifically, in the best case scenario, it can tolerate up to x = nr (r+1) node failures as long as the faults do not occur in more than r consecutive neighbors (where r is the replication factor and n is number of nodes). For example, withr = 1, the system can still be functional (i.e., no state or tuple loss) even if every alternate node on the ring (i.e.,n/2 nodes) fail simultaneously. Recent large scale studies [33] have shown that a significant portion of failures observed in a datacenter are spatially collocated (e.g., overheating, rack failures, cluster switch failures) and that the datacenter can be divided into several “fault zones” such that the probability of simultaneous failures for machines in different fault zones is lower than that for machines within a single fault zone. This is also the basis for the “fault domains” or “availability zones” feature provided by 143 Microsoft Azure and Amazon AWS which provide atleast 99.95% availability guar- antees if VMs are placed in distinct fault zones. We can exploit this property and place neighbors on the ring in distinct fault zones to achieve higher fault-tolerance. Figure 6.9 shows a peer-ring (r = 1) distributed across two fault zones. Given the property of fault-zones that the probability of simultaneous failures across two zones is much less than that of within a fault zone, the configuration with con- secutive neighbors in the ring lying in distinct fault-zones as shown in Figure 6.9 gives an optimal fault-tolerance level. In the best case scenario, the system will still be operational without any state or tuple loss even if simultaneous failures are observed in all the nodes in a single fault zone (i.e. even if n∗1 (1+1) =n/2 of the nodes fail at the same time). Fault Zone 1 Fault Zone 2 1 2 3 4 5 6 2 4 6 Fault Zone 2 Zone 1 Fails Figure 6.9: Reducer peer ring with two fault zones: before and after failure. Fault-tolerance involves two activities: fault-detection and recovery. To achieve the former in a decentralized manner we overlay a monitoring ring on top of the peer-ring where each host monitors its immediate counter-clockwise neighbor (i.e., a secondary node which holds tuple and state backup for a subset of the key space monitors the primary node). Whenever a fault is detected, to initiate fault recovery, it contacts the one-hop neighbor of the failed node to see if that node is still alive and continues this process until r hops or a live host is found. Note that at this point the backup host will have the checkpointed state and tuples for all failed nodes. Hence, it can employ a procedure similar to the scale in process 144 to take over the load from one or more (≤ r) failed nodes. As before, no state transfer or additional tuple transfer is required. The fault-recovery process by itself does not provision additional resources but uses the backup nodes for recovery so as to minimize downtime. However, during or after the takeover, if the backup node becomes overloaded, the adaptive load- balance or scale-out process will be initiated to offload some of its work without interrupting the regular operations. 6.7 Evaluation We implemented a prototype of our proposed architecture on top of Floe SPS [112]. It provides a distributed execution environment and programming abstrac- tions similar to Storm and S4. In addition, it has built-in support for elastic oper- ations and allows the user to add or remove parallel instances of (stateless) PEs at run-time using existing or acquired resources. We extend it to support elasticity as well as fault-tolerance for stateful reducers via the proposed enhancements. The goal of the experiments is not to study the scalability of the system on hundreds of nodes, but to evaluate the dynamic nature of the proposed system and the overhead it incurs. The setup consists of a private Eucalyptus cloud with up to 20 VMs (4 cores, 4GB RAM) connected using gigabit Ethernet. Each map/reduce node holds at most 4 corresponding instances. We use a streaming version of the word frequency count application that keeps track of the overall word frequency for each unique word seen in the incoming stream as well as a count over a different sliding windows (e.g. past 15 mins, 1 hr, 12 hrs etc.) which represents recent trends observed in the stream. Such application may be used in analysis of social data streams to detect top “k” trending topics by ranking them based on their 145 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 Ω (Relative Throughput) Data Rate (msg/sec) Brute Force AllConst Brute Force VDataVInfra Global Static AllConst Global Static VDataVInfra Local Static AllConst Local Static VDataVInfra Figure 6.10: Achieved Peak throughput for different number of reducers. relative counts observed during a recent window. In such applications, the exact count for each of the topics is not required since the relative ranking among the topics is sufficient. As a result, the system’s atleast-once semantics for message delivery is acceptable for the application. We emulate the data streams by playing the text extracted from the corpus of text data from the Gutenberg project. Each mapper randomly selects a text file and emits a stream of words which is routed to the reducers. To demonstrate various characteristics, we emulate: (1) variations in overall data rate by dynamically scaling up/down the number of mappers, and (2) variations in data rate for a subset of keys (load imbalance), by using streams that repeatedly emit a small subset of words. We synchronize the VMs using NTP and get a loose synchronization bound within few μsec. Further, we determine the maximum network latency (L n ) to be around 1ms by executing a number of ping requests between the VMs. Nonetheless we use a conservative estimate of 15ms for our experiments and a value of 2×L n = 30ms as a bound to evict tuples from the backup buffer to account for variations in the network latency and potential time drift observed over time. 146 6.7.1 Empirical Results We first evaluate the system under static configurations to determine the over- head due to the checkpointing and backup mechanisms. We fix the number of VMs andreducersatdeploymenttimeandprogressivelyincreasedataratestodetermine the maximum achievable cumulative throughput (processing rate of the reducers). We examine the system under different tolerance levels r = 0, 1, 2, where 0 indi- cates that no fault-tolerance mechanisms are in place. We compare our system against two variations of the application deployed using Storm SPS. The first uses Storm’s upstream-backup feature with explicit acknowledgments ensuring that no tuples are lost during failure. However, the state is stored locally and may be lost if the corresponding node fails. The second relies on an external distributed reliable in-memory storage (Cassandra) to store the state associated with each key. This version provides fault-tolerance and recovery as well as protection against tuple loss similar to the proposed system, but incurs significant overhead. Figure 6.10 shows the peak throughput achieved by these systems as a function of number of reducers. Following key observations can be made from Figure 6.10: (1) The peak throughput achieved by Floe at r = 0 is consistently higher than others due to minimal overhead and it drops by around 15% as we increase the tolerance level. This is expected since higher r values require additional tuple and state transfer and adds to the load on secondary nodes. (2) Floe achieves higher throughput than both versions of Storm giving around 2.8x improvement for r = 1 compared to Storm with state backup using Cassandra due to high latency incurred dur- ing state access. (3) Finally, we observe that Floe scales (almost) linearly as we increase the number of resources, while Storm’s peak throughput flattens out after a certain point due to the bottleneck caused by the external state management system. 147 0 5 10 15 20 25 0 5 10 15 0 500 1000 1500 2000 Queue Length Data Rate Per Reducer Machine (in thousands) Execution time in Seconds Reducer Node 1 Reducer Node 2 Reducer Node 3 Queue Length (Reducer 1) (a) Load balancing example (fixed resources). 3 8 13 18 0 10000 20000 30000 40000 50000 60000 0 200 400 600 800 1000 Execution Time Number of Reducers Message/Sec Overall Data Rate Output Throughput Num reducers (b) Throughput characteristics as a function of data-rate and scale- out. (c) Backup node latency. Figure 6.11: Throughput and Latency Characteristics for Load Balance and Scale- out. 148 Next we study the throughput and latency characteristics of the proposed load-balancing and elasticity mechanisms. Figure 6.11(a) shows an example load- balancing scenario with fixed resources. It shows the last 1-min average data rate per node for a deployment with 3 reducer nodes and average queue length for one ofthenodes. Thesystemisinitiallyimbalanced(duetorandomplacementofsmall number of reducer nodes) but stable (i.e., q L ≤ τ high q for all nodes). At around 500s, we repeatedly emit a small subset of words causing further imbalance. The pending queue length for reducer 1 starts increasing beyond the threshold indicat- ing that the incoming data rate is beyond its processing capacity and it initiates the load-balancing process with its neighbor, reducer 3, which in turn transfers some of its load to reducer 2 and reaches a stable state around 700s. Note that the system may not reach stable state if the increase in data rate is beyond the cumulative capacity of the cluster, in which case a “scale out” operation will be performed. Figure6.11(b)showsthesystemresponseandthroughputasafunction of increasing data rate. We observe that for given set of resources, the achieved throughput initially increases with increase in the incoming data rate. However, as the system reaches its capacity, the observed throughput flattens out (as indicated in Figure 6.11(b)). As a result, the pending queue length for the resources goes beyond the threshold and a load-balance or scale-out process is initiated on an idle or acquired resource which allows the system to catchup with the incoming data rate. We observe a reaction time for scale out to be around 1.5–3 seconds which includes both the detection latency as well as message replay and state restoration. While the former is a function of monitoring interval, the latter depends on both the checkpointing interval and the data rate (at the overloaded reducer), which we study next. 149 Figure 6.11(c) shows the latency characteristics for the load-balancing scenario asafunctionofdatarateandthecheckpointingintervalonthelatency. Weobserve that the absolute value of the latency is very low (10 - 500ms) for moderate data rates and checkpointing interval of up to 10s. However, latency increases as the checkpointing interval and the data rate increases as this causes the number of tuples backed up by the node to increase along with the number of tuples that need to be replayed. Further, it adds significant memory overhead and that con- tributes to the performance degradation. Thus smaller checkpointing intervals are preferred. Another benefit of our approach is due to the proposed state representa- tion and use of state fragments which allows us to asynchronously and consistently checkpoint a part of the state without pausing the regular operations, eliminating the effect of frequent write operations caused by small checkpointing intervals. As shown in Figure 6.12, the proposed incremental checkpointing significantly reduces the size of the checkpoint when using smaller intervals (the checkpoint size for 2s interval stabilizes around 7% of the size of the entire state stored by the reducer node), which further supports our argument for a smaller checkpointing interval. 0 0.2 0.4 0.6 0.8 1 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 Ratio of Checkpoint Delta to Full State Size Runtime (in seconds) 1s 2s 10s 5s 15s 20s Checkpoint interval Figure 6.12: Relative state size for incremental checkpointing. 150 0 50 100 150 200 0 1000 2000 3000 4000 5000 6000 300 400 500 600 Backup Queue Length Input Data Rate (msg/sec) Execution Time (seconds) Reducer 1 Reducer 2 Reducer 3 Backup size (Red. 3) (a) Fault-recovery example. 0 500 1000 1500 2000 0 1 2 3 4 5 6 Recovery Time (ms) Number of Simultaneous Failures (b) Recovery latency as a function of number of simultaneous node failures (n = 12, r = 1, data rate = 15,000 msgs/sec). Figure 6.13: Fault Tolerance Throughput and Latency Characteristics. Since load balancing, elasticity, and fault-tolerance use the same backup mecha- nism, Figure 6.11(c) is a good indicator of the overall performance of our process. Finally, we demonstrate the fault-tolerance and recovery process. Figure 6.13(a) shows a snapshot of an execution with 3 reducer nodes (12 reducers) with fixed data rate. It also shows the size of the tuple backup for one of the reducers (reducer 3). We induce a fault in the system by manually stopping reducer 1 at around 500s. Reducer 3 stops receiving checkpoint data from the failed reducer (which is also treated as a heartbeat) leading to large tuple backup during the 151 recovery process. After detecting the fault, it decides to take over the execution from the failed reducer and replays all the backed-up tuples to recover the state and finally moves itself on the ring so that the tuples originally destined for the failed node are now transmitted to reducer 3 (as is evident by the increasing data rate). Note that the latency characteristics of fault-recovery due to a single node fault are similar to that of the load-balancing process (Figure 6.11(c)) and hence are omitted for brevity. We further study the recovery latency of the system under multiple concurrent failures. Since the recovery is performed in parallel, the over- all recovery latency is the maximum of latencies to recover all the failed nodes in parallel. Figure 6.13(b) shows the average recovery latency observed for multiple (m) simultaneous failures such that no two consecutive neighbors fail simultane- ously. Weobservethattherecoverylatencydoesnotincreaselinearlyandstabilizes around 1,200ms to 1,500ms for 3 to 6 simultaneous node failures. Note that the recovery latency for multiple simultaneous failures is measured as the maximum recovery latency incurred by any of the backup nodes in the ring. The observed variations in the recovery time (Figure 6.13(b)) are due to the imbalances in the load (which leads to varying recovery latency for different failed nodes) and is not due to the increase in the number simultaneous failures. 6.8 Related Work MR and Batch Processing Systems. MR introduced a simplified pro- gramming abstraction for distributed large scale processing of high volume data. A number of MR variations that provide high level programming and querying abstractions, such as PIG, HIVE, Dryad [60], Apache Spark [134], and Orleans [23], as well as extensions such as iterative [38] and incremental MR [19] have been 152 proposed for scalable high volume data processing. However, these fail to con- sider run-time elasticity of mappers and reducers as the workload, and required resources, can be estimated and acquired at deployment time. Further, simple fault-tolerance and recovery techniques such as replicated persistent storage and re-execution of failed tasks suffice since the overall batch run-time outweighs the recovery cost. In contrast, SMR requires run-time load-balancing and elasticity, as the stream data behavior varies over time, and low latency fault-tolerance and recovery, to maintain the desired QoS. These are the focus of this chapter. Scalable SPS and SMR. SPSs (Apache Storm [81], Apache S4 [88], Granules [95], and TimeStream [98]) enable loosely coupled applications, often modeled as task graphs, to process high-velocity data streams on distributed clusters with support for SMR operators. Here, mappers continuously emit data tuples routed to a fixed number of reducers based on a given key. Their main drawback is the limited or lack of support for run-time elasticity in the number of mappers and reducers to account for variations in data rates or resource performance. Storm supports limited load balancing by allowing the user to add new machines at run- time and redistribute the load across the cluster, but requires the application to be temporarilysuspended, leadingtoatransientspikeinlatency. Othersystems[103], including our previous work [69], support elasticity and dynamic load-balancing but assume stateless operators or involve costly processes and state migration, and hence unsuitable for SMR. Other SMR approaches (Spark Streaming [135] and StreamMapReduce [22]) convert the incoming stream into small batches (windows) of tuples and perform batch MR within each window; the reducer state is embedded in the batch out- put. Runtime elasticity can be achieved at the window boundaries by varying the number of machines based on the load observed in the previous window. However, 153 this has two downsides. First, depending on the window size, the queuing latency of the tuples can grow to tens of seconds [135]. And second, since these systems use a simple hash-based key-to-reducer mapping function, scaling in/out causes a complete reshuffle of the key mappings, incurring a high overhead due to the large state transfer required. Fault-tolerance and state management in SPS. Traditional SPSs support fault-tolerance using techniques such as active replication [105] which rely on two or more copies of data and processes. Recent systems such as S4 [88] and Gran- ules [95] provides partial fault-tolerance by using a centralized coordination system such as Zookeeper. They offer automatic fail-over by launching new instances of the failed processes on a new or standby machine. They also perform periodic, non-incremental checkpointing and backup for individual processes, including the tuplebuffer, byusingsynchronous(de)serializationwhichrequirespausingthepro- cess during checkpointing. Further, they use an external in-memory or persistent storage for backup. These approaches add considerable overhead and lead to high processing latency during checkpointing as well as recovery. In addition, they do not guarantee against tuple loss as any in-flight and uncheckpointed tuples may be lost during the failure. Systems such as Storm [81], Timestream [98] guarantee “atleast once” tuple semantics using a combination of upstream-backup and an explicit acknowledg- ment tree, even in the presence of multiple node failures. Trident, an abstraction over Storm improves that to “exactly once” semantics using an external, persis- tent coordination system (zookeeper). However, neither of these support state recovery, and any local state associated with the failed processes is lost. A user may implement their own state management system using persistent distributed cache systems (DHT, Zookeeper) but this increases the complexity and processing 154 overhead per tuple. SEEP [84, 26] integrates elastic scale-out and fault-tolerance for general purpose SPSs using a distributed, partitioned state and explicit rout- ing table. Their solution, while applicable to SMR, incurs higher overhead during load-balancing, scaling and fault-recovery as it fails to take advantage of the key grouping and state locality property of the reducers. This causes significant reshuf- fling of the key-to-reducer mapping. Martin et. al. [79] proposed a streaming map reduce system with low-overhead fault tolerance similar to our proposed system. The key distinguishing factor is our support for run-time elasticity and load-balancing to handle variability in data streams observed at run-time. Further, their system enables deterministic execu- tion (exactly-once semantics) and relies on the virtual synchrony [86] method for synchronization and synchronous checkpointing at the end of each epoch (check- point interval) which increases overall latency and further requires total ordering of messages from different mappers, which is difficult to achieve. Ontheotherhand, ourapproachprovidesefficientstatemanagementandfault- recovery, and guarantees “atleast once” tuple semantics with no tuple loss during failure. Itcombinesasynchronous,incrementalpeer-checkpointingforreducerstate and peer tuple backup with intelligent collocation to reduce the recovery overhead by minimizing state and tuple transfer during recovery. We also employ a decen- tralized monitoring and recovery mechanism which isolates the fault to a small subset of reducers while the rest can continue processing without interruptions. 6.9 Summary Ashigh-velocitydatabecomesincreasinglycommon, usingthefamiliarMapRe- duce model to process it is valuable. In this chapter, we presented an integrated 155 approach to support fault-tolerance, load-balancing and elasticity for the stream- ing MapReduce model. Our novel approach extends the concept of consistent hashing, and provides locality-aware tuple peer-backup and peer-checkpointing. These allows low-latency dynamic updates to the system, including adaptive load- balance and elasticity, as well as offer low-latency fault recovery by eliminating the need for explicit state or tuple transfer during such operations. Our decentral- ized coordination mechanism helps make autonomic decisions based on the local information and eliminates a single point of failure. Our experiments show up to 2× improvement in throughput compared to Storm SPS and demonstrated low- latency recovery of 10−1500ms from multiple concurrent failures. As future work, we will evaluation the system on large scale, real-world application. Further, we will extend the idea to a general purpose stream processing system with shared stateful processing elements for wider adoption and applicability. 156 Chapter 7 Conclusions In this thesis we studied various characteristics of high velocity stream process- ing applications with focus on impact of the three dimensions of dynamism viz. Domain dynamism, Infrastructure dynamism, and Data dynamism on the appli- cation QoS and overall value. We developed programming model, and underlying platform that allow users to develop Dynamic Dataflow Applications and enables value-driven, cost-efficient execution on public clouds. 7.1 Contributions The main contributions of this thesis are summarized below: • We motivate the need for Dynamic Applications in chapter 1 through our experiences in the Smart Grid domain, specifically, the dynamic demand response application which relies heavily on real-time data analytics and predictions that enables dynamic decision making in smart grids. • Chapter 2 develops an abstract infrastructure model for IaaS clouds and ana- lyzes the performance characteristics of various private, academic and public cloud deployments. We observe variations in several characteristics such as CPU core coefficient, network latency, and bandwidth due to multi-tenancy and shared resources as well as hardware heterogeneity. This combined with data dynamism further motivates the need for dynamic applications and 157 adaptive execution platform that work together to mitigate the effects of such variations. • We introduce the novel concept of Dynamic Dataflows in chapter 3 that uses Guarded Dynamic Elements viz. Dynamic PE and Dynamic Edges to enable dynamic run-time re-composition in response to changing guards. We develop an execution model that decouples the guard policies from the data plane which enables ease of development as well as efficient deployment of dynamic dataflows. Finally, we demonstrate the usability and advantage of using dynamic dataflows through the demand response application in smart grid. The dynamic dataflows coupled with the guard policies provide a very powerful abstraction to develop adaptive applications and also provide addi- tional control (especially with value-based guards) to the infrastructure to balance the resource cost and achieved application value. • Given the dynamic dataflow model, the IaaS cloud infrastructure model and the data dynamism, chapter 4 discusses various metrics to measure the qual- ity of service, including Profit which combines the resource cost and the achieved application value into a single metric that allows users to tradeoff between the two based on the budget and application requirements at the time. We further discuss the optimization problem that attempts to maxi- mize the profit under given throughput or latency constraints and propose several reactive (chapter 4) and predictive look-ahead (chapter 5) heuris- tics based on several practical considerations such as monitoring and logging cost, and centralized vs. decentralized coordination. The proposed heuris- tics achieve significant improvement in terms of profit as well as resource cost 158 compared to static over or under provisioned deployment and also demon- strate the advantage of value-based guards in reducing the resource cost. • Finally, chapter 6 addresses the issue of achieving efficient elasticity and fault tolerance to system and network failures observed in the clouds. While the previous chapters focused on the heuristics to decide what, when and how much to scale, in this chapter we focused on efficient mechanisms to achieve seamless scaling at run-time, especially for stateful operators that require efficient management, migration and recovery of associated state. We proposed an integrated approach for load-balancing, elasticity and fault- tolerance that exhibits minimum run-time overhead and provides sub-second latency in operations such as elastic scale up/down and recovery from one or more simultaneous failures which is orders of magnitude faster that the state-of-the-art. We have developed a comprehensive framework for development and execution of Dynamic dataflows on clouds including programming abstractions, scheduling and resource management heuristics as well as efficient mechanisms for elasticity and fault tolerance. We have developedF`oε an open-source distributed, elastic and adaptive framework for dynamic streaming applications that brings together the above features. We have demonstrated the viability of dynamic dataflows through the dynamic demand response application in smart grid and evaluated our heuristics and system through various simulation and real deployments. 7.2 Broader Impact The notion of Dynamic Applications has implications beyond the dynamic dataflow abstractions presented in this thesis. We envision an extension in the 159 Service Oriented Architecture system, especially with the development of micro services that are in principle similar to the processing elements discussed in this thesis. The proposed heuristics for efficient deployment for dynamic dataflows are equally applicable for current generation of streaming applications and would sig- nificantly improve the resource cost by taking advantage of continuous monitoring, cloud elasticity and locality of placement and hence is applicable to the vast array of applications currently in use across several domains. Similarly, the fault toler- ance mechanisms we proposed go beyond dynamic dataflows and provide generic model for in-memory fault-tolerance that can significantly reduce the overhead and recovery latency compared to disk-based mechanisms currently used by many systems. Finally, the proposed elastic and adaptive framework forms a basis for develop- ment of various high level frameworks for specialized applications. For example, we have developed frameworks for batch and streaming graph processing framework on top ofF`oε that inherently take advantage of the elasticity and fault tolerance of the system. This allows the framework developer to focus on specialized features and delegate the responsibilities for tasks such as elasticity and fault-tolerance to the lower layer thus significantly lowering the development cost. In addition, our work on graph analytics [110, 111, 114] built on top of F`oε shows that the proposed elastic and fault tolerant system forms a basis for developing high level abstractions such as graph processing systems, distributed machine learning libraries etc. that inherit elasticity and fault tolerance properties of the underlying system with minimum development and run-time overhead. This will further allow development of a large class of applications in addition to the stream processing applications considered here. 160 7.3 Future Work Our work introduces the notion of dynamic applications and focuses on the specific dynamic dataflow model. As future work we will explore the possibility of extending these models beyond continuous dataflows to a more generic setting to address batch as well as interactive systems. Similary, the optimization problem and proposed heuristics were developed with specific goals and tradeoffs observed in the cyber physical domain. While these models and heuristics are widely appli- cable, we will further explore different relevant domains, especially in the con- text of Internet of Things (IoT) to understand the spectrum of requirements and refine both the optimization problem and the corresponding heuristics to meet the requirements. Finally, we will investigate integration of dynamic dataflows with query based models such as Continuous Query Language, CEP and SQL which will allow developers to write a single application and run it across the breadth of execution contexts including batch, streaming and interactive. 161 Reference List [1] CAISO Demand Response User Guide, Guide to Participation in MRTU Release 1, Version 3.0. Technical report, California Independent System Operator (CAISO), 2007. [2] Emergency Demand Response Program Manual, Sec 5.2: Calculation of Cus- tomer Baseline Load (CBL). Technical report, New York Independent Sys- tem Operator (NYISO), 2010. [3] D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and archi- tecture for data stream management. The VLDB Journal, 12, August 2003. [4] S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointingformassivelyparallelsystems. InSupercomputing.ACM,2004. [5] G. Agha and R. Panwar. An actor-based framework for heterogeneous com- puting systems. In Heterogeneous Processing, 1992. Proceedings. Workshop on, pages 35–42, 1992. [6] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock. Kepler: an extensible system for design and execution of scientific workflows. In Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on, pages 423–424. IEEE, 2004. [7] S. Aman, M. Frincu, C. Chelmis, M. Noor, Y. Simmhan, and V. Prasanna. Prediction models for dynamic demand response: Requirements, challenges, and insights. In IEEE International Conference on Smart Grid Communi- cations, 2015. [8] S. Aman, Y. Simmhan, and V. Prasanna. Holistic measures for evaluating prediction models in smart grids. IEEE Transactions in Knowledge and Data Engineering, 27(2), 2015. [9] S. Aman, Y. Simmhan, and V. K. Prasanna. Improving energy use forecast for campus micro-grids using indirect indicators. In IEEE ICDM Workshop on Domain Driven Data Mining (DDDM), 2011. 162 [10] L. Amini, H. Andrade, R. Bhagwan, F. Eskesen, R. King, P. Selo, Y. Park, and C. Venkatramani. Spc: A distributed, scalable platform for data mining. In Proceedings of the 4th international workshop on Data mining standards, services and platforms. ACM, 2006. [11] A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. S. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts. Linear road: a stream data management benchmark. In international conference on Very Large Databases, pages 480– 491. VLDB Endowment, 2004. [12] A. Artikis, C. Baber, P. Bizarro, C. Canudas-de Wit, O. Etzion, F. Fournier, P. Goulart, A. Howes, J. Lygeros, G. Paliouras, et al. Scalable proactive event-driven decision making. Technology and Society Magazine, IEEE, 33(3):35–41, 2014. [13] J. Auerbach, D. F. Bacon, I. Burcea, P. Cheng, S. J. Fink, R. Rabbah, and S. Shukla. A compiler and runtime for heterogeneous computing. In Proceedings of the 49th Annual Design Automation Conference, DAC ’12, pages 271–276, New York, NY, USA, 2012. ACM. [14] S. Babu and J. Widom. Continuous queries over data streams. ACM Sigmod Record, 30(3):109–120, 2001. [15] R. M. Badia, F. Escale, E. Gabriel, J. Gimenez, R. Keller, J. Labarta, and M. S. Müller. Performance prediction in a grid environment. In Grid Com- puting, LNCS, 2004. [16] R. Barga, J. Jackson, N. Araujo, D. Guo, N. Gautam, and Y. Simmhan. The trident scientific workflow workbench. In eScience, 2008. eScience’08. IEEE Fourth International Conference on, 2008. [17] A. Barker and J. Van Hemert. Scientific workflow: a survey and research directions. In Parallel Processing and Applied Mathematics, pages 746–753. 2008. [18] S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly analysis of a very large topically categorized web query log. In Pro- ceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 321–328. ACM, 2004. [19] P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquin. Incoop: Mapreduce for incremental computations. In Symposium on Cloud Comput- ing, page 7. ACM, 2011. 163 [20] A. Biem, E. Bouillet, H. Feng, A. Ranganathan, A. Riabov, O. Verscheure, H. Koutsopoulos, and C. Moran. Ibm infosphere streams for scalable, real- time, intelligent transportation services. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010. [21] T. D. Braun, H. J. Siegel, N. Beck, L. L. Bölöni, M. Maheswaran, A. I. Reuther, J. P. Robertson, M. D. Theys, B. Yao, D. Hensgen, and R. F. Freund. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. In Journal of Parallel and Distributed Computing, volume 61, pages 810–837. Elsevier, June 2001. [22] A. Brito, A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, and C. Fetzer. Scalable and low-latency data processing with stream mapreduce. In CloudCom, pages 48–58. IEEE, 2011. [23] S.Bykov, A.Geller, G.Kliot, J.R.Larus, R.Pandya, andJ.Thelin. Orleans: cloud computing for everyone. In Symposium on Cloud Computing, page 16. ACM, 2011. [24] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. De Rose, and R. Buyya. Cloudsim: a toolkit for modeling and simulation of cloud computing environ- ments and evaluation of resource provisioning algorithms. Software: Practice and Experience, 41(1):23–50, 2011. [25] S. Carlsen. Action port model: A mixed paradigm conceptual workflow modeling language. In Cooperative Information Systems, 1998. Proceedings. 3rd IFCIS International Conference on, pages 300–309. IEEE, 1998. [26] R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Inte- gratingscaleoutandfaulttoleranceinstreamprocessingusingoperatorstate management. In SIGMOD, pages 725–736. ACM, 2013. [27] K. Chakraborty, K. Mehrotra, C. K. Mohan, and S. Ranka. Forecasting the behavior of multivariate time series using neural networks. Neural networks, 5(6), 1992. [28] S. Chandrasekaran, O. Cooper, A. Deshpande, M. Franklin, J. Hellerstein, W.Hong, S.Krishnamurthy, S.Madden, F.Reiss, andM.Shah. Telegraphcq: continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, 2003. [29] H.-Y. Chu and Y. Simmhan. Resource allocation strategies on hybrid cloud for resilient jobs. Technical report, USC, 2013. 164 [30] A. Colorni, M. Dorigo, and V. Maniezzo. Distributed Optimization by Ant Colonies. In European Conference on Artificial Life, pages 134–142, 1991. [31] R. Cottet and M. Smith. Bayesian modeling and forecasting of intraday electricityload. Journal of the American Statistical Association, 98(464):839– 849, 2003. [32] K. Cukier and V. Mayer-Schoenberger. Rise of big data: How it’s changing the way we think about the world, the. Foreign Aff., 92:28, 2013. [33] J. Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009. [34] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [35] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: ama- zon’shighlyavailablekey-valuestore. In SIGOPS Operating Systems Review, volume 41, pages 205–220. ACM, 2007. [36] J. Dunkel, A. Fernández, R. Ortiz, and S. Ossowski. Event-driven archi- tecture for decision support in traffic management systems. Expert Systems with Applications, 38(6):6530–6539, 2011. [37] S. C. Edison. 10 day average baseline and “day-off” adjustment. Technical report, 2011. URL: http://asset.sce.com/Documents/Business%20- %20Energy%20Management%20Solutions/10DayAvgBaselineFS.pdf (accessed Nov 23, 2013). [38] J.Ekanayake, H.Li, B.Zhang, T.Gunarathne, S.-H.Bae, J.Qiu, andG.Fox. Twister: A runtime for iterative mapreduce. In Workshop on MapReduce and its Applications (MAPREDUCE), 2010. [39] S. Fan, C. Mao, J. Zhang, and L. Chen. Forecasting electricity demand by hybrid machine learning model. In Neural Information Processing, pages 952–963. Springer, 2006. [40] FERC. Assessment of demand response and advanced metering. Federal Energy Regulatory Commission, 2010. [41] C. Fetzer. Streammine: A scalable and dependable event processing plat- form. In DEBS. ACM, 2010. 165 [42] M. Frincu, C. Chelmis, S. Aman, M. R. Saeed, V. Zois, and V. Prasanna. Enabling automated dynamic demand response: From theory to practice. Technical report, USC Computer Science Department, 2015. [43] M. Frincu, C. Chelmis, M. Noor, and V. Prasanna. Accurate and efficient selection of the best consumption prediction method in smart grids. In Inter- national Conference on BigData, page accepted, 2014. [44] M. E. Frincu. Scheduling highly available applications on cloud environ- ments. In Future Generation Computer Systems, volume 32, pages 138–153. Elsevier, 2014. [45] D. Galinec and W. Steingartner. A look at observe, orient, decide and act feedback loop, pattern-based strategy and network enabled capability for organizations adapting to change. Acta Electrotechnica et Informatica, 13(2):39–49, 2013. [46] M. Gatti, P. Cavalin, S. B. Neto, C. Pinhanez, C. dos Santos, D. Gribel, and A. P. Appel. Large-scale multi-agent-based modeling and simulation of microblogging-based online social network. In Multi-Agent-Based Simulation XIV, pages 17–33. Springer, 2014. [47] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo. Spade: the system s declarative stream processing engine. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages1123–1134. ACM, 2008. [48] P. Gill, N. Jain, and N. Nagappan. Understanding network failures in data centers: measurement, analysis, and implications. In SIGCOMM Computer Communication Review, volume 41. ACM, 2011. [49] R. Gioiosa, J. C. Sancho, S. Jiang, F. Petrini, and K. Davis. Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers. In SuperComputing. IEEE Computer Society, 2005. [50] D. E. Goldberg and K. Deb. A Comparative Analysis of Selection Schemes Used in Genetic Algorithms. In Foundations of Genetic Algorithms, pages 69–93, 1990. [51] Z. Gong, X. Gu, and J. Wilkes. Press: Predictive elastic resource scaling for cloud systems. In CNSM, 2010. [52] D.Guinard,V.Trifa,S.Karnouskos,P.Spiess,andD.Savio. Interactingwith the soa-based internet of things: Discovery, query, selection, and on-demand provisioning of web services. Services Computing, IEEE Transactions on, 3(3):223–235, 2010. 166 [53] V. Gulisano, R. Jimenez-Peris, M. Patino-Martinez, C. Soriente, and P. Val- duriez. Streamcloud: An elastic and scalable data streaming system. In Transactions on Parallel and Distributed Systems, volume 23, pages 2351– 2365. IEEE, 2012. [54] T. Gunarathne, B. Zhang, T.-L. Wu, and J. Qiu. Scalable parallel comput- ing on clouds using twister4azure iterative mapreduce. Future Generation Computer Systems, 29(4):1035 – 1048, 2013. [55] J. H. Holland. Adaptation in Natural and Artificial Systems. MIT Press, Cambridge, MA, USA, 1992. [56] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu. Resource provisioning for cloud computing. In Conference of the Center for Advanced Studies on Collabora- tive Research, pages 101–111, Riverton, NJ, USA, 2009. IBM Corp. [57] International Energy Agency. World Energy Outlook. 2014. [58] A. Iosup, S. Ostermann, M. N. Yigitbasi, R. Prodan, T. Fahringer, and D. H. Epema. Performance analysis of cloud computing services for many-tasks scientific computing. Parallel and Distributed Systems, IEEE Transactions on, 22(6):931–945, 2011. [59] A. Iosup, N. Yigitbasi, and D. Epema. On the performance variability of production cloud services. In Cluster, Cloud and Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium on, 2011. [60] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In SIGOPS Operat- ing Systems Review, volume 41. ACM, 2007. [61] K. R. Jackson, L. Ramakrishnan, K. Muriki, S. Canon, S. Cholia, J. Shalf, H. J. Wasserman, and N. J. Wright. Performance analysis of high perfor- mance computing applications on the amazon web services cloud. In Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second Inter- national Conference on, pages 159–168. IEEE, 2010. [62] M. C. Jaeger, G. Rojec-Goldmann, and G. Muhl. Qos aggregation for web service composition using workflow patterns. In Enterprise distributed object computing conference, 2004. EDOC 2004. Proceedings. Eighth IEEE Inter- national, pages 149–159, 2004. [63] M. F. Kaashoek and D. R. Karger. Koorde: A simple degree-optimal dis- tributed hash table. In Peer-to-peer systems II. Springer, 2003. 167 [64] J. Kang and S. Park. Algorithms for the variable sized bin packing problem. European Journal of Operational Research, 147(2):365–372, 2003. [65] D. Karger, A. Sherman, A. Berkheimer, B. Bogstad, R. Dhanidina, K. Iwamoto, B. Kim, L. Matkins, and Y. Yerushalmi. Web caching with consistent hashing. Computer Networks, 31(11), 1999. [66] P. S. Kenkre, A. Pai, and L. Colaco. Real time intrusion detection and prevention system. In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014, pages 405–411. Springer, 2015. [67] J. Kennedy and R. Eberhart. Particle swarm optimization. In IEEE Inter- national Conference on Neural Networks, volume 4, pages 1942–1948 vol.4, 1995. [68] P. Krause, D. Boyle, and F. Bäse. Comparison of different efficiency criteria for hydrological model assessment. Advances in Geosciences, 5(5), 2005. [69] A. Kumbhare, Y. Simmhan, and V. Prasanna. Plasticc: Predictive look- ahead scheduling for continuous dataflows on clouds. In CCGrid, pages 344– 353. IEEE/ACM, May 2014. [70] A. Kumbhare, Y. Simmhan, and V. K. Prasanna. Exploiting application dynamism and cloud elasticity for continuous dataflows. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 57. ACM, 2013. [71] D.Lachut, N.Banerjee, and S.Rollins. Predictabilityof energyusein homes. In International Green Computing Conference, 2014. [72] A. Li, X. Yang, S. Kandula, and M. Zhang. Cloudcmp: comparing public cloud providers. In SIGCOMM conference on Internet measurement, pages 1–14. ACM, 2010. [73] S. Loesing, M. Hentschel, T. Kraska, and D. Kossmann. Stormy: an elas- tic and highly available streaming service in the cloud. In EDBT/ICDT Workshops. ACM, 2012. [74] B. Lu, A. Apon, L. Dowdy, F. Robinson, D. Hoffman, and D. Brewer. A case study on grid performance modeling. In ICPDCS, 2006. [75] M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund. Dynamic mapping of a class of independent tasks onto heterogeneous computing sys- tems. Journal of Parallel and distributed Computing, 59(2):107–131, 1999. 168 [76] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Cza- jkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 international conference on Management of data. ACM, 2010. [77] A. Marinescu, C. Harris, I. Dusparic, V. Cahill, and S. Clarke. A hybrid approach to very small scale electrical demand forecasting. In Innovative Smart Grid Technologies Conference (ISGT), 2014 IEEE PES, pages 1–5. IEEE, 2014. [78] A. Martin, C. Fetzer, and A. Brito. Active replication at (almost) no cost. In SRDS, pages 21–30. IEEE, 2011. [79] A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, C. Fetzer, and A. Brito. Low-overhead fault tolerance for high-throughput data processing systems. In Distributed Computing Systems (ICDCS), 2011 31st Interna- tional Conference on, pages 689–699, June 2011. [80] F. Martinez Alvarez, A. Troncoso, J. C. Riquelme, and J. S. Aguilar Ruiz. Energy time series forecasting based on pattern sequence similarity. Knowl- edge and Data Engineering, IEEE Transactions on, 23(8):1230–1243, 2011. [81] N. Marz. Storm-distributed and fault-tolerant realtime computation, 2013. URL http://www. storm-project. net. [82] E. M. Maximilien and M. P. Singh. A framework and ontology for dynamic web services selection. Internet Computing, IEEE, 8(5):84–93, 2004. [83] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of State Calculations by Fast Computing Machines. In The Journal of Chemical Physics, volume 21, pages 1087–1092. AIP, 1953. [84] M. Migliavacca, D. Eyers, J. Bacon, Y. Papagiannis, B. Shand, and P. Piet- zuch. Seep: scalable and elastic event processing. In Middleware’10 Posters and Demos Track, page 4. ACM, 2010. [85] I. Moreno, P. Garraghan, P. Townend, and J. Xu. Analysis, modeling and simulation of workload patterns in a large-scale utility cloud. In Transactions on Cloud Computing, volume 2, pages 208–221. IEEE, April 2014. [86] L. E. Moser, Y. Amir, P. M. Melliar-Smith, and D. A. Agarwal. Extended virtual synchrony. In Distributed Computing Systems, 1994., Proceedings of the 14th International Conference on. IEEE, 1994. 169 [87] P. Neophytou, P. Chrysanthis, and A. Labrinidis. Confluence: Implementa- tion and application design. In Collaborative Computing: Networking, Appli- cations and Worksharing (CollaborateCom), 2011 7th International Confer- ence on, oct. 2011. [88] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, pages 170–177. IEEE, 2010. [89] A. H. H. Ngu, S. Bowers, N. Haasch, T. M. Mcphillips, and T. Critchlow. Flexible Scientific Workflow Modeling Using Frames, Templates, and Dynamic Embedding. Springer, 2008. [90] S. Y. NOF and F. HANK GRANT. Adaptive/predictive scheduling: review and a general framework. Production Planning & Control, 2(4), 1991. [91] G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. Pace–a toolset for the performance prediction of parallel and distributed systems. Int. J. HPCA, 14(3), 2000. [92] G. J. Nutt. The evolution towards flexible workflow systems. Distributed Systems Engineering, 3(4), 1996. [93] Z. Ou, H. Zhuang, A. Lukyanenko, J. Nurminen, P. Hui, V. Mazalov, and A. Yla-Jaaski. Is the same instance type created equal? exploiting hetero- geneity of public clouds. In Transactions on Cloud Computing, volume 1, pages 201–214. IEEE, July 2013. [94] C. Ouyang, E. Verbeek, W. M. Van Der Aalst, S. Breutel, M. Dumas, and A. H. Ter Hofstede. Formal semantics and analysis of control flow in ws-bpel. In Science of Computer Programming, volume 67, pages 162–198. Elsevier, 2007. [95] S. Pallickara, J. Ekanayake, and G. Fox. Granules: A lightweight, streaming runtime for cloud computing with support, for map-reduce. In Cluster Com- puting and Workshops, 2009. CLUSTER’09. IEEE International Conference on. IEEE, 2009. [96] N. Parikh and N. Sundaresan. Scalable and near real-time burst detection from ecommerce queries. In SIGKDD, pages 972–980. ACM, 2008. [97] J. S. Plank, J. Xu, and R. H. Netzer. Compressed differences: An algorithm for fast incremental checkpointing. University of Tennessee, Tech. Rep. CS- 95-302, 1995. 170 [98] Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang. Timestream: Reliable stream computation in the cloud. In ECCS, 2013. [99] A. Quiroz, H. Kim, M. Parashar, N. Gnanasambandam, and N. Sharma. Towards autonomic workload provisioning for enterprise grids and clouds. In IEEE/ACM International Conference on Grid Computing, pages 50–57, Oct 2009. [100] S. D. Ramchurn, P. Vytelingum, A. Rogers, and N. R. Jennings. Putting the ’Smarts’ Into the Smart Grid: A Grand Challenge for Artificial Intelligence. Communications of the ACM, 55(4), 2012. [101] N. Russell, A. H. Ter Hofstede, and N. Mulyar. Workflow controlflow pat- terns: A revised view. Technical report, 2006. [102] S. W. Sadiq, M. E. Orlowska, and W. Sadiq. Specification and validation of process constraints for flexible workflows. Information Systems, 30(5):349– 378, 2005. [103] B. Satzger, W. Hummer, P. Leitner, and S. Dustdar. Esc: Towards an elastic stream computing platform for the cloud. In Cloud Computing (CLOUD), 2011 IEEE International Conference on, 2011. [104] G. Schirner, D. Erdogmus, K. Chowdhury, and T. Padir. The future of human-in-the-loop cyber-physical systems. Computer, 1:36–45, 2013. [105] M.A.Shah,J.M.Hellerstein, andE.Brewer. Highlyavailable, fault-tolerant, parallel dataflows. In SIGMOD. ACM, 2004. [106] M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In Data Engineering, 2003. Proceedings. 19th International Conference on, pages 25– 36. IEEE, 2003. [107] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop dis- tributed file system. In Symposium on Mass Storage Systems and Tech- nologies (MSST), pages 1–10. IEEE, 2010. [108] Y. Simmhan, S. Aman, B. Cao, M. Giakkoupis, A. Kumbhare, Q. Zhou, D. Paul, C. Fern, A. Sharma, , and V. K. Prasanna. An informatics approach todemandresponseoptimizationinsmartgrids. Technicalreport, University of Southern California, 2011. 171 [109] Y. Simmhan, S. Aman, A. Kumbhare, R. Liu, S. Stevens, Q. Zhou, and V. K. Prasanna. Cloud-based software platform for data-driven smart grid management. In Computing in Science and Engineering (CISE), 2013. [110] Y. Simmhan, N. Choudhury, C. Wickramaarachchi, A. Kumbhare, M. Frincu, C. Raghavendra, and V. Prasanna. Distributed programming over time-series graphs. In IEEE International Parallel & Distributed Pro- cessing Symposium (IPDPS), 2015. To Appear. [111] Y. Simmhan, A. Kumbhare, C. Wickramaarachchi, S. Nagarkar, S. Ravi, C. Raghavendra, and V. Prasanna. Goffish: A sub-graph centric framework for large-scale graph analytics. In International European Conference on Parallel Processing (EuroPar), 2014. [112] Y. Simmhan and A. G. Kumbhare. Floe: A continuous dataflow framework for dynamic cloud applications. CoRR, abs/1406.5977, 2014. [113] Y. Simmhan and M. U. Noor. Scalable prediction of energy consumption using incremental time series clustering. In IEEE International Conference on Big Data, Workshop on Big Data and Smarter Cities, 2013. [114] Y. Simmhan, C. Wickramaarachchi, A. G. Kumbhare, M. Frncu, S. Nagarkar, S. Ravi, C. S. Raghavendra, and V. K. Prasanna. Scalable analytics over distributed time-series graphs using goffish. Technical Report arXiv:1406.5975, 2014. Modified version appears in IPDPS 2015. [115] R. Soma, A. Bakshi, V. Prasanna, W. J. DaSie, B. C. Bourgeois, et al. Semantic web technologies for smart oil field applications. In Intelligent Energy Conference and Exhibition. Society of Petroleum Engineers, 2008. [116] D. Spooner, S. Jarvis, J. Cao, S. Saini, and G. Nudd. Local grid scheduling techniques using performance prediction. Computers and Digital Techniques, 150(2), 2003. [117] S. Suhothayan, K. Gajasinghe, I. Loku Narangoda, S. Chaturanga, S. Per- era, and V. Nanayakkara. Siddhi: a second look at complex event processing architectures. In Proceedings of the 2011 ACM workshop on Gateway com- puting environments, pages 43–50. ACM, 2011. [118] R. Tolosana-Calasanz, J. Aťngel Bañares, C. Pham, and O. Rana. End- to-end qos on shared clouds for highly dynamic, large-scale sensing data streams. In Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on, pages 904–911. IEEE, 2012. 172 [119] H. Topcuoglu, S. Hariri, and M.-y. Wu. Performance-effective and low- complexity task scheduling for heterogeneous computing. Parallel and Dis- tributed Systems, IEEE Transactions on, 13(3):260–274, 2002. [120] W. M. van Der Aalst, A. H. Ter Hofstede, B. Kiepuszewski, and A. P. Barros. Workflow patterns. Distributed and parallel databases, 14(1):5–51, 2003. [121] J. Wainer and F. de Lima Bezerra. Constraint-based flexible workflows. In Groupware: Design, Implementation, and Use, pages 151–158. Springer, 2003. [122] L. Wang, H. J. Siegel, V. P. Roychowdhury, and A. A. Maciejewski. Task matching and scheduling in heterogeneous computing environments using a genetic-algorithm-based approach. Journal of Parallel and Distributed Com- puting, 47(1):8–22, 1997. [123] L. Wang, J. Tao, R. Ranjan, H. Marten, A. Streit, J. Chen, and D. Chen. G-hadoop: Mapreduceacrossdistributeddatacentersfordata-intensivecom- puting. Future Generation Computer Systems, 29(3), 2013. [124] R. P. Weicker. An overview of common benchmarks. Computer, 23(12):65– 75, 1990. [125] C. Wickramachari and Y. Simmhan. Continuous dataflow update strategies for mission-critical applications. In eScience, 2013. [126] G. J. Woeginger. Exact algorithms for np-hard problems: A survey. In Combinatorial OptimizationŮEureka, You Shrink!, pages 185–207. Springer, 2003. [127] R. Wolski, N. T. Spring, and J. Hayes. The network weather service: a dis- tributedresourceperformanceforecastingserviceformetacomputing. FGCS, 15(5), 1999. [128] Y. Wu, K. Hwang, Y. Yuan, and W. Zheng. Adaptive workload prediction of grid performance in confidence windows. TPDS, 21(7), 2010. [129] Z. Wu, X. Liu, Z. Ni, D. Yuan, and Y. Yang. A market-oriented hierarchical scheduling strategy in cloud workflow systems. In The Journal of Supercom- puting, volume 63, pages 256–293. Springer, 2013. [130] W. Yin, Y. Simmhan, and V. Prasanna. Scalable regression tree learning on hadoop using openplanet. In MAPREDUCE, 2012. [131] J. Yu and R. Buyya. A taxonomy of scientific workflow systems for grid computing. Sigmod Record, 34(3):44, 2005. 173 [132] T.Yu, Y.Zhang, andK.-J.Lin. Efficientalgorithmsforwebservicesselection withend-to-endqosconstraints. ACM Transactions on the Web, 1(1):6, 2007. [133] Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 1–14, 2008. [134] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010. [135] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccom- puting, pages 10–10, 2012. [136] F. Zamfirache, M. Frincu, and D. Zaharie. Population-based metaheuristics for tasks scheduling in heterogeneous distributed systems. In Numerical Methods and Applications, volume 6046, pages 321–328. Springer, 2011. [137] Q.Zheng. Improvingmapreducefaulttoleranceinthecloud. In International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pages 1–6, April 2010. [138] Y. Zhou, B. C. Ooi, K.-L. Tan, and J. Wu. Efficient dynamic operator placement in a locally distributed continuous query system. In On the Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE, pages 54–71. Springer, 2006. 174
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A complex event processing framework for fast data management
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Prediction models for dynamic decision making in smart grid
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Distributed adaptive control with application to heating, ventilation and air-conditioning systems
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Failure prediction for rod pump artificial lift systems
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
On efficient data transfers across geographically dispersed datacenters
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Optimal designs for high throughput stream processing using universal RAM-based permutation network
Asset Metadata
Creator
Kumbhare, Alok Gautam
(author)
Core Title
Adaptive and resilient stream processing on cloud infrastructure
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/22/2016
Defense Date
01/13/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
big data,cloud computing,cyber physical infrastructure,fault-tolerance,high velocity analytics,OAI-PMH Harvest,resiliency,runtime adaptations,runtime elasticity,smart grids,stream processing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Nakano, Aiichiro (
committee member
), Raghavendra, Cauligi (
committee member
)
Creator Email
alok.kumbhare@hotmail.com,kumbhare@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-213264
Unique identifier
UC11279179
Identifier
etd-KumbhareAl-4142.pdf (filename),usctheses-c40-213264 (legacy record id)
Legacy Identifier
etd-KumbhareAl-4142.pdf
Dmrecord
213264
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kumbhare, Alok Gautam
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
big data
cloud computing
cyber physical infrastructure
fault-tolerance
high velocity analytics
resiliency
runtime adaptations
runtime elasticity
smart grids
stream processing