Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Resource scheduling in geo-distributed computing
(USC Thesis Other)
Resource scheduling in geo-distributed computing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
RESOURCE SCHEDULING IN GEO-DISTRIBUTED COMPUTING by Chien-Chun Hung A Disertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2017 Copyright 2017 Chien-Chun Hung Acknowledgments First and foremost, I would like to thank my adviser, Professor Leana Golubchik, for her continuous guidance throughout my PhD study. Professor Golubchik not only helped develop me into a researcher, but also provided me countless lessons in becoming a responsible person. I would also like to thank my mentors, Doctor Ganesh Ananthanarayanan (a researcher at Microsoft Research), Doctor Peter Bodik (a researcher at Microsoft Research) and Professor Minlan Yu, for their guidance that helped develop my research skills even in low-level details. I would also like to thank my labmates, especially Sung-Han Lin, and other friends, especially the Taiwanese friends whom I play basketball with almost every Friday at the gym. Their supports from many aspects are crucial for my mentality throughout my PhD study starting my first day at USC. The last but not the least, I would like to thank my family, and, especially my wife Jessy, for their endless supports in my entire life. I feel like I could not have completed this degree without their supports which allow me to focus only on my study without worrying about other things in life. I am really grateful for all the supports for all the people around me, for that this thesis could not have been finished without the support from any of them. Thank you all! ii Table of Contents Acknowledgments ii List of Tables vi List of Figures vii Abstract xi Chapter 1: Introduction 1 Chapter 2: Related Work 13 2.1 Geo-distributed Job Execution . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Intra-datacenter Job Execution . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Job Scheduling and Task Placement . . . . . . . . . . . . . . . 16 2.2.2 Co-flow Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Multi-resource Scheduling . . . . . . . . . . . . . . . . . . . . 18 2.2.4 Summary of Intra-datacenter Job Execution . . . . . . . . . . . 19 2.3 Distributed Systems and Computing . . . . . . . . . . . . . . . . . . . 19 2.3.1 Placement of VMs and Jobs across Distributed Machines . . . . 19 2.3.2 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Video Analytics Systems . . . . . . . . . . . . . . . . . . . . . 22 2.3.4 Database Systems . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.5 Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.6 Computation Offloading or Eliminating in Mobile Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Scheduling Algorithms for Minimizing Job Response Time . . . . . . . 29 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Chapter 3: Scheduling Jobs across Geo-distributed Datacenters 32 3.0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.1 Job Scheduling across Geo-distributed Datacenters . . . . . . . 35 3.1.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 38 iii 3.1.3 SRPT-based Extensions . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Reordering-based Approach . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Workload-aware Approach . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 SWAG Design Principles . . . . . . . . . . . . . . . . . . . . . 46 3.3.2 SWAG Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Prototype and System Extensions . . . . . . . . . . . . . . . . . . . . . 49 3.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5.2 Scheduling Performance Results . . . . . . . . . . . . . . . . . 54 3.5.3 Overhead Evaluation . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.4 Performance Sensitivity Analysis . . . . . . . . . . . . . . . . . 61 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 4: Multi-resource Scheduling across Heterogeneous Geo-distributed Clusters 65 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.1 Geo-distributed data analytics . . . . . . . . . . . . . . . . . . 69 4.2.2 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Compute/Network-Aware Task Placement . . . . . . . . . . . . . . . . 77 4.3.1 Map-Task Placement . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.2 Reduce-Task Placement . . . . . . . . . . . . . . . . . . . . . . 79 4.3.3 Task Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.4 Mismatch between Map and Reduce . . . . . . . . . . . . . . . 82 4.4 Job Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.1 Minimizing Average Job Response Time . . . . . . . . . . . . . 83 4.4.2 Dealing with Resource Dynamics . . . . . . . . . . . . . . . . 84 4.4.3 Considering WAN Usage . . . . . . . . . . . . . . . . . . . . . 85 4.4.4 Incorporating Fairness . . . . . . . . . . . . . . . . . . . . . . 86 4.5 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6.2 Evaluation with EC2 Deployment . . . . . . . . . . . . . . . . 90 4.6.3 Evaluation with Trace-driven Simulations . . . . . . . . . . . . 91 4.6.4 Distribution of The Performance Gains . . . . . . . . . . . . . . 96 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 5: Video Stream Analytics over Hierarchical Cluster 101 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2 Video Processing Architecture . . . . . . . . . . . . . . . . . . . . . . 105 5.3 Challenges and Desirable Features . . . . . . . . . . . . . . . . . . . . 106 5.4 Resource-Accuracy Profiles . . . . . . . . . . . . . . . . . . . . . . . . 111 iv 5.5 Video Query Planning & Placement . . . . . . . . . . . . . . . . . . . 115 5.5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 116 5.5.2 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . 118 5.5.3 Resource Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.5.4 Greedy Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.5.5 Pareto Band . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.5.6 Merging Peer Queries . . . . . . . . . . . . . . . . . . . . . . . 123 5.6 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.6.1 Resource-Accuracy Profiler . . . . . . . . . . . . . . . . . . . . 125 5.6.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.7.2 Improvement in Accuracy . . . . . . . . . . . . . . . . . . . . 129 5.7.3 Gains by Different Query Types . . . . . . . . . . . . . . . . . 133 5.7.4 Static Placements . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.7.5 Gains with Merging . . . . . . . . . . . . . . . . . . . . . . . . 136 5.7.6 Scalability with Pareto Band . . . . . . . . . . . . . . . . . . . 137 5.7.7 Cost and Efficiency Metrics . . . . . . . . . . . . . . . . . . . 138 5.7.8 Performance Sensitivity . . . . . . . . . . . . . . . . . . . . . . 139 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Chapter 6: Summary 141 References 144 v List of Tables 3.1 Settings of The Example: Job Set, Arrival Sequence and Task Assignment 37 3.2 Job Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 Definition of Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.1 Notations for queryi. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 vi List of Figures 1.1 Architecture of geo-distributed computing environment. . . . . . . . . . 2 1.2 Structure of a job. A job is composed of tasks that form a Directed Acyclic Graph (DAG). . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 System Architecture of Distributed Job Execution . . . . . . . . . . . . 37 3.2 Results of The Example: Job Orders and Finish Instants Computed by Different Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . 38 3.3 Reordering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 Motivating Example for Third Principle . . . . . . . . . . . . . . . . . 47 3.5 Workload-Aware Greedy Scheduling (SWAG) . . . . . . . . . . . . . . 48 3.6 Performance with Facebook Trace . . . . . . . . . . . . . . . . . . . . 54 3.7 Performance with Google Trace . . . . . . . . . . . . . . . . . . . . . 55 3.8 Performance with Exponential Trace . . . . . . . . . . . . . . . . . . . 55 3.9 Fairness with Facebook Trace . . . . . . . . . . . . . . . . . . . . . . . 57 3.10 Fairness with Google Trace . . . . . . . . . . . . . . . . . . . . . . . . 58 3.11 Fairness with Exponential Trace . . . . . . . . . . . . . . . . . . . . . 58 3.12 Scheduling Algorithm Running Time . . . . . . . . . . . . . . . . . . . 59 3.13 Communication Overhead . . . . . . . . . . . . . . . . . . . . . . . . 60 3.14 Various Task Assignment Scenarios . . . . . . . . . . . . . . . . . . . 61 3.15 Various Number of Datacenters . . . . . . . . . . . . . . . . . . . . . . 63 3.16 Various Estimation Accuracy . . . . . . . . . . . . . . . . . . . . . . . 63 vii 4.1 Global data analytics across geo-distributed cluster. Analytics jobs are submitted to the global manager, and may require data stored across geo-distributed sites which have various capacities in compute slots and bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Heterogeneity in compute resources. . . . . . . . . . . . . . . . . . . . 71 4.3 Heterogeneity in network resources. . . . . . . . . . . . . . . . . . . . 71 4.4 Bandwidth, compute capacities and input data for our three-site example setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5 Task placement result: Iridium . . . . . . . . . . . . . . . . . . . . . . 73 4.6 Task placement result: better approach . . . . . . . . . . . . . . . . . . 74 4.7 Reduction in Average Response Time. . . . . . . . . . . . . . . . . . . 90 4.8 Reduction in Average Slowdown. . . . . . . . . . . . . . . . . . . . . . 90 4.9 Average Response Time . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.10 CDF of Response Time . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.11 Gains in Response Time Under Different Combinations of Task Order- ing Strategies (Baseline: In-Place) . . . . . . . . . . . . . . . . . . . . 93 4.12 Gains in Response Time Under Different Resource Dynamics Scenarios (Baseline: In-Place) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.13 Balancing Response Time And WAN Usage; baseline: In-Place. . . . . 95 4.14 Balancing Response Time And WAN Usage; baseline: Centralized. . . . 95 4.15 Balance Response Time And Fairness . . . . . . . . . . . . . . . . . . 97 4.16 Distribution of The Gains under Various Ratios of Intermediate/Input Size 97 4.17 Histogram of Gains under Various Job Sizes . . . . . . . . . . . . . . . 98 4.18 Histogram of Gains under Various Input Data Skew . . . . . . . . . . . 98 4.19 Histogram of Gains under Various Intermediate Data Skew . . . . . . . 99 4.20 Histogram of Gains under Various Task Estimation Errors . . . . . . . . 99 5.1 Hierarchical Video Analytics Architecture. . . . . . . . . . . . . . . . 105 viii 5.2 Object Tracker Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.3 Hierarchical Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4 Query plans forQ 1 andQ 2 . . . . . . . . . . . . . . . . . . . . . . . . 107 5.5 Utilization at private cluster for best plans . . . . . . . . . . . . . . . . 108 5.6 Merging the detector and associator of the “car counter” and “jay walker” queries on the same camera. . . . . . . . . . . . . . . . . . . . . . . . 110 5.7 Accuracy vs. Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.8 CPU demands of components . . . . . . . . . . . . . . . . . . . . . . . 112 5.9 Accuracy vs. Data rates . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.10 Data rates of components . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.11 Network and CPU demands of the camera and the two modules in the tracker pipeline for three different plans, all with accuracy 0:730:75. Y-axis truncated at 2.5. . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.12 Profiling DNN recognizers – object, scene, face – on server-class and mobile GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.13 Pseudocode for Cascade’s heuristic. . . . . . . . . . . . . . . . . . . . 120 5.14 Illustration of Pareto band (shaded) for a single query. Note that for each accuracy (plan), there is a horizontal stripe of placement options with different costs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.15 Comparing Cascade’s accuracy to baselines. Accuracies are normalized by Optimal Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.16 Distribution of accuracies across queries. . . . . . . . . . . . . . . . . . 130 5.17 Distribution of costs of the query configurations. . . . . . . . . . . . . . 131 5.18 Choice of placements for components. . . . . . . . . . . . . . . . . . . 131 5.19 Comparison between Cascade and a placement-restricted Cascade. . . . 132 5.20 Cascade’s accuracy under various query types normalized by Optimal Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 ix 5.21 Cascade’s gains in accuracy under various query types compared to Fair Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.22 Cascade’s normalized accuracy under various mix of tracking and license plate reader queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.23 Cascade’s gains over restricted placements. . . . . . . . . . . . . . . . 135 5.24 Gains in merging common query components. . . . . . . . . . . . . . . 137 5.25 Accuracy achieved and running time with Pareto band width, relative to using all the configurations. . . . . . . . . . . . . . . . . . . . . . . 137 5.26 Algorithms’ Running Time. . . . . . . . . . . . . . . . . . . . . . . . . 138 5.27 Cascade’s performance sensitivity compared to Optimal Allocation. . . 139 x Abstract Due to the growing needs in computing and the increasing volume of data, cloud service providers deploy multiple datacenters around the world in order to provide fast comput- ing response. Many applications utilizing such geo-distributed deployment include web search, user behavior analysis, machine learning applications, and live camera feeds processing. Depending on the characteristics of the applications, their data may be gen- erated, stored, and processed across the geo-distributed sites. Hence, efficient processing of the data across the geo-distributed sites is critical to the applications’ performance. Existing solutions first aggregate all the required data at one location and execute the computation within the site. Such solutions incur large amounts of data transfer across the WAN and lead to prolonged response times for the applications due to significant network delays. An emerging trend is to instead distribute the computation across the sites based on data distribution, and aggregate only the results afterwards. Recent works have shown that such an approach can result in significant improvement in response time as well as reduction in WAN bandwidth usage. However, the performance of the geo-distributed jobs highly depends on how the resources are scheduled, which raises new challenges as the trivial extensions of state- of-the-art scheduling solutions lead to sub-optimal performance. In this thesis, we first improve the performance of geo-distributed jobs from the perspective of computation resources. We provide the insights into how conventional xi Shortest Remaining Processing Time (SRPT) falls short due to the lack of schedul- ing coordination among the sites, and propose a light-weight heuristic that significantly improves the jobs’ response time. We also design a new job scheduling heuristic that coordinates the workload demands and the resource availability among the sites and greedily schedules job that can finish quickly. The trace-driven simulation studies show that our proposed scheduling heuristics effectively reduces the response time of the geo- distributed jobs by up to50%. Next, we address the geo-distributed jobs’ performance from the perspectives of both the computation and the network resources. Specifically, we address the scheduling challenge of the heterogeneity of the resources availability across the sites and the mis- match of the data distribution across the geo-distributed sites. We formulate the task placement decisions using a Linear Programming optimization model, and allocate the resources greedily to the job that can finish quickly. In addition to the response time, our design can also easily incorporate other performance goals, e.g., fairness and WAN usage, with simple control knobs. The EC2-based deployment of our prototype and the large-scale trace-driven simulations showed that our solutions can improve the response time of a baseline in-place scheduling approach by up to77%, and improve the state-of- the-art geo-distributed analytics solution by up to55%. Finally, we expand to a more general setting in which each job has multiple configu- ration options, and its quality depends on the configuration it utilizes. We motivate this problem by the scenario of processing live camera feeds across hierarchical clusters. In this setting we focus on the scheduling problem of jointly determining job configura- tion and placement for concurrent jobs and design an efficient heuristic to maximize the overall quality with available resources across the geo-distributed sites. Our evaluation based on an Azure deployment of our prototype showed that the proposed solution out- performs the state-of-the-art video analytics scheduler by up to 2:3X and the widely xii deployed Fair Scheduler by up to15:7X, in terms of the average quality of the concur- rent jobs. xiii Chapter 1 Introduction Over the past decade, as the volume of data continues to expand at a significant rate and the computing needs grow enormously, many cloud companies are deploying more and more datacenters around the world in order to provide low-latency access for their users as well as to meet their computing needs. For example, Amazon’s datacenters span across 10 deployment regions and tens of availability zones [7]; Google, Facebook and Microsoft each has a few tens of datacenters across the continents, and these companies are still expanding their datacenters [114]. Each datacenter is composed of thousands to tens of thousands of servers, depending on the deployment constraints and capacity needs. In addition to the powerful datacenters, there are also edge clusters deployed all over the world, each is composed of tens to hundreds of servers. While the comput- ing capacity is less powerful as compared to the giant datacenters, edge clusters can be deployed at more locations and even closer to the users as they are less costly. Moreover, cloud providers may deploy edge clusters not only to reduce the data trasnfer overhead, but also to keep sensitive data computing on-premises. Overall, the datacenters and the edge clusters form a geo-distributed environment for storing data and serving comput- ing needs. Common applications utilizing such environments include machine learning workloads, user behavior analysis, ads recommendation, traffic monitoring and control, and troubleshooting for system performance. In order to run such workloads, users usually allocate a certain amount of servers within one of the deployment sites and form a computing environment. The computing environment may contain a significant amount of servers connected by network links, 1 D a t a C o m p u t e S l o t B a n d w i d t h Resource Scheduler Job Queue Datacenter Edge Cluster Edge Cluster Global Manager Datacenter Core Network Figure 1.1: Architecture of geo-distributed computing environment. and each server contains certain amount of CPU cores, memory, and storage space. The compute resources (i.e., CPU cores and memory) are usually allocated in terms of multiple compute slots, where each compute slot contains one or more CPU cores and some amount of memory. Compute slots are the abstraction and the units of workload execution. The number of compute slots in a computing environment is constrained by the total amount of CPU cores and memory. The bandwidth between the compute slots is constrained by the network link capacities. Within the computing environment, there is a component called resource scheduler, which is an abstraction of one or mul- tiple processes and in charge of managing and monitoring the resource usage among the computing workloads. Specifically, the compute slots and network bandwidth are the fundamental resource types managed by the resource scheduler in each computing environment. Figure 1.1 depicts an example architecture of geo-distributed computing environment. The applications’ workloads are submitted to the computing environment as inde- pendent jobs. A job is generally composed of tasks that form a Directed Acyclic Graph (DAG). Each node in a DAG includes one or multiple tasks and represents certain func- tion processing, while each edge represents the dataflow between the nodes and specifies 2 Job Output Data Storage Task Source Input Intermediate Result Figure 1.2: Structure of a job. A job is composed of tasks that form a Directed Acyclic Graph (DAG). the input and output for the tasks. More specifically, each DAG node task reads input from its predecessor tasks or directly from data storage, performs certain functions on the data, and then passes the output to its successor tasks or returns results. In some jobs, each node in a job’s DAG can be further divided into numerous tasks that each process a partition of disjoint input data and can run in parallel. A job could have many configu- rations: each node in a job DAG may have multiple implementation choices or control knobs, while some jobs also have different DAG structures. Each job configuration has distinctive demands in resources and quality of results. A job is considered complete when all of its tasks are finished; once a job completes it departs from the computing environment. Figure 1.2 shows an example structure of a job. A job’s response time is defined as the interval between job’s arrival and departure. A job’s resulting quality is dictated by the configuration it utilizes. To serve the computing workloads, the resource scheduler has to make the following decisions in scheduling the resources among the jobs: Task Placement: To which compute slot (and where among the geo-distributed sites) each of the jobs’ tasks should be assigned to run? 3 Job Configuration: What configuration should each job utilizes? Job Scheduling: How should the resources (of multiple types) be allocated among the concurrent jobs that compete for the shared resources? The resource scheduling decisions are critical to the jobs’ performance. From the user’s perspective, it is desirable that a job finishes as quickly as possible, i.e., fast response time, while achieving high quality results. Fast response time is critical for supporting real-time decisions and reducing user-perceived latency. The resource sched- uler can improve a job’s response time by assigning its tasks to the machines that have the input data, and/or by allocating a sufficient number of slots so that all the job’s tasks can be served as long as there are available resources. Another important performance metric is quality of a job’s result, as high-quality results are essential for making accurate decisions and providing users with a higher quality experience. A job’s quality of results can be improved by allocating sufficient resources so as to utilize the high-quality configurations. The resource scheduler’s decisions depend not only on the jobs’ resource usage and resource availability in the computing environment, but also on the distribution of the data across the geo-distributed sites required by the jobs. As the data can be generated at any site within the geo-distributed computing environment and the data volume contin- ues to grow significantly, it is challenging to store all data within a single site. Although it is common to replicate the popular data items and store them across multiple sites for low-latency access as well as reliability, it is not cost-effective to fully replicate all data items across all the geo-distributed sites. Therefore, a job may need to read its input data across multiple sites within the geo-distributed computing environment. The traditional approach in geo-distributed computing is to aggregate all the data required by a job at one location with the most abundant available resources, and run the job at that location. We refer to this approach as intra-datacenter job execution, as the 4 job is served within only one of the datacenters. Despite being simple, this approach of moving data for computation may result in performance degradation and critical prob- lems. Since it aggregates all the required data at one site before the job starts, it often incurs a substantial amount of data transfer across the global network, which consumes significant bandwidth and increases cost. In addition, aggregating all the required data upfront may also increase the response time of the job due to the significant network transfer delay. Finally, some sensitive data may not be allowed to be transferred out of certain cites due to data sovereignty. Recently, an emerging trend is to move jobs’ workloads towards their data instead of the other way around, e.g., the jobs’ tasks are delegated to the sites that have its required data. We refer to this approach as inter-datacenter job execution, as the job may be served across multiple datacenters in the geo-distributed computing environment. There have been a few recent efforts showing the benefits of inter-datacenter job execution: [125] shows that such an approach could improve the jobs’ response times by319, and [155] shows that the bandwidth savings due to inter-datacenter job execution could be as high as250. Despite the gains of inter-datacenter job execution suggest a promising direction for geo-distributed computing, there remain many critical problems to be addressed. First, none of the previous works in geo-distributed computing addressed the problem of how to perform job scheduling, i.e., jointly allocating resources among competing jobs to optimize the performance metrics of interests. Second, the resource scheduler has many decision factors that interact with each other in a geo-distributed computing environment, which further complicates the resource scheduling decisions. Finally, how the different types of resources should be jointly allocated has not been addressed in the context of geo-distributed environments. Solving the above-mentioned problems is challenging due to the following: 5 Mismatch in resource usage and availability: The heterogeneity in multi-resource capacity and the usages across the geo-distributed sites together result in differ- ent resource scheduling decisions. For example, user session logs are usually generated at the edge clusters with limited storage and computation power. The datacenters have greater capacity, but result in network delay when transferring data from the edge to the datacenters. As a result, task duration varies depending on their placement. In addition, job’s task placement causes variation in work- loads across the sites. Therefore, coordination of multiple resources across all the sites is required to minimize jobs response time. Multi-resource allocation among competing jobs: As concurrent jobs are compet- ing for the shared resources in the computing environment, it is critical to care- fully allocate the shared resources among the competing jobs in order to optimize the performance metrics. Moreover, since each job and task may have distinctive demands at each resource type, it is also important to jointly consider allocat- ing the multiple resource types so that the resources can be efficiently utilized to optimize the performance metrics. Inter-dependency in job scheduling and task placement: The task placement deci- sions potentially result in various workloads across the sites, which affects the job’s response time as it is determined by the last completed task, and further affects the service order of the jobs. The job scheduling decisions produce the service order of the jobs, which determines the resource usage, and further affects how the tasks should be placed for faster job completion. As a results, the two decisions (job scheduling and task placement) should be jointly considered based on resource usage and availability in order to optimize the performance of geo- distributed computing. 6 Inter-dependency in job configuration and task placement: Given that the avail- able resources are constrained across the sites and utilized by all concurrent jobs, it may be infeasible to utilize the most expensive configuration for each of the jobs in order to maximize accuracy. Specifically, what configuration a job can use depends on where its tasks are placed across the sites and how much resources are available at the sites where its tasks are placed. Where to place a job’s tasks depends on the resource availability and what configuration a job utilizes; a job’s configuration determines its resource demands, which constrains the placement options for its tasks. Therefore, job configuration and task placement need to be jointly considered to optimize a jobs’ accuracy. In the first work of my thesis (Chapterx3), I study the problem of allocating compute slots to jobs across multiple datacenters, with the assumption that a task is placed at the site storing its data. The performance goal is to minimize the average job response time. Even under such simplified setting, this problem is still challenging to solve and is shown to be strongly NP-hard [131] or APX-hard [60]. Traditional job scheduling solutions like Shortest Remaining Processing Time (SRPT) provide optimal average job response time in single-server-single-queue setting by finishing the small jobs before the large jobs. However, SRPT-based extensions lead to sub-optimal response times in a geo-distributed computing environment due to lack of coordination, leaving significant room for improvement. By recognizing the importance of coordinating job scheduling across the geo-distributed sites, this work proposes two effective heuristics that improve average job response time. Specifically, the contributions of this work are a follows: This work investigates why natural SRPT-based extensions leave significant room for performance improvements, which provides insights for better approaches. 7 A light-weight “add-on”, termed Reordering is proposed, which can be easily added to any scheduling algorithm to improve its performance by delaying parts of certain jobs without degrading their response times, while providing opportunities for other jobs to finish faster. This work proves that executing Reordering after any scheduling algorithm would result in performance that is not worse than the one achieved without Reordering. Three principles are derived for designing a job scheduling algorithm aimed at reducing the average job completion time. Armed with these design principles, this work develops Workload-Aware Greedy Scheduling (SWAG), that greedily serves the job that finishes the fastest by taking existing workload at the local queues into consideration. As a proof of concept, a prototype using the proposed algorithms is implemented with Spark [163] while addressing several system implementation issues. This work also conducts extensive large-scale simulation-based experiments using real- istic job traces under a variety of settings. The results show that SWAG and Reordering achieve as high as50% and27% improvements, respectively, in aver- age job response time as compared to the SRPT-based extensions. The results also show that the proposed techniques achieve response times within 5% of an optimal solution (as obtained through brute-force for comparison purposes), while requiring reasonable communication and computation overhead. The details of this work are describe in Chapter 3. The second work in my thesis (Chapterx4) further addresses multi-resource allo- cation through job scheduling and task placement. As heterogeneous capacity of both compute and network resources is one of the key characteristics in geo-distributed com- puting, the bottleneck of a job’s response time could come from either compute delay or 8 network delay, or both. However, existing works [79, 125, 127, 153, 155] only focus on minimizing the network delay, which leaves significant room for improvement. Instead, this work focuses on reducing jobs’ average response time by jointly considering com- pute and network delay. In addition, we address both task placement and job scheduling problems, and their inter-dependency, in order to reduce geo-distributed jobs’ response times. To our knowledge, this is the first work in the research literature that addresses the problem of multi-resource allocation for task placement and job scheduling. The contributions of this work are as follows: This work provides several insights to where and how existing solutions fall short for resource scheduling in heterogeneous geo-distributed clusters. Both network and computation resources should be taken into consideration in placing the tasks across geo-distributed sites. In addition, as job scheduling and task placement are mutually dependent, they should also be jointly considered. These insights lead to the design of our solutions. The problem of task placement is formulated as a Linear Programming Optimiza- tion model that optimizes the total duration of network transfer delay and com- putation delay. By optimizing the total duration for all the tasks of a job, both network and computation resources are jointly taken into consideration for opti- mizing the placement of the tasks. A greedy heuristic is proposed to jointly determine job scheduling and task place- ment for geo-distributed clusters. The design of the heuristic aims to reduce the average job response time, while it can also incorporate other performance met- rics of interest (e.g., fairness, WAN bandwidth usage) by providing control knobs to trade-off between the different performance goals. 9 This work conducts extensive performance evaluation to understand the gains using a geo-distributed EC2 cluster deployment based on a Spark prototype as well as a large-scale trace-driven simulation. Evaluation results show that the pro- posed solutions improve the average response time of existing site-locality and network-centric approaches by up to77% and55%, respectively. The details of this work are described in Chapter 4. The third work of my thesis (Chapterx5) addresses the more general problem of joint decision of job scheduling, configuration, and task placement for geo-distributed jobs. My thesis motivates this problem by the application of processing live video streams in hierarchical clusters, when a cluster is composed of several on-premises clusters and a public cloud. Each job is composed of a chain of tasks, in which each task has many implementation choices that provide the same abstraction; in addition, each task also has many control knobs, e.g., implementation choices. Job configuration determines the implementation choice and the knob value for each of the job’s tasks, and each con- figuration has its distinctive resource demands and results in different query accuracy. In addition to determining each job’s configuration, another important decision is the placement of each of a job’s tasks across the sites. Both job configuration and task placement need to consider the resource availability at the sites as well as the concur- rent jobs competing for the resources. Moreover, the scheduling decisions should be jointly made for all the concurrent jobs as they compete for the same pool of resources. Finally, job scheduling, configuration, and task placement are mutually dependent, and therefore need to be jointly determined in order to optimize performance. The goal of this work is to determine the configuration for each job and place its tasks across the hierarchy of the cluster in order to maximize the overall quality of the concurrent jobs. My solution, Cascade, first identifies promising job configurations by filtering out 10 inaccurate and expensive ones based on the concept of Pareto boundary, which this sig- nificantly reduces the search space. Second, a heuristic is designed to greedily selects the configuration that can maximally improve the overall quality with minimal resource costs. Cascade leverages dominant resource utilization of all resource types across all sites to represent the resource cost of a job configuration that avoids draining critical resources. Finally, Cascade also identifies and merges peer jobs that process the same video feeds and have common task prefixes, which further reduces resource consump- tion and allows other jobs to run with higher configurations. The contributions of this work are as follows: This work formulates the problem of joint job configuration and task placement for live video analytics in hierarchical clusters. This work proposes an efficient heuristic that jointly determines job configuration and task placement for all concurrent jobs in the cluster. The heuristic greed- ily determines the job configuration and its placement that provides maximal improvement in overall quality with minimal cost. To reduce the search space for the heuristic, this work filters out non-promising job configurations based on the concept of Pareto boundary, and only keeps the cost-effective job configurations. By doing so, the search time is significantly reduced by80% while achieving90% of the overall (without filtering) quality . This work further improves the overall quality of the jobs by merging peer jobs processing the same video feeds and have common task prefixes. By merging peer jobs and allowing other jobs to use better configurations, the overall quality is further improved by1:8. This work evaluates the performance using an Azure deployment and real-world video queries. Evaluation results show that the proposed solution outperforms 11 a commonly deployed fair scheduler by 15:7, while being within 6% of the optimal quality. This work also develops an efficient mechanism that obtains jobs’ resource- quality profiles for scheduling decisions. By intelligently caching intermediate results during the profiling process, the proposed profiling mechanism is able to reduce the overall profiling time by70% and reduce the CPU cycles by99% with a reasonable caching budget. The details of this work are described in Chapter 5. In the remainder of my thesis, the three works are described in Chaptersx3,x4 and x5, respectively. Chapterx2 discusses related work in geo-distributed computing and the the differences between our contributions and previous efforts. Finally, Chapterx6 concludes this thesis. 12 Chapter 2 Related Work In this chapter, we discuss the related work to this thesis, and the differences between the related work and this thesis. We start from the emerging topic of geo-distributed job execution (Chapter 2.1), which is the most relevant area to this thesis. We move on to the popular area of intra-datacenter job execution (Chapter 2.2), which involves several research topics (e.g., job scheduling, task placement, co-flow scheduling, and multi- resource scheduling) that overlap with part of the problems that this thesis addresses. Next, we extend our discussion to a broader area of distributed systems (Chapter 2.3), in which we explore many traditional topics that address similar problem setting to this thesis, such as grid computing, sensor networks, database systems, mobile computing and video analytics. We also explore the related work in traditional scheduling algo- rithms (Chapter 2.4) for minimizing job response time, and discuss the limitations in applying those solutions to schedule geo-distributed computing jobs. We summarize the related work in Chapter 2.5. 2.1 Geo-distributed Job Execution This thesis mainly focuses on executing jobs across geo-distributed sites. Geo- distributed job execution is a recent direction in big data applications, with relatively little work existing in the research literature. 13 A major distinction of geo-distributed computing from traditional intra-datacenter computing is that its data are stored across geo-graphically distributed sites, and trans- ferring data between the sites incurs significant cost in WAN bandwidth usage as well as delay in response time. Instead of aggregating data at one location and executing the computing workloads there, recent works in geo-distributed computing take an opposite approach and move computation workloads towards the data [71, 79, 125, 127, 153– 155]. JetStream [127] focuses on the scenario in which applications aggregate data across wide-area networks, and deals with insufficient backhaul bandwidth by applying pre-processing at each source site before transferring all data to a central location. Wana- lytics and Geode [154, 155] propose to push the analytical queries to the locations where the data are hosted, and also optimize their execution plans accordingly. Iridium [125] focuses on reducing the network delay throughout the geo-distributed jobs execution by smartly placing jobs’ data and computation across datacenters based on bandwidth con- straints. Clarinet [153] selects the best configuration, e.g., order of joining tables, for a geo-distributed job according to bandwidth capacities and data cardinality stats. Gaia [79] aggregates parameter updates in model training of machine learning across the geo- distributed sites, so that cross-sites transfer overhead is reduced and the overall training time is improved, while the quality of training results is guaranteed to be bounded by a certain threshold. These works improve the performance of geo-distributed job execution in various aspects, i.e., reduction in response time and/or wide-area bandwidth usage. However, they all focus only on the network aspect in improving geo-distributed job’s perfor- mance, while a geo-distributed job’s execution could have a bottleneck in either network or computation resources due to heterogeneous resource constraints across the sites. In contrast, this thesis proposes holistic solutions that improve geo-distributed jobs’ per- formance by considering both network and compute aspects. Furthermore, this thesis 14 also addresses the job scheduling problem, and, most importantly, the inter-dependency between job scheduling and task placement, in the solution design. The above men- tioned works, except in Clarinet, do not deal with job scheduling problems. Clarinet, on the other hand, allocates the resources for geo-distributed jobs by decoupling the joint problem of job scheduling and task placement, which results in sub-optimal per- formance. To the best of our knowledge, this thesis is the first work that addresses the inter-dependency between job scheduling and task placement in geo-distributed job execution, which both have significant impacts on a geo-distributed job’s response time. Dealer [71] dynamically redistributes poorly performing tasks of a single job to other datacenters while it is running, in an attempt to reduce user-perceived latency. However, Dealer only distributes a job’s tasks to one datacenter at a time, which is a rigid assump- tion for geo-distributed job execution as typical jobs have tasks spanning across multiple datacenters. This thesis instead addresses the more general setting that allows distribut- ing tasks of a job among multiple datacenters, based on data distribution and resource availability, which leads to greater performance improvement opportunities. 2.2 Intra-datacenter Job Execution Speeding up computation-intensive workloads by using a cluster of machines has been an active area in High Performance Computing (HPC) over the past decades [8, 54, 80, 84, 86, 135, 163, 165]. Those HPC clusters contain tens to thousands of machines inter-connected by network links within a single location (e.g., a datacenter), and make efficient use of the abundant resources by running data-parallel jobs. There exist many works for improving the performance of such intra-datacenter jobs through effective resource scheduling, and their proposed techniques are highly related to the core problems (e.g., job scheduling, task placement, multi-resource allocation, and and 15 job configuration) addressed in this thesis. The following subsections discuss the rep- resentative works of several topics in this research literature and identify the limitations of adapting their proposed techniques in the context of geo-distributed computing. 2.2.1 Job Scheduling and Task Placement Placing tasks to meet data locality within clusters has become a common approach for data analytics [85, 144, 164], as it significantly improves job response time by avoid- ing data transfer across the network. The promising gains brought by data-locality task placement do not trivially apply to geo-distributed computing environment, as the het- erogeneity of data distribution and resource availability could result in server load imbal- ance among the sites. Therefore, mitigating only the network delay does not necessarily improve the overall job response time and could even degrade the performance. This thesis focuses on improving geo-distributed jobs’ performance by addressing such het- erogeneity. Another set of previous works [23, 25, 27] focuses on reducing response time of a job by mitigating its outlier tasks, i.e., stragglers. The imbalance among tasks’ dura- tions leads to the existence of the straggler tasks’, which could delay the job’s response time. While the reasons behind such imbalance may include many factors and investi- gating the reasons is still an ongoing research, this thesis aims at improving jobs’ perfor- mance by addressing one critical reason for stragglers: the imbalance in tasks’ duration caused by resource scheduling decisions, e.g., task placement. For the other factors, e.g., machine failure, the techniques proposed for straggler mitigation [23, 25, 27] are orthogonal to those in this thesis, and can be integrated into the proposed solutions in this thesis. In addition to placing the tasks of a job, recent works [88, 130] push further by planning for both, job scheduling and task placement within a cluster, either from the 16 perspective of enhancing data-locality (Corral [88]) or mitigating stragglers (Hopper [130]). The major limitation in applying Corral to a geo-distributing environment is that Corral does not explicitly deal with a heterogeneous input data distribution, and always assumes data-locality placement for the first stage of tasks in a job, e.g., map stage. As geo-distributed jobs could have significant imbalance in data distribution across the sites, Corral’s data-locality task placement could lead to scheduling results that are far from optimal. Hopper’s approach to dealing with stragglers via task placement and job scheduling by introducing virtual tasks into each job is orthogonal to the proposed solution in this thesis, and hence can be directly integrated into the techniques described in this thesis. 2.2.2 Co-flow Scheduling A co-flow is a group of data transfer flows that belong to the same job; a co-flow is considered complete only when all of its flows finish transferring. Therefore, co-flow scheduling addresses a similar problem as job scheduling, in which each job is con- sidered complete when all of its tasks finish. There are several works in scheduling co-flows [46–48, 53] for minimizing the average co-flow completion time. Despite sharing the goal of average completion time reduction, there exist several major differ- ences between the problem settings of this thesis and the co-flow scheduling problem. First, co-flow scheduling only focus on allocating the network resources, i.e., band- width, to the concurrent data transfer flows. In this thesis, a job could have a bottleneck in network or compute resources; focusing on allocating only a resource type may not affect the job’s performance significantly, which makes resource allocation trickier and hence challenging. Second, compared to compute resources, the bandwidth resource is infinitely divisible. That is, a resource scheduler can make multiple flows simultane- ously share the same link, while each of them receives a small share of the bandwidth 17 but can still use it to transmit data. On the other hand, compute resources are abstracted into compute slots while each slot can be allocated to only one task at a time. This the- sis addresses the scheduling problem of allocating both divisible network resources and non-divisible compute resources, which is more challenging than allocating only one of the two. Finally, the co-flow scheduling results of co-flows depend on sending and receiving rates at the two ends of a flow, while there is no such constraint in the problem setting of this thesis. 2.2.3 Multi-resource Scheduling Jointly allocating multiple resources among concurrent jobs within a datacenter has been an active research topic. Previous works [61, 63] focus on achieving fairness among jobs by allocating resources like disk, CPU and memory. While these works jointly allocate multiple resource types to efficiently schedule the jobs, they do not consider network resources, which significantly influences the performance of the geo-distributed jobs. Another set of previous works [65, 66] improves job performance by efficiently packing multiple resources (CPU, memory, bandwidth) to match tasks’ resource require- ments, and mitigates resource fragmentation by utilizing dynamic-sized resources. However, in heterogeneous geo-distributed data analytics, fragmentation is less of an issue given that the number of sites is relatively small compared to the number of machines within a datacenter. Moreover, the network transfer time during a job’s dura- tion depends on how the tasks are distributed across sites, as well as the heterogeneous resource bottleneck across the sites. Thus, bandwidth usage of each task cannot be pre-determined for geo-distributed jobs, and the techniques proposed by these works [65, 66] cannot be directly applied to geo-distributed computing. 18 2.2.4 Summary of Intra-datacenter Job Execution To sum up, the proposed solutions of the above mentioned works do not naturally inherit the performance gains they enjoy in the intra-cluster setting. Specifically (as noted above), resource heterogeneity and data skew across multiple sites combined with the need for multi-resource allocation present key challenges in performance optimization in geo-distributed computing. To the best of our knowledge, this thesis is the first to jointly address these challenges. 2.3 Distributed Systems and Computing This section discusses several research topics in distributed computing systems that are related to geo-distributed computing addressed by this thesis: jobs/VMs placement, grid computing, video analytics systems, database systems, sensor networks and mobile computing systems. 2.3.1 Placement of VMs and Jobs across Distributed Machines Placement of VMs (e.g., Oktopus [35], FairCloud [124]) or tasks of big data jobs (e.g., Yarn [29], Mesos [78], Apollo [38], Borg [152]) within a single cluster of machines, in which a set of VMs or tasks make exact resource requests and are placed in the cluster to maximize resource utilization of the cluster of machines. For example, Okto- pus [35] designs an interface that accounts for virtual overlays and the underlying net- work connections in shared cloud environments, which allows cloud tenants to specify their requests as a graph of inter-connected VMs, and allows cloud providers to allocate physical resources to maximally serve the tenant requests. FairCloud [124] proposes fair allocation of network resources among cloud tenants based on the network require- ments of the VMs owned by different tenants and placed at the same server. Mesos [78] 19 proposes a two-level resource management mechanism, in which the scheduler provides resource offers to each application currently sharing the computing environment, while each application determines whether or not to accept the offer and what tasks to place at the slots included in the offer based on pre-determined resource requirements. In a geo-distributed computing environment, however, the resource requirements of the tasks depend on how they are placed across the sites. Therefore, placing the tasks of geo-distributed jobs based on a pre-determined resource demand, as in the existing works for a single cluster, may lead to performance that is far from optimal. 2.3.2 Grid Computing Grid computing became an active research topic during the mid-1990s due to the increasing needs in sharing computing resources among different organizations [37, 57, 58, 161]. By contributing some of its resources, an organization gains access to a larger resource pool using certain sharing policies. Grid computing systems coordi- nate resources (compute, storage, etc) from different administrative organizations that are potentially geographically distributed, and provide a unified interface for job sub- mission and execution. Grid computing shares a common feature with geo-distributed computing environment concentrated in this thesis: the resources utilized by the com- puting environment come from geographically distributed sites, and the capacity of each resource type may be heterogeneous across the sites. Therefore, scheduling heteroge- neous resources across the sites is an important factor in both grid computing and geo- distirbuted computing environment. The work [159] applies the well-known ant algo- rithm to schedule jobs in grid computing, such that a job’s consumption rate at different resource types across the geo-distributed sites can be balanced and hence the resources are better utilized. The work [74] proposes a QoS-guided min-min scheduling heuristic 20 to achieve high system throughput as well as to match the applications’ needs in compu- tation resources. In [162], the authors design a cost-based scheduling algorithm for grid computing applications, such that the execution cost of all the jobs is minimized while meeting the jobs’ deadlines. Despite dealing with geo-distributed and heterogeneous resources, there exist several major distinctions between the problem focus of geo-distributed computing studied in this thesis and grid computing, which are summarized as follows: Grid computing allocates its resources in a batched manner. Each job submitted to a grid computing environment has to specify the amount of required resources as well as the usage duration, and the scheduler of the grid computing environment holds the job in the queue until all the required resources are available for the duration of the job. The problem focus of this thesis, on the other hand, allows some of the jobs to start running in the form of tasks once a certain amount of resources, not all, become available. For example, in a geo-distributed computing environment, as soon as a compute slot becomes available, the resource scheduler allocates it to a task of the current jobs. Moreover, all the available resources in a geo-distributed computing environment are shared by all the running jobs; they are not reserved for a specific job through- out the job’s entire duration. For example, a slot running job A’s task, once fin- ished, may be allocated to job B’s task, even though job A still has tasks that are not completed yet. In a grid computing environment, on the other hand, once the resources are allocated to a job, they will not be released until this job is com- pletely finished. The data storage in a grid computing environment relies on virtual and shared file systems, instead of partitioned chunks as in nowadays data analytics systems 21 [115, 163]. As a result, data stored in a file cannot be efficiently processed in parallel, which could create significant skew in workloads at certain sites. In a geo-distributed computing environment, data are partitioned and stored as small chunks. Hence, a job with large input data can utilize many tasks that can run in parallel, and even run the tasks across sites by utilizing the available resources at multiple sites by transferring the data among them. Given the previous point, this thesis deals with the problem of heterogeneity in both the resource availability and the data distribution across sites. In a grid com- puting environment, on the other hand, the data required by the jobs are usually stored in a single location under the virtual file system, or are provided by the user along with the job submission. Based on the above-mentioned points, resource scheduling in a geo-distributed com- puting environment is different, and more challenging, as compared to resource schedul- ing in a grid computing environment, although the two share some common character- istics. 2.3.3 Video Analytics Systems Improving the performance of video analytics systems, as addressed in part of this thesis (Chapter 5), has been increasing in popularity in the recent years. Apart from large amount of existing research works focusing on improving the accu- racy of video analytics with enhanced vision algorithms, recent works [72, 104, 169] start addressing the problem of improving video analytics performance with enhanced system design, e.f., better resource allocation or job configuration. MCDNN [72] uses different versions of DNNs (i.e., configurations) to trade off resource usage and accu- racy but does not consider placement across sites within a geo-distributed environment. 22 Optasia [104] writes video queries in SQL and applies conventional SQL optimization techniques for improving query performance; however, it ignores the resource-accuracy trade-off among different configurations for vision queries, and does not consider place- ments either. VideoStorm [169] optimizes query knobs and resource allocation to improve both query accuracy and delay, however only considers the CPU resource in a single cluster. As a result, its solutions cannot be directly applied to a geo-distributed computing environment, in which both CPU and network resources are critical for geo- distirbuted computing jobs, and should be jointly allocated. Part of this thesis (Chapter 5) aims at improving the performance of video analytics by jointly determining job configuration, placement of jobs’ tasks, and resource alloca- tion among concurrent jobs. To the best of our knowledge, this is the first work addresses such a joint problem for video analytics, as well as in geo-distributed computing. 2.3.4 Database Systems Database systems have been an active research area for several decades. Query opti- mization (i.e., selection of query plans) is a core research topic for improving the performance of database systems, with corresponding techniques extensively studied over more than 30 years of research. Over the past decade, other techniques such as placement of queries across multiple machines and even merging of common compo- nents among queries have also gained attention due to the wide adoption of distributed database systems and the applications of big data analytics using clusters of machines. These techniques are highly relevant to this thesis, especially to the video analytics sys- tem as described in Chapter 5. The following categorizes the works in database research into four directions which have high relevance to this thesis : 23 Numerous works addressing query optimization, i.e., selecting the best execu- tion plan for a query, in single-server database systems [43, 64, 89]. In general, query optimization techniques develop heuristics according to cost-based opti- mization in order to minimize cost during query execution. Depending on the performance goal and the system characteristics, the cost is defined to reflect the amount of resource consumption to access a single byte of data in the database system, in which the volume of input and intermediate data during query exe- cution is estimated based on cardinality statistics of individual and cross-table data [105, 109, 116]. Based on the cost definition, the query optimization heuris- tics explore all the possible query execution plans, and select the one with the minimum cost. Common cost-based optimization techniques, such as predicate push down [70, 76, 97, 122] and partition pruning [77, 103, 112, 120], typically eliminate heavy operations up front instead of passing a significant volume of redundant data along query execution. In addition to query optimization for each query individually, there are also some proposals on multi-query optimization in database systems [111, 132, 137, 138], which optimize the execution plans for a batch of queries that share the same, or at least partially the same, data access patterns, i.e., common sub-expressions. For example, the work in [132] identifies common sub-expressions within each query or across multiple queries when executing a batch of queries, and stores intermediate views of cross-table data that are cost-efficient to reuse. The work in [111] takes a step further to manage the materialization of the shared views across multiple queries by categorizing them into transient or permanent reuse characteristics, or establishes the shared views that can be efficiently recomputed in an incremental manner. 24 Apart from single-server database systems, executing queries across multiple machines became the trend due to the increasing volume of data and the advance of distributed computing [30, 52, 123, 128, 158]. For example, Track Join [123] proposes a distributed join method that minimizes network traffic with an optimal transfer schedule for each join key. The technique proposed in [128] exploits data locality with an optimal partition assignment to reduce network communication, and schedules transfer flow to better utilize bandwidth. The work in [158] uti- lizes the visibility and control power provided by the Software-Defined Network- ing (SDN) infrastructure to (1) keep track of bandwidth availability before query execution, (2) reserve bandwidth for queries, and (3) prioritize network transfer for differential query service. These works are generally developed for database systems whose servers are within a single location, which essentially assumes a homogeneous network environment that simplifies the network constraints in the cost-based optimization models. Executing streaming analytics queries over distributed database systems has become a popular topic in the database community recently, in which placement of query operators and merging of query sub-expressions are the main focus. The work in [142] aims at placing query operators across servers to minimize the cost of computation and the volume of data transmission. SBON [121] determines the placement of query operators with the view of virtual overlay networks and a multi-dimensional cost space based on network conditions, which helps to min- imizes the data transmission cost between servers during placement decisions. CACQ [107] adaptively updates query execution plans based on the dynamics of workloads, data streaming rates and system performance; it provides state shar- ing among queries during join computation to merge parts of operations across 25 streaming queries. These works assume that the performance of streaming analyt- ics queries over distributed database systems is dominated by the network over- head, and therefore focus only on optimizing network cost. Despite the fact that many existing works in database research address similar chal- lenges (e.g., configuration, placement, and merging), the problem addressed in this the- sis is uniquely challenging in several ways, such that those works cannot be directly applied. Specifically, the lack of joint consideration of compute and network resources during query planning and placement in the existing database literature is the key factor: apart from the large body of works in query optimization for single-server database sys- tems, other existing works in distributed database systems either assume homogeneous network models, or assume that the performance is dominated by the network over- head and hence optimize only for that. Moreover, the joint decision of query planning, placement, and merging along with multi-resource consideration, as this thesis mainly addresses, has not been studied in the existing database literature. However, such joint decisions are not unique to this thesis only; these can also be applied to database systems in a similar manner. 2.3.5 Sensor Networks Sensor networks have been an active research area for several decades due to the wide variety of applications: underwater surveillance [22, 75], measurements in the wild [108, 170], health monitoring in buildings and mechanical devices [32, 45, 118], and, recently, Internet of Things (IoT) applications [31, 68]. In these sensor network applica- tions, sensors measure and forward streaming data to certain sinks for further process- ing, which is similar to live video analytics in part of this thesis (Chapter 5). 26 Many works have been proposed to optimize the performance of sensor network applications; the following summarizes three representative categories of existing works in sensor network research which are relevant to this thesis: Some works focus on the design of efficient routing protocols, so that either the data perceived by the sensors can reach the sinks as soon as possible [59, 160, 168], or the energy consumption during the forwarding can be mini- mized and the lifetime of the sensor network can be maximized [41, 42, 82]. For example, the work in [82] applies the concept of opportunistic routing to balance between the transmission range and energy cost according to packet loss patterns under different transmission schemes. By considering the energy consumption against the remaining energy at the sensors as well as the geographical advance- ment to the sink, an energy-efficient route (directed sequence of sensors) is created between the source sensor and the destination sink, which improves the duration of the sensor network’s connectivity. In [168], the authors select the routing path based on the wireless transmission scheme, and trade-off between transmission range and packet loss. By considering both the transmission range and packet loss probability, the data which originated from the source sensor is guaranteed to be delivered to the sink within the least number of relays. The work [59] utilizes multi-path routing to collect data from the sensors efficiently and achieves failure resilience. Another set of works aims at conducting data pre-processing or aggregation before forwarding to the sinks so that the volume of the data forwarded across the sensor network can be reduced, which substantially reduces energy consumption [56, 95, 106]. For example, the work in [106] provides a query interface for the users to request data collection from the sensor network, and processes aggregation by computing the measured data as they flow through the network. 27 Another set of works further optimizes across multiple queries run across the sen- sor network to reduce redundant execution or the volume of data forwarded across the network. For example, TTMQO [157] proposes a two-tier query optimization framework, where multiple queries are first merged offline in a predetermined manner and their execution is further optimized in the network by minimizing the number of messages sent across the network. The problem focus of live video analytics in part of this thesis (Chapter 5) differs from sensor networks in several distinctive ways. First and foremost, each query in video analytics has multiple configurations, both in terms of different query implemen- tation and parameter settings, which makes query optimization more challenging as compared to optimizing just over a pre-determined implementation with several settings as in sensor networks. Moreover, the selection of a configuration for each query in this thesis has to be jointly determined with placement decisions (e.g., routing as in sensor networks) and across multiple queries. Finally, merging among queries along with the joint decision of configuration and placement further distinguishes this thesis from the existing solutions in sensor networks. So far, no existing work in the sensor network literature addresses joint decision of configuration, placement, and merging across mul- tiple queries. We believe that the sensor network systems can also benefit from such joint decisions to improve their performance. 2.3.6 Computation Offloading or Eliminating in Mobile Computing Systems Offloading or eliminating expensive operations from a resource-constrained mobile device to the cloud has been a popular area [33, 34, 67, 98]. Maui [51] and Odessa [126] automatically provide a runtime based on programming reflection to off-load methods 28 and adjust execution parallelism to improve responsiveness and accuracy. Maui continu- ously runs a profiler to measure CPU and network cost of each method and dynamically determines what is offloaded based on the current state of the system. Odessa is an adap- tive runtime for mobile perception applications that performs offloading together with adjusting execution parallelism to jointly improve application responsiveness and accu- racy. Starfish [99] eliminates redundant computation among concurrent vision applica- tions on a mobile device. Compared to offloading, part of this thesis in Chapter 5 considers a large number of queries together and optimizes over a significant search space of query configurations (plans and placements) to resolve conflicts. Specifically, Chapter 5 in this thesis differs from those works in the following aspects: (a) this thesis jointly schedules resources for many queries (jobs) instead of allocating resources for each of them independently, (b) each query has many plans that can be used interchangeably, (c) there are many site options at which a query’s component can be placed, (d) plans and placements are jointly considered in a geo-distributed environment, and (e) this thesis resolves any conflicts among the queries by merging shared components along with planning and placement. 2.4 Scheduling Algorithms for Minimizing Job Response Time Minimizing the average response time of the geo-distributed jobs is the performance goal in part of this thesis (Chapter 3 and Chapter 4). Shortest Remaining Process- ing Time (SRPT) is a well-known scheduling algorithm that achieves optimal average job response time for preemptive job scheduling in a single-server-single-queue envi- ronment [36, 133, 134]; it has been extensively studied and applied to many problem 29 domains [73, 100, 101, 136, 156]. The focus of this thesis, specifically job scheduling in geo-distributed computing scenarios, differs in that: (a) jobs are composed of tasks that can run in parallel, and (b) tasks of the same job potentially span multiple sites, each with a number of compute slots, controlled by a local scheduler. As this thesis will show in Section 3.1.3, SRPT-based extensions do not work well in this context, mainly due to the imbalance caused by data distribution and heterogeneous resource availability. Several efforts, in more idealized (theoretical) settings, include concurrent open shop problems [60, 110, 131], in which each job has certain operations to be processed at each machine, and the goal is to minimize the weighted average job response time. The major difference of this thesis from those works is that this thesis focuses on a more general setting in which the workloads at each site, i.e., task placement, is not a fixed constraint upfront as in the concurrent open shop problem; instead, it is part of the decisions to be made to improve the jobs’ performance. Beyond that, other dis- tinctions between this thesis and concurrent open shop include the following: (a) this thesis addresses a more general scheduling problem as each datacenter (or, machine as termed in concurrent open shop) has multiple compute slots that can run the tasks of the same job in parallel, (b) this thesis develops online scheduling mechanisms rather than the offline deterministic scheduling analysis proposed by previous efforts on concurrent open shop, and (c) this thesis conducts real-deployment evaluation and simulation-based experiments to evaluate the performance of scheduling solutions under more realistic settings and with realistic workloads. Heterogeneous Earliest Finish Time (HEFT) [145] schedules DAG of tasks (in a job) in heterogeneous processors, inter-connected by network links, to minimize the job’s completion time. Given the information of both task computation duration and data transfer duration between the tasks, HEFT greedily schedules the task with earliest finish time according to its critical path, and assigns it to the processor that achieves 30 the minimum finish time for it. In a geo-distributed computing environment, however, tasks’ data transfer duration between sites significantly depend on how the tasks are placed across sites and can not be pre-determined. In addition, part of this thesis, e.g., Chapter 4, addresses a more general setting in which task placement needs to be jointly determined with job scheduling so as to optimize the average response time of the con- current jobs. As a result, HEFT is not suitable for the problem addressed by this thesis due to its assumption about the known information and the lack of joint decision of job scheduling and task placement. 2.5 Summary While there exist many works in the related areas tackling part of the similar issues (to various extent) covered by this thesis, the problem addressed by this thesis is unique and challenging as it deals with the joint decision of job scheduling, job configuration, and task placement in a geo-distributed computing environment where heterogeneity of both resource availability and data distribution complicate the resource allocation decisions. Solving such a complexed problem as a whole that has not been addressed in the literature yet is the main contribution of this thesis, while some of the principles and insights derived from this thesis can be applied to the related fields. 31 Chapter 3 Scheduling Jobs across Geo-distributed Datacenters 3.0.1 Introduction Data intensive jobs run by cluster computing systems (e.g., Hadoop[8], Spark[163], Dryad[84]) generate significant workloads for datacenters, providing services such as web search, consumer advertisements and product recommendations, user behavior analysis and business intelligence. These jobs are composed of numerous tasks. Each task reads a partition of input data and runs on available computing slots in parallel; the job is finished upon the completion of all of its tasks [23, 25, 27]. To serve the increasing demands of various data analytics applications, major cloud providers like Amazon[7], Microsoft[13] and Google[87] each deploy from tens to hundreds of geo-distributed datacenters; AT&T has thousands of datacenters at their PoP locations. Conventional approaches perform centralized job execution, with each job running within a single datacenter. In such a case, when a job needs data from multiple datacen- ters, a typical approach is to first collect all the required data from multiple datacenters at a single location, and then run the computation at that datacenter [49, 69, 83, 94, 121]. However, as data volumes continue to grow in an unprecedented manner, such an approach results in the substantial network traffic [127, 154, 155] and the increased job completion time [71]. Moreover, it is becoming increasingly impractical to replicate a large data set across multiple datacenters [114]. Finally, some data are restricted to 32 certain datacenters due to security and privacy constraints (e.g., must be kept within a particular nation [154, 155]), and therefore cannot be moved. Consequently, instead of data aggregation at a single data center, a recent trend is to conduct distributed job execution, i.e., running a job’s tasks at the datacenters where the needed data are stored, and only aggregating the results at job completion time. Recent research efforts show that distributed job execution achieves 250 bandwidth savings [154, 155] and reduces the90thpercentile job completion time by a factor of3 [71]; moreover, 319 query speed-up and 1564% reduction in bandwidth costs can be achieved [125]. Although promising, distributed job execution poses new challenges for job schedul- ing. Since a job’s completion time is determined by its last completed task across the datacenters, finishing a portion of the job quickly at one datacenter does not necessarily result in faster overall job completion time. In addition, potential skews in number of tasks per job processed at a particular datacenter (as determined by the data stored there) further complicate matters. Hence, prioritizing a job’s tasks at one datacenter when its counterparts at other datacenters dominate the overall job completion is “wasteful” (in the sense that prioritizing a different job may have led to better overall average comple- tion time). Consequently, unlike in the single-server-single-queue scenario, classical Shortest Remaining Processing Time (SRPT) scheduling [36, 133, 134] fails to optimize the average job completion time in the case of multiple datacenters with parallel task execu- tion. To provide insight into sub-optimal behavior of SRPT (and its natural extensions to the multiple datacenter scenario), we present motivating examples in Section 3.1, and then show in Section 3.4 that SRPT-type techniques’ scheduling of jobs based only on their sizes results in even worse behavior under heterogeneous datacenters. 33 To address the challenges outlined above, in this Chapter, we focus on job schedul- ing algorithms designed for the multi-datacenter parallel task execution scenario. Even single-server-single-queue versions of this scheduling problem have been shown to be strongly NP-hard [131] or APX-hard [60]. Thus, our efforts are focused on principled heuristic solutions that can be (experimentally) shown to provide near-optimal perfor- mance. Specifically, our contributions can be summarized as follows. We illustrate why natural SRPT-based extensions leave significant room for per- formance improvements, which provides insights for better approaches. (Section 3.1) We propose a light-weight “add-on”, termed Reordering, that can be easily added to any scheduling algorithm to improve its performance by delaying parts of cer- tain jobs without degrading their response times, while providing opportunities for other jobs to finish faster. We prove that executing Reordering after any schedul- ing algorithm would result in performance that is not worse than the one without Reordering. (Section 3.2) We construct three principles for designing a job scheduling algorithm aimed at reducing the average job completion time in our setting. Armed with these design principles, we develop Workload-Aware Greedy Scheduling (SWAG), that greedily serves the job that finishes the fastest by taking existing workload at the local queues into consideration. (Section 3.3) As a proof of concept, we implement a prototype using our proposed algorithms under Spark [163] while addressing several system implementation issues (Sec- tion 3.4). We also conduct extensive large-scale simulation-based experiments using realistic job traces under a variety of settings (Section 3.5). Our results 34 show that SWAG and Reordering achieve as high as 50% and 27% improve- ments, respectively, in average job completion time as compared to the SRPT- based extensions. The results also show that the proposed techniques achieve completion times within 2% of an optimal solution (as obtained through brute- force for comparison purposes), while requiring reasonable communication and computation overhead. 3.1 Background and Motivation In this section, we first present an overview of the distributed job execution framework in a geo-distributed datacenter system. Next we provide a motivating example to illustrate the needs for better scheduling approaches. 3.1.1 Job Scheduling across Geo-distributed Datacenters Figure 3.1 depicts the general framework for distributed job execution in geo-distributed datacenters. Our system consists of a central controller and a set of datacentersD span- ning geographical regions, while the system serves the jobs running with input data stored across the geo-distributed datacenters. Each job (arriving at the central controller) is composed of small tasks that process independent input partitions and run in parallel [23, 25, 27]. The main focus of this Chapter is the development of an effective job scheduling mechanism for geo-distributed datacenters. In our system, job scheduling decisions are made (and potentially re-evaluated) at job arrival and departure instants 1 , and involve two levels of schedulers: (1) The global scheduler residing in the central controller, 1 We illustrate later in Section 3.4 that this is sufficient. 35 makes job-level scheduling decisions for all jobs in the system 2 , and assigns a job’s tasks to the datacenters that host the input data. 3 (2) The local scheduler at each datacenter has a queueq d that stores the tasks assigned by the global scheduler, and launches the tasks at the next available computing slot based on the job order determined by the global scheduler (or the local scheduler itself). In addition, all datacenters report their progress to the central controller, in support of global job scheduling decisions. The job-level scheduling decisions are therefore made by the coordination of the global and local schedulers (depending on the scheduling technique as described later) and are a function of the set of current jobs J, their tasks, and local queue information data reported by the datacenters. A job is considered completed only after all of its tasks are finished; therefore the job completion time is determined by its last completed task. Our goal is to reduce the average job completion time. Fully replicating data across all datacenters in today’s systems is quite costly, in terms of storage space and in overhead for maintaining consistency among the copies [114]. Instead, recent systems [114] opt for a single primary copy plus multi- ple partial copies based on coding techniques and replication policies. In our system, each task is assigned to the datacenter that holds its primary copy of the input data. We refer to the subset of the job’s tasks assigned to the same datacenter as the job’s sub-job at that datacenter. Let v j;d denote the sub-job composed of job j’s tasks that are assigned to datacenterd: The order in which these sub-jobs are served at each data center is determined by the job-level scheduling decisions, where the local scheduler continues launching the task of the first sub-job in the queue whenever a computing slot 2 In some cases the global scheduler delegates the job-level scheduling to the local schedulers as dis- cussed later. 3 Some local jobs may go directly to the datacenter where all of their required data is located. We assume that each datacenter reports information about local jobs to the central controller as the jobs arrive. 36 Figure 3.1: System Architecture of Distributed Job Execution becomes available unless the order of sub-jobs is updated. When such modifications occur, we assume no preemption for a task execution when it’s running 4 , but a job (or sub-job) execution can be preempted, i.e., the tasks of other jobs (or sub-jobs) can be scheduled to run before the non-running tasks of the currently running job (or sub-job). To facilitate global scheduling decisions, each datacenter reports its current snapshot (including the progress of the sub-jobs in service and those in the queue) to the central controller. For simplicity of presentation and evaluation, we assume that this informa- tion is guaranteed to be delivered in time and accurate. In addition, we assume that our system primarily serves the jobs with single-stage tasks. Job ID Arrival Sequence Remaining Tasks in DC1 Remaining Tasks in DC2 Remaining Tasks in DC3 Total Remaining Tasks A 1 1 10 1 12 B 2 3 8 0 11 C 3 7 0 6 13 Table 3.1: Settings of The Example: Job Set, Arrival Sequence and Task Assignment 4 Non-preemptive task execution is common in conventional cluster computing systems [8, 163] as the tasks are typically of short duration and hence switching cost is (relatively) large. 37 DC1 DC3 DC2 Queue Length A-2: 10 tasks A-1: 1 task A-3: 1 task B-2: 8 tasks B-1: 3 tasks C-1: 7 tasks C-3: 6 tasks Finish Instance Job A: 10 Job B: 18 Job C: 11 Average: 13 (a) FCFS DC1 DC3 DC2 Queue Length Finish Instance Job A: 18 Job B: 8 Job C: 11 Average: 12.3 A-1: 1 task B-1: 3 tasks C-1: 7 tasks A-2: 10 tasks B-2: 8 tasks A-3: 1 task C-3: 6 tasks (b) Global-SRPT DC1 DC3 DC2 Queue Length Finish Instance Job A: 18 Job B: 8 Job C: 11 Average: 12.3 A-1: 1 task B-1: 3 tasks C-1: 7 tasks A-2: 10 tasks B-2: 8 tasks A-3: 1 task C-3: 6 tasks (c) Independent-SRPT DC1 DC3 DC2 Queue Length Finish Instance Job A: 18 Job B: 10 Job C: 7 Average: 11.7 A-1: 1 task B-1: 3 tasks C-1: 7 tasks A-2: 10 tasks B-2: 8 tasks A-3: 1 task C-3: 6 tasks (d) Workload-Aware Greedy Scheduling (SWAG) DC1 DC3 DC2 Queue Length Finish Instance Job A: 18 Job B: 8 Job C: 10 Average: 12 A-1: 1 task B-1: 3 tasks C-1: 7 tasks A-2: 10 tasks B-2: 8 tasks A-3: 1 task C-3: 6 tasks (e) Global-SRPT w/Reordering DC1 DC3 DC2 Queue Length Finish Instance Job A: 18 Job B: 8 Job C: 10 Average: 12 A-1: 1 task B-1: 3 tasks C-1: 7 tasks A-2: 10 tasks B-2: 8 tasks A-3: 1 task C-3: 6 tasks (f) Independent-SRPT w/Reordering Figure 3.2: Results of The Example: Job Orders and Finish Instants Computed by Dif- ferent Scheduling Algorithms 3.1.2 Motivating Example We now present a simple example to illustrate how the various scheduling techniques work and the differences of their scheduling results. Table 3.1 describes the example settings (job arrival order, number of tasks per job and their distribution among the data centers); Figure 3.2a, 3.2b, 3.2e, 3.2c, 3.2f, 3.2d provides the scheduling results obtained by the various scheduling techniques described in this Chapter. In this example, there are three jobs arriving to the system at different times, with Job A followed by Job B, followed by Job C. At the time the scheduler makes the scheduling decision, these three jobs all have some tasks that are not yet launched. The jobs’ remaining sizes 5 in each 5 Here, a job’s remaining size is its remaining number of tasks that are not launched yet. 38 datacenter are also given in Table 3.1. In this example each datacenter has a single compute slot, i.e., the datacenter serves one task at a time. Let the completion time of job i be r i = f i a i , where f i and a i are the time instants of finishing the job i (or, finish time) and job i’s arrival, respectively. Then, the average job completion time of n jobs is 1 n P n i=1 r i = 1 n P n i=1 (f i a i ) = 1 n f P n i=1 f i P n i=1 a i g. We can view reducing the average job completion time as reducing the sum of the finish times, P n i=1 f i (or equivalently, 1 n P n i=1 f i ), as P n i=1 a i is constant. For simplicity of exposition, we discuss the remainder of the example in terms of reducing average finish time (rather than the average completion time). We further define a sub-job’s finish instanti j;d as the queue index at which sub-job v j;d ends, which is computed asi z;d +jv j;d j, wherev z;d is the sub-job that is right next to v j;d while being earlier in the queue, andjv j;d j is the size (remaining number of tasks) of sub-jobv j;d . The sub-job’s finish instant is a relative measure and a monotonic indicator of its finish time; 6 specifically, given thati a;d < i b;d 8a;b2 J; sub-jobv a;d finishes no later than sub-job v b;d does. In addition, a job’s finish instant is the maximum finish instant of all its sub-jobs, i.e., max d i j;d ;8d2 D: In this example, if we were to use a First Come First Serve (FCFS) scheduling approach, the finish instants of Jobs A, B, and C would be10,18 and11, respectively, which results in an average job finish instant of13. 3.1.3 SRPT-based Extensions In the single-datacenter scenario - or more specifically single-server-single-queue with job preemption scenario - it has been shown that Shortest-Remaining-Processing-Time 6 We will discuss the assumptions that make a job’s finish instant equal to its finish time in Section 3.2, and how our system addresses those assumptions in Section 3.4. 39 (SRPT) minimizes the average job completion time[36, 133, 134] by selecting the job with smallest remaining size first. To the best of our knowledge, the problem of scheduling jobs across multiple data- centers has not been solved nor extensively studied. It is natural to consider SRPT-based extensions to multi-datacenter environment, as we will present next. However, we illus- trate in Section 3.1.3 their shortcomings as the motivation for better approaches. Global-SRPT The first heuristic is to run the SRPT in a coordinated manner, which performs SRPT and computes the jobs priority based on the jobs’ total remaining size across all the datacenters. We call this heuristic as Global-SRPT. Glogbal-SRPT runs at the central controller, as it requires the global state of the current jobs’ remaining tasks across all the datacenters. Then central controller passes the job order computed by Global-SRPT to all the datacenters, where each datacenter scheduler updates its sub-jobs order in the queue based on the new job order. In our motivating example, the total remaining tasks for JobA;B;C are 12;11;13, respectively, so the job order computed by Global-SRPT is B ! A! C, which is enforced by each datacenter as shown in Figure 3.2b. Since Global-SRPT gives higher priority to the jobs with fewer tasks and finishes them as quickly as possible, it avoids the cases that small jobs are blocked behind the large jobs and spend lots of time waiting. As a result, Global-SRPT achieves better average job finish instant ( 37 3 as in the example) compared to that of the default scheduling FCFS (13 as in the example). 40 Independent-SRPT Since SRPT is designed for single-scheduler scenario, our second heuristic is to enable each datacenter scheduler to perform SRPT on its own, with the hope that each datacen- ter reduces average completion time for its sub-jobs. We call this Independent-SRPT, as the datacenter prioritizes its sub-jobs based on the their sizes and updates the queue order independently from the information of other datacenters. In the example, according to the jobs’ remaining number of tasks for each sub-job, their priorities at each datacenter may not be the same. In datacenter 1, the priority is A! B ! C, while the priority in datacenter 2 and datacenter 3 are B ! A and A! C, respectively (as shown in Figure 3.2c). By reducing the finish instant of the sub-jobs in each datacenter, Independent-SRPT achieves 37 3 for average job finish instant in the motivating example, which is better than FCFS (13). Shortcomings of SRPT-based Extensions Both Global-SRPT and Independent-SRPT improve the average job completion time by favoring small jobs. However, since each job may have multiple sub-jobs across all the datacenters, the imbalance of the sizes among the sub-jobs causes the problems for SRPT-based scheduling. Take Global-SRPT for example, in Figure 3.2b, we see that job A’s sub-jobs in datacenter 1 and 3 finish even before its sub-job at datacenter 2 starts. Since the job’s completion time is determined by the last completed sub-job across all datacenters, we can actually deferv A;1 andv A;3 a bit without hurting jobA’s finish instant, while it can yield the compute resources to the tasks of other sub-jobs, say jobC in this example. The same observation is also valid for Independent-SRPT in the example, in whichv A;1 can yield to v B;1 and v C;1 in datacenter 1, and v A;3 can yield to v C;3 in datacenter 3, without delaying jobA’s finish instant as depicted by Figure 3.2f. 41 As illustrated in the above example, both Independent-SRPT and Global-SRPT leave significant room for improvement as they waste resources in serving some sub-jobs while their counterparts at other datacenters are delayed due to imbalanced job exe- cution. Next, we first propose a mechanism in Section 3.2 to improve the result of scheduling by eliminating the waste of resources in imbalanced job execution. Then we develop a new scheduling solution in Section 3.3 that leads to further improved schedul- ing results. 3.2 Reordering-based Approach Recall that one insight into why the SRPT-based heuristics do not result in better per- formance is that they fail to consider the competition for resources faced by each of its component sub-jobs, as only the “slowest” sub-job determines the response time of the job. Consequently, there is no gain from lowering the response time of a sub-job at datacenterd if it has a counterpart at datacenterj with a higher completion time. In that case, we might as well delay this sub-job, in favor of other sub-jobs at datacenter d which have “faster” counterparts at other datacenters. This brings us to the notion of reordering the sub-jobs for the jobs, in a coordinated manner, based on how the sub-jobs of a job are progressing at various datacenters. Specifically, we develop Reordering, as an auxiliary mechanism to reduce the “imbalance” (in terms of their position in the local queues) of a job’s sub-jobs. Reorder- ing can work as an “add-on” to any scheduling solution. The basic idea behind Reorder- ing is to continue moving sub-jobs later in a local queue, as long as delaying them does not increase the overall completion time of the job to which they belong; this, in turn, gives other jobs an opportunity for a shorter completion time. 42 1: Input:i j;d ;8j2J;d2D 2: U J 3: N ; . an ordered list 4: whileU6=; do 5: targetDC max d jq d j;8d2D 6: targetJob max j i j;targetDC ;8j2J 7: N:push back(targetJob) 8: q d q d jv targetJob;d j;8d2D 9: U UftargetJobg 10: return reverse(N) Figure 3.3: Reordering Algorithm Algorithm 3.3 presents the pseudo code of Reordering, and its actual mechanism works as follows. Given the sub-jobs’ queue order, as computed by any scheduling algorithm, in each iteration Reordering starts by identifying the datacenter targetDC with the longest queue length (Step 5) and targets the last sub-job targetJob in its queue, which has the maximum value of i targetJob;targetDC in the queue (Step 6). We addtargetJob toN (Step 7), which is a queue data structure that keeps the sequence of its elements based on their arrival, and extract all of the sub-jobs associated with Job targetJob from the corresponding datacenter (Step 8). The same procedure continues until all current jobs in the system have been added intoN (Step 9). The final job order computed by Reordering is the reverse order ofN (Step 10). In our example in Figure 3.2, Reordering improves both Global-SRPT and Independent-SRPT by delaying v A;1 and v A;3 until the end of their associated queues after identifying that DC2 has the longest queue length and sub-jobv A;2 is the last one in its queue. The delay ofv A;1 andv A;3 does not degrade JobA’s finish instant as it is determined by v A;2 . This procedure continues by selecting Job C, and finally Job B, which results in N = A!C!B. Thus, Reordering returns B!C!A, with a mean job finish instant of 12 for both Global-SRPT with Reordering and Independent- SRPT with Reordering, as opposed to that of 37 3 without Reordering. 43 Note that in the Reordering algorithm, we use a job’s finish instant to approximate its job finish time. Moreover, the job finish instant is exactly the job finish time under the following assumptions: (1) homogeneous task service times, i.e., all tasks of all jobs have the same duration; (2) homogeneous service rates, i.e., all servers in all datacenters serve tasks at the same rate; and (3) homogeneous data centers, i.e., all datacenters have an equal numbers of computing slots with the same configurations. Under the above stated assumptions, Reordering would, at the very least, not result in degradation in completion time. Theorem 1: Reordering provides non-decreasing performance improvement for any scheduling algorithm. Let f x be job x’s finish instant represented by the queue position; that is, f x = max y2D i x;y . LetO to be any scheduling algorithm applied to the datacenters andh O be the resulting overall job finish instant; that is,h O = 1 jJj P x2J f x . LetR denote the Reordering algorithm andh O;R to be the overall job finish instant of executing algorithm O and algorithm R sequentially. Theorem 1 states that h O;R h O no matter what scheduling algorithmO is. Proof: We provide an intuitive proof based on Mathematical Induction on the num- ber of jobs. When n = 1, the theorem obviously holds. Assume the theorem holds whenn = k. We defineh(k) as the overall job finish instant when the number of jobs is k. So, h O;R (k) h O (k). When n = k + 1, suppose we first process job a, since it is identified from the data-center with the longest queue, after being processed, its finish timef 0 a is the same asf a , which is joba’s finish instant before applying Reorder- ing. For the other jobs, based on step 3 we know that h O;R (k) h O (k). Therefore, h O;R (k+1) = kh O;R (k)+f 0 a k+1 kh O (k)+fa k+1 =h O (k+1): The above Theorem proves that Reordering improves, or does no harm at least, the average job finish instant for any job scheduling algorithm. With the assumption that 44 job finish instant can estimate the job finish time, Reordering improves the average job finish time, and the average job response time as the result. In Section 3.4 we discuss how we address these assumptions for a system prototype, and evaluate it in Section 3.5. In summary, we emphasize that Reordering is an add-on mechanism that can be easily used with any scheduling approach to improve (or at the very least not harm) overall average job completion time. 3.3 Workload-aware Approach Given the “do no harm” property of Reordering as described above, it is naturally a conservative approach (to modifying the original scheduling decisions), with results depending significantly on the original scheduling algorithm to which the reordering process is applied. However, Reordering still leaves rooms for improvement. In the motivating example in Section 3.1, both Global-SRPT (Figure 3.2e) and Independent- SRPT (Figure 3.2f) came up with the job order ofB! C! A: We observe that the scheduling performance would be improved if we switched the order of jobB and job C, and result in the new job orderC! B! A: Doing so would bring performance improvement for jobC while hurting the completion time of jobB; which is against the principle of Reordering, yet the net effect results in overall performance improvement as shown in Figure 3.2d. This observation motivates us to develop the more aggressive approach than Reordering, termed Workload-Aware Greedy Scheduling (SWAG), which schedules the jobs greedily based on their estimated finish time. We first discuss the design principles for SWAG, and then present its algorithm details. 45 3.3.1 SWAG Design Principles Recall that a job’s completion time is composed of the waiting time as well as the service time, and the traditional SRPT results in the shortest total waiting time for all jobs by greedily scheduling the job with the shortest remaining processing time over the long ones. Therefore SRPT optimizes the average job completion time since the jobs’ service times are fixed. 7 This insight is common for all job scheduling in reducing the average job completion time, yet it sets the ground of our first design principle for SWAG. First Principle: In order to reduce the total waiting time and further reduce the response time, jobs that can finish quickly should be scheduled before the other jobs. However, as shown in Section 3.1.3, following the first principle by favoring the small jobs only is sub-optimal in the multiple-scheduler-multiple-queue scenario, due to the imbalance between the sizes of the sub-jobs across the datacenters and the fact that the finish time of a job depends only on its last completed sub-job. In fact, a small job with a large sub-job may not finish as quickly as a large job with many small sub- jobs. Therefore it leads us to the second design principle. Second Principle: Since the small jobs are not guaranteed to finish quickly (as is the case in the single-scheduler-single-queue scenario), we should consider scheduling jobs more as a function of sub-job sizes rather than the size of the overall job. The first two principles guide us to select the job finishing the quickest under the condition that it occupies the entire system. However each datacenter has different workload at the scheduling decision instant, which also impacts the waiting time that each sub-job suffers. This gives us the final design principle for SWAG. 7 Note that in a traditional scheduling problems a job is an atomic processing unit, as opposed to our problem where a job is composed of small tasks that can be executed in parallel. 46 Third Principle: Since the sub-jobs of a job experience different delays at different datacenters, we should also consider the local queue sizes in assessing the finish times of sub-jobs. DC1 DC3 DC2 Queue Length B-1 3 tasks queued 2 tasks A-1 2 tasks Finish Instance Job A: 4 , Job B: 5 , Average: 4.5 A-2 2 tasks B-2 3 tasks (a) SRPT-based Approach DC1 DC3 DC2 Queue Length B-1 3 tasks queued 2 tasks B-2 3 tasks A-1 2 tasks A-2 2 tasks Finish Instance Job A: 5 , Job B: 3 , Average: 4 (b) Better Approach Figure 3.4: Motivating Example for Third Principle Figure 3.4 presents a simple example to illustrate the third principle, in which there are two jobs A and B to be scheduled over 3 datacenters, and there are two tasks already at the first datacenter. Note that both Global-SRPT and Independent-SRPT would result in the scheduling result shown in Figure 3.4a as they both prioritize the jobs or sub-jobs based on their sizes only. Also note that executing Reordering after Global-/Independent-SRPT does not improve their performances because the dominat- ing jobs and sub-jobs are already put at the end of the queue. In conclusion, all the 3 principles are essential for reducing the average job comple- tion time. Next, we present how we construct SWAG based on these principles. 3.3.2 SWAG Algorithm In our design, the central controller runs SWAG whenever a new job arrives or departs. The new order of all jobs is computed from scratch based on the estimated job finish 47 1: Input:J;v j;d ;8j2J;d2D 2: N ; . an ordered list 3: q d 0;8d2D 4: whilejNj6=jJj do 5: m j max d (q d +jv j;d j);8j2J;d2D 6: targetJob min j m j ;8j2J 7: N:push back(targetJob) 8: q d q d +jv targetJob;d j;8d2D 9: returnN Figure 3.5: Workload-Aware Greedy Scheduling (SWAG) times. Let q d denote the current queue length at datacenter d, andjv j;d j denote the size of jobj’s sub-job at datacenterd. v j;d = 0 if none of jobj’s tasks is assigned to datacenterd: In addition, we define the makespanm j for jobj as: m j = max d (q d +jv j;d j);8d2D: (3.1) Then SWAG—as detailed in Algorithm 3.5—greedily prioritizes jobs by computing their estimated finish times based on the current queue length (accumulated number of tasks to be served) as well as the job’s remaining size (number of remaining tasks). Initially, SWAG computes the makespan for each job based on Equation 3.1 (Step 5). Then SWAG selects the job with the minimal makespan (Step 6), appends it into the job order (Step 7) and updates the queue length based on the selected job’s sub-job sizes (Step 8). If there are more than one job with the minimal makespan, SWAG picks the one with the smallest total remaining size as a tie-breaker. SWAG continues to greedily add the next job with the smallest makespan, with respect to the current queue lengths, until all the current jobs in the system have been added. In our example presented in Figure 3.2, SWAG first selects Job C as it has the smallest makespan of7, as compared to10 for Job A and8 for Job B. After that, the queue length for datacenter 1 and datacenter 3 would be updated to 7 and 6, respectively, according 48 to Job C’s sub-job size. At this point, both Jobs A and B result in the same makespan of 10, with respect to the new queue lengths. Since Job B has a smaller remaining size than Job A, it is added after Job C, followed by Job A . The final job order as computed by SWAG isC! B! A, and the resulting average job finish time is 35 3 , which is better than that the SRPT-based solutions. 3.4 Prototype and System Extensions In this section we describe our prototype implementation and how we address some system issues. Prototype: We implemented a system prototype with Spark[163]. Two main com- ponents in our system prototype are the global controller and the local controller. The global controller is primarily in charge of computing the job orders, by run- ning Reordering or SWAG module, based on the information (e.g., number of remaining tasks of each job at each datacenter) collected from each local controller. The global controller passes the results of job orders to each local controller through socket com- munication. Besides, whenever a new job arrives, it divides the job into sub-jobs and send the metadata (e.g., the application program ID, the number of tasks) to each local controller. The local controller is in charge of feeding the computed job orders to the local cluster as well as reporting the jobs’ progress to the global controller. Based on the updated job order, each cluster scheduler assigns the next available computing slots to the tasks of the job with the highest priority until all of its tasks are launched. In addition to passing new job orders to the cluster, the local controller sends the global controller updates of jobs’ progress (e.g., number of finished tasks for each job), upon receiving requests from the global controller by reading the logs produced by Spark cluster. 49 Heterogeneous Datacenter Capacity. In previous sections we assume all data- centers to be homogeneous in that they have the same number of computing slots for serving the tasks. In reality datacenters may have different capacity in the number of computing slots. Recall that both Reordering and SWAG rely on queue length as the estimation for job finish time (e.g., Step 5 in Algorithm 3.3 and Step 5 in Algorithm 3.5), while the same queue length would result in different job finish time if equipped with different number of computing slots. Reordering and SWAG can easily adapt to het- erogeneous datacenter capacity by normalizing the queue length of each datacenter by their number of computing slots. For example, Step 5 in Algorithm 3.3 can be updated astargetDC max d [ jq d j c d ];8d2 D, and Step 5 in Algorithm 3.5 can be updated as m j max d [ (q d +jv j;d j) c d ];8j2 J;d2 D; wherec d represents the number of computing slots in datacenter d: The intuition is that the datacenters with more computing slots spend shorter time finishing serving the same workload than the datacenters with less computing slots. Heterogeneous Tasks Duration. In above presentation we assumed that all tasks across all jobs were of the same duration. However, previous works [23, 25, 27] show that tasks duration could be heterogeneous within and across jobs in a real system due to various reasons. We address this by having the local scheduler of each datacenter select the task with the longest expected duration that is not yet launched for the sub-job with the highest priority determined by the job scheduling. The rationale behind this method is to start the larger tasks earlier in order to reduce the makespan across all tasks of a particular sub-job. Inaccuracies in Task Duration Estimation. The way we address heterogeneous tasks duration (task-level scheduling by local schedulers) relies on reasonably accu- rate estimation of task durations. Unfortunately, there is no guarantee that the estima- tions at the scheduler are accurate because tasks duration are subject to many dynamic 50 factors[23, 25, 27], including I/O congestion and performance interference among con- current tasks. The typical approach to this problem is to use the finished tasks’ duration to estimate the duration of the remaining tasks of the same job[23, 25, 27]; it is reported that the estimation accuracy with such approaches reaches 80%, as the jobs get closer to the completion[27]. Here, we do not assume a specific estimation mechanism for task duration, but rather (in Section 3.5) evaluate the sensitivity of our system’s performance to the estimation accuracy. Scheduling Decision Points. The heterogeneous nature of tasks duration and the (potential) lack of accuracy in their estimation indicate that in a real system we should consider (re)evaluating scheduling decisions at task departure points (in addition to job arrival and departure points). However, our simulation study indicates that the hetero- geneous nature of tasks duration and the inaccuracies in their estimation only have a marginal impact on the scheduling results. Since the frequency of task departures can be a few orders of magnitude larger than that of job arrivals and departures, running of job-level scheduling at such high frequency would incur substantial overhead, particu- larly as job-level scheduling is performed by the central controller. Consequently, we conclude that in a real system it is sufficient to consider scheduling decisions upon job arrivals and departures. 3.5 Performance Evaluation In this Section we conduct an extensive simulation study, with realistic job traces, for the proposed scheduling approaches (SWAG and Reordering) compared to the tradi- tional solutions (FCFS and SRPT extensions) with regard to performance improvement 51 and fairness (Section 3.5.2), overhead evaluation (Section 3.5.3) and sensitivity analy- sis (Section 3.5.4). Our results show that SWAG and Reordering improve SRPT-based approaches by50% and27%, respectively, over a wide range of settings. Trace Type Average Job Size (Num. tasks) Variance Small Jobs (1150 tasks) Medium Jobs (151500) Large Jobs (501+ tasks) Facebook[23, 25–27] 364:6 high 89% 8% 3% Google[10, 129, 139] 86:9 small 96% 2% 2% Exponential 800 medium 18% 29% 53% Table 3.2: Job Traces 3.5.1 Experiment Settings The main performance metric we focus on is average job completion time, which is defined as the average elapsed duration from the job’s arrival time to the time instant at which the job has all its tasks completed and can depart from the system. Average job completion time is a common metric for data analytics systems; this is a reasonable metric when focusing on customer quality-of-service. In addition, we also evaluate the jobs’ slowdown, which is defined as the job completion time divided by the job service time. We use slowdown as a metric for evaluating fairness among jobs of different sizes, as detailed in Section 3.5.2. All performance results are presented with confidence intervals of95%5%: We compare the performance of: FCFS, Global-SRPT, Independent-SRPT, Global- SRPT followed by Reordering, Independent-SRPT followed by Reordering, and SWAG. We also show the results generated by Optimal Scheduling, which are obtained through an offline brute-force search, i.e., with full knowledge of future job arrivals and actual tasks duration. We use the results from Optimal Scheduling as an upper-bound on the 52 response time improvement that can be achieved through better scheduling, to investi- gate how much room for improvement is left. We run FCFS as our baseline scheduling approach, for comparison purposes only. For clarity of exposition, we present our results as the normalized average job completion time of each algorithm, i.e., normalized by the average job completion time achieved by the FCFS approach for the same setting. Workload: We use synthetic workloads in our experiments with job size distribu- tions obtained from Facebook’s production Hadoop cluster [23, 25–27] and Google clus- ter workload trace[10, 129, 139], as well as the Exponential Distributions, referred to as Facebook trace, Google trace and Exponential trace, respectively. Table 3.2 summarizes the job traces we use in our simulation experiments. We adjust the jobs’ inter-arrival times for both workloads based on Poisson Process in order to make the two workloads consistent in terms of system utilization. The default settings for the average job size is 800 tasks, and we tune the inter-job-arrival time to obtain the workload with certain system utilization. Tasks Duration: The tasks duration in our simulations are modeled by Pareto dis- tribution with = 1:259 according to the Facebook workload information described in [27], and average task duration to be 2 seconds. In our simulation experiments, we investigate the impact of inaccurate estimation of task duration in Section 3.5.4. Task Assignment: To evaluate the impact of imbalance due to task assignment, we use Zipf Distribution to model the skewness of task assignment among the datacenters. The higher the Zipf’s skew parameter is, the more skewed that tasks assignment is (i.e., constrained to fewer datacenters). We also consider two extreme cases where tasks of each job are: (i) distributed uniformly across all datacenters, or (ii) assigned to only one datacenter. The default setting for the skew parameter is 2, while we investigate how skew of task assignment affects the performance in Section 3.5.4. 53 System Utilization: We define the percentage of occupied computing slots as our system utilization. Multiple factors contribute to the system’s utilization: job inter- arrival time, job size, task duration, and task assignment. Other Default Settings: In our experiments the default number of datacenters is30, with 300 computing slots per datacenter. Such default system settings result in 78% system utilization, which allows us to explore how the system performance behaves at reasonably high utilization. 3.5.2 Scheduling Performance Results Figure 3.6, 3.7 and 3.8 depict the average job completion time (normalized by that of FCFS), using the Facebook trace, Google trace and Exponential trace respectively. We vary the average job inter-arrival times and observe how performance characteristics react to different system utilization. 0.1 0.15 0.2 0.25 0.3 0.35 0.4 39% 46% 58% 68% 78% Normalized Average Job Completion Time System Utilization (%) Global-SRPT Independent-SRPT Global-SRPT w/Reordering Independent-SRPT w/Reordering SWAG Optimal Figure 3.6: Performance with Facebook Trace Performance Improvements of Reordering Our experiment results first confirm that Reordering does result in reduction of average completion time for SRPT-based heuristics, as stated by Theorem 1. The performance improvements for SRPT-based heuristics due to Reordering reaches as high as 27% under highly utilized settings, 54 0.2 0.25 0.3 0.35 0.4 0.45 0.5 39% 46% 58% 68% 78% Normalized Average Job Completion Time System Utilization (%) Global-SRPT Independent-SRPT Global-SRPT w/Reordering Independent-SRPT w/Reordering SWAG Optimal Figure 3.7: Performance with Google Trace 0.3 0.4 0.5 0.6 0.7 0.8 39% 46% 58% 68% 78% Normalized Average Job Completion Time System Utilization (%) Global-SRPT Independent-SRPT Global-SRPT w/Reordering Independent-SRPT w/Reordering SWAG Optimal Figure 3.8: Performance with Exponential Trace and is up to 17% under lower utilization. Finally, the results also show that Reorder- ing is more beneficial to Independent-SRPT than to Global-SRPT. This is intuitive as Independent-SRPT does not coordinate between the sub-jobs of a job and thus results in a higher imbalance between the sub-jobs; this creates more opportunities for Reordering to improve performance. Without Reordering, Global-SRPT performs better than Independent-SRPT in the Facebook trace, while the Google trace and the Exponential trace display the oppo- site trend. Under higher utilization, Global-SRPT outperforms Independent-SRPT by 27% in the Facebook trace, while in Exponential trace, Independent-SRPT outperforms Global-SRPT by 32%. This is the result of the fact that the variance of job sizes in the Facebook trace is significantly higher than that of the Google trace and the Exponential trace, so Global-SRPT benefits more from favoring small jobs by considering the total 55 job size across all datacenters, while Independent-SRPT performs even poorly by con- sidering only the individual sub-job sizes. In the Google trace, however, the gap between Global-SRPT and Independent-SRPT is not obvious. Most of the jobs in Google trace are small and so are the variance in the job sizes. With such characteristic, the skews among the sub-job sizes tend to be smaller compared to the other two job traces, and, therefore, Global-SRPT and Independent-SRPT perform similar job scheduling deci- sion. With Reordering, Independent-SRPT performs better than Global-SRPT in all traces, because Independent-SRPT benefits significantly from Reordering than Global- SRPT does as mentioned above. The gap between them becomes significant (10% or more) starting at lower utilization (39%) in Exponential trace, and reaches 40% under higher utilization. In the Facebook trace, however, the gap is only significant under higher utilization (68% and 78%). This is because Global-SRPT performs reason- ably well, unlike Independent-SRPT without Reordering, in the Facebook trace. Thus, Global-SRPT with Reordering also performs well as compared to the performance in the Exponential trace. These results also show that the performance of Reordering depends on the original scheduling algorithm. Performance Improvements of SWAG Compared to SRPT-based heuristics, SWAG’s performance improvements under higher utilization are up to 50%, 29% and 35% in the Facebook, Google and Exponential trace respectively, with at least 12% improvement under lower utilization. The differences in performance improvements attribute to the fact that job traces with higher variance in job sizes tend to have more large jobs, which potentially results in more sever skews among the sub-jobs. Thus, high-variance job trace like Facebook trace displays more opportunities that allow SWAG to achieve higher improvement by selecting jobs that can finish quickly accord- ing to its design principles. In addition, SWAG outperforms, by up to10%, SRPT-based 56 heuristics with Reordering, under various utilization and in all job traces. Finally, SWAG achieves near-optimal performance throughout our experiments: the performance gap between SWAG and Optimal is within only2%. Fairness among Job Types: Figure 3.9, 3.10 and 3.11 present the slowdown results for the Facebook, Google and Exponential trace respectively. We further present the slowdown for different job types by classifying the jobs based on their sizes (number of tasks): small jobs (1-150 tasks), medium jobs (151-500 tasks) and large jobs (501 or more tasks). The slowdown for FCFS is omitted as it is significantly larger than the rest and is more than 15 in all cases. Also, Global-SRPT and Independent-SRPT have similar results; thus, we only include the results for one of them. 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Overall Small Medium Large Slowdown Independent-SRPT Independent-SRPT w/Reordering SWAG Figure 3.9: Fairness with Facebook Trace We note that that all scheduling approaches have the same trends, i.e., that small jobs have the smallest slowdown while large jobs have the largest slowdown. As expected, this is naturally due to the fact that all algorithms essentially favor smaller jobs in order to reduce the average job completion time. In addition, the major differences of slow- down between the scheduling solutions exist in large jobs. 57 1 2 3 4 5 6 7 Overall Small Medium Large Slowdown Independent-SRPT Independent-SRPT w/Reordering SWAG Figure 3.10: Fairness with Google Trace 1 1.5 2 2.5 3 3.5 4 Overall Small Medium Large Slowdown Independent-SRPT Independent-SRPT w/Reordering SWAG Figure 3.11: Fairness with Exponential Trace In Facebook and Exponential trace, the slowdown of large jobs for Independent- SRPT is 40% more than its overall slowdown, while the gap is no more than 30% for Independent-SRPT with Reordering and no more than25% for SWAG. Google trace dis- plays significant gap of slowdown between large jobs and overall jobs. This is because most of the jobs in Google trace are small jobs, therefore the few large jobs are often queued for a long time while the system is serving many small jobs as determined by the scheduling solutions. However, Independent-SRPT with Reordering and SWAG still maintain relatively low slowdown compared to Independent-SRPT. Hence, we conclude that Reordering and SWAG improve performance without significantly sacrificing per- formance of large jobs. 58 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 33% 39% 46% 52% 60% 71% 78% Running Time (msec) System Utilization (%) Global-SRPT w/Reordering Independent-SRPT w/Reordering SWAG Figure 3.12: Scheduling Algorithm Running Time We also observe that Reordering improves the original scheduling approach by mainly improving performance of large jobs. This is because small jobs get to be served earlier than the other even after Reordering is performed, while Reordering provides the opportunity for some large jobs to get served earlier by delaying some other sub-jobs. We use the Exponential trace for the following overhead and sensitivity evaluation as it displays moderate characteristics compared to the other two. 3.5.3 Overhead Evaluation We evaluate our system overhead on the following aspects. Computational Overhead. We obtain this by monitoring the execution time due to running of the scheduling algorithms during each scheduling decision point. Communication Overhead. This is defined as the additional messages required by the global scheduler as needed to be transferred from each local datacenter to the central controller. Note that this does not include the fundamental and necessary information needed by the system, e.g., the metadata of the jobs and the tasks, or the task program binaries. Instead, It includes the information such as the set of current jobs IDs as well as their remaining number of tasks associated at each datacenter. 59 10 20 30 40 50 60 70 80 90 100 33% 39% 46% 52% 60% 71% 78% Amount of Data (KBytes) System Utilization (%) Global-SRPT Global-SRPT w/Reordering Independent-SRPT w/Reordering SWAG Figure 3.13: Communication Overhead Figure 3.12 depicts the scheduling running time under various system utilization. The results for FCFS, Global-SRPT and Independent-SRPT are omitted as they are neg- ligible compared to the rest. These results suggest that even under higher utilization (78%), the scheduling running time of SWAG (4:5ms) is relatively small compared to the average task duration time (2s). In addition to the scheduling running time, our prototype confirms that the control message passing between the global scheduler and the local scheduler required by Reordering and SWAG takes no more than a few hun- dred milliseconds. As a result, the delay in scheduling running time and message pass- ing does not significantly degrade the completion time of the jobs. Note that although SWAG has a higher computational (worst-case) complexity than Reordering (O(n 2 m) for SWAG andO(nm) for Reordering, wheren is the number of current jobs andm is the number of datacenters), the actual difference in computational overhead between SWAG and SRPT-based heuristics with Reordering is not significant, because SWAG is able to keep the number of current jobs (i.e.,n) in the system small, by scheduling jobs that can finish quickly. Figure 3.13 depicts the communication overhead incurred by each scheduling algo- rithm. Note that FCFS and Independent-SRPT do not require any additional information 60 0 0.2 0.4 0.6 0.8 1 Uniform 1 2 3 4 One-DC Normalized Average Job Completion Time Skew Parameter of Zipf Distribution for Task Assignment Global-SRPT Independent-SRPT Global-SRPT w/Reordering Independent-SRPT w/Reordering SWAG Optimal Figure 3.14: Various Task Assignment Scenarios from local schedulers, so their overhead is zero. The communication overhead essen- tially depends on the number of current jobs in the system. Since SWAG succeeds in keeping the number of current jobs small, it achieves the smallest communication over- head. The overhead analysis confirm that the performance gains from the proposed Reordering and SWAG techniques come with acceptable computation and communi- cation overhead. 3.5.4 Performance Sensitivity Analysis Impact of Task Assignment In this experiment we study the sensitivity of scheduling algorithms to the skew in task assignments. In Figure 3.14, the X-axis represents the skewness of task assignment, with Uniform Distribution being the least skewed and One-DC Assignment being the most skewed. Between Uniform and One-DC are the results under different Zipf’s skew parameters. 61 The general trend in Figure 3.14 is that as the skewness increases, the performance of the scheduling algorithms first increases and then decreases. There is not much room for improvement when all tasks are uniformly distributed across datacenters. The per- formance improvement becomes more significant as the imbalance in task assignments requires greater coordination of jobs scheduling across the datacenters to reduce the jobs’ completion time. Beyond a certain skewness level, the imbalance of task assign- ment becomes so substantial that most of the tasks from the same job only span a few datacenters, in which case not much can be done. As expected, when all the tasks of a job are assigned to a single datacenter, the execution of Global-SRPT and Independent-SRPT are essentially the same as they are both equivalent to performing SRPT on the local datacenters exclusively. In this case, there is no room for Reordering to improve SRPT-based approach either. Among the scheduling algorithms, SWAG and Independent-SRPT are more sensitive to the changes in skewness of task assignment than Gobal-SRPT. This is because their scheduling decisions are subject to how the sub-jobs of the other jobs are ordered at each datacenter, which is directly impacted by the extent of skews among the sub-jobs of the same job. On the other hand, Global-SRPT considers only the global view of the job sizes across all the datacenters, and is therefore less sensitive to how the skewness varies. Accuracy of Task Duration Estimation In this experiment we investigate how the number of datacenters affects the perfor- mance by varying the number of datacenters while keeping the total number of com- puting slots constant. In Figure 3.15, the performance improvements by Reordering and SWAG generally increase as the number of datacenters increases, because more datacen- ters provide greater opportunities for coordination of sub-jobs across the datacenters. Accuracy of Task Duration Estimation 62 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 15 20 25 30 Normalized Average Job Completion Time Number of Datacenters Global-SRPT Independent-SRPT Global-SRPT w/Reordering Independent-SRPT w/Reordering SWAG Figure 3.15: Various Number of Datacenters 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 40 80 100 Normalized Average Job Completion Time Estimation Accuracy for Task Duration (%) Global-SRPT Independent-SRPT Global-SRPT w/Reordering Independent-SRPT w/Reordering SWAG Figure 3.16: Various Estimation Accuracy In this experiment we study how the error in task duration estimation affects the scheduling algorithms’ results. The estimation error happens as task execution is subject to unpredictable factors like I/O congestion and interference as discussed in Section 3.4, and it has impact on how local schedulers schedule the tasks because the scheduling decisions are based on estimates of tasks duration. We introduce estimation error to our experiments based on a uniform distribution with the original task duration as the average. For example, if we want to investigate 75% estimation accuracy, we set the 63 estimation value for task duration to be uniformly drawn from the range of[0:75;1:25] actual taskduration; so that the estimation error is at most 25% of the actual task duration. Figure 3.16 shows that the performance improves marginally as the estimation accu- racy for task duration increases. This is because there is often a high variance in task duration due to stragglers [23, 25, 27], and the estimation error is not significant enough to affect the order of task scheduling much. Therefore, our Reordering and SWAG algo- rithms are robust to estimation errors. 3.6 Conclusions In the big data era, as data volumes keep increasing at dramatical rates, running jobs across geo-distributed datacenters emerges as the promising trend. In this setting, we propose two solutions for job scheduling across datacenters: Reordering, which improves scheduling algorithms by efficiently adjusting their job order with low com- putational overhead; and SWAG, a workload-aware greedy scheduling algorithm that further improves the average job completion time and achieves near-optimal perfor- mance. Our simulations with realistic job traces and extensive scenarios show that the average job completion time improvements from Reordering and SWAG are up to 27% and 50%, respectively, as compared to SRPT-based extensions, while achieved at rea- sonable computational and communication overhead. 64 Chapter 4 Multi-resource Scheduling across Heterogeneous Geo-distributed Clusters 4.1 Introduction Large online service providers like Microsoft, Google, Amazon and Facebook are deploying tens of datacenters and many hundreds of smaller “edge” clusters globally to provide their users with low latency access to their services [7, 13, 87]. These geo- distributed sites continuously generate data both about the services deployed on them (like end-user session logs) as well as server health (like performance monitor logs). Collectively analyzing this geo-distributed data is crucial for many operational and man- agement tasks. Examples of such tasks include analyzing server logs to maintain system health dashboards, analyzing user logs to make advertisement choices, and picking relay servers for online services using network performance logs [90, 102]. As the results produced by these analytics queries are used for making critical deci- sions by data analysts and real-time applications, minimizing their response times is a key requirement. Recent efforts in geo-distributed analytics have demonstrated that cen- trally aggregating all the data at a single site and then analyzing them can seriously limit the timeliness of the analytics for user applications. In addition, this leads to wasteful use of the WAN bandwidth [125, 154]. It has emerged that executing the queries in a 65 geo-distributed manner by leaving the data in-place at the sites can lead to faster query completion [81, 125, 154, 155]. An important characteristic of geo-distributed clusters is heterogeneity, in compute as well as network resources. The bandwidth capacities of the sites may vary by an order of magnitude [39, 125]. Compute capacities are also highly diverse. Conversations with one of the largest online service providers (OSP) [19] reveal that the heterogeneity in compute capacity (cores and memory) among its datacenters varies by two orders of magnitude. Further, the availability of network and compute resources also varies depending on their utilization. An additional source of heterogeneity is the non-uniform distribution of a job’s input data across sites; e.g., when analyzing user session logs of Bing within the recent two hours, more user data is likely to be present on sites where it is working hours for their nearby users than during night time. Recent works on geo-distributed data analytics only addressed the network hetero- geneity. These solutions minimize network transfers alone by placing mapper tasks at the sites where their input is located and placing reducer tasks to minimize shuf- fle time [125, 154, 155]. Their design assumes that the sites are indistinguishable in their compute capacities and have effectively infinite capacities to run the tasks. As described above, the former is invalid in real deployments [19]. Further, prior work on data analytics, even within a single DC, have documented “multi-waved” execution of analytics jobs of even a single stage (e.g., reduce stage) due to the constraints on compute capacities (i.e., only a fraction of the job’s tasks execute simultaneously per “wave”) [26, 27, 117]. Placing reduce tasks only based on the network bandwidth [125, 154, 155] can lead to more tasks being scheduled at a site than available compute slots, leading to multiple waves of execution even when slots are available at other sites. Not only does this inflate job response time, it also introduces inaccuracy in its model of network bandwidth when 66 the reduce tasks of subsequent waves execute. This is because the network traffic is not accounted for, i.e., the traffic occurring at the time of the later waves. This error in bandwidth cascades and degrades further scheduling decisions. In this Chapter, we take the first effort towards allocating multiple resources – com- pute slots and network capacity – to data analytics jobs with parallel tasks across het- erogeneous geo-distributed clusters. The problem of multi-resource allocation is rela- tively easier in intra-DC analytics due to the near homogeneity between machines (or racks) in compute and network capacities. While at first blush the problem appears to be that of multi-dimensional bin packing of balls (tasks), there are key differences which make theoretical packing heuristics hard to adapt. The resource demands of tasks (balls) in our setup are not static but depend on the site (bin) where the task is placed. For instance, the WAN usage of a map or reduce task depends on the amount of its input data present on its own local site. Even a recent systemic solution for intra-DC packing, Tetris[65], falls short for a geo- distributed setting because it models remote network accesses of tasks (e.g., reading data from a remote site) only using a simple and fixed “penalty”. While such a fixed penalty works for a homogeneous setting, this is a crucial omission in a heterogeneous geo-distributed setup. Further, the heterogeneity among the geo-distributed sites means that we have to jointly make decisions on all the tasks of a job so that all of them face similar resource contention and finish together. The duration of a job with many parallel tasks is depen- dent on the last task’s completion. Note that this scheduling requirement is orthogonal to straggler mitigation via speculation [23, 27, 167] and addressing of data skews [93, 96]. Finally, it is also important to optimize the scheduling across jobs since they compete for the same set of slots and bandwidth. In doing so, we optimize for the same metrics of interest to intra-DC analytics – response time and fairness. 67 Unfortunately, the problem of optimally allocating multiple resources of geo- distributed sites to simultaneous jobs with many parallel tasks is computationally intractable. Hence, we devise an efficient heuristic. We formulate the multi-resource scheduling problem for a single job as a linear program (LP). An important aspect of our formulation is the modeling of multiple waves of execution among tasks within a stage, and hence also determining the ordering in which the tasks are scheduled. We believe that we are one of the first works to model multiple waves of tasks in schedul- ing decisions in big data analytics (including intra-DC schedulers). We use different formulations for the map (input) stages and reduce (intermediate) stages, based on their communication patterns across the sites. To schedule multiple geo-distributed jobs, we integrate the above LP with the Short- est Remaining Processing Time (SRPT) heuristic. The LP described helps us accurately and efficiently identify jobs with the least remaining time, instead of proxies like remain- ing number of tasks. It also re-evaluates its task placement based on resources consumed by tasks of other concurrent jobs at each scheduling instance. In doing so, it balances the goals of response time and fairness. Our solution is careful about the usage of WAN bandwidth, i.e., bytes transferred across sites. As WAN bandwidth is a critical resource [154, 155], our solution incor- porates a WAN usage budget and stays within the budget in its scheduling decisions. A simple knob trades off between optimizing job response times and WAN usage. We have built Tetrium, a system for multi-resource allocation in geo-distributed clus- ters inside Apache Spark [163]. We evaluate Tetrium using (a) an Amazon EC2 deploy- ment across several geographically distributed regions [7] running the TPC-DS [14] and the Big Data benchmarks [9], as well as (b) large-scale simulation experiments driven by production traces. Tetrium improves the average job response time by up to 78% 68 D a t a C o m p u t e S l o t B a n d w i d t h Global Manager Job Queue Oregon Tokyo Australia New York Ireland Core Network Figure 4.1: Global data analytics across geo-distributed cluster. Analytics jobs are sub- mitted to the global manager, and may require data stored across geo-distributed sites which have various capacities in compute slots and bandwidth. compared to existing locality-based techniques [163, 164], up to 55% as compared to Iridium [125], and33% as compared to Tetris [65]. 4.2 Motivation We first motivate the heterogeneity in both geo-distributed clusters for analytics jobs (see Sectionx4.2.1). Next, we use an example to show the key challenges in scheduling jobs for such heterogeneous settings (see Sectionx4.2.2). 4.2.1 Geo-distributed data analytics Architecture Figure 4.1 shows the architecture of a geo-distributed analytics framework that logically spans multiple sites (datacenters of different sizes). Each site contains multiple compute slots (corresponding to some amount of memory and cores), and diverse uplink and 69 downlink bandwidth. We assume all sites are connected using a congestion-free network in the core as in prior work [125], which is validated by measurement studies [11]. Data can be generated at any site and the input data required by a job may be located across different sites. A centralized global manager accepts analytics jobs and translates them into a DAG of stages with multiple parallel tasks. Geo-distributed analytic jobs include analyzing server logs to maintain system health dashboards, analyzing user logs to make advertisement choices, and picking relay servers for online services using network performance logs [90, 102]. Minimizing job response times is critical to these analytics jobs because the output of these jobs is often used for making critical decisions by data analysts and real-time applications. Heterogeneity across Clusters Aggregating all the data to a central location is wasteful since we do not know before- hand the datasets that get accessed; most data (82%, as per our analysis, in a large OSP’s big data cluster) is never accessed or accessed very few number of times (Figure 1 in [24] says that 80% of data is accessed 2 times over a five day period in Microsoft Cosmos). Prior work has demonstrated that aggregating data after the query arrives seriously limits the timeliness of the analytics [125, 154]. The better approach to geo- distributed data analytics is leaving the data “in place” and distributing the tasks of a job across the different clusters. The key challenge in this approach is heterogeneity, in both resource and data distribution, across the clusters, as we elaborate next. (1) Heterogeneous compute capacities: Figure 4.2 shows the compute capacities of clusters of one of the largest online service providers (OSP). We see that compute capacities differ by up to two orders of magnitude across hundreds of sites. This is because clusters are built at different times with different levels of investment, different capacity requirements, are constrained by site size, energy, cooling, etc. The impending 70 0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 Normalized Compute CDF Figure 4.2: Heterogeneity in compute resources. 0 0.2 0.4 0.6 0.8 1 0 3 6 9 12 15 18 Normalized Bandwidth CDF Figure 4.3: Heterogeneity in network resources. trend of edge computing [5, 17] will only increase the heterogeneity since edge clusters are envisioned to be just a few racks of servers, as opposed to large data centers. The available capacities tend to vary even more [39, 125]: heavy usage of one or a few clusters by non-analytics jobs leads to limited resource availability. Clusters are often not provisioned exclusively for analytics but share resources, e.g., with client- facing services. When the client load to these services increases, the resources available for data analytics shrink, which contributes to the heterogeneity. (2) Heterogeneous data sizes: Data generated at the sites also vary. For a globally deployed service across many sites, the size of the user session logs at the sites is depen- dent on the number of sessions served by the site, which naturally have significant vari- ation. Our analysis of Skype logs on call performance generated at over 100 different Azure sites (relays) shows considerable variation: relative to the site with the minimum 71 Notation Explanation S x number of compute slots at sitex B up x ;B down x uplink/downlink bandwidth at sitex I input x ;I shufl x input/intermediate data at sitex M x ;R x number of map-/reduce-tasks placed at sitex t map ;t red computation time of a map-/reduce-task T aggr network transfer duration of map-stage T map computation duration of map-stage T shufl network transfer duration of reduce-stage T red computation duration of reduce-stage Table 4.1: Definition of Notations. data generated, the median, 90 th percentile and maximum values are 8, 15 and22 more. These logs are constantly analyzed for dashboards and system health. It is difficult to provision the sites with compute capacity proportional to the data generated for the following two reasons. First, the distribution of data sizes varies over time and is not a constant. Second, the distribution of data across sites for a given job might be vastly different than the overall distribution of data sizes. As a result, a job may have imbalance in computing slots needed as compared to the computing capacity at a site. (3) Heterogeneous network bandwidths: Figure 4.3 measures the skew in network bandwidth across different inter-site links of the large OSP. Note the 18 variation in bandwidths normalized to the least value. In fact, recent work has shown similar hetero- geneity across Amazon EC2 sites of over25 in their bandwidth [153]. Reduce tasks that have the all-to-all shuffle stage across the sites are directly impacted by the network heterogeneity. Equal spreading of reduce tasks across sites, as is recommended by prior approaches [23, 150], will bottleneck sites with lower band- width. Therefore, we should also consider network resources for geo-distributed jobs. 72 S i t e - 1 S i t e - 2 S i t e - 3 Number of Compute Slots, 40 10 20 Uplink Bandwidth (GB/s), 5 1 2 Downlink Bandwidth (GB/s), 5 1 5 Input Data (GB), 20 30 50 Figure 4.4: Bandwidth, compute capacities and input data for our three-site example setup. 200 Site-1 500 Site-3 Site-2 0 Site-1 350 Site-3 150 Site-2 7.5GB 10.5GB 3GB 7GB Map Stage Site-1 Site-2 Site-3 Input Data (GB) 20 30 50 Site-1 Site-2 Site-3 Intermediate Data (GB) 10 15 25 Reduce Stage Map Stage Reduce Stage Site-1 Site-2 Site-3 Site-1 Site-2 Site-3 Network Transfer Time (sec) 0 0 0 2 10.5 3.75 Computation Time (sec) 10 60 50 0 15 18 Combined Duration (sec) 0 + 60 + 10.5 + 18 = 88.5 Figure 4.5: Task placement result: Iridium 4.2.2 Illustrative Examples We now motivate the needs for jointly scheduling compute and network resources through illustrative examples, and show how state-of-the-art solutions fall short. We consider a geo-distributed setup that has a varying number of compute slots (S x ) at each site x as well as uplink/downlink bandwidth capacities (B up x and B down x ); the volume of dataI x stored at each site might be unevenly distributed. Table 4.1 summarizes our notation. Joint Compute- & Network-Aware Task Placement: Figure 4.4 specifies a 3-site example setup, in which site-1 is the most powerful in terms of the number of compute slots and bandwidth, while it has the least amount of input data as compared to the other 2 sites. We consider an analytics job with one map 73 Map Stage Site-1 Site-2 Site-3 Input Data (GB) 20 30 50 Site-1 Site-2 Site-3 Intermediate Data (GB) 28.55 7.15 14.3 Reduce Stage Map Stage Reduce Stage Site-1 Site-2 Site-3 Site-1 Site-2 Site-3 Network Transfer Time (sec) 7.42 15.7 10.7 2.45 6.13 5.11 Computation Time (sec) 30 30 30 8 8 8 Combined Duration (sec) 15.7 + 30 + 6.13 + 8 = 59.83 571 Site-1 Site-3 143 Site-2 15.7GB 21.4GB Site-1 Site-3 Site-2 2.03GB 2.04GB 4.05GB 4.1GB 0.92GB 8.18GB Figure 4.6: Task placement result: better approach and one reduce stage. Each map task processes100MB input data and takes2s to finish; after the map stage, the intermediate data is only half of the size of the input data. There are 500 tasks in the reduce stage, and each reduce task takes 1s to finish. The state- of-the-art solution of assigning tasks across the sites is Iridium [125], which processes all the map tasks locally and then decides on the best placement of reduce tasks that minimizes network resource usage. In the map stage, the computation bottleneck is at site2: 2sd 300 10 e = 60s: In the reduce stage, Iridium places the reduce tasks so that the shuffle time is minimized. The shuffle bottleneck in the reduce stage is at site2, in which it’s downloading takes (10+25)GB0:3 5GBps = 10:5s, and its uploading also takes 10:5s. The computation bottleneck in reduce stage is at site3: 1sd 350 20 e = 18s: The end-to-end job completion time for this job is therefore 88:5 seconds. Figure 4.5 shows the placement result as determined by Iridium. 1 A better placement in this example considering both network and computation capacity would transfer some input data from site-2 (15:7GB) and site-3 (21:4GB) to site-1 as site-1 has more powerful computation capacity. Despite the minor increment 1 Here we are calculating the total duration based on worst-case estimation that there is no overlap between the network transfer time and computation time in each stage. In practice, a task starts computa- tion, once its data are gathered, without waiting for the network transfer time of the other tasks. 74 (15:7s) in transferring input data to site-1, the better assignment balances the computa- tion workloads among the sites so that the computation duration in the map-stage drops from60s to30s: In the reduce-stage, it again reduces the computation duration from18s to8s by considering both network and computation capacity. Note that although Iridium optimizes shuffle time, the better assignment in this example has a better intermediate data distribution which leads to faster shuffle time (6:13s) as compared to that of Iridium (10:5s). The end-to-end job completion time for this placement is59:83 seconds, which is only 68% of Iridium’s completion time. Figure 4.6 specifies the placement result as determined by the better approach. There are two key insights from comparing the two solutions. First, in the map phase, it is sometimes beneficial to shift some workload away from the compute bottle- necked site (i.e., site 2) to other sites for a small increase in network transfer. Note that the opposite solution of aggregating all input data to the most powerful site, referred to as the Central approach, is far from optimal as well. The Central approach would aggregate all input data at site-1, and then run all the computation there without having to transfer data across the sites again. The end-to-end job completion time using the Central approach is93 seconds, which is1:55 times of the better assignment mentioned above. Second, in the reduce stage, although Iridium aims at minimizing the network transfer time by avoiding data transfer for the map stage and then optimizes the shuf- fle time for the reduce stage, it does not consider the computation capacity at each site (especially site 3). This example highlights the importance of coordinating the hetero- geneous capacity across the sites, as well as jointly considering network and compute resources to optimize job completion time. Joint Job Scheduling & Task Placement: 75 Task placement under multiple resource constraints becomes more intricate when multiple simultaneous jobs compete for resources. Placing the tasks of the jobs accord- ing to each job’s optimal placement may no longer be the best option. We explain this using an example with 3 sites each with 3 compute slots and 10GB/s upload/download bandwidth. There are two jobs (job-1 and job-2) in this example with 3 and 12 tasks, respectively. For simplicity, both jobs contain only a map-stage, and each task runs for 1s. Job-1’s input data require0, 1 and2 tasks across the3 sites, while job-2 requires2, 4, and 6 tasks. The optimal placement for each job in isolation would run all the tasks locally without any data transfer, i.e., (M 1 ;M 2 ;M 3 ) = (0;1;2) for job-1 and (2;4;6) for job-2, which leads to response time1s for job-1 and2s for job-2 (since job-2 requires 2 waves of computation). When the two jobs are jointly scheduled, their best placement would depend on the order of the jobs. Scheduling job-1’s tasks prior to job-2 will not change job-1’s task placement, but the best placement for job-2 becomes (6;4;2) with response time 2:4s; the average response time for the two jobs is 1:7s. The opposite ordering - job-2’s tasks followed by job-1’s tasks - will lead to optimal placement for job-2, while the best placement for job-1 becomes(3;0;0) with response time3:3s; this leads to a worse average response time of2:65s. Two takeaways from this example are: (1) An optimal schedule (in this example) was obtained without either of the jobs achieving their individual optimal task placement due to the inter-job contention for resources. (2) Furthermore, an optimal scheduling order of the jobs is a complex interaction between the available slots, available network bandwidth, and data distribution. 76 4.3 Compute/Network-Aware Task Placement We first describe our compute- and network-aware task placement, then integrate it with job scheduling inx4.4. Task placement decisions essentially determine the following: (1) at which site should a task be placed, and (2) from which site should a task read data. Moreover, tasks in a stage often run across multiple waves as the compute slots are insufficient for launching all tasks at once. Therefore, the decisions should also include the ordering of the tasks. Considering task placement and ordering together is, however, challenging even for a single stage. The formulation is a mixed integer linear program withO(mn) variables wherem is the number of tasks andn is the number of sites, and is inefficient to solve in the time scales (of seconds) required for cluster schedulers. Our approach is to solve task placement first based on a linear program under het- erogeneous resource constraints, for each stage independently – map-stage (x4.3.1) and reduce-stage (x4.3.2). We then address task ordering within each stage inx4.3.3, and address sub-optimalities due to solving each stage independently (x4.3.4). 4.3.1 Map-Task Placement In placing the map-tasks, we can view our problem as determining what fraction of the job’s tasks (m x;y ) should run at site y with corresponding data residing at site x. P y6=x m x;y denotes the fraction of tasks that are not placed at site x but need to read data from site x. The amount of data to be transferred out of site x is then I input ( P y6=x m x;y ), whereI input = P x I input x is the total volume of input data. Therefore, the upload transfer time at sitex is I input ( P y6=x mx;y) B up x given sitex’s upload bandwidthB up x . Similarly, the fraction of map-tasks that are placed at sitex but need to read data from other sitey6=x is P y6=x m y;x , so the download transfer time at sitex is I input ( P y6=x my;x) B down x . 77 The number of map-tasks at sitex can be denoted byn map P y m y;x , assumingn map is the total number of map-tasks. Given that site x has S x slots, it takes nmap P y my;x Sx waves to finish all the map-tasks at sitex. Hence, the computation time at sitex ist map ( nmap P y my;x Sx ), assuming each map-task’s duration is t map at the ease of presentation; we deal with variances in task durations inx4.5. Based on the principled guidelines provided above, we can then formulate the map- task placement problem into the following Linear Program (LP) to minimize the job’s remaining processing time at map-stage. LP: map-task placement min T aggr +T map (4.1) s:t: T aggr I input ( P y6=x mx;y) B up x ;8x (4.2) T aggr I input ( P y6=x my;x) B down x ;8x (4.3) T map t map ( nmap P y my;x Sx );8x (4.4) m x;y 0; P y m x;y = I input x I input ; P x P y m x;y = 1;8x;y (4.5) Here, our goal (Eq. 4.1) is to minimize the map-stage’s total processing time which con- sists of bot the time it takes to move input data across sites (T aggr ) and the computation time of all map tasks (T map ). 2 The constraints in Eqs. 4.2 and 4.3 reflect the aggrega- tion time, i.e., time to transfer the input data to where the map tasks are placed. Since the network transfer time is dominated by the bottleneck site, T aggr is at least as large as the upload and download duration at each site. Eq. 4.4 reflects that the map-stage’s computation timeT map is dominated by the maximum computation time across all the sites. 2 In data analytics frameworks, the aggregation and map stages do not overlap because it is hard to track when the map data is ready. 78 Note that, to obtain an LP formulation above (rather than a MILP), we focus on the fraction (rather than the number) of tasks to be run at each site x. Of course, the number of tasks at each sitex needs to be integral; hence, we round the solution. With a sufficiently large number of tasks per job, this approximation should not significantly affect performance. Solving the above LP gives us the following: (a) The minimum remaining processing time (T = T aggr +T map ) for the map-stage. (b) The fraction of map-tasks to place at each site y that needs input data from x (m x;y ); in essence, this provides the list of tasks at each site. (c) The needed slot allocation (D = fd x = min(S x ; P y m y;x n map );8xg), which is used in job scheduling decisions (x4.4) Note that this formulation is deterministic in nature, i.e., it relies on averages and as such does not reflect the variance in task duration that may occur due to data skew and availability of network capacity (when reading from remote sites). We discuss how to handle the variance in x4.5. 4.3.2 Reduce-Task Placement Different from the map-stage, each reduce-task reads data from the output of all map- tasks. Hence, in placing reduce-tasks, we only need to identify the fraction of reduce- tasksr x to be placed at each sitex. Since sitex hasr x fraction of the reduce-tasks, it needs to processr x fraction of total intermediate data for reduce-stage: the volume of data to be transferred out of sitex is I shufl x (1r x ), and the volume of data to be transferred to sitex is( P y6=x I shufl y )r x : Therefore the upload and download duration at sitex are I shufl x (1rx) B up x and ( P y6=x I shufl y )rx B down x , respectively. The number of reduce-tasks at sitex isn red r x assumingn red is the number of reduce-tasks, and it takes n red rx Sx waves to finish all the reduce-tasks. Therefore, the 79 computation time at sitex ist red ( n red rx Sx ); similar inx4.3.1, we assume constant task durationt red . LP: reduce-task placement min T shufl +T red (4.6) s:t: T shufl I shufl x (1rx) B up x ;8x (4.7) T shufl ( P y6=x I shufl y )rx B down x ;8x (4.8) T red t red ( n red rx Sx );8x (4.9) r x 0; P x r x = 1;8x (4.10) Our goal (Eq. 4.6) is to minimize job’s remaining processing time in the reduce-stage, i.e., the sum of the network shuffle time (T shufl ) and reduce computation time (T red ). Eq. 4.7 and Eq. 4.8 bound the shuffle timeT shufl by the network transfer duration at each site:T shufl is dominated by the maximum of upload and download duration across all the sites. Eq. 4.9 reflects that the computation time of the reduce-stage is domi- nated by the maximum computation time across all the sites. Note that our formulation for the reduce stage is similar to the model proposed in [125]. The key difference is that we extend the model to jointly minimize the time spent in network transfer and in computation time. The outcome of solving this LP gives us the following: (a) The optimized remaining processing time (T = T shufl +T red ) for the reduce-stage. (b) The fraction of reduce- tasks to place at each sitex (r x ). (c) The needed slot allocation (D =fd x = min(S x ;r x n red );8xg). As in the case of map tasks, the LP formulation produces a fraction of reduce tasks to place at each site, which (using similar rationale) is rounded to obtain an integral number of reduce tasks to place at each sitex. 80 4.3.3 Task Ordering When the compute slots are constrained, it may take several waves to finish all a job’s tasks. Therefore, selecting the set of tasks to run in each scheduling instance is critical for completing their associated jobs quickly. With tasks of varying durations, the key principle for minimizing the associated job’s response time is to start the tasks with long duration first to avoid having long-duration tasks delay job response time [92]. For map-stages, since the tasks that read data from remote sites take significantly longer to complete compared to the tasks that run locally with the data, we start the remote tasks first before the local tasks. Specifically, we first select the tasks that takes longest time to fetch its input data, i.e., the tasks that read data from the site with the most constrained upload bandwidth, as each map-task generally processes the same size input partition and the only factor in different input fetching time is the bandwidth between the two sites. We further reduce the network contention by spreading the remote tasks launching across different sites, as opposed to launching all the remote tasks reading data from the most constrained site at once. For reduce-stages, we also start with the longest-duration tasks based on their net- work transfer time. Note that the LP formulation in Sectionx4.3.2 assumes each reduce- task processes the same volume of intermediate data; in practice, each reduce task may have different amount of data because the intermediate data may not be equally parti- tioned across the keys. Therefore, we order the tasks based on their size of input data: the larger the task’s input data size, the earlier it gets launched. We verify inx4.6 that our task ordering design does reduces a job’s response time; we further describe how we adapt this design to deal with dynamic slot arrivals inx4.5. 81 4.3.4 Mismatch between Map and Reduce The intermediate data distribution depends on task placement decisions made for the map-stage. Cascade’s stage-by-stage approach falls short of addressing such a depen- dency as we place map-tasks by optimizing map-stage’s duration (T aggr +T map ), without considering the consequences for the reduce-tasks. This could (potentially) result in a longer reduce-stage (T shufl +T red ). A better approach could be to decide the place- ment of the tasks of map and reduce stages jointly. Without specifying the details, such alternative lads to job duration of44:875 in our previous example inx4.2, as opposed to 50:88 byCascade. Scheduling a job’s end-to-end tasks based on its full DAG description has been an open (and hard) problem in the literature, with previous efforts often resorting to a stage- by-stage approach [125, 130]. Here we investigate to what extent Cascade’s particular stage-by-stage approach is handicapped by not considering a full DAG-based formula- tion. We use an unrealistic alternative by assuming having full information about all tasks (map and reduce) upfront, and design a heuristic that produces a more favorable (to the reduce-stage) intermediate data distribution. Instead ofCascade’s approach that starts with map-stage placement (termed forward here), the alternative (termed reverse) starts with the reduce-stage as follows: (i) assign reduce-tasks to each site in proportion to the slot distribution (r x = Sx P x Sx ); (ii) using this placement, solve the reduce task placement LP (fromx4.3.2), but with the intermediate data fraction at each site now being our decision variables, giving us a desired intermediate data distribution (I shufl x ) at each sitex; (iii) solve the map task placement LP (fromx4.3.1) but with an additional constraint of the intermediate data distribution, namely: P x m x;y = I shufl y P y I shufl y . In our evaluations (x4.6) we compute both, forward and reverse solutions, and choose the better of the two. The results illustrate that we can obtain occasional benefits from this approach, but the overall improvements are marginal as compared toCascade. 82 Given the small loss in performance and the fact that Cascade’s forward approach is easier to implement – it does not require upfront information about all stages and poten- tially incurs less overhead – our prototype implementation and simulations are focused on the originally proposed solutions (inx4.3.1 andx4.3.2). 4.4 Job Scheduling Resource allocation among jobs is critical in reducing response time as illustrated in x4.2, however, the problem of scheduling jobs with parallel tasks to minimize response time is NP-hard [55, 147]. We develop an efficient heuristic for reducing the average job response time (x4.4.1). We also provide flexibility to incorporate other important metrics (WAN usage inx4.4.3, fairness inx4.4.4) using simple knobs. 4.4.1 Minimizing Average Job Response Time When it comes to reducing average job response time, it is intuitive to apply the the Shortest Remaining Processing Time (SRPT) philosophy due to its well-studied behav- ior. A key component in SRPT is the estimation of remaining time for each job, and previous works often resort to the number of remaining tasks as an approximation[130]. In geo-distributed clusters, however, the response time of a job is determined by not only its remaining tasks, but also how the tasks are distributed across the sites based on resource availability as highlighted by our illustrative examples (x4.2). We propose the following heuristic for estimating the remaining time for a geo-distributed job: Remaining duration for jobs with stage-dependency Conventional jobs can be mod- eled as a DAG of tasks, in which there is stage dependency. Note that previous works [125, 130] often handle the DAG by treating each stage separately; yet this has the undesirable property of mistakenly allocating slots to quickly finish a stage with a 83 lot of subsequent stages, while there could be stages from other jobs that are close to their overall completion. Ideally, we should schedule the jobs based on their remaining time across all the stages. However, estimating such information across all the stages is inefficient to com- pute because we need to sequentially estimate each stage’s processing time (by invoking the optimizer fromx4.3) based on the output of the parent stages and sum them up. Our simple heuristic first chooses the job’s remaining number of stages (G j ) as a proxy forj’s remaining workloads, then uses the current stage’s remaining processing time (T j ) to break ties if there are multiple jobs with the same DAG progress.Once we identify jobk with the shortest remaining time based onG j andT j , we allocate slots D k = fd k x g to job k based on the task placement described inx4.3 and schedule it to run.The algorithm continues the above steps until there is no remaining slot to be allocated, or all the jobs have been scheduled, whichever comes first. 4.4.2 Dealing with Resource Dynamics Resource capacity at a site may suddenly drop due to various reasons: the compute slots at a site may be allocated to other non-analytics jobs with higher priority, and available bandwidth between the sites could degrade due to temporal link failure in WAN. The reduced resources at a site result in longer finish time for its assigned tasks, which could prolong the job’s response time. Therefore, the global manager should adjust the workloads assignment, e.g., by offloading workloads from the site with resource drop, and update all the site managers. However, updating the assignment at all of the sites incurs significant overhead in communication. It is hence desirable to update only a subset of the sites, while still optimizing for job response time. We extend our solution to address such resource dynamics. Letf i denote the number of tasks of a job assigned to site i according to the last scheduling decision. When 84 the global manger is notified of significant resource drop changes at a site by the site manager, it triggers the scheduling computation based on the new resource condition, and obtains the new assignment f i for all sites. Assume the global manager is set to only update k sites; when k equals to the total number of sites, it gets the optimal assignment f i for all site. Say after re-assignment, each site now has f 0 i tasks, and we calculate a distance metric Q between this assignment and the ideal assignment: Q = p P 8i (f 0 i f i ) 2 : The objective is to adjust the assignment in only k sites and find the new assignment with the minimumQ value. We design a heuristic towards this goal. We first focus on the sites that want to offload some of their tasks to other sits, i.e., f z f z < 0 on sitez, and sort the sites based onjf z f z j in descending order. We start moving tasks out of the first site to the other sites untilf 0 z =f z . We exhaustively search through all possible assignments and update based on the one with the minimum Q value. Inx4.6.3 we evaluate the performance of this approach under differentk values. 4.4.3 Considering WAN Usage WAN usage across the sites is critical for operational cost in global data analytics [155], and is often charged based on the volume of data transferred over WAN [7]. Limiting the amount of data sent across the sites, however, may increase a job’s response time, because it restricts the job from transferring the data to a site with high resource capacity that can process the data faster. Tetrium offers a knob, , that incorporates budgeting WAN usage with reducing job response time. During each scheduling instance, Tetrium calculates WAN budget W j =W j min + _ (W j max W j min ) for each job, whereW j max andW j min are the maximum and minimum possible WAN usage for the job, respectively. When ! 1, a job has maximum WAN budget, and Tetrium is completely geared toward reducing the job’s response time. On the other hand, as! 0, WAN usage is minimized for each job. 85 We setW j max to be the sum of input data in the job’s current stage, as the amount of data this job could sent across WAN is no more than its input data. W j min value varies depending on the stage type: in a map stage, W j min = 0 as a job achieves zero data transfer when it leaves all input data in-place, while in a reduce stage, W j min can be calculated by the following LP model: min W j min (4.11) s:t: W j min = P x I shufl x (1r x ) (4.12) r x 0; P x r x = 1;8x (4.13) , where the sum of upload data is constrained byW j min . 3 Given the WAN budget calculated above, an additional constraint for data upload/download is then added in task placement model described in Sectionx4.3. In map stage, this constraint is written as P x I input ( P y6=x m x;y )W j ; in reduce stage, it is P x I shufl x (1r x ) W j . The revised models limit the amount of a job’s data sent across sites when placing its tasks in each scheduling decision, and the rest follows Tetrium’s original design. 4.4.4 Incorporating Fairness Reducing average job response time based on SRPT may starve large jobs. Here we define fairness as: each job is allocated a number of slots in proportion to the number of its remaining tasks, i.e., jobi withf i remaining tasks getsS ( f i P i f i ) slots, where there areS available slots. 3 The sum of download data is equivalent to the sum of upload data, so specifying one of them is sufficient. 86 We provide some flexibility between reducing average job response time and achiev- ing fairness: our system achieves-fairness if each job receives at least(1)S ( f i P i f i ) slots in each scheduling instance. The system achieves complete fairness as ! 0, while it reverts to the original design (geared towards performance) as! 1: Specifically, we first calculate the minimum slots that should be reserved for each jobi asp i = S ( f i P i f i ): Next, instead of allocating the selected jobk all its desired slotsD k based on task placement, we limit the number of slots allowed for jobk toq k = P x S x P i2J;i6=k p i : We scale down jobk’s slot allocation byd k x q k P x d k x ifq k < P x d k x . After capping jobk’s slots based onmin(q k ; P x d k x ), the rest follows our original design. 4.5 Prototype Implementation We implement Cascade on top of Spark [163] by overriding Spark’s cluster scheduler to make our scheduling decisions, containing 950 LoC in Scala. In addition, our Spark implementation calls out Gurobi solver [12] to solve the models described inx4.3. The optimization model solving part contains roughly 300 LoC in Python. We discuss sev- eral implementation details as follows. Batching of Slots: Since we make task placement and scheduling decisions based on currently available slots at one scheduling instance, the scheduling quality depends on whether the current set of available slots is sufficiently representative of future slot arrivals. We can make better placement decisions if we delay a bit so that more slots become available and more placement options can be covered, yet we do not want to waste resources by leaving them too long. In our implementation, we batch the slots according to the average duration of the recently finished tasks, which provides the system with a rich set of slots across sites so that the scheduler does not make biased decisions based on one (or a few) available slots. 87 Handling Dynamic Slot Arrivals: Slot distribution varies across scheduling instances in practice (even when batching is employed), either due to variance in last wave’s task durations, or due to some slots being allocated to another jobs with higher priority. With such dynamic slot availability, our task ordering solution (x4.3.3) that schedules remote tasks prior to local tasks may force local tasks to run on remote sites given insufficient local slots in the end, which not only prolong the tasks’ durations but also decreases slot utilization. To address dynamic slots availability in practice, Cascade reserves a small fraction of the current slots for running local tasks, while leverages the remaining portion of the slots in accordance to the original task ordering method. Estimation of Available Bandwidth: Given EC2 bandwidth is relatively stable at min- utes level, similar to [125], we run measurements of available bandwidth at each site every few minutes. We do not apply specific bandwidth reservation method and assume that available bandwidth is fairly shared among all concurrent flows at a site. Estimation of Task Duration: Our implementation estimates the tasks’ duration according to the finished tasks in the same stage. Previous work [27] showed this approach is effective because tasks in the same stage perform the same functions over similar amounts of data. 4.6 Evaluation We evaluate Tetrium with a geo-distributed EC2 deployment (Sectionx4.6.2) and extended trace-driven simulations (Sectionx4.6.3). 4.6.1 Settings EC2 Deployment: We deploy Tetrium in EC2 using geo-distributed instances from 8 regions: Oregon (US), Virginia (US), Sao Paulo, Frankfurt, Ireland, Tokyo, Sydney and 88 Singapore [7]. The bandwidth at these 8 sites ranges from 100Mbps to 1Gbps (among the geo-distributed regions), and we vary the number of slots at each site to set up the cluster with heterogeneous compute capacities across the sites: the maximum of slots number is 16, and the minimum is 4: We also mimic a larger deployment of 30 sites within one region, and set the bandwidth by applying the same bandwidth distribution as in the 8-site scenario. Trace-driven Simulator: We also evaluate Tetrium using trace-driven simulations on large-scale settings. The trace is derived from a production cluster with information including job arrivals, jobs’ number of tasks and corresponding DAGs, input/output data sizes for each task, the distribution of input data, straggler tasks and fail-over [23, 25]. We simulate a 50-site setting: the number of slots at each site ranges from 25 to 5000, for a mix of powerful datacenters and small edge clusters, with bandwidth ranging from 100Mbps to2Gbps. Baselines: We use two baselines: (a) In-Place Approach: This is the default Spark approach [163] that runs tasks locally along with their input data (site-locality). It applies fair scheduling among the jobs and delay scheduling [164] for launching tasks within a job. (b) Iridium [125]: This is a recent work that improves Spark through shuffle-optimized reduce-tasks placement for geo-distributed jobs. Additional baselines are included as appropriate. Performance Metrics: The primary performance metric in our evaluation is average job response time, and we report the results based on the reduction in response time as compared to different baselines. In some cases we also report the reduction in slowdown. Slowdown is defined as the job’s response time divided by its service time when running in isolation, and indicates how jobs are prioritized compared to their size. 89 0 2 0 4 0 6 0 8 0 1 0 0 TPC-DS, 8-site Big-Data, 8-site TPC-DS, 30-site Big-Data, 30-site Reduction in Average Response Time (%) Baseline: Iridium Baseline: In-Place Figure 4.7: Reduction in Average Response Time. 0 2 0 4 0 6 0 8 0 1 0 0 TPC-DS, 8-site Big-Data, 8-site TPC-DS, 30-site Big-Data, 30-site Reduction in Average Slowdown (%) Baseline: Iridium Baseline: In-Place Figure 4.8: Reduction in Average Slowdown. 4.6.2 Evaluation with EC2 Deployment Workloads: We run two workloads to evaluate Tetrium in the EC2 deployment. (a) TPC-DS [14]: a set of SQL queries for evaluating decision support solutions, including Big Data systems. The queries in this benchmark are characterized by high CPU and I/O workloads, and typically contain long sequences of dependent stages (6 16). (b) BigData [9]: a mix of scan, join, and aggregation queries over the data provided in [119]; the queries in this benchmark have fewer sequences of stages (2 5). Performance Gains: Figure 4.7 shows that Tetrium improves average job response time, as compared to In-Place and Iridium, by up to 78% and 55%, respectively, as a result of efficient task placement that jointly considers both network and compute resources. Tetrium’s gains are also due to being able to schedule the jobs with shorter remaining time so that they are not substantially delayed by longer jobs. The performance gains are more significant under TPC-DS workloads than under Big Data Benchmark, and we believe this is due to the different job characteristics in these benchmarks: TPC-DS jobs have substantially more stages, providing greater 90 opportunity for efficient task-placement decisions that result in larger gains. Gains under the 30-site setting are more significant than in the 8-site setting because Tetrium bene- fits more from the flexibility in placement options. Yet, Tetrium’s gains compared to Iridium in “TPC-DS 30-site” are smaller than that in 8-site setting, because Iridium’s network-centric approach also gains some benefits from the increased placement flex- ibility, especially in the TPC-DS workloads that incur many intermediate data shuffles during job execution. Figure 4.8 shows that Tetrium achieves up to 45% and 16% reduction in slowdown compared to In-Place and Iridium, respectively. Tetrium has the best slowdown, as it not only makes task placement decisions based on the joint consideration of bandwidth and compute resources, but also prioritizes small jobs (to finish quickly) without being delayed by large jobs. As expected, In-Place has the worst slowdown, as it essentially allocates slots based on fair sharing among jobs. Tetrium has smaller gains as compared to Iridium, as Iridium achieves better slowdown (compared to In-Place) by reducing response time through better (network-centric) task placement. Scheduling Overhead: Tetrium’s scheduling decisions complete within 950ms for 50 concurrent jobs and 2s for 100 jobs. The majority of scheduling time is spent on task placement decisions, which takes an average of 100ms for each decision for a job. Although the number of jobs in the system might be large, Tetrium effectively selects a few high-priority jobs (by focusing on the jobs with fewer remaining stages) and eliminates the needs for solving optimization problems for lower priority jobs. 4.6.3 Evaluation with Trace-driven Simulations Our simulations employ an additional baseline (Centralized Approach) that (upfront) aggregates all input data at a powerful datacenter, to quantify the benefits of running jobs’ tasks across the site. 91 0 10 20 30 40 50 60 70 80 In-Place Centralized Reduction in Average Response Time (%) Tetrium +FS +I-task +I-data Figure 4.9: Average Response Time 0 20 40 60 80 100 0 10 20 30 40 50 60 70 CDF Reduction in Response Time (%) vs. I-P vs. Cen. Figure 4.10: CDF of Response Time Performance Gains and Design Choices Reducing Response Time: In Figure 4.9, Tetrium achieves 42% and 50% improve- ments as compared to In-Place and Centralized, respectively. The gains are higher rela- tive to Centralized because most jobs have less intermediate data than input data, which undermines the benefits in pre-aggregating input data upfront. Next, we tease apart Tetrium’s gains by quantifying the contributions of the task placement and job schedul- ing strategies using: (a) Tetrium with Fair Scheduler (Tetrium+FS) which replaces Tetrium’s job scheduling method by fair scheduling. (b) Tetrium with Iridium’s task 92 Figure 4.11: Gains in Response Time Under Different Combinations of Task Ordering Strategies (Baseline: In-Place) placement (+I-task) which replaces Tetrium’s task placement method by that of Irid- ium’s. (c) Tetrium with Iridium’s data placement (+I-data) that applies Iridium’s data placement method to original Tetrium. Tetrium+FS also provides significant gains (of 26% and 35%, as compared to the In-Place and the Central approach, respectively), which verifies that Tetrium’s task placement solution is effective even without the aid of the job scheduling solution. Also note that moving data in advance (Tetrium+I- data) does not help Tetrium as it is difficult to predict the resource availability in future scheduling instances. Although not included in the figure, Tetrium also improves Tetris [65] by 33% (on average) and 47% (at the 90 th -percentile). Figure 4.10 plots the CDF of reduction of jobs’ response times against In-Place and Centralized. Tetrium does not slow down any jobs as compared to both baselines despite its greedy heuristic. Task Ordering Strategy: We evaluate different task ordering strategies. For the map- stage: (a) Remote-First (Spread): launching remote tasks first while spreading them across different sites to reduce network contention, as proposed in Sectionx4.3.3, and (b) Local-First: launching local tasks first, i.e., those that read data from the site corre- sponding to the available slot. For the reduce-stage: (a) Longest-First: first launching the reduce-task with the longest network transfer time, as proposed in Sectionx4.3.3, and (b) Random: arbitrarily selecting a reduce-task to run. Figure 4.11 presents the aver- age gains for the4 combinations of the strategies, as compared to the In-Place baseline. The results verify that our proposed task ordering methods result in the best combina- tion, with most of the gains attributed to the map-task ordering method. 93 Resource Drop (%) k = 3 k = 5 k = 7 k = 10 k = 20 k = 50 10 2 2 3 5 3 7 3 8 3 9 3 9 20 1 8 2 9 3 3 3 4 3 5 3 5 30 1 6 2 5 2 6 2 6 3 2 3 4 40 9 1 9 2 3 2 5 2 6 3 0 50 6 1 5 1 8 2 0 2 1 2 4 Figure 4.12: Gains in Response Time Under Different Resource Dynamics Scenarios (Baseline: In-Place) Stage-By-Stage Approach: Here we investigate the effectiveness of Tetrium’s solution to address the mismatch between map and reduce task placement; see Sectionx4.3.4. We quantify this limitation by comparing Tetrium’s (forward) stage-by-stage approach against a method that selects the best out of forward and reverse, while the latter is guaranteed to be better (or at least as good as) Tetrium’s approach. Our results show that Tetrium achieves42% improvements against the In-Place baseline, while the mixed method achieves 45%. Because (i) the difference is marginal and (ii) forward is more practical to implement in most systems while incurring less overhead (e.g., it does not require upfront information about all stages), we adopt forward stage-by-stage as in Tetrium. Addressing Resource Dynamics We next evaluate our proposed method for addressing resource dynamics (see Section x4.4.2). In this experiment there are 50 sites, and as the jobs run we degrade both the compute and network resources of 5 randomly selected sites by a certain percentage. Figure 4.12 presents the gains as compared to In-Place baseline under various resource drops (%) as well as the number of sites allowed for updates (k). For example, when the resources drop by30% and the system is allowed to update7 sites, the proposed method achieves26% reduction in job response time as compared to the In-Place baseline. 94 10 20 30 40 50 60 0 10 20 30 40 50 Reduction in WAN Usage (%) Reduction in Average Response Time (%) ρ=0 ρ=0.5 ρ=1 Figure 4.13: Balancing Response Time And WAN Usage; baseline: In-Place. 10 20 30 40 50 60 70 10 20 30 40 50 60 Reduction in WAN Usage (%) Reduction in Average Response Time (%) ρ=0 ρ=0.5 ρ=1 Figure 4.14: Balancing Response Time And WAN Usage; baseline: Centralized. Gains increase when the system is allowed to update more sites. On the other hand, when the resources drop by a larger percentage, the gains decrease. Under a setting of 50 sites, settingk = 5 or 7 provide most of the gains, while settingk beyond 10 does not provide further improvements. 95 Incorporating WAN Usage Budget This subsection presents the reduction in response time and WAN usage, compared to In-Place (Figure 4.13) 4 and Centralized (Figure 4.14) baselines, under various knob values. As increases, a greater WAN budget is allowed, and hence the reduction in response time significantly increases. Tetrium reduces more WAN usage as compared to Centralized than In-Place. This is because the In-Place baseline does reduce some WAN usage by not transferring any data in the map-stage, while the Centralized baseline aggregates all input data upfront. Balancing Response Time and Fairness Figure 4.15 shows gains under various fairness knob’s () values. As the knob is turned towards = 1, response time gains increase quickly because – in addition to providing each job some minimum number of slots – Tetrium allocates slots in proportion to each job’s desired distribution, thereby achieving a good match between available resources and demand. Under complete fairness ( = 0), Tetrium achieves almost the same per- formance as In-Place which adopts fair sharing among the jobs. 4.6.4 Distribution of The Performance Gains We further study Tetrium’s gains (as compared to the In-place baseline) by characteriz- ing the gains distribution. Intermediate-Input Ratio: Figure 4.16 shows the gains based on the intermediate- input ratio of jobs: the higher the ratio, the greater the improvements (up to50%) Tetrium achieves. At the high end (reduce-heavy), more intermediate data is generated than input data, which incurs more data shuffle across the network. Therefore Tetrium benefits 4 Note that In-Place does incur network transfer in intermediate stages (e.g., reduce) although it pro- cessed all input data locally in map-stage. 96 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 Reduction in Average Repsonse Time (%) Control Knob (ε) Values vs. In-Place Figure 4.15: Balance Response Time And Fairness 0 15 30 45 60 75 <0.2 0.2- 0.5 0.5- 1.0 >1.0 Intermediate/Input Data Ratio Queries(%) Gains(%) Figure 4.16: Distribution of The Gains under Various Ratios of Intermediate/Input Size more from effective resource allocation. At the low end (map-heavy), Tetrium also improves the baseline by at least31%; this is attributed to effective map-task placement, which also highlights how site-locality falls short. Job Size: Figure 4.17 shows Tetrium’s gains under various job sizes (number of tasks). When a job has more tasks, a smart placement solution makes a greater difference as there are more options to consider. Tetrium achieves50% gains for large jobs, while still providing significant gains (36%) for small jobs as its job scheduling design is able to identify and finish small jobs quickly. Data Skew: Tetrium’s gains depend on the skews in data distribution – Coefficient of Variation (CV) – across sites. With CV 2:0 in Figure 4.18, higher data skews in 97 0 15 30 45 60 75 <50 51- 250 250- 1000 >1000 Job Size (Number of Tasks) Queries(%) Gains(%) Figure 4.17: Histogram of Gains under Various Job Sizes 0 15 30 45 60 75 <0.5 0.5- 1.0 1.0- 2.0 >2.0 Skew (Coefficient of Variation) Queries(%) Gains(%) Figure 4.18: Histogram of Gains under Various Input Data Skew input data benefit greatly from better decisions by Tetrium in balancing computation and network resources. However, when the input data is extremely skewed (CV > 2:0), it resides on only a few site (one site at the extreme), and In-Place benefits from reduced network transfer time for later stages through site-locality. Figure 4.19 depicts effects of intermediate data skew, where gains are highest (as high as56%) at the most skewed intermediate data distributions. Task Estimation Error: Figure 4.20 shows that Tetrium achieves the highest gains with accurate estimation in task duration. Although the gains drops when estimation error increases, only a small percentage of the tasks have high estimation error in our implementation. 98 0 15 30 45 60 75 <0.5 0.5- 1.0 1.0- 2.0 >2.0 Skew (Coefficient of Variation) Queries(%) Gains(%) Figure 4.19: Histogram of Gains under Various Intermediate Data Skew 0 15 30 45 60 75 <10% 10% -25% 25% -50% >50% Estimation Error Queries(%) Gains(%) Figure 4.20: Histogram of Gains under Various Task Estimation Errors Heterogeneity of Resources: We evaluate the impact of skew in number of slots and bandwidth by setting the slot number based on the Zipf distribution: the higher the exponent e value, the more skewed the resource distribution between sites. Settings with more skewed resources provide greater gains, as such scenarios calls for better task placements by balancing the workload according to resource availability. Results suggest that the compute slot skew has greater impact than the bandwidth skew: from slot skewe = 0 to1:6, the gain increases by51%, while from BW skew0 to1:6, the gain increases by 37%. This is because the number of slots directly determines the compute duration once the placement is set; on the other hand, bandwidth resource capacity has relatively indirect impacts because each reduce task requires data shuffle across all the sites, which involves many links associated with other sites no matter where this task is placed. 99 4.7 Conclusions As cloud providers deploy datacenters around the world, support for fast geo-distributed data analytics is becoming critical. We design and build Cascade, a system for multi- resource allocation in geo-distributed clusters, that takes the first stab at jointly schedul- ing the network and computation resources for improving geo-distributed jobs’ response times, while it incorporates other metrics (e.g., WAN usage, fairness) with simple knobs. In our evaluations with a geo-distributed EC2 deployment and large-scale trace-driven simulations, Cascade greatly improves the average job response time as compared to common practices and recent approaches. 100 Chapter 5 Video Stream Analytics over Hierarchical Cluster 5.1 Introduction Major cities like London, New York, and Beijing are deploying tens of thousands of cameras. Analyzing live video streams is of considerable importance to many organi- zations. Traffic departments analyze video feeds from intersection cameras for traffic control, and police departments analyze city-wide cameras for surveillance. Organizations typically deploy a hierarchy of clusters to analyze their video streams. An organization, e.g., a city’s traffic department, runs a private cluster to pull in the video feeds from its cameras (with dedicated bandwidths). The private cluster con- tains compute capacity for analytics while also tapping into public clouds (like Azure or Amazon EC2) for overflow computation. The uplink bandwidth between the private cluster and the public cloud, however, is not sufficient to stream all the camera feeds to the cloud for analytics. Some newer cameras also have limited compute capacity on them for video analytics. Our conversations with major video analytics providers like Genetec [6] and Avigilon [3] reveal that the hierarchical architecture – camera, private cluster, cloud – is common to live video analytics. Video analytics queries are a pipeline of computer vision components. For exam- ple, the object tracking query consists of a “decoder” component which converts video 101 to frames, followed by a “detector” that identifies the objects in each frame, and an “associator” that matches objects across frames thereby tracking them over time. Video query components have many implementation choices that provide the same abstraction. For example, object detectors take a frame and output a list of detected objects. Detectors can use background subtraction to identify moving objects against a static background or a deep neural network (DNN) to detect objects based on visual features. Background subtraction requires fewer resources than a DNN but is also less accurate because it misses stationary objects. Components can also have many knobs that further impact query accuracy and resource demand. Frame resolution is one such knob; higher resolution improves detection but requires more resources. Video queries have thousands of different combinations of implementations and knob values. We define query planning as selecting the best combination of implementations and knob values for a query, so as to maximize the query accuracy under resource budget. In addition to planning, components of queries have to be placed across the hierar- chy of clusters. Placement dictates the multiple resource demands (network, compute) at each cluster. For example, placing the tracker query’s detector on the camera and associator in the private cluster uses compute and network of the camera and private cluster, but not the uplink out of the private cluster or any resources in the cloud. While a query plan has a single accuracy value, it can have multiple placement options each with its corresponding resource demands. Careful component placement achieves better resource utilization which allows queries to utilize higher-accuracy plans. Finally, multiple queries analyzing video from the same camera often have com- mon components – e.g., car counter and pedestrian monitor queries both need an object detector and associator. The common components are typically the core vision building blocks. Merging common components has the potential to significantly reduce overall 102 resource usage, but the queries can only be merged if they utilize the same plan and are placed at the same location. Our objective is to determine the query plan for each video query, place its compo- nents across the hierarchy of clusters, and merge common components across queries with the goal of maximizing the average query accuracy. Current video analytics solutions [3, 6] make static query planning and placement decisions . These decisions are often conservative on resource demands and result in low accuracies while leaving resources underutilized. At the same time, running all the queries at the highest accuracy is often infeasible because the private cluster does not have enough compute resources to run them locally, or bandwidth resources to push all the streams to the cloud. Production stream processing systems [2, 40, 166] commonly employ fair sharing among queries [29, 62, 78]. We could extend fairness to multiple clusters in the hierarchy, and pick query plans and placements such that the demand is within their fair allocation. But fair sharing is a poor choice for our objective because its decisions are agnostic to the resource-accuracy relationship of queries. To maximize accuracy, we need to jointly plan, place, and merge all queries. If we separately determine a query plan, we might not have enough resources to place it. Even if we plan and place queries together, we might not be able to merge common compo- nents because they end up using different plans or placements. However, optimizing over plans and placements of all queries while also considering merging leads to an exponentially large search space. A key challenge we address is efficiently navigating the space of query plans and placements. Our solution, Cascade, identifies the most “promising” configurations – combina- tions of a query plan and a placement – and filters out those that are inaccurate with large resource (network, compute) demand. We call their promising configurations the Pareto band of configurations since we extend the classic economic concept of Pareto 103 efficiency [149]. This dramatically reduces the configurations search with little impact on accuracy. Our heuristic greedily searches through the configurations within the Pareto band and prefers configurations with higher accuracy for its resource demand. Comparing resource demand vectors consisting of multiple resources across clusters, however, is non-trivial. For every configuration’s demand vector, we define a resource cost as the dominant utilization: maximum of ratio of demand to capacity across all resources and clusters in the hierarchy. Using the dominant utilization avoids the lopsided drain of any single resource at any cluster. Cascade also merges common components of queries by carefully considering the aggregate accuracy and demand of different merging options. In doing so, it resolves potential merging conflicts – e.g., DNN-based detector is better for pedestrian monitor- ing while background subtracter is better for car counting. Streaming database literature [20, 21, 28, 44, 113, 151] considered the resource- accuracy tradeoff but did not deal with multiple knobs (only sampling rate), multiple resources (only memory), or a hierarchy of clusters. Networked streaming systems [127, 140, 146, 157] consider a hierarchy but also tweak only the sampling rate based on bandwidth availability. Recent work on video processing [169] considers multiple knobs but assume that all the video is streamed into a single cluster, and hence does no placement. Our contributions are as follows: 1. Formulate the problem of planning, placement, and merging for video analytics in hierarchical clusters. 2. Efficiently search only in a Pareto band of promising query configurations, and compare configurations for multiple resource demands across the hierarchy by defining a dominant utilization metric. 104 NYPD Cluster1 NYPD Cluster2 Public Cloud Network Compute WAN Figure 5.1: Hierarchical Video Analytics Architecture. 3. Study the resource-accuracy profiles of multiple real-world video analytics queries. As part of Cascade, we build an efficient profiler that generates the resource- accuracy profiles by using 100 fewer CPU cycles than an exhaustive exploration. Evaluation of Cascade using realistic video queries shows that we outperform a fair allocation of resources by15:7 in accuracy and are within6% of optimal. 5.2 Video Processing Architecture Organizations with large deployment of cameras – e.g., cities, police departments, retail stores – typically use a hierarchy of clusters (or locations, interchangeably) to process video streams [3, 6, 18]. Figure 5.1 shows that each organization (e.g., NYPD in Fig- ure 5.1) runs one or more private clusters into which videos from their cameras are pulled. When there are multiple private clusters, cameras are pre-configured to stream into one of them. Connectivity between the cameras and the private cluster is via dedi- cated (wired or wireless) links. The private clusters also store the videos though storage is not our focus. Network: The network connecting the cameras, private clusters, and the public cloud is a crucial resource. The network bandwidth required to support a single camera ranges from hundreds of Kb/s for wireless cameras to a few Mb/s for high-resolution video or even above 10Mb/s for multi-megapixel cameras. The bitrate of the video stream can be controlled by configuring the frame resolution and frame rate on the camera. Typically, 105 the uplink from the private cluster to the public cloud is a few tens of Mb/s. As per our conversations with Avigilon [3] and Genetec [6], leaders in video analytics solutions, the typical provisioned uplink bandwidth from a private cluster does not support streaming a large number of videos to the cloud at high resolution. Compute: Each private cluster also contains compute capacity (cores) to process video queries; we explain queries shortly but for now it suffices to know that they are pipelines of vision components. Compute capacities at private clusters vary significantly from just a handful of cores (in the case of smaller cities) to hundreds of cores (as in the case of New York City [18]). Organizations may also tap into the public cloud like Amazon EC2 and Microsoft Azure when required. Newer cameras themselves contain compute capacity [4] which can also be used for video analytics queries. Video queries have thousands of functionally equivalent configurations (see Section x5.4) from which to choose that control their compute and network demand, but all major video analytics providers typically use hard-coded configurations. Placement of the components of each query – on camera, private cluster, or cloud – is also static, thereby precluding dynamic splitting of a query across multiple clusters, joint decisions across queries, or merging common components of multiple queries. 5.3 Challenges and Desirable Features In this section, we motivate the need for careful query planning, placement, and merging using an illustrative example. We define query planning for a video query as choosing the most suitable implementation for each of the query components along with setting the relevant knobs. Query placement determines how we place the individual query components across the available clusters. Query merging eliminates common compo- nents among queries that analyze the same camera stream. 106 The objective we consider is to maximize the average accuracy of queries given the available resources. Object Associator [A] Object Detector [D] Camera B D B A Figure 5.2: Object Tracker Pipeline Public Cloud Private Cluster Compute 3 cores Network 3 Mb/s 1 2 Figure 5.3: Hierarchical Setup Query Plan B D B A C D C A Accuracy Q 1080p 3 1.5 3 3 0.9 Q 480p 1 1 2 2 0.6 Q 240p 0.5 0.5 0.5 0.5 0.2 Figure 5.4: Query plans forQ 1 andQ 2 Query and Cluster Setup Consider the “object tracking” query consisting of two components; the first component – detector D – detects objects in the video, while the second component – associator A – associates newly detected objects with existing object tracks or starts new tracks (see Figure 5.2). The components have CPU demands (C D andC A ) and data rates into them (B D andB A ). Detectors and associators are foundational to computer vision and key building blocks for video queries. Figure 5.3 presents the cluster setup with two tracking queries (Q 1 andQ 2 ) running on two different camera streams that are pulled in to the private cluster. The private 107 C D 1080p B D 480p B A 1080p 3 Mb/s 3 cores CPU Network 1.5 1 Figure 5.5: Utilization at private cluster for best plans cluster has 3 cores while the public cloud has 40 cores (practically unlimited, for this example); the camera has no compute resources for ease of illustration. The private cluster has a 3 Mb/s link to the public cloud and each camera has a dedicated 3 Mb/s link to the private cluster. For simplicity, we assume that the only knob we control in the query plans is the frame resolution and that bothQ 1 andQ 2 have the same choice of plans (see Figure 5.4). We profile the tracker in detail inx5.4. With a resolution of 1080p, the trackers work well off both camera streams (Q 1;1080p andQ 2;1080p ) producing outputs of accuracies0:9. The Corresponding CPU demands of the detector and associator (C D andC A ) are also high (3 cores). 1 Accuracy of the trackers drops at lower resolutions because the detector cannot find small objects and the associator cannot correlate objects between frames. However, the data rates of the streams (B D ), outputs of the components (B A ), and the CPU demand (C D andC A ) all drop too. Choosing the query plans and placements Each query has three query plan options (1080p, 480p, or 240p) and three placement options: (a) both components in the private cluster, (b) detector in the private cluster 1 CPU demand is the cores needed to analyze at frame rate. 108 and associator in the cloud, (c) both components in the cloud. 2 Hence, in our example, each query has9 configurations. Picking the 1080p resolution is the best for accuracy (Q 1;1080p andQ 2;1080p ), but we cannot pick it for both queries simultaneously because it is infeasible given the resource availability. Placing all components in the cloud requires a bandwidth of6 Mb/s (B D + B D ) against the available 3 Mb/s. If one query’s detector (needing 3 cores) is placed in the private cluster (whose capacity is 3 cores), the network link of 3 Mb/s between the private cluster and the cloud is still insufficient to support the aggregate data rate of 4:5 Mb/s (B D +B A for1080p). At the same time, the compute resources of the private cluster is insufficient to support all the components locally. Hence, query plans should be determined jointly across queries, unlike current database optimizers [1, 30], while considering multiple resources. PickingQ 1;480p andQ 2;1080p (orQ 1;1080p andQ 2;480p ) leads to the best average accu- racy of 1:5=2 = 0:75. However, this combination of query plans is feasible only if we place the detector of Q 2;1080p in the private cluster and its associator in the cloud, while forwardingQ 1;480p ’s camera stream up to the cloud for executing both its detec- tor and associator. Figure 5.5 shows the resulting utilizations at the private cluster. All other placements are infeasible. Hence, query components must be placed jointly when selecting their plans. Merging Common Components Consider two new queries Q 3 and Q 4 running on the same camera that both use the detector and associator modules. Q 3 uses object trajectories to count cars andQ 4 uses people trajectories to identify jaywalkers. We can merge the detector and associator of 2 Placing the detector in the cloud and associator in the private cluster is clearly wasteful and we do not consider this placement option. 109 Associator Detector Car Counter Associator Detector Jay Walkers Associator Detector Car Counter Jay Walkers Figure 5.6: Merging the detector and associator of the “car counter” and “jay walker” queries on the same camera. both queries and run only one copy of each. Such merging saves us network and compute resources; we avoid redundant streaming from the camera and execution of the detector and associator (see Figure 5.6). Queries with common components are quite prevalent in traffic analytics and surveillance based on our discussions with municipalities in the USA. Despite the obvious resource benefits of merging, the decision is non-trivial. This is because we need to select the same plan for the merged components. However, a high- accuracy plan forQ 3 might result in a low accuracy forQ 4 . In our example, car counting might work better with background subtraction, while we might need a DNN detector for pedestrians. Hence, merging has to consider conflicts in accuracies and whether the merged plan with maximum accuracy is not too resource intensive. Desirable Features To summarize, these are the desirable properties of a video query planner with the goal of maximizing query accuracy: (i) jointly plan for multiple queries using their resource-accuracy profiles, (ii) consider component placement when selecting query plans to identify resource constraints, (iii) account for multiple resources at the hierar- chy of locations, and (iv) merge common components across queries that process the same video stream. Achieving these properties is computationally complex owing to the combinatorial number of options. 110 5.4 Resource-Accuracy Profiles We now profile the query plans of the canonical object tracking video query: detector followed by associator (see Figure 5.2). Object detectors and associators form the basis for a large swathe of video queries in traffic monitoring, indoor surveillance, and retail analytics. There are many different implementations for video processing components. A com- mon approach to detect objects is to continuously model the background and subtract it to get the foreground objects. There are also other approaches based on scene position- ing and deep neural networks (DNNs). Likewise, objects across frames can be associ- ated with each other using different metrics such as distance moved (DIST), color his- togram similarity (HIST), or SIFT [15] and SURF [16] features. Note that the different implementations for the detector and for the associator are equivalent in their function- ality and abstraction (inputs and outputs). However, they result in widely varying accu- racy and resource demands. Computer vision research does not have a well-established ordering among the implementations and their focus is by and large only on improving accuracy (not resources utilization). There are also many knobs that affect query performance. Reducing the resolution drops the resource demand at the expense of also dropping the corresponding accuracy. Sampling the frames, e.g., dropping every third frame, makes the detector’s model of the background inaccurate. The farther the frames are spaced apart in time (due to sampling), the harder it is to associate objects across frames. However, not processing all the frames saves resources. Queries typically have many other knobs too, some of which are specific to the implementation choices (e.g., model size of object detection DNNs), which all affect accuracy and resource demand. Next we quantify the impact that the query plans – decisions on the implementations and knobs – have on the accuracy and resource demands of the tracker query. We plot the 111 0 0.25 0.5 0.75 1 0 1 2 3 Accuracy Cores Figure 5.7: Accuracy vs. Cores 0 25 50 75 100 0 1 2 3 Detector Associator Cores CDF Figure 5.8: CPU demands of components results on a representative traffic camera stream in Bellevue city (in Washington state) with an original bit-rate of3 MB/s at30 frames per second. We obtain the ground truth of the tracks – cars, pedestrians, bikes – via crowd-sourcing, and compare our tracks against groundtruth. Accuracy: An object’s track is a time-ordered sequence of boxes across frames, and in each frame we calculate the F1 score2 [0;1] (the harmonic mean of precision and recall [148]) between the box in the ground truth and the track generated by the tracker. We compute the accuracy for a track as the average F1 scores across all the track’s frames. We see from Figures 5.7 and Figure 5.9 that the 300 query plans produce outputs of widely-varying accuracies. CPU Demand: Query plans have vastly varying CPU demands – sum of the detector’s and associator’s CPU demands – going from under a core all the way to multiple cores (truncated to 3; see Figure 5.7). Queries with background-subtraction based detectors are less CPU-intensive than DNN-based detectors. Further, when components do not 112 0 0.25 0.5 0.75 1 0 1 2 3 4 5 Accuracy Data rate (Mb/s) Figure 5.9: Accuracy vs. Data rates 0 25 50 75 100 0 1 2 3 4 5 Detector Associator Data rate (Mb/s) CDF Figure 5.10: Data rates of components maintain state across frames (e.g., DNN detectors), different frames can be processed in parallel across many cores to keep up with the video’s frame rate. As Figure 5.8 shows, the individual components – detector and associator – them- selves exhibit considerable variability in their CPU demand over the range of implemen- tations and knobs; accuracy is defined only for the entire query and not for individual components. Network Demand: Query plans also differ in data rates; Figure 5.9 shows the sum of output data rates of the detector and associator (but excludes the input stream rate), Figure 5.10 shows the CDF of output rates of the components. Data rates vary across different implementations and knob values from a few hundred Kb/s to5 Mb/s. In some query plans, when the components fail to detect and track the objects, they output much less data. Modern DNNs that detect and describe regions in a frame in natural language (e.g., DenseCap [91]) produce rich outputs whose data rates are comparable to the input video. 113 0 1 2 camera output rate detector CPU detector output rate associator CPU associator output rate demand [cores or Mbps] A B C (5.4) Figure 5.11: Network and CPU demands of the camera and the two modules in the tracker pipeline for three different plans, all with accuracy0:730:75. Y-axis truncated at 2.5. 0.4 0.5 0.6 0.7 0.8 1 10 100 1000 Object (Server GPUs) Object (Mobile GPUs) Scene (Server GPUs) Scene (Mobile GPUs) Face (Server GPUs) Face (Mobile GPUs) GPU Cycles (Millions) Accuracy Figure 5.12: Profiling DNN recognizers – object, scene, face – on server-class and mobile GPUs. Figure 5.11 shows computation and data rates for three different plans with very similar accuracy. A and B are based on a DNN detector, while C uses background subtraction. A’s CPU demand is much higher than C’s, but A’s detector output is much lower; depending on the constrained resource both plans thus might be useful when optimizing for accuracy, while B is too expensive. Note that C’s detector output rate is larger than input video rate (because it sends images of the moving objects). Actual network demands depend on component placement. If we place both compo- nents in the same cluster, the detector’s output stays inside the cluster. Hence the need to jointly plan and place as called out in Sectionx5.3. Our work only considers network bandwidth across clusters. 114 Resource-accuracy profiles are integral to video queries including license plate read- ers, DNN recognizers, etc. Figure 5.12 plots the resource-accuracy profile for different DNN recognizer implementations on GPUs: each for scene (AlexNet/MITPlaces [171]), face (DeepFace [143]), and object (the VGG16 model [141]) recognizers. We ran each model on a server-class GPU (NVIDIA K20) as well as a mobile GPU (NVIDIA Tegra K1) that is likely to be available in cameras. We generate less accurate but faster models using lossy techniques as in [72]. In contrast to the rich body of database literature on identifying the relationship between sampling rate and accuracy, video queries have many more dimensions (knobs and implementations) that affect accuracy and resource demand. Crucially, SQL queries use analytical models to estimate the relationship between query demand and accu- racy [50]. Such models do not exist for video queries and depend on the specific camera feeds. Exhaustively exploring the space of query plans is expensive, and we design an efficient profiler that is 100 cheaper for obtaining the resource-accuracy profile (see Sectionx5.6). 5.5 Video Query Planning & Placement The goal of our solution, Cascade, is to jointly optimize all queries to maximize the average query accuracy within the available resources. Specifically, we plan for each query (pick its implementations and knobs), place components of the query across the available clusters, and merge identical components across queries that process the same stream to save resources. 115 5.5.1 Problem Formulation We begin with formulating our problem as an optimization to highlight its computa- tional intractability and the need for an efficient heuristic solution (see Sectionsx5.5.3 x5.5.6). Notation and Definitions LetN i represent the set of all plans of queryi, i.e., all combinations of possible knob values and component implementations. As illustrated in Sectionx5.4, examples of this include setting the “frame resolution” knob and picking the implementation for the object detector component. LetA i;j represent the accuracy of planj for queryi. Our profiler provides us with the accuracy and resource demands for each plan (covered in Sectionx5.6.1), both of which are independent of where we place the query’s compo- nents. LetL i represent the set of all possible placements of components of queryi; if the query hasn c components and we can place each component in one ofn s clusters, there is a total ofn s nc placement choices for the query. Table 5.1 lists the relevant notation. We model each cluster (e.g., the private cluster in Figure 5.1) as an aggregate bin of resources and only consider placement of query components across the clusters. Each cluster has compute cores along with network uplink and downlink bandwidths. LetC l be the capacity of resourcel andD l i;j;k be the demand on resourcel from queryi when running with planj and placementk. We refer to each combination of resource type (e.g., uplink) and cluster (e.g., the camera) as a “resource” l. For the example in Sectionx5.3 (Figure 3.2), placing the detector at the private cluster and the associator in the cloud uses the following resources: uplink of the camera and downlink of the private cluster (for the video), cores 116 N i set of all plans of queryi L i set of all placements of queryi A i;j accuracy of planj of queryi S i;j;k cost of query i when using plan j and placementk D l i;j;k demand on resource l of query i when using planj and placementk C l capacity of resourcel x i;j;k binary variable equal to 1 iff query i is using planj and placementk c i configuration of query i (its plan and placement);c i = (j;k) iffx i;j;k = 1 Table 5.1: Notations for queryi. and uplink of the private cluster (running the detector and shipping its output), downlink and cores of the cloud (ingesting the detector’s output and running the associator). 3 Binary Integer Program Our problem can be formulated as the following Binary Integer Program (BIP): max P i A i;j x i;j;k (5.1) s:t:; 8l : P i;j;k D l i;j;k x i;j;k C l (5.2) 8i : P j;k x i;j;k = 1 (5.3) x i;j;k 2f0;1g (5.4) wherex i;j;k is a binary variable equal to 1 iff queryi runs using planj and placement k. The optimization maximizes the sum (equivalently, average) of query accuracies (Eq. 5.1), while meeting the capacity constraint for all resources l (Eq. 5.2). As in Sectionx5.3, plans and placements that do not fit within the available resources are 3 Downlink bandwidths usually far exceed the uplink bandwidths. 117 deemed infeasible. Eq. 5.3 restricts that exactly one query plan and placement is selected for each query. We refer to each (plan, placement) pair for a query as a configuration. Unfortunately, this optimization has to explore a large, combinatorial space which makes it inefficient. With n s number of clusters, n c number of components in each query,n p number of query plans for each query, andn q number of queries; the size of the optimization space is (n s nc n p ) nq . For example, with 1000 queries, each with 5 components and 100 plans, and 3 clusters, we would have to consider approximately 10 4000 options. We can extend the formulation to handle query merging as follows. Only the queries that process video from the same camera can merge; we thus logically group all queries on the same camera into super-queries and formulate the same program as above but at the level of the super-queries. The accuracy and demand of a super-query are aggregated across all its contained queries. 5.5.2 Solution Overview The cornerstone ofCascade is efficiently navigating the large space of configurations – potential query plans and placements of components – and reducing the combinatorial complexity. We build our solution as follows. 1. We define resource cost for a configuration, a scalar metric which aggregates mul- tiple resource demands across many clusters. Cost lets us easily compare different query configurations (see Sectionx5.5.3.) 2. Starting from configurations with lowest costs, our heuristic greedily switches to configurations that have high efficiency; i.e., improve the query accuracy the most with low additional cost (see Sectionx5.5.4). 118 3. We optimize the running time of the heuristic by identifying a smaller subset of promising query configurations in its Pareto band (see Sectionx5.5.5). 4. Merge queries containing common components processing the same camera streams (see Sectionx5.5.6). Resource demands of the configurations, D l i;j;k , and accuracies of the plans, A i;j are estimated before submitting the query to the scheduler by our profiler (see Section x5.6.1). 5.5.3 Resource Cost To decide between two configurationsc 0 andc 1 , we need to compare their accuracies and resource demands. Isc 1 improving the accuracy enough for the amount of additional resources it consumes? However, because the query runs across multiple clusters with several types of resources (seex5.5.1), it is not straightforward to compare resource demands. Therefore, we define a resource cost which aggregates demand for multiple resources as a single value. We use the following simple definition: the cost of a placement k of a query plan j for query i is its dominant resource utilization: S i;j;k = max l D l i;j;k =C l . S is a scalar that measures the highest fraction of resources l needed by the query across resource types (CPU, uplink, downlink) and clusters (cam- era, private cluster, cloud). A nice property of the dominant utilization metric is that, by normalizing the demand D relative to the capacityC of the cluster, it avoids lopsided drain of any single resource at any cluster. Also, by being dimensionless, it easily extends to multiple resources, akin to DRF [62]. While we also considered defining S i;j;k using the sum of the resource 119 1: U . Set of all(i;j;k) tuples of all queriesi and the available plansj and placementsk 2: p i . Plan assigned to queryi 3: t i . Placement assigned to queryi 4: for all queryi do 5: (p i ;t i ) = argmin (j;k) S i;j;k . CostS,x5.5.3 6: for eachresourcel: updateR l 7: while U6=; do 8: U’ Uf(i;j;k)where 9l :R l <D l i;j;k g 9: remove(i;j;k) fromU ifA i;j;k A i;p i ;t i 10: (i ? ;j ? ;k ? ) = argmax i;j;k2U 0E i (j;k) 11: p i ? j ? 12: t i ? k ? 13: for eachresourcel: updateR l based onD l i ? ;p i ?;t i ? Figure 5.13: Pseudocode for Cascade’s heuristic. utilizations ( P l instead ofmax l ) or just the absolute resource demand, these performed worse. 5.5.4 Greedy Heuristic In order to maximize average accuracy, it is crucial to efficiently utilize the limited resources. We employ the intuitive guiding principle of allocating more resources to queries that can achieve higher accuracy per unit resource allocated compared to other queries. To achieve this goal, we use an efficiency metric that relates the achieved accu- racy to the cost of the query. Our heuristic starts with assigning the configuration with the lowest cost to each query and greedily considers incremental improvements to all the queries to improve the overall accuracy. When considering switching query i from its current plan j and placementk to another planj 0 and placementk 0 , we define the efficiency of this change as the improvement in accuracy normalized by the additional cost required. Specifically: E i (j 0 ;k 0 ) = A i;j 0A i;j S i;j 0 ;k 0S i;j;k 120 Defining E i (j 0 ;k 0 ) in terms of the “delta” in both accuracy and cost turns out to be the most suited to our gradient-based search heuristic. It outperformed alternate definitions that just use only the new values, e.g., onlyA i;j 0 and/orS i;j 0 ;k 0, as we show in Sectionx5.7. Figure 5.13 shows the pseudocode. U represents the set of all (i;j;k) tuples of all queries i, the available plans j, and placements k. The objective is to assign to each queryi, a planp i and placementt i ; lines13. It first assigns each query i the plan j and placement k with the lowest cost S i;j;k (lines 4 5). After that, it iteratively searches across all plans and placements of all queries and selects the queryi ? (and the corresponding planj ? and placementk ? ) with the highest efficiency (lines713). It switches queryi ? to its new planj ? and placement k ? , and repeats until no query can be upgraded any more (either due to insufficient remaining resources, or there are no more plans with higher accuracy). In each iteration, we only consider configurations that fit in the remaining resources R l by constructingU 0 , line8. Note that we cannot remove such infeasible configurations from U completely because they might become feasible later as the heuristic moves components across clusters by changing configurations of queries. A subtle aspect is that in each iteration, we remove those options fromU that reduce a query’s accuracy relative to its currently assigned plan and placement (line 9). Such an explicit removal helps because even though the change in accuracy of the removed options would be negative, those options may also have negative difference in dominant utilization (S i;j;k ), thus making the efficiency positive and potentially high. Not having this check drastically lowers the eventual accuracy as well as increases the running time of our heuristic. While our heuristic is iterative, we clarify that we do not apply the query plans and placements at the end of each iteration but only when the heuristic fully completes.Note 121 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Cost Accuracy Figure 5.14: Illustration of Pareto band (shaded) for a single query. Note that for each accuracy (plan), there is a horizontal stripe of placement options with different costs. that our heuristic offers a simpler solution, as compared to solving the BIP optimization in Section 5.5.1, by focusing on one single query at a time without worrying about the other queries: it greedily updates the configuration for a single query in each iteration and makes further updates in future iterations based on the current updates. In Section 5.7.6, we show that our heuristic achieves significantly lower running time as compared to solving BIP optimization. 5.5.5 Pareto Band To speed up the heuristic, we significantly reduce the size of the exponentially-large set U by explicitly filtering out query configurations that have low accuracy and high resource demand. For example, the configurations in the bottom-right corners of the tracker query in Figures 5.7 and 5.9 are unlikely to be selected by the heuristic. We build upon the classic economic concept of Pareto efficiency [149] to first iden- tify the Pareto boundary of query configurations. Figure 5.14 plots an illustrative accuracy-cost space for a query with the left line being the Pareto boundary. For a particular queryi, a configurationc is on the Pareto boundary if there does not exist any other configuration with a lower cost and a higher accuracy. For every point not on the 122 Pareto boundary, there is at least one point on the boundary that beats it in both accuracy (higher) and cost (lower). However, limiting our search to only the configurations on the Pareto boundary can be problematic when optimizing for multiple queries. Note that the cost S in Section x5.5.3 is defined in terms of the resource capacities and not resource availabilities. As a result, when our greedy heuristic makes its decisions iteratively and gets to deciding for a query, all the placement options on the query’s Pareto boundary may turn out to be infeasible with the available resources because earlier assigned queries may have used up the capacities (line-8 in Figure 5.13). Therefore, to reduce the size of set U but not restrict the heuristic too much, we define a “band” relative to the boundary, which we refer to as Pareto band. We first define a -boundary that consists of points (c;a) for all points (c;a) on the Pareto boundary. The Pareto band thus consists of configurations between the Pareto boundary and the -Pareto boundary. (See an illustration with = 2 in Figure 5.14; right line is the -Pareto boundary). Making the band’s width relative to the Pareto boundary provides a cluster of placements with comparative cost. We only search among the query configurations within the band using our heuristic (setU in Sectionx5.5.4). 5.5.6 Merging Peer Queries When there are multiple queries processing the same camera feed with a common prefix in their pipeline, we have the opportunity to eliminate running of redundant components. We refer to such queries as a peer set. As explained in Sectionx5.3, a key question in merging peer queries is deciding on the implementation and knobs for the merged components. In addition, the decision of merging is not just for the peer queries involved, but should also consider the aggregate quality for all queries in the system as the planning and placement of other queries would 123 also be affected. Finally, the possible merging combinations grow exponentially for a peer set ofN queries (each pair of queries in a peer set can be merged). Our heuristic in Sectionx5.5.4 is efficient because it considers the Pareto band of configurations for each query independently. However, searching for good merged con- figurations could make it considerably more expensive. To reduce the search space, we make the following two simplifying decisions when considering merging two queries: (a) First, we either merge all their common compo- nents or nothing at all. For the example in Figure 5.6, we either merge both the detector and associator or neither of them. We do not consider the option of merging only the detector or only the associator. (b) Second, we avoid searching through all possible implementation and knob values for the components that are not common (“counter” and “jay walker” modules in Figure 5.6). Note that the heuristic already comes up with values for these distinct components, and we only consider those values for them in evaluating the overall value of merging. We make the following change to line 11 12 of Figure 5.13. When considering switching to configuration(p ? i ;t ? i ) of queryi ? , we also consider merging this query with all subsets of its peer queries. LetR be one of the subsets ofi ? ’s peer queries. We merge all queries inR withi and apply the(p ? i ;t ? i ) configuration to all components ini ? . Any remaining components in the merged query (that are not ini ? ) remain in their current plan and placement. For each such merged query, we compute the efficiency metric E relative to all peer queries ofi ? , i.e., ratio of the aggregate increase in accuracy to the aggregate increase in resource cost. 124 5.6 System Design We now describe the resource-accuracy profiler inCascade and the relevant implemen- tation details. 5.6.1 Resource-Accuracy Profiler As we saw in Sectionx4.2, each query has multiple query plans each with different resource demands and accuracy of output. Before we can allocate resources to optimize the queries (i.e., planning and placement decisions), we have to profile them – for each query plan (i.e., each component implementation and its knobs), measure the CPU and bandwidth requirements of all query components and query accuracy. Note that the profiler does not consider component placement since the component resource demands and query accuracy are fully determined by the query plan. Our scheduler in Section x5.5 optimizes over both plans based on the profiles and placements in the hierarchy of clusters to find the maximum accuracy schedule. The profiler estimates the query accuracy by running the query on a labeled dataset obtained via crowd-sourcing or by labeling the dataset using a “golden” query plan which might be resource-intensive but is known to produce highly accurate outputs. When a user submits a new query, we start profiling it while submitting it to the sched- uler with the default query plan. We assume a dedicated set of machines for profiling. Once the profiles become available, the system scheduler is triggered to recompute for query optimization based on the new profiles. Our system re-runs the profiling process when: (1) the users submit new sets of labeled data to calibrate the accuracy for the queries, (2) a certain period of time pre-configured by the users expires (e.g., after a few hours or a few days, depending on the query characteristics), or (3) the resource capacity in the system changes. 125 Since a query can have thousands of plans which we have to execute on the labeled videos, the main goal in profiling is to minimize the CPU demand of the profiler. 4 We use two simple tricks: (1) eliminating common sub-expressions by merging multiple query plans and(2) caching intermediate results of query components. To illustrate the merging technique we use for profiling, we assume that both com- ponents in the tracking queryD! A have two implementations; D 1 ;D 2 andA 1 ;A 2 . We thus have to profile four query plans: D 1 A 1 , D 1 A 2 , D 2 A 1 , and D 2 A 2 . If we run each plan separately, implementationsD 1 andD 2 would run twice on the same video data. If we merge the execution of plansD 1 A 1 andD 1 A 2 , we can avoid the redundant runs. We merge recursively, and the idea is similar to the merging of queries in Section x5.5.6. While we could merge all the plans into one, executing this would require a large number of concurrent compute slots that might not be available. In such cases, we resort to caching of intermediate results. Note that while caching alone will eliminate redun- dant executions, it has dramatically high requirement for storage space. In profiling the tracker on a 5-minute traffic video, the storage space for caching is 78 the size of the original video. Hence, we assign a caching budget per query. We preferentially cache the outputs of those components that take longer to generate. In addition, we also cache the outputs of those components which are to be used more frequently. These are components with many downstream components each with many implementations and knob choices. Naturally, caching their outputs is a better use of the caching budget. We encode these in a metric for each intermediate result,M =n T S wheren is the number of times this output will be accessed,T is the time taken to generate the output, andS is the size of 4 Note that minimizing the CPU demand of the profiler is different from query optimization: the former aims at efficiently obtaining the queries’ profiles, while the latter aims at optimally scheduling the queries according to the queries’ profiles. 126 the output. The higher the value ofM for an intermediate result, the higher the value in caching the result. M increases with the number of times an output will be accessed and the time taken to generate it. All things being equal, it prefers smaller intermediate results. Our profiler uses the caching budget for intermediate outputs with higher values of theM metric. Benefit: Compared to a baseline of exhaustive exploration for the tracking query, our proposed profiling method based on caching and merging techniques (with cache budget of900 MB per query per machine) consumes100 fewer CPU cycles. We believe such a budget is practical in modern machines. Note that, while reducing the CPU cycles for profiling, our proposed profiling method does not sacrifice the quality of the profiling results: it obtains the CPU and bandwidth requirements as well as output accuracy for all the query plans as the exhaustive exploration does, while achieving so with a signifi- cantly lower CPU cycle consumption as compared to that of the exhaustive exploration. 5.6.2 Implementation They are two key aspects to runningCascade. Query Specification: Queries are submitted as a pipeline of components, specified in JSON. Each component takes a time-ordered sequence of events (e.g., frames) and produces outputs (e.g., objects). The JSON lists the knobs as well as the implementation options. Control Hierarchy: In Cascade’s implementation, each organization runs a global manager. The global manager picks the appropriate query plans as well as places the different components of the organization’s multiple queries at the cameras, private clus- ter or public cloud with the objective of maximizing query accuracies (Sectionx5.5). Each location (e.g., private cluster) has a local manager that monitors the components running locally and reports resource usages to the global manager. 127 5.7 Evaluation We evaluateCascade with an Azure deployment emulating a hierarchy of clusters using representative video queries, and complement it using large-scale simulations. 1. Cascade outperforms fair allocation of resources by up to 15:7 in respect to average accuracy, while being within6% of the optimal accuracy. 2. Merging queries with common components improves the gains to 27:2 better accuracy. 3. Searching only the configurations in the Pareto band drops the heuristic’s running time by 80% while still achieving 90% of the original accuracy. 5.7.1 Setup Azure Deployment. We use a 24 node Azure cluster to emulate a hierarchical setup; each node is a D3v2 instance with 4 cores and 14GB of memory. Ten of the nodes are the “camera compute nodes”, with two cameras per node. The 20 cameras “play” feeds from20 recorded streams from many cities in the USA at their original resolution and frame rate. Two nodes act as a private cluster. Each camera has a 600Kb/s link to the private cluster, resembling the bandwidths available today. The cloud consists of 12 nodes with a 5Mb/s uplink from the private cluster. We analyze sensitivity to these parameters in Sectionx5.7.8. We also use trace-based simulations to evaluate large-scale settings under various resource capacities. Video Queries. We profile and evaluate using the following queries: tracker, DNN object classifier, car counter, and license plate reader. The queries have 300, 20, 10, and30 query plans, respectively, from different implementation and knob choices. Each query has two components and among the three clusters in the hierarchy there are six placement options per query: both components in the same cluster or each in a different 128 cluster (we avoid placements that go “down” from the cloud). We use 200 5-minute video clips from many locations and times of day, and hence have200 profiles. Baselines. We compare against four approaches. (1) Fair scheduling is our primary baseline, since it is widely used in production clusters [29, 78]. Specifically, we extend the definition of fairness used within a cluster – 1 n of the resources in a cluster givenn queries – to multiple clusters by allocating 1 n of each resource in each cluster. Within this fair allocation, each query picks the configuration (query plan and placement) that achieves the highest accuracy. (2) We compare Cascade against the optimal planning and placement (Sectionx5.5.1). (3) Running Cascade with a static placement of all components in either the camera, private cluster or cloud. This is often the approach used today. (4) Recent work on video analytics, VideoStorm [169], a single-cluster, single-resource video query planner. We use the aggregate CPU resources of all the clusters for its planning. Being a single-cluster solution, it does not include placements and we assume it places its components randomly in the hierarchy. We ensure that it avoids infeasible placements (corresponding to insufficient network resources). Performance Metric. Our performance metric is the average accuracy across all queries. We report the relative improvement over the baseline approaches; given accu- racies ofCascade and a baseline areA c andA b , we report Ac A b . 5.7.2 Improvement in Accuracy We first describe results from our Azure deployment without merging of queries; we show results with merging inx5.7.5. In the deployment, we run the tracking query and vary the average number of queries per camera (we have20 cameras in our setup) from 1 to5 and assign them uniformly to the cameras. Each camera runs a randomly chosen video from the200 options. 129 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 2 3 4 5 6 7 8 Accuracy / Accuracy of Optimal Allocation Average Number Of Queries Per Camera C a s c a d e V i d e o S t o r m F a i r Figure 5.15: Comparing Cascade’s accuracy to baselines. Accuracies are normalized by Optimal Allocation. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CDF Raw Accuracy Optimal Cascade VideoStorm Fair Figure 5.16: Distribution of accuracies across queries. Figure 5.15 shows the average accuracy with Cascade relative to the optimal accu- racy obtained by solving the BIP optimization (Sectionx5.5.1); the absolute values of the accuracies range between 0:6 0:7. Even with the increasing number of queries in the system, Cascade’s accuracy is within 6% of the optimal. In fact, with 5 queries executing on every camera stream, Cascade’s accuracy is 15:7 better than the fair scheduler. Figure 5.16 presents the distribution of the absolute query accuracies with the “2-queries-per-camera” setting. Cascade’s CDF closely matches the optimal which shows that its greedy heuristic search in the Pareto band is near-optimal even at a per- query level, not just in aggregate. 130 VideoStorm’s accuracy is better than that of fair allocation but far lower than that of Cascade and the optimal allocation. It suffers because it does not consider the cluster hierarchy and the network bottlenecks in its decisions. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CDF Cost (Dominant Resource Utilization) Optimal Cascade VideoStorm Fair Figure 5.17: Distribution of costs of the query configurations. Resource utilization. The key to Cascade’s performance is effective utilization of resources by smartly picking plans and placements. The CPU and network utiliza- tions are all above 85%, thus showing the effectiveness of Cascade in balancing the load across clusters and avoiding bottlenecks. This is mainly due to our cost metric of dominant utilization (Sectionx5.5.3) that prevents any single resource from being dis- proportionately utilized. This is supported by Figure 5.17 that shows that the query costs achieved byCascade are significantly higher than with fair allocation and VideoStorm, and similar to the optimal, thus leading to higher utilizations and accuracies. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Camera, Camera Camera, Cluster Camera, Cloud Cluster, Cluster Cluster, Cloud Cloud, Cloud Fraction of Queries Optimal Cascade VideoStorm Fair Figure 5.18: Choice of placements for components. 131 Figure 5.18 presents the distribution of the six options for placing the detector and the associator. Cascade chooses to place both components of a query at the same location, i.e., ”Camera-Camera”, ”Cluster-Cluster” and ”Cloud-Cloud”, for 93% of the queries. As a result, the intermediate data between the components does not use the net- work resources between clusters, thus avoiding contention. With VideoStorm’s random placement strategy, 23% of the queries have components placed across clusters which adversely impacts its query plans and accuracies. Although the results show that Cascade places a significant amount of queries’ com- ponents at the same location to reduce inter-site bandwidth consumption, placing all the queries’ components at the same location for all the queries does not leads to the optimal results. 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 2 4 8 16 Normalized Accuracy Bandwidth Capacity Multiplier cascade cascade_sl percentage Figure 5.19: Comparison between Cascade and a placement-restricted Cascade. Figure 5.19 shows the comparison between Cascade and a placement-restricted ver- sion of Cascade, termed Cascade-SL, that places each query’s components at the same location (i.e., either at the camera, the private cluster, or the public cloud, according to Cascade’s heuristic) for all the queries. In this figure, the left two bars represent the normalized accuracy of Cascade and Cascade-SL, while the other bar represents the percentage of the queries that have all its components at the same location as deter- mined by Cascade. The X-axis represents the bandwidth capacity of the evaluation 132 setting, specified by the multiplier value based on the default bandwidth capacity. For example, when multiplier is2, the bandwidth capacity is2 the capacity when the mul- tiplier is 1: The results show that, when the bandwidth resource is scarce, Cascade and Cascade-SL achieves similar accuracy as saving inter-site bandwidth is critical to utiliz- ing higher-accuracy configurations; 94% of the queries in Cascade have same-location placement when the bandwidth multiplier is 1: As bandwidth capacity keeps increas- ing, less queries determined by Cascade have same-location placement, and the gap between Cascade and Cascade-SL increases: Cascade outperforms Cascade-SL by20% in terms of accuracy, while Cascade only places83% of the queries’ components at the same location when the multiplier is16: This is because, when the bandwidth resources are abundant, Cascade is able to utilize the compute resources across the sites without worrying about draining inter-site bandwidth, resulting in higher accuracy through bet- ter planning and placement. The results verify that, although same-location placement helps save inter-site bandwidth, rigidly placing all queries’ components at the same loca- tion for all the queries may lead to worse accuracy as it requires more prudent decisions in planning and placement. 5.7.3 Gains by Different Query Types We also evaluate a wider mix of query types – object tracker, DNN classifier, car counter, license plate reader – in our simulator. We profile these queries using our profiler (see Sectionx5.6.1) and feed these to the simulator. The cluster settings are similar to the deployment. Figure 5.20 shows that the license plate reader and the car counter are farther from the optimal (87% of optimal) unlike the other two that are near-optimal, as indicated by. This is explained by the resource-accuracy profiles of the queries. The license plate reader and the car counter queries, beyond a certain accuracy, have inefficient profiles 133 0.87 1.02 0.87 1.02 0.75 0.80 0.85 0.90 0.95 1.00 1.05 LPR Classification Counting Tracking Normalized Accuracy Figure 5.20: Cascade’s accuracy under various query types normalized by Optimal Allo- cation. (seex5.5.4). The additional resources needed to improve their accuracy is higher as compared to the object tracker and the DNN queries which have many more efficient configurations. As a result, our heuristic favors object tracker and DNN classifier queries more than license plate reader and car counter queries, and assigns the former two more resources, which results in higher accuracy among the mix of the queries as compared to Optimal Allocation. 6.47 5.10 6.53 13.33 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 LPR Classification Counting Tracking Gains Over Fair Allocation Figure 5.21: Cascade’s gains in accuracy under various query types compared to Fair Allocation. Figure 5.21 shows Cascade’s gains of accuracy as compared to Fair Allocation. Cas- cade’s gains in DNN classification queries are smaller because this query type has a small variance in accuracy-resource-ratio. As a result, Fair Allocation can perform rel- atively well as compared to other query types. Figure 5.22 shows Cascade’s normalized accuracy under various mix of tracking and license plate reader queries. As the portion of Tracking queries increases, Cascade 134 Overall Tracking Queries LPR Queries 100% LPR 0.91 N/A 0.91 20% Tracking + 80% LPR 0.92 1.05 0.89 40% Tracking + 60% LPR 0.93 1.01 0.88 60% Tracking + 40% LPR 0.95 0.98 0.91 80% Tracking + 20% LPR 0.97 0.98 0.93 100% Tracking 0.98 0.98 N/A Figure 5.22: Cascade’s normalized accuracy under various mix of tracking and license plate reader queries. achieves better overall accuracy. This is because, as mentioned earlier, tracking queries have relatively efficient profiles as compared to license plate reader queries, having more tracking queries in the query mix offers Cascade with greater opportunities in making the query planning decisions closer to the decisions made by Optimal Allocation. 5.7.4 Static Placements 0 1 2 3 4 5 6 7 1 2 3 4 5 Accuracy of Cascade / Accuracy of Baseline Average Number Of Queries Per Camera Baseline: Camera only Baseline: Private-cluster only Baseline: Cloud only Figure 5.23: Cascade’s gains over restricted placements. Next, we evaluate Cascade with constrained query placement – like in many pro- duction video analytics deployments – to one level in the hierarchy. All the queries run (i) on their corresponding camera, (ii) in the private cluster, or (iii) in the cloud. We still consider the network constraints between the clusters, i.e., the output of the query 135 should be transferred to the cloud no matter where the components are placed. Figure 5.23 shows the ratio of Cascade’s accuracies (without merging) over each of the three constrained approaches. As the number of queries (per camera) increases, the compute or network resources become saturated and they cannot support the queries at high accuracies. For example, with just one query per camera, we can stream all video to the cloud and process them there; however, as the load increases, this becomes harder and Cascade can run the queries with up to3 higher accuracy. We achieve the largest gains against the camera- only constraint because the cameras have the least compute resources available. These results highlight the value of utilizing all the available resources across the hierarchy of clusters for video analytics. 5.7.5 Gains with Merging Our results thus far have been without merging of common components in queries; we now measure gains with merging (see Sectionx5.5.6). Despite the queries having similar pipelines, recall that merging can lead to “conflicts” between queries when they have different resource-accuracy profiles, thus requiring us to weigh the resource gains in merging against any loss in accuracy. Figure 5.24 shows the gains of Cascade with merging as the average number of queries per camera ranges from 1 to 5. With five queries per camera, Cascade with merging achieves27:2 gains in accuracy as compared to fair allocation, which is1:8 higher than our gains without merging. A neat aspect is that blindly merging whenever possible, without considering conflicts (“Merge Everything”), undercuts our gains. This nods to the importance of our careful decisions in Sectionx5.5.6. 136 0 5 1 0 1 5 2 0 2 5 3 0 1 2 3 4 5 Accuracy of Cascade / Accuracy of Fair Allocation Average Number Of Queries Per Camera Cascade w/Merging Cascade w/o Merging Merge Everything Figure 5.24: Gains in merging common query components. 5.7.6 Scalability with Pareto Band 0 0 . 2 0 . 4 0 . 6 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 2 3 4 5 6 Relative Running Time Relative Accuracy Pareto Band Width (δ) Accuracy Time Figure 5.25: Accuracy achieved and running time with Pareto band width, relative to using all the configurations. Recall from Sectionx5.5 that we narrow down our search space by using a Pareto band of promising configurations. The smaller the width of the band (), the faster the running time of the heuristic at the expense of lower accuracy. Figure 5.25 presents the impact of on the accuracy and running time normalized bu the approach considering all the configurations (i.e.,!1). Even with = 1, of the Pareto boundary, we see that the relative accuracy is 80% of that with searching through the entire space. With = 2, we achieve a relative accuracy of over90% in under a fifth of the time. Thus, we use the Pareto band with = 2 in our system. 137 0 1 2 3 4 1 . E + 0 1 . E + 2 1 . E + 4 1 . E + 6 4 0 8 0 1 6 0 3 2 0 6 4 0 Running Time (%) of Cascade over BIP Running Time (sec) Number of Queries BIP's Running Time Cascade (Right Y-Axis) Figure 5.26: Algorithms’ Running Time. We also evaluate the running time of the heuristic. Figure 5.26 comparesCascade’s running time to that of solving BIP optimization (see Sectionx5.5.1). The left Y-axis represents the BIP’s time while the right Y-axis represents Cascade’s relative running time. BIP optimization’s complexity grows exponentially withnm, which is consid- erably faster than that of our heuristic (O(n 2 m 2 )), forn queries each withm config- urations. Cascade takes only 0:09% 3:7% of the time as compared to solving BIP optimization. 5.7.7 Cost and Efficiency Metrics In our heuristic, we made two important choices: a) For the cost metric in Sectionx5.5.3, instead of the dominant utilization (S i;j;k = max l D l i;j;k =C l ), if we had used the sum of utilizations ( P l D l i;j;k =C l ), our achieved accuracy would drop by11% in our experiments. This is significant as improvements of this magnitude are the focus for many computer vision research techniques. b) Inx5.5.4, we compute efficiency of a new query configuration(j 0 ;k 0 ) when switching from(j;k) using the difference in accuracy as well as cost. Our evaluation suggest that using only the new accuracy and the new cost, or any combination that does not use the difference in both, lowers the achieved accuracy by nearly10 percentage points. 138 5.7.8 Performance Sensitivity Finally, we evaluate the sensitivity of Cascade’s performance to different resource capacities. 0.7 0.75 0.8 0.85 0.9 0.95 1 1X 2X 4X 8X 16X Normalized Accuracy Camera CPU Camera BW Private CPU Private BW Cloud CPU All Multiplier of Resource Capacity Figure 5.27: Cascade’s performance sensitivity compared to Optimal Allocation. Figure 5.27 showsCascade’s normalized accuracy as compared to Optimal Alloca- tion under various sensitivity settings: doubling one type of resource at a time. Under most settings Cascade achieves constant normalized accuracy (94%), except when the cameras’ CPU and bandwidth are doubled ( 80%). This is because Cascade applies a greedy heuristic which would place the queries’ components towards the cloud as a tie-breaker, which results in larger differences from Optimal Allocation when there are more resources in the cameras. 5.8 Conclusions Analyzing live video streams over hierarchical clusters is a problem of considerable importance. Video analytics queries have multiple implementations and knobs that determine their accuracy and resource demand. We devise Cascade to determine these choices, place the queries across the hierarchy, and merge queries with common process- ing components. To navigate the exponentially large search space, we identify the most 139 promising options in a “Pareto band” and search only within the band. We also devise a multi-resource multi-cluster resource cost metric to compare configurations within the band. Our deployment and simulations with real-world video queries show promising results of being within 6% of optimal planning. We do note that our merging heuristic is rather simplified for efficiency and can be improved. 140 Chapter 6 Summary In this thesis, I address the problem of resource scheduling for geo-distributed comput- ing. In my first work (Chapterx3), I focus on scheduling compute resources for the geo-distributed jobs across multiple datacenters. Specifically, I propose two heuris- tics, Reordering and SWAG for reducing the average job response time. Reordering is a light-weight add-on that guarantees to improve the overall job response time by delay- ing non-urgent tasks without degrading any of the job’s response time, while SWAG provides a more aggressive improvements by favoring the jobs that can finish quickly which leads to less waiting time for the remaining jobs. In my second work (Chapterx4), I focus on scheduling both computation and net- work resources, and address both job scheduling and task placement problems in geo- distributed computing. I formulate the task placement problem as a Linear Program- ming Optimization that considers both network and computation resources in order to optimize the job’s total duration. I also develop a heuristic that jointly determines job scheduling and task placement for reducing the average job response time by greedily finishing jobs with minimum remaining workloads. The proposed heuristic can also incorporate other performance metrics through simple control knobs. In my third work (Chapter 5), I focus on jointly determining job configuration and placement for concurrent jobs through multi-resource allocation in order to maximize 141 the average quality of the geo-distributed jobs. I propose an efficient heuristic that greed- ily improves accuracy for each job, and reduces the significant decision space by incor- porating the concept of Pareto Boundary to prune non-promising configurations. In summary, the main contribution of this thesis is to propose and solve the resource scheduling problem in geo-distributed computing environment that requires joint deci- sion of job scheduling, job configuration and task placement. This thesis starts with a simplified version of the problem that addresses job scheduling only (Chapter 3), then includes one more problem (task placement as in Chapter 4, and job configuration as in Chapter 5) into problem formulation at a time. In addition, this thesis covers two types of computing workloads: batch analytics (as in Chapter 3 and Chapter 4) and streaming analytics (as in Chapter 5). The problem studied by this thesis has not been studied yet even in the related fields such as database systems, sensor network, grid computing, and stream analytics. Therefore, I believe that the principles and insights derived from this thesis can also be applied to the above moentioned related research fields in distributed systems. This thesis addresses the resource scheduling problem in a computing environment highlighted by the heterogeneity of resource availability and data distribution. The future directions extended from this thesis include optimizing resource allocation for application-specific scenarios, such as that involve different data types (numeric, text, audio, video, or other multi-media format), different performance requirements (e.g., resource utilization, power-efficiency, QoS-aware metrics, etc, and even mix of mul- tiple requirements), or general-purpose computing infrastructure that simultaneously supports different types of applications (e.g., running machine learning training jobs and query engine on top of the same cluster). The above mentioned directions involve adding additional dimensions of heterogeneity to the problem characteristics, which we 142 believe is very challenging to solve yet provides significant values for both the current and the future computing requirements. 143 References [1] Apache Calcite - a dynamic data management framework.http://calcite. incubator.apache.org. Accessed 04-27-2015. [2] Apache Storm. https://storm.apache.org/. [3] Avigilon. http://avigilon.com/products/. [4] AXIS camera application platform. https://goo.gl/tqmBEy. Accessed 01-25-2016. [5] Cloud vs edge in an IoT world. https://iotworldnews.com/2016/04/cloud-vs- edge-in-an-iot-world/. [6] Genetec. https://www.genetec.com/. [7] http://aws.amazon.com/about-aws/global-infrastructure/. Amazonn Global Infrastructure. [8] http://hadoop.apache.org/. Hadoop Cluster Computing System and Distributed File System. [9] https://amplab.cs.berkeley.edu/benchmark/. AMPLab Big Data Benchmark. [10] https://code.google.com/p/googleclusterdata/. Google Cluster Workload Traces. [11] https://ipp.mit.edu/sites/default/files/documents/congestion-handout-final.pdf, 2014. Measureing Internet Congestion: A preliminary report. [12] http://www.gurobi.com/. Gurobi Optimization. [13] http://www.microsoft.com/en-us/server-cloud/cloud-os/global-datacenters.aspx. Microsoft Cloud Platform. [14] http://www.tpc.org/tpcds/. TPC Benchmark Standard for Decision Support Solu- tions Including Big Data. 144 [15] Introduction to SIFT (Scale-Invariant Feature Transform). http://docs. opencv.org/3.1.0/da/df5/tutorial_py_sift_intro.html. [16] Introduction to SURF (Speeded-Up Robust Features). http://docs. opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_ surf_intro/py_surf_intro.html. [17] Key azure storsimple features. https://www.microsoft.com/en-us/server- cloud/products/storsimple/Features.aspx. [18] NYPD expands surveillance net to fight crime as well as terrorism. https: //goo.gl/Y9OKh0. Accessed 01-25-2016. [19] Private conversation with datacenter operators of one of the largest public cloud providers; anonymized. 2016. [20] Daniel J Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Cetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, et al. The design of the borealis stream processing engine. In CIDR, volume 5, pages 277–289, 2005. [21] Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Mad- den, and Ion Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 29–42. ACM, 2013. [22] Ian F Akyildiz, Dario Pompili, and Tommaso Melodia. Underwater acoustic sensor networks: research challenges. Ad hoc networks, 3(3):257–279, 2005. [23] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y . Lu, B. Saha, and E. Harris. Reining in the Outliers in Map-Reduce Clusters using Mantri. In USENIX OSDI, 2010. [24] Ganesh Ananthanarayanan, Sameer Agarwal, Srikanth Kandula, Albert Green- berg, Ion Stoica, Duke Harlan, and Ed Harris. Scarlett: Coping with skewed content popularity in mapreduce clusters. In Proceedings of the Sixth Conference on Computer Systems (EuroSys), 2011. [25] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Why let resources idle? aggressive cloning of jobs with dolly. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, HotCloud’12, Boston, MA, 2012. 145 [26] Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and Ion Stoica. Pacman: Coordinated mem- ory caching for parallel jobs. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2012. [27] Ganesh Ananthanarayanan, Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. Grass: trimming stragglers in approximation analytics. In USENIX NSDI, 2014. [28] Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. GRASS: Trimming Stragglers in Approxima- tion Analytics. USENIX NSDI, 2014. [29] Apache Hadoop NextGen MapReduce (YARN). Retrieved 9/24/2013, URL: http://hadoop.apache.org/docs/current/hadoop-yarn/ hadoop-yarn-site/YARN.html. [30] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. Spark SQL: Relational data processing in Spark. In SIGMOD, 2015. [31] Luigi Atzori, Antonio Iera, and Giacomo Morabito. The internet of things: A survey. Computer networks, 54(15):2787–2805, 2010. [32] Haowei Bai, Mohammed Atiquzzaman, and David Lilja. Wireless sensor network for aircraft health monitoring. In Broadband Networks, 2004. BroadNets 2004. Proceedings. First International Conference on, pages 748–750. IEEE, 2004. [33] Rajesh Krishna Balan, Darren Gergle, Mahadev Satyanarayanan, and James Herbsleb. Simplifying cyber foraging for mobile devices. In Proceedings of the 5th International Conference on Mobile Systems, Applications and Services, MobiSys ’07, pages 272–285, New York, NY , USA, 2007. ACM. [34] Rajesh Krishna Balan, Mahadev Satyanarayanan, So Young Park, and Tadashi Okoshi. Tactics-based remote execution for mobile computing. In Proceedings of the 1st International Conference on Mobile Systems, Applications and Services, MobiSys ’03, pages 273–286, New York, NY , USA, 2003. ACM. [35] Hitesh Ballani, Paolo Costa, Thomas Karagiannis, and Ant Rowstron. Towards predictable datacenter networks. In ACM SIGCOMM Computer Communication Review, volume 41, pages 242–253. ACM, 2011. [36] Nikhil Bansal and Mor Harchol-Balter. Analysis of SRPT scheduling: Investigat- ing unfairness. ACM, 2001. 146 [37] Fran Berman, Geoffrey Fox, and Anthony JG Hey. Grid computing: making the global infrastructure a reality, volume 2. John Wiley and sons, 2003. [38] Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 285–300, Broomfield, CO, 2014. USENIX Association. [39] Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and Ramesh Govindan. Mapping the expansion of google’s serving infrastructure. In Pro- ceedings of the 2013 Conference on Internet Measurement Conference (IMC), 2013. [40] Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, and John Wernsing. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. In VLDB, 2014. [41] Jae-Hwan Chang and Leandros Tassiulas. Energy conserving routing in wireless ad-hoc networks. In INFOCOM 2000. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, volume 1, pages 22–31. IEEE, 2000. [42] Jae-Hwan Chang and Leandros Tassiulas. Maximum lifetime routing in wireless sensor networks. IEEE/ACM Transactions on networking, 12(4):609–619, 2004. [43] Surajit Chaudhuri. An overview of query optimization in relational systems. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98, pages 34–43, New York, NY , USA, 1998. ACM. [44] Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Ugur Cetintemel, Ying Xing, and Stan Zdonik. Scalable Distributed Stream Pro- cessing. In CIDR 2003 - First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, January 2003. [45] Krishna Chintalapudi, Tat Fu, Jeongyeup Paek, Nupur Kothari, Sumit Rangwala, John Caffrey, Ramesh Govindan, Erik Johnson, and Sami Masri. Monitoring civil structures with a wireless sensor network. IEEE Internet Computing, 10(2):26– 34, 2006. [46] Mosharaf Chowdhury and Ion Stoica. Efficient coflow scheduling without prior knowledge. In Proceedings of the 2015 ACM Conference on SIGCOMM, 2015. 147 [47] Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, and Ion Sto- ica. Managing data transfers in computer clusters with orchestra. In Proceedings of the ACM SIGCOMM 2011 Conference, 2011. [48] Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. Efficient coflow scheduling with varys. In Proceedings of the 2014 ACM Conference on SIGCOMM, 2014. [49] James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. Spanner: Googles globally distributed database. ACM Trans- actions on Computer Systems (TOCS), 2013. [50] Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. Syn- opses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1–294, 2012. [51] Eduardo Cuervo, Aruna Balasubramanian, Dae-ki Cho, Alec Wolman, Stefan Saroiu, Ranveer Chandra, and Paramvir Bahl. Maui: making smartphones last longer with code offload. In Proceedings of the 8th international conference on Mobile systems, applications, and services, pages 49–62. ACM, 2010. [52] David DeWitt and Jim Gray. Parallel database systems: the future of high perfor- mance database systems. Communications of the ACM, 35(6):85–98, 1992. [53] Fahad R. Dogar, Thomas Karagiannis, Hitesh Ballani, and Antony Rowstron. Decentralized task-aware scheduling for data center networks. In Proceedings of the 2014 ACM Conference on SIGCOMM, 2014. [54] Jack Dongarra, Thomas Sterling, Horst Simon, and Erich Strohmaier. High- performance computing: clusters, constellations, mpps, and future directions. Computing in Science & Engineering, 7(2):51–59, 2005. [55] J. Du and J. Y .-T. Leung. Complexity of scheduling parallel task systems. SIAM J. Discret. Math., 1989. [56] E. Fasolo, M. Rossi, J. Widmer, and M. Zorzi. In-network aggregation tech- niques for wireless sensor networks: a survey. IEEE Wireless Communications, 14(2):70–87, April 2007. [57] Ian Foster and Carl Kesselman. The Grid 2: Blueprint for a new computing infrastructure. Elsevier, 2003. [58] Ian Foster, Yong Zhao, Ioan Raicu, and Shiyong Lu. Cloud computing and grid computing 360-degree compared. In Grid Computing Environments Workshop, 2008. GCE’08, pages 1–10. Ieee, 2008. 148 [59] Deepak Ganesan, Ramesh Govindan, Scott Shenker, and Deborah Estrin. Highly- resilient, energy-efficient multipath routing in wireless sensor networks. ACM SIGMOBILE Mobile Computing and Communications Review, 5(4):11–25, 2001. [60] Naveen Garg, Amit Kumar, and Vinayaka Pandit. Order scheduling models: hardness and algorithms. In FSTTCS 2007: Foundations of Software Technol- ogy and Theoretical Computer Science. Springer, 2007. [61] Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011. [62] Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In USENIX NSDI, 2011. [63] Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. Choosy: Max-min fair sharing for datacenter jobs with constraints. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys), 2013. [64] Goetz Graefe. Query evaluation techniques for large databases. ACM Computing Surveys (CSUR), 25(2):73–169, 1993. [65] Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. Multi-resource packing for cluster schedulers. In Proceedings of the 2014 ACM Conference on SIGCOMM, 2014. [66] Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. GRAPHENE: Packing and Dependency-Aware Scheduling for Data- Parallel Clusters. In OSDI, 2016. [67] Xiaohui Gu, Klara Nahrstedt, Alan Messer, Ira Greenberg, and Dejan Miloji- cic. Adaptive offloading for pervasive computing. IEEE Pervasive Computing, 3(3):66–73, 2004. [68] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. Internet of things (iot): A vision, architectural elements, and future directions. Future generation computer systems, 29(7):1645–1660, 2013. [69] Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai, Shuo Wu, Sandeep Govind Dhoot, Abhilash Rajesh Kumar, Ankur Agiwal, et al. Mesa: Geo-replicated, near real-time, scalable data warehousing. In Proceedings of the VLDB Endowment, 2014. 149 [70] Laura M Haas, Johann Christoph Freytag, Guy M Lohman, and Hamid Pirahesh. Extensible query processing in Starburst, volume 18. ACM, 1989. [71] Mohammad Hajjat, David Maltz, Sanjay Rao, Kunwadee Sripanidkulchai, et al. Dealer: application-aware request splitting for interactive cloud applications. In ACM CoNEXT, 2012. [72] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wol- man, and Arvind Krishnamurthy. Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints. In Proceed- ings of the 14th Annual International Conference on Mobile Systems, Applica- tions, and Services, MobiSys ’16, 2016. [73] Mor Harchol-Balter, Bianca Schroeder, Nikhil Bansal, and Mukesh Agrawal. Size-based scheduling to improve web performance. ACM Transactions on Com- puter Systems (TOCS), 2003. [74] Xiaoshan He, X Sun, and G Laszewski. A qos guided scheduling algorithm for grid computing. In Proc. of the Intl Workshop on Grid and Cooperative Comput- ing, 2002. [75] John Heidemann, Milica Stojanovic, and Michele Zorzi. Underwater sen- sor networks: applications, advances and challenges. Phil. Trans. R. Soc. A, 370(1958):158–175, 2012. [76] Joseph M Hellerstein and Michael Stonebraker. Predicate migration: Optimizing queries with expensive predicates, volume 22. ACM, 1993. [77] Herodotos Herodotou, Nedyalko Borisov, and Shivnath Babu. Query optimiza- tion techniques for partitioned tables. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 49–60. ACM, 2011. [78] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for fine- grained resource sharing in the data center. In NSDI, 2011. [79] Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds. In NSDI, 2017. [80] Wei Huang, Jiuxing Liu, Bulent Abali, and Dhabaleswar K Panda. A case for high performance computing with virtual machines. In Proceedings of the 20th annual international conference on Supercomputing, pages 125–134. ACM, 2006. 150 [81] Chien-Chun Hung, Leana Golubchik, and Minlan Yu. Scheduling jobs across geo-distributed datacenters. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC), 2015. [82] Michael Chien-Chun Hung, Kate Ching-Ju Lin, Chih-Cheng Hsu, Cheng-Fu Chou, and Chang-Jen Tu. On enhancing network-lifetime using opportunistic routing in wireless sensor networks. In ICCCN, pages 1–6, 2010. [83] Jeong-Hyon Hwang, Ugur Cetintemel, and Stan Zdonik. Fast and reliable stream processing over wide area networks. In IEEE ICDE Workshop, 2007. [84] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, 2007. [85] Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. Quincy: Fair scheduling for distributed computing clusters. In ACM SOSP, 2009. [86] Keith R Jackson, Lavanya Ramakrishnan, Krishna Muriki, Shane Canon, Shreyas Cholia, John Shalf, Harvey J Wasserman, and Nicholas J Wright. Performance analysis of high performance computing applications on the amazon web services cloud. In Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, pages 159–168. IEEE, 2010. [87] Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs H¨ olzle, Stephen Stuart, and Amin Vahdat. B4: Experience with a globally- deployed software defined wan. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, pages 3–14, New York, NY , USA, 2013. ACM. [88] Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, and Matthew Caesar. Network-aware scheduling for data-parallel jobs: Plan when you can. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, 2015. [89] Matthias Jarke and Jurgen Koch. Query optimization in database systems. ACM Comput. Surv., 16(2):111–152, June 1984. [90] Junchen Jiang, Rajdeep Das, Ganesh Ananthanarayanan, Philip A. Chou, Venkata Padmanabhan, Vyas Sekar, Esbjorn Dominique, Marcin Goliszewski, Dalibor Kukoleca, Renat Vafin, and Hui Zhang. Via: Improving internet telephony call quality using predictive relay selection. In Proceedings of the ACM Conference on Special Interest Group on Data Communication (SIGCOMM), 2016. 151 [91] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In CVPR, 2016. [92] Jon Kleinberg and Eva Tardos. Algorithm Design. Addison-Wesley Longman Publishing Co., Inc., 2005. [93] Paraschos Koutris and Dan Suciu. Parallel evaluation of conjunctive queries. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), 2011. [94] Tim Kraska, Gene Pang, Michael J Franklin, Samuel Madden, and Alan Fekete. Mdcc: Multi-data center consistency. In ACM EuroSys, 2013. [95] L. Krishnamachari, D. Estrin, and S. Wicker. The impact of data aggregation in wireless sensor networks. In Proceedings 22nd International Conference on Distributed Computing Systems Workshops, pages 575–578, 2002. [96] YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. Skew- tune: mitigating skew in mapreduce applications. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2012. [97] Alon Y Levy, Inderpal Singh Mumick, and Yehoshua Sagiv. Query optimization by predicate move-around. In VLDB, pages 96–107, 1994. [98] Zhiyuan Li, Cheng Wang, and Rong Xu. Task allocation for distributed mul- timedia processing on wirelessly networked handheld devices. In Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM, pages 6–pp. IEEE, 2001. [99] Robert LiKamWa and Lin Zhong. Starfish: Efficient concurrency support for computer vision applications. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services, pages 213–226. ACM, 2015. [100] Minghong Lin, Adam Wierman, and Bert Zwart. The average response time in a heavy-traffic srpt queue. ACM SIGMETRICS Performance Evaluation Review, 2010. [101] Minghong Lin, Adam Wierman, and Bert Zwart. Heavy-traffic analysis of mean response time under shortest remaining processing time. Performance Evalua- tion, 2011. [102] Hongqiang Harry Liu, Raajay Viswanathan, Matt Calder, Aditya Akella, Ratul Mahajan, Jitendra Padhye, and Ming Zhang. Efficiently delivering online services over integrated infrastructure. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2016. 152 [103] Guy Maring Lohman, Eugene Jon Shekita, David E Simmen, and Mon- ica Sachiye Urata. Relational database query optimization to perform query eval- uation plan, pruning based on the partition properties, July 18 2000. US Patent 6,092,062. [104] Yao Lu, Aakanksha Chowdhery, and Srikanth Kandula. Optasia: A relational platform for efficient large-scale video analytics. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 57–70. ACM, 2016. [105] Gang Luo, Jeffrey F Naughton, Curt J Ellmann, and Michael W Watzke. Toward a progress indicator for database queries. In Proceedings of the 2004 ACM SIG- MOD international conference on Management of data, pages 791–802. ACM, 2004. [106] Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. Tag: A tiny aggregation service for ad-hoc sensor networks. SIGOPS Oper. Syst. Rev., 36(SI):131–146, December 2002. [107] Samuel Madden, Mehul Shah, Joseph M Hellerstein, and Vijayshankar Raman. Continuously adaptive continuous queries over streams. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 49–60. ACM, 2002. [108] Alan Mainwaring, David Culler, Joseph Polastre, Robert Szewczyk, and John Anderson. Wireless sensor networks for habitat monitoring. In Proceedings of the 1st ACM international workshop on Wireless sensor networks and applications, pages 88–97. Acm, 2002. [109] Michael V Mannino, Paicheng Chu, and Thomas Sager. Statistical profile esti- mation in database systems. ACM Computing Surveys (CSUR), 20(3):191–221, 1988. [110] Monaldo Mastrolilli, Maurice Queyranne, Andreas S. Schulz, Ola Svensson, and Nelson A. Uhan. Minimizing the sum of weighted completion times in a concur- rent open shop. Operation Research Letter, 2010. [111] Hoshi Mistry, Prasan Roy, S. Sudarshan, and Krithi Ramamritham. Materialized view selection and maintenance using multi-query optimization. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, SIGMOD ’01, pages 307–318, New York, NY , USA, 2001. ACM. [112] Hyun J Moon, Carlo A Curino, Alin Deutsch, Chien-Yi Hou, and Carlo Zaniolo. Managing and querying transaction-time databases under schema evolution. Pro- ceedings of the VLDB Endowment, 1(1):882–895, 2008. 153 [113] Rajeev Motwani, Jennifer Widom, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Chris Olston, Justin Rosenstein, and Rohit Varma. Query processing, resource management, and approximation in a data stream management system. CIDR, 2003. [114] Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, et al. f4: Facebook warm blob storage system. In USENIX OSDI, 2014. [115] Arun C Murthy, Chris Douglas, Mahadev Konar, Owen OMalley, Sanjay Radia, Sharad Agarwal, and KV Vinod. Architecture of Next Generation Apache Hadoop MapReduce framework. Technical report, Apache Hadoop community, 2011. [116] Thomas Neumann and Guido Moerkotte. Characteristic sets: Accurate cardinal- ity estimation for rdf queries with multiple joins. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 984–994. IEEE, 2011. [117] Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. The case for tiny tasks in compute clusters. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems (HotOS), 2013. [118] Jeongyeup Paek, Krishna Chintalapudi, Ramesh Govindan, John Caffrey, and Sami Masri. A wireless sensor network for structural health monitoring: Perfor- mance and experience. In Embedded Networked Sensors, 2005. EmNetS-II. The Second IEEE Workshop on, pages 1–9. IEEE, 2005. [119] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to large- scale data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, 2009. [120] Jian Pei, Bin Jiang, Xuemin Lin, and Yidong Yuan. Probabilistic skylines on uncertain data. In Proceedings of the 33rd international conference on Very large data bases, pages 15–26. VLDB Endowment, 2007. [121] Peter Pietzuch, Jonathan Ledlie, Jeffrey Shneidman, Mema Roussopoulos, Matt Welsh, and Margo Seltzer. Network-aware operator placement for stream- processing systems. In Data Engineering, 2006. ICDE’06. Proceedings of the 22nd International Conference on, pages 49–49. IEEE, 2006. [122] Hamid Pirahesh, Joseph M Hellerstein, and Waqar Hasan. Extensible/rule based query rewrite optimization in starburst. In ACM Sigmod Record, volume 21, pages 39–48. ACM, 1992. 154 [123] Orestis Polychroniou, Rajkumar Sen, and Kenneth A. Ross. Track join: Dis- tributed joins with minimal network traffic. In Proceedings of the 2014 ACM SIG- MOD International Conference on Management of Data, SIGMOD ’14, pages 1483–1494, New York, NY , USA, 2014. ACM. [124] Lucian Popa, Gautam Kumar, Mosharaf Chowdhury, Arvind Krishnamurthy, Sylvia Ratnasamy, and Ion Stoica. Faircloud: Sharing the network in cloud com- puting. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIG- COMM ’12, pages 187–198, New York, NY , USA, 2012. ACM. [125] Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, and Ion Stoica. Low latency geo-distributed data ana- lytics. In ACM SIGCOMM, 2015. [126] Moo-Ryong Ra, Anmol Sheth, Lily Mummert, Padmanabhan Pillai, David Wetherall, and Ramesh Govindan. Odessa: enabling interactive perception appli- cations on mobile devices. In Proceedings of the 9th international conference on Mobile systems, applications, and services, pages 43–56. ACM, 2011. [127] Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S Pai, and Michael J Freed- man. Aggregation and degradation in jetstream: Streaming analytics in the wide area. In USENIX NSDI, 2014. [128] W. Rdiger, T. Mhlbauer, P. Unterbrunner, A. Reiser, A. Kemper, and T. Neu- mann. Locality-sensitive operators for parallel main-memory database clusters. In 2014 IEEE 30th International Conference on Data Engineering, pages 592– 603, March 2014. [129] Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Comput- ing, 2012. [130] Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, and Minlan Yu. Hop- per: Decentralized speculation-aware cluster scheduling at scale. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, 2015. [131] Thomas A Roemer. A note on the complexity of the concurrent open shop prob- lem. Springer, 2006. [132] Prasan Roy, Srinivasan Seshadri, S Sudarshan, and Siddhesh Bhobe. Efficient and extensible algorithms for multi query optimization. In ACM SIGMOD Record, volume 29, pages 249–260. ACM, 2000. 155 [133] Linus Schrage. A proof of the optimality of the shortest remaining processing time discipline. Operations Research, 1968. [134] Linus E Schrage and Louis W Miller. The queue m/g/1 with the shortest remain- ing processing time discipline. Operations Research, 1966. [135] Bianca Schroeder and Garth Gibson. A large-scale study of failures in high- performance computing systems. IEEE Transactions on Dependable and Secure Computing, 7(4):337–350, 2010. [136] Bianca Schroeder and Mor Harchol-Balter. Web servers under overload: How scheduling can help. ACM Transactions on Internet Technology (TOIT), 2006. [137] Timos Sellis and Subrata Ghosh. On the multiple-query optimization problem. IEEE Transactions on Knowledge and Data Engineering, 2(2):262–266, 1990. [138] Timos K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23–52, March 1988. [139] Bikash Sharma, Victor Chudnovsky, Joseph L Hellerstein, Rasekh Rifaat, and Chita R Das. Modeling and synthesizing task placement constraints in google compute clusters. In Proceedings of the 2nd ACM Symposium on Cloud Comput- ing, 2011. [140] Adam Silberstein and Jun Yang. Many-to-many aggregation for sensor networks. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 986–995. IEEE, 2007. [141] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. [142] Utkarsh Srivastava, Kamesh Munagala, and Jennifer Widom. Operator placement for in-network stream query processing. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 250–258. ACM, 2005. [143] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In Proceed- ings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [144] Jian Tan, Xiaoqiao Meng, and Li Zhang. Delay tails in mapreduce scheduling. ACM SIGMETRICS Performance Evaluation Review, 2012. 156 [145] Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transac- tions on Parallel and Distributed Systems, 13:260–274, 2002. [146] Niki Trigoni, Yong Yao, Alan Demers, Johannes Gehrke, and Rajmohan Rajara- man. Multi-query optimization for sensor networks. In International Conference on Distributed Computing in Sensor Systems, pages 307–321. Springer, 2005. [147] John Turek, Joel L. Wolf, and Philip S. Yu. Approximate algorithms scheduling parallelizable tasks. In Proceedings of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, 1992. [148] C. J Van Rijsbergen. Information Retrieval. Butterworth, 2nd edition, 1979. [149] Hal Varian. Equity, envy, and efficiency. In Journal of Economic Theory, vol- ume 9, pages 63–91, 1974. [150] S. Venkataraman, A. Panda, G. Ananthanarayanan, M. Franklin, and I. Stoica. The Power of Choice in Data-Aware Cluster Scheduling. In USENIX OSDI, 2014. [151] Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J. Franklin, and Ion Stoica. The power of choice in data-aware cluster scheduling. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pages 301–316. USENIX Association, 2014. [152] Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems (EuroSys), Bordeaux, France, 2015. [153] Raajay Viswanathan, Ganesh Ananthanarayanan, and Aditya Akella. Clarinet: Wan-aware optimization for analytics queries. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 435–450. USENIX Association, 2016. [154] Ashish Vulimiri, Carlo Curino, Brighten Godfrey, Konstantinos Karanasos, and George Varghese. Wanalytics: Analytics for a geo-distributed data-intensive world. In CIDR, 2015. [155] Ashish Vulimiri, Carlo Curino, P. Brighten Godfrey, Thomas Jungblut, Jitu Pad- hye, and George Varghese. Global analytics in the face of bandwidth and regula- tory constraints. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2015. 157 [156] Adam Wierman and Mor Harchol-Balter. Classifying scheduling policies with respect to unfairness in an m/gi/1. In ACM SIGMETRICS Performance Evalua- tion Review, 2003. [157] Shili Xiang, Hock Beng Lim, Kian-Lee Tan, and Yongluan Zhou. Two-tier mul- tiple query optimization for sensor networks. In Distributed Computing Systems, 2007. ICDCS’07. 27th International Conference on, pages 39–39. IEEE, 2007. [158] Pengcheng Xiong, Hakan Hacigumus, and Jeffrey F. Naughton. A software- defined networking based approach for performance management of analytical queries on distributed data stores. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 955– 966, New York, NY , USA, 2014. ACM. [159] Hui Yan, Xue-Qin Shen, Xing Li, and Ming-Hui Wu. An improved ant algorithm for job scheduling in grid computing. In Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on, volume 5, pages 2957– 2961. IEEE, 2005. [160] Yang Yang, Hui-Hai Wu, and Hsiao-Hwa Chen. Short: shortest hop routing tree for wireless sensor networks. International Journal of Sensor Networks, 2(5- 6):368–374, 2007. [161] Jia Yu and Rajkumar Buyya. A taxonomy of workflow management systems for grid computing. Journal of Grid Computing, 3(3-4):171–200, 2005. [162] Jia Yu, Rajkumar Buyya, and Chen Khong Tham. Cost-based scheduling of sci- entific workflow applications on utility grids. In e-Science and Grid Computing, 2005. First International Conference on, pages 8–pp. Ieee, 2005. [163] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In USENIX HotCloud, 2010. [164] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In ACM EuroSys, 2010. [165] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster comput- ing. In USENIX NSDI, 2012. [166] Matei Zaharia, Tathagatha Das, Haoyuan Li, Tim Hunter, Scott Shenker, and Ion Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In ACM SOSP, 2013. 158 [167] Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2008. [168] Kai Zeng, Wenjing Lou, and Hongqiang Zhai. On end-to-end throughput of opportunistic routing in multirate and multihop wireless networks. In INFO- COM 2008. The 27th Conference on Computer Communications. IEEE, pages 816–824. IEEE, 2008. [169] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Victor Bahl, and Michael J. Freedman. Live Video Analytics at Scale with Approximate and Delay-Tolerant Processing. In NSDI, 2017. [170] Jizhong Zhao, Wei Xi, Yuan He, Yunhao Liu, Xiang-Yang Li, Lufeng Mo, and Zheng Yang. Localization of wireless sensor networks in the wild: Pursuit of ranging quality. IEEE/ACM Transactions on Networking, 21(1):311–323, 2013. [171] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. In Proceed- ings of the Twenty-eighth Annual Conference on Neural Information Processing Systems (NIPS), 2014. 159
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive resource management in distributed systems
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Dispersed computing in dynamic environments
PDF
Satisfying QoS requirements through user-system interaction analysis
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Advancing distributed computing and graph representation learning with AI-enabled schemes
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Optimizing distributed storage in cloud environments
PDF
Distributed resource management for QoS-aware service provision
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Backpressure delay enhancement for encounter-based mobile networks while sustaining throughput optimality
PDF
Towards the efficient and flexible leveraging of distributed memories
PDF
Energy proportional computing for multi-core and many-core servers
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Improving network security through cyber-insurance
PDF
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Resource management for scientific workflows
Asset Metadata
Creator
Hung, Chien-Chun
(author)
Core Title
Resource scheduling in geo-distributed computing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/01/2017
Defense Date
04/27/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,datacenter,distributed systems,OAI-PMH Harvest,resource scheduling
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Golubchik, Leana (
committee chair
), Ananthanarayanan, Ganesh (
committee member
), Krishnamachari, Bhaskar (
committee member
), Lloyd, Wyatt (
committee member
), Yu, Minlan (
committee member
)
Creator Email
hung654@usc.edu,shingleehung@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-427051
Unique identifier
UC11264461
Identifier
etd-HungChienC-5704.pdf (filename),usctheses-c40-427051 (legacy record id)
Legacy Identifier
etd-HungChienC-5704.pdf
Dmrecord
427051
Document Type
Dissertation
Rights
Hung, Chien-Chun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
datacenter
distributed systems
resource scheduling