Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Distributed resource management for QoS-aware service provision
(USC Thesis Other)
Distributed resource management for QoS-aware service provision
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DISTRIBUTED RESOURCE MANAGEMENT FOR QOS-AWARE SERVICE PROVISION by Sung-Han Lin A Dissertatation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2017 Copyright 2017 Sung-Han Lin Acknowledgments This thesis represents not only my work at the keyboard; it is also the result of many experiences, both in the scientific arena and on a personal level, I have encountered at USC from dozens of remarkable individuals whom I wish to acknowledge. This thesis would not have been possible without their support. First and foremost, I would like to express my sincere gratitude to my advisor, Pro- fessor Leana Golubchik, for her continuous support since the days I began working in the QED research group. Her guidance and motivation throughout my PhD study not only helped me grow as a researcher, but also helped me build my career as an engineer. Besides my advisor, I would like to thank Professor Konstantinos Psounis and Professor Fei Sha for serving as my committee members, and for their insightful comments and suggestions. I would like to thank my labmates, especially Dr. Bo-Chun Wang, Dr. Ranjan Pal, and Dr. Marco Paolieri, for the creative discussions, and for the support and hard work in helping me edit the papers. I would also like to thank other friends, especially some Taiwanese friends, who pursued their PhD degrees at the same time with me, for their support in many aspects throughout my PhD study starting with my first day at USC. Last but not the least, I would like to thank my family. I could not have completed this degree without their support which allowed me to focus only on my study without worrying about other things in life. ii Table of Contents Acknowledgments ii List of Tables vi List of Figures vii List of Algorithms ix Abstract x Chapter 1: Introduction 1 1.1 Video Streaming as a Service . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Computation as a Service . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2: Related Work 9 2.1 Peer-to-Peer Video Streaming as a Service . . . . . . . . . . . . . . . . 9 2.1.1 Modifying Sharing Policies . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Providing Incentives . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Computation as a Service . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Parallel Job Scheduling . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Hybrid Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Design of Cooperative Clouds . . . . . . . . . . . . . . . . . . 16 Chapter 3: On Market-Driven Hybrid-P2P Video Streaming 18 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Overview of ASPECT . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Architecture of ASPECT . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Peers in ASPECT . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.3 Content Providers in ASPECT . . . . . . . . . . . . . . . . . . 26 3.2.4 Ad Providers in ASPECT . . . . . . . . . . . . . . . . . . . . . 26 3.2.5 Trading Download Capacity with Advertisements . . . . . . . . 27 3.3 Market for P2P Video Streaming . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 The Market Environment . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 Non-cooperative Game among Peers . . . . . . . . . . . . . . . 31 iii 3.3.3 Non-cooperative Game among Content Providers . . . . . . . . 36 3.4 Sharing Mechanisms in ASPECT . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 BitTorrent-like Video Streaming Systems . . . . . . . . . . . . 42 3.4.2 Peer Selection Mechanism . . . . . . . . . . . . . . . . . . . . 44 3.4.3 Modified Peer Request Mechanism . . . . . . . . . . . . . . . . 47 3.4.4 Advertisements Reward Function . . . . . . . . . . . . . . . . . 49 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5.1 Performance of Modified BitTorrent-based System . . . . . . . 54 3.5.2 Numerical Experiments with a Monopoly Market . . . . . . . . 64 3.5.3 Numerical Experiments with Oligopolistic Markets . . . . . . . 66 3.5.4 Overhead and Complexity . . . . . . . . . . . . . . . . . . . . 68 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Chapter 4: Dynamic Resource Management for Distributed Machine Learn- ing Workloads 71 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.1 Distributed Stochastic Gradient Descent . . . . . . . . . . . . . 75 4.2.2 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Throughput Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.1 Throughput Measurements of Distributed SGD . . . . . . . . . 78 4.3.2 Queueing Model . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.3 On the Effects of Short TCP Transmissions . . . . . . . . . . . 82 4.4 Scheduling Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.1 The Dilemma of Assigning Workers . . . . . . . . . . . . . . . 87 4.4.2 Malleable Job Scheduling . . . . . . . . . . . . . . . . . . . . . 89 4.4.3 Moldable Job Scheduling . . . . . . . . . . . . . . . . . . . . . 91 4.4.4 Extension for Early Termination . . . . . . . . . . . . . . . . . 92 4.5 Evaluation and Validation . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5.1 Throughput Estimation Validation . . . . . . . . . . . . . . . . 93 4.5.2 Scheduling Evaluation . . . . . . . . . . . . . . . . . . . . . . 95 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Chapter 5: SC-Share: Performance Driven Resource Sharing Markets for the Small Cloud 102 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . 107 5.2.2 Cost Metric Description . . . . . . . . . . . . . . . . . . . . . 108 5.2.3 Cost Metric Evaluation Framework . . . . . . . . . . . . . . . . 109 5.3 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3.1 SC without Sharing Resources . . . . . . . . . . . . . . . . . . 110 iv 5.3.2 Detailed Model for SC Federation . . . . . . . . . . . . . . . . 112 5.3.3 Approximate Model for SC Federation . . . . . . . . . . . . . . 114 5.4 Market-based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4.1 SC Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4.2 Non-Cooperative Game among SCs . . . . . . . . . . . . . . . 122 5.5 Evaluation and Validation . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.5.1 Performance Model Validation . . . . . . . . . . . . . . . . . . 127 5.5.2 Market-based Model Evaluation . . . . . . . . . . . . . . . . . 129 5.5.3 Computational Overhead . . . . . . . . . . . . . . . . . . . . . 133 5.6 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 134 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Chapter 6: Conclusion 140 Chapter A: Mathematical Assumptions in Existing Theorems 154 v List of Tables 3.1 Summary of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 The average download rates (kbps) experienced by peers of different classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Parameters used in experiments . . . . . . . . . . . . . . . . . . . . . . 56 3.4 The distribution of upload bandwidth . . . . . . . . . . . . . . . . . . . 56 3.5 Game results for homogeneous oligopolistic content providers with dif- ferent preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.6 Game results for oligopolistic content providers with different number of video content and video upload supply . . . . . . . . . . . . . . . . 68 4.1 Summary of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 Synthetic workload collected by [71] from previous literature . . . . . . 96 5.1 State transitions for detail modelM . . . . . . . . . . . . . . . . . . . 113 vi List of Figures 3.1 Overview of ASPECT . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 On-demand videos are interleaved with advertisements in ASPECT . . . 24 3.3 State diagram of the game between peers . . . . . . . . . . . . . . . . . 34 3.4 State diagram of the game between content providers . . . . . . . . . . 38 3.5 The evolutions of (a) download rate and (b) the reputation score . . . . 47 3.6 Video pauses of each class with different % of inconsistent capacity peers 58 3.7 Video pauses of each class with 10% inconsistent capacity peers and different values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.8 Video pauses after applying our peer request mechanism . . . . . . . . 61 3.9 Before and after applying our peer request mechanism . . . . . . . . . . 61 3.10 Video pauses with different inconsistent capacity peers . . . . . . . . . 61 3.11 The duration of ads for each class after our mechanisms . . . . . . . . . 61 3.12 Utility results from valid reward functions of advertisements . . . . . . 65 3.13 The real download rates of different peer classes in three valid equilib- rium points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.14 The durations of ads viewed by different peer classes in three valid points 65 3.15 Average number of iterations for various values of . . . . . . . . . . . 70 3.16 Average number of iterations for various numbers of CPs . . . . . . . . 70 4.1 Parameter server architecture . . . . . . . . . . . . . . . . . . . . . . . 76 4.2 Measured training throughput of TensorFlow and Python implementa- tions (mini-batch size is 50 examples) . . . . . . . . . . . . . . . . . . 78 vii 4.3 Queueing model of a distributed machine learning application with a parameter server architecture . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 The results of our exact MV A model . . . . . . . . . . . . . . . . . . . 83 4.5 Outstanding window sizes for two async workers . . . . . . . . . . . . 83 4.6 The illustration of linear speedup phenomenon in TensorFlow . . . . . . 84 4.7 The results approximate MV A with FCFS . . . . . . . . . . . . . . . . 85 4.8 The results of our throughput estimation model . . . . . . . . . . . . . 94 4.9 The benefit of using our throughput estimation model . . . . . . . . . . 97 4.10 The performance of malleable job scheduling . . . . . . . . . . . . . . 98 4.11 The performance of moldable job scheduling . . . . . . . . . . . . . . . 99 4.12 Applying extension for early termination with different values of . . . 100 5.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 Feedback between two models . . . . . . . . . . . . . . . . . . . . . . 109 5.3 A Markov model for forwarding . . . . . . . . . . . . . . . . . . . . . 111 5.4 Example of allocation constraints for a state (q i ;s i ;o i ;a i ) inM i . . . . 119 5.5 Comparing the result of forwarding estimation in 10 and 100 VMs with QoS = 0:2 and 0:5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.6 Validating approximate performance model (2 SCs and 10 SCs) . . . . . 130 5.7 Market results in 3-SC scenarios: (a-b) are results where 3 SCs have i = 0:58; 0:73; 0:84, (c) is the result where 3 SCs have i = 0:73; 0:79; 0:84, (d) are the results where 3 SCs have i = 0:49; 0:58; 0:66 . . . . . . . . 131 5.8 Time complexity of the performance model and the game model . . . . 134 viii List of Algorithms 1 Reward mechanism forCP i . . . . . . . . . . . . . . . . . . . . . . . . 50 2 Modified peer request mechanism at timet . . . . . . . . . . . . . . . . 51 3 Penalty algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4 Malleable KNEE job scheduling . . . . . . . . . . . . . . . . . . . . . . 90 5 Malleable HELL job scheduling . . . . . . . . . . . . . . . . . . . . . . 90 6 Proposed repeated game among SCs . . . . . . . . . . . . . . . . . . . . 122 ix Abstract Provision of quality of service (QoS) is of significant importance to service providers, where QoS is a function of resource availability. When resources are insufficient at a particular service provider, two approaches to mitigating this problem can be considered by that service provider (a) limit the amount of resources allocated to its users, and (b) cooperate with other resource holders and find a reasonable way to share resources. For instance, a private cloud could reject its customers’ requests or forward some requests to a public cloud (e.g., Amazon) to achieve satisfactory QoS. To this end, in addition to designing resource allocation approaches, service providers should also consider how to maximize their utilities when cooperating with other resource holders. Motivated by cooperation among resource holders and related resource allocation problems, in this thesis, we focus on several services and study how to allocate resources efficiently while maximizing all participants’ benefits: For P2P video streaming, where the resource is the download rate for video playback, we eliminate the problem of play- back pauses by adopting reduced advertisement viewing duration as a positive incentive for peers to contribute their unused download rates. For provision of on-demand com- pute capacity in the cloud service, where virtual machines are the main resources, we study the incentives motivating small-scale clouds to share their virtual machines in a cooperative manner in order to achieve profitable service while maintaining customer service-level agreements. For co-locating machine learning training jobs, where the x resource is the CPU core or GPU, we investigate the throughput improvement of a dis- tributed training job when optimizing its resource allocation by integrating our through- put estimation technique with scheduling mechanisms. xi Chapter 1 Introduction Provision of quality of service (QoS) is of significant importance to service providers, where QoS is a function of resource availability. Traditionally, when resources are insuf- ficient at a particular service provider, the service provider has to (i) invest in additional infrastructure in order to satisfy customer demand and/or (ii) reject some workloads to maintain QoS for accepted customers. However, these mechanisms might not be cost- effective for service providers. For instance, if a cloud provider invests in additional servers for satisfying the peak load demand, the maintenance cost, such as power and cooling, results in loss of revenue due to under-utilization of resources during off-peak hours. Moreover, rejecting the incoming requests might lead to losing customers. Thus, to mitigate the problem of insufficient resources, two approaches can be considered: (a) limiting the amount of resources allocated to each user without violating the QoS requirements, and (b) cooperating with other resource holders for sharing resources in a reasonable way. For instance, a private cloud could forward some requests to a public cloud (e.g., Amazon web services) to achieve satisfactory QoS, when its own resources are insufficient. Consequently, an important problem in this domain is how to allocate resources efficiently while maximizing the benefit of all participants in a distributed environment. Motivated by cooperation among resource holders and related resource allocation problems, in this thesis, we focus on studying resource allocation problems in two types of services. We first consider the video streaming service and study resource allocation in a peer-to-peer (P2P) network, within a single service provider in Chapter 3. Since a 1 peer is also a resource provider, we study an incentive motivating peers to re-allocate their resources in order to maximize the benefit of all peers, the service provider, and ad providers. We also extend the environment to include more than one service provider and study how a service provider can maximize its revenue when competing with other service providers. We then consider computation as a service. In Chapter 4, we start by considering executing machine learning jobs in a small cloud environment and deter- mine how to properly distribute limited resources to each of submitted jobs, without significantly degrading the users’ QoS. Followed by that, in Chapter 5, we study the scenario where a single cloud provider cannot satisfy the QoS requirements through resource allocation mechanisms. Thus, we study the problem of a cloud provider shar- ing resources within a cloud federation in order to minimize the cost for providing sat- isfactory QoS to all customers. In the remainder of this chapter, we briefly introduce the three research problems and summarize our contributions. 1.1 Video Streaming as a Service In Chapter 3 of this thesis, we consider video streaming services in a P2P network, where both service providers and peers are resource holders. P2P-based architectures have been widely used to solve scalability problems that exist in client-sever-based architectures; that is, the system capacity increases when more peers join the systems. At a high level, peers contribute upload capacity to the sys- tem, and then receive some (proportional) amount of download capacity in return, which determines their QoS. The works in [63, 80] have shown that the number of peers with high upload capacities significantly affects the performance of P2P-based streaming sys- tems, particularly when free-riders exist in the system. Thus, one important problem in 2 providing good QoS in streaming systems is to motivate high capacity peers to continue staying in the system with higher upload bandwidth contributions. To address such an incentive problem in P2P video streaming systems, many P2P mechanisms [88, 97, 111] have been proposed, focusing on using download rates as an incentive to motivate peers to provide higher upload rates. However, studies have also shown that this results in high capacity peers largely exchanging content with other high capacity peers (as they have little to gain from low capacity peers). This leaves low capacity peers to exchange content with other low capacity peers, resulting in their poor download rates. Another approach is using video quality as an incentive; however, the problem is that high capacity peers have no incentive to contribute higher upload rates, beyond the point which provides them with un-interrupted streaming. Thus, if there are no other incentives beyond video quality, high capacity peers only need to contribute sufficient upload capacity to achieve download rates needed for satisfactory video quality, resulting in degraded overall system performance. To this end, in Chapter 3, we investigate the proper use of advertisements (and cor- responding challenges) as incentives for mitigating the QoS problem in streaming P2P systems. Our contribution can be summarized as follows. We propose a mechanism, implemented within our Ad-driven Streaming P2p ECosysTem (ASPECT), which allows peers to trade their capacities and ad dura- tions. This mechanism increases opportunities for peers to obtain sufficient down- load rates so as to significantly reduce video pauses. We design ASPECT as a market-based model in both monopoly and oligopoly competition setting that consists of one or many content providers, a set of ad providers, and a set of peers, and show that ASPECT is able to achieve market success. Our model facilitates study of properly designed incentives needed to encourage continued contributions from peers at market equilibrium. Our results 3 enable the content provider to achieve its desired profit by providing sufficient incentives for their peer customers to stay in the system and contribute to greater revenues of the content providers via ad viewing, while respecting the ad duration contracts with the ad providers (i.e. ensuring that a pre-specified minimal duration of ads is viewed by all peers). 1.2 Computation as a Service In Chapter 4 and Chapter 5 of this thesis, we shift our focus to the environment where the cloud providers are the resource holders and users submit their requests to allocate resources from cloud providers for completing some computation jobs. We start by considering the resource allocation within a single cloud in Chapter 4. Machine learning, in particular deep learning [81], recently achieved breakthrough results in several domains, including computer vision, speech recognition, natural lan- guage processing, and robot control. However, the use of deep neural networks (DNNs) requires very large amounts of data and compute power to discover internal represen- tations directly from input data. To speed up training and provide quick turnaround to users submitting these types of jobs, it is important to take advantage of distributed training, which uses multiple machines in parallel. In a parameter server architecture [43], the dataset is split among several worker nodes that perform training in parallel, sending parameter updates to a parameter server and receiving the most recent model version, which includes updates from other workers. As shown through experimental measurements [15], when more worker or server nodes are assigned to a job, its throughput (number of training examples processed per second) increases only sub-linearly. In some cases, when a shared resource (e.g., the network) is congested, adding more nodes can reduce the throughput, thus increasing 4 the overall job service time. Thus, the first problem that we tackle is the definition and validation of a performance model for the throughput prediction of a training job as the number of assigned workers increases. Then, we leverage this performance model to address the problem of parallel job scheduling. We consider the case of a compute clus- ter that receives Poisson streams of heterogeneous machine learning jobs. We explore several algorithms that achieve different tradeoffs between system efficiency, speedup of job response time (i.e., time spent in the system by a job, waiting or in service), and mean time to complete intermediate results (e.g., 25% of the job). Our contribution in Chapter 4 can be summarized as follows. To use system resources efficiently (i.e., with high speedup per worker node), we need to estimate the performance of a distributed training job as a function of number of workers allocated to it. We develop a performance model based on approximate mean value analysis (MV A) [109], which accounts for the effects of the TCP protocol. This model provides a sufficiently accurate estimate of training throughput for the job scheduler. We propose preemptive parallel scheduling algorithms that address the cases of moldable jobs (KELL) and malleable ones (HELL and KNEE). We also propose an extension to speed up the early part of each job in order to provide quick feed- back for hyper-parameter tuning. Then, we compare, through extensive experi- mental evaluations, the tradeoff between shorter response time and parallel service of more jobs. We show that KELL produces not much worse than both malleable mechanisms, and, with a proper tuning, our extension could reduce the time for obtaining intermediate results significantly without degrading the overall response time too much. 5 However, during peak hours, limiting the amount of resources allocated to each job might not still be able to satisfy the QoS requirement. Thus, in Chapter 5, we consider a corresponding problem of cooperation among small-scale clouds. Cloud service providers (Amazon AWS [1], Google Compute Engine [6], Microsoft Azure [8]) allow customers to quickly deploy their services, at a price lower than main- taining their own infrastructure. Such large-scale public cloud providers invest in large- scale data centers, that are typically over-provisioned in order to be able to respond to bursty workloads during peak hours. However, there are also non-trivial concerns in obtaining services from large-scale public clouds, e.g., loss of data privacy, and cost and complexity of building services [5]. Thus, many smaller-scale clouds are available, partly as a solution to privacy concerns and partly as a solution to cost concerns [42]. However, smaller-scale clouds are likely to suffer from resource under-provisioning during peak demand periods, which can lead to inability to satisfy service level agree- ments (SLAs) and consequently loss of customers. Since idle resources are wasted in smaller-scale clouds during their off-peak hours, one approach is for the smaller-scale clouds to share their resources in some cooperative manner [18, 113, 136, 148], thus (effectively) increasing their individual capacities (when needed) without having to sig- nificantly invest in more resources. Many of these efforts assume the existence of the cloud federation and largely focus on designing sharing policies in order to maximize the profit of individual smaller-scale clouds [59, 127, 137, 148]. Moreover, most previ- ous efforts focus on performance of the federation as whole, rather than on the potential benefits (in terms of profit) and cost (in terms of performance degradation) to individual smaller-scale clouds, which are significant contributing factors to incentivizing smaller- scale clouds to participate in a federation. 6 To this end, in Chapter 5, we focus on developing a framework for cooperating smaller-scale clouds, referred to here as (SC-Share), that can lead to appropriate incen- tives for individual small-scale clouds to participate in the federation while making sure that each is profitable and is able to meet its SLAs. Since the amount of shared resources in the federation directly affects how much workload the SC-Share is able to handle, which in turn affects the profit each smaller- scale cloud is able to achieve, we propose a market-based model for determining how much each smaller-scale cloud should share. However, formulation of the market-based model must require some notion of performance characteristics. For instance, we would need to know how likely is a smaller-scale cloud to use the shared resources in the federation which is a function of how often it is in danger of not meeting SLAs; how often would a large-scale public cloud’s resources be needed as a backup; etc. Hence, we also develop a performance model that is able to estimate parameters needed for the market-based model. Such a performance model also needs to take the sharing policy into consideration. Consequently, the performance model needs input from the solution of the market-based model, and the market-based model needs input from the solution of the performance model. We address this in an iterative manner. Our contribution in Chapter 5 can be summarized as follows. We design an approximate performance model with an efficient solution that is able to provide accurate (i.e., to be of use to the market-based model) estimates of the measures of interest, with linear complexity in the number of smaller-scale clouds, which also allows smaller-scale clouds to keep their SLAs and capacity information private. 7 We design a market-based model that results in determining the price charged within the federation for the use of shared resources which (a) properly incen- tivizes the smaller-scale clouds to participate in the federation and (b) achieves market success. We perform an extensive evaluation study of performance and market-based mod- els which validates the accuracy (with low complexity) of the proposed approx- imate performance model and its utility for the market-based model as well as illustrates the benefits of our proposed framework. The remainder or this thesis is organized as follows. Chapter 2 gives an overview of related work, and differentiates our contributions from previous efforts. The above three problems and their solutions are described in Chapter 3, 4 and 5. We then conclude in Chapter 6. 8 Chapter 2 Related Work In this chapter, we give an overview of literature related to this thesis, first in the context of P2P video streaming as a service, and then in the context of computation as a service. 2.1 Peer-to-Peer Video Streaming as a Service Peer-to-Peer based approaches have been developed and deployed in order to address scalability problems that exist in client-server based steaming architectures. However, the quality of service (QoS) of P2P-based approaches is highly dependent on the avail- able resources from peers. Thus, previous efforts focus on two aspects to improve the QoS of the P2P-based system: (a) modifying sharing policies to redistribute resources among peers, and (b) providing incentives for peers to contribute more. 2.1.1 Modifying Sharing Policies This type of work assumes that the amount of resource is sufficient but the proper shar- ing policy is lacking. Thus, for a P2P-based video streaming system, the works in [33, 133] suggest that the block selection strategies should favor blocks that are closer to the current playback point. (The works on block selection strategies are orthogo- nal to ours, and we believe our mechanisms can be integrated with those proposed in [33, 133].) The analysis in [106] indicates that a TFT-based strategy may not be suit- able for Video-on-Demand (V oD) applications, because younger peers may not have 9 sufficient data to share with older peers, resulting in overloading of older peers. Con- sequently, in [141], algorithms are proposed for load balancing requests between peers. In [40], peers increase the number of neighbors chosen during random selection when such peers already have high QoS; thus, increasing the chance for peers with low QoS to receive more resources. Another effort [138] designs a semi-distributed algorithm to optimize fairness among peers in P2P live video systems, where low upload capac- ity peers contribute all of their capacities, and high upload capacity peers upload at the same higher rate. However, these works do not provide incentives for peers to do so, and run the risk of de-motivating the contributions of high capacity peers. 2.1.2 Providing Incentives As shown in the works [63, 80], the number of peers with high upload rates significantly affects the performance of P2P-based systems, particularly when free-riders exist in the system. Thus, providing sufficient incentives for peers to contribute their resources is very important. Some efforts use game-theoretic approaches to design incentive mechanisms in P2P file-sharing applications [26, 51, 89]. By analyzing a generalized Prisoner’s Dilemma model, [51] has shown that the adoption of a shared history and discriminating server selection techniques enable strategic users to reach nearly optimal levels of coopera- tion. In [89], a resource distribution mechanism and a generalized incentive mechanism are proposed to provide service differentiation based on the amount of information pro- vided by each peer. The work in [26] studies an incentive mechanism for resources that are shared among peers with non-direct relationship and differentiates the service level based on each peer’s previous behavior. P2P-based video streaming systems further rely on contributions from a small set of high upload capacity peers in order to provide reasonable QoS. Thus, providing 10 incentives for peers to contribute their capacities is a focus of many existing efforts that include a variety of approaches [31, 66, 86–88, 108, 135]. For instance, the work in [88] focuses on coding/MDC schemes in the context of TFT-type strategies, where peers contributing higher upload rates are rewarded with higher video quality. In [66], the authors propose a score-based incentive mechanism by converting users’ contribu- tions into scores and mapping scores into ranks, used by a peer selection mechanism for choosing to which neighbors to upload. The work in [103] and [107] demonstrates that a pure TFT strategy may not work well in streaming systems due to round-trip delays. Consequently, [103] proposes an incentive scheme based on the Iterated Prison- ers Dilemma, while [107] focuses on the effects of a local pairwise incentive-based peer selection policy by measuring resource availability and path quality. Moreover, [31] adopts an evolutionary game approach in P2P video systems to study the cooperation among geographically closed peers, in order to improve video quality. [86] also pro- poses a game-theoretic framework to model peers’ behavior and design incentive mech- anisms to achieve cheat-proof and attack-resistant cooperation in P2P live streaming social networks. The work in [87] proposes a substreaming framework, which modifies a TFT mechanism and a partner selection scheme to be applied to a variety of video coding schemes. The authors of [135] propose a system using advertisements as an incentive, which uses a token-based scheme for trading data between peers. Peers con- tributing greater upload resources can obtain tokens to reduce advertisement viewing time. However, these works do not guarantee an improvement in QoS of low upload capacity peers. Moreover, all of these efforts only consider the interaction among peers. None of them consider the strategies of other entities which enable the whole system to continue providing services, e.g., streaming videos. 11 2.2 Computation as a Service To allow scientists and engineers to solve compute-intensive problems in a pay-as-you- go manner, cloud computing or high performance computing (HPC) providers provide computation as a service via their large-scale infrastructure with very high compute power, large amounts of memory, and fast network speed. However, for a smaller-scale provider, how to provide satisfactory QoS to their customers without investing in extra infrastructure is a big concern. Thus, previous efforts focus on three aspects to maintain the QoS for customers while minimizing the cost: (a) changing the level of parallelism to better utilize the resources, (b) adopting the hybrid cloud architecture to outsource requests to large-scale providers, and (c) cooperating with other smaller-scale providers for sharing resources. 2.2.1 Parallel Job Scheduling The problem of scheduling multiprocessor jobs is widely studied in the context of HPC and Grid computing. Compared to divisible job scheduling [23], where a job can be divided into multiple disjointed tasks and each task can be executed independently, mul- tiprocessor job scheduling has to allocated multiple processors to a job simultaneously. Three variants of multiprocessor jobs are typically considered: (i) rigid, (ii) moldable, and (iii) malleable; these have been widely discussed in the literature with the goal of minimizing the makespan of a set of tasks (i.e., the time to complete all tasks) while satisfying deadline requirements. A rigid job requires a fixed number of processors, which is specified by the user when submitting the job. It has been shown to perform badly in a busy shared cluster [35] because a rigid job might have to wait for a long time before executing due to asking for too many processors, and does not allow other jobs to run simultaneously due to 12 occupying most of the available processors. Due to this inflexibility, finding the optimal solution for rigid jobs is also difficult. Thus, the non-preemptive version of rigid job scheduling has been shown to be an NP-hard problem; this problem is strongly NP-hard when the jobs with precedence constraints can use more than one processor, or when the independent jobs without precedence constraints can be scheduled on more than four processors [46]. When the rigid job is preemptable, the job scheduling problem can be solved in polynomial time using a linear programming formulation when all rigid jobs required the same fixed number of processors [22]; however, the problem becomes NP- hard when the number of required processors is arbitrary (also an input to the system) [45]. Since special cases of this problem reduce to 2D bin-packing, there exists no known efficient and optimal algorithm; common heuristics sort the jobs according to many factors (age, size, priority) and schedule them greedily in order, with backfilling of unused slots [117]. A moldable job allows the scheduler to determine the number of processors right before execution, and this decision cannot be changed during the course of the execu- tion. Even though this type of job is more flexible, the scheduling problem has been shown to be an NP-hard problem in both, non-preemptive and preemptive cases [46]. The scheduling problem is even strongly NP-hard when the required number of proces- sors is more than four processors in the non-preemptive case, and the required number of processors is arbitrary in the preemptive case [46]. Thus, several efforts based on lin- ear programming tackled off-line versions of this problem. Authors in [72, 73] 1 adopted a linear programming formulation to compute both, preemptive and non-preemptive schedules with minimum makespan, and showed that the running time of the algorithm depends polynomially on the required number of processors and only linear in the num- ber of jobs. Thus, with a fixed number of processors, their mechanism can compute 1 Their definition of malleable jobs is the typical definition of moldable jobs. 13 the results with at most (1 +) times the optimum in an non-preemptive scenario with running time polynomial in the number of jobs, the number of processors used, and 1 . Moreover, with a fixed number of processors, their mechanism can compute the optimal solution in a preemptive scenario with running time linear in the number of jobs. In [47], the authors use experimental results to show that the well-known Largest Task First (LTF) algorithm is the best for minimizing the makespan of the moldable job scheduling. The approach in [34] outperforms the results of the LTF-based approach via solving the makespan minimization problem based on successive linear programming approximations of a sparse model on a smaller instance with fewer than a hundred jobs. A malleable job is more flexible than a moldable job and allows the scheduler to dynamically adjust the number of processors. With the property of allowing a different number of processors during its execution, this type of job is always considered pre- emptable. Like with moldable jobs, the problem of scheduling malleable jobs is also strongly NP-hard [46]. However, authors in [24] show that there is a linear time mecha- nism to compute the optimal solution if the processing speedup function is convex and a polylogarithmic time mechanism if the processing speedup function is concave. Authors in [114] further reduced the computation time via adopting a binary search mechanism for the case of a concave processing speedup function. Authors in [37] study the feasi- bility of satisfying the deadlines of malleable jobs and propose a scheduling algorithm to meet all deadline requirements when the minimum number of processors is specified for each task. In [91], the authors propose a (2 +)-approximation algorithm for solving the problem of scheduling malleable jobs with precedence constrains when the concave processing speedup functions is used. Authors in [64] consider a problem of scheduling both, malleable and non-malleable jobs on two parallel identical machines and use a dynamic programming algorithm to achieve the optimal mean computation time. Two 14 mixed integer linear programming models have been proposed in [100] to minimize the average completion time in a malleable jobs scheduling problem. The problem of scheduling multiprocessor jobs is also considered in an on-line sce- nario, where the total number of jobs is not known at the beginning, and the goal is (usually) to minimize the mean response time (i.e., the duration of time between job submission and job completion). In this type of problem, previous efforts have shown that the mean response time will not be worse than twice of that an optimal solution, if the number of processors allocated to each job is not more than half of the total pro- cessors [129]. In order not to miss the deadlines of future jobs, authors in [82] only allocate the minimum number of processors, which allows the job to meet the dead- line, to each job. A later study [122] shows that better response times are achieved by allocating processors to jobs iteratively, which outperforms the mechanism of fair- share allocation based on weights with overbooking factors. Authors in [20] propose to adopt a back-filling mechanism when scheduling moldable jobs in order to meet the load requirements of the computing center. However, most of these previous efforts only consider the non-decreasing concave processing speedup functions, which is not the case in our environment. 2.2.2 Hybrid Clouds Many previous efforts consider hybrid private/public clouds where cost is minimized by reducing the amount of workload forwarded to large-scale public clouds. For instance, [144] focuses on a hybrid cloud architecture for video streaming, where a workload controller is proposed to predict future workloads and split the traffic into two parts: one still using dedicated resources and another being forwarded to a large-scale cloud (e.g., Amazon EC2). Another example is [118] which proposes a threshold-based scheduling algorithm to minimize the system cost of forwarding workload to large-scale clouds. 15 However, these works do not consider the possible benefits of sharing resources in the federation of smaller-scale clouds, which is the focus of our proposed effort. 2.2.3 Design of Cooperative Clouds Earlier efforts also study the competition and cooperation within a federated cloud. For instance, authors in [57, 59] characterize a cloud federation to help cloud providers maximize their profits via dynamic pricing models. Earlier efforts [30, 128] also study the competition and cooperation among cloud providers, but assume that each cloud provider has sufficient resources to serve all user requests, while [30] incorporates a penalty function to address a service delay penalty. Authors in [101] propose a hierar- chical cooperative game theoretic model for better resource integration and for achieving a higher profit in a federation. Similarly to our work, [93] studies a federation forma- tion game but assumes that cloud providers share everything with others, while [62] adopts cooperative game theoretic approaches to model a cloud federation and study the motivation for cloud providers to participate in a federation. Another line of work focuses on designing sharing policies in a federation to obtain a higher profit. For instance, [136] proposes a decentralized cloud platform SpotCloud [10], a real-world system allowing customers or smaller-scale clouds to sell idle com- pute resources at specified prices, and presents a resource pricing scheme (resulting from a repeated seller game) plus an optimal resource provisioning algorithm. [148] employs various cooperation strategies under varying workloads, to reduce the request rejection rate (i.e., the efficiency metric in [148]). Another effort [127] combines resource out- sourcing and rejection of less profitable requests in order to increase resource utilization and profit. [137] proposes to efficiently deploy distributed applications on federated clouds by considering security requirements, the cost of compute power, data storage and inter-cloud communication. [18] designs a decentralized cloud resource sharing 16 based computing platform to group resources of various smaller-scale clouds into com- putational units, in order to serve customers’ requests. [113] proposes to incorporate both historical and expected future revenue into virtual machine sharing decisions in order to maximize a smaller-scale cloud’s profit. However, most of these efforts do not study the potential performance degradation resulting at each smaller-scale cloud due to participating in the federation, which is a significant factor for a smaller-scale cloud to determine whether participating in the federation or not; instead, these work study the performance characteristics of the federation as a whole. 17 Chapter 3 On Market-Driven Hybrid-P2P Video Streaming 3.1 Introduction Peer-to-Peer (P2P) based video streaming systems have been developed and deployed in order to address (in an economical manner) scalability problems that exist in client- server based streaming architectures (e.g. Hulu[70], YouTube [11], Netflix [99], etc.), where content providers (CPs) have to keep investing in bandwidth to provide satis- factory quality of service (QoS) to customers. However, providing QoS in P2P based video streaming systems is still challenging since the achieved performance is highly dependent on resources contributed by the peers. For instance, in a BitTorrent-like system (which is the focus of this work), a peer experiences poor QoS (where video playback proceeds with frequent pauses) when data blocks are missing from the buffer at the time they are needed for display, due to (1) a poor choice of blocks requested, on the part of the block selection algorithm and/or (2) insufficient download rates (due to not receiving data from neighbors). To address (1), a number of efforts have devel- oped block selection algorithms that can make appropriate block selection choices (see Section 2.1), with some efforts adopting network coding mechanisms to increase the diversity of exchanged blocks [52, 68, 90]. To address (2), recent efforts are focusing on building hybrid systems through integration with Content Delivery Networks (CDNs) or self-hosting video servers [96, 142, 143]. Those efforts tackle the problem of resource 18 planning and allocation of CDN/server resources to achieve satisfactory QoS in hybrid- P2P systems. However, this may not always result in cost efficient solutions, as a given level of satisfactory QoS can also be achieved by instead improving peers’ sharing mech- anisms that in turn will require fewer CDN/server resources and hence reduce costs of profit-minded CPs. In this work, we focus on solving the problem of insufficient download rates by improving peers’ sharing mechanism, to tackle the problem of achieving satisfactory QoS in a cost efficient manner. We achieve this goal by using a game-theory driven economic approach. Such an approach is promising because, in practice, cost-effective resource planning and allocation in hybrid-P2P systems is often jointly a function of (i) the competitive interaction of multiple similar profit-minded CPs in operation and (ii) the competitive interaction of QoS-minded peers in a given CP’s swarm. (In fact, other stakeholders may also participate in such hybrid-P2P systems and can be accommodated by our game-theoretic economic framework, as detailed in the remainder of the chapter.) Game theory driven economics is a widely used technique for holistically analyzing systems where entities have different (often conflicting) interests. Unlike our effort, existing works [96, 142, 143] do not focus on such competitive markets. To improve peers’ sharing mechanisms, we first need to understand the cause of insufficient download rates. In this work, we focus on the BitTorrent protocol because it is the most popular P2P protocol and provides the general foundation of a number of widely used P2P systems today; e.g., CoolStreaming [145], the widely used block- driven P2P streaming protocol, and Popcorn Time [9] both adopt a BitTorrent-like proto- col. (Another widely used P2P system is PPLive [67]; however, the details of its design are not publicly available.) In such systems, an important reason for low capacity peers having insufficient download rates is that a significant amount of the bandwidth is con- tributed by high capacity peers, who in return receive most of the download rates due 19 to incentive-based sharing mechanisms, leaving the low capacity peers with less than sufficient download rates. For instance, a Tit-for-Tat (TFT) type strategy is often used in BitTorrent file-sharing systems, where receiving higher download rates by peers is used as an incentive to encourage them to contribute their upload resources, and where lack of contribution consequently results in longer download time for peers. However, in BitTorrent streaming systems, TFT type strategies result in poor QoS for lower capac- ity peers that experience relatively frequent video pauses [104, 139, 140], while higher capacity peers have more than needed download rates. Moreover, significant variations in peers’ upload capacities could also result in video pauses for higher capacity peers, particularly during transient network congestion or transient periods of poor wireless connectivity. This is due to the fact that traditional TFT-based mechanisms reduce the download rate of a peer with current low upload capacity, regardless of how significant its upload contributions were earlier. The reduced rate is sufficient to cause video pauses. As a practical example, when multiple hosts access the same wireless network simultaneously, the throughput of all hosts is deter- mined by the lowest transmission rate among all wireless hosts. This leads to a problem known as the performance anomaly of 802:11 and degrades upload capacities of wire- less peers [65]. It is during these periods of poor connectivity (which could last longer than 10 minutes [19]) that even higher capacity wireless peers suffer from low download rates under a TFT-type strategy. To solve such video pause problems, earlier efforts [88, 97, 111] have focused on using differentiated video quality to incentivize peers to increase overall upload band- width capacity through layered coding mechanisms, such as multiple description coding (MDC) and scalable video coding (SVC). These mechanisms are designed to: (i) allow a peer to view videos at a quality corresponding to its download rate, subsequently reduc- ing its chances of experiencing video pauses, and (ii) incentivize peers to contribute 20 more upload resources (if they want to view higher quality videos) to help low capac- ity peers who need more download bandwidth for satisfactory QoS. With respect to (i), viewing poorest quality of videos still does not prevent low capacity peers from expe- riencing video pauses. For instance, a previous effort in [110] showed that more than 9% of peers cannot completely download video blocks for the base layer in time for playback even when all peers have homogeneous bandwidth capacities. With respect to (ii), increasing the overall upload supply does not necessarily increase the download rates of low capacity peers since high capacity peers would be rewarded with higher download rates due to the TFT-type sharing mechanism in BitTorrent [83]. Thus, differ- entiated video quality is not always a promising solution since the received download rates depend on the sharing mechanisms in P2P-based video systems. Therefore, ide- ally, we need a mechanism that allows higher capacity peers to obtain sufficiently high download rates so that they can first experience streaming (nearly) without pauses, and then (after achieving high QoS) if possible, “release” whatever additional download rates they might have to peers with lower download rates. To this end, our solution is to provide proper incentives, which motivate peers to reallocate more than sufficient download rates from high capacity peers to peers with insufficient download rates (see Section 3.4). Previous efforts [86, 125] proposed credit- based mechanisms, where peers earn credits by distributing video blocks to others and pay credits for receiving blocks from others. However, in these types of mechanisms, where incentive is based on upload capacities, if low capacity peers cannot earn suffi- cient credits (due to their low upload capacities) to pay for receiving the required down- load rates for smooth playback, they will still experience video pauses even when some peers are incentivized to contribute more upload bandwidth. Thus, to solve the video pause problem, our approach is to base incentives on the amount of download rates that high capacity peers contribute to helping low capacity peers. Since high capacity peers 21 will increase their risk of experiencing video pauses when reallocating download rates to low capacity peers, our incentive mechanism should also require low capacity peers to “pay” proportionally to what they have received from high capacity peers. Follow- ing this design goal, such incentives could take several forms. For instance, credits and points can still be the incentive. In our work, we propose an Ad-driven Streaming P2P ECosysTem (ASPECT) that aims to eliminate the problem of playback pauses by adopting reduced advertisement (ad) viewing durations as a positive incentive for peers to provide high upload rates because: (i) peers are generally used to viewing ads for streaming shows; some service providers (such as YouTube and Hulu [70]) offer free on-line video delivery services but force customers to view fixed duration ads (i.e., of the same duration for all users) at the beginning or in the middle of a video in order to sell ad periods to ad providers in return for revenue, (ii) peers can immediately observe a reduction in the duration of viewed ads after they increase their bandwidth contributions, and (iii) the business of commercials is a complete ecosystem. When enabling CPs to utilize P2P networks through using ad durations as incentives, an important challenge is as follows. How can a CP determine allocation of CDN/server resources in order to compete with other CPs, while determining appropriate ad dura- tions that will incentivize peers to continue contributing as well as be satisfied with their received QoS? To this end, with peers, ad providers, and CPs as stakeholders involved in the ASPECT ecosystem, we mould ASPECT into a market-based model with the goal to satisfy all stakeholders. In this market, the CPs play a non-cooperative game amongst themselves through combining results from the peers’ game to maximize their utility, which is a function that is increasing in (i) the number of peers staying in their systems and (ii) the minimal ad durations viewed by all peers, and decreasing in the investment for video server capacities. Moreover, the peers play a non-cooperative game amongst 22 themselves, each being selfish and wanting to maximize their utility, which is a function that increases with the received download rates and decreases with the length of ads they have to view. Our main contributions can be summarized as follows. We design ASPECT as a market-based model that consists of one or a set of CPs, a set of ad providers, and a set of peers, and show that ASPECT is able to achieve market success in both monopoly and oligopoly scenarios. Conditioned on the existence of an equilibrium point in the peers’ game, ASPECT provides sufficient incentives for high capacity peers to “release” their download rates in return for viewing shorter duration ads. At the same time, ASPECT allows low capacity peers to improve their QoS without significantly increasing their ad durations. Overall, our approach achieves market success, where (i) the CPs are able to make their desired profit while providing sufficient incentives for their peer customers to stay in the system and contribute to greater revenues of the CPs via ad viewing, and (ii) the ad provider ad duration contracts are respected (see Section 3.3). We show that video pause problems can be experienced by every peer, whether of low or high capacity. Thus, within ASPECT, we propose to use ad duration as a new incentive and introduce new sharing mechanisms to allow peers to trade their capacities and ad durations, thereby nearly eliminating video pauses by increasing download rates for all peers (see Section 3.4). All proposed mechanisms work in a completely decentralized manner, without the need for additional support from CPs. 23 Video Servers Ad Servers X Release X Peers Content Provider Reward for viewing fewer ads View more ads due to getting help ... Advertisement Providers Swarm 1 ... Swarm 2 Swarm N ... Choose one Content Provider ... Put Ads on one or more Content Providers Other Peers Figure 3.1: Overview of ASPECT time On-Demand Video Advertisement The minimal requirement The default skip point L m L D L M Figure 3.2: On-demand videos are interleaved with advertisements in ASPECT 3.2 Overview of ASPECT In this section, we present an overview of our ASPECT system. We also state our pro- posed peer reward mechanism and discuss the importance and challenges of designing such a mechanism that could jointly satisfy all CPs, peers, and ad providers. An illus- tration of ASPECT is given in Fig. 3.1. 3.2.1 Architecture of ASPECT As in traditional hybrid P2P systems, CPs invest in video servers (private video servers and CDNs) to guarantee QoS to customers and make deals with video providers for 24 broadcasting video content. Since a CP delivers video streaming services for revenue, it might charge its customers a monthly fee or have its customers to view ads (or both). Peers subscribing to CPs obtain some initial blocks from the CP’s video servers and then exchange blocks with other peers. Here, we assume that a CP also uses servers to deliver ads through a P2P-based mechanism. Ad blocks are shared and consumed similarly to video content blocks, at the same playback rate. Thus, video pauses might also occur during ad viewing. As in TV commercials, we use fixed-length ads in the middle of videos, as shown in Fig. 3.2. However, like YouTube, we allow viewers to skip ads after a specified skip point. Ad providers have agreements with CPs for the minimal duration of ads (L m ) viewed by all peers; thus, the default skip point (L D ) should be beyond the agreed-upon length,L m . The fixed-length ads should also not exceed a common upper bound,L M , in order to prevent peers from leaving the system due to frequent interruptions. For instance, the length of current ads on TV is 31% of real content [115]. Thus, a CP has a fixed ad duration interval at its disposal within which to operate. 3.2.2 Peers in ASPECT In general, peers have heterogeneous capacities (both upload and download), which results in receiving different download rates, as illustrated in Fig. 3.1. ASPECT aims to provide sufficient download rates to all peers by creating the following incentive princi- ple: a CP rewards peers that “release” some of the download capacity (“due to them”) to the system, with shorter ad durations (see Section 3.4). Peers that “acquire” this “released” download capacity from the system are asked to view longer ad durations. However, even though we have a mechanism for peers to release their download capac- ity in order to reduce their viewed ad lengths, they still need appropriate incentives to do so. For instance, if peers cannot significantly reduce the duration of ads, they will 25 not continue to release their download capacities. On the other hand, if peers have to view significantly longer duration of ads for only a small improvement in QoS, i.e., fewer playback pauses, peers might not want to stay in the system at all. Therefore, it is important to strike an appropriate balance between ad durations and QoS improvement. 3.2.3 Content Providers in ASPECT In ASPECT, a P2P-based ecosystem, a CP could attract more high capacity peers to con- tribute their upload capacities by showing them shorter ad durations, thereby eventually reducing the former’s investment in video servers. However, the minimal ad duration could directly affect the revenue of CPs. As illustrated in Fig. 3.2,L m is the minimal ad duration, which should be viewed by all peers. If L m is too short, it might have fewer ad providers interested in buying this ad period. On the other hand, if a CP sets a longL m to attract more ad providers, high capacity peers might switch to other CPs that offer shorter ad durations. Therefore, one important question here is to determine an appropriate value ofL m for each CP that strikes a good balance between attracting ad providers to a CP and maintaining a peer customer base for that CP , given competition with other CPs in the market. 3.2.4 Ad Providers in ASPECT An ad provider is a product manufacturer that interested in selling its products. Typi- cally, ad providers play their roles after CPs have deployed their platforms and locked-in customers. Such ad providers would ideally wish to buy ad viewing periods from CPs that (a) have potentially more customers that buy ad-based products, and (b) can make users view longer ads. Consequently, profit-minded CPs compete with one another on (a) and (b). The competition can be modeled as an auction-based game, where each 26 ad provider maximizes their utility by competing for CPs (potentially with a high num- ber of peers and sufficiently long ad durations). The results of the competition would determine which ad providers can show ads on which CPs. However, it is complicated to model the peers’ preference for ads, and this preference seems to be a second-order factor affecting the peers’ engagement in the swarm (as compared to the peers’ prefer- ence for ad lengths and reward mechanisms, which are first-order factors determining the number of peers in a swarm). Thus, we simplify the model by not considering directly the game among ad providers, and combine the utility of CPs with the utility of ad providers together, since both share the same goal of having more peers and longer ad durations. Even though in ASPECT we assume that ad providers compete for CPs on minimal ad durations, our payment model is like the one adopted by YouTube Advertise [12], where ad providers do not pay a CP if viewers skip the ad. Thus, in ASPECT, the real payment is based on real peer engagement (i.e., the overall ad durations peers have viewed). This also motivates CPs to show longer ad durations and supplements existing rationale to attract ad providers. 3.2.5 Trading Download Capacity with Advertisements For the purpose of trading download capacities with ad viewing, we define a reward mechanism that provides satisfaction to all peers. The reward mechanism consists of a function for properly calculating ad durations based on peers’ contributions. Thus, if peern releases D n 0 amount of unnecessary download rates, it will be rewarded with a shorter ad duration. On the other hand, if a peer obtains a better QoS by receiving D n 0 amount of released download rates, it has to view a longer ad in return. 27 Based on this function, if the default ad duration before the skip point isL D , then the actual skip point assigned to peern,L n , is calculated as: L n = 8 < : min(L D D n ;L M ) if D n 0; max(L D D n ;L m ) if D n > 0; (3.1) where > 0 is the parameter used for translating download rate to ad length. For simplicity, in this chapter we use a linear function forL n ; this assumption can easily be extended to a number of non-linear functions (e.g., a convex function). In order to provide sufficient incentives for peers to pursue the change in ad dura- tions, we need to find a proper combination ofL D and that will result in peers expe- riencing a sufficient QoS improvement if they view longer duration ads, or have peers receive sufficient reductions in ad durations if they release download rates. With a sim- ple reward mechanism, we can eventually find some combinations of (L m ;L M ;L D ;) that enables all peers to have sufficient download rates and differentiated ad durations. However, it is difficult to tell which combination of (L m ;L M ;L D ;) can make all peers satisfied with their download rates and ad durations. Moreover, the combination of (L m ;L M ;L D ;) could also affect the revenue of CPs and ad providers. Thus, for the ecosystem to exist, it is insufficient for a reward mecha- nism to only focus on satisfying a subset of stakeholders, i.e., the peers, rather than the entire set. For instance, if CPs only focus on peers, their ad periods might not be attrac- tive to ad providers. Moreover, only focusing on selling ad periods would result in peers viewing intolerably long ads. To this end, we first model ASPECT as a game-based mar- ket. We then realize ASPECT in the context of a real protocol - namely a BitTorrent-like video streaming system, with modifications that help us achieve the desired incentives (see Section 3.4). In order to address this challenge, we resort to dynamic game theory (see Section 3.3) for arriving at the ideal parameter settings for the reward mechanism, 28 Table 3.1: Summary of notation L i D default ad duration atCP i i ratio for translating rate to ads atCP i C v constant video bit rate O C i ,O P n upload capacity ofCP i and peern D n average download rate received by peern D t k;n download rate from peerk to peern during [t 1;t) A t k;n avg download rate from peerk to peern during [0;t) R t n # of neighbors released by peern at timet Q t n # of continuous blocks in peern’s buffer at the start of time slott T length of requesting period/time slot X # of blocks for 1 sec of video playback Z size of one block (Kb) W t n ,P t n reward and penalty for peern at timet L n;j actual ad duration viewed by peern in thej-th period i.e., to determine appropriate L D and values that jointly satisfy all ASPECT enti- ties including peers, CPs, and ad providers. A summary of notation used throughout this chapter is given in Table 3.1. Our extensive simulation-based study of ASPECT’s performance is described in Section 3.5. 3.3 Market for P2P Video Streaming In this section, we design market games to determine proper parameter values (for the reward mechanism in Eq. (3.1)) that provide sufficient incentives for peers to partici- pate in the system and at the same time jointly satisfy the interests of content and ad providers, thereby ensuring market success. This section is structured as follows. We first describe the market environment. We then formulate the utility functions of the players/stakeholders in the market. Finally, we describe the details of our proposed games, appropriate to specific market types, and explore the notion of market efficiency. 29 3.3.1 The Market Environment In our market setting, we consider multiple CPs, multiple ad providers, and a set ofN peers interested in on-demand videos. Those videos can be streamed as single-layer videos with constant bit rateC v or as multi-layer videos with the minimal required bit rateC v (which is the bit rate for the base layer in Scalable Video Coding (SVC) tech- niques). Here, we use constant bit rate (CBR) videos because it has been shown that using variable bit rate (VBR) videos does not significantly improve performance of Bit- Torrent systems due to the use of fixed-size blocks in the BitTorrent protocol [58, 146]. Since our goal in this chapter is to solve the video pause issue that is due to insufficient download rates, we use CBR to focus on the sharing strategies in BitTorrent systems. (Few results exist on efficient transmission of VBR chunks in BitTorrent; solving this problem is beyond the scope of this chapter.) Thus, as long as a peer can obtain a download rate greater thanC v , that peer experiences no video pauses. (In real systems, peers might still experience video pauses due to inappropriate order of block downloads, which is not the focus of this work.) To provide service, CP i invests in servers (private video servers and CDNs), that provide a total upload capacity of O C i . Moreover, CP i chooses a default skip point, L i D , for peers in its swarm, and provides a reward mechanism to encourage peers to increase their contributions, thereby facilitating decrease of L i D . Based on the reward mechanism and the contributions from peers, there exists a minimal ad duration, L i m , and a maximal ad duration, L i M , where L i m L i D L i M , for peers in the swarm of CP i , as illustrated in Fig. 3.2. CP i uses its utility function to tune the parameter combination, (L i m ;L i M ;L i D ;), to attract more contributing peers to stay in its swarm, thereby reducing its investment in video servers, while at the same time improving its revenue by selling ad periods to ad providers. Thus, an important focus of our work is to ensure that CPs keep more peers in their system under longer minimal ad durations, 30 in order to attract greater ad provider revenues. (The competition among ad providers is not the focus of this work.) This in turn also reduces CPs’ infrastructure costs, i.e., by relying on peers to deliver streaming video content (via P2P technology) to other peers. However, longer ad durations might repulse peers and encourage them to move to other swarms in oligopoly scenarios. Thus, we design the utility functions of CPs to align with those of the ad providers (since they share the same goals), and subsequently make the swarm size and ad duration length the salient parameters of the utility function of CPs (as discussed in Section 3.3.3). A peer is a rational, strategic player, that wants to maximize its utility (as discussed in Section 3.3.2). Based on the received download rate, D n , peer n uses its utility function to choose (i) the amount of download rate, D n , it wants to release/receive (benefit), and (ii) the length of ads it has to view (cost). If a peer experiences video pauses or is unsatisfied with the received ad duration, it might switch to another CP. 3.3.2 Non-cooperative Game among Peers Here, we first introduce the peers’ utility functions, and then describe the game setting when all peers have decided in which swarm to participate. (In the oligopoly setting, peers can change their swarm in every round.) Peers’ Utility Functions Peers’ utility should be based on the amount of video pauses and the length of ads they need to view. However, for peers with different capacities, one second of video pauses might be due to a different number of video blocks missing. Thus, we need a better metric to distinguish between video pauses. Since, in ASPECT, we only consider video pauses arising due to insufficient download rates, we can relate the amount of required but not received download rate (subsequently triggering video pauses) to our metric. If 31 a peer needs a higher additional amount of the download rate, to meet the required video bit rate, then this peer will experience more frequent video pauses. To be more general, since some peers might prefer to obtain higher download rates, we extend this metric to the received download. To this end, we define the utility of peern,U n , to be a linearly separable function of its achieved download rate and its ad duration. We first model peern’s benefit from its download rate as a monotonically increasing concave function (i.e., a Cobb-Douglas 1 function of one variable [36]),U rate n : U rate n (b) = 8 < : 1 1 (b) 1 if 0 < 1; logb if = 1, (3.2) where represents a peer’s preference. Here, b is the normalized received download rate: b = (D n D n )=f(C v ); (3.3) wheref(C v ) is the normalization function based on the minimal required video bit rate C v .f(C v ) is used to guarantee that all utilities are in the same range. Peers are happier if they view shorter ad durations, and are unhappy if they have to view longer duration ads. Without loss of generality, we can also model the cost (in a sense a negative utility) to peern of viewing ads as a concave function,U ad n (a), as in Eq. (3.2), wherea is the normalized length of an ad. Based on the amount of download rate released (received), peers receive a deduction (increase) in their ad viewing durations. Thus,a is defined as: a = min(max(L D D n ; 0);L M )=f(C v ); (3.4) 1 The Cobb-Douglas function is a standard in economics literature in modeling utilities of consumers with respect to resources [92]. 32 where parameter “translates” download rates to ad lengths. Based on the benefit and cost functions,U rate n () andU ad n (), the utility function of peern,U P n , is defined as: U P n =U rate n (b)U ad n (a) +c; (3.5) where,, andc are constants used to guarantee thatU P n 0. and are chosen such that lim x!0 U rate n ( Cv F (C V ) )U rate n ( (Cvx) f(C V ) ) x ' lim y!0 U ad n ( L M 2 f(C V ) )U ad n ( ( L M 2 y) f(C V ) ) y (3.6) where Eq. (3.6) provides us with the condition when a low capacity peer would be willing to increase its ad viewing duration to increase its download rate. Failure to satisfy the above condition for a low capacity peer would result in it preferring video pauses over an increase in its download rate. (Note that peers can be diverse and not every peer might care to mitigate video pauses.) We define the peer utility in this way because it is very difficult to restrict certain conditions in a CRS (constant returns to scale) Cobb-Douglas function, as used by CPs (see Section 3.3.3). For instance, there is a specified download rate requirement,C v , for peers not to experience video pauses. Some low capacity peers should have higher utilities when they can increase their ad viewing durations by no more than a certain amount ( L M 2 in our setting) for receiving C v amount of download rate. However, it is difficult to define such a preference in the CRS model. Thus, in our work, we achieve this preference by using Eq. (3.6) to find the coefficients for properly combining two Cobb-Douglas functions of one variable, which might not satisfy the quasi-concavity assumption. 33 Peer nsolves for ΔDn (Eq. (3.3) and (3.4)) to maximize Un Strategy parameter Download rate changeΔDn Common pool of free released download capacity Remainingtotal free download capacity ΔDntransfer Iterate until converge to SPNE ΔDnfrom other peers Figure 3.3: State diagram of the game between peers For simulation purposes, we choose and to satisfy Eq. (3.6) when b = a = 0:5, such that at least half (due to b = a = 0:5) of the peers (without loss of generality) will satisfy Eq. (3.6), i.e., half of the population of peers would be of the mindset to prefer a short ad duration even when they experience video pauses. However, there are still many pairs of (;) satisfying Eq. (3.6) with b = a = 0:5. Here, we randomly pick one pair of (;) with the minimum value of=. (When we set the value of=, different pairs of (;) are simply scaled.) Finally, we choose c to guarantee that all peers’ utilities are greater than or equal to zero. In practice, to maximize peers’ utilities, our proposed P2P mechanisms enable peers to adjust their contributions (D n ) according to their preferences (see Section 3.4). Game Setting among Peers We consider the game setting where peers in the swarm play a dynamic multi-round non- cooperative game amongst themselves, where the goal is to find a subgame perfect Nash equilibrium (SPNE) [54] vector of download rates for individual peers (as depicted in Fig. 3.3). An SPNE is a final Nash equilibrium (NE) outcome of a dynamic multi-round game, where each round has a NE. The reason for modeling a dynamic multi-round game is that the equilibrium obtained from a single game round may not be sustained 34 over time due to players taking into account the history of their game play in every round to improve their utilities. In this game, the strategy parameter for the peers is the download capacity they want to release to (or receive from) the system, to satisfy their QoS, where the download rate of a peer is a function of its upload contribution. They make this decision according to their originally received download capacities, which only depend on their upload capacities. For simplicity of the model, we assume that the originally received download rate for peeri in round 1 of the game is proportional to the ratio of its upload capacity to the overall capacity, and we represent it as D n = O P n P N k=1 O P k ( N X k=1 O P k +O C i ); whereO P n is the peern’s upload capacity, andO C i is the upload supply ofCP i . (The function for real download rates is discussed in Section 3.4.1.) To keep track of the spare upload capacity from all peers, we define a logical com- mon pool 2 to maintain the “free” download capacity released by the peers in every round. Peers cannot obtain download capacities beyond what this common-pool can offer. Thus, due to the unavailability of sufficient “free” download capacity in the common-pool, peers can only adjust their download capacity by a limited amount in a given round. In order not to favor a small set of peers, each peer has an opportunity to be the first to release (or receive) download capacities that maximize its utility in differ- ent rounds of the game. Each round of the game results in a NE and takes into account the equilibria peer strategies from the previous round to arrive at the current NE. Rounds of the game are repeated until the game converges to SPNE. 2 This logical common pool is only used for accounting of the download capacity in the system. We use it to make sure that the extra download capacity received by peers does not exceed the available amount in the system. 35 3.3.3 Non-cooperative Game among Content Providers In this section, we focus on a scenario, that consists of a setI of competing CPs, eachi2 I hosting video servers with maximum upload capacity,O C i , and competing with each other to attract peers. We first introduce the CPs’ utility functions, and then describe the game setting modeling competition among CPs. Content Providers’ Utility Functions Unlike the parameters used in the peer utility function, that can be directly controlled by peers alone, all parameters used in the CP utility function are determined by the reactions of peers to the decisions made by CPs. We define the utility of CP i as U C i , which is a function of the number of peers in its swarm, the minimal ad duration, and upload bandwidth supplied by its servers. Here, we again use the Cobb-Douglas function with three variables to model the CP utility as a monotonically increasing concave function of its variables, i.e., forCP i : U C i (P;O;L) =P O L + + = 1; (3.7) whereP is the normalized number of peers,L is the normalized ad duration, andO is the normalized upload bandwidth supply. ;; are output elasticity parameters of the respective CP utility function variables and denotes the percentage change of CP utility divided by the percentage change of the respective variable. For example, if = 0:4, a 1% increase in P would result in a 0:4% increase in U C i . In this work, we have performed extensive experiments to evaluate and study the effects of all (discretized at intervals of 0:1 for purposes of simulations) parameter combinations (that satisfy this relation) on market efficiency. Due to lack of space, in Section 3.5.3, we only discuss a 36 couple of representative scenarios, i.e., parameter combinations, that represent the entire discretized parameter space. We defineP , the normalized number of peers, as P =P i =P G ; whereP i is the number of peers joining the swarm ofCP i andP G is the total number of peers in the market environment. Many factors influence peers’ decisions to stay in a swarm or to switch to another one, including the video bit rate, the variety of video content, the service fee, the frequency of video pauses, and the ad duration. However, the variety of video content provided by a CP is determined by the nature of the policy and cooperation among video producers, something that is not easy for a CP to change dynamically. Therefore, we mainly focus on the frequency of video pauses and the ad duration as factors contributing to a peer’s decision to stay in a swarm, and use the variety of video content as an input argument that is used by a peer to decide when it wants to switch swarms. As we have discussed before, if more high capacity peers attracted by short ad dura- tions join the system, a CP could reduce its investment in servers, in turn increasing its net revenue. So, we defineO as the normalized upload bandwidth supply: O = 1 + O C i =O C i ; whereO C i is the maximum bandwidth supply byCP i , and O C i is the amount of unused bandwidth. We set the range ofO to lie in [1; 2] instead of in [0; 1] to discourage CPs from eagerly pursuing decreases in bandwidth investment, but rather to utilize all of their upload supply (if needed) to satisfy a sufficient level of QoS for its customers to stay in the system, thus earning revenue. However, if the minimal ad duration is too 37 Content providers (CPs) solve Eq.(3.7) to maximize Ui Strategy parameters Peer number (0≤P≤1) Upload supply (1≤O≤2) Ad duration (0≤L≤1) Peers choose CPs (Eq. (3.8)) and play peers game (Fig. 3.3) in each swarm Reward mechanisms from equilibria of CPs game (Eq. (3.1)) Peers’ new swarms, peers’ download rates and ad durations from equilibria of peers game Iterate until converge to SPNE Non-strategy parameters # videos, upload B/W Figure 3.4: State diagram of the game between content providers short, the CP will have reduced opportunity to increase net revenue through payment via ad providers because only a small set of these providers would bargain for the ad period. Thus, we defineL to be the normalized ad duration L = (L i m =L i M ); where L i m is the achieved minimal ad duration and L i M is the maximal ad duration at CP i . Like peer utility functions exhibiting different preferences, i.e., different values, CP utility functions also exhibit different preferences, i.e., different values of the (;;) tuple. Game Setting In this model, we have a multi-round dynamic game setting (as illustrated in Fig. 3.4), where a peer can strategically maximize its utility by changing its download capacity as well as by choosing the best fitted CP. Here, we adapt the concept of fictitious play, i.e., a time-averaged technique [29] in the theory of learning in games, and assume that each CP does not know the utility functions of other CPs and peers. A CP can strategically maximize its utility by changing its number of peers, upload bandwidth 38 supply, and ad duration, based on the reward mechanisms announced by other CPs and peers’ contributions in the previous round of the game. Here, a peern evaluates its potentialCP i with a scoreE i n : E i n = 8 < : U ad n (L i n )=U content n (v i ) ifD n D n C v ; U ad n (L i n ) otherwise, (3.8) whereU ad n (L i n ) is the peer utility due to an ad duration, L i n , viewed by peern atCP i (here, lower is better; see Section 3.3.2), and U content n (v i ), a concave function as in Eq. (3.2), is the peer valuation of the number of video content (v i ) provided by CP i (here, higher is better). It is evident that the lower the score E i n (shorter ad durations and more video content) for CP i , the greater is the inclination of peer n to move to CP i ’s swarm. Here, we only consider the number of videos,v i , to be the sole argument of U content n () due to the difficulty in modeling other factors (such as the categories of video content) quantitatively. In our evaluation, we allow peer n to jointly consider the number of videos hosted byCP i as well as its provided ad duration when this peer has received acceptable QoS in a former CP k ’s (k 6= i) swarm; otherwise, the peer will only take the ad duration of the potential new CP into consideration because the amount of content is less important when peers cannot view videos smoothly. Hence, we assume that each peer behaves in a homogeneous manner in different CP swarms, and that other CPs act similarly to the peer’s current CP. Thus, peern will use the value of released upload capacity in the current swarm to compute the estimated ad duration, L n , in other swarms. (Here, we assume a CP will announce its reward mechanism to all peers.) A peer continues evaluating all CPs and keeps switching from one CP to another, from round to round, if the evaluation score for a new CP is (i) h % less than that of the current CP under sufficient QoS, and (ii) l % less than that of the current CP when this peer does not have sufficient QoS. (We use threshold values here to avoid 39 oscillations between rounds.) We set the threshold higher ( h l ) for peers to switch swarms when they have sufficient QoS because peers are likely to stick to the service that already provides them with satisfactory QoS. As discussed above, CPs not only strategically compete with each other to attract peers, but also intend to increase their revenue by decreasing their video server capacity and selling ad periods to more ad providers. To this end, a CP has to search for a proper reward mechanism to maximize its utility. We assume that each CP knows other CPs’ reward mechanisms from the last round of the multi-round game between them. Based on peers’ contributions in the present game round, a CP adjusts its reward mechanism and assumes that peers which do not have video pauses do not leave the system in the next round, while peers that experience video pauses only leave the system if other CPs can provide ad durations which are more than l % (this comes from the peer choice for switching swarms) shorter than what it provides. Thus, the CP maximizes its utility via trading off the number of peers (P i ), the minimal ad duration (L i m ), and the amount of unused bandwidth in the video servers (O i C ). The algorithm for a CP to maximize its overall utility is given in Fig. 1. In each round of the game, each peer determines its CP and plays a non-cooperative game with other peers in the swarm by using the reward mechanism determined by the CP in the previous round. Subsequently, each CP maximizes its utility via adjusting its reward mechanism based on the result of the peers’ game in an equilibrium state and the equilibrium reward functions of its competitors in the previous round. This multi- round game is repeated until it converges to a SPNE of reward functions for the CPs and download capacities for the peers. 40 Reaching Equilibrium and Market Efficiency An equilibrium point of the game here represents the point where no peer tries to change their swarms and capacities, and no CP tries to change its reward mechanism. Since, in this game, we maximize the sum of utilities for all players in the market, this equilibrium point also corresponds to market efficiency [92]. In this work, we are focusing on practical modeling of ad-driven streaming P2P ecosystems, so we resort to simulation-based experiments to (i) deal with pure strategy Nash equilibria, and (ii) work with arbitrary utility functions, both of which are prac- tically desirable for a diverse set of applications. We do not provide a mathematical proof for existing NE because our peer utility function does not necessarily satisfy the quasi-concavity assumption, which in turn is necessary for guaranteeing a pure and/or mixed strategy NE in theory [41, 44, 49, 55, 98]. (The mathematical proof for existing NE in non-quasi-concave utility functions is still a difficult open problem.) 3.4 Sharing Mechanisms in ASPECT In Section 3.3, we gave an overview of our market-based model, which determines parameters for the reward function in Eq. (3.1) to provide sufficient incentives for peers to trade download rates and ad durations. Here, we focus on the mechanisms that pro- vide peers in ASPECT the ability to trade download rates and ad durations (according to their preferences) in the context of a BitTorrent-based streaming system. As men- tioned in Sections 3.1 and 3.3, we only consider CBR single layer video streaming so as to highlight (in a clear manner) the benefits of our proposed mechanisms. Moreover, we propose a decentralized method to quantify peers contributions in order to provide differentiated ad durations. 41 3.4.1 BitTorrent-like Video Streaming Systems In a hybrid BitTorrent-like system, peers obtain video blocks from the content providers servers as well as exchange blocks with other peers, i.e., their neighbors in an overlay network. Peers have a strategy, to which we refer as a “Peer Request” mechanism in the remainder of the chapter, for selecting from which neighbors to request blocks. At the same time, every peer receives messages from its neighbors for requesting blocks and determines a subset of them to respond to, since each peer only has a limited upload capacity. A neighbor is “unchoked” when it is selected to receive data (in response to its request), and we refer to this decision process as a “Peer Selection” mechanism in the remainder of the chapter. In order to encourage peers to contribute resources, BitTorrent- like systems typically adopt a TFT-type strategy in their Peer Selection mechanism, i.e., bigger contributors are rewarded with larger amounts of resources. For instance, a peer unchokes several (typically 4 or 5) neighbors that provide the highest download rates to it, to send blocks to. In the BitTorrent protocol, a peer can only choke and unchoke peers once every 10 seconds to avoid oscillations [13]. Additional peers (typically 1) are also unchoked for uploading (i.e., in response to their request for data) in a random manner, in order to explore newly arrived neighbors; this is referred to as optimistic unchoking. Once a neighbor is unchoked, it needs a block selection algorithm to determine which specific blocks to request. Because the traditional rarest-first mechanism is not suitable for video streaming, other block selection algorithms [33, 133, 147], better suited for streaming, have been designed; these take the order in which blocks are streamed into consideration and thus put higher priority on selection of blocks that are needed in the near future. All these choices are re-evaluated periodically. As noted earlier, although a number of block selection algorithms can make appro- priate block selection choices, it is still difficult to guarantee that the blocks needed in the near future will be downloaded in time. If all peers unchoking peern do not have the 42 block required by peern in the near future, then block selection algorithms cannot pre- vent the video pauses experienced by peern. To this end, several deadline-aware-type approaches have been proposed [141]. Under such approaches, rather than receiving an unchoking request, a peer receives a request for a specific block with a deadline cor- responding to the time of expected use of this block. However, such deadline-driven approaches have disadvantages as compared to the BitTorrent design. For instance, the deadline-driven approaches can cause significant oscillations in peers’ downloading (i.e. a peer frequently chokes and unchokes its neighbors), resulting in difficultly of accurately measuring neighbors’ contributions. Moreover, there is additional overhead of including deadline information in requests as well as the need for sending multiple copies of requests for the same block. After a block is downloaded, a peer would need to cancel all other requests for the same block. Such overheads are avoided in the original BitTorrent design; exploring the tradeoffs is beyond the scope of this chapter. Consequently, in this chapter, we begin with the traditional design of a BitTorrent system, and focus on the problem of not being able to obtain a sufficiently high down- load rate from neighbors 3 . Specifically, we focus on Peer Selection and Peer Request mechanisms, which are particularly responsible for download rates, and use a block selection algorithm based on the principles described in [133], as detailed in Section 3.5. Traditional Peer Selection and Peer Request mechanisms, enabling high capacity peers to maximize their download rates, often result in low download rates for low capacity peers. Consequently, low capacity peers may experience significant video pauses (due to slowly filling buffers) and may even leave the system (almost regardless of which block selection algorithm is used). Moreover, most efforts in the literature assume that an individual peer’s upload capacity remains relatively stable. However, a number of 3 We also note that a high download rate can also aid with an occasional poor choice of block selection. 43 factors, such as network congestion and poor wireless connectivity, affect peers’ avail- able upload capacities. When a peer has transiently poor connectivity, resulting in low upload contributions, this peer is likely to experience severe video pauses because its neighbors (using a traditional TFT-type strategy) are likely to choke this peer, essen- tially regardless of how significant its upload capacity contributions were in the past. In such situations, simply downgrading the video quality may not be a satisfactory or sufficient solution. To improve the quality of service, our work focuses on reducing video pauses by enabling peers to acquire sufficient download rates for video playback. To achieve this goal, we modify Peer Selection and Peer Request mechanisms so as to allow high capacity peers opportunities to “release” resources (that aren’t needed by them) to low capacity peers. 3.4.2 Peer Selection Mechanism Before proposing our modified mechanisms, we first describe our abstraction of the BitTorrent-like video streaming system. We view the system as operating in slotted time (with a time slot of lengthT ). At the beginning of each time slot, peers determine to whom they should send requests and which of their neighbors’ requests to grant. Given the asymmetric nature of upload/download capacities in users’ connectivity, we assume (as is typically done) that the available download capacity of peers is not the bottleneck, i.e., the download rates acquired by peers are determined by the available upload capac- ity. 4 Specifically, during time slott, peern has download rates,D t 1;n :::D t mn;n , from its m n neighbors. These download rates are a function of the neighbors’ upload capacities and a peer selection algorithm (see below). For instance, a peer receives high download rates if it has a relatively higher upload capacity than peers nearby. Given videos with a 4 However, if wireless peers experience bandwidth losses caused by the performance anomaly of 802:11, the download rates of these wireless peers are determined by the wireless transmission rates of access points (APs). 44 constant bit rate (C v ) for one time slot of playback, peern’s total download rate should be greater than or equal to the video bit rate, i.e., P mn k=1 D t k;n TC v ; otherwise, peer n suffers from video pauses. In a traditional TFT-based strategy, a peer ranks neighbors according to the provided download rates and then unchokes several of them that provide the greatest download rates. This strategy works reasonably well in file-sharing applications since the effect on low capacity peers is simply longer download times. However, in streaming appli- cations, this is likely to result in video pauses, i.e., a much more significant degradation in QoS. When a peer experiences transient upload capability drops, using the download rate from the previous time slot as an indicator in selecting neighbors for unchoking may significantly decrease the download rate of the peer. Therefore, to reduce the degrada- tion of video quality (in the form of pauses) caused by transient upload capacity degra- dation, our approach ranks neighbors according to their reputation scores, which are calculated from a combination of neighbors’ historical contributions and current down- load rates experienced by this peer. The use of historical contributions for reputation scores is needed to absorb the effects of a transient capacity drop. We defineA t k;n as the average download rate (i.e., historical contribution) obtained byn from neighbork (up to but not including slott); here, we only consider download rates greater than zero when calculating this average, as we would like to reflect the average upload capacity of neighbork 5 . We useI t i;n 2f0; 1g to indicate whether or not peern obtains a non-zero download rate from neighbork during time slott, i.e.,I t k;n = 1 5 A download rate of zero might just mean that two peers did not share blocks during a particular time slot. 45 indicates that the download rate is greater than zero. Then, the average download rate obtained byn prior to time slott from neighbori is defined as: A t k;n = P t j=1 D j k;n = P t j=1 I j k;n : (3.9) We then define the reputation scoreS t k;n of neighbork as: S t k;n = t k;n A t1 k;n + (1 t k;n )D t k;n 0 t k;n 1; (3.10) where t k;n is a weight used to determine how much to account for historical contribution versus the recent one. We use a non-zero value of t k;n when a neighbor actually needs help and a zero value otherwise. Taking into account historical contributions is helpful to neighbors that are experiencing a down-turn in upload capacity, not to those who are experiencing the return of their upload capacity or those whose capacity has been relatively stable. Thus, we set 0 < t k;n = 1 only whenA t1 k;n D t1 k;n , for eachk; otherwise, t k;n = 0. (In Section 3.5, we explore the sensitivity of our mechanisms to different values of.) We use a simple example, as depicted in Fig. 3.5 (where peerk experiences a tem- porary drop in its bandwidth) to illustrate our approach. Here, Fig. 3.5a illustrates the evolution ofD t k;n , i.e., as peern observes its neighbork, and Fig. 3.5b illustrates the corresponding evolution in neighbork’s reputation score (where the dotted line corre- sponds to the original TFT-based approach and the solid line represents our proposed modification). That is, in response to the drop ink’s upload rate, the original TFT-based approach simply tracksk’s upload rate, whereas our approach decreasesk’s reputation score gradually. As a result, k’s chances of obtaining blocks from peer n (during its 46 0 50 100 150 200 250 0 10 20 30 40 50 60 70 D k,n t Time (sec) (a) Download rate (D t k;n ) 0 50 100 150 200 250 0 10 20 30 40 50 60 70 Our Score Original TFT S k,n t Time (sec) (b) Reputation score (S t k;n ) Figure 3.5: The evolutions of (a) download rate and (b) the reputation score transient bandwidth losses) are higher with our approach; this results in better over- all performance, as detailed in Section 3.5. Once k’s capacity increases again, both approaches quickly track this improvement. We note that if peern’s perception of nodek’s bandwidth loss is due tok becoming a free-rider (i.e., this loss is not transient), then eventuallyk’s reputation withn will drop as well. We also note that our mechanism requires only local information and hence works in a decentralized manner, just as the original TFT-based approach. 3.4.3 Modified Peer Request Mechanism Traditional Peer Request mechanisms allow every peer to maximize its download rate by requesting data from all its neighbors. However, with a TFT-type strategy, peers only unchoke a fixed number of neighbors (those with higher contributions) to whom they upload data. Thus, high capacity peers are likely to have more blocks, which in turn means that they are likely to get more requests from all other peers. In contrast, low capacity peers are unlikely to be selected (for receiving blocks) over a high capacity peer; they also have a smaller probability of receiving requests from high capacity peers, due to owning fewer blocks. Hence, peers with similar capacities are likely to form a 47 cluster and mostly exchange blocks within this cluster [83]. As a result, low capacity peers may experience significant video pauses (due to slowly filling buffers) and may even leave the system (almost regardless of which block selection algorithm is used). Table 3.2 (obtained through simulation-based experiments in Section 3.5) illustrates this situation where the needed download rate (for smooth playback) is 500 kbps; here (under the original approach), high capacity peers end up with higher than needed download rates whereas low capacity peers end up with lower than required download rates. Thus, we have an opportunity to “shift” the unneeded download rates from high capacity peers to low capacity peers, as long as we can do this without hurting the QoS experienced by high capacity peers and with appropriate incentives. In order to shift unneeded download rates to low capacity peers, we take the fol- lowing approach. When excess download rate (i.e., more than the required rate) is perceived by a high capacity peer n, given appropriate incentives (see Section 3.3.2), n forgoes on requesting blocks from some of its high capacity neighbors (i.e., those that provide n with high download rates). We refer to this as “releasing” a neighbor, and attempt to release as many neighbors as possible without affecting the quality of n’s video playback. Specifically, we do this in an adaptive manner, where peern increases the number of neighbors released only if its download rate satisfies the video playback requirement (i.e., no video pauses). Moreover, in order not to experience video pauses caused by insufficient download rates due to over-releasing of neighbors, peern releases an additional neighbor only if it has already cached more than sufficient blocks in its buffer. Let Q t n be the number of continuous blocks 6 in peer n’s buffer at the begin- ning of time slot t. Then, if the block size is Z, we set a sufficient number of blocks in the buffer as Q t n 2 (C v =Z)T (based on the results from Experiment 3.5.1 6 This is a set of blocks, starting from the next video frame, that has a continuous sequence without any missing blocks. 48 as detailed below). However, when peer n finds that the current number of released neighbors leads to insufficient download rates, and the number of continuous blocks in the buffer is only sufficient to insure one (next) period for smooth playback, i.e., 2(C v =Z)T >Q t n (C v =Z)T , peern decreases the number of released neighbors. Lastly, all releases are voided when peer n determines that it does not have sufficient blocks for even a single playback period, i.e.,Q t n < (C v =Z)T . After determining the number of peers released,R t n , our mechanism sorts peern’s neighbors based on their average download rates,A t k;n , ton and releasesR t n peers with the highest download rates. We choose to start releasing peers with the highest download rates because their high contributions provide greater benefits to other peers. In contrast, if we release peers with the lowest download rates, it is less likely that those donations could be useful to low capacity peers as they will have lower upload capacities and fewer useful blocks. The details of our modified mechanism are given in Algorithm 2. The last row of Table 3.2 also demonstrates the resulting “shift” in download capacity allocation (with our modified approach from one of our experiments). As a result of our approach (see details in Section 3.5), no peer obtains an (un-necessarily) high download rate, but many peers have an opportunity to obtain download rates sufficient for playback. 3.4.4 Advertisements Reward Function As discussed in Section 3.3, appropriate duration of ads is a good incentive for motivat- ing peers to continue contributing upload capacity while releasing unneeded download capacity. Specifically, in our approach, the length of peers’ ads can be reduced as a reward for helping their neighbors or increased as a penalty for being helped by their neighbors. The mechanism used to compute rewards and penalties is detailed next. Rewards. As described in Sections 3.4.2 and 3.4.3, a peer helps its neighbors by (1) maintaining the neighbors’ reputation scores at higher levels (as compared to their recent 49 ALGORITHM 1: Reward mechanism forCP i Goal: Find the reward mechanism producing the highest utility in the range defined by the parameter 0< 1 At the beginning of each round, eachCP j ,j6=i announces its reward mechanism,L j D and j , to other CPs CP i only knows its utility parametersP i , O C i , andL i m AssumeP i is the set of peers subscribing to CPi,jP i j =P i According to the report of ad duration,L n , from each peer inP i ,CP i knows D n for peern inP i LetL i D 0 2 [max(0;L i D (1)); min(L M ;L i D (1 +)] Let i 0 2 [max(0; i (1)); min(1; i (1 +))] for any pair (L i D 0 ; i 0 ) do P i 0 , O C i 0 = O C i ,L i m 0 =1 for each peern inP i having no video pauses do if D n 0 and (9d D n ) and (min(0;L i D 0 0 d)L n ) and (no pauses)) then P i 0 P i 0 [n O C i = O C i (D n d) L i m 0 = min(L i m 0 ; min(0;L i D 0 0 d)) else if (D n < 0 and (9dD n + O C i ) and (max(L M ;L i D 0 + 0 d)L n ) and (no pauses)) then P i 0 P i 0 [n O C i = O C i (D n + O C i d) end end end for each peern inP i still having video pauses do if (9dD n + O C i ) and (no CPj;j6=i give more than l % shorter ad) then P i 0 P i 0 [n O C i = O C i (D n + O C i d) end end ObtainU C i 0 based onP i 0 , O C i 0 , andL i m 0 end Find the pair (L i D 0 ; i 0 ) that generates the largestU C i 0 Table 3.2: The average download rates (kbps) experienced by peers of different classes Upload rate (kbps) 256 384 512 768 1024 2048 Original DW rate (kbps) 440 452 485 659 803 852 New DW rate (kbps) 525 526 528 529 546 551 50 ALGORITHM 2: Modified peer request mechanism at timet Peern hasm n neighbors, which are sorted on average download ratesA t i;n from high to low, the order is 1:::m n ; The number of released neighbors at timet 1 isR t1 n 0; if P mn i=1 D t1 i;n TBZ andQ t n 2BT then R t n R t1 n + 1; else if P mn i=1 D t1 i;n T <BZ and 2BT >Q t n BT then R t n R t1 n 1; else ifQ t n <BT then R t n 0; end end end Chose requested peers from left neighborsR t n + 1:::m n ; upload contributions) when these neighbors temporarily experience poor connectivity, and (2) donating its download rates by releasing neighbors (that should result in higher download rates for others, as described in Section 3.4.3), thus allowing neighbors to obtain higher download rates and maintain QoS. A peer’s reward is computed based on the amount of help provided through these two mechanisms to its neighbors. For the first type of help, we define the difference between the reputation scoreS t k;n and the download rateD t k;n as the amount of help provided by peern to its neighbork. However, such help is not really useful unless it results in peer i being selected to receive data blocks. If we letJ t k;n 2f0; 1g indicate whether neighbork receives blocks from peern during time slott, thenn’s rewardW t n at timet will be P mn k=1 (S t k;n D t1 k;n )J t k;n . For the second type of help, we say peern releasesR t n neighbors in time slott. If peern did not release neighbori, the expected download rate can be estimated asA t i;n , which is the historical average download rate fromi. Thus, when peern stops requesting from peers based on their (historical) average download rates (from high to low) and releases the firstR t n neighbors, the donation from peern at timet is estimated from those released 51 neighbors as P R t n k=1 A t k;n : Consequently, the total reward that peern could obtain can be computed as: W t n = P mn k=1 (S t k;n D t k;n )J t k;n + P R t n k=1 A t k;n ; (3.11) wherem n is the number ofn’s neighbors. Penalties. A peer’s penalty corresponds to the download rates a peer obtains as a result of its neighbors’ “donations”. Given the decentralized nature of our system, it is difficult to determine how much a neighbor’s donation eventually increases a peer’s download rate. Therefore, we use peers’ local information to estimate this benefit, based on an expectation of how much download capacity a peer would not have obtained without our approach, as detailed next. As discussed in Section 3.4.3, peers form a cluster with neighbors that have sim- ilar capacities and consequently seldom exchange blocks with peers from other clus- ters, other than through optimistic unchoking [105]. Thus, we treat the download rates obtained by peer n from peers in higher capacity clusters, other than those obtained through optimistic unchoking, as being obtained due to “donations” made by higher capacity peers. Consequently, to determine the amount of download rates obtained through “donations”, we need to determine how much is obtained through optimistic unchoking. To this end, we approximate the probability of a peer being chosen (to receive data) by neighbork through optimistic unchoking by 1 m k 4 1 m k , wherem k is the number of peerk’s neighbors. It has been shown that most peers will soon learn and build connections to the maximum number of peers allowed by the system, typically 80 neighbors in BitTorrent systems, when the system enables peers to exchange their neigh- bor lists periodically with others [61]. Based on this, we can make an assumption that all peers have a similar number of neighbors; that is, in Fig. 3, we setm k =m n ;8k. Thus, if peern obtains a download rate ofD t k;n from a higher capacity neighbork in time slot t, then we estimate the “donated” download rate (fromk ton) as beingD t k;n (1 1 mn ). 52 ALGORITHM 3: Penalty algorithm Peern hasm n neighbors, with avg upload rate to one peerO n 0 The average download rates of neighbors areA t 1;n :::A t mn;n The download rates in time slott areD t 1;n :::D t mn;n for Neighbor fromi = 1tom n do ifD t i;n > 0 andA t i;n >O n 0 then P t n + =D t i;n (1 1=m n ) end end What remains (before we can characterize the overall penalty) is to determine an appropriate cluster for each peer. We do this based on historical data, i.e., the average download rate,A t k;n , of neighbork. Specifically, given that peern’s average upload rate to a neighbor is O n 0 , we consider neighbor k to be in the same cluster with peer n if A t k;n 2 [O n 0 =;O n 0 ]; where is a scaling parameter. Since a peer only receives “donations” from neighbors in clusters with higher capacities, we only account for a “donation” when it comes from a peer with a download rate higher thanO n 0 . Thus, the total average download rates from higher capacity peers is P t n , as detailed in Fig. 3. In our experiments, we set = 1:5 as the high upload capacities are at least 1:5 times higher than the low upload capacities, as listed in Table 3.4. (In real systems, service providers can adjust this value according to their users’ bandwidth capabilities as typical users are unlikely to change their upload bandwidth frequently, except for transient bandwidth drops.) The duration of ads. Once we determine the reward and the penalty, we combine them in computing the duration of ads a peer should view. Given an ad period, to be viewed after everyI time slots of content, the CP determines the default total ad duration,L D , and divides it equally among the ad periods. According to the contract, a CP has a lower bound,L m , and an upper bound,L M , for the total ad duration to be viewed by a peer. 53 Thus, if there areL ad periods, the actual ad duration in thej-th period to be viewed by peern,L n;j , is determined as follows: L n;j = min max L D I P t j k=t j I (W k n P k n );L m ;L M L ; (3.12) wheret j is the start time of thej-th ad period. We note that our mechanisms for computing rewards and penalties require only local information, and thus do not require the use of central servers or information exchange between peers, which are needed in [135]. 3.5 Evaluation In this section, we perform simulation-based experiments, in a controlled environment, in order to demonstrate the characteristics of our mechanisms and gain insight and understanding of the corresponding system. We first show the achieved QoS and ad durations of our proposed sharing mechanisms in a BitTorrent-based system. Then, using experiments on both monopoly and oligopoly markets, we show how CPs can design their ad policies (through the reward function in Eq. (3.1)) to provide greater incentives for peers to contribute to the system. 3.5.1 Performance of Modified BitTorrent-based System In Section 3.2, we described a BitTorrent-like video streaming system, which is used as a baseline approach in the following experiments for comparison purposes. We now describe how we implement the baseline approach (i.e. the original BitTorrent-like video streaming system discussed in Section 3.4.1) in our simulation environment, as well as how we integrate our modified mechanisms (as described in Section 3.4.2 and 3.4.3) into this baseline approach. Briefly, we implement all peer client and server functionalities 54 where simulated peers actually exchange information and data (as real peers do). How- ever, since this is a simulated environment, peers do not actually play the videos, but only check that the required content is in the buffers when it is needed. A video pause is detected if it happens that a required block is not in the buffer. A summary of default parameter settings used here is given in Table 3.3. In the simulations, after joining the system, a peer will re-register itself and retrieves a set of random peers (at most 50 in our system) every 200 seconds from the tracker. In the BitTorrent protocol, a peer can only choke and unchoke peers once every 10 seconds to avoid oscillations. Therefore, we adopt 10 seconds as the intervalT for peer’s actions. To update block information, peers exchange bit-maps of blocks they have with their neighbors at the beginning of every interval. Then, peern sends request messages to its neighbors that have blocks peern does not have. Peern will unchoke 4 peers that have the highest download rates to peer n for the 10 seconds prior to the next round of bit-map exchanges. For opti- mistic unchoking, peer n also re-selects the (randomly) unchoked neighbors every 10 seconds. After peern is unchoked, it chooses missed blocks to download. Recall that our focus is on reducing video pauses as caused by insufficient download rates. Hence, we adopt an existing block selection algorithm as described in [133]. In this basic block selection algorithm, peers request blocks to be used in the near future first, and then use the rarest-first approach for those blocks that will not be needed in the near future. For instance, if video playback time of peern is att n seconds, then peern will give blocks in the [t n ;t n +T 2] interval higher priority. If no block in [t n ;t n +T 2] period can be selected for download, peern will use rarest-first algorithm to fetch blocks after time t n +T 2. The central video server acts like another peer, with the difference being that it can unchoke a greater number of peers due to having higher upload capacity. (We pro- vide details below.) All these mechanisms continue/repeat until the end of a simulation run. 55 Table 3.3: Parameters used in experiments # of peers in the system (N) 500 Length of an action interval (T ) 10 seconds Recorded simulation time (T s ) 1800 seconds Max Advertisement : real content 0:31 : 1 Video bit rate 500 kbps Blocks per second 4 Number of peers unchoked by the server : by peer 20 : 5 (the value used to distinguish peers’ groups) 1:5 Table 3.4: The distribution of upload bandwidth UL rate (kbps) 256 384 512 768 1024 2048 Popularity 12% 40% 31% 4% 7% 6% Our modifications follow a similar design. However, the modified Peer Selection algorithm ranks neighbors to unchoke according to a combination of historical contribu- tions and current upload capacity, as described in Section 3.4.2. Moreover, the modified Peer Request algorithm insures that high capacity peers stop sending block request mes- sages to some of the other high capacity peers. The specifics of how (and how many) high capacity peers are released are described in Section 3.4.3. Finally, having peers “donate” to others (as described earlier) introduces different duration of ads viewed, as detailed below. Environment Settings To make sure that video pauses are not due to not having sufficient overall resources, but are rather due to inappropriate allocation of those resources, we only focus on experi- mental settings where there is sufficient total upload bandwidth to satisfy the total down- load demand. In our experiments, we set the total number of peers to 500, based on traces from the PPLive Project [134]. Upload capacities of peers are drawn from the distribution given in Table 3.4. We consider single-layer video with CBR encoding, 56 where the video bit rate is set to 500 kbps, as measured in [63]; however, we do address heterogeneous streaming rates in Section 3.5.1. According to [78], the average video viewing time per a user’s visit is more than 22 minutes. Thus, for simplicity of exposi- tion, we assume that there is no peer churn in our 30-minute simulation period. However, peers in our experiments experience upload bandwidth losses due to, e.g., anomaly in wireless connections and network congestion (see Section 3.4.2 for details). To simulate such losses, we choose a percentage of peers in our experiments that experience capac- ity losses (as detailed below). According to [19], most wireless sessions are shorter than 10 minutes, and inter-arrival times of wireless users are highly varied. Moreover, as measured in [65], the wireless transmission rates decrease exponentially when more users join the same wireless local area network. Therefore, for each peer experiencing capacity losses, we generate multiple durations of losses from a uniform distribution, each of which is no more than 10 minutes. We have no free-riders in our experiments (as that is not the focus of this work). Given these settings, the total upload capacity is sufficient for all peers to view video playback smoothly, if the upload resources are allocated properly. According to [115], the length of current TV ads is 31% of real content. For instance, in our 30-minute simulation period, there is typically only 23 minutes of real content with 7 minutes of ads. So, we set the maximal ad duration (L M ) to 7 minutes. The minimal ad duration (L m ), default ad duration (L D ), and are obtained from our market-based model. The primary evaluation metric, used in the remainder of this section, is the percent- age of time a streamed video is paused, defined as follows. GivenN peers in the system viewing video over time T s , if the total length of video pauses experienced by peer i 57 0 2 4 6 8 10 12 14 16 256 384 512 768 1024 2048 Video pauses (%) Peer Class 0% 10% 20% 30% 40% Figure 3.6: Video pauses of each class with different % of inconsistent capacity peers 0 2 4 6 8 10 12 14 256 384 512 768 1024 2048 Video Pauses (%) Peer class Original TFT (β = 0) (β = 0.1) (β = 0.5) (β = 0.9) Figure 3.7: Video pauses of each class with 10% inconsistent capacity peers and different values during this experiment isV i , then the percentage of video pauses for allN peers in the system is computed as: %ofpausetime = P N i=1 V i =(NT s ) 100%: (3.13) In order not to have initial “warm-up period” results skew the outcome, we run each simulation for 2400 seconds, and only record the results from the last 1800 seconds, where each peer has already connected to around 80 neighbors and peers already formed their clusters. Moreover, most simulation results presented in this section are obtained with 95% 5% confidence intervals 7 . Due to lack of space, we only show a subset of our results in what follows; however, results for other settings are qualitatively similar. 7 Results for the 2048 kbps class are obtained with lower confidence - video pauses rarely occur for that class, and hence, it is difficult to obtain simulation results with high confidence intervals. 58 Buffer Starvation In this experiment, we show (a) how severe the video buffer starvation problem can be when peers’ upload bandwidth fluctuates, and as a result that (b) peers with lower capac- ity experience frequent video pauses. We refer to a peer that is experiencing transient bandwidth losses as an “inconsistent capacity peer” in the remainder of the chapter. We simulate the baseline approach, i.e., the original BitTorrent-like video streaming system, and vary the percentages of inconsistent capacity peers, where the number of inconsistent capacity peers in each upload bandwidth class is drawn from the distribu- tion given in Table 3.4. We record the percentage of video pauses of different classes, i.e., as in Equation (3.13) but on a per class basis, as shown in Fig. 3.6, which illustrates that the video buffer starvation problem can occur even when there are no inconsis- tent capacity peers; this happens due to peer clustering (as described in Section 3.4.3). For instance, low capacity peers experience more than 3 minutes of video pauses in a 30-minute video (such as a typical TV show), particularly peers in the 256 kbps class, who experience almost 4 minutes of video pauses. On the other hand, high capac- ity peers seldom experience a video pause. When the number of inconsistent capacity peers increases, peers, except very high capacity peers, experience more video pauses due to their decreased upload capacities, which in turn decreases their opportunities to obtain data. For instance, peers in the 512 and 768 kbps classes only experience a few video pauses when no inconsistent capacity peers are present, but do experience twice as many video pauses when more than 40% of the peers in the system are inconsis- tent capacity peers. This degradation of QoS is not only due to the decrease of peers’ download capacities but also due to the decrease of download rate due to the TFT-type mechanism behavior. This problem is particularly severe for low capacity peers - e.g., peers in the 256 kbps class experience 5 minutes of video pauses (in a 30 min video) when 40% of the peers in the system are inconsistent capacity peers. High capacity 59 peers, however, only experience very few video pauses even when 40% of the peers in the system are inconsistent capacity peers because the high download rates enable high capacity peers to cache blocks, allowing toleration of transient bandwidth losses. These experiments demonstrate that buffer starvation (and subsequent video pauses) is potentially a significant problem in P2P-based streaming systems. Peer Selection Modification The results of Section 3.5.1 illustrate that video pauses increase when the number of inconstant capacity peers increases. In this experiment, we demonstrate that our Peer Selection mechanism can reduce video pauses due to bandwidth losses. Specifically, we depict the percentage of video pauses experienced by peers in different classes using our proposed mechanism (as compared to the original BitTorrent-like system) in Fig. 3.7. In order to explore the sensitivity of our results to different settings of, we demonstrate the performance under different values (of 0:1, 0:5, and 0:9). As shown in Fig. 3.7, our mechanism significantly reduces video pauses for all peers, even when giving historical information a small weight ( = 0:1). As expected, higher values of degrade peers’ reputation scores slower, resulting in better playback quality, i.e., significant reductions in video pauses under longer durations of poor wireless connectivity (particularly with = 0:9) are obtained. However, our mechanism cannot insure that the video pauses of high capacity peers are eliminated due to the decrease of the download transmission rate, which is possibly lower than the required download rate. On the other hand, peers with very low capacity are not able to obtain much benefit from our mechanism because, even though our mechanism does help increase low capacity peers’ download rates, it, of course, cannot increase the download rate beyond what they had to begin with (i.e., before experiencing transient bandwidth losses). Hence, low capacity peers cannot increase their reputation scores above those of high capacity peers; as a result, they 60 1 3 10 256 384 512 768 1024 2048 Video pauses (%) Peer class Original Peer Request Modified (Q = BT) M (Q = 2BT) M (Q = 3BT) M (Q = 4BT) Figure 3.8: Video pauses after applying our peer request mechanism 200 400 600 800 256 384 512 768 10242048 Download rates (kbps) Peer class Original Peer Request Modified Peer Request Figure 3.9: Before and after applying our peer request mechanism 0 2 4 6 8 10 12 14 10 20 30 40 Video pauses (%) Inconsistent capacity peers (%) Original mechanisms Our Peer Request Only Our Request and Selection Figure 3.10: Video pauses with different inconsistent capacity peers 0 2 4 6 8 10 256 384 512 768 1024 2048 Advertisements (min) Peer Class Our Mechanism Figure 3.11: The duration of ads for each class after our mechanisms are unlikely to be unchoked. As noted earlier, in Fig. 3.7 we demonstrate that higher values of result in better QoS. Higher values of result in a slower decrease of the reputation score of an inconsistent peer, resulting in a diminished effect of bandwidth drops. However, if the value of is too high, it would delay appropriate communication of the bandwidth drop to a peer’s neighbors. For instance, if = 1, then any decrease in the reputation score will be delayed by an entire time slot. Consequently, we use = 0:9 as our default setting in the remainder of the chapter. 61 Peer Request Modification As shown in Section 3.5.1, the video playback quality of low capacity peers is affected by the high frequency of video pauses. In this experiment, we demonstrate that our proposed Peer Request mechanism can significantly reduce video pauses experienced by low capacity peers. We compare the percentage of video pauses experienced by the different classes of peers under the original Peer Request mechanism with those under our modified version; the results are depicted in Fig. 3.8. As noted in Section 3.4.3, a peer starts to release its neighbors after the number of continuous blocks cached in its buffer,Q, is larger than a threshold. Therefore, in Fig. 3.8, we also demonstrate the per- formance of our modified versions when different thresholds are used. Here, we observe that our mechanism reduces video pauses under all thresholds used in our experiments. Generally, our mechanism can significantly reduce video pauses of peers in the 256 kbps class (i.e., more than half of the pauses experienced under the original BitTorrent-like approach). Moreover, video pauses of peers in the 384 kbps class are nearly eliminated (i.e., only around 1% of video pauses are left). However, the performance of our mech- anism highly depends on the threshold, particularly when the threshold is very small. As expected, a smaller threshold increases peers’ probability of releasing neighbors and increases the download rate of low capacity peers; however, it makes peers more sen- sitive to the download rate changes, resulting in a higher probability of experiencing video pauses. On the other hand, a higher threshold enables peers to absorb the change in download rates, but results in low capacity peers not being likely to obtain donations from high capacity peers. Therefore, in our work, we choose a reasonably conservative threshold (of twice the required blocks) as our default for each time period. To demonstrate that the video pauses are reduced through the “donation” of down- load capacity by high capacity peers, in Fig. 3.9 we depict the download rates obtained by peers in different classes. This figure shows that low capacity peers can increase 62 their download rates, once some of the capacity is “released” by high capacity peers. For instance, peers in the 256 kbps class obtain nearly sufficient download rates for video playback, thus resulting in significantly fewer video pauses. (Recall that the video bit rate is 500 kbps.) Combining Mechanisms So far, we showed that each of our modifications (Peer Selection and Peer Request mechanisms) can improve video playback quality on their own. Here, we combine the two modifications (Peer Selection and Peer Request mechanisms), and show that the combined approach can further reduce video pauses as compared to the original BitTorrent-like system, as illustrated in Fig. 3.10. In summary, our Peer Request mecha- nism significantly reduces video pauses for peers that were not able to obtain sufficiently high download rates to begin with. Moreover, our Peer Selection mechanism reduces video pauses for peers that suffer such pauses due to bandwidth losses (typically higher capacity peers, whose probability of being unchoked by its neighbors would dramati- cally decrease due to bandwidth losses). Although in our experiments only single-layer video is used, our approach can be easily extended for systems with heterogeneous streaming rates through the use of lay- ered video coding. The main difference in systems with heterogeneous streaming rates is that high capacity peers may prefer to use their “unneeded” capacity to view higher rate (and hence quality) videos rather than release it in order to reduce the duration of ads viewing. However, as measured in [69], the highest bit rate used in most video streaming systems is smaller than the rate of many cable Internet connections. Hence, as long as high capacity peers are interested in viewing shorter duration ads, low capac- ity peers can still obtain increased download rates from high capacity peers through our mechanism. 63 Duration of Advertisements In this experiment, we set the default ad duration (L D ) to 6 minutes and = 0:134 (based on the empirical results obtained from Section 3.5.2), to adjust the duration of ads (see Section 3.4.4). We record the duration of ads viewed by different classes of peers, with 10% of inconsistent capacity peers. As shown in Fig. 3.11, lower capacity peers view more than the default length, which is 6 minutes, while higher capacity peers view significantly shorter duration ads. However, since the increased/decreased duration of ads is proportional to the expected download rate change (refer to Eq. (3.11) and Fig. 3), peers will have significant differences in their ad durations (e.g., peers in the 2048 kbps class always release peers with highest capacities, resulting in significantly decreased ad durations). Therefore, with ASPECT, a CP can still satisfy requirements from ad providers (i.e., delivering the minimal duration of ads), while incentivizing peers to contribute resources by offering differentiated ad durations to peers, based on their resource contributions as well as the amount of resources they receive. 3.5.2 Numerical Experiments with a Monopoly Market To show how the market efficiency point helps a CP design reward mechanisms in order to motivate peers to contribute their resources, we start with a monopoly market with only one CP. In the game, we set the number of peers to 500, all of which watch the same 500 kbps on-demand video. The minimal total duration of adsL m is 1:5 minutes, and the maximal duration should not exceed 7 minutes. Peers have different upload capacities, which are drawn from the distribution given in Table 3.4. Moreover, peers have different utility functions, which are uniformly selected from all possible values of inU rate andU ad . We iterate combinations of parameters (L D and) according to the game setting in Section 3.3.2 and find the equilibrium point for each combination. 64 2 3 4 5 6 7 L D 6 7 8 1/ λ 0 Utility Figure 3.12: Utility results from valid reward functions of advertisements 0 500 1000 1500 2000 2500 3000 256 384 512 768 10242048 Download Rate (kbps) Peer Class L D =5.5, λ=0.134 L D =5.5, λ=0.142 L D =6, λ=0.134 Figure 3.13: The real download rates of different peer classes in three valid equi- librium points 0 2 4 6 8 10 256 384 512 768 1024 2048 Advertisement (Min) Peer Class L D =5.5, λ=0.134 L D =5.5, λ=0.142 L D =6, λ=0.134 Figure 3.14: The durations of ads viewed by different peer classes in three valid points Consequently, we observe through simulation experiments that only three (L D ;) com- binations ((L D = 5:5, = 0:134), (L D = 5:5, = 0:142), and (L D = 6, = 0:134)) result in market efficiency (see Fig. 3.12). However, out of these three efficient points, we are interested in the point that makes the monopoly CP most satisfied. We observe this point to be (L D = 6, = 0:134) - it has the most similar (among peers) download rates received (see Fig. 3.13), which should lead to similar (good) QoS as well as the most differentiated ad durations (see Fig. 3.14). 65 3.5.3 Numerical Experiments with Oligopolistic Markets Unlike a monopoly market, if there are many CPs in the market, each CP has to attract peers to stay in the swarm. Thus, we show how the market efficiency point helps CPs design reward mechanisms within an oligopolistic market competition. Due to space limitations, we only show the results from a triopoly market, where three CPs compete for 500 peers. Moreover, we use h = 20 and l = 10, which are the thresholds for peers to switch swarms (see Section 3.3.3), in the following experiments. (We tried many threshold combinations in our experiments. Since all of them show similar results, we only show one combination here.) The results of other market settings, like a duopoly market or oligopoly markets with more than three CPs, are similar to the triopoly case. Preferences of Content Providers In this experiment, we show (a) how different preference settings of CPs’ utility func- tions affect the policies of showing ads, and (b) how different reward policies affect the number of peers staying in the swarms. To this end, we consider a homogeneous situ- ation, where all three CPs have equal video server capacities and the same amount of video content but have different preferences on the factors of their utilities (Eq. (3.7)). In this case, we have each CP put the most emphasis on a different factor, with results shown in Table 3.5. As we can see from the table, different preference parameter com- binations result in similar numbers of peers attracted, with small differences between CPs arising due to their heterogeneous preferences. For instance, CP 3 focuses more on the number of peers compared to CP 1 and CP 2 , so it has the most peers through its strategy, relative to the other two. CP 1 , however, focuses on minimizing its video server upload supply, so it uses a longer default ad duration to encourage its peers to release unneeded download rates. Therefore, the gap betweenL D andL m inCP 1 is the biggest. It is important to note thatCP 2 andCP 3 have the minimal ad duration value 66 close to the default ad duration due to them having less preference for reducing video server capacities. Lastly, it is not surprising to see thatCP 2 has the largestL m , since its preference is to maximize ad durations. Consequently, since the goal of our work is to encourage peers to contribute resources, we will make the preference of minimizing video server upload supply (as inCP 1 ) our default setting. Importance of Video Content and Upload Supply As discussed in Section 3.3, peers choose CPs according to the amount of video content and the length of ad durations provided. Since video content is not a variable easily adjustable, in this experiment, we show how CPs can utilize their ad durations to attract peers. Thus, in this case, all CPs have the same default preference settings, but both, CP 1 adCP 2 have 40000 videos, andCP 3 has only 20000 videos. However,CP 1 has a video streaming capacity of 5120 kbps, whereasCP 2 andCP 3 have a streaming capacity of 10240 kbps. As shown in Table 3.6, the results significantly depend on the amount of video content, i.e.,CP 1 andCP 2 have more peers thanCP 3 . However, even though CP 1 andCP 2 have the same amount of video content, the video upload supply makes the two CPs choose different policies, resulting in a significant difference in the number of peers. CP 1 has a lower video streaming capacity thanCP 2 , so it is motivated to reduce its ad durations to attract more peers and make them contribute upload bandwidth. An interesting observation here is thatCP 3 has the longest ad duration. This shows that a small CP with less video content could still survive in the market by increasing its value of minimal ad durations, thereby attracting more ad providers. Therefore, the amount of content is an important factor that dominates the resulting competition between CPs, but at the same time, increasing video server capacities (to decrease the ad duration for low capacity peers) enables a CP with less content to survive in the market. 67 Table 3.5: Game results for homogeneous oligopolistic content providers with different preferences CP with capacity # content (;;) (L D ;) L m (secs) # peers CP 1 (10 Mbps) 20000 (0:2; 0:5; 0:3) (304; 0:01) 293 147 CP 2 (10 Mbps) 20000 (0:3; 0:2; 0:5) (310; 0:01) 309:99 166 CP 3 (10 Mbps) 20000 (0:5; 0:3; 0:2) (279; 0:01) 278 187 Table 3.6: Game results for oligopolistic content providers with different number of video content and video upload supply CP with capacity # content (;;) (L D ;) L m (secs) # peers CP 1 (5 Mbps) 40000 (0:2; 0:5; 0:3) (368; 0:04) 336:96 370 CP 2 (10 Mbps) 40000 (0:2; 0:5; 0:3) (384; 0:02) 353 111 CP 3 (10 Mbps) 20000 (0:2; 0:5; 0:3) (401; 0:02) 376 19 3.5.4 Overhead and Complexity Here, we discuss the empirical overhead and complexity of our sharing mechanisms and the market-based model. Modified sharing mechanisms Our modified peer selection mechanism has little overhead due to recording histori- cal information and computing the new scores (instead of using current information directly), while our modified peer request mechanism does not increase the time com- plexity (for releasing high contributing peers) since the TFT mechanism already makes a peer keep ranking its neighbors based on their contributions, in order to decide whom to unchoke. Market-based model As depicted in Section 3.3.3, CPs use the algorithm in Fig. 1 to repeatedly adjust their reward mechanisms,L j D and j , each round of the game, in order to maximize their util- ities until reaching an equilibrium state; thus, the time complexity of our market-based 68 model significantly depends on the search range parameter, the number of peers, and the number of CPs. We first consider a scenario, as illustrated in Fig. 3.15, where 2 CPs are competing for a different number of peers under various values of. With respect to, the number of iterations for reaching an equilibrium state decreases significantly when we extend the search range by increasing the value of from 0:1 to 0:2, with diminishing returns once 0:3. Moreover, the number of iterations decreases with the increase in the number of peers. This is because CPs have no unused bandwidth when too many peers are in the system, resulting in CPs having fewer options (unused bandwidth to leverage) to maximize their utilities. To support this observation, our next experiment considers scenarios where a dif- ferent number of CPs are competing for the same number of peers when = 0:2. As shown in Fig. 3.16, as the number of CPs increases, each CP has more unused band- width (since fewer peers subscribe to one CP), resulting in needing more iterations to reach an equilibrium state. However, we observe that the increase in iterations with 2 CPs is due to peers oscillating between CPs when fewer peers (fewer than 1000) are in the system. This is due to the fact that each CP adjusts its reward mechanism slowly when it only has one competitor. Empirically, we observe that our market-based model converges quickly to a market equilibrium and typically needs around 6 iterations when the search range parameter = 0:2. 3.6 Conclusions We proposed an Ad-Driven Streaming P2P Ecosystem (ASPECT), in the context of P2P video streaming systems, for peers to donate their (unneeded) download capac- ity, in order to improve overall QoS in the system. To support such “re-allocation” of resources, we proposed a modified Peer Request mechanism, that facilitates donation 69 2 4 6 8 10 12 500 1000 2000 4000 8000 Average # Iterations to Converge # Peers δ = 0.1 δ = 0.2 δ = 0.3 δ = 0.4 Figure 3.15: Average number of iterations for various values of 4 6 8 500 1000 2000 4000 8000 Average # Iterations to Converge # Peers 2 CPs 3 CPs 4 CPs 5 CPs 6 CPs Figure 3.16: Average number of iterations for various numbers of CPs (by peers) of potentially available download capacity. To provide appropriate incen- tives, we viewed the P2P-based video streaming system as a dynamic market-based model, that encourages peers to release their download rates for shorter ad durations. Our simulation-based experiments demonstrated that ASPECT can significantly reduce video pauses, thus increasing QoS. Moreover, ASPECT enables the content provider to achieve its desired profit by providing sufficient incentives for all peers to stay in the system without violating agreements with the ad providers (i.e., ensuring that a pre- specified minimal duration of ads is viewed by all peers). In this chapter, we focused on the block exchange progress; thus, we only considered a single channel (i.e., users in a swarm sharing the same video), and used a simple block selection mechanism designed for video streaming. However, our work can be easily combined with other block selection mechanisms as well as extended to a multi-channel scenario. 70 Chapter 4 Dynamic Resource Management for Distributed Machine Learning Workloads 4.1 Introduction Machine learning is one of today’s most rapidly growing fields in computer science. Its combination of artificial intelligence and data science has led to the development of practical technologies currently in use for many applications. Deep learning [81] is a novel area of machine learning that recently achieved breakthrough results in several domains, including computer vision, speech recognition, natural language processing, and robot control. Its distinctive trait is the use of deep neural networks (DNNs) to dis- cover, directly from input data, internal representations suitable for classification tasks, without the need of manual feature engineering. To be effective, this approach requires very large amounts of data and compute power. For example, DNNs for image classification include tens of layers and mil- lions of weights (parameters that combine the outputs of one layer to produce inputs of the next layer), and they are trained using datasets of millions of images [123]. Training can take weeks and must often be repeated for multiple values of the hyperparameters of training algorithms, such as the learning rate or momentum coefficients in stochastic gradient descent (SGD). 71 To speed up training and provide quick turnaround to users submitting these types of jobs, it is important to take advantage of hardware acceleration (e.g., by using GPUs that implement DNN primitives) and distributed training, which uses multi- ple machines in parallel. Machine learning frameworks, such as TensorFlow [14], Caffe [74], Theano [126] and Torch [38] provide a high-level abstraction of training algorithms that allows the user to easily run them on GPUs with hardware acceleration and in parallel on multiple machines. Using a parameter server architecture [43], the dataset is split among several worker nodes that perform training in parallel, sending parameter updates to a parameter server and receiving the most recent model version, which includes updates from other workers. As shown by experimental measurements [15], when more worker or server nodes are assigned to a job, its throughput (number of training examples processed per second) increases only sub-linearly. In some cases, when a shared resource (e.g., the network) is congested, adding more nodes can reduce the throughput, thus increasing the overall job service time. The first problem that we tackle is the definition and validation of a performance model for the throughput prediction of a training job as the number of assigned work- ers increases. Our model is a queueing network [77] where different stations model worker nodes, the parameter server, and the incoming and outgoing network links of the parameter server. For a given job, we estimate the mean service time at each station of the model from quick profiling that uses a single node. In contrast to black-box models [132], this approach allows us to compute utilization and throughput at each station, and thus to select on optimal operating point that avoids bottlenecks or congestion of critical resources. Then, we leverage this performance model to address the problem of parallel job scheduling. We consider the case of a computing cluster that receives Poisson streams 72 of heterogeneous machine learning jobs. Job sizes have general distributions determined by the number of training examples to process, by the amount of computation required to process each example, and by the size of the model (transmitted over the network between the server and the workers). The number of nodes assigned to a job by the scheduler is either selected when service begins (i.e., jobs are moldable) or adapted over time (i.e., jobs are malleable). We explore several algorithms that achieve different tradeoffs between system efficiency and speedup of job response time (i.e., time spent in the system by a job, waiting or in service). We also propose a mechanism to reduce the time for completing intermediate results (e.g., 50% of the job size). This is particularly designed for machine learning jobs since users might check intermediate results fre- quently and abort training jobs producing unsuccessful experiments due to poor values of the hyperparameters. Related Work. The response time of large-scale machine learning jobs is predicted in [132] using a linear model with respect to the number of nodes, its logarithm and its inverse. These terms capture the speedups that result from common communication patterns such as tree aggregation and shuffle mechanisms. In [130], a simple analytical model is proposed to estimate the speedup of synchronous SGD when multiple nodes are assigned to a job. The model divides the computation load equally among workers and assumes that communication between nodes is organized as a tree, with commu- nication overhead proportional to the logarithm of the number of nodes. However, our experimental evaluation of asynchronous SGD highlights that communication overhead depends on the specific patterns of network access by the nodes: the overhead of addi- tional workers is sometimes negligible at low utilization, as different workers exchange small amounts of data without transmission overlaps, each time having the entire band- width at their disposal. In general, transmission overhead is affected by the transmission 73 protocol (e.g., TCP or RDMA) and by the patterns of processing time at workers and server nodes. Parallel job scheduling is the subject of an extensive literature that addresses distinct scenarios and applications. Several efforts based on linear programming or approxima- tion algorithms tackled off-line versions of this problem with the goal of minimizing the makespan of a set of tasks (i.e., the time to complete all tasks) while satisfying dead- line requirements [34, 73, 100]. Polynomial-time parallel job scheduling algorithms have been proposed to minimize the makespan when the job speedup function is either convex or concave [25]. In high-performance computing (HPC), a static set of jobs is considered, with the goal of minimizing mean response time when jobs are assumed to be rigid, i.e., the number of nodes and job sizes are fixed and specified by the user [46, 50]. Since special cases of this problem reduce to 2D bin-packing, there exists no known efficient algorithm; common heuristics sort the jobs according to many factors (age, size, priority) and schedule them in order, with backfilling of unused slots. Optimal algorithms exist for moldable or malleable jobs with linear speedups [116]. If the number of nodes can be selected by the scheduler (i.e., for moldable jobs), assign- ing all nodes to the job with the least mount of expected work (i.e., service time on one node) minimizes mean response time. If the scheduler can vary (over time) the number of nodes assigned to a job (i.e., for malleable jobs), assigning all nodes to the job with least remaining work minimizes mean response time. Since speedups are sub-linear for machine learning jobs, assigning all nodes to the job with the least remaining amount of work results in inefficient allocations where similar service times would be achieved with fewer nodes. Conversely, assigning one node to each job prevents the use of idle nodes at low loads. Contributions. In this work, we make two main contributions. 74 1) Performance Model of Asynchronous SGD: To use system resources efficiently (i.e., with high speedup per worker node), we need to estimate the performance of a dis- tributed training job for different number of workers. In Sect. 4.3, we develop a per- formance model based on approximate mean value analysis (MV A), which accounts for the effects of the TCP protocol. This model provides a sufficiently accurate estimate of training throughput for the job scheduler. 2) Parallel Job Scheduling: In Sect. 4.4, we propose preemptive parallel scheduling algorithms that address the cases of moldable jobs (KELL) and malleable jobs (HELL and KNEE). We also propose an extension for speeding up the early part of each job in order to provide quick feedback for hyper-parameter tuning. In Sect. 4.5, we com- pare, through extensive experimental evaluations, the tradeoff between shorter response times and parallel service of more jobs. We show that KELL performs not much worse than both mechanisms designed for malleable jobs, and, with proper tuning, our exten- sion reduces the time for obtaining intermediate results significantly without substantial degradation in the overall response time. 4.2 Background 4.2.1 Distributed Stochastic Gradient Descent Stochastic Gradient Descent (SGD) [27] is the most widely used algorithm for DNN training. For a given set of layers, connections, and activation functions, a DNN is a parametric function f computing the output y = f(x;) (e.g., an image classifi- cation) from the inputs x (e.g., the pixel values of each RGB channel), where the parameters = ( 1 ;:::; n ) are real-valued weights connecting neurons of different layers. SGD starts with a random initialization of that is iteratively improved to 75 Worker nodes θ g Training data shards θ ′ = θ+ηg Parameter Server Figure 4.1: Parameter server architecture minimize the error on a training dataset of labeled examplesD =f(x i ;y i )g as mea- sured by a loss function L, i.e., to solve the optimization problem min J() where J() = 1 jDj P (x;y)2D L(f(x;);y). At each iterationt, the algorithm updates the weight vector in the direction opposite to the gradient of the error on a mini-batch of exam- plesBD, i.e., (t+1) = (t) g (t) where g (t) i = 1 jBj X (x;y)2B @L(f(x; (t) );y) @ i fori = 1;:::;n and is the learning rate parameter. This step is repeated for several epochs, i.e., full iterations overD. Distributed SGD [32, 43, 84] adopts a parameter server architecture to run on a cluster of distributed nodes. As illustrated in Fig. 4.1, the training dataset is parti- tioned among multiple worker nodes that compute gradients in parallel, on separate mini-batches of examples (data parallelism). To synchronize their execution, worker nodes send gradients g (t) to a parameter server that holds the most up-to-date version of the weights (possibly on multiple nodes for load balancing). The parameter server 76 applies the gradients and sends back the weights to the workers. In asynchronous SGD, weights are sent back to a worker immediately after applying its gradient; in syn- chronous SGD, the parameters server sends the weights only after receiving and apply- ing gradients from all the workers, so that gradients in the next iteration are computed from up-to-date weights. In this chapter, we focus on asynchronous SGD with many worker nodes and a single parameter server. 4.2.2 TensorFlow TensorFlow [14] is a machine learning framework with special support for DNNs and large-scale computations using heterogeneous hardware. Computations are specified by a dataflow graph, where nodes represent operations (e.g., matrix multiplications or con- volutions of inputs of a DNN layer) and intermediate results flow along edges as tensors, i.e., multidimensional arrays of floating-point, integer, or string elements. The dataflow graph makes communication between subcomputations explicit and allows the frame- work to execute independent computations in parallel across multiple GPUs or nodes of a cluster. Each operation (e.g., gradient computation on a mini-batch of examples) and state variable (e.g., weights) is assigned to a specific device (e.g., a CPU core or GPU). Specialized implementations of abstract operations allow devices to use hardware accel- eration and the framework transparently handles the transmission of data among devices, e.g., among nodes in a network or among multiple GPUs. The dataflow graph allows to easily implement a parameter server architecture. In addition, TensorFlow supports fault tolerance through user-level checkpointing: the chief node of the cluster periodically saves the current version of to disk; when a client restarts, it automatically attempts to restore from the last checkpoint. Through checkpoints, it is possible for workers or parameter servers to recover from faults, and 77 5 10 1 2 3 4 5 6 7 8 9 10 Mini-batches/s # Workers T ensorFlow 8MB Python 8MB T ensorFlow 16MB Python 16MB Figure 4.2: Measured training throughput of TensorFlow and Python implementations (mini-batch size is 50 examples) to suspend and resume training, effectively enabling preemptive scheduling with limited loss of completed work. 4.3 Throughput Estimation In this section, we propose a performance model to estimate the throughput of a training job (examples processed per second) as the number of assigned worker nodes increases. 4.3.1 Throughput Measurements of Distributed SGD To measure the throughput of distributed TensorFlow and collect service time infor- mation, we built our own testbed cluster, which includes 11 servers with quad-core 2.30 GHz AMD Opteron 2376 CPUs and 16 GB of RAM, connected by a 1 Gbps switch running in full-duplex mode. Each server runs TensorFlow 1.0.1 on Debian 8 using Python 3. 78 To evaluate the accuracy of the performance model (i.e., the predicted training throughput) under many scenarios, we build synthetic DNN models with arbitrary size by changing the number of neurons in each layer. Given that we are not interested in evaluating classification accuracy of DNNs, during training we can just sample random examples to match the input layer of synthetic models. Later, we validate our model on a real-world DNN (Google’s Inception). To measure the amount of data exchanged between nodes and transmission times, we use TCPDUMP to collect TCP packets. This is the most reliable approach, since TensorFlow uses the binary protobuf format to serialize data before transmission using the gRPC protocol over TCP. From times between completed transmissions, we estimate the processing time spent at the worker nodes and at the parameter server. We run each experiment for 40 minutes and collect traces only during the last 30 minutes, to remove initial warm-up effects. Fig. 4.2 reports measurements of training throughput in a cluster with up to 10 work- ers for fully-connected DNN models with slightly different structure; the size in bytes of the weights is similar (8 MB and 16 MB), but training throughput presents different trends: it saturates smoothly for 16 MB, but presents a non-monotonic trend for 8 MB. To show that these results are not due to specific implementation details of Tensor- Flow, we developed a client-server Python program mimicking the operations of Ten- sorFlow. Instead of processing real data, our program makes sleep system calls for amounts of time equal to the processing times measured from TensorFlow; exchanged data has the same size measured from TCP dumps. Obtained results are similar; this suggests that a performance model should account for the different processing times at each node, and for the interaction between network access and TCP connections. 79 IS IS IS Downlink Queue (Process Sharing) Uplink Queue (Process Sharing) Parameter Server (FCFS) Worker Node Figure 4.3: Queueing model of a distributed machine learning application with a param- eter server architecture This is in contrast with performance models for synchronous SGD [130], where successive SGD iterations do not influence each other (each iteration restarts identi- cally when the parameter server sends the weights to the workers). As highlighted in Sect. 4.3.3, congestion control mechanisms of TCP can create, across iterations of asyn- chronous SGD, a pattern where worker nodes transmit at different times. 4.3.2 Queueing Model We model asynchronous SGD training as the closed queueing system illustrated in Fig. 4.3. Worker nodes are modeled by K infinite server (IS) stations, each receiv- ing tasks from one ofK different routing chains (classes). After being processed at a worker, its task (modeling a single distributed SGD update) goes to the uplink station (a processor sharing, PS, station), then to the parameter server (a fist-come first-serve, FCFS, station), to the downlink station (another processor sharing station), and finally back to the same worker. Although the connection of each worker with the parameter server can cross one or many network switches, the model abstracts the network fabric as one non-blocking switch [16], and only focuses on its ingress and egress ports (e.g., NICs), while the PS policy approximates the behavior of packet-switching networks. The FCFS policy of the parameter server models the sequential application of gradients from different workers; 80 Table 4.1: Summary of notation D c ; U c service demand of classc at the Downlink and Uplink station W c ; P c service demand of classc at the Worker and Parameter server t D c (k);t U c (k) the total expected time spent of classc at the Downlink and Uplink station t W (k);t P (k) the total expected time spent of classc at the Worker and Param- eter server N D (k);N U (k) mean steady-state number of tasks transmitting in the Downlink and Upload Queue whenk jobs in system N P (k) mean steady-state number of tasks in the parameter server (includ- ing task in service) whenk jobs in system D (k); U (k) mean steady-state utilization of the Downlink station and the Uplink station when the system hask jobs P (k) mean steady-state utilization of the parameter server when the system hask jobs X(k) mean steady-state throughput rate of tasks when the system hask jobs after a task leaves this station, it goes to the downlink station to model weights waiting to be sent back to the worker. Note that, by default, TensorFlow allows the parameter server to update weights concurrently using gradients received from many workers; however, in this case we observed from TCP traces that workers exchange additional data and communication overhead is increased, so that a FCFS policy is accurate. Our goal is to estimate the sum of throughputs of all worker stations, which repre- sents the number of mini-batches processed per time unit in the cluster. To this end, we use mean value analysis (MV A), a recursive algorithm to compute steady-state queue sizes, waiting times, and throughputs in product-form queueing networks [109]. The requirements for product-form queues impose exponential distributions for the service time at FCFS stations (in our case, the parameter server), while other stations can have general service times. Table 4.1 summarizes our notation. Let k = 1;:::;K represent the customer classes, and ~ n = (n 1 ;:::;n K ) their pop- ulations (i.e., number of customers per class). For each job of class k with nonzero 81 population in~ n (i.e. , n k > 0) in a closed system, MV A computes the response time t x k (~ n) at a stationx based on the Arrival Theorem [77]: if x k is the mean service time at a station for classk, the queue sizeN x k at the moment of an arrival is equal to its mean value when one customer of classk is removed, so that t x k (~ n) = x k [1 +N x k (~ n ~ 1 k )]; (4.1) wherex =D;U;P (the downlink, uplink, and parameter server nodes, which are PS or FCFS stations) whilet W k (~ n) = W k (for worker nodes, which are IS stations). From the residence time of all stations, we obtain the throughput of classk X k (~ n) = n k t D k (~ n) +t U k (~ n) +t P k (~ n) +t W k (~ n) : and, in turn, we can calculate the steady-state number of tasks at each station asN x k (~ n) = t x k (~ n)X k (~ n) forx =D;U;P . The results are compared in Fig. 4.4 with measurements from our TensorFlow exper- iments, for a parameter server using 1 to 10 workers, a model of size 8 MB (transmitted over 1 Gbps Ethernet), and processing times at worker nodes and at the parameter server equal to 29 ms and 18 ms, respectively. As shown in the figure, the real throughput of TensorFlow is higher than predicted by our model when there are 2 to 4 workers in the system: the throughput scales almost linearly in this range, while predictions are accurate with more than 4 workers. We address this phenomenon in the next section. 4.3.3 On the Effects of Short TCP Transmissions To investigate the linear speedup of TensorFlow with 2 to 4 workers, we first analyze TCP packet traces in asynchronous SGD using 2 workers. Fig. 4.5 illustrates the TCP outstanding window size (data sent but not yet acknowledged), which shows that the 82 5 10 15 1 2 3 4 5 6 7 8 9 10 Mini-batches/s # Workers T ensorFlow Exact MVA Figure 4.4: The results of our exact MV A model 0 200000 400000 600000 800000 1x10 6 1.2x10 6 200200 200400 200600 Outstanding Window Size Time (milliseconds) Worker 1 Worker 2 Figure 4.5: Outstanding window sizes for two async workers workers send data during different time slots, without competing for network bandwidth. This phenomenon is due to the fact that data exchanged between parameter server and worker nodes is of similar size for all workers, proportional to the number of weights. If one worker can finish its uplink (or downlink) transmission before the other node, it will have a higher chance of starting the next transmission first (after processing at the parameter server or worker nodes); in case of transmission overlaps, the TCP congestion 83 control will favor the node that is already transmitting (transmission times are compa- rable to those required to adapt to bandwidth sharing), thus creating a self-reinforcing mechanism. Worker Up PS Down Down PS Up Worker 1 Time Worker Up PS Down Down PS Up Worker 2 Worker Up PS Down Down PS Up Worker 3 Worker Up PS Down Down PS Up Worker 4 Worker Up PS Down Down PS Up Worker 5 Worker Up PS Down Down PS Up Worker 6 Figure 4.6: The illustration of linear speedup phenomenon in TensorFlow This phenomenon can still be observed when the number of workers increases. As shown in Fig. 4.6, if the training time before sending updates to the parameter server is long enough, workers have a chance to send updates to the parameter server without competing for network bandwidth with each other via the self-reinforcing mechanism, resulting in linear speedup in training throughput. For instance, the machine learning job illustrated in Fig. 4.6 allows up to five workers to transmit their updates without overlapping. Under such dynamics, the uplink and downlink stations operate essentially with FCFS policy and deterministic service times (equal to transmission times for fixed weights/gradient sizes). Since exact MV A requires exponential service times in FCFS stations, we adopt the approximate MV A solution of [109], which gives downlink/uplink response times as t x k (~ n) = x k + x k (~ n ~ 1 k ) x k 2 + [N x k (~ n ~ 1 k ) x k (~ n ~ 1 k )] x k (4.2) forx = D;U. The results, illustrated in Fig. 4.7, show very good accuracy for fewer than 4 workers, but the model overestimates throughput when the number of workers increases. 84 5 10 15 1 2 3 4 5 6 7 8 9 10 Mini-batches/s # Workers T ensorFlow Approx MVA Figure 4.7: The results approximate MV A with FCFS As mentioned before, this asynchronous training scenario allows up to a certain num- ber of workers to transmit without competing for network bandwidth. However, if we keep adding workers to the system, each worker starts to share network bandwidth with some number of workers, which depends on the time spent at each component. Thus, to capture the idea of a certain level of overlapping (not transmitting simultaneously like in Eq. (4.1)), we define the downlink/uplink response times beyond a certain level of system load as t x k (~ n) = (1g( x k ))t x;FCFS k (~ n) +g( x k )t x;PS k (~ n) (4.3) forx =D;U, wheret x;PS k (~ n) is defined in Eq. (4.1),t x;FCFS k (~ n) is defined in Eq. (4.2), and 0g( x k ) 1 is a function of the utilization of stationx, used to combine the two types of time estimates. The time estimation should be closer to the time of transmitting all updates simultaneously when the system load is heavy, while the time estimation 85 should be closer to the time of sharing no bandwidth with others when the system load is light. In this work, we use the following definition: g( x k ) = 8 > > < > > : x k 0:8 0:2 if x k 0:8; 0 otherwise: (4.4) In particular, as defined in Eq. (4.4), our new time estimation begins to consider the transmission overlapping with others when the link utilization is more than 0:8. However, this 0:8 utilization is determined specifically for our topology setting, where the throughput exhibits diminishing return beyond this point. For other hierarchical environment settings (like the ones in data centers), it will be better to perform test runs in order to determine the “crossover” point for the two estimates. 4.4 Scheduling Mechanisms The performance model presented in the previous section allows us to estimate the throughput of a job when different number of machines are assigned by the sched- uler. Nonetheless, even with exact knowledge of throughput, no optimal algorithm is known for response time minimization of a Poisson stream of jobs with general job size distributions [116]. In this section, we describe heuristics for this problem when preemption is allowed and jobs are moldable or malleable. We also consider a mechanism that allows users to check intermediate results early, in order to determine whether or not to terminate the job early.. This can benefit the scenario of machine learning jobs, where users submit jobs with different combinations of hyper-parameters (e.g., learning rate, number of lay- ers, neurons in each layer), monitor the training progress, and terminate jobs exhibiting unsatisfactory results in terms of their learning potential. 86 Problem Definition. Jobs arrive to the system as a Poisson stream at a fixed rate ; for each job, the scheduler knows the job size (total number of training examples that must be processed) and the throughput functionX i (w), which estimates the throughput (examples processed per second) of jobi usingw machines (w 1 workers since we only consider one parameter server here). The cluster hasW homogeneous machines, each of which can be assigned to at most one job at a time. IfM jobs with remaining sizes J t 1 ;:::;J t M are present at time t, the remaining processing time of job i with w t i workers is T t i (w t i ) = J t i X i (w t i ) . Our goal is to determine the proper machine allocation w t i 0 for alli;t so as to minimize the mean response time, i.e., the mean time spent in the system by a job (waiting or in service). 4.4.1 The Dilemma of Assigning Workers When the scheduler needs to determine how to assign multiple machines to multiple jobs, it always has two extreme mechanisms to consider: assigning as many machines as possible to a job or executing as many jobs as possible. Shortest Remaining Time First (SRTF) job scheduling mechanism is a representative mechanism, which allocates all machines to one job and has been shown to minimize the mean response time when the throughput increase as a function of linearly with the increment of workers [116]. However, SRTF-type mechanisms might end up wasting compute cycles due to the sub-linear and non-monotonic throughput increment to the number of workers in distributed machine learning jobs, as shown in Section 4.3. Another possibility for assigning machines is to execute as many jobs as possible. For instance, [112] proposes an iterative mechanism to make sure that all jobs receive a minimum number of machines; additional machines are assigned iteratively, in a greedy fashion, to the job that can achieve the greatest reduction in response time, while also reducing the mean response time of queued jobs. However, this type of mechanism 87 does not utilize the scalability of a machine learning job well, and might result in longer response time. To this end, we consider a modification based on the two extreme directions to deter- mine the proper machine allocations: KNEE mechanism. As mentioned before, assigning all machines to one job might lead to wasting system resources for a small throughput improvement. Thus, to better utilize the resources, our approach stops giving machines to a job if the throughput improvement is not more than a certain threshold for that extra machine. For instance, given a threshold > 0, the best number of machines w KNEE i allocated to a job i should beX i (w KNEE i 1) (1 +) X i (w KNEE i ), whereX i (w KNEE i ) (1 +) > X i (w KNEE i + 1) as long asw KNEE i + 1W . We refer to this mechanism as the KNEE mechanism since it stops allocating resources at the point where the slope turns flat. High Efficiency, Low Latency (HELL) mechanism. The second approach searches for better machine allocations starting at the other extreme. Compared to the iterative- type mechanism [112], which runs as many jobs as possible simultaneously via giving a minimal number of machines to each job, this approach still tries to run more jobs simultaneously but allows each executed job to be scaled efficiently. To quantify the effectiveness of resource usage, [48] proposes the following metrics Speedup:S i (w) = T i (1) T i (w) ; Efficiency:E i (w) = S i (w) w and suggests a static allocation algorithm aiming at achieving short response time with high efficiency. Thus, we refer to our approach as a High Efficiency, Low Latency (HELL) mechanism, and allocate machinesw HELL i in the HELL mechanism according to the smallest ratio T i (w HELL i ) E i (w HELL i ) . 88 Generally, if the threshold for the KNEE mechanism is small, the KNEE mecha- nism will allocate more machines than the HELL mechanism to a job (i.e., w HELL i w KNEE i ). In this work, we study malleable and moldable job scheduling algorithms using these two mechanisms. 4.4.2 Malleable Job Scheduling In the malleable job scheduling case, the scheduler can dynamically update the num- ber of machines allocated to a job at execution time. This functionality is supported by TensorFlow through checkpointing: the weights of the model are saved to disk periodi- cally, and restored back when the model is restarted. Thus, we can reassign resources at arrivals or departures, when the number of jobs in the system changes. Due to this reassignment behavior, it is acceptable for a job to have fewer machines than the best allocation. This inspires our malleable KNEE and HELL job scheduling algorithms: Malleable KNEE Job Scheduling. In this approach, as detailed in Algorithm 4, the scheduler initially selects the job i that has the shortest remaining time T i (w KNEE i ) using at most the number of available workers w KNEE i = min(W 0 ;w KNEE i ), where W 0 is the number of current idle machines (W 0 =W when there is no assigned job), and assigns allw KNEE i machines to jobi. The same step is repeated to assign remaining machines, until all machines are busy or all jobs are scheduled for execution. This algorithm may not result in all jobs being allocated the most workers to minimize their ratios, but it is a work conserving scheduler. Malleable HELL Job Scheduling. In this approach, as detailed in Algorithm 5, the scheduler initially selects the jobi that has the smallest ratio of T i (w HELL i ) E i (w HELL i ) using at most the number of available workersw HELL i =min(W 0 ;w HELL i ), whereW 0 is the current number of available machines, and assigns all w HELL i machines to job i. The same 89 ALGORITHM 4: Malleable KNEE job scheduling Input:T t 1 (w);:::;T t M (w),W Output:fw t 1 ;:::;w t M g LetA =fg,w i = 0;8i: while P M i=1 w t i <W andjAj<M do k;a k i anda i , which makeT t i (a i )<T t j (a j ); where a i = min(W P M l=1 w t l ;w KNEE i );a j = min(W P M l=1 w t l ;w KNEE j ), 9i = 2A;8j = 2A;j6=i w t k a k ;A =A[k end ALGORITHM 5: Malleable HELL job scheduling Input:T t 1 (w);:::;T t M (w),E t 1 (w);:::;E t M (w),W Output:fw t 1 ;:::;w t M g LetA =fg,w i = 0;8i: while P M i=1 w t i <W andjAj<M do k;a k i anda i , which make T t i (a i ) E t i (a i ) < T t j (a j ) E t j (a j ) ; where a i = min(W P M l=1 w t l ;w HELL i );a j = min(W P M l=1 w t l ;w HELL j ), 9i = 2A;8j = 2A;j6=i w t k a k ;A =A[k end step is repeated to assign remaining machines, until all machines are busy or all jobs are scheduled for execution. This is still a work conserving scheduler. These two algorithms work perfectly well when the number of jobs is more than the number of machines. However, if the number of jobs is fewer than the number of machines, there is a chance that some machines are still idle even after all jobs are scheduled. To utilize those idle machines, we propose a filling mechanism to allocate additional machines to jobs since both algorithms do not assign the most number of machines (i.e., the number of machines that generates the highest possible throughput) to each job initially. In this filling mechanism, the smallest number of idle machines will be assigned to the job that can produce the shortest remaining time with these extra machines. This procedure is repeated until all machines are assigned or no job can improve its throughput via these additional idle machines. 90 4.4.3 Moldable Job Scheduling Malleable type job scheduling can better utilize the system resources since the alloca- tion can be reassigned when the number of jobs in the system changes. However, if the number of workers of a machine learning job is changed, the complexity of repartition- ing and redistributing the job may be significant. Thus, in this work, we also consider a moldable job scheduling case, where the number of workers used does not change after a job’s execution begins. In this type of job scheduling, poor initial allocation choices could have significant repercussions because the job has to continue using that number of workers even when the system has idle machines. One strategy is to let a job wait for an ideal number of machines rather than starting earlier with fewer machines. However, if the ideal number is too large, this job has to wait for a long time before getting served and can block other jobs from executing after it starts, potentially with only a small marginal gain in its running time. If the allocated number of machines is small, this job can be executed immediately, but it will have to run for a long time due to reduced parallelism. Thus, it is reasonable to believe that the ideal number of machines for jobi should be between w HELL i andw KNEE i . Hybrid of KNEE and HELL (KELL). However, this ideal number of machines should depend on the system load. Each job can have a larger number of machines if the system load is low, but each job is only allowed a smaller number of machines when the system is busy. Recall that w KNEE i is usually larger than w HELL i ; thus, we use the system utilization (0 1) as a knob to combinew HELL i andw KNEE i : w KELL i = (1f())w KNEE i +f()w HELL i ; 91 where 0 f() = 1; 1. The reason for using a polynomial-type function is that we want to make sure that the number of allocated machines decreases slowly when the system utilization is lower. In this algorithm, the scheduler initially selects the jobi that has the shortest remain- ing timeT i (w KELL i ) using the ideal number of machinesw KELL i , and assigns allw KELL i machines to job i. The same step is repeated to assign remaining machines, until no jobs can be scheduled for execution. However, since the job will wait in the queue if the remaining idle machines is fewer than its ideal number, this is not a work conserving scheduler. 4.4.4 Extension for Early Termination As noted earlier, one special property of machine learning jobs is that users will keep submitting jobs with different hyper-parameter settings, monitor the progress of each jobs, and terminate the jobs with unsatisfactory results before the jobs complete their execution. To reduce the mean waiting time for jobs that will be terminated, one approach is to embed the job remaining progress into the job scheduling mechanism. The remaining progress for jobi at timet can be defined asp t i = J t i J i , whereJ i is the job size at job submission time andJ t i is the remaining job size at timet. A job that has not started, withp t i = 1, should have a higher priority to be considered than jobs that have been executing for awhile. However, a job should also have a higher priority if it only needs a very short period of time to complete. Thus, to consider both the remaining time and the remaining progress at the same time, we propose to use a weighted remaining time for scheduling the job: T t i (w) = J t i X(w) 1 p t i ; 92 wherew is the number of workers and 0. This can be incorporated in the KNEE, HELL and KELL mechanisms without changing the algorithms. However, everything comes at a price. Using a larger value of gives a higher prior- ity to jobs that have not started, while results in a significant penalty to the jobs that have been executing for a long time, resulting in increased response time for jobs. Therefore, appropriate value of depends on user behavior in the cluster. We give a detailed dis- cussion of the outcomes for different values of and suggest how to determine this value in Section 4.5.2. 4.5 Evaluation and Validation In this section, we first validate the accuracy of our throughput prediction model by comparing the results collected using TensorFlow in our testbed cluster, as detailed in Sect. 4.3.1, on different workloads. We then investigate the performance of the pro- posed scheduling mechanisms that utilize the throughput estimation model. All results reported in this section are obtained with 95% 5% confidence intervals. 4.5.1 Throughput Estimation Validation To demonstrate that our throughput estimation model is robust under different settings, we first evaluate the results on a 16 MB DNN model with three convolutional layers and two fully connected layers. In this DNN mode, we change the batch size from 8 examples per update to 32 examples per update, and illustrate the results in Fig. 4.8a, Fig. 4.8b, and Fig. 4.8c. From these figures, we see that our model can accurately esti- mate the throughput increase and identify the number of workers with highly efficient worker utilization. Even though our model results in a higher underestimate when the 93 4 6 1 2 3 4 5 6 7 8 9 10 Batches/second # Workers T ensorFlow Our Estimation (a) A model with 3 conv layers w/ batch=8 2 4 6 1 2 3 4 5 6 7 8 9 10 Batches/second # Workers T ensorFlow Our Estimation (b) A model with 3 conv layers w/ batch=16 2 4 6 1 2 3 4 5 6 7 8 9 10 Batches/second # Workers T ensorFlow Our Estimation (c) A model with 3 conv layers w/ batch=32 0.1 0.2 1 2 3 4 5 6 7 8 9 10 Batches/second # Workers T ensorFlow Our Estimation (d) The Google inception model Figure 4.8: The results of our throughput estimation model time spent at the workers is longer (larger batch size), our estimate of the point of dimin- ishing returns still allows the scheduler to allocate workers efficiently. Next, we validate our throughput estimation running Google’s Inception model [124] in TensorFlow using only CPU cores. Results, reported in Fig. 4.8d, show that our approach slightly overestimates the throughput with fewer workers; with more workers, there is almost no difference between our estimate and the measurements. This inaccu- racy is likely due to the fact that the Inception model has more dynamic update patterns, while our model assumes that the update patterns are similar. However, our estima- tion still allows the scheduler to use resources efficiently without allocating unnecessary workers to a job. 94 4.5.2 Scheduling Evaluation Next, we perform experiments to investigate the performance (mean response time) of the scheduling mechanisms. Unlike the validation of throughput estimation, which uses our experimental testbed, we use a simulator to evaluate the benefits of scheduling mech- anisms in a larger cluster. We simulate a cluster with 100 physical machines, each with one GPU and one 1 Gbps port used to connect with other machines. (In this environ- ment, we assume that physical machines are interconnected by a high-speed Top-of- Rack switching architecture, so that the transmission bottleneck of each machine is the speed and buffer size of this 1 Gbps port.) To demonstrate that our mechanism is suitable for general machine learning jobs, we consider the workloads for four widely-used DNN architectures, listed in Table 4.2. For estimating the throughput of each DNN architec- ture, we use the workload characteristics - such as model size, TFLOPS per batch - as given in existing literature [71]. Since we simulate model training, we use the through- put of a modern GPU to calculate the service time at the worker and at the parameter server. Here, we assume that each machine is equipped with one NVIDIA Grid K520 GPU, 1; 229 TFLOPS [60]. When using the workload information to generate our syn- thetic workload, we pick a mini-batch size that follows a Gaussian distribution with a mean of 1024 examples per mini-batch, in order to simulate jobs of varying worker processing times. Performance Baselines. We use two extreme mechanisms mentioned in Section 4.4.1 as our baselines for comparison. For the scenario of allocating all machines to a job, we use the Shortest Remaining Time First (SRTF) job scheduling mechanism as repre- sentative. For the scenario of executing as many jobs as possible, we adopt a modified version of the iterative mechanism proposed in [112], which uses our filling mechanism to improve the resource utilization when the throughput function is not monotonically 95 Table 4.2: Synthetic workload collected by [71] from previous literature Name Total ExamplesjDj Model Size Forward+Backward TFLOPS/batch (per 1024 examples) Epochs NiN [85] 50; 000 (CIFAR-10, CIFAR-100) 30 MB 6:7 200 GoogLeNet [123] 1:2 Millions (ILSVRC 2014) 54 MB 9:7 N/A AlexNet [79] 1:2 Millions (ILSVRC 2012) 249MB 7:0 90 VGG19 [119] 1:2 Millions (ILSVRC 2012, ILSVRC 2014) 575 MB 120 74 increased with the machine increment, referred to as the Waterfill mechanism in the remainder of this Chapter. Performance Metrics. The primary performance metric in our evaluation is the mean job response time, and we compare the mean response time for each algorithm under the same system load. However, even with the same arrival rate, different algorithms will result in different system utilizations due to different number of machines they allocate to each job. Thus, to compare different algorithms fairly, we estimate the system load as if each job can only use one machine: system load = E[S(1)] W ; where is the arrival rate, and E[S(1)] is the expected service time with only one machine assigned to a job. Benefit of Accurate Throughput Estimation. As noted in Section 4.3.1, our approx- imate MV A model can better capture the throughput evolution by considering the TCP 96 100 200 300 400 500 600 700 0 0.2 0.4 0.6 0.8 1 Mean Response Time System Load Real Trace Exact MVA Approx MVA (a) KNEE on different throughput functions 100 200 300 400 500 600 700 0 0.2 0.4 0.6 0.8 1 Mean Response Time System Load Real Trace Exact MVA Approx MVA (b) HELL on different throughput functions Figure 4.9: The benefit of using our throughput estimation model effect. To show this accurate estimation could also help the scheduling algorithm to allo- cate machines properly, we first compare the results of using approximate MV A with the results of using exact MV A and the measured throughputs directly. Fig. 4.9a illustrates the results of applying malleable KNEE scheduling algorithm with different throughput estimation functions. Since KNEE mechanism needs accurate values for determining the throughput improvement, using our approximate MV A pro- duces almost the same results as using the real measured throughputs, while using exact MV A results in significantly longer mean response time. However, unlike malleable KNEE scheduling, malleable HELL scheduling does not have significant improvement with our approximate MV A, as shown in Fig. 4.9b. This is because exact MV A can already produce a close slope estimation for the throughput function; thus, having a better throughput estimation does not change the results. Since our KELL mechanism relies on the value of the KNEE mechanism, a better throughput estimation can also help improve the results of moldable scheduling. Without Early Termination. We start with comparing the performance of our mal- leable and moldable job scheduling mechanisms without considering the extension for early termination. Here, we use a synthetic workload, where each job is drawn from the 97 1 2 3 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Mean Response Time (x 10 5 sec) System Load SRTF WATERFILL HELL-FILL KNEE-FILL (a) Mixing four DNNs 1 2 3 4 5 6 7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Mean Response Time (x 10 5 sec) System Load WATERFILL SRTF HELL-FILL KNEE-FILL (b) Mixing the two large DNNs Figure 4.10: The performance of malleable job scheduling four DNN architectures in Table 4.2, and compare the results with baseline mechanisms (SRTF and Waterfill). We first compare the results of the KNEE and HELL mechanisms without filling mechanism. As illustrated in Fig. 4.11b, the HELL mechanism has a longer response time due to leaving some machines idle when the system load is low; the KNEE mech- anism, however, has fewer chances to have idle machines. After adopting filling mech- anism, as shown in Fig. 4.10, both the KNEE and HELL mechanisms improve their response times, while the HELL mechanism produces a similar response time due to allocating similar number of machines to each job when the system load is low. Thus, when comparing the performance of malleable mechanisms, in what follows, we only consider the mechanism with filling enabled. We then compare malleable KNEE and HELL mechanisms with two baseline mech- anisms, as illustrated in Fig. 4.10. Fig. 4.10a shows the results with jobs uniformly drawn from all fours DNN architectures, and Fig. 4.10b illustrates the results with jobs only from larger DNNs (AlexNet and VGG19). It is not surprising to observe that SRTF produces the worst performance since it wastes resources for small marginal gains. 98 3 4 5 6 7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Mean Response Time (x 10 5 sec) System Load ρ ρ 2 ρ 3 (a) KELL with different 1 2 3 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Mean Response Time (x 10 5 sec) System Load KELL HELL KNEE (b) Mixing four DNNs Figure 4.11: The performance of moldable job scheduling Waterfill is the second worst mechanism because it executes too many jobs simulta- neously without obtaining much benefit from parallelism. The KNEE and HELL mech- anisms have similar performance when the system load is low, but HELL mechanism can accept more arrivals. Thus, the HELL mechanism with filling performs better when the system load is high no matter what types of jobs are running in the system. Finally, we compare the moldable KNEE and HELL mechanisms with the KELL mechanism. Before doing the comparison, we first examine the results of using differ- ent values of , which affects the number of machines in KELL (as detailed in Sec- tion 4.4.3), under the scenario of a mix of four DNNs. Fig. 4.11a shows that = 2 produces better results: even though larger values of reduce job response time, allow- ing the system to accept more arrivals, these increase job waiting time during medium system load because of executing fewer jobs simultaneously. Thus, we compare the moldable KNEE and HELL mechanisms with the KELL mechanism ( = 2). We do not include the two baselines here because we have already shown that our mechanisms can do much better than both baseline mechanisms. As shown in Fig. 4.11b, KELL is able to combine the benefits of both the KNEE and HELL mechanisms, but it still performs a bit worse than our malleable mechanisms in Fig. 4.10a. 99 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100000 1x10 6 CDF Mean Time α=0.0 α=0.5 α=1.0 α=1.05 α=1.5 α=2.0 Figure 4.12: Applying extension for early termination with different values of Extension for Early Termination. Here, we demonstrate how our early termination extension can help users check intermediate results quickly. Fig.4.12 illustrates the time for completing a certain portion of a job through the use of the malleable HELL mech- anism when different values of are used. As shown in the figure, our extension starts to speed up the early part of jobs when > 1. For instance, without the extension, the mean time to complete 40% of the job requires around 41 hours, while our extension with = 1:05 only needs around 26:8 hours, which reduces the time by 35%. However, this kind of improvement comes at the cost of increasing the mean response time for completing the whole job. Thus, the use of this extension really depends on what types of machine learning jobs are running in the system. If users want to quickly find out the proper hyper-parameters to use and do not care as much about the amount of time it would take to obtain the final results, the scheduler can use a large value to achieve this; otherwise, choosing a smaller value of between 1 and 1:05 results in a better trade-off between the benefits and the penalties. Summary. In order to reduce the response time, it is important to allocate a proper number of workers to a job. If this number can be modified during execution, the HELL mechanism with filling achieves better response time under all situations (based on our 100 experiments). However, the complexity of changing the scale of parallelism might not be negligible. Thus, we proposed the KELL mechanism to combine the benefits of both the KNEE and HELL with respect to response time, as illustrated in Fig. 4.11b, and provide sufficiently quick feedback to users. 4.6 Conclusion We focused on reducing response time of machine learning jobs in a shared distributed compute environment. To this end, we developed a performance model for estimat- ing the throughput of a distributed training job as a function of the number of workers allocated to the job. Based on the throughput estimation, we proposed and evaluated scheduling mechanisms that utilize resources efficiently in order to reduce the mean job response time and provide quick feedback to users, as early job termination is a desirable feature for machine learning model training. 101 Chapter 5 SC-Share: Performance Driven Resource Sharing Markets for the Small Cloud 5.1 Introduction Infrastructure-as-a-Service is quickly becoming a ubiquitous model for providing elastic compute capacity to customers who can access resources in a pay-as-you-go manner without long-term commitments, with rapid scaling (up or down) as needed [17]. Cloud service providers (Amazon AWS [1], Google Compute Engine [6], and Microsoft Azure [8]) allow customers to quickly deploy their services without a large initial infrastructure investment. Proliferation of Smaller-scale Clouds. However, there are some non-trivial concerns in obtaining service from large-scale public clouds, including cost and complexity. Mas- sive cloud environments can be costly and inefficient for some customers (e.g., Blippex [5]), thus resulting in more and more customers building their own smaller-scale clouds (SCs) [4] for better control of resource usage; for example, it is hard to guarantee net- work performance in large-scale public clouds due to their multi-tenant environments [95]. Moreover, smaller-scale providers exhibit greater flexibility in customizing ser- vices for their users, while large-scale public providers minimize their management 102 overhead by simplifying their services; e.g., Linode [7] distinguishes itself by provid- ing clients with easier and more flexible service customization. The use of SCs is one approach to solve cost and complexity issues. Despite the potential of SCs, they are likely to suffer from resource under- provisioning during peak demand, which can lead to inability to satisfy service level agreements (SLAs) and consequent loss of customers. SLAs come in many forms, such as the average or maximum waiting time before being served, the probability of requests being rejected, and the amount of resources that each request can obtain. In order not to resort, similarly to large-scale providers, to resource over-provisioning, with all its disadvantages, one approach to realizing the benefits of SCs is to adopt hybrid archi- tectures [118, 144] that allow private clouds (or small cloud providers) to outsource their requests to larger-scale public providers. However, the use of public clouds can potentially be costly for the small-scale provider. Motivation. An emerging approach to solve the under-provisioning problem is for SCs to share their resources in a federated cloud environment [18, 30, 57, 59, 62, 93, 113, 127, 136, 137, 148], thus (effectively) increasing their individual capacities (when needed) without having to significantly invest in more resources, e.g., this can be help- ful when the SCs do not experience peak workloads at the same time. Earlier efforts [57, 62] characterize the benefits of cloud federations, while [113] also demonstrates that the uncertainty in meeting SLAs can be an incentive enabling sharing of resources among clouds. Moreover, the use of multiple SCs can avoid single points of failure: when one SC suffers an outage, others can be accessed to rent VMs. For instance, on February 28th, 2017, AWS suffered a five-hour outage in the US, causing an estimated damage of $150 million to S&P 500 companies [120]. However, many of these efforts assume the existence of the cloud federation and largely focus on designing sharing policies in order to maximize the profit of individual 103 Large-scale Public Cloud Small-scale Cloud 3 Small-scale Cloud 2 PhysicalServers Hypervisor VM … VM VM Customer Requests Queue Small-scale Cloud 1 Forward Requests to Public Clouds Share VMs ? Figure 5.1: System overview SCs [59, 127, 137, 148]. For example, [127] proposes a strategy to terminate less prof- itable spot instances, in order to accommodate more profitable on-demand VM requests. Moreover, most works do not consider the trade-off between economical benefits (in terms of profit) and performance degradation for individual SCs, which is a significant factor in incentivizing SCs to participate in the cloud federation. Without the analysis of performance degradation due to resource sharing, the feasibility of a federation can be questioned. While [93] studies a federation formation game among cloud providers based on revenue, it only considers a special scenario where all cloud providers share all of their resources with others. In contrast, our work focuses on the fundamental, unan- swered question of “how each SCs should share its resources to be profitable, satisfy customer SLAs, and also motivate other SCs to join the federation.” Problem Description. We consider an environment with multiple SCs providing on- demand VM instances; an example with 3 SCs is depicted in Fig. 5.1. In this work, we also refer to SCs sharing resources with each other as a federation. Each SC has its own SLAs with customers: the maximum waiting time before service of a request is initiated. To satisfy SLAs, SCs use public clouds as a “backup,” i.e., they buy additional resources on-demand from large-scale public clouds, when in danger of not being able 104 to meet SLAs. If such SCs form a federation, when an SC exhausts its own resources, it can use resources shared by other SCs at a price lower than that of public clouds. The amount of shared resources directly affects how much workload the federation is able to handle, which in turn affects the profit that each SC is able to achieve. In this sharing scenario, an important question is: Should SCs participate in the federation? If so, how many resources should each SC share? If an SC is too generous (i.e., shares too many of its resources), then it may be in danger of not being able to serve its own workload, resulting in more requests being forwarded to public clouds, thereby reducing profit margins. As a result, an SC should determine the amount of shared resources based on the price of selling and buying resources, i.e., the net profit, compared with the cost of using public clouds. However, if an SC is too selfish, i.e., shares few of its resources for higher profit, then it may get removed from the federation for not being a useful contributor, or the federation may fall apart if most/all SCs tend towards selfish behavior. Thus, another critical question that needs to be addressed is: What prices can make each SC share a reasonable amount of resources so that all SCs will participate in the federation? Challenges and Contributions. To answer these questions, we make the following contributions: 1. Performance-dependent cost function: Operating costs of an SC depend on the SLA with its customers and on the performance achieved inside the federation; in particular, we need to compute how frequently the SC will need to allocate exter- nal resources to satisfy SLAs (e.g., maximum waiting time), and whether it will be able to use resources of other SCs, or only those of public clouds. In Sect. 5.3.2, we develop a detailed performance model to compute such performance metrics for each SC. In turn, these metrics allow us to compute the operating cost of SCs (as defined in Sect. 5.2.2). To address the high computational complexity of the 105 detailed performance model (due to its large state space, which grows exponen- tially with the number of SCs), we develop an approximate performance model (Sect. 5.3.3). This model provides accurate estimates of the measures of interest, with linear complexity in the number of SCs, and it allows SCs to keep their SLAs and capacity information private. 2. Sharing market design: The sharing mechanism should motivate SCs to par- ticipate, without significant oversight nor management, i.e., they should find an economic benefit in contributing resources to the federation. We design a market- based model to determine the price charged within the federation for the use of shared resources. The model is based on a non-cooperative, repeated game among SCs, each being selfish and trying to maximize its utility; as in real-world scenar- ios, SCs do not know the utility of other SCs, but they can compute (using our approximate performance model) the operating cost that they would incur for each possible sharing decision. We determine market equilibrium conditions under which the federation is successful and market efficiency is achieved (Sect. 5.4). 3. Experimental evaluation: In Sect. 5.5, we perform an extensive experimental evaluation to validate the accuracy of our approximate performance model with respect to simulation, and to verify the existence of market equilibria. Results highlight errors lower than 10% for the performance metrics of interest; the pro- posed pricing model achieves market equilibria and good economic efficiency, successfully incentivizing SCs to stay in the federation. To the best of our knowledge, ours is the first work that models small-cloud federa- tions as a holistic performance-driven market, integrating engineering aspects (from a performance model) with economic ones (from a market model). 106 5.2 System Description In this section, we first describe the architecture of the SC federation, illustrated in Fig. 5.1. We then introduce a definition of operating costs of SCs. Finally, we describe our sharing framework, which we call SC-Share. 5.2.1 Architecture Description Each SC has a number of physical servers: through virtualization technology, physical resources (CPU, memory, storage) of SC i are packed into N i homogeneous virtual machines (VMs), which are the resource unit adopted in this work. Customers request the allocation of individual VMs from SCs; the arrival process of VM requests at each SCi is modeled as a Poisson process with rate i . The service time of each request at SC i (including the time elapsed from start of VM preparation until its release by the user) is modeled as an exponential random variable with rate i . Each SC processes VM requests in FCFS order. If physical servers do not have sufficient resources for a new VM, an SC can reject the request, queue it until more resources are available, or forward it to a public cloud (in a hybrid-cloud model). In Sect. 5.6, we discuss these assumptions in detail. In a federation withK SCs (Fig. 5.1 depicts the caseK = 3), we consider the fol- lowing general scenario: when all VMs at an SC are fully occupied, its new VM requests are queued and can be served either by waiting for local resources to become available, or by purchasing resources from other SCs in the federation, or from a public cloud. In order to participate in the federation, SCi must determine the maximum number of VMsS i to share with other SCs (at a given price) when idle VMs are available; i.e., at any time instant, the number of VMs shared by SC i is I S i i S i . When all its VMs are occupied, SCi cannot terminate VMs serving requests of other SCs, but it can only 107 stop accepting such requests until it is able to clear its own queue. Each SCi is required to maintain SLAs with its customers; we assume that this corresponds to a bound on the waiting time, i.e., a VM needs to be provided by SCi withinQ i time units from its request. If SCi determines that it is not able to satisfy this SLA using resources of the federation, it forwards the request to a public cloud (e.g., Amazon AWS). 5.2.2 Cost Metric Description SCs usually make large up-front investments in infrastructure, and continue to pay for maintenance costs (e.g., power supply and cooling costs). In addition, SCs need to consider costs for forwarding requests to public clouds or for using resources in the federation, in order to satisfy customer SLAs. We define a cost metric to combine these costs with the revenue generated by VM requests from other SCs in the federation, and compute the net operating cost. Let I S i i be a random variable representing the number of SC i’s VMs per second used by other SCs when SCi shares up toS i VMs with the federation. LetO S i i andP S i i be random variables representing the number VMs per second used by SC i from the federation and from a public cloud, respectively, to satisfy its SLAs. The net cost for SC i is then C S i i =P S i i C P i + (O S i i I S i i )C G i 8i; (5.1) whereC P i andC G i represent the cost of using a single VM from a public cloud and from other SCs, respectively. P S i i , O S i i , and I S i i are the mean number of VMs per second used by SCi from a public cloud, by SCi from other SCs, or by other SCs from SCi, respectively. Here,P S i i C P i is the cost (penalty) for not serving requests locally, which drives SCs to participate in the federation and determines proper sharing decisions since we assume thatC P i > C G i . To reduce cost, by making appropriate sharing decisions, 108 Performance Model Market-based Model (Solve) Estimated Performance characteristics (Input Parameters) (Solve) (Iterate until converge) Sharing Decisions (Input Parameters) # of VMs, QoS, Figure 5.2: Feedback between two models i.e., determining the number of VMs to share with others, we need a performance model for each SC, in order to properly estimateP S i i ,O S i i , andI S i i (see Sect. 5.3 for details). Unlike [127], where cloud providers change VM prices based on system utilization, our model considers a fixed priceC G i for every VM. Since VMs are homogeneous, we assume thatC G i =C G j 8i;j = 1;:::;K, but each SC can have a differentC P i depending on its public cloud provider (these assumptions are discussed in Sect. 5.6). Another incentive for participating in the federation is reducing power cost by for- warding VMs to other SCs when they offer VMs at cheaper prices than the cost of instantiating VMs in SC’s own environment. For instance, previous efforts [57, 76] study the sharing mechanisms for cloud providers to minimize their costs. However, in this work, we only focus on the cost of additional resources required to satisfy cus- tomers’ SLAs. Extending the cost function to incorporate power consumption is a future direction of our work. 5.2.3 Cost Metric Evaluation Framework In order to help SCs determine whether it is beneficial for them to participate in the federation and share their resources, we design the framework SC-Share that allows each SCi to determine the best value ofS i , in order to meet its SLAQ i and minimize the expected operating costC S i i . 109 The essence of SC cooperation in such a federation is the mutual agreement among individual SCs to share their idle resources with other SCs experiencing peak work- loads. 1 However, the amount of resources S i that each selfish but honest SC wants to share represents its strategic property that subsequently affects the cost metricC S i i . Thus, in SC-Share, we develop a market-based model to capture SC interactions in the federation via a market consisting ofK selfish SCs that interact strategically, and repeat- edly over time, via a non-cooperative game to converge upon stable parameter values. However, a feedback loop exists between the performance model and the market model: sharing decisions S i 8i are used by the performance model to compute P S i i , O S i i , andI S i i and evaluate the cost metricsC S i i of Eq. (5.1), which, in turn, determine the SC utility functions of the market model governing sharing decisions. Therefore, in SC-Share, we propose an iterative solution approach, as illustrated in Fig. 5.2, involving these two models and their mutual feedback, to converge upon stable sharing decisions. 5.3 Performance Model In this section, we propose a performance model for SC-Share that is used to compute performance parameters required by the cost function of Eq. (5.1). 5.3.1 SC without Sharing Resources We start with a degenerate case, where an SC does not participate in the federation and shares no VMs. Based on SLA requirements, the SC will forward a request to public clouds if service cannot be started withinQ i time units after its reception. To compute the cost, we need to estimate the mean number of requests forwarded per second by SCi,P 0 i (we denote it with “0” since no VMs are shared). 1 The issue of enforcing the agreement is beyond our scope here. 110 N i -1 N i N i +1 N i +2 λλP NF (N i ,N i ,Q i ) λ N i μ N i μ N i μ N i μ (N i -1) μ λP NF (N i +1,N i ,Q i ) λP NF (N i +2,N i ,Q i ) Figure 5.3: A Markov model for forwarding To computeP 0 i , we use a Markovian model, where the state represents the number of requests at SC i, as illustrated in Fig. 5.3. In this example, we assume that SC i has N i VMs and SLA Q i with its customers. When at least one VM is idle, a new request can be served immediately. However, when all VMs are busy, the probability that the new request is added to the queue of SC i (rather than forwarded to a public cloud) is equal to the probability that service will start in Q i time units, based on the current number of queued requests. Let q i be the number of customers in SC i (i.e., max(0;q i N i ) customers are waiting in its queue) at the time of the request arrival. Then, given exponential service times with rate and the FCFS service policy, the probability of queueing the request (instead of forwarding to a public cloud) is P NF (q i ;N i ;Q i ) = 8 > > < > > : 1 q i N i P j=0 e N i Q i(N i Q i ) j j! ifq i N i , 1 ifq i <N i . In particular,P NF (q i ;N i ;Q i ) is less than one if the request cannot be served immediately upon arrival (i.e.,q i N i ). At the steady state, the expected probability of forwarding a new request to public clouds is then P F = P 1 k=N i (1P NF (k;N i ;Q i )) k , where k is the steady-state probability of having k requests in the system. Then, the expected rate at which VM requests are forwarded to public clouds isP 0 i =P F , which can be used in Eq. (5.1) to compute the cost for SCs not sharing resources, i.e., withO 0 i =I 0 i = 0. 111 5.3.2 Detailed Model for SC Federation The model of a federation with sharing is complex. Given a federation ofK SCs, each of which will share a maximum of S i VMs for i = 1;:::;K, our goal is to estimate the performance parameters P S i i , O S i i , and I S i i for each SC i. To accurately estimate these parameters, we need to consider the interaction among SCs in the federation. One approach is to build a continuous-time Markov chain (CTMC),M, with the following state spaceS: S =f(q 1 ;s 1;1 ;:::;s 1;K ; :::; q K ;s K;1 ;:::;s K;K )jq i 0; 0s i;j S j ; s i;i = X j6=i s j;i S i ; fori = 1;:::;Kg; whereq i is the number of requests from SCi’s customers that are either queued or in service at SCi,s i;i is the number of VMs at SCi serving requests from other SCs, and s i;j ;i6=j is the number of VMs at SCj being used by SCi. Transition rates between states ofM can be assigned so as to implement the prob- abilistic forwarding mechanism of the model for new arrivals, and service of queued requests. Table 5.1 reports the transition structure for the detailed modelM introduced in Section 5.3.2. Transitions are given for SCi from a generic state (q 1 ;s 1;1 ;:::;s 1;K ;:::;q i ;s i;1 ;:::;s i;K ;:::;q K ;s K;1 ;:::;s K;K ): The transition rates include P F (V i ;n i ;Q i ), which is the probability that a request is forwarded to a public cloud when n i requests are queued at SC i, all of its V i N i available VMs are currently busy, and the maximum allowed waiting time by the SLA is Q i (see Section 5.3.1 for a detailed definition). We also assume a load balancing mechanism in the model: SC i determines with which SC j to share an idle VM by 112 Next State Rate Condition for Transition (q 1 ;s 1;1 ;:::;s 1;K ;:::;q i + 1;s i;1 ;:::;s i;K ; P (q i ;s i;i ) i (q i +s i;i <N i )_ (q j +s j;j N j ;8j6=i) :::;q K ;s K;1 ;:::;s K;K ) (q 1 ;s 1;1 ;:::;s 1;K ;:::; i jKj (q i +s i;i N i )^ q i ;s i;1 ;:::;s i;j + 1;:::;s i;K ;:::; (L =f(q l ;s l;l )jq l +s l;l < N l ;s l;l <S l g;8l6=i)^ q j ;s j;1 ;:::;s j;j + 1;:::;s j;K ;:::; (K =f(q k ;s k;k )jq k +s k;k = min L (q l +s l;l )g) q K ;s K;1 ;:::;s K;K ) ^(q j ;s j;j )2K (q 1 ;s 1;1 ;:::;s 1;K ;:::;q i 1;s i;1 ;:::;s i;K ;:::; min((N i s i;i );q i ) (q i +s i;i >N i )_ (q j +s j;j N j ;8j6=i) q j ;s j;1 ;:::;s j;j ;:::;s j;K ; :::;q K ;s K;1 ;:::;s K;K ) (q 1 ;s 1;1 ;:::;s 1;K ;:::; min((N i s i;i );q i ) jKj (q i +s i;i N i )^ (s i;i <S i )^ q i 1;s i;1 ;:::;s i;j + 1;:::;s i;K ;:::; (L =f(q l ;s l;l )jq l +s l;l > N l g;8l6=i)^ q j 1;s j;1 ;:::;s j;j ;:::;s j;K ;:::; (K =f(q k ;s k;k )jq k +s k;k = max L (q l +s l;l )g) q K ;s K;1 ;:::;s K;K ) ^(q j ;s j;j )2K (q 1 ;s 1;1 ;:::;s 1;K ;:::;q i ;s i;1 ;:::;s i;j 1;:::;s i;K ; s i;j (q j +s j;j >N i )_ (q k +s k;k N k ;8k6=j) :::;q j ;s j;1 ;:::;s j;j 1;:::;s j;K ;:::q K ;s K;1 ;:::;s K;K ) (q 1 ;s 1;1 ;:::;s 1;K ;:::;q i ;s i;1 ;:::;s i;j 1;:::;s i;K ;:::; s i;j jKj (q j +s j;j N j )^ q j ;s j;1 ;:::;s j;j ;:::;s j;K ;:::; (L =f(q l ;s l;l )jq l +s l;l > N l g;8l6=i)^ q m ;s m;1 ;:::;s m;j + 1;:::;s m;K ;:::; (K =f(q k ;s k;k )jq k +s k;k = max L (q l +s l;l )g) q K ;s K;1 ;:::;s K;K ) ^(q m ;s m;m )2K Table 5.1: State transitions for detail modelM choosing (uniformly at random) among those SCs with the highest number of queued requests. Although solvingM could give us an accurate prediction of all performance char- acteristics required in Eq. (5.1), the corresponding state spaceS grows exponentially withK. Since re-computation of sharing decisions is needed when significant changes in workload or resource availability occur, a model with a more efficient solution is desirable. Moreover, solving forM requires obtaining detailed SC information (such as the arrival rate, the number of overall VMs, and the SLA) that SCs might not want to 113 release. Thus, each SC should be able to compute the model in a decentralized manner and release as little information as possible. 5.3.3 Approximate Model for SC Federation In this section, we focus on an approximate model that can be solved quickly (as system conditions, such as workload, change) and in a decentralized manner (without releas- ing too much information to other SCs), but also yields sufficiently accurate results, in order to produce appropriate sharing decisions. By analyzing the detailed modelM, we realize that usingM allows estimation of performance parameters for all SCs in the federation simultaneously; however, in realistic scenarios, each SC computes its own performance parameters to estimate its cost assuming that other SCs’ sharing decisions are fixed; thus, there is no need for the performance model to simultaneously output results for all SCs. Moreover, since we assume that the same cost is charged by all SCs for shared VMs, an SC does not need to distinguish the source or destination of shared VMs. Therefore, we propose a hierarchical approximate model that computes performance parameters iteratively. Given a federation ofK SCs, we consider each SCi = 1;:::;K in sequence, where SC K is the SC of interest, which we refer to as target SC in the rest of the chapter. At each step, we build and analyze a Markovian modelM i where only SCsf1;:::;ig can access shared resources of the federation. The modelM i takes into account the solution ofM i1 and refines it to include also SC i. For example, in modelM 1 , the first SC has exclusive access to all shared resources of the federation; inM 2 , only SC 1 and SC 2 utilize shared resources from all SCs, but VM allocations inM 1 are taken into account. We repeat this process until reaching the target SC. In this approach, since SCi only needs the solution ofM i1 to buildM i , we allow SCs not to leak sensitive 114 information on capacity and SLAs. In the following, we give a detailed description of M i ; 1iK and of its solution. State SpaceS i forM i . The state spaceS i ofM i is S i =f(q i ;s i ;o i ;a i )jq i 0; 0s i S i ; 0o i +a i B i g; whereq i is the total number of requests at SCi (queued or in service),s i is the number of VMs of SCi currently used to serve requests from SCsf1;:::;i 1g,o i is the number of VMs from other SCs currently used by SCi, anda i is the number of shared VMs used by SCs inM i1 . Given that there are at mostN i VMs in SCi, max(0;q i (N i s i )) requests are waiting at SCi; moreover, s i is bounded byS i , the maximum number of VMs shared by SCi. SinceM i includes SCsf1;:::;ig and SCi is the target SC inM i , we useo i to record the number of shared VMs (not from SCi) used by SCi, and we usea i to record the number of shared VMs (not from SCi) used by SCsf1;:::;i 1g; thus, o i +a i is bounded by B i = P j6=i S j , the maximum number of VMs shared by SCsf1;:::;K 1g. State Transitions. VM allocations inM i1 affect the results of new states inM i after state transitions. Each state transition happens in the period of time between two events (referred to as inter-event period in the rest of chapter), each of which can be a request arrival or a service completion instance. During an inter-event period, each state inM i can increase the number of VMs shared by SCi due to SCs inM i1 allocating VMs in SC i; similarly, the number of requests queued at SCi can decrease due to service completions inM i1 , which allow SCi to utilize shared VMs. Thus, the probability of going to any destination state from a state ofM i depends on the probability of being at a specific state inM i1 . Here, we define three interaction probability vectors representing the probability of moving from each state (q i ;s i ;o i ;a i ) ofM i to any other state ofM i when an event happens, based on the interaction probabilities computed forM i1 : 115 P A (q i ;s i ;o i ;a i ) for an inter-event period preceding an arrival instance; P D loc (q i ;s i ;o i ;a i ) for an inter-event period preceding a local departure instance; P D rem (q i ;s i ;o i ;a i ) for an inter-event period preceding the remote departure instance of a VM allocated at other SCs by SCi. The detailed computation of these interaction probability vectors is described below. Let a loc represent the number of VMs shared by SC i and allocated by SCsf1;:::;i 1g inM i1 , and leta rem represent the number of VMs shared by all other SCs (except SCi) and allocated by SCsf1;:::;i1g inM i1 , respectively. Then, given a state inM i1 , which can produce the pair (a loc ;a rem ),P A (q i ;s i ;o i ;a i ) (a loc ;arem) , P D loc (q i ;s i ;o i ;a i ) (a loc ;arem) , and P D rem (q i ;s i ;o i ;a i ) (a loc ;arem) represent the probability of allocating VMs (a loc ;a rem ) in vectors P A (q i ;s i ;o i ;a i ), P D loc (q i ;s i ;o i ;a i ), and P D rem (q i ;s i ;o i ;a i ), respectively, after an event in the state (q i ;s i ;o i ;a i ) ofM i . The legal combinations of the pairs (a loc ;a rem ) are determined by the current state (q i ;s i ;o i ;a i ) ofM i , as described below. For simplicity, in the rest of chapter we useP A (a loc ;arem) , P D loc(a loc ;arem) , andP D rem(a loc ;arem) to represent the probability of VM allocations inM i1 for each state (a loc ;a rem ), given the state (q i ;s i ;o i ;a i ) ofM i . Transitions forM 1 . InM 1 , there is only one SC, and no other model affecting the transitions; thus,s 1 =a 1 = 0, and (q 1 ; 0;o 1 ; 0) ! (q 1 + 1; 0;o 1 ; 0) ifq 1 <N 1 (q 1 ; 0;o 1 ; 0) ! (q 1 ; 0;o 1 + 1; 0) ifq 1 N i ^o 1 <B 1 (q 1 ; 0;o 1 ; 0) P NF (q1;N1;Q1) ! (q 1 + 1; 0;o 1 ; 0) ifq 1 N i ^o 1 =B 1 (q 1 ; 0;o 1 ; 0) min(q1;N1) ! (q 1 1; 0;o 1 ; 0) ifq 1 > 0 (q 1 ; 0;o 1 ; 0) o1 ! (q 1 ; 0;o 1 1; 0) ifo 1 > 0 116 Transitions forM i . Any transition inM i withi> 1 depends on interaction probabil- ity vectors forM i1 . Given any pair (a loc ;a rem ) from states inM i1 , the transitions corresponding to a request arrival instance at state (q i ;s i ;o i ;a i ) inM i fall into one of the following cases: C 1 : The new request can use a VM at SCi when there is at least one free VM at SCi, even after consideringa loc anda rem fromM i1 during the arrival period: (q i ;s i ;o i ;a i ) P A (a loc ;arem) ! (q i + 1;a loc ;o i ;a rem ) for allq i +a loc <N i such that (a loc S i )^ (o i +a rem B i ). C 2 : The new request uses a VM from other SCs. This situation arises when SCi has no idle VMs prior to this arrival instance, but other SCs can provide at least one VM during the preceding inter-event period: (q i ;s i ;o i ;a i ) P A (a loc ;arem) ! (q i ;a loc ;o i + 1;a rem ) for allq i +a loc N i ando i +a rem + 1B i . C 3 : The new request must be queued or forwarded to a public cloud due to no available shared VMs in the federation, where all VMs have been occupied during the previous or current inter-event period by requests from other SCs: (q i ;s i ;o i ;a i ) P A (a loc ;arem) P NF (q i ;V i ;Q i ) ! (q i + 1;a loc ;o i ;a rem ) for allq i +a loc N i ando i +a rem =B i . V i =N i s i +o i is the number of VMs in the federation currently used by SCi. 117 Given any pair (a loc ;a rem ) for states inM i1 , the transitions corresponding to a service completion instance at SCi for its own customers fall into one of the following cases: C 4 : The departure is from VMs of SCi used by SCi itself. If there is at least one job queued in SCi, the freed VM will be used by SCi directly: (q i ;s i ;o i ;a i ) L i P D loc(a loc ;arem) ! (q i 1;a loc ;o i ;a rem ); where L i = min(q i ;N i s i ) is the number of VMs from SC i used by SC i, for all q i +a loc >N i . However, if there are no queued requests in SCi, the freed VM will be assigned to other SCs with queued jobs: (q i ;s i ;o i ;a i ) L i P D loc(a loc ;arem) ! (q i 1;a loc + 1;o i ;a rem ) for allq i +a loc N i . If other SCs do not have queued requests, the transition has the same form as in the previous case for queued requests at SCi. C 5 : The departure is from VMs of other SCs allocating to SCi. If there are no queued jobs in any SCs, the freed VM will be returned directly: (q i ;s i ;o i ;a i ) o i P D rem(a loc ;arem) ! (q i ;a loc ;o i 1;a rem ) for allq i +a loc N i . If at least one request is queued in SCsf1;:::;i 1g, SCi must share the VM: (q i ;s i ;o i ;a i ) o i P D rem(a loc ;arem) ! (q i ;a loc ;o i 1;a rem + 1): 118 shared VMs from the rest of SCs; B i = 5 shared VMs in ℳ i already in use shared VMs in ℳ i-1 available for allocation shared VMs in SC i; S i = 3 Figure 5.4: Example of allocation constraints for a state (q i ;s i ;o i ;a i ) inM i However, if the above conditions are not satisfied and there is at least one job queued in SCi, the VM will still be assigned to SCi, for allq i +a loc >N i : (q i ;s i ;o i ;a i ) o i P D rem(a loc ;arem) ! (q i 1;a loc ;o i ;a rem ): Interaction Probabilities. As mentioned above, the interaction probabilities describe the probability of different VM allocations from SCs inM i1 during an inter-event period ofM i . To compute transient probabilities, which describe transient changes in the number of VM allocations in CTMCM i1 over inter-event periods at SCi, we use the method of uniformization [121] to transform the CTMC into a discrete-time Markov chain and a Poisson process as follows: given the infinitesimal generatorQ i1 , the rate of the Poisson process is max j q i1 jj , the transition matrix of the DTMC isP i1 =I + 1 Q i1 . Then, the transient probability vector p i1 (t) forM i1 can be computed for all t 0 as p i1 (t) = p 0 P i1 (t), where P i1 (t) = P 1 k=0 e t ( t) k k! (P i1 ) k is the matrix of transition probabilities for the CTMC (for a given precision , the summation can be truncated using the Fox and Glynn method [53]). By letting p 0 be equal to the initial state distribution at any time instance, we can compute transient state changes ofM i1 . 119 Initial State Distribution. The initial distribution ofM i1 depends on VM allocations in the current state ofM i . For instance, as illustrated in Fig. 5.4, when S i = 3 and SCi uses 2 of its shared VMs in state (q i ;s i ;o i ;a i ) ofM i , only up to 1 of the remaining SCi’s shared VMs can be allocated to others inM i1 . We represent the initial distribution ofM i1 over its state spaceS i1 as X [(q i ;s i ;o i ;a i )] , where [(q i ;s i ;o i ;a i )] is the subset of states inS i1 that satisfy VM allocation constraints for a given state (q i ;s i ;o i ;a i ) ofM i . The initial state distribution X [(q i ;s i ;o i ;a i )] ofM i1 is computed from its steady-state probabilities by considering only states [(q i ;s i ;o i ;a i )] and renormalizing their probability masses. Then, interaction probability vectors for M i1 andM i are given by the product of the initial state distribution and transient state change during the average inter-arrival time or departure time: P A (q i ;s i ;o i ;a i ) = X [(q i ;s i ;o i ;a i )] P ( 1 i ) P D loc (q i ;s i ;o i ;a i ) = X [(q i ;s i ;o i ;a i )] P ( 1 L i ) P D rem (q i ;s i ;o i ;a i ) = X [(q i ;s i ;o i ;a i )] P ( 1 o i ) whereL i is the number of local busy VMs in (q i ;s i ;o i ;a i ). The initial state distribution X [(q i ;s i ;o i ;a i )] is computed through the concept of Conditional Probability Distribution [21]. Performance Parameters. Given that i represents the steady-state probabilities of M i , the performance parameters can be computed as follows: I S i i = X s i i (q i ;s i ;o i ;a i ) ; O S i i = X o i i (q i ;s i ;o i ;a i ) ; P S i i = i X (1P NF (q 0 i ;V i ;Q i )) i (q i ;s i ;o i ;a i ) ; whereq 0 i =q i (N i s i ) andV i =N i s i +o i . 120 5.4 Market-based Model Next, we develop the empirical market-based model for SC-Share to determine appro- priate sharing decisions for each SC. We first formulate SC utility functions that take performance characteristics (as computed above) into consideration. We then focus on the details of the game and on the notion of market efficiency. 5.4.1 SC Utilities As discussed before, SCs participate in the federation in order to obtain resources and satisfy SLAs at prices cheaper than public clouds, and sell idle resources to other SCs for profit, similarly to spot instances sold by Amazon AWS [3]. To this end, we define SC i’s utility U S i i (see Eq. (5.2) below) from the ratio between (a) the change in net cost of an SC when it participates in the federation versus when it does not, and (b) the change in utilization of an SC when it participates in the federation versus when it does not: U S i i = (max(C 0 i C S i i ; 0)) 2 ( S i i 0 i ) 0 1; (5.2) whereC 0 i is the cost for SCi when it does not participate in the federation,C S i i is the cost for SCi when it shares a maximum ofS i VMs, 0 i is the system utilization (i.e., the fraction of time that SCi’s VMs are busy) when not participating in the federation, and S i i is the utilization of SCi when it shares a maximum ofS i VMs. It is evident that an SC will try to minimize its cost for satisfying SLAs; thus, we consider the cost reduc- tion as the numerator of Eq. (5.2). We consider the increment in SCs’ utilizations (the denominator of Eq. (5.2)) because SCs always want to keep utilizing their resources in a certain level (the system utilization of SCs should always increase since all of them have to share resources with others in order to participate in the federation). For instance, an 121 ALGORITHM 6: Proposed repeated game among SCs Input:C P i ;C G i , SCf1;:::;Kg Output:fS 1 ;:::;S K g SCi hasN i VMs, arrival rate i and SLA requirementQ i ; In roundr = 0, SC VM sharing vector isfS (0) 1 ;:::;S (0) K g; do r =r + 1; foreachi2 1;:::;K do S (r) i the shared VM number which maximizes SCi’s utility based onS (r1) j ;8j6=i,C P i ;C G i ; end while9i2f1;:::;Kg; S (r) i 6=S (r1) i ; fS (r) 1 ;:::;S (r) K g is the equilibrium point; SC would want to increase the amount of shared VMs (i.e., increase its system utiliza- tion) to obtain higher profit from the cooperation, but would like to decrease the amount of shared VMs whenever its high system utilization makes it forward more requests to a public cloud (i.e., the rate of cost reduction starts to decrease). The parameter in Eq. (5.2) reflects the importance SCi places on utilization, where = 0 means SCi only considers cost reduction, referred to asUF 0 in the rest of the chapter, and = 1 means SCi considers the marginal cost reduction for utilization changes as the most important factor, referred to asUF 1 in the rest of the chapter ( = 1 gives highest importance to utilization increase since 0 < S i i 0 i 1). We choose such structure forU S i i so that an SC will always try to reduce cost, and the marginal utility is linear in (C 0 i C S i i ). In the experiments, we assume that all SCs in the federation have the same value for the parameter, as different values would produce different scales of utility. 5.4.2 Non-Cooperative Game among SCs Game Setting. We implement a finite repeated non-cooperative game, where the strat- egy parameterS i of each SCi is the maximum number of VMs shared with other SCs 122 at any given time. Here, we adapt the concept of fictitious play [29], and assume that each SC does not need to know the utility functions of others. SCi determinesS i based on the performance characteristics achieved through sharing with others in the previous round of the game, resulting in a corresponding cost of maintaining the required SLAs. Algorithm 6 describes the details of our non-cooperative repeated game. In the initial round (without knowledge of other SCs’ behavior), each SC makes an initial sharing decision arbitrarily, and begins sharing VMs with other SCs. Given the solution of the performance model (which takesfS (0) 1 ;:::;S (0) K g as input), each SC maximizes its utility, to determineS (1) i , its sharing decision for the next round. Using its new sharing decision and those from other SCs (S (1) j ;8j6= i) from the previous round, SCi maximizes its utility again, to determine a new sharing decisionS (2) i . This continues until the game converges to an equilibrium point, as explained next. Analyzing Market Equilibria. A Nash equilibrium point of our proposed repeated game represents the game state in which no SC has any incentive to improve its sharing decision [54]. In our work, we are primarily interested in pure strategy Nash Equilib- ria (NE) [54] as it is more practical to implement and realize for a detailed reasoning. More importantly, we have designed utility functions for the SCs that take as argu- ments, parameters that are practically relevant to our problem, and are expressions that best reflect SC satisfaction levels. However, in the process, we could not strictly pre- serve salient mathematical properties related to the utility functions that allow us to derive closed form results about Market Equilibria (ME) from existing seminal works in micro-economic theory, forcing us to take an experimental stance to characterize equilibria. Below, we briefly rationalize our stance in the light of the inapplicability of seminal game-theory theorems in characterizing ME in our work. A detailed explana- tion of our rationale (along with a description of the salient mathematical properties) is in the Appendix A. 123 First, deriving closed form results for our work via the seminal result by Nash is not possible due to us (a) dealing with only pure strategy NE, and (b) the utility for an SC might not be quasi-concave [28] in general cases. Second, deriving closed form results for our work via the seminal result by Debreu, Fan, and Glicksberg (derived indepen- dently) [44, 49, 55] in relation to pure strategy NE is not possible due to (a’) the quasi- concavity assumption might not always be satisfied (for the peer utility function), which in turn might not guarantee pure strategy NE (violating theorem assumptions), and (b’) strategy sets in many applications (including specialized versions of our application set- ting, i.e., the number of shared VMs is discrete in nature) might not be continuous and infinite [54], in which case, we would have to go back to using Nash’s theorem to guar- antee mixed strategy NE (which we do not aim to achieve). Finally, deriving closed form results for our work via the strong seminal result by Dasgupta and Maskin [41] (that also accounts for discontinuous utility functions) is not possible due to the same reasons in (a) and (b) above. Despite barriers to closed form analysis, we observe through simulation results (see below) the existence of pure strategy NE for infinite strategy spaces (simulated in a discrete manner, thereby becoming a finite game in simulation), and for non quasi- concave SC utility functions. Thus, at least from the experimental results, we observe that for our work, (i) it is not necessary (via the theorem of Nash) for quasi concavity to hold for a pure strategy (also discounting the guarantee of only a mixed strategy via Nash’s theorem) Nash equilibrium to exist, and (ii) it is not necessary (via the theorem of Debreu et al.) for quasi concavity to hold for a pure strategy (also discounting the infinite strategy space assumption via the theorem by Debreu et al., as the simulation is discrete in nature) Nash equilibrium to exist. 124 Reaching Market Equilibria. As addressed above, since we could not afford a mathe- matical proof, in this work, we simulate the game in Algorithm 6 and determine the equi- librium point empirically for a specific price setting (C P i andC G i ). A traditional heuristic to search for one such equilibium point in the game is the numerical Tˆ atonnement pro- cess [131] that is based on the principle of gradient descent. In our work, due to the discrete nature of the SC strategy elements (e.g., # of VMs to share), we need a discrete version of a Tˆ atonnement process to reach an equilibrium point. However, the design and analysis of such a process has been shown to be quite challenging [75]; moreover there is no existing discrete Tˆ atonnement process to the best of our knowledge. Thus, in our market-based model, we use the non-gradient based Tabu Search heuristic [56] to search for an equilibrium value ofS (r) i , and reach the global optimum in most cases (by starting at different initial points). Fairness among SCs. A joint social end goal, serving as a benchmark of how well selfish non-cooperative SCs participate in the federation w.r.t. their sharing behavior, is to (a) reach a certain level of fairness (see below for details) among SCs in terms of their utilities, and (b) maximize their individual utilities at ME. It is important to note here that if we only compare the fairness allocations among SCs, the scenario where all SCs share nothing with others can also be a most fair allocation, but it results in sub- optimal individual utilities (at times an individual utility of zero for the SCs) at ME (see Sect. 5.5.2). To achieve our joint social end goal, we need to find a specific price setting (the ratio ofC G i andC P i ) that enables all SCs to maximize their utilities through sharing VMs while at the same time maintaining an appropriate level of fairness. In regard to adopting an appropriate fairness measure, we consider in our work the widely popular 125 notion of weighted -fairness [94] to combine individual SC utilities U S i i through the function W (; ! S i ; ! U S i i ) = 8 > > < > > : P K k=1 S i (U S i i ) 1 1 0;6= 1; P K k=1 S i logU S i i = 1: (5.3) Here, S i , the maximum number of shared VMs, is the weight used to combine the- fairness metric of each SCi, while the parameter controls the fairness of utility allo- cations among SCs. In this work, we evaluate three popular-fairness utility functions, achieving different trade-offs between fairness and economic efficiency: (i) = 0, which gives the utilitarian function [92] (denoting minimum fairness), (ii) = 1, which results in max-min fairness, and (iii) = 1, which gives proportional fairness. For each fairness function defined by , our goal is to find the best price setting that motivates SCs, based on their system loads, to participate in the federation and share more of their VMs, i.e., thereby achieving higher values of-fair functions. We assume that SCs always report the true decisions and utility without releasing detailed infor- mation. (The design of an economic mechanism to enforce truthful communication between SCs is beyond the scope here.) 5.5 Evaluation and Validation We first validate the accuracy of our performance model, the results of which are needed as input parameters to the market-based model. To this end, we compute the solution of our approximate model (in Sect. 5.3) numerically, and compare it to the solution of the exact model (computed through a C++-based simulator). We then use our market- based model to investigate how the price of shared VMs from other SCs affects weighted utilities. 126 0.1 0.2 0.3 0.4 0.3 0.6 0.9 Forward Probability System Utilization Exact(Q=0.2) Approx(Q=0.2) Exact(Q=0.5) Approx(Q=0.5) (a) 10 VMs 0.1 0.2 0.3 0.4 0.7 0.8 0.9 1.0 Forward Probability System Utilization Exact(Q=0.2) Approx(Q=0.2) Exact(Q=0.5) Approx(Q=0.5) (b) 100 VMs Figure 5.5: Comparing the result of forwarding estimation in 10 and 100 VMs with QoS = 0:2 and 0:5. 5.5.1 Performance Model Validation SC without Sharing Resources. Here, we start with the accuracy evaluation of our forward probability estimation in Sect. 5.3.1, since this is a measure used by all other models. Moreover, to demonstrate that SCs have better incentives to participate in the federation, we compare the results of two clouds, which have 10 and 100 VMs respec- tively, with SLAs ofQ i = 0:2 andQ i = 0:5 under various Poisson arrival rates; each request has an exponential service time with rate 1. In order to correctly compare the results among two SCs, in Fig. 5.5, we show the estimated forward probability under different system utilizations (by increasing the arrival rate). As shown in the figure, for both clouds, the probability of forwarding is higher for smaller QoS values, and our estimation properly predicts the forward probability under different settings. It is easy to see that the cloud with fewer VMs has higher forwarding probability under the same system utilization. Thus, if an SC does not want to increase its investments in infras- tructure, it needs some mechanism to decrease its forwarding probability to reduce the cost of satisfying SLAs. In the following experiments, each SC in the federation has 10 VMs by default with exponential service time with rate = 1 and QoSQ i = 0:2. 127 Approximate Model. In this section, we perform extensive experiments to validate the accuracy of the approximate model presented in Sect. 5.3.3. Here, we want to investigate how well our approximate model performs as a function of the different number of shared VMs and system utilizations. We begin with a 2-SC federation scenario. We fix the arrival rate of one SC to 7 and the number of shared VMs to 5 (out of 10 total VMs), and vary the number of shared VMs and system load (by changing the arrival rate) of another SC, referred to as target SC. Figures 5.6a and 5.6b illustrate the performance metrics of interest when the target SC shares 1 and 9 VM(s) under different system loads. (Due to lack of space, we omitP S i i as its estimation remains accurate.) As shown in the figures, the exact and approximateI S i i andO S i i are nearly the same when the target SC shares very few VMs. The accuracy of our approximate model decreases when the target SC shares 9 VMs (as compared to a scenario with 1 shared VM), but the difference betweenI S i i andO S i i (see Eq. (5.1)) remains within 10% of the exact solution. We now illustrate how the approximation error increases in larger systems. Firstly, we consider a 10-SC federation scenario (each with a total of 10 VMs) and fix, for 9 SCs, the number of shared VMs to (3; 3; 3; 2; 2; 2; 1; 1; 1) and the arrival rate to (7; 7; 7; 8; 8; 8; 9; 9; 9), respectively. Figures 5.6c and 5.6d illustrate the performance metrics of interest when the target SC shares 1 and 5 VM(s) under different system loads. We still observe that the difference between the exact and approximateI S i i and O S i i remains small (within 10% of the exact solution) when the system utilization is lower than 0:8 (within 20% when the system utilization is lower than 0:9). Generally, we can observe that the results of approximatedI S i i are under-estimated when the system has very high utilization because our approximate model breaks the direct relationship between the target SC and all other SCs (we only consider the connection between SCi and SCi1); thus, the target SC might under-estimate the number of queued requests at 128 all other SCs. For the same reason, the results of approximatedO S i i are over-estimated. However, the difference betweenI S i i andO S i i remains accurate (within 20% of the exact solution) when the system utilization is lower than 0:9. Secondly, we consider again a 2-SC federation scenario, with 100 VMs per SC. We fix the the number of shared VMs at 10 for both SCs, and vary system load for both of them. Figures 5.6e and 5.6f illustrate the performance metrics of interest when one SC has system utilization of 0:8 and 0:9 under different system loads of the target SC. We still observe that the differ- ence betweenI S i i andO S i i remains accurate (within 20% of the exact solution) when the system utilization of the target SC is lower than 0:9. Summary. Our extensive experiments indicate that our approximate model estimates I S i i andO S i i within 20% of the exact solution, under a variety of scenarios, while saving significant computation time (as reported in Sect. 5.5.3). More importantly, the accuracy of the difference betweenI S i i andO S i i , andP S i i , which are the parameters needed by the market-based model, are within 10% of the exact solution for reasonable system utiliza- tion. Thus, we believe that our approximate model is useful in estimating performance characteristics of the federation, as needed in the market-based model. 5.5.2 Market-based Model Evaluation Here, we perform experiments to investigate how C G i C P i , U S i i , and W (; ! S i ; ! U S i i ), affect the criteria for SCs to participate in the federation. Due to lack of space, we focus on evaluating 3-SC scenarios (in Fig. 5.7), where each SC has 10 VMs (as a representative example) in the evaluation, to better explain the effects of system utilizations on the game model; results for other SC-scenarios are qualitatively similar. Here, we display the ratio of the achieved value of the W metric (see Sect 5.4.2) to the (empirical) market efficient value of the W metric, as a measure of federation efficiency, for a given mixture of SC utility functions. If no SCs are willing to participate in the federation, we depict 129 1 2 3 0.5 0.6 0.7 0.8 0.9 1.0 Average # VM System Utilization Exact(share = 1) Approx(share = 1) Exact(share = 9) Approx(share = 9) (a)I Si i , 2 SCs, 10 VMs 0 1 2 3 4 5 6 0.5 0.6 0.7 0.8 0.9 1.0 Average # VM System Utilization Exact(share = 1) Approx(share = 1) Exact(share = 9) Approx(share = 9) (b)O Si i , 2 SCs, 10 VMs 0 2 4 6 8 10 0.5 0.6 0.7 0.8 0.9 1.0 Average # VM System Utilization Exact(share = 1) Approx(share = 1) Exact(share = 5) Approx(share = 5) (c)I Si i , 10 SCs, 10 VMs 0 1 2 3 4 5 6 7 8 0.5 0.6 0.7 0.8 0.9 1.0 Average # VM System Utilization Exact(share = 1) Approx(share = 1) Exact(share = 5) Approx(share = 5) (d)O Si i , 10 SCs, 10 VMs 2 4 6 8 0.5 0.6 0.7 0.8 0.9 1.0 Average # VM System Utilization Exact( ρ=0.8) Approx( ρ=0.8) Exact( ρ=0.9) Approx( ρ=0.9) (e)I Si i , 2 SCs, 100 VMs 0 1 2 3 4 5 6 7 8 0.5 0.6 0.7 0.8 0.9 1.0 Average # VM System Utilization Exact( ρ=0.8) Approx( ρ=0.8) Exact( ρ=0.9) Approx( ρ=0.9) (f)O Si i , 2 SCs, 100 VMs Figure 5.6: Validating approximate performance model (2 SCs and 10 SCs) it as zero federation efficiency (since the value of the W metric is always greater than zero). We first consider scenarios where the 3 SCs have significantly different system loads ( i = 0:58; 0:73; 0:84). Fig. 5.7a illustrates the case where all SCs chooseUF 0 ( = 0) as their utility function; Fig. 5.7b illustrates the case where all SCs chooseUF 1 ( = 1) as their utility function. As shown in the figures, if all SCs chooseUF 0 , the utilitarian 130 0 1 0.1 0.3 0.5 0.7 0.9 Federation Efficiency C i G /C i P Utilitarian Prop Fair Max-min Fair(log) (a) All SCs withUF 0 0 1 0.1 0.3 0.5 0.7 0.9 Federation Efficiency C i G /C i P Utilitarian Prop Fair Max-min Fair(log) (b) All SCs withUF 1 0 1 0.1 0.3 0.5 0.7 0.9 Federation Efficiency C i G /C i P Utilitarian Prop Fair Max-min Fair(log) (c) All SCs withUF 0 0 1 0.1 0.3 0.5 0.7 0.9 Federation Efficiency C i G /C i P Utilitarian Prop Fair Max-min Fair(log) (d) All SCs withUF 1 Figure 5.7: Market results in 3-SC scenarios: (a-b) are results where 3 SCs have i = 0:58; 0:73; 0:84, (c) is the result where 3 SCs have i = 0:73; 0:79; 0:84, (d) are the results where 3 SCs have i = 0:49; 0:58; 0:66 W metric increases with increase inC G i =C P i (except whenC G i =C P i is nearing 1), since the SCs choosing UF 0 as their utility are incentivized to share more VMs to reduce their net cost. WhenC G i =C P i is nearing 1, the federation cannot be formed because SCs with high utilizations do not reduce cost by using shared VMs, compared to resorting to a public cloud, and low utilization SCs do not generate enough demand to let high utilization SCs remain profitable. If all SCs usedUF 1 , they would only share 1 VM with others even whenC G i =C P i increases because, in our setting, the increase in marginal cost reduction with increase in number of shared VMs is not sufficient to encourage SCs to contribute more VMs. Moreover, since all SCs only shared 1 VM when they useUF 1 , both proportional W metric and max-min W metric achieve the same maximum state 131 (due to the same weight for all SCs in Eq. (5.3)), as shown in Fig. 5.7b. In other cases, the results of the proportional W metric depend on the behavior of the lower utilization SCs. If these SCs chooseUF 0 , their cost reductions with increase in number of shared VMs are greater than high utilization SCs; thus, the maximum proportional W metric can only happen when all SCs share few VMs. In Fig. 5.7c, we consider scenarios where 3 SCs have similar high system loads ( i = 0:73; 0:79; 0:84), where all of them considerUF 0 . In this scenario, the results are similar to the cases in Fig. 5.7a; however, unlike the scenario where SCs having signifi- cant different utilizations are not incentivized to join the federation whenC G i =C P i = 1, SCs in a scenario when they have similar high utilizations, are incentivized to coop- erate when C G i =C P i = 1. This is because high utilization SCs share a similar num- ber of VMs with each other, resulting in canceling out the cost of using shared VMs. In Fig. 5.7d, we consider scenarios where 3 SCs have similar medium system loads ( i = 0:49; 0:58; 0:66), where all of them considerUF 1 . The results in these scenarios are similar to what we have discussed above, however, we observe the federation cannot be formed whenC G i =C P i is beyond 0:8. This is because all low utilization SCs do not generate enough revenue from their incoming VM demand from other SCs to balance their costs of using shared VMs from other SCs. Summary. Our extensive experimental evaluation indicates three C G i =C P i regions of operation to maximize various W metrics. When maximizing proportional fairness based W metric is the goal of the federation, the value of C G i =C P i should be set in the lower range of C G i =C P i (between 0 and 0:3 in our example setting). When maxi- mizing max-min fairness based W metric is the goal, the value ofC G i =C P i should be set in the middle range ofC G i =C P i (between 0:3 and 0:7 in our example setting). Finally, when maximizing utilitarian W metric is the goal, the value of C G i =C P i should be set in the high range ofC G i =C P i (between 0:7 and 1 in our example setting). However, the 132 utilitarian setting also runs the risk of breaking the federation at a certain high value of C G i =C P i at which no SC would be willing to cooperate. 5.5.3 Computational Overhead In this section, we discuss the cost of computing our performance model and market- based model. Performance Model. Our approximate model can significantly reduce the state space size of the detailed Markov model (see Sect. 5.3.3). For instance, in a 10-SC scenario with each SC sharing 5 VMs, the detailed model has 9 billion states, whose generation and solution requires a substantial amount of space and computation time. However, our approximate model only needs to build ten Markov models with 1 million states each, and compute their steady-state probabilities. Fig. 5.8a illustrates the computation time of the approximate model with 2 10 SCs, each with 10 VMs and sharing 2 VMs. We observe that the computation time increases with the number of SCs due to generating and solving larger linear systems. Our approximate model significantly reduces the state space size, estimating the results faster and with less memory. Market-based Model. SCs use Algorithm 6 to repetitively adjust their sharing deci- sions,S i , at each round of the game in order to maximize their utilities until reaching an equilibrium state (see Sect. 5.4.2); thus, the market-based model’s computational time depends on the Tabu Search distance and on the number of SCs. We consider scenar- ios with 2 8 SCs in the federation, each with 100 VMs. The number of iterations required decreases as more SCs participate (see Fig. 5.8b). This occurs because any decision change produces bigger effects in a smaller federation. Similarly, the influence of a larger search distance is bigger in a smaller federation. For example, our proposed market-based model needs 5 iterations to reach equilibrium when only 2 SCs are in the federation. 133 0 0.5 1 1.5 2 2.5 3 3.5 4 2 4 6 8 10 Time (Hours) # SCs (a) Approximate performance model 0 1 2 3 4 5 6 7 8 5 10 15 20 25 30 Average # of Iter. to Converge Step size 2 SCs 4 SCs 6 SCs 8 SCs (b) Game model Figure 5.8: Time complexity of the performance model and the game model 5.6 Discussion and Future Work We made a number of assumptions in our models; here, we discuss the rationale behind the main assumptions. Homogeneous VMs. In practice, each cloud provider offers heterogeneous VM pro- files (e.g., memory-optimized, CPU-optimized, or GPU-enabled), which reserve hard- ware resources on pre-specified machine pools shared by multiple VMs [2]. However, many cloud providers, such as Amazon LightSail, DigitalOcean, and Linode, offer VM configurations with very similar specifications (e.g., $10/month instances from Linode, DigitalOcean, and Amazon Lightsail currently provide 1 CPU core, 30 GB SSD, 2 TB data transfer/month, 1 or 2 GB of RAM). We believe that it is very likely that SCs would negotiate the sharing policies for each VM profile separately, given that these profiles correspond to different prices and capacities at each SC. In this case, our model of homogeneous resources can be applied repeatedly to each VM profile. Sharing poli- cies for hardware resources (rather than VM profiles) would require the introduction of scheduling and packing algorithms within our performance model, which is beyond the scope of this work. I.I.D. Exponential Service Times. Depending on the target application, requests can require two or more VMs to complete a job, and service times of different requests 134 likely have different distributions. In these cases, our Markov model can address non- exponential service times by introducing phase-type distributions that fit the moments of service time distributions from real-world traces [102]. Similarly, batch arrivals can introduce with batch Markovian arrival processes (BMAPs). Unfortunately, both approaches result in larger state spaces, with the effect of increasing computation costs for the analysis of our performance model. In this chapter, we motivated the formation of a federation using exponentially-distributed service times and single-VM requests to reduce the computational cost. To relax these assumptions, one of our future goals focuses on leveraging symbolic analysis methods for Markov chains, e.g., methods based on multi-terminal binary decision diagrams (MTBDDs), or lumping of Markov processes, to further cope with the state-space explosion. Stable System Parameters. This work focuses on establishing a long-term relationship in the federation: in reality, unlike spot clouds, where decisions must be made in a very short period of time, each SC would collect sufficient historical traces for a longer period of time before joining the federation, and update its sharing decisions after observing a long-term change in system parameters. Our approximate model is designed to deliver the results for this kinds of updates. Participating in single federation. In real world, an SC can participate in multiple cloud federations simultaneously, and that sharing decisions for different federations might depend on many factors, such as the cost of using shared VMs in federations. However, this chapter focuses on studying how the price of using shared resources affects the motivation of participating in the federation; profit maximization through the use of the resources from multiple federations is outside of the scope of this work, but it could represent future work. The feasibility of Tˆ atonnement process. According to [39], when mixed strategies are considered, the results of the Tˆ atonnement process might be unstable, which does not 135 happen with pure strategies. This entails (as one of the reasons) the use of pure strategies in practical settings. However, not all games will have pure strategy equilibria, but they will definitely have mixed strategy equilibria. In such situations, the Tˆ atonnement process to reach a pure strategy equilibria will not terminate, indicating the possibility of the non-existence of a pure strategy NE. Currently, we do not have a good solution to overcome this problem. In addition, in all of our settings we reach a pure strategy NE. Given the assumption that we only deal with pure strategies in our game, the results from the Tˆ atonnement process depend on the initial point, particularly when there are more equilibria in the game. Thus, in our game, we have tried different initial points, and picked the equilibrium that produced a better fairness level among SCs. Throughout our experiments, we can always find an equilibrium in the game. However, as discussed in the previous comment, the existence of an equilibrium significantly depends on the utilities of SCs. SCs follow the sequence of actions. We stress the fact that it is in the rational interest of users to follow the sequence/order as specified by the game. However, in the worst case, even if users deviate from following the prescription, it is very unlikely that all users would do that at the same time. In the event that even a few players follow the sequence/order as specified by the game, we would end up with a better outcome than no-sharing. On an individual level, we agree with the reviewer that some players might end up having a worse outcome than the scenario with no-sharing if they do not make new decisions (e.g., leave the federation) when they have to sustain higher costs than in the case of not participating to the federation. However, each SC can have a better utility than in the scenario with no sharing as long as its decision can reduce the cost of serving customers, even when this SC does not constantly update its decision. No collusion among SCs. It is possible in practice for certain SCs to collude among themselves to ‘game’ the cloud sharing system so that these SCs benefit more in terms 136 of resource availability at cheaper costs, than the others. It is here that the benefits of a federation should come into play in two possible ways: (a) enforce a strict set of laws prohibiting collusion, in addition to strong punishments (e.g., being excluded from the federation) if SCs are found to collude, and (b) designing economic mechanisms (via the use of mechanism design models) to incentivize SCs not to collude. However, the goal of this work focuses on studying how the price of using shared resources will affect the decisions of SCs that participate in the federation, and not on modeling collusions. We leave the latter for future work. The same family of cost and utility function. in practice different SCs might have cost and utility functions coming from different mathematical families. However, our design choice (to assume functions from the same family) is motivated by two practical insights stemming from our work: One major element in our work focuses on discussing how the cost of resource usage affects the sharing strategies under different environments. In this regard, we use the simple type of linear cost functions as a representative example of a cost function, which rationalizes realistic system designs (see Sect 5.2.2); in so doing, we reduce the complexity of our analysis. However, without loss of gener- ality, other types of cost functions (even those coming from different mathemati- cal families) using the same rationale as our work for the design of performance parameters, will show the same trend (albeit different values) when SCs change their sharing strategies: the reason is that different functions will exhibit similar mathematical properties of monotonicity, continuity, and differentiability. A second important element in our work is the focus on studying fairness in shar- ing resources among SCs through the reduction of the cost (i.e,,C 0 i C S i i in Eq. (5.1)). To this end, it is imperative that utility comparisons are done within a nor- malized interval range (e.g., [0,1]) irrespective of SCs having different families 137 of utility functions in the worst case. This requires a formal normalization step which is outside the scope of our work. For simplicity, we assume that each SC utility is already normalized over a given fixed range: we implement this step in the experimental evaluation section by fixing the value for each SC to be the same and varying between 0 and 1. For SC cost and utility functions from dif- ferent mathematical families, after normalization, would produce similar trends and practical insights from fairness analysis, when compared to our experimental study. With respect to cost functions, we agree with the reviewer that the cost of using shared VMs from SCs might be different in the federation. However, our focus is on studying how the cost of using shared resources affects the motivation of participating in the cloud federation. If the cost of using shared resources is not homogeneous, the decision will also be affected by the resource allocation strategies (i.e., which SC to request the resources from). We do not try to introduce resource allocation strategies in this work, and thus assume that prices are homogeneous. In future work, we plan to study how the resource allocation strategies will affect the sharing decisions made by individual SCs. We also plan to incorporate different factors into our cost functions, such as trustfulness among SCs, and to propose a mechanism to evaluate multi-dimensional fairness for utility functions that belong to different mathematical families. Future Work. SC-Share evaluates resource sharing benefits among SCs by account- ing only for the cost of using VMs. However, there are other parameters that SC-Share could account for in evaluating resource sharing benefits: (i) privacy concerns/risks of sharing/forwarding resources within cloud entities, (ii) data transmission costs for for- warding VM requests among cloud entities, and (iii) power consumption costs of run- ning physical servers hosting VMs. We plan to incorporate these parameters into the SC-Share framework as part of future work. 138 5.7 Conclusions In this chapter, we proposed SC-Share to enable small-scale clouds to share their resources in a profitable manner while satisfying customer SLAs. Our framework is based on two interacting models: (i) an approximate performance model with an effi- cient solution that is able to produce sufficiently accurate estimates of performance char- acteristics of interest; and (ii) a market-based model that results in sharing policies which properly incentivize SCs to participate in the federation while achieving market success. SC-Share can suggest different price settings in different federations in order to achieve sufficient market efficiency. Moreover, SC-Share shows that even when the price of shared VMs is equal to the price of using a public cloud, a federation can still be formed under certain criteria. 139 Chapter 6 Conclusion In this work, we study resource management problems in two types of services. In order to solve the resource under-provisioning problem, resulting in degraded QoS received by users, we propose to cooperate between resource holders and find a reasonable way to share those resources efficiently. The first service we study is P2P video streaming as a service. In order to improve the QoS of low capacity peers, we propose a mechanism, implemented within our Ad- driven Streaming P2p ECosysTem (ASPECT), which allows peers to trade their capac- ities and ad durations. This mechanism increases opportunities for peers to obtain suf- ficient download rates so as to significantly reduce video pauses. Moreover, ASPECT allows content providers to achieve their desired profit by providing sufficient incentives for their peer customers to stay in the system and contribute to greater revenues of the content providers via ad viewing, while respecting the ad duration contracts with the ad provider (i.e., ensuring that a pre-specified minimal duration of ads is viewed by all peers). The second service we study is computation as a service. We start by considering resource allocation for machine learning jobs within a single cloud. In order to better utilize resources, we propose a performance model to estimate the throughput of a train- ing job as the number of assigned machines increases, and leverage this performance model to help the scheduler maximize benefits of assigning machines to each job. How- ever, sometimes, providing satisfactory QoS to all customers is not possible via the use 140 of a single cloud. Thus, our next work focuses on developing a framework for coop- erating SCs, that can lead to appropriate incentives for individual small-scale clouds to participate in a federation while making sure that each is profitable and is able to meet its SLAs. To this end, we propose a market-based model for determining how much each SC should share. Moreover, we develop a performance model that is able to estimate parameters needed for the market-based model. The main contributions of this thesis focus on solving the resource management problem in video streaming as a service and computation as a service. However, the principles and insights derived from this thesis can also be applied to the related research fields in distributed systems such as database systems, electrical grid, Internet of Things (IoT), and crowdsourcing. 141 References [1] Amazon AWS. http://aws.amazon.com. [2] Amazon EC2 Instance Types. https://aws.amazon.com/ec2/instance-types/. [3] Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/spot/. [4] As Private Cloud Grows, Rackspace Expands Options Inside Equinix. https://blog.equinix.com/blog/2016/05/16/as-private-cloud-grows-rackspace- expands-options-inside-equinix/. [5] Blippex - Why we moved away from AWS. http://blippex.github.io/updates/2013/09/23/why-we-moved-away-from- aws.html. [6] Google Compute Engine. https://cloud.google.com/compute. [7] It’s clear to Linode: There’s a market to bring cloud services to small companies. http://www.njbiz.com/article/20131104/NJBIZ01/311019994/It. [8] Microsoft Azure. https://azure.microsoft.com. [9] Popcorn time. [10] SpotCloud. http://www.spotcloud.com/. [11] Youtube. http://www.youtube.com. [12] Youtube advertise policy. [13] The bittorrent protocol specifications, 2008. [14] Mart´ ın Abadi and al. TensorFlow: Large-scale machine learning on heteroge- neous systems, 2015. Software available from tensorflow.org. [15] Mart´ ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016. 142 [16] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. pfabric: Minimal near-optimal datacenter transport. In ACM SIGCOMM Comp. Comm. Review, volume 43, 2013. [17] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, et al. A view of cloud computing. Communications of the ACM, 2010. [18] Ozalp Babaoglu, Moreno Marzolla, and Michele Tamburini. Design and imple- mentation of a P2P Cloud system. In Proceedings of the 27th Annual ACM Sym- posium on Applied Computing. ACM, 2012. [19] Anand Balachandran, Geoffrey M V oelker, Paramvir Bahl, and P Venkat Rangan. Characterizing user behavior and network performance in a public wireless lan. In ACM SIGMETRICS Performance Evaluation Review. ACM, 2002. [20] Lawrence Barsanti and Angela C Sodan. Adaptive job scheduling via predictive job resource allocation. In Workshop on Job Scheduling Strategies for Parallel Processing, pages 115–140. Springer, 2006. [21] Patrick Billingsley. Probability and measure. John Wiley & Sons, 2008. [22] Jacek Blazewicz, Mieczyslaw Drabowski, and Jan Weglarz. Scheduling multi- processor tasks to minimize schedule length. IEEE Transactions on Computers, 35(5):389–393, 1986. [23] Jacek Bła˙ zewicz, Maciej Drozdowski, and Mariusz Markiewicz. Divisible task scheduling–concept and verification. Parallel Computing, 25(1):87–98, 1999. [24] Jacek Blazewicz, Mikhail Y Kovalyov, Maciej Machowiak, Denis Trystram, and Jan Weglarz. Preemptable malleable task scheduling problem. IEEE Transactions on Computers, 55(4):486–490, 2006. [25] Jacek Bła˙ zewicz, Maciej Machowiak, Jan Weglarz, Mikhail Y Kovalyov, and Denis Trystram. Scheduling malleable tasks on parallel processors to minimize the makespan. Annals of Operations Research, 2004. [26] Thomas Bocek, Michael Shann, David Hausheer, and Burkhard Stiller. Game theoretical analysis of incentives for large-scale, fully decentralized collaboration networks. In IPDPS. IEEE, 2008. [27] L´ eon Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade - Second Edition, pages 421–436. 2012. [28] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge Uni- versity Press, 2004. 143 [29] George W Brown. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 1951. [30] Haipeng Chen, Bo An, Dusit Niyato, Yengchai Soh, and Chunyan Miao. Work- load factoring and resource sharing via joint vertical and horizontal cloud federa- tion networks. IEEE Journal on Selected Areas in Communications, 2017. [31] Yan Chen, Beibei Wang, W Sabrina Lin, Yongle Wu, and KJ Liu. Cooperative peer-to-peer streaming: An evolutionary game-theoretic approach. Circuits and Systems for Video Technology, IEEE Transactions on, 2010. [32] Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanara- man. Project adam: Building an efficient and scalable deep learning training system. In OSDI, volume 14, pages 571–582, 2014. [33] Yung Ryn Choe, Derek L Schuff, Jagadeesh M Dyaberi, and Vijay S Pai. Improv- ing vod server efficiency with bittorrent. In Multimedia. ACM, 2007. [34] Stephane Chretien, Jean-Marc Nicod, Laurent Philippe, Veronika Rehn-Sonigo, and Lamiel Toch. Job scheduling using successive linear programming approx- imations of a sparse model. In Euro-Par, volume 12, pages 116–127. Springer, 2012. [35] Walfredo Cirne and Francine Berman. Using moldability to improve the perfor- mance of supercomputer jobs. Journal of Parallel and Distributed Computing, 62(10):1571–1601, 2002. [36] Charles W Cobb and Paul H Douglas. A theory of production. The American Economic Review, 1928. [37] S´ ebastien Collette, Liliana Cucu, and Jo¨ el Goossens. Integrating job parallelism in real-time scheduling theory. Information Processing Letters, 106(5):180–187, 2008. [38] Ronan Collobert, Koray Kavukcuoglu, and Cl´ ement Farabet. Torch7: A matlab- like environment for machine learning. In BigLearn, NIPS Workshop, 2011. [39] V . Crawford. Essays in Economic Theory (Routledge Revivals). 2004. [40] L. D’Acunto, N. Andrade, J. Pouwelse, and H. Sips. Peer selection strategies for improved qos in heterogeneous bittorrent-like vod systems. In Multimedia (ISM). IEEE, 2010. [41] Partha Dasgupta and Eric Maskin. The existence of equilibrium in discontinuous economic games, i: Theory. The Review of economic studies, 1986. 144 [42] Primavera De Filippi and Smari McCARTHY . Cloud computing: Legal issues in centralized architectures. In VII International Conference on Internet, Law and Politics, 2011. [43] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale dis- tributed deep networks. In Advances in neural information processing systems, 2012. [44] Gerard Debreu. A social equilibrium existence theorem. Proceedings of the National Academy of Sciences, 1952. [45] Maciej DROZDOWSKI. On the complexity of multiprocessor task scheduling. Bulletin of the Polish Academy of Sciences Technical Sciences, 43(3), 1995. [46] Jianzhong Du and Joseph Y-T Leung. Complexity of scheduling parallel task systems. SIAM Journal on Discrete Mathematics, 1989. [47] Pierre-Franc ¸ois Dutot, Marco AS Netto, Alfredo Goldman, and Fabio Kon. Scheduling moldable bsp tasks. In JSSPP, pages 157–172. Springer, 2005. [48] Derek L Eager, John Zahorjan, and Edward D Lazowska. Speedup versus effi- ciency in parallel systems. IEEE Transactions on Computers, 38(3):408–423, 1989. [49] Ky Fan. Fixed-point and minimax theorems in locally convex topological linear spaces. Proceedings of the National Academy of Sciences of the United States of America, 1952. [50] Dror G Feitelson. Job scheduling in multiprogrammed parallel systems. 1997. [51] Michal Feldman, Kevin Lai, Ion Stoica, and John Chuang. Robust incentive techniques for peer-to-peer networks. In Proceedings of the 5th ACM conference on Electronic commerce. ACM, 2004. [52] Attilio Fiandrotti, Valerio Bioglio, Marco Grangetto, Rossano Gaeta, and Enrico Magli. Band codes for energy-efficient network coding with application to p2p mobile streaming. Multimedia, IEEE Transactions on, 2014. [53] Bennett L. Fox and Peter W. Glynn. Computing Poisson Probabilities. Commun. ACM. [54] Drew Fudenberg and Jean Tirole. Game theory. 1991. Cambridge, Mas- sachusetts, 393, 1991. 145 [55] Irving L Glicksberg. A further generalization of the kakutani fixed point theo- rem, with application to nash equilibrium points. Proceedings of the American Mathematical Society, 1952. [56] Fred Glover. Tabu search-part I. ORSA Journal on computing, 1989. [57] ´ I˜ nigo Goiri, Jordi Guitart, and Jordi Torres. Economic model of a cloud provider operating in a federated cloud. Information Systems Frontiers, 2012. [58] C Goktug Gurler, S Sedef Savas, and A Murat Tekalp. Variable chunk size and adaptive scheduling window for p2p streaming of scalable video. In 19th Inter- national Conference on Image Processing. IEEE, 2012. [59] Makhlouf Hadji and Djamal Zeghlache. Mathematical programming approach for revenue maximization in cloud federations. IEEE Transactions on Cloud Computing, 2015. [60] Stefan Hadjis, Ce Zhang, Ioannis Mitliagkas, Dan Iter, and Christopher R´ e. Omnivore: An optimizer for multi-device deep learning on cpus and gpus. arXiv preprint arXiv:1606.04487, 2016. [61] Anwar Al Hamra, Arnaud Legout, and Chadi Barakat. Understanding the prop- erties of the bittorrent overlay. arXiv preprint arXiv:0707.1820, 2007. [62] Mohammad Mehedi Hassan, M Shamim Hossain, AM Jehad Sarkar, and Eui- Nam Huh. Cooperative game-based distributed resource allocation in horizontal dynamic cloud federation platform. Information Systems Frontiers, 2014. [63] X. Hei, C. Liang, J. Liang, Y . Liu, and K.W. Ross. A measurement study of a large-scale p2p iptv system. Transactions on Multimedia, 2007. [64] Yann Hendel, Wieslaw Kubiak, and Denis Trystram. Scheduling semi-malleable jobs to minimize mean flow time. Journal of Scheduling, 18(4):335, 2015. [65] Martin Heusse, Franck Rousseau, Gilles Berger-Sabbatel, and Andrzej Duda. Performance anomaly of 802.11 b. In INFOCOM. IEEE, 2003. [66] P.K. Hoong and H. Matsuo. Push-pull incentive-based p2p live media streaming system. WSEAS transactions on communications, 2008. [67] G. Huang. Pplive: A practical p2p live system with huge amount of users. In Proceedings of the ACM SIGCOMM Workshop on Peer-to-Peer Streaming and IPTV Workshop, 2007. [68] Shenglan Huang, Ebroul Izquierdo, and Pengwei Hao. Bandwidth-efficient packet scheduling for live streaming with network coding. IEEE Transactions on Multimedia, 2016. 146 [69] Te-Yuan Huang, Nikhil Handigol, Brandon Heller, Nick McKeown, and Ramesh Johari. Confused, timid, and unstable: picking a video streaming rate is hard. In Internet measurement conference. ACM, 2012. [70] Hulu. Hulu. http://www.hulu.com, 2014. [71] Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2592–2600, 2016. [72] Klaus Jansen and Lorant Porkolab. Linear-time approximation schemes for scheduling malleable parallel tasks. Algorithmica, 32(3):507–520, 2002. [73] Klaus Jansen and Lorant Porkolab. Computing optimal preemptive schedules for parallel tasks: linear programming approaches. Mathematical programming, 95(3):617–630, 2003. [74] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [75] Taisei Kaizoji. Multiple equilibria and chaos in a discrete tˆ atonnement process. Journal of Economic Behavior & Organization, 2010. [76] Yacine Kessaci, Nouredine Melab, and El-Ghazali Talbi. A Pareto-based meta- heuristic for scheduling HPC applications on a geographically distributed cloud federation. Cluster Computing, 16(3):451–468, 2013. [77] Leonard Kleinrock. Queueing Systems. Wiley, 1975. [78] S Shunmuga Krishnan and Ramesh K Sitaraman. Video stream quality impacts viewer behavior: inferring causality using quasi-experimental designs. In Internet measurement conference. ACM, 2012. [79] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014. [80] R. Kumar, Y . Liu, and K. Ross. Stochastic fluid theory for p2p streaming systems, 2007. [81] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015. [82] Wan Yeon Lee, Sung Je Hong, and Jong Kim. On-line scheduling of scalable real-time tasks on multiprocessor systems. Journal of Parallel and Distributed Computing, 63(12):1315–1324, 2003. 147 [83] Arnaud Legout, Nikitas Liogkas, Eddie Kohler, and Lixia Zhang. Clustering and sharing incentives in bittorrent systems. In SIGMETRICS Performance Evalua- tion Review. ACM, 2007. [84] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014. [85] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013. [86] W Sabrina Lin, H Vicky Zhao, and KJ Ray Liu. Incentive cooperation strate- gies for peer-to-peer live multimedia streaming social networks. Transactions on Multimedia, 2009. [87] Z. Liu, Y . Shen, K.W. Ross, S.S. Panwar, and Y . Wang. Substream trading: Towards an open p2p live streaming system. In ICNP, 2008. [88] Zhengye Liu, Yanming Shen, Shivendra S Panwar, Keith W Ross, and Yao Wang. Using layered video to provide incentives in p2p live streaming. In Proceedings of the 2007 workshop on Peer-to-peer streaming and IP-TV. ACM, 2007. [89] Richard TB Ma, Sam Lee, John Lui, and David KY Yau. Incentive and service differentiation in p2p networks: a game theoretic approach. IEEE/ACM Transac- tions on Networking (TON), 2006. [90] Enrico Magli, Mea Wang, Pascal Frossard, and Athina Markopoulou. Network coding meets multimedia: A review. Multimedia, IEEE Transactions on, 2013. [91] Konstantin Makarychev and Debmalya Panigrahi. Precedence-constrained scheduling of malleable jobs with preemption. arXiv preprint arXiv:1404.6850, 2014. [92] Andreu Mas-Colell, Michael Dennis Whinston, Jerry R Green, et al. Microeco- nomic theory. Oxford university press New York, 1995. [93] Lena Mashayekhy, Mahyar Movahed Nejad, and Daniel Grosu. Cloud federa- tions in the sky: Formation game and mechanism. IEEE Transactions on Cloud Computing, 2015. [94] Jeonghoon Mo and Jean Walrand. Fair end-to-end window-based congestion con- trol. IEEE/ACM Transactions on Networking (ToN), 2000. [95] Jeffrey C Mogul and Lucian Popa. What we talk about when we talk about cloud network performance. ACM SIGCOMM Computer Communication Review, 2012. 148 [96] Kianoosh Mokhtarian and Mohamed Hefeeda. Capacity management of seed servers in peer-to-peer streaming systems with scalable video streams. Multime- dia, IEEE Transactions on, 2013. [97] J.J.D. Mol, D.H.J. Epema, and HJ Sips. The orchard algorithm: P2p multicasting without free-riding. In Peer-to-Peer Computing. IEEE, 2006. [98] John F Nash et al. Equilibrium points in n-person games. Proc. Nat. Acad. Sci. USA, 1950. [99] Netflix. Netflix. http://www.netflix.com, 2014. [100] Nhan-Quy Nguyen, Farouk Yalaoui, Lionel Amodeo, Hicham Chehade, and Pas- cal Toggenburger. Solving a malleable jobs scheduling problem to minimize total weighted completion times by mixed integer linear programming models. In ACI- IDS’16, pages 286–295. Springer. [101] Dusit Niyato, Athanasios V Vasilakos, and Zhu Kun. Resource and revenue shar- ing with coalition formation of cloud providers: Game theoretic approach. In Proceedings of the 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 215–224. IEEE Computer Society, 2011. [102] Takayuki Osogami and Mor Harchol-Balter. Closed form solutions for mapping general distributions to quasi-minimal ph distributions. Performance Evaluation, 63(6):524–552, 2006. [103] V . Pai and A.E. Mohr. Improving robustness of peer-to-peer streaming with incen- tives. 1st NetEcon, 2006. [104] A. ParandehGheibi, M. M´ edard, A. Ozdaglar, and S. Shakkottai. Avoiding inter- ruptions a qoe reliability function for streaming media applications. IEEE JSAC, 2011. [105] Hyunggon Park, Rafit Izhak Ratzin, and Mihaela van der Schaar. Peer-to-peer networks - protocols, cooperation and competition. Streaming Media Architec- tures, Techniques, and Applications: Recent Advances, pages 262–294, 2010. [106] N. Parvez, C. Williamson, A. Mahanti, and N. Carlsson. Analysis of bittorrent- like protocols for on-demand stored media streaming. In ACM SIGMETRICS Performance Evaluation Review. ACM, 2008. [107] F. Pianese and D. Perino. Resource and locality awareness in an incentive-based p2p live streaming system. In Proceedings of the 2007 workshop on Peer-to-peer streaming and IP-TV. ACM, 2007. 149 [108] T. Qiu, I. Nikolaidis, and F. Li. On the design of incentive-aware p2p streaming. Journal of Internet Engineering, 2007. [109] Martin Reiser. A Queueing Network Analysis of Computer Communication Networks with Window Flow Control. IEEE Transactions on Communications, 27(8):1199–1209, Aug 1979. [110] Pedro L Rodrigues and Jˆ anio M Monteiro. Bittorrent based transmission of real- time scalable video over p2p networks. In Information Systems and Technologies (CISTI), 2012 7th Iberian Conference on. IEEE, 2012. [111] J. R¨ uckert, O. Abboud, T. Zinner, R. Steinmetz, and D. Hausheer. Quality adap- tation in p2p video streaming based on objective qoe metrics. NETWORKING, 2012. [112] Gerald Sabin, Matthew Lang, and P Sadayappan. Moldable parallel job schedul- ing using job efficiency: An iterative approach. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 2006. [113] Nancy Samaan. A novel economic sharing model in a federation of selfish cloud providers. IEEE Transactions on Parallel and Distributed Systems, 2014. [114] Peter Sanders and Jochen Speck. Efficient parallel scheduling of malleable tasks. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE Interna- tional, pages 1156–1166. IEEE, 2011. [115] Wayne Schmidt. How Much TV Commercial Length has Grown over the Years. http://www.waynesthisandthat.com/commerciallength.htm, 2014. [Online; accessed 20-June-2014]. [116] Kenneth C Sevcik. Application scheduling and processor allocation in multi- programmed parallel processing systems. Performance Evaluation, 19(2-3):107– 140, 1994. [117] Syed Munir Hussain Shah, Kalim Qureshi, and Haroon Rasheed. Optimal job packing, a backfill scheduling optimization for a cluster of workstations. The Journal of Supercomputing, 54(3):381–399, 2010. [118] Mark Shifrin, Rami Atar, and Israel Cidon. Optimal scheduling in the hybrid- cloud. In Integrated Network Management (IM 2013). IEEE, 2013. [119] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [120] Laura Stevens. Amazon finds the cause of its aws outage: A typo. Wall Street Journal, March 7th, 2017. 150 [121] W. J. Stewart. Introduction to the Numerical Solution of Markov Chains. Prince- ton University Press, 1995. [122] S Sudha, K Savitha, and P Sadayappan. A robust scheduling strategy for moldable scheduling of parallel jobs. In Fifth IEEE International Conference on Cluster Computing, 2003. [123] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi- novich. Going deeper with convolutions. In CVPR’15, pages 1–9. [124] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR’16. [125] Guang Tan and Stephen A Jarvis. A payment-based incentive and service differ- entiation scheme for peer-to-peer streaming broadcast. transactions on parallel and distributed systems, 2008. [126] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, 2016. [127] Adel Nadjaran Toosi, Rodrigo N Calheiros, Ruppa K Thulasiram, and Rajku- mar Buyya. Resource provisioning policies to increase iaas provider’s profit in a federated cloud environment. In IEEE 13th International Conference on High Performance Computing and Communications (HPCC). IEEE, 2011. [128] Tram Truong-Huu and Chen-Khong Tham. A novel model for competition and cooperation among cloud providers. IEEE Transactions on Cloud Computing, 2014. [129] John Turek, Walter Ludwig, Joel L Wolf, Lisa Fleischer, Prasoon Tiwari, Jason Glasgow, Uwe Schwiegelshohn, and Philip S Yu. Scheduling parallelizable tasks to minimize average response time. In Proceedings of the sixth annual ACM sym- posium on Parallel algorithms and architectures, pages 200–209. ACM, 1994. [130] Alexander Ulanov, Andrey Simanovsky, and Manish Marwah. Modeling scala- bility of distributed machine learning. In ICDE’17, pages 1249–1254. IEEE. [131] Hal R Varian. Intermediate Microeconomics: A Modern Approach: Ninth Inter- national Student Edition. WW Norton & Company, 2014. [132] Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Benjamin Recht, and Ion Stoica. Ernest: Efficient performance prediction for large-scale advanced analytics. In NSDI, pages 363–378, 2016. 151 [133] Aggelos Vlavianos, Marios Iliofotou, and Michalis Faloutsos. Bitos: Enhancing bittorrent for supporting streaming applications. In INFOCOM, 2006. [134] Long Vu. Pplive project, 2008. [135] B.C. Wang, A.L.H. Chow, and L. Golubchik. P2p streaming: use of advertise- ments as incentives. In Multimedia Systems. ACM, 2012. [136] Haiyang Wang, Feng Wang, Jiangchuan Liu, Dan Wang, and Justin Groen. Enabling Customer-Provided Resources for Cloud Computing: Potentials, Chal- lenges, and Implementation. IEEE Transactions on Parallel & Distributed Sys- tems, 2014. [137] Zhenyu Wen, Jacek Cala, Paul Watson, and Alexander Romanovsky. Cost effec- tive, reliable and secure workflow deployment over federated clouds. IEEE Trans- actions on Services Computing, 2016. [138] Di Wu, Yi Liang, Jian He, and Xiaojun Hei. Balancing performance and fairness in p2p live video systems. Circuits and Systems for Video Technology, IEEE Transactions on, 2013. [139] Y . Xu, E. Altman, R. El-Azouzi, S. Elayoubi, and M. Haddad. Qoe analysis of media streaming in wireless data networks. NETWORKING, 2012. [140] Y . Xu, E. Altman, R. El-Azouzi, M. Haddad, S. Elayoubi, and T. Jimenez. Prob- abilistic analysis of buffer starvation in markovian queues. In INFOCOM, 2012. [141] Y . Yang, A. Chow, L. Golubchik, and D. Bragg. Improving qos in bittorrent-like vod systems. In INFOCOM. IEEE, 2010. [142] Hao Yin, Xuening Liu, Tongyu Zhan, Vyas Sekar, Feng Qiu, Chuang Lin, Hui Zhang, and Bo Li. Design and deployment of a hybrid cdn-p2p system for live video streaming: experiences with livesky. In Proceedings of the 17th ACM inter- national conference on Multimedia. ACM, 2009. [143] Ge Zhang, Wei Liu, Xiaojun Hei, and Wenqing Cheng. Unreeling xunlei kankan: understanding hybrid cdn-p2p video-on-demand streaming. Multimedia, IEEE Transactions on, 2015. [144] Hui Zhang, Guofei Jiang, Kenji Yoshihira, and Haifeng Chen. Proactive workload management in hybrid cloud computing. IEEE Transactions on Network and Service Management, 2014. [145] Xinyan Zhang, Jiangchuan Liu, Bo Li, and Tak-Shing Peter Yum. Coolstream- ing/donet: a data-driven overlay network for peer-to-peer live media streaming. In INFOCOM. IEEE, 2005. 152 [146] Marat Zhanikeev. A method for extremely scalable and low demand live p2p streaming based on variable bitrate. In First International Symposium on Com- puting and Networking. IEEE, 2013. [147] Bridge Qiao Zhao, J Lui, and Dah-Ming Chiu. Exploring the optimal chunk selection policy for data-driven p2p streaming systems. In P2P, 2009. [148] Hao Zhuang, Raziur Rahman, and Karl Aberer. Decentralizing the cloud: How can small data centers cooperate? In 14th IEEE International Conference on Peer-to-Peer Computing (P2P), 2014. 153 Appendix A Mathematical Assumptions in Existing Theorems Here, we describe in detail why existing seminal micro-economics theorems cannot be used to derive closed form results for market equilibria in our work. The seminal results by Nash provide a formal proof for both finite (strategy spaces need not be continuous) and infinite (strategy space is continuous) games that the exis- tence of an equilibrium is possible [98] only when (i) the strategy set is compact, i.e., closed and bounded, convex, and non-empty, and (ii) the utility functions are necessar- ily quasi-concave (or stronger forms of concavity) in a player’s (mixed strategy) action, and continuous in the vector of players actions. In addition, the theorem of Nash is valid only for the guaranteed existence of a mixed strategy Nash equilibrium. However, in our work we are only interested in the guarantee of pure strategy Nash equilibria (see reason below), as it is more practical to implement and realize. Thus, we design our SC’s utility function as a quasi-concave function. However, if SCs’ self-defined utility functions are non-quasi concave function, a mathematical proof for existing Nash equi- libria in non-quasi concave utility functions is still a difficult open problem based on Nash’s theorem. In regard to guaranteeing a pure strategy Nash equilibria, consider another semi- nal theorem by Debreu, Fan, and Glicksberg (derived independently) [44, 49, 55] that infinite games under the assumptions of (i) quasi-concavity of utility functions, (ii) the utility functions being continuous in the vector of players actions, and (iii) convex and 154 compact strategy sets, promise the existence of pure strategy Nash equilibrium. How- ever, for many practical settings including ours, the quasi-concavity assumption might not always be satisfied (arbitrary SCs’ utility function), which in turn might not guaran- tee a pure strategy Nash equilibrium (violating theorem assumptions). Thus, a mathe- matical proof for existing Nash equilibria in non-quasi concave utility functions is still a difficult open problem based on the theorem by Glicksberg et.al. In addition, strategy sets in many applications (including specialized versions of our application setting, i.e., the number of shared VMs is discrete in nature) might not be continuous, in which case, we would have to go back to using Nash’s theorem to guarantee mixed strategy Nash equilibria. An even stronger seminal theoretical result was proposed by Dasgupta and Maskin [41] that states: games under the assumptions of (i) quasi-concavity of utility func- tions, (ii) the utility functions being discontinuous (if we are dealing with arbitrary util- ity functions as SCs might demand) in the vector of players actions, and (iii) convex and compact strategy sets, promise the existence of a mixed strategy Nash equilibrium. However, in our work we are only interested in pure strategy Nash equilibria (see reason below). Thus, a mathematical proof for existing Nash equilibria in non-quasi concave utility functions is still a difficult open problem based on the theorem by Dasgupta and Maskin. Thus, we observe that practical modeling of a system might not always fit the the- oretical assumptions required to mathematically prove the existence of a pure strategy Nash equilibria. Therefore, we resort to a simulation evaluation to search for the exis- tence of Nash equilibria. However, through simulation results, we do observe the exis- tence of pure strategy Nash equilibria for infinite strategy spaces (simulated in a discrete manner, thereby becoming a finite game in simulation), and for non quasi-concave peer net utility functions. Thus, at least from the experimental results, we observe that for 155 our work, (i) it is not necessary (via the theorem of Nash) for quasi concavity to hold for a pure strategy (also discounting the guarantee of only a mixed strategy via Nash’s the- orem) Nash equilibrium to exist, and (ii) it is not necessary (via the theorem of Debreu et.al.) for quasi concavity to hold for a pure strategy (also discounting the infinite strat- egy space assumption via the theorem by Debreu. et.al, as the simulation is discrete in nature) Nash equilibrium to exist. Taking all the above-mentioned issues in our work related to fitting the assumptions required to prove the existence of Nash equilibrium in theory, and the information structure, we adopt a typical approach of fictitious play, i.e., a time-averaged technique [29] from the theory of learning in games, which allows us to converge upon a Nash equilibria (provided its existence). However, we cannot guarantee to reach the equilibrium if the SCs’ self-defined utility functions are non-quasi concave. Finally, we report an historical perspective on the rationale for studying pure strategy Nash equilibria: During the 1980s, the concept of mixed strategies came under heavy fire for being intuitively “problematic.” Randomization, central in mixed strategies, lacks behavioral support. Seldom do people make their choices following a lottery. This behavioral problem is compounded by the cognitive difficulty that people are unable to generate random outcomes without the aid of a random or pseudo-random generator. In 1991, game theorist Ariel Rubin- stein described alternative ways of understanding the concept. The first, due to Harsanyi (1973), is called purification, and supposes that the mixed strategies interpretation merely reflects our lack of knowledge of the play- ers’ information and decision-making process. Apparently random choices are then seen as consequences of non-specified, payoff-irrelevant exogenous factors. However, it is unsatisfying to have results that hang on unspecified factors. Later, Aumann and Brandenburger (1995) [29] re-interpreted Nash 156 equilibrium as an equilibrium in beliefs, rather than actions. For instance, in the “rock-paper-scissors” game an equilibrium in beliefs would have each player believing the other was equally likely to play each strategy. This interpretation weakens the predictive power of Nash equilibrium, however, since it is possible in such an equilibrium for each player to actually play a pure strategy of Rock. Ever since, game theorists’ attitude towards mixed strategies-based results have been ambivalent. Mixed strategies are still widely used for their capacity to provide Nash equilibria in games where no equilibrium in pure strategies exist, but the model does not specify why and how players randomize their decisions. From: Strategy (game theory), Wikipedia 157
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Satisfying QoS requirements through user-system interaction analysis
PDF
QoS based resource management for Internet applications
PDF
QoS-aware algorithm design for distributed systems
PDF
Resource scheduling in geo-distributed computing
PDF
Adaptive resource management in distributed systems
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Resource management for scientific workflows
PDF
Efficient delivery of augmented information services over distributed computing networks
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
PDF
Edge-cloud collaboration for enhanced artificial intelligence
PDF
Cloud-enabled mobile sensing systems
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Security and privacy in information processing
PDF
Dispersed computing in dynamic environments
PDF
Improving network security through cyber-insurance
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Scaling-out traffic management in the cloud
PDF
Protecting networks against diffusive attacks: game-theoretic resource allocation for contagion mitigation
Asset Metadata
Creator
Lin, Sung-Han
(author)
Core Title
Distributed resource management for QoS-aware service provision
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/20/2017
Defense Date
09/11/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cloud computing,distributed system,game theory,machine learning,OAI-PMH Harvest,peer-to-peer,performance evaluation,quality of service,queueing,scheduling,video streaming
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Golubchik, Leana (
committee chair
), Psounis, Konstantinos (
committee member
), Sha, Fei (
committee member
)
Creator Email
sunghan@usc.edu,sunghlin@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-459093
Unique identifier
UC11267136
Identifier
etd-LinSungHan-5915.pdf (filename),usctheses-c40-459093 (legacy record id)
Legacy Identifier
etd-LinSungHan-5915.pdf
Dmrecord
459093
Document Type
Dissertation
Rights
Lin, Sung-Han
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cloud computing
distributed system
game theory
machine learning
peer-to-peer
performance evaluation
quality of service
queueing
scheduling
video streaming