Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
An end-to-end framework for provisioning based resource and application management
(USC Thesis Other)
An end-to-end framework for provisioning based resource and application management
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
AN END-TO-END FRAMEWORK FOR PROVISIONING BASED RESOURCE AND APPLICATION MANAGEMENT IN GRIDS by Gurmeet Singh A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2008 Copyright 2008 Gurmeet Singh Dedication To My Family ii Acknowledgements I would like to thank my advisor Dr Carl Kesselman for his unwavering support and excellent guidance during my graduate studies. Despite his worldwide stature and renown, he has always been very kind and encouraging to me despite the mistakes that I might have made during the course of my Ph.D. His critical feedback has helped me develop the research skills necessary for doing the research described in this dissertation. He has also helped me develop the writing skills necessary to write good quality papers. I will forever remain grateful to him. I would also like to thank my mentor and supervisor Dr Ewa Deelman for her support during my graduate studies. She has been a great counselor in difficult times. Apart from the technical guidance, she has given me great moral support to persevere in this long journey. I was very fortunate to have her guidance and will forever remain thankful to her. I am also thankful to Dr Ann Chervenak who also has been a great source of support throughout my graduate studies. I am also thankful to my dear friends, Karan Vahi, Gaurang Mehta, Shishir Bharathi and others at the Center of Grid Technologies for the good times shared with them. This long journey would not have been possible without the support of my family. My wife always kept encouraging me and kept me positive despite difficult times. Her unwavering support even in tough times have made this dissertation possible. My parents, brothers and sister have also extended their love and support to me. iii They have kept me true to the cause and kept me focused on my studies. I owe this dissertation to them. iv Table of Contents Acknowledgements ………………………………………………………………..iii List of Tables ……………………………………………………………………...vii List of Figures ……………………………………………………………………viii Abstract ……………………………………………………………………………xi Chapter 1: Introduction ......................................................................................... 1 1.1 Problem Statement ..................................................................................... 4 1.2 Approach .................................................................................................... 6 1.3 Thesis ......................................................................................................... 8 1.4 Research Contributions .............................................................................. 9 1.5 Outline ...................................................................................................... 10 Chapter 2: Related Work ..................................................................................... 11 2.1 Grid Resource Management ..................................................................... 11 2.1.1 Resource management of parallel computers .................................... 11 2.1.2 Advance Reservations ........................................................................ 13 2.1.3 Economics based resource allocation................................................. 13 2.2 Application Scheduling ............................................................................ 15 2.3 Agreement Framework and Protocols ...................................................... 16 Chapter 3: Non-delaying Scheduler with Advertisement Model ..................... 17 3.1 Evaluation ................................................................................................ 30 Chapter 4: Application Provisioning using the Slot Advertisement Model .... 37 4.1 Application Model ................................................................................... 37 4.2 Resource Provisioning Problem ............................................................... 38 4.3 Pareto Optimality ..................................................................................... 41 4.4 Two approaches to multi-objective optimization ..................................... 43 4.5 Multi-Objective Genetic Algorithm (MOGA) ......................................... 45 4.6 Using High-level Information .................................................................. 52 4.7 Evaluation ................................................................................................ 53 4.8 Seismic Hazard Application ..................................................................... 64 4.9 Multi-site scheduling ................................................................................ 67 Chapter 5: Adaptive Pricing Scheduler .............................................................. 72 5.1 Pricing Algorithm .................................................................................... 75 5.2 Interaction models .................................................................................... 77 5.3 Experiments ............................................................................................. 81 v Chapter 6: Comparison of Virtual and Advanced Reservation System .......... 92 6.1 Virtual advanced reservation (VAR) ....................................................... 92 6.1.1 Allocation cost of VAR ...................................................................... 95 6.2 Experiments ............................................................................................. 97 6.3 Results .................................................................................................... 101 6.3.1 32 node experiments ........................................................................ 101 6.3.2 64 node experiments ........................................................................ 104 6.4 Combining the VAR and the adaptive pricing scheme .......................... 107 6.5 Conclusion ............................................................................................. 113 Chapter 7: Application Provisioning with the Adaptive Pricing Model ........ 116 7.1 Algorithm ............................................................................................... 118 7.2 Evaluation .............................................................................................. 120 7.3 Multi-Site scheduling ............................................................................. 129 Chapter 8: Conclusion ........................................................................................ 131 8.1 Future Work ........................................................................................... 135 Bibliography..…………………………………………………………………….137 vi List of Tables Table 1. Cost of possible subslot allocation. ............................................................. 22 Table 2: Module details of the Seismic Hazard Analysis workflow. ........................ 65 Table 3. Reservation costs at times from t 1 to t 6 . ...................................................... 80 vii List of Figures Figure 1. Schedule of tasks and resulting slots. ....................................................... 21 Figure 2. Scheduling a task on a set of slots. ........................................................... 27 Figure 3. Left over slots after scheduling a task on one or more slots. .................... 28 Figure 4. Frequency distribution of the number of available slots. ......................... 31 Figure 5. Correlation of advertised slots and number of tasks in the system. ......... 32 Figure 6. Distribution of start and runtime of slots. ................................................. 33 Figure 7. Frequency distribution of the processor count of slots. ............................ 34 Figure 8. Distribution of start time and runtime of slots with inaccurate estimates. ................................................................................................ 35 Figure 9. The solutions â 1 , â 2 , and â 3 dominate the solutions â 4 and â 5 . .................. 42 Figure 10. Two approaches to multi objective optimization. ................................... 43 Figure 11. Working of MOGA. ............................................................................... 46 Figure 12. Workflow used for evaluation. ............................................................... 54 Figure 13. Makespan and allocation cost of the best effort and the provisioned approach for different trade-off factors. ............................. 57 Figure 14. Impact of the workflow size. .................................................................. 59 Figure 15. Makespan and allocation cost for the provisioned approach for varying task sizes. .................................................................................. 61 Figure 16. Normalized makespan and allocation cost of the provisioned approach with increasing resource utilization. ....................................... 63 Figure 17. A Seismic Hazard Analysis Workflow. .................................................. 65 Figure 18. Makespan and allocation cost of the seismic hazard analysis workflow. ............................................................................................... 66 viii Figure 19. Comparison of provisioned and best effort with increasing number of resources. ........................................................................................... 69 Figure 20. A schedule of 4 tasks with conservative backfilling. ............................. 73 Figure 21. Resource schedule without(left) and with (right) the reservation in place. ...................................................................................................... 77 Figure 22. Slot Pricing Algorithm. ........................................................................... 77 Figure 23. Average price corresponding to different trade-off factors. ................... 83 Figure 24. The average wait time of reservations and best effort jobs. ................... 83 Figure 25. Reservation pricing with increasing resource load. ................................ 84 Figure 26. Normalized reservation wait times with increasing resource load. ........ 85 Figure 27. Prices with increasing proportion of reservations. ................................. 86 Figure 28. The average wait time of reservations and best effort jobs. ................... 86 Figure 29. Impact of conservative vs aggressive backfilling on pricing. ................. 88 Figure 30. Comparison of slot pricing with accurate and inaccurate estimates. ...... 89 Figure 31. Slot and best effort wait times with inaccurate runtime estimates. ........ 90 Figure 32. Reservation start time and duration. ....................................................... 93 Figure 33. Determining the submission and run time of task in VAR. .................... 95 Figure 34. Virtual reservation scenario. ................................................................... 96 Figure 35. Queue time prediction service on the NCSA TeraGrid Cluster. ............ 98 Figure 36. Working of the VAR algorithm for two requests. .................................. 99 Figure 37. Jobs running and queued at the NCSA Teragrid cluster....................... 100 Figure 38. Allocation cost of the VAR and the adaptive pricing scheme. ............. 101 Figure 39. Success rate of the VAR and the adaptive pricing system. .................. 103 ix Figure 40. Percentage of cases when VAR or adaptive pricing is cheaper than the other. ............................................................................................... 103 Figure 41. Allocation cost of VAR and the adaptive pricing scheme for the 64 node reservations. ............................................................................ 104 Figure 42. A VAR Scenario. .................................................................................. 105 Figure 43. Success rate of VAR and adaptive pricing scheme with 64 node reservations. ......................................................................................... 106 Figure 44. Percentage of cases when VAR or adaptive pricing is cheaper than the other. ............................................................................................... 107 Figure 45. The algorithm combining VAR and the adaptive pricing reservation. ........................................................................................... 110 Figure 46. Allocation cost of combined scheme for 32 node reservations. ........... 111 Figure 47. Success rate of the combined scheme for the 32 node reservations. .... 112 Figure 48. Allocation cost of the combined scheme for the 64 node reservations. ......................................................................................... 113 Figure 49. Success rate of the combined scheme for 64 node reservations. .......... 113 Figure 50. Resource schedule with two running and two queued tasks. ............... 117 Figure 51. Greedy Provisioning Algorithm ........................................................... 119 Figure 52. Makespan and allocation cost of the best effort and the provisioned approach for different trade-off factors. ........................... 123 Figure 53. Impact of workflow size. ...................................................................... 124 Figure 54. Makespan and allocation cost for varying task sizes. ........................... 126 Figure 55. Effect of resource utilization. ............................................................... 128 Figure 56. Comparison of provisioning and best effort in a multi resource setting. .................................................................................................. 129 x Abstract Resources in distributed infrastructure such as the Grid are typically autonomously managed and shared across a distributed set of end users. These characteristics result in a fundamental conflict: resource providers optimize for throughput and utilization which coupled with a stochastic multi-user workload results in non- deterministic best effort service for any one application. This conflicts with the user who wants to optimize end-to-end application performance but is constrained by the best effort service offering. Resource provisioning can be used to obtain a deterministic quality of service but it is generally not allowed due to the perceived impact on the other users and overall resource utilization. Without a deterministic quality of service, it is not possible to co-allocate resources from multiple providers in a scheduled manner and thus realize the true potential of Grid Computing. In this thesis, we examine two strategies for integrating reservations within the resource management fabric that address these concerns by either minimizing the adverse impact of a reservation on the other users or enable a resource provider to recoup losses through a differentiated pricing mechanism. Correspondingly, we also present algorithms for optimizing the application performance when resources provide automated reservations using the previously developed strategies. These algorithms use a cost based model to identify the set of reservations to be made for the application in order to optimize performance while xi xii minimizing the cost for the reservations. The cost based model allows the users to do a trade-off between the application performance and resulting resource costs. Using trace-based simulations and task graph structured applications, we compare the application performance and resource cost when it is executed using reservations to that when only best effort service is available. We show that the approach incorporating reservations can provide superior performance for the application at a price that the user can predetermine. Also, the benefits of using the reservation based approach become more pronounced when the resources are under high utilization and/or the applications have significant resource requirements. The work in thesis complements the work done on developing frameworks for service level agreements in Grids. Chapter 1 Introduction Applications in many scientific domains such as high energy physics, astronomy, earthquake engineering, weather prediction, etc often require scheduled access to resources in order to produce meaningful results. However, due to their large scale nature, the resource demand of these applications can not be met by the resources available at a single institute or organization. At the same time, developments in information technology have made feasible concurrent use of resources spread across multiple organizations for executing these applications. Thus we have seen the growth of shared distributed infrastructure such as the TeraGrid [15] and Open Science Grid [4] etc that provide the execution platform for these applications. The resources in this infrastructure are often geographically distributed and owned by different organizations or entities. Access to these resources by users, both local and remote is provided under resource owner’s policies. There has been significant progress in the design and development of tools for facilitating access to heterogeneous autonomous distributed systems. These systems are also often known as Grids. These Grids have three distinguishing 1 features [27]. First, they represent resource co-ordination with decentralized control. Secondly, the sharing is achieved using general open purpose protocols and interfaces. Lastly, these Grids are intended to deliver non-trivial qualities of service. There has been significant progress in resource co-ordination through the development of Virtual Organizations (VO) and the development of open purpose protocols and interfaces, through the development of Globus [26] and other related projects. However, the realization of non-trivial qualities of service is still a distant goal that we have tried to address in this thesis. The qualities of service refer to the achievement of desired throughput, response time, co-allocation of multiple resources to meet application requirement, etc. Due to the decentralized nature of the execution environment, each entity tries to optimize its behavior based on its goal. Resource providers for instance, want to maximize the utilization or throughput of their resources. The allocation of resources to applications is done by a local resource management system (LRM) based on the resource owners policies. This has led to a queuing based resource management model where users submit application tasks to the resource queues which are then allocated resources at a later time based on the workload of the resource and some fairness or utilization criteria specified by the resource owner policy. Thus the user does not have any control on the allocation of resources to their applications and hence are not able to optimize for performance. This quality of service offered to the user is often called best effort. 2 On the other hand, the best effort service allows considerable flexibility to the users in the sense that they can submit resource requests without tight estimates of the runtime parameters and tasks can be cancelled at any time even after the resources have been allocated to it without any penalty. This further reinforces the non-deterministic nature of the best effort service since the resource providers cannot plan the future use of their resources due to lack of information and commitment from the users towards the requested resource allocation. Due to this non-deterministic nature of the best effort quality of service, it is difficult if not impossible for the users to achieve desired performance goals such as response time etc particularly when co-allocation of resources across multiple provides is required. Thus prediction services such as the Network Weather Service [60], queue wait time estimators [11, 20], etc are used to make scheduling decisions for the applications [9, 14, 49]. However, the resulting application performance is highly dependent on the quality of the predictions. Moreover, frequent adaptation is required for countering the dynamic nature of resource availability. Agreement based resource management has been proposed as an alternative to this problem. Under this model, the users and the resource providers enter into a binding agreement about when resource would be made available to the users and for how long and at what price. The users can then execute their applications without worrying about external factors such as the allocation policy of the resource provider, the workload of the resource etc. The resource providers on the 3 other hand can plan ahead and optimize the return on their investment on the resources. There has been a recent focus on the framework [1, 17] and protocols for agreement based resource management [8, 29, 40]. These provide the underlying plumbing required to instantiate a set of resource agreements for an application and complement the work presented in this paper. However, there has been little work on the reasoning process for selecting the set of agreements to be created for workflow structured applications from the user’s perspective and the pricing of these agreements for the resource providers perspective. 1.1 Problem Statement Developments in information technology have made collaborative use of distributed resources managed by different autonomous administrative domains feasible for executing large scale applications. Some applications like severe weather modeling are deadline driven and need scheduled access to resources. Other science applications such as earthquake impact modeling, search for gravitational waves, astronomical image processing etc has large resource requirements and often need co-allocation of different resource types. However, the resources available to these applications are shared between multiple users and the allocation policy is oriented towards optimizing the throughput or utilization of the resource. As a result, the allocation of resources is not under the explicit control of the user. Thus the performance of the application cannot be predicted in advance much less optimized. For applications that are structured as task graphs 4 with precedence constraints, additional delays during execution are encountered since the current execution model of the resources is task based. Users are allowed to submit tasks to the resource queue, which then wait in the queue till resources are allocated to the task by the local resource management system. Since the start time of a submitted task cannot be predicted in advance, a task from the application cannot be submitted to a resource for execution until all the precedence constraints of the task have been satisfied. Thus the completion time of the application as a whole is affected due to this late submission of tasks and the associated wait time in the resource queues. The resource providers on the other hand target utilization or throughput as the metric to optimize. It does not consider the fact that the users of the resources have diverse preferences on the timing of the resource allocation and the utility that they derive from this allocation. Thus more sophisticated methods are required that would allow the resource providers to differentiate between the users using pricing as the mechanism and deliver multiple qualities of service. Performance optimization for applications in such a distributed execution environment is a difficult problem. Sometime, offline methods such as manual negotiation between resource providers and users are used to provide deterministic resource availability to the application and thus achieve specific performance goals. However, these non automated methods are not scalable. Application level scheduling [9, 14, 49] in Grids have traditionally relied on prediction services such as Network Weather Service [60], queue wait time estimators [11, 20], etc in order 5 to optimize performance. These services use past behavior as an indicator of future performance. However, there is still no explicit control on the part of the user and the resulting performance depends on the quality of prediction. Another approach is to redundant task submissions in order to decrease the time to allocate a resource or to simulate deterministic resource availability. However, these might lead to under utilization of resources resulting in higher costs to the user. 1.2 Approach The basic approach taken in this thesis is to create an end-to-end framework of provisioning based resource and application management. On the resource side, we investigate heuristics that allow the resources to provide deterministic quality of service to the users. Since the resources are shared between multiple users, it is likely that these users would have diverse preference on the quality of service desired from the resource. We use economics as the means to provide differentiated quality of service to users. Traditionally, the quality of service has been measured across a single dimension, i.e. the response time from the resource. By introducing pricing as another dimension, we allow a choice of different response times to the users based on the prices they are willing to pay. On the user side, we separate resource allocation from application scheduling. This is different from the prior work in this regards where allocation is a by-product of scheduling [37, 42, 57]. The current application execution model consists of submitting tasks to the resources queue. The allocation of resources to the task is 6 done at a later convenient time when the local resource manager decides to schedule it. Thus the allocation of resources for each task in the application is done separately. Moreover this allocation and scheduling is done by the resource manager instead of the user. The approach that we have taken in this thesis is to separate allocation from scheduling. Resource allocation is done explicitly by the user as a first step by interacting with the resource provider and creating a set of allocation agreements that specify what resources would be allocated to the user, when they would be allocated and for how long. This allocation step will then allow the user to create an application schedule over the allocated resources independent of the resource provider. Thus allocation and scheduling are two separate operations instead of being tightly coupled as in the present case. As a result of this separation, the expected application performance can be determined with certainty due to the deterministic nature of resource allocation. The users can now select an allocation that optimizes the application performance while minimizing the cost of the allocated resources. Better performance for the application can be achieved since multiple tasks can be executed on the same allocated resource without incurring any additional resource acquisition overhead. Co-allocation of resource from multiple providers or of different resource types can also be achieved without any out of band effort. The interaction between the user and the resource provider is now defined by the set of allocation agreements that both sides mutually agree upon. The actual enactment of the application 7 schedule on the allocated resources can be done without any intervention from the resource provider. This also allows the resource providers to increase the utilization of their resource by better planning without getting into the nitty-gritty of individual task execution with its associated uncertainties. Given this separation approach between the allocation and scheduling that question that now arises is that what resources should be allocated by the user, when they should be allocated, for how long, and at what price in order to optimize the application performance and minimize the resource costs. Given that there might be multitude of alternate resource providers, the space of allocation agreements that can feasibly execute the application can be extremely large. Thus identify the suitable set of allocation agreements would be a difficult problem. This is the problem that we attempt to address in this thesis. 1.3 Thesis In this thesis we have developed a model for application level resource provisioning in distributed systems that provides a framework for automated negotiation of resource agreements between the users and the resource providers. The model allows the resource providers to advertise their resource availability with associated costs. Alternatively, the users can query the providers to discover the availability information. The users can then provision resources with the specific characteristics that provide the greatest utility for the application. We have also developed algorithms that operate using this model for application 8 performance optimization. Furthermore, we have also explored resource level algorithms that support this model. 1.4 Research Contributions The main contribution of this thesis is a framework for resource provisioning in distributed autonomous systems. Specifically • We have developed a two phase approach that separates resource allocation from application scheduling. • cost based model for resource provisioning that allows the resource owners to advertise resources and the users to express trade-off between multiple objectives. • We have developed algorithms that use pricing as the means to provide differentiated quality of service to users. Moreover, instead of supporting only the provisioned or the best effort quality of service, we show how both can be supported and they leverage each other to provide differentiated services. • We have developed user-level algorithms for resource selection that try to optimize the application performance while minimizing the resource costs. • Using a set of trace-based simulations, we compare the application performance with the provisioned and best effort approach. We show that 9 the provisioned approach can provider better performance for the application at little or not extra cost. 1.5 Outline The remainder of this thesis is organized as follows. Related work is presented in Chapter 2. In Chapter 3 we present the slot abstraction and the design of a resource scheduler that generates slots such as to minimally impact the best effort service and advertises these slots. In Chapter 4 we describe a GA based heuristic for searching the advertised set of slots to find a suitable subset to reserved and evaluate its performance with regards to best effort service. In Chapter 5 we present the design of a resource scheduler that can generate slots based on user queries and prices these slots based on their impact on the best effort workload. In Chapter 6 we compare the performance of our pricing algorithm with that of a probabilistic virtual reservation system. The corresponding application level provisioning algorithm that can operate with the query model is described and evaluated in Chapter 7. Chapter 8 presents the conclusions. 10 Chapter 2 Related Work In this chapter, I summarize related work on Grid resource management, advanced reservations, application scheduling and frameworks for provisioning. I also present comparison with my work where appropriate. 2.1 Grid Resource Management This section describes the related work in resource management of large scale parallel and supercomputer type computational resources. 2.1.1 Resource management of parallel computers With the rise of shared multiprocessing systems, there has been a lot of focus on scheduling such resources in order to maximize the utilization or minimizing the turnaround time of the resources. Much of the research has been presented in the proceedings of the annual workshops on job scheduling strategies for parallel processing [7]. A good introduction to the field is presented in [23, 24]. These resources are generally massive parallel supercomputers and operate in a non- interactive environment where tasks are submitted by users to a central queue. 11 Each task specifies the number of required processors and the maximum runtime of the task. The resource management system picks up tasks from this queue and schedules them on the available resources as and when they become available. A first come first served (FCFS) based strategy is used for scheduling tasks from the queue. However, since pure FCFS would result in poor utilization of resources, variant such as Aggressive backfilling [41] and Conservative Backfilling [46] are used to execute tasks out of order in order to increase utilization. This out of order execution of tasks coupled with inaccurate runtime estimates led to a best effort quality of service where the start time of the submitted tasks is not known to the user in advance. Many open source and commercial resource management systems such as Maui [3, 35], PBS [5], LSF [2], SGE [6] support this model. The alternative to queuing based systems described above are planning based systems [33]. These systems plan out the schedule in advance with deterministic start time of tasks instead of making decisions only when resources become available as in queuing based system. CCS [33] in an example of such a system. Some systems such as Maui [3] can be configured to operate either as a queuing based system or a planning based system. In this dissertation we use planning as the management methodology for resources. However, we also integrate pricing into the resource availability model and allow resource discovery through advertisements (Chapter 3) or queries (Chapter 5). 12 2.1.2 Advance Reservations Advance reservations have been widely proposed for provisioning resources for performance predictability, meeting resource requirements and providing guaranteed quality of service to applications. A resource model similar to ours is adopted in [52] where a client can probe the possible start times of a task on a resource. However, the reservation is for a single job instead of an application and a single resource slot has to be reserved. This has been extended in [51] to support co-reservations. However, the reservations are made based on an abstract resource description from the user. In our case, we create the set of reservations based on the resource availability in the Grid and a given application. Additionally it creates all combinations of the resource slots to find a feasible candidate. This might not be a feasible strategy if there are large numbers of available resource slots. 2.1.3 Economics based resource allocation Pricing of services in priority based systems have received a significant amount of attention during the seventies and eighties [30, 43, 44, 59] and is the motivation for the adaptive pricing model of chapter 5. In these studies, a delay cost is associated with each user or job where the delay cost is defined as the amount of money the user will be willing to spend to get the job finished one time unit earlier. Thus each user or job incurs an implied waiting cost which is the product of the waiting time for service and the delay cost. Price discrimination has been shown to provide efficient allocation of resources minimizing the total expected waiting costs in the 13 system when demand exceeds capacity and users differ in their delay costs. This is achieved by creating a number of priority classes and setting the price of each priority class based on the expected increase in the waiting costs of lower priority classes due to this class. Each user or job selects a class for service that minimizes the sum of the price paid and the expected waiting cost using that class. In an M/M/1 system, this price scheme has also been shown to be incentive compatible in the sense that it makes it optimal for the users to make decisions based on their true delay costs [45]. Delay cost pricing as summarized in the previous paragraph has been the motivation for the adaptive pricing scheme developed in this paper. However, there are important differences. First, we don’t explicitly elicit the delay cost of the best effort users whose jobs get delayed. Instead, the size (number of processors) is used as an implicit delay cost in the pricing algorithm described in chapter 5. Thus there is a greater penalty for delaying larger jobs than smaller ones. Getting users to specify their actual perceived delay costs in the current framework would require a pricing scheme that is provably incentive compatible and is the subject of future work. Second, the analysis of pricing schemes of dynamic systems [43, 45] is generally performed under steady state conditions using expected values of arrival and service rates with an assumed distribution. The resulting prices are constant for a priority class. We have done the pricing analysis under a static condition by taking a snapshot of the system and using currently queued jobs and their characteristics. The reason is that it is difficult to assign a distribution and its 14 parameters that accurately model the job and reservations arrivals and their characteristics. Instead, our pricing algorithm is more responsive to the current state of the resource. For example, the user does not have to pay a premium when the system is empty encouraging use of an underutilized system and the premium is significantly high when the resource is overloaded causing demand to shift to the less loaded resources or greater start times. Moreover, our pricing scheme does not assume any particular job scheduling algorithm and allows the resource provider to use any scheduling algorithm for the best effort jobs. 2.2 Application Scheduling There is a large body of research on application scheduling on dedicated systems [42, 56]. The resource provisioning is implicit in that the entire resource is considered provisioned. In Grid computing, due to the non deterministic nature of the resource availability, prediction services such as the Network Weather Service [60], queue wait time estimators [10] etc are used to make scheduling decisions for the applications [49]. However, the resulting application performance is highly dependent on the quality of the predictions. Moreover, frequent adaptation is required for countering the dynamic nature of resource availability. In [62], authors create a reservation plan for a workflow by making reservations for each task individually. This may not be a feasible approach for large scale workflows containing thousands of fine-grained tasks. The focus in [62] is on increasing reliability of execution of the workflow and contention for resources is 15 not considered. A similar strategy is adopted in [54, 58] where reservation or allocation for each task or activity in the application is done separately by negotiating with the resource providers. The cost of allocation is not considered in [58, 62]. In [54], authors employ a cost aware resource model similar to ours. However, the goal here is to concurrently maximize the resource and application utility using a centralized resource allocator. In our case, we use a distributed approach where the resource providers and users maximize their own utility. 2.3 Agreement Framework and Protocols There has been a recent focus on the framework [17] and protocols for agreement- based resource management [29]. These provide the underlying plumbing required to instantiate a set of resource agreements for an application and complement the work presented in this paper. However, there has been little work on the reasoning process for selecting the set of agreements to be created for workflow structured applications. 16 Chapter 3 Non-delaying Scheduler with Advertisement Model The basic abstraction used in this dissertation is that of a slot, having a start time, a duration, number of processors and cost. A slot is represented as follows f c d n f c d n s where f c d n s + × × = = = = = = > < slot the of cost total slot the of cost fixed slot the of cost unit per slot the of duration slot the in processors of number slot the of time start , , , , (1) The cost of the slot is expressed using a multiplicative factor c and a fixed portion f. c represents the per unit charge for the resources on offer while f, represents the fixed overhead or the transactional cost for the creation and allocation of the slot. While we could have consolidated the cost into a single factor, having them separately provides a better semantic interpretation as would be evident later. We assume without loss of generality that the processors in the slot are available for exclusive use since the same approach can also be used for fractional portions of 17 CPU [28]. Thus the resource allocation decisions are focused on determining the number of required processors and the timeframe and need not consider the impact of other users concurrently using the processors in the slot. For the sake of simplicity, we leave out other details from the slot abstraction such as the make and processing speed of the processors, the amount of memory available, etc. In a real implementation, these attributes would also get included. In this dissertation we focus on compute resources such as processors while the same approach can be used for other resource types such as CPU cycles [28] or portions of network bandwidth [32]. We model a resource as 1..N homogeneous processors. This is the model used in distributed infrastructure projects such as TeraGrid [15] and Open Science Grid [4]. The applications can determine the available slots by either having the resource advertise the slots or by querying the resources to determine the availability of specific desired slots and their cost attributes. In this chapter we present the design of a system where resources advertise the available slots with uniform pricing and in later chapters a query model is used with a variable pricing scheme. The basis for price differentiation between the advertisement and the query model is the impact on the best effort service provided by the resource. Typical space-shared batch queuing based resource management systems use algorithms such as First Come First Serve (FCFS) or its variants to schedule the tasks from the resource queue when resources become available. These algorithms are able to chart out the resource schedule based on the running and queued tasks 18 akin to a Gantt chart using the number of processors requested and the runtime estimates of these tasks. One scheduling algorithm called conservative backfilling [46] works by moving tasks ahead in the schedule if it does not delay the start time of a task that arrived earlier than the task being moved. Aggressive backfilling [41] on the other hand will move a task ahead if it does not delay the first task in the queue. These charts are created whenever a scheduling decision has to be made, either periodically or dynamically when a task is submitted or finishes execution. Most resource schedulers such as Maui [36], LSF [41], CCS [39], and others [33] etc support backfilling. Another way at looking at backfilling is that the scheduler keeps track of the unused resources fragments or holes where an arriving task can be scheduled, potentially out of the FCFS order. These holes can be the leftovers from scheduling all the tasks in the resource queue as in conservative backfilling or only the first one as in aggressive backfilling. These holes are referred to as backfill windows [36] in Maui and gaps in CCS [39]. We propose to advertise the holes where backfilling would normally happen as slots. When conservative backfilling is used, these slots do not impact the existing tasks by definition and advertising the slots is equivalent to allowing the user to do the backfilling instead of doing it within the scheduler. Thus, when a user provisions a slot after it has been advertised, it is no different to the scheduler than backfilling a task onto the hole that was advertised as the slot. The scheduler gains by increased utilization since the users are provisioning slots that might have remained unutilized. The users gain by being able to create a deterministic 19 application schedule with a defined performance (execution time) and moreover being able to optimize the application performance and cost by choosing suitable slots (more on this in next section). In order to illustrate the slot generation mechanism, consider a 5 processor cluster with 4 queued tasks, task A requiring 2 processors for 3 hours, task B requiring 4 processors for 1 hour, task C requiring 1 processor for 2 hours, and task D requiring 3 processors for 2 hours. Figure 1 shows the schedule of the resource using conservative backfilling and the resulting slots that show the aggregate resource availability of the resource from the current time to 8 hours in the future. The figure shows 5 slots, S1 to S5 that can be advertised. Note that there might exist slots that are already provisioned by users (reservations) in the resource schedule, however, they are no different from other scheduled tasks as far as generating the free slot information is concerned. The lifetime of the advertised slots is determined by events such as submission of a new task to the resource queue or the completion of a task earlier than its requested runtime which will cause the resource schedule to change and hence the resulting slots to change too. Thus the slots should be considered advisory by the users. 20 Figure 1. Schedule of tasks and resulting slots. The set of slots advertised by the resource provider, periodically or on demand can be modeled as E, where } , , , , , , ,... , , , , , , { 1 1 1 1 1 1 1 > < > < = k k k k k k k e b f c d n s e b f c d n s E (2) <s i , n i , d i , c i , f i , b i , e i > represents the availability of n i processors for duration d i starting at time s i . c i and f i represent the per unit and the fixed cost respectively and the total cost of the slot is (n i x d i x c i + f i ). We added two more boolean attributes to the slot model of the previous chapter: the divisibility attribute b i and the extensibility attribute e i . b i = true implies that the user can provision a part of the slot instead of the whole slot and b i = false implies that the slot has to be provisioned in its entirety. The divisibility attribute can be used to control the fragmentation of the resources. e i , when true, indicates that the duration of the slot 21 can be extended past the specified endtime (s i + d i ), i.e. in case we want to provision resources for tasks with longer runtimes than the specified slot duration. The extensibility attribute can be used to model slots whose end time is not constrained by the start time of a task, e.g. slots S1, S4, and S5 in Figure 1. This attribute can be used to allow provisioning of resources past the delineation time of the slots e.g. 8 hours from the current time in Figure 1. The total cost of the slot depends on the amount of capacity provisioned as allowed by the divisibility and extensibility attributes. Suppose, the user or the broker wants to provision n k processors for duration d k starting at time s k when the offered slot is <s i , n i , d i , c i , f i , b i , e i >. Then, Table 1 shows the various possibilities with associated costs exist based on the values of b i and e i . Table 1. Cost of possible subslot allocation. b i e i restrictions cost false false n k = n i , s k = s i , d k = d i n i x d i x c i + f i false true n k = n i , s k = s i , d k ≥ d i n i x d k x c i + f i true false n k ≤ n i , s i ≤ s k , s k + d k ≤ s i + d i n k x d k x c i + f i true true n k ≤ n i , s i ≤ s k n k x d k x c i + f i 22 The set of slots is dynamic and changes based on the arrival of new tasks and completion of running tasks earlier than their requested runtime. For example, an arriving task requiring 2 processors for 2 hours can be backfilled into slots S1 and S3 in Figure 1, thus modifying these slots and task A completing just after 1 hour instead of the requested 3, would create another slot having 2 processors for 2 hours in addition to the ones shown in Figure 1. Thus the best approach is to create the slot information dynamically whenever it is required e.g. periodically in advertisement mode or on demand from a user in an on-demand mode. Also, the assertion that the slots do not impact the best effort jobs is only strictly true when the best effort jobs request their actual runtime. Otherwise slots might prevent some jobs from starting at their earliest possible start time. For example, suppose slots S1 and S3 in Figure 1 are provisioned by users and are now deemed reservations. Suppose thereafter task A completes in 2 hours instead of the requested 3. Now task B has to wait till time 3 for getting started. If the slots were not provisioned, task B could have started at time 2 when task A completed. However, this is no different than what would have happened if two tasks submitted to the resource queue after task D would have got backfilled into the slots S1 and S3. Task B would have get delayed in that case too. This is an accepted drawback in order to reap the benefits of backfilling and moreover it has been reported that only a small percentage of jobs are truly delayed [36]. If we assume that the users provide actual runtime estimates instead of conservative ones then an efficient implementation is possible that significantly 23 reduces the cost of generating the slot information. The idea is to always maintain the set of free slots and then operate the resource schedule using this set of slots. Whenever a task is submitted to the resource queue, it is backfilled on the current set of free slots using the algorithm described next and the set of free slots is updated in the process. Thus there is no cost involved in creating the slot information. It is always present in the resource scheduler. The algorithm is shown formally in Figure 2. The algorithm returns the start time of the new task and during its operation modifies the set of slots. The algorithm analysis is as follows Operation The set of slots is sorted by start time in ascending order. The algorithm goes through the slots. There can be two scenarios. In the first scenario, the algorithm encounters a slot that has enough processors and duration to run the task. If the slot has the required number of processors but not the duration and is extensible, we extend the duration of the slot 5 minutes past the required runtime of the task to be scheduled. The task is then scheduled on the slot and left over slots might be created if the number of processors required by the task is less than the number of processors in the slot and/or the duration of the slot is more than the duration of the task. The runtime of the slot is extended by 5 minutes so that all the processors represented by this slot would be represented by the left over slots. For example in Figure 3(a), a task is scheduled on a slot and Figure 3(b) shows the left over slots created. The slots and tasks shown in Figure 3 are in two dimension where the X 24 dimension represents time and the Y dimension represents processors. The algorithm then terminates. In the second scenario, the algorithm encounters a slot that has the required duration to run the task or is extensible but does not has the required number of processors. In this case, it is possible to combine resources with other slots to run this task. The algorithm then goes through the set of slots in reverse order from the current one to the first one in the list to find a combination of slots that can run the task. The reason is that we want to schedule the task no later than the start time of the slot under consideration and hence we consider slots earlier in the list that start earlier than the current slot. If no such combination can be found, the algorithm continues with the next slot in the list of slots. If a combination is found, the task is scheduled on the combined set of slots (after extending them if required), left over slots are created and the algorithm terminates. In Figure 3(c), a task is scheduled on two slots, c1 and c2 since, none of them alone has the required number of processors to run the task. The left over slots created are shown in Figure 3(d). Note that the algorithm is basically stacking the slots vertically to get the required number of processors i.e. the processors are spatially distributed but are temporally congruent. This is because applications are designed using standard like MPI to utilize spatially distributed processors. A temporal distribution on the other hand similar to horizontal stacking of slots might not work because the processors constituting the slots might be different and so it would be difficult to resume a running task on a different set of processors unless check pointing is supported. 25 Termination The algorithm always terminates with a schedule for the task at the earliest possible time. First, in order to guarantee that the task always gets scheduled, we deem those slots whose finish time is not constrained by the start time of a task as extensible as described earlier in this chapter. Their duration can be extended by the algorithm in Figure 2 if required to match the runtime requirements of the task being scheduled.. Now consider the maximum finish time of any task in the current schedule, for example, at time 6 in Figure 1. At this time all the processors must be represented by extensible slots. The task can certainly be scheduled at this time since all the processors at available at the time and the duration of the task is inconsequential due to the extensible nature of the slots. If it was possible to schedule the task any earlier on one or more slots, the algorithm would have examined that possibility already. Thus the algorithm always returns with a schedule for the task and this schedule provides the earliest possible start time for the task. Another fact to notice is that the task being scheduled does not delay any preexisting task because they are already scheduled and their start time cannot be affected by the current task. This along with the previous claim ensures that the algorithm in Figure 2 effectively does conservative backfilling for the resource. 26 Figure 2. Scheduling a task on a set of slots. Input: A task T requiring T n processors for T d duration. Input: A set of slots, E = {S 1 ,…,S k } sorted by start times Output: start time of task T, modified set of slots E’ For each slot S i in E do IF S i has at least T n procs AND ( S i has T d duration OR is extensible ) Extend the duration of S i to slightly past T d if required Delete slot S i from E and create slots from the left over portions of S i , if any and add to E in proper place Return starttime of S i Else IF S i is extensible OR has duration ≥ T d starttime = starttime of slot S i Let Δ = difference between the processor count of S i and T n Z = {S i } For each slot S j in E from the predecessor of S i back to the first slot in E IF S j is extensible OR active from starttime to (starttime +T d ) reduce Δ by the processor count of S j add S j to the set Z IF Δ has become zero or less Extend the slots in Z if required so that they are active from starttime to 5 minutes past (starttime +T d ) Delete the slots in Z from E and create left over slots after removing T n processors for T d duration at starttime from the combined resource pool of the slots in Z and add these left over slots in proper place in E Return starttime EndIF EndIF EndFor EndIF EndFor In comparison to this algorithm, the algorithm for backfilling tasks in the Maui scheduler operates slightly differently [36]. There multiple slots might represent the same underlying resource capacity and thus are not independent. As a result, a task can be scheduled on only one slot. However, both the algorithms would achieve the same effect. 27 Figure 3. Left over slots after scheduling a task on one or more slots. Complexity The complexity of the algorithm is O(S 2 ) where S is the number of slots in the input list. This can be understood by the fact that we go through the list of slots once and for each slot we might do a backward traversal of the list. The overhead of creating left over slots is linear in the number of slots, S, since for each slot, there can be at most 3 left over slots, as shown for slot c1 in Figure 3(c) and (d). Also this overhead is incurred only once when the final combination of slots have been found. Additionally, the algorithm does the scheduling and maintains the list of slots at the same time. The operation of finding the current set of slots cost nothing since it just returns the set of current slots. 28 An important point to note is that this approach works because we assume that the tasks state their actual runtimes and hence once a task is submitted and scheduled, that schedule would not change because of some other task finishing earlier than its requested duration. Any planning based system that tries to provide deterministic service to its users without sacrificing utilization would require such tight estimates of the task runtimes such as CCS [39]. The multiplicative cost c i of the slots generated by the scheduler is set to 1 and the fixed cost f i is set to 0. The reason for this pricing is that the slots do not delay any currently scheduled tasks and hence should be charged at the same rate as the tasks submitted to the resource queue. The best effort charge for a task requiring n processors for d duration is assumed to be (n x d) implying a multiplicative cost factor of one and no other fixed cost. Thus a task and a slot with the same dimensionality should cost the same. The resource provider gains by being able to increase the utilization of the resource by selling these slots that represent the unused holes in the current schedule. Moreover, the slots whose finish time is constrained by the start time of a task are categorized as indivisible (b i = false) and non-extensible (e i = false) by the resource. The reason is that these slots represent constrained holes and any sub allocation of these slots would lead to higher fragmentation. The extensible slots as mentioned earlier are categorized as divisible (b i = true) for allowing the users flexibility in planning their resource usage and naturally these slots are also termed extensible (e i = true). 29 3.1 Evaluation In order to evaluate the empirical performance of the algorithm in Figure 2 and the characterization of the generated slots, we simulated two Grid sites representing a 430 node and a 128 node cluster. The workload simulated on these clusters was based on logs from the 512 node IBM SP2 Cornell Theory Center (CTC) (430 nodes operational) and the 128 node IBM SP2 at the San Diego Supercomputer Center (SDSC) obtained from the Parallel Workloads Archive [22] containing 79285 and 54036 jobs respectively. The simulator was a modified version of GridSim simulator [12] with extensions for parallel job scheduling and conservative backfilling. The entire simulation for the CTC and SP2 clusters took 75 and 47 seconds respectively on a dual processor 2 GHz Pentium II machine and an individual run of the scheduling algorithm hardly took more than a millisecond ever. We instrumented the simulation for the resources to periodically advertise the set of available slots. When advertised, we recorded the characteristics and other information about the slots. Figure 4 shows the frequency distribution of the number of slots in the advertised set. The X axis is divided into ranges such as 0-1, 2-4, 5-7, 8-10, etc. For each range, the Y axis shows the percentage of times when the number of advertised slots fell into the range. The average number of advertised slots for the CTC cluster was 27 whereas for the SDSC cluster it was 14. The small number of advertised slots imply that the complexity of the 30 scheduling algorithm is minimal and the slot selection problem for the users is manageable. 0 2 4 6 8 10 12 14 16 18 20 0-1 2-4 5-7 8-10 11-13 14-16 17-19 20-22 23-25 26-28 29-31 32-34 35-37 38-40 41-43 44-46 47-49 50-52 53-55 56-58 More Number of available slots % Frequency CTC SDSC Figure 4. Frequency distribution of the number of available slots. Figure 5 shows how the number of advertised slots vary with the workload of the resource. For this, we also observed the number of tasks in the system (running and queued) when the number of advertised slots were recorded for Figure 5. As the figure shows, the number of advertised slots seem to be positively correlated with the number of tasks in the system. This can be understood since the slots are the leftovers after scheduling the tasks in the system. 31 0 20 40 60 80 100 0 50 100 Number of Available Slots Number of Tasks in System CTC 0 20 40 60 80 100 0 50 100 Number of Available Slots Number of Tasks in System SDSC Figure 5. Correlation of advertised slots and number of tasks in the system. Figure 6 shows the distribution of the start time and the duration of the slots. The start time is relative to the time when the slots are advertised. The start times and duration of the slots from SDSC cluster has more variation than the slots from the CTC cluster. This can be explained by the lower utilization of the CTC cluster (63%) as compared to the SDSC cluster (70%). Since higher utilization imply longer wait time for the tasks, the start time of the slots is also longer since slots and task are complementary. Figure 6 also shows the existence of a particular set of slots whose start time and duration add to 48 hours. These set of slots give the appearance of a line in the graph. This is due to the fact that the resource advertises 32 its resource availability from the current time to 48 hours in future or the maximum finish time of any task in the current schedule plus 5 minutes, whichever is larger. So, for all the extensible slots e.g. S1, S4, and S5 in Figure 1, the sum of the start time and duration is 48 hours. These edge slots create the appearance of the line. Due to the extensibility and divisibility of these edge slots, it does not matter for how long in future the slots are advertised. If the user wants to provision a slot past the advertised time period, the extensibility of the edge slots would allow that. CTC: Slot Start and Run times 0 10 20 30 40 50 0 1020 3040 5 Start time (hrs) Run time (hrs) 0 SDSC: Slot start and Run times 0 10 20 30 40 50 0 20 406080 10 Start time (hrs) Run time (hrs) 0 Figure 6. Distribution of start and runtime of slots. Figure 7 shows the frequency distribution of the processor count of the slot. The range in the X axis is 1, 2-5, 6-9, 10-13, etc. As the figure shows, a large 33 percentage of slots have only few processors (1-10), which is the result of fragmentation from task scheduling. The processor count is a multi modal distribution where we see peaks at powers of 2 i.e. 2, 8, 16, 32, 64, etc. This is due to the fact that the jobs in the workload of the resources often have powers of 2 processor count that also induces a similar distribution on the slot processor count. 0 10 20 30 40 50 60 1 2-5 6-9 10-13 14-17 18-21 22-25 26-29 30-33 34-37 38-41 42-45 46-49 50-53 54-57 58-61 62-65 Processor Count of Slots % Frequency CTC SDSC Figure 7. Frequency distribution of the processor count of slots. We also experimented by using the inaccurate runtime estimates as recorded in the workload logs. Previously, the set of slots would only change at the arrival of a new task. With inaccurate runtime estimates, in addition, the set would also change at the completion at a task. This is because rescheduling might need to be done in order to utilize the resources made available by a task completing before its requested run time. As the schedule changes, the set of slots would change too. In addition to making the set of free slots more dynamic, inaccurate runtime estimates also quantitatively affect them. Figure 8 shows the distribution of the slot start times and runtimes. While the slot runtimes are not significantly affected, the 34 start times of the slots are now more widely distributed. The reason is that the resource schedule is more widely spread out as it is based on the runtime estimates of the tasks which can be very conservative. According to a study [16], around half of the jobs use less than 20% of their estimated runtime. Thus the slots start later as compared to the case where tasks report their accurate runtime estimates. CTC: Slot Start and Run times w ith inaccurate estimates 0 10 20 30 40 50 0 2040 6080 10 Start time (hrs) Run time (hrs) 0 SDSC: Slot start times and run times w ith inaccurate estimates 0 20 40 60 80 100 0 50 100 150 200 Start time (hrs) Run time (hrs) Figure 8. Distribution of start time and runtime of slots with inaccurate estimates. The values of the multiplicative and fixed cost factors of slots i.e. c i and f i used for the experiments in this section are just an example. The model just specifies these parameters but does not prescribes what the values of these parameters should be. The resource provider is free to use any pricing algorithm internally that can be used to determine the value of these parameters. For example, day time slots can 35 be more expensive than night time slots. Similarly we have used only two possible slot types from those shown in Table 1. It is entirely possible that the resource provider may chose to decide the type of slots i.e. whether divisible or not etc based on internal policy consideration. It would be difficult to experimentally evaluate all possible combinations. Instead, we have tried to present an example of how an operational Grid resource can use the slot model presented in this and the previous section. In this chapter, we described how queuing based batch scheduled resource management systems can advertise available resources to the users in the form of slots while minimally affecting the execution of tasks already submitted to the resource queue. In this next chapter, we describe how the users can select among the advertised slots in order to optimize their application performance while minimizing the allocation costs. 36 Chapter 4 Application Provisioning using the Slot Advertisement Model In the previous section we discussed resource management techniques that can advertise free slots to the users. Users can provision these slots at the prices specified subject to any admission control policy at the resource side and slot lifetime issues. In this section, we focus attention on how task graph structured applications can use the slot advertisement model of the previous section to create a resource plan for the application in order to optimize performance while minimizing costs. 4.1 Application Model In this paper, we assume that an application can be represented as a set of tasks with dependencies, also known as a workflow or a DAG (directed acyclic graph). We note that our techniques can be applied to other application models as well, however since many applications can be structured as task graphs [19, 38, 57] we use this model in our paper. Moreover due to the presence of dependencies 37 between the application tasks, there are timing requirements regarding when resources need to be available for the tasks. Thus, such applications can benefit most from having access to deterministic resources. However, it also makes the task of identifying the suitable slots for the application more challenging. Within this paper; we use the term application and workflow interchangeably. The task runtimes on each of the computational resources is assumed to be known. The runtimes could be estimated analytically using component performance modeling [42] or empirically using historical information. The amount of data transferred between each parent and child task in the application is assumed to be known. If the parent and child tasks are executed on the same resource, the data transfer time is assumed zero. Otherwise, the data transfer time can be computed using bandwidth and latency information. Without any loss of generality we can also assume a single entry task i.e. task without any parent and a single exit task i.e. task without any child. The entry task is denoted as n entry as the exit task is denoted as n exit . 4.2 Resource Provisioning Problem We assume the system to be composed of r = 1,..,R resource providers. The set of slots advertised by the resource provider, r, is denoted by E r , as defined in (2). The global set of available slots in the system is denoted by Â, defined as U R r r E .. 1 A ˆ = = (3) 38 The goal of resource provisioning is to identify a set of slots â, derived from the global set,  such that the application makespan and the total cost of the slots in â is minimized. This set â is called the resource plan. If all the slots in  are non divisible and non extensible, then â is a subset of Â. However, since we can divide a slot in  or extend it (when allowed by the slot attributes described in the previous chapter), before including it in â, we use the terminology that â is derived from Â, denoted by . A a ˆ p ˆ The total cost of the slots in â is called the allocation cost, denoted by AC(â) and is defined as ∑ >∈ < + × × = a f c d n s i i i i i i i i i f c d n a AC ˆ , , , , ) ( ) ˆ ( (4) Minimizing the allocation cost encourages the efficient utilization of provisioned resources since any unused capacity represents an unnecessary addition to the allocation cost. The application makespan on â is termed as the scheduling cost of the plan, SC(â). The scheduling cost is designed to reflect the application performance on the resource plan being considered. When dealing with application models other than workflows or for performance metrics other than makespan, only the evaluation of the scheduling cost need to be altered for our application provisioning approach to be applicable. 39 ) ( ) ˆ ( exit n FinishTime a SC = (5) In order to determine the makespan of the application over the resource plan, â, a scheduling algorithm is required. Since the resource availability and the application task runtimes are known deterministically, the schedule of the application tasks over â can be computed using any DAG scheduling heuristic. We use Heterogeneous Earliest Finish Time (HEFT) [56] based scheduling that has been shown to be a good scheduling heuristic [56, 57]. A resource plan is called infeasible if the application cannot be completely scheduled onto it and its scheduling cost is considered infinite. The resource provisioning problem faced by the resource broker or the user is the following multi-objective optimization problem ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ) ˆ ( ) ˆ ( min ˆ ˆ a SC a AC A a p (6) where both the allocation cost, AC(â) and the scheduling cost, SC(â) are sought to be minimized subject to any budget and/or deadline constraints if any. However, it is generally not possible to simultaneously minimize both the allocation and scheduling cost since they are conflicting in nature. For example, consider the resource plan where â ={}, i.e. it does not include any slots. Obviously, it has the minimum allocation cost, 0, but there are no resources to schedule the application on and hence the scheduling cost is infinite. When considering only feasible 40 resource plans, it may be possible to find one that minimizes both the costs depending on the problem instance, however, it is not the general case. Next, we describe the heuristic for obtaining a solution to (6) using the concept of Pareto- optimality. 4.3 Pareto Optimality While two different solutions can be directly compared to each other based on the value of a single objective, it is not possible to do so in the case of multiple objectives. For example, a particular resource plan, â 1 may have a lower allocation cost and a higher scheduling cost than another resource plan, â 2 . Thus â 1 and â 2 cannot be directly compared with each other. However, if â 1 or â 2 have a lower allocation cost and a lower scheduling cost than another resource plan â 3 , then both â 1 and â 2 are superior to â 3 . We use the concept of domination from [18] in order to compare two resource plans in the context of our optimization problem. A resource plan â i is said to dominate another resource plan â j , if both condition 1 and 2 are true: 1. The allocation cost and scheduling cost of â i is no worse than that of â j i.e. AC(â i ) ≤ AC(â j ) and SC(â i ) ≤ SC(â j ) 2. The solution â i is strictly better than â j in at least one objective i.e. either AC(â i ) < AC(â j ) or SC(â i ) < SC(â j ) or both. 41 If any of the above conditions is violated, â i does not dominate â j . Figure 9 shows the allocation cost and scheduling cost of five hypothetical solutions, each representing a different resource plan. As can be seen, none of the solutions is optimal with respect to both objectives. Solution â 4 is dominated by â 2 and solution â 5 is dominated by both â 2 and â 3 . Solutions â 1 , â 2 , â 3 do not dominate each other and are not dominated by any other solution. Given, the entire set of solutions to a multi-objective problem, the set of solutions that are not dominated by any other solution in the entire set is known as Pareto-optimal set. For example, in Figure 9, if the entire solution set was composed of the given five solutions, then â 1 , â 2 , and â 3 would form the Pareto-optimal set. Thus when a multi-objective optimization problem, when there are conflicting objectives, there is no single optimal solution. Rather there are multiple optimal solutions. Without any further information, no one solution from the Pareto-optimal set can be said to be better than any other. â 4 â 3 â 2 â 1 Scheduling Cost, SC Allocation Cost, AC â 5 Figure 9. The solutions â 1 , â 2 , and â 3 dominate the solutions â 4 and â 5 . 42 4.4 Two approaches to multi-objective optimization Figure 10 shows schematically two alternate approaches to solving a multi- objective optimization problem such as ours. Choose one solution Multi-Objective Problem ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ) ˆ ( ) ˆ ( min a SC a AC Multi-Objective Optimizer Pareto- Optimal Set Higher-level Information (a) Ideal Case Multi-Objective Problem ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ) ˆ ( ) ˆ ( min a SC a AC Higher-level information in the form of preferences w 1 , w 2 Single-Objective Problem ) ˆ ( ) ˆ ( min 2 1 a SC w a AC w + Single-Objective Optimizer One optimal solution (b) Single objective Optimization Figure 10. Two approaches to multi objective optimization. In the first case, called ideal, in Figure 10(a), first an attempt is made to find all the optimal solutions by considering all the objectives to be equally important. This set is the Pareto-optimal set as described in the previous section. Then high level information can be used to select a particular solution from the set. This high level information is based on the user preferences, deadline, budget and other concerns. 43 While being quantitative in nature, this information will also be based on user experience etc. The second approach, based on single objective optimization, in Figure 10(b), first creates a composite objective function as the weighted sum of the objectives, where the weights are selected by the user to reflect his preferences for the objectives. This converts the problem into a single objective optimization problem which can be solved using any search based heuristic. In this chapter, we use the ideal method over the single objective optimization method. The reason is that while being more simple, the latter is more subjective than the ideal method. The reason is that finding a relative weight vector not only requires knowledge about the range values of the objectives, but also other qualitative, non-technical and experience-driven information. Without any information about the solution space, the range values are impossible to predetermine. Additionally, without any knowledge of the trade-offs that can exist between the various solutions, it is very difficult to form a reliable and accurate weight vector. While the ideal method also ultimately uses higher level information in the final stage, there is a fundamental difference in the use of this information in the two approaches. In the ideal case, it is used to evaluate and compare the obtained solutions. This requires less subjectivity than using the high level information in order to search for a new solution. 44 4.5 Multi-Objective Genetic Algorithm (MOGA) There has been a tremendous amount of work in the last decade on using Genetic Algorithms (GA) for multi-objective optimization where the goal is to find the Pareto-optimal set [18]. The advantage of using GA is that it operates using a population of solutions and hence is uniquely suitable for finding the Pareto- optimal set of solutions than other methods that only operate using a single solution at a time. In this paper, we use a multi-objective GA (MOGA) [25] because of its simplicity and the fact that it operates in the objective variables space rather than the decision variable space (explained below). The detailed working of MOGA is described in [18] and a high-level description is shown in Figure 11. MOGA, like any other GA operates using a population of certain number of solutions. Each solution is a resource plan. The initial population is created by randomly picking numbers between 0 and (2 n -1) and creating resource plans based on the binary representation of the numbers. The allocation and scheduling cost of each solution in the population is evaluated and the non-dominated members of this initial population are added to a Pareto- optimal set. MOGA goes through a specified number of iterations, where the new populations are generated using the normal operators of selection, cross-over and mutation and the Pareto-optimal set is updated. 45 No Begin Initialize Population Update Pareto Set Evaluation Assign Fitness Reproduction CrossOver Mutation Add to Pareto Set Yes Stop Figure 11. Working of MOGA. Solution Respresentation A resource plan in MOGA is encoded as a n bit binary number where n = |Â|. Each bit in the number represents a slot in Â. If the bit is 1, the slot is included in the plan otherwise it is not. 46 Fitness Evaluation The fact that MOGA operates in the objective variable space implies that it uses allocation cost and scheduling cost values of resource plans for determining their fitness rather than the absolute value of the binary number representing the plan which doesn’t represent any meaningful value. The scheduling cost of a resource plan is calculated as the makespan of the application as determined by creating a schedule of the application on the slots in the resource plan. The scheduling is done by first assigning a rank to each task in the workflow by doing a bottom up traversal. The child tasks are assigned rank before the parent tasks using (7). )) rank(v (c max w ) Rank(v j ij ) succ(v v i i i j + + = ∈ (7) where w i is the average runtime of the task v i on all the resources in the system and c ij is the average data transfer time between tasks v i and v j using the average bandwidth and latency between the resources in the system. This task ranking system is derived from the Heterogeneous Earliest Finish Time (HEFT) [56] algorithm. Then the tasks are sorted by rank in a non-increasing manner that also ensures a topological sort. The slots from different resources are put in separate bins. The tasks are considered one by one in the sorted order. The algorithm in Figure 2 is used to calculate the earliest possible start time of a task in each bin. The only minor change to the algorithm is that the tasks now have an earliest feasible start time based on the scheduled finish time of the parent tasks and any 47 data transfer requirements. The task is scheduled on the bin that provides the earliest possible start time and the slots in the bin are modified accordingly. When the entire workflow is scheduled, the finish time of the last task (n exit ) in the workflow defines the scheduling cost and the allocation cost is derived by considering the difference between the initial and final set of slots. If a non- divisible slot that was part of the initial set is no longer part of the final set, it implies that one or more tasks were scheduled on the slot and hence the total cost of the slot would be included in the allocation cost. For divisible slots, the accounting is only done based on the used parts of the slot. The scheduling cost and the allocation costs themselves are not enough to determine the fitness of a solution in the population since this is a multi-objective problem. We need to compare the solution with other solutions in the current population using the concept of domination in order to determine its fitness. To a solution i, a rank equal to one plus the number of solutions n i in the current population that dominate i is assigned i i n r + = 1 (8) Thus the non-dominated solutions are assigned a rank equal to 1. The count of the number of solutions with rank r i is incremented by one, i.e. 1 ) ( ) ( + = i i r r μ μ (9) 48 Ranks are assigned to all the solutions in the population. Thereafter, to each solution i, a fitness F i is assigned ) 1 ) ( ( 5 . 0 ) ( 1 1 − − − = ∑ − = i r k i r k N F i μ μ (10) where N is the population size. We also want to find a well distributed set of solutions that reflect a wide range of trade-offs between the objectives. Hence aggregation of solutions in niches is discouraged. For each solution in the population, a niche count nc i with other solutions of the same rank in the population is computed. 2 min max 2 min max ) ( 1 , 0 , 1 ) ( , ) ( ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − = ⎪ ⎩ ⎪ ⎨ ⎧ ≤ − = = ∑ = sc sc sc sc ac ac ac ac d otherwise d if d d Sh d Sh nc j i j i ij share ij share ij ij r j ij i i σ σ μ (11) where σ share is a parameter of MOGA. We have used an σ share value of 0.041. ac i and sc i refer to the allocation and scheduling cost of solution i respectively. ac max(min) and sc max(min) refers to the maximum(minimum) value of the allocation and scheduling costs in the population. Then a shared fitness, i i i nc F F = ′ (12) 49 is assigned to the solution. To preserve the same average fitness, the shared fitness is scaled ′ × ′ × ← ′ ∑ = i r k k i i i F F r F F i ) ( 1 ) ( μ μ (13) The goal of this fitness calculation is to find a diverse spread of solutions. By reducing the fitness of solutions that are close together, we place more importance on the underrepresented areas of the Pareto set. More details on the MOGA algorithms along with a worked out example can be found in [18]. Selection After the fitness is assigned to each member of the population, proportionate selection method is applied to create a mating pool, the size of which is same as the size of the population. N draws are done to select members of the mating pool. In each draw, a member of the population is copied to the mating pool. In a draw, the probability, p i , of selecting the i th solution with fitness F i is ∑ = = N j j i i F F p 1 (14) Thus solutions with high fitness are likely to have more copies in the mating pool. Crossover and Mutation 50 After the mating pool is created, two solutions are randomly selected and removed from the pool and a two point crossover operation is applied between the solutions to create two off-springs. Mutation is then applied to the new off-springs with a mutation probability of 1/n where n is the length of the binary string representation of the solution. The off-springs are then added to the new population. The process is repeated till the mating pool is empty. Updating the Pareto-optimal set Once the new population is created, each member of the population is compared with the current Pareto-optimal set. If the member is not dominated by any member of the Pareto set, the member is added to the Pareto-set. Additionally, the members in the current Pareto set that are dominated by this new addition are removed from the set. Note that due to the transitive nature of domination, it is not possible for a member of the new population to dominate some member of the Pareto set and in turn be dominated by some other member of the set. Termination The process is repeated for a specified number of iterations, the current Pareto- optimal set is returned as the solution. Note that the Pareto-optimal solutions created by MOGA are only an approximation to the real set of Pareto-optimal solution since MOGA like any other GA is a stochastic algorithm and cannot guarantee optimality. 51 4.6 Using High-level Information The Pareto-optimal set (PO) generated by MOGA contains multiple solutions. However, since we only require a single resource plan, the approach that we have taken in this dissertation is to use high level information provided by the user in the form of a trade-off factor, α [0,1] as discussed in Section 4.4. This trade-off factor is used in the following normalized composite objective function to select a solution from the set. )} ˆ ( ) 1 ( ) ˆ ( { min ˆ a SC a AC nr nr PO a × − + × ∈ α α (15) where, min max min min max min ) ˆ ( ) ˆ ( , ) ˆ ( ) ˆ ( SC SC SC a SC a SC AC AC AC a AC a AC nr nr − − = − − = ) ˆ ( max ), ˆ ( min ) ˆ ( max ), ˆ ( min ˆ max ˆ min ˆ max ˆ min a SC SC a SC SC a AC AC a AC AC PO a PO a PO a PO a ∈ ∈ ∈ ∈ = = = = The trade-off factor, α represents the user’s preference between the allocation and the scheduling cost. Once the Pareto-optimal set has been created, it can be used to select different solutions corresponding to different trade-off factors without the re-computing the set again. A linear combination of the costs as in (15) is not the only possible way of identifying the best resource plan. However, the advantage of using (15) is that the costs are normalized using the minimum and maximum value of the allocation and scheduling costs from the Pareto-optimal set. Normalization is important because we don’t know the range of values that the allocation and scheduling cost can take 52 for a given problem instance and without normalization one cost can dominate the other. Moreover, any solution to the problem that seeks to minimize both the allocation and scheduling costs must be a Pareto-optimal solution. The reason is that if the solution selected is part of the Pareto-optimal set, we can find a solution from Pareto-optimal set that dominates this solution and provides a better minimization of the allocation and/or scheduling cost. Thus, there is great value in finding the Pareto-optimal set. Selection of a particular solution from the set can be done using (15) or any other formulation. 4.7 Evaluation In this section, we do a comparison of the application performance (makespan) and allocation cost when the application is executed using resource provisioning to when it is executed using best effort service. We use trace based simulations of the two resource clusters (430 node CTC cluster and 128 node SDSC cluster) as in the previous chapter. The resources operate under the same assumption that submitted best effort tasks request their actual running time and the resource scheduling policy is conservative backfilling. Each resource was simulated separately. The resources operate using the algorithm in Figure 2 (Chapter 3) and advertise the set of available slots. The slots are re-advertised whenever a new task is submitted to the resource and that consequently changes the set of available slots. The price and other attributes of the slots are as mentioned in Chapter 3. 53 For the experiments a synthetic workflow is generated using a parameterized task graph generator. The application (Figure 12) consists of 100 tasks with an average runtime of 1000 seconds. The workflow consists of 10 levels (depth of the workflow) with 10 tasks at each level on the average. The data transfer time between the tasks is zero since all tasks are executed on the same resource. The average number of processors per task is 22 for the CTC cluster and 6 for the SDSC cluster, ensuring that the maximum parallelism of the workflow is about the same as the maximum number of processors on the resource. However, the critical length of the workflow i.e. the optimal minimal completion time is the same for both the clusters since it is independent of the task sizes. Figure 12. Workflow used for evaluation. Executing an application using the best effort scheduling was done by simulating a just in time scheduling policy. First rank values are assigned to each task in the application in the HEFT order using (7). Then tasks are sorted by rank in non- increasing order and tasks from this list are submitted one by one in this order to 54 the resource queue when they become ready for execution i.e. all parent tasks have finished execution. The time difference between the time when the first task in the application was submitted to the resource queue and the time when the last task finished execution is taken as the best effort makespan of the application. The allocation cost of best effort execution (AC best-effort ) is the sum of the runtime of the task multiplied by the number of processors required by the task over all the tasks in the application. ∑ × = − V ip it effort best v v AC ) ( (16) V = set of all tasks in the application (v 1 ,…,v n ) v it = runtime of a task v i v ip = number of processors required by task v i This allocation cost reflects the fact that in best effort execution, as per the current practice, one only pays for the resources used by the task during its execution. Thus the allocation cost for an application is constant for best effort execution and is the lower bound on the allocation cost of the provisioned approach. When the application is submitted, a resource plan is generated from the current set of advertised slots from the resource using MOGA and a given trade-off factor α. The population size in MOGA was 100 while the number of iterations was limited to 10. We compute and record the makespan (scheduling cost) and the allocation cost of this resource plan. The application is then executed using best effort. The allocation cost of the best effort approach is always constant for an 55 application as mentioned earlier. This process is repeated 50 times by submitting the application at different times during the simulation run and the average and other metrics are calculated. The makespan and allocation cost are measured in hours and service units (SU) respectively with 1 SU = 1 processor-hour. Impact of trade-off factor Figure 13 shows the makespan and the allocation cost of the best effort and the provisioned approach on the SDSC and CTC cluster. The results of the provisioned approach are shown for different values of the trade-off factor α. For both the SDSC and CTC clusters, the provisioned makespan is lower than the best effort makespan for all values of α at a cost of slightly increased allocation cost. The provisioned makespan is lower at the CTC cluster as compared to the SDSC cluster. This can be explained by the lower utilization of the CTC cluster (64%) as compared to the SDSC cluster (74%) providing more resources to provision for the application. While the best effort allocation cost remains a lower bound on the provisioned allocation cost, we are able to approach that lower bound using higher values of α. Particularly when α = 1, the allocation cost of the provisioned approach is almost the same as that of the best effort while the provisioned makespan is lower than the best effort makespan. The provisioned makespan is lower than the best effort because we provision resources for the entire application apriori while in the best effort case, because of the associated uncertainty, tasks are not submitted to the resource queue until it is 56 safe to do so i.e. all their parent tasks have finished execution. While the tasks wait for their parent tasks to finish, jobs from other users enter the resource queue and take up resources that would have otherwise available for the application. Thus the precedence constraints in the application coupled with the shared nature of the resources, increase the best effort makespan of the application as compared to the provisioned one.. The provisioned approach however, leads to higher allocation cost than the best effort approach due to the existence of non-divisible slots. When these slots are provisioned, they seldom provide an exact fit for the tasks scheduled on them and hence create the increased allocation cost of the provisioned approach. CTC 0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1 Best Effort Trade-off factor Makespan (hrs) CTC 0 100 200 300 400 500 600 700 800 0 0.2 0.4 0.6 0.8 1 Best Effort Trade-off factor Allocation cost (SU) SDSC 0 2 4 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Best Effort Trade-off factor Makespan (hrs) SDSC 0 50 100 150 200 250 0 0.2 0.4 0.6 0.8 1 Best Effort Trade-off factor Allocation cost (SU) Figure 13. Makespan and allocation cost of the best effort and the provisioned approach for different trade-off factors. 57 Figure 13 also shows the range of values (error bars) that fall within one standard deviation of the average of the makespan and the allocation cost. The standard deviation of the makespan using the best effort is generally higher than that of the provisioned approach except when α = 1 for the SDSC cluster. Note that when α = 1, we are not concerned about makespan at all and the entire emphasis is on minimizing the allocation cost. Thus variability of the provisioned makespan in this case can be understood. The provisioned approach is only affected by the resource workload once when the resource plan is created. The best effort makespan, however, is affected by the resource workload during the entire execution period of the workflow leading to its greater variability. Thus the provisioned approach provides better insulation for the application against dynamic changes in the resource workload. As the value of α increases, the standard deviation of the provisioned makespan increases due to the increased emphasis on reducing the allocation cost. Similarly, when the value of α decreases, the standard deviation of the provisioned allocation cost increases due to the increased emphasis on reducing the makespan. The allocation cost of the best effort is a constant and hence its deviation is zero. Impact of workflow size We also performed experiments to determine the impact of the workflow size in terms of the number of tasks in the workflow on the performance of the provisioned and the best effort approach. Figure 14 shows the impact of the workflow size on the makespan and allocation cost. The number of tasks in the 58 workflow is varied from 100 to 400. The depth of the workflow is √n where n is the total number of tasks in the workflow. This ensures that the workflow is well balanced. Rest of the workflow characteristics like the average task runtimes and the number of processors per task is the same. CTC 0 5 10 15 20 100 200 300 400 Number of tasks in Workflow Makespan (hrs) α = 0 α = 1 Best Effort CTC 0 500 1000 1500 2000 2500 100 200 300 400 Number of tasks in Workflow Allocation cost (SU) α = 0 α = 1 Best Effort SDSC 0 5 10 15 20 25 100 200 300 400 Number of tasks in Workflow Makespan (hrs) α = 0 α = 1 Best Effort SDSC 0 100 200 300 400 500 600 700 800 100 200 300 400 Number of tasks in Workflow Allocation cost (SU) α = 0 α = 1 Best Effort Figure 14. Impact of the workflow size. Figure 14 shows the provisioned makespan and allocation cost using trade-off factors of 0 and 1 that represent the range of values achievable through provisioning. The figure shows that the application performance in terms of makespan improves with provisioning as compared to best effort as the workflow size increases. The reason is that the impact of queue wait times is magnified due to the increase in the number of levels in the workflow. The allocation cost of the 59 provisioned approach is not significantly impacted and in fact decreases in comparison to the best effort allocation cost with increase in workflow size. The reason is that with increased size, most of the workflow tasks are scheduled on the divisible extensible edge slots later in the schedule that decrease the allocation cost. This shows that the provisioned approach is well suited for large sized workflows. Impact of task size We also performed experiments to determine the effect of the task size on the makespan and the allocation cost. The task size is varied by changing the average number of processors per task in the workflow and the workflow contains 100 tasks with 10 levels. Figure 15 shows the effect of the task size on the makespan and the allocation cost of the provisioned approach as compared to best effort on the CTC and SDSC clusters. The performance of the provisioned approach is calculated using tradeoff factors of zero and one that represents the lower and upper bounds for makespan and the upper and lower bounds for allocation cost respectively. The difference between the bounds is the range of values that the makespan and allocation cost can take for different values of the trade-off factor between 0 and 1. 60 CTC 0 1 2 3 4 5 6 7 8 310 20 30 Average # of processors per task Makespan (hrs) α = 0 α = 1 Best Effort CTC 0 100 200 300 400 500 600 700 800 3 102030 Average # of processors per task Allocation cost (SU) α = 0 α = 1 Best Effort SDSC 0 2 4 6 8 10 12 13 69 Average # of processors per task Makespan (hrs) α = 0 α = 1 Best Effort SDSC 0 50 100 150 200 250 300 1 369 Average # of processors per task Allocation cost (SU) α = 0 α = 1 Best Effort Figure 15. Makespan and allocation cost for the provisioned approach for varying task sizes. For the CTC cluster, when the average task size is 3, the provisioned makespan varies from being 0.87 ( α=0) to 0.92 ( α=1) times the best effort makespan and the provisioned allocation cost varies from being 1.5 ( α=0) to 1 ( α=1) times the best effort allocation cost. When the average task size increases to 30, the makespan varies from being 0.577 to 0.585 times and the provisioned allocation cost varies from being 1.003 to 1 times the best effort values. For both the clusters, as the task size increases, the application makespan of the provisioned approach relative to the best effort improves. With larger task sizes, the chances of the best effort scheduler being able to backfill them earlier in the schedule decreases and their queue wait time increases. As the task size increase, the overhead of the allocation 61 cost of the provisioned approach decreases since most of the tasks are scheduled on the divisible slots or their combinations due to the large task size. This shows that the provisioned approach is well suited for workflows with highly parallel tasks. Impact of resource load Production Grid sites such as Teragrid [15] etc often experience high to very high utilization levels [34]. We performed experiments to ascertain the effect of the resource utilization on the performance of the provisioned approach. For increasing the load, we made two copies of the workload trace, with a week’s difference between their start time, using a method adopted in [53]. Just simulating the first trace gave a resource utilization of 63% for the CTC cluster and 70% for the SDSC cluster. For increasing the load, both the traces were fed simultaneously to the simulator. While all tasks in the first trace were submitted to the resource as per their recorded submission time, the tasks in the second trace was submitted to the resource only with a certain probability. Increasing this probability allowed us to increase the load while not changing the task characteristics in the workload. Figure 16 shows the makespan and allocation cost of the provisioned approach normalized with respect to best effort. In this case, a single tradeoff factor of 0.5 was used for generating the provisioned results. The results are shown for different task sizes and different resource utilization levels. Generally, the provisioned makespan becomes smaller as compared to best effort as the resource utilization 62 increases. The reason is that due to the increased load, the queue wait times of the best effort tasks increase leading to an increased makespan. This effect becomes more pronounced when the task sizes are large and it becomes difficult for the resource scheduler to successfully backfill the task earlier in the schedule. The allocation cost of the provisioned approach is based on the slots selected and their match with the application tasks and is not significantly affected by the resource load. This shows that the provisioned approach is more attractive when the resources are highly loaded and the applications have significant resource requirements. CTC 0 0.2 0.4 0.6 0.8 1 310 20 30 Average # processors per task Makespan 63% 76% 88% CTC 0 0.2 0.4 0.6 0.8 1 1.2 310 20 30 Average # processors per task Allocation cost 63% 76% 88% 0 0.2 0.4 0.6 0.8 1 1.2 1369 Makespan Average # processors per task SDSC 70% 78% 84% 90% 0 0.5 1 1.5 1369 Allocation cost Average # processors per task SDSC 70% 78% 84% 90% Figure 16. Normalized makespan and allocation cost of the provisioned approach with increasing resource utilization. 63 4.8 Seismic Hazard Application In the previous sections, we compared the performance of the provisioned and the best effort approaches using artificially-generated applications. In this section, we use a seismic hazard analysis application taken from the earthquake engineering community, called CyberShake [19]. The workflow has 5 levels with 8039 tasks and the structure of the workflow along with the module names is shown in Figure 19. Table 2 shows the average runtime and number of processors required for each module. Levels four and five in the workflow contain 4017 tasks each with the same module (synthSGT and peakValCal respectively) operating on different datasets. Since the parallel tasks at level 2 and 3 in the workflow (shown in gray) would not be able to execute on the SDSC cluster, we simulated the execution of this workflow on the CTC cluster only. The runtimes of the tasks in the workflow were acquired from the provenance records of the previous runs of the workflow on the HPCC (High Performance Computing & Communications) cluster at the University of Southern California (USC). We simulated different background loads on the CTC cluster by making two copies of the workload and probabilistically adding portions of the second copy to the simulated workload as described in the previous section. As a result, we were able to vary the resource utilization of the CTC cluster from 63% to 88%. For each utilization level, we simulated the workflow execution using both best effort and the provisioned approach at different points during the simulation run and 64 averaged the makespan and allocation cost metrics. We used a trade-off factor of 0.5 for creating the resource plan. ….. ….. Level 4 Level 3 Level 0 Level 1 Level 2 fd_grid_xyz fd_grid_cvm preSGT pmvl_chk1 pmvl_chk2 synthSGT peakValCal Figure 17. A Seismic Hazard Analysis Workflow. Table 2: Module details of the Seismic Hazard Analysis workflow. Module name # of tasks Avg runtime (seconds) # of processors fd_grid_xyz 1 1 1 preSGT 1 300 1 fd_grid_cvm 1 2100 288 pvml_chk1 1 86400 288 pmvl_chk2 1 86400 288 synthSGT 4017 519 1 peakValCal 4017 1 1 Figure 18 shows the makespan and the allocation cost of the workflow execution using the best effort and the provisioned approach. At 63% utilization, the best effort makespan is 476811 seconds (5.5 days) which is similar in magnitude to the 65 observed makespan of the workflow on the HPCC cluster. The provisioned makespan at this utilization level is 23% lower than the best effort value. The best effort makespan sharply increases as the resource utilization exceeds 80%. The growth in the provisioned makespan is less pronounced. At 88% utilization level, the provisioned makespan is 50% lower than the best effort makespan. Figure 18 also shows the standard deviation of the makespan and allocation cost. The standard deviation of the best effort makespan is more than the provisioned makespan and increases with the utilization of the resource. The standard deviation of the provisioned allocation cost is very small at the highest utilization level and negligible at lower levels. 0 100 200 300 400 500 600 63% 76% 88% Resource utilization Makespan (hrs) Provisioned Best Effort 0 5 10 15 20 63% 76% 88% Thousands Resource utilization Allocation cost (SU) Provisioned Best Effort Figure 18. Makespan and allocation cost of the seismic hazard analysis workflow. The allocation cost of the best effort and provisioned approach does not differ significantly as the resource utilization level is changed. The reason is that with the provisioned approach, due to the large size of the parallel tasks, these and the later parts of the workflow had to be scheduled on the divisible/extensible slots at the end of the resource schedule which are very cost effective. In practice, special 66 reservations had to be set up for these tasks on the HPCC cluster since the best effort queues were not configured for such large tasks. 4.9 Multi-site scheduling In the previous subsections, each application was executed on a single resource only whether by provisioning or best effort. This allowed us to compare the performance of provisioning and best effort in the absence of any side effects the resources to submit tasks to for the best effort approach. Even realistically, data intensive applications are likely to execute Multiple resources were simulated using the CTC and SDSC clusters running at various utilization levels. We varied the number of resources from 2 to 6. The first two resources were the CTC and SDSC clusters at their 88% and 84% utilization respectively. The third simulated resource was the CTC cluster at 76% utilization, the fourth resource was the SDSC cluster at 78% utilization, the fifth resource was caused by heuristics used to select over a single resource due the costs associated with data duplication and movement. However in a general setting the user would have the choice to select between multiple resources. Thus we extend the experiments to the case where applications can use resources from multiple providers. the CTC cluster at normal default utilization of 63% and the final resource was the SDSC cluster at its normal default utilization of 70%. The method used to increase utilization was the same as in the previous subsection. Thus the various resources were heterogeneous in terms of the workload and the resulting resource 67 availability of the resources. We intentionally progressively added less loaded resources to the system in order to evaluate the ability of the provisioned and best effort approaches to take advantage of better resources in the system. The provisioning approach needed no modification to work in a multi resource setting. The global set includes slots from multiple resources and the selection algorithm (MOGA) operates in the usual manner to create a resource plan from this global set. For the best effort approaches, the heuristic used for resource selection was to select the resource with the least number of tasks (running and The application was the 100 task workflow with a depth of 10 as previously used with an average of 14 processors per task (average of 6 and 22 previously used for SDSC and CTC clusters). The resources are considered homogeneous in processing capacity and the average runtime of the application tasks is 1000 seconds. Each task has two parent tasks (except for the top level tasks) and the queued). Whenever an application task became ready for execution, this heuristic was used to select the resource queue to which it should be submitted. Most resource management systems allow the users to examine the number of running and queued tasks in the system. average amount of data transferred between the tasks is 1000 Gb. The average bandwidth between the resources is 10 Gbps with an average latency is 150 seconds. Thus the average data transfer time is (1000/10 + 150) = 250 seconds which makes the application slightly compute intensive since the average task runtime is more than the average data transfer time. We do not consider contention 68 for network resources in computing the data transfer times between tasks scheduled on different resources. The application was executed using provisioning and best effort 10 times at different points during the simulation and the average for the makespan and allocation cost was computed. Provisioning was done using a trade-off factor of 0.5. Figure 19 shows the comparison between the application makespan and the allocation cost of the provisioned and best effort approach as the number of Figure 19. Comparison of provisioned and best effort with increasing number of resources. As the number of resources increase, both the provisioned and best effort makespan decrease due the addition of less loaded resources to the mix. The best effort approach selects these less loaded resources for submitting the application tasks since they have fewer queued and running tasks. The decrease in makespan is more pronounced for best effort than for the provisioned approach. The reason is that the provisioned approach is relatively more effective at high utilization levels. resources is increased. 0 5 10 20 23 4 Ma span ( rs) 15 56 Number of resource providers ke h Provisioned Best Effort 0 100 200 400 llocat n cost SU) 300 234 56 Number of resource providers A io ( Provisioned Best Effort 69 The provisioned makespan is about 30% of the best effort makespan when only 2 highly loaded resources are present in the system and increases to about 45% of the best effort makespan when all the 6 resources are present. The allocation cost of the provisioned approach isn’t significantly affected by the number of resources in the system. The provisioned allocation cost is more than the best effort one because the slots selected for the application aren’t generally an exact fit for the tasks scheduled on the slots. When the resources have a higher workload they have more holes in the resource schedule produce more non- divisible non-extensible slots. These slots when provisioned increase the allocation cost of the provisioned approach. With the addition of less loaded resources, d available offering the same perform eading to a slight We have so far focused on the application performance and not discussed how provisioning affects the resource performance metrics such as utilization. However, the discussion about the application allocation cost was also implicitly about the resource utilization. The increased allocation cost of the provisioned approach directly translates into increased utilization for the resource. This should ivisible slots that are generally found later in the resource schedule become ance for the application l reduction in the allocation cost of the provisioned approach. compensate the providers for any reduced utilization on account of allowing reservations along with best effort [13, 31, 55]. 70 The approach described in this chapter can also be used when multiple applications need to be executed. Using the DAG model that we use for application modeling in this chapter, several DAGs can be combined together into a single DAG by adding a superficial entry and exit task. All the top level tasks of the individual DAGs can be made children of the superficial entry task and all the bottom level tasks can be made parent of the superficial exit task. As mentioned earlier, our approach is not limited to DAG model only. Any application model can be used as long as a scheduling cost for the application can be computed as a function of the resources provisioned for the application. 71 Chapter 5 In chapter 3, the resources advertised as slots were constrained so as to not delay the best effort tasks submitted to the resource. While it allows the user to make a deterministic resource plan based on the application requirements, it can overtly restrict the choice of resource available to the user. The resources on offer through the advertised slots may not be sufficient to meet the requirements of deadline driven applications such as severe weather modeling [21] while at the same time an application that could have been delayed without any repercussion might be getting executed with the best effort service. Moreover, it might become difficult to co-allocate resources from multiple providers for a desired timeframe since it depends on the slots on offer from these resources. While the slot advertisement model of chapter 3 is an easy first step for the resources to provide deterministic services to the users while preventing them from using provisioning for gaining unfair advantage over best effort and it can potentially improve the performance of applications as shown in chapter 4, additional work is required in order to provide more options to the users. Adaptive Pricing Scheduler 72 In this section, we remove the limitation and allow slots that represent resources that would have been otherwise allocated to the best effort tasks to be provisioned by the users. For example, Figure 20 (reproduced from chapter 3) shows a Gantt chart of 4 queued tasks, A to D, created using conservative backfilling. In the non- delaying model of chapter 3, only the resources represented by slots S1 to S5 could be provisioned by the user. We now allow provisioning of any slot containing any number of processors, from 1 to 5, for any timeframe, assuming none of the tasks have started execution yet. However, it creates a dilemma: What prevents the user who submitted task D, to provision a slot containing 3 processors for 2 hours at time 0 in order to his task completed at time 2 instead of submitting the task to the resource queue where it would complete at time 6 with the best effort service as shown in Figure 20. Thus without any regulatory mechanism or some criteria for Figure 20. A schedule of 4 tasks with conservative backfilling. 73 differentiating the provisioned and best effort service, users might exploit the system by provisioning slots in order to get their tasks completed earlier than using the best effort service. Thus most resource providers disable user-level reservations even though the underlying resource management system such as such as Maui [3], LSF [63], CCS [39] etc do provide features for user level reservations. Any provisioning policy that solves this problem should account for the externalities caused to other users sharing the system and discourage provisioning unless truly required. These objectives can be achieved by a suitable pricing policy. In this section, we suggest an adaptive pricing scheme for slots. Slot queued best-effort jobs that get delayed due to the slot. This gives a nice semantic provisioning is allowed as long as it does not require any currently running task to be preempted or conflict with existing reservations. The queued jobs are ignored in determining the admissibility of a slot. A slot can take up resource that would have been otherwise allocated to the queued-up best-effort jobs. Thus the presence of a slot might cause them to start later than when they would have otherwise. The delay caused to the best effort jobs due to the allocation of a slot is referred to as the externality caused by the slot. We use the slot price model of Chapter 3 consisting of two components. The first component, c, is a per unit charge that is greater than or equal to its best-effort counterpart. The second component, f, is designed to account for the externality caused by the slot and is based on the estimated increase in completion time of the 74 interpretation to the fixed cost portion (f) of the slot. Since this fixed component is based on the current workload of the resource, we call this pricing scheme adaptive. The slot advertisement model of the chapter 3 is a special case of this pricing model where we allow only slots that don’t delay any tasks and thus have f = 0. This pricing scheme is motivated by the prior work in externality based pricing [30, 43, 44, 59] where different priority classes exist and the admission price of a priority class is based on the expected externality it causes on the lower priority classes. as the charge for best effort service. Since we allow a slot to be provisioned delaying the queued best effort jobs, the additional charge, f, is used to account for the delay caused to the best effort jobs if any. The pricing algorithm described in this subsection determines this additional component (f) while the multiplicative part is assumed to be a constant across slots for the resource provider. In order to determine the delay caused to any job, the 5.1 Pricing Algorithm Before we describe the interaction between the user and the resource provider, we describe the algorithm for determining the price of a slot with a start time s, number of processors required n and a duration d. The cost for the slot consists of a multiplicative cost component c and an additive component f as in chapter 3. The implied total cost of the slot is (n x c x d + f). The multiplicative component c represents the basic unit charge e.g. one service unit (SU) per processor hour for the resource and might be the same 75 algorithm computes the schedule of the queued best effort jobs with and without the slot in place at its designated start time using the resource scheduling policy. By comparing the start times of the jobs in the two schedules, we can find the jobs, which would get delayed and by how much. The additional charge (f) is then set equal to the sum of the delays of the delayed jobs multiplied by the number of processors requested by these jobs. Thus if the slot does not delay any jobs according to the computed schedules, then the additional charge is zero. In order to illustrate the working of the algorithm, a resource with 5 processors is nd B are currently running. C and D are queued tasks and R is a provisioned slot. A slot X for 2 processors and 3 time units is shown in Figure 21. Tasks A a desired starting at time 2. Figure 21 shows the resource schedule without (left) and with (right) the slot in place. As a result of putting the slot in, Task C gets delayed by 1 time unit and task D gets delayed by 2 time units. Thus the charge for the slot X would be (1x2) + (2x2) = 6 since both Task C and D require two processors. The algorithm is more formally listed in Figure 22. The asymptotic complexity of the algorithm is O(S + n) where S is the runtime complexity of the resource scheduling algorithm and n is the number of the queued up jobs. However, since the scheduling complexity, S, would be at least linear in the number of queued up jobs, the complexity of the algorithm in Figure 22 is the same as that of the resource scheduling algorithm, O(S). 76 tion in place. The basic model is that the users can query for slots of a specific dimension and the resource will provide the availability information of the slots with their cost Figure 21. Resource schedule without(left) and with (right) the reserva Figure 22. Slot Pricing Algorithm. 5.2 Interaction models Input: slot start time s, processors n, duration d schedule. schedule Input: set of queued tasks Q, scheduling policy π Input: existing slots and running tasks. Output: a price for the slot, f 1. Create a schedule of the tasks in Q using π. Let v q = start time of task q in Q as per this 2. Add the slot at time s. 3. If the slot cannot be added at s due to conflicts with running tasks and provisioned slots, return ∞. 4. Recreate a schedule of the tasks in Q using π. Let v’ q = start time of task q in Q as per this 5. Return } ) {( ' ∑ + × − = q q q n v v f as the price for the slot where n q is the number of processors required by task q and (v’ q - v q )+ is defined as ⎩ ⎨ ⎧ ≥ − = − + otherwise v v if v v v v q q q q , 0 ), ( ) ( ' ' ' ∈Q q q q 77 attributes. Based on what information the user provides and its semantics, the following query options are possible all of which can be simultaneously supported by the resource. > = < f p d n s E , ) , , ( (17) ,....} , , {..., ) , ( > < = i i i f c s d n E (18) } , , ,.., , , { ) , , ( 1 1 1 > < > < = k k k f c s f c s d n s E (19) In er of requested processors and duration he resource responds with the cost att rd to implement and the cost can be ca ure 22. The unavailability of resources for the slot at the desired time would be indicated by some predefined cost values such as infinit In nd only provides the number of require ds back a prices at these start times. Thus the user can examine the different provided start times for a desired slot in order to optimize the price to be paid and/or the utility derived from the reservation. The approa ossible start times is to ing and queued jobs and existing slot reservations. We examine this schedule from the current time to the end of the schedule and the first case, (17), the user provides the fixed start time of the slot, the numb and t ributes for the slot. It is straightforwa lculated using the algorithm in Fig e ( ∞). the second case, (18), the user is flexible regarding the start time of the slot a d processors and duration. The resource sen potential set of start times for the slot and the ch we use for generating the list of p create a schedule of all the runn whenever there is a discontinuity in the resource availability e.g. as a result of tasks or slots starting or finishing, we include that time instant into the list of 78 possible start times. The intuition for picking these start times is that the availability of the resource changes at these time instants and hence the price and admissibility of the slot would be likely affected too. Another option would have been to examine start times at regular tervals starting from the current tim However, it might miss interesting time ins nts, for example, the earliest t when the slot can be provisioned or the earliest time when it can be provisio rces at that time, it is not included in the list. Since the earliest possible time for the slot is always either the current time or the end time of a job or slot, this list also contains this earliest start time for the slot. Additionally, the list also contains at least one time when the additive price (f) is zero, for example at the maximum finish time of any job or slot. At this time all the processors are available and the slot reservation can be done without delaying any jobs. By the same reasoning, the list is also always non-empty. The size of the list is of the order of the number of tasks and reservations in the system and hence the computational complexity of creating the list of start times with prices is O(nS) where n is the number of tasks (queued and running) and reservations in the system using the O(S) complexity of the pricing algorithm. in e. ta ime ned without delaying any queued jobs. The current time is also included in the list since it is possible that the requested slot might be able to start right away if enough resources are available. Then the price for each start time is computed using the algorithm in Figure 22 and the list of start times along with the price for each of them is sent back to the user. If a start time is infeasible for the slot due to lack of resou 79 As an illustration, based on the resource schedule shown in Figure 21(left), the potential start time for the requested slot X would include times {t 0 , t 1 , t 2 , t 3 , t 4 , t 5 , t 6 }. t 0 is an unfeasible start time and the price for the other start times is shown in Table 3 based on the delays to jobs C and D. These start times and prices would be send to the user. The table shows that an earlier start time does not always mean a larger cost e.g. t 3 is costlier than t 2 which is costlier than t 1 . The slot can be provisioned at any time from t 4 to t 6 without delaying any queued task and thus has a price of 0. Thus the resource response can be very useful for the user to determine a desirable start time for the slot. However the user is not constrained by these start times and can always self select a potential start time for the slot, as in (17), when allowed. Table 3. Reservation costs at times from t 1 to t 6 . Additive cost(f ) Time Delay for task C Delay for task D instants i t 1 1 2 1x2 + 2x2 = 6 t 1 3 1x2 + 2x3 = 8 2 t 3 1 4 1x2 + 2x4 = 10 t 0 0 0 4 t 5 0 0 0 t 6 0 0 0 In the last case, (19), the user provides the earliest feasible start time for the slot and its dimensions and the resource responds with a certain set of start times no earlier than the user supplied one and the prices at these times. This implies that the user is flexible regarding the start time but does not require the resources before a certain time. The set should also contain the price at the user supplied 80 earliest start time or the nearest possible one. It is straightforward to implement (19) using the implementation of (18). We use the same mechanism of generating the possible start time but ignore those that occur before the user provided earliest feasible start time. We also add a tuple corresponding to the user supplied earliest feasible start time if resources are available at that time. As discussed previously, the response is always non-empty and also contains at least one start time when f = 0. 5.3 Experiments In order to evaluate the performance of the pricing algorithm, we did trace-based simulations using the same resources (430 node CTC and 128 node SDSC cluster) and workload th s section. These workl ave iv g s to characterize the effect of reservations on be effort jobs aggressi backfilling as the resource scheduling policy for the best effort jobs for the simu ions. The runtim estimates and the actual runtimes of the jobs are used as recor in the worklo traces. Each resou e was simulated se e at was used for the experiments in the previou oads h been used extens ely for schedulin tudies [46] and st [31, 55]. We use ve lat e ded ad rc parately. W randomly selected ten percent of the jobs in the workload trace file and executed them using slot reservations instead of just submitting them to the resource queue. For each such job a query was send to the resource with the number of required processors and duration based on the formulation of (18). The resource sends back a set of possible start time with corresponding prices. One particular start time for 81 the job was then selected from the list such that the following metric was minimized set in the start time ximum) minimum(ma min(max) min max min min max min , = = − set in the price ximum) minimum(ma )} ( ) 1 ( ) ( { min min(max) − × − + × f α α the simulation. Different simulation runs were run with different values of α that model different sensitivities of the user towards price and start time. Note that e with f = 0 will minimize (20) , e.g., t 4 to t 6 in Table 3. In − − > < s s s s s f f f f i i f s i i (20) The value of trade-off factor ( α) is a number between 0 and 1 and is kept constant for when α = 1, any start tim that case, the earliest one is selected when more than one exist. Due to the computational complexity of the creating the list of start times and prices, the simulation was limited to the first 30 days in the workload trace. Figure 23 shows the average price for the tasks that were executed using slots in terms of service units (SU) where 1 SU = 1 processor hour. The price shown in the figure is the additive price (f i ) and not the total price for the slot. When the value of the trade-off factor ( α) is zero, the earliest possible time is selected for the job irrespective of the price and hence it leads to the highest average price for the job. On the other extreme, when the trade-off factor is one, the earliest time from the list is selected such that the additive price is zero and hence the average price is also zero. For the same trade-off factor, the price at SDSC is higher than CTC, likely due to the higher utilization and smaller processor count of the SDSC cluster. 82 Figure 23. Average price corresponding to different trade-off factors. Figure 24 shows the comparison between the average wait time for jobs that were executed using provisioning (slots) and the other jobs (best effort) for the CTC and the SDSC cluster. When the focus is on wait time minimization (low trade-off factor), the wait time of the provisioned jobs is much less than that of the best effort jobs. For higher trade-off factors, the wait time of the provi 0 5 10 15 20 25 30 35 40 0 0.25 0.5 0.75 1 CTC SDSC Price (SU) Trade-off factor sioned service is similar to that of the best effort traffic. Figure 24. The average wait time of reservations and best effort jobs. Impact of resource utilization CTC 0 0.5 1.5 Trade-off factor W it time (hrs) 1 0 0.25 0.5 0.75 1 a Provisioned Best Effort SDSC 0 0.4 0.6 0.8 1 1.4 0 0.25 0.5 0.75 1 Trade-off factor Wait ti (hrs) 0.2 1.2 me Provisioned Best Effort 83 In order to examine the effect of the resource load on the prices and wait times of the reservations, we increase the resource utilization by 20% and 40% by duplicating tasks in the simulated workload. As a result the utilization of the CTC system rises from 58.7% to 68.4% and 78.8% respectively and that of the SP2 system rises from 63.4% to 77.8% and 87.4% respectively. The percentage of reservations in the final workload remains 10%. Figure 25 shows how the price for the slo s, the . Figure 25. Reservation pricing with increasing resource load. Figure 26 shows the average reservation wait time normalized with respect to the average best effort wait time. With increasing resource load, the reservation wait times becomes less as compared to the best effort. The reason for decrease is the t changes with increasing resource utilization. For all trade-off factor price increases with increasing utilization. The reason is that with increased load, more tasks get delayed in order to provide preferred service to the slots leading to their high cost. This reflects the adaptivity of the pricing scheme to the status of the resource and suggests that the adaptive pricing model can lead to load balancing in the Grid as users will shift from highly loaded resources to the least loaded ones due to the price differential CTC 0 50 100 150 200 0 0.25 0.5 0.75 1 Trade-off factor Price (SU) Original +20% +40% SDSC 0 50 100 150 200 0 0.25 0.5 0.75 1 Trade-off factor Price (SU) Original +20% +40% 84 increased best effort waiting time due to the increased resource load. However, the lower wait times comes at a cost of increased price as shown in Figure 25. Thus the pricing scheme provides a way for the users to get the desired performance for their applications at a cost that the users can predetermine. Fig d. tions CTC 0 0.2 0.4 0.6 0.8 1 1.2 0 0.25 0.5 0.75 1 Trade-off factor Wait time Original +20% +40% SDSC 0 0.2 0.4 0.6 0.8 1 1.2 0 0.25 0.5 0.75 1 Trade-off factor Wait time Original +20% +40% ure 26. Normalized reservation wait times with increasing resource loa Impact of proportion of reserva We also experimented by increasing the percentage of jobs that are executed using slot reservations. Figure 27 show the average price for the slots when the percentage of jobs in the system that execute using slots increase. The slots are reserved at the earliest possible time ( α = 0). As the figure shows, as the percentage of reservations increase, their price drops. This is due to the corresponding decline in the percentage of best effort traffic that is the basis for the price in the first place. When the entire workload is composed of reservations, the price would be zero since there would be no best effort traffic. With increasing proportion of reservations, their average wait time also increase due to the same reason (not shown). 85 0 5 10 15 20 25 30 35 40 45 (SU) 10 20 40 60 80 90 % resource reservations Price CTC SDSC Figure 27. Prices with increasing proportion of reservations. We also did experiments to compare the wait time of the reservations and the best effort traffic after increasing the perce from 10% me of the Figure 28. The average wait time of reservations and best effort jobs. Impact of resource scheduling policy ntage of reservations to 50% (Figure 28) earlier (Figure 24). As the value of alpha increases, the wait ti best effort tasks decrease since the reservations go nearer to the end of the schedule. This effect is not pronounced in because of the small percentage (10%) of the reservations used in that experiment. The resource scheduling policy for the best effort jobs would also affect the reservation price since it determines the job placements and hence the resulting 0 0.5 1 2 Wait Time 1.5 0 0.25 0.5 0.75 1 (hrs) Alpha CTC w 50% reservations Reservations Best Effort 0 0.3 0.6 0.9 1.2 1.5 1.8 Wait Time 0 0.25 0.5 0.75 1 (hrs) Alpha SDSC w 50% reservations Reservations Best Effort 86 del nd rices to that of the conservative backfilling (CB) that has been used for the preceding experiments. Conservative backfilling ahead of it in the queue while aggressive backfilling allows out of order execution freedom than CB in scheduling jobs from the queue. In these experiments, the value of α for determining the start time of a reservation was randomly chosen reservation price under aggressive backfilling is generally more than that of conservative backfilling. The reason is that aggressive backfilling packs the schedule more tightly than conservative backfilling due to the additional flexibility in placement of jobs and thus a reservation under aggressive backfilling causes more perturbance than under conservative backfilling. Moreover, this difference in price gets accentuated as the resource utilization increases. ay due to a reservation. We also implemented aggressive backfilling (AB) a compared the resulting reservation p allows a job to be executed out of order only when it would not delay any job only when it would not delay the first job in the queue. Thus, AB provides more between 0 and 1. Figure 29 shows the average reservation price using the aggressive and conservative backfilling for the CTC and SP2 systems with different utilization levels. The average 87 Figure 29. Impact of conservative vs aggressive backfilling on pricing. The pricing also seems to be dependent on the profile of the resource. For example, Figure 29 shows that the average price at the CTC is more than the average price at the SDSC cluster when the resource utilization is the same ion on the CTC cluster with 77.8% at the (compare the price at 78.8% utilizat SDSC cluster). The reason is that the CTC cluster has more processors (430) in comparison to the SDSC cluster (128). Thus the average job size at the CTC cluster is more than the SDSC cluster. Since the size of the jobs that get delayed also factors into the price, the price becomes more when the delayed jobs are of larger size. Thus smaller jobs would tend to choose resources that are of similar size implying that the larger resources would be targeted by jobs that need more resources. This is one of the goals of the supercomputing centers i.e. to attract highly parallel jobs that are a natural match for the resources at the center. Impact of inaccurate runtime estimates Till now, we have assumed that the tasks executing using the best effort service request their actual running time. However, tasks generally request a running time CTC 80 120 (SU) 0 20 40 60 100 58.7% 68.4% 78.8% Resource utilization Price CB AB SDSC 100 120 (SU) 0 20 40 60 80 63.4% 77.8% 87.4% Resource utilization Price CB AB 88 that is often way more than their actual runtime due to the fact that users are not sure about the runtime of their tasks and tasks that do not finish by their requested running time are terminated. Thus, users generally play safe and provide a conservative estimate of the task runtime. These estimates are recorded in the workload logs along with the actual runtime of the tasks. We performed experiments using the requested runtime estimates of the tasks in the best effort workload. Since the resource schedule is constructed using the runtime estimates provided by the users and is used by the pricing algorithm in Figure 22, the slot prices are also affected by the inaccurate runtime estimates. Figure 30 compares the slot prices generated using accurate and inaccurate runtime estimates. As the figure shows, the prices are generally higher with inaccurate estimates. The reason is that using the conservative estimates, the resource schedule is more spread out in time and any perturbation due to insertion of a slot causes longer delays to the queued jobs. Figure 30. Comparison of slot pricing with accurate and inaccurate estimates. CTC 0 20 30 Trade-off factor Price SU) 10 40 50 0 0.25 0.5 0.75 1 ( Accurate InAccurate Estimates SDSC 50 Accurate InAccurate Estimate 0 20 30 Trade-off factor Price SU) 10 40 00.25 0.5 0.75 1 ( s 89 However, the most pronounced impact of the inaccurate runtime estimates is on the wait times for the slots. Figure 31 shows the slot and best effort wait times using the inaccurate estimates. A comparison with Figure 24 that was generated using accurate runtime estimates show that both the slot and best effort wait times are increased with inaccurate estimates. This can be understood by the loss of optimality in schedule due to the inaccurate estimates. However, more importantly, the slot wait times are not necessarily less than the best effort wait times. f these opportunities. Moreover when the slots do cost zero using trade-off factor of 1, i.e. the price for slot and best effort is the same, the average wait time of slots is more than the average wait time of best effort traffic. Thus the provisioned service is not superior Figure 31. Slot and best effort wait times with inaccurate runtime estimates. For higher values of the trade-off factor the best effort wait time is less than the slot wait times. The reason is that best effort jobs take advantage of backfilling opportunities created when tasks finish before their specified wall clock time to jump ahead and finish earlier than otherwise predicted. However, the slots have a fixed start time and cannot move forward to take advantage o SDSC CTC 0 0.5 1.5 2 2.5 0 0.25 0.5 0.75 1 Trade-off factor Wait me (h 1 ti rs) Provisioned Best Effort 0 0.5 1 1.5 2 2.5 3 0 0.25 0.5 0.75 1 Trade-off factor Wait me (h Provisioned Best Effort ti rs) 90 to the best effort service, it just provides more options to the users to optimize the performance of their jobs at a specified price. In this next chapter, we describe how the users can create a resource plan for their applications using the adaptive pricing model in order to optimize their application performance while minimizing the allocation costs. 91 Chapter 6 Comparison of Virtual and Advanced Reservation System In this chapter we compare the cost of creating reservations through the adaptive pricing scheme of the previous chapter and a recently proposed scheme for creating probabilistic reservations. This scheme is called virtual advanced reservations (VAR) [48] and uses predictions services and the best effort queue to simulate an advanced reservation. .1 Virtual advanced reservation (VAR) tempt to provide users with probabilistic advanced reservations when the underlying system is best effort and does not support reservations. VAR is implemented by submitting jobs to the underlying resource scheduler that are similar to jobs submitted by other users. The resource scheduler is not involved in providing the reservation. However, the VAR system assumes that it is possible to observe the queue wait time of the jobs serviced by the resource scheduler. By observing the history of queue wait times 6 The virtual advanced reservation (VAR) [48] system is an at 92 of jobs and using statistical techniques, VAR assumes that the probability of a specific job starting before a given deadline can be predicted. This functionality is by a service called QBETS (Batch queue job delay prediction service). This service is currently installed on the TeraGrid system. Specifically, already provided the prediction done by QBETS takes the following form name of m = ask for the t processors required number of nodes ine n the mach he queue o name of t q he machine t prob ine startDeadl wallTime nodes q m QBETS = = = ) , , , , ( xecution ld start e task shou which the ine before the deadl ine startDeadl e task time of th execution required wallTime = = (21) re we want a reservation for a task starting at time S and having duration of D as shown in Figure 32. The VAR system divides the time from the current time to the desired task start time in small uniform intervals. Then it simulates the flow of time from the current VAR implements reservations by determining when a job should be submitted to a batch queue so as to ensure that it will be running at a particular point in future. For example consider the case whe Figure 32. Reservation start time and duration. Current Time, T Start Time, S D End Time, S+D 93 time to the desired task start time. At the beginning of each interval, it answers the following question, what is the probability that if the task is submitted now, it would start execution before the desired start time. Also, the task may start execution anytime between the time it is submitted and the desired start time. In order to ensure, that the task keeps running till the desired end time, its runtime is increased to cover the duration between the current time and the desired start time of the task. The time when the probability is maximum is taken as the submission time for the task. Figure 33 shows the pseudo code for the e that should be requested for the task. The algorithm internally uses the QBETS prediction service described earlier. VAR system. It is a slight modification of the algorithm published in [48]. The input to the algorithm is the node count of the task, the run time of the task (wallTime) and the desired start time of the task. The output of the algorithm is the time when the task should be submitted to the machine queue and run tim 94 Figure 33. Determining the submission and run time of task in VAR. Input(machine, queue, nodes, wallTime, startTime) Output(submitTime, runTime) runTime = 0 advWallTime = wallTime + (starttime - currT) runTime = advWallTime currT = current time maxProb = 0 submitTime = 0 Increment = 30 //time between submission times (seconds) While (currT < startTime) currProb = QBETS(machine, queue, nodes, advWallTime, starttime) if (currProb > maxProb) maxProb = currProb submitTime = currT endif currT = currT + Increment EndWhile 6.1.1 Allocation cost of VAR he allocation cost of VAR arises from the fact that the requested runtime of the task that is actually submitted to the resource queue is more than the duration of the reservation. For example, considering the example shown in Figure 32, reproduced in Figure 33, where the current time is T, the desired start time of the reservation is S and the duration is D. T 95 Sup dvanced reservation system computes that the task should lting duration of me that the task starts running at time T 2 ( ≤ S). Then it wil rk till S, since the reservation is required only at time S. Thus the allocation cost of the VAR system would be (S - T 2 ) x node count. Note that a tight upper bound on the allocation cost would be (S - T 1 ) x node count T 1 ≤ T 2 i.e. a task cannot begin execution before it is submitted to the resourc en the VAR is a failure in that case and the question of allocation cost does not arises. Note that the allocation cost of the VAR system cannot be determined until the Figure 34. Virtual reservation scenario. pose now, that the virtual a be submitted at time T 1 to the resource queue and should have a resu (D + (S - T 1 )). Assu l not be doing any wo since e queue. If the submitted task does not begin execution by time S, th submitted job starts execution. Thus it is not possible to predetermine the cost of allocation as in the adaptive pricing model of the previous chapter. However, as discussed earlier, an upper bound on the cost can be determined. D Cu T rrent ime, T Reservation Start Time, S End Time, S+D it Task Start Time, T 2 Task Subm Time, T 1 96 6.2 Experiments We performed experiments to compare the allocation cost of the adaptive pricing reservation system of the previous chapter and the virtual advanced reservation system (VAR). These experiments were done using the QBETS prediction service installed on the NCSA the same cluster that TeraGrid cluster and the task queue of can be observed using the http://tg-monitor.ncsa.teragrid.org web site. The following reservations were considered 2. reservation for 32 nodes for 1 hour starting in 4 hours 3. reservation for 32 nodes for 1 hour starting in 8 hours 5. reservation for 32 nodes for 4 hours starting in 4 hours 6. reservation for 32 nodes for 4 hours starting in 8 hours 7. reservation for 64 nodes for 1 hour starting in 1 hour 9. reservation for 64 nodes for 1 hour starting in 8 hours 10. reservation for 64 nodes for 4 hours starting in 1 hour 12. reservation for 64 nodes for 4 hours starting in 8 hours We used the QBETS service installed on the NCSA Teragrid cluster. Figure 35 shows the probability of a 32 node task with a 4 hour runtime 1. reservation for 32 nodes for 1 hour starting in 1 hour 4. reservation for 32 nodes for 4 hours starting in 1 hour 8. reservation for 64 nodes for 1 hour starting in 4 hours 11. reservation for 64 nodes for 4 hours starting in 4 hours starting within 4 hours is 38% using the QBETS service of the NCSA Teragrid Cluster. The same service is also available in a programmatic form as a web service at [50]. 97 eraGrid Cluster. algorithm shown VAR tasks for each rted executed was probability changes as different submission times are considered from the current time to the reservation start time (240 minutes) for a 32 node request that has a Figure 35. Queue time prediction service on the NCSA T We used the web service version of the QBETS service and the in Figure 33 to determine the submission time and runtime of of these reservations. The time when the task actually sta recorded and thus the VAR allocation cost of each of these reservations was determined. In order to illustrate the working of the VAR algorithm, Figure 36 plots how the 98 duration of 1 hour (left) and another 32 node request that has a duration of 4 hours (right). For the request with 1 hour duration (left), the probability is maximum (0.772) at 110 minutes from the current time. So the VAR in this case would decide to submit a task to the resource queue 110 minutes from now with a duration of (60 + (240 - 110)) = 190 minutes requesting 32 nodes. For the request with 4 hour duration (right), the maximum probability is at time 0 (0.747). So the VAR in this case would decide to submit a task to the resource queue right away with a duration of (240 + (240 - 0)) = 480 minutes requesting 32 nodes. logging on to the NCSA Teragrid cluster and doing a “qstat -a” 32 nodes, runtime 1 hour 0 0.2 0.4 0.6 0.8 1 0 30 60 90 120 150 180 210 240 Figure 36. Working of the VAR algorithm for two requests. At the same when the VAR experiments were done, we also computed the allocation cost of these reservations using the adaptive pricing strategy. In order to do so we needed to know about the currently running and the queued jobs. This was done by command. The output of the command is similar to that shown in Figure 37 which is also available from the http://tg-monitor.ncsa.teragrid.org web site. Time (minutes) Probability 32 nodes, runtime 4 hours 0 0.2 0.4 0.6 0.8 1 0 30 60 90 120 150 180 210 240 Time (mins) Probability 99 Figure 37. Jobs running and queued at the NCSA Teragrid cluster. After getting the information about the running and the queued tasks, we use the reservations. We assume conservative backfilling as the resource scheduling determine the allocation cost of the 12 reservations mentioned before using the adaptive pricing algorithm of chapter 5 (Figure 22) to compute the prices of these algorithm. We repeated the experiments for 30 days where each day we would 100 VAR and the adaptive pricing methodology at some particular time of the day. The results shown in the next section are based on the average of the 30 days. 6.3 Results 6.3.1 32 node experiments Figure 38 shows the allocation cost of the VAR system and the adaptive pricing scheme for the set of reservation requesting 32 processors. The allocation cost is normalized with respect to the base charge for the reservation i.e. the node count multiplied by the duration of the required reservation. As the figure shows, the allocation cost of the VAR scheme is less than that of the adaptive pricing scheme. The difference between the two is significantly pronounced in the case when the start deadline is 1 hour. This is because the allocation cost of the VAR system is upper bounded by the start deadline multiplied by the node count. Figure 38. Allocation cost of the VAR and the adaptive pricing scheme. 32 nodes, 1 hour runtime 0 2 4 14 8 Start time (hrs) Allo tion 6 8 ca cost VR AR 32 nodes, 4 hour runtime 0 1 2 14 8 Start time (hrs) Allo tion 3 4 ca cost VR AR 101 The allocation cost of the adaptive pricing system however is not upperbounded and hence there is significant difference between the allocation cost of the VAR and the adaptive pricing system. When the required duration is 1 hour and the start rs, there is little difference between the allocation costs of the VAR and the adaptive pricing system. We also measured the success rate of the VAR system and the adaptive pricing system. In the case of VAR, success happens if the submitted task starts execution before the reservation start time, else it is regarded a failure. In the case of the adaptive pricing system, a failure happens when it is not possible to grant a reservation. This would happen when the reservation would conflict with the currently running tasks. Figure 39 shows the success rate of the VAR and the adaptive pricing system. As the figure shows, the adaptive pricing system has a higher success rate than the probabilistic VAR system. The success rate is higher when the start deadline is 1 hour at the cost of an increased allocation cost as shown in Figure 38. Thus the adaptive pricing and the VAR system seem to provide two different qualities of service, one with lower cost and lower probability of success and the other with higher cost and higher probability of success. Another difference between the two services is that in the case of the adaptive pricing scheme, the user can predetermine the success or failure and the cost of allocation whereas in the case of the VAR system, the allocation cost and success cannot be predetermined until the submitted task has started execution. deadline is 8 hou 102 32 nodes, 1 hour runtime 0 20 40 60 80 100 120 148 Start time (hrs) Success % VR AR 32 nodes, 4 hour runtime 0 20 40 60 80 100 120 148 Start t % Figure 39. Success rate of the VAR and the adaptive pricing system. Even though the average allocation cost of the adaptive pricing scheme is higher than that of the VAR system as shown in Figure 38, in a non negligible percentage of cases, the allocation cost of the adaptive pricing scheme was lower than that of the VAR system. Figure 40 shows the percentage of cases when the VAR or the adapt Figure 40 shows that when the required reservation duration is 1 hour, the adaptive pricing scheme is cost effective than the VAR scheme in a majority of cases. ive pricing scheme was cheaper than the other. Figure 40. Percentage of cases when VAR or adaptive pricing is cheaper than the other. ime (hrs) Success VR AR 32 Nodes, 1 hour runtime 0 20 40 80 148 Start time (hrs) % o time whe cheap r 60 f s n e VR AR 32 nodes, 4 hour runtime 0 40 80 % o time when cheap 20 60 14 8 Start time (hrs) f s er VR AR 103 When the required reservation duration is 4 hours, the VAR scheme is cost effective than the adaptive pricing scheme in a majority of cases. Thus while the average allocation cost of the adaptive pricing scheme is more than that of the VAR scheme, it is competitive with the VAR scheme in regards to the number of cases when each of them is cost effective than the other. 6.3.2 64 node experiments We also performed experiments using 64 node requests as mentioned in the reservation list. The goal of using a higher node count was to study the resulting effect on the allocation cost of the VAR and the adaptive pricing scheme. Figure 41. Allocation cost of VAR and the adaptive pricing scheme for the 64 node reservations. The results of the 64 node reservations are similar to that of the 32 node reservations with the average allocation cost of the adaptive pricing scheme being higher than that of the VAR scheme. Also, the average adaptive pricing is generally higher than that of the 32 node reservation scheme, particularly when the 64 nodes, 1 hour runtime 5 10 catio cost 0 14 8 Start time (hrs) Allo n VR AR 64 nodes, 4 hour runtime 1 2 4 catio cost 0 3 148 Start time (hrs) Allo n VR AR 104 reservation duration is 1 hour. This is likely due to the higher node count of the reservation that leads to more displacement among the jobs in the resource schedule and results in higher price. For the VAR scheme, however the upper bounds on the allocation cost remain the same irrespective of the node count. In order to understand this consider Figure 42 that shows a typical VAR scenario. Figure 42. A VAR Scenario. In Figure 42, the upper bound on the allocation cost of VAR is D T S count node T S bound upper normalized ) ( ) ( − = × − = (22) Thus the normalized upper bound is independent of the node count and hence the VAR allocation costs are similar between the 32 node and the 64 node reservation requests. Also, the allocation cost of the adaptive pr count node D count node D n reservatio the for e ch base b upper arg × × = icing scheme increases with the start time of the reservation. However, if we were to further increase the start time of the reservation to be beyond the maximum finish time of any scheduled task, the adaptive price would be zero in that case and it would be cheaper than the VAR. It was not possible to perform these experiments on the NCSA Teragrid count node T S ound ) ( × − = D Current Time, T Reservation Start Time, S End Time, S+D Task Submit Time, T 1 Task Start Time, T 2 105 Cluster because of the limit on the runtime of the tasks submitted to the system. When the start time of the reservation is further away in the future, the runtime of the task generated by the VAR system would also potentially be long and hence would be rejected by the system. Figure 43 shows the success rate of the VAR and the adaptive pricing scheme. As with the 32 node reservation case, the success rate of the adaptive pricing scheme is higher than that of the VAR scheme at the expense of an increased allocation cost. We also observed that the failures of the VAR and the adaptive pricing scheme are correlated i.e. when the adaptive pricing scheme failed, it was more likely that the VAR request would fail as well. This can be understood since the reason for failure of both the scheme is a higher load on the resource. the VAR system is cheaper than the adaptive pricing scheme in a majority of Figure 43. Success rate of VAR and adaptive pricing scheme with 64 node reservations. Figure 44 shows the percentage of cases when the VAR or the adaptive pricing scheme was cheaper than the other. Unlike the case with the 32 node reservations, 64 nodes, 1 hour runtime 0 148 Start time (hrs) uc 50 100 S cessful % VR AR 64 nodes, 4 hour runtime l % 0 14 8 Start time (hrs) ucc 50 100 S essfu VR AR 106 cases. This shows that for the set of experiments performed, the adaptive pricing scheme fared better with a lower node count than with a higher node count when compared with the VAR scheme. Figure 44. Percentage of cases when VAR or adaptive pricing is cheaper than the other. To summarize the experimental results shown in this section, VAR and the adaptive pricing scheme provide two complementary classes of service, one with We also use the VAR and the adaptive pricing scheme concurrently to get the benefits of lower costs of the VAR model and the higher success rate of the adaptive pricing model. In order to do so, a user would make a reservation using the adaptive pricing model and also let the VAR submit a task to the queue for the same reservation. When the VAR task begins execution, the user can make a lower allocation cost and lower success rate and the other with higher allocation cost and higher success rate. 6.4 Combining the VAR and the adaptive pricing scheme 64 nodes, 1 hour runtime 0 50 100 14 8 Start time (hrs) % when cheaper VR AR 64 nodes, 4 hour runtime 0 50 100 14 8 Start time (hrs) % when cheaper VR AR 107 decision on whether to cancel the adaptive pricing reservation or to cancel the VAR task that had just begun execution based on the costs of these. However, till now, we have not discussed what happens when the user cancels an adaptive pricing reservation or whether he is allowed to do so by the resource scheduler without a penalty. Not ch d to abuse of the system later. This might lead to revenue or utilization loss for the resource provider since existing reservations are considered while doing admission control of new reservation requests. A new reservation request is not granted if it conflicts with an existing reservation. Thus we extend the slot model of chapter 3 to have a penalty factor p (23) re 42. At time T, let f be the adaptive price for the reservation (excluding the base charge c since it is common to both VAR and arging any penalty might lea since users might make frivolous reservations and cancel them as shown below f p f c d n f c d n where + × × = = = = = slot the of cost total slot the of cost fixed slot the of cost unit per slot the of duration slot the in processors of number The combined VAR and adaptive pricing scheme works as follows. Suppose the user wants a reservation for N nodes starting at time S for duration D when the current time is T as shown in Figu p s p f c d n s × = = = > < on cancellati for penalty ons cancellati for factor penalty slot the of time start , , , , , 108 adaptive pricing). The user creates the reservation and also invokes VAR to get the same reservation. VAR determines that a task should be submitted to the resource queue at time T 1 with a duration of (S - T 1 ) + D to get the same reservation. The VAR task starts execution at time T 2 . Assuming that T 2 happens before S, we need to decide between keeping the adaptive pricing reservation or VAR. If the sum of the allocation cost of VAR and the penalty for cancelling the adaptive pricing reservation is less than the allocation cost of the adaptive pricing reservation then cancel the adaptive pricing reservation. Otherwise cancel the VAR task and keep the adaptive pricing reservation. If the VAR task does not begin execution by the time S, then we use the adaptive pricing reservation and cancel the VAR task in the resource queue. Thus we improve the success rate by using both th concurrently. The combined schem fail when both the VAR and the e VAR and the adaptive pricing system e would only adaptive pricing fails to create the required reservation. The costs are lower than that of the adaptive pricing system since we never pay more than the adaptive reservation pricing for the reservation. The heuristic is shown formally in Figure 45. 109 n. In some cases, it might be unnecessary to consider the other alternative. For example, if the allocation cost of the adaptive pricing scheme is zero, then there is no need to consider the VAR request since its allocation cost cannot be lower than zero. Similarly when the adaptive pricing scheme fails to grant a reservation due to conflicts with currently running tasks and existing reservations, then the VAR system is the only alternative available. Figure 45. The algorithm combining VAR and the adaptive pricing reservatio Input, nodes(n), duration(d), start time(s) of the required reservation t = current time f = adaptive price for the reservation p = penalty factor for cancelling the reservation if (f == 0) { use the adaptive price reservation; return; } submit the task to the resource queue at time t1 if (VAR task does not begin execution by time s) { cancel the VAR task and use the adaptive pricing reservation; d use the adaptive pricing reservation; } t1 = submission time of the task determined by the VAR system return; } t2 = time when VAR task begins execution ( ≤ s) if ( (s - t2) x n + (p x f) ≤ f ){ cancel the adaptive pricing reservation and use the running VAR task for the reservation; } else { cancel the VAR task an In order to evaluate the performance of this combined scheme, we simulated the combined scheme using the experimental data gathered in the experiments in the 110 previous section and a penalty factor of 0.1 (i.e. 10%). It was possible to do the simulation since we had all the data for each experiment i.e. number of reservation nodes, the start time of reservation and its duration, the submission time of the VAR task for the reservation, the time when the VAR task started execution and the adaptive price for the reservation. Figure 46 shows the average allocation cost of the combined scheme as compared to the VAR and the adaptive pricing scheme for the 32 node reservations. As the figure shows, the allocation cost of the combined scheme is significantly less than that of the adaptive pricing scheme. For the case where the runtime is 1 hour the start time is 8 hours, the combined cost is less than either of the VAR and the adaptive pricing system. This can be understood since the combined scheme chooses the least costly of the VAR and the adaptive pricing scheme. However, it also has to take the penalty cost of cancellation into consideration which is the reason why the combined cost is more than the VAR cost in rest of the cases. and Figure 46. Allocation cost of combined scheme for 32 node reservations. 32 nodes, 1 hour runtime 0 2 6 8 14 8 A ocati n co 4 Start time (hrs) ll o st VR Combined AR 32 nodes, 4 hour runtime 0 1 2 3 4 14 8 A catio cos Start time (hrs) llo n t VR Combined AR 111 Figure 47 shows the success rate of the combined scheme as compared to that of the VAR and the adaptive pricing scheme for the 32 node reservations. As the figure shows, the success rate of the combined scheme is equal to or more than the success rate of the VAR or the adaptive pricing scheme. The reason is that the combined scheme only fails when both the VAR and the adaptive pricing scheme fails. Thus the combined scheme is pareto superior to the adaptive pricing scheme since it has a lower allocation cost and an equal or higher success rate. The combined scheme is not pareto superior to the VAR scheme since the VAR can still provide a lower allocation cost due to the penalty charges associated with Figure 47. Success rate of the combined scheme for the 32 node reservations. Figure 48 and Figure 49 show the allocation cost and the success rate of the combined scheme for the 64 node reservations. As the figures show, the combined scheme has lower allocation costs than the adaptive pricing scheme while having equal or higher success rate than either of the VAR or the adaptive pricing scheme. using the combined scheme. 32 nodes, 1 hour runtime 0 50 100 150 14 8 Start time (hrs) Succ s R te es a VR Combined AR 32 nodes, 4 hour runtime 0 50 100 150 14 8 Start time (hrs) Succ s R te es a VR Combined AR 112 64 nodes, 1 hour runtime F reservations. In this chapter, we have compared the allocation costs and the success rate of the using externally observable characteristics of an operational Grid infrastructure the running and queued jobs of the cluster at some time during the day, and used igure 48. Allocation cost of the combined scheme for the 64 node Figure 49. Success rate of the combined scheme for 64 node reservations. 6.5 Conclusion virtual reservation system and the adaptive pricing system. We did the simulation such as the NCSA TeraGrid cluster. Everyday, for 30 days, we took observation of the adaptive pricing algorithm and the VAR mechanism to figure out the allocation costs of the 12 reservations considered in this chapter. For the adaptive pricing 0 All 2 o 4 cati 6 on 8 10 cos 14 8 Start time (hrs) t VR Combined AR 64 nodes, 4 hour runtime 0 1 2 catio 3 14 8 Start time (hrs) Allo n co 4 st VR Combined AR 64 nodes, 1 hour runtime 0 14 50 cces 100 s R 150 8 Start time (hrs) Su e at VR Combined AR 64 nodes, 4 hour runtime 40 100 14 8 Start time (hrs) es ate VR Combined AR 60 80 s R 0 20 Succ 113 algorithm, we assumed that all the tasks currently in the system were part of a single queue for simplicity. Moreover, during the 20 minutes or so each day it took us to perform the experiment, the resource queue would have possibly changed. However, we ignored such changes as the QBETS prediction service is based on the observation that the percentile predictions are relatively stable often over many days [48]. This is another reason for choosing a maximum start deadline of 8 hours since inherently the VAR algorithm assumes that the QBETS predictions would not change during this interval of 8 hours. Increasing the deadline would have made the assumption less tenable. cing scheme would be more cost effective than the VAR scheme. There is also a qualitative difference between the allocation costs of the VAR scheme and the adaptive pricing scheme. In the case of VAR, the allocation cost represents wasted resources since the VAR task essentially is idling when it starts execution till the resource start time. This in turn reduces the effective utilizatio of th by notifying users when their tasks are idling. The allocation cost of the adaptive pricing scheme however, does not represent any lost utilization. It is just a From the set of experiments performed, we have observed that the VAR and the adaptive pricing scheme produces two complementary classes of service. The VAR has lower allocation costs while the adaptive pricing scheme has a higher rate of success. However, we have only experiments with start deadline of 1 to 8 hours. When the deadlines are very far away, we expect that the adaptive pri n e resource. Thus some resource providers discourage such practices 114 mechanism to differentiate between the requirements of different users as measured by their willingness to pay and thus provide multiple qualities of service to the users. 115 Chapter 7 Application Provisioning with the Adaptive P In this chapter, we consider resource planning for task graph structured applications when using the adaptive pricing model of the previous chapter. The resources in this case do not advertise a set of available slots, instead, users can query for slots, as in (17) - (19). Given r = 1,..,R number of resource providers in the system, one can still conceptually use multiple queries of the form (18), to crea (24) where, ricing Model te the global resource availability set, Â, where A ˆ = UU U R rNnD d r rr d n E .. 11 0 ) , ( =≤≤≤ < N r = max processors on site r D r = max allowed duration on site r E r (n,d) = as defined in (18) advertisement model, i.e. to derive a resource plan â from  so as to minimize the we can find slots of every dimension in Â, and thus we can restrict â to be a subset The provisioning problem remains the same as in (6) in chapter 4 using the slot allocation and the scheduling cost. Divisibility of slots is not a concern now since of Â. 116 However, the slots in  may no longer be independent of each other because the response from the resource to multiple queries might be based on the same underlying resource capacity and hence it might not be possible to provision all the slots in any given subset of Â. For example, consider the resource schedule in Figure 50 recreated from Figure 21 in the previous chapter. For a single query with n = 2, and d = 3 done on the resource with the schedule Figure 50. Resource schedule with two running and two queued tasks. shown in the figure, the resource responds with slots {<t 1 ,1,6>, <t 2 ,1,8>, <t 3 ,1,10>, <t 4 ,1,0>, <t 5 ,1,0>, <t 6 ,1,0>} with the prices computed in Table 3. This doesn’t imply that all these slots are independent and can be provisioned concurrently. The semantics of the response is that these are the possible alternatives for the slot queried. If we used the approach of chapter 4 and the final resource plan contained slots starting at t 1 , t 2 , t 3 , it could not be provisioned because there aren’t enough resources to reserve these 3 slots concurrently. When we combine the results of 117 queries done with various values of n and d, there is significant overlap between the slots. Thus for every resource plan considered, MOGA would have to check whether it is provisionable in addition to being feasible for the application. , if the range of values that n and d can take in (24) is not properly restricted. Alternatively, the resource plan could be restricted to contain only a single slot and hence no possibility of overlap with other selected slots. However, it might lead to poor utilization since the application structure might not be good fit for a single slot. Hence, in this chapter, we use a greedy algorithm for creating a resource plan for the application and compare its performance to the best effort approach. 7.1 Algorithm The resource provisioning algorithm works by first ranking tasks in the application using the HEFT order as in chapter 4. Then tasks are sorted based on their ranks in non-increasing order. This leads to a topological sorting of the application. Next, tasks are considered from this sorted order. For a particular task under consideration, we first compute the earliest possible start time of the task for each resource in the system. This start time is based on the schedule of the parent tasks and the data transfer requirements. Then a query is sent to each resource to provide list of start times and prices based on (19). Then the response from all the resources is combined. One <start time, price> tuple is then selected from this combined set that minimizes a linear normalized objective function of the finish Moreover, the number of slots in  might become very large 118 time of the task and its cost based on a user supplied trade off factor α [0,1]. The task is then scheduled on the resource that provides the selected tuple at the start tuple. Once the slot reservation of the task is done, the next highest ranked unscheduled task is considered. The algorithm is formally listed in Figure 51. Greedy Provisioning Algorithm time listed in the Figure 51. Compute ranks of all applica (c max w ) Rank(v ij i i tion tasks (bottom-up) )) rank(v j ) succ(v v i j + ∈ + = w i = average runtime of task i c = average data transfer time between task i and j While there are unscheduled tasks in the list do Let s ir = earliest feasible start time of task v i on resource r based on the scheduled finish i,j Sort the tasks in a list by non-increasing rank values Select the first unscheduled task, v i from the list Let n i = number of processors required by task v i For each resource r in the system Let d ir = runtime of task v i on resource r time of its predecessors and data transfer times )) v , data(v ) e(v (finishTim max s i j j i j ) pred(v v ir + = data(v i , v j ) = data transfer time between v i and v j , ( = 0 if v j is scheduled on resource r) ∈ EndFor 1..R r ir i ir r ) d , n , (s E A ˆ = = U ir i i d s ˆ becomes v of time finish s ˆ becomes v of time start + i A ˆ f p, s, min(max) A ˆ f p, s, min(max) ir i i ir ir resour on v Schedule f ˆ , c ˆ , s ˆ Let cost min(max) cost mks, min(max) mks f c d n cost s, generated that resource on v of runtime d where , d s mks define , A ˆ in f c, s, tuple each For >= < = = + × × = > < = + = > < < >∈ < >∈ < EndWhile f c, min max min max f ˆ , c ˆ , s ˆ generates that r ce > < min min f p, s, } mks mks mks mks α) (1 cost cost cost cost { min − − × − + − − × > α 119 7.2 Evaluation We use the same trace-based simulation setup of previous chapters to evaluate the performance of resource provisioning as compared to best effort execution in the context of the adaptive pricing model. The resources operate under the same assumption that submitted best effort tasks request their actual running time and the resource scheduling policy is conservative backfilling. Each resource was simulated separately. Users can query for slots of specific dimension with a start time no earlier than a specified start time as in (19), and the resource used the heuristics described in previous chapter to determine a set of feasible start times and prices at these start times and sends back the response to the user. The per unit charge (c) for the slots is always equal to one. Executing an application using the best effort scheduling was done by simulating a just in nked time scheduling policy as in chapter 4. First the application tasks are ra in HEFT order and then they are submitted to the resource queue in this order when they became ready for execution i.e. when all their parent tasks finish execution. We use the same task ordering for the best effort and the provisioned approach to eliminate any task ordering related impact on the performance of these approaches. The time difference between the time when the first task in the application was submitted to the resource queue and the time when the last task finished execution is taken as the best effort makespan of the application. The allocation cost of best effort execution is the same as defined in (16). 120 We use the same 100 task application used for evaluation in chapter 4 and shown in Figure 12. When the application is submitted, the algorithm in Figure 51 is used to provision resources for the application using a given trade-off factor α. We compute and record the makespan and the allocation cost. The allocation cost is the sum of the total costs of all the provisioned slots. The application is then executed using best effort. The allocation cost of the best effort approach is always constant for an application as given by (16). This process is repeated 50 times by submitting the application at different times during the simulation run and the average is taken. The makespan and allocation cost are measured in hours and service units (SU) respectively with 1 SU = 1 processor hour. ctor Impact of trade-off fa Figure 52 shows the makespan and the allocation cost of the provisioned and the best effort approach for the CTC and the SDSC clusters. The makespan with the provisioned approach is consistently better than the best effort makespan for different values of the trade-off factor ( α) at the cost of some additional allocation cost. When the tradeoff factor α = 1 is used, the makespan is less than the best effort makespan but the allocation cost is the same as the best effort allocation cost. The reason is that when α = 1, the slots chosen for the tasks always have the additional cost, f, as zero, and hence they cost the same as their best effort counterpart. Thus the provisioned approach is Pareto-superior to the best effort approach under the assumption of actual runtime estimates of best effort tasks. 121 The provisioned allocation cost is higher than the best effort allocation cost particularly when using low values of α (alpha). This can be understood since many best effort tasks in the resource queue can get delayed in order to provision slots early and hence these delays contribute to a higher allocation cost. The allocation cost is higher for the SDSC cluster (as compared to the best effort a s is due to the higher utilization of the SDSC cluster (70%) than the CTC cluster (63%) which implies that more tasks are li than at the CTC cluster. Also, the pr 52 are generally higher than the cor Figure 13. In the latter case, we made sure not to delay e misfit between application tasks and the adve was not directly dependent on the resource workload whereas not it is. However, we now have more opportunities to optimize the application makespan than before and thus the provisioned makespan in Figure 52 is generally less than the provisioned makespan in Figure 13. llocation cost) than for the CTC cluster. Thi kely to get affected at the SDSC cluster ovisioned allocation costs in Figure responding allocation cost in any best effort task and the extra allocation overhead was only due to th rtised slots. Thus the allocation cost 122 Figure 52. Makespan and allocation cost of the best effort and the provisioned approach for different trade-off factors. Figure 52 also shows the error bars that indicate the range of values that fall within provisioning is considerably less than that of best effort providing better insulation for the application from dynamic workload of the resource. The variability of the provisioned allocation cost, however, can be significant, particularly with low values of α. The reason is the workload dependency of the allocation cost and the one standard deviation of the mean. The variability of the makespan with increased emphasis on makespan minimization at these α values. Impact of workflow size CTC 1 4 7 8 9 M s) 0 2 3 5 6 0 0.2 0.4 0.6 0.8 1 Best Effort Trade-off factor akespan (hr CTC 0 400 800 1000 0 0.2 0.4 0.6 0.8 1 Best Effort Trade-off factor Pri (SU) 200 600 1200 1400 ce SDSC 2 6 10 0 0.2 0.4 0.6 0.8 1 Best M ) 0 4 8 12 Effort Trade-off factor akespan (hrs SDSC 200 600 1000 0 0.2 0.4 0.6 0.8 1 Best e 0 400 800 1200 Effort Trade-off factor Pric (SU) 123 We also performed experiments to determine the impact of the workflow size in terms of the number of tasks in the workflow on the performance of the provisioned and the best effort approach. Figure 53 shows the impact of the workflow size on the makespan and allocation cost. The number of tasks in the workflow is varied from 100 to 400. The depth of the workflow is √n where n is the total number of tasks in the workflow. This ensures that the workflow is well balanced. Rest of the workflow characteristics like the average task runtimes and the number of processors per task is the same. Figure 53. Impact of workflow size. Figure 53 shows the provisioned makespan and allocation cost using trade-off factors of 0 and 1 that represent the range of values achievable through provisioning. The figure shows that the application performance in terms of CTC 15 s) 0 100 200 300 400 Workflow size 5 kes 10 Ma pan (hr α = 0 α = 1 Best Effort CTC 5 6 α = 0 α = 1 Best Effort 1 2 3 4 (thousands) Price (SU) 0 100 200 300 400 Workflow size SDSC 0 5 10 15 25 100 200 300 400 Workflow size akespan (hr 20 M s) α = 0 α = 1 Best Effort SDSC 0 0.5 1 1.5 2.5 100 200 300 400 (thousands) Workflow size Price ( U) α = 0 α = 1 Best Effort 2 S 124 makespan is better than best effort for all workflow sizes. The reduction in makespan actually increases with the workflow size in case of the SDSC cluster. The reason is that the impact of queue wait times on best effort makespan is magnified due to the increase in the number of levels in the workflow. There is no significant difference in the makespan between doing provisioning at the earliest possible time ( α = 0) and the earliest time when no best effort jobs are delayed ( α = 1). However the allocation cost between the two differs significantly. Thus the latter approach seems preferable where the user pays the same as best effort cost but the performance achieved is significantly better than best effort. Impact of task size t represents the lower and upper bounds for makespan and the upper and lower bounds for allocation cost of the provisioned approach respectively. The difference between the bounds represents the range of values that the makespan and allocation cost can take for different values of the trade-off factor. For example, when the trade-off factor of 0 was used for the CTC cluster with an average of 3 processors per task in the application, the provisioned makespan was 0.85 times the best effort makespan and the provisioned allocation cost was 2.2 times the best effort allocation cost. When the trade-off factor was 1, Figure 54 shows the range of the values that the makespan and allocation cost of the provisioned approach can take based on different trade-off factors using the 100 task workflow. The applications considered differ in the average number of processors per task. The performance of the provisioned approach is shown using tradeoff factors of zero and one tha 125 the provisioned makespan was 0.89 times the best effort makespan and the provisioned allocation cost was the same as the best effort allocation cost. These values constitute the extremes for the bar shown in the figure. For trade-off factors between 0 and 1, the values would fall in the range between these extremes. When trade-off factor is 1, represented by the upper ends of the makespan bars and the lower ends of the allocation cost bars, the allocation cost is always equal to its best effort value while the makespan is better than the best effort makespan implying that provisioning is empirically pareto superior to best effort. an the best effort isioned allocation cost varies from being equal to the best effort cost to being 2 times the cost in the CTC cluster and about 6 times the cost in the SDSC cluster for application with small sized tasks. As the average task size CTC 0 Figure 54. Makespan and allocation cost for varying task sizes. As Figure 54 shows, the provisioned makespan is always less th one in all cases. The prov 2 02030 Make 4 6 8 3 1 Average # of processors per task span (hrs) α = 0 α = 1 Best Effort CTC 0 200 310 Pric 400 600 800 1000 20 30 Average # of processors per task e (SU) α = 0 α = 1 Best Effort SDSC 0 2 4 8 310 20 30 Average # of processors per task Ma espan (hrs) 6 10 k α = 0 α = 1 Best Effort SDSC 0 400 600 Average # of processors per task ice (S ) 200 800 3 102030 Pr U α = 0 α = 1 Best Effort 126 increases, both the provisioned makespan and allocation cost improve as compared to best effort. The provisioned makespan improves because as the task sizes increase, the conservative backfilling scheduling cannot find effective backfilling oppurtunities for the best effort case and hence the best effort makespan increases while the provisioned makespan is unaffected. The provisioned allocation cost improves (relative to the best effort cost) because the extra allocation cost for the provisioned approach depends on the delay caused to the best effort tasks by the provisioned resources. This delay depends more on the duration of the provisioned resource than its size. Thus the extra cost is not significantly affected while the absolute magnitude of the best effort allocation cost increase because of the increase in task size. Thus, when normalized, the extra cost decreases in magnitude when task sizes increase. When compared to Figure 15, makespans are lower and allocation costs are higher in Figure 54. The reasons are additional flexibility in slot placement and the workload dependency of the allocation costs as mentioned earlier. Impact of resource utilization We also investigated the effect of increased resource utilization. Figure 55 shows the makespan and allocation cost of the provisioned approach normalized with respect to best effort for different resource utilization levels with workflows differing in the average number of processors required per task. The method used to increase the resource utilization is the same as used in chapter 4 and a single 127 tradeoff factor of 0.5 is used for generating the provisioned results. Generally, the provisioned makespan becomes smaller as compared to the best effort as the resource utilization increases and/or the task sizes increase. The reason is that due to the increased load and task sizes, the queue wait times of the best effort tasks increase due to decreased backfilling opportunities leading to an increased makespan. However, the provisioned makespan is not so much affected the resource workload. Figure 55. Effect of resource utilization. The allocation cost of the provisioned approach is higher than the best effort allocation cost. This is due to the fact that we use a trade-off factor of 0.5 that tries to optimize both makespan and allocation cost. Any optimization of the makespan over the best effort case involves delaying tasks from the resource queue and it CTC 0.8 1 0 0.6 span 0.2 0.4 Make 310 20 30 Average # processors per task 63% 76% 88% CTC 2 2.5 63% 76% 88% 0 0.5 1 1.5 Price 310 20 30 Average # processors per task 0 0.5 1 136 9 Makespan Average # processors per task SDSC 70% 78% 84% 90% 0 1 3 4 5 6 13 69 Price Average # processors per task SDSC 2 70% 78% 84% 90% 128 results in an increased allocation cost. The allocation cost is also higher as compared to Figure 16 due to its workload dependency. 7.3 Multi-Site scheduling We also performed experiments with multiple simulated resources in the system using the exact setup as in chapter 4, section 4.9. Figure 56 shows the makespan and allocation cost of the provisioned and best effort approach. With increasing resources, the best effort makespan decreases due to the presence of less loaded resources in the system. However, the provisioned makespan doesn’t seem to be affected much. The reason is that even when resources are highly loaded, as in the case with only 2 resources in the system, provisioning can negotiate can an early starting time for the tasks at the expense of an increased allocation cost. Thus, the high workload of the resources only affects the price and not the performance of provisioning with the adaptive pricing model. Figure 56. Comparison of provisioning and best effort in a multi resource setting. The provisioned makespan is also lower than that of Figure 19 thus showing that the adaptive pricing scheme allows better optimization of the application 0 4 8 12 16 20 23 456 Makespan (hrs) Number of resource providers Provisioned Best Effort 0 100 200 300 400 500 234 56 Price (SU) Provisioned Best Effort Number of resource providers 129 performance than the slot advertisement model of chapter 3. However, the allocation cost of the adaptive pricing model (Figure 56) is higher than that of the fixed price model (Figure 19). When the system consists of only two highly loaded resources, the provisioned allocation cost is about 43% more than that of the best effort allocation cost. This can be understood since when workload is high, more tasks are delayed in order to provide an earlier start time for the application tasks. However, as Figure 56 shows, the greedy provisioning algorithm of Figure 51 is able to take advantage of less loaded resources in the system in order to decrease the allocation cost while providing similar or better application performance. 130 Conclusion In this dissertation, we Chapter 8 have presented an end-to-end framework for providing deterministic quality of service and utilizing that quality of service for application level performance optimization while addressing fairness issues with regards to best effort services. The main contributions of this thesis are as follows • Development of the slot abstraction as the fundamental unit of resource capacity that the user can provision. We also developed an algorithm for determining slot availability in a batch queued resource management system. The algorithm can co-exist with any internal scheduling policy of the resource management system such as aggressive or conservative backfilling or other priority based provide tight estimates of their runtime, we designed a fast scheduling algorithm for the resource that can generate the set of available slots in constant time. schemes. Furthermore when the tasks 131 • We also developed heuristics to allow more flexibility to the user in determining the set of slots or reservations required by the user. We designed a corresponding pricing policy that had multiple objectives Æ ensure fairness between the provisioned and the best effort users. This Æ allow more degrees of freedom to the user. Since the best effort workload is not considered during admission control of reservations, there is more flexibility in creating reservations. ce load. Since the prices increase or decrease exponentially based on the resource load, this leads to load balancing in the system and higher utilization of least loaded resources and lower utilization of highly loaded resources. • We compared the cost of our reservation pricing scheme to that of a probabilistic virtual reservation system. By using traces from a currently was achieved by having price differentiation between the provisioned and best effort resources. Using trace based simulations and logs from super computing centers we showed that there is no incentive for the user to make reservations over using the best effort queue unless truly required as measured by the users willingness to pay. Æ being adaptive based on the resour operational supercomputing center, we showed that they provide two different classes of service, one with lower cost and the other with higher success rate. Moreover, we showed that we can combine the two to come 132 up with a unified scheme that has a lower cost that our reservation scheme with the same or higher success rate. • We developed a model for representing the various costs faced by the users when executing their applications using the slot abstraction. The costs can be used to capture the resource costs incurred by the user and the performance obtained from those resources. We formulate the resource provisioning problem as a multi objective optimization problem using these costs. • We used the concept of Pareto optimality for solving the optimization problem. The Pareto set was evolved using Multi Objective Genetic Algorithm (MOGA). In order to do so we developed a suitable encoding ld be used by MOGA. Furthermore, we allowed human interaction in the form of a user supplied trade-off factor that represents the user sensitivity between the resource costs and the application performance. Additional constraints such as deadlines, budgets can be trivially included in the solution procedure. • The performance of the provisioned approach was compared to best effort using trace based simulation of large scale system. Using artificial and real applications we validated our main thesis that the provisioning approach can provide better performance to the users than the best effort quality of service. We used a real seismic hazard analysis application and showed of the solution of the provisioning problem that cou 133 that provisioning can reduce the workflow completion time by 50% under high load conditions. Using a variety of artificial applications that differ in ask parallelism, we showed that the our provisioning approach produces better results when the resource are under high utilization and/or tions have large resource requirements. Both of these characteristics are visible at the current operational Grid sites such as tasks which are a natural match for these resources. Moreover due to the show high to very high utilization ework can be extended to consider other types of resources as well. In our experiments, provisioning of network resources was not considered since each application was executed on a single cluster. Provisioning of size and t the applica TeraGrid. These sites target large-scale applications with highly parallel shared nature of these resources, the large number of users, and their production nature, these sites generally levels (60-80+%) [34]. • Finally, we extended the GridSim simulator [12] in order to perform the simulation experiments described in this dissertation. We extended the simulator to enable parallel job scheduling. The simulator initially was only able to schedule uniprocessor tasks. We also implemented backfilling based scheduling algorithms such as aggressive and conservative backfilling for resource management. The simulator originally had support only for First come First serve scheduling discipline. In this dissertation, we have restricted our attention to provisioning of compute resources. However, this fram 134 network resources would become important when co-allocating resources from multiple providers. Design of a generic slot based resource manager that can be used to manage network resources is discussed in [28]. The idea is to manage allocations using a reservation table similar to the site scheduler in Chapter 3 while doing the actual reservation using the RSVP [61] protocol. Hard guarantees can be obtained if the resource manager does admission control as well. 8.1 W properties using trace based simulations. A future extension of the work can be a more formal treatment of the subject and to show mathematical properties of the system using game theoretic concepts. For example, it would be good to show that truth telling by all parties leads to a kind of Nash equilibrium in the system. This be desir Another possible future work would be to consider the effect of uncertainties in the task runtime estimates. In the current model, we have assumed that the application Future Work e have shown that the adaptive pricing scheme of chapter 5 has several nice has shown to be the case in very simple formulations of the problem [47]. It would able to extend this work to more realistic systems. task runtimes are known accurately. However it may be the case that the task runtimes are stochastic. This adds another variable to the problem. Ideally we would like to provision more resources since if the slot expires before the task completes, we lose all the work. On the other hand allocating more resources would obviously cost more and add to the resource costs of the application and 135 underutilization of the resources. We can relinquish the resources when the task is completed but it would likely incur a cancellation penalty from the resource. Thus the right amount of resource to provision is an optimization problem in itself. 136 Bibliography [1] "The Grid Resource Allocation and Agreement Protocol Working Group," in https [2] http://wwwpdp.web.cern.ch/wwwpdp/bis/services/lsf/ ://forge.gridforum.org/projects/graap-wg. "Load Sharing Facility (LSF)," . [3] "Maui Cluster Scheduler," http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php. "The Open Science Grid Consortium," [4] http://www.opensciencegrid.org. "PBSPro," [5] http://www.pbspro.com. "Sun Grid Engine," [6] http://www.sun.com/software/gridware. orkshops on Job Scheduling Strategies for Parallel Processing," in [7] "W http://www.cs.huji.ac.il/~feit/parsched/. [8] S. Andreozzi, et al., "Agreement-Based Workload and Resource Man e Computing, 2005. [9] F. Berman, et al., "Adaptive computing on the Grid using AppLeS," Parallel 2003. [10] J. Brevik, et al., "Predicting Bounds on Queueing Delay in Space-Shared Comput Workload Characterization, 2006. [11] J. Brevik, et al., "Predicting bounds on queuing delay for batch-scheduled parallel Parallel Programming, New York, 2006. Simulation of Distributed Resource Management and Scheduling for Grid pp. 1175-1220, 2002. ag ment," presented at First International Conference on e-Science and Grid and Distributed Systems, IEEE Transactions on, vol. 14, pp. 369-382, ing Environments," presented at IEEE International Symposium on machines," presented at ACM Symposium on Principles and Practice of [12] R. Buyya and M. Murshed, "GridSim: A Toolkit for the Modeling and Computing," Concurrency and Computation: Practice and Experience, vol. 14, 137 [13] J. Cao and F. Zimmermann, "Queue scheduling and advance reservations Proceedings. 18th International, 2004. [14] H. Casanova, et al., "Heuristics for scheduling parameter sweep Workshop, (HCW), 2000. [15] C. Catlett, "The philosophy of TeraGrid: building an open, extensible, distributed TeraScale facility," presented at Cluster Computing and the Grid 2nd IEEE/ACM International Symposium CCGRID2002, 2002. with COSY," presented at Parallel and Distributed Processing Symposium, 2004. applications in grid environments," presented at Heterogeneous Computing [16] W. Cirne and F. Berman, "A comprehensive model of the supercomputer IEEE International Workshop on Workload -4., 2001. [17] K. Czajkowski, et al., "Agreement-based resource management," [20] A. B. Downey, "Predicting queue times on space-sharing parallel [21] K. K. Droegemeier, et al., "Linked Environments for Atmospheric Processing Systems for Meteorology, Oceanography, and Hydrology, Seattle, WA, systems," in URL: http://www.cs.huji.ac.il/labs/parallel/workload workload," presented at Characterization, WWC Proceedings of the IEEE, vol. 93, pp. 631-643, 2005. [18] K. Deb, Mutli-Objective Optimization using Evolutionary Algorithms: John Wiley & Sons, 2001. [19] E. Deelman, et al., "Managing Large-Scale Workflow Execution from Resource Provisioning to Provenance Tracking: The CyberShake Example," presented at Second IEEE International Conference on e-Science and Grid Computing, 2006. computers," presented at Proceedings of the 11th International Parallel Processing Symposium, 1997. Discovery (LEAD): A CyberInfrastructure for Mesoscale Meteorology Research and Education," presented at 20th Conference on Interactive Information 2004. [22] D. G. Feitelson, "Logs of real parallel workloads from production . [23] D. G. Feitelson and L. Rudolph, "Parallel Job Scheduling: Issues and Feitelson and L. Rudolph, Eds.: Springer-Verlag, 1995, pp. 1-18. Approaches," in Job Scheduling Strategies for Parallel Processing, D. G. 138 [24] D. G. Feitelson, et al., "Theory and Practice in Parallel Job Scheduling " in Verlag, 1997 pp. 1-34 [25] C. M. Fonseca and P. J. Fleming, "Genetic algorithms for multiobjectiv Proceedings of the Job Scheduling Strategies for Parallel Processing Springer- e optimization: Formulation, discussion, and generalization," presented at Proceedings of the Fifth International Conference on Genetic Algorithms, 1993. [26] I. Foster, "Globus Toolkit Version 4: Software for Service-Oriented Systems," Journal of Computer Science and Technology, vol. 21(4), pp. 513-520, 2006. [27] I. Foster, "What is the Grid? A Three Point Checklist," in GRIDToday, July 20, 2002. [28] I. Foster, et al., "A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation," presented at Proc. International Workshop on Quality of Service, 1999. [29] M. H. Haji, et al., "A SNAP-Based Community Resource Broker Using a Three-Phase Commit Protocol: A Performance Study," The Computer Journal, vol. 48, pp. 333-346, 2005. [30] M. Harris and A. Raviv, "A Theory of Monopoly Pricing Schemes with Demand Uncertainty," American Economic Review, vol. 71, pp. 347-365, 1981. [31] F. Heine, et al., "On the Impact of Reservations from the Grid on Planning- Based Resource Management," in Computational Science - ICCS, vol. 3516, Lecture Notes in Computer Science: Springer Berlin, 2005, pp. 155-162. [32] G. Hoo, et al., "QoS as middleware: bandwidth reservation system design," presented at High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on, 1999. [33] M. Hovestadt, et al., "Scheduling in HPC Resource Management Systems: Queueing vs. Planning," in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson, et al., Eds.: Springer Verlag, 2003, pp. 1--20. [34] A. Iosup, et al., "How are Real Grids Used? The Analysis of Four Grid Traces and its Implications," presented at 7th IEEE/ACM International Conference on Grid Computing, Barcelona, Spain, 2006. [35] D. Jackson, et al., "Core Algorithms of the Maui Scheduler," in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds.: Springer Verlag, 2001, pp. 87--102. 139 [36] D. B. Jackson, et al., "Core Algorithms of the Maui Scheduler " in Revised ernational Workshop on Job Scheduling Strategies for inger-Verlag, 2001 pp. 87-102 [37] S. J. Jim Blythe, Ewa Deelman, Yolanda Gil, Karan Vahi, Anirban onal Symposium on Cluster Computing and the Grid, Cardiff, UK, 2005. ds for Building Astronomical Image Mosaics on a Grid," presented at Parallel Processing, 2005. P 2 al Conference Workshops on, 2005. [39] A. Keller and A. Reinefeld, "CCS resource management in networked HPC em Proceedings. 1998 Seventh, 1998. [40] D. Kuo and M. Mckeown, "Advance Reservation and Co-Allocation rence on e- Science and Grid Computing, 2005. eduling System," in Job Scheduling Strategies for Parallel Processing, D. G. F. a. L. Rudolph, Ed.: Springer-Verlag, et al., "Scheduling Strategies for Mapping Application Workflows onto the Grid," presented at The 14th IEEE International Symposium M. G. Marchand, "Priority Pricing," Management Science, vol. 20, pp. 1131-1140, 1974. 8, pp. 312-321, 1985. [46] A. W. Mu'alem and D. G. Feitelson, "Utilization, predictability, workloads, lel 01. "Eliciting honest value information in a batch-queue environment," presented at 8th IEEE/ACM International Conference on Grid Computing, 2007. Papers from the 7th Int Parallel Processing Spr Mandal, Ken Kennedy, "Task Scheduling Strategies for Workflow-based Applications in Grids," presented at IEEE Internati [38] D. S. Katz, et al., "A Comparison of Two Metho ICP 005 Workshops. Internation syst s," presented at Heterogeneous Computing Workshop, 1998. (HCW 98) Protocol for Grid Computing," presented at First International Confe [41] D. Lifka, "The ANL/IBM SP Sch 1995, pp. 295--303. [42] A. Mandal, on High Performance Distributed Computing (HPDC-14), 2005. [43] [44] H. Mendelson, "Pricing Computer Services: Queueing Effects," Communications of the ACM, vol. 2 [45] H. Mendelson and S. Whang, "Optimal Incentive-Compatible Priority Pricing for the M/M/1 Queue," Operations Research, vol. 38, pp. 870-883, 1990. and user runtime estimates in scheduling the IBM SP2 with backfilling," Paral and Distributed Systems, IEEE Transactions on, vol. 12(6), pp. 529-543, 20 [47] A. Mutz, et al., 140 [48] D. Nurmi, et al., "VARQ: Implementing Probabilistic Advanced Reservations for Batch-scheduled Parallel Machines," available at http://www.cs.ucsb.edu/research/tech_reports/reports/2007-09.pdf, Technical Report 2007. ted Batch Queue Wait Time Prediction," presented at SuperComputing Conference, Tampa, Florida, 2006. rvice." omputing and e Workshop on Adaptive Grid Middleware, 2004. , "Assessment and enhancement of meta-schedulers for multi-site job sharing," presented at High Performance Distributed Computing, n for Optimized QoS," presented at SuperComputing Conference, Tampa, Florida, 2006. ted Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International, 2000. l and [57] M. Wieczorek, et al., "Scheduling of scientific workflows in the [58] M. Wieczorek, et al., "Applying Advance Reservation to Increase 006. [59] R. Wilson, "Efficient and Competitive Rationing," Econometrica, vol. 57, pp. 1-40, 1989. [49] D. Nurmi, et al., "Evaluation of a Workflow Scheduler Using Integra Performance Modelling and [50] QBETS web service., "http://nws.cs.ucsb.edu/ewiki/nws.php?id=QBETS+Web+Se [51] T. Roblitz and A. Reinefeld, "Co-reservation with the concept of virtual resources," presented at IEEE International Symposium on Cluster C the Grid, 2005. [52] T. Roblitz, et al., "Elastic Grid Reservations with User-Defined Optimization Policies," presented at Proceedings of th [53] G. Sabin, et al. 2005. HPDC-14. Proceedings. 14th IEEE International Symposium on, 2005. [54] M. Siddiqui, et al., "Grid Capacity Planning with Negotiation-based Advance Reservatio [55] W. Smith, et al., "Scheduling with advanced reservations," presented at Parallel and Distribu [56] H. Topcuouglu, et al., "Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing," IEEE Transactions on Paralle Distributed Systems, vol. 13(3), pp. 260-274, 2002. ASKALON grid environment " SIGMOD Rec. , vol. 34 pp. 56-62 2005 Predictability of Workflow Execution on the Grid," presented at Second IEEE International Conference on e-Science and Grid Computing, Amsterdam, 2 141 142 15, pp. 757-768, 1999. [62] H. Zhao and R. Sakellariou, "Advance Reservation Policies for arallel [63] S. Zhou, et al., "Utopia: a load sharing facility for large, heterogeneous [60] R. Wolski, et al., "The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing," Future Generation Computer Systems, vol. [61] L. Zhang, et al., "RSVP: a new resource ReSerVation Protocol," Network, IEEE, vol. 7, pp. 8-18, 1993. Workflows," presented at 12th Workshop on Job Scheduling Strategies for P Processing (JSSPP), Saint-Malo, France, 2006. distributed computer systems " Softw. Pract. Exper. , vol. 23 pp. 1305-1336 1993
Abstract (if available)
Abstract
Resources in distributed infrastructure such as the Grid are typically autonomously managed and shared across a distributed set of end users. These characteristics result in a fundamental conflict: resource providers optimize for throughput and utilization which coupled with a stochastic multi-user workload results in non-deterministic best effort service for any one application. This conflicts with the user who wants to optimize end-to-end application performance but is constrained by the best effort service offering. Resource provisioning can be used to obtain a deterministic quality of service but it is generally not allowed due to the perceived impact on the other users and overall resource utilization. Without a deterministic quality of service, it is not possible to co-allocate resources from multiple providers in a scheduled manner and thus realize the true potential of Grid Computing.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Resource management for scientific workflows
PDF
A resource provisioning system for scientific workflow applications
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Efficient data and information delivery for workflow execution in grids
PDF
A provenance management framework for reservoir engineering
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
QoS based resource management for Internet applications
PDF
Schema evolution for scientific asset management
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Distributed resource management for QoS-aware service provision
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Grid-based Vlasov method for kinetic plasma simulations
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Resource scheduling in geo-distributed computing
PDF
A framework for runtime energy efficient mobile execution
PDF
Tag based search and recommendation in social media
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
Asset Metadata
Creator
Singh, Gurmeet
(author)
Core Title
An end-to-end framework for provisioning based resource and application management
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2008-05
Publication Date
03/25/2008
Defense Date
01/09/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
grid computing,OAI-PMH Harvest,resource provisioning,workflow management
Language
English
Advisor
Kesselman, Carl (
committee chair
), Deelman, Ewa (
committee member
), Dessouky, Maged M. (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
gurmeets@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1055
Unique identifier
UC197237
Identifier
etd-Singh-20080325 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-45280 (legacy record id),usctheses-m1055 (legacy record id)
Legacy Identifier
etd-Singh-20080325.pdf
Dmrecord
45280
Document Type
Dissertation
Rights
Singh, Gurmeet
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
grid computing
resource provisioning
workflow management