Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Upload Folder
/
usctheses
/
uscthesesreloadpub_Volume1
/
Resource management for scientific workflows
(USC Thesis Other)
Resource management for scientific workflows
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
RESOURCE MANAGEMENT FOR SCIENTIFIC WORKFLOWS by Gideon M. Juve A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2012 Copyright 2012 Gideon M. Juve For my parents, Mark and Jenny Juve, who encouraged me to work hard and always have fun, and my wife, Beth Juve, for her love and support ii Acknowledgments Many people have helped me on my research and in the production of this thesis. I would like to express my deep gratitute to all of the family, friends, faculty, students and staff who have given me their support, encouragement, and assistance. First and foremost, I would like to thank my parents, Mark and Jenny Juve. I am grateful that I have parents who have always encouraged me to challenge myself, and who passed on to me the strong work ethic that is the primary reason for all my successes. I would also like to thank my wife, Beth Juve, for being a supportive and patient partner, for not complaining even once about how long I took to finish, and for being my best friend. I have been very fortunate to have a superb advisor, Ewa Deelman. She provided me with excellent guidance, allowed me to investigate my own ideas, gave me many opportunities to learn, and helped me write many papers, book chapters, articles, presentations, and theses. I am grateful to my committee members, Tom Jordan, Ann Chervenak and Aiichiro Nakano, who have all given me excellent guidance and advice. Tom sparked my interest in computational science when I was an intern at SCEC in 2003, and I am grateful for all the opportunities and advice he has given me. Ann co-authored the research presented in Chapter 5, and provided helpful input on my research and drafts of my thesis. Aiichiro is the best teacher I have had at USC and it has been my privilege to have taken three of his classes. I owe a lot to Sue Perry who was my supervisor when I was an intern at SCEC and who encouraged me to return to USC for graduate school. I definitely would not be where I am now without her help and guidance. I would like to thank Phillip Maechling for being a mentor during my time at SCEC, for helping me with the Broadband and CyberShake workflows, and for contributing many helpful comments and suggestions to the papers we have co-authored together. I am fortunate to have many good friends and colleagues at ISI. Karan Vahi, Gaurang Mehta, Mats Rynge, Fabio Silva and Jens V¨ ockler have all been wise counselors and great sources of iii ideas and discussion. Karan provided a great deal of support in learning and using Pegasus, discussed many ideas with me, and implemented changes in Pegasus to support the research in Chapter 4. Gaurang helped me with everything from planning conference trips to system admin- istration. He also helped me get binaries and data for the Montage, Broadband and Epigenome workflows that have been used throughout this thesis. Mats and I worked together on Peri- odograms and the mpidag tool, and he was a collaborator on several research projects that con- tributed to this thesis. Fabio has been a great friend and colleague on the Stampede project and discussed many ideas with me. Jens has been a source of many ideas, has helped me with his deep UNIX knowledge, and has been a collaborator on several papers related to the research presented in Chapters 3 and 4. I am grateful to my office mates Scott Callaghan, Shishir Bharathi and Weiwei Chen for many productive conversations and ideas. One of Shishir’s papers was the foundation for the research presented in Chapter 5, and he wrote the code used to generate most of the synthetic workflows that were used in Chapter 6. Scott was my office mate and colleague at SCEC where we had many fruitful discussions while I was working on the topics in Chapter 2. He also helped me with the CyberShake and Broadband workflows used throughout this thesis. Weiwei has been an excellent sounding board for new ideas and a great help in dissecting technical problems. I am very happy to have worked with Bruce Berriman on the Montage and Periodograms applications. I have been fortunate to be a co-author with Bruce on many publications, several of which make up the bulk of Chapter 4. I would like to thank Maciek Malawski for all of his work on the research described in Chapter 6. Maciek developed the code for the DPDS and WA-DPDS algorithms, wrote a signif- icant portion of the paper on which Chapter 6 is based, and was instrumental in developing the experiments, analyses, and plots. Finally, I would like to thank everyone at ISI, SCEC, the Viterbi School of Engineering, the Department of Computer Science, and the Department of Earth Sciences who have helped me reach this point: Carl Kesselman, Larry Godinez, Kathy Thompson, Liszl DeLeon, Steve Schrader, Flor Martinez, Shirley Chan, John McRaney, Tran Huynh, Mark Benthien, John Yu, iv Maria Liukis, Patrick Small, Kevin Milner, David Okaya, David Meyers, Nitin Gupta, Vipin Gupta, Cynthia Waite, Vardui Ter-Simonian, Muhammad Ali Amer, Gurmeet Singh, Raphael Bolze, Rubing Duan, Mei-Hui Su, Ken Johnson, Marc Romero, Tom Wisniewski, Prasanth Thomas, and Rajiv Mayani. v Table of Contents Dedication ii Acknowledgments iii List of Tables ix List of Figures xi Abstract xiv Chapter 1: Introduction 1 Chapter 2: Resource Provisioning in the Grid 8 2.1 Problems with the traditional model . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Alternative strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Benefits of resource provisioning . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Resource provisioning on the grid . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.7 Architecture and Implementation . . . . . . . . . . . . . . . . . . . . . . . . 16 2.7.1 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.8.1 Resource Provisioning Overhead . . . . . . . . . . . . . . . . . . . . 22 2.8.2 Job Execution Delays . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.8.3 Provisioning for Workflows . . . . . . . . . . . . . . . . . . . . . . 24 2.8.4 Example Application: CyberShake . . . . . . . . . . . . . . . . . . . 32 2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 3: Resource Provisioning in the Cloud 35 3.1 The benefits of cloud computing for workflows . . . . . . . . . . . . . . . . 37 3.2 Virtual Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Architecture and Implementation . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.1 Specifying Deployments . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.2 Deployment Process . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.3 Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5.4 Dependencies and Groups . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.5 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 vi 3.6.1 Base VM provisioning time . . . . . . . . . . . . . . . . . . . . . . 58 3.6.2 Deployment with no plugins . . . . . . . . . . . . . . . . . . . . . . 59 3.6.3 Deployment for workflow applications . . . . . . . . . . . . . . . . . 60 3.6.4 Multi-cloud virtual cluster . . . . . . . . . . . . . . . . . . . . . . . 61 3.7 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.1 Data Storage Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.2 Periodograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 4: Evaluating the Performance and Cost of Workflows in the Cloud 65 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Workflow Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3 Single Node Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3.1 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.2 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.3 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.4 Cost Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3.5 Cost-Performance Analysis . . . . . . . . . . . . . . . . . . . . . . 78 4.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Data Storage Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.1 Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.2 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.3 Cost Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4.4 Submit Host Placement . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.5 Cluster Compute Comparison . . . . . . . . . . . . . . . . . . . . . 91 4.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5 Astronomy in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5.1 Mosaic Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5.2 Periodograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Chapter 5: Workflow Profiling and Characterization 105 5.1 Workflow traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Workflow profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5 Workflow profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.6 Application Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.6.1 Montage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.6.2 CyberShake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.6.3 Broadband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.6.4 Epigenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.6.5 LIGO Inspiral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.6.6 SIPHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 vii Chapter 6: Provisioning for Workflow Ensembles 131 6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.2.1 Resource Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.2.2 Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.3.1 Static Provisioning Dynamic Scheduling (SPDS) . . . . . . . . . . . 139 6.3.2 Dynamic Provisioning Dynamic Scheduling (DPDS) . . . . . . . . . 141 6.3.3 Workflow-Aware DPDS (WA-DPDS) . . . . . . . . . . . . . . . . . 143 6.3.4 Static Provisioning Static Scheduling (SPSS) . . . . . . . . . . . . . 144 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4.1 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.4.2 Workflow Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.4.3 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.4.4 Experimental Parameters . . . . . . . . . . . . . . . . . . . . . . . . 152 6.4.5 Relative Performance of Algorithms . . . . . . . . . . . . . . . . . . 154 6.4.6 Task Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.4.7 Inaccurate Task Runtime Estimates . . . . . . . . . . . . . . . . . . 158 6.4.8 Provisioning Delays . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.4.9 SPSS Planning Time . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Chapter 7: Conclusion 170 7.1 Findings and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Bibliography 177 Appendix: List of Publications 191 viii List of Tables Table 2.1: Average resource provisioning overheads . . . . . . . . . . . . . . . 23 Table 2.2: Average no-op job runtime . . . . . . . . . . . . . . . . . . . . . . 24 Table 3.1: Single VM provisioning time. . . . . . . . . . . . . . . . . . . . . . 58 Table 3.2: Mean provisioning time for a simple deployment with no plugins. . . 59 Table 3.3: Provisioning time for a deployment used to execute workflow appli- cations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Table 3.4: Provisioning time for virtual clusters of 3, 6, 12, and 24 nodes across 3 cloud providers. . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 4.1: Comparison of application resource usage. . . . . . . . . . . . . . . 68 Table 4.2: Resource types used. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Table 4.3: Monthly storage cost. . . . . . . . . . . . . . . . . . . . . . . . . . 77 Table 4.4: Per-workflow transfer costs. . . . . . . . . . . . . . . . . . . . . . . 77 Table 4.5: Comparison for submit host outside or inside the cloud . . . . . . . 91 Table 4.6: Comparison for cc1.4xlarge and c1.xlarge . . . . . . . . . . . . . . 93 Table 4.7: Cost per mosaic of a Montage image mosaic service hosted at IPAC . 96 Table 4.8: Cost per mosaic of a Montage image mosaic service hosted on Amazon EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Table 4.9: Summary of Periodograms runs on Amazon EC2 and NCSA Abe . . 98 Table 5.1: Experimental conditions for profiling and normal executions . . . . . 116 Table 5.2: Summary of workflow profiles . . . . . . . . . . . . . . . . . . . . 117 Table 5.3: Montage execution profile . . . . . . . . . . . . . . . . . . . . . . . 119 Table 5.4: CyberShake execution profile . . . . . . . . . . . . . . . . . . . . . 121 Table 5.5: Broadband execution profile . . . . . . . . . . . . . . . . . . . . . . 123 Table 5.6: Epigenomics execution profile . . . . . . . . . . . . . . . . . . . . . 126 ix Table 5.7: LIGO Inspiral Analysis execution profile . . . . . . . . . . . . . . . 128 Table 5.8: SIPHT execution profile . . . . . . . . . . . . . . . . . . . . . . . . 130 x List of Figures Figure 1.1: Architecture of a typical infrastructure cloud management system . 2 Figure 2.1: Resource provisioning using pilot jobs . . . . . . . . . . . . . . . . 13 Figure 2.2: Corral architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 2.3: Corral components . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 2.4: Example meta-workflow containing resource provisioning jobs . . . 21 Figure 2.5: Resource provisioning phases . . . . . . . . . . . . . . . . . . . . 23 Figure 2.6: Example Broadband workflow containing 66 tasks. . . . . . . . . . 25 Figure 2.7: Runtime with and without provisioning for a Broadband workflow . 26 Figure 2.8: Example Montage workflow with 67 tasks. . . . . . . . . . . . . . 27 Figure 2.9: Runtimes with and without provisioning for two Montage workflows 28 Figure 2.10: Small Epigenome workflow with 64 tasks . . . . . . . . . . . . . . 29 Figure 2.11: Comparison of Epigenome workflow runtime when limited by site policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 2.12: Comparison of Epigenome workflow runtime on different num- bers of processors . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Figure 2.13: CyberShake resource usage for the period between April 28, 2009 and Jun 11, 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 3.1: Architecture of a typical infrastructure cloud management system . 38 Figure 3.2: Virtual cluster for the Pegasus workflow management system that uses NFS and Condor. . . . . . . . . . . . . . . . . . . . . . . . . 43 Figure 3.3: Wrangler architecture . . . . . . . . . . . . . . . . . . . . . . . . . 49 Figure 3.4: Example request for 4 node virtual cluster with a shared NFS file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 3.5: Example plugin used for Condor workers. . . . . . . . . . . . . . . 56 Figure 3.6: Deployment used for workflow applications. . . . . . . . . . . . . . 60 xi Figure 3.7: Deployment used in the data storage study. . . . . . . . . . . . . . 62 Figure 3.8: Deployment used to execute Periodograms workflows . . . . . . . . 63 Figure 4.1: Execution environments on EC2 and Abe . . . . . . . . . . . . . . 70 Figure 4.2: Single node runtime comparison . . . . . . . . . . . . . . . . . . . 74 Figure 4.3: Single node resource cost comparison . . . . . . . . . . . . . . . . 76 Figure 4.4: Cost-performance comparison of different instance types for Mon- tage, Broadband and Epigenome. . . . . . . . . . . . . . . . . . . . 79 Figure 4.5: Performance of Montage using different storage systems. . . . . . . 87 Figure 4.6: Performance of Epigenome using different storage systems. . . . . 88 Figure 4.7: Performance of Broadband using different storage systems. . . . . . 89 Figure 4.8: Rounded and actual cost using different storage systems . . . . . . 102 Figure 4.9: Comparison of runtime and actual cost for submit node inside and outside the cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 4.10: Comparison of runtime and actual cost for cc1.4xlarge vs c1.xlarge 104 Figure 5.1: Workflow profiles are created by processing trace data collected by wrappers running on compute nodes. . . . . . . . . . . . . . . . 110 Figure 5.2: Example Montage workflow . . . . . . . . . . . . . . . . . . . . . 118 Figure 5.3: Example CyberShake workflow . . . . . . . . . . . . . . . . . . . 120 Figure 5.4: Example Broadband workflow . . . . . . . . . . . . . . . . . . . . 122 Figure 5.5: Example Epigenomics workflow . . . . . . . . . . . . . . . . . . . 124 Figure 5.6: Example LIGO Inspiral Analysis workflow . . . . . . . . . . . . . 127 Figure 5.7: Example SIPHT workflow . . . . . . . . . . . . . . . . . . . . . . 129 Figure 6.1: Example workflow illustrating resource usage bottleneck . . . . . . 132 Figure 6.2: CyberShake run showing how the resource requirements of a work- flow can change over time. . . . . . . . . . . . . . . . . . . . . . . 132 Figure 6.3: Example schedule generated by the SPDS algorithm. . . . . . . . . 141 Figure 6.4: Example schedule generated by the SPSS algorithm . . . . . . . . . 148 xii Figure 6.5: Histogram of workflow sizes in Pareto ensembles . . . . . . . . . . 151 Figure 6.6: Percentage of high scores achieved by each algorithm on different ensemble types for all five applications. . . . . . . . . . . . . . . . 155 Figure 6.7: Percentage of high scores achieved by each algorithm for Mon- tage and Cybershake ensembles when task runtime is stretched . . . 157 Figure 6.8: Boxplots for budget and deadline ratios when runtime estimates are inaccurate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Figure 6.9: Percentage of high scores achieved by each algorithm on Montage with uniform unsorted ensembles when runtime estimate error varies from0% to50%. . . . . . . . . . . . . . . . . . . . . . . 163 Figure 6.10: Boxplots for budget and deadline ratios when provisioning delays occur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Figure 6.11: Percentage of high scores achieved by each algorithm on Montage with uniform unsorted ensembles when provisioning delay varies from 0 seconds to 15 minutes. . . . . . . . . . . . . . . . . . . . . 165 Figure 6.12: Planning time of SPSS algorithm for different ensemble and work- flow sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 xiii Abstract Scientific workflows are a parallel computing technique used to orchestrate large, complex, multi-stage computations for data analysis and simulation in many academic domains. Resource management is a key problem in the execution of workflows because they often involve large computations and data that must be distributed across many resources in order to complete in a reasonable time. Traditionally, resources in distributed computing systems such as clusters and grids were allocated to workflow tasks through the process of batch scheduling. The tasks were submitted to a batch queue and matched to available resources just prior to execution. Recently, due to performance and quality of service considerations on the grid, and the devel- opment of cloud computing, it has become advantageous and, in the case of cloud computing, necessary for workflow applications to explicitly provision resources ahead of execution. This trend toward resource provisioning has created many new problems and opportunities in the management of scientific workflows. This thesis explores several of these resource management issues and describes some potential solutions. This thesis makes the following contributions: 1. It describes several problems associated with resource provisioning in cluster and grid environments, and presents a new provisioning approach based on pilot jobs that has many benefits for both resource owners and application users in terms of performance, quality of service, and efficiency. It also describes the design and implementation of a system based on pilot jobs that enables applications to bypass restrictive grid scheduling policies and is shown to reduce the makespan of several workflow applications by 32%-48% on average. 2. It describes the challenges of provisioning resources for workflows and other distributed applications in Infrastructure as a Service (IaaS) clouds and presents a new technique for modeling complex, distributed applications that is based on directed acyclic graphs. This model is used to develop a system for automatically deploying and managing distributed xiv applications in infrastructure clouds. The system has been used to provision hundreds of virtual clusters for executing scientific workflows in the cloud. 3. It describes the challenges and benefits of running workflow applications in infrastructure clouds and presents the results of several studies investigating the cost and performance of running workflow applications on Amazon EC2 using a variety of different resource types and storage systems. These studies compared the performance of workflows in grids and clouds, characterized the virtualization overhead of workflow applications in the cloud, compared the cost and performance of using different storage systems with workflows in the cloud, and evaluated the long-term costs of hosting workflow applications in the cloud. 4. It investigates the issue of predicting the resource needs of workflow applications using historical data, and describes a technique for collecting detailed resource usage records for workflow applications that is applied to several real applications. In addition to estimating resource requirements, this data can also be used as inputs for simulations of scheduling algorithms and workflow management systems, and for identifying problems and opti- mization opportunities in workflows. This technique is used to collect and analyze the resource usage of six different workflow applications, which is analyzed to identify poten- tial bugs and opportunities for optimizing the workflows. 5. It investigates issues related to dynamic provisioning of resources for workflow ensembles and describes three different algorithms (1 offline and 2 online) that were developed for provisioning and scheduling workflow ensembles under deadline and budget constraints. The relative performance of these algorithms is evaluated using several different appli- cations under a variety of realistic conditions including resource provisioning delays and task estimation errors. It shows that the offline algorithm is able to achieve higher perfor- mance given perfect conditions, but the online algorithms are better able to adapt to errors and delays without exceeding the constraints. xv Chapter 1 Introduction Many important problems in computational science can be expressed as a scientific workflow consisting of a series of linked processing steps, where each step takes in some input data, per- forms a computation, and produces some output data that is passed on to subsequent processing steps. Workflows are used to orchestrate scientific computations in many different academic disciplines, including physics [43], astronomy [94], biology [105], earthquake science [41], and many others. There are many different types of workflows and, consequently, there are many different workflow management systems. These systems use different workflow representations, have different execution models and semantics, and focus on different application domains [42, 58, 193]. In this thesis the focus is on workflows which can be modeled as a directed acyclic graph (DAG), where the vertices in the DAG represent tasks, and the edges represent dependencies, which can be either data-flow dependencies or control-flow dependencies. This model is used by the Pegasus Workflow Management System [40, 45], for example, which is the workflow management system used throughout this thesis. Figure 1.1 shows an example DAG-structured workflow with four tasks. Each task produces one output file that is read by subsequent tasks. These data-flow dependencies are shown as directed edges between files and tasks. There are many benefits to using workflows to express scientific computations. Workflows enable researchers to develop scientific codes independently, and then easily link them together to perform large computations. They also enable scientists to capture the structure and process of scientific computations in a shareable, reusable format. This fosters both collaboration and reproducibility in e-Science [39]. Some workflow systems have graphical user interfaces that 1 A B C D A.in A.out D.out B.out C.out Figure 1.1: Simple DAG with four tasks (A, B, C, D) and five data files (A.in, A.out, B.out, C.out, D.out). The edges represent data-flow dependencies between files and tasks. make it easy to compose complex data analysis pipelines [132, 169], and others support the plan- ning and optimization of large-scale computations [45]. Finally, and most importantly for this thesis, workflows provide a model of computation that enables problems to be easily parallelized for execution on distributed resources. Scientific workflows commonly involve large-scale com- putations. Individual workflows may contain anywhere from a few to several million compute tasks, and require thousands of CPU hours of computation and many terabytes of storage space. In order to complete these large-scale computations in a reasonable amount of time it is neces- sary to use parallel and distributed computing resources. This is particularly important as science becomes inundated with data and the computations required to extract meaning from that data increase [76]. 2 Fortunately, the number of computational resources available to workflow developers has been increasing along with the scale of their problems. In the past, high-performance comput- ing systems were available only to researchers at large universities and government labs. These systems were typically owned, operated, and used by a single organization and were not easily shared with researchers outside the organization. In recent years two new distributed computing paradigms have made access to distributed computing resources more widespread. The devel- opment of grid computing [56] in the last 15 years has enabled virtual organizations to securely share the computational infrastructure of their member institutions. This made it possible for government labs and universities to give outside users access to their high-performance comput- ing resources. More recently, the development of cloud computing [10, 192] has made it possible for anyone with a credit card to easily lease large-scale computing infrastructure from commer- cial resource providers. Computational scientists have been increasingly using this capability to build distributed computing environments for their research. As a result of these develop- ments, the availability of computational resources that can be used to execute workflows and other science applications has increased dramatically. Now the primary challenges for workflow applications are how to efficiently plan for, allocate, and manage the large number of resources that are available. Along with the proliferation of resources, the way in which those resources are assigned to computations has been changing. Traditionally, resource allocation in grids, clusters and supercomputers has been based on a best effort queuing model where resources are assigned to computations as a side-effect of scheduling. In this model, application tasks are submitted as batch jobs to a shared queue associated with each resource pool, and a local scheduler assigns the jobs to resources as they become available. This model was designed, and works well, for large, tightly-coupled applications such as MPI codes. Unfortunately, because of the schedul- ing policies and overheads encountered on the grid, this model has been found to lead to poor performance for workflows and other loosely-coupled applications [86, 151, 161]. As a result of this, researchers have developed techniques to enable resource provisioning on the grid. In the resource provisioning model, resources are reserved for a single application or user for a 3 fixed period of time, and application tasks are mapped to the resources by an application-level scheduler. By decoupling scheduling and resource allocation this approach eliminates many of the performance issues encountered by workflows on the grid and gives workflow applications more control over scheduling. It also presents new challenges in estimating the resource needs of workflows, and in making informed provisioning decisions that guarantee performance while reducing cost and waste. At the same time that resource provisioning is becoming popular in grids, infrastructure clouds are emerging as a new platform for scientific applications. The advantages of cloud computing are a result of 1) virtualization, which enables applications to be easily packaged and deployed in the cloud, and 2) on-demand provisioning, which enables applications to quickly change their resource usage according to their current needs. Unlike in grid computing, which requires special tools to enable provisioning, the resource allocation model in cloud computing is based on provisioning. The use of provisioning in both grids and clouds enables similar application-level scheduling techniques to be used for applications on both platforms. These developments have introduced many new problems for the management of scientific workflows. This thesis describes some of these problems and develops solutions for them. The remainder of the thesis will address the following research questions: How can resource provisioning be enabled for batch-oriented computing systems? Grids and clusters provide high-performance computing resources at large scale. Unfortu- nately, the best-effort model supported by these batch systems has a detrimental effect on the performance of loosely-coupled applications such as workflows [86, 151, 161]. One solution to this problem is to use task clustering, which amortizes scheduling and queuing overheads over many tasks [163]. Another solution is to use resource provisioning [160]. Provisioning places control over resources in the hands of the application, which can significantly reduce overheads and improve the performance of loosely-coupled applications. Provisioning on the grid can be enabled using advance reservations, but reservations are rarely supported by resource providers [160]. Another solution is to use pilot jobs that act as placeholders to reserve resources 4 and start up application-level schedulers that can be used by the application to bypass queues and eliminate scheduling overheads [89, 159, 184, 151]. This thesis investigates the benefits of the pilot jobs approach to provisioning and describes the development and evaluation of a resource provisioning system for grids and clusters based on Condor glideins [59, 34]. How can virtual environments be constructed in the cloud for workflow applications? Previous research has established techniques that use pilot jobs [11, 30, 111, 125, 151, 159] and advance reservations [129, 187, 196] to enable resource provisioning on the grid. In the cloud, however, provisioning is already a first-class operation. The challenge in the cloud is not to enable provisioning, but rather to dynamically construct execution environments suit- able for workflow applications [96, 87, 18]. This involves not only provisioning the resources, but also setting up resource managers to execute the workflow, file systems to store workflow data, and file transfer services to enable data movement. Existing cloud management systems enable resource provisioning, but have little or no support for deploying the software and services required by distributed applications. This thesis investigates the benefits of cloud computing for workflow applications and describes the development and evaluation of new tools for deploy- ing virtual environments in infrastructure clouds for complex distributed applications such as workflows. What is the cost and performance of workflows in the cloud? Two of the key characteris- tics of cloud platforms are their use of commodity hardware and virtualization technology. These characteristics make clouds more cost-effective and improve usability, however, their impact on the performance of workflow applications has not yet been adequately measured. If the perfor- mance impact is significant, then clouds are unlikely to become a useful platform for workflow applications because performance is one of the primary motivators for workflow technologies. In addition, many cloud providers are commercial organizations that charge fees for access to resources. As such, it is important to determine if it is economically feasible to run workflows in the cloud. Recent research has investigated the cost and performance of science applications 5 in the cloud [4, 51, 74, 84, 85, 121, 137, 153, 183]. Many of these studies have focused on tightly-coupled applications, and relatively little research has been done on the performance of loosely-coupled applications such as workflows. This thesis addresses this gap by measuring the cost and performance of several real workflows deployed in different resource configurations on infrastructure clouds. How can the resource requirements of workflow applications be estimated? One of the primary challenges of resource provisioning is knowing how many of what kind of resources to provision for how long. In order to develop an effective resource provisioning strategy it is critical to know what resources (I/O, memory, and CPU) the workflow requires. There have been many previous research efforts to predict the performance of science applications using statistical modeling and simulation techniques (e.g. [3, 16, 52, 127, 158, 179]). Many of these approaches rely on historical traces, which have been shown to be a good predictor of future behavior in computational workloads [47, 63, 165]. Most workflow systems include some capability to collect information about the execution of workflows, but few systems support the collection of detailed traces that capture not only runtime, but also other forms of resource consumption, such as I/O and memory usage. The few existing approaches are either coarse-grained [182], or do not capture all resource usage [48]. This thesis investigates ways to support the prediction of workflow resource requirements by enabling the collection and analysis of fine-grained traces of resource usage. How can resources be provisioned efficiently considering the dynamic nature of workflows and their execution environments? The dynamic nature of resources and applications often makes it necessary to adjust provisioning decisions at runtime. Jobs and resources can fail, performance estimates can be inaccurate, and the resource needs of an application can change over time. When these things happen, the provisioning system needs to be able to adapt to them dynamically. Previous efforts to solve this problem have used relatively simplistic approaches [151, 159, 139]. Many of these approaches only consider the current state of the queue when 6 making provisioning decisions and do not account for the future needs of the application, which can be estimated for structured applications such as workflows. The problem is further com- pounded by the fact that many workflow applications consist of an ensemble of workflows, which is set of related workflows. When executing an ensemble, the solution to the dynamic provisioning problem may be constrained by the amount of time available to complete the work, and the budget available to execute tasks. No existing approaches have been developed to solve problems of this nature. In order to address this need, this thesis develops several different dynamic provisioning algorithms for workflow ensembles with budget and deadline constraints, and evaluates them by simulating ensembles of realistic, synthetic workflows. 7 Chapter 2 Resource Provisioning in the Grid Grid computing is a term used to describe a distributed computing infrastructure that facilitates the coordinated sharing of hardware, software and storage resources among groups of users and institutions called virtual organizations (VOs) [57]. A typical grid is divided into several grid sites that are connected over a wide area network (WAN). Each grid site is administered independently and consists of a collection of compute and storage resources connected via a local area network (LAN). Access to grid sites is facilitated by grid middleware [19, 50, 55, 65], which allows users to securely store and access data, transfer data between sites, submit jobs to remote compute resources, and query and discover new resources. Until recently, workflows were typically scheduled on the grid using best-effort queuing. In this model, workflow tasks are submitted as jobs to batch queues at each grid site via grid middleware. The jobs wait in the batch queue alongside jobs from other users until they reach the top of the queue. At that point they are scheduled by a local resource manager (LRM) onto resources as they become available. In this model resources are completely controlled by the grid site and users have little input into how their jobs are scheduled. The scheduling policies used by the LRM determine, among other things, the order in which jobs are scheduled and the relative priority of different users’ jobs. They also establish limitations on the number of jobs that each user can have queued and running at any given time. 2.1 Problems with the traditional model From the perspective of workflows and other loosely-coupled applications there are several prob- lems with this model [86]. First, grid sites typically give higher priority to large, monolithic 8 applications such as MPI (Message Passing Interface) codes. Because it is more difficult for the LRM to schedule monolithic applications, and because they are often considered to be “more important” than loosely-coupled applications, they are frequently given higher priority relative to the serial jobs that are commonly generated by workflows. This often makes it difficult for workflows to compete for resources and leads to long queue delays for workflow jobs. Second, some common scheduling policies are harmful to the performance of workflow applications. Many batch queues limit the number of jobs that can be queued in order to prevent the system from becoming overloaded and to prevent individual users from monopolizing the resources and gaining an unfair share of the resource capacity. This is not an issue for monolithic applications that request many resources per job, but it severely limits the scalability of workflows that con- tain serial jobs. For example, at USC’s HPCC center [78], the maximum number of running jobs allowed per user is 10. This has the practical consequence that a workflow containing serial jobs can use no more than 10 processors on HPCC’s cluster at any one time. This makes it difficult to complete large scale workflows in a reasonable amount of time. Finally, the traditional approach also adds overheads and inefficiencies that reduce performance of loosely-coupled applications. Many workflows consist of a large number of short-running tasks. Executing these tasks through a batch queue involves many layers of middleware, queuing, and scheduling that introduces delays and overheads that represent a significant fraction of the total turnaround time of a work- flow task. On occasion, the overheads can even add up to more than the processing time of the task [128, 163]. This leads to poor resource utilization and low throughput for workflow applications. 2.2 Alternative strategies As grids have become popular targets for large-scale scientific workflows and other loosely- coupled applications, several solutions to these problems have been developed. One solution is to use task clustering [163, 195]. In this approach, several tasks are combined into a single batch job. The advantage of this approach is that it reduces the number of jobs submitted to the grid 9 site and increases their runtime, which reduces scheduling overhead, improves throughput, and increases resource utilization. However, clustering involves structural changes to the workflow, which often result in reduced parallelism and must be done carefully to ensure that existing dependencies are maintained and no false dependencies are introduced. In addition, clustering often leads to rework. If a single task in a clustered job fails, then the entire cluster must be re-executed. This places a practical limit on the number of tasks that can be reasonably clustered together. An alternate, arguably better, solution is to use resource provisioning. With provisioning, resources are allocated for the exclusive use of a single user or application for a given period of time. In this model the resources are temporarily managed by the application, and schedul- ing is performed by an application-level scheduler. In contrast to the traditional model where resource allocation and scheduling are combined, the provisioning model allows resources to be allocated once and used to execute multiple jobs. In other words, provisioning turns the problem of matching tasks to resources around: instead of an application submitting tasks to the resource provider, the application asks the resource provider for direct access to the resources. 2.3 Benefits of resource provisioning There are many advantages to using resource provisioning in the grid when compared with the traditional approach: Resource provisioning improves application performance by reducing overheads [89, 151, 161]. It reduces the layers of middleware that add delays in job submission, and it elim- inates the competition for resources that add queue delays. This makes it possible to efficiently submit short-running tasks to grid sites that would otherwise be infeasible. Provisioning gives applications control over scheduling. It allows the default scheduling policies that are implemented at many grid sites, and which make it difficult for workflows to run efficiently, to be replaced by application-specific policies that result in improved 10 performance. This makes it easier for applications to use, for example, more efficient DAG scheduling heuristics (e.g. [103, 172]). Provisioning enables applications to optimize for performance without considering the external workload of the system. Once the resources have been provisioned, the applica- tion is shielded from interference from other users of the system. This enables workflows to submit, for example, short-running tasks as individual jobs or fine-grained clusters with- out the risk that these serial jobs will be given a very low-priority relative to other users’ jobs. Provisioning enables co-allocation of resources and co-scheduling of tasks across multiple administrative domains. In addition, provisioning enables novel cross-domain functional- ity. For example, pilot jobs can be submitted to several grid sites and used to start tasks across those sites simultaneously regardless of when the pilot jobs themselves actually start running. This is similar to the approach V ARQ uses to implement virtual advance reservations [129]. Provisioning helps resource providers by offloading some functionality to the application that would otherwise be provided by the system. It is often thought that pilot jobs are “cheating” and that system administrators dislike them, but that is not the case. Pilot jobs have many characteristics that are helpful to system administrators. They reduce the load on remote resource management systems by reducing the number of jobs that would oth- erwise be submitted. They also improve overall resource utilization by decreasing the amount of time spent in scheduling (however, care must be taken to ensure that the appli- cation terminates pilot jobs that are no longer needed). Finally, they enable new system use-cases that can be cited by system owners as success stories and used as exemplars for other applications. Provisioning insulates the application from malfunctioning resources. After provision- ing, the application can perform basic sanity checks on nodes to prevent tasks from being 11 scheduled on nodes that are misconfigured or faulty [30]. This reduces much of the com- plexity and expense of identifying and avoiding bad nodes, which is typically accom- plished by observing a series of failures on a particular node. Provisioning masks the application from the heterogeneity of the underlying resources. Resources provisioned from multiple providers can be combined into a single pool with a uniform interface. This makes it easier to submit tasks to local resources, grid sites, and clouds at the same time [184]. Provisioning can be used to implement meta-scheduling services for a virtual organiza- tion. Many LHC (Large Hadron Collider) projects, for example, have developed workload management systems that use resource provisioning techniques. These systems allow VO members to submit work to a central queue without considering where on the grid it will run [118, 111, 11, 175, 159]. This approach makes it easy to implement VO-level schedul- ing policies that make the best use of VO resources considering organizational priorities and goals. 2.4 Resource provisioning on the grid There are two ways to achieve resource provisioning on the grid. The first way is to use advance reservations in which the LRM is configured to run only a single user’s jobs on a given set of resources for a limited amount of time. In this approach the jobs are submitted to the grid site’s queue as normal, but do not compete with other user’s jobs for access to the reserved resources. This approach has been shown to be very effective for improving the performance of workflows on the grid [187, 196]. The problem with advance reservations is that, although many LRMs support advance reservations, many grid sites do not allow users to take advantage of the feature, and those sites that do allow reservations require administrators to approve each reservation, which can be a burden to users [160]. In addition, research has shown that advance reservations may have a negative impact on the quality of service of regular, non-reservation jobs 12 Grid Site Worker Node Host Node Manager Application Scheduler 6. Join personal cluster Guest Node Manager Local Resource Manager Application Job 8. Start application job Provisioning Service 2. Submit pilot job 3. Start pilot job 4. Start User 7. Submit application job 1. Request resources Pilot Job 5. Start 9. Start Figure 2.1: Resource provisioning can be implemented on the grid using pilot jobs, which are regular grid jobs that are used to create personal clusters from grid resources. due to the fragmentation of scheduling slots that are caused by reservations and the existence of underutilized reservations [164]. Another way to do resource provisioning on the grid is to use pilot jobs. In this approach, which is also known as multi-level scheduling [151], or placeholder scheduling [67, 145], resources are provisioned by submitting ordinary jobs (called, not surprisingly, “pilot jobs”) to the grid site. Instead of running an application task, however, the pilot jobs start a guest node/resource manager on the remote machine. The guest node manager is configured to fetch application tasks from an application-level scheduler hosted outside the grid site. This process, illustrated in Figure 2.1, creates a personal cluster [98, 184] composed of remote resources that can be used by the application to execute tasks. The application-level scheduler temporarily controls the resources in the personal cluster. As such, the application is given full control over all scheduling decisions. This enables the application developer to define application-specific scheduling policies that result in the best performance for their application. 13 2.5 Related Work Several systems based on pilot jobs have been created to provision resources on Grids in order to overcome job startup overheads and uncertainties. Condor glidein [34] is a command-line tool that can be used to add grid resources to an existing Condor [108] pool using the glidein technique [59]. Condor glidein is simple to use, but it does not support the ability to provision multiple resources at once. Instead, it assumes that each glidein job is mapped to exactly one worker node. This significantly limits its scalability. GlideinWMS [159] is a workload management system that is also based on Condor. It supports dynamic provisioning by polling Condor for queued application jobs and automatically provisioning grid resources to execute them. GlideinWMS does not provide a direct interface for provisioning resources. Instead, it automatically requests and releases resources in response to changes in demand observed in the Condor queue. This makes it more suitable for use as a workload manager for an entire VO rather than as a resource provisioning tool for individual applications where more control over the provisioning process is required. MyCluster [184] creates personal clusters using several different resource managers includ- ing Condor. It can automatically maintain fixed-size pools by resubmitting resource requests as they expire, and it allows users to control the granularity of resource requests. It uses a vir- tual network overlay and user space network file system to avoid pre-staging executables. Like GlideinWMS, MyCluster does not have any programmatic interfaces that can be used by appli- cations to control the provisioning process. Falkon [151] is a multi-level scheduling system designed for applications requiring very high task throughput, such as very fine-grained workflows and bags of tasks on very large clusters. It consists of a web service that accepts job requests, a provisioner that allocates resources from remote sites, and a custom node manager that executes application jobs on provisioned resources. Although Falkon achieves very high throughput, it does so by omitting many of the features provided by off-the-shelf resource managers, such as resource matching and job prioritization. 14 The VGES system includes a Java API that is used to create personal clusters on the grid [98]. The API uses a custom version of the Torque resource manager [173] that has been modified to run in user-mode. It creates personal clusters by starting Torque daemons on host clusters using grid protocols. Access to these personal clusters is provided through a user-level Globus gatekeeper that is started on the host cluster’s head node. The system assumes that both Torque and Globus are installed and configured on the remote site and does not pre-stage executables. This makes it difficult to deploy in a grid environment that consists of many different sites such as the TeraGrid [170]. DIANE [118] is a master-worker framework based on multi-level scheduling. It consists of a master process and several agent processes. The agents are started on grid worker nodes and contact the master to request tasks. The master distributes the tasks and merges the results. Like Falkon, DIANE relies on a custom scheduling system and does not support generic, off-the-shelf resource managers. In addition, DIANE requires applications to be developed using a custom API that ties the application strongly to the DIANE framework. 2.6 System Requirements The following list of system requirements is based on an analysis of the existing multi-level scheduling systems and the requirements of workflow applications: Automate environment setup. Using pilot jobs on a remote Grid site requires that the resource manager software be installed and configured on the site. Many existing pilot job systems assume that the resource manager is pre-installed on the site by the user. This imposes additional burdens on the user to identify the characteristics of the remote system (OS, architecture, etc.), to identify the correct software version, and to install and configure the resource manager properly. This process makes using existing systems complicated and error-prone. Instead of relying on the user to choose and install the correct software on the remote site, a pilot job system should automate the setup process as much as possible while allowing the user to control details of the configuration where necessary or desirable. 15 Minimize overheads. When running workflows on the Grid the main performance metric is the time to solution, or makespan, of the workflow. This makes the time required to acquire resources an important design criteria in the development provisioning systems. Several of the existing pilot job systems transfer large executables for each provisioning request. This introduces overheads that delay the provisioning of resources. Instead of transferring executables for each request, a pilot job system should try to minimize these delays as much as possible by caching the executables on the remote site. Provide multiple control interfaces. Complex software systems benefit from having multiple interfaces tailored to suit the needs of diverse clients. Alternative interfaces pro- vide different access mechanisms and abstractions that cannot be offered through a single interface. Many existing pilot job systems provide only a single, command-line interface, and some provide no external interfaces at all. A pilot job system should provide a pow- erful programmatic interface that can be used by third-party dynamic provisioning tools, and a scriptable, easy-to-use command-line interface for users and administrators. Recover gracefully from failures. Component failures are common occurrences in dis- tributed systems. This is especially true of grid environments that span administrative boundaries, support hundreds of simultaneous users, and operate on heterogeneous sys- tems across wide-area networks. Any service that operates in such an environment should be able to recover from routine failures of system components. A pilot job system should be able to recover the state of its resources and jobs after server failures. 2.7 Architecture and Implementation This section describes the design and implementation of a resource provisioning system called Corral. Corral creates personal clusters by provisioning resources from grid sites using pilot jobs. Once these pilot jobs are running on the remote resources, they start up daemons that contact the personal cluster manager and make themselves available to run application tasks. 16 Corral Grid Site Head Node Shared File System Worker Nodes Grid Site Head Node Shared File System Worker Nodes Figure 2.2: Corral architecture Corral uses the Condor glidein technique [59] to provision resources by starting Condor worker daemons on remote cluster nodes. Upon startup, the worker daemons join a Condor pool administered by the user (i.e. a personal cluster) where they are used to execute application jobs. The use of this technique has been shown to significantly reduce the runtime of several large-scale workflow applications [160, 163]. Corral was developed using a service-oriented architecture as shown in Figure 2.2. Clients send provisioning requests to the Corral web service, which communicates with grid sites to allocate resources that fulfill the requests. Each grid site consists of a head node, several worker nodes, and a shared file system that can be accessed by all nodes. 17 The components of the system and the functional relationships between them are shown in Figure 2.3. A description of the purpose and responsibilities of each of these components follows. Glidein Service The Glidein Service is the central component of the system. It is a web service that accepts requests from clients, sets up the execution environment on the grid site, provisions resources using glideins, and cleans up files and directories created by the system. This service was implemented as a RESTful web service [53]. Condor Condor [108] is used to schedule both pilot and application jobs, and to manage glidein workers. Condor submits pilot jobs to the grid site using Condor-G [59], glidein workers contact the Condor central manager to join the user’s personal cluster, and application jobs are submitted to the Condor queue where they are matched to glidein workers for execution. The Condor schedulers used for pilot jobs and application jobs can be separated, but in practice the same scheduler is used for both sets of jobs. Staging Servers The Glidein Service installs Condor on the remote grid site from bundles of executables and configuration files called packages. Each package contains a set of Condor worker node daemons for a given Condor version, system architecture and operating system. The file servers used to host these packages are called staging servers. Any file server that supports the HTTP, FTP, or GridFTP protocols may be used as a staging server. Replica Location Service (RLS) The Replica Location Service (RLS) [31, 32] is an Globus GT4 grid service that maps logical file names to physical locations. It is used by the Glidein Service to map package names to staging servers. This enables a single package to be replicated across many staging servers for reliability and scalability, for the locations of packages to be easily changed, and for a single set of staging servers to be used by all users of Corral. 18 Grid Site Head Node Glidein Service Client Condor Setup Job Replica Location Service Submit Service Jobs Find Package Staging Server Download Package Local Resource Manager Start Provision Resources Shared Filesystem Worker Node Glidein Job Start Install Package Access Binaries Submit Application Jobs Register Submit Application Jobs App. Job Cleanup Job Start Uninstall Package Start DBMS Gatekeeper Request Resources Figure 2.3: Corral components 2.7.1 Operation The process used by the service to provision resources is divided into three phases: setup, pro- visioning, and cleanup. These phases correspond to jobs that are submitted by the service to the grid site. Setup Job The Setup Job is submitted during the setup phase to prepare the site for glideins. The setup job runs an installer which determines the appropriate package to use for the site, looks up the package in RLS to determine which staging servers have it, and downloads the package from the first available staging server. It then creates Condor installation and working directories on the site’s shared file system and copies the Condor binaries into the correct location. 19 Pilot Job The pilot job (also known as the glidein job) is submitted during the provisioning phase to allocate worker nodes for the user’s personal cluster. Pilot jobs generate a Condor configuration file and launch Condor daemons on each allocated worker node. The Condor daemons register themselves with the central manager and are matched with application jobs for execution. The daemons are monitored by a dedicated process and killed when the user’s request is canceled or expires. Cleanup Job The cleanup job is submitted during the cleanup phase to remove the working directories used by the glideins. It runs an uninstaller, which removes all log files, configuration files, executables and directories created by the setup and pilot jobs. This three-step process enables Condor executables to be staged once during the setup phase and reused for multiple requests during the provisioning phase. This precludes the transfer of binaries for each provisioning request and thereby reduces the provisioning overhead of the system. 2.7.2 Features Interfaces Users can interact with Corral using a simple command-line interface. In addition to providing functionality for interactive provisioning requests, the command-line interface also supports scripting by providing outputs that are easy to parse and operations that block until resources have been allocated. This allows the command-line interface to be used in shell scripts and workflows to automate provisioning. This capability could be used, for example, to create a meta-workflow that automates the planning and execution of other workflows as shown in Figure 2.4. In addition, the glidein service has a simple RESTful interface that can be invoked directly, and there is a simple client API that can be used in Java applications. Automatic Resubmission Many resource providers limit the maximum amount of time that can be requested for an individual job. This means that pilot jobs used to provision resources can only run for a limited amount of time before they expire. Often, however, users would like 20 Plan Application Workflow Allocate Resources Execute Application Workflow Release Resources Figure 2.4: Example meta-workflow containing resource provisioning jobs to provision resources for a longer time to accommodate long-running applications. This can be accomplished by resubmitting provisioning requests as they expire. Corral supports this capa- bility by automatically submitting new pilot jobs to replace old jobs that have terminated. When creating a new pilots the user can specify that the request should be resubmitted indefinitely, until a specific date and time, or a fixed number of times. When resubmitting, if the last request failed, or if the user’s credential has expired, the request will not be resubmitted. Unlike other provisioning techniques, such as advance reservations, this resubmission approach is not unfair to other users because resubmitted pilot jobs are not given special priority and must wait in the remote queue alongside other users’ jobs. Firewall Negotiation Multi-level scheduling systems function well when worker nodes have public IP addresses and are free to communicate with clients outside their network. How- ever, many resource providers conserve IP addresses by using private networks and isolate their worker nodes behind firewalls for security. This prevents application-specific schedulers outside the resource provider’s network from communicating directly with worker nodes. Solutions to this problem in Condor include Generic Connection Brokering (GCB) [61] and Condor Connec- tion Brokering (CCB) [33]. These techniques make use of a broker that is accessible to both the worker nodes and the application-specific scheduler to facilitate connections between them. The broker allows the application-specific scheduler and the worker nodes to communicate without requiring any direct connections into the private network. Corral supports GCB and CCB by automatically configuring worker nodes to communicate with a CCB or GCB broker. 21 2.8 Evaluation In this section the Corral service is evaluated in four ways: 1. By measuring the overhead of starting up pilot jobs to provision resources on several grid sites. 2. By measuring the overhead of running application jobs on resources provisioned by Corral and comparing it to the overhead of running the same jobs using traditional grid interfaces. 3. By measuring the end-to-end performance of three real-world workflow applications exe- cuting on resources provisioned by Corral and comparing that runtime with a lower-bound calculated using a well-known DAG scheduling algorithm (HEFT). 4. By relating the experience of running a number of large-scale earthquake science work- flows on the TeraGrid in a production mode. 2.8.1 Resource Provisioning Overhead In order to quantify the overhead of using Corral to provision resources the amount of time required to complete the phases of the resource provisioning process was measured. These phases correspond to the lifecycle events of the setup, pilot, and cleanup jobs used by Corral as shown in Figure 2.5. The provisioning phase is composed of two sub-phases, allocation and runtime, that correspond to the scheduling/queuing delay and the execution time of the pilot job. The setup time, allocation time and cleanup time was measured for three typical Tera- Grid [170] sites. Since the runtime is specified by the user it is not considered for comparison. All allocation measurements were taken when the sites had sufficient free resources in order to minimize the impact of queuing delays. As such, these figures represent lower bounds on the allocation time. The average times are shown in Table 2.1. The overheads for setup and cleanup were approximately 30 seconds and 15 seconds, respec- tively. The uniformity of the results is a result of using the Globus fork jobmanager to run setup and cleanup jobs, which imposes a small overhead that is more-or-less uniform across all sites. 22 Setup Cleanup Allocation Setup Job Submitted Gliden Job Started Setup Job Started Setup Job Finished Glidein Job Submitted Gliden Job Finished Cleanup Job Submitted Cleanup Job Finished Cleanup Job Started Runtime Provisioning Figure 2.5: Resource provisioning phases Table 2.1: Average resource provisioning overheads (in seconds) of Corral on three grid sites Grid Site Setup Time Allocation Time Cleanup Time NCSA Mercury 29.5 52.5 15.0 NCSA Abe 28.4 35.3 15.7 SDSC IA-64 28.4 97.0 15.2 In comparison, the pilot jobs were submitted a batch jobmanager (i.e. jobmanager-pbs), which is reflected in the variation in allocation time between the sites. This variation is due to scheduling overheads, which depend on site policies and configuration. 2.8.2 Job Execution Delays In order to determine the benefits of running application jobs using glideins the amount of time required to run a no-op job using Globus version 2, Globus version 4, and Corral was measured. The time to provision the resource (measured in the previous section) is not included in the Corral result. As such, the Corral time represents the runtime of a no-op job after the resource has been provisioned (i.e. the glidein job is running). The average runtimes for three TeraGrid sites are shown in Table 2.2. 23 Table 2.2: Average no-op job runtime (in seconds) for Globus GRAM versions 2 and 4 vs glideins provisioned with Corral for three grid sites Grid Site GT2 GT4 Corral NCSA Mercury 61.1 237.9 2.2 NCSA Abe 35.8 220.7 1.6 SDSC IA-64 263.3 N/A 2.0 On all three sites, the runtime of the jobs using Corral (approximately 2 seconds) was signif- icantly shorter than the runtime using Globus (approximately 35-260 seconds). This improve- ment is attributed to two factors: reduced software overhead, and reduced scheduling delay. The reduction in software overhead is a result of Condor requiring fewer software layers to dispatch jobs than Globus. The reduction in scheduling delay results from the ability to configure Condor to immediately execute jobs if there are available resources. In comparison, Globus is limited by the scheduling policies of each site’s LRM, which are typically configured to schedule jobs periodically on intervals of up to several minutes. These measurements clearly show the benefit of a pilot job approach on Grid-based systems, especially when the overheads shown are incurred by every job in the application workflow. 2.8.3 Provisioning for Workflows The following sections quantify the benefits of Corral using three real-world workflow appli- cations, running on three different types of execution environments (small and large clusters). The makespan of the workflows—not including the provisioning step, which adds 100 seconds to the overall workflow as shown in Table 2.1—was measured. The applications were selected from three different disciplines: earthquake science, astronomy, and epigenomics. In addition, for the large workflows (astronomy and epigenomics), the impact of task clus- tering on overall workflow performance with and without resource provisioning ahead of the execution was measured. 24 ucsb_createSRF urs_lp_seisgen urs_hf_seisgen sdsu_hf_seisgen ucsb_seisgen rspectra rspectra rspectra urs_createSRF urs_lp_seisgen urs_hf_seisgen sdsu_hf_seisgen ucsb_seisgen rspectra rspectra rspectra ucsb_createSRF urs_lp_seisgen urs_hf_seisgen sdsu_hf_seisgen ucsb_seisgen rspectra rspectra rspectra urs_createSRF urs_lp_seisgen urs_hf_seisgen sdsu_hf_seisgen ucsb_seisgen rspectra rspectra rspectra ucsb_createSRF urs_lp_seisgen urs_hf_seisgen sdsu_hf_seisgen ucsb_seisgen rspectra rspectra rspectra urs_createSRF urs_lp_seisgen urs_hf_seisgen sdsu_hf_seisgen ucsb_seisgen rspectra rspectra rspectra ucsb_createSRF urs_lp_seisgen urs_hf_seisgen sdsu_hf_seisgen ucsb_seisgen rspectra rspectra rspectra urs_createSRF urs_lp_seisgen urs_hf_seisgen sdsu_hf_seisgen ucsb_seisgen rspectra rspectra rspectra stage_in_ec2_0 create_dir_broadband_0_ec2 Figure 2.6: Example Broadband workflow containing 66 tasks. Earthquake Science Workflow The first application tested is an earthquake science application, Broadband, developed by the Southern California Earthquake Center (SCEC) [166]. The objective of Broadband is to integrate a collection of motion simulation codes and calculations to produce research results of value to earthquake engineers. Broadband workflows combine these codes to simulate the impact of vari- ous earthquake scenarios on several recording stations. Researchers use the Broadband platform to combine low frequency (less than 1.0Hz) deterministic seismograms with high frequency (approximately 10Hz) stochastic seismograms and calculate various ground motion intensity measures (spectral acceleration, peak ground acceleration and peak ground velocity) for their building design procedures. The first experiment compares the performance of Broadband using the provisioning approach to that using the traditional approach. The workflow used in this experiment simu- lates 6 earthquake scenarios for 8 different recording stations and contains 768 tasks. A smaller example is shown in Figure 2.6 to illustrate the structure of a Broadband workflow. The 768-task workflow was executed on WIND, a small cluster at ISI. WIND nodes have two, 2.13GHz dual core Intel Xeon CPUs (4 cores total) and 4GB of memory. The local sched- uler on WIND is Condor 7.1.3. For runs using the traditional approach, jobs were submitted using Globus GT2 GRAM and WIND was configured to only use four nodes (16 cores). For runs using pilot jobs, Corral was used to provision 16 cores. The Pegasus Workflow Manage- ment System [45] was used to plan and execute the workflow. Figure 2.7 shows the results of running Broadband with and without provisioning. Using provisioning (Corral), the application runs 40% faster than with the traditional approach where jobs are submitted directly to the job manager on the resource (Globus). A modified version 25 0 20 40 60 80 100 120 140 Globus Corral HEFT Run$me (minutes) Figure 2.7: Runtime (in minutes) of Broadband on a Local Cluster with and without provision- ing. HEFT is a lower-bound on the runtime. of the HEFT scheduling heuristic [172] was used to compute a lower bound on the runtime of the workflow for comparison. This version of HEFT assumes no scheduling or communication overheads and uniform resources. Since these overheads are not included, the HEFT runtime is a reasonable lower bound on the runtime that could be achieved in a real execution environment where the overheads are present. The workflow runtime with Corral is very close to the HEFT lower bound, however, it is still slower because provisioning cannot eliminate all the overheads, such as the waiting of a workflow task in a queue in the workflow management system, and the delay in sending the task to the computational resource (in this case over an Local Area Network). Astronomy Workflow The second application tested was an astronomy application, Montage [94]. This workflow was run using both the traditional grid approach and multi-level scheduling provided by Corral. The input to Montage is the region of the sky for which a mosaic is desired, the size of the mosaic, and other parameters such as the image archive to be used, etc. The input images are first re-projected 26 mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mProjectPP mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mDiffFit mConcatFit mBgModel mBackground mBackground mBackground mBackground mBackground mBackground mBackground mBackground mBackground mBackground mBackground mBackground mBackground mBackground mBackground mImgTbl mAdd mShrink mJPEG stage_in create_dir Figure 2.8: Example Montage workflow with 67 tasks. to the coordinate space of the output mosaic, the re-projected images are then background rec- tified and finally co-added to create the output mosaic. Montage is a data-intensive application because the input images, the intermediate files produced during the execution of the workflow, and the output mosaic are of considerable size and require significant storage resources. Montage tasks have short runtimes of at most a few minutes. The size of a workflow (number of tasks) is based on the area of the sky in square degrees covered by the mosaic. Workflows of two different sizes were used for this evaluation: a 1- degree workflow with 206 tasks and a 6-degree workflow with 6062 tasks. Figure 2.8 shows a smaller, 0.5-degree Montage workflow to give the reader an idea about the shape of the larger workflows. The medium-sized Skynet cluster at ISI was used for all Montage experiments. The nodes on Skynet have 800MHz Pentium III processors and 1GB of memory. To isolate the effects of scheduling overhead, which can be prominent in large-scale workflows with short duration tasks, experiments were performed using both un-clustered and clustered versions of the workflows. For the clustered configurations, tasks from each level of the workflow were grouped into N jobs where N equals the number of available processors. The level of a task is defined to be the length of the longest path to that task from any root task (task with no parents). Workflows that are automatically clustered in this manner produce the minimum scheduling overhead achievable without reducing workflow parallelism [163]. 27 0 5 10 15 20 25 30 1 2 4 8 16 Run$me (minutes) Number of Processors Globus Corral Globus-‐cluster Corral-‐cluster HEFT (a) 1-degree 0 100 200 300 400 500 600 1 2 4 8 16 Run$me (minutes) Number of Processors Globus Corral Globus-‐cluster Corral-‐cluster HEFT (b) 6-degree Figure 2.9: Runtimes with and without provisioning for a 1-degree Montage workflow (a) and a 6-degree Montage workflow (b). HEFT is a lower-bound on the runtime. The results of these experiments are shown in Figure 2.9. For un-clustered experiments, the runtime of the workflows using Corral was 45% less on average than the runtime using Globus (up to 78% in the best case). The clustered experiments showed a more modest improvement of 11% on average (23% best case). This was primarily due to a decrease in scheduling overheads for the clustered experiments that result from having fewer jobs to schedule. It is also interesting to note the difference between the fine- and coarse-grained workflows for the clustered experi- ments. Since the same number of clusters were generated for both workflow sizes, and the larger workflows have more total tasks, there were more tasks per cluster in the larger workflows and 28 chr21 fast2bfq fast2bfq fast2bfq fast2bfq fast2bfq fast2bfq fast2bfq fast2bfq fast2bfq fast2bfq fast2bfq fast2bfq fastqSplit fastqSplit fastqSplit fastqSplit fastqSplit fastqSplit filterContams filterContams filterContams filterContams filterContams filterContams filterContams filterContams filterContams filterContams filterContams filterContams mapMerge mapMerge mapMerge mapMerge mapMerge mapMerge mapMerge map map map map map map map map map map map map pileup sol2sanger sol2sanger sol2sanger sol2sanger sol2sanger sol2sanger sol2sanger sol2sanger sol2sanger sol2sanger sol2sanger sol2sanger create_dir Figure 2.10: Small Epigenome workflow with 64 tasks the resulting runtimes were larger. This difference in cluster granularity had an impact on the relative scheduling overheads. Comparing with the HEFT runtimes it can be observed that, for the fine-grained workflows (1-degree) the scheduling overheads dominate the execution, and for the coarse-grained workflows (6-degree) performance close to optimal can be achieved. In addi- tion, the fine-grained workflows have enough scheduling overhead that there is not much benefit from increasing the number of processors. Bioinformatics Workflow To illustrate the benefits of using Corral for larger workflows and sites with restrictive policies, a set of experiments was performed using an Epigenome mapping application [177]. The appli- cation consists of 2057 tasks that reformat, filter, map and merge DNA sequences. The major- ity (more than 90%) of the runtime of this application is consumed by 512 tasks that require approximately 2.5 hours to run each. Figure 2.10 shows a picture of a much smaller Epigenomic workflow for illustration. This application scales well to large numbers of cores so it was possible to run scaling exper- iments on the 10GB cluster at USCs High-Performance Computing Center (HPCC). The HPCC nodes that were used had 2.3GHz AMD Opteron processors and 16GB memory. HPCC’s cluster 29 0 500 1000 1500 2000 2500 3000 3500 Clustering No Clustering Run$me (minutes) Globus Corral HEFT Figure 2.11: Comparison of Epigenome workflow runtime when limited by site policies has two scheduling policies that affect the runtime of the Epigenome application: max user run, and resources max.walltime. The max user run policy prevents any single user from running more than 30 jobs concurrently. When using Globus to submit jobs, this policy prevents work- flows with serial tasks from using more than 30 processors at a time. The resources max.walltime policy prevents any single job from running for more than 24 hours. One result of this policy is that workflows with long-running jobs cannot be clustered to match the allowable resources. For the genomic workflow, clustering tasks into 30 jobs per level (to match max user run) would result in individual job runtimes of 45 hours. Consequently, the maximum clustering possible is 60 jobs per level, which results in runtimes of 22:5 hours. These clustered jobs must be executed in two separate batches of 30 jobs each. Figure 2.11 compares the runtime of the genomic workflow when limited to 30 processors. The runtime using Corral was 32% less than Globus in the unclustered case, and 10% less in the clustered case. Using Corral the clustered workflow actually took longer than the unclustered workflow. This is a result of the interaction between provisioning and load imbalance. Due to the limitation on the runtime of the jobs, and to make a closer comparison with the plain Globus solution, the resources are provisioned in two 24-hour blocks, with the second block 30 0 100 200 300 400 500 600 700 128 256 512 Run$me (minutes) Number of Processors Corral HEFT Figure 2.12: Comparison of Epigenome workflow runtime on different numbers of processors being requested when all the jobs scheduled for the first block have completed. However not all the jobs take the same time to complete and the jobs in the second block must wait for the slowest jobs in the first block before they can proceed. The problem occurs for both clustered and unclustered workflows, however the shorter jobs in the unclustered workflow compensate for the imbalance and make the gap less severe. This gap could be eliminated completely by using a more sophisticated provisioning scheme (e.g. by overlapping the blocks or using dynamic provisioning). However, since the overhead in the non-clustered case with Corral was only about 3% compared to HEFT, leaving the workflow unclustered is a reasonable solution. It is important to note that although workflows using Globus are limited to 30 processors by HPCC’s max user run policy, workflows using Corral are not. Corral can use a single parallel job to provision any number of processors. This enables Corral to allocate more processors to run the workflow than is possible using Globus. Figure 2.12 shows the runtime of the genomic workflow using 128, 256 and 512 processors. For this workflow 512 processors is the maximum parallelism achievable due to the structure of the workflow. 31 Using Corral to provision 512 processors resulted in an order of magnitude lower run- time (211 minutes) compared with the best runtime using Globus (2349 minutes). In addi- tion, although the number of processors increases, the scheduling overhead when compared with HEFT remains relatively constant at around 5-10%. 2.8.4 Example Application: CyberShake This section describes the experience of using Corral to provision resources for production runs of a large-scale workflow application on the TeraGrid [170]. The application, CyberShake [69, 26, 41], is a probabilistic seismic hazard analysis (PSHA) tool developed by the Southern California Earthquake Center (SCEC) [166] to study the long- term risk associated with earthquakes. It consists of two parts: a parallel simulation that com- putes how a given geographic location, or site, responds to earthquakes, and a workflow that uses many scenario earthquakes to determine the future probability of different levels of shaking at the site. The parallel simulation consists of an MPI code and several serial programs that generate a large 3D mesh of site-response vectors called Strain Green Tensors (SGTs). In this part of the computation there are only a few, large tasks and thus it can make use of standard grid scheduling techniques. The workflow part of the computation is a 3-stage pipeline consisting of tasks that 1) extract site response vectors, 2) compute synthetic seismograms, and 3) measure peak ground motions. The workflow contains 840,000 tasks for each site. Due to the large number of tasks it is not feasible to use normal grid scheduling techniques because the overall workflow performance would be poor. Instead, Corral was used to provision resources and Condor was used to schedule the workflow tasks. In the period between April 28, 2009 and June 11, 2009 the CyberShake team computed the seismic hazard curves for 220 sites located in Southern California. This work involved the execution of over 197 million workflow tasks. Corral was used to provision resources from the Ranger cluster at TACC [171]. A timeline showing the resource usage for that period is shown 32 Figure 2.13: CyberShake resource usage for the period between April 28, 2009 and Jun 11, 2009 in Figure 2.13. The application made 64 provisioning requests for a total of 1.19 million CPU hours. The times where no resources were provisioned (white spaces in Figure 2.13) correspond to the times when Ranger was down for maintenance or no new workflows were being scheduled and thus no provisioning requests were made. The Figure also shows that 2,400 cores were being provisioned most of the time, but up to 4,500 resources were provisioned for brief periods. An open issue is how to optimize the resource utilization while running workflows on pro- visioned resources. It is possible for some fraction of the resources to sit idle while waiting to be matched with workflow tasks. In the 2009 CyberShake runs there were periods when the resource utilization dropped below 50% due to scalability issues on the submit host and a lack of sufficient work to keep all cores busy. For large-scale workflows such as CyberShake, resource utilization can be improved by tuning the task scheduling system to release a sufficient number of tasks to the resources. For small workflows or workflows with a small number parallel tasks, a solution could be to run multiple, independent workflows or tune the number of resources provisioned over time. 2.9 Summary Scientists in many fields are developing large-scale, workflow applications for complex, data- intensive scientific analyses [169]. These applications require the use of large numbers of low- latency computational resources in order to produce results in a reasonable amount of time. 33 Although the grid provides access to ample resources, the traditional approach to accessing these resources introduces many overheads and delays that make the grid an inefficient platform for executing workflows. Section 2.7 presented the design and implementation of a resource provisioning system called Corral. Although Corral is a general-purpose resource provisioning system, it can greatly benefit the performance of workflow applications executing on clusters and the grid. The system is based on the concept of multi-level scheduling with pilot jobs. This approach eliminates queu- ing delays by reserving resources, reduces overheads by streamlining resource management, and improves parallelism by allowing the user to specify application-specific scheduling policies. Section 2.8 showed how the use of Corral can improve the runtime of three real workflow applications. The system was shown to reduce the runtime of an astronomy application by 45% on average without clustering and 11% on average with clustering. These results indicate that a combination of provisioning to reduce queue delays and clustering to amortize scheduling delays provides the best improvement in runtime. In addition, the Epigenome application was used to show how the system can be used to bypass restrictive site scheduling policies that, e.g. limit the number of processors that can be used concurrently. This enabled an order of magnitude reduction in the runtime of the Epigenome application. Finally, Section 2.8.4 showed that the system is being used today to enable the execution of scientifically meaningful workflows, such as those being run by earthquake scientists. The 2009 CyberShake runs showed that Corral can be used to run workflows at a very large scale over a long period of time in a production environment. 34 Chapter 3 Resource Provisioning in the Cloud Recently there has been a surge of interest in the field of cloud computing. The term “cloud computing” encompasses many different architectures, technologies and services, but, in gen- eral, cloud computing refers to the delivery of software, hardware, or storage resources, as a service, over the Internet [10]. As such, cloud computing is a realization of the long-held goal of utility computing, where users pay for access to computing resources according to their usage, and can provision resources according to their requirements from large, centrally-operated data centers. This delivery model takes advantage of economies of scale in human resources, hard- ware procurement, and energy consumption, which enables computational infrastructure to be delivered to users with high efficiency and low cost. Many different types of clouds have been developed to deliver different services and applica- tions at different levels of abstraction. The various types of clouds can be categorized according to the service that they are designed to deliver. These categories are typically named according to the schema “XaaS”, where “X” represents some resource type and “aaS” means “as a service” [192]. The common categories are: Software as a Service (SaaS) clouds provide hosted web applications, such as web-based email clients, online document editors, and customer relationship management systems. Salesforce.com and Google Documents are examples of SaaS clouds. Platform as a Service (PaaS) clouds provide higher-level services that are useful for implementing large-scale applications, such as web hosting services, load balancing ser- vices, caching services, and database services. PaaS clouds are meant to be used as plat- forms for deploying applications rather than applications themselves. Google AppEngine [68] and Heroku [75] are examples of PaaS. 35 Data as a Service (DaaS) clouds provide services for storing data. DaaS clouds are designed for high availability, and performance, and provide guarantees as to the relia- bility, and consistency of data. Amazon S3 [9] is an example of a DaaS cloud. Hardware as a Service (HaaS) clouds provide access to bare metal hardware. Typically they allow users to provision servers and install operating software without the use of virtualization. Grid5000 [72] is an example of an HaaS cloud. Infrastructure as a Service (IaaS) clouds provide access to basic compute infrastructure in the form of virtual machines. The canonical example of an infrastructure cloud is the Amazon Elastic Compute Cloud (EC2) [8]. These different types of clouds may be offered independently, or, as is the case with Ama- zon Web Services [5], as a suite of interconnected services. In addition to commercial cloud offerings, many organizations are developing so-called private clouds, which offer services to different business units and users within an organization, and science clouds [95], which are operated as a shared resource for research communities, much like grid sites and high perfor- mance computing centers. The primary focus of this work is IaaS clouds, which are also called infrastructure clouds (we will use the terms “IaaS”, “infrastructure cloud” and “cloud” interchangeably from this point onward). Although all of the various types of clouds can be used to provide some services for workflow applications, infrastructure clouds are of the most use to workflow applications because they provide all of the basic services required for workflow applications, including pro- cessors, storage, and network access. In addition, infrastructure clouds are the most common, feature-rich, and mature cloud technologies currently available. The benefits of infrastructure clouds for workflows will be discussed in Section 3.1. Future SaaS and PaaS clouds may be designed specifically for workflow applications, but these systems do not currently exist. The key enabling technology of infrastructure clouds is virtualization. In particular, infras- tructure clouds rely on OS-level virtualization such as what is provided by Xen [12], KVM [102], and similar virtualization technologies. Infrastructure clouds use virtualization to enable 36 the deployment of virtual machines (VMs) on top of clusters of physical server hosts. Each VM provides an virtual hardware environment that, from the perspective of the OS, looks and behaves the same as it would if the OS was operating on unshared physical hardware. Virtual machines boot and run user-specified virtual machine images (VM images). These images are large archive files that contain a complete file system for the virtual machine, including an oper- ating system, libraries, application codes, configuration files, and any other data desired by the user. The architecture of a typical infrastructure cloud management system is shown in Figure 3.1. The system consists of a web service that handles requests from users, a resource manager that keeps track of resource availability and schedules user requests, a VM image manager that stores disk images, and a node manager that controls each physical machine and the virtual machines that run on top of it. When a request comes in from a user it is passed from the web service to the resource manager. The resource manager checks to see if there are any physical machines with sufficient resources to execute the request. If there are, then it passes the request to the appropriate node manager(s). The node manager downloads the required VM image from the VM image manager and uses it to initialize a new virtual machine. Once the virtual machine is running, it continues to run until the user terminates it. Users can also upload new VM images through the web service. In the following sections we explain the benefits of using infrastructure clouds like this one to deploy workflow applications, and describe a system we have developed to deploy complex distributed applications such as workflows on infrastructure clouds. 3.1 The benefits of cloud computing for workflows In general, there are many benefits to using infrastructure clouds for all kinds of distributed applications. Rather than enumerating all the various benefits we will focus on those aspects of infrastructure clouds that are particularly useful to scientific workflow applications [87]. 37 Physical Machine Web Service Resource Manager VM Image Manager Node Manager Virtual Machine Virtual Machine Physical Machine Node Manager Virtual Machine Figure 3.1: Architecture of a typical infrastructure cloud management system. Resource provisioning Clouds are fundamentally based on resource provisioning, and infrastructure clouds support provisioning natively. This is in contrast to grids, which, as we saw in Chapter 2, are based on best-effort batch queuing and need to be coerced into supporting provisioning using techniques such as advanced reservations and pilot jobs. As such, many of the same benefits that result from resource provisioning in grids also apply to clouds. These benefits include: reduced overheads, application control over scheduling, co-allocation and co- scheduling, support for heterogeneous platforms, and others. On-demand Infrastructure clouds enable users to allocate resources on-demand. Cloud users can request, and expect to obtain, sufficient resources for their needs at any time. This capability is a result of the massive scale of many commercial infrastructure clouds and is actually an illusion that stems from the fact that most users require only a small fraction of the entire cloud. 38 This feature of clouds has been called the “illusion of infinite resources” [10]. In comparison, other distributed computing technologies used in scientific computing, such as clusters and grids, assume that a single user may need to allocate a large part of the total computing capacity or even the full capacity of the computing resource. The drawback of the on-demand approach is that, unlike the best-effort queuing model used in clusters and grids, it does not provide an opportunity for large requests to queue up waiting for available resources. If sufficient resources are not available to service a request immediately, then the request fails. However, the benefit of on-demand provisioning is that it allows workflows to be more opportunistic in their choice of resources. Unlike tightly-coupled applications, which need all their resources up-front and would prefer to wait in a queue to ensure priority and fairness, a workflow application can start making progress with only a portion of the total resources desired. The minimum usable resource pool for workflows containing only serial tasks, for example, is one processor. With on-demand provisioning a workflow can allocate as many resources as possible and start making progress immediately. Elasticity In addition to provisioning resources on-demand, infrastructure clouds also allow users to de-provision resources on-demand. This ability to acquire and release resources as needed is called elasticity [10]. It is a very useful feature for workflow applications because it enables workflow systems to easily grow and shrink the available resource pool as the needs of the workflow change over time. Common workflow structures such as data distribution and data aggregation (also known as fork-join parallelism) can significantly change the amount of paral- lelism in a workflow over time [15]. These changes lead naturally to situations in which it may be profitable to acquire or release resources to more closely match the needs of the application and ensure that resources are being fully utilized. Legacy Applications Workflow applications frequently consist of a collection of heteroge- neous software components developed at different times for different uses by different people. Part of the job of a workflow management system is to weave these components into a single, 39 coherent application without changing the components themselves. Often the scientists develop- ing a workflow application do not want to modify codes that may have been designed and tested many years earlier in fear of introducing bugs that may affect the scientific validity of outputs. This requirement can be challenging depending on the environment for which the components were developed and the assumptions made by the developer. Such components are often brittle and require a specific software environment to execute successfully. The use of OS-level virtual- ization in infrastructure clouds enables the environment to be customized to suit the application. Specific operating systems, libraries, and software packages, can be installed; directory struc- tures required by the application can be created; input data can be copied into specific locations; and complex configurations can be constructed. The resulting environment can be bundled up as a virtual machine image and easily deployed on a cloud to run the workflow. Reproducibility Reproducibility is one of the cornerstones of science. In order for a scien- tific experiment to be validated it is important for multiple scientists to reproduce and confirm it over time. This has been true of physical experiments for a long time and is more recently being applied to computational science and in silico experiments. For example, CSEP (Collaboratory for the Study of Earthquake Predictability [36]), which is attempting to develop a virtual labora- tory for testing earthquake predictions, requires that study participants are able to repeat the exact calculations that were used to produce every prediction. Virtualization enhances reproducibility for workflows and other scientific applications such as CSEP by capturing the exact environ- ment that was used to execute the application. The VM images used to run the application can be stored and redeployed to create the exact same environment used for a prior experiment, with the exact same operating system, configuration, and versions of libraries and application codes. This capability ensures that future researchers will be able to reproduce exactly the environment that was used to perform a computation. 40 Provenance The term provenance refers to the origin and history of digital objects such as data sets or images [117]. In computational science, provenance is the metadata about a compu- tation that can be used to answer questions about the origins and derivation of data produced by the computation. It is closely related to reproducibility in that it helps scientists to understand and reconstruct an experiment. If a workflow is executed in the cloud using a virtual machine, then the VM image can be stored along with the provenance of the workflow [73]. This enables the scientist to answer many important questions about the results of a workflow run such as: What version of the simulation code was used to produce the data? Which library was used? How was the software installed and configured? 3.2 Virtual Clusters Users of scientific workflows would like to deploy workflow applications in the cloud to take advantage of all the benefits outlined above. However, existing cloud management systems do not have good support for the types of environments required by workflows. Although infras- tructure clouds provide web services for deploying virtual machine images, they do not provide services for installing, configuring and managing the software required by workflows. Tradi- tional HPC systems, such as clusters and grids, are pre-configured with many of the computing services that are necessary to run workflow applications. These services include resource man- agement systems (e.g. batch schedulers) for distributing application tasks among a group of worker nodes, shared file systems that enable worker nodes to access input and output data for the workflow, and data transfer servers that enable workflow inputs and outputs to be transferred to/from the execution site. In order to deploy workflow applications in the cloud it is necessary to construct clusters of virtual machines (called virtual clusters [54]) that emulate these traditional environments. It is important to keep in mind that, although there are some clouds that offer services that could be used to replace workflow management systems (e.g. queue services), our goal is not to build a new workflow management system, but rather to enable existing workflow systems to be deployed in the cloud. 41 To motivate the need for virtual clusters, consider an example virtual cluster for the Pegasus workflow management system as shown in Figure 3.2. This configuration uses Condor as a resource management system, and NFS as a shared file system (other services may be required in practice, but we limit the discussion to these two for simplicity). The virtual cluster consists of a master node that runs the Condor central manager and exports an NFS file system, and N worker nodes that run Condor worker daemons and mount the NFS file system. The master also has Pegasus and DAGMan installed to plan the workflow and manage its execution. The steps involved in deploying this virtual cluster are: 1. Provision N+1 virtual machines for the master node and N workers 2. Export an NFS file system on the master node 3. Install and configure Condor on the master node 4. Start the Condor master 5. Mount the NFS file system on all N worker nodes 6. Install and configure Condor on all N worker nodes 7. Start Condor on all N worker nodes Ideally this procedure would be automated. If N is small, then the setup could easily be performed manually, but if N is large, then it is not feasible to manually configure all of the nodes. Doing so would be far too time-consuming and error-prone. An alternative solution would be to write scripts to perform these steps automatically. In this approach, each node would be configured to run a script when it boots that installs, configures and starts the required services. The problem with the scripting approach is that there are some steps that are difficult to script. For example, steps 5 and 6 require all worker nodes to know the IP address of the master node in order to generate configuration files for Condor and mount the NFS file system. Unfortunately the masters IP address is assigned dynamically and there is no simple way for a 42 Master Node Pegasus Condor Master DAGMan Worker Node Condor Worker NFS Client Worker Node Condor Worker NFS Client Workflow NFS Server ... Figure 3.2: Virtual cluster for the Pegasus workflow management system that uses NFS and Condor. script on the worker to discover that information without using an external information source. In addition, there are ordering constraints that must be followed. For example, step 5 needs to happen after step 2, but there is no way for the worker node to know when the master has finished exporting the file system. Finally, scripts can be easily written to deploy static configurations, or configurations that change slowly, but it is not scalable when there are several overlapping configurations that must be maintained. For example, there may be configurations where NFS is used with different resource managers, such as the Portable Batch System (PBS) [134, 173], Sun Grid Engine (SGE) [62] or Load Sharing Facility (LSF) [146]. In that case, it would be nice to have a more modular approach that allows the user to select which combination of software roles to deploy on each node. The process of deploying a virtual cluster involves provisioning resources in the form of virtual machines and configuring them with the appropriate software and services required to run distributed applications. Our goal is to develop tools that make it easy for users to easily and efficiently deploy such virtual clusters. 43 3.3 Related Work The problem of provisioning virtual clusters in the cloud is similar to the problem of provisioning personal clusters in the grid. As we saw in Chapter 2, many different pilot job systems have been developed to enable resource provisioning in the grid [89, 118, 151, 159, 175, 184]. For the most part, these systems are designed to support a specific application or virtual organization. As such, they have not been generalized to support arbitrary software configurations. The problem of deploying virtual clusters is also similar to the datacenter automation prob- lem. Several systems have been developed to address this issue in UNIX systems, including: Cfengine [22, 23], Puppet [92], Chef [135], and others. The primary function of these systems is to ensure that nodes in a datacenter adopt and maintain the configuration specified by the user. They periodically audit nodes to ensure that the user’s policies are not violated, and take specific actions (e.g. update a file, restart a service, etc.) to bring any nodes in violation of the policy back into compliance. These systems can be used to manage cloud resources, but because they treat nodes independently they are not well-suited to the problem of deploying virtual clus- ters. In particular, they do not support the notion of dependencies between nodes and automatic provisioning of resources. Configuring compute clusters is a well-known systems administration problem. In the past many cluster management systems have been developed to enable system administrators to easily install and maintain high-performance computing clusters [20, 81, 144, 116, 178, 197]. Of these, Rocks [142] is perhaps the most well known example. These systems assume that the cluster is deployed on physical machines that are owned and controlled by the user. Often they rely on techniques using DHCP and PXE network boot that are not available in the cloud, and do not support virtual machines provisioned from cloud providers. As a result, they are not well-suited to the problem of provisioning virtual clusters in cloud environments where the hardware is not controlled by the user. Constructing clusters on top of virtual machines has been explored by several previous research efforts. These include VMPlants [101], StarCluster [115], and others [120, 126]. These 44 systems typically assume a fixed architecture that consists of a head node and N worker nodes. They also typically support only a single type of cluster software, such as SGE, Condor, or Globus. In comparison, a general-purpose tool for deploying virtual clusters should support complex application architectures consisting of many interdependent nodes and custom, user- defined plugins. The Nimbus Context Broker (NCB) [96] supports the deployment of virtual clusters in clouds using the Nimbus cloud management system [95]. NCB supports arbitrary, user-defined config- urations, but they must be pre-installed in the VM image, which makes the system little bet- ter than a scripting approach. In addition, since NCB works primarily with the Nimbus cloud management system, it is not possible to deploy virtual clusters across heterogeneous clouds. Finally, NCB does not allow nodes to be added or removed from a virtual cluster once it has been deployed. This makes dynamic provisioning for workflows impossible. Recently, other groups have recognized the need for virtual cluster deployment services, and are developing solutions to the problem. One example is cloudinit.d [18], which enables users to deploy and monitor interdependent services in the cloud. Cloudinit.d only allows one service to be deployed on each node in a virtual cluster, which is unfortunate because many virtual cluster configurations that are useful to workflow applications require each node to implement multiple roles, for example, to behave as both a worker and file system node [90]. 3.4 System Requirements Based on our experience running workflow applications in the cloud [91, 90, 181], and our experience using the Nimbus Context Broker [96] we have developed the following requirements for a service to deploy virtual clusters for workflow applications in the cloud: Automatic deployment of distributed applications. Distributed applications used in sci- ence and engineering research often require resources for short periods in order to com- plete a complex simulation, to analyze a large dataset, or complete an experiment. This 45 makes them ideal candidates for infrastructure clouds, which support on-demand provi- sioning of resources. Unfortunately, distributed applications often require complex envi- ronments in which to run. Setting up these environments involves many steps that must be repeated each time the application is deployed. In order to minimize errors and save time, it is important that these steps are automated. A deployment service should enable a user to describe the nodes and services they require, and then automatically provision, and configure the application on-demand. This process should be simple and repeatable, using modular components that can be recombined in different ways to create new environments. User-defined configuration. It is likely that each application will require a different con- figuration. It should be possible for users to create custom deployment configurations. These configurations should be easy to create and should be automatically deployed to the nodes. Users should also be able to select generic virtual machine images and have their configurations deployed from scratch. It should not be necessary to create a new VM image for every application or configuration. Information distribution. When configuring a virtual cluster it may be necessary for some nodes to discover information about other nodes. In many cases, this information is not known before resources are provisioned. For example, a client node may require the IP address or host name of a server node, which is assigned to the server dynamically when it is provisioned. It should be possible for nodes to publish information about themselves to a central broker where it can be discovered by other nodes for use in configuration processes. Complex dependencies. Distributed systems often consist of many services deployed across a collection of hosts. These services include batch schedulers, file systems, databases, web servers, caches, and others. Often, the services in a distributed applica- tion depend on one another for configuration values, such as IP addresses, host names, and port numbers. In order to deploy such an application, the nodes and services must be configured in the correct order according to their dependencies. These dependencies 46 can be expressed as a DAG (directed acyclic graph) where the nodes in the graph repre- sent VMs, and the edges represent dependencies. Some previous systems for constructing virtual clusters have assumed a fixed architecture consisting of a head node and a collec- tion of worker nodes [101, 120, 126, 115]. This static model severely limits the type of applications that can be deployed. A virtual cluster provisioning system should support complex applications with arbitrary dependencies, and enable nodes to publish metadata (e.g. the host and port of a service) that can be queried and used to configure other nodes. Dynamic provisioning. The resource requirements of distributed applications often change over time. For example, a workflow application may require many worker nodes during the initial stages of a computation, but only a few nodes during the later stages. Similarly, an e-commerce application may require more web servers during daylight hours, but fewer web servers at night. A deployment service for distributed applications should support dynamic provisioning by enabling the user to add and remove nodes from a deployment at runtime. This should be possible as long as the deployment’s dependencies remain valid when the node is added or removed. This capability could be used along with elastic provisioning algorithms (e.g. [114]) to easily adapt deployments to the needs of an application at runtime. Multiple cloud providers. In the event that a single cloud provider is not able to supply sufficient resources for an application, or reliability concerns demand that an application is deployed across independent data centers, it may become necessary to provision resources from several cloud providers at the same time. This capability is known as federated cloud computing or sky computing [97]. A deployment service should support multiple resource providers with different provisioning interfaces, and should allow a single application to be deployed across multiple clouds. Heterogeneous cloud platforms. Many different infrastructure clouds are being devel- oped in both commercial and academic settings. Each of these systems has a different 47 interface for specifying and provisioning virtual machines. Rather than requiring applica- tion developers to implement the same functionality for different clouds, the provisioning system should provide an abstraction that exposes common functionality through a single interface. Using this interface it should be possible to specify a virtual cluster description that can be deployed on multiple clouds with little or no changes. Monitoring. Long-running services may encounter problems that require user interven- tion. In order to detect these issues, it is important to continuously monitor the state of a deployment in order to check for problems. A deployment service should make it easy for users to specify tests that can be used to verify that a node is functioning properly. It should also automatically run these tests and update the node’s status when errors occur so that the user is aware of the problem. In addition to these functional requirements, the system should exhibit other characteristics important to distributed systems, such as scalability, reliability, and usability. 3.5 Architecture and Implementation We have developed a system called Wrangler to support the requirements outlined above. The components of the system are shown in Figure 3.3. They include: clients, a coordinator, agents, and plugins. Clients run on each user’s machine and send requests to the coordinator to launch, query, and terminate, deployments. Clients have the option of using a command-line tool or XML-RPC to interact with the coordinator. The Coordinator is a web service that manages application deployments. It accepts requests from clients, provisions nodes from cloud providers, collects information about the state of a deployment, and acts as an information broker to aid application configura- tion. The coordinator stores information about its deployments in an SQLite database. 48 Cloud Coordinator Cloud Cloud Resource Provider Virtual Machine Agent Plugins Database Cloud Resource Provider Virtual Machine Agent Plugins Figure 3.3: Wrangler architecture Agents run on each of the provisioned nodes to manage their configuration and monitor their health. The agent is responsible for collecting information about the node (such as its IP addresses and hostnames), reporting the state of the node to the coordinator, configuring the node with the software and services specified by the user, and monitoring the node for failures. Plugins are user-defined scripts that implement the behavior of a node. They are invoked by the agent to configure and monitor a node. Each node in a deployment can be config- ured with multiple plugins. 49 3.5.1 Specifying Deployments Users specify their deployment using a simple XML format. Each XML request document describes a deployment consisting of several nodes, which correspond to virtual machines. Each node has a provider that specifies the cloud resource provider to use for the node, and defines the characteristics of the virtual machine to be provisioned—including the VM image to use and the hardware resource type—as well as authentication credentials required by the provider. Each node has one or more plugins, which define the behaviors, services and functionality that should be implemented by the node. Plugins can have multiple parameters, which enable the user to configure the plugin, and are passed to the script when it is executed on the node. Nodes may be members of a named group, and each node may depend on zero or more other nodes or groups. An example deployment is shown in Figure 3.4. The example describes a cluster of 4 nodes: 1 NFS server node, and 3 NFS client nodes. The clients, which are identical, are specified as a single <node> with a count of three. All nodes are to be provisioned from Amazon EC2, and different images and instance types are specified for the server and the clients. The server is configured with annfs server.sh plugin, which starts the required NFS services and exports the/mnt directory. The clients are configured with annfs client.sh plugin, which starts NFS services and mounts the server’s/mnt directory as/nfs/data. TheSERVER parameter of thenfs client.sh plugin contains a<ref> tag. This parameter is replaced with the IP address of the server node at runtime and used by the clients to mount the NFS file system. The clients are part of aclients group, and depend on theserver node, which ensures that the NFS file system exported by the server will be available for the clients to mount when they are configured. 3.5.2 Deployment Process Here we describe the process that Wrangler goes through to deploy an application, from the initial request, to termination. 50 <deployment> <node name="server"> <provider name="amazon"> <image>ami-912837</image> <instance-type>c1.xlarge</instance-type> ... </provider> <plugin script="nfs_server.sh"> <param name="EXPORT">/mnt</param> </plugin> </node> <node name="client" count="3" group="clients"> <provider name="amazon"> <image>ami-901873</image> <instance-type>m1.small</instance-type> ... </provider> <plugin script="nfs_client.sh"> <param name="SERVER"> <ref node="server" attribute="local-ipv4"> </param> <param name="PATH">/mnt</param> <param name="MOUNT">/nfs/data</param> </plugin> <depends node="server"/> </node> </deployment> Figure 3.4: Example request for 4 node virtual cluster with a shared NFS file system Request. The client sends a request to the coordinator that includes the XML descriptions of all the nodes to be launched, as well as any plugins used. The request can create a new deployment, or add nodes to an existing deployment. Provisioning. Upon receiving a request from a client, the coordinator first validates the request to ensure that there are no errors. It checks that the request is valid, that all dependencies can be resolved, and that no dependency cycles exist. Then it contacts the resource providers specified in the request and provisions the appropriate type and quantity of virtual machines. In the event that network timeouts and other transient errors occur during provisioning, the coordi- nator automatically retries the request. The coordinator is designed to support many different cloud providers. It currently supports Amazon EC2 [8], Eucalyptus [131], and OpenNebula [133]. Adding additional providers is 51 designed to be relatively simple. The only functionalities that a cloud interface must provide are the ability to launch and terminate VMs, and the ability to pass custom contextualization data to a VM. The system does not assume anything about the network connectivity between nodes so that an application can be deployed across many clouds. The only requirement is that the coordinator can communicate with agents and vice versa. Startup and Registration. When the VM boots up, it starts the agent process. This requires the agent software to be pre-installed in the VM image. The advantage of this approach is that it offloads the majority of the configuration and monitoring tasks from the coordinator to the agent, which enables the coordinator to manage a larger set of nodes. The disadvantage is that it requires users to re-bundle images to include the agent software, which is not a simple task for many users and makes it more difficult to use off-the-shelf images. In the future we plan to investigate ways to install the agent at runtime to avoid this issue. When the agent starts, it uses a provider-specific adapter to retrieve contextualization data passed by the coordinator, and to collect attributes about the node and its environment. The attributes collected include: the public and private hostnames and IP addresses of the node, as well as any other relevant information available from the metadata service, such as the availabil- ity zone. The contextualization data includes: the host and port where the coordinator can be contacted, the ID assigned to the node by the coordinator, and the node’s security credentials. Once the agent has retrieved this information, it is sent to the coordinator as part of a registra- tion message, and the node’s status is set to registered. At that point, the node is ready to be configured. Configuration. When the coordinator receives a registration message from a node it checks to see if the node has any dependencies. If all the node’s dependencies have already been con- figured, the coordinator sends a request to the agent to configure the node. If they have not, then the coordinator waits until all dependencies have been configured before proceeding. 52 After the agent receives a command from the coordinator to configure the node, it contacts the coordinator to retrieve the list of plugins for the node. For each plugin, the agent downloads and invokes the associated plugin script with the user-specified parameters, resolving any<ref> parameters that may be present. If the plugin fails with a non-zero exit code, then the agent aborts the configuration process and reports the failure to the coordinator, at which point the user must intervene to correct the problem. If all plugins were successfully started, then the agent reports the node’s status as configured to the coordinator. Upon receiving a message that the node has been configured, the coordinator checks to see if there are any nodes that depend on the newly configured node. If there are, then the coordinator attempts to configure them as well. It makes sure that they have registered, and that all dependencies have been configured. The configuration process is complete when all agents report to the coordinator that they are configured. Monitoring. After a node has been configured, the agent periodically monitors the node by invoking all the node’s plugins with the status command. After checking all the plugins, a message is sent to the coordinator with updated attributes for the node. If any of the plugins report errors, then the error messages are sent to the coordinator and the nodes status is set to failed. Elastic Provisioning. After all the nodes in a deployment are configured it is possible for the user to both add more nodes and remove existing nodes. Nodes can be added at any time as long as the nodes that they are to depend on exist and have not failed. Nodes can be removed as long as there are no other configured nodes that depend on them. Termination. When the user is ready to terminate one or more nodes, they send a request to the coordinator. The request can specify a single node, several nodes, or an entire deployment. Upon receiving this request, the coordinator sends messages to the agents on all nodes to be terminated, and the agents send stop commands to all of their plugins. Once the plugins are 53 stopped, the agents report their status to the coordinator, and the coordinator contacts the cloud provider to terminate the node(s). 3.5.3 Plugins Plugins are user-defined scripts that implement the application-specific behaviors of each node. There are many different types of plugins that can be created. These include service plugins that start daemon processes, application plugins that install software used by the application, configuration plugins that apply application-specific settings, data plugins that download and install application data, and monitoring plugins that validate the state of the node. Plugins are the modular components of a deployment. Several plugins can be combined to specify the behavior of a node, and well-designed plugins can be reused for many different applications. For example, NFS server and NFS client plugins can be combined with plugins for different batch schedulers, such as Condor [108], PBS [134], or Sun Grid Engine [62], to deploy many different types of compute clusters. We envision that there could be a central repository where users share useful plugins with each other. Plugins are implemented as simple scripts that run on the nodes to perform all of the actions required by the application. They are transferred from the client (or, potentially, a central repos- itory) to the coordinator when a node is provisioned, and from the coordinator to the agent when a node is configured. This enables users to easily define, modify, and reuse custom plugins. Plugins are typically written as shell, Perl, Python, or Ruby scripts, but could be any exe- cutable program that conforms to the required interface. This interface defines the interactions between the agent and the plugin, and involves two components: parameters and commands. Parameters are the configuration variables that can be used to customize the behavior of the plugin. They are specified in the XML request document described above (for example, the EXPORT parameter of the nfs server.sh plugin in Figure 3.4). The agent passes these parameters to the plugin as environment variables when the plugin is invoked. Commands are specific actions that must be performed by the plugin to implement the plugin lifecycle. The 54 agent passes commands to the plugin executable as arguments. There are three commands that tell the plugin what to do: start,stop, andstatus. Thestart command tells the plugin to perform the behavior requested by the user. It is invoked when the node is being configured. Most plugins will implement this command. The stop command tells the plugin to stop any running services and clean up. This command is invoked before the node is terminated. Only plugins that must be shut down gracefully need to implement this command. The status command tells the plugin to check the state of the node for errors. This command can be used, for example, to verify that a service started by the plugin is running. Only plugins that need to monitor the state of the node or long-running services need to implement this command. If at any time the plugin exits with a non-zero exit code, then the node’s status is set to failed. Upon failure, the output of the plugin is collected and sent to the coordinator to simplify debugging and error diagnosis. The plugin can advertise node attributes by writing key=value pairs to a file specified by the agent in an environment variable. These attributes are merged with the node’s exist- ing attributes and can be queried by other nodes in the virtual cluster using <ref> tags or a command-line tool. For example, an NFS server node can advertise the address and path of an exported file system that NFS client nodes can use to mount the file system. The status command can be used to periodically update the attributes advertised by the node, or to query and respond to attributes updated by other nodes. A basic plugin for Condor worker nodes is shown in Figure 3.5. This plugin generates a configuration file and starts thecondor master process when it receives the start command, kills thecondor master process when it receives the stop command, and checks to make sure that thecondor master process is running when it receives thestatus command. 55 #!/bin/bash -e PIDFILE=/var/run/condor/master.pid SBIN=/usr/local/condor/sbin if [ "$1" == "start" ]; then # Got start command # CONDOR_HOST is a parameter that should be set # in the XML description for this plugin. if [ -z "$CONDOR_HOST" ]; then echo "CONDOR_HOST not specified" exit 1 fi # Generate Condor configuration file cat > /etc/condor/condor_config.local <<END CONDOR_HOST = $CONDOR_HOST END # Start Condor $SBIN/condor_master pidfile $PIDFILE elif [ "$1" == "stop" ]; then # Got stop command # Kill condor kill QUIT $(cat $PIDFILE) elif [ "$1" == "status" ]; then # Got status command # Make sure condor is running kill -0 $(cat $PIDFILE) fi Figure 3.5: Example plugin used for Condor workers. 3.5.4 Dependencies and Groups Dependencies ensure that nodes are configured in the correct order so that services and attributes published by one node can be used by another node. When a dependency exists between two nodes, the dependent node will not be configured until after its dependency has been configured. Dependencies are valid as long as they do not form a cycle that would prevent the application from being deployed. Applications that deploy sets of nodes to perform a collective service, such as parallel file systems and distributed caches, can be configured using named groups. Groups are used for 56 two purposes. First, a node can depend several nodes at once by specifying that it depends on the group. This is simpler than specifying dependencies between the node and each member of the group. These types of groups are useful for services such as Memcached clusters where the clients need to know the addresses of each of the Memcached nodes. Second, groups that depend on themselves form co-dependent groups. Co-dependent groups enable a limited form of cyclic dependencies and are useful for deploying some peer-to-peer systems and parallel file systems that require each node implementing the service to be aware of all the others. Nodes that depend on a group are not configured until all of the nodes in the group have been configured. Nodes in a co-dependent group are not configured until all members of the group have registered. This ensures that the basic attributes of the nodes that are collected during registration, such as IP addresses, are available to all group members during configuration, and breaks the deadlock that would otherwise occur with a cyclic dependency. 3.5.5 Security Wrangler uses SSL for secure communications between all components of the system. Authen- tication of clients is accomplished using a username and password. Authentication of agents is done using a random key that is generated by the coordinator for each node. This authentica- tion mechanism assumes that the cloud provider’s provisioning service provides the capability to securely transmit the agent’s key to each VM during provisioning. 3.6 Evaluation The performance of Wrangler is primarily a function of the time it takes for the underlying cloud management system to start the VMs. Wrangler adds to this a relatively small amount of time for nodes to register and be configured in the correct order. With that in mind, we conducted a few basic experiments to determine the overhead of deploying applications using Wrangler. We conducted experiments on three separate clouds: Amazon EC2, NERSC’s Magellan cloud [124], and FutureGrid’s Sierra cloud [60]. EC2 uses a proprietary cloud management 57 Table 3.1: Single VM provisioning time. Site Min Max Mean Std. Dev. Amazon 44.93 s 64.53 s 55.40 s 4.82 s Magellan 96.06 s 130.74 s 104.88 s 10.22 s Sierra 337.55 s 584.92 s 428.70 s 88.12 s system, while Magellan and Sierra both use the Eucalyptus cloud management system [131]. We used identical CentOS 5.5 VM images, and the m1.large instance type, on all three clouds. 3.6.1 Base VM provisioning time The first thing we measured is the base provisioning time of a single VM without Wrangler. This includes the time required to allocate hardware for the VM, distribute the VM image, and boot up an OS on the VM. The base provisioning time will serve as a benchmark to compare with the measurements of our system. The times we observed are shown in Table 3.1. There was a large difference between the fastest cloud (Amazon) and the slowest (Sierra). The superior performance of Amazon EC2 is explained by the use of EBS for storing VM images on EC2. EBS eliminates the need to copy images to each VM across the network, and has the ability to make O(1) copies of an image. In comparison, Magellan and Sierra copy and transfer the entire 3GB image across the network to each VM. Magellan accomplishes this faster than Sierra because Magellan uses a 10GigE network while Sierra uses a 1GigE network. In addition to observing a large difference between cloud providers, we also observe a rel- atively large range of times for a given provider. This is important because a deployment in Wrangler is not complete until all nodes have been configured, which, as the size of the cluster increases, will result in provisioning times that tend toward the maximum shown in Table 3.1. 58 Table 3.2: Mean provisioning time for a simple deployment with no plugins. Site 2 Nodes 4 Nodes 8 Nodes 16 Nodes Amazon 55.8 s 55.6 s 69.9 s 112.7 s Magellan 101.6 s 102.1 s 131.6 s 206.3 s Sierra 371.0 s 455.7 s 500.9 s FAIL Table 3.3: Provisioning time for a deployment used to execute workflow applications. Site 2 Nodes 4 Nodes 8 Nodes 16 Nodes Amazon 101.2 s 111.2 s 98.5 s 112.5 s Magellan 173.9 s 175.1 s 185.3 s 349.8 s Sierra 447.5 s 433.0 s 508.5 s FAIL 3.6.2 Deployment with no plugins The next experiment we performed was provisioning a simple vanilla cluster with no plugins. This experiment measures the time required to provision N nodes from a single provider, and for all nodes to register with the coordinator. The results of this experiment are shown in Table 3.2. In most cases we observe that the pro- visioning time for a virtual cluster is comparable to the time required to provision a single VM. For larger clusters we observe that the provisioning time is up to twice the maximum observed for one VM. This is a result of two factors. First, nodes for each cluster were provisioned in serial, which added 1-2 seconds onto the total provisioning time for each node. In the future we plan to investigate ways to provision VMs in parallel to reduce this overhead. Second, on Magellan and Sierra there were several outlier VMs that took much longer than expected to start, possibly due to the increased load on the providers network and services caused by the larger number of simultaneous requests. Note that we were not able to collect data for Sierra with 16 nodes because the failure rate on Sierra while running these experiments was about 8%, which virtually guaranteed that at least 1 out of every 16 VMs failed. 59 Master Node Pegasus Condor Master DAGMan Worker Node Condor Worker NFS Client Worker Node Condor Worker NFS Client Workflow NFS Server ... Figure 3.6: Deployment used for workflow applications. Table 3.4: Provisioning time for virtual clusters of 3, 6, 12, and 24 nodes across 3 cloud providers. Amazon Magellan Sierra Time 1 Node 1 Node 1 Node 398.38 s 2 Nodes 2 Nodes 2 Nodes 394.49 s 4 Nodes 4 Nodes 4 Nodes 436.77 s 8 Nodes 8 Nodes 8 Nodes 581.20 s 3.6.3 Deployment for workflow applications In the next experiment we again launch a virtual cluster deployment using Wrangler, but this time we add plugins for the Pegasus workflow management system [45], DAGMan [38], Condor [108], and NFS to create an environment that is similar to what we have used for executing real workflow applications in the cloud [90]. The deployment consists of a master node that manages the workflow and stores data, and N worker nodes that execute workflow tasks as shown in Figure 3.6. The results of this experiment are shown in Table 3.3. By comparing Table 3.2 and Table 3.3. we can see it takes on the order of 1–2 minutes for Wrangler to run all the plugins once the nodes have registered, depending on the target cloud and the number of nodes. The majority of this time is spent downloading and installing software, and waiting for all the NFS clients to successfully mount the shared file system. 60 3.6.4 Multi-cloud virtual cluster To illustrate Wrangler’s federated cloud capability we measured the performance of provisioning virtual clusters across several clouds. For this experiment we use a master node outside the cloud to host the Condor pool, and N worker nodes in each cloud to execute tasks. This setup is similar to the one shown in Figure 3.6, with the exception that we elected not to deploy NFS across the WAN, but instead rely on Condor to transfer files for the workflow. The provisioning time for setting up 3, 6, 12, and 24 worker nodes is shown in Table 3.4. In this experiment we expect, and observe, that the provisioning time is limited by the slowest provider, which in this case is Sierra. 3.7 Example Applications In this section we describe our experience using Wrangler to deploy scientific workflow appli- cations. Although these applications are scientific workflows, other applications, such as web applications, peer to peer systems, and distributed databases, could be deployed as well. 3.7.1 Data Storage Study Many workflow applications require shared storage systems in order to communicate data prod- ucts among nodes in a compute cluster. Recently we conducted a study [90] that evaluated several different storage configurations that can be used to share data for workflows on Amazon EC2. This study required us to deploy workflows using four parallel storage systems (Amazon S3, NFS, GlusterFS, and PVFS) in six different configurations using three different applications and four virtual cluster sizes–a total of 72 different combinations. Due to the large number of experiments required, and the complexity of the configurations, it was not possible to deploy the environments manually. Using Wrangler we were able to create automatic, repeatable deploy- ments by composing plugins in different combinations to complete the study. The deployments used in the study were similar to the one illustrated in Figure 3.7. This deployment sets up a Condor pool with a shared GlusterFS file system and installs application 61 Submit Host Condor Schedd ... Worker Node Condor Startd App. Binaries Pegasus DAGMan File System Group GlusterFS Client File System Node GlusterFS Peer File System Node GlusterFS Peer File System Node GlusterFS Peer Worker Node Condor Startd App. Binaries GlusterFS Client Figure 3.7: Deployment used in the data storage study. binaries on each worker node. The deployment consists of three tiers: a master node using a Condor Master plugin, N worker nodes with Condor Worker, file system client, and application- specific plugins, and N file system nodes with a file system peer plugin. The file system nodes form a group so that worker nodes will be configured after the file system is ready. This example illustrates how Wrangler can be used to set up experiments for distributed systems research. 3.7.2 Periodograms Kepler [122] is a NASA satellite that uses high-precision photometry to detect planets outside our solar system. The Kepler mission periodically releases time-series datasets of star brightness called light curves. Analyzing these light curves to find new planets requires the calculation of periodograms, which identify the periodic dimming caused by a planet as it orbits its star. Generating periodograms for the hundreds of thousands of light curves that have been released by the Kepler mission is a computationally intensive job that demands high-throughput distributed computing. In order to manage these computations, a workflow application was developed using the Pegasus workflow management system [45]. This application will be discussed in more detail in Section 4.5.2. 62 Master Node Pegasus Condor Master Magellan (NERSC) Sierra (FutureGrid) Local (ISI) Worker Node Condor Worker Periodo -grams EC2 (Amazon) Worker Node Condor Worker Periodo -grams Worker Node Condor Worker Periodo -grams Figure 3.8: Deployment used to execute Periodograms workflows Wrangler was used to deploy the Periodograms application across the Amazon EC2, Future- Grid Sierra, and NERSC Magellan clouds. The deployment configuration is illustrated in Fig- ure 3.8. In this deployment, a master node running outside the cloud manages the workflow, and worker nodes running in the three cloud sites execute workflow tasks. The deployment used several different plugins to set up and configure the software on the worker nodes, including a Condor Worker plugin to deploy and configure Condor, and a Periodograms plugin to install application binaries, among others. This application successfully demonstrated Wranglers abil- ity to deploy complex applications across multiple cloud providers. 3.8 Summary The rapidly-developing field of cloud computing offers new opportunities for distributed applica- tions. The unique features of cloud computing, such as on-demand provisioning, virtualization, and elasticity, as well as the emergence of commercial cloud providers, are changing the way we think about deploying and executing distributed applications. There is still much work to be done in investigating the best way to manage cloud envi- ronments, however. Existing infrastructure clouds support the deployment of isolated virtual machines, but do not provide functionality to deploy and configure software, monitor running 63 VMs, or detect and respond to failures. In order to take advantage of cloud resources, new provisioning tools need to be developed to assist users with these tasks. In this chapter we presented the design and implementation of a system used for automat- ically deploying distributed applications on infrastructure clouds. The system interfaces with several different cloud resource providers to provision virtual machines, coordinates the config- uration and initiation of services to support distributed applications, and monitors applications over time. We have been using Wrangler since May 2010 to provision virtual clusters for scientific workflow applications on Amazon EC2, the Magellan cloud at NERSC, the Sierra and India clouds on the FutureGrid, and the Skynet cloud at ISI. We have used these virtual clusters to run several hundred workflows for applications in astronomy, bioinformatics and earth science. So far we have found that Wrangler makes deploying complex, distributed applications in the cloud easy, but we have encountered some issues in using it that we plan to address in the future. Currently, Wrangler assumes that users can respond to failures manually. In practice this has been a problem because users often leave virtual clusters running unattended for long periods. In the future we plan to investigate solutions for automatically handling failures by re- provisioning failed nodes, and by implementing mechanisms to fail gracefully or provide degraded service when re-provisioning is not possible. We also plan to develop techniques for re-configuring deployments, and for dynamically scaling deployments in response to application demand. 64 Chapter 4 Evaluating the Performance and Cost of Workflows in the Cloud One of the primary reasons to use distributed computing technologies such as workflows is to improve the performance of large scale computing tasks. Todays science and engineering appli- cations often require the analysis of large quantities of data or resource-intensive simulations of complex physical phenomena. Often, executing these applications on a single computer is not practical because of the amount of time required. Workflows and other parallel computing approaches were developed to solve this problem. They allow large applications to be easily parallelized and distributed across a cluster of computers in order to generate meaningful results in a reasonable amount of time. As a result of this, performance is one of the primary metrics by which distributed applica- tions are evaluated. Workflow performance is typically determined by measuring the makespan of the workflow, which is the amount of time elapsed from when the first task in the workflow is submitted until the last task finishes. Makespan is also called the “runtime”, or “wall time”, of the workflow. The smaller the makespan of a workflow, the better the performance. Given that performance is such a critical aspect of workflows, one of the primary questions with regard to executing workflows in the cloud is: how well do workflows perform in the cloud? As we saw in Chapter 3, many infrastructure clouds are based on commodity resources and virtualization technologies. Considering that commodity in IT is almost synonymous with lower performance, and that virtualization is known to add overhead to low-level operating system functions, one would expect that workflows would perform worse in the cloud than on traditional HPC systems such as grids and clusters. The question is: How much worse? If the use of 65 commodity hardware and virtualization significantly decreases performance, then it is difficult to justify using clouds for performance-sensitive applications such as workflows. Another important question with regard to workflows in the cloud is: How much does it cost to run workflows in the cloud? Users of traditional HPC systems are typically not con- cerned about what computations actually cost in terms of dollars. The economic model of most existing high-performance computing facilities is based on service units, which are a sort-of fake currency used in research computing. In this model users apply for service units through an allocation request process, and resource providers grant service units to them based on the quality of their proposal and the amount of computation required by their project. The resource providers then have a formula that translates CPU hours on their computing systems into service units (often 1:1, but some usage, including advance reservations, is charged at a higher rate). Some providers also have formulas for translating storage into service units. In contrast to this model, many clouds are operated by commercial enterprises that charge real money for access to resources. Although some private clouds (also called science clouds [95]) are being developed, currently these systems are research prototypes and are not ready for production workloads. In the future, however, we may see HPC centers deploying science clouds based on the service unit economic model. Until that happens, commercial providers are the only option for reliable, large-scale cloud infrastructure. Cost is an important metric for two reasons. First, it is important to know the cost of run- ning workflows in the cloud so that we can budget for cloud resources. In the existing service unit-based system, researchers do not include computation costs in their funding proposals. If researchers are going to use commercial clouds for their applications, then they need to be able to estimate the cost of the cost of those applications in order to prepare accurate budgets. Sec- ond, it is important to know if clouds are economically feasible platforms for workflows. If the cost of running an application on cloud resources is high, then it is difficult to justify spending money for resources when alternatives are available. This chapter discussess the results of two studies that were conducted to evaluate the perfor- mance and cost of workflow applications in the cloud. The first study compared the performance 66 of running the workflows on cloud resources to grid resources, and measured the resource, stor- age, and transfer costs of the workflows in the cloud. The second study investigated the use of parallel workflow environments in the cloud by comparing the cost and performance of work- flows on several different distributed storage systems that were used to share data within a virtual cluster. 4.1 Related Work In the last few years many studies have been done to evaluate the performance of science appli- cations in the cloud [4, 51, 74, 84, 85, 121, 137, 153, 183]. Most of these studies have focused on either parallel benchmarks or traditional, tightly-coupled HPC applications such as MPI codes. In comparison, relatively few studies have been done on loosely-coupled applications such as workflows. The one exception is an early study by Hoffa, et. al. [77], which compared the performance of workflows on an local machine, a virtual machine, a virtual cluster, and a local cluster. This study showed the counterintuitive result that the virtual cluster performed better than the local cluster, however it is not clear whether the underlying hardware was comparable. In addition, due to storage limitations on the virtual machines, the study only considered a single, small application that only took a few minutes to run. The cost of science applications in the cloud has received relatively less attention than per- formance. Some research has been done that considers the cost of science applications at a high level without considering the detailed cost of individual applications [10, 100]. Other studies have focused on the cost of individual resources such as storage [140] and computing [188]. One study by Jackson, et. al. evaluated the performance and cost of an astronomy application [85], but that work did not consider how their observations would impact other applications. There is only one existing study that focuses specifically on the cost of workflows in the cloud [44]. This study estimated the cost of a workflow application on Amazon EC2 considering com- putation, storage, and network transfer. Although the study had some interesting results, it only 67 Table 4.1: Comparison of application resource usage. Application I/O Memory CPU Montage High Low Low Broadband Medium High Medium Epigenome Low Medium High considered a single application, and was done using a simulation that appears to overestimate the actual costs of the application. 4.2 Workflow Applications Both studies in this chapter used three different workflow applications: an astronomy applica- tion (Montage), a seismology application (Broadband), and a bioinformatics application (Epige- nomics). These three applications were chosen because they cover a wide range of application domains and a wide range of resource requirements. Table 4.1 shows the relative resource usage of these applications in three different categories: I/O, memory, and CPU. In general, applica- tions with high I/O usage are I/O-bound, applications with high memory usage are memory- limited, and applications with high CPU usage are CPU-bound. Montage [13] creates science-grade astronomical image mosaics using data collected from telescopes. The size of a Montage workflow depends upon the area of the sky (in square degrees) covered by the output mosaic. In the experiments a Montage workflow was created to generate an 8-degree mosaic. The workflow contains 10,429 tasks, reads 4.2 GB of input data, and produces 7.9 GB of output data. Montage is considered to be I/O-bound because it spends more than 95% of its time waiting on I/O operations. Broadband [70] generates and compares seismograms from several high- and low- frequency earthquake simulation codes. Each workflow generates seismograms for several sources (sce- nario earthquakes) and sites (geographic locations). For each (source, site) combination the workflow runs several high- and low-frequency earthquake simulations and computes intensity measures of the resulting seismograms. In the Broadband experiments 4 sources and 5 sites were 68 used to create a workflow containing 320 tasks that reads 6 GB of input data and writes 160 MB of output data. Broadband is considered to be memory-limited because more than 75% of its runtime is consumed by tasks requiring more than 1 GB of physical memory. Epigenome [177] maps short DNA segments collected using high-throughput gene sequenc- ing machines to a previously constructed reference genome using the MAQ software [113]. The workflow splits several input segment files into small chunks, reformats and converts the chunks, maps the chunks to a reference genome, merges the mapped sequences into a single output map, and computes the sequence density for each location of interest in the reference genome. The Epigenome workflow used in the experiments maps human DNA sequences to a reference chro- mosome 21. The workflow contains 81 tasks, reads 1.8 GB of input data, and produces 300 MB of output data. Epigenomics is considered to be CPU-bound because it spends 99% of its runtime in the CPU and only 1% on I/O and other activities. 4.3 Single Node Study The first study evaluated the cost and performance of workflows in the cloud on single nodes. The goal was to evaluate and compare the different VM resource types available in the cloud with typical resource types available from HPC systems such as grids and clusters. The three workflow applications were run on a commercial cloud, Amazon EC2 [8], and a comparable grid system, NCSA’s Abe cluster [123]. This study was restricted to single nodes in order to make it easier to compare the different resource types available without considering the impact of data movement on workflow performance. One of the advantages of HPC systems over currently deployed commercial clouds is the availability of high-performance I/O devices. HPC systems commonly provide high-speed net- works and parallel file systems, while most commercial clouds use commodity networking and storage devices. These high-performance devices increase workflow performance on HPC sys- tem by making inter-task communication more efficient. In order to have an unbiased compari- son of the performance of workflows on EC2 and Abe, the experiments performed for this study 69 Submit Host Amazon Cloud (EC2) VM Instance Condor Master Pegasus DAGMan EBS Volume Condor Worker (a) EC2 NCSA Grid (Abe) Submit Host Worker Node Condor Worker Condor Master Pegasus DAGMan Lustre Corral Head Node Globus PBS (b) Abe Figure 4.1: Execution environments on EC2 and Abe attempted to account for these differences by a) running all experiments on single nodes and b) running experiments using the local disk on both EC2 and Abe, and the parallel file system on Abe. This allowed a direct comparison of the resources of both systems and enabled the performance advantage of the parallel file system on Abe to be quantified. The experiments were deployed as shown in Figure 4.1. In both cases, a submit host running outside the cloud was used to coordinate the workflows. For the EC2 experiments, a worker VM was started inside the cloud to execute workflow tasks. For the Abe experiments, Globus [54] and Corral [89] were used to deploy Condor glideins [59]. The glideins started Condor daemons on the Abe worker node, which contacted the submit host and were used to execute workflow tasks. This approach creates an execution environment on Abe that is equivalent to the EC2 environment. 70 4.3.1 Resources Table 4.2 compares the resource types used for the single node experiments. It lists 6 resource types from EC2 (m1.*, c1.*, and cc1.*) and 2 resource types from Abe (abe.local and abe.lustre). There are several noteworthy details about the resources shown. First, although there is actually only one type of Abe node, there are two types listed in the table: abe.local and abe.lustre. The actual hardware used for these types is the same. The difference is in how I/O was handled. The abe.local experiments used a local partition for I/O, and the abe.lustre experiments used a Lustre partition for I/O. Using the two different names is simply a notational convenience. Second, in terms of computational capacity, the c1.xlarge resource type is roughly equivalent to the abe.local resource type with the exception that abe.local has slightly more memory. This fact is used to estimate the virtualization overhead for the test applications on EC2. Third, in rare cases EC2 assigns Xeon processors for m1.* instances, but for all of the experiments reported here the m1.* instances used were equipped with Opteron processors. The only significant difference is that the Xeon processors have better floating-point performance than Opteron processors (4 FLOP/cycle vs. 2 FLOP/cycle). Fourth, the m1.small instance type is shown having 12 core. This is possible because of virtualization—EC2 nodes are configured to give m1.small instances access to the processor only 50% of the time. This allows a single processor core to be shared equally between two separate m1.small instances. Finally, the cc1.4xlarge instance type, also known as the “cluster compute” type, has a faster processor and network than the other types, and is also fully virtualized (as opposed to paravirtualized) using Xen HVM [93]. HVM improves application performance by reducing virtualization overhead compared to paravirtualization, and enables the nodes to deploy custom kernels, which is not possible with the other EC2 instance types. 4.3.2 Storage To run workflows storage needs to be allocated for 1) application executables, 2) input data, and 3) intermediate and output data. In a typical workflow application, executables are pre-installed 71 Table 4.2: Resource types used. Type Arch. CPU Cores Memory Network Storage Price m1.small 32-bit 2.0-2.6 GHz Opteron 1/2 1.7 GB 1-Gbps Eth. Local $0.085 m1.large 64-bit 2.0-2.6 GHz Opteron 2 7.5 GB 1-Gbps Eth. Local $0.34 m1.xlarge 64-bit 2.0-2.6 GHz Opteron 4 15 GB 1-Gbps Eth. Local $0.68 c1.medium 32-bit 2.33-2.66 GHz Xeon 2 1.7 GB 1-Gbps Eth. Local $0.17 c1.xlarge 64-bit 2.33-2.66 GHz Xeon 8 7.5 GB 1-Gbps Eth. Local $0.68 cc1.4xlarge 64-bit 2.93 GHz Xeon 8 23 GB 10-Gbps Eth. Local $1.30 abe.local 64-bit 2.33 GHz Xeon 8 8 GB 10-Gbps IB Local N/A abe.lustre 64-bit 2.33 GHz Xeon 8 8 GB 10-Gbps IB Lustre N/A on the execution site, input data is copied from an archive to the execution site, and output data is copied from the execution site to an archive. For these experiments, executables and input data were pre-staged to the execution site, and output data were not transferred from the execution site. For the EC2 experiments, executables were installed in the VM images, intermediate and output data was written to a local partition, and input data was stored on EBS volumes. The Elastic Block Store (EBS) [7] is a SAN-like, replicated, block-based storage service that can be used with EC2 instances. EBS volumes can be created in any size between 1 GB and 1 TB and appear as standard, unformatted block devices when attached to an EC2 instance. As such, EBS volumes can be formatted with standard UNIX file systems and used like an ordinary disk, but they cannot be shared between multiple instances. EBS was chosen to store input data for a number of reasons. First, storing inputs in the cloud obviates the need to transfer input data repeatedly. This saves both time and money because transfers cost more than storage. Second, using EBS avoids the 10 GB limit on VM images, which is too small to include the input data for all the applications tested. Third, EBS does not significantly decrease the I/O performance of the application compared to the local disk. A simple experiment using the disk copy utility dd showed similar performance reading from EBS volumes and the local disk (74.6 MB/s for local, and 74.2 MB/s for EBS). Finally, using EBS simplifies the setup by allowing multiple experiments to reuse the same EBS volume—when changing instance types the volume can be detached from one instance and re-attached to another. 72 For Abe, all application executables and input files were stored in the Lustre file system. For abe.local experiments the input data was copied to a local partition (/tmp) before running the workflow, and all intermediate and output data was written to the same local partition. For abe.lustre, all intermediate and output data was written to the Lustre file system. 4.3.3 Performance Results The critical performance metric for these experiments is the runtime of the workflow (also known as the makespan), which is the total amount of wall clock time from the moment the first work- flow task is submitted until the last task completes. The runtimes reported for EC2 do not include the time required to install and boot the VM, which typically averages between 70 and 90 sec- onds, and the runtimes reported for Abe do not include the queue time of the pilot jobs used to provision resources, which is highly dependent on the current system load. Also, the runtimes do not include the time required to transfer input and output data. It is assumed that this time will be variable depending on WAN conditions. In these experiments the observed bandwidth between EC2 and the submit host in Marina del Rey, CA was typically on the order of 500-1000KB/s. Figure 4.2 shows the runtime of the selected applications using the resource types shown in Table 4.2. In all cases the m1.small resource type had the worst runtime by a large margin. This is not surprising given its relatively low capabilities. For I/O-intensive workflows like Montage, EC2 is at a significant disadvantage because of the lack of high-performance parallel file systems. While such a file system could conceivably be constructed from the raw components available in EC2, the cost of deploying such a system would be prohibitive (we explore this issue more in Section 4.4). In addition, because most EC2 instance types use commodity 1-Gbps networking equipment it is unlikely that there would be a significant advantage in shifting I/O from a local partition to a parallel file system across the network, because the bottleneck would simply shift from the disk to the network interface. In order to compete performance-wise with Abe for I/O-intensive applications, an Amazon user would need to deploy a parallel file system using the cc1.4xlarge instance type, which has a 10-Gbps network. 73 0 2 4 6 8 10 12 14 Montage Broadband Epigenome Run$me (hours) Applica$on m1.small m1.large m1.xlarge c1.medium c1.xlarge cc1.4xlarge abe.lustre abe.local Figure 4.2: Single node runtime comparison For memory-intensive applications like Broadband, EC2 can achieve nearly the same perfor- mance as Abe as long as there is more than 1 GB of memory per core. If there is less, then some cores must sit idle to prevent the system from running out of memory or swapping. This is not strictly an EC2 problem, the same issue affects Abe as well. For CPU-intensive applications like Epigenome, EC2 can deliver comparable performance given equivalent resources. The virtualization overhead for an application can be estimated by comparing its runtime on c1.xlarge with that of abe.local. Because c1.xlarge and abe.local are very similar hardware-wise, measuring the difference in runtime between these two resource types should provide a good estimate of performance the cost of virtualization. Based on this, the virtualization overhead for all three applications on c1.xlarge is less than 10%. This is con- sistent with previous studies that show similar overheads [12, 64, 192]. Although there is no comparable grid resource for the cc1.4xlarge type, because it uses hardware-assisted virtualiza- tion it is likely that the virtualization overhead for cc1.4xlarge is even smaller than c1.xlarge. Based on these results, virtualization does not seem, by itself, to be a significant performance problem for workflows in the cloud. And, as virtualization technologies improve, it is likely that what little overhead there is will be further reduced or eliminated in the future. 74 4.3.4 Cost Results When running workflows in the cloud there are three cost categories that should be considered: resource cost, storage cost, and transfer cost. Resource cost includes charges for the use of VM instances in EC2. Storage cost includes charges for keeping VM images in S3 and input data in EBS. And transfer cost includes charges for moving input data, output data and log files between the submit host and EC2. Resource Cost In order to better illustrate the real costs of the various instance types, two different methods to calculate the resource cost of a workflow were used: rounded cost, and actual cost. Rounded cost is what Amazon would charge for the usage in reality, which includes rounding usage up to the next hour. Actual cost is what the experiments would cost if Amazon charged for only the time that was used and did not do any rounding. The rounded cost is useful for comparing the real cost of executing a single workflow, while the actual cost is the better choice when considering the price/performance ratios of the different instance types, or when amortizing the resource cost of a workflow over many back-to-back runs. Figure 4.3 shows the per-workflow resource cost for the applications tested. Although it did not perform the best in any of the experiments, the most cost-effective instance type was c1.medium, which had the lowest rounded cost for all three applications and the lowest actual cost for two of three applications. Storage Cost Storage cost consists of a) the cost to store VM images in S3, and b) the cost of storing input data in EBS. Both S3 and EBS use fixed monthly charges for the storage of data, and variable usage charges for accessing the data. The fixed charges are $0.15 per GB-month for S3, and $0.10 per GB-month for EBS. The variable charges are $0.01 per 1,000 PUT operations and $0.01 per 10,000 GET operations for S3, and $0.10 per million I/O operations for EBS. The interesting quantities reported are the fixed cost per month, and the total variable cost for all experiments performed. 75 $0.00 $0.50 $1.00 $1.50 $2.00 $2.50 Montage Broadband Epigenome Cost (dollars) Applica0on m1.small m1.large m1.xlarge c1.medium c1.xlarge cc1.4xlarge (a) Rounded cost $0.00 $0.50 $1.00 $1.50 $2.00 $2.50 Montage Broadband Epigenome Cost (Dollars) Applica0on m1.small m1.large m1.xlarge c1.medium c1.xlarge cc1.4xlarge (b) Actual cost Figure 4.3: Single node resource cost comparison Two VM images, one 32-bit and one 64-bit, were used for the experiments in this study. The size of the 32-bit image was 773 MB (compressed) and the size of the 64-bit image was 729 MB (compressed). This results in a total fixed cost of $0.22 per month to store these images on S3. In addition, there were 4,616 GET operations and 2,560 PUT operations to store and retrieve these images from S3 for the experiments. This results in a total variable cost of approximately $0.03. If all of these experiments in this study were performed every day of the month, then the total VM cost would only be approximately $1.12 per month. This result suggest that the cost of storing and accessing VM images would be negligible for most applications. The fixed monthly cost of storing input data for the three applications on EBS is shown in Table 4.3. In addition, there were 3.18 million I/O operations for a total variable cost of $0.30. The amount of input data for these applications results in relatively low cost, however, this analysis only considers one set of inputs. Although the actual inputs to an individual workflow are small, the universe of possible inputs may be very large. For example, a Montage workflow may access a subset of images from the Two Micron All Sky Survey (2MASS) [176], which contains 2.2 terabytes of data. Storing the entire 2MASS dataset on EBS would cost around $225 per month. This suggest that storing the inputs for an application may be quite expensive depending on the size of the data and the amount of time it needs to be stored. A more detailed 76 Table 4.3: Monthly storage cost. Application V olume Size Monthly Cost Montage 5GB $0.66 Broadband 5GB $0.60 Epigenome 2GB $0.26 Table 4.4: Per-workflow transfer costs. Application Input Size Input Cost Output Size Output Cost Log Size Log Cost Total Cost Montage 4291 MB $0.42 7970 MB $1.32 40 MB < $0.01 $1.75 Broadband 4109 MB $0.40 159 MB $0.03 5.5 MB < $0.01 $0.43 Epigenome 1843 MB $0.18 299 MB $0.05 3.3 MB < $0.01 $0.23 study of this problem from the point of view of a Montage image mosaic service is presented in Section 4.5.1. Transfer Cost Table 4.4 shows the per-workflow transfer costs for the three applications studied. Amazon charges $0.10 per GB for transfer into, and $0.17 per GB for transfer out of, the EC2 cloud. Input is the amount of input data to the workflow, output is the amount of output data, and logs is the amount of logging data that is recorded for workflow tasks and transferred back to the submit host. The cost of the protocol used by Condor to communicate between the submit host and the workers is not included, but it is estimated to be much less than $0.01 per workflow. One interesting thing to note about this data is that for Montage, the I/O-intensive application, the cost of transferring the data ($1.75) is significantly more than the cost of computation ($0.55 on c1.medium). This suggests that it may be particularly beneficial for I/O-intensive applications to pursue optimization strategies (such as data compression or caching, for example) in order to reduce the amount and cost of data transfers. 77 4.3.5 Cost-Performance Analysis The first thing to consider when provisioning resources on a commercial cloud is the tradeoff between performance and cost. In general, EC2 resources obey the aphorism “you get what you pay for”—resources that cost more perform better than resources that cost less. For the applications tested, c1.medium was the most cost-effective resource type even though it did not have the lowest hourly rate, because the type with the lowest rate (m1.small) performed so badly. In order to better understand this tradeoff, this section presents a cost-performance analysis for the applications and instance types described above. Figure 4.4 shows the cost-performance plots for the three example applications. Although, in most cases, these plots do not indicate a single “best” instance type to choose, they can highlight instance types that should not be used. For example, for Montage (Figure 4.4(a)) the m1.small type can be eliminated from consideration because c1.medium is both faster and cheaper. Similar arguments can be made for m1.large, m1.xlarge, and c1.xlarge. The other two instance types—cc1.4xlarge and c1.medium—are both optimal solutions in the sense that there are no other instance types that are both faster and cheaper. The cc1.4xlarge type is the fastest, and c1.medium is the cheapest. Formally, these two instance types comprise the Pareto set of instance types for Montage, and are called Pareto optimal solutions. The Pareto set for Broad- band contains only one solution: cc1.4xlarge. That means that cc1.4xlarge is the best instance type for Broadband considering cost and performance. On the other hand, because the Pareto sets for Montage and Epigenome contain more than one instance type, there is no best choice for these applications. In these cases, choosing an instance type from the Pareto set still involves a cost-performance tradeoff based on the user’s requirements. 4.3.6 Discussion Based on these experiments it appears that the performance of workflows on EC2 is reason- able given the resources that can be provisioned. Although the EC2 performance in most cases (cc1.4xlarge excluded) was not as good as the performance on the Abe cluster, most of the 78 $0.20 $0.40 $0.60 $0.80 $1.00 $1.20 $1.40 $1.60 0 2 4 6 8 Execu&on Cost (dollars) Run&me (hours) m1.small m1.large m1.xlarge c1.medium c1.xlarge cc1.4xlarge (a) Montage $0.50 $0.60 $0.70 $0.80 $0.90 $1.00 $1.10 $1.20 0 2 4 6 8 10 12 14 Execu&on Cost (dollars) Run&me (hours) m1.small m1.large m1.xlarge c1.medium c1.xlarge cc1.4xlarge (b) Broadband $0.30 $0.40 $0.50 $0.60 $0.70 $0.80 $0.90 $1.00 0 2 4 6 8 10 12 Execu&on Cost (dollars) Run&me (hours) m1.small m1.large m1.xlarge c1.medium c1.xlarge cc1.4xlarge (c) Epigenome Figure 4.4: Cost-performance comparison of different instance types for Montage, Broadband and Epigenome. resources provided by EC2 are also less powerful. When the resources are similar, as was the case for c1.xlarge, the performance was found to comparable. In the case of cc1.4xlarge, the performance was actually better than Abe, but it is not clear if this is a result of cc1.4xlarge having slightly more powerful processors, or if it is because cc1.4xlarge uses full virtualization, which has less overhead than paravirtualization. The primary advantages of Abe were found to be the availability of a high-speed inter- connect, and a parallel file system, which significantly improved the performance of the I/O- intensive application. Factoring out these advantages by running additional Abe tests using the local disk shows that, given equivalent resources, EC2 is capable of I/O performance close to that of Abe. 79 One important thing to consider when using EC2 is the tradeoff between price and perfor- mance. Occasionally there is one instance type that has the best cost/performance ratio for an application. That was the case for Broadband and cc1.4xlarge, for example. More often, how- ever, there will be several instance types that have either lower cost, or better performance than the other types, but do not dominate others in both metrics. For Montage and Epigenome, for example, there were several Pareto-optimal instance types. In general, application developers should be aware of the various tradeoffs between different instance types, and benchmark their applications to decide which type meets the requirements of their application, rather than blindly choosing the type with the most resources or the lowest hourly rate. Another important thing to consider when using EC2 is the tradeoff between storage cost and transfer cost. Users have the option of either a) transferring input data for each workflow separately, or b) transferring input data once, storing it in the cloud, and using the stored data for multiple workflow runs. The choice of which approach to employ will depend on how many times the data will be used, how long the data will be stored, and how frequently the data will be accessed. In general, storage is more cost-effective for input data that is reused often and accessed frequently, and transfer is more cost-effective if data will be used only once. For the applications tested in this paper, the monthly cost to store input data is only slightly more than the cost to transfer it once. Therefore, for these applications, it is usually more cost-effective to store the input data rather than transfer the data for each workflow. Although the cost of transferring input data can be easily amortized by storing it in the cloud, the cost of transferring output data may be more difficult to reduce. For many applications the output data is much smaller than the input data, so the cost of transferring it out may not be significant. This is the case for Broadband and Epigenome, for example. For other applications the large size of output data may be cost-prohibitive. In Montage, for example, the output is actually larger than the input and costs more to transfer than it does to compute. For these applications it may be possible to leave the output in the cloud and perform additional analyses there rather than to transfer it back to the submit host. 80 4.4 Data Storage Study The second study on workflow cost and performance in the cloud evaluated different options for data storage. Unlike tightly-coupled applications, such as MPI jobs, in which tasks communicate directly via the network, workflow tasks typically communicate through the use of files. Each task in a workflow produces one or more output files that become input files to other tasks. When tasks are run on different computational nodes, these files are either saved in a storage system that can be accessed by all the nodes, or transferred as needed from one node to the another by the workflow management system. This study assumed the former configuration, and evaluated different options for providing a shared storage sytem for workflows in the cloud. 4.4.1 Storage Systems There are many existing storage systems that can be deployed in the cloud. These include vari- ous network and parallel file systems, object-based storage systems, and databases. One of the advantages of cloud computing and virtualization is that the user has control over what software is deployed, and how it is configured. However, this flexibility also imposes a burden on the user to determine what system software is appropriate for their application. This section describes the storage services used for this study and any special configura- tion or handling that was required to get them to work with the Pegasus workflow management system. A number of different storage systems that span a wide range of storage options were selected. Given the large number of network storage systems available, it is not possible to exam- ine them all. In addition, it is not possible to run some file systems on EC2 because Amazon does not allow kernel modifications (Amazon does allow loadable kernel modules, but many file systems require core kernel patches as well). This is the case for Lustre [136] and Ceph [185], for example. Also, in order to work with workflow application tasks (as they are provided by the domain scientists), the file system either needs to be POSIX-compliant (i.e. it must be possible to mount it as a virtual disk and it must support POSIX operations and semantics), or additional 81 tools need to be used to copy files to/from the local file system, which can result in reduced performance. It is important to note that the goal with this work is not to evaluate the raw performance of these storage systems in the cloud, but rather to examine application performance in the context of scientific workflows. This involves exploring various options for sharing data in the cloud for workflow applications to determine, in general, how the performance and cost of a workflow is affected by the choice of storage system. Where possible each storage system has been tuned to deliver the best performance, but there is no way of knowing what combination of parameter values will give the best results for all applications without an exhaustive search. Instead, some simple benchmarks were executed for each storage system to verify that the storage system func- tions correctly and to determine if there are any obvious parameters that should be changed. As such, the configurations used may not be the best of all possible configurations for the applica- tions considered, but rather represent a typical setup. Amazon S3 Amazon S3 [9] is a distributed, object-based storage system. It stores un-typed binary objects (e.g. files) up to 5 GB in size. It is accessed through a web service that supports both SOAP and a REST-like protocol. Objects in S3 are stored in directory-like structures called buckets. Each bucket is owned by a single user and must have a globally unique name. Objects within a bucket are named by keys. The key namespace is flat, but path-like keys are allowed (e.g. “a/b/c” is a valid key). Because S3 does not have a POSIX interface, in order to use it, it was necessary to make some modifications to the workflow management system. The primary change was adding support for an S3 client, which copies input files from S3 to the local file system before a job starts, and copies output files from the local file system back to S3 after the job completes. The workflow management system was modified to wrap each job with the necessary GET and PUT operations. Transferring data for each job individually increases the amount of data that must be moved and, as a result, has the potential to reduce the performance of the workflow. Using S3 each file 82 must be written twice when it is generated (program to disk, disk to S3) and read twice each time it is used (S3 to disk, disk to program). In comparison, network file systems enable the file to be written once, and read once each time it is used. In addition, network file systems support partial reads of input files and fine-grained overlapping of computation and communication. In order to reduce the number of transfers required when using S3, a simple whole-file caching mechanism was implemented. Caching is possible because all the workflow applications studied obey a strict write-once file access pattern where no files are ever opened for updates. This simple caching scheme ensures that each file is transferred from S3 to a given node only once, and saves output files generated on a node so that they can be reused as input for future jobs that may run on the node. The scheduler that was used to execute workflow jobs does not consider data locality or parent-child affinity when scheduling jobs, and does not have access to information about the contents of each node’s cache. Because of this, if a file is cached on one node, a job that accesses the file could end up being scheduled on a different node. A more data-aware scheduler could potentially improve workflow performance by increasing cache hits and further reducing trans- fers. NFS NFS [157] may be the most commonly-used network file system. Unlike the other storage sys- tems used in this study, NFS is a centralized system with one node that acts as the file server for a group of machines. This puts it at a distinct disadvantage in terms of scalability compared with the other storage systems. For the workflow experiments we provisioned a dedicated node in EC2 to host the NFS file system. The m1.xlarge instance type provided the best NFS perfor- mance in benchmarks of all the resource types available at the time of these experiments. This is attributed to the fact that m1.xlarge has a comparatively large amount of memory (16GB), which facilitates good cache performance. NFS clients were configured to use the async option, which allows calls to NFS to return before the data has been flushed to disk, and atime updates were disabled. NFS was deployed in two configurations: one with a dedicated NFS server node, 83 and one where the NFS server is co-located on a worker node that executes tasks. The latter is called called “NFS (shared)” in the results. GlusterFS GlusterFS [66] is a distributed file system that supports many different configurations. It has a modular architecture based on components called translators that can be composed to create novel file system configurations. All translators support a common API and can be stacked on top of each other in layers. The translator at each layer can decide to service the call, or pass it to a lower-level translator. This modular design enables translators to be composed into many unique configurations. The available translators include: a server translator, a client translator, a storage translator, and several performance translators for caching, threading, pre-fetching, etc. As a result of these translators there are many ways to deploy a GlusterFS file system. Two configurations were used in this study: NUFA (non-uniform file access) and distribute. In both configurations, the nodes act as both clients and servers. Each node exports a local volume that is merged with the local volumes of all other nodes to provide a single, virtual volume. In the NUFA configuration, all writes to new files are performed on the local disk, while reads and writes to existing files are either performed across the network or locally depending on where the file was created. Because files in the workflows we tested are never updated, the NUFA configuration results in all writes being directed to the local disk. In the distribute configuration, GlusterFS uses hashing to distribute files among nodes. This configuration results in a more uniform distribution of reads and writes across the virtual cluster compared to the NUFA configuration. PVFS PVFS [29] is a parallel file system for Linux clusters. It distributes file data via striping across a number of I/O nodes. In our configuration the same set of nodes was used for both I/O and computation. In other words, each node was configured as both a client and a server. In addition, 84 PVFS was configured to distribute metadata across all nodes instead of having a central metadata server. Although the latest version of PVFS was 2.8.2 at the time these experiments were conducted, the 2.8 series releases did not run reliably on EC2 without crashes and loss of data. Instead, an older version, 2.6.3, was used and a patch was applied for the Linux kernel used on EC2 (2.6.21). This version ran without crashing, but does not include some of the changes made in later releases to improve support and performance for small files. 4.4.2 Performance Results This section compares the performance of the selected storage options for workflows on Ama- zon EC2. Just as in the previous study, the critical performance metric is the total runtime, or makespan, of the workflow. As before, the runtimes reported do not include the time required to boot and configure the VM, nor do they include the time required to transfer input and output data. Because the sizes of input files are constant, and the resources are all provisioned at the same time, the file transfer and provisioning overheads are assumed to be independent of the storage system chosen. In discussing the results for various storage systems it is useful to consider the I/O workload generated by the applications tested. Each application generates a large number (thousands) of relatively small files (on the order of 1 MB to 10 MB). The write pattern is sequential and strictly write-once (no file is updated after it has been created). The read pattern is primarily sequential, with a few tasks performing random accesses. Because many workflow jobs run concurrently, many files will be accessed at the same time. Some files are read concurrently, but no file is ever read and written at the same time. These characteristics will help to explain the observed performance differences between the storage systems in the following sections. Note that the GlusterFS and PVFS configurations used require at least two nodes to construct a valid file system, so results with one worker are reported only for S3 and NFS. In addition to the storage systems described above, the comparison included performance results from the previous 85 study for experiments run on a single c1.xlarge instance using the local disk. Performance using the local disk is shown as a single point in the plots labelled “Local”. Montage The performance results for Montage are shown in Figure 4.5. The characteristic of Montage that seems to have the most significant impact on its performance is the large number ( 29,000) of relatively small (a few MB) files it accesses. GlusterFS seems to handle this workload well, with both the NUFA and distribute modes producing significantly better performance than the other storage systems. NFS performs relatively well for Montage, and, surprisingly, the NFS (shared) configuration performs better than the NFS configuration with a dedicated server node. This may be because the tasks executed on the NFS server in the shared configuration were able to perform better by accessing data from a local disk. The relatively poor performance of S3 and PVFS may be a result of Montage accessing a large number of small files. As was indicated in Section 4.4.1, the version of PVFS used for these experiments does not contain the small file optimizations added in later releases. S3 performs worse than the other systems on small files because of the relatively large overhead of fetching and storing files in S3, which is on the order of 1/2 second per file. In addition, the Montage workflow does not contain much file reuse, which makes the S3 client cache less effective. One surprising result for Montage is that the single node, local disk configuration does almost as well as the multiple node configurations (at a much lower cost). Epigenome The performance results for Epigenome are shown in Figure 4.6. Epigenome is mostly CPU- bound, and performs relatively little I/O compared to Montage and Broadband. As a result, the choice of storage system has less of an impact on the performance of Epigenome compared to the other applications. In general, the performance was almost the same for all storage systems, with performance increasing more-or-less uniformly as nodes are added. S3 and PVFS performed slightly worse than NFS and GlusterFS. The local disk configuration on a single node performed 86 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1/8 2/16 4/32 8/64 Runtime (seconds) Number of Nodes/Cores NFS GFS (NUFA) GFS (dist.) PVFS2 S3 Local NFS (shared) Figure 4.5: Performance of Montage using different storage systems. relatively well compared to NFS and S3 because of the lower overhead of accessing the disk directly compared with accessing files over the network. Broadband The performance results for Broadband are shown in Figure 4.7. In contrast to the other appli- cations, the best overall performance for Broadband was achieved using Amazon S3 and not GlusterFS. This is likely due to the fact that Broadband reuses many input files, which improves the effectiveness of the S3 client cache. Many of the transformations in Broadband consist of several executables that are run in sequence like a mini workflow. This would explain why GlusterFS (NUFA) results in better performance than GlusterFS (distribute). In the NUFA case all the outputs of a task are stored on the local disk, which results in much better locality for Broadband’s multi-step tasks. An additional Broadband experiment was run using a different NFS server (m2.4xlarge, 64 GB memory, 8 cores) to see if a more powerful server would sig- nificantly improve the performance of the dedicated NFS configuration. The result was better than the smaller server for the 4-node case (4368 seconds vs. 5363 seconds), but was still sig- nificantly worse than GlusterFS and S3 (less than 3000 seconds in all cases). The decrease in 87 0 1000 2000 3000 4000 5000 6000 1/8 2/16 4/32 8/64 Runtime (seconds) Number of Nodes/Cores NFS GFS (NUFA) GFS (dist.) PVFS2 S3 Local NFS (shared) Figure 4.6: Performance of Epigenome using different storage systems. performance using NFS between 2 and 4 nodes was consistent across repeated experiments and was not affected by changing the NFS server configuration. Similar to Montage, Broadband appears to have relatively poor performance on PVFS, possibly because of the large number of small files it generates (more than 5,000). 4.4.3 Cost Results This section analyzes the cost of running workflow applications using the selected storage sys- tems. Just as in the previous study, three different cost categories are considered: resource cost, storage cost, and transfer cost. One important issue to consider when evaluating the cost of running a workflow on multiple nodes is the granularity at which the provider charges for the nodes. In the case of EC2, Amazon charges for nodes by the hour, and any partial hours are rounded up. One important result of this is that there is no cost benefit to adding nodes for workflows that run for less than an hour, even though doing so may improve the makespan. It should be noted that the storage systems do not have uniform cost profiles. The dedicated NFS configuration is at a disadvantage in terms of cost because of the extra node that was used 88 0 1000 2000 3000 4000 5000 6000 7000 8000 1/8 2/16 4/32 8/64 Runtime (seconds) Number of Nodes/Cores NFS GFS (NUFA) GFS (dist.) PVFS2 S3 Local NFS (shared) Figure 4.7: Performance of Broadband using different storage systems. to host the file system, which results in an extra cost of $0.68 per hour for all applications. The shared NFS configuration was included for comparison to understand this tradeoff better. S3 is also at a disadvantage compared to the other systems because Amazon charges a fee to store data in S3. This fee is $0.01 per 1,000 PUT operations, $0.01 per 10,000 GET operations, and $0.15 per GB-month of storage (data transfers are free between EC2 and S3). For Montage this results in an extra cost of $0.28 for S3, for Epigenome the extra cost is $0.01, and for Broadband the extra cost is $0.02. Note that the S3 cost is somewhat reduced by caching in the S3 client, and that the monthly storage cost is insignificant for the applications tested (much less than $0.01). The actual cost and the rounded cost for Montage, Epigenome and Broadband, including extra charges for NFS and S3, are shown in Figure 4.8. Again, the local disk results from the previous study are included for comparison. The most interesting result is that the single node, local disk configuration had the lowest cost for all applications, for both rounded cost and actual cost. In all cases the cost only increased as noded were added because the performance gains were not enough to offsett the increased cost of adding a node. This suggests that there is no cost advantage to running a workflow on multiple nodes. 89 4.4.4 Submit Host Placement In the previous experiments the worker nodes were deployed in the cloud, but a submit host outside the cloud was used to manage the workflows. This configuration was used for a number of reasons: it is the simplest way to deploy their workflows, it made experiment management and setup easier, and it provided a stable and permanent base from which identical experiments could be executed over time. It also required one less node, which reduced the total cost. The drawback of this approach is that, although the amount of data transferred between the submit host and the workers is small, the overhead of communication over the WAN could impact the performance of the application. In order to quantify the impact of submit node placement on performance and cost, a set of experiments were performed where the submit node was provisioned in the cloud and outside. Once again the experiments used the c1.xlarge instance types for worker nodes, and GlusterFS in the NUFA configuration for shared storage. For the experiments were the submit host was provisioned inside the cloud, the m1.xlarge instance type was used for the submit host, which costs $0.68 per hour. For the experiments where the submit host was outside the cloud, the total cost of the data transfer required was computed in Section 4.3.4 and found to be much less than $0.01 per workflow instance. The results of these experiments are shown in Figure 4.9. Overall, the location of the submit host does have an impact on the performance of the workflow, with the submit host inside the cloud resulting in somewhat better performance in almost every case. Of all the applications, Epigenome has the least benefit, most likely because there are fewer jobs in the Epigenome workflow and therefore less traffic between the submit host and the worker nodes. Montage has the greatest benefit, which is a result of having the most jobs and the most traffic between the submit host and the workers. Interestingly, the cost of running with the submit host in the cloud is, in some cases, less than the cost of running with the submit host outside, despite the fact that using a submit host inside 90 Table 4.5: Performance and cost comparison when switching from a submit host outside the cloud to a submit host inside the cloud 2 Nodes 4 Nodes 8 Nodes Application Runtime Cost R C Runtime Cost R C Runtime Cost R C Montage -38.55% -8.49% 4.54 -22.66% -3.44% 6.58 -10.52% 0.66% 15.91 Broadband -7.00% 39.50% 0.18 -15.42% 5.73% 2.69 -16.94% -6.56% 2.58 Epigenome -11.40% 32.90% 0.35 -8.42% 14.47% 0.58 1.81% 14.54% 0.12 requires an extra node. This is a result of the decrease in total runtime across all nodes, which offset the cost of the additional node. Table 4.5 shows the performance and cost comparison when switching from a submit host outside, to a submit host inside the cloud. In 5 out of 9 cases the change is beneficial, and in 3 of 9 cases the cost actually decreases. Aside from performance and cost, however, developers of workflow applications should con- sider application requirements and convenience when deciding upon the location of the submit host. Having a submit host outside the cloud provides a permanent base for storing workflow descriptions, execution logs, data, and metadata that will not be lost when the workflow com- pletes. If a submit host is provisioned in the cloud, then additional measures must be taken to ensure that this information is transferred to a permanent storage location before the submit host is deprovisioned. Other issues, such as the availability of an existing local submit host, the use of multiple clouds, and the combination of local resources with cloud resources would also affect the decision. On the other hand, provisioning submit hosts in the cloud has the advantage of being able to support multiple, large-scale workflows at the same time by enabling the developer to provision a separate submit host for each workflow instance. 4.4.5 Cluster Compute Comparison Recently Amazon introduced two new instance types that are designed specifically for high- performance computing. The new cluster compute (cc1.4xlarge) and cluster GPU (cg1.4xlarge) nodes use 10 Gbps Ethernet in comparison with the 1Gbps Ethernet used by the other instance types, and it is possible to ensure that cluster nodes are provisioned “close” to each other (in 91 terms of network locality) in order to minimize network latency. These features provide signif- icantly better network performance for parallel applications using the new instance types. The new types are also fully virtualized using Xen HVM [93] instead of using paravirtualization. HVM improves application performance by reducing virtualization overhead, and enables the nodes to deploy custom kernels, which is not possible with the older instance types. Both new cluster types have higher-performance, dual, quad-core 2.93 Ghz Intel Xeon Nehalem proces- sors. In comparison, most of the c1.xlarge instance types have older, dual, quad-core 2.13 or 2.33 Ghz Intel Xeon processors. The cluster GPU types also include two NVIDIA Tesla Fermi GPUs. The new instance types come at a significantly higher cost than the older instance types. Cluster compute nodes cost $1.30 per hour, while cluster GPU nodes cost $2.10 per hour. In comparison, the c1.xlarge instance type used in the last section cost $0.68 per hour–48-68% less than the new instance types. In order to offset the increased cost and assuming that cost and performance are weighted equally by the user, the cluster compute nodes would have to deliver more than twice the performance of the c1.xlarge type to offset the increased cost. To compare the new, high-performance instance types to the older instance types we ran a set of experiments using the three example workflow applications. Since these applications cannot take advantage of the GPUs provided by the cluster GPU nodes without being modified (which would require significant changes to the application code), the comparison is restricted to the cluster compute (cc1.4xlarge) nodes. The performance of these new nodes is compared to the c1.xlarge results obtained in the previous experiments. Since GlusterFS in the NUFA configu- ration performed well for all three applications in the previous experiments, the comparison is restricted to that storage system. The performance and actual cost comparison are shown in Figure 4.10. In all cases, the use of cc1.4xlarge nodes resulted in significantly better performance compared with the c1.xlarge nodes. This improvement decreases as the number of nodes increases due to the limited scal- ability of the applications. Also, the cost curves diverge as a result of weaker performance 92 Table 4.6: Performance and actual cost comparison when switching from c1.xlarge to cc1.4xlarge 2 Nodes 4 Nodes 8 Nodes Application Runtime Cost R C Runtime Cost R C Runtime Cost R C Montage -46.3% 2.6% 17.59 -41.7% 11.5% 3.64 -19.4% 54.0% 0.36 Broadband -28.7% 36.3% 0.79 -31.6% 30.8% 1.03 -36.1% 22.2% 1.62 Epigenome -38.1% 18.4% 2.07 -31.1% 31.7% 0.98 -15.9% 60.8% 0.26 improvements as the number of nodes increases, so that the cost increases with cc1.4xlarge at a faster rate than with c1.xlarge. In a few cases, the performance benefit of switching from c1.xlarge to cc1.4xlarge might be worth the increased cost. Table 4.6 compares the percent change in runtime versus the percent change in cost. Assuming that cost and performance are weighted equally by the user, if the ratio of performance improvement to cost increase is greater than 1, then the performance improve- ment is worth the cost. In about half of the cases the ratio is greater than 1. For Montage, the benefit is greater on fewer nodes. This may be due to the fact that Montage (I/O bound) is ben- efitting primarily from the high-speed network on cc1.4xlarge, and the benefit decreases as the number of nodes increases due to a smaller amount of data being transferred between each pair of nodes. For Broadband the benefit is more pronounced as nodes are added, which suggests that Broadband is benefitting from the increased memory and CPU power of cc1.4xlarge. In the case of Epigenome (CPU bound), only 1 of 3 cases shows a benefit, suggesting that the increased CPU power of cc1.4xlarge alone is not enough to offset the cost. 4.4.6 Discussion Based on these results it is clear that the choice of storage system has a significant impact on workflow runtime. In general, GlusterFS delivered good performance for all the applications tested and seemed to perform well with both a large number of small files, and a large number of clients. S3 produced good performance for one application, Broadband, possibly due to the use of caching in the S3 client. NFS performed surprisingly well in cases where there were either few clients, or when the I/O requirements of the application were low. Both PVFS and S3 93 performed poorly on workflows with a large number of small files, although the version of PVFS used did not contain optimizations for small files that were included in subsequent releases. These results also indicate that cost closely follows performance. In general the storage systems that produced the best workflow runtimes resulted in the lowest cost. NFS was at a dis- advantage compared to the other systems because it used an extra, dedicated node to host the file system for the rest of the cluster. Similarly, S3 is at a disadvantage, especially for workflows with many files, because Amazon charges a fee per S3 transaction. For two of the applications (Mon- tage, I/O-intensive; Epigenome, CPU-intensive) the lowest cost was achieved with GlusterFS, and for the other application (Broadband, Memory-intensive) the lowest cost was achieved with S3. The most important cost-related result, however, is the effect of adding resources to run workflows in parallel. Assuming that resources have uniform cost and performance, in order for the cost of a workflow to decrease when resources are added, the speedup of the applica- tion must be super-linear. Since this is rarely the case in any parallel application, it is unlikely that there will ever be a cost benefit for adding resources, even though there may still be a per- formance benefit. In these experiments adding resources reduced the cost of a workflow for a given storage system in only 2 cases: 1 node to 2 nodes using a dedicated NFS configuration for both Epigenome and Broadband. In both of those cases, the improvement was a result of the non-uniform cost of resources due to the extra node that was used for NFS. In all other cases the cost of the workflows only increased when resources were added. Assuming that cost is the only consideration and that resources are uniform, the best strategy is to either provision only one node for a workflow, or to use the fewest number of resources possible to achieve the required performance. In most of the experiments the submit host used to coordinate the workflows was located outside the Amazon cloud. In most cases this configuration results in lower cost and slightly worse performance. In general it seems that having the submit host outside the cloud is more convenient, but workflows with many tasks, such as Montage, may benefit from having the submit host in the cloud. 94 With regard to the new cluster compute instance types, the experiments indicate that using cc1.4xlarge can significantly improve the performance of workflow applications, but the improvement comes with a significant increase in cost. As was suggested earlier, it is best for application developers to do a cost-performance analysis for their application and either choose the most cost-effective instance type, or choose the one that fits requirements of their application best. 4.5 Astronomy in the Cloud This section describes several cloud workflow use-cases that were studied for the Infrared Pro- cessing and Analysis Center (IPAC) [82]. IPAC is interested in using cloud computing for astron- omy applications, but they are concerned about the cost and feasability of running data-intensive applications in the cloud. Investigating these use-cases helped them identify beneficial uses of cloud computing for their astronomy applications. 4.5.1 Mosaic Service IPAC hosts an on-demand image mosaic service that uses the Montage application (described above) [147]. In order to determine the potential usefulness of cloud computing with regard to this service, IPAC wanted to answer the question: Is it cheaper to host the image mosaic service locally or on Amazon EC2? The costs described here are current as of October 2010. The calculations presented assume that the service processes requests for 36,000 mosaics of 2MASS images (total size 10 TB) of size 4 square degrees over a period of three years. This workload is typical of the requests made to the existing image mosaic service. Table 4.7 summarizes the costs of the local service, using hardware choices typical of those used at IPAC. The cost for power, cooling and administration are estimates provided by IPAC system management. Table 4.8 gives similar calculations for EC2. The costs for EC2 include the cost of data transfer, I/O, VM instances, etc. Comparing the two tables it is clear that the local service is the least expensive choice by a wide margin. In fact, the cost per mosaic of the local solution is less than half of 95 Table 4.7: Cost per mosaic of a Montage image mosaic service hosted at IPAC Item Cost 12 TB RAID 5 disk farm and enclosure (3 yr support) $12,000 Dell 2650 Xeon quadcore processor, 1 TB staging area $5,000 Power, cooling and administration $6,000 Total 3-year Cost $23,000 Cost per mosaic $0.64 Table 4.8: Cost per mosaic of a Montage image mosaic service hosted on Amazon EC2 Item Cost Network Transfer In $1,000 Data Storage on Elastic Block Storage $36,000 Processor Cost (c1.medium) $4,500 I/O operations $7,000 Network Transfer Out $4,200 Total 3-year Cost $52,700 Cost per mosaic $1.46 the EC2 solution. The high costs of data storage in EC2, and the high cost of data transfer and I/O in the case of an I/O-bound application like Montage, make EC2 much less attractive than a local service. Based on this analysis, IPAC is inclined to keep their current configuration with the service deployed locally. 4.5.2 Periodograms The Kepler satellite [122], launched on 06 March 2009, is a NASA mission that uses high- precision photometry to search for transiting exoplanets around main sequence stars. The French mission Convection Rotation and Planetary Transits (CoRoT) [35], launched in late 2006, has similar goals. Kepler’s primary mission is to determine the frequency of Earth-sized planets around other stars. In May 2009, it began a photometric transit survey of 170,000 stars in a 105 square degree area in Cygnus. The photometric transit survey has a nominal mission lifetime of 3.5 years. 96 Analyzing the light curves produced by Kepler and CoRoT to identify periodic signals, such as those that arise from transiting planets and from stellar variability, requires calculations of periodograms, which reveal periodicities in time-series data and estimate their significance. Peri- odograms are, however, computationally intensive, and the volume of data generated by Kepler demands high-performance processing. IPAC has developed a program, written in C, to compute periodograms. In order to process the large number of light curves expected to be produced by the Kepler mission, IPAC collaborated with the Pegasus team at ISI to develop a workflow for computing periodograms using their program. To support the scientific analysis of Kepler data, IPAC wished to generate an atlas of peri- odograms of the public Kepler data, computed with three different algorithms and three different sets of parameters for maximum science value. The atlas is to be served through the NASA Star and Exoplanet Database (NStED) [148]. End-users will be able to browse periodograms and phased light curves, extract the highest-probability periodicities from the atlas, identify stars for further study, and refine the periodogram calculations as needed. In 2010 the Kepler mission released light curves for 210,664 stars. These light curves contain measurements made over 229 days, with between 500 to 50,000 epochs per light curve. IPAC and ISI planned to compute an atlas for these light curves on Amazon EC2. There were two good reasons for choosing EC2 over a local cluster. First, the processing would interfere with operational services on the local machines accessible to IPAC, which are being used for other applications such as the mosaic service described above. Second, the periodogram workflow has the characteristics that make it attractive for cloud processing. It is strongly CPU-bound, as it spends 90% of the runtime processing data, and the input data sets are small, so the transfer and storage costs are not excessive. It is an example of bulk processing where the processors can be provisioned as needed and then released. Table 4.9 summarizes the results of three production runs on Amazon EC2 and NCSA’s Abe cluster. Each run processed all 210,664 public light curves using all three Periodograms algorithms (LS, BLS, Plavchan). The first two runs were modest in size and were executed on Amazon EC2. The third run used parameters that resulted in much longer runtimes than the first 97 Table 4.9: Summary of Periodograms runs on Amazon EC2 and NCSA Abe Run 1 (Amazon) Run 2 (Amazon) Run 3 (NCSA) Resources Type c1.xlarge c1.xlarge abe.lustre Nodes 16 16 16 Cores 128 128 128 Runtimes No. Tasks 631992 631992 631992 Mean Task Runtime 7.44 sec 6.34 sec 285 sec No. Jobs 25401 25401 25401 Mean Job Runtime 3.08 min 2.62 min 118 min Total CPU Time 1304 1113 50019 Total Wall Time 16.5 hr 26.8 hr 448 hr Inputs Input Files 210664 210664 210664 Mean Input Size 0.084 MB 0.084 MB 0.084 MB Total Input Size 17.3 GB 17.3 GB 17.3 GB Outputs Output Files 1263984 1263984 1263984 Mean Output Size 0.171 MB 0.124 MB 5.019 MB Total Output Size 105.3 76.52 3097.87 Cost Compute Cost $110.84 $94.61 $4,251.62 (est) Output Cost $15.80 $11.48 $464.68 (est) Total Cost $126.64 $106.08 $4,716.30 (est) two runs. The estimated cost to execute the third run on Amazon EC2 was approximately $4,700 (cost values marked “est” in Table 4.9 are estimates of the cost on EC2). Instead of spending this relatively large sum, IPAC decided to use an allocation on NCSA’s Abe cluster instead. The results showed that cloud computing is a powerful, cost-effective tool for bulk process- ing. On-demand provisioning is especially powerful and is a major advantage over grid facilities, where latency in scheduling jobs can increase the processing time dramatically. However, the cost of running large computations on Amazon was a limiting factor. IPAC was comfortable executing more modestly-sized runs on EC2 that cost on the order of a few hundred dollars, but were disinclined to execute a larger run on EC2 that would cost a few thousand dollars, espe- cially considering that there was an available TeraGrid allocation with sufficient service units to complete the run. 98 4.6 Summary This chapter presented the results of several studies on the cost and performance of workflows in the cloud. The first study evaluated the performance of workflows using single VMs on Amazon EC2 with different resource configurations and compared it to a typical HPC cluster. The second study evaluated the performance of different storage systems that could be used for sharing workflow data in a virtual cluster deployed on Amazon EC2. And the last two studies analyzed the cost of hosting two different astronomy applications in Amazon EC2. Based on the experiments done in the single node study it appears that the performance of workflows on EC2 is reasonable given the resources that can be provisioned. Although the EC2 performance in most cases was not as good as the performance on a traditional cluster, most of the resources provided by EC2 are also less powerful. When the resources were similar, the performance was found to comparable. One of the concerns of running high-performance computing applications on clouds was the impact of virtualization on performance. The single node study found that the virtualization overhead on EC2 is generally small—on the order of 10%—and the the overhead was most evident for CPU-bound applications. Further advances in virtualization technology, such as the full virtualization technology used on Amazon’s cluster compute instances, will likely eliminate the overhead entirely in the future. The tradeoff between price and performance is an important issue when deploying workflows in the cloud. The single node study found that, occasionally there is one instance type that has the best cost/performance ratio for an application. More often, however, there were several instance types that had either lower cost, or better performance than the other types, but did not dominate the others in both metrics. In general, application developers should be aware of this tradeoff, and benchmark their applications to decide which type meets the requirements of their application, rather than blindly choosing the type with the most resources or the lowest hourly rate. 99 The single node study also determined that the cost of transferring data can be prohibitive. For some applications the cost of transferring the output data is larger than the cost of executing the workflow. Keeping in mind that storing 1 GB of data on EC2 for one month costs the same as transferring it outside the cloud once, for these applications it may be more cost-effective to leave the output in the cloud and perform additional analyses there rather than transfer it to local storage for analysis. A similar argument can be made for input data. If the input data will be reused for multiple analyses it may be better to store it in the cloud rather than have it transferred again in the future. Based on the results from the storage system study it is clear that the choice of storage system has a significant impact on workflow runtime. The main issue seemed to be how well the file system performed with lots of small files. File systems such as GlusterFS, which excelled in handling small files, seemed to perform better than others, such as PVFS, that were designed for accessing large files in parallel. The storage system study also showed that, in order for the cost of a workflow to decrease when resources are added, the speedup of the application must be super-linear. Since this is rarely the case in any parallel application, it is unlikely that there will ever be a cost benefit for adding resources, even though there may still be a performance benefit. Assuming that cost is the only consideration and that resources are uniform, the best strategy is to either provision only one node for a workflow, or to use the fewest number of resources possible to achieve the required performance. One question that arose during the storage system study is: Is it better to place the submit host inside the cloud, or outside? The experiments to answer that question suggest that the choice of where to place the submit host is a non-issue performance-wise. Some workflows performed better with a submit host provisioned in the cloud, but not significantly. The correct placement choice appears to be largely a matter of convenience rather than performance. A similar experiment was done to test whether the new cluster compute instance types were a significantly better choice in terms of performance and cost than the standard instance types. The experiments to test this issue found that the cluster compute types can significantly improve the 100 performance of workflow applications, but the improvement comes with a significant increase in cost. In some cases this increased cost may be justified, but it depends on the application and the user’s requirements. The analysis done for IPAC to compare the cost of hosting a Montage image mosaic service locally on owned hardware versus in the cloud on Amazon EC2 showed that the EC2 solution was more than twice as expensive as the local solution. This was primarily a result of the high cost of storing data long-term in Amazon compared to using a local disk array. On the other hand, Amazon does eliminate local maintenance and energy costs, and does offer reliable, redundant storage, which is difficult to replicate locally. Finally, EC2 was determined to be an excellent solution for running the Periodograms workflow. This is because 1) the Periodograms application is CPU-intensive rather than data- intensive, so the time and cost of transferring data is not signficant, and 2) the computational capacity was only required for a short duration (namely, just long enough to process Kepler data releases), which made it easier justify renting capacity from EC2 rather than buying new hardware or oversubscribing existing hardware. 101 $0.00! $2.00! $4.00! $6.00! $8.00! $10.00! $12.00! 1/8! 2/16! 4/32! 8/64! Rounded Cost per Workflow! Number of Nodes/Cores! NFS! GFS (NUFA)! GFS (dist.)! PVFS2! S3! Local! NFS (shared)! $0.00! $2.00! $4.00! $6.00! $8.00! $10.00! $12.00! 1/8! 2/16! 4/32! 8/64! Actual Cost Per Workflow! Number of Nodes/Cores! NFS! GFS (NUFA)! GFS (dist.)! PVFS2! S3! Local! NFS (shared)! (a) Montage $0.00! $2.00! $4.00! $6.00! $8.00! $10.00! $12.00! 1/8! 2/16! 4/32! 8/64! Rounded Cost Per Workflow! Number of Nodes/Cores! NFS! GFS (NUFA)! GFS (dist.)! PVFS2! S3! Local! NFS (shared)! $0.00! $2.00! $4.00! $6.00! $8.00! $10.00! $12.00! 1/8! 2/16! 4/32! 8/64! Actual Cost Per Workflow! Number of Nodes/Cores! NFS! GFS (NUFA)! GFS (dist.)! PVFS2! S3! Local! NFS (shared)! (b) Broadband $0.00! $1.00! $2.00! $3.00! $4.00! $5.00! $6.00! $7.00! 1/8! 2/16! 4/32! 8/64! Rounded Cost Per Workflow! Number of Nodes/Cores! NFS! GFS (NUFA)! GFS (dist.)! PVFS2! S3! Local! NFS (shared)! $0.00! $1.00! $2.00! $3.00! $4.00! $5.00! $6.00! $7.00! 1/8! 2/16! 4/32! 8/64! Actual Cost Per Workflow! Number of Nodes/Cores! NFS! GFS (NUFA)! GFS (dist.)! PVFS2! S3! Local! NFS (shared)! (c) Epigenome Figure 4.8: Rounded cost (left) and actual cost (right) using different storage systems. 102 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00 2/16 4/32 8/64 Run$me (seconds) Number of Nodes/Cores Outside Inside $0.00 $0.50 $1.00 $1.50 $2.00 $2.50 2/16 4/32 8/64 Total Instance Cost Number of Nodes/Cores Outside Inside (a) Montage 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 2/16 4/32 8/64 Run$me (seconds) Number of Nodes/Cores Outside Inside $0.00 $0.20 $0.40 $0.60 $0.80 $1.00 $1.20 $1.40 $1.60 $1.80 $2.00 2/16 4/32 8/64 Total Instance Cost Number of Nodes/Cores Outside Inside (b) Broadband 0 500 1000 1500 2000 2500 2/16 4/32 8/64 Run$me (seconds) Number of Nodes/Cores Outside Inside $0.00 $0.20 $0.40 $0.60 $0.80 $1.00 $1.20 $1.40 $1.60 $1.80 2/16 4/32 8/64 Total Instance Cost Number of Nodes/Cores Outside Inside (c) Epigenome Figure 4.9: Comparison of runtime and actual cost for submit node inside and outside the cloud 103 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00 2/16 4/32 8/64 Run$me (seconds) Number of Nodes/Cores c1.xlarge cc1.4xlarge $0.00 $1.00 $2.00 $3.00 $4.00 2/16 4/32 8/64 Total Instance Cost Number of Nodes/Cores c1.xlarge cc1.4xlarge (a) Montage 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 2/16 4/32 8/64 Run$me (seconds) Number of Nodes/Cores c1.xlarge cc1.4xlarge $0.00 $0.50 $1.00 $1.50 $2.00 $2.50 2/16 4/32 8/64 Total Instance Cost Number of Nodes/Cores c1.xlarge cc1.4xlarge (b) Broadband 0 500 1000 1500 2000 2500 2/16 4/32 8/64 Run$me (seconds) Number of Nodes/Cores c1.xlarge cc1.4xlarge $0.00 $0.50 $1.00 $1.50 $2.00 $2.50 2/16 4/32 8/64 Total Instance Cost Number of Nodes/Cores c1.xlarge cc1.4xlarge (c) Epigenome Figure 4.10: Comparison of runtime and actual cost for cc1.4xlarge vs c1.xlarge 104 Chapter 5 Workflow Profiling and Characterization In order to enable resource provisioning for workflow applications in grids and clouds, it is important to understand the resource requirements of a workflow. Knowing what the require- ments of a workflow are makes it easier to provision resources for the workflow. If the require- ments are not known, then it is possible that the resource provisioning system will either over- provision or under provision, leading to either low resource utilization or poor application per- formance. Our approach to this problem is to collect fine-grained traces of workflows that capture the resource usage of individual tasks, and then process this detailed information to create a profile of the workflow that can be used to predict the resource requirements of future tasks. This approach enables us to build up an estimate of the resource requirements of an entire workflow based on the predicted resource usage of individual tasks. In addition to enabling the resource requirements of a workflow to be estimated, workflow traces and profiles have many other potential uses. These include: Monitoring. Real-time trace data could be used to provide feedback to the user and to the workflow management system concerning the progress of the workflow. For example, trace data could be used to monitor the amount of storage and computation being used, and to assess the load being generated by the workflow. Performance analysis. Workflow traces could be mined for information about the perfor- mance of the workflow, workflow management system, and target resources. Trace data could be analyzed to identify bottlenecks, and poor performance. It could also be used to guide and evaluate performance optimizations. 105 Anomaly and bug detection. Bugs and anomalies in workflow execution may show up in execution traces. It may be possible to develop online algorithms to process trace data to detect bugs and anomalies as the workflow runs. This could be used develop early warning systems that respond to failures and prevent expensive and time-consuming rework. Simulation of workflows. Workflow traces could be used to drive simulations of work- flows for use in the development of new workflow systems, scheduling algorithms, and resource management tools. Using traces would enable researchers to model workflow applications and systems at a high-level of detail. Testing. Traces could be used to develop realistic mock workflows for testing workflow management systems. Developing mock workflows using trace data would lead to more realistic testing scenarios than alternative methods without requiring the actual applica- tions to be deployed. 5.1 Workflow traces Workflow traces are logs that record the fine-grained resource usage of workflow tasks as they run. The types of resource usage recorded in a trace include: Memory. The amount of memory used by each process invoked by the task. This includes both peak physical memory usage (known as peak RSS, the actual amount of memory used) and peak virtual memory usage (the amount of memory allocated). I/O. The amount of I/O performed by each process. This includes bytes read and written from/to each file descriptor (which includes files, pipes, sockets, and stdout/stderr), and the number of I/O operations performed (e.g. number of read, write and seek calls). It also includes the names of the files that were accessed and their size. 106 Runtime. The start and finish times of the task and the start and finish times of each process invoked by the task. These values are used to compute the total wall time, or runtime, of the task. CPU Usage. The amount of CPU time spent in kernel space (stime) and user space (utime). These can be used along with the runtime to compute idle time (I/O, sleep), and CPU utilization. A tracing process runs alongside each task in the workflow to record this information into a log file. Once the data is recorded, it can be collected and analyzed to discover useful information about the execution of the workflow. 5.2 Workflow profiles Workflow traces provide raw data about the resource usage of individual workflow tasks as they ran. These traces contain very low-level information that can be used directly for some purposes, such as monitoring and debugging, but for other uses, including resource estimation and perfor- mance analysis, these traces must be processed into summary data. This processing results in a workflow profile. Workflow profiles are statistical summaries of the resource usage of a work- flow. The purpose of a workflow profile is to use data about previously executed workflows to make informed predictions about future workflows. A workflow profile may contain, for exam- ple, a statistical distribution of task runtimes that can be used to predict the runtime of future tasks. Similarly, profiles could contain information used to predict resource usage, which is useful for resource provisioning. The process of creating a profile involves clustering similar tasks according to their attributes, and then developing a characterization of the measured target attributes of each cluster using statistical techniques. This process can be performed in a number of different ways. The easiest way is to cluster tasks according to an obvious attribute, such as transformation type, and use simple statistical techniques such as mean and standard deviation to aggregate the measurements. 107 More complicated approaches could be used as well. Machine learning techniques could be used to automatically cluster tasks, and regression analysis could be used to characterize the target attributes as a function of the other attributes. This could be accomplished using, for example, classification and regression trees [17]. This thesis focuses on the simpler process, but the data collected could easily be used for more sophisticated techniques. 5.3 Related Work In the area of characterizing workloads in distributed and grid environments, the Parallel Work- loads Archive [143] and the Grid Workloads Archive [71] provide workloads from parallel and grid execution environments that can be used in simulations. These workloads focus on the per- formance and utilization of computational resources, and provide data at the granularity of jobs, but not fine-grained resource usage characteristics. Iosup et. al. [83] describe the system-wide, virtual organization-wide, and per-user characteristics of traces obtained from four grids. These analyses provide insight into how grid environments are used and allow users to model such environments at a high level. Van der Aalst and ter Hofstede started The Workflow Patterns Initiative to identify basic workflow components [180, 191]. They categorize perspectives such as control flow, data flow, resource and exception handling that are supported by workflow description and business process modeling languages. They also maintain descriptions of common patterns for each perspective, such as the sequence, parallel split, and synchronization patterns for control flow. This effort is focused on identifying common workflow structures and does not consider performance metrics or resource utilization. There have been relatively few efforts to collect and publish traces and performance statis- tics for scientific workflows. The Pegasus team recently published traces for a few workflows executed using Pegasus [189] and have developed synthetic workflows based on statistics from real workflows for use in simulations [190]. Similarly, Ramakrishnan and Gannon [152] and Bharathi et. al. [15] have conducted surveys to characterize workflows at a high level. These 108 surveys describe the use, composition and structure of many workflows and provide statistical characterizations of the runtimes and input/output file sizes of their tasks. The purpose of these studies is to aid in the development of workflow management systems, and they do a good job of describing the range and scale of existing applications, but they have not resulted in general- purpose techniques and tools that can be used to profile workflows. A few systems have been developed to collect profiling data for scientific workflows. Many workflow systems have been integrated with performance monitoring and analysis tools [174, 138, 37]. These systems typically collect only coarse-grained information, such as task runtime and data size. Similarly, Kickstart [182] is a tool that is used to monitor task execution in scientific workflows. It captures task information such as runtime, exit status, and stdout/stderr, as well as execution host information, such as hostname, CPU type, and load. Kickstart does not, however, collect fine-grained profiling data. The ParaTrac system [48], on the other hand, does collect fine-grained profiling data using a FUSE-based file system to monitor and record the I/O usage of tasks and the Linux taskstats interface to record memory, CPU, and runtime data. The main drawbacks of ParaTrac’s approach are that FUSE is not supported on many of the computing systems used to run workflows, and it only captures I/O performed through the FUSE mount point, which excludes a potentially large amount of the I/O performed by workflow tasks. 5.4 Approach The challenge of tracing workflows is that they involve a large number of tasks scattered across many nodes in a distributed computing environment. As a result of this, it is usually not possible to develop an interactive tracing tool that can be invoked by a user manually to collect informa- tion about individual tasks. Instead, the workflow management system must be enlisted to help manage the tracing process and gather the trace data. Our approach is to wrap each workflow task with a program that records trace data as the task executes. 109 Worker Node Wrapper T Workflow Manager Worker Node Wrapper T Worker Node Wrapper T Worker Node Wrapper T 1. Workflow Manager distributes tasks to worker nodes 3. Wrappers send trace data back to submit host 2. Wrappers collect data as jobs run Workflow Profile 4. Trace data is mined to create a profile of the workflow Task Trace Task Trace Task Traces Figure 5.1: Workflow profiles are created by processing trace data collected by wrappers run- ning on compute nodes. The process for collecting trace data to build a workflow profile is illustrated in Figure 5.1. In this process, the workflow management system distributes workflow tasks to a set of worker nodes for execution. The workers invoke each tasks with a wrapper program that monitors all of the processes and sub-processes forked by the task. The wrapper program records trace data for these processes into a log file. When the task finishes, the trace files are transferred back to the submit host along with the other outputs of the task. Once the workflow is complete, additional processing is done on the traces to generate a profile of the workflow. 5.5 Workflow profiler We have developed a set of workflow tracing and profiling tools called wfprof. The tracing portion of wfprof is composed of two tools–ioprof and pprof–that wrap the tasks in a workflow 110 and record data about them as they run. Ioprof collects data about I/O activity, and pprof collects data about process runtimes, memory usage, and CPU utilization. Ioprof uses the strace utility [167] to intercept system calls made by a task. The system calls are mined for data about operations performed on file descriptors. The calls of interest to ioprof include open(), close(), read(), write(), seek() and others. The arguments and return values from these system calls are analyzed to determine which files were accessed by the job, how they were accessed, and how much I/O was performed. This process involves some overhead (measured in microbenchmarks to be between 10–15%), but the statistics collected, namely I/O statistics, should not be impacted by this overhead. In addition to file I/O, this method also captures I/O performed on other objects accessed through file descriptors, includ- ing standard streams (stdin, stdout, stderr), pipes, and sockets. Also, unlike the FUSE-based approach used by ParaTrac, ioprof is guaranteed to capture 100% of the file I/O performed by a task regardless of where it terminates. Pprof uses the Linux ptrace() API [150] to analyze lifecycle events for the processes invoked by a job. Unlike ioprof, it does not intercept system calls, so the overhead is minimal (measured to be on the order of 1 millisecond per task in microbenchmarks). The kernel notifies pprof whenever a process starts or finishes, and pprof records data about the process such as: the start and stop times, the peak RSS (resident set size, or physical memory usage), the time spent in user mode (utime), and the time spent in kernel mode (stime). All these values, with the exception of start and stop time, are maintained by the kernel for every process. To create a profile for a workflow, it is executed twice—once to collect I/O data using ioprof, and once to collect process data using pprof. The two separate runs are required because the I/O tracing done by ioprof affects the statistics gathered by pprof. These runs produce traces that are mined to produce summary statistics for each transformation in the workflow. These profiles report the mean and standard deviation for five statistics: I/O read, I/O write, runtime, peak physical memory usage, and CPU utilization. These statistics are calculated for each task as follows: 111 I/O read and write are calculated by adding up all the I/O performed by all processes invoked by the task. Runtime is calculated by subtracting the start time of the first process invoked by the task from the stop time of the last process. Peak memory usage is calculated from the total peak memory usage of all processes invoked by the job. Since a single job may invoke many different processes concurrently, care is taken to ensure that the peak memory measurements of concurrent processes are added together and the maximum value over the lifetime of the job is reported. Because the peak memory usage of a process may not be maintained for the entire lifetime of the process, there may be some error when adding these values together. As a result, the peak memory reported represents an upper bound on memory usage. CPU utilization is calculated by dividing the amount of time each process spends sched- uled in user mode (utime) and kernel mode (stime) by the total runtime of the process. In the case where a job consists of several processes running concurrently, the total utime+stime of all processes is divided by the wall time of the job. This computation results in the percentage of time the process spends running on the CPU when it is not blocked waiting for I/O operations to complete. Note that for CPU-intensive concurrent processes running on multicore nodes, the CPU utilization may be greater than 100%. There are four tools that generate profile data from the outputs of ioprof and pprof. These tools are: iostats, which provides statistics about I/O, memstats, which provides statistics about memory, utilstats, which provides information about CPU utilization, and runstats, which pro- vides information about runtimes. These tools parse and analyze the outputs of ioprof and pprof to generate workflow profiles containing summary statistics (count, min, max, mean, standard deviation, sum) for each transformation and executable invoked by the workflow. 112 5.6 Application Profiles We have used wfprof to generate profiles for six workflows. These profiles were obtained from executions of typical instances of the workflows on the grid. Some of these applications have been used for the research presented in Chapters 2, 3, and 4, and others are new. The applications profiled include: Montage: Montage [13] was created by IPAC [82] as an open source toolkit that can be used to generate custom mosaics of the sky using input images in the Flexible Image Transport System (FITS) format. During the production of the final mosaic, the geometry of the output image is calculated from the input images. The inputs are then re-projected to have the same spatial scale and rotation, the background emissions in the images are corrected to have a uniform level, and the re-projected, corrected images are co-added to form the output mosaic. The Montage application has been represented as a workflow that can be executed in grid environments such as the TeraGrid [170]. CyberShake: CyberShake [69, 26, 41, 110] was developed by the Southern California Earthquake Center (SCEC) [166] to characterize earthquake hazards using the Probabilis- tic Seismic Hazard Analysis (PSHA) technique. Given a region of interest, an MPI based finite difference simulation is performed to generate Strain Green Tensors (SGTs). From the SGT data, synthetic seismograms are calculated for each of a series of predicted rup- tures. Once this is done, spectral acceleration and probabilistic hazard curves are calcu- lated from the seismograms to characterize the seismic hazard. CyberShake workflows composed of more than 800,000 jobs have been executed using the Pegasus workflow management system on the TeraGrid [170, 41]. Broadband: Broadband is another application developed by the Southern California Earthquake Center [166]. Broadband is a computational platform designed to integrate a collection of motion simulation codes and calculations to produce research results of value to earthquake engineers. These codes are composed into a workflow that simulates 113 the impact of one or more earthquakes on one of several recording stations. Researchers can use the Broadband platform to combine low frequency (less than 1.0Hz) deterministic seismograms with high frequency (10Hz) stochastic seismograms and calculate various ground motion intensity measures (spectral acceleration, peak ground acceleration and peak ground velocity) for building design analysis. Epigenomics: The USC Epigenome Center [177] is currently involved in mapping the epigenetic state of human cells on a genome-wide scale. The Epigenomics workflow is essentially a data processing pipeline that uses the Pegasus workflow management system to automate the execution of the various genome sequencing operations. DNA sequence data generated by the Illumina-Solexa [80] Genetic Analyzer system is split into several chunks that can be operated on in parallel. The data in each chunk is converted into a file format that can be used by the MAQ software that maps short DNA sequencing reads [107, 113]. From there, the sequences are filtered to remove noisy and contaminating segments and mapped into the correct location in a reference genome, Finally, a global map of the aligned sequences is generated and the sequence density at each position in the genome is calculated. This workflow is being used by the Epigenome Center in the processing of production DNA methylation and histone modification data. LIGO Inspiral Analysis: The Laser Interferometer Gravitational Wave Observatory (LIGO) [1, 104] is attempting to detect gravitational waves produced by various events in the universe as predicted by Einstein’s theory of general relativity. The LIGO Inspiral Analysis workflow [21, 28] is used to analyze the data obtained from the coalescing of compact binary systems such as binary neutron stars and black holes. The time-frequency data from each of the three LIGO detectors is split into smaller blocks for analysis. For each block, the workflow generates a subset of waveforms belonging to the parameter space and computes the matched filter output. If a true inspiral has been detected, a trigger is generated that can be checked with triggers for the other detectors. Several additional consistency tests may also be added to the workflow. 114 SIPHT: The bioinformatics project at Harvard University is conducting a wide search for small, untranslated RNAs (sRNAs) that regulate processes such as secretion and vir- ulence in bacteria. The sRNA Identification Protocol using High-throughput Technology (SIPHT) program [109] uses a workflow to automate the search for sRNA encoding-genes for all bacterial replicons in the National Center for Biotechnology Information (NCBI) database. The kingdom-wide prediction and annotation of sRNA encoding genes involves a variety of programs that are executed in the proper order using Condor DAGMan’s [38] capabilities. These involve the prediction of Rho-independent transcriptional terminators, BLAST (Basic Local Alignment Search Tools) comparisons of the inter-genetic regions of different replicons and the annotations of any sRNAs that are found. For each of these workflows, Table 5.1 lists the details of the two profiling runs (ioprof and pprof) and a third run without any profiling enabled (normal) for comparison. The elapsed times are shown in units of hours:minutes:seconds. The differences between the elapsed time of the profiling runs and the normal run are small and fairly consistent. This difference is primarily due to 1) the extra data transfers required to retrieve the trace data from the remote system, 2) the added overhead of ioprof, and 3) a natural variation in the runtime of the workflow. Recall from the previous section that pprof is used for runtime characteristics and has an overhead on the order of 1ms per tasks, and ioprof is used only for I/O characteristics and has an overhead of approximately 10–15%. The extra ioprof overhead does not affect the runtime data, and I/O statistics do not vary from run to run or platform to platform. In addition, some workflow characteristics, such as number of jobs, I/O read, I/O write, and peak memory are independent of the execution environment and the profiler overhead. Runtime and CPU utilization depend on the quantity and performance of resources used to execute the workflow and will vary from platform to platform. Table 5.1 also shows the execution environments that were used to collect the data. Several of the workflows (Epigenome, Montage, Broadband) were run on Amazon EC2. The other workflows (SIPHT, CyberShake, LIGO) were run on the clusters used by the developers of 115 Table 5.1: Experimental conditions for profiling and normal executions Workflow Site Nodes Cores CPU Mem. Storage ioprof pprof Normal Montage Amazon EC2 1 8 Xeon @ 2.33GHz 7.5 GB ext3 1:11:20 0:57:36 0:55:28 Cybershake TACC Ranger 36 15 16 Opteron @ 2.3GHz 32 GB Lustre 16:35:00 – – 12:20:00 N/A – Broadband Amazon EC2 1 8 Xeon @ 2.33GHz 7.5 GB ext3 1:31:24 1:21:16 1:20:03 Epigenome Amazon EC2 1 8 Xeon @ 2.33GHz 7.5 GB ext3 1:11:11 1:08:01 1:07:40 LIGO Syracuse SUGAR 80 4 Xeon @ 2.50GHz 7.5 GB NFS 1:48:11 1:47:38 1:41:13 SIPHT UW Newbio 13 8 Xeon @ 3.16GHz 1 GB ext3 1:33:24 1:19:37 1:10:53 the application. The workflows run on EC2 all used a local disk with the ext3 file system for sharing data between jobs (this setup was similar to that used in Section 4.3). The EC2 runs used only a single node, so no inter-node transfers were required. SIPHT also used a local file system, however, the input and output files were transferred to the worker nodes for each job. The remaining workflows used network file systems to share data among workers (Lustre for CyberShake and NFS for LIGO). The “normal” runtime is not shown for the CyberShake workflow because the data avaialble to us was obtained from a workflow execution with several failures and down times and thus is not representative. A summary of the execution profiles of all the workflows is shown in Table 5.2. The data shows a wide range of workflow characteristics. It includes small workflows (SIPHT), CPU- bound workflows (Epigenome), I/O-bound workflows (Montage), data-intensive workflows (Cybershake, Broadband), workflows with large memory requirements (Cybershake, Broadband, LIGO), and workflows with large resource requirements (Cybershake, LIGO). The structure, operation, and profiles of each of these applications will be examined in more detail in the com- ing sections. 5.6.1 Montage Figure 5.2 shows a very small Montage astronomy workflow for the purpose of illustrating the workflow’s structure. The size of a Montage workflow depends on the number of images used 116 Table 5.2: Summary of workflow profiles Workflow Jobs CPU Hours I/O Read (GB) I/O Write (GB) Peak Memory (MB) CPU Utilization (%) Montage 10429 4.93 146.01 49.93 16.77 31.04% CyberShake 815823 9192.45 217369.37 920.93 1870.38 90.89% Broadband 770 9.48 1171.73 175.41 942.32 88.85% Epigenome 529 7.45 24.14 5.36 197.47 95.91% LIGO 2041 51.17 209.36 0.05 968.68 90.50% SIPHT 31 1.23 1.70 1.48 116.38 93.37% in constructing the desired mosaic of the sky. The structure of the workflow changes to accom- modate increases in the number of inputs, which corresponds to an increase in the number of computational jobs. At the top level of the workflow shown in Figure 5.2 are mProjectPP jobs, which reproject input images. The number of mProjectPP jobs is equal to the number of Flexible Image Transport System (FITS) input images processed. The outputs of these jobs are the reprojected image and an “area” image that consists of the fraction of the image that belongs in the final mosaic. These are then processed together in subsequent steps. At the next level of the work- flow, mDiffFit jobs compute a difference for each pair of overlapping images. The number of mDiffFit jobs in the workflow therefore depends on how the input images overlap. The difference images are then fitted using a least squares algorithm by themConcatFit job. The mConcatFit job is a computationally intensive data aggregation job. Next, the mBgModel job computes a correction to be applied to each image to obtain a good global fit. This back- ground correction is applied to each individual image at the next level of the workflow by the mBackground jobs. The mConcatFit and mBgModel jobs are data aggregation and data partitioning jobs, respectively, but together they can be considered as a single data redistribu- tion point. Note that there is not a lot of data being partitioned in this case. Rather, the same background correction is applied to all images. ThemImgTbl job aggregates metadata from all the images and creates a table that may be used by other jobs in the workflow. The mAdd job combines all the reprojected images to generate the final mosaic in FITS format as well as an 117 mImgTbl mBackground mAdd mShrink mJPEG mProjectPP mDiffFit mConcatFit mBgModel Figure 5.2: Example Montage workflow area image that may be used in further computation. ThemAdd job is the most computationally intensive job in the workflow. Multiple levels ofmImgTbl andmAdd jobs may be used in large workflows. Finally, themShrink job reduces the size of the FITS image by averaging blocks of pixels and the shrunken image is converted to JPEG format bymJPEG. Table 5.3 shows a transformation-level profile from the execution of an 8-degree Mon- tage workflow on the grid. Note that in this large workflow there are several parallel mImgTbl-mAdd-mShrink pipelines followed by a finalmImgTbl-mAdd-mJPEG pipeline. Table 5.3 shows that the majority of runtime was spent in the 17 instances of themAdd job, which performed the most I/O on average (1 GB of data read and 775 MB of data written per job) but had low CPU utilization. The table shows low CPU utilization for several job types in the workflow (mBackground,mImgtbl,mAdd,mShrink). This is because Montage jobs spend much of their time on I/O operations, as illustrated by their I/O and runtime measurements. For 118 Table 5.3: Montage execution profile Job Type Count Runtime I/O Read I/O Write Peak Memory CPU Util. Mean (s) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (%) Std. Dev. mProjectPP 2102 1.73 0.09 2.05 0.07 8.09 0.31 11.81 0.32 86.96 0.03 mDiffFit 6172 0.66 0.56 16.56 0.53 0.64 0.46 5.76 0.67 28.39 0.16 mConcatFit 1 143.26 0.00 1.95 0.00 1.22 0.00 8.13 0.00 53.17 0.00 mBgModel 1 384.49 0.00 1.56 0.00 0.10 0.00 13.64 0.00 99.89 0.00 mBackground 2102 1.72 0.65 8.36 0.34 8.09 0.31 16.19 0.32 8.46 0.10 mImgtbl 17 2.78 1.37 1.55 0.38 0.12 0.03 8.06 0.34 3.48 0.03 mAdd 17 282.37 137.93 1102.57 302.84 775.45 196.44 16.04 1.75 8.48 0.11 mShrink 16 66.10 46.37 411.50 7.09 0.49 0.01 4.62 0.03 2.30 0.03 mJPEG 1 0.64 0.00 25.33 0.00 0.39 0.00 3.96 0.00 77.14 0.00 example, themShrink jobs have a relatively short runtime, require a large amount of read I/O, and have low CPU utilization. The singlemBgModel job had the longest mean job runtime, the highest CPU utilization (99.89%) and low I/O activity. The table also shows thatmAdd has a high standard deviation for all metrics. This is because the same mAdd executable is used in two different ways in the workflow. The workflow has a two-level reduce structure that uses mAdd, where the second level has much different runtime characteristics than the first level. These two levels could be treated as two different job types to clarify the statistics. 5.6.2 CyberShake Figure 5.3 shows a small CyberShake workflow. While relatively simple in structure, this work- flow can be used to perform significant amounts of computation on extremely large datasets. The ExtractSGT jobs in the workflow extract data from large SGT files that pertains to a given scenario earthquake rupture. The ExtractSGT jobs may therefore be considered data partitioning jobs. Synthetic seismograms are generated for each variation of a rupture by the SeismogramSynthesis jobs. Peak intensity values, in particular the spectral accel- eration, are calculated by the PeakValCalcOkaya jobs for each synthetic seismogram. The resulting synthetic seismograms and peak intensities are collected and compressed by the 119 SeismogramSynthesis PeakValCalcOkaya ZipPSA ZipSeis ExtractSGT Figure 5.3: Example CyberShake workflow ZipSeismograms andZipPeakSA jobs to be staged out and archived. These jobs may be considered as simple data aggregation jobs. Of the computational jobs, SeismogramSynthesis jobs are the most computationally intensive. However, when the workflow is executed on the grid, due to the large sizes of the SGT files, ExtractSGT jobs may also consume a lot of time on compute resources. Additionally, since a lot of data may be generated by the workflow, theZipSeismograms andZipPeakSA jobs may also consume a large amount of time in generating the compressed files to be staged out. Table 5.4 shows profiling data from the execution of a large CyberShake workflow on the Ranger cluster at TACC. Note that in order to fit into the table, SeismogramSynthesis, PeakValueCalcOkaya andZipSeismograms jobs are called “SeisSynth”, “PVCOkaya” and “ZipSeis”, respectively. The vast majority of the runtime (more than 97%) of CyberShake is spent in the SeismogramSynthesis jobs. This suggests that SeismogramSynthesis would be a good place to focus any code-level optimization efforts. SeismogramSynthesis reads, on average, 547 MB per job, however, the total size of all the input files for SeismogramSyntheis is known to be on the order of 150-250 MB per job. That means 120 Table 5.4: CyberShake execution profile Job Count Runtime I/O Read I/O Write Peak Memory CPU Util. Mean (s) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (%) Std. Dev. ExtractSGT 5939 110.58 141.20 175.40 177.70 155.86 176.48 20.64 0.64 65.82 0.28 SeisSynth 404864 79.47 70.86 547.16 324.92 0.02 0.00 817.59 483.51 92.01 0.08 ZipSeis 78 265.73 275.04 118.95 7.88 101.05 14.36 6.25 0.16 6.83 0.01 PVCOkaya 404864 0.55 2.48 0.02 0.00 0.00 0.00 3.11 0.01 16.89 0.04 ZipPeakSA 78 195.80 237.99 1.07 0.07 2.26 0.15 6.16 0.16 2.89 0.01 some of the input data is being read multiple times, which suggests another opportunity for code- level optimization. Finally, SeismogramSynthesis also has very high memory require- ments. On average it requires 817 MB of memory, but the profiling data shows that some instances require up to 1870 MB, which was not previously known. This has an impact on the scheduling of SeismogramSynthesis jobs because, on many of the nodes available on the grid, the amount of memory available per CPU core is only about 1GB. Thus, if one SeismogramSynthesis job is scheduled on each core, the node could potentially run out of memory, causing the jobs to fail or slow down significantly if swapping occurs. This may explain some of the failures that occurred when the application was run on a different cluster that was relatively memory-poor (NCSA Abe). Care must be taken if the application is moved to another cluster in the future to configure the scheduler such that it never schedules more SeismogramSythesis jobs on a node than the node can support. 5.6.3 Broadband Figure 5.4 shows a small Broadband workflow, which integrates earthquake motion simulation codes and caluclations. The workflow can be divided into four stages based on the types of computations performed in each stage. The first stage consists of rupture generation jobs ucsb createSRF and urs createSRF. As mentioned earlier, one of the goals of the Broadband platform is to consider different codes that perform similar calculations (based on the underlying models) and compare and verify these calculations. In the example workflow shown in Figure 5.4, 121 ucsb_seisgen urs_lp_seisgen rspectra urs_hf_seisgen urs_createSRF sdsu_hf_seisgen ucsb_createSRF Figure 5.4: Example Broadband workflow both types of rupture generation jobs take as input the same simple earthquake description (location and magnitude) and generate time series data that represent skip time histories for the earthquake at the same recording station. The second stage includes the urs lp seisgen jobs that calculate deterministic low frequency (up to 1Hz) seismograms based on the time series data generated in the previous stage. The third stage includes urs hf seisgen and sdsu hf seisgen, which add stochastic high frequency seismograms to the low frequency seismograms generated byurs lp seisgen. Theucsb seisgen job may be considered to be part of both the second and third stages, as it calculates and merges both low frequency and high frequency seismograms. The fourth stage consists of rspectra jobs that extract parameters from the seismograms that are of interest to earthquake engineers. These parameters may include peak acceleration, peak ground velocity, and peak spectral acceleration. Table 5.5 shows profile data for each type of job in the Broadband workflow. These statistics were collected from a workflow that simulated the ground motion at 8 stations for 6 earthquakes and a single velocity model. The jobs from the second and third stages (urs lp seisgen, sdsu hf seisgen,ucsb seisgen) that generate low and high frequency seismograms had the largest runtimes, with the runtime of thesdsu hf seisgen jobs being the largest overall. These jobs operate on relatively simple earthquake descriptions. Thesdsu hf seisgen job 122 Table 5.5: Broadband execution profile Job Count Runtime I/O Read I/O Write Peak Memory CPU Util. Mean (s) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (%) Std. Dev. ucsb createSRF 48 0.57 0.13 4.55 0.05 7.45 0.17 6.97 0.01 61.93 0.11 urs createSRF 48 0.43 0.09 0.00 0.00 1.96 0.01 7.45 0.01 62.98 0.14 urs lp seisgen 96 84.92 6.69 43.65 18.30 0.63 0.00 10.13 0.07 91.73 0.06 urs hf seisgen 96 3.06 0.63 5.54 0.06 3.31 0.00 12.71 0.09 70.84 0.07 sdsu hf seisgen 96 196.40 11.14 938.28 0.01 1855.82 0.01 942.30 0.01 87.72 0.08 ucsb seisgen 98 68.32 12.43 11271.46 1292.00 5.11 0.15 53.68 4.90 90.57 0.08 rspectra 288 0.34 0.21 0.81 0.00 0.46 0.00 7.92 0.01 16.20 0.06 is similar to the SeismogramSynthesis job in CyberShake. More than half of the CPU time of Broadband (about 55%) is used by sdsu hf seisgen, making it a good target for optimization. Thesdsu hf seisgen jobs also generate the most output I/O, on average 1.8 GB per job. In addition, eachsdsu hf seisgen job uses 942 MB of memory, which makes it difficult to schedule on nodes with only 1GB per core (keeping in mind that the OS needs some memory for system processes and the kernel). The ucsb seisgen job consumes the most input (11.2 GB per job). The Broadband workflow is a good example of how workflow traces and profiles can be used to diagnose unusual and erroneous behavior. For example, while Table 5.5 shows that the I/O read forucsb seisgen is 11.2 GB, it is known that the input file forucsb seisgen is only about 3 GB. Therefore, at least some of the data is being read multiple times. This suggests that ucsb seisgen could be optimized to reduce this duplication, which might improve perfor- mance. The file system cache may be able to absorb many of these reads, however, if the file system being used does not support client-side caching, as is the case in some parallel file sys- tems, these duplicate reads may greatly reduce performance. Another interesting thing to notice aboutucsb seisgen is its memory usage. The peak memory actually used was 53 MB, but the peak memory allocated was 1847 MB (not shown in the Table but reported by pprof). This large gap suggests that there is either a bug inucsb seisgen, or that the author intended to cache more of the input data in memory, but the feature was only partially implemented. 123 fastQSplit filterContams sol2sanger fastq2bfq map mapMerge maqIndex pileup Figure 5.5: Example Epigenomics workflow 5.6.4 Epigenomics Figure 5.5 shows an example Epigenomics workflow, which maps the epigenetic state of human DNA. The Epigenomics workflow is a highly pipelined application with multiple pipelines oper- ating on independent chunks of data in parallel. The input to the workflow is DNA sequence data obtained for multiple “lanes” from a genetic sequencing machine. The information from each lane is split into multiple chunks by thefastqSplit jobs. The number of splits gener- ated depends on a parameter, chunk size, used on the input data. The filterContams jobs then filter out noisy and contaminated data from each of the chunks. The data in each chunk is then converted to a format understood by the Maq DNA sequence mapping software by the sol2sanger utility. For faster processing and reduced diskspace usage, the data are then con- verted to the binary fastQ format by fastq2bfq. Next, the remaining sequences are aligned with the reference genome by themap utility. The results of individualmap processes are com- bined using one or more stages ofmapMerge jobs. After merging themaqIndex utility oper- ates on the merged alignment file and retrieves reads about a specific region (e.g. chromosome 21). Finally, thepileup utility reformats the data so that it can be displayed by a GUI. 124 Themap jobs that align sequences with the reference genome are the most computationally intensive, followed by thepileup andmaqIndex jobs that work on the entire aligned output. The performance of other jobs in the pipeline mainly depends on the amount of data in each of the individual chunks. Table 5.6 lists profiling data collected during executions of a sample workflow that was used to align approximately 13 million sequences. For most Epigenome jobs, the CPU utilization is high, so we classify the workflow as CPU-bound. The few low utilization jobs are sim- ple data conversion jobs: fastqsplit just splits an input file into multiple output files, and sol2sanger just converts its input file into another format. Thepileup job has a mean uti- lization of 153%, which seems strange, but remember that it is possible for a job to have a CPU utilization of more than 100% if it makes use of multiple cores. In this case,pileup is a shell script that pipes the output of Maq to the Awk data extraction utility. Since Maq and Awk are independent processes, they can be scheduled on different cores concurrently, provided that one is not always blocked waiting for the output of another, which appears to be the case here. In addition, knowing that both processes run at the same time is useful from a workflow scheduling perspective because it suggests that perhapspileup should be treated as a parallel job requiring two cores. However, for this application there is only one pileup job per workflow, so spe- cial treatment is not likely to have a significant impact on overall workflow performance. The map jobs represent the vast majority of the runtime of the Epigenome workflow (about 96%), which suggests that any performance optimizations would be best directed towards improving the mapping code. 5.6.5 LIGO Inspiral Analysis The LIGO Inspiral Analysis workflow analyzes data from the coalescing of compact binary systems such as binary neutron stars and black holes. This workflow is very complex and is composed of several sub-workflows. A simplified representation of the workflow is shown in Figure 5.6, with sub-workflows indicated by dashed lines. The actual workflow that was profiled 125 Table 5.6: Epigenomics execution profile Job Count Runtime I/O Read I/O Write Peak Memory CPU Util. Mean (s) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (%) Std. Dev. fast2bfq 128 1.40 0.24 10.09 1.73 2.22 0.40 4.05 0.01 88.42 0.10 pileup 1 55.95 0.00 151.82 0.00 83.95 0.00 148.26 0.00 153.43 0.00 mapMerge 8 11.01 12.18 27.68 32.49 26.71 32.89 5.00 0.39 95.08 0.06 map 128 201.89 21.91 138.76 0.80 0.90 0.22 196.04 5.50 96.69 0.01 sol2sanger 128 0.48 0.14 13.15 2.25 10.09 1.73 3.79 0.00 65.17 0.12 filterContams 128 2.47 0.43 13.25 2.26 13.25 2.26 2.97 0.06 88.54 0.09 mapIndex 1 43.57 0.00 214.10 0.00 107.53 0.00 6.17 0.00 99.50 0.00 fastqSplit 7 34.32 8.94 242.29 82.60 242.29 82.60 2.80 0.00 22.41 0.03 contains 25 sub-workflows with more than 2000 total jobs. Other instances of the workflow contain up to 100 sub-workflows and 200,000 jobs. The TmpltBank jobs, which identify the continuous family of waveforms that belong to the parameter space for each block of the data, can all be executed in parallel once the input data from the LIGO detectors have been split into multiple blocks of 2048 seconds each. The output of a TmplBank job is a bank of waveform parameters that are used by the matched filtering code in an Inspiral job. The triggers produced by multiple Inspiral jobs are tested for consistency by inspiral coincidence analysis jobs, which are denoted by Thinca in the example. Since these jobs operate on data obtained from multiple jobs, they can be regarded as data aggregation jobs. The outputs of theThinca jobs are inputs to theTrigBank jobs that generate template banks out of the triggers. These template banks are then used by the second set ofInspiral jobs, followed by another set ofThinca jobs. More information about the details of the LIGO workflow can be found in Capano’s PhD thesis [28]. The profiling data for the LIGO Inspiral workflow is shown in Table 5.7. Inspiral jobs that execute the matched filter to generate triggers are the most computationally intensive jobs in the workflow. The 358 instances of the Inspiral job collectively consume around 92% of the total runtime, around 92% of the I/O (197 GB), and have relatively high CPU utilization (around 90%). The 29 TmpltBank jobs are also computationally intensive (average runtime 500 seconds and average CPU utilization of 99%) and read significant input data (16 GB). The 126 Inspiral TrigBank Thinca TmpltBank Inspinj Sire Inca Data_Find Coire Figure 5.6: Example LIGO Inspiral Analysis workflow (dotted lines indicate sub-workflow boundaries) remaining jobs in the workflow have short runtimes, low I/O consumption and low CPU uti- lization. An interesting aspect of the LIGO Inspiral workflow’s I/O usage is that it reads a large amount of data (213 GB), but writes very little (50 MB). The other workflows characterized have significantly more output data than input data. 127 Table 5.7: LIGO Inspiral Analysis execution profile Job Count Runtime I/O Read I/O Write Peak Memory CPU Util. Mean (s) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (%) Std. Dev. TmpltBank 29 497.23 45.78 552.66 6.04 0.04 0.00 404.54 0.02 98.94 0.00 Inspiral 358 472.89 453.18 553.16 5.44 0.05 0.07 533.17 116.28 89.96 0.11 Thinca 290 0.64 1.06 0.53 0.92 0.00 0.01 2.63 0.83 43.90 0.28 Inca 20 0.35 0.27 0.26 0.18 0.13 0.09 2.30 0.35 37.93 0.20 Data Find 6 0.99 0.38 0.03 0.00 0.01 0.01 10.06 0.01 55.55 0.05 Inspinj 4 0.30 0.28 0.00 0.00 0.08 0.00 1.99 0.01 8.32 0.08 TrigBank 200 0.13 0.14 0.03 0.05 0.00 0.00 2.04 0.14 17.44 0.14 Sire 748 0.25 0.21 0.17 0.34 0.02 0.10 1.93 0.15 14.15 0.18 Coire 386 0.21 0.16 0.07 0.08 0.02 0.03 1.91 0.06 8.00 0.07 5.6.6 SIPHT A small SIPHT workflow that searches for small untranslated RNAs (sRNAs) is shown in Fig- ure 5.7. All SIPHT workflows have almost identical structure, and large workflows can be com- posed of smaller, independent workflows. The only difference between two workflow instances is the number ofPatser jobs, which scan sequences with position-specific scoring matrices that return matching positions. The number of Patser jobs depends on inputs describing transcrip- tion factor binding sites (TFBSs). The results of these Patser jobs are concatenated by the Patser Concate job. There are severalBLAST jobs in the workflow that compare different combinations of sequences. TheBlast job and theBlast QRNA job operate on all possible partner inter-genetic regions (IGRs) from other suitable replicons. Even though it is not apparent in Figure 5.7, these BLAST jobs operate on hundreds of data files. There are three jobs in the workflow that search for transcription terminators—FindTerm,RNAMotif andTransterm. The sRNA prediction is performed by the SRNA job that operates on the outputs of the above jobs as well as the output of theBLAST job. The output of this job is used by the otherBLAST jobs. The SRNA Annotate job annotates candidate sRNA loci that were found, for multiple features such as its conservation in other bacterial strains, its association with putative TFBSs, and its homology to other previously identified sRNAs [109]. Table 5.8 shows the execution profile statistics collected from an execution of the SIPHT workflow. Like the other workflows, SIPHT has a single job type (BLAST) that accounts for 128 Patser_Concate Blast_Candidate Blast RNA_Motif SRNA_annotate Blast_synteny Patser FFN_Parse Blast_QRNA SRNA Transterm Findterm Blast_paralogues Figure 5.7: Example SIPHT workflow most of its runtime (about 74%). Finderm and Blast QRNA also contribute a significant amount to the runtime. Also, similar to Epigenome, SIPHT is primarily a CPU-bound workflow. Most of the jobs have high CPU utilization and relatively low I/O. Only Patser concate, which simply concatenates the output of thePatser jobs, has low CPU utilization. TheBLAST andBlast QRNA jobs read the most input data (about 800 MB each) and write the most output data (about 565 MB each). TheFindterm job also writes significant output data (379 MB). 5.7 Summary This chapter described a profiling strategy for scientific workflows that consists of 1) running workflow tasks using both I/O and system call trace utilities to collect detailed logs of resource usage, and 2) analyzing the resulting traces to develop a descriptive profile for the workflow. 129 Table 5.8: SIPHT execution profile Job Count Runtime I/O Read I/O Write Peak Memory CPU Util. Mean (s) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (MB) Std. Dev. Mean (%) Std. Dev. Patser 19 0.96 0.08 2.70 0.00 0.00 0.00 4.48 0.00 83.48 0.04 Patser concate 1 0.03 0.00 0.14 0.00 0.14 0.00 2.29 0.00 18.89 0.00 Transterm 1 32.41 0.00 2.93 0.00 0.00 0.00 16.03 0.00 94.79 0.00 Findterm 1 594.94 0.00 15.14 0.00 379.01 0.00 58.21 0.00 95.20 0.00 RNAMotif 1 25.69 0.00 2.91 0.00 0.04 0.00 4.38 0.00 95.05 0.00 Blast 1 3311.12 0.00 808.17 0.00 565.06 0.00 116.38 0.00 93.87 0.00 SRNA 1 12.44 0.00 47.98 0.00 1.32 0.00 5.65 0.00 93.48 0.00 FFN parse 1 0.73 0.00 5.03 0.00 2.51 0.00 5.00 0.00 81.09 0.00 Blast synteny 1 3.37 0.00 1.76 0.00 0.42 0.00 14.18 0.00 61.01 0.00 Blast candidate 1 0.60 0.00 0.27 0.00 0.06 0.00 13.28 0.00 43.61 0.00 Blast QRNA 1 440.88 0.00 804.60 0.00 567.01 0.00 115.21 0.00 87.80 0.00 Blast paralogues 1 0.68 0.00 0.12 0.00 0.03 0.00 13.34 0.00 44.30 0.00 SRNA annotate 1 0.14 0.00 0.40 0.00 0.03 0.00 6.95 0.00 55.96 0.00 These tools were used to profile the execution of six scientific workflows span a range of charac- teristics, from small to large and from compute-intensive to data-intensive, with both moderate and large resource requirements. The workflows were executed at typical scale in grid environ- ments, and the profiling technique was used to characterize their runtimes, I/O requirements, memory usage and CPU utilization. While these workflows are diverse, all of them had one job type that accounted for most of the workflow’s runtime. There were several workflows in which the same data is being re-read multiple times by a single workflow task. Both observations suggest opportunities for optimizing the performance of these workflows. In general, workflow profiles can be used to detect errors and unexpected behavior and to identify the best candidates for optimizing workflow execution. This work is intended to provide to the research community a detailed overview of the types of scientific analyses that are being run using workflow management systems and their typical resource requirements, with the goal of improving the design and evaluation of algorithms used for resource provisioning and job scheduling in workflow systems. In the future, these workflow profiles could be used to create workflow simulations, to construct realistic synthetic workflows that can be used to evaluate different workflow engines and scheduling algorithms, to debug and optimize workflow performance, and to create a suite of benchmarks for scientific workflows. 130 Chapter 6 Provisioning for Workflow Ensembles One of the primary challenges of resource provisioning is ensuring that the resources allocated match the work to be performed. Provisioning too many resources leaves some resources under- utilized and increases the cost of a computation. Provisioning too few resources causes the execution time to increase. The problem is complicated by the fact that the ideal number of resources can change over time due to the dynamic nature of both the application and the envi- ronment. When jobs and resources fail, any partially completed work needs to be re-executed, which increases the demand for resources. Furthermore, the resource needs of an application can change over time as a result of the structure of the application or its data. This latter case is particularly true of workflow applications. Consider the workflow shown in Figure 6.1, which contains serial tasks with runtimes as shown. This workflow has a data redistribution structure [15] that results in an execution bottleneck. If this workflow were to be run on a distributed system there are many ways that it could be scheduled. A single processor could be provisioned to execute all of the tasks in 11 hours, but this would result in a runtime much greater than the minimum achievable runtime of 3 hours. Another option would be to provision 5 processors for 3 hours, but that would overprovision by (3*5)-11 = 4 CPU hours, or 36%. The ideal solution is to provision a block of 5 processors for 1 hour, followed by 1 processor for 1 hour, then another block of 5 processors for 1 hour. This provisioning plan results in the minimum application runtime with no waste. In practice, however, things are rarely as simple as that. Often the tasks do not have uniform runtimes, the workflow structure is much more complex, there are scheduling delays, task throt- tling is required to prevent the system from being overloaded, and task and resource failures lead to rework. A more realistic example is shown in Figure 6.2. This chart shows the number of 131 1 hr 1 hr 1 hr 1 hr 1 hr 1 hr 1 hr 1 hr 1 hr 1 hr 1 hr Figure 6.1: Example workflow containing a data redistribution structure that results in a bottle- neck and changes the resource needs of the application over time. Figure 6.2: CyberShake run showing how the resource requirements of a workflow can change over time. running and idle tasks during a 4-hour run of the CyberShake application [41]. The height of the curve is equal to the number of running and idle tasks (number of tasks in the application queue). At the time this data was collected there were 800 processors provisioned for the application, but the number of tasks available to be executed varied widely from 0 all the way up to around 1300. There were significant periods of time where the number of available tasks was either much less or much more than the 800 processors available. The average resource utilization over the period of time that the resources were provisioned was approximately 55%, meaning that nearly half of the resources allocated for the application were unused. 132 The problem is more difficult when multiple workflows are considered. Workflow applica- tions for large computational problems often consist of several related workflows grouped into ensembles. All the workflows in an ensemble typically have a similar structure, but they differ in their input data, number of tasks, and individual task sizes. The CyberShake application [27] is a good example of an application that consists of workflow ensembles. CyberShake computes seismic hazard curves for a given geographical location. In order to produce a hazard map for a region such as Southern California, many hazard curves need to be computed for a large num- ber of geographical sites. Each hazard curve is computed by a single workflow, so generating a hazard map requires an ensemble of many related workflows. The hazard map generated by the CyberShake group in 2009, for example, required an ensemble of 239 workflows. There are many other examples of workflow ensembles. In astronomy, for example, a user might wish to use Montage [94] to generate a set of image mosaics that cover a given area of the sky in different wavelengths, which requires multiple workflows with different parameters. Alternatively, a user might want to use an ensemble to generate the parts of a very large mosaic. The Pegasus team is currently working on a project called Galactic Plane that is focused on generating a mosaic of the entire sky. The Galactic Plane ensemble consists of 17 workflows, each of which contains 900 sub-workflows and computes one “tile” of the full mosaic. Another ensemble example is the Periodograms application described in Chapter 3. Peri- odograms are used to detect the periodic dips in light intensity that occur when an extrasolar planet transits its host star. Due to the large scale of the input data for this application, it is often split up into multiple batches, each of which is processed by a different workflow. Additional workflows are created to run the analysis using different parameters. A recent analysis of data from the Kepler satellite required three ensembles of 15 workflows [14]. Ensemble workflows may differ not only in their parameters, but also in their priority. For example, in CyberShake some sites may be in heavily populated areas or in strategic locations such as power plants, while others may be less important. Scientists typically prioritize the workflows in an ensemble such that important workflows are finished first. This enables them to 133 see critical results early, and helps them to choose the most important workflows when the time and financial resources available for running the computations are limited. Once an ensemble is constructed it needs to be executed on distributed resources. This can be done either using cluster and grid resources, as described in Chapter 2, or using an infrastruc- ture cloud. As was mentioned in Chapter 3, infrastructure clouds offer the ability to provision computational resources on-demand according to a pay-per-use model. In contrast to clusters and grids, which typically offer best-effort quality of service, clouds give users more flexibility in creating a controlled and managed computing environment. Most significantly, clouds pro- vide the ability to adjust resource capacity according to the dynamically changing computing demands of the application. This resource provisioning capability opens up many new opportu- nities for optimizing the cost and performance of workflow ensembles. Running workflow ensembles in infrastructure clouds requires the development of new meth- ods for task scheduling and automatic resource provisioning. The resource management deci- sions required in ensemble scenarios not only have to take into account traditional performance- related metrics such as workflow makespan or resource utilization, but must also consider budget constraints, since the resources from commercial cloud providers usually have monetary costs associated with them [49], and deadline constraints, since users often do not have unlimited time in which to complete their analyses. In this chapter we aim to gain insight into resource management challenges when execut- ing scientific workflow ensembles on clouds. In particular, we address the new and important problem of maximizing the number of completed workflows from a prioritized ensemble under both budget and deadline constraints. The motivation for this work is to answer the fundamental question of concern to a researcher: How much computation can be completed given the limited budget and timeframe of a research project? The goals of this work are to discuss and assess possible static and dynamic strategies for both task scheduling and resource provisioning. We develop three strategies that rely on the information about the workflow structure (critical paths and dependencies) and estimates of task runtimes. These strategies include: a static approach that plans out all provisioning and scheduling decisions ahead of execution, an online algorithm that 134 dynamically adjusts resource usage based on utilization, and another, similar online algorithm that also takes into account workflow structure information. We evaluate these algorithms using a simulator we have developed using CloudSim [25] that models the infrastructure, middleware, and the application. The algorithms were evaluated on a set of synthetic workflow ensembles that were generated based on several different real workflow applications using a broad range of budget and deadline parameters. 6.1 Related Work General policy and rule-based approaches to dynamic provisioning (e.g. Amazon Auto Scal- ing [6] and RightScale [154]) allow adjusting the size of resource pool based on metrics related to infrastructure and application. A typical infrastructure-specific metric is system load, whereas application-specific metrics include response time and length of a task or of a request queue. It is possible to set thresholds and limits to tune the behavior of these autoscaling systems, but no support for complex applications is provided. Policy-based approaches for scientific workloads (e.g. [114, 99]) also enable users to scale the request cloud resource pool or to extend the capabilities of clusters using cloud-burst tech- niques. The proposed approach is different in that it considers workflows, while previous policy- based approaches have only considered bags of independent tasks or unpredictable batch work- loads. The use of workflows enables the system to take advantage of workflow-aware scheduling heuristics that cannot be applied to independent tasks. Several different systems have been developed that implement dynamic provisioning for workload management on the grid. The system developed by Pinchak, et. al. [145] uses a simple recursive scheme where pilot jobs (called placeholders) run a single application task and then resubmit themselves if there is additional work in the queue. Falkon [151] has provisioning policies for both allocating and releasing resources. The allocation policy analyzes the queue periodically to determine how many resources to provision, for how long. The release policy is based on idle time such that when the resource has been idle for a configurable amount of time 135 it releases itself automatically. glideinWMS [159] also uses an idle time policy for releasing resources, but the allocation policy is designed to continue provisioning resources as long as the provider has free resources and the application queue has idle tasks. All of these approaches use relatively simplistic policies that consider only queued jobs. They have not been extensively evaluated to determine how well they perform on a variety of applications, such as workflows. Although they can be used to enable dynamic provisioning for workflows, they are not optimized to include information about workflow structure that could be used to predict future load. There has been much recent interest in the dynamic provisioning of cloud resources for batch computing. Rodriguez, et. al. [155] propose an architecture for dynamically adding cloud resources to a cluster running Sun Grid Engine. Similarly, the Elastic Site system [114] has been used to dynamically add virtual machines to a PBS cluster and Murphy and Goasguen [120, 119] developed a system to dynamically add cloud resources to a Condor pool. With the exception of Elastic Site, which uses several different provisioning policies, these approaches all use greedy algorithms that provision a single VM for each job. This approach, although simple, is not efficient for applications that contain short running jobs, such as workflows. In addition, the policies used by Elastic Site have parameters that seem to have been chosen using an educated guess without any justification. Presumably this was done because it was difficult and time-consuming for the authors to evaluate different values to determine what results in the best performance. This suggests that more work needs to be done to evaluate dynamic provisioning policies. A few techniques have been developed that specifically address provisioning for workflow applications. Singh, et. al. [161, 162] have developed a multi-objective genetic algorithm that minimizes the cost and makespan of a workflow by choosing resources to provision from a list of available time slots. Huang, et. al. [79] have developed a performance model for workflow applications that is used to generate static resource specifications that minimize makespan. Byun, et. al. [24] have developed an approach based on simulated scheduling that estimates the number of resources required to finish a workflow by a user-specified deadline. All of these approaches rely on workflow structure information and runtime estimates to develop static provisioning 136 plans ahead of workflow execution. As such they are all sensitive to the uncertainties in workflow execution that dynamic provisioning is meant to address. Some previous research has considered adapting workflows for execution in dynamic envi- ronments. The approach developed by Lee, et. al. [106] enables workflows to adapt to changing resource availability by re-planning portions of a workflow. The problem considered here is the inverse—in their model, the workflows must adapt to changing resource availability rather than the resource availability adapting to changing workflows. This work is related to the strategies for deadline-constrained cost-minimization workflow scheduling, developed for utility grid systems. However, this problem is different from [194, 2] in that it considers ensembles of workflows in IaaS clouds, which allow one to provision resources on a per-hour billing model, rather than utility grids, which allow one to choose from a pool of existing resources with a per-job billing model. This work is also different from the cloud-targeted autoscaling solution [112] in that it considers ensembles of workflows rather than unpredictable workloads containing workflows. It also considers budget constraints rather than cost minimization as a goal. In other words, it assumes that there is more work to be done than the available budget, so some work must be rejected. Therefore cost is not an objective (something to optimize), but rather a constraint. This work is also related to bi-criteria scheduling and multi-criteria scheduling of work- flows [186, 149, 46]. These approaches are similar in that this problem has two scheduling criteria: cost and makespan. The challenge in multi-criteria scheduling is to derive an objective function that takes into account all of the criteria. In this case one objective—amount of work completed—is subject to optimization, while time and cost are treated as constraints. Other approaches [168, 141] use metaheuristics that usually run for a long time before producing good results for a single workflow, which makes them less useful in the scenarios considered here where we consider ensembles containing many workflows. This work can be also regarded as an extension of budget-constrained workflow scheduling [156] in the sense that this problem can be thought of as an extention to budget-constrained scheduling with the addition of ensembles and a deadline constraint. 137 6.2 Problem Description 6.2.1 Resource Model We assume a resource model similar to Amazon’s Elastic Compute Cloud (EC2). In this model, virtual machine (VM) instances are requested on-demand. The VMs are billed by the hour, with partial hours being rounded up. Although there are multiple VM types with different amounts of CPU, memory, disk space, and I/O, for this thesis we focus on a single VM type because we assume that for most applications there will typically be only one or two VM types with the best price/performance ratio for the application [91]. We also assume that a submitted task has exclusive access to a VM instance and that there is no preemption. We assume that there is a delay between the time that a new VM instance is requested and when it becomes available to execute tasks. In practice this delay is typically on the order of tens of seconds to a few minutes, and is highly dependent upon the cloud platform and the VM image size [130]. 6.2.2 Application Model The target applications are ensembles of scientific workflows that can be modeled as Directed Acyclic Graphs (DAGs), where the nodes in the graph represent computational tasks, and the edges represent data- or control-flow dependencies between the tasks. We assume that there are runtime estimates for each task in the workflow based on either a performance model of the application, or historical data that can be mined. Additionally, we assume that the runtime estimates for individual workflow tasks are not perfect, but may vary based on a uniform distribution ofp%. This study uses synthetic workflows that were generated using historical data from real appli- cations [15]. The applications come from a wide variety of domains including: bioinformatics (Epigenomics, SIPHT: sRNA identification protocol using high-throughput technology), astron- omy (Montage), earthquake science (CyberShake), and physics (LIGO). The runtimes used to 138 generate the workflows are based on distributions gathered using the technique described in Chapter 5. The workflows were generated using code developed by Bharathi, et al [190]. Although scientific workflows are often data-intensive, the algorithms described in the next section do not currently consider the size of input and output data when scheduling tasks. Instead it is assumed that all workflow data is stored in a shared cloud storage system, such as Amazon S3, and that intermediate data transfer times are included in task runtime estimates. It is also assumed that data transfer times between the shared storage and the VMs are equal for different VMs so that task placement decisions do not have an impact on the runtime of the tasks. This work focuses on scheduling and provisioning for workflow ensembles. A workflow ensemble consists of several related workflows that a user wishes to run. Each workflow in an ensemble is given a numeric priority that indicates how important the workflow is to the user. As such, the priorities indicate the utility function of the user. These priorities are absolute in the sense that completing a workflow with a given priority is more valuable (gives the user more utility) than completing all other workflows in the ensemble with lower priorities combined. The goal of the workflow ensemble scheduling and cloud provisioning problem is to com- plete as many high-priority workflows as possible given a fixed budget and deadline. Only workflows for which all tasks are finished by the deadline are considered to be complete—partial results are not usable in this model. 6.3 Algorithms This section describes several algorithms were developed to schedule and provision resources for ensembles of workflows on the cloud under budget and deadline constraints. 6.3.1 Static Provisioning Dynamic Scheduling (SPDS) The simplest strategy for executing ensembles of workflows in the cloud is to provision resources statically and schedule workflow tasks dynamically. Given a budget in dollars b, deadline in 139 Algorithm 1 Priority-based scheduling algorithm for SPDS 1: procedure SCHEDULE 2: P empty priority queue 3: IdleVMs set of idle VMs 4: for root taskt in all workflows do 5: INSERT(t;P ) 6: end for 7: while deadline not reached do 8: whileIdleVMs6=;andP6=; do 9: v SELECTRANDOM(IdleVMs) 10: t POP(P ) 11: SUBMIT(t;v) 12: end while 13: Wait for taskt to finish on VMv 14: UpdateP with ready children oft 15: INSERT(v;IdleVMs) 16: end while 17: end procedure hoursd, and the hourly price of a VM in dollarsp, it is easy to calculate the number of VMs, N VM , to provision so that the entire budget is consumed before the deadline: N VM =db=(dp)e (6.1) The SPDS algorithm statically provisionsN VM VMs at the start of the ensemble execution and keeps them running until the deadline is reached, or the budget runs out. This provisioning plan has the advantage that it minimizes the number of provisioning and de-provisioning requests. Once the VMs are provisioned, the tasks are mapped onto idle VMs using the dynamic priority-based scheduling procedure shown in Algorithm 1. Initially, the ready tasks from all workflows in the ensemble are added to a priority queue based on the priority of the workflow to which they belong. If there are idle VMs available, and the priority queue is not empty, the next task from the priority queue is submitted to an arbitrarily chosen idle VM. The process is repeated until there are no idle VMs or the priority queue is empty. The scheduler then waits for a task to finish, adds its ready children to the priority queue, marks the VM as idle, and the entire process repeats until the deadline is reached. 140 Time VM Figure 6.3: Example schedule generated by the SPDS algorithm. Each row is a different VM. Tasks are represented as boxes and are colored by workflow. This algorithm guarantees that tasks from lower priority workflows are always deferred when higher-priority tasks are available, but lower-priority tasks can still occupy idle VMs when higher-priority tasks are not available. However, because there is no preemption, long-running low-priority tasks may delay the execution of higher-priority tasks. In addition, tasks from low priority workflows may be executed even though there is no chance that those workflows will be completed within the current budget and deadline. Figure 6.3 shows an example schedule gen- erated using the SPDS algorithm. The figure illustrates how tasks from lower priority workflows backfill idle VMs when tasks from higher priority workflows are not available. 6.3.2 Dynamic Provisioning Dynamic Scheduling (DPDS) The static provisioning approach of SPDS may perform poorly when resource demand changes during ensemble execution. In those cases, SPDS may either over-provision, wasting the budget on idle VMs, or under-provision, using up the deadline while starving the application of usable VMs. One of the advantages of elasticity in cloud computing is that the number of resources can be changed to meet the demands of an application. In policy-based provisioning (or auto- scaling) systems, metrics such as resource utilization or task queue length are monitored to estimate application demand. This is the approach used by the Dynamic Provisioning Dynamic Scheduling (DPDS) algorithm to determine when to provision and de-provision VMs. The algo- rithm periodically computes resource utilization using the percentage of idle VMs over time, and 141 Algorithm 2 Dynamic provisioning algorithm for DPDS Require: c: consumed budget;b: total budget;d: deadline;p: price;t: current time;u h : upper utilization threshold;u l : lower utilization threshold;v max : maximum number of VMs 1: procedure PROVISION 2: V R set of running VMs 3: V C set of VMs completing billing cycle 4: V T ; . set of VMs to terminate 5: n T 0 . number of VMs to terminate 6: ifbc<jV C jport>d then 7: n T jV R jb(bc)=pc 8: V T selectn T VMs to terminate fromV C 9: TERMINATE(V T ) 10: else 11: u current VM utilization 12: ifu>u h andjV R j<v max N VM then 13: START(newVM) 14: else ifu<u l then 15: V I set of idle VMs 16: n T djV I j=2e 17: V T selectn T VMs to terminate fromV I 18: TERMINATE(V T ) 19: end if 20: end if 21: end procedure adjusts the number of VMs if the utilization is above or below given thresholds. Because it is assumed that VMs are billed by the hour, DPDS only considers VMs that are approaching their hourly billing cycle when deciding which VMs to terminate. The DPDS algorithm is shown in Algorithm 2. The set of VMs completing their billing cycle is determined by considering both the pro- visioner interval, and the termination delay of the provider. This guarantees that VMs can be terminated before they start the next billing cycle and prevents the budget from being overrun. The VMs terminated in line 9 of Algorithm 2 are the ones that would overrun the budget if not terminated in the current provisioning cycle. The VMs terminated in line 18 are chosen to increase the resource utilization to the desired threshold. In order to prevent instances from being terminated too quickly, potentially wasting resources that have already been paid for but could be used later, no more than half of the idle resources are terminated during each provisioning 142 cycle. To avoid an uncontrolled increase in the number of instances, which may happen in the case of highly parallel workflows, the provisioner will not start a new VM if the number of run- ning VMs is greater than the product ofN VM (from Equation 6.1) and an autoscaling parameter, v max . Unless otherwise specified,v max is assumed to be 1. 6.3.3 Workflow-Aware DPDS (WA-DPDS) The DPDS algorithm does not use any information about the structure of the workflows in the ensemble when scheduling tasks: it looks only at the priorities of the ready tasks when deciding which task to schedule next. It does not consider whether a lower priority task belongs to a workflow that will never be able to complete given the current budget and deadline. As a result, DPDS may start lower priority tasks just to keep VMs busy that will end up delaying higher priority tasks later on, making it less likely that higher priority workflows will be able to finish. In order to address this issue, the Workflow-Aware DPDS (WA-DPDS) algorithm extends DPDS by introducing a workflow admission procedure. The admission procedure is invoked whenever WA-DPDS sees the first task of a new workflow at the head of the priority queue (i.e. when no other tasks from the workflow have been scheduled yet). The admission procedure— shown in Algorithm 3—estimates whether there is enough budget remaining to admit the new workflow; if there is not, then the workflow is rejected and its tasks are removed from the queue. This algorithm compares the current cost (consumed budget) and remaining budget, taking into account the cost of currently running VMs, and the cost of workflows that have already been admitted. In addition, it adds a small safety margin of $0.10 to avoid going over the budget. This admission procedure relies only on the total estimated resource consumption and com- pares it to the remaining budget. We found that this estimation is useful not only to prevent low-priority workflows from delaying high-priority ones, but also to reject large and costly work- flows that would overrun the budget and admit smaller workflows that can efficiently utilize idle resources in ensembles containing workflows of non-uniform sizes. It would also be possible to extend this admission procedure to check other constraints, such as whether the estimated critical path of the new workflow exceeds the time remaining until the deadline. 143 Algorithm 3 Workflow admission algorithm for WA-DPDS Require: w: workflow;b: budget;c: current cost 1: procedure ADMIT(w;b;c) 2: r n bc . Budget remaining for new VMs 3: r c cost committed to VMs that are running 4: r a cost to complete workflows previously admitted 5: r m 0:1 . Safety margin 6: r b r n +r c r a r m . Budget remaining 7: c w ESTIMATECOST(w) 8: ifc w <r b then 9: returnTRUE 10: else 11: returnFALSE 12: end if 13: end procedure 6.3.4 Static Provisioning Static Scheduling (SPSS) The previous algorithms are all online algorithms that make provisioning and scheduling deci- sions at runtime. In comparison, the SPSS algorithm creates a provisioning and scheduling plan before running any workflow tasks. This enables SPSS to start only those workflows that it knows can be completed given the deadline and budget constraints, and eliminates any waste that may be allowed by the dynamic algorithms. The approach used by SPSS is to plan each workflow in the ensemble in priority order, rejecting any workflows that cannot be completed by the deadline or that cause the cost of the plan to exceed the budget. Once the plan is complete, the VMs are provisioned and tasks are executed according to the schedules given by the plan. The disadvantage of the static planning approach used by SPSS is that it is sensitive to dynamic changes in the environment and the application that may disrupt its carefully con- structed plan. For example, if there are provisioning delays, or if the runtime estimates for the tasks are inaccurate, then workflow execution may diverge from the plan. This issue will be discussed further in Sections 6.4.8 and 6.4.7. Algorithm 4 shows how ensembles are planned in SPSS. The algorithm starts with an empty plan containing no VMs and no scheduled tasks. Workflows from the ensemble are considered 144 Algorithm 4 Ensemble planning algorithm for SPSS Require: W : workflow ensemble;b: budget;d: deadline Ensure: Schedule as much ofW as possible givenb andd 1: procedure PLANENSEMBLE(W;b;d) 2: P ; . Current plan 3: A ; . Set of admitted DAGs 4: forwinW do 5: P 0 PLANWORKFLOW(w;P;d) 6: ifCost(P 0 )b then 7: P P 0 . Accept new plan 8: A A + w . Admit w 9: end if 10: end for 11: returnP;A 12: end procedure in priority order. For each workflow, SPSS attempts to build on top of the current plan by provisioning VMs to schedule the tasks of the workflow so that it finishes before the deadline with the least possible cost. If the cost of the new plan is less than the budget, then the new plan is accepted and the workflow is admitted. If not, then the new plan is rejected and the process continues with the next workflow in the ensemble. The idea behind this algorithm is that, if each workflow can be completed by the deadline with the lowest possible cost, then the number of workflows that can be completed within the given budget will be maximized. In order to plan a workflow, the SPSS algorithm assigns sub-deadlines to each individual task in the workflow, and then schedules each task so as to minimize the cost of the task while still meeting its assigned sub-deadline. The idea is that if each task can be completed by its deadline in the least expensive manner possible, then the cost of the entire workflow can be minimized without exceeding the ensemble deadline. SPSS assigns sub-deadlines to each task based on the slack time of the workflow, which is defined as the amount of extra time that a workflow can extend its critical path and still be completed by the ensemble deadline. For a workfloww, the slack time ofw is: ST(w) =dCP(w) 145 Algorithm 5 Workflow planning algorithm for SPSS Require: w: workflow;P : current plan;d: deadline Ensure: Create plan forw that minimizes cost and meets deadlined 1: procedure PLANWORKFLOW(w;P;d) 2: P 0 copy ofP 3: DEADLINEDISTRIBUTION(w,d) 4: fortinwsortedbyDL(t) do 5: v VM that minimizes cost and start time of t 6: ifFinishTime(t;v)<DL(t) then 7: Schedule(t,v) 8: else 9: Provision a new VM v 10: Schedule(t,v) 11: end if 12: end for 13: returnP 0 14: end procedure where d is the ensemble deadline and CP(w) is the critical path of w. We assume that CP(w)d, otherwise the workflow cannot be completed by the deadline and must be rejected. For large ensembles we expect the critical path of any individual workflow to be much less than the deadline. A task’s level is the length of the longest path between the task and an entry task of the workflow: Level(t) = 8 > > < > > : 0; ifPred(t) =; max p2Pred(t) Level(p)+1; otherwise. SPSS distributes the slack time of the workflow by level, so that each level of the workflow gets a portion of the workflow’s slack time proportional to the number of tasks in the level and the total runtime of tasks in the level. The idea is that levels containing many tasks and large runtimes should be given a larger portion of the slack time so that tasks in those levels may be serialized. Otherwise, many resources need to be allocated to run all of the tasks in parallel, which may be more costly. 146 The slack time of a levell in workfloww is given by: ST(l) =ST(w) " N(l) N(w) + (1) R(l) R(w) # whereN(w) is the number of tasks in the workflow,N(l) is the number of tasks in levell,R(w) is the total runtime of all tasks in the workflow,R(w) is the total runtime of all tasks in levell, and is a parameter between 0 and 1 that causes more slack time to be given to levels with more tasks (large) or more runtime (small). The deadline of a taskt is then: DL(t) =LST(t)+RT(t)+ST(Level(t)) whereLevel(t) is the level oft,RT(t) is the runtime oft, andLST(t) is the latest start time of t determined by: LST(t) = 8 > > < > > : 0; ifPred(t) =; max p2Pred(t) DL(p); otherwise. Algorithm 5 shows how SPSS creates low-cost plans for each workflow. The PLANWORK- FLOW procedure first calls DEADLINEDISTRIBUTION to assign sub-deadlines to tasks according to the procedure described above. Then, the PLANWORKFLOW procedure schedules tasks onto VMs, allocating new VMs when necessary. For each task in the workflow, the least expensive slot is chosen to schedule the task so that it can be completed by its deadline. VMs are allocated in blocks of one billing cycle (one hour) regardless of the size of the task. When computing the cost of scheduling a task on a given VM, the algorithm considers idle slots in blocks that were allocated for previous tasks to be free, while slots in new blocks cost the full price of a billing cycle. For example, if a task has a runtime of 10 minutes, and the price of a block is $1, then the algorithm will either schedule the task on an existing VM that has an idle slot larger than 10 minutes for a cost of $0, or it will allocate a new block on an existing VM, or provision a new 147 Time VM Figure 6.4: Example schedule generated by the SPSS algorithm. Each row is a different VM. Tasks are represented as boxes and are colored by workflow. VM, for a cost of $1. If the cost of slots on two different VMs is equal, then the slot with the earliest start time is chosen. To prevent too many VMs from being provisioned, the algorithm always prefers to extend the runtime of existing VMs before allocating new VMs. The result of this is that the algorithm will only allocate a new block if there are no idle slots on existing blocks large enough or early enough to complete the task by its deadline, and it will only allocate a new VM if it cannot add a block to an existing VM to complete the task by its deadline. An example schedule generated by SPSS is shown in Figure. 6.4. This example shows how SPSS tends to start many workflows in parallel, running each workflow over a longer period of time on only a few VMs to minimize cost. In comparison, the dynamic algorithms tend to run one workflow at a time across many VMs in parallel. 6.4 Evaluation This section describes a simulation study of the relative performance of three algorithms: DPDS, WA-DPDS, and SPSS. SPDS is not evaluated because, being a simplified version of DPDS and WA-DPDS, it will always produce results worse than the other algorithms. 148 6.4.1 Simulator To evaluate and compare the algorithms, we developed a cloud workflow simulator based on CloudSim [25]. Our simulation model consists of Cloud, VM and WorkflowEngine entities. The Cloud entity is responsible for staring and terminating VM entities using an API similar to that used by Amazon EC2. VM entities simulate the execution of individual tasks, including randomized variations in runtime. And the WorkflowEngine entity manages the scheduling of tasks and the provisioning of VMs according to the rules established by the different algorithms. We assume that the VMs have a single core and execute tasks sequentially. Although CloudSim provides a more advanced infrastructure model which includes time-sharing and space-sharing policies, we do not use these features since we are interested mainly in the execution of tasks on VMs and high-level workflow scheduling and provisioning. The simulator reads workflow description files in a modified version of the DAX format used by the Pegasus Workflow Man- agement System [45]. The modified file format was created for the synthetic workflow generator developed by Bharathi, et al [15]. 6.4.2 Workflow Ensembles In order to evaluate the algorithms on a standard set of workflows, we created randomized ensembles using workflows available from the workflow generator gallery published online by Bharathi, et al [190]. The gallery contains synthetic workflows modeled using structures and parameters that were taken from real applications. Ensembles were created using synthetic workflows from five real applications: SIPHT, a bioinformatics application, LIGO, a physics application, Epigenomics, another bioinformatics application, Montage, an astronomy applica- tion, and CyberShake, an earthquake science application. For each of these applications, work- flows with 50, 100, 200, 300, 400, 500, 600, 700, 800, 900 and 1000 tasks were generated. For each workflow size, 20 different workflow instances were generated using parameters and task runtime distributions from profiles similar those described in Chapter 5. The total collection 149 of synthetic workflows contains 5 applications, 11 different workflow sizes, and 20 workflow instances, for a total of 1100 synthetic workflows. Using this collection of workflows, we constructed five different ensemble types: constant uniform, sorted uniform, unsorted Pareto, sorted Pareto, unsorted In the unsorted ensembles, workflows of different sizes are mixed together, and the priorities are assigned randomly so that there is no relationship between priority and workflow size. For many applications, however, large workflows are more important to users than small workflows because the represent more significant computations. To model this, the sorted ensembles are sorted by size, so that the largest workflows are given the highest priority. Constant ensembles contain workflows that all have the same number of tasks. For each constant ensemble, the number of tasks is chosen randomly from the set of possible workflow sizes. Once the size is determined, then N workflows of that size are chosen randomly for the ensemble from the set of synthetic workflows for the given application. Uniform ensembles contain workflows with sizes that are uniformly distributed among the set of possible sizes. Each workflow in a uniform ensemble is selected by first randomly choos- ing the size the workflow according to a uniform distribution, and then randomly choosing a workflow of that size from the set of synthetic workflows for the given application. Pareto ensembles contain a small number of larger workflows and a large number of smaller workflows. The sizes of the workflows in a Pareto ensemble are chosen according to a Pareto distribution. The distribution was modified so that the number of large workflows (of size 900) is increased by a small amount to produce a “heavy-tail”. This causes Pareto ensembles to 150 have a slightly larger number of large workflows, which reflects behavior commonly observed in many computational workloads. An example of the distribution of workflow sizes that occurs in a Pareto ensemble is shown in Figure 6.5. distribution of workflow sizes workflow size frequency 0 200 400 600 800 1000 0 10 20 30 40 50 Figure 6.5: Histogram of workflow sizes in Pareto ensembles. Workflow size is measured in number of tasks. The number of workflows in an ensemble depends on the particular application, but we assume that ensemble sizes are on the order of between 10 and 100 workflows. This range is motivated by several reasons: First, such sizes are typical of the real applications we have exam- ined. For example, the number of geographical sites of interest to the users of the CyberShake application in the past has been on the order of 100. Second, smaller ensembles consisting of just a few workflows can be aggregated into a single workflow, so there is no need to treat them as an ensemble. Similarly, when the number of workflows grows, and each workflow has a large num- ber of tasks, either the deadline and budget constraints are low enough to prevent many of the workflows from running, or the problem of efficiently allocating them to the resources becomes similar to a bag-of-tasks problem, which is easier to solve efficiently. 151 6.4.3 Performance Metric In order to assess the relative performance of our algorithms it is necessary to define a metric that can be used to score the performance of the different algorithms on a given problem (ensemble, budget, and deadline). The simplest approach is to simply count the number of workflows in the ensemble that each algorithm is able to complete within the budget before the deadline. The problem with this approach is that it does not account for the priority-based utility function specified by the user. Using the counting approach, a less efficient algorithm may be able to complete a large number of low-priority workflows by executing the smallest workflows first. In order to account for the user’s priority, we used an exponential scoring approach where the score for an ensemblee is: Score(e) = X w2 Completed(e) 2 Priority(w) where Completed(e) is the set of workflows in ensemble e that was completed by the algo- rithm, and Priority(w) is the priority of workflow w such that the highest-priority workflow hasPriority(w) = 0, the next highest workflow hasPriority(w) = 1, and so on. This expo- nential scoring function gives the highest priority workflow a score that is higher than all the lower-priority workflows combined: 2 p > X i = p+1; ::: 2 i This scoring is consistent with our definition of the problem, which is to complete as many workflows as possible, according to their priorities, given a set budget and deadline. 6.4.4 Experimental Parameters For each application, we selected ranges of constraints (deadline and budget) to cover a broad parameter space: from tight constraints, where only a small number of workflows can be com- pleted, to more liberal constraints where all, or almost all, of the workflows can be completed. 152 These parameters vary among the applications and ensemble types and sizes. The budget con- straints for each ensemble are calculated by identifying the smallest budget required to execute one of the workflows in the ensemble (MinBudget), and the smallest budget required to execute all workflows in the ensemble (MaxBudget): MinBudget = min w2 e Cost(w) MaxBudget = X w2 e Cost(w) This range—[MinBudget;MaxBudget]—is then divided into equal intervals to determine the budgets to use in each experiment. Similarly, the deadline constraints for each ensemble are calculated by identifying the smallest amount of time required to execute a single workflow in the ensemble (MinDeadline), which is the length of the critical path for the workflow with the shortest critical path, and by identifying the smallest amount of time required to execute all workflows (MaxDeadline), which is the sum of the critical paths of all the workflows: MinDeadline = min w2 e CriticalPath(w) MaxDeadline = X w2 e CriticalPath(w) This range—[MinDeadline;MaxDeadline]—is then divided into equal intervals to determine the deadlines to use in each experiment. By computing the budget and deadline constraits in this way we ensure that the experiments for each ensemble cover the most interesting area of the parameter space for the ensemble. In all the experiments we assumed that the VMs have a price of $1 per VM-hour. This price was chosen to simplify interpretation of results and should not affect the relative performance of the different algorithms. In this study we do not take into account the heterogeneity of the infrastructure since we assume that it is always possible to select a VM type that has the best price to performance ratio for a given application. All the experiments were run with maximum autoscaling factor (v max ) set to 1.0 for DPDS and WA-DPDS. After experimenting with DPDS and WA-DPDS we found that, due to the high parallelism of workflows used, the resource utilization remains high enough without adjusting 153 the autoscaling rate. Based on experiments with the target applications, we set the SPSS parameter for deadline distribution to be 0.7, which allocates slightly more time to levels with many tasks. 6.4.5 Relative Performance of Algorithms The goal of the first experiment is to characterize the relative performance of the DPDS, WA- DPDS, and SPSS algorithms. This was done by simulating the algorithms on many different ensembles and comparing the scores computed using the technique described in Section 6.4.3. The algorithm that achieves the largest number of high scores is judged to have the best perfor- mance. Figure 6.6 shows the percentage of simulations for which each algorithm achieved the high- est score for a given ensemble type. This experiment was conducted using all five applications, with all five types of ensembles. For each application, ensembles of 50 workflows were created using 10 different random seeds. Each ensemble was simulated with all three algorithms using 10 budgets and 10 deadlines. The total number of simulations to produce each bar in the figure is: 101010 = 1000. The best scores percentage is computed by counting the number of times that a given algorithm achieved the highest score and dividing by 1000. Note that it is possible for multiple algorithms to get the same high score (to tie), so the numbers do not necessarily add up to 100%. The sum is much higher than 100% in cases where the dynamic algorithms perform relatively well because DPDS and WA-DPDS, which are very similar algorithms, often get the same high score. There are several interesting things to notice about Figure 6.6. The first is that, in most cases, SPSS significantly outperforms the dynamic algorithms (DPDS and WA-DPDS). This is attributed to the fact that SPSS is able to make more intelligent scheduling and provisioning decisions because it has the opportunity to compare different decisions and choose the one that results in the best outcome. In comparison, the dynamic algorithms are online algorithms and are not able to project into the future to weigh the outcomes of their choices. 154 C PS PU US UU SIPHT Distribution Best Scores (%) 0 50 100 C PS PU US UU LIGO Distribution Best Scores (%) 0 50 100 C PS PU US UU Epigenomics Distribution Best Scores (%) 0 50 100 C PS PU US UU Montage Distribution Best Scores (%) 0 50 100 C PS PU US UU CyberShake Distribution Best Scores (%) 0 50 100 Algorithm DPDS WADPDS SPSS Figure 6.6: Percentage of high scores achieved by each algorithm on different ensemble types for all five applications. C = Constant ensembles, PS = Pareto Sorted ensembles, PU = Pareto Unsorted ensembles, US = Uniform Sorted ensembles, and UU = Uniform Unsorted ensembles. The second thing to notice about the figure is that, for constant ensembles, the dynamic algorithms perform significantly better relative to SPSS than for other ensemble types. This is a result of the fact that, since all of the workflows are of approximately the same size and shape, the choice of which workflow to execute next is arbitrary. Another thing to notice about these results is that the workflow-aware algorithms (WA-DPDS and SPSS) both perform better in most cases than the simple online algorithm that uses resource utilization alone to make provisioning decisions. This suggests that there is a significant value in having information about the structure and estimated runtime of a workflow when making scheduling and provisioning decisions. Finally, it is interesting to see that, for Montage and CyberShake, the relative performance of SPSS compared to the dynamic algorithms is significantly less than it is for other applica- tions. We attribute this to the structure of Montage and CyberShake. The workflows for both 155 applications are very wide relative to their height, and both have very short-running tasks. As a result of these two characteristics, the Montage and CyberShake applications have very short critical paths and look more like bag-of-tasks applications, which are easier to execute than more structured applications. DPDS and WA-DPDS are able to pack more of the tasks into the avail- able budget and deadline because there are a) more choices for where to place the tasks, and b) the different choices have a smaller impact on the algorithms’ ability to execute the workflow within the constraints. In addition, the short critical paths put SPSS at a disadvantage. Because of the way SPSS assigns deadlines to individual tasks, it is prevented from starting workflows late, which prevents it from packing tasks into the idle VM slots at the end of the schedule. 6.4.6 Task Granularity In the last experiment we noted that, for Montage and CyberShake, the short runtimes of their tasks made it possible for the dynamic algorithms to perform better relative to SPSS when com- pared with other applications. In order to test this theory we adjusted the granularity of the tasks in several Montage and CyberShake ensembles to see how this would affect the relative perfor- mance of the algorithms. The granularity adjustment was achieved by multiplying the runtime of each task by a fixed scaling factor. Figure 6.7 shows the relative performance of the algorithms as the scaling factor is increased from 1 to 16. Each data point represents the result of 500 simulations (5 random seeds for 10 budgets and 10 deadlines). The best scores percent was calculated the same way as it was for Figure 6.6. The figure clearly shows that, as the scaling factor increases, the relative performance of SPSS increases as well. In all cases, the performance of SPSS is superior to the dynamic algo- rithms for scaling factors greater than about 8. This suggests that, in general, for fine-grained workflows, the dynamic algorithms will produce better scores, and for coarse-grained workflows SPSS will produce better scores. 156 Constant Stretching Factor Best Scores (%) 1 4 8 16 0 50 ● ● ● ● ● Pareto Sorted Stretching Factor Best Scores (%) 1 4 8 16 0 50 ● ● ● ● ● Pareto Unsorted Stretching Factor Best Scores (%) 1 4 8 16 0 50 ● ● ● ● ● Uniform Sorted Stretching Factor Best Scores (%) 1 4 8 16 0 50 ● ● ● ● ● Uniform Unsorted Stretching Factor Best Scores (%) 1 4 8 16 0 50 ● ● ● ● ● ● Algorithm DPDS WADPDS SPSS (a) CyberShake Constant Stretching Factor Best Scores (%) 1 4 8 16 0 50 ● ● ● ● ● Pareto Sorted Stretching Factor Best Scores (%) 1 4 8 16 0 50 ● ● ● ● ● Pareto Unsorted Stretching Factor Best Scores (%) 1 4 8 16 0 50 ● ● ● ● ● Uniform Sorted Stretching Factor Best Scores (%) 1 4 8 16 0 50 ● ● ● ● ● Uniform Unsorted Stretching Factor Best Scores (%) 1 4 8 16 0 50 ● ● ● ● ● ● Algorithm DPDS WADPDS SPSS (b) Montage Figure 6.7: Percentage of high scores achieved by each algorithm for Montage and Cybershake ensembles when task runtime is stretched. A stretching factor ofx means that the runtime of tasks in each workflow was multiplied byx. 157 6.4.7 Inaccurate Task Runtime Estimates Both of the workflow-aware algorithms rely on estimates of task runtimes to make better schedul- ing and provisioning decisions. However, these estimates can be inaccurate. It is typically not possible to predict exactly how long a task will take to execute on a given machine at a given time. The actual runtime depends on many factors that are beyond the control of the application, such as network or disk contention, background processing, etc. Given that task runtime estimates are bound to be inaccurate, the question is: How does the error in the task runtime estimate impact the performance of our scheduling and provisioning algorithms? To test this we created an experiment to measure how introducing uniform errors in the runtime estimates affects the ability of the algorithms to achieve the budget and deadline constraints. This experiment used a single application, Montage, and a single ensemble type, uniform unsorted. This application and ensemble type was chosen arbitrarily for simplicity. It is expected that the application and ensemble type have little impact on how well the algorithms are able to stay within constraints. The estimation errors are generated by randomly sampling a uniform distribution between p% and +p% of the runtime of a task. So, for example, if the value of p is 10%, then the actual runtime of a task (r a ) is set in the simulation to be: r a = r e (1:0+q), wherer e is the estimated runtime of the task andq is a uniform random number beween0:1 and+0:1. Since the sampling is done uniformly, we expect to get just as many overestimates as underestimates in any given simulation. Note that the algorithms have not been changed to account for inaccurate runtime estimates. It is likely that better performance could be achieved if the algorithms were given some hint as to the accuracy of the task runtimes, but investigating that optimization is left for future work. The goal here is to determine how well the current algorithms are able to stay within the constraints given inaccurate task runtime estimates. Figure 6.8 shows the results of simulations for errors ranging from 0% to 50%. These sim- ulations were done for ensembles of 50 workflows, using 10 random seeds, 10 budgets, and 10 158 deadlines for each error value and algorithm (3000 simulations per error value). Each plot in the figure shows box plots of the ratio of actual cost to budget, or actual makespan to deadline. The actual cost and makespan are the real cost and makespan of the workflow calculated after the simulation completed. The ratio of actual values to constraints indicates whether the actual value exeeded the constraint. Values equal to 1 mean that the actual value was equal to the constraint, values less than 1 indicate that the actual value was within the constraint, and values greater than 1 mean that the actual value exceeded the constraint. Figure 6.8.a shows the ratio of actual cost to budget. These results show two important characteristics of algorithm behavior. First, both dynamic algorithms very rarely exceeded the budget, even with very large errors. This is indicative of the fact that the dynamic algorithms are easily able to adapt to changes at runtime to ensure good results without exceeding the constraints, even when the quality of information available to them is low. Second, unlike the dynamic algorithms, the static algorithm frequently exceeded the budget constraint when the quality of runtime estimates was low. For small errors the budget was exceeded infrequently by small amounts, and for large errors (more than 10%) the budget exceeded frequently by large amounts. This is a result of the fact that the static algorithm makes all of its decisions before execution and is not able to adapt to changing circumstances. The SPSS plan describes what tasks to execute on which VMs, but not when to execute them. At runtime, the worklow engine is forced to spend extra money to extend the runtime of some VMs so that all of the tasks assigned to that VM can be completed. At the same time, other VMs can be terminated early because the tasks they were assigned finished earlier than expected. However, because of dependencies in the workflow, the latter case is less likely to happen, which causes gaps in the schedule. So the net result is that the overall cost is increased. Figure 6.8.b shows the ratio of actual makespan to budget. The interesting thing to notice about this figure is that, for all three algorithms, the deadline constraint is rarely exceeded, even in cases with very poor quality estimates. For the dynamic algorithms this is a result of the fact that they can adapt at runtime to changing circumstances to avoid exceeding the deadline. For the static algorithm this is a result of the way that SPSS schedules workflows and 159 not of a particularly clever optimization. The SPSS algorithm tends to schedule workflows early, using up the budget long before the deadline is reached. This is a consequence of the deadline distribution function in SPSS, which prevents workflows from starting late. This is shown by the gantt chart in Figure 6.4, which illustrates how SPSS tends to pile up workflows at the beginning of the timeline. In comparison, the dynamic algorithms tend to spread out workflow starts over the entire duration as shown in Figure 6.3. The consequence of this behavior is that SPSS plans tend to use up the budget but leave plenty of time before the deadline. So when the runtime of the plan is increased by introducing errors, the SPSS plan has some room to expand without exceeding the deadline. Figure 6.9 shows relative performance as a percentage of the number of high scores achieved by each algorithm as the error increases. The figure shows that, for this experiment, the scores remain the same as the error increases. This is a result of the way the error is applied. Since the error is selected uniformly fromp% to+p% it is equally likely that the error will be positive as negative. In other words, it is equally likely that a task will take longer as it is that the task will be shorter. Because the application used in this experiment, Montage, has a low task runtime variance, applying the error leaves the total runtime of the tasks in the workflow unchanged—the total number of CPU hours in each workflow remains the same regardless of the error. The end result is that the dynamic algorithms are able to achieve the same score without exceeding the constraints by using a different schedule that finishes the same total amount of work using the same budget and deadline. If we had applied a skewed error, for example, an error that was more likely to be positive than negative, then the scores for the dynamic algorithms would decrease relative to the static algorithm as the error increased. The behavior of these algorithms indicate that, when the quality of runtime estimates is low, the dynamic algorithms do a much better job of adapting to the changing circumstances to meet their obligations. In comparison, the static algorithm performs better with good information, but fails to meet the constraints when the quality of information is poor. It is possible that the static algorithm could be modified to add more breathing room in the schedule to account for situations where the runtime estimates may be inaccurate. Such a modification would result in slightly 160 worse scores when the estimates are good, but would improve scores when the estimates are bad. Alternatively, the SPSS algorithm could be modified to generate a plan that specifies when to provision and deprovision each VM. In that case the workflow engine could be prevented from exceeding the budget and deadline constraints, but would prevent some workflows from finishing, which may significantly decrease the score. These two modifications are subjects for future work. 6.4.8 Provisioning Delays One important issue to consider when provisioning resources in the cloud is the amount of time between when a resource is requested, and when it becomes available to the application. Typi- cally these provisioning delays are on the order of a few minutes, and are highly dependent upon the architecture of the cloud system and/or the size of the user’s VM image [130]. Recall that it is assumed that resources are billed from the minute that they are requested until they are terminated, and not from the moment that they become available. As a result, provisioning delays have an impact on both the cost (budget constraint) and makespan (deadline constraint) of the ensemble. Figure 6.10 shows the ratios of actual values to constraints when the provisioning delay is increased from 0 seconds up to 15 minutes. The simulations were done using uniform unsorted ensembles of 50 workflows from the Montage application. Each plot in the figure contains three boxes: one for each algorithm. Each box summarizes the results of 10 budgets, 10 deadlines, and 10 random seeds for a total of 1000 simulations per algorithm and provisioning delay (3000 simulations per plot). As in Figure 6.8, the y-axis in each plot represents the ratio of the actual simulated value to the constraint value for the budget constraint (actual cost) or the deadline constraint (actual makespan). These ratios indicate how frequently and by how much the actual simulated values exceeded (or did not exceed) the constraints. 161 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ● ●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 0 % Algorithm Cost / Budget ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ● ● ●●●● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●●●●●●●● ●● ● ● ● ● ● ●● ● ●●●●●●●● ● ● ●●●● ● ●● ● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 1 % Algorithm Cost / Budget ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ●● ● ● ●●●●●●●● ●●● ● ●● ●● ● ● ●●● ● ●● ● ● ● ●●●●●●●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 2 % Algorithm Cost / Budget ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 5 % Algorithm Cost / Budget ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 10 % Algorithm Cost / Budget ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 20 % Algorithm Cost / Budget ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 50 % Algorithm Cost / Budget (a) Ratio of Actual Cost to Budget ● ● ● ● ●●●●● ● ●● ●●●●●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 0 % Algorithm Makespan / Deadline ● ● ● ● ●●●●● ● ● ● ●●●●●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●● ● ● ● ●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 1 % Algorithm Makespan / Deadline ● ● ● ● ●●●●● ● ● ● ● ●●●●●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 2 % Algorithm Makespan / Deadline ● ● ● ● ● ●●●●● ● ● ● ● ●●●●●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 5 % Algorithm Makespan / Deadline ● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●● ● ● ● ●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 10 % Algorithm Makespan / Deadline ● ● ● ● ● ●●●●● ● ● ● ● ●●●●●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●●●●● ● ● ● ●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 20 % Algorithm Makespan / Deadline ● ● ● ● ● ●●●●● ● ●●● ●●●●●●●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● DPDS SPSS WADPDS 0.0 1.0 2.0 ± 50 % Algorithm Makespan / Deadline (b) Ratio of Actual Makespan to Deadline Figure 6.8: Boxplots for budget and deadline ratios when runtime estimate error varies from 0% to50% for all three algorithms. A value of 1 indicates that the actual cost is equal to the budget, or the actual makespan is equal to the deadline, respectively. Values greater than 1 indicate that the budget/deadline has been exceeded. 162 0 1 2 5 10 20 50 Estimate Error (%) Best Scores (%) 0 50 100 Algorithm DPDS WADPDS SPSS Figure 6.9: Percentage of high scores achieved by each algorithm on Montage with uniform unsorted ensembles when runtime estimate error varies from0% to50%. The effect of provisioning delays on workflow performance is similar to that of inaccurate runtime estimates: when the delays are small, all algorithms are able to produce results within the constraints, but when the delays are larger, the dynamic algorithms are able to adapt to avoid exceeding the constraints while the static algorithm is not. The explanation for this behavior is also similar. The difference is that, a very small error in runtime estimates has little impact on the budget, while a very small provisioning delay has a large impact on the budget. Figure 6.10.a shows that even for an impossibly short provisioning delay of 30 seconds (usually it takes more than than 30 seconds just to boot a VM), the SPSS algorithm exceeds the budget by as much as a factor of 2, and has a distribution with an upper quartile of 1.5 times the budget. In the case of delays more than 5 minutes, which is more typical of what has been observed on academic infrastructure clouds [88] such as Magellan [124] and FutureGrid [60], fully three quarters of the simulations exceeded the budget. With the exception of a few outliers for 10 and 15 minute delays, the dynamic algorithms consistently met the budget and deadline constraints. 163 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ● ●●●●●●●●●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 0 sec. Algorithm Cost / Budget ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 30 sec. Algorithm Cost / Budget ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 60 sec. Algorithm Cost / Budget ●●●● ●●●●●●●●●●●●●●●● ●●●● ●●● ●●●●●●● ●●●● ●● ● ●●●●●●●●●● ●●● ●●●●● ● ●●●●●●●●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●● ●●● ●●●●●●● ●●●● ●● DPDS SPSS WADPDS 0.0 1.0 2.0 90 sec. Algorithm Cost / Budget ●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●●● ●●●●●●●●●● ●● ●●●●●●●●●● ●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 120 sec. Algorithm Cost / Budget ●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ● ●●●● ●●●●●●●● ●●●●●● ● ●● ●●●●●●●●●● ●●●●● ●●●●●●●●● ●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 300 sec. Algorithm Cost / Budget ●● ●●●● ●● ●● ●● ●● ●● ●●●●●● ●●● ●●● ●●● ● ●● ●● ●● ● ●● ●●●●●● ●●● ●●●● ●● ● ●●● ●● ●● ●●●●●●● ●●●● ●●● ●● ●●● ● ●●● ●●●●●● ●●●●●● ●●● ● ●● ●● ●● ● ●● ●●●● ●● ●●● ●●●● ●● ● ●●● ●● ●● ●●●●●●● ●●●● ●●● ●● ●●● ● ●●● ●●●●●● ●●●●●● ●●● ● ●● ●● ●● ● ●● ●●●●●● ●●●● ●● ●● ●● ●● ●● ●●●●●● ●●● ●●● ●●● ● ●● ●● ●● ● ●● ●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 600 sec. Algorithm Cost / Budget ●● ●●●● ●● ●●●●●●●●●● ●●● ●●● ●●● ● ●●● ●●● ● ● ● ●● ●● ●●●●●● ●●● ●●●● ●● ● ●●●● ●● ●●●●●●●●● ●●●● ●●● ● ●●● ●●●●●●●● ●●●●●● ●●● ● ●●●●●●● ● ● ● ●● ● ●● ●●●● ●● ●●● ●●●● ●● ● ●●●● ●● ●●●●●●●●● ●●●● ●●● ● ●●● ●●●●●●●● ●●●●●● ●●● ● ●●●●●●● ● ● ● ●● ● ●● ●●●●●● ●●●● ●● ●●●●●●●●●● ●●● ●●● ●●● ● ●●● ●●● ● ● ● ●● ●● ●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 900 sec. Algorithm Cost / Budget (a) Ratio of Actual Cost to Budget ● ● ● ● ●●●●● ● ● ●● ●●●●●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● DPDS SPSS WADPDS 0.0 1.0 2.0 0 sec. Algorithm Makespan / Deadline ● ● ● ● ● ●●●●● ● ●●● ●●●●●●●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● DPDS SPSS WADPDS 0.0 1.0 2.0 30 sec. Algorithm Makespan / Deadline ● ● ● ● ● ●●●●● ● ● ● ● ● ●● ●●●●●●●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● DPDS SPSS WADPDS 0.0 1.0 2.0 60 sec. Algorithm Makespan / Deadline ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●● ●●●●●●●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● DPDS SPSS WADPDS 0.0 1.0 2.0 90 sec. Algorithm Makespan / Deadline ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●● ● ●●●●●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● DPDS SPSS WADPDS 0.0 1.0 2.0 120 sec. Algorithm Makespan / Deadline ● ● ● ● ● ● ●●● ●●●●●●●●●●● ● ● ●●●●●●●●● ● ● ● ● ● ● ● ● ●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●● ● ●●●●● ● ● ● ● ● ● ● ●●●●●●●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●●●●● ● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 300 sec. Algorithm Makespan / Deadline ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●●●●●●●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ●● ● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 600 sec. Algorithm Makespan / Deadline ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●●● ● ● ●●●●●●●●●●● ● ● ●●●●●●●●●●● ● ●●●●●●●●●●●● ● ●●●●●●●●●● DPDS SPSS WADPDS 0.0 1.0 2.0 900 sec. Algorithm Makespan / Deadline (b) Ratio of Actual Makespan to Deadline Figure 6.10: Boxplots for budget and deadline ratios when provisioning delay varies from 0 seconds to 15 minutes for all three algorithms. A value of 1 indicates that the actual cost is equal to the budget, or the actual makespan is equal to the deadline, respectively. Values greater than 1 indicate that the budget/deadline has been exceeded. 164 Figure 6.11 shows the relative performance as a percentage of the number of high scores achieved by each algorithm as the provisioning delay increases. This figure shows that, as the delay increases, the relative performance of the static algorithm increases as well. This behav- ior is a result of the fact that the static algorithm is, in essence, cheating by using more time and money than the dynamic algorithms. In the previous section we showed how the relative performance of the algorithms remains the same when a uniform error is applied to the run- time constraints. In that case, the dynamic algorithms adapted to the error by rearranging the schedule to accomplish the same amount of work given the same deadline and budget. In this case, the dynamic algorithms adapt by performing less work (executing fewer workflows) to remain within the constraints while accounting for the delays. In both cases, the static algorithm completed the same amount of work, but did so by exceeding the constraints. 0 30 60 90 120 300 600 900 Provisioning Delay (seconds) Best Scores (%) 0 50 100 Algorithm DPDS WADPDS SPSS Figure 6.11: Percentage of high scores achieved by each algorithm on Montage with uniform unsorted ensembles when provisioning delay varies from 0 seconds to 15 minutes. This behavior suggests that SPSS is far too sensitive to provisioning delays in its current form to be of practical use in real systems. It is likely, as was the case for inaccurate runtime estimates, that modifying the SPSS algorithm would improve its performance for provisioning delays. In fact, since provisioning operations are infrequent (because all the algorithms tend to 165 provision resources for a long time), it is likely that the performance of SPSS could be improved significantly by simply adding an estimate of the provisioning delay to its scheduling function. Such an estimate may not have to be particularly accurate to get good results, and developing an estimate from historical data should be relatively simple. Testing this idea is left for future work. 6.4.9 SPSS Planning Time Unlike the dynamic algorithms, which execute their scheduling and provisioning logic at run- time, the SPSS algorithm plans out its provisioning and scheduling decisions ahead of workflow execution. Because SPSS involves more complicated logic than the dynamic algorithms, it is important to understand what impact this has on execution time. Figure 6.12 shows the SPSS planning time for three different ensemble sizes (50, 100, 200), and five different workflow sizes (50, 200, 400, 600, 800). The ensembles were generated using a constant distribution equal to the workflow size desired. Two different applications were used: SIPHT and CyberShake. Each box summarizes the results of 2 applications, 10 random seeds, 10 budgets and 10 deadlines, or 2000 simulations per box. For ensemble size 200, the results for workflow size 800 are omitted because the simulations took too long to get a representative sample. Figure 6.12 shows that, for small ensembles and small workflows, the SPSS planning time is reasonable, taking on the order of tens of seconds to a few minutes. For larger ensembles of large workflows, however, the SPSS planning time can easily reach 10-15 minutes. Considering that the largest workflows used in this experiment are rather small (maximum of 800 tasks), and that real workflows are often much larger (workflows with tens of thousands of tasks are common, and workflows with hundreds of thousands or millions of tasks are rare but do exist), it is unlikely that SPSS will be practical for ensembles of very large workflows. The algorithmic complexity of SPSS isO(n 2 ), wheren is the number of tasks in the ensem- ble. This derives from the fact that, for each task, SPSS considers scheduling the task on the cheapest available slot, which involves scanning all of the available slots on all the VMs. Since the number of available slots, in the worst case, is proportional to the number of tasks scheduled 166 ●●●●● ●● 50 200 400 600 800 0 50 100 150 200 Ensemble Size 50 Workflow Size (number of tasks) Planning Time (seconds) ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 50 200 400 600 800 0 200 400 600 800 Ensemble Size 100 Workflow Size (number of tasks) Planning Time (seconds) ●●● 50 200 400 600 0 500 1000 1500 Ensemble Size 200 Workflow Size (number of tasks) Planning Time (seconds) Figure 6.12: Planning time of SPSS algorithm for different ensemble and workflow sizes. (because scheduling a task splits an existing slot into at most two slots), the complexity of SPSS isO(n 2 ). In comparison, the dynamic algorithms all have a more scalable complexity ofO(n). DPDS only examines the tasks in the workflow once when they are scheduled, and WA-DPDS examines them twice: once in the admissions algorithm, and once when they are scheduled. This makes the dynamic algorithms a better fit for larger workflows and ensembles even though they may not produce results that are as good as SPSS. It may be possible to optimize SPSS to reduce its runtime by, for example, clustering the workflow to increase task granularity, which would decrease the ratio of planning time to ensem- ble makespan. It may also be possible to reduce the complexity of SPSS by employing more sophisticated data structures to store the available slots. Investigating these topics is left for future work. 6.5 Summary This chapter addressed the interesting and important new problem of scheduling and resource provisioning for scientific workflow ensembles on IaaS clouds. The problem is different from previous work on grid and utility grid scheduling in that cloud infrastructures provide more control over the resources, so the resource provisioning plan can be adjusted according to the requirements of the application. Therefore the problem space becomes larger; it requires not only an efficient mapping of tasks to available resources, but also the selection of the best resource 167 provisioning plan. Formulating the problem as a maximization of the number of prioritized workflows completed from the ensemble is also novel and requires workflows to be admitted or rejected based on their estimated resource demands. We believe that this bi-constrained problem is highly relevant because such constraints are commonly imposed on many real-world projects. The approach is also directly applicable to grid environments that provide resource reservations and charge service units for resource use. The DPDS, WA-DPDS and SPSS algorithms were developed based on both static and dynamic scheduling approaches. The algorithms were evaluated on ensembles of synthetic work- flows, which were generated based on real scientific applications. For the purposes of evaluation we have developed a simulator that models the cloud infrastructure, and a workflow engine with tightly-coupled scheduler and provisioner modules. The results of our simulation studies indicate that the two algorithms that take into account information about the structure of the workflow and task runtime estimates (WA-DPDS and SPSS) yield better results than the simple priority-based scheduling strategy (DPDS), which makes provisioning decisions purely based on resource utilization. In cases where there are no provisioning delays and task runtime estimates are good, we found that the static algorithm performs significantly better than the dynamic algorithms. How- ever, when runtime estimates are poor, or delays are present, the plans produced by the static algorithm are obstructed and it frequently exceeds the budget and deadline constraints. In com- parison, the dynamic algorithms are able to quickly adapt to even verly large delays and poor estimates to prevent the ensemble from exceeding the constraints. In addition, we found that SPSS tends to perform better on coarse-grained workflows than the dynamic algorithms. For wide, fine-grained workflows, such as CyberShake and Montage, however, the dynamic algorithms frequently produce performance that is as good or better than SPSS, because they are better able to pack the smaller tasks onto idle VMs close to the deadline than SPSS, which distributes deadlines to tasks in a way that prevents them from starting late. For very large workflows and ensembles, we found that the planning time of the SPSS algo- rithm is prohibitive because the algorithmic complexity of SPSS isO(n 2 ). Our simulation results 168 showed that ensembles of 200 workflows with 800 tasks each took 10 to 15 minutes to plan. This suggests that ensembles of workflows with tens of thousands of tasks, which are commonly encountered in real workflow applications, would take many hours to plan. In comparison, the O(n) complexity of the dynamic algorithms makes them much more attractive for large scale problems. This study suggests many areas for future work. It would be interesting to extend the appli- cation and infrastructure model to explicitly include the various data storage options available on clouds (right now data access is modeled as part of the task execution time and data costs are not considered). The experimental study described in Chapter 4 suggests that the data demands of scientific workflows have a large impact on not only the execution time, but also on the cost of workflows in commercial clouds. Finally, it would be interesting to investigate heterogeneous environments that include multiple VM types and cloud providers, including private and com- munity clouds, which will make the problem even more complex and challenging [114, 181, 90]. The results of this study can also be applied to develop tools that assist researchers in plan- ning their large-scale computational experiments. The estimates of cost, runtime, and num- ber of workflows completed that can be obtained from both the static algorithms and from the simulation runs, constitute valuable hints for planning ensembles and evaluating the associated trade-offs. 169 Chapter 7 Conclusion The focus of this thesis has been on the development of tools and techniques related to resource management for large-scale scientific workflows. In this chapter we summarize the main findings and contributions of our work and outline topics for future research. A list of publications derived from the research presented in this thesis is given in Appendix 3. 7.1 Findings and Contributions In Chapter 2 we described some of the challenges of executing scientific workflows on batch scheduled platforms such as grids and clusters. On these systems, loosely-coupled applications such as workflows suffer from long queue delays, large scheduling overheads, and restrictive scheduling policies that penalize fine-grained applications. We described how the techniques that have been used to address these issues in the past, such as task clustering and advance reservations, introduce their own set of problems, and how new approaches based on resource provisioning can help. We developed a new system, Corral, that enables resource provisioning on grids using pilot jobs. We showed how this system is able to reduce overheads in batch systems and improve the performance of real workflow applications. In our experiments with the Montage application, the runtime of a 6-degree workflow was reduced by 45% on average when using Corral compared to Globus GRAM. In addition, we showed how pilot-job provisioning techniques could be used to bypass restrictive scheduling policies to improve throughput for workflows in the grid. This system is currently being used by researchers to enable the execution of scientifically meaningful workflows. For example, the Southern California Earthquake Center has used Corral to provision resources for its CyberShake workflows. 170 Chapter 3 described how the development of cloud computing is providing new opportuni- ties and challenges for the execution of workflow applications. We explained how the unique features of infrastructure clouds, such as on-demand provisioning, virtualization, and elasticity, make it an ideal platform for deploying workflows. We also identified the need for new sys- tems and techniques for automating the deployment of complex, distributed applications such as workflows on infrastructure clouds. In order to address this need, we developed a new system, Wrangler, that automatically provisions and configures cloud resources based on a declarative description of the requirements and dependencies of an application. The system interfaces with several different cloud providers to provision virtual machines, coordinates the configuration and initiation of software services, and monitors those services over time to detect failures. This system has been used to provision virtual clusters to execute hundreds of workflows on both com- mercial and academic clouds. In addition, the system can be used to support systems research by enabling researchers to easily deploy and evaluate various configurations on top of infrastructure clouds. Chapter 4 presented the results of our studies into the cost and performance of workflows in the cloud. These studies used several real workflow applications to evaluate the feasibility of executing large-scale workflows on commercial clouds. The first study evaluated the cost and performance of different virtual machine resources on Amazon EC2 and compared them with typical grid resources. This study concluded that: 1. Virtualization decreased the performance of CPU-bound workflows, but did not have a significant impact on I/O bound workflows. For the CPU-intensive application we tested (Epigenomics), we found the virtualization overhead to be around 10%. 2. The performance of comparable grid and cloud resources is similar, but grid resources have a significant advantage in terms of I/O performance. In our experiments, resources from a TeraGrid cluster (Abe) were found to have similar performance to the Amazon EC2 c1.xlarge virtual machine type, but for I/O intensive applications the Lustre parallel file system gave Abe a significant performance advantage over EC2. 171 3. The different virtual machine types available provide a wide range of cost/performance tradeoffs that must be considered by application developers. For the applications tested, we found that there was no resource type that provided the best cost/performance ratio for any of the applications. 4. The cost of data transfer can be a significant fraction of the total cost of executing data- intensive workflows in the cloud. For the Montage application, the cost of transferring output data was found to be greater than the cost of executing the computations. The second study evaluated the performance of different storage systems that could be deployed in the cloud for storing workflow data. This study found that: 1. The choice of storage system has a large impact on cost and performance for I/O inten- sive applications. There was a large gap in both cost and performance between the best- performing storage system and the worst-performing storage system. For CPU-bound applications, however, the choice of storage system does not have a significant impact on cost and performance. 2. Storage systems that were optimized for handling lots of small files performed signifi- cantly better than storage systems designed for parallel I/O. GlusterFS was found to per- form well on most applications, but Amazon S3, which has a large per-file overhead per- formed poorly in most cases. 3. It is rarely possible to decrease the cost of running workflows in the cloud by deploying them in parallel because the speedup achieved does not offset the increased cost. The last two studies analyzed the cost of deploying two different astronomy workflows on Ama- zon EC2. These studies concluded that: 1. For long-term usage, the cost of an Amazon EC2-based solution is more than twice that of a locally-hosted solution. This is primarily a result of the large cost of storing data for long periods of time in Amazon S3. 172 2. For short-term usage, the EC2 cost is justifiable for non-data-intensive applications, but only when computations are small or grid resources are not available. It was found to be considerably less expensive to temporarily lease capacity from Amazon than to acquire new hardware, but the availability of grid resources made expenditures of more than a few hundred dollars undesireable. In Chapter 5 we discussed the need for tools and techniques for profiling workflows. One of the primary challenges of resource provisioning is knowing how many of what kind of resources to provision for how long. In order to develop an effective resource provisioning strategy it is critical to know what resources (I/O, memory, and CPU) the workflow requires. We developed a set of tools, wfprof, that support the prediction of workflow resource requirements by enabling the collection and analysis of fine-grained traces of resource usage. These tools use low-level system tracing functions to collect logs of workflow task runtime, I/O, and CPU utilization. We showed how these logs can be collected and analyzed to create workflow profiles. These profiles provide detailed insight into the behavior of workflows and enable analyses that were not previously possible. We provided detailed profiles for six different workflow applications taken from a wide variety of scientific domains. Our characterization of these workflows identified executables with potential bugs, opportunities for optimization, and many interesting statistics that were not previously known. For example, in the Broadband application we found that the ucsb seisgen code allocates far more memory than it uses, which suggests that it may contain a memory allocation bug. We also found that ucsb seisgen tasks were reading far more data than the total size of their input files, which suggests that data is being accessed inefficiently. In the CyberShake application we identified a similar discrepancy between the size of input data and the total amount of data read by the SeismogramSynthesis code. We also discovered that SeismogramSynthesis requires up to 1,800 MB of memory for some inputs, which was not previously known and has a significant impact on the way SeismogramSynthesis tasks can be scheduled. In addition to being useful for identifing problems and opportunities for optimization in workflow codes, data collected by the profiling tools we developed can be used for many other 173 purposes, including estimating the resource requirements of future workflows, or for generating realistic synthetic workflows for use in simulation studies to test new workflow systems and new workflow scheduling and provisioning algorithms. In Chapter 6 we defined the new problem of scheduling and provisioning for workflow ensembles under budget and deadline constraints. The goal of this research is to help workflow developers maximize the amount of useful computation they can complete given a fixed amount of time and money. To solve this problem, we developed several new scheduling and provisioing algorithms: one static algorithm, SPSS, which plans out provisioning and scheduling decisions ahead of workflow execution, and two dynamic algorithms, DPDS and WA-DPDS, which are both online algorithms that make provisioning and scheduling decisions at runtime. Both WA- DPDS and SPSS make use of knowledge about workflow structure and task runtime estimates to make more informed decisions. We evaluated these algorithms using synthetic workflows that were generated based on real scientific applications. The synthetic workflows were grouped into ensembles with a variety of workflow size distributions. To conduct the evaluation we developed a simulator that models an infrastructure cloud, virtual machines, and a workflow management system. The results of our simulation studies indicate that: 1. The two algorithms that take into account information about the structure of the workflow and task runtime estimates (WA-DPDS and SPSS) yield better results than the simple on-line priority-based scheduling strategy (DPDS). 2. The dynamic algorithms (DPDS and WA-DPDS) perform better on fine-grained work- flows relative to SPSS, but when the granularity of those same workflows is artificially increased by a factor of more than 8, the SPSS algorithm performs better. 3. The static algorithm is not able to honor the budget and deadline constraints when either the runtime estimates of workflow tasks are uncertain, or there are provisioning delays, but both dynamic algorithms are able to stay within the constraints when runtime estimates are poor and provisioning delays are large. In our experiments, SPSS exceeded the budget for more than 50% of ensembles when the runtime error was more than 5%, and it exceeded 174 the budget for more than 75% of ensembles when there was a provisioning delay of more than 30 seconds. 4. The planning time required for the SPSS algorithm may be prohibitive for ensembles of very large workflows. In our experiments, the ensembles with 200 workflows and 800 tasks took more than 18 minutes to plan. Extending this to larger workflows that contain between 10,000 and 1 million tasks suggests that SPSS may not be feasible for very large workflows and ensembles. This line of research opens up a number of interesting avenues for further investigation, such as combining static and dynamic techniques, adding additional optimizations to account for uncertainties, and identifying ways to apply the algorithms in real systems. 7.2 Future Work Resource provisioning in the grid has advanced significantly in recent years due to the widespread use of pilot jobs. There are currently several systems, including: glideinWMS [159], PanDA [111], DIRAC [175], and others, that that use pilot jobs to schedule VO workloads on grids. These systems are very effective at managing work for large communities of researchers, but they are far too complex for the average researcher to install and use. In order for a broad spectrum of users to benefit from pilot jobs, new systems will need to be developed that make it easy for individual researchers to deploy pilot jobs on the grid. As a result of their inherent resource provisioning, on-demand access, virtualization, and elasticity, we have identified many advantages to using infrastructure clouds for workflows. At the same time, however, our experience has shown that workflow applications deployed in existing infrastructure clouds suffer from a myriad of frequent and complex failures. In order for these platforms to be used reliably for long-running workflows and workflow ensembles, new techniques will need to be developed to automatically respond to failures. In many cases, responding to failures will require resources to be re-provisioned, and applications and resources 175 to be re-configured. This will require the systems used to deploy the applications to be built for autonomous operation. In Chapter 5 we showed how workflow profiling tools can be used can be used to collect data on the resource usage of workflow applications. This data can be used for a number of different purposes, including the development of resource usage estimates for future workflows that is required for dynamic provisioning and scheduling. In order to develop these estimates, how- ever, new techniques need to be developed to automatically mine workflow profiles to develop a workflow performance model that can be used for estimation. There are many existing machine learning techniques, such as classification and regression trees [17], that could be applied to solve this problem in the future. In our work on resource provisioning we investigated a relatively simple infrastructure model that does not include heterogeneous resources, task failures, or data movement. In real cloud environments, these issues may have a significant impact on the cost and performance of work- flow applications. In the future, it will be important to investigate new algorithms for scheduling and provisioning that consider more complex infrastructure models. Finally, current workflow management systems are not well positioned to support sophis- ticated resource provisioning techniques. In order to apply new workflow provisioning and scheduling algorithms in practice, much work needs to be done to add the tools required for provisioning to existing workflow management systems. This includes the addition of workflow profiling tools, online performance analysis and prediction techniques, support for workflow ensembles, interfaces to resource provisioning systems, and new online and offline algorithms for resource provisioning and task scheduling. 176 Bibliography [1] A. Abramovici, W. E. Althouse, R. W. P. Drever, Y . Grsel, S. Kawamura, F. J. Raab, D. Shoemaker, L. Sievers, R. E. Spero, K. S. Thorne, R. E. V ogt, R. Weiss, S. E. Whit- comb, and M. E. Zucker. LIGO: The Laser Interferometer Gravitational-Wave Observa- tory. Science, 256(5055):325 –333, Apr. 1992. [2] S. Abrishami, M. Naghibzadeh, and D. Epema. Cost-driven scheduling of grid work- flows using partial critical paths. In Proceedings of the 11th IEEE/ACM International Conference on Grid Computing, 2010. [3] V . Adve, R. Bagrodia, J. Browne, E. Deelman, A. Dube, E. Houstis, J. Rice, R. Sakel- lariou, D. Sundaram-Stukel, P. Teller, and M. Vernon. POEMS: end-to-end performance design of large parallel adaptive computational systems. IEEE Transactions on Software Engineering, 26(11):1027–1048, 2000. [4] M. Alef and I. Gable. HEP specific benchmarks of virtual machines on multi-core CPU architectures. Journal of Physics: Conference Series, 219(5):052015, Apr. 2010. [5] Amazon.com, Inc. Amazon Web Services. http://aws.amazon.com. [6] Amazon.com, Inc. Auto Scaling. http://aws.amazon.com/autoscaling. [7] Amazon.com, Inc. Elastic Block Store (EBS). http://aws.amazon.com/ebs. [8] Amazon.com, Inc. Elastic Compute Cloud (EC2). http://aws.amazon.com/ec2. [9] Amazon.com, Inc. Simple Storage Service (S3). http://aws.amazon.com/s3. [10] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Pat- terson, A. Rabkin, I. Stoica, and M. Zaharia. Above the clouds: A berkeley view of cloud computing. Technical Report EECS-2009-28, EECS Department, University of Califor- nia, Berkeley, 2009. [11] S. Bagnasco, L. Betev, P. Buncic, F. Carminati, C. Cirstoiu, C. Grigoras, A. Hayrapetyan, A. Harutyunyan, A. J. Peters, and P. Saiz. AliEn: ALICE environment on the GRID. Journal of Physics: Conference Series, 119(6):062012, July 2008. [12] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In 19th ACM Symposium on Operating Systems Principles (SOSP 03), 2003. [13] G. B. Berriman, E. Deelman, J. C. Good, J. C. Jacob, D. S. Katz, C. Kesselman, A. C. Laity, T. A. Prince, G. Singh, and M. Su. Montage: a grid-enabled engine for deliv- ering custom science-grade mosaics on demand. In SPIE Conference on Astronomical Telescopes and Instrumentation, June 2004. 177 [14] G. B. Berriman, G. Juve, E. Deelman, M. Regelson, and P. Plavchan. The application of cloud computing to astronomy: A study of cost and performance. In Workshop on e-Science challenges in Astronomy and Astrophysics, 2010. [15] S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M. Su, and K. Vahi. Characterization of scientific workflows. In 3rd Workshop on Workflows in Support of Large Scale Science (WORKS 08), 2008. [16] J. Brehm, M. Madhukar, E. Smirni, and L. Dowdy. PerPreT - a performance prediction tool for massively parallel systems. In Proceedings of the Joint Conference on Perfor- mance Tools / MMB, 1995. [17] L. Breiman, J. Friedman, C. J. Stone, and R. Olshen. Classification and Regression Trees. Chapman and Hall/CRC, 1984. [18] J. Bresnahan, T. Freeman, D. LaBissoniere, and K. Keahey. Managing appliance launches in infrastructure clouds. In Teragrid Conference, July 2011. [19] D. Breuer, D. Erwin, D. Mallmann, R. Menday, M. Romberg, V . Sander, B. Schuller, and P. Wieder. Scientific computing with UNICORE. In Proceedings of the NIC Symposium, 2004. [20] M. J. Brim, T. G. Mattson, and S. L. Scott. OSCAR: open source cluster application resources. In Proceedings of the Ottowa Linux Symposium, 2001. [21] D. A. Brown, P. R. Brady, A. Dietz, J. Cao, B. Johnson, and J. McNabb. A case study on the use of workflow technologies for scientific analysis: Gravitational wave data analysis. In Workflows for e-Science, pages 39–59. Springer-Verlag, 2007. [22] M. Burgess. A site configuration engine. USENIX Computing Systems, 8(3):309–337, 1995. [23] M. Burgess and R. Ralston. Distributed resource administration using cfengine. Software: Practice and Experience, 27(9):1083–1101, Sept. 1997. [24] E. Byun, Y . Kee, E. Deelman, K. Vahi, G. Mehta, and J. Kim. Estimating resource needs for Time-Constrained workflows. In Proceedings of the 4th IEEE International Confer- ence on e-Science (e-Science ’08), 2008. [25] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. De Rose, and R. Buyya. CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Software: Practice and Experience, 41(1):23–50, Jan. 2011. [26] S. Callaghan, P. Maechling, E. Deelman, K. Vahi, G. Mehta, G. Juve, K. Milner, R. Graves, E. Field, D. Okaya, and T. Jordan. Reducing Time-to-Solution Using Dis- tributed High-Throughput Mega-Workflows: Experiences from SCEC CyberShake. In Proceedings of the 4th IEEE International Conference on e-Science (e-Science ’08), 2008. 178 [27] S. Callaghan, P. Maechling, P. Small, K. Milner, G. Juve, T. Jordan, E. Deelman, G. Mehta, K. Vahi, D. Gunter, K. Beattie, and C. X. Brooks. Metrics for heterogeneous scientific workflows: A case study of an earthquake science application. International Journal of High Performance Computing Applications, 25(3):274–285, 2011. [28] C. D. Capano. Searching for Gravitational Waves from Compact Binary Coalescence Using LIGO and Virgo Data. PhD thesis, Syracuse University, 2011. [29] P. Carns, W. Ligon, R. Ross, and R. Thakur. PVFS: A Parallel File System for Linux Clusters. In 4th Annual Linux Showcase and Conference, 2000. [30] A. Casajus, R. Graciani, S. Paterson, A. Tsaregorodtsev, and The LHCb Dirac Team. DIRAC Pilot Framework and the DIRAC Workload Management System. Journal of Physics: Conference Series, 219(6), Apr. 2010. [31] A. Chervenak, N. Palavalli, S. Bharathi, C. Kesselman, and R. Schwartzkopf. Perfor- mance and scalability of a replica location service. In 13th IEEE International Symposium on High-Performance Distributed Computing (HPDC ’04), 2004. [32] A. Chervenak, R. Schuler, M. Ripeanu, M. A. Amer, S. Bharathi, I. Foster, A. Iamnitchi, and C. Kesselman. The globus replica location service: Design and experience. IEEE Transactions on Parallel and Distributed Systems, 20(9):1260–1272, 2005. [33] Condor connection brokering (CCB). http://www.cs.wisc. edu/condor/manual/v7.3/3_7Networking_includes.html# SECTION00473000000000000000. [34] condor glidein. http://www.cs.wisc.edu/condor/glidein. [35] Convection Rotation and Planetary Transits. http://www.esa.int/esaMI/ COROT/index.html. [36] Collaboratory for the Study of Earthquake Predictability (CSEP). http://www. cseptesting.org/. [37] S. M. S. da Cruz, F. N. da Silva, M. C. R. Cavalcanti, M. Mattoso, and L. M. R. Gadelha. A lightweight middleware monitor for distributed scientific workflows. In IEEE Interna- tional Symposium on Cluster Computing and the Grid, 2008. [38] DAGMan: Directed Acyclic Graph Manager. http://cs.wisc.edu/condor/ dagman. [39] D. De Roure, C. A. Goble, and R. Stevens. Designing the myExperiment virtual research environment for the social sharing of workflows. In Proceedings of the IEEE International Conference on e-Science and Grid Computing (e-Science ’07), 2007. [40] E. Deelman, J. Blythe, Y . Gil, C. Kesselman, G. Mehta, S. Patil, M. Su, K. Vahi, and M. Livny. Pegasus: Mapping scientific workflows onto the grid. In Across Grid Confer- ence, 2004. 179 [41] E. Deelman, S. Callaghan, E. Field, H. Francoeur, R. Graves, N. Gupta, V . Gupta, T. H. Jordan, C. Kesselman, P. Maechling, J. Mehringer, G. Mehta, D. Okaya, K. Vahi, and L. Zhao. Managing Large-Scale workflow execution from resource provisioning to prove- nance tracking: The CyberShake example. In Proceedings of the 2nd IEEE International Conference on e-Science and Grid Computing (e-Science 06), Dec. 2006. [42] E. Deelman, D. Gannon, M. Shields, and I. Taylor. Workflows and e-Science: an overview of workflow system features and capabilities. Future Generation Computer Systems, 25(5):528–540, May 2009. [43] E. Deelman, C. Kesselman, G. Mehta, L. Meshkat, L. Pearlman, K. Blackburn, P. Ehrens, A. Lazzarini, R. Williams, and S. Koranda. GriPhyN and LIGO: building a virtual data grid for gravitational wave scientists. In 11th IEEE International Symposium on High Performance Distributed Computing (HPDC ’02), 2002. [44] E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good. The cost of doing science on the cloud: The montage example. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008. [45] E. Deelman, G. Singh, M. Su, J. Blythe, Y . Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 13(3):219–237, 2005. [46] J. J. Dongarra, E. Jeannot, E. Saule, and Z. Shi. Bi-objective Scheduling Algorithms for Optimizing Makespan and Reliability on Heterogeneous Systems. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’07), 2007. [47] A. Downey. Predicting queue times on space-sharing parallel computers. In Proceedings of the 11th International Parallel Processing Symposium, 1997. [48] N. Dun, K. Taura, and A. Yonezawa. ParaTrac: a fine-grained profiler for data-intensive workflows. In Proceedings of the 19th ACM International Symposium on High Perfor- mance Distributed Computing (HPDC 2010), 2010. [49] D. Durkee. Why cloud computing will never be free. Communications of the ACM, 53(5):62–69, May 2010. [50] M. Ellert, M. Grønager, A. Konstantinov, B. K´ onya, J. Lindemann, I. Livenson, J. Nielsen, M. Niinim¨ aki, O. Smirnova, and A. W¨ a¨ an¨ anen. Advanced resource connector middleware for lightweight computational grids. Future Generation Computer Systems, 23(2):219– 240, Feb. 2007. [51] C. Evangelinos and C. N. Hill. Cloud computing for parallel scientific HPC applications: Feasibility of running coupled Atmosphere-Ocean climate models on amazon’s EC2. In Cloud Computing and Its Applications (CCA 2008), 2008. 180 [52] M. Faerman, A. Su, R. Wolski, and F. Berman. Adaptive performance prediction for distributed Data-Intensive applications. In IEEE/ACM Conference on Supercomputing, 1999. [53] R. Fielding. Architectural styles and the design of network-based software architectures. PhD thesis, University of California, Irvine, 2000. [54] I. Foster, T. Freeman, K. Keahey, D. Scheftner, B. Sotomayer, and X. Zhang. Virtual clus- ters for grid communities. In Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID ’06), pages 513–520, 2006. [55] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Interna- tional Journal of Supercomputer Applications, 11(2):115–128, 1997. [56] I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 2004. [57] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable vir- tual organizations. International Journal of High Performance Computing Applications, 15(3):200–222, 2001. [58] G. C. Fox and D. Gannon. Workflow in grid systems. Concurrency and Computation: Practice and Experience, 18(10):1009–1019, 2006. [59] J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke. Condor-G: a computation management agent for Multi-Institutional grids. Cluster Computing, 5(3):237–246, 2002. [60] FutureGrid. http://futuregrid.org/. [61] Generic connection brokering (GCB). http://cs.wisc.edu/condor/gcb. [62] W. Gentzsch. Sun grid engine: towards creating a compute power grid. In 1st IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid ’01), 2001. [63] R. Gibbons. A historical application profiler for use by parallel schedulers. In Job Scheduling Strategies for Parallel Processing, volume 1291 of Lecture Notes in Computer Science. 1997. [64] L. Gilbert, J. Tseng, R. Newman, S. Iqbal, R. Pepper, O. Celebioglu, J. Hsieh, and M. Cob- ban. Performance implications of virtualization and Hyper-Threading on high energy physics applications in a grid environment. In 19th IEEE International Parallel and Dis- tributed Processing Symposium, 2005. [65] gLite Grid Middleware. http://glite.web.cern.ch. [66] Gluster, Inc. GlusterFS. http://www.gluster.org. [67] M. Goldenberg, P. Lu, and J. Schaeffer. TrellisDAG: A System for Structured DAG Scheduling. In Job Scheduling Strategies for Parallel Processing, volume 2862 of Lecture Notes in Computer Science, pages 21–43. 2003. 181 [68] Google, Inc. AppEngine. http://appengine.google.com. [69] R. Graves, T. Jordan, S. Callaghan, E. Deelman, E. Field, G. Juve, C. Kesselman, P. Maechling, G. Mehta, K. Milner, D. Okaya, P. Small, and K. Vahi. CyberShake: A Physics-Based Seismic Hazard Model for Southern California. Pure and Applied Geo- physics, 168(3-4):367–381, May 2010. [70] R. W. Graves, B. T. Aagaard, K. W. Hudnut, L. M. Star, J. P. Stewart, and T. H. Jordan. Broadband simulations for mw 7.8 southern san andreas earthquakes: Ground motion sensitivity to rupture speed. Geophysical Research Letters, 35(22), 2008. [71] Grid Workloads Archive. http://gwa.ewi.tudelft.nl/pmwiki. [72] Grid5000. http://www.grid5000.fr. [73] P. Groth, E. Deelman, G. Juve, G. Mehta, and B. Berriman. Pipeline-centric provenance model. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, 2009. [74] S. Hazelhurst. Scientific computing using virtual high-performance computing: a case study using the amazon elastic computing cloud. In 2008 annual research conference of the South African Institute of Computer Scientists and Information Technologists on IT research in developing countries: riding the wave of technology (SAICSIT ’08), 2008. [75] Heroku Cloud Application Platform. http://www.heroku.com/. [76] T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: Data-Intensive Scientific Discov- ery. Microsoft Research, 2009. [77] C. Hoffa, G. Mehta, T. Freeman, E. Deelman, K. Keahey, B. Berriman, and J. Good. On the use of cloud computing for scientific workflows. In 3rd International Workshop on Scientific Workflows and Business Workflow Standards in e-Science (SWBES ’08), 2008. [78] USC High Performance Computing Center (HPCC). http://www.usc.edu/hpcc. [79] R. Huang, A. Chien, and H. Casanova. Automatic resource specification generation for resource selection. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, 2007. [80] Illumina. http://www.illumina.com. [81] Infiniscale, Inc. Perceus/Warewulf. http://www.perceus.org/. [82] Infrared processing and analysis center (ipac). http://ipac.caltech.edu. [83] A. Iosup, C. Dumitrescu, D. Epema, H. Li, and L. Wolters. How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications. In 7th IEEE/ACM International Conference on Grid Computing, 2006. 182 [84] K. R. Jackson, L. Ramakrishnan, K. Muriki, S. Canon, S. Cholia, J. Shalf, H. J. Wasser- man, and N. J. Wright. Performance analysis of high performance computing applications on the amazon web services cloud. In Proceedings of Cloudcomp 2010, 2010. [85] K. R. Jackson, L. Ramakrishnan, K. J. Runge, and R. C. Thomas. Seeking supernovae in the clouds: a performance study. In Proceedings of the 19th ACM International Sympo- sium on High Performance Distributed Computing, 2010. [86] G. Juve and E. Deelman. Resource provisioning options for Large-Scale scientific work- flows. In 4th IEEE International Conference on eScience (eScience ’08), Dec. 2008. [87] G. Juve and E. Deelman. Scientific Workflows in the Cloud. In M. Cafaro and G. Aloisio, editors, Grids, Clouds and Virtualization, pages 71–91. Springer, 2010. [88] G. Juve and E. Deelman. Automating Application Deployment in Infrastructure Clouds. In 3rd IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2011), 2011. [89] G. Juve, E. Deelman, K. Vahi, and G. Mehta. Experiences with resource provisioning for scientific workflows using corral. Scientific Programming, 18(2), Apr. 2010. [90] G. Juve, E. Deelman, K. Vahi, G. Mehta, B. P. Berman, B. Berriman, and P. Maech- ling. Data sharing options for scientific workflows on amazon EC2. In 2010 ACM/IEEE conference on Supercomputing (SC 10), 2010. [91] G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, and P. Maech- ling. Scientific workflow applications on amazon EC2. In 2009 5th IEEE International Conference on E-Science Workshops, Dec. 2009. [92] L. Kanies. Puppet: Next generation configuration management. Login, 31(1):19–25, Feb. 2006. [93] P. K¨ arkk¨ ainen and L. Kurth. XenOverview - Xen Wiki. http://wiki.xensource. com/xenwiki/XenOverview. [94] D. S. Katz, J. C. Jacob, E. Deelman, C. Kesselman, S. Gurmeet, S. Mei-Hui, G. B. Berri- man, J. Good, A. C. Laity, and T. A. Prince. A comparison of two methods for building astronomical image mosaics on a grid. In 34th International Conference on Parallel Pro- cessing Workshops (ICPP ’05), 2005. [95] K. Keahey, R. J. Figueiredo, J. Fortes, T. Freeman, and M. Tsugawa. Science clouds: Early experiences in cloud computing for scientific applications. In Cloud Computing and Its Applications (CCA ’08)), 2008. [96] K. Keahey and T. Freeman. Contextualization: Providing One-Click virtual clusters. In 4th International Conference on e-Science (e-Science 08), 2008. [97] K. Keahey, M. Tsugawa, A. Matsunaga, and J. Fortes. Sky computing. IEEE Internet Computing, 13(5):43–51, 2009. 183 [98] Y . Kee, C. Kesselman, D. Nurmi, and R. Wolski. Enabling personal clusters on demand for batch resources using commodity software. In International Heterogeneity Computing Workshop (HCW 08), 2008. [99] H. Kim, Y . el-Khamra, I. Rodero, S. Jha, and M. Parashar. Autonomic management of application workflows on hybrid computing infrastructure. Sci. Program., 19:75–89, April 2011. [100] D. Kondo, B. Javadi, P. Malecot, F. Cappello, and D. P. Anderson. Cost-benefit analy- sis of cloud computing versus desktop grids. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing, 2009. [101] I. Krsul, A. Ganguly, J. Zhang, J. A. B. Fortes, and R. J. Figueiredo. VMPlants: provid- ing and managing virtual machine execution environments for grid computing. In 2004 ACM/IEEE conference on Supercomputing (SC 04), 2004. [102] Kernel-based Virtual Machine (KVM). http://www.linux-kvm.org/. [103] Y . Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys (CSUR), 31(4):406–471, 1999. [104] Laser Interferometer Gravitational Wave Observatory (LIGO). http://www.ligo. caltech.edu. [105] A. Lathers, M. Su, A. Kulungowski, A. Lin, G. Mehta, S. Peltier, E. Deelman, and M. Ellisman. Enabling parallel scientific applications with workflow tools. In Challenges of Large Applications in Distributed Environments (CLADE 2006), 2006. [106] K. Lee, N. W. Paton, R. Sakellariou, E. Deelman, A. A. A. Fernandes, and G. Mehta. Adaptive workflow processing and execution in pegasus. In Proceedings of the 3rd Inter- national Conference on Grid and Pervasive Computing Workshops, 2008. [107] H. Li, J. Ruan, and R. Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18(11):1851–1858, 2008. [108] M. J. Litzkow, M. Livny, and M. W. Mutka. Condor: A hunter of idle workstations. In 8th International Conference of Distributed Computing Systems, 1988. [109] J. Livny, H. Teonadi, M. Livny, and M. K. Waldor. High-Throughput, Kingdom-Wide prediction and annotation of bacterial Non-Coding RNAs. PLoS ONE, 3(9):e3197, 2008. [110] P. Maechling, E. Deelman, L. Zhao, R. Graves, G. Mehta, N. Gupta, J. Mehringer, C. Kesselman, S. Callaghan, D. Okaya, H. Francoeur, V . Gupta, Y . Cui, K. Vahi, T. Jor- dan, and E. Field. SCEC CyberShake WorkflowsAutomating probabilistic seismic hazard analysis calculations. In I. Taylor, E. Deelman, D. Gannon, and M. Shields, editors, Work- flows for e-Science, pages 143–163. Springer, 2007. [111] T. Maeno. PanDA: distributed production and distributed analysis system for ATLAS. Journal of Physics: Conference Series, 119(6):062036, 2008. 184 [112] M. Mao and M. Humphrey. Auto-scaling to minimize cost and meet application deadlines in cloud workflows. In Proceedings of Supercomputing 2011 (SC ’11), 2011. [113] MAQ: Mapping and assembly with qualities. http://www.maq.sourceforge. net. [114] P. Marshall, K. Keahey, and T. Freeman. Elastic site: Using clouds to elastically extend site resources. In 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010), May 2010. [115] Massachussetts Institute of Technology. StarCluster. http://web.mit.edu/ stardev/cluster/. [116] T. Mattson. High performance computing at intel: the OSCAR software solution stack for cluster computing. In Proceedings of the 1st IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2001), 2001. [117] S. Miles, P. Groth, E. Deelman, K. Vahi, G. Mehta, and L. Moreau. Provenance: The bridge between experiments and data. Computing in Science and Engineering, 10(3):38– 46, 2008. [118] J. Moscicki. DIANE - distributed analysis environment for GRID-enabled simulation and analysis of physics data. In IEEE Nuclear Science Symposium Conference Record, Oct. 2003. [119] M. A. Murphy and S. Goasguen. Virtual organization clusters: Self-provisioned clouds on the grid. Future Generation Computer Systems, 26:12711281, Oct. 2010. [120] M. A. Murphy, B. Kagey, M. Fenn, and S. Goasguen. Dynamic provisioning of virtual organization clusters. In 9th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 09), 2009. [121] J. Napper and P. Bientinesi. Can cloud computing reach the top500? In Proceedings of the Workshop on UnConventional High Performance Computing, 2009. [122] NASA. Kepler Mission. http://kepler.nasa.gov/. [123] National Center for Supercomputing Applications (NCSA). Intel 64 Cluster Abe. http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/ Intel64Cluster/. [124] National Energy Research Scientific Computing Center (NERSC). Magellan. http: //magellan.nersc.gov. [125] P. Nilsson. Experience from a pilot based system for ATLAS. Journal of Physics: Con- ference Series, 119(6):062038, 2008. [126] H. Nishimura, N. Maruyama, and S. Matsuoka. Virtual clusters on the fly - fast, scalable, and flexible installation. In 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 07), 2007. 185 [127] G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V . Wilcox. PaceA toolset for the performance prediction of parallel and distributed systems. Interna- tional Journal of High Performance Computing Applications, 14(3):228 –251, 2000. [128] D. Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, and K. Kennedy. Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In IEEE/ACM Conference on Supercomputing, 2006. [129] D. Nurmi, R. Wolski, and J. Brevik. V ARQ: virtual advance reservations for queues. In 17th International Symposium on High Performance Distributed Computing (HPDC ’08), 2008. [130] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. Eucalyptus: A technical report on an elastic utility computing archi- tecture linking your programs to useful systems. UCSB Computer Science Technical Report 2008-10, 2008. [131] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The eucalyptus open-source cloud-computing system. In 9th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 09), 2009. [132] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045–3054, Nov. 2004. [133] OpenNebula. http://www.opennebula.org. [134] OpenPBS. http://www.openpbs.org. [135] Opscode, Inc. Chef. http://www.opscode.com/chef. [136] Oracle Corporation. Lustre Parallel Filesystem. http://www.lustre.org. [137] S. Ostermann, A. Iosup, N. Yigitbasi, R. Prodan, T. Fahringer, and D. Epema. A perfor- mance analysis of EC2 cloud computing services for scientific computing. In Proceedings of Cloudcomp 2009, 2009. [138] S. Ostermann, K. Plankensteiner, R. Prodan, T. Fahringer, and A. Iosup. Workflow moni- toring and analysis tool for ASKALON. In Grid and Services Evolution. 2009. [139] S. Ostermann, R. Prodan, and T. Fahringer. Dynamic cloud provisioning for scientific grid workflows. In Grid Computing (GRID 2010), 2010. [140] M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel. Amazon s3 for science grids: a viable solution? In Proceedings of the 2008 international workshop on Data-aware distributed computing (DADC 08), 2008. [141] S. Pandey, L. Wu, S. M. Guru, and R. Buyya. A Particle Swarm Optimization-Based Heuristic for Scheduling Workflow Applications in Cloud Computing Environments. In International Conference on Advanced Information Networking and Applications, 2010. 186 [142] P. M. Papadopoulos, M. J. Katz, and G. Bruno. NPACI rocks: tools and techniques for easily deploying manageable linux clusters. Concurrency and Computation: Practice and Experience, 15(7-8):707–725, June 2003. [143] Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/parallel/ workload. [144] Penguin Computing, Inc. Scyld ClusterWare. http://www.penguincomputing. com/software/scyld_clusterware. [145] C. Pinchak, P. Lu, and M. Goldenberg. Practical heterogeneous placeholder scheduling in overlay metacomputers: Early experiences. In Job Scheduling Strategies for Parallel Processing, pages 205–228. 2002. [146] Platform Computing, Inc. Platform LSF. http://www.platform.com/ workload-management/high-performance-computing. [147] I. Processing and A. C. (IPAC). Montage Image Mosaic Service. http://hachi. ipac.caltech.edu:8080/montage/. [148] I. Processing and A. C. (IPAC). NASA Star and Exoplanet Database (NStED). http: //nsted.ipac.caltech.edu. [149] R. Prodan and M. Wieczorek. Bi-Criteria Scheduling of Scientific Grid Workflows. IEEE Transactions on Automation Science and Engineering, 7(2):364–376, 2010. [150] ptrace - process trace. Linux Programmer’s Manual. [151] I. Raicu, Y . Zhao, C. Dumitrescu, I. Foster, and M. Wilde. Falkon: a fast and light-weight tasK executiON framework. In IEEE/ACM Conference on Supercomputing, 2007. [152] L. Ramakrishnan and D. Gannon. A survey of distributed workflow characteristics and resource requirements. Technical Report TR671, Indiana University, 2008. [153] J. J. Rehr, F. D. Vila, J. P. Gardner, L. Svec, and M. Prange. Scientific computing in the cloud. Computing in Science and Engineering, 12(3):34–43, 2010. [154] RightScale, Inc. RightScale. http://www.rightscale.com. [155] M. Rodrguez, D. Tapiador, J. Fontn, E. Huedo, R. S. Montero, and I. M. Llorente. Dynamic provisioning of virtual clusters for grid computing. In Euro-Par 2008 Work- shops - Parallel Processing, 2009. [156] R. Sakellariou, H. Zhao, E. Tsiakkouri, and M. D. Dikaiakos. Scheduling workflows with budget constraints. In S. Gorlatch and M. Danelutto, editors, Integrated Research in GRID Computing, COREGrid Series. Springer-Verlag, 2007. [157] R. Sandberg, D. Golgberg, S. Kleiman, D. Walsh, and B. Lyon. Design and implementa- tion of the sun network filesystem. In USENIX Conference Proceedings, 1985. 187 [158] J. Schopf and F. Berman. Performance prediction in production environments. In Proceed- ings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing (IPPS/SPDP 98), 1998. [159] I. Sfiligoi. glideinWMS—a generic pilot-based workload management system. Journal of Physics: Conference Series, 119(6):062044, 2008. [160] G. Singh, C. Kesselman, and E. Deelman. Performance impact of resource provisioning on workflows. Technical Report 05-850, University of Southern California, 2005. [161] G. Singh, C. Kesselman, and E. Deelman. Application-Level resource provisioning on the grid. In Proceedings of the 2nd IEEE International Conference on e-Science and Grid Computing, 2006. [162] G. Singh, C. Kesselman, and E. Deelman. A provisioning model and its comparison with Best-Effort for Performance-Cost optimization in grids. In Proceedings of the 16th international symposium on High performance distributed computing (HPDC ’07), 2007. [163] G. Singh, M. Su, K. Vahi, E. Deelman, B. Berriman, J. Good, D. S. Katz, and G. Mehta. Workflow task clustering for best effort systems with pegasus. In 15th ACM Mardi Gras Conference, 2008. [164] W. Smith, I. Foster, and V . Taylor. Scheduling with advanced reservations. In 14th Inter- national Parallel and Distributed Processing Symposium (IPDPS 2000), page 127, 2000. [165] W. Smith, I. Foster, and V . Taylor. Predicting application run times with historical infor- mation. Journal of Parallel and Distributed Computing, 64(9):1007–1016, Sept. 2004. [166] Southern California Earthquake Center (SCEC). http://www.scec.org. [167] strace system call tracing utility. http://sourceforge.net/projects/ strace/. [168] A. K. M. K. A. Talukder, M. Kirley, and R. Buyya. Multiobjective differential evolu- tion for scheduling workflow applications on global Grids. Concurrency Computation: Practice and Experience, 21(13):1742–1756, 2009. [169] I. Taylor, M. Shields, I. Wang, and A. Harrison. Visual grid workflow in triana. Journal of Grid Computing, 3(3-4):153–169, Jan. 2006. [170] The TeraGrid Project. http://www.teragrid.org. [171] Texas Advanced Computing Center (TACC). Ranger user guide. http://www.tacc. utexas.edu/user-services/user-guides/ranger-user-guide. [172] H. Topcuoglu, S. Hariri, and W. Min-You. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3):260–274, 2002. [173] Torque. http://supercluster.org/torque. 188 [174] H. Truong and T. Fahringer. SCALEA-G: a unified monitoring and performance analysis system for the grid. Scientific Programming, 12(4):225–237, 2004. [175] A. Tsaregorodtsev, M. Bargiotti, N. Brook, A. C. Ramo, G. Castellani, P. Charpentier, C. Cioffi, J. Closier, R. G. Diaz, G. Kuznetsov, Y . Y . Li, R. Nandakumar, S. Paterson, R. Santinelli, A. C. Smith, M. S. Miguelez, and S. G. Jimenez. DIRAC: a community grid solution. Journal of Physics: Conference Series, 119(6):062048, 2008. [176] Two Micron All Sky Survey (2MASS). http://pegasus.phast.umass.edu. [177] USC Epigenome Center. http://epigenome.usc.edu. [178] P. Uthayopas, S. Paisitbenchapol, T. Angskun, and J. Maneesilp. System management framework and tools for beowulf cluster. In Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, 2000. [179] M. Uysal, T. M. Kurc, A. Sussman, and J. H. Saltz. A performance prediction framework for data intensive applications on large scale parallel machines. In 4th International Work- shop on Languages, Compilers, and Run-Time Systems for Scalable Computers, 1998. [180] W. van der Aalst, A. ter Hofstede, B. Kiepuszewski, and A. Barros. Workflow patterns. Distributed and Parallel Databases, 14(1):5–51, 2003. [181] J. V¨ ockler, G. Juve, E. Deelman, M. Rynge, and G. B. Berriman. Experiences using cloud computing for a scientific workflow application. In 2nd Workshop on Scientific Cloud Computing (ScienceCloud ’11), 2011. [182] J. V¨ ockler, G. Mehta, Y . Zhao, E. Deelman, and M. Wilde. Kickstarting Remote Appli- cations. In Proceedings of the 2nd Workshop on Grid Computing Environments (OGCE 06), 2006. [183] E. Walker. Benchmarking amazon EC2 for High-Performance scientific computing. Login, 33(5):18–23, 2008. [184] E. Walker, J. P. Gardner, V . Litvin, and E. Turner. Creating personal adaptive clusters for managing scientific jobs in a distributed computing environment. In IEEE International Workshop on Challenges of Large Applications in Distributed Environments (CLADE 06), 2006. [185] S. Weil, S. Brandt, E. Miller, D. Long, and C. Maltzahn. Ceph: A scalable, high- performance distributed file system. In 7th Symposium on Operating Systems Design and Implementation (OSDI 06), 2006. [186] M. Wieczorek, A. Hoheisel, and R. Prodan. Towards a general model of the multi-criteria workflow scheduling on the grid. Future Generation Computer Systems, 25(3):237–256, Mar. 2009. [187] M. Wieczorek, M. Siddiqui, A. Villazon, R. Prodan, and T. Fahringer. Applying advance reservation to increase predictability of workflow execution on the grid. In Second IEEE International Conference on e-Science and Grid Computing (e-Science 2006), 2006. 189 [188] J. Wilkening, A. Wilke, N. Desai, and F. Meyer. Using clouds for metagenomics: A case study. In IEEE Cluster, 2009. [189] Workflow gallery. http://pegasus.isi.edu/workflow_gallery/index. php. [190] Workflow Generator. https://confluence.pegasus.isi.edu/display/ pegasus/WorkflowGenerator. [191] The Workflow Patterns Initiative. http://www.workflowpatterns.com. [192] L. Youseff, M. Butrico, and D. Da Silva. Toward a unified ontology of cloud computing. In Grid Computing Environments Workshop (GCE ’08), 2008. [193] J. Yu and R. Buyya. A taxonomy of workflow management systems for grid computing. Journal of Grid Computing, 3(3–4), 2005. [194] J. Yu, R. Buyya, and C. Tham. Cost-Based scheduling of scientific workflow application on utility grids. In Proceedings of the First International Conference on e-Science and Grid Computing, 2005. [195] Y . Zhang, C. Koelbel, and K. Cooper. Batch queue resource scheduling for workflow applications. In IEEE International Conference on Cluster Computing (Cluster 09), 2009. [196] H. Zhao and R. Sakellariou. Advance reservation policies for workflows. In Job Schedul- ing Strategies for Parallel Processing, pages 47–67. 2007. [197] Z. Zhi-Hong, M. Dan, Z. Jian-Feng, W. Lei, W. Lin-ping, and H. Wei. Easy and reliable cluster management: the self-management experience of fire phoenix. In 20th Interna- tional Parallel and Distributed Processing Symposium (IPDPS 06), 2006. 190 Appendix List of Publications The Application of Cloud Computing to Scientific Workflows: A Study of Cost and Per- formance, G. Bruce Berriman, Gideon Juve, Jens-S. V¨ ockler, Ewa Deelman, Mats Rynge, Proceedings of the Royal Society A, to appear. An Evaluation of the Cost and Performance of Scientific Workflows on Amazon EC2, Gideon Juve, Ewa Deelman, G. Bruce Berriman, Benjamin P. Berman, Philip Maech- ling, Journal of Grid Computing, Special Issue on Data-Intensive Science in the Cloud, to appear. Using Clouds for Science, Is it Just Kicking the Can Down The Road?, Ewa Deelman, G. Bruce Berriman, Gideon Juve, 2nd International Conference on Cloud Computing and Services Science (CLOSER 2012), 2012. Automating Application Deployment in Infrastructure Clouds, Gideon Juve, Ewa Deel- man, 3rd IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2011), 2011. Experiences Using GlideinWMS and the Corral Frontend Across Cyberinfrastructures, Mats Rynge, Gideon Juve, Gaurang Mehta, Ewa Deelman, Krista Larson, Burt Holzman, Igor Sfiligoi, Frank Wrthwein, G. Bruce Berriman, Scott Callaghan, Proceedings of the 7th IEEE International Conference on e-Science (e-Science 2011), 2011. 191 Online Workflow Management and Performance Analysis with STAMPEDE, Dan Gunter, Christopher H. Brooks, Ewa Deelman, Monte Good, Gideon Juve, Gaurang Mehta, Priscilla Moaes, Taghrid Samak, Fabio Silva, Martin Swany, Karan Vahi, Proceedings of the 7th International Conference on Network and Service Management (CNSM 2011), 2011. Online Fault and Anomaly Detection for Large-Scale Scientific Workflows, Taghrid Samak, Dan Gunter, Ewa Deelman, Gideon Juve, Gaurang Mehta, Fabio Silva, Karan Vahi, 13th IEEE International Conference on High Performance Computing and Commu- nications (HPCC 2011), 2011. Metrics for Heterogeneous Scientific Workflows: A Case Study of an Earthquake Sci- ence Application, Scott Callaghan, Philip Maechling, Patrick Small, Kevin Milner, Gideon Juve, Thomas H. Jordan, Ewa Deelman, Gaurang Mehta, Karan Vahi, Dan Gunter, Keith Beattie. International Journal of High Performance Computing Applications, 25:3, pp. 274-285, 2011. Wrangler: Virtual Cluster Provisioning for the Cloud, Gideon Juve and Ewa Deelman, short paper, Proceedings of the 20th International Symposium on High Performance Dis- tributed Computing (HPDC 2011), 2011. Experiences Using Cloud Computing for A Scientific Workflow Application, Jens-S. V¨ ockler, Gideon Juve, Ewa Deelman, Mats Rynge, G. Bruce Berriman, Proceedings of 2nd Workshop on Scientific Cloud Computing (ScienceCloud 2011), 2011. Scientific Workflows in the Cloud, Gideon Juve and Ewa Deelman, in Grids, Clouds and Virtualization, M. Cafaro and G. Aloisio, Eds. Springer, pp. 71-91, 2010. The Application of Cloud Computing to Astronomy: A Study of Cost and Performance, G. Bruce Berriman, Gideon Juve, Ewa Deelman, Moira Regelson, Peter Plavchan, workshop on e-Science challenges in Astronomy and Astrophysics in conjunction with the 6th IEEE International Conference on e-Science (e-Science 2010), 2010. 192 CyberShake: A Physics-Based Seismic Hazard Model for Southern California, Robert Graves, Thomas Jordan, Scott Callaghan, Ewa Deelman, Edward Field, Gideon Juve, Carl Kesselman, Philip Maechling, Gaurang Mehta, Kevin Milner, David Okaya, Patrick Small, Karan Vahi, Pure and Applied Geophysics, 168:3-4, pp 367-381, 2010. Experiences with Resource Provisioning for Scientific Workflows Using Corral, Gideon Juve, Ewa Deelman, Karan Vahi, Gaurang Mehta, Scientific Programming, 18:2, pp. 77- 92, 2010. The Application of Cloud Computing to the Creation of Image Mosaics and Management of Their Provenance, G. Bruce Berriman, Ewa Deelman, Paul Groth, and Gideon Juve, SPIE Conference 7740: Software and Cyberinfrastructure for Astronomy, 2010. Data Sharing Options for Scientific Workflows on Amazon EC2, Gideon Juve, Ewa Deel- man, Karan Vahi, Gaurang Mehta, Bruce Berriman, Benjamin P. Berman, Phil Maechling, 22nd IEEE/ACM Conference on Supercomputing (SC10), 2010. Scaling up Workflow-based Applications, Scott Callaghan, Ewa Deelman, Dan Gunter, Gideon Juve, Philip Maechling, Christopher Brooks, Karan Vahi, Kevin Milner, Robert Graves, Edward Field, David Okaya, Thomas Jordan, Journal of Computer and System Sciences, 76:6, pp. 428-446, 2010. Scientific Workflows and Clouds, Gideon Juve, Ewa Deelman, ACM Crossroads, 16:3, pp. 14-18, Spring 2010. Scientific Workflow Applications on Amazon EC2, Gideon Juve, Ewa Deelman, Karan Vahi, Gaurang Mehta, Bruce Berriman, Benjamin P. Berman and Phil Maechling, Work- shop on Cloud-based Services and Applications in conjunction with 5th IEEE Interna- tional Conference on e-Science (e-Science 2009), 2009. 193 Pipeline-Centric Provenance Model, Paul Groth, Ewa Deelman, Gideon Juve, Gaurang Mehta, Bruce Berriman, 4th Workshop on Workflows in Support of Large-Scale Science (WORKS 09), 2009. The ShakeOut earthquake scenario: Verification of three simulation sets, Jacobo Bielak, Robert Graves, Kim Olsen, Ricardo Taborda, L. Ram´ ırez-Guzm´ an, S.M. Day, Geoffrey Ely, D. Roten, Thomas Jordan, Philip Maechling, J. Urbanic, Yifeng Cui, Gideon Juve, Geophysical Journal International, 108, pp 375-404, 2009. Resource Provisioning Options for Large-Scale Scientific Workflows, Gideon Juve, Ewa Deelman, Third International Workshop on Scientific Workflows and Business Workflow Standards in e-Science (SWBES 08), 2008. Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows: Expe- riences from SCEC CyberShake, Scott Callaghan, Philip Maechling, Ewa Deelman, Karan Vahi, Gaurang Mehta, Gideon Juve, Kevin Milner, Robert Graves, Edward Field, David Okaya, Dan Gunter, Keith Beattie, Thomas Jordan, Fourth IEEE International Conference on e-Science (eScience 08), 2008. A Resource Provisioning System for Scientific Workflow Applications, Gideon Juve, Mas- ter’s Thesis, Computer Science Department, University of Southern California, December 2008. 194
Abstract (if available)
Abstract
Scientific workflows are a parallel computing technique used to orchestrate large, complex, multi-stage computations for data analysis and simulation in many academic domains. Resource management is a key problem in the execution of workflows because they often involve large computations and data that must be distributed across many resources in order to complete in a reasonable time. Traditionally, resources in distributed computing systems such as clusters and grids were allocated to workflow tasks through the process of batch scheduling. The tasks were submitted to a batch queue and matched to available resources just prior to execution. Recently, due to performance and quality of service considerations on the grid, and the development of cloud computing, it has become advantageous and, in the case of cloud computing, necessary for workflow applications to explicitly provision resources ahead of execution. This trend toward resource provisioning has created many new problems and opportunities in the management of scientific workflows. This thesis explores several of these resource management issues and describes some potential solutions. ❧ This thesis makes the following contributions: 1. It describes several problems associated with resource provisioning in cluster and grid environments, and presents a new provisioning approach based on pilot jobs that has many benefits for both resource owners and application users in terms of performance, quality of service, and efficiency. It also describes the design and implementation of a system based on pilot jobs that enables applications to bypass restrictive grid scheduling policies and is shown to reduce the makespan of several workflow applications by 32%-48% on average. 2. It describes the challenges of provisioning resources for workflows and other distributed applications in Infrastructure as a Service (IaaS) clouds and presents a new technique for modeling complex, distributed applications that is based on directed acyclic graphs. This model is used to develop a system for automatically deploying and managing distributed applications in infrastructure clouds. The system has been used to provision hundreds of virtual clusters for executing scientific workflows in the cloud. 3. It describes the challenges and benefits of running workflow applications in infrastructure clouds and presents the results of several studies investigating the cost and performance of running workflow applications on Amazon EC2 using a variety of different resource types and storage systems. These studies compared the performance of workflows in grids and clouds, characterized the virtualization overhead of workflow applications in the cloud, compared the cost and performance of using different storage systems with workflows in the cloud, and evaluated the long-term costs of hosting workflow applications in the cloud. 4. It investigates the issue of predicting the resource needs of workflow applications using historical data, and describes a technique for collecting detailed resource usage records for workflow applications that is applied to several real applications. In addition to estimating resource requirements, this data can also be used as inputs for simulations of scheduling algorithms and workflow management systems, and for identifying problems and optimization opportunities in workflows. This technique is used to collect and analyze the resource usage of six different workflow applications, which is analyzed to identify potential bugs and opportunities for optimizing the workflows. 5. It investigates issues related to dynamic provisioning of resources for workflow ensembles and describes three different algorithms (1 offline and 2 online) that were developed for provisioning and scheduling workflow ensembles under deadline and budget constraints. The relative performance of these algorithms is evaluated using several different applications under a variety of realistic conditions including resource provisioning delays and task estimation errors. It shows that the offline algorithm is able to achieve higher performance given perfect conditions, but the online algorithms are better able to adapt to errors and delays without exceeding the constraints.
Conceptually similar
PDF
A resource provisioning system for scientific workflow applications
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
An end-to-end framework for provisioning based resource and application management
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Scientific workflow generation and benchmarking
PDF
Optimizing execution of in situ workflows
PDF
Efficient data and information delivery for workflow execution in grids
PDF
Dispersed computing in dynamic environments
PDF
An automated testing system for scientific workflows
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Schema evolution for scientific asset management
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Adaptive resource management in distributed systems
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Policy based data placement in distributed systems
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Resiliency-aware scheduling
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
Asset Metadata
Creator
Juve, Gideon M.
(author)
Core Title
Resource management for scientific workflows
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/16/2012
Defense Date
03/27/2012
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,provisioning,resource management,scheduling,scientific workflows
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Deelman, Ewa (
committee chair
), Chervenak, Ann (
committee member
), Jordan, Thomas H. (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
gideonjuve@gmail.com,juve@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-6886
Unique identifier
UC1111047
Identifier
usctheses-c3-6886 (legacy record id)
Legacy Identifier
etd-JuveGideon-605.pdf
Dmrecord
6886
Document Type
Dissertation
Rights
Juve, Gideon M.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
provisioning
resource management
scheduling
scientific workflows