Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Policy based data placement in distributed systems
(USC Thesis Other)
Policy based data placement in distributed systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
POLICY BASED DATA PLACEMENT IN DISTRIBUTED SYSTEMS
by
Muhammad Ali Amer
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2012
Copyright 2012 Muhammad Ali Amer
Dedication
I dedicate this dissertation to my mother for her love, support and her con-
tinuous stream of prayers. To my teachers for giving me knowledge. To my
friends for adding color and fun to my life. To the Army for instilling the
values of a lifetime in me. And finally, and most dearly to my wife, Lubna,
for her unwavering support through the ups and downs during the course of
my Ph.D, for taking the helm in bringing up our three wonderful children,
Aaiza, Ahmad and Aadil.
I thank Allah Almighty for his blessings. All that I achieve is ultimately
granted by Him.
ii
Table of Contents
Dedication ii
List of Tables vi
List of Figures vii
Abstract ix
Chapter 1: Introduction 1
1.1 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Scientific Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Workflow Management Systems . . . . . . . . . . . . . . . . . . . . . 5
1.5 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Intellectual Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2: Background 11
2.1 Workflow Management Systems . . . . . . . . . . . . . . . . . . . . . 11
2.2 Condor-DAGMan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Data Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Data Management Policies . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Data Placement . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 Scope of Data Placement Policies . . . . . . . . . . . . . . . . 19
2.5.3 Policy Enforcement Times . . . . . . . . . . . . . . . . . . . . 22
2.5.4 Data Placement Policies . . . . . . . . . . . . . . . . . . . . . 22
2.5.5 Data Placement Policies for Scientific Workflows . . . . . . . . 24
2.6 Non-deterministic Data Placement . . . . . . . . . . . . . . . . . . . . 26
2.7 Related work on Policy Based Data Management . . . . . . . . . . . . 27
iii
Chapter 3: Policy Based Data Placement 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 PDPS Design and Architecture . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 PDPS Architecture . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 PDPS Pegasus Framework . . . . . . . . . . . . . . . . . . . . 39
3.2.3 Policy Authoring . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Policies Evaluated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 PDPS Standalone Performance Evaluation . . . . . . . . . . . . . . . . 42
3.4.1 Tier Based Dissemination . . . . . . . . . . . . . . . . . . . . 43
3.4.2 ncopy Replication . . . . . . . . . . . . . . . . . . . . . . 45
3.4.3 Policy engine performance . . . . . . . . . . . . . . . . . . . . 47
3.5 Interfacing Leaf Jobs with PDPS . . . . . . . . . . . . . . . . . . . . . 48
3.5.1 Integrating Pegasus and PDPS - Leaf jobs . . . . . . . . . . . . 50
3.5.2 Workflows Evaluated . . . . . . . . . . . . . . . . . . . . . . . 53
3.5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 55
3.6 Interfacing Non-Leaf Jobs with PDPS . . . . . . . . . . . . . . . . . . 66
3.6.1 Integrating Pegasus and PDPS - Non-Leaf jobs . . . . . . . . . 66
3.6.2 Stage-in Policies . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6.3 Workflows Evaluated . . . . . . . . . . . . . . . . . . . . . . . 70
3.6.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 73
Chapter 4: SDAG 82
4.1 Well Formed Workflows(WFW) . . . . . . . . . . . . . . . . . . . . 84
4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.1 Workflow Systems . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.2 Design coupling between applications and WMS tools . . . . . 87
4.3 SDAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.1 SDAG Design Criteria . . . . . . . . . . . . . . . . . . . . . . 91
4.3.2 SDAG Architecture . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Statistical Properties of SDAG . . . . . . . . . . . . . . . . . . . . . . 96
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 5: Evaluating WMS tools with SDAG 99
5.0.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 100
5.0.2 PDPS Policies . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.0.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1.1 ControlWFW set . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1.2 WeightedWFW set . . . . . . . . . . . . . . . . . . . . . . 110
5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
iv
Chapter 6: Conclusions 118
Bibliography 121
v
List of Tables
3.1 Number of Data Placement Jobs for all Workflows . . . . . . . . . . . . 55
3.2 I/O Statistics for Montage workflows . . . . . . . . . . . . . . . . . . . 58
3.3 Epigenomics runtime comparison . . . . . . . . . . . . . . . . . . . . 61
3.4 Broadband runtime comparison . . . . . . . . . . . . . . . . . . . . . . 64
5.1 Results for Pairwise T-Test, Control set, Overall Completion Times . . . 105
5.2 Results for Pairwise T-Tests . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3 Median Overall Runtimes (in seconds) for WeightedWFWs . . . . . . 112
vi
List of Figures
1.1 Pegasus WMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Policy Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 PDPS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Overall Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Example Policy Expression . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Tiered model of data distribution from the high energy physics domain . 42
3.5 Tier-based dissemination 1 MB files . . . . . . . . . . . . . . . . . . . 44
3.6 Tier-based dissemination 1 GB files . . . . . . . . . . . . . . . . . . . 45
3.7 3copy replication -1 MB files . . . . . . . . . . . . . . . . . . . . . 47
3.8 3copy replication -1 GB files . . . . . . . . . . . . . . . . . . . . . 47
3.9 Performance of Drools policy engine . . . . . . . . . . . . . . . . . . . 48
3.10 Overall Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.11 Data Placement Service . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.12 Overall workflow runtime - Montage . . . . . . . . . . . . . . . . . . . 57
3.13 8Degree Montage stage-out jobs . . . . . . . . . . . . . . . . . . . 58
3.14 8Degree Montage clean-up jobs . . . . . . . . . . . . . . . . . . . 59
3.15 Average clean-up job Montage . . . . . . . . . . . . . . . . . . . . . . 59
3.16 Average stage-out job Montage . . . . . . . . . . . . . . . . . . . . . 60
3.17 Epigenomics stage-out jobs . . . . . . . . . . . . . . . . . . . . . . . 61
vii
3.18 Number of files in each stage-out job . . . . . . . . . . . . . . . . . . . 62
3.19 Epigenomics clean-up jobs . . . . . . . . . . . . . . . . . . . . . . . . 63
3.20 Broadband stage-out jobs . . . . . . . . . . . . . . . . . . . . . . . . 64
3.21 Broadband clean-up jobs . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.22 Partial View of Executable Montage Workflow . . . . . . . . . . . . . . 72
3.23 Experimental Setup - Physical Placing . . . . . . . . . . . . . . . . . . 74
3.24 8-Degree, Montage Overall Runtime with Stage-in Jobs Handled by PDPS 76
3.25 Synthetic Workflow - 4.2 GB Dataset . . . . . . . . . . . . . . . . . . . 78
3.26 Partial Synthetic Workflow DAG . . . . . . . . . . . . . . . . . . . . . 79
3.27 Synthetic workflow (26.16GB dataset) . . . . . . . . . . . . . . . . . . 80
4.1 Workflow Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 SDAG Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3 SDAG Statistics (a) Levels/WFW (b) Jobs/level (c) Job Runtimes (d)
Bandwidth / job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 SDAG Statistics (Dependencies / Job) . . . . . . . . . . . . . . . . . . 98
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Control set - Overall Execution Time . . . . . . . . . . . . . . . . . . . 104
5.3 Control set (a) Input Data / Level (b) Dependencies / Level . . . . . . . 107
5.4 Control set (a) Overall time difference with default (b) Overall time dif-
ference between Data and Dependency . . . . . . . . . . . . . . . . . . 110
5.5 Weighted set - Overall completion time . . . . . . . . . . . . . . . . . . 111
5.6 Weighted set - (a) Dependencies/Level (b) Input Data / Level . . . . . . 113
5.7 Weighted set (a) Overall time difference with default (b) Overall time
difference between Data and Dependency . . . . . . . . . . . . . . . . 114
viii
Abstract
Scientific domains are increasingly adopting workflow systems to automate and manage
large distributed applications. Workflow Management Systems (WMS) manage overall
scheduling and monitoring of both compute and data placement jobs for such applica-
tions. The management of data placement jobs in WMS provide the overall context for
the problems addressed in this thesis. This thesis starts by automating data placement
for scientific applications based on user provided data-placement policies and proceeds
to interface a WMS with a policy based data placement service (PDPS). It provides
a solution for the lack of testing data in workflow science by developing a Synthetic
Directed Acyclic Graph Generator (SDAG) and using synthetic workflows generated by
it in a case study.
This thesis relies on actual software development and experimental analysis for
both major research contributions. Experimental results using existing workflows prove
the immediate benefit of PDPS for mid-sized virtual organizations. Results for SDAG
demonstrate its usefulness in the design and development of future WMS.
ix
Chapter 1
Introduction
Over the past three decades, the practice of using multiple computers to solve problems
that are either too big or too complex for a single computer has expanded rapidly and
spawned various independent fields of scientific research. The idea of using multiple
computers simply by itself poses questions on how these computers would be shared,
what constitutes a shared resource, who can share what, and for how long? Answers
to these and other such questions have led to multiple mechanisms called middleware
[Pro05b] [Nie][JRA06][BFK
+
00][Zha]. Middleware are collections of software com-
ponents that work in unison to form collaborations, or in other words, groups of shared
resources. For example, a computer science department may have a number of comput-
ers that it shares with other departments, or even other institutions. The actual sharing of
those resources, outside the local administrative domain of the computer science depart-
ment, constitutes a collaboration.
Big collaborations are now omnipresent in almost all fields of scientific research
[BBD
+
07][DAB
+
03][BDG
+
04b][MKJ
+
07]. Such collaborations are often composed
of many investigators that are part of different physical organizations from potentially
disparate fields of science. The resources shared between such collaborations are typi-
cally geographically distributed. Resources that are shared include, but are not limited
to, scientific instruments, computing power, and storage capacity. While most sharing
occurs asynchronously, there are many cases, such as sensor data feeds, where sharing
happens in realtime [DGR
+
05].
1
As the most common type of sharing is that of raw or processed data across col-
laborations, data management in large collaborations requires special attention. Many
data management issues like data scheduling, data transfer protocols, and storage
space utilization have been at the forefront of recent Big Data research initiatives
[BDG
+
04b][AAA
+
08]. Before discussing data management in distributed systems in
further detail, we familiarize readers with some common concepts that are extensively
used in this thesis. That said, there are instances where workflow management systems
(WMS) specific terminology is used prior to being defined. Relevant forward references
will be provided in such cases.
Today, scientific communities are faced with an increasing number of prob-
lems that require using High Throughput Computing (HTC) to solve, analyze
or visualize them. Fields that make use of these technologies include medi-
cal research, bioinformatics, physics, astronomy, chemistry and earthquake science
[CMG
+
10][JDV
+
09][KDG
+
07][BDG
+
04b]. A common factor in all such situations
is the requirement of handling data management tasks intelligently. Every such appli-
cation requires input data and produces output data. Typically, the input data is orders
of magnitude larger than the output or results data. It is common for large applica-
tions’ input data to be multiple terabytes, or in some cases petabytes [AAA
+
08]. As the
compute nodes used to run these large applications generally do not possess such large
storage capacities locally, data is usually broken up into multiple files. Moving and
managing such a large number of files manually is tedious, repetitive, time consuming,
and extremely error prone process. Consequently, it is best done automatically.
2
1.1 Data Management
Data management in distributed systems entails moving and manipulating data in ways
that ensures its availability at the place it is required, at the time it is required, and in the
form that it is required. In the context of HTC, data management pertains to maintain-
ing input data at storage sites such that it is reliable and highly available for geograph-
ically distributed resources. A number of important data management characteristics
such as data storage, metadata storage, data policies, and naming and location services
contribute to the overall success of distributed data management tasks. In particular,
the owners of storage or compute resources will likely enforce policies that stipulate
resource usage parameters in detail that can potentially limit a workflow’s ideal execu-
tion conditions.
1.2 Grid Computing
Grid computing, or Scientific Grids, focuses on bringing many independent collabo-
rators or institutions together to share resources for combined use in joint scientific
research. Such a grouping is termed a Virtual Organization, or VO. A VO provides
resources to individual organizations that are usually beyond their local capacity. Scien-
tific collaborations over grids typically use resources that are geographically distributed
in wide area networks within VOs. Workflow management systems (section 1.4) are
commonly used to assign and monitor these resources. As these resources are shared,
local resource usage policies take precedence over the globally accepted VO level poli-
cies. When an organization commits to a VO, it also agrees to what resources will be
shared, who is allowed to share them, how, when, and for how long the resources may be
shared. These agreements are implemented in the shape of local and global policies. The
3
actual implementation of grid computing is manifested in different middleware toolk-
its like the Globus Toolkit [All05b]. Middleware environments enable grid computing
by allowing participating organizations to join or leave VOs, trust one another, publish
shared resources, broker the use of those resources, use resources and provide a host of
application specific services. There are currently multiple middleware platforms avail-
able to the scientific community [DBGK02][CGH
+
06][OLK
+
07b][WPF05].
1.3 Scientific Workflows
Scientific workflows are instances of computational and data management jobs grouped
together in a predetermined order with data or control dependencies between jobs. One
or more tasks are grouped together to form jobs. A large, repeatable, experiment that
contains multiple steps can thus be described as a workflow. It can be repeated under
similar, or if required, different conditions, and results compared for analysis. A scien-
tific workflow is authored as an abstract workflow that specifies the input data require-
ments, computational tasks and related dependency information. The abstract repre-
sentation is in the form of a Directed Acyclic Graph (DAG). A DAG is presented to
a WMS along with information on available storage and execution sites. Additional
parameters such as movement of resulting output data to permanent storage, cleanup of
non-essential files or requirement of moving executable files to the computational site
are also made available during the planning process. Once the workflow is planned, the
resulting DAG contains the references to additional tasks such as data stage-in, stage-
out, clean-up and directory management tasks. This representation of the workflow is
termed as the executable workflow, or DAX. The additional data management tasks in
the DAX are concrete. By this we mean that each of the data management tasks con-
tains the complete information required to independently execute a data transfer job or
4
Abstract
WF
Workflow Planning
Executable
workflow
DAGMan
Condor
Resource
locations
Site info
Site and state
information
Grid / Cloud based
Execution Sites
Figure 1.1: Pegasus WMS
a compute job. This may include the source destination tuple, authentication require-
ments, local and global environment variables etc. Figure 1.1 shows the conversion
of an abstract workflow into an executable workflow for the Pegasus WMS. A more
detailed explanation of the figure is provided in following chapters.
1.4 Workflow Management Systems
Lately, scientific communities have started to adopt WMS. These are systems designed
to orchestrate the precise execution of scientific workflows. Typically, these systems
offer capabilities to create or author workflows. Thus, scientists are able to port existing
5
distributed, parallel, or sequential applications into new workflow definitions. WMS
then plan an abstract workflow to produce an executable workflow. WMS are presented
with resource and I/O information for where to execute the workflow, where to look
for and where to place related data. Additionally, security issues such as authentication
and authorization are also managed within WMS. Finally, the WMS are designed to
monitor workflow progress and if required take appropriate actions in case failures are
encountered.
Individually, these systems have adapted to specific niche domains and have excelled
in providing the kinds of services deemed necessary for those areas. For example, the
Kepler workflow management system is predominantly used for bioinformatics work-
flows. [PHP
+
07]. Thus applications from the biology domain influence the new capabil-
ities being developed for Kepler. Consequently, there may be an earthquake simulation
[CMG
+
10] application that may not run as efficiently on Kepler as it would using Pega-
sus WMS [DMS
+
07].
1.5 Research Questions
This research tackles two questions. Having considered the first question, the second
question is a natural progression of new research that emanates from the first. We begin
with a discussion on policies and their automated enforcement in distributed systems.
1. Is the enforcement of policies a feasible way of automating data placement jobs
in distributed systems? If so, can such policies be enforced during the creation
of data placement jobs in WMS?
The amount of data movement that takes place in distributed systems for any sin-
gle application can vary from a few kilobytes to petabytes. We are interested in
6
applications that tend to fall in the latter category. When the number of data man-
agement jobs surpasses hundreds, then, human administered execution of these
jobs not only becomes extremely difficult, but also extremely error-prone. For
such large applications, data management is best done automatically. However,
the level of automation is still a variable. On the one hand, data placement jobs
can be automated at the level of the application, i.e. each application manages
its own data placement jobs. With this approach, cross application coordination
of resource utilization becomes a major issue. On the other hand, data man-
agement operations can be centralized, thus ensuring consistency across appli-
cations. However, as with all centralized models, the management scalability of
such approaches are limited. If such a centralized model offers a medium scale,
VO-level centralized management service, then it stands out as a viable option for
policy enforcement within individual VOs.
In either case, the significance of automated data placement is obvious. The addi-
tional perspective that we bring to this scenario is to enable this automation auto-
matically. By this we mean that policies are allowed to invoke the automated
data management jobs. Such policies are the technical manifestation of user Ser-
vice Level Agreements (SLAs). As a result, changes to data placement schedules
are decided based on policies that can be changed without effecting the regular
operations of a distributed system.
If data placement jobs are scheduled based on policies at their inception (or cre-
ation), then the reactive or time epoch based policy enforcement can be replaced
with a proactive policy-based data placement service that enforces data placement
policies even before the data placement job is first executed. This reduces the
7
time taken in enforcing a data placement policy along with reducing the overall
execution time of the corresponding workflow. We discuss this issue in chapter 3.
2. Is there enough data available to confidently measure the generic design of
workflow management system tools and components?
While testing and evaluating the software for the above issue, we established that
there is a dearth of workflows with particular characteristics that can be used in
the evaluation of WMS related software tools and components. As a result, WMS
components are tested and evaluated based on a small set of real workflow appli-
cations.
Workflow management systems are typically born from a unique set of require-
ments that conform to a set of real world applications. The growth of WMS capa-
bilities is also based on requirements of real workflow applications. Such WMS
designs are sub-optimal for applications whose functional requirements conflict
those of the WMS design. If the WMS was designed to cater to a larger variety
of applications at the early stages of development, then design trade-offs that are
posed to new applications would be reduced. There is a lack of test data for design
and development of WMS components. If a broad enough variety of test data is
made available to the WMS during design and evaluation stages, then, such lim-
itations would be largely overcome. However, to construct such a data set is not
a trivial task and due thought needs to be given to designing and validating such
datasets. We address this issue in detail in chapter 4.
8
1.6 Intellectual Contributions
Exploring answers to the questions in section 1.5 has led us into interesting research
issues and highlighted underlying and peripheral problems associated with them. In
exploring these issues, this thesis makes the following main contributions:
Studies the impact of offloading data placement jobs from Workflow Manage-
ment Systems to a Data Placement Service on the overall efficiency of executable
workflows and resource utilization.
Identifies the lack of synthetic data in the form of workflow applications in test-
ing and evaluating WMS components. Identifies the defining characteristics for
workflow applications when viewed as formally written DAGs, or in other words
as abstract workflows, and presents a synthetic workflow generator that can be
tuned to sample bounded parameter space around a reference workflow uniformly
at random.
1.7 Thesis Outline
The rest of this thesis is organized as follows. In chapter 2 we discuss the background
and related work for our problem setting in general. In chapter 3, we present the design
and evaluation of the Policy Based Data Placement Service (PDPS). Chapter 3 also
deals with the interfacing of PDPS with the Pegasus WMS and evaluation of differ-
ent data placement policies and their impact on three real workflows. In chapter 4 we
present SDAG, a synthetic well formed workflow generator, and analyze the statistical
properties of workflows generated by it. In chapter 5 we present a case study that eval-
uates PDPS based on workflows produced by SDAG. We finally conclude in chapter 6.
9
We touch upon different possible directions that could be explored as possible solutions
in each of the core research based chapters. We also provide possible future research
direction in each case.
10
Chapter 2
Background
This thesis builds on existing work in several areas of distributed computing: data place-
ment, scientific workflows, policies and data scheduling. In this chapter, we will talk in
general about these areas, their related work. Typically, workflows are a set of applica-
tion level abstractions that represent a number of jobs in an acyclic graph that accounts
for all dependencies within them. Our work on data placement refers to workflows and
how data placement can contribute to their efficiency. It is therefore important to have
an understanding about workflows in general before going on to data placement and
policies.
In this chapter we start by talking about the Pegasus WMS first. We briefly describe
the constituent components of Pegasus. Next, we talk about policies in general and data
management policies in particular. We list related work on policy based data manage-
ment at the end of this chapter.
2.1 Workflow Management Systems
Given a workflow and a set of resources that are to be used in the workflow, Pegasus
(Planning for Execution in Grids) [DMS
+
07] is a workflow management system that
creates executable directed acyclic graphs (DAGs) that can be run in any target environ-
ment. The processes of taking an abstract workflow and assigning related resources to
it, thus creating an executable workflow alleviates scientists from constantly monitoring
resources and scheduling their jobs around resource availability.
11
A similar open source engine, Taverna [OLK
+
07a], provides for the ability to com-
pose, execute, and manage workflows in heterogeneous environments with minimal a set
of standards. Taverna does not define standards for I/O, data representation or annota-
tions. . . everything is represented as strings. Originally designed for the Bioinformatics
community, it has now expanded into many other scientific domains. It provides clients a
GUI based workbench for ease of use, provides the ability of service discovery via web
services, provenance tracking and result visualization. Other workflow management
tools include Triana[TSWH07b, CGH
+
06], Kepler [LAB
+
06] and Askalon [WPF05]
and are explained in more detail in section 4.2.1.
2.2 Condor-DAGMan
Pegasus relies heavily on Condor-DAGMan for post planning, execution, and monitor-
ing, of workflows. Once a workflow is planned and ready for execution, Pegasus hands
over the executable workflow to DAGMan. DAGMan understands the acyclic nature
of job interdependences. To start with, DAGMan looks for root jobs and releases them
to Condor for submission. Later, as and when it receives successful completion noti-
fications for already submitted jobs, it releases newer jobs whose parent jobs have all
successfully terminated. In this way, DAGMan manages the job releases to Condor and
ensures that only those jobs that have met all dependency requirements are released.
Condor is a meta-scheduler that runs various daemons on both the execution sites as
well as the submit hosts. Condor monitors the resources that are available to a particu-
lar universe. Although it may receive jobs ready for execution from DAGMan, Condor
queues these jobs and waits for the correct resource to become available before submit-
ting the jobs to that resource. This way, the tandem team of Condor and DAGMan aptly
12
manage job submissions across highly distributed resources in a manageable and fault
tolerant manner.
2.3 Data Placement
Data placement can be thought of as a superset of data transfer. Placement operations
invoke a set of data transfer tasks to achieve a desired state in the distributed system.
Data placement algorithms may have several goals. Often, these algorithms attempt
to improve the performance of applications by staging input data sets (consumed data)
from storage sites to the compute nodes where execution takes place, or by staging out-
put data sets (produced data) to storage systems after execution completes. Placement
algorithms may also attempt to provide greater reliability and availability for data sets
by replicating or moving data across storage sites. Data Placement Services (DPS) are
high-level facilities that use existing services to accomplish complex data management
operations.
Although data dependencies can be trivial for many small-scale applications, they
can be very complex for large scientific applications, making efficient data placement a
challenging task. In grid collaborations, a workflow planner enforces these data depen-
dencies and finalizes the eventual flow of jobs in the DAG by creating an executable
workflow, or DAX. Hereafter, the earmarked resources must be available for the DAG to
complete execution. The executable workflow, being acyclic, will schedule data move-
ment jobs before the dependent compute jobs within the workflow. This may cause
inherent delay for the dependent computational jobs as they wait for data movement
jobs to complete. Such dependencies can be separated from the workflow in a number
of ways. Bharathi, et. al. [BC09b], characterize the interaction between Data Placement
13
Services (DPS) and workflow execution managers in three varying degrees; de-coupled,
loosely coupled, and tightly coupled.
Asynchronous data placement for large scientific computations provide significant
gains in the efficiency of the workflows involved [CDL
+
07a]. Computation can be
significantly slowed down if data stage-in / stage-out jobs are executed as part of the
scientific workflow. Computational jobs have to wait for data stage-in placement jobs
to complete before commencing. Such jobs may be stalled unnecessarily if there is an
unforeseen hold up in the parent stage-in job. Work by Chervenak, et. al. [CDL
+
07a],
study the relationship between data placement and workflow management systems using
the Montage large scale scientific application [BDG
+
04b]. The work highlighted the
requirement of data placement policies that contributed to reliability and availability of
data while increasing the efficiency of the workflow.
In large scientific applications, data is not only used as input to and output from
entire applications, but a large amount of intermediate data is also produced. This inter-
mediate data is termed as transient data and is typically in the form of output data from
one job in the DAG that later becomes the input data to the following job in the DAG.
Typically such data is not treated as results, but rather as a required input to a follow-
ing job and is therefore not required to be stored for the long term. Persistent data
on the other hand is useful analytical data, which can be used as results. Persistent
data warrants long term storage. While the output data from applications are usually
persistent, not all input data may be persistent. For example, image data produced by
genome sequencing machines is processed to produce alignments that are orders of mag-
nitude smaller than the raw image data. In such cases, the persistent data would be the
alignment data. Raw image data is discarded over time to make space for new image
14
data. This shows that the categorization of data into persistent or transient is application
dependent.
Data placement services are concerned with persistent data. Transient data may be
subject only to policies effecting storage space, network bandwidth or CPU usage, while
persistent data is subject to additional VO level policies affecting data reliability and
availability. As highlighted in the example above, the final decision on imposing data
policies on either type will be dependent on the current hardware / software environment
and the application requirements.
To better understand upcoming concepts, we discuss a few earlier works before
explaining data placement policies in detail. The Replica Location Service (RLS)
[CSR
+
09] is a grid service that maps logical file names to physical file locations in a
distributed collaboration. The Globus Data Replication Service (DRS) [All06][AC05] is
a DPS that uses the Globus RLS and the Reliable File Transfer (RFT) [MHA02] services
as underlying components. The RFT is a Web Services Resource Framework (WSRF)
[htt] compliant service that acts like a job scheduler for data movement jobs. The Laser
Interferometer Gravitational-Wave Observatory (LIGO) Data Replicator (LDR) [Pro04]
is a lightweight tool used for replicating data sets to VO members. As in the DRS,
LDR uses Globus GridFTP and RLS as underlying components and is an example of a
DPS. PhEDEx [BMM
+
05], short for Physics Experiment Data Export is a data place-
ment and file transfer service that uses intelligent software agents to manage tiered data
placement.
Baharti at el. [BC09a] define data placement for large-scale scientific applications
on distributed systems in three distinct stages. First, a workflow planner such as Pega-
sus [DMS
+
07] is used to plan the entire workflow for the application. This planning
encompasses data placement as well as computational jobs. Second, the workflow is
15
submitted to a Grid for execution. Here, a workflow execution manager like Condor
DAGMan [Fre02] ensures that the dependencies between all jobs in the workflow are
maintained and failures handled within the available resources. Finally, meta-schedulers
like Condor-G [Fre02] manage the job submission and handle failures. During the
first stage, the workflow planner creates a DAG of the constituent jobs for the entire
application. This DAG can either contain data placement jobs intertwined within the
computational workflow or it can delegate the data placement to a separate Data Place-
ment Service (DPS) [CDL
+
07a]. In the latter case, there may or may not be a need to
exchange state between the workflow managers and the data placement service to run
the workflow efficiently[BC09a]. Our research investigates improvements to the latter
option where data placement jobs are delegated by the planner to a DPS.
Chervenak, et. al. [CS07], discuss a data placement service that allows for imple-
mentation of VO policies during data placement jobs. Their policy based placement
service attaches as an abstract layer over the placement service such that the decisions
of the placement service are suggested by the policy service. These placement decisions
are asynchronous with respect to the relevant computational workflow. They demon-
strate that asynchronous prefetching of required input data before workflow execution
significantly improves performance of data intensive scientific workflows.
2.4 Policies
Computational systems management in large distributed systems such as Grids require
efficient automated services to address both the scale of hardware involved as well as
the variety of software systems that comprise it. Human management by direct manipu-
lation becomes unfeasible at such scales. A requirement to automate such management
16
led to the creation of distributed policy management systems such as the Grid Security
Infrastructure [TCF
+
03].
In general, a policy is a plan of action that leads to a desired outcome. Policy
logic defines system conditions that must be monitored. If met, these conditions trigger
enforcement actions. While a policy is a first order logic statement of the form ‘if (con-
dition) then (action)’, such logic may be arbitrarily complex. Complexity is achieved
by making simple first order logic rules and then grouping these rules together into sets,
where each set represents a policy. This concept is illustrated in Figure 2.1. In our work,
a policy is a set of one or more rules, and its enforcement entails triggering actions in
the rules set. The effect of policy-based management is the automation of tasks based
on rules that may be dynamic. Using such a model enables a system to act/react effi-
ciently to changes in its state. The logic for policies is external to the policy management
system. This means that the policy management system can be in operation while an
existing policy may be changed or a new one added.
With large distributed collaborations, implementing policies becomes an efficient
way for effective resource management. While the current Grid Security Infrastructure
(GSI) model allows for four types of decision semantics for resource access (permit,
deny, unknown, not enough information), policies encompassing areas other than secu-
rity require a wider range of decision semantics. In the case of data management, policy
enforcement may require the maintenance of a certain state in the distributed collabora-
tion. For example, the maintenance of at leastn copies of a file in the system requires
a decision semantic that attempts to achieve and then maintainn copies of a file in the
system. This departure from reacting to client queries and authorizing or denying access
to resources to a proactive system state maintenance is a prime area of discussion and
analysis of this thesis.
17
Rules
Conditions Actions
Rules
Rules
Rules
Conditions
Conditions
Conditions
Actions
Actions
Actions
Policy
Figure 2.1: Policy Composition
In distributed scientific collaborations, policies may be defined by the VO, by a
VO member institution, or by an application that is designed to manage the usage of
resources. In Grids, a policy is an understanding or a requirement set forth by an entity
that regulates the usage of its resources. These requirements may be subject to con-
straints of certain state to be maintained or security measures to be met for the resources
contributed towards the Grid.
Policy based management has been incorporated in almost all large-scale systems
that involve distributed resource usage. While the emphasis of the current state is on
security policies in distributed system [Slo94] [LFS
+
06] [LKS03] [SC01] [VSC
+
02]
[WBKS05] , most relevant to this proposal are works that deal directly with data storage
and movement. In particular, we are interested in Data Placement Policies that can be
employed for Scientific Workflows.
18
2.5 Data Management Policies
2.5.1 Data Placement
Data placement decisions are made by local administrators as well as by VO level
administrators for various reasons. A data placement policy must have one of the fol-
lowing two actions executed in its action set: Copy a file to a specified location or delete
a file from a specified location. These above actions may be conducted out side of the
policy engine’s purview or they may be carried out within the purview of the policy
engine. However, it is up to the administrator to categorize a policy as a data placement
policy or otherwise. There are several justifications as to why a data placement policy
may be put into effect. Although the semantics of the policy do not display the intention
behind the policy, the core motivation set behind deploying data placement policies are
scalability, reliability, high availability, and fault tolerance.
2.5.2 Scope of Data Placement Policies
It is important to segregate data placement policies from generic policies. Administra-
tors often combine two different policies and create a larger more complex policy. If
some portion of this policy makes data placement decisions and then executes the cor-
responding action set, should the overall policy be termed as a data placement policy?
One may argue that in the case of reliability, a file may be checked for data integrity
before making a placement decision. Thus actions that perform integrity checks seem
to be part of data placement policies. However, it is important to distinguish that these
actions of performing data integrity checks on files do not fall under the category of
data placement policies. A data placement policy would entail checking if the integrity
check has passed or failed, and then invoking the corresponding action set. Typically
19
that would be to create as many new copies of the corrupt file from its clean uncorrupted
source as there are corrupted files. The placement policy dictates how many copies must
exist, not how or when to perform the check. Thus we say that this is a policy group that
consists of a general policy and a data placement policy.
Such separation of data placement policies from general policies is important as
administrators may place blanket requirement covers on all data placement policies that
should not affect other generic policies. An example of such a cover is that local data
placement policies will always have priority over VO level placement policies. Data
placement is usually constrained by system conditions. These conditions are those that
eventually return true / false results to a policy decision point that leads to a policy being
executed. Some of the typical placement constraints that are often used for placement
policy decisions include data count checks, data location checks, storage space checks,
data size checks, and data access frequency checks. Each of the following policy cate-
gories contains multiple types of policies that can be used as data management policies.
Policies may overlap between these categories. However, we try to keep this discussion
focused on isolated policies and their individual groupings.
1. Security policies : These include user authentication policies, trust policies, audit
trails policies and user management policies.
2. Storage domain policies: Policies in this category may include the following
(a) Time based storage: These policies include conditions that may be based on
the setting of the start time (when the data item in question can start existing
on a resource) or the end time (the time after which a data item may not exist
on a resource). These policies may also provide time-based storage in terms
20
of time windows. For example if a data item is stored on a resource, it may
continue using storage for at most3 days.
(b) Size based storage: Storage based constraints may include the amount of
storage space allocated to a data item.
(c) Storage organization: These policies include organizing data in manage-
able hierarchies where the number of data items in a collection is large. For
example, for immediate data, a faster but smaller storage resource may be
allocated while for archival results, a slower but larger storage may be allo-
cated.
3. Data monitoring: Policies that fall into this category may include conditions con-
cerning metadata creation or metadata storage, checking data integrity or checking
for data validation.
4. Accounting policies: Accounting policies are concerned with creating actions
such as replica registration, monitoring space utilization for payment or other
accounting purposes.
5. Data replication policies: Replication policies aim at achieving single or multiple
effects by creating copies of existing data.
6. Data archival policies: Data archival policies deal with managing the entire data
lifecycle. These include policies concerned with data versioning, maintaining the
correct ownership and rights of digital data over different versions of underlying
OS and HW. Maintaining a chain of custody for the data, maintaining software
required to access the digital archive etc. These policies are also used to classify
data based on data type.
21
2.5.3 Policy Enforcement Times
There are four distinct times when data management policies can be enforced, immedi-
ately, deferred to some time in the future, iterative or periodic, and finally, interactive or
on demand.
2.5.4 Data Placement Policies
Data placement policies are a subset of data management policies. These policies are
primarily concerned with the movement and storage of data. We can classify them into
two groups, movement policies and storage and manipulation policies.
1. Data Movement Policies. These policies are primarily concerned with copying
of data items from one location to another. There are a number of factors that can
be considered in creating these policies.
Order of fetching data. The priority scheme through which data transfers
are scheduled often makes an impact on the performance of the requesting
processes.
Triggers for initiating transfers of data need to be carefully considered as
these may impact the amount of data being moved.
Pre-fetching. This type of data transfer operation may or may not be invoked
based on a policy condition.
Speculation Another technique that can be employed as a policy is that of
speculative pre-fetching. In such cases, a speculation metric based on spa-
tio/temporal locality, data set membership, etc., can be used to pre-fetch data
that may be used by the processes in question. While such techniques are
22
well documented in cache and RAM design, they are also applicable in data
placement strategies.
Synchronization. Among the type of transfers, a choice between syn-
chronous or asynchronous transfers may influence the creation of the policy.
Transfers that are time bound and need to be synchronized with calling pro-
cesses or workflow jobs may require more strict handling in policy creation
as compared to lazy transfers of an asynchronous nature.
2. Data Storage Policies. These policies are placed specifically to manage the stor-
age resource and the manner in which data is stored on them. The factors that
need to be considered for such policies are:
Storage space. The most often used policy is that of space constraints. Data
must conform to free space constraints before being placed on a storage
resource.
Distance from origin. An important aspect to consider while planning any
data transfer operations is the overall bandwidth and latency of the link
between the source and destination. Higher latencies may result in choosing
a different source or a different destination for the data in question.
Number of replicas. This is a simple policy with far greater effects then
most policies. Increasing the number of replicas for a data product results in
achieving many of the above stated effects such as fault tolerance, reduced
network latency, increased workflow efficiency and high availability etc.
Hardware based. Local administrators design the layout of the local network
based on physical constraints. However, there is always a logical topology
that follows the network setup. Both physical and logical topologies of the
23
local network may contribute in the decision on a storage policy being for-
mulated.
Based on Hardware. These are factors that we consider based on the selec-
tion of hardware being used for storage. There may be a condition where
tiered based hardware storage is in place. For example, the selection between
tape, magnetic disk or flash based storage can be made on the age of the data.
Data lifetime policies. These are policies that decide how long a data item
can exist on a particular resource. For example, on an execution site, the
storage resource may only hold data for a particular length of time. Based
on that, a workflow may be required to run soon or it may be required to run
later.
2.5.5 Data Placement Policies for Scientific Workflows
The goal of data placement policies for scientific workflows is to benefit the overall
execution of scientific workflows through well thought out data placement decisions.
Such policies are a subset of data placement policies. We would like to clearly define
the bounds for policies that fall under this category. While these policies will inevitably
consider conditions such as storage space on the execution host, etc., the primary objec-
tive is not of space monitoring. We start by listing the conditions that can be monitored
and actions that can be taken as a result of an invocation of a data placement policy for
scientific workflows.
Conditions Monitored
1. Meta data: Data attributes that are available to the policy decision point may be
used for monitoring conditions. For example file name, file size, file timestamp,
24
or any other metadata that can be associated with the data and used by the policy
decision point for decisions.
2. Workflow attributes: The structure of the abstract workflow can provide attributes
such as the dependencies between jobs that stage in data, stage out data, the total
numbers of levels in a DAX that require data placement. In cases where multiple
workflows are running concurrently, the number of jobs that require data place-
ment operations can contribute to computing data placement schedules. However,
such a choice would obviously be made between multiple workflows, and not
within a single workflow.
3. Storage resource attributes: This typically refers to the size of storage space avail-
able. In the case of the execution site, the free storage space is generally an issue
of concern. The size of stage-in data may not be able to exceed a storage limit set
by the local admin at the execution site. Or more urgently, data being staged-in
may not exceed the available free space available. These attributes are usually
valid for the execution site and not the storage site. The storage site space may
only be of concern if there is a time condition attached with the storage. For exam-
ple, if the storage will serve a particular data set up to a certain time, then a policy
must decide on moving the required data set to the execution sites before the time
window expires.
4. Virtual / Physical machine. Data placement policies may be tailored to use static
or dynamically generated destination addresses. One of the most visible differ-
ences in running scientific workflows on physical resources as opposed to virtual
resources is the level of I/O contention for the execution site working directory
file system. Based on whether the data placement is destined for a virtual or a
25
physical execution node, the transfer operations may be throttled to reduce the
I/O contention at the execution host. In the case of Virtual nodes, the effect is
very pronounced and large number of concurrent transfers can actually impede
workflow execution instead of making it more efficient.
Policy Actions
Once a data placement policy for scientific workflows is created, the action taken
by the enforcement of the policy typically involves replicating, registering or
deleting data. In addition pre or post-job scripts may also be added to work-
flow compute jobs. While data access and authentication are typically an inherent
part of these actions, their enforcement does not fall under the purview of data
placement polices for scientific workflows.
A data placement policy can now be created by combining conditions monitored with
actions enforced in arbitrary levels of complexity.
2.6 Non-deterministic Data Placement
So far we have discussed data placement in a context where location information for the
data is precise. Meaning thereby, that every file can be independently addressed and the
mapping to physical location can be established. This is an important aspect for the over-
all efficiency of workflows. If placement decisions are precise, data availability is deter-
ministic. However, there is still the case where data locality is non-deterministic. The
Google File System (GFS) [GGL03] and the Hadoop Distributed File Systems (HDFS)
[SHRC] are two systems where data locality is approximate.
26
2.7 Related work on Policy Based Data Management
The remainder of this chapter provides a brief survey of related work in policy based
data placement systems. We also differentiate this work from that which is presented in
this thesis.
Constandache, et. al. [COS07], present PEERTRUST, a trust model and imple-
mentation for negotiating policies between different administrative domains in a grid
in order to initiate interaction between them. Their emphasis is on the determination
of peer policies that effect the anticipated interaction between two individual entities
within the administrative domains resulting in the permission or denial of interaction.
Their work is however directed towards the security aspect of permitting or denying
access to resources and the challenges its scalability presents for local administrators in
maintaining up-to-date remote accounts in their grid mapping files.
MyPolyMan by Feng, et. al., [FCWH08][JWH07], is a general policy management
system that permits local administrators to publish and retrieve resource usage poli-
cies. They highlight the absence of resource usage policies in grids and point out that
security policies dominate resource access in VOs. Their work focuses on identifying
representative grid policies covering service availability, data reliability, quota and net-
work consumption. They implement MyPolMan as a system for data movement. While
the implementation of a policy-directed data movement system is an essential part of
data placement services, its focus is on the permitting or denying of the movement and
not on maintaining an optimal workflow while placing data in line with VO policies.
Extensions to Heimdall, a history based policy engine, are presented by Gama, et. al.
[GRF06]. They highlight the shortcomings of earlier work in history based policy sys-
tems such as managing the amount of meta-policy history stored, and its implications on
the scalability of the system. They define the concept of custom event sets that comprise
27
the event history and of purging meta-policy tags. A resource may be allowed access to
for a limited amount of time and thereafter policy may dictate revocation of access priv-
ileges. Such information is essential in implementing a complex history based policy
system. Gama, et. al., measure performance of the policy system with their extensions
of history purging and of the event sets optimizing. Their experiments show significant
improvement of the Heimdell policy engine with the addition of their extensions. While
their work overlaps data placement to the extent of data storage and movement, it still
focuses on the end result of the permit/deny result of resource usage.
Wasson, et. al., [WH03], identify two policies that represent the operations of VOs
along with the expected behavior of their resources and users. They identify an equal
load distribution policy that attempts to distribute work across all participating members
of the VO equally and tries to achieve a 1=n load distribution for n participants. The
second policy they present is one that attempts to give credit to members commensu-
rate to their contributed resources. They term it as the ygwyg (you-get-what-you-give)
policy. Both these policies bring to light the issue of measuring and reporting resource
utilization. They present a prototype implementation of the system. The emphasis of
their work is on the fair use of VO resources.
While much work on scheduling has been focused on process and resource schedul-
ing, we will specifically be covering work on data scheduling. When looking at typical
large scientific collaborations, we have seen that data placement constitutes a significant
portion of overall execution time. Careful scheduling of data placement jobs has been
shown to increase large data intensive work flow efficiency [CDL
+
07b]. Ranganathan,
et. al. [RF03], present a three level data scheduling architecture that addresses the
issues of considering local policies , dynamic resources and scalability while making
28
data transfer scheduling decisions. Their work presents an extensible scheduling archi-
tecture based on combinations of three layers of schedulers. They define data grid exe-
cution (DGE) as a sequence of jobs and the related resource usage. The problem in data
grids is to produce good DGEs with respect to dynamic metrics that may include geo-
graphical distance of data and computation, size of data, and availability of resources
from the start of a DGE to the actual execution of a particular job. They define external,
local, and data schedulers each of which schedule jobs based on a variety of parameters.
They highlight the importance of simulating scheduling scenarios using all variables at
all levels to achieve realistic analysis for data grid scheduling. They identify the impor-
tance of data locality while taking scheduling decisions, and their impact on various
aspects of efficiency. They do this particularly by replicating popular data to various
sites, thus reducing hotspots for data access. Their work specifically considers dynamic
data replication as a fundamental part of the overall job scheduling problem in grid
environments.
Further work by Ranganathan, et. al., [RSZ
+
07], looks at the optimal usage of stor-
age resources during large data intensive computations. They show that adding data
cleanup jobs after data is no longer required reduces storage constraints on resources
while improving overall workflow efficiency in resource constrained environments.
They achieve this by combining two approaches. First, they inject cleanup jobs for
each job that completes a computation on data, and second, by scheduling jobs while
keeping the resource requirement and resource availability in view. In both these steps,
storage resources are the primary resource considered. Data cleanup during workflow
execution does not take up significant compute time, but it does create storage resources
for subsequent jobs. Their work is more in tune to the requirements of data stage-in jobs
where data placement is carried out in view of the popularity and location of data in
29
light of the related computation. They prove that placing data in locations that are pop-
ular for data queries is beneficial both for computational efficiency of workflows as well
as in providing fault tolerance and higher availability for data sets in higher demand.
Their approach uses a three-level scheduling strategy where each level schedules data
placement in light of information made available at that stage. Their work is based
on providing a data placement scheduler that works with job schedulers to enable data
availability at compute sites.
Singh, et. al.,[SVR
+
07], discuss the critical factors in determining workflow com-
pletion time. They examine workflow performance with various factors involved, such
as scheduling interval, job submission rate, and the rate at which schedulers start jobs
on a cluster. By changing the system parameters that influence these factors and restruc-
turing the workflow, they show improvements in workflow completion time.
Task Clustering [SSV
+
08] is an optimization tool associated with Pegasus that
groups small and loosely coupled tasks into one single large task. Therefore, the number
of overall tasks in a workflow decreases. This method reduces the overall completion
time by reducing the queue wait time during the job release and submission. Reducing
the overall load on the head node of the remote clusters and decreasing the accounting
costs further enhance this improvement. [SSV
+
08] focuses on the ability to achieve
higher levels of job concurrency by clustering jobs and ultimately increasing workflow
efficiency. In contrast, our work focuses on increasing workflow efficiency by achieving
concurrency outside the execution of the workflow. i.e. jobs being executed by DPS run
concurrently with the compute jobs inside the workflow.
In earlier work, Chervenak, et. al., measured the impact of data pre-staging on
workflows running under Pegasus [DSS
+
05b]. They pre-staged input data sets near
expected computation sites for the Montage workflow and measured the performance
30
improvement when the workflow accessed this pre-staged data. They also categorized
placement operations and explored the ability of the workflow systems to provide hints
to the DPS regarding where to place data.
Ramakrishnan, et. al., present an algorithm that finds feasible solutions for task
assignment in storage constrained environments [RSZ
+
07]. The authors showed that the
overall space required for a gravitational-wave physics application is reduced by 57%
when using our task assignment algorithm along with performing clean-up operations
on intermediate data.
Policy based data placement has previously been addressed and implemented in
iRODS [RMH
+
10]. iRODS enforces data policies over a large distributed systems.
It works with distributed client components that provide intricate knowledge required
by iRODS to perform policy based data management. iRODS is a complex multi-site
deployment, and individually configured amalgamation of services which is too com-
plex for the effort required in set up and maintenance for mid-sized collaborations.
PDPS bridges this gap aptly by providing a single point deployment with minimal con-
figuration.
DAG scheduling heuristics provide another approach to optimize runtime perfor-
mance. Cannon, et.al. [CJSZ08], analyze the performance of 20 static, makespan-
centric, DAG scheduling heuristics. DAG scheduling algorithms usually consider
the computation cost and communication cost together so as to reduce the runtime
makespan, or the overall execution time.
Stork, developed by Kosar, et. al., [KL04a], is a specialized data placement sched-
uler that provides the ability to schedule, queue, monitor and manage data placement
jobs in a fault tolerant manner. More recently, this group proposed a new data-intensive
31
scheduling paradigm [KB09] that schedules data placement jobs independently of com-
pute jobs in scientific workflows. These different job types are managed by Stork and
DAGMan, respectively. After DAGMan receives an executable workflow from Pegasus,
it divides the workflow execution between Condor and Stork, with Stork handling all
data placement jobs. The authors evaluate different algorithms to schedule data place-
ment with Stork.
Yuan et al. [YYLC10] propose a matrix based k-means clustering strategy for data
placement in scientific workflows. They assume that datasets are located in different
data centers. They use two strategies that group the existing datasets into k data centers
and dynamically cluster newly generated datasets to their optimal destinations based on
dependencies during runtime.
Recently, cloud based systems have gained popularity in the workflow community.
Juve, et. al., studied the cost of data management using different file systems in cloud
environments [JDVM10]. They used the Amazon EC2 cloud [JDV
+
10] and Amazon S3
storage services [JDV
+
09] and presented the performance of six different file systems
along with related costs incurred for running the Montage, Broadband and Epigenomics
workflows. While the ability to provision computing resources dynamically is preferred
by many scientists, data co-location with compute resources is a concern for data inten-
sive workflows in cloud environments.
Catalyrek, et. al., [C ¸ KU11] use a hypergraph partitioning scheme to reduce inter-site
file transfers while running scientific workflows in the cloud. They pre-assign weights
to compute sites based on their storage and CPU capacity. The total size of data trans-
ferred between sites is minimized so that each site is assigned at most its pre-assigned
combination of data and related computation.
32
Liu et al. [LD11] present a framework where data transfers between data centers
are reduced by considering the current and past capabilities of storage and compute
resources. They achieve this in two stages. First, the input data set is broken up into
smaller data items, is pre-clustered and placed in data centers whose current compute
and storage capabilities are best suited for the computation. Second, the decision to
move intermediate data generated during the runtime execution of the workflow is taken
in view of the initial parameters as well as the historical performance of the data centers.
Agarwal, et. al., present V olley [ADJ
+
10], which optimizes data placement across
globally distributed cloud services based on heuristics acquired from data center logs.
V olley places user and application data so that a cloud application can serve users from
the most feasible data center, where a feasible data center has enough storage capacity to
accommodate the user data and is located close enough geographically that the user does
not experience excessive latency. In experimental evaluation, V olley reduces inter-data
center transfer, perceived user latency and inter-data center capacity skew significantly
in comparison to other heuristics.
The emphasis of our work is on integrating data placement policies with workflows
and trying to improve workflow execution. Usage and access are two separate areas
of resource utilization. Our work is related more with usage over extended timelines
and not so much on access tied with authorization. The enforcement of such policies
differs from the typically employed policy model in that it will be proactive and aim
at maintaining stable system state, as opposed to being reactive to client queries and
authorizing / denying access to resources. Additionally, we enable the use of policies
that are specific to data management tasks in scientific workflows by presenting the
Policy Based Data Placement Service (PDPS) in the following chapters.
33
Chapter 3
Policy Based Data Placement
In this chapter we discuss the development of the Policy Based Data Placement Ser-
vice (PDPS) as a standalone data placement service and its evolution into a modular
tool that is interfaced with workflow planning and execution engines for managing data
placement tasks. This ability gives PDPS knowledge for not only improving scien-
tific workflow performance, but also being actively aware of the changing system state
when enforcing VO-level data placement policies. We start by discussing the motiva-
tion behind the development of PDPS and follow that up with the design of PDPS, its
utility in general for distributed systems, and its integration with Pegasus. We evalu-
ate the impact of using PDPS as a policy enforcement service in a distributed system
setting and the impact of using Pegasus with PDPS for three scientific workflows from
the astronomy, earthquake science and epigenomics domains. Based on this evaluation,
we discuss how policy enforcement is achieved in distributed systems. We also analyze
the performance improvement in workflows using PDPS with the Pegasus WMS. We
establish that the impact on workflow efficiency depends on the size of data for all data
management tasks as well as the percentage of time taken by those tasks within each
workflow.
3.1 Introduction
Scientists in a variety of application domains use workflow technologies to manage the
complexity of sophisticated analysis and simulations. These analyses may be composed
34
of thousands or millions of interdependent tasks that must be executed in a specified
order, and each task may require the management of significant amounts of data. Data
management operations performed by the workflow system may include staging input
data to the execution site where a compute job will be executed; Staging results of com-
putation to more permanent storage, if necessary; and removing or “cleaning up” data
products that are no longer needed from execution sites. The analysis and visualiza-
tion of large data sets places an additional level of complexity on workflow execution
and management. For such workflows, the management of data placement tasks places
a significant amount of strain on resources such as storage space, network bandwidth
and adds the additional I/O time consumed in data placement tasks to the overall work-
flow execution time. Additionally, independent members of VOs enforce local data
placement policies that may conflict with each other. Therefore specialized code or
applications are developed to enforce such distributed data management policies across
collaborations [RCB
+
06].
Typically, workflow managers create data management jobs such as stage-in, stage-
out and clean-up jobs to supplement an abstract workflow and create an executable
workflow that can be run as a single entity by the WMS. The WMS scheduling pro-
cess primarily focuses on the compute jobs and attempts to optimize resource utilization
based on these. Data management jobs are added where required and may not used in
scheduling decisions. While there has been significant work in data scheduling for sci-
entific workflows, so far this work has treated data placement independent of workflow
scheduling [KL04b]. For example, in Pegasus WMS, when stage-in jobs are created,
the decision on which stage-in jobs to schedule earlier is not addressed by the Pegasus
planning stage.
35
PDPS, a Policy Based Data Placement Service was developed to complement Pega-
sus WMS by enforcing distributed data-placement policies in mid-sized collaborations.
The framework enables Pegasus WMS to hand off data placement jobs to PDPS. PDPS
is responsible for performing data management operations, thus freeing Pegasus WMS
to perform other tasks, such as executing, monitoring and managing dependencies
among computational tasks. Stage-out and clean-up jobs are handed off to PDPS during
the execution of the workflow, whereas during the workflow planning process, stage-in
jobs along with policy information are handed off to PDPS. In the latter case, priority
information for data placement jobs is computed based on a variety of conditions by
PDPS. This information is also used to compute the priorities for the child compute jobs
and passed back to the Pegasus WMS for further processing and assignment to compute
jobs. Different priorities can be assigned based on different conditions being evaluated.
The effect of policy-based management is the automation of tasks based on rules
that are highly dynamic and are stored externally from the system. For example, the
computation of priorities may be based on the structure of the workflow’s input DAG,
or it may be based on the physical characteristics of the data involved in the workflow.
Both these examples have conditions that can be stored outside the WMS and used to
feed priorities or schedules to the WMS based on the policy selected for enforcement.
We start by describing the overall architecture of PDPS and follow that up with the
experimental evaluation sections of this chapter.
3.2 PDPS Design and Architecture
In this section, we present an abstract overview of the design and architecture of PDPS
with the modifications we have made to it to complete interfacing with Pegasus.
36
3.2.1 PDPS Architecture
PDPS receives two types of calls from Pegasus. A synchronous call for the policy based
data placement of stage-in jobs and asynchronous requests from Pegasus through Con-
dor for stage-out and clean-up jobs. In both cases, PDPS executes the jobs independent
of the state of the parent workflow. As depicted in figure 3.1, PDPS has now evolved
to consist of six major components: the open source Drools policy engine [Dro11] that
monitors and executes the policies in the Policy Information Base (PIB), the Listening
Interface, the Job Handler, the Job Thread Manager and the Status Database.
Data Placement Requests
Policy Based Data Placement Service
Policy Engine
Policy
Information
Base
Replica Catalog
Transfer invocations
Job Thread Manager
Transfer/Delete
Instance
Transfer/Delete
Instance
Transfer/Delete
Instance
System State
and Listening
Interface
Job Handler
GridFTP
Status Database
Functions Library
Figure 3.1: PDPS Architecture
PDPS receives incoming requests from Pegasus on the listening interface. On receiv-
ing a call, PDPS inserts all received information as facts into the working memory of
the Drools policy engine. Next, the job handler decides if the request is one for stage-in,
37
stage-out, clean-up, or other data placement operations. The policy to be enforced is
selected based on received information. The policy is enforced by the job handler by
selecting and executing the relevant functions from the functions library. Typically, the
result of the policy enforcement is the execution of data transfer tasks. These tasks are
handed off to the multi threaded job manager that places the jobs in a local queue and
executes them as scheduled by the policy enforced. In the case of stage-out and clean-up
operations, the thread manager launches a new thread using GridFTP [All12, ABK
+
05]
client calls for delete operations and globus-url-copy [All05a] for stage-out operations.
In the default case for Pegasus managed workflows, data placement jobs are not
throttled based on the number of GridFTP connections. Stage-in or stage-out jobs are
released based on condor match-making and DAGMan scheduling requirements. As a
result, the server at the execution site may become overwhelmed with transfer requests
and transfer performance may be adversely affected. PDPS controls the number of active
transfer threads through a global properties file. We included this throttling feature to
avoid spawning too many GridFTP connections to the execution site.
For stage-in requests, in addition to the thread manager invoking transfer operations,
PDPS uses a status database to track the status of individual file transfers. The status is
set to ‘IN PROGRESS’ when the transfers are queued up for execution; it is updated to
either ‘SUCCESS’ or ‘ERROR’ depending on the result of the transfer.
For each individual transfer request, PDPS starts a timer and places the transfer
request in a slot on the transfer job thread pool. For all transfer tasks, the transfer
instance first parses the destination path and checks to see if the destination directories
exist. If so, then a third party transfer is invoked as a separate thread. If the destination
path does not exist, the transfer instance recursively creates all directories up to the
parent directory of the destination file and then proceeds with the transfer. If a source
38
file does not exist for a transfer job, PDPS makes five additional attempts to transfer the
file before giving up and reporting the error. A file transfer may time out based on a
preset threshold set in PDPS. Once the transfer completes successfully, PDPS adds the
relevant entry in the Replica Location Service (RLS) [CSR
+
09], a catalog that registers
the locations of file replicas at multiple storage sites. For the current set of experiments,
PDPS limits the number of concurrent transfers to five. Additionally, PDPS is capable
of polling the system state independent of Pegasus, creating a snapshot of the current
state of the system, and enforcing any policies that may be in conflict.
3.2.2 PDPS Pegasus Framework
Figure 3.2 shows the overall flow of control and data in our framework. Pegasus reads
in a DAG that represents an abstract workflow along with input parameters that include
the type of policy to be enforced from file. During the workflow planning process, it
sends transfer job, file attributes, job dependency information and the requested policy
type to PDPS. PDPS uses that information to create a prioritized schedule for the data
placement jobs and at the same time computes a compute job priority list to send back
to Pegasus. On receiving the compute job priorities, Pegasus optionally inserts them as
Condor priorities into the compute jobs. Workflow planning continues and on comple-
tion, the executable workflow is executed with the help of DAGMan and Condor. On
receiving the first call from Pegasus, PDPS enforces the requested policy by scheduling
transfer jobs based on the priorities calculated by the data placement policy. It then sub-
mits the transfer jobs to its own transfer scheduler. As these are clustered or bulk transfer
calls, we limit the total concurrent transfers initiated by PDPS to five. For stage-in jobs,
the transfer scheduler first updates the status database to reflect the state of stage-in jobs.
Every stage-in job may be ‘in progress’, ‘successful’ or ‘in error’. Dependent compute
39
jobs continuously poll for the successful transfers of required input files before starting
to execute. Stage-out and clean-up jobs are submitted to PDPS during the actual exe-
cution of the workflow. PDPS handles stage-out and clean-up jobs independent of the
state of the executing workflow.
Compute resources
Head Node with
shared NFS
GridFTP
Server
PDPS
Compute node
Compute node
Compute node
Compute node
Compute node
Storage Site
Submit Host
Pegasus
Condor
Manager
abstract workflow
all jobs
compute job
priorities
Stage-in jobs,
File attributes,
Job dependencies,
Policy information
compute jobs
third party
transfers
shared
data
stage-out and clean-up jobs
Poll for status of Stage-in jobs
Figure 3.2: Overall Framework
Next, we present results of experiments conducted that measure the performance of
distributed data placement policy enforcement by PDPS in the wide area without the
interface to Pegasus WMS. We start by describing the policies that we enforced in the
distributed system, followed by the performance evaluation.
3.2.3 Policy Authoring
In PDPS, policies are inserted into the policy information base and compiled into Drools
packages. When a policy condition in any of these packages is met, the relevant rule is
‘fired’ and the related functions from the functions library are invoked. An example
policy expression is shown in Figure 3.3.
40
package edu.isi.policy
rule "Replication"
when
# s is a storage element such that s has space left
s : StorageElement ( h : hostname, spaceLeft > 0 )
then
# make file transfer
FileTransferService.copy(source, hostname);
end
Figure 3.3: Example Policy Expression
3.3 Policies Evaluated
For the first set of experiments, we tested two policies that are representative of data
management in distributed scientific collaborations. The first policy is based on the
tiered dissemination of files used in high energy physics experiments such as CMS
[Pro05a]. This policy attempts to ensure replication or dissemination of all files that
are published at an experimental site called tier-0 (for example, CERN in the CMS
experiment) to other storage sites arranged in levels, or tiers. Typically, each succes-
sive tier is comprised of an increasing number of storage sites, each of which has less
storage capacity than sites on the tier above. Thus, storage sites on successive tiers
store decreasing subsets of the complete data set. Our policy is illustrated in Figure 3.4.
PDPS attempts to ensure that every file at the root (tier-0) site is replicated to each tier
at least once and that an equal amount of data is disseminated to each site in a tier. In
our experiments, we partition the data sets equally and deterministically among the sites
on each tier.
The second policy that we evaluate, called ncopy, attempts to maintain a high
level of data availability and fault tolerance in the distributed system by creating multiple
copies of each file on distinct storage sites. In particular, we run experiments where there
is initially a set of files on a storage site. PDPS enforces a policy that there should be at
leastn copies of each file stored on distinct storage sites in the distributed system, where
41
Legend
Tier 0
(CERN)
Tier 1 (Regional Centers)
Tier n
site 1
site 1
site 2 site 3
site 1
site 2
...
site n
storage
space
Figure 3.4: Tiered model of data distribution from the high energy physics domain
n = 3 for our experiments. During enforcement of this3copy policy, PDPS replicates
each file twice to randomly selected storage sites if the available space remaining on
those storage sites is greater than the file size and the sites do not currently store a copy
of the file. The latter is determined by the current fact state in working memory. After
enforcement, at least 3 copies of each file exist in the system on distinct storage sites,
and no storage sites usage exceeds its maximum available space. While these policies
could be applied on any unit of data, for our experiments we chose a single file as the
unit of enforcement.
3.4 PDPS Standalone Performance Evaluation
In order to evaluate a data placement policy, we needed to establish measurable metrics
that can be used to compare different policies under similar circumstances. For this set of
experiments we measured the time to completely enforce a single policy. This entailed
the transfer or copying of all files that need replication depending on the policy being
enforced. In order to clarify the impact of each policy, we conducted the experiments in
42
independent sets for each policy. However, in general, multiple policies may be enforced
by a PDPS instance concurrently.
We conducted two sets of experiments for PDPS for each of these policies: for the
first set, we used a dataset of10000 files of size one MB each while in the second set of
experiments, we used30 files of size one GB each. All experiments were conducted on
30 nodes randomly selected from the Planet-Lab research network [pla12]. These nodes
are dispersed over the continental United States. Each node runs a custom version of the
2:6:22 Linux kernel with 4GB of available storage space. The storage space available
on the Planet-Lab nodes constrains the size and number of files used in our experiments.
For our experiments, one workstation in our local environment was the initial location
of all the data files for both policies that we evaluate. This “tier 0” machine was
an 8 core 3:0 GHz Intel Xeon workstation with 24 GB of RAM running Mac OS X
10:5. This workstation ran PDPS instance. It also ran the Replica Location Service
(RLS) [CSR
+
09] and Meta Data Service (MDS) [SDM
+
05] information services used
to gather current state information for PDPS and the GridFTP server that performed
third party data transfers initiated by PDPS.
3.4.1 Tier Based Dissemination
For the tier-based dissemination experiments, we replicate data to 30 storage sites
divided into 10 tier 1 sites and 20 tier 2 sites. We report the performance of
PDPS when disseminating a large number of small files and a smaller number of larger
files. For the first experiment, before policy enforcement, we publish10000 data files of
size1 MB on thetier0 workstation and register these files in the RLS. The MDS pro-
vides information to PDPS regarding available storage sites, available space and the tier
to which each site belongs. During policy enforcement, PDPS first disseminates 1000
43
files per site to 10tier1 sites and then disseminates 500 files per site to 20tier2
sites.
!"
#$!!"
%!!!"
&$!!"
!" $!!!" #!!!!" #$!!!" '!!!!"
!"#$%"&%'$('%
)*#+$,%-.%/0$'%1"''$#"&23$1%
!"$,%42'$1%5"'$#"&26-&7%894%/0$'%%
!"$,%:7%8::::%/0$'%%%
!"$,%87%8:::%/0$';&-1$%%%
!"$,%<7%=::%/0$';&-1$%
Figure 3.5: Tier-based dissemination 1 MB files
Figure 3.5 shows the time taken to enforce the dissemination policy for10000 files.
The average data throughput for the first 15;000 files is 5:89 MB/s, while the overall
throughput is4:96 MB/s. The initial transfers originate form the relatively fasttier0
workstation that has a gigabit network connection available to it. The drop in throughput
for the last5000 file transfers occurs because the transfers take place between relatively
slow Planet-Lab nodes. For this experiment, 19966 out of 20000 transfers completed
successfully. PDPS reported timeout failures for 34 transfers destined for two Planet-
Lab nodes . These last two destination nodes were unresponsive to network activity mid
operation and were thus reported as failures by GridFTP. At the time these measurements
were taken, the current implementation of PDPS did not recover from these failures. We
later extended the functionality of PDPS to include a timeout mechanism. If a GridFTP
transfer fails, PDPS will continue policy enforcement using retries and, if necessary,
will initiate transfers to create replicas on different nodes in the system.
44
Figure 3.6 shows dissemination performance for a smaller number of larger files.
Initially, we publish30 files of size1 GB on thetier0 workstation and register them
to the RLS. We disseminate3 files each to10tier1 nodes and up to2 files to each of
20tier2 nodes. Out of a total of 60 files that were to be disseminated, 58 transfers
took 8781 seconds, averaging a data throughput rate of 6:6MB/sec. The last two files
were transferred to a single slow node and took an additional9116 seconds to complete.
The overall data throughput for the 60 files is 3:35Mb/sec. These results reinforce the
need to identify slow nodes in the system and extend the policy engines logic to select
nodes with better performance as targets for replication operations.
!"
#!!!"
$!!!!"
$#!!!"
%!!!!"
!" $!" %!" &!" '!" #!" (!"
!"#$%"&%'$('%
)*#+$,%-.%/0$'%1"''$#"&23$1%
!"$,%42'$1%5"'$#"&26-&7%894%/0$'%%%
!"$,%:7%;:%/0$'%%%
!"$,%87%;%/0$'<&-1$%%%
!"$,%=7%=%/0$'<&-1$%
Figure 3.6: Tier-based dissemination 1 GB files
3.4.2 ncopy Replication
Next, we show performance results for a policy that provides high availability by main-
taining at least n copies of every file in the distributed system, where n = 3 for our
experiments. As for the tier-based dissemination experiments, we initially published
45
either10000 files of size1 MB or30 files of size1 GB on thetier0 workstation and
registered these files in the RLS. We replicate these files over30 Planet-Lab sites.
During policy enforcement, PDPS checks whether each file is stored on fewer than
three sites in the distributed system. If so, then PDPS randomly selects a storage site on
which that file does not exist and where the site has sufficient available storage. It issues
a transfer request to the transfer queue to create a new replica at that site. When a storage
site runs out of available space, PDPS removes the site from its working memory and no
longer considers that site for replication operations. Figure 3.7 shows the performance
of PDPS in enforcing 3copy replication for 10000 files of size 1 MB, and Figure
3.8 shows the performance of enforcing the same policy with 30 files of size 1 GB in
the Planet-Lab network. For the first experiment, we replicate a total of 20000 files to
create three copies each of the initial 10000 files. We observe a trend similar to that
of Figure 3.5, where GridFTP transfers originating from thetier0 workstation have
higher throughput than those between slower Planet-Lab nodes. The first 15000 file
transfers average a data throughput of 5:42 MB/sec, and the overall throughput is 3:54
MB/sec. Out of 20000 GridFTP transfers initiated by PDPS, 19984 transfers complete
successfully, while 16 transfers failed. These 16 GridFTP transfers originate from two
Planet-Lab nodes.
For the last wide area experiment, we measure the performance of3copy replica-
tion with30 files of size1 GB that are initially published on thetier0 workstation and
registered in the RLS. Out of 60 file transfers initiated by PDPS, 56 transfers complete
successfully and4 fail. Figure 3.8 shows the time for the56 transfers to complete. The
data transfer rate achieved for the first55 transfers is6:89 MB/sec. The last transfer took
917 seconds, and the overall data rate is6:29 MB/sec.
46
!"
#!!!"
$!!!"
%!!!"
!" &!!!" '!!!!" '&!!!" #!!!!"
!"#$%"&%'$('%
)*#+$,%-.%/0$'%,$10"(23$4%
567-18%9$10"(2:-&%;<=<<<%/0$'%%%
;%>?%$2(@%-A$,%5<%&-4$'%
Figure 3.7: 3copy replication -1 MB files
!"
#!!!"
$!!!"
%!!!"
&!!!"
'!!!!"
!" '!" #!" (!" $!" )!" %!"
!"#$%"&%'$()%
*+#,$-%./%01$)%(.2"$3%
4%5.26%7$21"(89.&%4:%01$)%;%<=>%$8(?%.@$-%
4:%&.3$)%
Figure 3.8: 3copy replication -1 GB files
3.4.3 Policy engine performance
Next, we measure the operation of the Drools policy engine to distinguish between
the time taken by the policy engine in evaluating conditions and firing actions and the
time taken by file transfers. We place 1000 rules in the policy information base and
insert from1000 to1000000 facts into the policy engines working memory. (These facts
correspond to files and storage sites in our policies.) When a policy condition matches
47
a fact, the Drools engine increments a global counter that counts the total number of
rules fired. We insert facts so that approximately 10% of facts match rule conditions.
The time to evaluate1000000 facts is less than7 seconds, as shown in Figure 3.9. This
overhead corresponds to the time incurred by the Drools engine each time it enforces
policy on the working memory state.
0
2
4
6
8
1,000
10,000
100,000
1,000,000
Time
in
secs
Number
of
facts
inserted
in
working
memory
Drools
policy
engine
performance
with
1000
rules
Figure 3.9: Performance of Drools policy engine
Our policy evaluation experiments each initially run with a single rule and 10012
facts (one fact each for10000 files and for12 storage sites). The average time for Drools
to evaluate conditions and fire actions for these experiments is less than two seconds.
This overhead is small compared to data transfer times in our experiments.
In the following section, we change gears to the evaluation of PDPS with the Pegasus
WMS. We start by discussing the handling of stage-out and clean-up jobs by PDPS and
follow that with discussion of stage-in jobs.
3.5 Interfacing Leaf Jobs with PDPS
Earlier work proposed separating data transfer operations from the execution of scien-
tific applications [CDL
+
07a] [KL04a]. In this section, we implement and evaluate such
a separation of concerns for two types of data management operations performed by
scientific workflows: stage-out of files to permanent storage and clean-up of files that
48
are no longer needed at the execution sites. (Data stage-in jobs are being evaluated
separately, and the results are presented in the following sections)
Our earlier work on data staging used a “decoupled” approach, where all stage-in
data needed by a workflow were transferred to the appropriate storage system before
workflow execution began [CDL
+
07a]. By contrast, the work described in this section
takes a “loosely coupled” approach, where data stage-out and clean-up jobs are ini-
tiated by the workflow management system and executed by the data management at
workflow runtime. We integrated PDPS with the Pegasus [DSS
+
05a] Workflow Man-
agement System. PDPS is responsible for performing the stage-out and clean-up oper-
ations, thus freeing the workflow management system to perform other tasks, such as
task scheduling and managing dependencies among computational tasks.
We begin with an overview of the Pegasus Workflow Management System and
describe its default behavior in executing stage-out and clean-up operations. We then
describe modifications made to Pegasus to send those requests to PDPS. Next, we
describe the design and operation of PDPS. We give an overview of the three work-
flows used in our evaluation. We describe the experimental setup for the performance
evaluation and the metrics that we use in evaluating the impact of PDPS. We show exe-
cution performance for each workflow running with and without PDPS. In each case,
we show the impact on the performance of individual data management operations and
on the overall workflow execution time. We conclude with a discussion of related work
and plans for future work.
49
3.5.1 Integrating Pegasus and PDPS - Leaf jobs
As discussed in chapter 2, the Pegasus Workflow Management System is a workflow-
mapping and execution engine that is used to map complex, large-scale scientific work-
flows with thousands of tasks processing terabytes of data onto distributed resources.
Pegasus enables users to represent workflows at an abstract level without worrying
about particulars of target execution systems. Pegasus includes DAGMan [Fre02] as
a workflow execution engine and relies on Condor [Pro03] for the scheduling and the
monitoring the execution of jobs. Condor is a specialized workload management system
developed at the University of Wisconsin that provides pools of potentially distributed
resources on which we execute workflows. Condor places serial or parallel jobs submit-
ted from users into a queue and matches them with resources based on job and resource
requirements and monitors the jobs until theyre done. DAGMan is a tool developed by
the Condor team that accepts a Directed Acyclic Graph (DAG) that describes depen-
dencies among computational tasks and manages those dependencies during execution.
Pegasus produces an executable workflow that is managed by DAGMan, which releases
jobs to Condor for execution in the order dictated by dependencies among the jobs.
Condor then delegates those jobs to physical resources.
Example workflows that use Pegasus WMS include applications in gravitational
wave physics (LIGO) [Pro04], astronomy (Montage) [BDG
+
04a], earthquake sciences
(Broadband) [20111], as well as biological science (Epigenomics) [epi12]. The appli-
cations often involve the processing of large data sets in many discrete steps such as
calibration of the raw data, various data transformations, visualization, etc.
We restrict the explanation of the default behavior of Pegasus to only that of its han-
dling stage-out and clean-up jobs. Clean-up jobs remove files from remote execution
50
sites when these files are no longer needed by the workflow. By default, Pegasus per-
forms clean-up by moving files to a temporary directory (also on the remote site) and
then deleting this directory using Linux commands. For stage-out jobs, Pegasus invokes
the GridFTP [All12] data transfer service to transfer files from execution sites to archival
storage.
We modified Pegasus to call PDPS for both stage-out and clean-up jobs. The mod-
ified Pegasus creates requests for PDPS using a list of dependencies that are generated
by Pegasus for each compute job; this list includes the stage-out jobs necessary for the
successful completion of a compute job. Pegasus submits the clean-up and stage-out
calls to PDPS using non-blocking calls. PDPS server then performs the clean-up and
stage-out operations. From the perspective of the modified Pegasus, the execution time
for a clean-up or stage-out job only includes the time to create the input parameters for
these jobs and the time to submit those parameters to PDPS.
Figure 3.10 shows the overall flow of control in the new framework. To assure a
stable execution environment, we provision resources in advance of the workflow exe-
cution using a tool called Corral [JDVM10]. Corral enables provisioning of resources
on shared infrastructure, providing a dedicated resource to the user. Corral uses Condor
“glidein” jobs to provision cluster resources and make them available to a user-level
Condor pool. Glideins are placeholder jobs that allow the temporary addition of a grid
resource to a local Condor pool by installing Condor daemons on the remote site. Users
can submit compute jobs directly to the remote resources without going through addi-
tional scheduling queues. After the resources are acquired by Corral, workflows are
submitted to Pegasus. Pegasus maps the workflow onto the provisioned Condor pool
51
and submits jobs (via Condor) to these resources. In the default case, both computa-
tional and data placement jobs are submitted to the pool. However, when using PDPS,
Condor is directed by Pegasus to send stage-out and clean-up jobs to PDPS.
Stage-out
Clean-up jobs
Stage-out
Clean-up
jobs Submit host
Pegasus
Condor
Manager
Head Node
DPS
Linux Cluster
Corral
Server
Globus
Condor Glidein
Condor Glidein
Condor Glidein
Condor Glidein
Condor Glidein
GridFTP
Server
Cluster nodes
P
Figure 3.10: Overall Control Flow
PDPS receives asynchronous requests from the workflow system and executes the
stage-out and clean-up jobs independent of the state of the parent workflow. PDPS
has four active components: the listening interface, the clean-up component, the stage-
out component, and the task thread-pool. PDPS receives incoming requests from the
workflow system on a listening interface based on Java RMI. On receiving a call, PDPS
invokes the appropriate component. For clean-up calls, PDPS parses the list of files to
be deleted and launches threads that invoke GridFTP [ABB
+
03] file delete calls. For
stage-out jobs, PDPS first parses the destination path and checks to see if the destination
directories exist. If so, then a third party transfer is invoked as a separate thread. If
52
Data Placement Service
Pegasus
Interface
GridFTP
Job Queue
Transfer/Delete
Instance
Pegasus
Cleanup
Stageout
Figure 3.11: Data Placement Service
the destination directory does not exist, PDPS recursively creates all directories up to
the parent directory of the destination file and then proceeds with the transfer. For both
types of calls, PDPS manages one central thread-pool. In the current implementation,
the number of concurrent threads is limited to 50. In all remote calls, PDPS uses the
GridFTP client API. The relevant architecture of PDPS is shown in Figure 3.11. The
current implementation does not use GSI security.
3.5.2 Workflows Evaluated
Next, we briefly describe the three scientific workflows used in our experiments. Table
3.1 shows the number of data placement jobs for all workflows in this paper.
Montage
Montage [BDG
+
04b] is an astronomy application that is used to construct large image
mosaics of the sky. Input images are re-projected onto a sphere and overlap is calculated
for each input image. The application re-projects input images to the correct orientation
53
while keeping background emission level constant in all images. The images are added
by rectifying them into a common flux scale and background level. Finally the re-
projected images are co-added into a final mosaic. The resulting mosaic image can
provide a much deeper and more detailed understanding of the portion of the sky in
question.
Epigenomics
The Epigenomics workflow is a pipeline workflow [BCD
+
08]. Initial data are acquired
from an Illumina-Solexa Genetic Analyzer in the form of DNA sequence lanes. Each
Solexa machine can generate multiple lanes of DNA sequences. These data are con-
verted into a format that can be used by sequence mapping software. The mapping soft-
ware can do one of two major tasks. It either maps short DNA reads from the sequence
data onto a reference genome, or it takes all the short reads, treats them as small pieces in
a puzzle and then tries to assemble an entire genome. In our experiments, the workflow
maps DNA sequences to the correct locations in a reference Genome. This generates
a map that displays the sequence density showing how many times a certain sequence
expresses itself on a particular location on the reference genome. Scientists draw conclu-
sions from the density of the acquired sequences on the reference genome. Epigenome
is a CPU-intensive application.
Broadband
The Broadband workflow provides earthquake scientists a platform to combine long
period deterministic seismograms with deterministic low frequency and stochastic high
frequency simulations. The Broadband platform assists users to select one or more
approaches it provides to use and simulate an earthquake with data from multiple sites.
54
Table 3.1: Number of Data Placement Jobs for all Workflows
Job Type 2degree
Montage
4degree
Montage
8degree
Montage
Epigenomics Broadband
Clean-up 89 100 124 73 94
Stage-out 9 11 11 9 16
Compute 126 137 173 97 128
The goal is to improve ground motion attenuation models by generating and analyzing
seismograms for given sites and ruptures, including high and low frequency data gen-
erated by Broadband. These seismograms provide building engineers and earthquake
scientists predictions of future ground motions in different areas. In this paper, we use a
Broadband workflow example that has two sources, four stations and one velocity file.
The Broadband workflow has a large number of relatively small files to stage-out and
clean-up.
3.5.3 Performance Evaluation
Experimental Setup
PDPS ran on an 8 core Mac Pro running OS X 10:6 with 24 GB RAM (Machine ‘A’).
Pegasus ran on a 4 core, 2:66 GHz, Linux machine with 8 GB of RAM (Machine ‘B’).
The workflow tasks ran on a twenty node Linux based cluster, with each node running
Linux 2:6:23 kernel on 4 core Intel Xeon 2:33 GHz CPU and 8 GB RAM. We provi-
sioned a virtual Condor pool on the execution resources to provide us with a controlled
and dedicated execution environment. A Condor manager is installed on the submit host
(machine B). Prior to workflow submission, Corral provisions20 resources on the Linux
cluster to form the virtual pool.
55
Performance Metrics
Our experimental evaluation uses two sets of metrics.
1. Overall performance improvement for workflows.
The aim of our work is to improve the workflow execution time. To evaluate the
impact of the DPS on overall performance, we provide three performance mea-
surements for each workflow. First, we present a base case for each workflow
that does not perform any stage-out or clean-up operations but instead performs
No Operations (NOOPs). This measurement provides a lower bound for compar-
ison with standard Pegasus and modified Pegasus with PDPS. Next, we measure
the performance of the workflow using standard Pegasus to perform stage-out and
clean-up jobs. Finally, we measure the performance of the workflow when Pega-
sus requests PDPS to perform stage-out and clean-up operations. For each mea-
surement, we run the workflow five times and record the average runtime along
with the standard deviation in the measurements.
2. Improvement for clean-up and stage-out jobs
The other metric that we report is the improvement in the average execution time
of clean-up and stage-out jobs using PDPS. For some workflows, a significant
reduction in the execution time of these two types of jobs may not be reflected
in a significant reduction in overall workflow execution time. This is because
stage-out and clean-up jobs may not form a significant part of the total workflow
execution time.
Montage Results
Three different degrees of Montage workflows were run. 2, 4 and 8 degrees square
were run at least five times each with and without the integration of PDPS. As the size of
56
the mosaic increases, so does the size of the workflow. The2degree square workflow
contains675 tasks, the4degree square workflow had2860 tasks, and the8degree
square workflow10429 tasks. Since Pegasus uses task clustering for short running tasks
[17] the executable workflows have227,252 and327 Condor jobs respectively.
For the first set of experiments, we ran all three sizes of the Montage workflow with
NOOP (no data placement operations submitted) as stage-out and clean-up jobs. We
assume these results serve as the lower bound in the overall workflow execution time.
Figure 3.12 shows the time taken for the workflows to complete. All the experiments
use20 cluster resources.
Figure 3.12: Overall workflow runtime - Montage
Figure 3.12 also compares overall workflow execution time when running with and
without PDPS. We measure an average improvement of 12:6% in the case of the 8
Degree Montage workflow when using PDPS compared to the same workflow running
without PDPS. For the 2 degree workflow, the improvement is 13:2%. However, the 4
degree workflow shows a performance improvement of only0:8%. Table 3.2 shows the
I/O statistics for these workflows.
The number of data aggregation jobs in the4 degree workflow are such that in com-
parison with2 and8 degree workflows, Condor is able to schedule the maximum number
57
Table 3.2: I/O Statistics for Montage workflows
2 Degree 4 Degree 8 Degree
Total staged out 3.3 GB 14 GB 51 GB
Total I/O 7.3 GB 20.9GB 67.2 GB
Average file size 1.8 MB 13 MB 1.8 MB
of jobs and occupy the largest number of remote compute nodes simultaneously. As the
level of parallelism achieved during this workflow is high, the use of PDPS does not
significantly improve the overall execution time of this particular case.
Next, we present a profile of the individual stage-out and clean-up jobs for the 8
degree Montage workflow with and with out PDPS. Figure 3.13 shows the improvement
achieved for individual stage-out jobs with PDPS. We note that four data points stand
out in the log scale above the 100 seconds mark. These are stage-out jobs that contain
a large number of files. The time for these long-running stage-out jobs is significantly
reduced with the use of PDPS.
!"#$
#$
#!$
#!!$
#!!!$
#!!!!$
!$ %$ &$ '$ ($ #!$ #%$
!"#$%&'()&*+#,-'''
'.+/')*0.&'
1+2'34'
)*+,-.+$/01$
)*+,$/01$
Figure 3.13: 8Degree Montage stage-out jobs
58
Figure 3.14 shows the improvement for individual clean-up jobs using PDPS. The
average improvement is over65%. We note that job IDs between60 to120 take signif-
icantly longer than the rest of the jobs; these jobs have large number of files that need
to be cleaned up. We also note some data points where the execution time with PDPS is
longer than without it, which is be due to PDPS overheads and the limit of50 threads in
PDPS.
!"#$
#$
#!$
#!!$
!$ %!$ &!$ '!$ (!$ #!!$ #%!$ #&!$
!"#$%&'()&*+#,-''.+/')*0.&'
1+2'34'
)*+,-.+$/01$
)*+,$/01$
Figure 3.14: 8Degree Montage clean-up jobs
Figure 3.15: Average clean-up job Montage
59
Next, in Figure 3.15 we show the average improvement of clean-up jobs for Montage
workflows of 2, 4 and 8 degrees. We calculate the average execution time of clean-
up jobs for each workflow and then further average these measurements over at least
five runs of the workflow. Compared to Pegasus executing clean-up jobs itself, the
submission of clean-up jobs to PDPS reduces the execution time of clean-up jobs from
Pegasuss perspective by86:4% to88:3%.
Figure 3.16: Average stage-out job Montage
Figure 3.16 shows that the average performance improvement for stage-out jobs is
between 96% and 98%. This improvement occurs because, when using PDPS, Pega-
sus makes simple calls to hand off the stage-out jobs to PDPS rather than performing
the transfer jobs itself. The largest improvement is for the 8 degree workflow, where
the average stage-out time drops from 321:77 seconds to 10:53 seconds for 29124 files
staged out.
Epigenomics Results
Table 3.3 shows results for the epigenomics workflow. We observe that the overall
workflow runtime improves by 5:47%, while the improvement for individual clean-up
jobs is over82% and for the individual stage-out jobs is69:44%.
60
Table 3.3: Epigenomics runtime comparison
Epigenomics With PDPS
(secs)
Without
PDPS
(secs)
Percent
Improve-
ment
Overall runtime 1879 1987.8 5.47%
Overall runtime stdev 32.89 26.56
Clean-up (runtime) 0.4 2.25 82.22%
Stage-out (runtime) 0.44 1.44 69.44%
NOOP Runtime: 1619:20secs
The relatively small improvement in overall execution time reflects the relative pro-
portion of clean-up and stage-out jobs in the workflow. Most of the time consumed by
the workflow is spent on the computational jobs that map the DNA sequences to the
reference genome. We also observe relatively low standard deviation readings for each
data point measured. The average file size for the workflow was 7 MB. The clean-up
and stage-out jobs handled a total of5:4 GB of data constituting779 files. The total I/O
in the epigenomics workflow was9:3 GB.
Figure 3.17: Epigenomics stage-out jobs
Figure 3.17 shows the profile of individual stage-out jobs for one of the experimental
runs presented Table 3.3. When the workflow is run without PDPS, we observe that
61
job IDs 190 through 250 take longer than the remaining jobs. During this part of the
workflow, there exist multiple file transfer tasks within each stage-out job. Figure 3.18
shows the corresponding number of files per stage-in job for the workflow shown in
Figure 3.17.
Figure 3.18: Number of files in each stage-out job
We also note some data points in Figure 3.178 that show relatively long execution
time for stage-out jobs with PDPS. This is because the compute node to which these jobs
were submitted was under high load. Resource contention at the compute node leads to
delays in execution and results in longer execution time recorded by Pegasus for those
jobs.
Figure 3.19 shows a profile of the individual clean-up jobs presented in Table 3.3.
The execution time of clean-up jobs with PDPS is mainly influenced by the current load
and resource contention on the executing compute node, because their main work is to
send requests to PDPS server. Clean-up jobs without PDPS delete associated files. The
files are first moved to a temporary directory, the expected free space is calculated and
reported to Pegasus for logging provenance, and finally the directory is removed with a
recursive directory remove command (rm -r). Thus, the execution time of stage-out jobs
without PDPS is mainly determined by the amount of data being deleted.
62
!"
#"
$"
%"
&"
'"
!" #!!" $!!" %!!" &!!" '!!" (!!"
!"#$%&'(#')&*+#,)'
-+.'/0'
)*+,-.+"/0/1"
)*+,"/0/1"
Figure 3.19: Epigenomics clean-up jobs
While the total I/O of the epigenomics workflow was 9:3 GB, the data placement
tasks delegated to PDPS handled only5:4 GB of data.
Broadband Results
The Cybershake Broadband workflow measurements are shown in Table 3.4. The
overall runtime of the workflow improves 6:07% using PDPS as compared to Pegasus
executing clean-up and stage-out jobs. The remaining measurements show that clean-
up jobs perform 39:9% worse using PDPS, while stage-out jobs perform 97:9% better.
Cleanup jobs remove 57 MB of data spread across 295 files. Stage-out jobs move 208
files consisting of51 MB of data.
For stage-out jobs, both PDPS and Pegasus issue third party transfer calls for each
placement job. Stage-out jobs handled by PDPS are faster from the perspective of Pega-
sus, since Pegasus simply submits the call to PDPS. Figure 3.20 shows the profile of
individual stage-out jobs for the Broadband workflow. Submitting stage-out jobs to
PDPS consistently takes under 1 second. Without PDPS, the stage-out job execution
times vary from under one second to over120 seconds.
63
Table 3.4: Broadband runtime comparison
Broadband With DPS
(secs)
Without
DPS (secs)
Percent
Improve-
ment
Overall runtime 1129.2 1202.2 6.07%
Overall runtime stdev 37.23 112.79
Clean-up 4.31 3.08 -39.9%
Clean-up stdev 0.431 0.33
Stage-out 0.316 15.386 97.9%
Stage-out stdev 0.038 20.5
NOOP Runtime: 1098:2secs
!"#$
#$
#!$
#!!$
!$ %$ #!$ #%$ &!$
!"#$%&'()&*+#,)-''
'.+/')*0.&'
1+2'34'
'()*+,)$-./$
'()*$-./$
Figure 3.20: Broadband stage-out jobs
The performance of clean-up jobs for a run of the Broadband workflow is shown in
Figure 3.21. The reason that clean-up jobs using PDPS perform39:9% worse than when
Pegasus executes those jobs is because of a flaw in our integration of PDPS and Pegasus.
Unlike the case of stage-out jobs, where the submit host issues those jobs directly to
PDPS, our implementation submits clean-up jobs to the remote execution host, which
then invokes the clean-up calls to PDPS. As a result, these clean-up jobs incur overhead
while waiting in queues to be released to remote resources and on the execution nodes
local scheduler. In addition, because the Broadband compute jobs are I/O intensive, they
64
generate a high load on the Network File System (NFS) shared by the remote execution
nodes; this can delay the execution of clean-up jobs on the remote execution host, since
these jobs invoke a java binary loaded from NFS that submits the clean-up job to PDPS
server. Update: This flaw has been corrected in the implementation of clean-up jobs.
!"#$
#$
#!$
#!!$
!$ %!$ &!$ '!$ (!$
!"#$%&'()&*+#,-''
.+/')*0.&'
1+2'34'
)*+,-.+$/01$
)*+,$/01$
Figure 3.21: Broadband clean-up jobs
Summary of experimental results
We have shown that data placement for clean-up and stage-out jobs can reduce the exe-
cution time of scientific workflows. We evaluated the performance of three workflows
with and without PDPS. The greatest improvement is shown of I/O-intensive workflows
such as Montage. For workflows that clean-up a large number of small files, the time
taken by Pegasus to execute the clean-up jobs may be less than the time taken by DPS
because the GridFTP client API forces PDPS to delete files individually.
In the following section, we evaluate the use of the Policy Based Data Placement
Service to manage data stage-in jobs. This work requires Pegasus to poll PDPS to
determine when stage-in operations are complete so that job execution can proceed.
65
3.6 Interfacing Non-Leaf Jobs with PDPS
In this section, we present the interface for Policy Based Data Placement Service with
the Pegasus WMS. We present comparative results of different placement policies when
enforced on three types of workflows. We demonstrate the impact that PDPS has on
overall workflow execution time. We show that asynchronous data placement for work-
flows like Montage have significant impact in reducing the overall runtime of the work-
flows.
3.6.1 Integrating Pegasus and PDPS - Non-Leaf jobs
By default Pegasus plans workflows based on the compute jobs it reads in from an
abstract workflow. Pegasus creates and adds data management jobs such as stage-in
stage-out and data clean-up jobs to the workflow in order to produce an executable
workflow. The dependencies between the data management and compute jobs are cre-
ated by Pegasus itself. It provides the capability of running these new jobs both locally
on the Pegasus host as well as remotely on compute resources. In either case, the
Pegasus WMS schedules and monitors these jobs, and their execution contributes to
the overall runtime of the workflow. In our framework, Pegasus hands over these jobs
to PDPS for execution thus relieving Pegasus WMS of executing and monitoring these
jobs. PDPS may use existing transfer agents such as FTP, scp [Ope11], Globus GridFTP
[All12, AMP05] , Fast Data Transfer [fas12] etc. for our current set of experiments, we
utilize Globus GridFTP client [ABK
+
05, All05a] to execute these data staging jobs. We
now discuss the default Pegasus behavior and the changes that we have incorporated in
it to communicate with PDPS.
66
Stage-in jobs
By default, Pegasus creates all stage-in jobs globally at the first level of the workflow.
After the creation of the workflow running directory at the remote site, these are the
first set of jobs released by Pegasus WMS for execution. In the default case, Pegasus
creates transfer jobs for stage-in operations and invokes globus-url-copy[All05a] to exe-
cute them either directly or in third party mode, depending on the choice of running
transfer jobs locally or remotely. The running time of these jobs is dependent on the
size of data being staged in, the available bandwidth and the current load on the source
and destination hosts.
We modified Pegasus to call PDPS for executing stage-in jobs. Pegasus is provided
the type of policy to be enforced as an input parameter to the workflow planning process.
Pegasus creates a list of all stage-in jobs with their related compute jobs and the physical
attributes of files and hosts involved in the stage-in job. Pegasus sends this list to PDPS
along with the choice of the data placement policy to be enforced. In return Pegasus
receives a list of the compute jobs and their recommended priorities for scheduling. As
data placement is handed off to PDPS, Pegasus also changes all stage-in jobs to NOOP
(no operations) jobs. In addition to this, Pegasus appends a pre-job script to all compute
jobs that are dependent on stage-in jobs. As there is no guarantee for compute jobs
that the required stage-in jobs are complete, the effected compute jobs employ a pre-
compute-job script that polls a status database that contains the current state of each
transfer (in progress, successfully completed, or failed). The pre-compute-job continues
to poll until all the input files for the specific compute job are successfully transferred
to the correct execution host; then the pre-job returns successfully and the compute
job executes immediately. This reduces or completely eliminates the queue wait time
67
between the completion of a stage-in job and the start of a compute job. In addition,
Condor now schedules compute jobs based on the priority assigned to them by Pegasus.
Stage-out and Clean-up jobs
Pegasus submits the clean-up and stage-out jobs to PDPS for execution. These job
definitions are modified such that when executed, the jobs send input parameters to
PDPS in a non-blocking call and exit successfully. PDPS then executes the clean-up and
stage-out operations asynchronously. From the perspective of the modified Pegasus, the
execution time for a clean-up or stage-out job only includes the time to create the input
parameters for these jobs and the time to submit those parameters to PDPS.
3.6.2 Stage-in Policies
We compare two basic data placement policies to show the impact on the overall runtime
of the workflows. The policies we test are effective for scheduling stage-in jobs through
PDPS. These policies show that allocated remote resources do not have to wait idle for
stage-in jobs any longer than necessary. While complex policies can be implemented
and enforced, the scope of this study is simply to demonstrate the advantage of policy
based data placement. Therefore, we compare two basic data placement policies to show
the impact on the overall runtime of the workflows. We show that allocated remote
resources do not have to wait idle for stage-in jobs any longer than necessary. Although
the following two sections are not a comprehensive list of possible placement policies,
we choose these to show that placement policies may be based on various criteria such
as data or workflow characteristics.
68
Job Dependency Based Priority
This policy takes advantage of underlying information from the structure of the abstract
workflow. Typically, compute jobs require input data and produce output data. Depend-
ing on how the workflow is constructed, input data may come from stage-in jobs or be
made available by output from a parent compute job. There may be multiple stage-in
jobs that provide data for a single compute job. This condition is the result of the round
robin selection and population of transfer jobs to clusters. While the data transfers are
optimized in this process, the stage-in jobs remain disconnected form any knowledge of
the dependent compute jobs. The job dependency priority policy takes this information
into account and prioritizes stage-in jobs such that stage-in jobs with a higher number
of dependent compute jobs are given a higher priority. Thus when transfer jobs are
scheduled by PDPS, the stage-in job with the largest fan-out dependency is one that gets
scheduled first. We also test the opposite policy where the lowest priority is given to
stage-in jobs with the highest number of dependent compute jobs.
File Size Based Priority
The second policy we test is based on the size of the individual files in all the stage-in
jobs. This policy looks at the file size information and prioritizes the transfers based on
size either ascending or descending. As bundled stage-in jobs are populated with file
transfers by Pegasus in round robin form, this policy places file transfers from different
stage-in jobs in a priority queue that often violate the bundling done by Pegasus. Thus
it is possible for a single stage-in job to contribute files towards both the highest priority
as well as the lowest priority transfer groups.
69
3.6.3 Workflows Evaluated
In this section we briefly describe the workflows used in our experiments. We considered
three real workflows for this study. These were Montage, Epigenomics and Broadband.
However, after running some initial experiments, we established that Epigenomics and
Broadband did not delegate enough data management to Pegasus to be able to gain any
benefit from PDPS during the stage-in operations. While the total I/O was significant
for these workflows, the actual stage-in or stage-out jobs handed over to Pegasus were
a much smaller fraction. The partition of Broadband that we ran used17GB in disk I/O
of which only 112MB was exposed to Pegasus WMS for creating data placement jobs.
In the case of Epigenomics,95% of the time spent by the workflow is computation time.
That leaves only 5% for data management operations within which any improvement
does not appear significant. Bharathi et al. [BC09b] categorize Montage as a data
intensive workflow. As our experiments focus on data placement for workflows, we
select Montage as the base workflow in our experiments.
Montage Workflow
As described in earlier sections, Montage [BDG
+
04b] is an astronomy application that
is used to construct large image mosaics of the sky. Input images are reprojected onto a
sphere and overlap is calculated for each input image. The application re-projects input
images to the correct orientation while keeping background emission level constant in
all images. The images are added by rectifying them into a common flux scale and
background level. Finally the reprojected images are co-added into a final mosaic. The
resulting mosaic image can provide a much deeper and detailed understanding of the
portion of the sky in question. Montage can be categorized as a data distribution and
70
aggregation workflow [BCD
+
08]. A Montage workflow of2;4 or8 degree indicates the
size squared of the mosaic of the sky being constructed.
Synthetic Workflows
The Montage workflow is I/O intensive. An 8 degree Montage workflow receives 4:2
GB of data as input, it then creates up to 67:2 GB of data during the execution. In
our experiments, this data is read and written on a shared NFS file system. Compute
jobs running on different condor execution slots engage in read/write access contention
during the workflow. This behavior is pronounced when the workflow has a significant
number of similar jobs on the same level that are trying to access data from the shared
NFS disk. In these situations, disk I/O is the largest contributing factor to the time
spent by each job in execution. For example, mProjectPP jobs that execute at the third
level of the executable workflow, have an average runtime of over four seconds while
executing in an 8 degree Montage workflow. However, if an individual mProjectPP
job is run locally on the execution host without any NFS contention, then the average
runtime is less than one second. The effect of this contention is twofold. First, the
overall workflow execution time is significantly increased. And second, the percentage
contribution by data stage-in and stage-out operations to the overall workflow execution
time is reduced.
In order to clearly demonstrate the effect of policy based data placement, we create
two synthetic workflows that are modeled on the original Montage workflow. In the
first synthetic workflow, for each job in the original Montage workflow, the original job
names and job IDs still remain in effect. However, during execution, all jobs are directed
to execute the keg [Peg12] binary (or executable file) instead of the respective Montage
binaries (or executable files). We also set the input parameters of the workflow so that
71
disk I/O during the workflow execution is reduced. This is achieved by restricting keg to
read in only10 MB of input per execution invocation. The reduction of disk I/O during
workflow execution and the resulting reduction of the compute times in the synthetic
workflow magnify the effect of policy based data placement and provide us a better
understanding of the effect of data placement operations in the overall runtime of the
workflow.
The second synthetic workflow also follows the Montage workflow, but has 26:16
GB as input data that is distributed on three levels of the workflow. The first level of the
workflow has 16:16 GB of input files, the fifth level has 8 GB and the ninth level has 2
GB of input. However, when the workflow is planned by Pegasus, all stage-in jobs are
grouped at the start of the workflow.
Employing Policies for Stage-in Jobs
In order to segregate the effect of policy based data placement for stage-in jobs and data
placement in general for stage-out and clean-up jobs, we briefly explain the reasoning
behind why an executable workflow is not affected if PDPS enforces data placement
policies on stage-out and clean-up jobs. Figure 3.22 shows a partial view of an exe-
Level 5
Level 3
Level 2
Level 1
mConcatFit
mProjectPP mProjectPP
mDiffFit mDiffFit mDiffFit mDiffFit
stage_in stage_in stage_in stage_in
create_dir
stage_out
stage_out
stage_out clean_up
Level 4
Figure 3.22: Partial View of Executable Montage Workflow
cutable Montage workflow. After Pegasus constructs the executable workflow, first level
72
jobs from the abstract workflow now lie at the third level of the executable workflow.
The first level is now occupied by a directory manager job that creates workflow run
directories on the remote resources. The second level is occupied by all stage-in jobs.
Pegasus stage-in jobs are created globally, i.e. regardless of the compute jobs level in
the executable workflow, its data will be staged in at the second level of the executable
workflow. Depending on the time taken by the stage-in jobs to complete, compute jobs
may start relatively early or later. On the other hand, stage-out and clean-up jobs are
created and scheduled at every level of the workflow. Unlike the case of stage-in jobs,
no further job is dependent on the actual completion time of stage-out or clean-up jobs.
Regardless of the placement policy enforced, at each level, the jobs will exit after hand-
ing off requests to PDPS. Placement policies enforced on these jobs do not influence
the overall runtime of the workflow. Therefore, in this section, we present the policy
enforcement results of ‘only stage-in’ workflow runs.
3.6.4 Performance Evaluation
Experimental Setup
To ensure a stable execution environment, we dedicate resources to a single condor pool.
The condor central manager and Pegasus run on the submit host. Condor execute nodes
are allocated from five linux servers varying from2GHz CPUS with1 GB of RAM per
CPU to3 GHz CPU cores with2GB of RAM per CPU. All linux servers run the64-bit
2:6:4 version of the linux kernel. The submit host is anx86 MacPro with8 cores at3Ghz
with 24GB RAM running OSX 10:6. PDPS is comprised of two machines, the status
DB is a MySQL [MM07] server instance running on a quad core Intel Xeon based linux
server running linux 2:6:4 and PDPS engine is run on an 8 core,x86 Mac Xserve with
40GB RAM. The storage is housed on a dedicated4 core,2GHz, Intel Xeon based Mac
73
Xserve running OSX10:5:8. Figure 3.23 shows the relative placing of our experimental
hardware between ISI at Marina del Rey, and USC in Los Angeles.
USC LAN
ISI LAN
Condor Execution Hosts
WAN
PDPS
MySQL DB
Storage Host
Pegasus,
Condor Manager
Figure 3.23: Experimental Setup - Physical Placing
For all our experiments we assign a cluster factor of 32 to stage-in jobs. Pegasus
creates 32 stage-in jobs in all cases and populates these clusters with all the available
individual stage-in jobs. The total number of stage-in jobs remains constant at 32. We
cluster compute jobs into groups of 16 jobs each. Here, Pegasus places at most 16
compute jobs in each clustered job. The total number of compute jobs will depend on
the number of clustered jobs available. Clusters are created based on workflow levels.
Therefore, some jobs such as mAdd will not be clustered as there is only one job in the
entire level. In the case of an 8 degree Montage workflow, 13275 individual compute
jobs are grouped into836 clustered jobs.
We conducted two sets of experiments. In the first set, the impact of policy based
data placement on stage-in jobs for the Montage workflow is recorded. In this case,
Pegasus creates regular stage-in jobs for the workflow but does not create any stage-out
or clean-up jobs. Data placement tasks such as stage-out and clean-up take a consider-
able amount of time during workflow execution. In order to clarify the effect of policy
enforcement on stage-in jobs, we restrict the workflows to only creating stage-in jobs.
This provides us a measure of the time it takes for the workflow to complete with vari-
able scheduling of stage-in jobs. We record the runtimes for a default Pegasus run, and
74
one each for all policies tested. We run each case at least three times and present aver-
aged numbers in reported data points. The second set of experiments performs the same
measurements as the first set, but on the synthetic workflow.
In both cases we compare the runtimes against the lower bound measurement of
workflow runs where data is pre-staged and the workflow does not execute any clean-up
or stage-out jobs. We refer to this case as the pre-staged case in our text. We also draw
comparison against the default Pegasus case where Pegasus handles all data manage-
ment operations by itself and PDPS is not involved. The pre-staged case provides us a
realistic lower bound for the operations whereas the default Pegasus case provides us
relative comparison of the different policies we enforce for data placement.
Montage Workflow
Figure 3.24 shows runtimes handled by PDPS as well as the default and pre-staged
cases. The average difference between the two cases is 290 seconds. The average total
time taken by stage-in jobs in the default case is265 seconds. Ideally, these two numbers
should be very close to each other. Stage-in operations for8 degree Montage constitute
an average of 5% in the workflow execution time. We make note of this, as the impact
of data placement policies will reside within this 5% of the overall workflow execution
time. We compare four policies with the default Pegasus case and the pre-staged Pegasus
case. We discuss the8-degree Montage workflow, as that is significantly data intensive.
In Figure 3.24 for the pre-staged case, Pegasus creates a single symlink job in place
of data stage-in jobs. The average execution time of the symlink job is 1:6 seconds.
In the default case, Pegasus creates and executes the stage-in operations itself and no
interactions with PDPS are invoked.
75
We observe that the File Size (Ascending) policy shows an average runtime of4793
seconds, which is 0:8% less than the average default case. This is because when PDPS
receives the request from Pegasus, Pegasus is still in the processes of planning the work-
flow. Pegasus has to complete planning, submit the workflow for execution and then
wait for DAGMan to release jobs to Condor for scheduling. While these actions are
being executed in Pegasus, the file size ascending policy is invoked in PDPS. PDPS
starts transfer jobs before Pegasus starts the workflow execution. By the time Pegasus
execution reaches the first compute job (mProjectPP), all small files (which are header
files) are already transferred in. When sets of data files that are required for a particular
compute job are successfully transferred, the dependent compute job can start execution.
Figure 3.24: 8-Degree, Montage Overall Runtime with Stage-in Jobs Handled by PDPS
In case of the File Size (Descending) policy, PDPS transferred the largest files first,
and thus the smaller files (header files) were transferred last. As header files are required
by all mProjectPP jobs, these jobs had to wait for all the data files as well as header
files to be successfully transferred before starting execution. During the waiting period,
mProjectPP jobs continue polling the status DB for the successful completion of the
76
parent stage-in jobs. As a result, notwithstanding the early head start in transfer jobs by
PDPS, this policy forces the workflow to take longer then the default case to finish.
When comparing the policy where priorities are set based on the number of child
compute jobs for the stage-in jobs (Fanout), we observe that if higher priority is given
to stage-in jobs with a larger number of dependent compute jobs, the workflow runtime
is longer while in the case where the job with least number of child nodes gets highest
priority, the runtime is relatively shorter. This result is counter-intuitive and specific to
the Montage workflow. It depends on the way the abstract workflow is constructed. If
a certain file is common input to a number of compute jobs, intuitively, it should be
transferred early. However, if the remaining files required for the first set of mProjectPP
jobs, that have been scheduled early by DAGMan, are still in queue because their stage-
in jobs did not have enough child nodes, then even though the most popular stage-in
jobs have been successfully executed, the mProjectPP job will continue to wait for its
remaining parent stage-in jobs to complete. We observe this effect while comparing
fan-out ascending with fan-out descending case in Figure 3.24.
Synthetic workflow (4:2 GB dataset)
While the maximum overall improvement seen in Montage workflow between the
default Pegasus case and PDPS based cases was less than 1%, we show that PDPS
is of significant benefit for workflows that have significant stage-in and stage-out data
and spend a sizable portion of the overall workflow execution time in conducting data
staging operations. As explained earlier, these workflows are based on the Montage
abstract workflow with the Montage executable binaries being replaced by keg.
The workflow is run using the original data files from the Montage workflow. We
draw comparison between the results of the synthetic workflow and the real Montage
77
Figure 3.25: Synthetic Workflow - 4.2 GB Dataset
workflow. Figure 3.25 shows the overall runtimes for the Synthetic workflow where the
input data set is that of the Montage workflow. The overall runtime difference between
the pre-staged case and the default Pegasus case is 239 seconds. This is the time taken
by the workflow to successfully complete the stage-in jobs and represents about14% of
the overall workflow time in the default case. We note an overall runtime improvement
of up to7:8% in the fan-out ascending case when compared to the Pegasus default case.
Additionally, the fan-out descending case performs 0:8% worse than the default Pega-
sus case. In the size-based policies, size ascending performs 0:9% better while the size
descending performs 2:2% worse than the default case. As the time spent by the Syn-
thetic workflow on stage-in operations is of a greater segment of the overall workflow
execution time as compared to Montage, the resulting benefit and loss is proportionally
greater. Gaining a7:8% improvement benefit in the overall workflow execution time by
only altering the stage-in operations is significant. In ongoing work, we have extended
the synthetic workflows to cover a much greater variety of workflow types and expect
to establish relations between the data placement policies and classes of the executable
workflows for each of the Synthetic workflows.
78
Synthetic workflow (26:16 GB dataset)
The advantage or disadvantage of using a particular data placement policy becomes
more obvious when we use a workflow that does not follow one precise type of DAG
such as Montage. In order to demonstrate this, we create a new synthetic workflow. This
workflow resembles Montage in its abstract characteristics [BCD
+
08], but possesses
different data staging characteristics. In the first level, the workflow consists of16 jobs.
16:16 GB data consisting of1GB and10 MB files is required as input at this level. 8 GB
data is required for the 5
th
level of the workflow. This consists of 16 files of 500 MB
each. Finally four jobs on the 9
th
level of the workflow have an input of a 500 MB file
each. Although the data input requirement of the workflow is distributed over the depth
of the workflow; the executable workflow is planned with the default Pegasus behavior.
This results in grouping the entire26:16 GB data input into stage-in jobs at the top of the
executable workflow. Figure 3.26 shows the same abstract workflow with only8 jobs at
the first level. Jobs shown in red represent those that require input data to be staged in
from storage sites. The number of jobs in the remaining workflow are a function of the
number of first level jobs. In our experiments, there are16 jobs at the first level and the
workflow comprised of a total of60 compute jobs.
level_5_job_level_5_job00017 level_5_job_level_5_job00016 level_5_job_level_5_job00019
level_4_job_level_4_job00014
level_5_job_level_5_job00018
level_7_job_level_7_job00027
level_5_job_level_5_job00015 level_5_job_level_5_job00020 level_5_job_level_5_job00021 level_5_job_level_5_job00022
level_2_job_level_2_job00011 level_2_job_level_2_job00010 level_2_job_level_2_job00012
level_8_job_level_8_job00028
level_9_job_level_9_job00029
level_6_job_level_6_job00023 level_6_job_level_6_job00024
level_2_job_level_2_job00009
level_1_job_level_1_job00006
level_6_job_level_6_job00026
level_1_job_level_1_job00005
level_6_job_level_6_job00025
level_1_job_level_1_job00008 level_1_job_level_1_job00007 level_1_job_level_1_job00002 level_1_job_level_1_job00001 level_1_job_level_1_job00004
level_9_job_level_9_job00030
level_1_job_level_1_job00003
level_10_job_level_10_job00031
level_3_job_level_3_job00013
Figure 3.26: Partial Synthetic Workflow DAG
79
In Figure 3.27 we observe that both size-based policies have no substantial influence
on the overall workflow performance. The workflow is so constructed that there are
multiple sizes of files required as input by the first level of the workflow. Both the 1
GB input and the10 MB input files are required by the first level jobs. Thus regardless
of the size priority, the executable workflow has to wait for the entire input data to be
completely staged in before it can start to execute. Another factor that effects the time
taken by PDPS to complete transfers is that the number of GridFTP transfer connections
is throttled to a maximum of five during these experiments. On the other hand, when the
same data placement jobs are executed by Pegasus WMS in the default case, there is no
throttling of the number of GridFTP transfer invocations at the remote site.
1340%
1523%
1400%
1871%
1522% 1520%
0%
500%
1000%
1500%
2000%
Prestaged%
Pegasus%default%
Fanout%ascending%
Fanout%descending%
Size%ascending%
Size%descending%
Overall'run*me'in'seconds' Figure 3.27: Synthetic workflow (26.16GB dataset)
In the case of fan-out or job dependency based policy, we observe a large difference
between the two ascending and descending cases. When stage-in jobs with the least
number of dependent compute jobs are scheduled early, then the stage-in jobs at the first
level which have at most two dependent compute jobs are scheduled early. On the other
hand, when fan-out ascending is used, then the stage-in jobs at the9
th
level are scheduled
early, followed by those at level five and finally level-1 jobs are executed. This behavior
80
demonstrates the overall behavior that can result when certain data placement policies
are applied to particular types of executable workflows. While the fan-out ascending
policy performs better than the default case, the fan-out descending policy performs
worse than any other policy tested. The maximum improvement of the overall runtime
of 8:7% was achieved by using the fan-out ascending policy in PDPS. The maximum
degradation in performance was 18:6% and was observed in the fan-out descending
policy.
Summary of Experimental Results
We have run three different categories of workflows and presented results for each. We
have successfully shown that asynchronous data placement for workflows can signifi-
cantly improve or degrade performance for data intensive workflows. We have demon-
strated that the impact of scheduling stage-in jobs based on policies enforced by PDPS
will depend on workflows data management characteristics. In the case of the Montage
workflow, off loading stage-in jobs showed an improvement of less than 1%, whereas
7:8% improvement was shown by the synthetic workflow that had similar data but dif-
ferent execution characteristics from Montage. The second synthetic workflow showed
an improvement of 8:7% for the fan-out ascending policy. The size based policies did
not show any notable effect on the workflow because of the workflows data characteris-
tics. The fan-out based policy showed both improvement and degradation based on the
flavor of the policy invoked.
81
Chapter 4
SDAG
After analyzing the experiments listed in the previous chapters, we conclude that the
default manner in which Pegasus handles data transfers is fairly efficient and attempts to
keep the network bandwidth fully occupied during the data stage-in processes. The
amount of improvement we achieved over the default data management by Pegasus
WMS was related to the amount of queuing delay saved for stage-in jobs and some
additional time that PDPS got as head-start for stage-in transfers over the Pegasus default
case. The evaluation was done on a set of real world workflows that presented us with
precise design of workflow characteristics. The improvements that we reported were
therefore instance specific. In other words, the reported results were specific to only the
actual workflows that were evaluated. Although, the examined workflows represent a
diverse set of workflow characteristics, these are still too specific to make reasonably
generic inferences about the WMS tool being tested.
Over the past decade, WMSs have evolved into sophisticated applications them-
selves that leverage the capabilities of different independent tools. Additional capabili-
ties or modules are continuously being developed and being made available to WMS for
integration. These additional capabilities stem from requirements of actual real world
workflows. As a result, the design tradeoffs that guide the evolution of WMS reflect a
small part of existing applications. The paucity of these applications could lead to sub-
optimal WMS design. We are concerned with these influences and investigate solutions
to circumvent them in the design and development of workflow tools.
82
Deriving concepts for workflow technologies based on real workflows leads to devel-
opment of tools that are tailored for specific real world workflows. Thus, opportunities
to further improve the overall process are not even presented the to designers of the
WMS. For example, Pegasus was born out of the GriPhyN [DKM01] project. Although
data management in Pegasus was designed to be as generic as possible, GriPhyN influ-
enced the design. The current default for Pegasus is to group all data stage-in tasks at
the start of executable instances of workflows by default. Such a path leads to design
peculiarities that are influenced by a set of known types of workflows. It may be the
case that a particular workflow application benefits from distributing stage-in jobs along
the depth of the workflow or a combination of speculative staging-in whereas the WMS
being employed defaults to grouping all stage-in jobs at the start of the workflow. This
ultimately leads domain scientists who use these WMS to author their workflows to
match the WMS’s design parameters.
If a set of generic workflows that did not follow particular patterns in data or con-
trol were to be made available to WMS design engineers, then influences from any one
particular real world application would diminish, resulting into design criteria that are
more generic and ought to be able to cater to a large variety of existing and future work-
flows efficiently. A requirement for a set of workflows that do not follow any precise
data or control pattern is thus identified. To be more specific, a tool for generating syn-
thetic well formed workflow descriptions that can produce error free executable synthetic
workflows is now a requirement.
In this chapter, we identify the lack of evaluation data for WMS design and present a
tool, SDAG, that enables scientists to overcome that deficiency. We evaluate SDAG and
present an analysis of the statistical properties ofWFWs produced by it.. The remain-
der of this chapter is organized as follows: In section 4.2, we present motivation behind
83
the questions we address. Section 4.3 presents SDAG design criteria and architecture.
We analyze SDAG in section 5.1.
4.1 Well Formed Workflows (WFW)
A synthetically generated well-formed workflow orWFW is an error-free representa-
tion of an abstract workflow [DBG
+
03] or directed acyclic graph (DAG) that can be
presented to a workflow planner and converted into an executable workflow [DBG
+
03].
AWFW contains all required information such as job descriptions, input/output file
names and attributes, parent/child dependencies or any other related attributes such as
file sizes, etc. In section 4.3.1, we briefly discuss parameters that characterize a work-
flow and how these may be used in generating a syntheticWFW in order to search
space around a particular real world workflow while keeping within reasonable param-
eter bounds.
4.2 Motivation
In this section, we start by briefly describing a few WMSs and highlighting design dif-
ferences between them. We highlight the unique functional capabilities of these WMSs
and argue that these capabilities stem from functional requirements of actual workflow
applications. Finally, we argue that current WMS designs influence design of future
workflow applications.
84
4.2.1 Workflow Systems
There are multiple WMS projects that aim to serve different segments of the scientific
community. While all these systems have a common goal of running many jobs for a
user in an automated manner, each of these systems possesses unique capabilities that
are tailored to serve a particular niche in the scientific community. For example, while
Kepler [ABJ
+
04] as a workflow system is aware of the executables used in its work-
flows, Pegasus ensures no domain knowledge of the executables in the workflow during
both at the planning and execution stages. Kepler workflows are modeled as actors that
represent actions and connections that represent data flows. On the other hand, abstract
Pegasus workflows specify only compute jobs and their interdependencies with data
management jobs being added to the abstract representation during the planning pro-
cess. Similarly, while Kepler is a Graphic User Interface (GUI) driven platform, Pegasus
remains command line interface (CLI) driven. Other systems like Taverna [OLK
+
07a],
Triana[TSWH07b], ASKALON [FJP
+
05] , all have their unique functional peculiarities
which lean towards the scientific domains that they serve.
Taverna [OLK
+
07a] workflows are composed of processors, of software compo-
nents that contain input and output ports. Data dependencies are created using links
between output and input ports of processors. A processor executes an activity that can
be a software invocation or an independent workflow invocation. This feature allows
for dynamic workflow specifications where the binding of activities can be done during
the execution of the workflow. However, in Pegasus where an executable workflow was
(very rightly) previously referred to as a concrete workflow the concept of late bind-
ing can be achieved only through workflow of workflows and not in single independent
workflows.
85
ASKALON is a Grid Application Development and Computing Environment
[FJP
+
05]. It provides a UML based GUI for modeling, using Teuta [PQF12], an abstract
Grid Workflow Language (AGWL), scheduling, execution, resource management with
monitoring and analysis services in one complete environment. The goal for ASKALON
is to simplify development and optimization of distributed scientific applications. The
underlying use of Globus tools is abstracted by the overlaying environment simplifying
tools. We highlight the ability of ASKALON to provide a workflow development tool,
as opposed to Pegasus where developing a workflow is a highly technical task possibly
requiring collaboration between domain scientists and workflow engineers to produce
a workflow definition. Again, this is not a drawback in Pegasus as the granularity of
control required by users may not be available in ASKALON but possible to achieve in
Pegasus.
Triana [TSWH07b], yet another example workflow environment, was born out of
the GEO 600 Project for detecting gravitational waves [TS98]. The emphasis of Tri-
ana is to use multiple middleware types and services. This enables users to author and
execute workflows across multiple environments that are exposed by the respective mid-
dleware[TSWH07a]. Triana sets itself apart from those mentioned above by trying to
enable the use of multiple types of execution environments for its users.
All the above WMS are projects in progress that are constantly evolving into bigger
and better products. The typical testing and evaluation for the enhanced capabilities
are unit and stress testing. Border and error cases are human created, and individually
tested. The impact that such testing and evaluation has on the future of the product is
discussed in some depth in the following section.
86
4.2.2 Design coupling between applications and WMS tools
Although we draw comparisons with Pegasus, in general, WMS are presented with
numerous choices which place them into categories where the absence or presence of
features/capabilities are a manifestation of the overall goal of that WMS. The plethora of
choices present many important questions for both workflow system designers as well as
scientists who are moving their work into workflow environments. Addressing all such
questions is out of the scope of this paper and readers are referred to work by Deelman
[Dee07] for a terse summary of challenges faced by workflow systems.
We choose to address one aspect of these multiple choices: How can we evaluate
workflow performance for any particular domain, across different workflow environ-
ments, using different workflow tools? Can this evaluation lead to better platforms in
the future?
To answer these questions, we turn to the current development and evaluation strate-
gies being used. Tools and components for WMS are continuously being developed and
added to produce enhanced capabilities. The testing and development for these is gen-
erally limited to the use of real workflows being used by respective WMS. By this we
mean that when a workflow component or tool is under development, its testing is based
on real workflows that have previously been executed on that WMS. This leads to devel-
opment of tools and components that are directly influenced by those real workflows. As
a result, new workflow applications are influenced by the existing design of the WMS.
For example, in earlier versions of Pegasus (2.4), applications were forced to stage-in all
input data at the start of the workflow regardless of the DAG level to which the data input
dependency was tied. This was a design criterion tied into Pegasus from the GriPhyN
project [DKM01]. However, the current release of Pegasus does provide the capability
of stage-in jobs being distributed along the depth of the executable workflow.
87
Although recent work emphasizes the importance of a catalog of abstract workflow
descriptions that can be used across workflow systems [GG11] [KKK
+
11], we argue
that the number and variety of real workflows is not large enough to confidently test
and evaluate the design and development of WMS tools and components. In fact, we
created two synthetic workflows by hand to evaluate PDPS, as discussed in the previous
chapter. WMS designers need richer datasets to evaluate design tradeoffs that a smaller
number of real workflows may introduce. These rich datasets can provide a platform
for WMS designers for analyzing and comparing various WMS capabilities confidently.
Our concern is that of providing a large enough dataset (of workflows) to confidently
evaluate workflow tools and components that are in the design and development phase.
The ability to compare the performance of workflow tools and components across a
large spectrum ofWFWs enables developers to look into possible future workflow
manifestations, and design tools and components that serve them equally as well. Given
such datasets, the testing and evaluation of workflow tools and components can be made
independent of real workflow applications.
Although random workflow generation has been alluded to in many projects, to the
best of our knowledge, a concentrated effort in characterizing parameters for such a
generator and concerted efforts into designing a workflow generator that can be used
to address the above issues is lacking in the workflow community so far. This brings
to light the requirement for a tool that is able to provide workflow scientists a generic
set of syntheticWFWs. Such a requirement is not unique to the design of WMSs.
It is typical for scientific experiments such as High energy Physics (HEP) to use 3X or
more simulated data versus observed data for analysis to reduce statistical errors [RL12].
This work highlights the importance of using synthetic workflows for generating such
simulated data.
88
4.3 SDAG
There has been a considerable amount of earlier work where random DAGs or synthetic
workflows have been used in order to evaluate different works both in workflow science,
as well as in the area of scheduling. While we discuss just a few of these works to
make a point, there exist numerous other works that have embedded the use of such
tools as necessary requirements for testing and evaluating. Where other works utilize
synthetically generated workflows or DAGs for various purposes, we simply highlight
these few and describe what sets these apart from SDAG.
Rahman, et. al., evaluate a critical path algorithm by making use of a workflow
generator [RVB07]. They characterize the generated workflows as parallel, join-fork
and random. Each of these types represents the shape the workflow DAG acquires after
generation. Amalarethinam and Muthulakshmi [AM12] describe DAGITIZER, a tool
to generate random DAGs. DAGITIZER handles cost of communication between nodes
and assigns weights to individual nodes also. On the whole, DAGITIZER seems to
be a derivative of [RVB07] on a smaller scale. Zang and Mao present a scheduling
strategy for data-intensive workflows in [ZeM10] where they use a workflow generator
that produces three types of workflows, random, fork-join and parallel.
Naseri and Ludwig mention using a random workflow generator for evaluating work-
flow trust using Hidden Markov Modeling [NS11]. While the authors simply allude to
the use of a generator, we highlight this work simply as an example of the many areas
where requirements for synthetic workflows exist. Saha et. al. use randomly gener-
ated workflows for eliminating redundancies in workflows [SSS09]. Yan and Chapman
randomly generate jobs in evaluating a workflow job scheduler [YC07]. Schwerdel et.
al. generate random workflows with type matching edges between inter-dependent jobs
89
for evaluating evolutionary algorithms in solving the functional composition problem
[SRM10].
An alternative to random or synthetic workflows would be a large body of real work-
flows. Work on providing such a large repository of workflows that can be used as start-
ing templates for new workflows has been addressed in SHIWA [KKK
+
11]. Generating
a large enough depository of real workflows can potentially provide the same benefit as
that of SDAG. However, we argue that the number of workflows that can be deposited
into the repository will still be finite, these will require administrative maintenance, and
the classification of those workflows to suit the particular requirements of testing or
evaluating any individual WMS tool may be difficult to isolate as a large enough data
set in such a repository.
On the other hand, SDAG provides for the generation of such a dataset at a fraction
of the cost (basically some computational cost in generating theWFWs) with no long
term storage costs. Therefore, we assert that SDAG is a more manageable and user
configurable alternative to a very large real workflow repository. It is also important to
note that the goal of the SHIWA repository is not to provide the scientific community
with a testing and evaluation dataset, but instead it is to provide starting templates in the
form of existing workflows for authoring new workflows. Moreover, as the goal of these
repositories is to ease authoring of new workflows by reusing existing workflows, core
functional requirements of existing workflows are propagated into the new workflows
and new or unseen requirements are not explored very often.
Although we establish the requirement of synthetically generatedWFWs, we have
not yet defined the bounds on the random space that aWFW generator should explore.
In order to avoid generatingWFWs that have extremely unrealistic workflow proper-
ties, the generator should be bounded by some parameters that extend from some central
90
reference point. In this case, the reference point is a real workflow, and the random space
to be explored forms the basis of the design criteria for a Synthetic DAG Generator, or
SDAG. In other words, given a real workflow, we want to generate synthetic workflows
whose parameter bounds are reasonably close or far from the real workflow (graphically
depicted in figure 4.1).
Workflow X
Workflow X
Workflow R
Workflow Y
Workflow A
Workflow Z
Workflow P
Search Space
Figure 4.1: Workflow Search Space
4.3.1 SDAG Design Criteria
There has been significant work in the area of generating random DAGs for evaluat-
ing scheduling solutions[Sin93][JK91][CMP
+
10]. However, previous work is primarily
directed at evaluation of algorithms in theoretical science like scheduling, graph traver-
sal or search algorithms etc. We are interested in generatingWFWs. In particular, we
are interested in having some level of control over the following parameters of aWFW.
1. Total number of jobs in the DAG.
2. The average number of dependents per job in the DAG.
3. A parameter to control the parallelism of the DAG. For example, a DAG consist-
ing of100 jobs can be authored as a ladder DAG (a pipeline) with one job on each
level and creating a total of 100 levels with no parallelism. At the other extreme,
91
the same 100 jobs can be placed on the same level of the DAG and thus create a
DAG with only one level and achieve maximum parallelism.
4. The bandwidth decides how far across levels a dependency will reach. For exam-
ple, a bandwidth of 1 will only allow children from the next level to be assigned
to a job. Whereas, a bandwidth of 2 would allow a job to choose a child from all
jobs in the following two levels of the DAG.
5. Given an ideal number of jobs on every level of the DAG, regularity controls the
actual number of jobs per level in the DAG.
6. A parameter that controls the runtimes of jobs in the DAG. An upper bound and a
lower bound are given. A number is selected between these uniformly at random
for the assignment of running time to every job grouped by levels in the DAG.
7. The size of input data required for each job in theWFW.
4.3.2 SDAG Architecture
Figure 4.2 shows the major components of SDAG. SDAG reads the number of jobs as
an input parameter. Next, the job generator creates job nodes in the DAG for every job
without any dependency edges between them. A running time value, input and output
files are assigned for each job. For the case of simplicity, our current implementation
restricts the input file for each job to at most one. SDAG calculates the ideal number
of jobs per level of the workflow from the parallelism parameter. Based on the ideal
number of jobs per level, SDAG then calculates the actual number of jobs per level
using the regularity parameter. For a regularity value of1, SDAG will place exactly the
ideal number of jobs per level, otherwise, for values less than 1, SDAG will choose a
percentage variation (positive or negative) using the ideal number of jobs as mean. The
92
assignment is a uniformly selected at random integer value between the two ends of the
calculated percentile value. This parameter allows the resulting DAG to take shape of
a ladder, a platter, or an hourglass etc. For example, if there are 100 total jobs, and the
parallelism parameter is0:2, then the ideal number of jobs per level is20. Next, for this
example, if the regularity parameter is1, then SDAG assigns20 jobs per level. However,
if the regularity parameter is0:8, then, for each level in theWFW, SDAG will choose a
number between16 and24 jobs. This choice is made by a uniform distribution between
min(16) andmax(24).
SDAG
Task
Generator
Dependency
Generator
Forest
Consolidator
OPM / DAG /
DOT Generator
Figure 4.2: SDAG Architecture
With the number of tasks per level assigned, SDAG moves control from the job
generator to the dependency generator. The dependency generator uses the dependents
parameter to compute the number of children that every node in the DAG will have. This
value is a zero truncated poisson distribution with the dependents parameter used as the
value. For each node, the dependency generator looks at the bandwidth parameter to
decide how far up or down to jump in the DAG levels in order to choose a child for the
node. A coin flip decides if the node will have a child from its immediate lower level or
if it will have a choice of child jobs from more than one levels down. Finally, a child is
selected uniformly at random between the set ofminandmax child job indexes.
93
As the assembly of the DAG is primarily done by the dependency generator, and
beause it works based on a scaled, uniformly distributed, random selection to create an
edge in the tree, there may be job nodes that never get selected in this random allocation.
This results in multiple trees being formed by the dependency generator. In order to cor-
rect this and have a single tree that represents a DAG, the Forest Consolidator traverses
every root node to see if that root is connected to other roots in the tree, or if it is an
independent tree. In the latter case, the forest consolidator attaches the independent tree
to the largest tree in the forest by generating a dependency between the two and finally
stops when there is only one tree in the forest.
Finally, the DAG is printed out both as a visualizable .dot file as well as an abstract
DAG file that can be submitted to a workflow planning engine. In our current implemen-
tation, an input DAG to the Pegasus WMS is generated. Much of the design of SDAG
has been influenced by Suter [Sut12].
In the future, a platform translator will live between the forest consolidator and the
output generator which will produce workflow definitions in the standard Open Prove-
nance Model (OPM) [MCF
+
11] format as well as formats accepted by multiple work-
flow systems. Additionally, in ongoing work, an architectural overhaul of SDAG is
underway in which the use of Markov Chain Monte Carlo method will be made in the
generation of the initial DAG.
A few points that need to be addressed before proceeding to the discussion of statis-
tical properties of SDAG are:
1. Why were all SDAG parameter distributions uniformly random?
So far, there is no theoretical work that suggests any particular distribution for
any one dimension of any DAG representing a workflow. The short answer to the
above question is that we acknowledge that different statistical distributions will
94
be required for SDAG parameters in order for it to be beneficial to a large scientific
audience. However, in order to keep the scope of this research manageable and
focused in identifying a need and presenting ‘one’ solution, we only considered
uniformly random distributions. There will certainly be instances where other sta-
tistical distributions will be a much better choice for some SDAG parameter. This
thesis does not discuss those options any further. In future work, this functionality
can be provided as an option selected at the start of creating a set ofWFWs.
2. Why DAG? Why not branching and loops?
The major population of workflow management systems are DAGs. Yes, there are
existing WMS that allow for conditional edges and looping in the abstract defini-
tions of workflows, e.g. Kepler, but as that does not represent the major portion of
the workflow management systems that exist in the academic and research fields,
we chose to produce only DAGs. In future work, conditional decisions and loop-
ing can be addressed in the workflow generator. The current choice of DAG based
workflows gives the largest coverage of scientific workflow management systems.
And last, but not least, it cannot be trivially be added as additional functionality
into SDAG.
3. Why DAG? Why not forests?
It is known that there are many instances of workflows being forests or bag of tasks
instead of a single DAG. For example, the Broadband Platform is essentially a for-
est. The addition of housekeeping jobs by Pegasus renders it as a DAG. However,
for the scope of this thesis, we address DAGs only. The ability of creating forests
can be incorporated into the current version of SDAG by optionally not invoking
the forest consolidator. However, the statistical properties of the resulting forest
95
will need to be assessed in detail before making any assertions on the performance
of SDAG in producing the required forest with the required distributions. At this
time, we defer that as future work.
4.4 Statistical Properties of SDAG
In this section, we evaluate the performance of SDAG. We are interested in compar-
ing the median values of parameters that we assigned for generating a particular set
ofWFWs with those that were provided as input parameters to SDAG. Ideally, these
values should be close to each other. However, in our results, we observe some statis-
tical biases and provide explanation of why these biases have been introduced into the
synthetic dataset.
We generate 10;000 workflows with 128 jobs each. The parallelism is set to 0:2,
which means that on average, aWFW should contain 5 distinct levels. As expected,
figure 4.3 (a) shows the histogram of total levels in aWFW across the set of 10;000
WFWs with mean at 5. Next, regularity is set to 0:01, which means that the number
of jobs=level can vary between 1 to 128. Figure 4.3 (b) shows the histogram of this
parameter. We observe a bias in the histogram towards lower number ofjobs=level with
the value spiking at 1job=level. This bias is introduced into the set due to manner in
which the forest consolidator works. When the dependency generator assigns child jobs
uniformly at random, there are jobs that are never assigned as a child. Thus these jobs
become independent roots. When the forest consolidator attaches these to the largest
tree, new root or leaf nodes are formed, thereby creating a level with only one job on
it. We acknowledge this statistical bias and have addressed it in ongoing work as stated
at the end of section 4.3.2. Next, the job runtime min is set to 1 and max is set to 50.
Figure 4.3 (c) shows the histogram of job runtimes. As expected, runtimes are uniformly
96
Number of levels/workflow (a)
Level Number
Frequency
2 4 6 8 10
0 1000 2000 3000
Number of jobs/level (b)
Jobs per level
Frequency
0 20 40 60 80 100 120
0 2000 4000 6000
Job Runtimes (c)
Runtime in seconds
Frequency
0 10 20 30 40 50
0 5000 15000 25000
Bandwidth/Job (d)
Bandwidth/Job
Frequency
0 2 4 6 8 10
0 500000 1000000
Figure 4.3: SDAG Statistics (a) Levels/WFW (b) Jobs/level (c) Job Runtimes (d) Band-
width / job
distributed. The bandwidth of the graph is set to 2. While it is known that calculation
of a graph’s bandwidth is anNPcomplete problem [DPPS01], a DAG level-based
naming convention followed in SDAG allows us to precisely compute this parameter for
every edge in aWFW. Figure 4.3 (d) shows that majority of the jobs had children at
a distance of a single level, whereas there exist dependencies that span two levels and a
small number of dependencies that span more than two levels. The average number of
dependents per job is set to 2. Figure 4.4 shows single child dependency as the highest
97
followed by two dependents/job. The frequency of dependencies decreases with the
increase of the number of dependencies exponentially.
Dependencies / Job
Number of dependencies
Frequency
0 20 40 60 80 100 120
0e+00 2e+05 4e+05
Figure 4.4: SDAG Statistics (Dependencies / Job)
4.5 Summary
In this chapter we have presented a tool for generatingWFWs. We presented the sta-
tistical properties forWFWs produced by SDAG. The observed statistical distributions
are close to the expected distributions. This close mapping is critical to how we evaluate
SDAG. If the distributions of various parameters were skewed, then it would be difficult
to isolate each individualWFW control parameter and make an assertion on how a
particular variance will effect the resulting evaluation of tools.
98
Chapter 5
Evaluating WMS tools with SDAG
In this chapter, we demonstrate capabilities of SDAG in generating sets ofWFWs that
are set apart based on statistical values, but enable users to fine tune the evaluation of
particular capabilities of a selected workflow component. Our goal in testing a particular
component is to see if it produces results similar to those that were produced while
testing with real workflows when presented with a much larger and more varied dataset
ofWFWs. As stated in chapter 4, we choose a real workflow as a reference point and
then generateWFWs around it.
For this study, we select a data management tool capable of enforcing different poli-
cies on data placement tasks in distributed and grid environments. The PDPS, introduced
in previous chapters, is a tool designed to off-load data placement tasks from Pegasus
WMS and execute them in tandem with compute tasks that are being handled by Pegasus
WMS. It does so by allowing Pegasus to submit all data management tasks to PDPS and
all compute jobs to Condor / DAGMan. As a result, the workflows being run by Pegasus
should execute faster, utilize allocated resources more effectively and result in less time
spent by the workflow in performing data placement tasks. Ultimately, this results in
improved workflow performance and is reflected in the overall completion times for the
workflow. Work on PDPS was presented in [Ame11] and [IEE12].
In chapter 3 and [IEE12], we presented results that demonstrated the impact of PDPS
when used with Pegasus for data management tasks. In this section we study the impact
of similar policies used by PDPS in [IEE12] on the performance of workflow runtimes
99
while usingWFWs. The goal, now, is to access PDPS as generic tool which can
deliver results similar to [IEE12] when presented with a broad range of workflows. As a
significant number of real world scientific applications use the Pegasus WMS, therefore,
for the scope of this work, only Pegasus WMS was evaluated. However, in future, other
WMS can be setup for testing different constituent components by usingWFWs.
5.0.1 Evaluation Metrics
We use overall workflow execution times and data sizes as evaluation metrics. There has
been a large amount of work on the comparative ratio of compute:communication costs
and compute:data costs. Although we have selected uniform distributions for clearly
separable measurements in our observed results, typically, compute times and data sizes
are distributed by varying degrees. We intend to incorporate the selection of different
distributions for both compute times and data sizes in future releases of SDAG. It is also
important to note that SDAG’sWFW based testing is typically done based on a single
parameter. Mixing and matching parameters of metrics would potentially contaminate
one or more of the other testing dimensions. We therefore restrict our evaluation to the
metrics of time and data size. An additional reason for selecting these measurement
metrics was based on the continuity of research. We presented PDPS in earlier chapters
and wanted to extend testing and evaluating it as a generic WMS tool.
5.0.2 PDPS Policies
We test two policies on two sets ofWFWs. We choose these policies for sake of
consistency across experimental evaluation between [IEE12] and this study.
100
First, we pick the smallest file first heuristic as the policy. PDPS schedules the
smallest file first for all data placement operations. For the remaining part of this chapter,
we refer to this policy asData.
The second policy uses the heuristic ‘largest number of dependents first’. PDPS
picks those files that are required by the job with the largest number of dependent jobs
and schedules those files early for data placement operations. For the remaining part of
this chapter, we refer to this policy asDependency.
5.0.3 Experimental setup
We generate two sets of workflows. The reference workflow used in the generation of
theWFWs is the Broadband Platform (BBP) [CMG
+
10]. The total number of jobs are
set to 128, the density of the DAG is set to 0:2, meaning that on average a workflow
would have 5 levels which is similar to BBP. We set the regularity of the workflow at
0:01 which means that the number ofjobs=level of the workflow may vary from a single
job=level to up to129jobs=level. Considering the number of dependent compute jobs
per node in the original BBP workflow, we assign the number of dependents value set
to1.
In the first 100WFWs generated, every job is assigned a single input data file of
size128 MB. We refer to this as theControlWFW set. Following this, we generate a
second set of100WFWs with exactly the same parameters as thecontrol set with one
exception. We assign a weight function to the random selection of the input file size for
every job. Two input file sizes, 16MB and 128MB are made available to SDAG. An
input file size is selected at random for each level of theWFWs. We inject a bias in
this selection so that the16MB file size is preferred for first three levels in eachWFW.
We refer to this set as theWeightedWFW set.
101
We now have 200WFWs that search the parameter space around the BBP work-
flow. Note, that this is just one search space and additionalWFWs may be explored by
assigning a different workflow as reference and/or by relaxing the bounds or parameters
of the search space.
For every workflow, Pegasus plans the workflow, creates data management and com-
pute jobs and submits the workflow to Condor/DAGMan for execution and monitoring.
With each of the 200 synthetic workflows, we run four instances on the Pegasus-PDPS
platform. In the first instance, no data placement operations are created by the Pega-
sus planning process. In this case, data is pre-staged at the execution site and only a
simlink job is created by the Pegasus planning process for making input data available
at the execution site. Next, a default Pegasus run is initiated. In this run, no workflow
planning parameters are changed and Pegasus uses its default parameters to generate
the executable DAG. In this case, all data placement jobs are created by Pegasus and
handled by Condor and DAGMan.
In the third and fourth case, Pegasus-PDPS is used. In both these cases, PDPS
handles all data management tasks while Condor and DAGMan simply handle compute
jobs. In the third case, the smallest file first or ‘data’ policy is enforced by PDPS while in
the fourth case, PDPS enforces the largest number of dependents first or ‘dependency’
policy.
The experimental set up is shown in Figure 5.1. The submit host and storage site are
Intel based MacPros running OSX10:7, PDPS is run on an Intel based XServe running
OSX 10:5. A Condor pool comprising of 96 virtual nodes is running on two 8-core
Linux machines running the2:6:x linux kernel. All experiments are conducted within a
local area network.
102
Pegasus
submit
Host
Storage
Site
PDPS
96 Node
Condor Pool
Workflow
Submit
Data Placement Information
GridFTP
transfers
3rd party transfer
initiated
Figure 5.1: Experimental Setup
Workflows are generated and submitted for planning and execution on the submit
host. Each synthetic workflow is submitted to a modified version of Pegasus for planning
and execution. The modified Pegasus contains an interface to PDPS and is capable of
planning data management jobs that are not just grouped at the start of the workflow but
distributed along the entire depth of the workflow. For the workflow execution instances
where PDPS is invoked, the planning process in Pegasus creates stage-in jobs that are
distributed along the depth at each level of the workflow.
In all experiments, the clustering factor for stage-in jobs is set to 32. This means
that Pegasus creates32 stage-in jobs for every workflow by default. Thereafter, Pegasus
populates each of these32 jobs with files from the entire stage-in file set using a round-
robin allocation algorithm.
5.1 Results and Analysis
In this section, we analyze the results from execution with and without PDPS of the200
WFWs generated with SDAG.
103
5.1.1 ControlWFW set
We first start with presenting results from the Control set. Figure 5.2 shows the key
statistical information on the overall completion times for this set as a box plot. The box
in the plot shows the distribution of data points between25% to75% while the error bars
or whiskers show the sample min and the sample max value in the result set. The line
in the center of the box shows the median value of the sample. Data points represented
outside these bounds are treated as outliers and are not considered in the calculation of
the displayed box plot. We present results for the four cases discussed earlier.
Default Data Dependency Prestaged
0 500 1000 1500
Control Workflow Set - Overall Completion Times
time in seconds
Actual BB completion times
Figure 5.2: Control set - Overall Execution Time
104
For comparison to the reference workflow, we highlight the overall completion times
for actual BBP executions in figure 5.2. We do not observe any significant difference in
the BBP values. This is so primarily because of the very small amount of data (only 8
KB) that BBP exposes to stage-in jobs. The transfer time difference between the default
and presaged case is only11 seconds while the variation of using PDPS is similar to the
default case. We emphasize the size of input data to drive home our goal of providing a
broader spectrum of DAGs that map the defining characteristics of BBP workflow while
still being able to use a large set of data to test the impact of PDPS on similar workflows.
Table 5.1: Results for Pairwise T-Test, Control set, Overall Completion Times
Test p-value Mean of differences
Default : Pre-staged < 2.2e-16 161.98
Default : Data 0.027 51.36458
Default : Dependency < 2.2e-16 122.27
Data : Dependency 0.001030 71.94792
In order to understand the pairwise trends in figure 5.2, we conduct a number of
students t-tests [PBG42] on the sample result set. Table 5.1 shows pairwise results of
the t-tests. The pairwise comparison between ‘default’:‘pre-staged’ shows that both
these sets are drawn from independent samples. Moreover, for each pair compared, the
difference between the overall completion times is calculated and the mean of those
differences is 161:98 seconds. This shows that, in the ‘default’ case, on average, data
placement operations took161:98 seconds. Therefore, if PDPS is to improve on any data
placement time, that improvement has to lie inside the 161:98 seconds time window.
The remaining time is spent by each individual workflow processing compute jobs and
is thus immutable for our study. We note that the161:98 second time window is only an
average value and that individual workflows may vary significantly in the time window
present for data placement operations.
105
Based on the observed p-values and the associated ‘Mean of differences’ we also
establish that for each individualWFW, the pre-staged case reported the earliest over-
all completion time, followed respectively by ‘Dependency’ and ‘Data’ with ‘Default’
cases finishing with the highest reported overall completion time. In addition to this
we note that for eachWFW, on average, PDPS performed 71:9 seconds better while
enforcing ‘Dependency’ policy when compared with ‘Data’ policy.
In order to understand the pair-wise trend of overall completion time values in figure
5.2 descending from Default! Data! Dependency! Prestaged, we look
at the distribution of relevant parameters in the sample set. First, in figure 5.3 (a), we
observe the distribution of data on a level-by-level basis in each workflow.
Table 5.2: Results for Pairwise T-Tests
Control set, Input data / Levels
T-test (between levels) p-value Mean of differences
1 : 2 8.466e-08 -1924682220
2 : 3 0.001898 948919337
3 : 4 4.878e-09 1418681385
4 : 5 1.880365e-16 1607928381
Control set, Dependencies / Job
T-test (between levels) p-value Mean of differences
1 : 2 2.957599e-50 0.8897666
2 : 3 1.893237e-65 0.7328546
3 : 4 3.37998e-31 0.2897666
Although every job in the ‘control’WFW set possesses a single input file of size
128MB, the distribution of the number of jobs/level alter the total size of data that each
individual workflow requires at each level. This is because SDAG assigns a different
number of jobs to each level in theWFW based on the regularity parameter. Thus,
with a unit data size of 128MB, the measure of input data/level is also a measure of
106
1 2 3 4 5 6 7 8
0.0e+00 5.0e+09 1.0e+10 1.5e+10
Data/Level (a)
Level Number
Data Size
Actual BB input data
1 2 3 4 5 6 7 8 9
0 10 20 30 40
Dependencies/Job (b)
Level Number
Frequency
BBP dependencies
Figure 5.3: Control set (a) Input Data / Level (b) Dependencies / Level
jobs/level. Therefore, Figure 5.3 (a) not only shows the distribution of data/level, but
also gives us a clear idea of the distribution of the number of jobs/level.
We are now able to establish that if data placement operations are conducted by
PDPS while enforcing the ‘data’ policy, then every job is scheduled at the same priority.
Therefore, the original round-robin grouping of stage-in jobs created by default in Pega-
sus is maintained by PDPS and no changes are induced into data scheduling for these
jobs. Consequently, individual paths in theWFW that have met data input require-
ments may start execution. Spread over100 instances of individualWFWs, the ‘data’
107
policy is expected to perform better than the default case, but worse than the pre-staged
case.
Table 5.2 shows thep-values between the different consecutive levels. As the num-
ber of data points taper off to a value low enough for us not to care about it in terms of
data placement time differences, we only present paired t-tests up to level 5 for data val-
ues and up to level 4 for dependency values in each workflow. The mean of differences
correlates with figure5.3 (a).
In figure 5.3 (b), we observe the reverse of this trend in the first and second levels.
Here, jobs from the first level have a median value of2 dependent compute jobs whereas
that number tapers down as the number of levels increases. In table 5.2, we present
p-values from pairwise t-tests between number of dependencies/level for consecutive
levels up to the fourth level. Observing the mean of differences, we see that the first
and the second levels have a difference of 0:89. In other words, the first level jobs, on
average, have more dependents than the second level jobs. The same holds true between
second and the third level jobs. This implies that while enforcing ‘dependency’ policy,
PDPS will try to schedule level-1 and level-2 jobs on a higher priority.
For comparison purposes, we show the actual BBP workflow’s average number of
dependencies per level in figure 5.3 (b) also. In the case of BBP platform workflow, the
first level consists of 16 jobs each having 4 dependent compute jobs. However, for the
remaining levels, the average is1 dependent job per parent compute job. Our choice of
keeping the mean for the zero-truncated, poisson distributed choice of dependents to 1
originated from taking the median value of the number of dependent jobs per parent job
in the BBP workflow.
After this additional perspective on the distribution of data and the number of depen-
dencies per compute job, we are better acquainted with the results we observed in Figure
108
5.2. As evident by the values in table 5.2, ‘dependency’ policy is expected to perform
better than ‘data’ policy given the distributions of these two parameters in the control
sample set.
The last property that we want to establish in the control workflow set is the pair-
wise trend of overall completion times of each individual workflow. We are particularly
interested in the difference of overall completion times between PDPS invocations that
enforce the ‘data’ policy and the ‘dependency’ policy. Based on the above analyses of
results and data point distributions, ideally, the pairwise difference between ‘data’ and
‘dependency’ would always be a positive number. A positive number in the difference
means that for a particularWFW, the ‘dependency’ policy based invocation finished
executing earlier while a negative value difference indicates that ‘data’ policy based
invocation finished executing early. Figure 5.4 (a) shows the scatter plot of the difference
for each category of workflow run-type against the default Pegasus run. As expected, the
comparison between default and pre-staged runs yield the highest difference in comple-
tion times. Given this difference, we know that the comparison for data and dependency
runs against default will be upper bound by the the default vs. pre-staged values. Fig-
ure 5.4 (a) also shows that individual workflow runs in both ‘data’ and ‘dependency’
perform better than ‘default’ runs. Additionally, we observe the trend of ‘dependency’
performing slightly better than ‘data’ overall. In order to confirm assertion, we look at
the pairwise comparison of overall finish times between ‘data’ and ‘dependency’ runs
in Figure 5.4 (b). With this we confirm that our control set of experiments have resulted
in ‘dependency’ policy performing better than the ‘data’ policy. The median value of
the differences plotted in the figure is56 seconds which indicated the median difference
between ‘data’ and ‘dependency’ runs of each individual workflow.
109
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100
0 100 200 300 400
Overall time diff
Individual workflow ID
Difference between overall runtimes in seconds
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Comparison Catagories
Default−Prestaged
Default−Data
Default−Dependency
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100
0 100 200 300 400
Data−Dependency diff
Individual workflow ID
Difference between overall runtimes in seconds
Figure 5.4: Control set (a) Overall time difference with default (b) Overall time differ-
ence between Data and Dependency
5.1.2 WeightedWFW set
In this section, we analyze theWeightedWFW set. In figure 5.5 we observe a stark
difference from results observed in figure 5.2. While the median difference between
overall completion times of ‘default’ and ‘pre-staged’ cases is 142 seconds, the differ-
ence between median values of ‘data’ and ‘dependency’ cases is only6 seconds. Addi-
tionally, the improvement shown by ‘dependency’ policy within the data placement time
110
Default Data Dependency Prestaged
300 400 500 600 700
'Data' Workflow Set - Overall Completion Times
time in seconds
Figure 5.5: Weighted set - Overall completion time
window of142 seconds is96:4% while that shown by ‘data’ policy is92:25%. The dif-
ference in placement time improvement between ‘dependency’ and ‘data’ type runs is a
mere 4%. Whereas the same difference was measured at 43:88% for ‘Control’WFW
set.
Such a small difference is a clear indication that ‘data’ policy has performed better
than the case presented in ‘Control’WFW. Figures 5.6 (a) and 5.6 (b) confirm these
results. In figure 5.6 (b) we observe the first three levels of the workflow requiring small
amounts of data with the requirement spiking in the fourth and fifth level. In this case,
when PDPS enforced the ‘data’ policy, it scheduled data stage-in jobs for the first three
111
Table 5.3: Median Overall Runtimes (in seconds) for WeightedWFWs
Runype Median
Default 530
Data 399
Dependency 393
Pre-staged 388
levels of eachWFW at higher priority than that of the data stage-in jobs for the fourth
and fifth levels. Consequently, compute jobs at the first three levels started executing
while PDPS transferred input data for the fourth and fifth levels concurrently. As a
result, the data placement time was reduced and ultimately resulted in the significant
reduction in the overall workflow runtime.
Figure 5.6 (a) shows the number of dependents/job for the weightedWFW set.
We observe that the difference in the median number of dependencies/job is less than
1 between contiguous levels in eachWFW. For the purpose of altering scheduling
priorities of jobs in PDPS, this difference was not large enough to have an impact larger
than that of data/level.
While results in figure 5.5 do not tell us much in terms of our goal of testing PDPS
when observed alone, but when compared with similar results from the ‘control’WFW
set, we can confidently say that ‘data’ policy has performed better inweightedWFW
set. Figure 5.7 (a) shows that ‘pre-staged’, ‘data’ and ‘dependency’ performed much
better than default with only a single data point showing negative difference. Figure 5.7
(b) shows that ‘data’ and ‘dependency’ performed very close to each other with 66%
positive difference and 33% negative difference in overall completion time comparison
with the median value as 19. Thus, ‘dependency’ performed better than ‘data’ overall.
Table 5.3 shows the median values of overall completion times for theweightedWFW
112
●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ●●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●●●●●●● ● ● ●●●●● ● ● ●●● ● ● ●● ● ● ● ● ●● ●●●●●●● ● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●●● ● ●● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●● ● ●●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ●●● ●●● ●●●●●●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●●● ●● ● ●●● ● ●●● ● ●●● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ●● ●●● ● ● ● ●●● ● ● ● ●●●●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● 1 2 3 4 5 6 7
0 20 40 60 80
Dependencies/Level
Level Number
Frequency
● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●●●●● ● ● ● ● ● ● 1 2 3 4 5 6 7 8
0e+00 2e+09 4e+09 6e+09 8e+09
Data/Level
Level Number
Data Size
Figure 5.6: Weighted set - (a) Dependencies/Level (b) Input Data / Level
set. T-tests between the ‘default:pre-staged’, ‘default:data’, ‘default:dependency’ and
‘data:dependency’ all yieldp-values less than0:05.
113
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100
0 100 200 300 400
Overall time diff
Individual workflow ID
Difference between overall runtimes in seconds
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Comparison Catagories
Default−Prestaged
Default−Data
Default−Dependency
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100
−200 −100 0 100 200
Data−Dependency diff
Individual workflow ID
Difference between overall runtimes in seconds
Figure 5.7: Weighted set (a) Overall time difference with default (b) Overall time differ-
ence between Data and Dependency
5.2 Discussion
The goal here, was to articulate the requirement and demonstrate the applicability of
SDAG as an essential evaluation tool for testing and development of workflow manage-
ment systems and their constituent software components. We evaluated SDAG in two
ways.
First, we evaluated SDAG to see if theWFWs generated by it conform to the search
space set forth by the statistical parameters and bounds given as input to it. We used the
114
BBP workflow as a reference and generated10;000WFWs with SDAG. We observed
the statistical properties of theseWFWs and compared those with the actual parame-
ter values from our reference workflow. SDAG was able to search the parameter space
around BBP effectively and within reasonable bounds. We observed some minor sta-
tistical bias in theWFWs generated in that test. We concluded that the bias towards
having a lower number of jobs/level for the first and second levels of theWFWs was
because of the way SDAG consolidated independent trees into a single tree to produce a
valid DAG orWFW. This bias is only introduced when the optional forest consolidator
component in SDAG is invoked. We note that many workflows are not single DAGs but
are a collection of DAGs. In case we want to model such a forest of DAGs in the form
of aWFW, then the resultingWFW set would not reflect this statistical bias.
Second, we put theWFWs generated by SDAG to use. We tested the performance
of a WMS tool, PDPS. On analyzing results, we observed that SDAG was able to pro-
duce different sets ofWFWs that presented different data placement characteristics for
PDPS to test. The first set ofWFWs (control) showed that when PDPS was used,
enforcing ‘data’ policy resulted in an average 43:88% improvement over the ‘depen-
dency’ policy. On the other hand, using the second (weighted)WFW set reduced that
improvement to only 4%. As ‘dependency’ policy enforcement performed equally in
both sets, we concluded that ‘data’ policy enforcement had performed better when the
weighted set ofWFWs was used.
This is an profound observation, as it proves that SDAG is capable of producing syn-
thetic workflows (WFWs ) whose statistical properties can be controlled by users and
used to evaluate workflow tools at various granularities. The ability of SDAG to control
the statistical bounds of different parameters that can be used to express the formulation
of a workflow allows the user to create a broad range ofWFWs. By selecting the input
115
parameters for a set ofWFWs, scientists are able to control precise dimensions of a
workflow’s description. Thus, they are able to tune and test for various dimensions of
workflows based on the WMS tool being evaluated. For example, schedulers can be
evaluated based on job runtimes or on job dependencies independently. Similarly, data
management tools such as PDPS can be evaluated based on data sizes or on relative posi-
tioning of data with computation jobs. The availability of addition of modular attributes
to job descriptions further enhances the variety ofWFWs that SDAG can generate
for different evaluation criteria. This provides WMS users and developers the ability
to create and examine issues beyond those that arise due to the individual structure or
composition of real workflows being tested.
5.3 Summary
In this chapter, we presented SDAG, a syntheticWFW generator. Workflows generated
by SDAG can be controlled by placing bounds on the parameters defining a workflow.
SDAG is capable of producing sets ofWFWs that conform to statistical properties cor-
responding to a reference workflow. A statistical bias in one of the properties has been
observed and is being addressed in ongoing work. We used SDAG to evaluate a work-
flow tool called PDPS with two sets ofWFWs generated by SDAG. PDPS enforced
two different data placement policies on eachWFW set. Based on the statistical prop-
erties set as reference for each set, PDPS produced results that corresponded directly to
the statistical difference in the twoWFW sets.
The applicability of SDAG to other scientific research fields is open. Many areas
tend to benefit from using SDAG. For example, the area of scheduling in particular will
stand to gain tremendous benefit from SDAG. Other niche areas such as geo-spatial
routing also benefit from SDAG. This thesis does not cover those external applications
116
in depth. That is deferred as future work. In ongoing work, SDAG support for IWIR is
underway. Additionally, we are using MCMC [KM12] techniques in the generation of
the DAG structure in SDAG in order to eliminate statistical biases in the current version.
117
Chapter 6
Conclusions
This thesis addresses two issues that arise in large distributed scientific collaborations.
The first is the issue of data placement and resulting workflow efficiency in the context of
user defined policies. The second is that of developing synthetic workflows that provide
a large and varied enough dataset for evaluating WMS tools and components.
The automation of data management tasks has always been amongst the top pri-
orities of large scientific collaborations. However, the complexity of existing systems
[RCB
+
06][RMH
+
10] is a major concern for naive users. Typically, a non-computer-
science domain expert would like for an automation system to be immediately usable
without the complexity of ongoing maintenance or configurations. Learning curves for
such domain scientists in existing services are steep and are typically passed over by
easier alternatives like scripting or writing small amounts of manageable automation
code.
As presented in chapter 3, PDPS bridges the above quoted gap. It is simple enough
to be used by mid-sized collaborations and versatile enough to encompass required func-
tionality. Its provision of policy authoring and enforcement provides flexibility for users
to tailor any instance of PDPS to their individual requirement. Therefore, PDPS is
immediately beneficial to a significant body of scientific collaborations. In order to
enforce data placement policies at the point of origin, or where data management jobs
are created, PDPS was interfaced with the Pegasus WMS as presented in chapter 3.
Experimental results demonstrated the benefit of off-loading data management jobs by
118
the WMS to PDPS. For leaf-nodes or jobs with no further dependencies, using PDPS
resulted in improvements of up to 97% savings in data management jobs runtimes. How-
ever, that was just one step towards separating data placement jobs from computational
jobs in scientific workflows. Much more work is still required in that direction as treat-
ing these two types of jobs as separate entities is intrinsically embedded into the future
of workflow science.
While gathering experimental results for PDPS in chapter 3, a dearth of test data
for the design and development of WMS tools and components was observed. In the
long-term, the lack of testing data results in design tradeoffs in WMS development
that may be sub-optimal for future workflow applications. Consequently, chapter 4
presents SDAG. It addresses this issue by providing the capability of producing synthetic
WFWs. Although there are multiple areas of research where random DAG generation
has been used, the generation of synthetic well formed workflows, which are also DAGs,
refers to entirely different design criteria and ultimately serves a different purpose. That
said, SDAG is modular enough to be utilized by many different fields of research as a
tool for generating random DAGs that can be used as testing and evaluation data. In
chapter 4, a well thought out study of how workflows can be characterized is conducted
and then those parameters are used to generate sets of workflows with different statistical
properties.
Using SDAG, in chapter 5 we present a case study where PDPS is evaluated with
two different data placement policies being enforced on two statistically different sets
ofWFWs. The difference in theWFW sets was based on files sizes. The dataset
which favored the ascending file size based policy performed better in overall execution
times as compared to the other dataset, thus proving the that PDPS was able to deliver
similar results with similar policies across a broad range of workflows. The case study
119
was made possible by the capability of generating 200WFWs using SDAG. We note
that owing to SDAG, just this single study used more unique workflows than the entire
public repository available at SHIWA.
Going forward, PDPS will be extended to contain generic interfaces for additional
WMS like Kepler, Triana, and Taverna, etc. Moreover, PDPS’s functions library will be
extended to perform data validation tasks such as CRC checks, etc. Additional policies
will be added to the PDPS and its functions library expanded to support them. In partic-
ular, a subscribe-publish policy where sites can subscribe to certain datasets and PDPS
updates any new copies of those datasets to those sites will be added. Policies dealing
with data integrity or validity will also be added into the functions library of PDPS.
A rehashing of SDAG design that also uses Markov Chain Monte Carlo, methods to
generateWFWs is underway. SDAG will be supplemented with capabilities for gen-
erating workflows with looping and conditional branches. The enhanced capabilities of
SDAG will enable its use in additional domains of scientific research such as generating
geo-spatial routing test datasets, etc. We are also working on enabling SDAG to gener-
ateWFW in multiple formats such as IWIR, so that additional WMS can benefit from
those.
120
Bibliography
[20111] Broadband platform, 2011.
[AAA
+
08] G. Aad, E. Abat, J. Abdallah, AA Abdelalim, A. Abdesselam, O. Abdi-
nov, BA Abi, M. Abolins, H. Abramowicz, E. Acerbi, et al. The atlas
experiment at the cern large hadron collider. Journal of Instrumentation,
3:S08003, 2008.
[ABB
+
03] W. Allcock, J. Bester, J. Bresnahan, A. Chervenak, L. Liming, and
S. Tuecke. Gridftp: Protocol extensions to ftp for the grid. Global Grid
ForumGFD-RP, 20, 2003.
[ABJ
+
04] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock.
Kepler: an extensible system for design and execution of scientific work-
flows. In Scientific and Statistical Database Management, 2004. Proceed-
ings. 16th International Conference on, pages 423 – 424, june 2004.
[ABK
+
05] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu,
I. Raicu, and I. Foster. The globus striped gridftp framework and server.
In Proceedings of the 2005 ACM/IEEE conference on Supercomputing,
page 54. IEEE Computer Society, 2005.
[AC05] Rob Schuler Ann Chervenak. Globus data replication service, 2005.
[ADJ
+
10] Sharad Agarwal, John Dunagan, Navendu Jain, Stefan Saroiu, Alec Wol-
man, and Harbinder Bhogan. V olley: automated data placement for geo-
distributed cloud services, 2010.
[All05a] Globus Alliance. globus-url-copy command documentation,
http://www.globus.org/toolkit/docs/4.0/data/gridftp/rn01re01.html,
2005.
[All05b] Globus Alliance. Gt 4.0 gridftp,
http://www.globus.org/toolkit/docs/4.0/data/gridftp/, 2005 2005.
121
[All06] Globus Alliance. Data replication service (drs), http://www-
unix.globus.org/toolkit/docs/4.0/techpreview/datarep/, 2006.
[All12] Globus Alliance. Gridftp, http://www.globus.org/toolkit/data/gridftp/,
2012.
[AM12] D. Amalarethinam and P. Muthulakshmi. Dagitizer – a tool to generate
directed acyclic graph through randomizer to model scheduling in grid
computing. In David C. Wyld, Jan Zizka, and Dhinaharan Nagamalai,
editors, Advances in Computer Science, Engineering and Applications,
volume 167 of Advances in Intelligent and Soft Computing, pages 969–
978. Springer Berlin / Heidelberg, 2012. 10.1007/978-3-642-30111-793.
[Ame11] Muhammad Ali Amer. Policy based data placement in high performance
scientific computing, 2011.
[AMP05] W. Allcock, I. Mandrichenko, and T. Perelmutov. Gridftp v2
protocol description (global grid forum recommendation gfd.47),
http://www.ggf.org/documents/gfd.47.pdf, May 4, 2005 2005.
[BBD
+
07] Duncan A. Brown, Patrick R. Brady, Alexander Dietz, Junwei Cao, Ben
Johnson, and John McNabb. A case study on the use of workflow tech-
nologies for scientific analysis: Gravitational wave data analysis. Work-
flows for e-Science, pages 39–59, 2007.
[BC09a] Shishir Bharathi and Ann Chervenak. Data staging strategies and their
impact on the execution of scientific workflows. In Proceedings of the sec-
ond international workshop on Data-aware distributed computing, DADC
’09, New York, NY , USA, 2009. ACM.
[BC09b] Shishir Bharathi and Ann Chervenak. Data staging strategies and their
impact on the execution of scientific workflows. Proceedings of the Inter-
national Workshop on Data-Aware Distributed Computing (DADC’09) in
conjunction with the 18th International Symposium on High Performance
Distributed Computing (HPDC-18), June 2009.
[BCD
+
08] S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.H. Su, and K. Vahi.
Characterization of scientific workflows. In Workflows in Support of
Large-Scale Science, 2008. WORKS 2008. Third Workshop on, pages 1–
10. IEEE, 2008.
[BDG
+
04a] G. B. Berriman, E. Deelman, J. C. Good, J. C. Jacob, D. S. Katz,
C. Kesselman, A. C. Laity, T. A. Prince, G. Singh, and M. H. Su. Mon-
tage: a grid-enabled engine for delivering custom science-grade mosaics
on demand, 2004.
122
[BDG
+
04b] G.B. Berriman, E. Deelman, J.C. Good, J.C. Jacob, D.S. Katz, C. Kessel-
man, A.C. Laity, T.A. Prince, G. Singh, and M.H. Su. Montage: a grid-
enabled engine for delivering custom science-grade mosaics on demand.
In Proceedings of SPIE, volume 5493, pages 221–232, 2004.
[BFK
+
00] M. Beynon, R. Ferreira, T. Kurc, A. Sussman, and J. Saltz. Datacutter:
Middleware for filtering very large scientific datasets on archival storage
systems. Proc. 8th Goddard Conference on Mass Storage Systems and
Technologies/17th IEEE Symposium on Mass Storage Systems, pages 119–
133, 2000.
[BMM
+
05] T. A. Barrass, O. Maroney, S. Metson, D. Newbold, W. Jank, P. Garcia-
Abia, J. M. Hern´ andez, A. Afaq, M. Ernst, and I. Fisk. Software agents
in data and workflow management. Proceedings of CHEP , Interlaken
Switzerland, 2005.
[CDL
+
07a] A. Chervenak, E. Deelman, M. Livny, M.H. Su, R. Schuler, S. Bharathi,
G. Mehta, and K. Vahi. Data placement for scientific applications in dis-
tributed environments. In Grid Computing, 2007 8th IEEE/ACM Interna-
tional Conference on, pages 267–274. IEEE, 2007.
[CDL
+
07b] Ann L. Chervenak, E Deelman, M Livny, Su Mei-Hui, R Schuler,
S Bharathi, G Mehta, and K Vahi. Data placement for scientific appli-
cations in distributed environments, 2007.
[CGH
+
06] D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson,
M. Shields, I. Taylor, and I. Wang. Programming scientific and distributed
workflow with triana services. Concurrency and Computation: Practice
and Experience, 18(10):1021–1037, 2006.
[CJSZ08] Louis-Claude Canon, Emmanuel Jeannot, Rizos Sakellariou, and Wei
Zheng. Comparative evaluation of the robustness of dag scheduling heuris-
tics. Grid Computing: Achievements and Prospects, 2008.
[C ¸ KU11]
¨
U.V . C ¸ ataly¨ urek, K. Kaya, and B. Uc ¸ar. Integrated data placement and
task assignment for scientific workflows in clouds. In Proceedings of the
fourth international workshop on Data-intensive distributed computing,
pages 45–54. ACM, 2011.
[CMG
+
10] S. Callaghan, PJ Maechling, RW Graves, PG Somerville, N. Collins,
KB Olsen, W. Imperatori, M. Jones, RJ Archuleta, J. Schmedes, et al.
Running on-demand strong ground motion simulations with the second-
generation broadband platform. In AGU Fall Meeting Abstracts, volume 1,
page 2007, 2010.
123
[CMP
+
10] Daniel Cordeiro, Gr´ egory Mouni´ e, Swann Perarnau, Denis Trystram,
Jean-Marc Vincent, and Fr´ ed´ eric Wagner. Random graph generation for
scheduling simulations. In Proceedings of the 3rd International ICST Con-
ference on Simulation Tools and Techniques, SIMUTools ’10, pages 60:1–
60:10, ICST, Brussels, Belgium, Belgium, 2010. ICST (Institute for Com-
puter Sciences, Social-Informatics and Telecommunications Engineering).
[COS07] I. Constandache, D. Olmedilla, and F. Siebenlist. Policy-driven negotia-
tion for authorization in the grid, 2007.
[CS07] Ann L. Chervenak and Robert Schuler. A data placement service for petas-
cale applications, 2007.
[CSR
+
09] A. Chervenak, R. Schuler, M. Ripeanu, M. Amer, S. Bharathi, I. Fos-
ter, A. Iamnitchi, and C. Kesselman. The globus replica location service:
Design and experience. Parallel and Distributed Systems, IEEE Transac-
tions on, Accepted for publication, 2009.
[DAB
+
03] Ewa Deelman, Amit Agarwal, Shishir Bharathi, Kent Blackburn, James
Blythe, Phil Ehrens, Yolanda Gil, Carl Kesselman, Scott Koranda, Albert
Lazzarini, Gaurang Mehta, Greg Mendell, Maria Alessandra Papa, Sonal
Patil, Srividya Rao, Peter Shawhan, Gurmeet Singh, Alicia Sintes, Marcus
Thiebaux, and Karan Vahi. From metadata to execution, the ligo pulsar
search. Technical report, Technical Report 2003-1, GriPhyN (Grid Physics
Network) Project, 2003.
[DBG
+
03] E. Deelman, J. Blythe, Y . Gil, C. Kesselman, G. Mehta, K. Vahi, K. Black-
burn, A. Lazzarini, A. Arbree, and R. Cavanaugh. Mapping abstract
complex workflows onto grid environments. Journal of Grid Computing,
1(1):25–39, 2003.
[DBGK02] E. Deelman, J. Blythe, Y . Gil, and C. Kesselman. Pegasus: Planning for
execution in grids. GriPhyN, 20:2002, 2002.
[Dee07] Ewa Deelman. Looking into the future of workflows: The challenges
ahead. In Ian J. Taylor, Ewa Deelman, Dennis B. Gannon, and Matthew
Shields, editors, Workflows for e-Science, pages 475–481. Springer Lon-
don, 2007. 10.1007/978-1-84628-757-2-28.
[DGR
+
05] K.K. Droegemeier, D. Gannon, D. Reed, B. Plale, J. Alameda, T. Baltzer,
K. Brewster, R. Clark, B. Domenico, S. Graves, E. Joseph, D. Mur-
ray, R. Ramachandran, M. Ramamurthy, L. Ramakrishnan, J.A. Rush-
ing, D. Weber, R. Wilhelmson, A. Wilson, M. Xue, and S. Yalda.
Service-oriented environments for dynamically interacting with mesoscale
weather. Computing in Science Engineering, 7(6):12 – 29, nov.-dec. 2005.
124
[DKM01] E. Deelman, C. Kesselman, and G. Mehta. Transformation catalog design
for griphyn. Technical report, Technical report griphyn-2001-17, 2001.
[DMS
+
07] Ewa Deelman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, and Karan
Vahi. Pegasus: Mapping large-scale workflows to distributed resources.
Workflows for e-Science, pages 376–394, 2007.
[DPPS01] J. Diaz, M.D. Penrose, J. Petit, and M. Serna. Approximating layout prob-
lems on random geometric graphs. Journal of Algorithms, 39(1):78–116,
2001.
[Dro11] Drools. http://www.jboss.org/drools, 2011.
[DSS
+
05a] E. Deelman, G. Singh, M.H. Su, J. Blythe, Y . Gil, C. Kesselman,
G. Mehta, K. Vahi, G.B. Berriman, J. Good, et al. Pegasus: A frame-
work for mapping complex scientific workflows onto distributed systems.
Scientific Programming, 13(3):219–237, 2005.
[DSS
+
05b] Ewa Deelman, Gurmeet Singh, Mei-Hui Su, James Blythe, Yolanda Gil,
Carl Kesselman, Gaurang Mehta, Karan Vahi, G. Bruce Berriman, John
Good, Anastasia Laity, Joseph C. Jacob, and Daniel S. Katz. Pegasus:
A framework for mapping complex scientific workflows onto distributed
systems, 2005.
[epi12] Usc epigenomics center, 2012.
[fas12] Fast data transfer, 2012.
[FCWH08] J. Feng, L. Cui, G. Wasson, and M. Humphrey. Policy-directed data move-
ment in grids. In Parallel and Distributed Systems, 2006. ICPADS 2006.
12th International Conference on, volume 1, pages 8–pp. IEEE, 2008.
[FJP
+
05] Thomas Fahringer, Alexandru Jugravu, Sabri Pllana, Radu Prodan, Clovis
Seragiotto, Jr., and Hong-Linh Truong. Askalon: a tool set for cluster
and grid computing: Research articles. Concurr. Comput. : Pract. Exper.,
17(2-4):143–169, February 2005.
[Fre02] J. Frey. Condor dagman: Handling inter-job dependencies, 2002.
[GG11] D. Garijo and Y . Gil. A new approach for publishing workflows: abstrac-
tions, standards, and linked data. In Proceedings of the 6th workshop on
Workflows in support of large-scale science, pages 47–56. ACM, 2011.
[GGL03] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file
system. In Proceedings of the nineteenth ACM symposium on Operating
systems principles, SOSP ’03, pages 29–43, New York, NY , USA, 2003.
ACM.
125
[GRF06] P. Gama, C. Ribeiro, and P. Ferreira. A scalable history-based policy
engine. Seventh IEEE International Workshop on Policies for Distributed
Systems and Networks, 2006.
[htt] http://www.globus.org/wsrf/. Ws-resource framework.
[IEE12] IEEE International Symposium on Policies for Distributed Systems and
Networks. Improving Scientific Workflow Performance using Policy Based
Data Placement. IEEE Computer Society, July 2012.
[JDV
+
09] G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B.P. Berman, and
P. Maechling. Scientific workflow applications on amazon ec2. In E-
Science Workshops, 2009 5th IEEE International Conference on, pages
59 –66, dec. 2009.
[JDV
+
10] Gideon Juve, Ewa Deelman, Karan Vahi, Gaurang Mehta, Benjamin P.
Berman, Bruce Berriman, and Phil Maechling. Data sharing options for
scientific workflows on amazon ec2, 2010.
[JDVM10] Gideon Juve, Ewa Deelman, Karan Vahi, and Gaurang Mehta. Expe-
riences with resource provisioning for scientific workflows using corral.
Scientific Programming, 18(2):77–92, 01 2010.
[JK91] Richard Johnsonbaugh and Martin Kalin. A graph generation software
package. In Proceedings of the twenty-second SIGCSE technical sympo-
sium on Computer science education, SIGCSE ’91, pages 151–154, New
York, NY , USA, 1991. ACM.
[JRA06] EGEE Middleware Activity JRA1. Djra1.4 - egee middleware architec-
ture and planning (release 2), https://edms.cern.ch/document/594698/1.0/,
2006.
[JWH07] Feng Jun, G. Wasson, and M. Humphrey. Resource usage policy expres-
sion and enforcement in grid computing. Grid Computing, 2007 8th
IEEE/ACM International Conference on, pages 66–73, 2007.
[KB09] Tevfik Kosar and Mehmet Balman. A new paradigm: Data-aware schedul-
ing in grid computing. Future Generation Computer Systems, 25(4):406–
413, 2009.
[KDG
+
07] Jihie Kim, Ewa Deelman, Yolanda Gil, Gaurang Mehta, and Varun Rat-
nakar. Provenance trails in the wings/pegasus workflow system. Concur-
rency and Computation: Practice and Experience, Special Issue on the
First Provenance Challenge, 2007.
126
[KKK
+
11] Vladimir Korkhov, Dagmar Krefting, Tamas Kukla, Gabor Z. Terstyan-
szky, Matthan Caan, and Silvia D. Olabarriaga. Exploring workflow inter-
operability tools for neuroimaging data analysis. In Proceedings of the
6th workshop on Workflows in support of large-scale science, WORKS
’11, pages 87–96, New York, NY , USA, 2011. ACM.
[KL04a] T. Kosar and M. Livny. Stork: Making data placement a first class citizen
in the grid, 2004.
[KL04b] T. Kosar and M. Livny. Stork: Making data placement a first class citizen
in the grid, 2004.
[KM12] J. Kuipers and G. Moffa. Uniform generation of random acyclic digraphs.
Arxiv preprint arXiv:1202.6590, 2012.
[LAB
+
06] B. Lud¨ ascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones,
E.A. Lee, J. Tao, and Y . Zhao. Scientific workflow management and the
kepler system. CONCURRENCY AND COMPUTATION, 18(10):1039–
1065, 2006.
[LD11] Xin Liu and Anwitaman Datta. Towards intelligent data placement for
scientific workflows in collaborative cloud environment, May 2011 2011.
[LFS
+
06] B. Lang, I. Foster, F. Siebenlist, R. Ananthakrishnan, and T. Freeman. A
multipolicy authorization framework for grid security. pages 269–272,
2006.
[LKS03] M. Lorch, D. Kafura, and S. Shah. An xacml-based policy management
and authorization service for globus resources. Proc. of the 4th Intl Work-
shop on Grid Comp, 2003.
[MCF
+
11] Luc Moreau, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul
Groth, Natalia Kwasnikowska, Simon Miles, Paolo Missier, Jim Myers,
Beth Plale, Yogesh Simmhan, Eric Stephan, and Jan Van den Bussche.
The open provenance model core specification (v1.1). Future Generation
Computer Systems, 27(6):743 – 756, 2011.
[MHA02] R. K. Madduri, C. S. Hood, and W. E. Allcock. Reliable file transfer in
grid environments, 2002.
[MKJ
+
07] T. S. Mikkelsen, M. Ku, D. B. Jaffe, B. Issac, E. Lieberman, G. Gian-
noukos, P. Alvarez, W. Brockman, T. K. Kim, and R. P. Koche. Genome-
wide maps of chromatin state in pluripotent and lineage-committed cells.
Nature, 448:553–560, 2007.
127
[MM07] AB MySQL and AB MySQL. The world’s most popular open source
database. MySQL AB, 2007.
[Nie] et al. Nielsen, J.L. Experiences with data indexing services supported by
the nordugrid middleware. In Computing in High Energy and Nuclear
Physics (CHEP) 2004.
[NS11] M. Naseri and A. Simone. Evaluating workflow trust using hidden markov
modeling and provenance data. Data Provenance and Data Management
for eScience , Studies in Computational Intelligence series, 2011.
[OLK
+
07a] T. Oinn, P. Li, D.B. Kell, C. Goble, A. Goderis, M. Greenwood, D. Hull,
R. Stevens, D. Turi, and J. Zhao. Taverna/my grid: Aligning a workflow
system with the life sciences community. Workflows for e-Science, pages
300–319, 2007.
[OLK
+
07b] Tom Oinn, Peter Li, Douglas B. Kell, Carole Goble, Antoon Goderis,
Mark Greenwood, Duncan Hull, Robert Stevens, Daniele Turi, and Jun
Zhao. Taverna grid: Aligning a workflow system with the life sciences
community. Workflows for e-Science, pages 300–319, 2007.
[Ope11] OpenSSH. Secure copy: scp, 2011.
[PBG42] ES (Egon Sharpe) Pearson, George A.(George Alfred) Barnard, and W.S.
Gosset. ’Student’. Wiley Online Library, 1942.
[Peg12] PegasusTeam. keg: A kanonical executable for pegasus, 2012.
[PHP
+
07] D.D. Pennington, D. Higgins, A.T. Peterson, M.B. Jones, B. Lud¨ ascher,
and S. Bowers. Ecological niche modeling using the kepler workflow
system. Workflows for e-Science, pages 91–108, 2007.
[pla12] Planet-lab, 2012.
[PQF12] Sabri Pllana, Jun Qin, and Thomas Fahringer. Teuta: A tool for uml based
composition of scientific grid workflows. In in 1st Austrian Grid Sympo-
sium. Schloss Hagenberg. Springler Verlag, 2012.
[Pro03] Condor Project. Condor: High throughput computing,
http://www.cs.wisc.edu/condor, 2003 2003.
[Pro04] LIGO Project. Ligo - laser interferometer gravitational wave observa-
tory, http://www.ligo.caltech.edu/, 2004. LIGO (Laser Interferometer
Gravitational-wave Observatory) will detect the gravitational waves of
pulsars, supernovae and in-spiraling binary stars.
128
[Pro05a] CMS Project. The compact muon solenoid, an experiment for the large
hadron collider at cern, http://cms.cern.ch/, 2005 2005.
[Pro05b] EGEE Project. glite lightweight middleware for grid computing,
http://glite.web.cern.ch/glite/, 2005 2005.
[RCB
+
06] J. Rehn, S. Cern, T. Barrass, D. Bonacorsi, I. Infn
¨
A` ıCnaf, J. Hernandez,
S. Ciemat, I. Semeniouk, F. In2P, and L. Tuura. Phedex high-throughput
data transfer management system. Computing in High Energy and Nuclear
Physics (CHEP) 2006, 2006.
[RF03] K. Ranganathan and I. Foster. Simulation studies of computation and data
scheduling algorithms for data grids. Journal of Grid Computing, 1(1),
2003.
[RL12] Rober Roser Robert Lucas. Transforming geant4 for the future. 2012.
[RMH
+
10] A. Rajasekar, R. Moore, C. Y . Hou, C. A. Lee, R. Marciano, A. de Torcy,
M. Wan, W. Schroeder, S. Y . Chen, L. Gilbert, et al. irods primer: Inte-
grated rule-oriented data system. Synthesis Lectures on Information Con-
cepts, Retrieval, and Services, 2(1):1–143, 2010.
[RSZ
+
07] A. Ramakrishnan, G. Singh, H. Zhao, E. Deelman, R. Sakellariou,
K. Vahi, K. Blackburn, D. Meyers, and M. Samidi. Scheduling data-
intensive workflows onto storage-constrained distributed resources, 2007.
[RVB07] M. Rahman, S. Venugopal, and R. Buyya. A dynamic critical path algo-
rithm for scheduling scientific workflow applications on global grids. In
e-Science and Grid Computing, IEEE International Conference on, pages
35–42. IEEE, 2007.
[SC01] Babu Sundaram and Barbara Chapman. Policy engine: A framework for
authorization, accounting policy specification and evaluation in grids. In
GRID, pages 145–153, 2001.
[SDM
+
05] J.M. Schopf, M. D’Arcy, N. Miller, L. Pearlman, I. Foster, and C. Kessel-
man. Monitoring and discovery in a web services framework: Function-
ality and performance of the globus toolkit’s mds4, argonne national lab-
oratory tech report anl. Preprint ANL/MCS-P1248-0405, 2005.
[SHRC] K. Shvachko, Kuang Hairong, S. Radia, and R. Chansler. The hadoop dis-
tributed file system. In Mass Storage Systems and Technologies (MSST),
2010 IEEE 26th Symposium on, pages 1–10.
[Sin93] A. Sinclair. Algorithms for random generation and counting: a Markov
chain approach, volume 7. Birkhauser, 1993.
129
[Slo94] Morris Sloman. Policy driven management for distributed systems. Jour-
nal of Network and Systems Management, 2(4):333–360, 1994.
[SRM10] D. Schwerdel, B. Reuther, and P. Muller. On using evolutionary algorithms
for solving the functional composition problem. 2010.
[SSS09] Dhrubajyoti Saha, Abhishek Samanta, and Smruti R. Sarangi. Theoretical
framework for eliminating redundancy in workflows. Services Computing,
IEEE International Conference on, 0:41–48, 2009.
[SSV
+
08] Gurmeet Singh, Mei-Hui Su, Karan Vahi, Ewa Deelman, Bruce Berriman,
John Good, Daniel S. Katz, and Gaurang Mehta. Workflow task clustering
for best effort systems with pegasus, 2008.
[Sut12] Fr´ ed´ eric Suter. Synthetic dag generation, 2012.
[SVR
+
07] Gurmeet Singh, Karan Vahi, Arun Ramakrishnan, Gaurang Mehta, Ewa
Deelman, Henan Zhao, Rizos Sakellariou, Kent Blackburn, Duncan
Brown, Stephen Fairhurst, David Meyers, G. Bruce Berriman, John Good,
and Daniel S. Katz. Optimizing workflow data footprint. Sci. Program.,
15(4):249–268, 2007.
[TCF
+
03] Steven Tuecke, Karl Czajkowski, Ian Foster, Jeffrey Frey, Steven Gra-
ham, Carl Kesselman, Tom Maguire, Thomas Sandholm, D. Snelling, and
P. Vanderbilt. Open grid services infrastructure (ogsi) version 1.0. 2003.
[TS98] I.J. Taylor and B.F. Schutz. Triana - A Quicklook Data Analysis System
for Gravitational Wave Detectors. In Second Workshop on Gravitational
Wave Data Analysis, pages 229–237. Editions Fronti` eres, 1998.
[TSWH07a] Ian Taylor, Matthew Shields, Ian Wang, and Andrew Harrison. The Triana
Workflow Environment: Architecture and Applications. In Ian Taylor,
Ewa Deelman, Dennis Gannon, and Matthew Shields, editors, Workflows
for e-Science, pages 320–339. Springer, New York, Secaucus, NJ, USA,
2007.
[TSWH07b] Ian Taylor, Matthew Shields, Ian Wang, and Andrew Harrison. The triana
workflow environment: Architecture and applications. Workflows for e-
Science, pages 320–339, 2007.
[VSC
+
02] D. Verma, S. Sahu, S. Calo, M. Beigi, and I. Chang. A policy service
for grid computing. LECTURE NOTES IN COMPUTER SCIENCE, pages
243–255, 2002.
130
[WBKS05] V . Welch, T. Barton, K. Keahey, and F. Siebenlist. Attributes, anonymity,
and access: Shibboleth and globus integration to facilitate grid collabora-
tion. 4th Annual PKI R&D Workshop, 2005.
[WH03] G. Wasson and M. Humphrey. Policy and enforcement in virtual organi-
zations. Grid Computing, 2003. Proceedings. Fourth International Work-
shop on, pages 125–132, 2003.
[WPF05] M. Wieczorek, R. Prodan, and T. Fahringer. Scheduling of scientific work-
flows in the askalon grid environment. ACM SIGMOD Record, 34(3):56–
62, 2005.
[YC07] Yonghong Yan and B. Chapman. Scientific workflow scheduling in com-
putational grids - planning, reservation, and data/network-awareness. In
Grid Computing, 2007 8th IEEE/ACM International Conference on, pages
18 –25, sept. 2007.
[YYLC10] Dong Yuan, Yun Yang, Xiao Liu, and Jinjun Chen. A data placement strat-
egy in scientific cloud workflows. Future Generation Computer Systems,
26(8):1200–1214, 2010.
[ZeM10] Yifei Zhang and Yan e Mao. A scp based critical path scheduling strategy
for data-intensive workflows. In Fuzzy Systems and Knowledge Discovery
(FSKD), 2010 Seventh International Conference on, volume 4, pages 1735
–1739, aug. 2010.
[Zha] Jennifer M. Schopf Zhang, Xuehai. Performance analysis of the globus
toolkit monitoring and discovery service, mds2. In Proceedings of the
International Workshop on Middleware Performance (MP 2004).
131
Abstract (if available)
Abstract
Scientific domains are increasingly adopting workflow systems to automate and manage large distributed applications. Workflow Management Systems (WMS) manage overall scheduling and monitoring of both compute and data placement jobs for such applications. The management of data placement jobs in WMS provide the overall context for the problems addressed in this thesis. ❧ This thesis starts by automating data placement for scientific applications based on user provided data-placement policies and proceeds to interface a WMS with a policy based data placement service (PDPS). It provides a solution for the lack of testing data in workflow science by developing a Synthetic Directed Acyclic Graph Generator (SDAG) and using synthetic workflows generated by it in a case study. ❧ This thesis relies on actual software development and experimental analysis for both major research contributions. Experimental results using existing workflows prove the immediate benefit of PDPS for mid-sized virtual organizations. Results for SDAG demonstrate its usefulness in the design and development of future WMS.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scientific workflow generation and benchmarking
PDF
Software connectors for highly distributed and voluminous data-intensive systems
PDF
Prediction of energy consumption behavior in component-based distributed systems
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Domain-based effort distribution model for software cost estimation
PDF
Efficient data and information delivery for workflow execution in grids
PDF
An automated testing system for scientific workflows
PDF
Resource management for scientific workflows
PDF
A user-centric approach for improving a distributed software system's deployment architecture
PDF
A resource provisioning system for scientific workflow applications
PDF
Adaptive resource management in distributed systems
PDF
Shrinking the cone of uncertainty with continuous assessment for software team dynamics in design and development
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Resource scheduling in geo-distributed computing
PDF
Design-time software quality modeling and analysis of distributed software-intensive systems
PDF
Transparent consistency in cache augmented database management systems
PDF
Detecting anomalies in event-based systems through static analysis
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Analyzing human activities in videos using component based models
Asset Metadata
Creator
Amer, Muhammad Ali
(author)
Core Title
Policy based data placement in distributed systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/21/2012
Defense Date
10/05/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data management,distributed systems,OAI-PMH Harvest,policy,scientific workflows,synthetic workflows
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lucas, Robert F. (
committee chair
), Boehm, Barry W. (
committee member
), Medvidović, Nenad (
committee member
), Smith, Andrew D. (
committee member
)
Creator Email
mamer@usc.edu,mamer@yp.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-118632
Unique identifier
UC11290972
Identifier
usctheses-c3-118632 (legacy record id)
Legacy Identifier
etd-AmerMuhamm-1331.pdf
Dmrecord
118632
Document Type
Dissertation
Rights
Amer, Muhammad Ali
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data management
distributed systems
policy
scientific workflows
synthetic workflows