Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Planning with continuous resources in agent systems
(USC Thesis Other)
Planning with continuous resources in agent systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PLANNING WITH CONTINUOUS RESOURCES IN AGENT SYSTEMS
by
Janusz Marecki
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2008
Copyright 2008 Janusz Marecki
Acknowledgments
First and foremost, I would like to thank my advisor, Milind Tambe, for the support and men-
toring he has always been so enthusiastic to provide. I also want to thank the members of my
thesis committee, Victor Lesser, Jonathan Gratch, Fernando Ordez and Rajiv Maheswaran for
helping me to advance my research in the right direction and to think beyond it. Furthermore, I
would like to thank Sven Koenig, Makoto Yokoo, Paul Scerri, Sarit Kraus and David Kempe for
valuable insights on the material in this thesis and useful career guidance. Also, special thanks to
Honeywell Inc. and the CREATE research center at USC for the support of this work.
I also want to thank my collegues at USC: I thank Nathan Schurr for the phenomenal job
he did in introducing me to the American culture, Zvi Topol for our fabulous movie making
endeavors, Jonathan Pearce for enthusiastically answering my mac-related questions, Pradeep
Varakantham for showing me the concept of indoor cricket, Tapana Gupta and Manish Jain for
teaching me Hindi, Praveen Paruchuri and Ranjit Nair for showing me how not to worry, Emma
Bowring for revealing to me the secret of sweet-life-goals, William Yeoh for showing me how to
be generous, Xiaosun Sun and Xiaoming Zheng for discussing with me the world issues, Chris
Portway for his willingness to tolerate my accent and James Pita for motivating me to do the
workout once in a while.
ii
And finally, I would like to thank my family for never loosing hope in me during this graduate
school adventure. Especially, I thank my wife Dorota for being there for me during this time..
iii
Table of Contents
Acknowledgments ii
List Of Figures vii
Abstract ix
Chapter 1: Introduction 1
1.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Motivating Domains and Formal Models 7
2.1 Motivating Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Planetary Exploration: Domain for Single Agent Algorithms . . . . . . . 8
2.1.2 Adjustable Autonomy: Domain for Single Agent Algorithms . . . . . . . 9
2.1.3 Civilian Rescue: Domain for Multiagent Algorithms . . . . . . . . . . . 11
2.1.3.1 Example 1: Fully Ordered Domain . . . . . . . . . . . . . . . 12
2.1.3.2 Example 2: Partially Ordered Domain . . . . . . . . . . . . . 14
2.1.4 General Description of Multiagent Domains . . . . . . . . . . . . . . . . 16
2.2 Formal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Continuous Resource MDP . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Continuous Resource, Decentralized MDP . . . . . . . . . . . . . . . . 24
Chapter 3: Single Agent Solutions 29
3.1 Value Iteration Approach: The CPH Algorithm . . . . . . . . . . . . . . . . . . 29
3.1.1 CPH Step 1: Phase Type Approximation . . . . . . . . . . . . . . . . . . 31
3.1.2 CPH Step 2: Piecewise Gamma Value Functions . . . . . . . . . . . . . 34
3.1.3 CPH Step 3: Functional Value Iteration . . . . . . . . . . . . . . . . . . 36
3.1.3.1 Simplified Example . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.3.2 The General Case . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.4 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Forward Search Approach: The DPFP Algorithm . . . . . . . . . . . . . . . . . 50
3.2.1 DPFP at a Glance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
iv
3.2.2 Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.3 Solving the Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.4 Taming the Algorithm Complexity . . . . . . . . . . . . . . . . . . . . . 59
3.2.5 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 4: Experiments with Single Agent Algorithms 63
4.1 CPH Feasibility Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 CPH and DPFP Scalability Experiments . . . . . . . . . . . . . . . . . . . . . . 67
4.3 DPFP-CPH Hybrid Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 DEFACTO Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 5: Multiagent Solutions 78
5.1 Locally Optimal Solution: The VFP Algorithm . . . . . . . . . . . . . . . . . . 78
5.1.1 Policy Iteration Approach . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.2 Functional Representation . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1.3 Opportunity Cost Function Propagation Phase . . . . . . . . . . . . . . . 85
5.1.4 Probability Function Propagation Phase . . . . . . . . . . . . . . . . . . 89
5.1.5 Splitting the Opportunity Cost Functions . . . . . . . . . . . . . . . . . 92
5.1.6 Implementation of Function Operations . . . . . . . . . . . . . . . . . . 99
5.2 Globally Optimal Solution: The M-DPFP Algorithm . . . . . . . . . . . . . . . 100
5.2.1 Arriving at M-DPFP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.2 The M-DPFP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2.3 Representative States . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.3.1 Representative State Selection . . . . . . . . . . . . . . . . . . 110
5.2.3.2 Policy for non-Representative States . . . . . . . . . . . . . . 115
5.2.3.3 Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2.4 Lookahead Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2.4.1 Dual Problem Formulation . . . . . . . . . . . . . . . . . . . 121
5.2.4.2 Dual Problem and the Lookahead Function . . . . . . . . . . . 125
5.2.4.3 Solving the Dual Problem . . . . . . . . . . . . . . . . . . . . 129
Chapter 6: Experiments with Multiagent Algorithms 131
6.1 VFP Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.1.1 Evaluation of the Opportunity Cost Splitting Heuristics . . . . . . . . . . 132
6.1.2 Comparison of VFP and OC-DEC-MDP Eciency . . . . . . . . . . . . 134
6.2 M-DPFP Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.2.1 Eciency of the M-DPFP Lookahead Function . . . . . . . . . . . . . . 138
6.2.2 Eciency of the M-DPFP algorithm . . . . . . . . . . . . . . . . . . . . 140
Chapter 7: Related Work 143
7.1 Single Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Constrained MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.2 Semi MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.1.3 Continuous state MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2 Multiagent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.1 Globally Optimal Solutions . . . . . . . . . . . . . . . . . . . . . . . . 148
v
7.2.2 Locally Optimal Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 149
Chapter 8: Conclusions 151
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Reference List 155
Appendix A: Phase Type Distribution 160
Appendix B: Classes of Phase Type Distributions 162
Appendix C: Fitting to Phase-Type Distributions 165
Appendix D: Uniformization of Phase Type Distributions 167
vi
List Of Figures
2.1 Simplified planetary exploration domain . . . . . . . . . . . . . . . . . . . . . . 9
2.2 DEFACTO disaster simulation system . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Civilian rescue domain and a mission plan . . . . . . . . . . . . . . . . . . . . . 13
2.4 Partially ordered multiagent domain . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Agent interacting with the environment in a Markov Decision Process . . . . . . 19
3.1 Phase type approximation example . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 CPH Value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Proof by induction of CPH value iteration . . . . . . . . . . . . . . . . . . . . . 42
3.4 Forward search of DPFP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 Empirical evaluation of CPH for exponential distributions. . . . . . . . . . . . . 66
4.2 Empirical evaluation of CPH for Weibull distributions. . . . . . . . . . . . . . . 67
4.3 Planning horizon of CPH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Domains for CPH and DPFP scalability experiments . . . . . . . . . . . . . . . 68
4.5 Scalability experiments for DPFP, CPH and Lazy Approximation . . . . . . . . . 70
4.6 DPFP+CPH hybrid: Fully ordered domain . . . . . . . . . . . . . . . . . . . . . 71
4.7 DPFP+CPH hybrid: Unordered domain . . . . . . . . . . . . . . . . . . . . . . 72
4.8 DPFP+CPH hybrid: Partially ordered domain . . . . . . . . . . . . . . . . . . . 72
vii
4.9 RIAACT: Adjustable autonomy component of the DEFACTO system . . . . . . 75
4.10 RIAACT results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 Agent policy in a fully-ordered CR-DEC-MDPs . . . . . . . . . . . . . . . . . . 80
5.2 Method time window and execution window . . . . . . . . . . . . . . . . . . . . 86
5.3 Propagation of opportunity cost functions and probability functions . . . . . . . 87
5.4 Visualization of the operation performed by Equation 5.3 . . . . . . . . . . . . . 88
5.5 Splitting the opportunity cost functions . . . . . . . . . . . . . . . . . . . . . . . 93
5.6 M-DPFP algorithm: Generation of the representative states . . . . . . . . . . . . 101
5.7 Example representative states . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.8 Method sequences graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.9 Demonstration of the H
UP
heuristic . . . . . . . . . . . . . . . . . . . . . . . . 114
5.10 Greedy policy for the non-representative states . . . . . . . . . . . . . . . . . . . 115
5.11 Greedy policy and the error bound . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.12 The range of the lookahead function . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1 Visualization of the opportunity costs splitting heuristics . . . . . . . . . . . . . 133
6.2 Mission plan configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3 VFP performance in the civilian rescue domain. . . . . . . . . . . . . . . . . . . 135
6.4 Scalability experiments for OC-DEC-MDP and VFP . . . . . . . . . . . . . . . 136
6.5 Runtime and quality of the lookahead function . . . . . . . . . . . . . . . . . . 139
6.6 Error bound of the of the Lookahead function . . . . . . . . . . . . . . . . . . . 140
6.7 Runtime of the M-DPFP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.8 Quality loss of the M-DPFP algorithm . . . . . . . . . . . . . . . . . . . . . . . 142
viii
Abstract
My research concentrates on developing reasoning techniques for intelligent, autonomous agent
systems. In particular, I focus on planning techniques for both single and multi-agent systems
acting in uncertain domains. In modeling these domains, I consider two types of uncertainty: (i)
the outcomes of agent actions are uncertain and (ii) the amount of resources consumed by agent
actions is uncertain and only characterized by continuous probability density functions. Such
rich domains, that range from the Mars rover exploration to the unmanned aerial surveillance to
the automated disaster rescue operations are commonly modeled as continuous resource Markov
decision processes (MDPs) that can then be solved in order to construct policies for agents acting
in these domains.
This thesis addresses two major unresolved problems in continuous resource MDPs. First,
they are very dicult to solve and existing algorithms are either fast, but make additional restric-
tive assumptions about the model, or do not introduce these assumptions but are very inecient.
Second, continuous resource MDP framework is not directly applicable to multi-agent systems
and current approaches all discretize resource levels or assume deterministic resource consump-
tion which automatically invalidates the formal solution quality guarantees. The goal of my thesis
is to fundamentally alter this landscape in three contributions:
ix
I first introduce CPH, a fast analytic algorithm for solving continuous resource MDPs. CPH
solves the planning problems at hand by first approximating with a desired accuracy the proba-
bility distributions over the resource consumptions with phase-type distributions, which use ex-
ponential distributions as building blocks. It then uses value iteration to solve the resulting MDPs
more eciently than its closest competitor, and allows for a systematic trade-o of solution qual-
ity for speed.
Second, to improve the anytime performance of CPH and other continuous resource MDP
solvers I introduce the DPFP algorithm. Rather than using value iteration to solve the problem
at hand, DPFP performs a forward search in the corresponding dual space of cumulative distri-
bution functions. In doing so, DPFP discriminates in its policy generation eort providing only
approximate policies for regions of the state-space reachable with low probability yet it bounds
the error that such approximation entails.
Third, I introduce CR-DEC-MDP, a framework for planning with continuous resources in
multi-agent systems and propose two algorithms for solving CR-DEC-MDPs: The first algorithm
(VFP) emphasizes scalability. It performs a series of policy iterations in order to quickly find a
locally optimal policy. In contrast, the second algorithm (M-DPFP) stresses optimality; it allows
for a systematic trade-o of solution quality for speed by using the concept of DPFP in a multi-
agent setting.
My results show up to three orders of magnitude speedups in solving single agent planning
problems and up to one order of magnitude speedup in solving multi-agent planning problems.
Furthermore, I demonstrate the practical use of one of my algorithms in a large-scale disaster
simulation where it allows for a more ecient rescue operation.
x
Chapter 1: Introduction
Recent years have seen an unprecedented rise in interest in the deployment of aerial [Beard
and McLain, 2003], underwater [Blidberg, 2001] and terrestrial [Bresina et al., 2002; Thrun
et al., 2006] unmanned vehicles to perform a variety of tasks in environments that are often
hazardous and inaccessible to humans. Whereas some of these vehicles are tele-operated [Beard
and McLain, 2003], vehicles such as the Mars rovers must often act autonomously due to inherent
communication limitations with their human operators [Bresina et al., 2002]. As a result, auto-
mated planning techniques for these vehicles have recently received a lot of attention [Pedersen
et al., 2005].
Of particular importance are planning techniques that take into account the uncertain nature
of agent environments, to construct robust plans for all possible agent action outcomes, including
action failures. For example, current Mars rovers receive only high-level instruction such as
to “move from site A to site B” and plan the optimal traversal between these sites themselves.
In doing so, they currently do not plan for action failures. As a result, it has been estimated
[Mausam et al., 2005] that the 1997 Mars Pathnder rover spent between 40% and 75% of its time
doing nothing because plans did not execute as expected.
1
Furthermore, planning techniques for autonomous agents must often take into account the
energy requirements of these agents, or more generally, resource requirements for resources such
as time, energy, storage whose consumption is uncertain and can only characterized by contin-
uous probability density functions. For example, with the data collected during the 1997 Mars
Pathfinder mission, NASA researchers [Bresina et al., 2002] are now in possession of statistical
distributions of energy and time required to perform various rover activities. That abundance
of data has in turn encouraged the Artificial Intelligence community to develop more expres-
sive planning frameworks and more ecient algorithms for planning in continuous, dynamic and
uncertain environments [Pedersen et al., 2005].
1.1 Assumptions
Planning problems for autonomous agents are often characterized by dierent assumptions. In
particular, the planning problems that this thesis considers share the following characteristics:
Agents can fully observe their local states and know the rewards for their actions, assumed
to be are positive and obtained upon transitioning to new states.
The execution of agent actions is related to resource availability: Initial resource levels are
known, actions cause resource levels to decrease (according to a given continuous proba-
bility density function) and can be executed only if required resources are available. This
thesis considers a case where agents have to deal with a single type of resource.
The goal of the agents is to maximize reward yields.
2
1.2 Problem Statement
Continuous resource MDPs have emerged as a powerful planning technique to address such do-
mains with uncertain resource consumption and resource limits. Unfortunately, continuous re-
source MDPs are very dicult to solve, both for single and multi-agent settings.
For single agent settings, existing algorithms either (i) make additional restrictive assumptions
about the model or (ii) do not add new assumptions, but run very slow. The first limitation applies
to Constraint MDPs [Altman, 1999] [Dolgov and Durfee, 2005] that do not allow for stochastic
resource consumption or Semi MDPs and Generalized Semi MDPs [Puterman, 1994] [Younes
and Simmons, 2004] that do not allow policies to be indexed by the amount of resources left. On
the other hand, the second limitation applies to existing value iteration algorithms [Boyan and
Littman, 2000], [Feng et al., 2004], [Liu and Koenig, 2005] and [Li and Littman, 2005] or policy
iteration algorithms [Lagoudakis and Parr, 2003], [Hauskrecht and Kveton, 2004], [Nikovski
and Brand, 2003], [Dolgov and Durfee, 2006], [Petrik, 2007] [Mahadevan and Maggioni, 2007]
which may exhibit poor performance with the scale-up in their state spaces. To remedy that,
[Guestrin et al., 2004] developed a technique to factor the underlying MDP thus reducing its
complexity and [Mausam et al., 2005] developed a technique that prunes out the unreachable
states and considers the remaining states in the order of their importance. Unfortunately, even
with these improvements, current algorithms for solving continuous resource MDPs may still run
slow in domains where solution quality guarantees are required [Pedersen et al., 2005].
For multi-agent settings, the situation is even worse, as continuous resource MDPs have re-
ceived little attention, despite the increasing popularity of multi-agent domains with continuous
characteristics [Raja and Lesser, 2003], [Becker et al., 2003], [Lesser et al., 2004], [Musliner
3
et al., 2006]. In fact, the proposed globally optimal algorithms [Becker et al., 2003], [Becker
et al., 2004], [Nair et al., 2005], [Varakantham et al., 2007] as well as locally optimal algorithms
[Nair et al., 2003], [Beynier and Mouaddib, 2005], [Beynier and Mouaddib, 2006] all discretize
the continuous resources and encode them as separate features of the state. As a consequence,
the planning horizons of those algorithms are limited to only a few time ticks. On top of that,
the discretization process invalidates the formal solution quality guarantees that these algorithms
provide.
Thus, there are two unresolved problems to be addressed: The first is to design new ecient
algorithms for continuous resource MDPs that are faster than existing algorithms, while providing
error bounds. The second is to design a framework and algorithms for planning with continuous
resources in multi-agent systems.
1.3 Thesis Contributions
In this context I introduce the three major contributions of my thesis: First, I develop a new
method for solving MDPs with continuous resources called CPH (Continuous resource MDPs
through PHase-type distributions) which finds solutions with quality guarantees, but is several
orders of magnitude faster than its closest competitor, the Lazy Approximation algorithm [Li and
Littman, 2005]. CPH approximates the probability distributions with phase-type distributions,
which use exponential probability distributions as building blocks [Neuts, 1981]. It then uses
the analytical value iteration technique to solve the resulting MDPs by exploiting the properties
of exponential probability distributions to derive the necessary convolutions eciently and pre-
cisely. Furthermore, I demonstrate the successful integration of CPH with RIAACT (Reasoning
4
in Adjustable Autonomy in Continuous Time) [Schurr et al., 2008], the adjustable autonomy
component of the DEFACTO system [Schurr et al., 2005] for training the incident commanders
of the Los Angeles Fire Department.
Second, I introduce a DPFP algorithm (Dynamic Probability Function Propagation), to im-
prove the anytime performance of CPH and other techniques for solving MDPs with continuous
resources. Rather than performing value iteration to solve the problem at hand, DPFP performs
a forward search in the corresponding space of cumulative distribution functions. In doing so,
DPFP discriminates in its policy generation eort providing only approximate policies for re-
gions of the state-space reachable with low probability yet it is able to bound the error that such
approximation entails. When run alone, DPFP outperforms other algorithms in terms of its any-
time performance, whereas when run as a hybrid, it allows for a significant speedup of CPH.
The third contribution in my thesis provides techniques for planning with continuous re-
sources in a multi-agent setting. To this end I introduce Continuous Resource, Decentralized
MDP (CR-DEC-MDP), a first framework for planning with continuous resources in multi-agent
systems and propose two algorithms for solving problems modeled as CR-DEC-MDP: Value
Function Propagation (VFP) and Multiagent DPFP (M-DPFP). VFP emphasizes scalability; it
leverages the value function reasoning to a multi-agent setting, that is, it maintains and manipu-
lates a value function over time for each state rather than a separate value for each pair of method
and time interval. Such representation allows VFP to group the time points for which the value
function changes at the same rate which gives a one order of magnitude speedup over VFP’s clos-
est competitor, the OC-DEC-MDP algorithm [Beynier and Mouaddib, 2005]. In addition, VFP
identifies and corrects critical overestimations of the expected value of a policy, that OC-DEC-
MDP fails to address. On the other hand, in order to solve CR-DEC-MDPs optimally, I propose
5
the M-DPFP algorithm. M-DPFP finds policies with solution quality guarantees by employing
the concept of forward search in the space of cumulative distribution functions in a multi-agent
setting.
1.4 Overview of Thesis
Having outlined the major theme of my thesis, the rest of my thesis document is organized as
follows: Chapter 2 provides the motivating domains for my work and introduces the formal
frameworks for planning with continuous resources: (i) The Continuous Resource MDP model
for single agent systems and (ii) the Continuous Resource, Decentralized MDP model for mul-
tiagent systems. In Chapter 3, that focuses on a single agent case, I develop two algorithms for
solving problems modeled as continuous resource MDPs: A value iteration algorithm (CPH) that
attains high quality solutions and a forward search algorithm (DPFP) that exhibits superior any-
time performance. The experimental evaluation of CPH, DPFP as well as a DPFP-CPH hybrid
algorithm that combines the strengths of the two algorithms is presented in Chapter 4. Further-
more, Chapter 4 also reports on the successful integration of CPH with the DEFACTO disaster
rescue simulation system. I then move on to multiagent planning problems. In Chapter 5 I de-
velop two algorithms for solving problems modeled as CR-DEC-MDPs: A fast, locally optimal
algorithm (VFP) and a globally optimal algorithm (M-DPFP). The experimental study of these
algorithms is presented in Chapter 6. I discuss the related work in Chapter 7. The summary of
my dissertation and future research directions are outlined in Chapter 8.
6
Chapter 2: Motivating Domains and Formal Models
In this chapter I first introduce three dierent domains that motivate this research: (i) A planetary
exploration domain for single agent algorithms, (ii) a disaster rescue domain for single agent
algorithms and (iii) a civilian rescue domain for multiagent algorithms. Next, I provide a general
description of the class of multiagent domain that this thesis focuses on. Finally, for modeling
the planning problems of interest, I introduce two formal frameworks: (i) Continuous Resource
MDP, for single agent planning problems and (ii) Continuous Resource, Decentralized MDP, for
multiagent planning problems.
2.1 Motivating Domains
This section presents three dierent domains that motivate this research: Section 2.1.1 introduces
a planetary exploration domain for single agent algorithms, Section 2.1.2 introduces a disaster
rescue domain for single agent algorithms and finally, Section 2.1.3.1 introduces a civilian rescue
domain for multiagent algorithms. Finally, Section 2.1.4 provides a general description of the
class of multiagent domain that this thesis focuses on.
7
2.1.1 Planetary Exploration: Domain for Single Agent Algorithms
I consider time as an example continuous resource and focus on a simplified version of the plan-
etary exploration problem originally introduced by [Bresina et al., 2002] and later extended by
[Pedersen et al., 2005] as a motivation for my work. Following the success of the first NASA
Mars exploration rover Sojourner, there are currently two other rovers rolling on the surface of
Mars: Spirit and Opportunity. The mission of each rover usually consists of visiting various ex-
ploration sites, and transferring the collected data back to the lander base that can later relay it
back to Earth. There are two major challenges that rover exploration presents: (i) since it takes
20 minutes for a control signal from Earth to reach Mars, rovers must act autonomously, and
(ii) because of the uncertain nature of the terrain the rovers traverse, their actions are not deter-
ministic. For example, is has been estimated that Sojourner spent between 40% and 75% of its
time doing nothing because the execution of its actions resulted in situations that have not been
planned for, and consequently the rover had to wait for the arrival of new commands from the
mission operation center on Earth. Since rovers have limited life-span, is it crucial to collect the
important data as soon as possible while ensuring that a rover does not get lost.
Exploration rovers rely on the solar power, and can only operate during a day. At each time
of rover operation hours, one can estimate the remaining time of rover operation for that day.
For simplicity, this remaining time of operation for a day will be referred to as time-to-deadline.
Figure 2.1 illustrates a simple version of the problem outlined by [Bresina et al., 2002]. A rover
has to maximize its expected total reward with time-to-deadline = 4. There are five locations
that the rover can be in: Rover start location, site 1, site 2, site 3 and rover base station. The
rover can perform two actions with deterministic action outcomes outside of its base station: (1)
8
It can move to the next site and collect a rock probe from there. It receives rewards 4, 2 and 1
for collecting rock probes from sites 1, 2 and 3, respectively. (2) It can also skip all remaining
sites and move back to its base station to perform a communication task, which drains its energy
completely and thus makes it impossible for the rover to perform additional actions. It receives
reward 6 for performing the communication task. Finally, action durations are not deterministic,
they are sampled from either exponential, normal or Weibull distributions. For a domain in Figure
2.1 the ordering in which the rover has to visit dierent sites is already given. However, depending
on the amount of time left, the rover does not know whether it should (i) move to the next site or
(ii) return to base. Hence, for any point in time and any site of interest, the rover must be able to
eciently determine the total expected utilities of these two dierent actions.
start base
site 1 site 2
site 3
return
to base
return
to base
return
to base
return
to base
move
to site 1
move
to site 2
move
to site 3
Figure 2.1: Simplified planetary exploration domain
2.1.2 Adjustable Autonomy: Domain for Single Agent Algorithms
Another important domain that motivates this thesis is the adjustable autonomy problem for a
disaster rescue domain shown in Figure 2.2. In this domain a large scale disaster such as an
earthquake hits a city, resulting in multiple buildings going on fire at the same time. With little
9
or no time to prepare for such an eventuality, a disaster rescue operation has to be performed in
a coordinated fashion. In particular, one can envision an incident commander (a person in Figure
2.2) who makes strategic decisions about which fires to fight first. For example, an incident
commander might be aware of the fact that the overall damage in the city can be reduced if
some areas of the city are sacrificed in favor of other areas. The incident commander can then
summon all the available resources (fire engines) to perform a particular task rather than having
the resources be used ineciently in a non-coordinated fashion on multiple other tasks.
Figure 2.2: DEFACTO disaster simulation system
However, the human incident commander can quickly become overwhelmed by the number
of role allocation requests from the fire engines participating in the disaster rescue operation
[Marecki et al., 2005]. To alleviate that, some of the role allocation burden needs to be shifted
from the human incident commander to a fire engine agent [Schurr et al., 2005]. In essence, the
10
agent faces time as a continuous resource to determine whether to take advice from human or not,
and when advice is obtained and the agent disagrees with it, whether to continue negotiating with
a human or not.
2.1.3 Civilian Rescue: Domain for Multiagent Algorithms
Planning under uncertainty for agent teams is more challenging than for single agents systems,
because of the lack of global state knowledge in multiagent systems. In essence, even though
an agent has incomplete knowledge of other agents’ execution status, it must make optimal ac-
tions. The restriction about the lack of global state knowledge can in theory be eliminated by
assuming that the agents can directly communicate. However, such an assumption introduces
two problematic issues:
If communication fails yet the agents are still executing policies that assume that reliable
communication, the quality of such policies is questionable.
Communication does not reduce the complexity of the problem. Indeed, as has been shown
in [Nair et al., 2004], communication requires computationally intensive oine planning
about all possible communication messages which often prevents the underlying algorithms
from scaling up to big domains.
This thesis considers a communication mode that mitigates the above-mentioned issues, yet
still allows the agents to exchange the information during the execution phase. What makes it
possible is the use of the environment to transmit the information [Becker et al., 2004].
11
2.1.3.1 Example 1: Fully Ordered Domain
For the multi-agent domains in this thesis it is assumed that time is a continuous resource. In
that context, agents must coordinate their plans over time, despite uncertainty in plan execution
duration and outcome. In particular, this section introduces a fully-ordered domain where agents
know the activities that they have to perform, yet do not know when these activities need to be
performed. One example domain is large-scale disaster, like a fire in a skyscraper. Because
there can be hundreds of civilians scattered across numerous floors, multiple rescue teams have
to be dispatched, and radio communication channels can quickly get saturated and useless. In
particular, firefighters must be sent on separate missions to rescue the civilians trapped in dozens
of dierent locations.
Picture a small mission plan from Figure 2.3, where three firefighters have been assigned a
task to rescue the civilians trapped at site B, accessed from site A. One example of such situation
might be when site B is a conference room and site A is a hall. General fire fighting procedures
involve both: (i) putting out the flames, and (ii) ventilating the site to let the toxic, high temper-
ature gases escape, with the restriction that ventilation should not be performed too fast in order
to prevent the fire from spreading. The team stars its mission at time ES T = 0 (Earliest Starting
Time) and estimates that at LET = 20 (Last Execution Time) the fire at site B will become un-
bearable for the civilians. In addition, the team knows that the fire at site A has to be put out in
order to open the access to site B. Furthermore, each team member knows the actions (methods)
that it has to execute (white boxes inside gray rectangles in Figure 2.3) and the ordering of its
actions (dotted arrows in Figure 2.3).
12
As has happened in the past in large scale disasters, communication often breaks down and
hence, the civilian rescue domain introduced here assumes that there is no explicit communication
between firefighters 1,2 and 3 (denoted as FF1, FF2 and FF3). In particular, FF2 does not know
if it is already safe to ventilate site A, FF1 does not know if site A has already been ventilated
and it is safe to start fighting fire at site B, etc. However, agents can communicate through the
environment, e.g., if agent FB2 successfully ventilates site A it can infer without asking agents
FB1 and FB3 that the fire at site A has been put out.
Fight fire
at site A
Ventilate site A
Fight fire
at site B
Fight fire
at site A
Evacuate civilians
from site B
Firefighter 1
Firefighter 2
Firefighter 3
Fight fire
at site B
EST=0
EST=0
LET=20
Reward=50
Reward=20
Figure 2.3: Civilian rescue domain and a mission plan
One can clearly see the dilemma that FF2 faces: It can only estimate the durations of the
“Fight fire at site A” methods executed by FF1 and FF3 and at the same time, FF2 knows that
time is running out for civilians. If FF2 ventilates site A too early (before the fire at site A is put
out by agents FF1 and FF3), the fire will spread out of control. On the other hand, if FF2 waits
13
with the ventilation of site A for too long, fire at site B will become unbearable for the civilians.
In general, agents have to perform a sequence of such dicult decisions. For example, a decision
process of FF2 involves first choosing when to start ventilating site A, and then (depending on
the time it took to ventilate site A), choosing when to start evacuating the civilians from site B.
Such sequence of decisions constitutes the policy of an agent, and it must be found fast because
time is running out.
2.1.3.2 Example 2: Partially Ordered Domain
This section illustrates a multiagent domain where the activities of the agents are only partially
ordered. Consider a planning problem in Figure 2.4 where a team of agents has to perform a set
of activities in a continuous time interval [
s
;
e
]. The agent team consists of two agents, Agent 1
and Agent 2 on a mission to perform methods m
1
; m
2
; m
3
and m
4
; m
5
; m
6
respectively in any order.
Furthermore, (i) each agent can be executing only one method at a time, (ii) execution of methods
m
4
, m
2
must be preceded by the successful completion of methods m
1
and m
5
respectively and
(iii) method execution durations are sampled from a normal distribution with a mean = 4 and
variance = 2. The goal of each agent is to maximize the team reward earnings for rewards
r
1
= 1, r
2
= 5, r
3
= 2, r
4
= 5, r
5
= 1, r
6
= 2 earned upon the successful execution of methods
m
1
; m
2
; m
3
; m
4
; m
5
; m
6
respectively.
Consider the situation where, at mission start time
s
, Agent 1 and Agent 2 are choosing
which methods should they start executing first. Since r
2
and r
4
are much greater than rewards
for other methods, Agent 1 might be tempted to first execute method m
1
in order to allow Agent
2 to successfully start the execution of its method m
4
early enough, so that the execution of
14
Agent 1
methods
Agent 2
methods
m
1
m
2
m
3
m
4
m
5
m
6
Figure 2.4: Partially ordered multiagent domain
method m
4
can finish before the mission deadline. On the other hand, Agent 2 might be tempted
to execute method m
5
first such that Agent 1 has enough time to start and finish the execution
of its method m
2
before the mission deadline. Clearly, this inconsistency has to be resolved as
there might be not enough time (before the mission deadline
e
) to execute both, the execution
sequence m
1
followed by m
4
and the execution sequence m
5
followed by m
2
.
Suppose that Agent 1 decides that, together with Agent 2, they will attempt to execute the
sequence m
1
followed by m
4
. The plan is that Agent 1 will start executing m
1
while Agent 2 will
wait until time t
1
after which it will start executing its method m
4
. When the execution of m
4
ends, e.g. at time t
2
, Agent 2 will make a decision whether to start executing m
5
or m
6
. If there is
enough time left, i.e., if
e
t
2
is suciently big, Agent 2 will consider that there is still enough
time for the sequence of methods m
5
followed by m
2
to be executed successfully. Otherwise, it
will simply decide to go after method m
6
, which is worth more than method m
5
alone. Note a
complication here: t
2
is not known before the mission starts and thus, to evaluate the utility of the
sequence m
1
followed by m
4
, one would have to consider an infinite number of values t
2
which
is impossible.
Notice how the environment is used for communication in this domain: If agent 2 executes
method m
4
within its time window but this execution is unsuccessful, agent 2 can immediately
infer that agent 1 has not executed its method m
1
successfully. Similarly, if agent 1 executes
15
method m
2
within its time window and this execution is successful, agent 1 can immediately infer
that agent 2 has successfully executed method m
5
before agent 1 has started executing method
m
2
. Thus, similarly to [Becker et al., 2004] special purpose methods can be added to play a role
in exchanging the information between the agents.
2.1.4 General Description of Multiagent Domains
This section provides a general description of multiagent planning problems analyzed in this
thesis. It assumes assumes a team of agents deployed on a mission to perform a given set of
tasks, referred to as methods. Such formulation has received a lot of attention in the literature. In
particular, it is inspired by the DARPA Coordinators eort Musliner et al. [2006], which in turn
is based on the popular GPGP paradigm Decker and Lesser [1995].
The formal description of the multiagent planning problems considered in this thesis is as
follows: A team of N agents is deployed on a mission to perform a set of methods from the set
M =fm
1
;:::; m
K
g. Each agent n is assigned to a set M
n
of methods, such thatfM
n
g
N
n=1
is a par-
titioning offm
k
g
K
k=1
. Furthermore, each agent can be executing only one method at a time, each
method can be executed only once and each method execution consumes a certain amount of
agent resource — it is assumed that there is only one type of resource. Also, the probability den-
sity function p
k
of the amount of resource consumed by the execution of method m
k
is assumed
to be known, for all k = 1;::; K. For example, every agent knows that method m
k
will consume t
amount of resource with probability p
k
(t).
Furthermore, the planning problems of interests are specified by two types of constraints:
Resource precedence and resource limit constraints with the following meaning:
16
Resource precedence constraints grouped in set C
impose restrictions on method execu-
tion based on resource levels of other methods. Precisely, a resource precedence constraint
hi; ji2 C
is a binary relation that links methods m
i
; m
j
2 M that belong to possibly dif-
ferent agents. The constraint imposes two necessary (but not sucient) conditions on the
successful execution of method m
j
. These are: (i) Method m
i
must be executed successfully
and (ii) if the execution of method m
i
finishes with l amount of resource left, the execution
of method m
j
must start with less than l resource left. For example, if time-to-deadline is a
resource, condition (ii) states that method m
j
must be started after method m
i
is finished.
It is often referred to that method m
j
is enabled at resource level l if all the methods m
i
such thathi; ji2 C
have been successfully finished with more than l resources left. For
example, if time is a resource than method m
j
is enabled at time t if all the methods m
i
such
thathi; ji2 C
have been successfully finished before time t. To illustrate this concept
consider Figure 2.3. Here, method “Ventilate site A” is enabled by two methods: “Fight
fire at site A” of Firefighter 1 and “Fight fire at site A” of Firefighter 3.
Checking whether for a given joint policy a resource precedence constraint is met is trivial
if methods m
i
and m
j
belong to the same agent. However, it is more problematic if methods
m
i
and m
j
belong to dierent agents because it is assumed that agents cannot directly com-
municate. Indeed, it is by checking whether the method m
j
has been executed successfully /
unsuccessfully that allows an agent to learn whether method m
i
has been executed success-
fully / unsuccessfully. Hence, checking the satisfiability of resource precedence constraints
post factum plays a key role in inter-agent communication (see also Section 2.1.3.2).
17
Resource limit constraints grouped in set C
[]
impose restrictions on internal agent resource
levels during method execution. Precisely, a resource limit constrainthi; l
1
; l
2
i2 C
[]
for
a method m
i
2 M imposes that the resource levels of an agent executing method m
i
must
stay within the range [l
1
; l
2
]. For example, if absolute time is a resource, method m
i
should
be started after time l
1
and finished before time l
2
. It follows from the resource limit
constraints in C
[]
that the maximum resource level to be considered when constructing an
optimal plan is = maxfl
2
:hi; l
1
; l
2
i2 C
[]
g. Finally, if absolute time is a resource, is
referred to as the mission deadline.
The goal of the agents is to maximize the joint reward earned by the agent team. The individ-
ual reward r
i
for executing method m
i
is earned when: (i) Resource limit constraints for method
m
i
are met and (ii) method m
i
is enabled when its execution starts. If condition (i) is met but con-
dition (ii) is not met, the execution of method m
i
proceeds normally (as if m
i
was enabled when
it was started), but the execution of m
i
is unsuccessful and reward r
i
is not earned. Also, when
during the execution of method m
i
condition (i) stops to hold (agent resource level drops too low)
the execution of method m
i
is interrupted, method m
i
is considered to be executed unsuccessfully
and reward r
i
is not earned.
2.2 Formal Models
For modeling the planning problems such as the ones introduced in Section 2.1, in this section
I introduce two formal frameworks: (i) Continuous Resource MDP — for single agent planning
problems and (ii) Continuous Resource, Decentralized MDP — for multiagent planning prob-
lems. To this end, I first provide the formal definition of the Markov Decision Process.
18
2.2.1 Markov Decision Process
Markov Decision Process (MDP) [Puterman, 1994] is a mathematical model for decision making
where the outcomes of actions are stochastic. MDP assumes a that there is a synchronous dialog
between some environment and some decision making entity. At each step of this dialog, the
entity makes a decision which aects the environment and then receives an observation which
uniquely determines the new state of the environment 2.5. From now on, this entity will be
referred to as an agent.
Figure 2.5: Agent interacting with the environment in a Markov Decision Process
Formally MDP is a tuplehS; A; P; Ri where:
S is the set of states of the environment;
A is the set of agent actions and A(s) A is the set of actions that can be executed from
state s2 S .
P : S A S7! [0; 1] is the transition function. For example, P(s; a; s
0
) is the probability
that the agent will transition to state s
0
2 S if it executes action a2 A(s). For all states
s2 S and actions a2 A(s) it holds that
P
s
0
2S
P(s; a; s
0
) = 1.
19
R : S A S 7! R is the reward function. For example, R(s; a; s
0
) is the reward that the
agent receives when it transitions to state s
0
2 S after executing an action a2 A(s).
A deterministic, stationary policy for an MDPhS; A; P; Ri is a mapping : S 7! A. For
example,(s) = a postulates that an action a2 A will be executed from state s2 S . For a policy
and some state s2 S , the total expected utility V
(s) of policy started in state s can be derived
using the following recursive equation:
V
(s) =
X
s
0
2S
P(s;(s); s
0
)
R(s;(s); s
0
) + V
(s
0
)
(2.1)
And the policy
is said to be optimal if for a given starting state s
0
2 S it holds that V
(s
0
)
V
(s
0
) for any policy,
.
2.2.2 Continuous Resource MDP
In order to model decision processes that require reasoning with continuous resources, the stan-
dard MDP model has to be extended. In particular, the extended model must capture that: (i)
resources are continuous, i.e., resource levels take real values, (ii) resources are always depleting,
(iii) resource consumption is stochastic and (iii) resources have both lower and upper limits. This
extended model is referred to as a continuous resource MDP. First, the most general version of
the continuous resource MDP model is defined. Next, a narrowed down version is outlined, to
model the planning problems of interest from Sections 2.1.1 and 2.1.2.
Formally, a continuous resource MDP is a tuplehS X; A; P; D; Ri, where:
S is a set of discrete states and X =
1ik
X
i
is a k-dimensional continuous variable whose
values identify current resource levels. For example, the MDP can occupy discrete state
20
s2 S with current resource levels (x
i
)
1ik
2 X where x
i
is the current level of resource i.
Furthermore, each continuous resource has lower and upper limits, i.e., X
i
= [a
i
; b
i
] R.
For example, in the planetary exploration domain from Section 2.1.1, S =fstart,site1,site2,
site3, baseg and X = [0; ] with time-to-deadline being a continuous resource.
A is a set of actions and A(s) A is the set of actions that can be executed from the
discrete state s2 S . For example, in the planetary exploration domain from Section 2.1.1,
A =fmove-to-next-site, return-to-baseg.
P : S AS7! [0; 1] is the discrete state transition function. For example, P(s; a; s
0
) is the
probability that the agent will transition to discrete state s
0
2 S if it executes action a2 A(s).
For all discrete states s2 S and actions a2 A(s) it holds that
P
s
0
2S
P(s; a; s
0
) = 1. In the
planetary exploration domain from Section 2.1.1, all discrete transitions are deterministic,
e.g. P(site1;move-to-next-site,site2) = 1.
D : S A 7! (X
i
7! R)
1ik
is the table of resource consumption distributions. For
example, for each action a2 A(s) executed from discrete state s2 S there exists a set
D(s; a) = (p
i
s;a
)
1ik
of resource consumption probability distribution functions p
i
s;a
—
one distribution function per each resource i. In particular, p
i
s;a
(y) is the probability that
action a2 A(s) executed from discrete state s will consume y amount of resource i. In the
planetary exploration domain from Section 2.1.1, for each state and action the resource
consumption distribution is the same, e.g. D(site1;move-to-next-site) = f where f (t) = e
t
.
R : S A S X7!R is the reward function. For example, R(s; a; s
0
; x) is the reward that
the agent receives when it executes action a2 A(s) from discrete state s2 S and transitions
to a discrete state s
0
2 S with x 2 X amounts of resources left. In particular, for the
21
planetary exploration domain from Section 2.1.1, the rover receives reward R(site1;return-
to-base,base; t) = 6 provided that t> 0.
For modeling the planning problems from Sections 2.1.1 and 2.1.2, the most general version
of the continuous resource MDPs model (defined above) is narrowed down so that it models one
continuous resource, namely, time-to-deadline. For notational convenience, each discrete state
s2 S will be referred to as just state whereas current resource level will be referred to as current
time-to-deadline.
For such MDPs, X = [0; ], where is the maximum time to deadline at the beginning of
execution. The agent transitions are the rewards it earns are then the following: Assume that
an agent is in discrete state s2 S with a deadline being t > 0 time units away (= with time-
to-deadline t), after which execution stops. It executes an action a2 A(s) of its choice. The
execution of the action cannot be interrupted, and the action duration t
0
is distributed according
to a given probability distribution p
s;a
(t
0
) that depends on both the state s and the action a. If
t
0
t, then the time-to-deadline t t
0
0 after the action execution is non-positive, which
means that the deadline is reached (resource is depleted) and execution stops. Otherwise, with
probability P(s; a; s
0
), the agent obtains reward R(s; a; s
0
) 0 and transitions to state s
0
2 S with
time-to-deadline t t
0
and repeats the process.
The objective of the agent is to maximize its expected total reward until execution stops.
Precisely, Let V
(s)(t) denote the largest expected total reward that the agent can obtain until
22
execution stops if it starts in state s 2 S with time-to-deadline 0 t . The agent can
maximize its expected total reward by executing the action
(s)(t) arg max
a2A(s)
8
>
>
<
>
>
:
X
s
0
2S
P(s; a; s
0
)
Z
t
0
p
s;a
(t
0
)(R(s; a; s
0
) + V
(s
0
)(t t
0
)) dt
0
9
>
>
=
>
>
;
in state s2 S with time-to-deadline 0 t , which can be explained as follows: When it
executes action a2 A(s) in state s2 S , it incurs action duration t
0
with probability p
s;a
(t
0
). If
0 t
0
t, then it transitions to state s
0
2 S with probability P(s; a; s
0
) and obtains an expected
total future reward of R(s; a; s
0
) + V
(s
0
)(t t
0
). The function
is the optimal agent policy that
needs to be found.
As will be discussed in Chapter 7, there are serious shortcomings in previous work in deter-
mining
. In essence, current algorithms either (i) make additional restrictive assumptions about
the model or (ii) do not add new assumptions, but run very slow (see Section 1.2). This thesis
addresses both of these shortcomings in two contributions:
It introduces CPH, a fast analytic algorithm for solving continuous resource MDPs. CPH
solves the planning problems at hand by first approximating with a desired accuracy the
probability distributions over the resource consumptions with phase-type distributions,
which use exponential distributions as building blocks. It then uses value iteration to solve
the resulting MDPs up to three orders of magnitude faster than its closest competitor, and
allows for a systematic trade-o of solution quality for speed.
Second, to improve the anytime performance of CPH and other continuous resource MDP
solvers this thesis introduces the DPFP algorithm. Rather than using the principle of value
iteration to solve the planning problems at hand, DPFP performs a forward search in the
23
corresponding dual space of cumulative distribution functions. In doing so, DPFP discrim-
inates in its policy generation eort providing only approximate policies for regions of the
state-space reachable with low probability yet it bounds the error that such approximation
entails.
2.2.3 Continuous Resource, Decentralized MDP
This section introduces a formal framework for planning problems outlined in Section 2.1.4. To
this end, the concepts of execution events and histories of execution events must be defined:
Definition 1. An execution event stores all the relevant information about the execution of a
method. Precisely, an execution event e is a tuplehi; l
1
; l
2
; qi where:
i is the index of method m
i
2 M
n
executed by some agent n;
l
1
is agent n resource level when it started executing method m
i
;
l
2
is agent n resource level when it finished executing method m
i
;
q2f0; 1g is the result of the execution of method m
i
. If q = 1 method m
i
has been executed
successfully, whereas if q = 0 method m
i
has been executed unsuccessfully.
Furthermore, a spoof execution eventh0; l
1
; l
2
; 1i is used to specify an event when the agent
remained idle (executed a spoof method m
0
) and its resource level dropped from l
1
to l
2
.
Definition 2. An execution history stores the complete history of execution events of an agent.
Precisely, the execution history h of an agent n is a vector (e
1
; e
2
;:::; e
k
) where e
1
; e
2
;:::; e
k
are the
execution events of agent n.
24
Observe, that agent n knows exactly its execution history. However, because it is assumed
that agents cannot communicate directly their execution status, agent n might not know exactly
the execution histories of other agents. Indeed, agent n only knows the probability distribution of
the execution histories of other agents. This distribution is aected by both: (i) The joint policy
of all the agents that has been found during the planning phase and (i) the observations that agent
n received during the execution phase, where an observation is an outcome of the execution of a
method involved in a resource precedence constraint (explained in Section 5.2.2).
The formal definition of the Continuous Resource, Decentralized MDP (CR-DEC-MDP)
model is then the following:
Definition 3. For a team of N agents, a CR-DEC-MDP is a collection of N continuous resource
Markov decision processesMDP
n
=hS
n
;A
n
;P
n
;R
n
i where:
S
n
is the set of states of agent n where each state s2 S
n
is the agent execution history,
e.g. s = (e
1
;:::; e
k
). EachMDP
n
has a distinguished starting state s
n;0
2S
n
encoded as
s
n;0
= (hn; l
n;0
; l
n;0
; 1i)2 S
n
where m
n
is a unique spoof method of agent n which by
default is completed in the starting state with l
n;0
resource left — the only purpose of the
starting state is to encode the initial resource level of the agent. The current state s2 S
n
of agent n allows the agent to estimate the probability distribution over current states of
MDPs of other agents (explained in Section 5.2.2).
A
n
is the set of actions of agent n. Furthermore, A(s)A
n
is used to denote a set of actions
that agent n can execute in state s2 S
n
. Because methods are allowed to be executed only
once, A(s) = M
n
nfm
i
:hi; l
1
; l
2
; qi2 sg. Alternatively, an agent can choose to remain idle
in which case its resource level drops by the controllable amount.
25
P
n
is the state transition function of agent n that can depend on the current state ofMDP
n
0
of any agent n
0
. When the agent is in state s = (hi
1
; l
1;1
; l
1;2
; q
1
i;:::;hi
k
; l
k;1
; l
k;2
; q
k
i) it can
either choose to remain idle (deliberately lose a certain amount of resource) or execute a
method m
j
2 A(s). If the agent chooses to remain idle, it looses a controllable amount
of resource and transitions to a state s
0
= (hi
1
; l
1;1
; l
1;2
; q
1
i;:::;hi
k
; l
k;1
; l
k;2
; q
k
i;h0; l
k;2
; l
k;2
; 1i). Else, if the agent decides to execute a method m
j
2 A(s), it transitions with proba-
bility p
j
(x) to state s
0
= (hi
1
; l
1;1
; l
1;2
; q
1
i;:::;hi
k
; l
k;1
; l
k;2
; q
k
i;h j; l
k;2
; l
k;2
x; q
j
i) where x is
the amount of resource consumed during the execution of method m
j
. The result q
j
of the
execution of method m
j
is 1 (method m
j
has been executed successfully) if and only if both
the resource limit constraint and resource precedence constraint are satisfied. Precisely:
– The resource limit constraint is met when the agent resource level remains within
some admissible range during the execution of method m
j
, i.e., if there exist a re-
source limit constrainth j; l
1
; l
2
i2 C
[]
such that l
1
l
k;2
l
k;2
x l
2
.
– The resource precedence constraint is met if method m
j
is enabled when it is started,
i.e., if the execution of all methods m
i
2 M such thathi; ji2 C
has finished success-
fully with at least l
k
2
resources left. In other words, for each method m
i
2 M
n
0 such
thathi; ji2 C
, the current state ofMDP
n
0 must contain an event e =hi; l
1
; l
2
; 1i
where l
2
l
k;2
.
Finally, since it is assumed that the execution of method m
j
is automatically interrupted
when the agent resource level drops below value l
2
, for some admissible resource range
constrainth j; l
1
; l
2
i2 C
[]
the transition function must be modified respectively. To this end,
26
the probability that the execution of method m
j
consumes l
k;2
l
2
amount of resource must
be 1
R
l
k;2
l
2
0
p
j
(x)dx.
R
n
is the reward function of agent n. If the agent executes a method m
j
2 A(s) from state
s = (e
1
;:::; e
k
) and transitions to state s
0
= (e
1
;:::; e
k
;h j; l
k;1
; l
k
2
; q
j
i) it earns reward q
j
r
j
.
Given local policies
n
of agents n = 1;::; N the joint policy is a vector (
1
;:::;
n
). The value
of the joint policy is then defined as follows:
Definition 4. The value of the joint policy = (
1
;:::;
N
) is given by V() =
P
N
n=1
V
i
(s
n;0
) where
V
i
(s
n;0
) is the total expected reward of policy
n
(of agent n) executed from its starting state s
n;0
.
The optimal policy
= (
1
;:::;
N
) is the one that maximizes V().
As will be discussed in Chapter 7, there are serious shortcomings in previous work in deter-
mining
. First, current algorithms are only applicable to CR-DEC-MDPs where the resource
levels are discretized and second, current algorithms may still exhibit poor performance with the
scale-up in their state-spaces. To remedy these shortcomings, this thesis proposes two dierent
algorithms for solving CR-DEC-MDPs (see Chapter 5):
The first algorithm (VFP) considers a special case when CR-DEC-MDPs are fully ordered
(explained in Section 5.1). For such CR-DEC-MDPs, VFP finds locally optimal policies
one order of magnitude faster than its competitors. In addition, VFP implements a set of
heuristics aimed at improving the quality of these locally optimal joint policies.
The second proposed algorithm (M-DPFP) for solving CR-DEC-MDPs operates on arbi-
trary CR-DEC-MDPs. It finds joint policies that are guaranteed to be within an arbitrary
small from the optimal joint policies.
27
Both VFP and M-DPFP allow for continuous resource levels and continuous resource consump-
tion probability density functions.
28
Chapter 3: Single Agent Solutions
Two algorithms for solving planning problems modeled as continuous resource MDPs are pro-
posed in this chapter: (i) The value iteration algorithm CPH and (ii) the forward search approach
DPFP. The advantages and disadvantages of CPH and DPFP are demonstrated in Chapter 4.
3.1 Value Iteration Approach: The CPH Algorithm
Recall (see Equation 2.2) that the optimal policy
is calculated from values V
(s)(t) s2 S ; t2
[0; ] where V
(s)(t) is the largest expected total reward that the agent can obtain until execution
stops if it starts in state s2 S with time-to-deadline 0 t . To calculate values V
(s)(t) s2
S ; t2 [0; ] necessary to determine
, value iteration first calculates the values V
n
(s)(t) using
the following Bellman updates for all states s2 S , time-to-deadlines 0 t , and iterations
n 0:
29
V
0
(s)(t) := 0
V
n+1
(s)(t) :=
8
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
:
0 if t 0
max
a2A(s)
f
P
s
0
2S
P(s; a; s
0
) otherwise
R
t
0
p
s;a
(t
0
)(R(s; a; s
0
) + V
n
(s
0
)(t t
0
)) dt
0
g:
It then holds that lim
n!1
V
n
(s)(t) = V
(s)(t) for all states s2 S and times-to-deadline 0
t .
Unfortunately, value iteration cannot be implemented as stated since the number of values
V
n
(s)(t) is infinite for each iteration n since the time-to-deadline t is a real-valued variable. This
thesis remedies this situation by viewing the V
n
(s) as value functions that map times-to-deadline
t to the corresponding values V
n
(s)(t). As a result CPH can maintain V
n
(s) with a finite number
or parametrized, continuous functions. In this context, I make the following contributions:
First, I show that once the starting MDP is approximated with an MDP with only exponen-
tial action duration distributions, it is possible to find such n, that running the value iteration
for n iterations would produce a solution with an approximation error max
s2S;0t
jV
(s)(t)
V
n
(s)(t)j no larger than a given constant > 0.
Second, I show that each value function V
n
(s) can be represented exactly with a vector of
a small number of real values each.
Finally, I show how the Bellman updates can eciently transform the vectors of the value
functions V
n
(s) to the vectors of the value functions V
n+1
(s).
30
These three contributions are presented in the following three steps of the CPH algorithm (Continuous
resource MDP through PHase-type approximation).
3.1.1 CPH Step 1: Phase Type Approximation
CPH first approximates the probability distributions p
s;a
of action durations that are not exponen-
tial with phase-type distributions (see Appendix A), resulting in chains of exponential distribu-
tions E() =e
t
with potentially dierent exit rate parameters> 0 [Neuts, 1981]. CPH then
uses uniformization [Puterman, 1994] to make all constants identical without changing the un-
derlying stochastic process. The detailed description of these two steps is provided in Appendix
C and Appendix D — this section only shows an example of use of phase-type approximation
and uniformization in context of CPH. Furthermore, this section reports on the implications of
phase-type approximation on the CPH’s planning horizon.
The actual implementation of CPH used for the experimental evaluation of CPH (see Chapter
4) employed the Coxian family of distributions (see Appendix B) for phase-type approximation.
The uniformization process (see Appendix D) was then used to uniformize the exit rate parame-
ters of the exponential distributions of the underlying Coxian distributions. The example of this
process is shown in Figure 3.1. Here, an action duration p
s;a
follows a Normal distribution with
a mean = 2 and standard deviation = 1. This normal distribution is first approximated with a
3-phase Coxian distribution and then uniformized.
Let a =return-to-base, s = start, s
0
= base and P(s; a; s
0
) = 1 as shown in Figure 3.1. First,
the Expectation-Maximization algorithm [Dempster et al., 1977] is used to approximate the nor-
mal distribution with a 3-phase Coxian distribution. The phase type approximation introduces 2
31
auxiliary states Ph
2
and Ph
3
(also referred to as phases). Two other phases of the phase-type dis-
tribution, Ph
1
and Ph
4
, are the original states of a continuous resource MDP: Ph
1
corresponds to
state start and Ph
4
(the absorbing phase) corresponds to state base. Observe, that all phase tran-
sition duration times follow exponential distributions — in particular, for the Coxian distribution
shown, the exit rates of these exponential distribution and the discrete phase-to-phase transition
probabilities are shown in Figure 3.1.
start base
~N(2,1)
p
start,base
=1
start
Ph
1
base
Ph
4
Ph
2
Ph
3
~E(1.45)
p
1,2
=1
~E(1.42)
p
2,3
=0.97
~E(1.43)
p
3,4
=1
~E(1.42)
p
2,4
=0.03
start
Ph
1
base
Ph
4
Ph
2
Ph
3
~E(1.45)
p
1,2
=1
~E(1.45)
p
2,2
=0.02
~E(1.45)
p
2,3
=0.95
~E(1.45)
p
2,4
=0.03
~E(1.45)
p
3,4
=0.99
~E(1.45)
p
3,3
=0.01
Initial probability distribution
After transformation to phase-type distribution
After uniformization
Figure 3.1: Phase type approximation example
A phase type distribution f specifies the probabilities of transitioning to the absorbing phase
(to state base in the example above) over time. Formally, a phase-type distribution f is given by:
1
f (
!
; Q)(t) =
!
e
Qt
(Q
!
1 ) (3.1)
1
For a detailed description of phase-type distributions, refer to Appendix A
32
where Q is the infinitesimal generator matrix,
!
is a vector on starting distribution over Markov
states and
!
1 is the unit column vector. In particular, for the example shown,
!
= (1; 0; 0) (the
Markov process starts in phase Ph
1
) and Q is given by
2
:
Q =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1:45 1 1:45 0
0 1:42 0:97 1:42
0 0 1:43
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
(3.2)
A common measure of fitness of the phase-type approximation is the KL-divergence [Kullback
and Leibler, 1951]. In particular, the KL-divergence between the original function (normal dis-
tribution with mean = 2 and variance = 1) and the obtained phase-type distribution is
1:370521. For details on how to measure the KL-divergence, refer to Section C.
To allow CPH to compute the underlying convolution operations analytically (see Equation
3.1), all phase-type distributions must consists of the exponential distributions that have the same
exit rate parameter . In order to achieve that, each phase-type distribution is uniformized (for
details on the uniformization process, refer to Section D). The uniformization does not alter the
generator matrix Q and as such, it does not aect the absorption time of the phase-type distribu-
tion. However, uniformization changes the exit rates of the phase-to-phase transition durations as
well as the phase-to-phase transition probabilities (intuitively, the exit rates are increased but so
are the probabilities of self-transitions). For the example above, the new exit rate ( = 1:45) and
the new phase-to-phase transition probabilities are shown in Figure 3.1.
From now on, CPH can therefore assume without loss of generality that the action durations
t
0
of all actions are distributed according to exponential distributions p
s;a
(t
0
) = p(t
0
) E()
2
For details on how to determine the values of
!
and Q refer to Appendix A and Appendix B.
33
with the same constant> 0. Note however, that the uniformization process can introduce self-
transitions, and consequently, CPH’s value iteration cannot determine the value functions V
(s)
with a finite number of iterations. To remedy that, Theorem 3.13 from Section 3.1.4 proves that
the approximation error is no larger than a desired > 0 if the value iteration runs for at least n
iterations where:
n
log
e
1
e
R
max
(e
1)
!
(3.3)
and R
max
:= max
s2S;a2A(s);s
0
2S
R(s; a; s
0
).
3.1.2 CPH Step 2: Piecewise Gamma Value Functions
In this section it is explained how CPH breaks up each value function V
n
(s)(t) into multiple
sub-functions V
n
i
(s)(t) where each sub-function is associated with a particular time interval. In-
formally, for dierent time intervals (separated by special times-to-deadline referred to as break-
points), the value function V
n
(s)(t) will be characterized by a dierent parametric function. Fur-
thermore, each parametric function will have an associated action that an agent should start exe-
cuting if the current time-to-deadline is inside the time interval of that parametric function.
Formally, it follows directly from the considerations in the next section that there exist times-
to-deadline 0 = t
s;0
< t
s;1
<:::< t
s;m
s
+1
= such that the value function V
n
(s) is made of m
s
+ 1
parametric functions V
n
i
(s) such that V
n
(s)(t) = V
n
i
(s)(t) for all t2 [t
s;i
; t
s;i+1
), where
V
n
i
(s)(t) = c
s;i;1
e
t
c
s;i;2
+::: + c
s;i;n+1
(t)
n1
(n 1)!
!
(3.4)
34
for the parameters c
s;i; j
for all s2 S , 0 i m
s
and 1 j n+1. These times-to-deadline t
s;i
are referred to as breakpoints, [t
s;i
; t
s;i+1
) as intervals, the expressions for the parametric functions
V
n
i
(s) as gamma functions (which is a simplification since each function V
n
i
(s) is actually linear
combinations of Euler’s incomplete gamma functions), and the expressions for the value functions
V
n
(s) as piecewise gamma functions. Each gamma function V
n
i
(s) is represented as a vector
[c
s;i;1
;:::; c
s;i;n+1
] = [c
s;i; j
]
n+1
j=1
and each piecewise gamma function V
n
(s) as a vector of vectors
[t
s;i
:V
n
i
(s)]
m
s
i=0
= [t
s;i
:[c
s;i; j
]
n+1
j=1
]
m
s
i=0
.
Figure 3.2: CPH Value function V
(start) consists of four gamma func-
tions: V
(start) = [0:V
0
(start); 0:8:V
1
(start); 1:8:V
2
(start); 3:2:V
3
(start)] =
[0:[6; 6]; 0:8:[10; 10; 6]; 1:8:[12; 8:73; 8; 6]; 3.2:[13; 27:1;1:92; 7; 6]].
Figure 3.2 illustrates the dierent concepts introduced in this section. The Figure shows the
optimal value function V
0
(start) for state start of the planetary exploration domain introduced
in Section 2.1.1. Observe that V
0
(start) is represented by four parametric gamma functions
V
0
(start), V
1
(start), V
2
(start), V
3
(start) associated with four dierent time intervals [0; 0:8],
35
[0:8; 1:8], [1:8; 3:2], [3:2; 4:0] respectively — times-to-deadline 0:8, 1:8, 3:2 are the breakpoints
of the value function V
0
(start). Although not shown explicitly in Figure 3.2, the actions associ-
ated with the gamma functions are:
Return-to-base if time-to-deadline is between 0 and 0:8 (for function V
0
(start));
Move-to-next-site if time-to-deadline is between 0:8 and 1:8 (for function V
1
(start));
Move-to-next-site if time-to-deadline is between 1:8 and 3:2 (for function V
2
(start));
Move-to-next-site if time-to-deadline is between 3:2 and 4 (for function V
3
(start));
3.1.3 CPH Step 3: Functional Value Iteration
3.1.3.1 Simplified Example
It is first shown how value iteration can calculate the vectors of the value functions for a simplified
example where an agent executes an unconditional sequence of deterministic actions a
1
; a
2
;:::; a
N+1
in its initial state s
1
with time-to-deadline and then execution stops. Thus, a
2
is executed after
a
1
, a
3
is executed after s
2
, ..., a
N+1
is executed after a
N
. Furthermore, assume that the execution
of action a
n
in state s
n
results (with probability one) in state s
n+1
for all 1 n N. Then, the
Bellman updates that are important (necessary to derive the value functions) are:
V
0
(s
N+1
)(t) = 0
V
n
(s
N+1n
)(t) =
Z
t
0
p(t
0
)
R(s
N+1n
; a
N+1n
; s
N+2n
) + V
n1
(s
N+2n
)(t t
0
)
dt
0
for all 0 t and 1 n N. The integral in the second equation is an example of the convo-
lution operation
R
t
0
p(t
0
) f (t t
0
) dt
0
commonly denoted as (p f )(t). This convolution operation
36
is now used recursively to transform the Bellman updates to the vector notation introduced in
Section 3.1.2:
V
0
(s
N+1
) = 0
V
n
(s
N+1n
) = p (R(s
N+1n
; a
N+1n
; s
N+2n
) + V
n1
(s
N+2n
))
= p R(s
N+1n
; a
N+1n
; s
N+2n
) + p V
n1
(s
N+2n
)
(since convolution is distributive)
= p R(s
N+1n
; a
N+1n
; s
N+2n
)
+ p p R(s
N+2n
; a
N+2n
; s
N+3n
)
+ :::
+ p::: p
| {z }
n1
R(s
N1
; a
N1
; s
N
)
+ p::: p
| {z }
n
R(s
N
; a
N
; s
N+1
)
(the recursion has been ”unrolled”)
Now, for p(t) sampled from an exponential distribution E() = e
t
and a constant R it can
easily be derived that p::: p
| {z }
n
R = R e
t
R +::: + R
(t)
n1
(n1)!
. Furthermore, this expression
37
can be stored as a vector [R;:::; R
| {z }
n+1
] = [R]
n+1
i=1
in accordance with the vector notation introduced in
Section 3.1.2, Hence, the derivation of V
n
(s
N+1n
) continues:
= [R(s
N+1n
; a
N+1n
; s
N+2n
)]
2
j=1
+ [R(s
N+2n
; a
N+2n
; s
N+3n
)]
3
j=1
+ :::
+[R(s
N1
; a
N1
; s
N
)]
n
j=1
+[R(s
N
; a
N
; s
N+1
)]
n+1
j=1
Finally, the above vectors (of dierent sizes) are summed into a vector [c
1
; c
2
;:::; c
N+1
] using the
following method: c
n
is the sum of the n-th elements in the above vectors, for vectors that have
at least n elements. Thus:
V
n
(s
N+1n
) = [
N
X
j=N+1n
R(s
j
; a
j
; s
j+1
);
N
X
j=N+1n
R(s
j
; a
j
; s
j+1
);
N
X
j=N+2n
R(s
j
; a
j
; s
j+1
);:::;
N
X
j=N
R(s
j
; a
j
; s
j+1
)]
for all 1 n N. Having transformed the Bellman updates to vector notation, this notation
for n = N can be used in expressing the value function for state s
1
:
V
N
(s
1
) = [
N
X
j=1
R(s
j
; a
j
; s
j+1
);
N
X
j=1
R(s
j
; a
j
; s
j+1
);
N
X
j=2
R(s
j
; a
j
; s
j+1
);:::;
N
X
j=N
R(s
j
; a
j
; s
j+1
)]:
Example: For the planetary exploration domain, recall the optimal value function V
(start)
from Figure 3.2. According to this function, the optimal action for the time interval [0; 0:8]
38
is to return-to-base and the expected utility of this action over time is given by V
0
(start)(t) =
6 (1 e
t
)
3
. This result is now verified; when the rover executes the “return-to-base” action
unconditionally at state “start” and then execution stops, it receives reward 6 for completing the
action. Thus, from Equation 3.5, V
1
(start)(t) = e
t
6 = 6 (1 e
t
) = [6; 6]
4
. Similarly,
when the rover executes the “return-to-base” action unconditionally at state “start” and then
execution stops, it holds that V
1
(site
3
)(t) =e
t
6 = 6 (1 e
t
) = [6; 6].
3.1.3.2 The General Case
In the previous section, only for demonstration purposes, it has been assumed that an agent ex-
ecutes an unconditional sequence of deterministic actions. Solving the planning problems, how-
ever, is more complicated since it might not be optimal to, say, always execute action a
2
after
action a
1
no matter how long it takes to execute action a
1
or which state results from its exe-
cution. This is the reason for why for dierent time intervals, the value function is represented
by a dierent gamma function (instead of being represented by a single gamma function for all
0 t ).
It is now shown how to generalize the key idea to allow value iteration to solve the planning
problems analytically. For n = 0, the value functions V
n
(s) = 0 satisfy V
n
(s) = [0:[0]] and thus
are (piecewise) gamma. It is shown by induction that all value functions V
n+1
(s) are piecewise
gamma if all value functions V
n
(s) are piecewise gamma. It is also shown how the Bellman
updates can eciently transform the vectors of the value functions V
n
(s) to the vectors of the
value functions V
n+1
(s). Recall Equation 3.1 which performs the value iteration:
3
The lower index in V
0
represents an index of the gamma function
4
The upper index in V
1
represents the Bellman update number, not the gamma function index. Indeed, the value
functions for an unconditional sequence of actions considered in this section are always represented by a single gamma
function
39
V
n+1
(s)(t) := max
a2A(s)
X
s
0
2S
P(s; a; s
0
)
Z
t
0
p(t
0
)(R(s; a; s
0
) + V
n
(s
0
)(t t
0
)) dt
0
:
This calculation breaks down into four stages:
First, V
n
(s
0
)(t t
0
) := R(s; a; s
0
) + V
n
(s
0
)(t t
0
).
Second,
e
V
n
(s
0
)(t) :=
R
t
0
p(t
0
)V
n
(s
0
)(t t
0
) dt
0
.
Third,
b
V
n
(s; a)(t) :=
P
s
0
2S
P(s; a; s
0
)
e
V
n
(s
0
)(t).
Finally, V
n+1
(s)(t) := max
a2A(s)
b
V
n
(s; a)(t).
5
Stage 1: Calculate V
n
(s
0
)(t t
0
) := R(s; a; s
0
) + V
n
(s
0
)(t t
0
). The induction assumption is
that all value functions V
n
(s
0
) are piecewise gamma, i.e., V
n
(s
0
) = [t
s
0
;i
:[c
s
0
;i; j
]
n+1
j=1
]
m
s
0
i=0
. In Stage
1, CPH calculates V
n
(s
0
)(t t
0
) := R(s; a; s
0
) + V
n
(s
0
)(t t
0
), which is the same as calculating
V
n
(s
0
)(t) := R(s; a; s
0
) + V
n
(s
0
)(t) since R(s; a; s
0
) is constant. Then,
V
n
(s
0
) = R(s; a; s
0
) + V
n
(s
0
)
= R(s; a; s
0
) + [t
s
0
;i
: [c
s
0
;i; j
]
n+1
j=1
]
m
s
0
i=0
= [t
s
0
;i
: [c
0
s
0
;i; j
]
n+1
j=1
]
m
s
0
i=0
:
where c
0
s
0
;i;1
= R(s; a; s
0
) + c
s
0
;i;1
and c
0
s
0
;i; j
= c
s
0
;i; j
for all 0 i m
s
0 and 2 j n + 1.
Example (continued): In the previous example for the planetary exploration domain it has
been found, that V
1
(site
3
) = [6; 6]. This example is now carried on in order to show how CPH
5
One should really use V
n
(s; a; s
0
)(t t
0
) and
e
V
n
(s; a; s
0
)(t) instead of V
n
(s
0
)(t t
0
) and
e
V
n
(s
0
)(t), respectively, but
this would make the notation rather cumbersome.
40
derives V
2
(site
2
). If in state site
2
the rover executes the “move to site 3” action, then the value
function after Step 1 is V
1
(site
3
) = [1 + 6; 6] = [7; 6] because V
1
(site
3
) = [6; 6].
Stage 2: Calculate
e
V
n
(s
0
)(t) :=
R
t
0
p(t
0
)V
n
(s
0
)(tt
0
) dt
0
, which is a convolution of p and V
n
(s
0
)
denoted as p V
n
(s
0
). Consider the value functions V
n
(s
0
) defined in Stage 1. It is now shown by
induction that
e
V
n
i
(s
0
)(t) = (p V
n
i
(s
0
))(t) e
t
(
i
X
i
0
=1
e
t
s
0
;i
0
(p V
n
i
0(s
0
))(t
s
0
;i
0)
i1
X
i
0
=0
e
t
s
0
;i
0
+1
(p V
n
i
0(s
0
))(t
s
0
;i
0
+1
)) (3.5)
for all t2 [t
s
0
;i
; t
s
0
;i+1
). Equation (3.5) holds for i = 0 since
e
V
n
(s
0
)(t) = (p V
n
0
(s
0
))(t) for all
t2 [t
s
0
;0
; t
s
0
;1
). Assume now that Equation (3.5) holds for some i. It then also holds for i + 1 as
shown in Figure 3.3. It has consequently been shown that the transformation performed at stage
2 results in a value function
e
V
n
(s
0
)(t) that is piecewise, and that each piece
e
V
n
i
(s
0
)(t) is a function
represented by Equation (3.5). In order to complete stage 2, it must also be shown, that
e
V
n
(s
0
)(t)
is in piecewise gamma form i.e., it must be shown that each piece
e
V
n
i
(s
0
) is a gamma function:
Lemma 1. Let p(t) follow the exponential probability density function with the exit rate, i.e.,
p(t) E(). Then, p [k
1
; k
2
;:::; k
n
] = [k
1
; k
1
; k
2
;:::; k
n
].
Proof. By symbolic integration, p [k]
n
j=1
= [k]
n+1
j=1
for any constant k. Then, p [k
1
;:::; k
n
] =
p(
P
n1
i=1
[k
i
k
i+1
]
i
j=1
+[k
n
]
n
j=1
) =
P
n1
i=1
(p[k
i
k
i+1
]
i
j=1
)+p[k
n
]
n
j=1
=
P
n1
i=1
[k
i
k
i+1
]
i+1
j=1
+[k
n
]
n+1
j=1
=
[k
1
; k
1
;:::; k
n
].
41
The recursion will be proven for i + 1. Since convolution is commutative it holds that:
e
V
n
i+1
(s
0
)(t) =
Z
t
0
p(t
0
)V
n
(s
0
)(t t
0
) dt
0
=
Z
t
0
p(t t
0
)V
n
(s
0
)(t
0
) dt
0
The integral range [0; t] is now split into ranges [0; t
s
0
;i+1
] and (t
s
0
;i+1
; t]:
=
Z
t
t
s
0
;i+1
p(t t
0
)V
n
i+1
(s
0
)(t
0
) dt
0
+
Z
t
s
0
;i+1
0
p(t t
0
)V
n
i+1
(s
0
)(t
0
) dt
0
For all t
0
such that t
s
0
;i+1
t
0
t< t
s
0
;i+2
it holds that V
n
(s
0
)(t
0
) = V
n
i+1
(s
0
)(t
0
). Thus:
=
Z
t
t
s
0
;i+1
p(t t
0
)V
n
i+1
(s
0
)(t
0
) dt
0
+
Z
t
s
0
;i+1
0
p(t t
0
)V
n
(s
0
)(t
0
) dt
0
By adding and then subtracting [0; t
s
0
;i+1
] to the first integral range:
=
Z
t
0
p(t t
0
)V
n
i+1
(s
0
)(t
0
) dt
0
Z
t
s
0
;i+1
0
p(t t
0
)V
n
i+1
(s
0
)(t
0
) dt
0
+
Z
t
s
0
;i+1
0
p(t t
0
)V
n
(s
0
)(t
0
) dt
0
And since p(t t
0
) =e
(tt
0
)
= e
(tt
s
0
;i+1
)
e
(t
s
0
;i+1
t
0
)
= e
(tt
s
0
;i+1
)
p(t
s
0
;i+1
t
0
) it holds that:
=
Z
t
0
p(t t
0
)V
n
i+1
(s
0
)(t
0
) dt
0
e
(tt
s
0
;i+1
)
Z
t
s
0
;i+1
0
p(t
s
0
;i+1
t
0
)V
n
i+1
(s
0
)(t
0
) dt
0
+e
(tt
s
0
;i+1
)
Z
t
s
0
;i+1
0
p(t
s
0
;i+1
t
0
)V
n
(s
0
)(t
0
) dt
0
Now, by the definition of
e
V
n
i
(s
0
):
=
Z
t
0
p(t t
0
)V
n
i+1
(s
0
)(t
0
) dt
0
e
(tt
s
0
;i+1
)
Z
t
s
0
;i+1
0
p(t
s
0
;i+1
t
0
)V
n
i+1
(s
0
)(t
0
) dt
0
+e
(tt
s
0
;i+1
)
e
V
n
i
(s
0
)(t
s
0
;i+1
)
And compact notation for the convolution p V
n
i+1
(s
0
):
= (p V
n
i+1
(s
0
))(t) e
(tt
s
0
;i+1
)
(p V
n
i+1
(s
0
))(t
s
0
;i+1
) + e
(tt
s
0
;i+1
)
e
V
n
i
(s
0
)(t
s
0
;i+1
)
Finally,
e
V
n
i
(s
0
)(t) is unrolled by using the induction assumption (for an argument t = t
s
0
;i+1
):
= (p V
n
i+1
(s
0
))(t) e
(tt
s
0
;i+1
)
(p V
n
i+1
(s
0
))(t
s
0
;i+1
)
+e
(tt
s
0
;i+1
)
( (p V
n
i
(s
0
))(t
s
0
;i+1
)
e
t
s
0
;i+1
0
B
B
B
B
B
B
@
i
X
i
0
=1
e
t
s
0
;i
0
(p V
n
i
0(s
0
))(t
s
0
;i
0)
i1
X
i
0
=0
e
t
s
0
;i
0
+1
(p V
n
i
0(s
0
))(t
s
0
;i
0
+1
)
1
C
C
C
C
C
C
A
)
And since all the terms except (p V
n
i+1
(s
0
))(t) are multiplications of e
t
it holds that:
= (p V
n
i+1
(s
0
))(t) e
t
e
t
s
0
;i+1
(p V
n
i+1
(s
0
))(t
s
0
;i+1
) + e
t
e
t
s
0
;i+1
(p V
n
i
(s
0
))(t
s
0
;i+1
)
e
t
0
B
B
B
B
B
B
@
i
X
i
0
=1
e
t
s
0
;i
0
(p V
n
i
0(s
0
))(t
s
0
;i
0)
i1
X
i
0
=0
e
t
s
0
;i
0
+1
(p V
n
i
0(s
0
))(t
s
0
;i
0
+1
)
1
C
C
C
C
C
C
A
= (p V
n
i+1
(s
0
))(t) e
t
0
B
B
B
B
B
B
@
i+1
X
i
0
=1
e
t
s
0
;i
0
(p V
n
i
0(s
0
))(t
s
0
;i
0)
i
X
i
0
=0
e
t
s
0
;i
0
+1
(p V
n
i
0(s
0
))(t
s
0
;i
0
+1
)
1
C
C
C
C
C
C
A
Figure 3.3: Proof by induction of Equation (3.5).
42
Using the result from the lemma, Equation (3.5) can now be transformed to vector notation.
e
V
n
i
(s
0
) = [c
00
s
0
;i; j
]
n+2
j=1
where c
00
s
0
;i;1
:= c
0
s
0
;i;1
c
00
s
0
;i;2
:= c
0
s
0
;i;1
+ z
s
0
;i
c
00
s
0
;i; j+1
:= c
0
s
0
;i; j
8
j=2;3;:::;n+1
with z
s
0
;i
:=
i
X
i
0
=1
e
t
s
0
;i
0
(p V
n
i
0(s
0
))(t
s
0
;i
0)
i1
X
i
0
=0
e
t
s
0
;i
0
+1
(p V
n
i
0(s
0
))(t
s
0
;i
0
+1
):
Observe that z
s
0
;i
is added to c
00
s
0
;i;2
because z
s
0
;i
is multiplied by e
t
in Equation (3.5). Conse-
quently,
e
V
n
(s
0
) is in piecewise gamma form:
e
V
n
(s
0
) = [t
s
0
;i
:[c
00
s
0
;i; j
]
n+2
j=1
]
m
s
0
i=0
. That result completes
all the necessary calculations performed at stage 2.
Note that one could also calculate
e
V
n
i+1
(s
0
) recursively based on
e
V
n
i
(s
0
) according to line 5 in
Figure 3.3. It then holds that:
e
V
n
i+1
(s
0
)(t) = (p V
n
i+1
(s
0
))(t) e
(tt
s
0
;i+1
)
(p V
n
i+1
(s
0
))(t
s
0
;i+1
)
+ e
(tt
s
0
;i+1
)
e
V
n
i
(s
0
)(t
s
0
;i+1
)
= (p V
n
i+1
(s
0
))(t) e
t
e
t
s
0
;i+1
((p V
n
i+1
(s
0
))(t
s
0
;i+1
)
e
V
n
i
(s
0
)(t
s
0
;i+1
))
and in vector notation:
e
V
n
i+1
(s
0
) = [c
0
s
0
;i+1;1
; c
0
s
0
;i+1;1
+ z
0
s
0
;i+1
; c
0
s
0
;i+1;2
;:::; c
0
s
0
;i+1;n+1
]
43
for
z
0
s
0
;i+1
:= e
t
s
0
;i+1
((p V
n
i+1
(s
0
))(t
s
0
;i+1
)
e
V
n
i
(s
0
)(t
s
0
;i+1
)):
Example (continued): In the previous example it has been found that if in state site
2
the rover
executes the “move to site 3” action, then the value function after Step 1 is V
1
(site
3
) = [7; 6].
Since V
1
(site
3
) V
1
1
(site
3
) (because V
1
(site
3
) has no breakpoints), it can be easily computed
that after stage 2,
e
V
1
(site
3
) = p V
1
(site
3
) = p [7; 6] = [7; 7; 6].
Stage 3: Calculate
b
V
n
(s; a)(t) :=
P
s
0
2S
P(s; a; s
0
)
e
V
n
(s
0
)(t). Consider the value functions
e
V
n
(s
0
) defined in Stage 2. Since they might have dierent breakpointsft
s
0
;i
g
m
s
0
i=0
for dierent states
s
0
2 S with P(s; a; s
0
)> 0, CPH introduces the common breakpoints
ft
s;a;i
g
m
s;a
i=0
=
[
s
0
2S :P(s;a;s
0
)>0
ft
s
0
;i
g
m
0
s
i=0
without changing the value functions
e
V
n
(s
0
). Afterwards,
e
V
n
(s
0
) = [t
s;a;i
:[c
000
s
0
;i; j
]
n+2
j=1
]
m
s;a
i=0
where
c
000
s
0
;i; j
= c
00
s
0
;i
0
; j
for the unique i
0
with [t
s;a;i
; t
s;a;i+1
) [t
s
0
;i
0; t
s
0
;i
0
+1
). Then,
b
V
n
(s; a) =
X
s
0
2S
P(s; a; s
0
)
e
V
n
(s
0
)
=
X
s
0
2S
P(s; a; s
0
)[t
s;a;i
: [c
000
s
0
;i; j
]
n+2
j=1
]
m
s;a
i=0
= [t
s;a;i
: [
X
s
0
2S
P(s; a; s
0
)c
000
s
0
;i; j
]
n+2
j=1
]
m
s;a
i=0
= [t
s;a;i
: [c
0000
s;a;i; j
]
n+2
j=1
]
m
s;a
i=0
:
44
Example (continued): In the previous example it has been found that if in state site
2
the rover
executes the “move to site 3” action, then the value function after Step 2 is
e
V
1
(site
3
) = [7; 7; 6].
Since the “move to site 3” action is deterministic, i.e., P(site
2
; move-to-site-3; site
3
) = 1, after
stage 3 it holds that
b
V
1
(site
2
; move-to-site-3) = 1 [7; 7; 6] = [7; 6; 6].
Stage 4: Calculate V
n+1
(s)(t) := max
a2A(s)
b
V
n
(s; a)(t). After stage 3, for each action a2 A(s)
one has a value function
b
V
n
(s; a) that has a setft
s;a;i
g
m
s;a
i=0
of breakpoints associated with it. Since
for dierent actions a2 A(s), setsft
s;a;i
g
m
s;a
i=0
can be dierent, the first thing CPH does in stage 4 is
to introduce common breakpoints
S
a2A(s)ft
s;a;i
g
m
s;a
i=0
without changing the value functions
b
V
n
(s; a).
CPH then introduces additional dominance breakpoints at the intersections of value functions
b
V
n
(s; a) 8
a2A(s)
to ensure that, over each interval, one of the value functions dominates the other
ones. Letft
0
s;i
g
m
0
s
i=0
be the set of breakpoints afterwards. Then, V
n+1
(s) = [t
0
s;i
:[c
00000
s;i; j
]
n+2
j=1
]
m
0
s
i=0
where
c
00000
s;i; j
= c
0000
s;a
s;i
;i
0
; j
for actions a
s;i
2 A(s) and the unique i
0
with [t
0
s;i
; t
0
s;i+1
) [t
s;a
s;i
;i
0; t
s;a
s;i
;i
0
+1
) and,
for all t 2 [t
0
s;i
; t
0
s;i+1
),
b
V
n
(s; a)(t)
b
V
n
(s; a
s;i
)(t). Furthermore, action a
s;i
should be executed
according to the value function V
n+1
(s) if the current state is s and the time-to-deadline is t2
[t
0
s;i
; t
0
s;i+1
).
Example (continued) : It has already been shown that when the rover executes the “move to
site 3” action, after stage 3 it holds that
b
V
1
(site
2
; move-to-site-3) = [7; 6; 6]. One can calculate
in the similar fashion, that when the rover executes the “move to base” action from site
2
, after
stage 3:
b
V
1
(site
2
; move-to-base) = [6; 6; 0]. Stage 4 calculates the maximum of
b
V
1
(site
2
; move-
to-site-3) and
b
V
1
(site
2
; move-to-base). The two functions intersect at 3.06 and the maximum is
V
2
(site
2
) = [0 : [6; 6; 0]; 3:06 : [7; 7; 6]. Now, to show how to calculate V
3
(site
1
) : If the rover
45
executes the “move to base” action, then (as before) after stage 1 one obtains: [6 + 0; 0; 0] =
[6; 0; 0] and after stage 2 one obtains: [6; 6; 0; 0]. If the rover executes the “move to site 2” action,
then one obtains after stage 1: [0 : [2 + 6; 6; 0]; 3:06 : [2 + 7; 7; 6] = [0 : [8; 6; 0]; 3:06 : [9; 7; 6]
and after stage 2: [0 : [8; 8; 6; 0]; 3:06 : [9;2:09; 7; 6]]. Stage 4 calculates the maximum of
[6; 6; 0; 0] and [0 : [8; 8; 6; 0]; 3:06 : [9;2:09; 7; 6]]. The two functions intersect at 1.87 and the
maximum is V
3
(site
1
) = [0 : [6; 6; 0; 0]; 1:87 : [8; 8; 6; 0]; 3:06 : [9;2:09; 7; 6]].
To summarize, the value functions V
n+1
(s) are piecewise gamma, and the vectors of the value
functions V
n
(s) can be transformed automatically to the vectors of the value functions V
n+1
(s).
The lengths of the vectors increase linearly in the number of iterations and, although the number
of breakpoints (that are placed automatically during the transformations) can increase exponen-
tially, in practice it stays small since one can merge small intervals after each iteration of value
iteration to reduce the number of breakpoints The transformations require only simple vector ma-
nipulations and a numerical method that determines the dominance breakpoints approximately,
for which CPH uses a bisection method. In Chapter 4 it is shown experimentally that these trans-
formations of CPH are both ecient and accurate.
3.1.4 Error Control
The following theorem assumes that all action durations follow phase-type distributions. It does
not take into account the error introduced by approximating non-phase type distributions with
phase-type distributions:
46
Theorem 1. Let > 0 be any positive constant, R
max
:= max
s2S;a2A(s);s
0
2S
R(s; a; s
0
) and
n log
e
1
e
R
max
(e
1)
!
:
It then holds that:
max
s2S;0t
jV
(s)(t) V
n
(s)(t)j:
Proof. Let := > 0 and b
i
:=
P
1
j=i
j
j!
for all i 0. It holds that b
0
= e
and b
1
= b
0
1 =
e
1. First, a bound on the probability p
i
(t) that the sum of the action durations of a sequence
of i 1 actions is no more than 0 t is provided:
p
i
(t) p
i
() =
Z
0
(p p::: p
| {z }
i
)(t
0
) dt
0
=
Z
0
e
t
0 t
0i1
i
i!
dt
0
=
1
e
0
B
B
B
B
B
B
B
@
e
i1
X
j=0
j
j!
1
C
C
C
C
C
C
C
A
=
1
e
0
B
B
B
B
B
B
B
@
1
X
j=0
j
j!
i1
X
j=0
j
j!
1
C
C
C
C
C
C
C
A
=
1
e
1
X
j=i
j
j!
=
b
i
e
:
The values
b
i+1
b
i
=
b
i+1
i
i!
+ b
i+1
=
1
i
i! b
i+1
+ 1
decrease strictly monotonically in i because the values
i
i! b
i+1
=
i
i!
P
1
j=i+1
j
j!
=
1
i+1
+
2
(i+2)(i+1)
+
3
(i+3)(i+2)(i+1)
+:::
47
increase strictly monotonically in i. Consequently,
1>
e
1
e
=
b
1
b
0
>
b
2
b
1
>
b
3
b
2
>:::> 0:
Thus
b
i
<
b
i1
b
i2
b
i1
<
b
1
b
0
b
i1
<
b
1
b
0
b
i2
b
i3
b
i2
<
b
1
b
0
!
2
b
i2
<:::<
b
1
b
0
!
i1
b
1
:
These results are now used to boundjV
(s)(t) V
n
(s)(t)j for all s2 S and 0 t . As-
sume that the agent starts in state s2 S with time-to-deadline 0 t . Value iteration with
n iterations determines the highest expected total reward V
n
(s)(t) under the restriction that exe-
cution stops when the deadline is reached or n actions have been executed. The largest expected
total reward V
(s)(t) does not have the second restriction and can thus be larger than V
n
(s)(t). In
particular, only V
(s)(t) takes into account that the agent can execute the (n + 1)st action with
probability p
n+1
(t), the (n + 2)nd action with probability p
n+2
(t), and so on, receiving a reward of
48
at most R
max
for each action execution. Additionally, V
n
(s)(t) is locally greedy, i.e., the reward
for its first n actions exceeds the reward for the first n actions of V
(s)(t). Thus,
0 V
(s)(t) V
n
(s)(t)
1
X
i=n+1
R
max
p
i
(t)
R
max
e
1
X
i=n+1
b
i
<
R
max
e
1
X
i=n+1
b
1
b
0
!
i1
b
1
=
R
max
b
1
e
1
X
i=0
b
1
b
0
!
i+n
=
R
max
b
1
e
b
1
b
0
!
n 1
X
i=0
b
1
b
0
!
i
=
R
max
b
1
e
b
1
b
0
!
n
1
1
b
1
b
0
:
The goal of this theorem is to bound this expression by. If value functions have breakpoints,
chosen error must be first reduced by the error introduced by breakpoints ( is small and can
be easily bounded by
n
where is the bisection method error, is the maximum number of
breakpoints and n is the planning horizon). Let
0
:=> 0, then:
R
max
b
1
e
b
1
b
0
!
n
1
1
b
1
b
0
0
b
1
b
0
!
n
0
(1
b
1
b
0
)e
R
max
b
1
n log b
1
b
0
0
B
B
B
B
B
B
@
0
(1
b
1
b
0
)e
R
max
b
1
1
C
C
C
C
C
C
A
n log
e
1
e
0
R
max
(e
1)
!
:
49
3.2 Forward Search Approach: The DPFP Algorithm
In this section the Dynamic Probability Function Propagation (DPFP) approach for solving con-
tinuous resource MDPs is introduced. DPFP alleviates the major weakness of CPH, that is, CPHs
poor anytime performance (see Chapter 4). DPFP approach is a novel combination of three key
ideas:
It introduces the concept of forward search to a continuous resource MDP setting;
Its forward search is performed in a dual space of cumulative distribution functions which
allows it to vary the planning eort based on the likelihood of reaching dierent regions of
the state-space;
It bounds the error that such approximations entails.
These three features allow for a superior anytime performance of DPFP when compared to
CPH or other algorithms for solving continuous resource MDPs. Furthermore, they allow DPFP
to be run in a hybrid mode with CPH or other continuous resource MDP solvers, to speed up the
search for high quality solutions.
3.2.1 DPFP at a Glance
Before the DPFP algorithm is described formally, the informal intuition behind this approach is
provided. The ability of CPH and other value iteration algorithms to find
comes at a high price.
Indeed, value iteration propagates values backwards, and thus, in order to find
(s
0
)(0), it must
first find
(s)(t) for all states s2 S reachable from s
0
and all t2 [0; ] — no matter how likely it
is that state s is visited at time t (in the Mars rover domain with 10 sites of interest, value iteration
50
must plan for all 2
10
states and for all t2 [0; ]). In fact, value iteration does not even know the
probabilities of transitioning to a state s at time t prior to finding
.
DPFP on the other hand, as a forward search algorithm, can determine the probabilities of
transitioning to a state s at time t prior to finding
. Hence, DPFP can discriminate in its policy
generation eort providing only approximate policies for pairs (s; t) encountered with low prob-
ability. Unfortunately if an MDP contains cycles or action duration distributions are continuous,
standard forward search cannot be carried out in a standard way as it would have to consider an
infinite number of candidate policies.
To remedy that, DPFP exploits two insights. First, since each action consumes a certain
minimum amount of time, only a finite number of actions can be performed before the deadline
Mausam et al. [2005] and thus, the action horizon of DPFP’s forward search can be finite. Second,
to avoid having to consider an infinite number of policies when action duration distributions are
continuous, DPFP operates on a dierent search space referred to as the dual space of cumulative
distribution functions. In that dual space, DPFP only finds approximate solutions, yet it can
express the error of its approximations in terms of an arbitrary small parameter. This process
will now be explained in detail.
3.2.2 Dual Problem
There exists an alternative technique for determining
that does not use Equation 2.2, and thus,
does not calculate the values V
(s)(t) for all s2 S and t2 [0; ]. For notational convenience,
from now on, and until the end of this chapter absolute-time (not time-to-deadline) is considered
to be a continuous resource. In other words, the process starts at time t = 0 and terminates at time
t = and each action increases the current absolute time. Note, that replacing time-to-deadline
51
with absolute-time does not aect the underlying planning problems, because time-to-deadline is
simply minus absolute-time. Consequently, the policy for absolute times maps directly to the
policy for times-to-deadline and vice versa. In the following, the absolute-time is simply referred
to as time.
Let = (s
0
;:::; s) be an execution path that starts in state s
0
at time t
0
and finishes in state s.
(s
0
) is a set of all paths reachable from state s
0
. Also, let F
()(t) be the probability of complet-
ing the traversal of path before time t when following the optimal policy
, and F
(; a)(t) be
the probability of completing the traversal of path and starting the execution of action a2 A(s)
before time t when following policy
— both F
() and F
(; a) are cumulative distribution
functions over t2 [0; ]. In this context, the optimal deterministic policy
(s) for state s can be
calculated as follows:
(s)(t) = arg max
a2A(s)
lim
!0
F
(; a)(t +) F
()(t)
(3.6)
Since the derivative of F
(; a) with respect to time is positive at time t for only one action
a2 A(s). The set F
:=fF
(); F
(; a) for all = (s
0
;:::; s)2 (s
0
) and a2 A(s)g is referred to
as the solution to the dual problem.
It is now shown how to find F
. For simplicity (but without the loss of generality) assume
that rewards are only dependent on the state that the process transitions to, i.e., R(s) is the reward
for executing any action from any state and transitioning to state s2 S . Since rewards R(s) are
52
earned upon entering states s2 S before time , the expected utility V
(s
0
)(0) of a policy is
given by:
V
(s
0
)(0) =
X
=(s
0
;:::;s)2(s
0
)
F
()() R(s)
Where F
diers from F
in that F
is associated with policy rather than
. Since solution
F
must yield V
(s
0
)(0), it has to satisfy:
V
(s
0
)(0) = max
V
(s
0
)(0)
= max
X
=(s
0
;:::;s)2(s
0
)
F
()() R(s)
=
X
=(s
0
;:::;s)2(s
0
)
F
()() R(s)
In addition, F
2 X =fF : (3:7); (3:8); (3:9)g where:
F((s
0
))(t) = 1 (3.7)
F((s
0
;:::; s))(t) =
X
a2A(s)
F((s
0
;:::; s); a)(t) (3.8)
F((s
0
;:::; s; s
0
))(t) =
X
a2A(s)
P(s; a; s
0
) (3.9)
Z
t
0
F((s
0
;:::; s); a)(t
0
) p
s;a
(t t
0
)dt
0
In the above set, constraint (3.7) ensures that the process starts in state s
0
at time 0. Constraint
(3.8) can be interpreted as the conservation of probability mass flow through path (s
0
;:::; s); Ap-
plicable only ifjA(s)j > 0, it ensures that the cumulative distribution function F((s
0
;:::; s)) is
split into cumulative distribution functions F((s
0
;:::; s); a) for a2 A(s). Finally, constraint (3.9)
53
ensures the correct propagation of probability mass F(s
0
;:::; s; s
0
) from path (s
0
;:::; s) to path
(s
0
;:::; s; s
0
). It ensures that path (s
0
;:::; s; s
0
) is traversed at time t if path (s
0
;:::; s) is traversed at
time t
0
2 [0; t] and then, action a2 A(s) takes time t t
0
to transition to state s
0
. The dual problem
is then stated as:
max
X
=(s
0
;:::;s)2(s
0
)
F()() R(s) j F2 X
3.2.3 Solving the Dual Problem
In general, the dual problem is extremely dicult to solve optimally because when action duration
distributions are continuous or the MDP has cycles, the set X where F
is to be found is infinite.
Yet, it is now shown that even if action duration distributions are continuous and the MDP has
cycles, the dual problem can be solved near-optimally with guarantees on solution quality. The
idea of the algorithm that is proposed is to restrict the search for F
to finite number of elements
in X by pruning from X the elements F that correspond to reaching regions of the state-space
with very low probability. In essence, when the probability of reaching certain regions of the
state-space is below a given threshold, the expected quality loss for executing suboptimal actions
in these regions can be bounded, and this quality loss can be traded o for speed.
More specifically, the algorithm searches for F
in set
b
X X where
b
X diers from X in that
values of functions F in
b
X are restricted to integer multiples of a given 2 R
+
. Informally,
54
creates a step function approximation of F. Formally,
b
X =fF : (3:7); (3:8); (3:10); (3:11); (3:12)g
where
F
0
((s
0
;:::; s; s
0
))(t) =
X
a2A(s)
P(s; a; s
0
) (3.10)
Z
t
0
F((s
0
;:::; s); a)(t
0
) p
s;a
(t t
0
)dt
0
F((s
0
;:::; s; s
0
))(t) = bF
0
((s
0
;:::; s; s
0
))(t)=c (3.11)
F((s
0
;:::; s); a)(t) = n where n2 N (3.12)
The restricted dual problem is then stated as:
max
X
=(s
0
;:::;s)2(s
0
)
F()() R(s) j F2
b
X
Note, that since
b
X is finite, the restricted dual problem can be solved optimally by iterating
over all elements of
b
X. In the following an algorithm that carries out this iteration is shown; the
algorithm returns a policy b
that is guaranteed to be at most away from
where can be
expressed in terms of. The algorithm is first shown on an example. Then, the pseudo-code of
the algorithm is provided. Finally, the error bound of the algorithm is established.
Figure 3.4 shows the algorithm in action. Assume, A(s
0
) =fa
1
g; A(s
1
) = A(s
2
) =fa
1
; a
2
g;
A(s
3
); A(s
4
); A(s
5
) is arbitrary. Also, P(s
0
; a
1
; s
1
) = P(s
1
; a
1
; s
2
) = P(s
1
; a
2
; s
3
) = P(s
2
; a
1
; s
4
) =
P(s
2
; a
2
; s
5
) = 1 and = 0:2. The algorithm iterates over all elements in
b
X. It starts with F((s
0
))
which is given by constraint (3.7), then uses constraints (3.8), (3.10) to derive F
0
((s
0
; s
1
)) (solid
gray line for state s
1
) and finally uses constraint (3.11) to approximate F
0
((s
0
; s
1
)) with a step
function F((s
0
; s
1
)) (solid black line for state s
1
).
55
t
1
t
κ
κ
κ
Δ
F'((s
0
,s
1
))
F((s
0
,s
1
))
0
State s
1
F((s
0
,s
1
),a
1
)
F((s
0
,s
1
),a
2
)
Δ
Δ 0
0
a
1
a
2
F'((s
0
,s
1
,s
2
))
F((s
0
,s
1
,s
2
))
F((s
0
,s
1
,s
2
),a
1
)
F((s
0
,s
1
,s
2
),a
2
)
State s
2
State s
3
κ
Δ 0
F'((s
0
,s
1
,s
2
,s
4
))
State s
4
F((s
0
,s
1
,s
2
,s
4
))=0
κ
Δ 0
State s
5
State s
0
Time t
0
a
1
F'((s
0
,s
1
,s
3
))
F((s
0
,s
1
,s
3
))
F'((s
0
,s
1
,s
2
,s
5
))
F((s
0
,s
1
,s
2
,s
5
))=0
t
1
t
2
t
3
t
4
a
1
a
2
Recursion level 0
Recursion level 1
Recursion level 2
Recursion level 3
Figure 3.4: Forward search for an optimal probabilistic policy in an approximate space of cumu-
lative distribution functions
56
At this point the algorithm knows the probability F((s
0
; s
1
))(t) that s
1
will be visited be-
fore time t but does not know the probabilities F((s
0
; s
1
); a
1
)(t) F((s
0
; s
1
); a
2
)(t) that actions a
1
or a
2
will be started from s
1
before time t (dotted black lines for state s
1
). Thus, to iterate
over all elements in
b
X, it must iterate over alljA(s
1
)j
F((s
0
;s
1
))()=
= 16 dierent sets of functions
fF((s
0
; s
1
); a
1
); F((s
0
; s
1
); a
2
)g (also called splittings of F((s
0
; s
1
))). A splitting determines the
policy (see Equation 3.6). In particular, for the specific splitting shown, action a
1
is started at
times t
1
; t
2
; t
4
whereas action a
2
is started at time t
3
(it is shown later show how to extrapolate this
policy on [0; ]).
At this point, the algorithm calls itself recursively. It now knows F((s
0
; s
1
; s
2
)) and F((s
0
; s
1
; s
3
))
(derived fromfF((s
0
; s
1
); a
1
); F((s
0
; s
1
); a
2
)g using constraints (3.10), (3.11)) but does not know
the probabilities F((s
0
; s
1
; s
2
); a
1
)(t); F((s
0
; s
1
; s
2
); a
2
)(t) that actions a
1
or a
2
will be started from
s
2
before time t. Thus, to iterate over all elements in
b
X, it iterates over all setsfF((s
0
; s
1
; s
2
); a
1
);
F((s
0
; s
1
; s
2
); a
2
)g (splittings of F((s
0
; s
1
; s
2
))). In this case, for the specific splitting shown,
F((s
0
; s
1
; s
2
; s
4
))() < and thus, no splittings of F((s
0
; s
1
; s
2
; s
4
)) are possible (similarly for
F((s
0
; s
1
; s
2
; s
5
))). In such case, the algorithm stops iterating over policies for states following
s
4
, because the maximum reward loss for not planning for these states, bounded by R where
R =
P
=(s
0
;s
1
;s
2
;s
4
;:::;s)
R(s), can be made arbitrary small by choosing a suciently small (to be
shown later). Thus, the algorithm evaluates the current splitting of F((s
0
; s
1
; s
2
)) and continues
iterating over remaining splittings of F((s
0
; s
1
; s
2
)) after which it backtracks and picks another
splitting of F((s
0
; s
1
)) etc.
In general, the algorithm is started by calling DPFP((s
0
); 1) for a globally defined and arbi-
trary small. When called for some = (s
0
;:::; s) and F
0
(), the DPFP function first derives F()
from F
0
() using constraint (3.11). It then iterates over all sets of functionsfF(; a) : a2 A(s)g
57
Algorithm 1 DPFP( = (s
0
;:::; s); F
0
())
1: F()(t) bF
0
()(t)=c
2: u
0
3: for all setsfF(; a) : a2 A(s)g s.t. F()(t) =
P
a2A(s)
F(; a)(t) and F(; a)(t) = n; n2 N
do
4: u 0
5: for all s
0
2 S do
6: F
0
P
a2A(s)
P(s; a; s
0
)
R
t
0
F(; a)(t
0
)p
s;a
(t t
0
)dt
0
7: u u+ DPFP((s
0
;:::; s; s
0
); F
0
)
8: if u> u
then
9: BestSplitting fF(; a) : a2 A(s)g
10: u
u
11: for all F(; a)2 BestSplitting and all t2 [0; ] do
12: if lim
!0
F(; a)(t +) F(; a)(t)> 0 then
13: b
(s)(t) a
14: return u
+ F()() R(s)
in order to find the best splitting of F() (lines 3—10). For a particular splitting, the DPFP func-
tion first makes sure that this splitting satisfies constraints (3.8) and (3.12) (line 3) upon which
it calculates the total expected utility u of this splitting (lines 4—7). To this end, for all paths
(s
0
;:::; s; s
0
), it uses constraint (3.10) to create functions F
0
= F
0
((s
0
;:::; s; s
0
)) (line 6) and then,
calls itself recursively for each pair ((s
0
;:::; s; s
0
); F
0
) (line 10). Finally, if u is greater than the to-
tal expected utility u
of the best splitting analyzed so far, DPFP updates the BestSplitting (lines
8—10).
Upon finding the BestSplitting, the DPFP function uses Equation (3.6) to extract the best
deterministic policyb
from it (lines 11—13) and terminates returning u
plus the expected reward
for entering s before time (computed in line 14 by multiplying the immediate reward R(s) by
the probability F(s
0
;:::; s)() of entering s before time ). As soon as DPFP((s
0
); 1) terminates,
the algorithm extrapolates its already known point-based policies onto time interval [0; ] using
the following method: Ifb
(s)(t
1
) = a
1
,b
(s)(t
2
) = a
2
, andb
(s)(t) is not defined for t2 (t
1
; t
2
),
58
the algorithm puts b
(s)(t) = a
1
for all t 2 (t
1
; t
2
). For example, if splitting in Figure 3.4 is
optimal,b
(s
1
)(t) = a
1
for t2 [0; t
1
)[ [t
1
; t
2
)[ [t
2
; t
3
)[ [t
4
; ) andb
(s
1
)(t) = a
2
for t2 [t
3
; t
4
).
3.2.4 Taming the Algorithm Complexity
As stated, the DPFP algorithm can appear to be inecient since it operates on large number
of paths (exponential in the length of the longest path) and large number of splittings per path
(exponential inb1=c). However, this exponential complexity is alleviated thanks to the following
features of DPFP:
Varying policy expressivity for dierent states: The smaller the probability of traversing a
path = (s
0
;:::; s) before the deadline, the less expressive the policy for state s has to be
(fewer ways in which F() can be split intofF(; a) : a2 A(s)g). For example, state s
2
in
Figure 3.4 is less likely to be visited than state s
1
and therefore, DPFP allows for higher
policy expressivity for state s
1
(2
4
policies) than for state s
2
(2
2
policies). Sparing the
policy generation eort in less likely to be visited states enables faster policy generation.
Varying policy expressivity for dierent time intervals: The smaller the probability of
traversing to a state inside a time interval, the less expressive the policy for this state and
interval has to be. In Figure 3.4 it is more likely to transition to state s
1
at time t2 [t
1
; t
3
]
(with probability 2) than at time t2 [t
3
; t
4
] (with probability 1) and thus, DPFP considers
2
2
policies for time interval [t
1
; t
3
] and only 2
1
policies for time interval [t
3
; t
4
].
Path independence: When function F() for a sequence = (s
0
;:::s) is split into functions
fF(; a) : a 2 A(s)g, functionsfF(; a) : a 2 A(s); P(s; a; s
0
) = 0g have no impact on
59
F((s
0
;:::; s; s
0
)). Thus, fewer splittings of F() have to be considered to determine all pos-
sible functions F((s
0
;:::; s; s
0
)). For example, in Figure 3.4, F((s
0
; s
1
; s
2
)) is only aected
by F((s
0
; s
1
); a
1
). Consequently, as long as F((s
0
; s
1
); a
1
) remains unaltered when iterating
over dierent splittings of F((s
0
; s
1
)), the best splittings of F((s
0
; s
1
; s
2
)) does not have to
be recomputed.
Path equivalence. For dierent paths = (s
0
;:::; s) and
0
= (s
0
;:::; s) that coalesce in state
s, the best splitting of F() can be reused to split F(
0
) provided that max
t2[0;]
jF
0
()(t)
F
0
(
0
)(t)j.
3.2.5 Error Control
Recall that
b
F
is the optimal solution to the restricted dual problem returned by DPFP. It will now
be proven that the reward error of a policy identified by
b
F
can be expressed in terms of. To
this end, it will first be proven that the maximum loss of probability mass for one sequence is
bounded. Then, the error bound of the DPFP algorithm will be expressed as the sum of rewards
collected on all possible execution sequences multiplied by the maximum loss of probability mass
for one sequence.
Hence, it is first proven that for all paths2 (s
0
):
max
t2[0;]
jF
()(t)
b
F
()(t)jjj (3.13)
Statement (3.13) is proven by induction on the length of.
Induction base: For any sequence2 such thatjj = 1 it holds that
max
t2[0;]
jF
((s
0
))(t)
b
F
((s
0
))(t)j = max
t2[0;]
j1 1j = 0<.
60
Induction step: Assume now that statement (3.13) holds for a sequence = (s
0
;:::; s
n1
)
of length n. Statement (3.13) then also holds for all sequences
0
= (s
0
;:::; s
n1
; s
n
) of
length n + 1 because
jF
(
0
)(t)
b
F
(
0
)(t)jjF
(
0
)(t)
b
F
0
(
0
)(t)j +
Where
b
F
(
0
) is derived from
b
F
0
(
0
) using constraint (3.11)
=
X
a2A(s)
P(s; a; s
0
)j
Z
t
0
F
(; a)(t
0
) p
s;a
(t t
0
)dt
0
Z
t
0
b
F
(; a)(t
0
) p
s;a
(t t
0
)dt
0
j +
max
a2A(s)
Z
t
0
jF
(; a)(t
0
)
b
F
(; a)(t
0
)j p
s;a
(t t
0
)dt
0
+
max
a2A(s)
Z
t
0
jF
()(t
0
)
b
F
()(t
0
)j p
s;a
(t t
0
)dt
0
+
And from the induction assumption
Z
t
0
n p
s;a
(t t
0
)dt
0
+n +j
0
j
holds for t2 [0; ].
Consequently, statement (3.13) holds for any sequence 2 . The error of the DPFP
algorithm can then be expressed by multiplying the maximum loss of probability mass for one
sequence by the sum of rewards collected on all possible execution sequences. Precisely, error
can be expressed in terms of because:
= R
max
X
2(s
0
)
max
t2[0;]
jF
()(t)
b
F
()(t)j
R
max
X
2(s
0
)
jjR
max
HjAj
H
61
Where R
max
= max
s2S
R(s) and H is the action horizon (if the minimal action duration is known
than Hb=c). Hence, by decreasing, DPFP can trade o speed for optimality.
62
Chapter 4: Experiments with Single Agent Algorithms
This chapter reports on the experimental evaluation of the algorithms for solving continuous
resource MDPs. Section 4.1 demonstrates a feasibility study of the CPH algorithm introduced in
Section 3.1. CPH is compared with the Lazy Approximation algorithm [Li and Littman, 2005],
currently the fastest algorithm for solving continuous resource MDPs, on various configurations
of the domain from Figure 2.1. Then, Section 4.2 reports on a comparison of CPH and Lazy
Approximation eciency on a family of computationally intensive planetary exploration domains
— these two algorithms are joined by the DPFP algorithm in this comparison. Next, Section
4.3 reports on the empirical evaluation of the DPFP-CPH hybrid algorithm that combines the
strengths of CPH and DPFP. Finally, Section 4.4 reports on the successful integration of CPH
with RIAACT [Schurr et al., 2008], the adjustable autonomy system for coordinating a human
incident commander and an agent team in an event of a simulated large scale disaster.
4.1 CPH Feasibility Experiments
This section reports on the small scale experiments involving CPH and Lazy Approximation [Li
and Littman, 2005], currently the leading algorithm for solving continuous resource MDPs. Lazy
Approximation approximates the probability distributions and value functions with piecewise
63
constant functions. The Bellman updates then transform the piecewise constant value functions
into piecewise linear value functions, that then need to get approximated again with piecewise
constant value functions. These repeated approximations result in large runtimes and approx-
imation errors that are demonstrated below. Furthermore, the number of intervals needed to
approximate the value functions with piecewise constant functions is large, which results in even
large runtimes.
CPH, on the other hand, approximates the probability distributions with phase-type distri-
butions, resulting in exponential probability distributions that it then uniformizes. One of its
advantages is that the value functions then remain piecewise gamma functions and it thus does
not need to approximate the value functions at all, with only one small exception, which involves
finding the intersections of value functions, for which it uses a numerical method. Another one
of its advantages is that the number of intervals of the piecewise gamma value functions tends
to be smaller than the number of intervals of the piecewise linear value functions. Both of these
advantages result in significant computational savings as can be seen below.
The first set of experiments compares the eciency of CPH and Lazy Approximation (re-
ferred to as LA) for the planetary exploration domain introduced in Section 2.1.1. The planetary
exploration domain might appear small with only 5 discrete states. However, if one uses stan-
dard MDP framework where time-to-deadline is discretized into 1/100 time units (which for a 4
hour period of rover operation corresponds to a decision point every 36 seconds), then there are
already 5 400 = 2000 distinct MDP states. For the three experiments (Figures 4.1, 4.2 and ??)
the error max
0t
jV
(start)(t) V(start)(t)j of the calculated value function V(start) is always
plot on the x-axis and the corresponding runtime (measured in milliseconds) in logarithmic scale
on the y-axis.
64
Experiment 1 determines how CPH and Lazy Approximation trade o between runtime and
error for the planetary exploration domain from Section 2.1.1. Since the action durations in
the Mars rover domain are already distributed exponentially and thus phase-type with one phase,
there are no errors introduced by approximating the probability distributions over the action dura-
tions with phase-type distributions. Also, since these distributions are equal, uniformization will
not introduce self-transitions and value iteration can run for a finite number of iterations. Thus,
for CPH, the accuracy of the bisection method (that determines the dominance breakpoints) was
varied whereas for Lazy Approximation, the accuracy of the piecewise constant approximations
of the probability distributions over the action durations and value functions was varied. The
results show that CPH is faster than Lazy Approximation with the same error, by three orders
of magnitude for small errors. For example, CPH needed 2ms and Lazy Approximation needed
1000ms to compute a policy that is less than 1% o optimal, which corresponds to an error of
0.13 in the Mars rover domain.
Experiments 2 and 3: determine how CPH and Lazy Approximation trade o between run-
time and error when all action durations in the planetary exploration domain from Section 2.1.1
are characterized by either Weibull distributions Weibull( = 1; = 2) (Experiment 2) or Nor-
mal distributions N( = 2; = 1) (Experiment 3). Since the action durations are no longer
exponential, phase-type approximation was used for CPH. Here, the accuracy of the bisection
method used by CPH has been fixed, but the number of phases used by the Coxian distribution
(see Section B) approximating the initial duration distribution was varied. As can be seen, CPH
is still faster than Lazy Approximation with the same error, by more than one order of magnitude
for small errors. For example, CPH needed 149ms (with five phases) and Lazy Approximation
needed 4471ms to compute a policy that is less than 1% o optimal for the Weibull distributions.
65
Figure 4.1: Empirical evaluation of CPH for exponential distributions.
In addition, the study the tightness of the error bound for CPH, calculated in Section 3.1.4
has been conducted. To this end, CPH was run with varying number of phases (2,3 and 5) used
to approximate the probability distributions which in turn aected both the error and the unified
exit rate parameter for the exponential distributions. Those numbers were then plugged into the
formula from Theorem 1 in order to calculate the theoretical planning horizon for CPH. As can
be seen (Figure 4.3) CPH converged much faster than the theoretical planning horizon calculated
from Theorem 1. That encouraging results suggests that tightening the error bound for CPH can
be a worthy topic of future investigation.
66
Figure 4.2: Empirical evaluation of CPH for Weibull distributions.
Figure 4.3: Planning horizon of CPH.
4.2 CPH and DPFP Scalability Experiments
This section reports on the scalability experiments of CPH, DPFP and the Lazy Approximation
algorithms. The domains on which all three algorithm were run, the fully ordered domain, un-
ordered domain and partially ordered domain, are presented in Figures 4.4a, 4.4b, 4.4c. In the
fully ordered domain the agent executes an action a
0
2 A(s
0
) =fa
1
; a
2
; a
3
g, transitions to state
s
a
0, executes a
00
2 A(s
a
0) =fa
1
; a
2
; a
3
g, transitions to state s
a
0
;a
00 — it repeats this scheme up to
67
s
0
...........
(a)
m
1
m
2 m
3
m
4
s
0
m
5
m
6
m
7 m
8
H=8
(b)
m
1
m
2
m
3
m
4
s
0
m
5
m
6
m
7
m
8
(c)
m
9
m
10
Figure 4.4: Domains for CPH and DPFP scalability experiments: m
i
denote sites of interest
H = 8 times for a total number of 3
8
= 6561 states. In the unordered domain (which resembles
the classical Traveling Salesman Problem) the agent visits up to 8 sites in an arbitrary order and
hence, the number of states is 2
8
= 256. Finally, the partially ordered domain in a combination
of the fully and unordered domains; here, the agent can visit up to 10 sites in a partial order, that
is, site m + 1 can be visited only after site m has already been visited for m = 1; 3; 5; 7; 9. For all
the domains, the mission deadline = 10. Also, action rewards are drawn uniformly from set
f1; 2;:::; 10g and action durations are sampled from one of the following probability distribution
functions (chosen at random): Normal( = 2, = 1), Weibull( = 2, = 1), Exponential( = 2)
and Uniform (a = 0,b = 4).
The scalability experiments that have been conducted determine how the three algorithms
trade o between runtime and error. The following parameters were varied:
For DPFP, the height of the step function that approximates the probability functions;
For CPH, the number of phases of the Coxian distribution used to approximate the initial
action duration distributions.
For Lazy Approximation, the tolerance threshold used when approximating piecewise lin-
ear functions with piecewise constant functions in Bellman updates.
68
The results of the scalability experiments are shown in Figures 4.5a, 4.5b, 4.5c where runtime (in
seconds) is on the x-axis (notice the logarithmic scale) and the solution quality (% of the optimal
solution) is on the y-axis. In particular, to obtain the benchmark optimal solution quality, Lazy
Approximation was run suciently long so that the error margin of its solution was below 10%
of the optimal solution (to see the detailed control of this error, refer to [Li and Littman, 2005]).
The results across all the domains show CPH or DPFP outperform Lazy Approximation in terms
of runtimes necessary to find high quality solutions by up to three orders of magnitude. For
example, to find a solution that is less than 10% o the optimal policy for the unordered domain,
CPH needs 14:5s whereas Lazy Approximation requires 2026:1s for the same task.
Furthermore, as can be seen, DPFP opens up an entirely new area of the solution-quality vs
time tradeo space that was inaccessible to previous algorithms. In particular DPFP dominates
Lazy approximation in this tradeo, providing higher quality in lower time. DPFP also provides
very high quality an order of magnitude faster than CPH, e.g. in Figure 4.5a for solutions with
quality higher than 70%, DPFP will provide an answer in 0.46 seconds, while CPH will take 28.1s
for the same task. Finally, DPFP exhibits superior anytime performance, e.g. in Figure 4.5c, run
with = 0:3; 0; 25; 0:2 it attains solution qualities 42%; 61%; 72% in just 0:5s; 1:1s; 3:9s.
4.3 DPFP-CPH Hybrid Experiments
Encouraged by DPFP’s any-time performance and CPH’s superior quality results, a DPFP-CPH
hybrid algorithm was developed and its eciency was contrasted with the eciency of a stand-
alone CPH. The idea of the DPFP-CPH hybrid is to first use DPFP to quickly find an action to
be executed from the starting state s
0
and then, use this action to narrow down CPH’s search for
69
Figure 4.5: Scalability experiments for DPFP, CPH and Lazy Approximation
70
Figure 4.6: DPFP+CPH hybrid: Fully ordered domain
a high quality policy. For example, the hybrid that was implemented uses DPFP (with = :2)
to suggest to CPH which action should be executed in s
0
at time 0, and then runs CPH in the
narrowed state-space.
The empirical evaluation of the DPFP-CPH hybrid is shown in Figures 4.6, 4.7 and 4.8.
Across all the Figures, the accuracy of CPH is varied on the x-axis (using more phases translates
into higher accuracy of CPH) whereas the y-axis is used to plot the algorithms runtime (in Figures
4.6a, 4.7a, 4.8a ) and the corresponding solution quality (in Figures 4.6b, 4.7b, 4.8b). The results
across all the domains show that the DPFP-CPH hybrid attains the same quality as stand-alone
CPH, yet requires significantly less runtime (over 3 times). For example, when CPH accuracy is
fixed at 5 phases, DPFP-CPH hybrid needs only 51s to find a solution whereas stand-alone CPH
needs 169:5s for the same task (Figure 4.7a).
71
Figure 4.7: DPFP+CPH hybrid: Unordered domain
Figure 4.8: DPFP+CPH hybrid: Partially ordered domain
4.4 DEFACTO Experiments
In the final set of experiments with single agent algorithms, CPH was integrated with RIAACT
[Schurr et al., 2008] - the adjustable autonomy module of the DEFACTO system [Schurr et al.,
2005] for training the incident commanders of the Los Angeles Fire Department. The main
72
purpose of RIAACT is to decide who should perform the role allocation in an even of a simulated
city wide disaster: Either (i) the human incident commander or (ii) the agent team. In essence,
whenever a new role appears, e.g. whenever a new building goes on fire, either the human incident
commander or the agent team must decide which fire engine should fight that particular fire. In
this context, it has been demonstrated [Marecki et al., 2005] that, preventing decision conflicts
between the human incident commander and the agent team is a fundamental problem, as it has
direct implications on the outcome of the disaster rescue operation.
At a basic level, RIAACT is an instantiation of the continuous resource MDP model. RI-
AACT makes the following assumptions: (i) Time is a continuous resource; (ii) time at which a
building is completely burnt is the resource limit and (iii) the action durations are uncertain. Pre-
cisely, RIAACT’s states are S =fAa; Ha; Adi; Adc; Hdi; Hdc; Finishg with the following mean-
ing (see Figure 4.9):
Aa: The agent team is responsible for deciding who should execute the current role;
Ha: The human incident commander is responsible for deciding who should execute the
current role;
Adi: Agent team has decided who should execute the current role, but this decision is
inconsistent with the human incident commander’s preferences;
Adc: Agent team has decided who should execute the current role and this decision is
consistent with the human incident commander’s preferences;
Hdi: The human incident commander has decided who should execute the current role, but
this decision is inconsistent with the preferences of the agent team.
73
Hdc: The human incident commander has decided who should execute the current role and
this decision is consistent with the preferences of the agent team.
Finish: The role allocation has been performed
RIAACT’s actions A =fTrans f erAutonomy; Decide; Resolve; Executeg are as follows:
Trans f erAutonomy: At any point in time, the agent team can transfer the autonomy (role
allocation request) to the human incident commander (similarly, the human incident com-
mander can transfer the autonomy to the agent team). Although the Trans f erAutonomy
action does not yield any reward, it consumes a certain amount of time which is sampled
from a given probability distribution.
Decide: At any point in time, when the role allocation request is at the human incident
commander (in state Aa), the human incident commander can make a role allocation de-
cision. Depending on whether this decision is consistent or not with the agent team, the
MDP transitions to state Hdi or state Hdc (similarly, when the agent team makes a deci-
sion, the MDP transitions from state Aa to either state Adi or state Adc). The duration of the
Decide action is uncertain and follows a given probability distribution. RIAACT assumes
that the human incident commander takes longer than the agent team to make a decision.
The Decide action itself yields no reward.
Resolve: When the role allocation decision is inconsistent (the process transitioned to an
inconsistent state under the Decide action), a Resolve action can be performed, to improve
the decision so that it becomes consistent. The Resolve action does not come for free — it
consumes a certain amount of time which is sampled from a given probability distribution.
74
Aa
Adi Adc
Ha
Hdi Hdc
Finish
Decide
Execute
Execute
Resolve Resolve
Decide
Transfer Autonomy
Execute
Execute
Transfer Autonomy
Figure 4.9: RIAACT: Adjustable autonomy component of the DEFACTO system, [Schurr et al.,
2008]
Although there is no reward for the Resolve action, consistent role allocations yield more
reward when the Execute action is performed (see below).
Execute: When the Execute action is performed, the current role is executed by some
entity. Depending on whether the process occupies states Adi; Adc; Hdi or Hdc when the
Execute action is performed, dierent rewards can be earned. Typically, the highest reward
is earned if the Execute action is started from state Hdc (the human incident commander
decides who should execute a role, and this decision is consistent with the agent team) and
the lowest reward is earned if the Execute action is started from state Adi (the agent team
decides who should execute a role, but this decision is inconsistent with the intention of the
human incident commander). The duration of the Execute action is uncertain; it is aected
by the intensity of the fire and the proximity of the fire engines.
75
To solve the RIAACT model shown in Figure 4.9 the CPH algorithm was used. CPH em-
ployed 3-phase Coxian distributions for the purpose of the underlying phase-type approximation.
The CPH solver was run for n
= 80 Bellman update iterations, to ensure that the increase if
value functions in further iterations was marginal. After 27:2 seconds, CPH returned a time de-
pendent policy, both for the human incident commander as well as the agent team, for any state
present in the RIAACT model. As a result, depending on the time remaining before a building is
completely burnt, the human incident commander and the agent knew exactly whether it is more
profitable to: (i) Make a role allocation decision and then try to resolve a potential conflict or (ii)
Choose to transfer the autonomy. The RIAACT approach was compared with the two competing
approaches: (i) The “Always Reject” approach in which the agent team would always perform a
no-return transfer of autonomy to the human incident commander and the (ii) “Always Accept”
approach in which the agent team would always accept a role allocation request and never bother
the human incident commander.
The experimental results of DEFACTO equipped with RIAACT are demonstrated in Figure
4.10. Here, the number of buildings saved in plotted on the y-axis (averaged over 50 runs) whereas
the duration of the Resolve action (which is sampled from a Normal distribution with a dierent
mean and variance values) is varied on the x-axis. As can be seen, RIAACT approach outperforms
the “Always Accept” and “Always Reject” approaches for any tested distribution of the Resolve
action duration. In particular, when the Resolve action duration follows a normal distribution
with a mean of 3s and the variance of 1s
2
, DEFACTO with RIAACT allows to save up to 25%
more building in an event of a simulated disaster rescue operation. Furthermore, the duration
of the Resolve action does not have an impact on the number of building saved for the “Always
76
Benefit of RIAACT in DEFACTO
0
1
2
3
4
5
6
7
8
9
10
3,1 6,4 9,5 12,6 12,2
Resolve Action Normal Distribution
Buildings Saved
Always Accept
Always Reject
RIAACT
Figure 4.10: RIAACT results, [Schurr et al., 2008]
Accept” and “Always Reject” approaches — this is not surprising because these strategies do not
use RIAACT’s inconsistency resolution mechanism.
77
Chapter 5: Multiagent Solutions
This chapter focuses on techniques for solving planning problems modeled as Continuous Re-
source, Decentralized MDPs (CR-DEC-MDPs). Two dierent algorithms for solving CR-DEC-
MDPs are proposed. The first algorithm (VFP) considers a special case when CR-DEC-MDPs
are fully ordered (explained later). For such CR-DEC-MDPs, VFP performs a series of policy
iterations to quickly find locally optimal joint policies. In addition, VFP implements a set of
heuristics aimed at improving the quality of these locally optimal joint policies. The second pro-
posed algorithm (M-DPFP) for solving CR-DEC-MDPs operates on arbitrary CR-DEC-MDPs.
It finds joint policies that are guaranteed to be within an arbitrary small from the optimal joint
policies by leveraging the concept of probability function propagation to a multi-agent setting.
Both algorithms consider AbsoluteTime as a continuous resource whose values monotonically
increase from 0 to some mission deadline
1
.
5.1 Locally Optimal Solution: The VFP Algorithm
Because CR-DEC-MDPs are hard to solve optimally, this section introduces a locally optimal al-
gorithm (VFP) that operates on CR-DEC-MDPs that are fully-ordered. The approach to simplify
1
Such formulation of a continuous resource does not aect the planning problems because AbsoluteTime is
simply a continuous resource whose values monotonically decrease
78
the problem by restricting the search for policies to fully ordered Decentralized MDP has been
first proposed in [Beynier and Mouaddib, 2005] and then elaborated on in [Beynier and Mouad-
dib, 2006]. In these works, the authors developed a locally optimal algorithm that has been shown
to scale up to domains with double digit action horizons. The VFP algorithm proposed in this
section builds on on the idea to perform a search for locally optimal policies to fully-ordered
CR-DEC-MDPs.
Precisely, a fully-ordered CR-DEC-MDP assumes that methods are arranged in chains, i.e.,
for any agent n and its set of methods M
n
=fm
1
;:::; m
k
g there exist resource precedence con-
straintshi; i + 1i2 C
for all i = 1;:::; k 1 which impose a chain ordering of methods from M
n
.
Two immediate facts result from imposing such restrictions on CR-DEC-MDPs:
When agent n executes method m
i
successfully, it can either start executing method m
i+1
(this action is denoted as E) or choose to remain idle for a certain amount of time and then
make the decision (the action is denoted as W);
When agent n executes method m
i
unsuccessfully, it can no longer execute any other
method because none of the methods m
i+1
; m
i+2
;:::; m
k
that directly or indirectly precede
method m
i
will ever be enabled. Hence, the execution of agent n stops. The execution of
other agents can continue, provided that that all their methods have been executed success-
fully thus far.
When operating on fully-ordered CR-DEC-MDPs, agent policy for a method m
i+1
is directly
related to the expected utility of starting the execution of m
i+1
as shown in Figure 5.1. Here,
the agent is about to start executing a method m
i
which can be executed in one of the two time
windows (resource ranges): hi; l
1
; l
2
i2 C
[]
orhi; l
3
; l
4
i2 C
[]
. The shaded area represents the
79
total expected utility for starting the execution of method m
i
over time (referred to as the value
function). As can be seen in Figure 5.1, at any time t2 [0; ], the Execute action is better than
the Wait action if it is less profitable to perform the Execute action in the future (after time t).
Total expected utility of
starting the execution
of method m
i
at "Time"
Time
Time window
(resource range)
Time window
(resource range)
l
1
l
3
l
2
l
4
W E W E W E
Policy for method m
i
Δ
0
Figure 5.1: Agent policy in a fully-ordered CR-DEC-MDPs
Since in a fully-ordered CR-DEC-MDP each agent n always knows which method should
be executed next, agent decision making is reduced to choosing the correct method execution
starting time. However, since the model allows for many resource limit constraints for a method
m
i
(multiple execution time windows in Figure 5.1), the policy
n
of agent n cannot be stored
as a set of pairs (m
k
; t
k
) where t
k
is a point in time when the agent switches from action W to
action E. Instead, the policy
n
of agent n must be a function
n
: M
n
[0; ]7!fW; Eg where
“W” represents the Wait action and “E” represents the Execute-next-method action. For example,
n
(hi; ti) = E means that, at time t, when agent’s next method to be executed is m
i
, it will start
executing it. Else, if
n
(hi; ti) = W the agent will wait for an infinitesimal amount of time and
80
then make another decision at time t +. In the following, the expression
n
(hi; ti) is referred to
as a policy of agent n for method m
i
at time t.
5.1.1 Policy Iteration Approach
A popular approach to find the optimal joint policy
is to use the value iteration principle which
in the context of CR-DEC-MDP could work as follows: In order to determine the optimal policy
for method m
i
one can propagate backwards (in the inverse direction to the resource precedence
constraints relation) the expected utilities of executing methods m
j
wherehi; ji2 C
. Unfortu-
nately, for the CR-DEC-MDP model, the optimal policy for method m
i
also depends on policies
for methods m
j
whereh j; ii2 C
. This bi-directional dependency results from the fact that the
expected reward for starting the execution of method m
i
at time t also depends on the probability
that method m
i
will be enabled before time t. Consequently, as shown in [Bernstein et al., 2000],
the complexity of the optimal algorithms for a CR-DEC-MDP model with discrete resource levels
is NEXP-complete.
Following the limited applicability of globally optimal algorithms, locally optimal algorithms
have recently gained a lot of attention. Of particular importance is the Opportunity Cost, Decen-
tralized MDP (OC-DEC-MDP) algorithm [Beynier and Mouaddib, 2005], [Beynier and Mouad-
dib, 2006] that has been shown to scale up to problems involving double digit action horizons.
The idea of the OC-DEC-MDP algorithm, which also operates on fully ordered MDPs, is to per-
form a series of policy iterations to converge on a locally optimal solution. The algorithm starts
with the earliest starting time policy according to which an agent starts executing a method m
i
as soon as it possible. It then improves this policy by performing a opportunity cost propagation
81
phase, evaluation the new policy and then, if the new policy is significantly better than the old
policy, the probability propagation phase which prepares the algorithm for its next iteration.
The opportunity cost and probability propagation phases of the OC-DEC-MDP algorithm
consider the AbsoluteTime as a resource, but a discretized one. The algorithm then operates
on discrete time intervals: F
i
[t
1
; t
2
] is the probability that method m
i
will be executed in time
interval [t
1
; t
2
], and O
i
[t
1
; t
2
] is the expected utility for executing method m
i
in time interval
[t
1
; t
2
] assuming that method m
i
is enabled — referred to as the Opportunity Cost of method m
j
in time interval [t
1
; t
2
]. At each iteration the OC-DEC-MDP algorithm knows the old policy
and the probabilities F
i
[t
1
; t
2
] for t
1
; t
2
2N\ [0; ] uniquely identified by. The algorithm then
searches for a new policy
0
that improves the old policy in the following two phases:
Opportunity cost propagation: It calculates opportunity costs O
i
[t
1
; t
2
] for t
1
; t
2
2 N\
[0; ] and m
i
2 M by backward propagation starting from sink methods (methods that do
not enable any other methods) and ending on source methods (methods not enabled by
any other method). Opportunity cost propagation phase utilizes the probabilities F
i
[t
1
; t
2
]
uniquely identified by the old policy.
Probability propagation: Using the opportunity costs O
i
[t
1
; t
2
] calculated in the opportu-
nity cost propagation phase, the algorithm identifies the most profitable method execution
intervals which are then used to determine the new policy
0
. The algorithm then calculates
the probabilities F
i
[t
1
; t
2
] for t
1
; t
2
2N\ [0; ] and m
i
2 M associated with the new policy
0
. To this end, it propagates the probabilities F
i
[t
1
; t
2
] forward (from source methods to
sink methods). If the new policy
0
does not improve the old policy by more than some
82
margin , the OC-DEC-MDP algorithm terminates. Otherwise, the algorithm repeats the
two above-mentioned steps.
5.1.2 Functional Representation
Unfortunately, the OC-DEC-MDP algorithm has three major shortcomings: (i) It is only applica-
ble to solving problems that assume discrete resource levels; (ii) it does not exploit the functional
representation of the underlying opportunity cost functions and probability functions and can thus
run slow and (iii) it fails to address a critical problem of double-counting when the opportunity
costs O
i
[t
1
; t
2
] are being derived from opportunity costs O
j
[t
1
; t
2
] forhi; ji2 C
.
To remedy these shortcomings, the VFP algorithm for solving fully-ordered CR-DEC-MDPs
is proposed. VFP borrows from OC-DEC-MDP the idea to perform a series of policy improve-
ment phases to find a locally optimal solution. Yet, VFP addresses the above-mentioned short-
comings of the OC-DEC-MDP algorithm: First, VFP maintains and manipulates opportunity
cost functions and probability functions over time for each method rather than discrete opportu-
nity costs and probabilities for each pair of method and time interval. Such representation allows
VFP to group the time points for which the opportunity cost function changes at the same rate
which translates into significant speedups of policy improvement phases. Second, VFP’s func-
tional representation preserve the structure of the underlying opportunity cost functions which
allows VFP to identify and correct the critical overestimations of the expected utilities of method
execution.
The general scheme of the VFP algorithm is identical to the OC-DEC-MDP algorithm —
both algorithms perform a series of policy improvement iterations. However, instead of prop-
agating opportunity costs and probabilities in each iteration, VFP propagates opportunity cost
83
functions and probability functions. These two phases are therefore referred to as the opportunity
cost function propagation and the probability function propagation. In order to implement these
phases using functional representation, for each method m
i
2 M, the VFP algorithm employs
three dierent functions (related to each other in Equation 5.4):
Opportunity Cost Function O
i
(t) maps time t2 [0; ] to the expected utility for start-
ing the execution of method m
i
at time t assuming that m
i
is enabled (assuming that the
execution of methods m
j
such thathi; ji2 C
has been finished successfully before time t).
Probability Function F
i
(t) maps time t2 [0; ] to the probability that method m
i
will be
successfully executed before time t.
Value Function V
i
(t) maps time t2 [0; ] to the expected utility for starting the execution
of method m
i
at time t.
The unique role of the value function V
i
(t) is to extract the current policy. Precisely, the policy
n
(hi; ti) of agent n for method m
i
at time t is calculated as follows (see Figure 5.1):
n
(hi; ti) =
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
W if there exists t
0
> t such that V
i
(t
0
)> V
i
(t)
E otherwise:
Where W is the wait action and E is the execute method action (informally, the agent will Wait
if it is more profitable to start the execution of method m
i
later).
The next two sections demonstrate how VFP performs a single policy iteration. To this end,
it is first shown how the analytical operations update the opportunity cost functions. It is then
shown how the opportunity cost functions are used to determine the value functions and how to
determine the policy using the value functions. Finally, it is shown how the analytical operations
84
update the probability functions — a step necessary to prepare the VFP algorithm for its next
policy iteration.
5.1.3 Opportunity Cost Function Propagation Phase
In analogy to [Beynier and Mouaddib, 2005], [Beynier and Mouaddib, 2006], the opportunity
function propagation phase consists in propagating the opportunity cost functions through the
graph G = (M; C
) of methods M linked by resource precedence constraints C
, starting from the
sink methods to the source methods. In particular, for each method m
i
0
encountered during this
phase (refer to Figure 5.3), the opportunity cost function O
i
0
is calculated from the opportunity
cost functions O
j
n
of methods m
j
n
that follow method m
i
0
in graph G.
Let n = 0; 1;:::; N as shown in Figure 5.3. The opportunity cost function O
i
0
(t) for time t2
[0; ] is then derived as follows (refer to Equation 5.1): If there is no time windowhi
0
; t
1
; t
2
i2 C
[]
for time t such that t
1
t t
2
, the agent will not start the execution of method m
i
0
because there
is no chance that the execution of method m
i
0
will be successful and hence, O
i
0
(t) = 0. Otherwise,
the execution of method m
i
0
can be started, because there is a non-zero chance that this execution
will be successful (completed before time t
2
). In this case, with probability p
i
0
(t
0
), the execution
of method m
i
0
will take time t
0
to complete and as long as t + t
0
t
2
(see Figure 5.2), method m
i
0
will be executed successfully (provided that method m
i
0
was enabled at time t). Here, the agent
will earn the immediate reward r
i
0
plus
P
n=N
n=0
O
j
n
;i
0
(t + t
0
) where O
j
n
;i
0
(t + t
0
) is the opportunity
cost that method m
i
0
will receive for enabling the execution of method m
j
n
at time t+t
0
. Functions
O
j
0
;i
k
(t + t
0
) for k = 0; 1;:::; K are called splittings of the opportunity cost function O
j
0
and it is
85
temporally assumed here that O
j
n
;i
0
:= O
j
n
— this assumption is waived in Section 5.1.5. The
above discussion justifies why the opportunity cost O
i
0
(t) is calculated by:
2
O
i
0
(t) =
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
R
t
2
t
0
p
i
0
(t
0
)(r
i
0
+
P
N
n=0
O
j
n
;i
0
(t + t
0
))dt
0
if9hi
0
; t
1
; t
2
i2 C
[]
s.t. t2 [t
1
; t
2
]
0 otherwise
(5.1)
time
Admissible time window
Method
execution
t
1
t
2
t t+t'
Figure 5.2: Method execution occurring within an admissible time window
It is now shown how to derive O
j
0
;i
0
(derivation of O
j
n
;i
0
for n, 0 follows the same scheme).
Let O
j
0
;i
0
(t) be the opportunity cost of starting the execution of method m
j
0
at time t assuming
that method m
i
0
has been completed. It can be derived by multiplying O
j
0
by the probability
functions for all methods other than m
i
0
that enable m
j
0
, that is:
O
j
0
;i
0
(t) = O
j
0
(t)
K
Y
k=1
F
i
k
(t): (5.2)
Note, that this derivation is only approximate because in general the probability functions
fF
i
k
g
K
k=1
do not have to represent independent random variables (similar approximation is used
in [Beynier and Mouaddib, 2005], [Beynier and Mouaddib, 2006]). Now, the opportunity cost
2
Note, that the calculation of O
i
0
(t) is equivalent to the convolution operation, because for h(t) := r
i
0
+
P
N
n=0
O
j n;i
0
(t
2
t) it holds that O
i
0
(t) = (p
i
0
h)(t
2
t) where is the convolution operator.
86
Method m
i
0
Opportunity cost function O
i
0
Probability function F
i
0
Method m
j
0
Opportunity cost function O
j
0
Probability function F
j
0
m
i
1
,
O
i
1
,
F
i
1
m
i
K
,
O
i
K
,
F
i
K
F
i
1
F
i
K
F
j
0
F
i
0
...
...
Other MDPs
Other MDPs
F
i
0
F
j
0
F
i
0
F
j
0
m
j
1
,
O
j
1
,
P
j
1
...
...
...
m
j
N
,
O
j
N
,
F
j
N
O
j
0
,i
0
O
j
1
,i
0
O
j
N
,i
0
...
O
j
0
,i
1
O
j
0
,i
K
m
l
1
,
O
l
1
,
F
l
1
m
l
K
,
O
l
K
,
F
l
K
F
l
1
F
l
K
...
O
i
0
,l
1
O
i
0
,l
K
Figure 5.3: Propagation of opportunity cost functions and probability functions through one
method: For the opportunity cost function propagation through method m
i
0
, opportunity cost
functions O
j
n
of methods m
j
n
for n = 0; 1;:::; N are known, and the opportunity cost function O
i
0
must be calculated. For the probability function propagation through method m
j
0
, the probability
functions F
i
k
of methods m
i
k
for k = 1; 2;:::; K are known, and the probability function F
j
0
must
be calculated.
O
j
0
;i
0
(t) that method m
i
0
receives for enabling the execution of method m
j
0
at time t (used in
Equation 5.1) is given by:
O
j
0
;i
0
(t) = max
t
0
t
O
j
0
;i
0
(t
0
) (5.3)
Where O
j
0
;i
0
can be greater than O
j
0
;i
0
since it can be more profitable to delay the execution of
the method m
i
0
(as illustrated in Figure 5.4).
It has consequently been shown how to propagate the opportunity cost functions: Knowing
fO
j
n
g
N
n=0
of methodsfm
j
n
g
N
n=0
one can derive O
i
0
of method m
i
0
by following Equations 5.2, 5.3
and then 5.1. In general, the opportunity cost function propagation phase starts with sink nodes.
It then visits at each time a method m, such that all the methods that m enables have already been
87
marked as visited. The value function propagation phase terminates when all the source methods
have been marked as visited.
Time
O
j
0
,i
0
(t)
Time
O
j
0
,i
0
(t)
Figure 5.4: Visualization of the operation performed by Equation 5.3
What remains to be shown is how to calculate the value function V
i
0
, used in Equation 5.1 to
identify the policy
n
(hi; ti). Recall, that the value function V
i
0
diers from the opportunity cost
function O
i
0
in that the opportunity cost function O
i
0
assumes that method m
i
is enabled (methods
m
j
such thathi; ji2 C
have been executed successfully) whereas the value function V
i
0
does not
make this assumption. When agent n is about to start the execution of method m
i
0
it knows that its
method m
l
0
2 M
n
such thathl
0
; i
0
i2 C
has been successfully executed. However, agent n is still
88
unsure if other agents have completed their methodsfm
l
k
g
k=K
k=1
such thathl
k
; i
0
i2 C
. Therefore,
the value function V
i
0
of method m
i
0
can be calculated as follows:
V
i
0
(t) = O
i
0
(t) Pb
F
l
0
(t) = 1^ F
l
1
(t) = 1^:::^ F
l
K
(t) = 1
= O
i
0
(t) Pb
F
l
1
(t) = 1^:::^ F
l
K
(t) = 1
= O
i
0
(t)
K
Y
k=1
F
l
k
(t): (5.4)
Where the dependency of probability functionsfF
l
k
g
K
k=1
has been ignored (similar approximation
is used in [Beynier and Mouaddib, 2005], [Beynier and Mouaddib, 2006]).
Finally, in order to determine the policy of agent n for method m
i
0
one must identify the set
Z
i
0
of intervals of method m
i
0
activity, that is, intervals [z; z
0
] [0; ] such that for all t2 [z; z
0
] it
holds that
n
(hi
0
; ti) = E. As can be seen in Figure 5.1, these intervals of activity in Z
i
0
are easily
identifiable.
5.1.4 Probability Function Propagation Phase
Recall the meaning of the following two functions: p
i
(t) is the probability that execution duration
of method m
i
will consume time t and F
i
(t) is the probability that the execution of method m
i
will be completed successfully before time t. Assume now that the opportunity cost function
propagation phase has been completed and that the opportunity cost functions O
j
and sets Z
j
for
all methods m
j
2 M have been calculated. Since the opportunity cost function propagation phase
was using probability functions F
i
found at previous algorithm iteration (for an old policy), the
new probability functions F
i
(for the new policy identified by the current sets Z
j
) now have to
be found, to prepare the algorithm for the next policy iteration.
89
The general case of the probability function propagation is shown in Figure 5.3 where the
probability functionsfF
i
k
g
K
k=0
of methodsfm
i
k
g
K
k=0
are known, and the probability function F
j
0
of method m
j
0
is to be derived. From the probability function F
i
0
and the set Z
j
0
of intervals of
activity one can calculate the probability F
0
j
0
(t) that method m
j
0
will be started before time t:
F
0
j
0
(t) =
8
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
:
F
i
0
(t) if9[t
1
; t
2
]2 Z
j
0
such that t2 [t
1
; t
2
]
F
i
0
(t
2
) otherwise
Where t
2
is such that [t
1
; t
2
] 2 Z
j
0
is the latest interval of activity before time t. Intuitively,
between time t
2
and t the function F
0
j
0
cannot increase because the method will not be started in
the time interval [t
2
; t]. However, for method m
j
0
to be executed successfully, method m
j
0
has
to be enabled, i.e., methods m
i
1
, ..., m
i
K
must be successfully finished before method m
j
0
starts.
Hence, Equation 5.5 must be modified, to account for probabilities of enabling method m
j
0
before
time t. Thus, one can calculate the probability F
00
j
0
(t) that method m
j
0
will be successfully started
before time t as: (ignoring the dependency offF
i
k
g
K
k=0
)
F
00
j
0
(t) =
8
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
:
Q
K
k=0
F
i
k
(t) if9[t
1
; t
2
]2 Z
j
0
such that t2 [t
1
; t
2
]
Q
K
k=0
F
i
k
(t
2
) otherwise
(5.5)
Where t
2
is such that [t
1
; t
2
]2 Z
j
0
is the latest interval of activity before time t. Probability
function F
j
0
is then derived as follows: For notation convenience, let f
00
j
0
:=
d
dt
F
00
j
0
and f
j
0
:=
90
d
dt
F
j
0
. Also, let Z
j
0
(t) be a set of time intervals [z
1
; z
2
]2 Z
j
0
such that z
2
t augmented with an
additional interval [z
1
; t] if z
1
< t< z
2
. It then holds that:
F
j
0
(t) =
Z
t
0
f
j
0
(y)dy =
X
[t
1
;t
2
]2Z
j
0
(t)
Z
t
2
t
1
f
j
0
(y)dy
Because f
j
0
(y) is equal to 0 outside the intervals of activity of Z
j
0
=
X
[t
1
;t
2
]2Z
j
0
(t)
Z
t
2
t
1
Z
y
t
1
f
00
j
0
(x) p
j
0
(y x)dx dy
The execution of method m
j
0
will finish at time y, if it starts at time x2 [t
1
; y]
and lasts for y x time units
=
X
[t
1
;t
2
]2Z
j
0
(t)
Z
t
2
t
1
Z
y
t
1
f
00
j
0
(y x) p
j
0
(x)dy dx
Because the arguments x and y x in the nested integral are interchangeable
=
X
[t
1
;t
2
]2Z
j
0
(t)
Z
t
2
t
1
Z
t
2
x
f
00
j
0
(y x) p
j
0
(x)dy dx
Order of integration has been reversed (integration ranges are updated accordingly)
=
X
[t
1
;t
2
]2Z
j
0
(t)
Z
t
2
t
1
p
j
0
(x)
Z
t
2
x
f
00
j
0
(y x)dy dx
=
X
[t
1
;t
2
]2Z
j
0
(t)
Z
t
2
t
1
p
j
0
(x)
Z
t
2
x
0
f
00
j
0
(u)du dx
y x has been substituted with u
=
X
[t
1
;t
2
]2Z
j
0
(t)
Z
t
2
t
1
p
j
0
(x)F
00
j
0
(t
2
x)dx
=
X
[t
1
;t
2
]2Z
j
0
(t)
(p
j
0
F
00
j
0
)(t
2
) (p
j
0
F
00
j
0
)(t
1
) (5.6)
Where F
00
j
0
is given by Equation 5.5.
It has consequently been shown how to propagate the probability functionsfF
i
k
g
K
k=0
of meth-
odsfm
i
k
g
K
k=0
to obtain the probability function F
j
0
of method m
j
0
. The general, the probability
91
function propagation phase starts with source methods m
s
2 M for which F
00
s
is calculated from
Equation 5.5 using known sets Z
s
. The algorithm then visits at each time a method m such that
all the methods that enable m have already been marked as visited. The probability function
propagation phase terminates when all the sink methods have been marked as visited.
To summarize, the VFP algorithm starts with the earliest starting time policy
0
. The algo-
rithm then iterates. At each iteration it: (i) Propagates backwards the opportunity cost functions
O
i
using the probability functions F
i
from the previous algorithm iteration and establishes the
new sets Z
i
of method activity intervals and then (ii) propagates forwards the new probability
functions F
i
using the newly established sets Z
i
. These new probability functions F
i
are then
used in the next iteration of the algorithm. Similarly to the OC-DEC-MDP algorithm, the VFP
algorithm terminates if a new policy does not improve the policy from the previous algorithm
iteration by more than a small value.
5.1.5 Splitting the Opportunity Cost Functions
In Section 5.1.3 it has been left unresolved how the opportunity cost function O
j
0
of method m
j
0
is split into the opportunity cost functionsfO
j
0
;i
k
g
K
k=0
sent back to methodsfm
i
k
g
K
k=0
that enable
method m
j
0
. So far, the approach similar to [Beynier and Mouaddib, 2005] and [Beynier and
Mouaddib, 2006] was taken, referred to as the H
h1;1i
heuristic. Formally, the H
h1;1i
heuristic
92
Method m
i
0
Method m
j
0
Method m
i
K
F
i
K
F
i
0
..............................
O
j
0
,i
0
Method m
0
O
j
0
,i
K
F
O
i
K
,0
O
i
0
,0
Agent n
Other agents
Figure 5.5: Splitting the opportunity cost functions
postulates that, for each method m
i
k
that enables method m
j
0
, the opportunity cost O
j
0
;i
k
(t) is
given by:
O
j
0
;i
k
(t) := max
t
0
>t
O
j
0
;i
k
(t
0
)
= max
t
0
>t
(O
j
0
Y
k
0
2f0;:::;Kg
k
0
,k
F
i
k
0
)(t
0
): (5.7)
It will be proven shortly, that this heuristic overestimates the opportunity cost.
Three problems might arise when splitting the opportunity cost functions: (i) Overestima-
tion, (ii) underestimation or (iii) starvation. Consider the situation in Figure 5.5 where methods
fm
i
k
g
K
k=1
enable method m
j
0
. When the opportunity cost function is propagated through methods
fm
i
k
g
K
k=1
, for each k = 0;:::; K, Equation (5.1) determines the opportunity cost function O
i
k
from
the immediate reward r
k
and the opportunity cost function O
j
0
;i
k
. If method m
0
is the only method
that enables method m
k
, the opportunity cost function O
i
k
;0
= O
i
k
is propagated only to method
m
0
, and consequently, the opportunity cost for completing the execution of method m
0
at time t
is equal to
P
K
k=0
O
i
k
;0
(t).
93
If the value of
P
K
k=0
O
i
k
;0
(t) is overestimated, agent n that is about to start the execution of
method m
0
will have too much incentive to finish the execution of method m
0
at time t. Conse-
quently, even when the probability F(t) that method m
0
will be enabled by other agents before
time t is low, agent n might still find the expected utility of starting the execution of m
0
at time
t higher than the expected utility of doing it later (e.g. in the next time window of method m
0
).
Hence, at time t the agent will choose to start executing method m
0
rather than waiting, which
can have undesirable consequences.
Conversely, if the value of
P
K
k=0
O
i
k
;0
(t) is underestimated, agent n might loose interest in
enabling the execution of future methodsfm
i
k
g
K
k=0
and just focus on maximizing the likelihood of
earning the immediate reward r
0
for the successful execution of method m
0
. Since this likelihood
can increase when the agent waits, the agent might decide that at time t it is more profitable to wait
rather than to start the execution of method m
0
which can also have undesirable consequences.
Finally, one can easily avoid the overestimation and underestimation of
P
K
k=0
O
i
k
;0
(t) if the
opportunity cost function O
j
0
is split such that O
j
0
;i
k
0
= 1 and O
j
0
;i
k
= 0 for k2f0; 1;:::; Kgnfk
0
g.
However, in such case, agents executing methods m
i
k
can underestimate the opportunity cost of
enabling method m
j
0
— this problem is referred to as the starvation of method m
k
. That short
discussion shows the importance of devising heuristics that split the opportunity cost function
O
j
0
in a way that remedies the overestimation, underestimation and starvation problems. In the
following it is shown how to arrive at such heuristics.
Theorem 2. Heuristic H
h1;1i
overestimates the opportunity cost.
Proof. To prove the theorem is it sucient to construct a case where the overestimation occurs.
Consider the situation in Figure (5.5) when the H
h1;1i
heuristic splits the opportunity cost function
94
O
j
0
into the opportunity cost functions O
j
0
;i
k
= O
j
0
Q
k
0
2f0;:::;Kg
k
0
,k
F
i
k
0
sent to methods m
i
k
for all
k = 0; 1;:::; K. Let method rewards be r
i
k
= 0 and method execution time windows behi
k
; 0; i2
C
[]
for k = 0;:::; K. In order to prove that the overestimation of the opportunity cost occurs,
time t
0
2 [0; ] must be found for which the opportunity cost
P
K
k=0
O
i
k
(t
0
) is greater than the
opportunity cost O
j
0
(t
0
). For the domain described above, Equation (5.1) states that:
O
i
k
(t) =
Z
t
0
p
i
k
(t
0
)O
j
0
;i
k
(t + t
0
)dt
0
Summing over all the methods m
i
k
that enable method m
j
0
it holds that:
K
X
k=0
O
i
k
(t) =
K
X
k=0
Z
t
0
p
i
k
(t
0
)O
j
0
;i
k
(t + t
0
)dt
0
(5.8)
=
K
X
k=0
Z
t
0
p
i
k
(t
0
) max
t
00
t+t
0
O
j
0
;i
k
(t + t
0
)dt
0
K
X
k=0
Z
t
0
p
i
k
(t
0
)O
j
0
;i
k
(t + t
0
)dt
0
=
K
X
k=0
Z
t
0
p
i
k
(t
0
)O
j
0
(t + t
0
)
Y
k
0
2f0;:::;Kg
k
0
,k
F
i
k
0
(t + t
0
)dt
0
Now, let c 2 (0; 1] and t
0
2 [0; ] be such that for all t> t
0
and k 2 f0;::; Kg it holds that
Q
k
0
2f0;:::;Kg
k
0
,k
F
i
k
0
(t)> c. The lower bound of
P
K
k=0
O
i
k
(t
0
) for the argument t
0
is then as follows:
K
X
k=0
O
i
k
(t
0
) >
K
X
k=0
Z
t
0
0
p
i
k
(t
0
)O
j
0
(t
0
+ t
0
) c dt
0
Because F
j
k
0
is non-decreasing. Now, suppose that there exists a time point t
1
2 (t
0
; ], such that
P
K
k=0
R
t
1
t
0
0
p
i
k
(t
0
)dt
0
> O
j
0
(t
0
)=(cO
j
0
(t
1
)). Since the decrease of the upper limit of the integration
95
from t
0
to t
1
t
0
decreases the value of the integral (function under the integral is positive) it
holds that:
K
X
k=0
O
i
k
(t
0
) > c
K
X
k=0
Z
t
1
t
0
0
p
i
k
(t
0
)O
j
0
(t
0
+ t
0
)dt
0
And since O
j
0
(t
0
+ t
0
) is non-increasing:
K
X
k=0
O
i
k
(t
0
) > c O
j
0
(t
1
)
K
X
k=0
Z
t
1
t
0
0
p
i
k
(t
0
)dt
0
K
X
k=0
O
i
k
(t
0
) > c O
j
0
(t
1
)
O
j
(t
0
)
c O
j
(t
1
)
= O
j
(t
0
)
Consequently, the opportunity cost
P
K
k=0
O
i
k
(t
0
) for starting the execution of methodsfm
i
k
g
K
k=0
at time t 2 [0; ] is greater than the opportunity cost O
j
0
(t
0
) which proves the theorem. The
overestimation of the opportunity cost caused by using the H
h1;1i
heuristic is visualized in Figure
6.1.1.
In the following, three heuristics that remedy the opportunity cost overestimation problem are
proposed:
Heuristic H
h1;0i
: Only method m
i
k
gets the reward for enabling method m
j
0
, i.e., O
j
0
;i
k
(t) =
(O
j
0
Q
k
0
2f0;:::;Kg
k
0
,k
F
i
k
0
)(t) and O
j
0
;i
k
0
(t) = 0 for k
0
2f0;:::; Kgnfkg.
Heuristic H
h1=2;1=2i
: Each method m
i
k
for k = 0; 1;:::; K gets the full opportunity cost
for enabling method m
j
0
divided by the number K of methods enabling method m
j
0
, i.e.,
O
j
0
;i
k
(t) =
1
K
(O
j
0
Q
k
0
2f0;:::;Kg
k
0
,k
F
i
k
0
)(t) for k2f0;:::; Kg.
96
Heuristic
b
H
h1;1i
: This is a normalized version of the H
h1;1i
heuristic in that each method
m
i
k
for k = 0; 1;:::K initially (for small t) gets the full opportunity cost for enabling method
m
j
0
. In order to prevent the opportunity cost overestimation the functionsfO
j
0
;i
k
g
K
k=0
are
normalized when their sum exceeds the opportunity cost function to be split. Formally:
O
j
0
;i
k
(t) =
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
O
H
h1;1i
j
0
;i
k
(t) if
P
K
k=0
O
H
h1;1i
j
0
;i
k
(t)< O
j
0
(t)
O
j
0
(t)
O
H
h1;1i
j
0
;i
k
(t)
P
K
k=0
O
H
h1;1i
j
0
;i
k
(t)
otherwise
Where O
H
h1;1i
j
0
;i
k
(t) := (O
j
0
Q
k
0
2f0;:::;Kg
k
0
,k
F
j
k
0
)(t):
For the new heuristics it then holds that:
Theorem 3. Heuristics H
h1;0i
, H
h1=2;1=2i
and
b
H
h1;1i
do not overestimate the opportunity cost.
Proof. When the H
h1;0i
heuristic is used to split the opportunity cost function O
j
0
only one method
(e.g. m
i
k
) gets the opportunity cost for enabling method m
j
0
. Thus:
K
X
k
0
=0
O
i
k
0
(t) =
Z
t
0
p
i
k
(t
0
)O
j
0
;i
k
(t + t
0
)dt
0
(5.9)
And since O
j
0
is non-increasing
Z
t
0
p
i
k
(t
0
)O
j
0
(t + t
0
)
Y
k
0
2f0;:::;Kg
k
0
,k
F
j
k
0
(t + t
0
)dt
0
Z
t
0
p
i
k
(t
0
)O
j
0
(t + t
0
)dt
0
O
j
0
(t)
The last inequality holds because the opportunity cost function O
j
0
is non-increasing.
97
For the H
h1=2;1=2i
heuristic the proof is similar:
K
X
k=0
O
i
k
(t)
K
X
k=0
Z
t
0
p
i
k
(t
0
)
1
K
O
j
0
(t + t
0
)
Y
k
0
2f0;:::;Kg
k
0
,k
F
j
k
0
(t + t
0
)dt
0
1
K
K
X
k=0
Z
t
0
p
i
k
(t
0
)O
j
0
(t + t
0
)dt
0
1
K
K O
j
0
(t) = O
j
0
(t):
For the
b
H
h1;1i
heuristic the opportunity cost function O
j
0
is by definition split such that
P
K
k=0
O
i
k
(t) O
j
0
(t). Thus, is has been proven that the new heuristics H
h1;0i
, H
h1=2;1=2i
and
b
H
h1;1i
prevent the overestimation of the opportunity cost.
The reason why all three heuristics have been introduced is the following: Since the H
h1;1i
heuristic overestimates the opportunity cost, one has to choose which method m
i
k
should receive
the reward for enabling the method m
j
0
which is exactly what the H
h1;0i
heuristic postulates.
However, the H
h1;0i
heuristic does not reward the other methods for enabling method m
j
0
which
introduces the starvation problem. Starvation can be avoided if the opportunity cost functions
are split using the H
h1=2;1=2i
heuristic that provides reward to all enabling methods. However, the
sum of the split opportunity cost functions for the H
h1=2;1=2i
heuristic can be smaller than the non-
zero opportunity cost function for the H
h1;0i
heuristic, which is clearly undesirable. This is why
the
b
H
h1;1i
heuristic which by definition avoids the overestimation, underestimation and starvation
problems has been introduced.
98
5.1.6 Implementation of Function Operations
In the previous sections all the function derivations have been carried out without choosing a
particular representation of the underlying continuous functions. In general, one can choose from
various function approximation techniques such as the piecewise linear [Boyan and Littman,
2000], piecewise constant [Li and Littman, 2005], or piecewise gamma [Marecki et al., 2007]
approximation. In the actual implementation of VFP evaluated in Section 6.1, piecewise linear
function representation has been chosen.
When VFP propagates the opportunity cost functions and probability functions it carries out
operations represented by Equations (5.1) and (5.6) which are equivalent to computing the con-
volution function f (t) = (p h)(t). If time is discretized, functions p(t) and h(t) are discrete;
however, h(t) can be approximated with a piecewise linear function
b
h(t), which is exactly what
the VFP algorithm does. As a result, instead of performing O(
2
) multiplications to compute f (t)
(as in case of the OC-DEC-MDP algorithm), VFP only needs to perform O(k ) multiplications
to compute f (t) where k is the number of linear segments of
b
h(t) and k < . Since the values of
F
i
are in range [0; 1] and the values of O
i
are in range [0;
P
m
i
2M
r
i
], opportunity cost functions
O
i
are chosen to be approximated within the error
O
and the probability functions F
i
are chosen
to be approximated within the error
F
. The experimental results in Section 6.1 demonstrate the
tradeo between the runtime and solution quality of VFP for dierent values of
O
and
F
.
99
5.2 Globally Optimal Solution: The M-DPFP Algorithm
This section introduces an algorithm (M-DPFP
3
) that finds solutions to CR-DEC-MDPs whose
quality is guaranteed to be within an arbitrary small of the quality of the optimal solution.
The idea of M-DPFP is the following: Recall the definition of the CR-DEC-MDP model from
Section 2.2.3 whereS
n
is the set of states of agent n. Determining the optimal actions for all
states s2S
n
is impossible since each state s is specified by continuous variables (method starting
and finishing times) and consequently, setS
n
contains an infinite number of elements. Yet, it is
possible to leverage the DPFP algorithm from Section 3.2 to find -optimal actions for a finite
number of states inS
n
— states referred to as the representative states. One can then prescribe
a greedy policy to non-representative states such that the error of M-DPFP is still expressible in
terms of an arbitrary small parameter. These ideas are first shown on an example (Section 5.2.1)
and later generalized to an arbitrary CR-DEC-MDP (Sections 5.2.2).
5.2.1 Arriving at M-DPFP
Consider the search for an optimal policy for the planning problem introduced in Section 2.1.3.2.
Here, both agents know exactly the starting states s
1;0
2S
1
and s
2;0
2S
2
. However, since no
explicit communication is allowed after the planning phase, each agent can only estimate the
current state of the other agent during the execution phase. For example, if agent 2 finishes the
execution of its method m
4
successfully it can infer that agent 1 has successfully finished the
execution of its method m
1
(see Figures 5.2.1 and 5.6).
3
The M-DPFP algorithm exploits the idea to perform the forward search in the space of cumulative distribution
functions introduced by the DPFP algorithm from Section 3.2, hence the name M-DPFP.
100
t
0
end
m
1
m
2
m
3
m
3
m
2
m
3
x
x
x
t
1
t
2
t
4
t
3
t
5
t
6
t
7
t
8
t
10
t
9
t
0
end
m
6
m
4
m
5
m
5
m
4
x
x
t
11
t
12
t
15
t
13
t
16
t
17
t
18
Agent 1
Agent 2
t
19
t
20
m
4
m
5
end
t
14
m
1
m
2
m
3
m
1
m
3
m
2
m
3
m
1
m
2
m
1
m
3
m
2
m
3
m
1
m
2
t
0
m
6
m
5
m
4
m
4
m
4
m
5
m
6
m
5
m
6
m
6
m
5
m
4
m
4
m
6
m
5
t
0
Figure 5.6: M-DPFP algorithm: Generation of the representative states. Solid arrows represent
time axes whereas dotted lines passing through representative states illustrate example execution
histories. Each time point t
i
has a corresponding state s
i
(see Figure 5.2.1).
Adding Representative States
At a basic level, the M-DPFP algorithm intertwines the generation of representative states with
the search for the-optimal actions to be executed from these states. When the algorithm starts,
it generates representative states s
1;0
= (h1; t
0
; t
0
; 1i) and s
2;0
= (h2; t
0
; t
0
; 1i) which are simply
the starting states of agents 1 and 2 (see the description of states in the CR-DEC-MDP model in
Section 2.2.3). The algorithm then searches for the best actions to be executed from s
1;0
and s
2;0
using a fast lookahead function (similar to the DPFP algorithm from Section 3.2).
Suppose that the lookahead function found that the best action to be executed from state s
1;0
is m
1
2 A(s
1;0
) and the best action to be executed from state s
2;0
is m
6
2 A(s
2;0
) (as illustrated
101
s
1;0
= (h1; t
0
; t
0
; 1i)
s
1
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i)
s
3
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;h2; t
1
; t
3
; 1i)
s
6
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;h2; t
1
; t
3
; 1i;h3; t
3
; t
6
; 1i)
s
4
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;h2; t
1
; t
4
; 1i)
s
7
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;h2; t
1
; t
4
; 1i;h3; t
4
; t
7
; 1i)
s
8
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;h2; t
1
; t
4
; 1i;h3; t
4
; t
8
; 1i)
s
9
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;h2; t
1
; t
4
; 1i;h3; t
4
; t
9
; 1i)
s
2
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
2
; 1i)
s
5
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
2
; 1i;h2; t
2
; t
5
; 1i)
s
10
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
2
; 1i;h2; t
2
; t
5
; 1i;h3; t
5
; t
10
; 1i)
s
2;0
= (h2; t
0
; t
0
; 1i)
s
11
= (h2; t
0
; t
0
; 1i;h6; t
0
; t
11
; 1i)
s
14
= (h2; t
0
; t
0
; 1i;h6; t
0
; t
11
; 1i;h5; t
11
; t
14
; 1i)
s
18
= (h2; t
0
; t
0
; 1i;h6; t
0
; t
11
; 1i;h5; t
11
; t
14
; 1i;h4; t
14
; t
18
; 1i)
s
19
= (h2; t
0
; t
0
; 1i;h6; t
0
; t
11
; 1i;h5; t
11
; t
14
; 1i;h4; t
14
; t
19
; 1i)
s
15
= (h2; t
0
; t
0
; 1i;h6; t
0
; t
11
; 1i;h5; t
11
; t
15
; 1i)
s
20
= (h2; t
0
; t
0
; 1i;h6; t
0
; t
11
; 1i;h5; t
11
; t
15
; 1i;h4; t
15
; t
20
; 1i)
s
12
= (h2; t
0
; t
0
; 1i;h6; t
0
; t
12
; 1i)
s
13
= (h2; t
0
; t
0
; 1i;h6; t
0
; t
12
; 1i;h4; t
12
; t
13
; 1i)
s
16
= (h2; t
0
; t
0
; 1i;h6; t
0
; t
12
; 1i;h4; t
12
; t
13
; 1i;h5; t
13
; t
16
; 1i)
s
17
= (h2; t
0
; t
0
; 1i;h6; t
0
; t
12
; 1i;h4; t
12
; t
13
; 1i;h5; t
13
; t
17
; 1i)
(5.10)
Figure 5.7: Representative states for the search problem from Figure 5.6. Note that only the
representative states for the successful execution of methods are shown.
in Figure 5.6). The algorithm now switches to generating new representative states. For agent 1,
the execution of method m
1
can finish at any point in time (inside the execution time window of
method m
1
), yet, the algorithm can only consider a finite set of representative states that agent 1
can transition to having completing the execution of method m
1
. Figure 5.6 illustrates a situation
when the algorithm considers just two representative states: s
1
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i) and
102
s
2
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
2
; 1i). In this particular case both s
1
and s
2
correspond to a situation
where agent 1 has executed method m
1
successfully (with q = 1)
4
, yet in general, both successful
and unsuccessful cases are considered. Furthermore, s
1
corresponds to a situation where the
execution of method m
1
finished at time t
1
and s
2
corresponds to a situation where the execution
of method m
1
finished at time t
2
. While it is explained later how the algorithm selects these
finishing times, assume for now that they are arbitrary.
The algorithm then switches back to the search for the-optimal actions to be executed from
states s
1
and s
2
(using the lookahead function) which turn out to be the same, i.e., to execute
method m
2
(as illustrated in Figure 5.6). Next, the algorithm switches back to the generation
of new representative states, for a situation where the execution of method m
2
starts in state
s
1
as well as for a situation where the execution of method m
2
starts in state s
2
. In the ex-
ample from Figure 5.6, the algorithm has chosen to generate two new representative states s
3
=
(h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;h2; t
1
; t
3
; 1i) and s
4
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;h2; t
1
; t
4
; 1i) for the ex-
ecution starting in s
1
and just one new representative state s
5
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
2
; 1i;h2; t
2
; t
5
; 1i)
for the execution starting in s
2
. The algorithm then continues to intertwine the generation of rep-
resentative states with the search for the-optimal actions to be executed from these states using
the technique just discussed. Note, that this approach allows M-DPFP to prune many possible
policies. For example, in Figure 5.6 agent 1 will never choose to execute method m
3
after finish-
ing the execution of method m
1
.
4
Recall that states are sequences of events e =hi; l
1
; l
2
; qi where q is the is the result of the execution of method m
i
103
Using the Lookahead Function
The lookahead function mentioned above (explained in detail in Section 5.2.4) not only finds the
-optimal action to be executed from a representative state but also returns the probability dis-
tribution of entering future states (after that representative state) when following the -optimal
policy from that representative state. For example, when the lookahead function is called to find
the -optimal action to be executed from a representative state s
1
= (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i)
it also returns the probability of entering future states s = (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;:::). Note
however, that these probabilities are not only aected by agent 1’s actions, but also by agent 2’s
actions, e.g., the probability of transitioning to state s = (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;h2; t
1
; t
3
; 1i)
is aected by the probability that agent 2 will successfully complete the execution of its method
m
5
before time t
3
. Consequently, for the the lookahead function to correctly determine the prob-
abilities of entering future states s = (h1; t
0
; t
0
; 1i;h1; t
0
; t
1
; 1i;:::) it must know agent 2’s policy
first.
To this end it is required that the lookahead function considers the representative states in a
specific order explained in depth in Section 5.2.2. In essence, this specific order ensures that when
the lookahead function is called for a representative state s, the lookahead function can correctly
estimate (given the information provided in the description of state s) the probability distribution
over the current states of all the agents which may have an impact on the policy executed from s.
For example, in order to estimate the expected utility of starting the execution of method m
2
of
agent 1 from state s
1
(at time t
1
) we need to know the probability with which method m
5
of agent
2 may have finished before time t
1
(because method m
5
enables method m
2
). In the example in
Figure 5.6 the -optimal action for agent 2 in state s
2;0
is to choose to execute method m
6
, not
104
m
5
. So, considering only the-optimal action from state s
2;0
is not enough as it still does not tell
us the probability of method m
5
’s ending before time t
1
. (This does not yet provide us enough
information to compute expected utility of starting to execute method m
2
from state s
1
, because
we still cant tell the likelihood of finishing the execution of method m
5
before time t
1
). So now
we must consider the-optimal actions that agent 2 will execute after completing the execution
of method m
6
, in particular, the-optimal actions from the representative states s
11
and s
12
must
be found. For state s
12
the -optimal action is still not to execute m
5
and we must continue to
consider state s
13
before the likelihood of finishing the execution of method m
5
before time t
1
can
be estimated.
To further illustrate the specific ordering in which the lookahead function can be called con-
sider the following two examples (refer to Figure 5.6). Initially, the lookahead function can be
called for the starting state s
1;0
or the starting state s
2;0
in any order because either way, the
lookahead function knows that the probability distribution over the current states of agent 1 is
Pb(s
1;0
= 1) and that the probability distribution over the states of agent 2 is Pb(s
2;0
= 1). On
the other hand, having been called only for the representative state s
1;0
, the lookahead function
cannot be immediately called for the representative state s
1
since at this point the probability dis-
tribution over the current states of agent 2 is unknown (because the lookahead function has not
been called for state s
2;0
yet). In the next section the specific ordering in which the lookahead
function considers the representative states is formalized.
5.2.2 The M-DPFP Algorithm
In order to formalize the M-DPFP algorithm it is necessary properly define the ordering in which
M-DPFP adds representative states, that is, define the concept of a method sequence and a method
105
sequences graph. The formal definition of these two concepts is followed by a concrete example
(refer to Figure 5.8).
Definition 5. Method sequence of agent n is a vector = (m
n
; m
i
1
;:::; m
i
k
) of executed
methods where m
n
is a spoof method completed by agent n before the execution starts and
m
i
1
;:::; m
i
k
2 M
n
. Furthermore,
n
is the set of all possible sequences of agent n and =
S
N
n=1
n
is the set of all possible method sequences of all agents. Also, for notational conve-
nience, (; m) denotes a method sequence that is a concatenation of vector with vector (m)
whereas (:::; m) denotes a method sequence that ends with a method m. Finally, the set S
of
states associated with method sequence is defined as:
S
=f(hi
1
; l
i
1
;1
; l
i
1
;2
; q
i
1
i;:::;hi
k
; l
i
k
;1
; l
i
k
;2
; q
i
k
i)2 S such that (m
i
1
;:::; m
i
k
) =g:
For example, in Figure 5.8 method sequences of agent 1 can be (m
1
; m
1
), (m
1
; m
1
; m
2
),
(m
1
; m
1
; m
3
), etc. As mentioned in Section 5.2.1, in order to take advantage of the lookahead
function, the representative states must be added in a specific order, i.e., the lookahead function
must be able to estimate the probability distribution over the current states of all the agents (given
the information provided in the description of the current representative state). To this end, M-
DPFP adds representative states one method sequence at a time, considering subsequent method
sequences from a listL defined as follows:
Definition 6. Method sequences graph is a directed graph G = (; C) where is the set of
nodes and C is the set of arcs. The set of arcs C is constructed as follows: For any two nodes
i
= (:::; m
i
);
j
= (:::; m
j
)2 an arc (
i
;
j
) is added to C if
j
= (
i
; m) orhi; ji2 C
.
The listL of method sequences is then a topological sorting of the elements of , that is, a
method sequence
j
is added at the end of listL only if all the nodes
i
that precede node
j
in
106
graph G (all the nodes
i
such that there exists an arc (
i
;
j
)2 C) have already been added to
listL.
To illustrate the concept of method sequences, method sequences graph and listL consider
the example in Figure 5.8. Here, each node label is a method sequence, e.g. node2564 cor-
responds to method sequence (m
2
; m
5
; m
6
; m
4
) of agent 2. Also, arcs represents precedence
constraints. For example, node2564 (that represents the execution of method m
4
after the
execution of methods m
5
and m
6
) is preceded by node256 (that represents the execution of
method m
6
after the execution of methods m
4
). Furthermore node2564 is preceded by nodes
121,1231,11,131,1321 becauseh4; 1i 2 C
(the execution of method m
4
must be
preceded by the successful execution of method m
1
). The listL for the graph in Figure 5.8
constructed from the topological sorting of the the graph nodes can therefore be:L = ((m
1
),
(m
2
), (m
1
; m
1
); (m
1
; m
1
; m
3
); (m
1
; m
3
); (m
1
; m
3
; m
1
); (m
2
; m
5
); (m
2
; m
5
; m
6
); (m
2
; m
6
);
(m
2
; m
6
; m
5
); (m
1
; m
2
); (m
1
; m
2
; m
1
); (m
1
; m
2
; m
1
; m
3
); (m
1
; m
2
; m
3
); (m
1
; m
2
; m
3
; m
1
), (m
2
;
m
4
); (m
2
; m
4
; m
6
); (m
2
; m
4
; m
6
; m
5
); (m
2
; m
4
; m
5
); (m
2
; m
4
; m
5
; m
6
); (m
1
; m
3
; m
2
); (m
1
; m
3
;
m
2
; m
1
); (m
2
; m
5
; m
4
); (m
2
; m
5
; m
4
; m
6
); (m
2
; m
5
; m
6
; m
4
); (m
2
; m
6
; m
4
); (m
2
; m
6
; m
4
; m
5
); (m
1
;
m
3
; m
1
; m
2
); (m
1
; m
1
; m
2
); (m
1
; m
1
; m
2
; m
3
); (m
1
; m
1
; m
3
; m
2
); (m
2
; m
6
; m
5
; m
4
)).
At a basic level, the M-DPFP algorithm iterates over all the method sequences fromL. In
particular, for a method sequence, it performs the following three steps:
Constructs a finite set
e
S
S
of representative states.
For each representative state s2
e
S
calls the lookahead function to determine the-optimal
action to be executed from s.
107
-24
-245
-246
agent 2
sequence -2
-2465
-25
-254
-256 -2564
-26
-265 -2654
-11
-112
agent 1
sequence -1
-1123
-1132
-12
-121 -1213
-1231
-13
-131 -1312
-264 -2645
-2456
-2546
-113
-123
-132 -1321
Figure 5.8: Method sequences graph
Prunes from the listL all the method sequences that cannot be encountered while executing
the-optimal policy.
Precisely, the algorithm starts by calling the ArrangeSequences() function to arrange the
method sequences into a listL (topological sort of method sequences in ). The algorithm then
calls the lookahead function OptimalMethod(s
i;0
) for the starting state s
i;0
over all the agents in
order to find the -optimal actions to be executed from these starting states as well as to prune
from listL some of the method sequences that are not reachable via the optimal policy. The
algorithm then iterates over the remaining method sequences inL. In particular, for a sequence
108
2L it first identifies an agent i that this sequence belongs to. The algorithm then incrementally
builds a set
e
S
of representative states and a set Z
of optimal actions in these states. Initially,
both
e
S
and Z
are empty.
If the error computeError(
e
S
) of the current policy for all states s 2 S
is greater than
the tolerable error maxError the algorithm tries to reduce the error ComputeError(
e
S
) by ex-
panding the set
e
S
of representative states with a new representative state s 2 S
(functions
NewRepresentativeState(S
;
e
S
) and ComputeError(
e
S
) are described in Section 5.2.3). To this
end, the algorithm calls the lookahead function OptimalMethod(s) to find the optimal method m
to be executed from state s of agent i (described in Section 5.2.4), assigns action m to the optimal
policy
(s) for state s and updates sets Z
and
e
S
respectively.
As soon as the error ComputeError(
e
S
) is within the tolerable error maxError the algorithm
stops expanding the set
e
S
with new representative states. It then employs the set Z
to remove
from the listL the method sequences that are not reachable via the optimal policy. The M-DPFP
algorithm terminates if all method sequences from the pruned listL have been analyzed. At this
point, the algorithm returns the optimal policy
defined for representative states. The error of the
greedy policy for the non-representative states (described in Section 5.2.3.3) is then guaranteed
to be less than maxError.
In the following three sections it is explained how to:
Select new representative states (the NewRepresentativeState(S
;
e
S
) function);
Bound the error of the algorithm (the computeError(
e
S
) function);
Find the optimal action to be executed from a representative state (the OptimalAction(s)
function).
109
Algorithm 2 M-DPFP(maxError)
1: L ArrangeSequences()
2: for all Agents i = 1; 2;:::; N do
3:
(s
i;0
) OptimalAction(s
i;0
)
4: L Lnf(m
i
; m;:::)2
i
such that m,
(s
i;0
)g
5: for all2L do
6: i n such that 2
n
7: Z
;
8:
e
S
;
9: while maxError>ComputeError(
e
S
) do
10: s NewRepresentativeState(S
;
e
S
)
11: m OptimalAction(s)
12:
(s) m
13: Z
Z
[ m
14:
e
S
e
S
[ s
15: L Lnf(; m
0
;:::)2
i
such that m
0
< Z
g
16: return
5.2.3 Representative States
Two heuristics according to which new representative states are selected are introduced in this
section. Then, the greedy policy for non-representative states is explained. Finally, the formula
for calculating the error of the M-DPFP algorithm is provided.
5.2.3.1 Representative State Selection
Consider the situation when the M-DPFP algorithm calls the NewRepresentativeState(S
;
e
S
)
function to pick a new representative state s (and later added to the set
e
S
of representative states
associated with the method sequence 2 L). Suppose that = (
1
; m
k
) for some method
sequence
1
2L and method m
k
2 M. For example, in Figure 5.6, = (m
1
; m
1
; m
2
),
1
=
(m
1
; m
1
) and m
k
= m
2
.
110
Upon considering a method sequence, the M-DPFP algorithm has already considered the
method sequence
1
because by definition,
1
is before in listL. Furthermore, there exists at
least one representative state s
1
2
e
S
1
for which the optimal action is to start executing method
m
k
because otherwise, would have been pruned fromL (it would have not been visitable via
the optimal policy). For each such representative state s
1
the execution of method m
k
can finish
with an outcome q2f0; 1g. Hence, the set
e
S
S
of representative states for sequence is
given by:
e
S
=
[
s
1
2
e
S
1
(s
1
)=m
k
q2f0;1g
e
S
;s
1
;q
Where
e
S
;s
1
;q
is the set of representative states that the agent can transition to after it finished
executing method m
k
from state s
1
with an outcome q. (The valuej
e
S
j=j
e
S
1
j is referred to as
the branching factor for representative states.)
For example, in Figure 5.6 where only successful execution of methods have been considered
(for explanatory purposes) we have s
1
2
e
S
(m
1
;m
1
)
=fs
1
; s
2
g and hence:
e
S
(m
1
;m
1
;m
2
)
=
e
S
(m
1
;m
1
;m
2
);s
1
;0
[
e
S
(m
1
;m
1
;m
2
);s
1
;1
[
e
S
(m
1
;m
1
;m
2
);s
2
;0
[
e
S
(m
1
;m
1
;m
2
);s
2
;1
= fs
3
; s
4
g[fs
5
g =fs
3
; s
4
; s
5
g:
because sets
e
S
(m
1
;m
1
;m
2
);s
1
;0
and
e
S
(m
1
;m
1
;m
2
);s
2
;0
are empty (for explanatory purposes).
When the NewRepresentativeState(S
;
e
S
) function is called, it first selects a state s
1
2
e
S
1
and an outcome q2f0; 1g. (The selection mechanism considers all possible combinations of
s
1
2
e
S
1
and q2f0; 1g). The NewRepresentativeState function then attempts to pick a new
representative state s2 S
and add it to the set
e
S
;s
1
;q
. Let s
1
= (e
1
;:::; e
m
) and recall that
111
method m
k
has been executed in state s
1
with an outcome q. Initially, when set
e
S
;s
1
;q
=;,
the NewRepresentativeState function adds two delimiting representative states to it: The first
representative state is associated with the earliest completion time of method m
k
, i.e., it is given
by (e
1
;:::; e
m
;hk; t
1
; t
1
+
k
; qi) where
k
is the minimum duration of the execution of method
m
k
. On the other hand, the second representative state is associated with the latest completion
time of method m
k
, i.e., it is given by (e
1
;:::; e
m
;hk; t
1
; minft
1
+
+
k
;
k
g; qi) where
+
k
is the
maximum duration of the execution of method m
k
and
k
is the latest admissible execution time
of method m
k
.
In general, the NewRepresentativeState function encounters a situation where the current
set
e
S
;s
1
;q
of representative states already contains some representative states (including the two
delimiting representative states), i.e.,
e
S
;s
1
;q
=f(e
1
;:::; e
m
;hk; t
1
; t
i
; qi)g
n
i=1
. Here, the NewRep-
resentativeState function adds to
e
S
;s
1
;q
a new representative state s = (e
1
;:::; e
m
;hk; t
1
; t
0
; qi)
where t
0
is determined using one of the heuristics below:
Uniform Time Heuristic (H
UT
) chooses representative states such that they uniformly cover
the time range [t
1
+
k
; minft
1
+
+
k
;
k
g], i.e., such that the maximum distance between two adja-
cent representative states is minimized. Formally, for time points t
0
;:::; t
n
sorted in the increasing
order of their values, the H
UT
chooses t
0
to be:
t
0
:=
t
j
+ t
j+1
2
Where j is such that the distance t
j+1
t
j
between time points t
j+1
and t
j
is maximized, that is:
j = arg max
i=0;:::;n1
t
i+1
t
i
:
112
For example, suppose that the set of representative states already contains three representative
states with their corresponding times t
0
= 5, t
1
= 10, and t
2
= 20. Here, because t
2
t
1
= 10
which is greater than t
1
t
0
= 5, the H
UT
heuristic will pick j = 1 which will result in adding a
new representative state associated with time t
0
= (10 + 20)=2 = 15.
The main advantage of the H
UT
heuristic is that is it easy to implement. However, the H
UT
heuristic can in practice be very ineective because it ignores the likelihood of transitioning to a
new representative state s. As a result, the representative states in set
e
S
;s
1
;q
can be situated far
away from the states the process transitions to during the execution phase.
Uniform Probability Heuristic (H
UP
) improves on the H
UT
heuristic in that it uses the
likelihood of transitioning to representative state as a metric for picking a new representative
state. The H
UP
heuristic first estimates
5
the probabilities F
;s
1
;q
(t) that the execution of method
m
k
started in state s
1
will be completed before time t with outcome q. For notational convenience
let F := F
;s
1
;q
(t). The H
UP
heuristic attempts to uniformly cover the probability range [0; 1],
i.e., it attempts to minimize the maximum distance between the two adjacent probability values
F(t). Formally, for time points t
0
;:::; t
n
and their corresponding probability values F(t
0
);:::; F(t
n
)
sorted in the increasing order, the H
UP
chooses t
0
to be:
t
0
:= F
1
F(t
j
) + F(t
j+1
)
2
!
Where F
1
is the inverse function of function F, i.e., F
1
(F(t)) = t for all t whereas j is found
using the formula below:
j = arg max
i=0;:::;n1
F(t
i+1
) F(t
i
):
5
This estimation uses the Monte-Carlo sampling explained in Section 5.2.4
113
f(t)
t
1
0
2 3 4 5
1/2
1/8
F(t)
t
1
0
2 3 4 5
1
1/8
1/4
3/4
1/2
2/8
3/8
4/8
5/8
6/8
7/8
Figure 5.9: Demonstration of the H
UP
heuristic
To understand how the H
UP
works in practice and why it is more ecient than the H
UT
heuristic, consider the situation in Figure 5.9 Here f := F
d
dt
is a probability density function of
the time when the execution of method m
k
will terminate. Assume that function f is given by a
multi-uniform distribution below: (see Figure 5.9)
f (t) =
8
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
:
1
2
if 0 t< 1
1
8
if 1 t< 5
Suppose that the H
UP
heuristic is used to add 9 representative states. The first two represen-
tative states are the delimiting representative states that correspond to t = 0 and t = 5. Now,
the heuristic aims to cover the probability range [0; 1] uniformly. First, the representative state at
t = 1 is added since it splits the probability range [0; 1] into two ranges [0;
1
2
] and [
1
2
; 1] of equal
length (such splitting minimizes the length of the longest probability range). After adding 6 ad-
ditional representative states (also in a way that minimizes the length of the longest probability
range) all the probability ranges will have the length of
1
8
and the set of representative states will
114
contain elements that correspond to times 0;
1
4
;
1
2
;
3
4
; 1; 2; 3; 4; 5 as shown in Figure 5.9. Notice
an important phenomena: since it is 4 times more likely that the execution of method m
k
will be
finished in the time interval [0; 1) than in the time interval [1; 2), the H
UP
heuristic introduces 4
times more representative states for the time interval [0; 1) than for the time interval [1; 2). As a
result, the representative states in set
e
S
;s
1
;q
are more likely to be situated close to the states the
process transitions to during the execution phase.
5.2.3.2 Policy for non-Representative States
Prior to developing the formula for the error bound of M-DPFP, it is necessary to explain the
greedy policy that agents follow when they transition to non-representative states. Intuitively, the
greedy policy for an non-representative state s is to execute an action from a representative statee s
which is before s and as close as possible to s. Formally, recall that method sequences and
1
are such that = (
1
; m
k
). The greedy policy is then defined as follows: (refer to Figure 5.10)
states for method sequence
x
states for method sequence
! s
−1
s
−1
! s
s
!
s
!
π(! s
−1
)=m
k
π(s
−1
) :=π(! s
−1
)
π(s) :=π(! s)
π(! s)
time
time
φ
−1 φ
Figure 5.10: Greedy policy for the non-representative states
Definition 7. The greedy policy postulates that if an agent:
Was in a non-representative state s
1
2 S
1
Started in state s
1
the execution of action(e s
1
) = m
k
where e s
1
2
e
S
1
115
Transitioned with an outcome q to a non-representative state s2 S
The agent will start in state s the execution an action(e s) where:
e s is a representative state from set
e
S
;e s
1
;q
e s is before s, i.e. , if e s = (:::;hk; t
1
;
e
t; qi) and s = (:::;hk; t
1
; t; qi) then
e
t t
e s is as close as possible to s, i.e., there exists no
e
s
0
2
e
S
;e s
1
;q
such that e s is before
e
s
0
and
e
s
0
is before s.
5.2.3.3 Error Bound
In order to bound the error of the M-DPFP algorithm it is necessary to combine the two following
errors:
The error produced by the lookahead function (described in Section 5.2.4)
The error produced by executing the greedy policy.
Consider the situation in Figure 5.11 where according to the greedy policy, action
(e s
i
) = m
g
should be executed in a non-representative state s. The the error of executing in state s a greedy
action m
g
instead of an optimal action m
o
can be bound as follows: Let e s
i
, e s
i+1
2
e
S
be two
adjacent representative states such that state s2 S
is located between e s
i
and e s
i+1
. The times at
which the agent transitions to statese s
i
, s,e s
i+1
are t
i
, t
s
, t
i
1
respectively. Futhermore, t
i+1
t
i
= as
shown in Figure 5.11. Also, because states e s
i
, s, e s
i+1
correspond to the same execution sequence,
the actions applicable in states e s
i
, s, e s
i+1
are the same, i.e., A(s) = A(e s
i
) = A(e s
i+1
) =fm
j
g
i2J
.
The lookahead function called for a representative state e s
i
returns (within the accuracy
)
the expected utilities u
i; j
of executing in state e s
i
actions m
j
2 A(e s
i
), (explained in Section 5.2.4).
116
time
! s
i+1
! s
i
s
0
u
i,g
u
s,g
u
s,o
u
i,o
error
t
i
t
i+1
t
s
δ
Figure 5.11: Greedy policy and the error bound: Optimal action in state e s
i
is to execute method
m
g
(which yields the utility u
i;g
) whereas the optimal action in state s is to execute method m
o
(which yields the utility u
s;o
). Here, the greedy policy for an agent postulates that method m
g
should be executed in state s (which yields the utility u
s;g
). The error of the greedy policy for
state s is bounded byju
s;o
u
s;g
j.
Observe now, that the expected utilities u
s; j
of executing actions m
j
in state s can be the same,
smaller or greater than the corresponding utilities u
i; j
. In particular (refer to Figure 5.11):
If the likelihood of method m
o
being enabled before time t
i
is smaller than the likelihood
of method m
o
being enabled before time t
s
> t
i
one can have u
i;o
< u
s;o
.
Since the likelihood of finishing the execution of method m
g
within a time window is
greater if the execution of m
g
is started in state e s
i
rather than in state s one can have u
i;g
>
u
s;g
.
And consequently, the error of the greedy policy for state s is at most u
s;o
u
s;g
. To bound this
expression, we first find the minimum value of u
s;g
and then the maximum value of u
s;o
. Notice,
that since we are interested in finding the minimum value of u
s;g
it can be assumed without the
loss of generality that method m
g
is enabled in state e s
i
with probability 1
6
. Recall the probability
6
If method m
g
was enabled in state e s
i
with probability p
1
< 1 then method m
g
would be enabled in state s with
probability p
2
p
1
and consequently, the dierence u
i;g
u
s g
would be smaller which would translate into smaller
error value.
117
F
(;m
g
);e s
i
;q
(t) that the execution of method m
g
started in statee s
i
will finish before time t with result
q — the value of F
(;m
g
);e s
i
;q
(t) can be easily calculated using Monte-Carlo sampling approach
explained in Section 5.2.4.2. Now, since the later the execution of method m
g
starts, the smaller
the likelihood of its completion within an admissible time window. Thus:
F
(;m
g
);e s
i
;q
(t)> F
(;m
g
);s;q
(t)> F
(;m
g
);e s
i
;q
(t)
And since the expected utility u
s;g
is directly aected by F
(;m
g
);s;q
(t) it holds that:
u
s;g
> u
i;g
(1 (F
(;m
g
);e s
i
;q
(t) F
(;m
g
);e s
i
;q
(t)))
Conversely, it is now shown how to establish the maximum value of u
s;o
. Notice, that since
we are interested in finding the maximum value of u
s
o
it can be assumed without the loss of
generality that method m
o
will be completed within its admissible time window with probability
1, no matter if it is started in statee s
i
or s
7
. Then, using Monte-Carlo sampling approach explained
in Section 5.2.4.2 one can calculate the probabilities F
e
i;o
(t) that method m
o
will be enabled before
time t when the agent is in state e s
i
(F
e
i; j
is a cumulative distribution function). Finally, since the
later the execution of method m
o
starts, the higher the probability that it will be enabled. Thus:
F
e
i;o
(t)< F
e
s;o
(t)< F
e
i;o
(t +)
7
If method m
o
is completed within its admissible time window with probability p
1
< 1 when it is started in state e s
i
then this probability can only get smaller if the execution of m
o
is started in state s and consequently, the value of u
s o
and the error can only get smaller.
118
And since the expected utility u
s;o
is directly aected by F
e
s;o
(t) it holds that:
u
s;o
< u
i;o
(1 + (F
e
i;o
(t +) F
e
i;o
(t)))
Consequently, the error of executing a greedy action m
g
instead of an optimal action m
o
in state s
is bounded by:
u
s;o
u
s;g
< u
i;o
(1 + (F
e
i;o
(t +) F
e
i;o
(t))) u
i;g
(1 (F
(;m
g
);e s
i
;q
(t) F
(;m
g
);e s
i
;q
(t)))
< u
i;g
(F
e
i;o
(t +) F
e
i;o
(t) + F
(;m
g
);e s
i
;q
(t) F
(;m
g
);e s
i
;q
(t))
Incorporating the error
of the lookahead function in determining u
s;o
and u
s;g
:
< u
i;g
(F
e
i;o
(t +) F
e
i;o
(t) + F
(;m
g
);e s
i
;q
(t) F
(;m
g
);e s
i
;q
(t)) + 2
(5.11)
And since m
o
is not known, the error
i
produced by following a greedy policy from a state
between the representative states e s
i
and e s
i+1
that are apart is given by
8
:
i
< max
m
o
2A(e s
i
)
u
i;g
(F
e
i;o
(t +) F
e
i;o
(t) + F
(;m
g
);e s
i
;q
(t) F
(;m
g
);e s
i
;q
(t)) + 2
(5.12)
Finally, for a given set
e
S
of representative states for the method sequence = (
1
; m
k
), the
error of the M-DPFP algorithm for that sequence, denoted as computeError(
e
S
), is bounded by:
computeError(
e
S
)< max
s
1
2
e
S
1
(s
1
)=m
k
q2f0;1g
max
e s
i
2
e
S
;s
1
;q
i
:
8
Assuming that representative states e s
i
and e s
i+1
exist
119
To summarize, the M-DPFP algorithm (see Algorithm 5.2.2) is called with a desirably small
parameter maxError. The algorithm then keeps adding the representative states untilcomputeEr-
ror (
e
S
) is smaller than maxError for any sequence that can be encountered while executing
the best policy. Finally, since the total number of sequences that can be encountered by the agent
team while executing a policy is bounded by the number K of methods, the cumulative error of
the M-DPFP algorithm is bounded by K maxError.
This section demonstrated how to select new representative states, what actions agents should
execute in non-representative states and how to bound the error of the M-DPFP algorithm. The
next section shows how the lookahead function determines which action should be executed from
a representative state.
5.2.4 Lookahead Function
This section describes the lookahead function OptimalMethod(s) called by Algorithm 5.2.2 in
Section 5.2.2. The lookahead function takes as an input the representative state s and returns the
expected utilities u
s;i
of executing actions m
i
2 A(s) from state s. In order to calculate u
s;i
the
lookahead function solves a dual problem, i.e., it tries to identify an optimal probability mass
flow through states (of all agents) starting from state s that maximizes the total expected utility
u
s;i
.
As stated, there are clear similarities between the lookahead function of M-DPFP and the
DPFP algorithm from Section 3.2, however there is one key dierence: Whereas DPFP could
perform the forward search in the space associated with the future states (methods) of a single
agent, M-DPFP’s search must consider future methods of possibly all agents. In other words, M-
DPFP’s lookahead function called for state s of agent n must consider future methods of agent n
120
and future methods of agents n
0
, n that are aected by the execution of future methods of agent
n. Hence, M-DPFP’s lookahead function must know the probabilities that agents n
0
, n will
start the execution of their future methods and in order to estimate these probabilities, policies of
agents n
0
, n must be found first. M-DPFP achieves that by adding the necessary representative
states (and finds policies for these representative states) for agents n
0
, n prior to considering
state s of agent n. This process in explained in detail in the next three sections.
5.2.4.1 Dual Problem Formulation
Before the dual problem formulation in context of the lookahead function is used, it is necessary
to develop the dual formulation in the general case. Let = (m
i
1
:::; m
i
k
) denote a sequence of k
methods that an agent has executed and Q = (q
i
1
;:::; q
i
k
) be the vector of k outcomes of method
execution where q
i
is the outcome of the execution of method m
i
. For notational convenience, let
A() be the set of methods that an agent can execute after it completed (successfully or unsuccess-
fully) the execution of methods from sequence. Furthermore, let F
;Q
(t) be the probability that
the agent has completed before time t the execution of methods from sequence with outcomes
Q when following a policy and F
;Q
(m
l
)(t) be the probability that the agent has completed the
execution of methods from sequence with outcomes Q and started the execution of method m
l
121
before time t when following a policy — observe that F
;Q
and F
;Q
(m
l
) are cumulative distri-
bution functions over t2 [0; ]. The expected joint reward that the agents receive for following
the policy from their starting states s
0;n
n = 1;:::; N is then given by:
V
(s
0;1
;:::; s
0;n
) =
X
n=1;:::;N
X
=(m
i
1
;:::;m
i
k
)2
n
Q=(q
i
1
;:::;q
i
k
)2f0;1g
k
F
;Q
() q
i
k
r
i
k
=
X
=(m
i
1
;:::;m
i
k
)2[
n
n
Q=(q
i
1
;:::;q
i
k
)2f0;1g
k
F
;Q
() q
i
k
r
i
k
(5.13)
In particular, for an optimal policy
the joint expected reward V
(s
0;1
;:::; s
0;n
) (denoted for
notational convenience as V
(s
0;1
;:::; s
0;n
)) is maximized:
V
(s
0;1
;:::; s
0;n
) = max
V
(s
0;1
;:::; s
0;n
) =
X
=(m
i
1
;:::;m
i
k
)2[
n
n
Q=(q
i
1
;:::;q
i
k
)2f0;1g
k
F
;Q
() q
i
k
r
i
k
(5.14)
Let F
:= fF
;Q
such that = (m
i
1
;:::; m
i
k
) 2 and Q 2 f0; 1g
k
g be a set of cumulative dis-
tribution functions associated with a policy and F
:= fF
;Q
such that = (m
i
1
;:::; m
i
k
) 2
and Q2f0; 1g
k
g be a set of optimal cumulative distribution functions associated with the opti-
mal policy
. In order to find F
observe that the cumulative distribution functions in F
must
122
maximize the right-hand-side of Equation (5.13). Furthermore, F
must be an element of a set
X =fF : (5:15); (5:16); (5:17)g where constraints (5.15), (5.16), (5.17) are defined as follows:
9
F
(m
n
);(1)
(t) = 1 (5.15)
F
;Q
(t) =
X
m
l
2A()
F
;Q
(m
l
)(t) + F
;Q
(m
0
)(t) (5.16)
F
(;m
l
);(Q;q)
(t) =
Z
t
0
Pb(m
l
; t
0
; q) F
;Q
(m
l
)(t
0
)p
l
(t t
0
)dt
0
(5.17)
for all agents n = 1;:::; N, sequences 2
n
and outcomes Q 2 f0; 1g
jj
. Here, (; m
l
) is a
concatenation of and (m
l
), (Q; q) is a concatenation of Q and (q) whereas Pb(m
l
; t
0
; q) (explained
later) is the probability that method m
l
is enabled before time t
0
(in case q = 1) or not enabled
before time t
0
(in case q = 0). Constraints (5.15), (5.16), (5.17) are explained as follows:
Constraint (5.15) ensures that each agent n = 1;::; N starts the execution from its starting
state s
0;n
. To observe that, recall that the starting state s
n;0
of agent n in encoded in the CR-
DEC-MDP framework as s
n;0
= (hn; l
n;0
; l
n;0
; 1i), that is, a spoof method m
n
is by default
completed successfully (with q = 1) at time l
n;0
= 0. Hence the probability F
(m
n
);(1)
(t) that
method m
n
will be completed successfully before time t must be 1 for all t2 [0; ].
Constraint (5.16) can be interpreted as the conservation of of probability mass flow through
a method sequence. Applicable only ifjA()j > 0 it ensures that the cumulative distri-
bution function F
;Q
is split into cumulative distribution functions F
;Q
(m
l
) for methods
9
Constraints (5.15), (5.16), (5.17) are defined for a single method execution time windows [0; ]. Extension to
multiple time windows is shown in Equation (5.6)
123
m
l
2 A() plus the cumulative distribution function F
;Q
(m
0
) — this latter function repre-
sents the probability that the agent will wait (do nothing) upon completing the execution of
methods from sequence with outcomes Q.
Constraint (5.17) ensures the correct propagation of probability mass from a method se-
quence to a method sequence (; m
l
) upon executing method m
l
with an outcome q. For
q = 1, the constraint ensures that the execution of method m
l
will be finished successfully at
time t if this execution: (i) is started at time t
0
, (ii) takes time t t
0
to complete and (iii) was
started after method m
l
had been enabled. In contrast, for q = 0, constraint (5.17) ensures
that the execution of method m
l
will be finished unsuccessfully at time t if this execution:
(i) is started at time t
0
, (ii) takes time t t
0
to complete and (iii) was started when method
m
l
was not enabled. The correct implementation of constraint (5.17) requires that the prob-
abilities Pb(m
l
; t
0
; q) are known — in Section 5.2.4.2 it is explained how to estimate these
probabilities using a Monte-Carlo sampling approach.
The dual problem is then stated as:
max
X
=(m
i
1
;:::;m
i
k
)2[
n
n
Q=(q
i
1
;:::;q
i
k
)2f0;1g
k
F
;Q
() q
i
k
r
i
k
s:t: F2 X (5.18)
And the solution to the dual problem is a set F
which yields the total expected utility V
(s
0;1
;:::; s
0;n
).
124
5.2.4.2 Dual Problem and the Lookahead Function
It is now shown how the lookahead function takes advantage of the dual problem formulation.
Recall, that the lookahead function takes as an input the representative state s and returns the
expected utilities u
s;k
of executing actions m
k
2 A(s) from state s. It is first shown how the looka-
head function uses the dual problem formulation in a simple case, when s is one of the starting
states of the CR-DEC-MDP model. This result is then generalized to an arbitrary representative
state s.
Consider a situation when s = s
0;n
, i.e., s is a starting state and A(s
0;n
) =fm
1
;:::; m
K
g as
shown in Figure 5.12a. In order to determine which method should agent n start executing from
state s
0;n
one must consider K separate cases. In particular, for case k, the dual problem (5.18) is
solved with an additional constraint F
(m
n
)(1)
(m
k
)(t) = 1 for t2 [0; ] which ensures that method
m
k
is going to be executed from state s. The solution to the new problem then determines the
expected utility u
s;k
of executing method m
k
in state s. Upon considering all cases k = 1;:::; K the
lookahead function returns method m
k
where k
= arg max
k=1;:::;K
u
s;k
.
Consider now a more general case (refer to Figure 5.12b) when the lookahead function is
called to find the method to be executed from a representative state s2 S
of agent n. Assume
that methods from the sequence have been executed with outcomes Q and A(s) = A() =
fm
1
;:::; m
K
g. Furthermore, let D() be the set of method sequences
0
2 [
n=1;:::;N
n
(of any
agent) that are preceded by the method execution sequence , that is, that are reachable from
node of the method sequences graph G (see Section 5.2.2). For example, in Figure 5.12b we
have D((m
2
; m
4
; m
6
)) = f(m
2
; m
4
; m
6
; m
5
); (m
1
; m
1
; m
2
); (m
1
; m
1
; m
2
; m
3
); (m
1
; m
1
; m
3
; m
2
);
(m
1
; m
3
; m
2
; m
1
)g.
125
m
1
s
0,1
s
0,n
s
0,N
agent 1 sequences
agent n sequences
agent N sequences
m
K
m
k
Method sequences affected by the action in state s
0,n
(The range of the lookahead function)
a)
s
0,1
s
0,n
s
0,N
b)
s∈ S
φ
φ
!
∈D(φ)
φ
!!!
φ
!!!
Figure 5.12: The range of the lookahead function: (a) Gray rectangles encompass all method
execution sequences of one agent; the dotted triangle encompasses the range of the lookahead
function, i.e., all the sequences that are aected by the decision of agent n in the initial state s
0;n
.
(b) When the lookahead function is called for state s2 S
of agent n, D() is the set of sequences
0
aected by the agent decision in state s. The execution of methods in sequences
000
aect the
execution of methods in sequences
0
2 D().
Observe that the action that agent n executes in state s (for sequence ) will have an aect
on the execution outcome of other methods (that possibly belong to dierent agents). Precisely,
action in state s aects all the methods m2[
n=1;::;N
M
n
such that there is a sequence
0
= (:::; m)2
126
D(). In other words, the action that the agent executes in state s will impact all the functions
F
0
;Q
0 for
0
2 D(). Thus, the objective function of the agent in state s is the following:
max
X
0
=(m
i
1
;:::;m
i
k
)2D()
Q
0
=(q
i
1
;:::;q
i
k
)2f0;1g
k
F
0
;Q
0() r
i
k
q
i
k
: (5.19)
Furthermore, agent n must consider the impact of methods from external external sequences
(from outside of D()) when maximizing the value of the expression 5.19. Precisely, the fact
that the agent has reached state s imposes additional constraints on the dual problem formulation
(5.18), i.e., set X specified by constraints (5.15), (5.16), (5.17) is now further constrained by:
Local agent history constraints: Because the state s is fully observable to the agent, the
agent knows exactly the starting times, finishing times and outcomes of the execution
of all its methods of sequence . For example, if agent 1 is in state s = (h1; t
0
; t
0
; 1i;
h1; t
0
; t
1
; 1i;h2; t
1
; t
2
; 0i) then it can infer that F
(m
1
);(1)
(t) = 1 for t2 [0; ]; F
(m
1
;m
1
);(1;1)
(t) =
0 if t 2 [0; t
1
] and 1 if t 2 [t
1
; ]; F
(m
1
;m
1
;m
2
);(1;1;1)
(t) = 0 for t 2 [0; ] and finally
F
(m
1
;m
1
;m
2
);(1;1;0)
(t) = 0 if t2 [0; t
2
] and 1 if t2 [t
2
; ]. In general, when methods from
a sequence are completed, functions F
00
;Q
00 where
00
is a prefix of can be inferred.
These functions F
00
;Q
00 specify new constraints to be added to further restrict set X.
Other agents histories constraints: Because the information encoded in state s does not
allow the agent to determine precisely the current states of other agents, the agent can only
estimate the starting times, finishing times and outcomes of the execution of methods of
other agents. Of particular importance for solving the dual problem of the agent (when
the agent is about to choose an action to be executed in state s) are the method execution
127
sequences
000
of other agents that can aect the execution of methods from sequences
0
2 D() and thus, influence the decision of the agent in state s. Precisely, in state s the
agent must be able to estimate the status of execution of methods from sequences
000
that
precede method sequences
0
2 D(), i.e., sequences
000
of other agents such that there
exists a path (
000
;:::;
0
) for some
0
2 D() in the method sequences graph G.
In order to determine F
000
;Q
000(t) for t2 [0; ] for all such sequences, a Monte-Carlo sam-
pling approach is used [MacKay, 1998]. In essence, the CR-DEC-MDP model assumes
given starting states s
0;1
;:::; s
0;N
of the agents. Moreover, thanks to the specific ordering
L in which the lookahead function analyses the method execution sequences, when the
M-DPFP algorithm calls the lookahead function for state s2
e
S
, the policies for states
s
000
2 S
000 are known (if s
000
is not a representative state, the greedy policies are used —
refer to Section 5.2.3.2). Thus, a Monte-Carlo algorithm initiated with a given number
of samples (varied in Chapter 6) can simulate the concurrent execution of agents policies
from their starting states. By counting the number of samples that traverse a particular se-
quence
000
with outcomes Q
000
in a given time interval, one can estimate the probabilities
F
000
;Q
000(t) for all t2 [0; ]. The functions F
000
;Q
000 are then used to specify new constraints
to be added to further restrict set X.
To summarize, in order to determine which method should agent n start executing from state
s = (:::;hm
l
; t
1
; t
2
; qi)2
e
S
the lookahead function considers K separate cases. In particular, for
case k the lookahead function instantiates a new dual problem with the objective function (5.19)
and constraints specified by (5.15), (5.16), (5.17), local agent history constraints, other agent
histories constraints and an additional constraint F
;Q
(m
k
)(t) = 0 for t2 [0; t
2
] and F
;Q
(m
k
)(t) =
128
1 for t2 [t
2
; ] which ensures that method m
k
is going to be executed from state s . The solution
to the dual problem then determines the expected utility u
s;k
of executing method m
k
in state
s. Upon considering all cases k = 1;:::; K the lookahead function returns method m
k
where
k
= arg max
k=1;:::;K
u
s;k
. Finally, if the maximum utility of executing an action in a representative
state s = (:::;hm
l
; t
1
; t
2
; qi) is smaller than the maximum utility of executing an action in a later
representative state, i.e. s
0
= (:::;hm
l
; t
1
; t
2
+ t
0
; qi) then the optimal policy in the representative
state s is to wait.
5.2.4.3 Solving the Dual Problem
In order to solve the dual problem above, The same approach as in the DPFP algorithm in Section
3.2.3 can be used. Informally, each cumulative distribution function F
;Q
is approximated with a
step function of a step height. With this approximation technique the feasible set X of the dual
problem is narrowed to a restricted feasible set
b
X X which contains a finite number of elements.
The search for the optimal solution
b
F
to the restricted dual problem can then be implemented as
an exhaustive iteration over all the elements of
b
X, to find an element that maximizes the objective
function (5.19).
At a basic level, the exhaustive iteration starts with a step function F
;Q
which is known to
the agent (when the agent is in state s) and then considers all possible ways in which F
;Q
can
be split into functions F
;Q
(m
i
) for m
i
2 A()[fm
0
g — spoof method m
0
is used to implement
the waiting action of the agent. The algorithm then enters a recursive loop; it fixes functions
fF
;Q
(m
i
)g
m
i
2A()
and considers a method sequence
0
such that
0
2 D() and
0
is the next
element on the listL after. For sequence
0
, the algorithm considers all possible ways in which
function F
0
;Q
0 (derived using Monte-Carlo sampling technique discussed above) can be split into
129
functions F
0
;Q
0(m
j
) for m
j
2 A(
0
)[fm
0
g. Here, the algorithm enters another recursive loop
(2-nd level of recursion); it fixes functionsfF
0
;Q
0(m
j
)g
m
j
2A(
0
)
and considers a method sequence
00
2 D(
0
) such that
00
is the next element on the listL after
0
etc. The recursion stops after all
possible elements of
b
X have been considered and evaluated using the objective function (5.19).
At this point, the optimal solution
b
F
to the restricted dual problem is known.
It is guaranteed that the the reward error
of a policy identified by
b
F
can be expressed in
terms of. Indeed, the error
of the lookahead function can be bounded in exactly the same way
as the error of the DPFP algorithm: (refer to Section 3.2.5)
= R
max
X
0
2D()
max
t2[0;]
jF
0
;Q
(t)
b
F
0
;Q
(t)j
R
max
X
0
2D()
j
0
j
Where R
max
= max
m
i
2M
r
i
. Hence, by decreasing, the lookahead function can trade o speed
for optimality.
130
Chapter 6: Experiments with Multiagent Algorithms
This chapter reports on the empirical evaluation of the algorithms for solving CR-DEC-MDPs.
The Chapter is composed of two sections: First, in Section 6.1 the evaluation of the Value Func-
tion Propagation (VFP) algorithm introduced in Section 5.1 is presented. VFP is compared with
its closest competitor, the Opportunity Cost, Decentralized MDP (OC-DEC-MDP) algorithm
[Beynier and Mouaddib, 2006]. Then, in Section 6.2 the evaluation of the Multiagent Dynamic
Probability Function Propagation (M-DPFP) algorithm is conducted. M-DPFP is shown to suc-
cessfully trade-o optimality for speed when solving problems modeled as CR-DEC-MDPs.
6.1 VFP Experiments
First, the experimental evaluation of the VFP algorithm is provided. The VFP algorithm has been
developed to provide two orthogonal improvements over the OC-DEC-MDP algorithm [Beynier
and Mouaddib, 2006]: (i) VFP uses functional representation to speed up the search for poli-
cies carried out by OC-DEC-MDP and (ii) VFP implements a family of opportunity cost splitting
heuristics, to allow to find solutions of higher quality. Both algorithms were run to find locally op-
timal solutions to a special case of CR-DEC-MDPs which assume that methods are fully ordered
(see Section 5.1). Furthermore, method execution durations were assumed to follow discrete
131
probability density function because the OC-DEC-MDP algorithm does not allow for continuous
probability density functions of method execution durations.
Hence, the experimental evaluation of VFP consists of two parts: Section 6.1.1 provides a
comparison of the solution quality found by either OC-DEC-MDP or VFP when running with
dierent opportunity cost splitting heuristics. Then, Section 6.1.2 reports on the evaluation of the
eciency of both algorithm for a variety of CR-DEC-MDP configurations.
6.1.1 Evaluation of the Opportunity Cost Splitting Heuristics
The VFP algorithm was first run on a generic mission plan configuration from Figure 5.5 in
Section 5.1.5 where only methods m
j
0
, m
i
1
, m
i
2
and m
0
were present. Time windows of all
methods were set to 400, duration p
j
0
of method m
j
0
was uniform, i.e., p
j
0
(t) =
1
400
and durations
p
i
1
; p
i
2
of methods m
i
1
; m
i
2
were normal distributions, i.e., p
i
1
= N( = 250; = 20), and
p
i
2
= N( = 200; = 100). Furthermore, only method m
j
0
provided reward, i.e. r
j
0
= 10 was the
reward for finishing the execution of method m
j
0
before time t = 400. The results are shown in
Figure (6.1.1) where the x-axis of each of the graphs represents time whereas the y-axis represents
the opportunity cost. The first graph confirms that when the opportunity cost function O
j
0
was
split into opportunity cost functions O
i
1
and O
i
2
using the H
h1;1i
heuristic, the function O
i
1
+ O
i
2
was not always below the O
j
0
function (the opportunity cost O
j
0
function was overestimated). In
particular, O
i
1
(280) + O
i
2
(280) exceeded O
j
0
(280) by 69%. When heuristics H
h1;0i
, H
h1=2;1=2i
and
b
H
h1;1i
were used (graphs 2,3 and 4), the function O
i
1
+ O
i
2
was always below O
j
0
.
Then, the experiments in the the civilian rescue domain introduced in Chapter 2 were con-
ducted. For a CR-DEC-MDP configuration from Figure 2.3 all method execution durations were
sampled from the normal distribution N = ( = 5; = 2)). To obtain the baseline for the heuristic
132
Figure 6.1: Visualization of the opportunity costs splitting heuristics
133
performance, a globally optimal solver has been implemented — it found a true expected total
reward for this domain (Figure (6.3a)). This reward was then compared with a expected total re-
ward found by a locally optimal solver guided by each of the discussed heuristics. Figure (6.3a),
which plots on the y-axis the expected total reward of a policy complements the previous results:
H
h1;1i
heuristic overestimated the expected total reward by 280% whereas the other heuristics
were able to guide the locally optimal solver close to a true expected total reward.
............
...............
.................................................
(a) (b)
(c)
(d)
Figure 6.2: Mission plan configurations
6.1.2 Comparison of VFP and OC-DEC-MDP Eciency
The eciency VFP and OC-DEC-MDP was then compared when the algorithms were run on
configurations shown in Figure 6.2. Both algorithms were using the H
h1;1i
heuristic to split the
134
Figure 6.3: VFP performance in the civilian rescue domain.
opportunity cost functions. Using the performance of the OC-DEC-MDP algorithm as a bench-
mark the VFP scalability experiments were started on a configuration from Figure (6.2a) for
which the method execution durations were extended to normal distributions N( = 30; = 5)
and the mission deadline was extended to = 200.
The eciency of VFP when it was running with three dierent levels of accuracy was then
tested. Dierent approximation parameters
P
and
V
were chosen, such that the total error of the
solution found by VFP stayed within 1%, 5% and 10% of the solution found by OC- DEC-MDP.
Both algorithms were run for a total of 100 policy improvement iterations.
Figure (6.3b) shows the performance of VFP in the civilian rescue domain (y-axis shows
the runtime in milliseconds). As can be seen, for this small domain, VFP runs 15% faster than
OC-DEC-MDP and finds a policy less with an error of less than 1%. For comparison, the glob-
ally optimal solved did not terminate within the first three hours of its runtime which shows the
potential of the approximate algorithm like OC-DEC-MDP and VFP.
135
Figure 6.4: Scalability experiments for OC-DEC-MDP and VFP for dierent CR-DEC-MDP
configurations.
136
Next, VPF was evaluated in a more challenging domain, i.e., with methods forming a long
chain (Figure (6.2b)). The experiments with chains of 10, 20 and 30 methods were conducted,
with the method execution time windows extended to 350, 700 and 1050 time ticks
1
, to ensure
that later methods are reachable. The results are shown in Figure (6.4a), where the number of
methods is varied on the x-axis and the algorithm runtime is marked on the y-axis (notice the
logarithmic scale). As can be observed, the increase of the size of the domain reveals high
performance of VFP: Within 1% error, it runs up to 6 times faster than OC-DEC-MDP.
Then, the VFP scalability experiments with methods arranged in a tree have been conducted
(Figure (6.2c)). In particular, trees with branching factors of 3 have been considered, with the tree
depth of 2, 3 and 4. To ensure that methods have a change to be executed, the time horizon was
increased from 200 to 300, and then to 400 time ticks. The results are shown in Figure (6.4b).
Although the speedups are smaller than in case of a chain, VFP still runs up to 4 times faster than
OC-DEC-MDP when computing the policy with an error of less than 1%.
Finally, VFP was tested on a domain with methods arranged in a n n mesh, i.e., C
=
fhm
i; j
; m
k; j+1
ig for i = 1;:::; n; k = 1;:::; n; j = 1;:::; n 1. In particular, meshes of 3 3, 4 4, and
5 5 methods have been considered. For such configurations the time horizons were increased
from 3000 to 4000, and then to 5000, to ensure that all methods have a chance to be executed.
The final set of results is shown in Figure (6.4c). As can be seen, especially for larger meshes,
VFP runs up to one order of magnitude faster than OC-DEC-MDP while finding a policy that is
within less than 1% from the policy found by OC- DEC-MDP.
1
Recall, that OC-DEC-MDP requires time to be discretized
137
6.2 M-DPFP Experiments
This section reports on the empirical evaluation of the M-DPFP algorithm, developed to provide
error bounded solutions to problems modeled as CR-DEC-MDPs. M-DPFP has been imple-
mented and tested on the domain described in Section 2.1.3.2. All method execution durations
were sampled from the Normal distributionN( = 2; = 1), method execution time windows
and method rewards were [0; 10], [0; 7], [0; 8], [0; 10], [0; 8], [0; 10] and 5; 10; 3; 6; 2; 2 for meth-
ods m
1
, m
2
, m
3
, m
4
, m
5
, m
6
respectively. The experiments in this section first evaluate the M-
DPFP lookahead function and then the complete M-DPFP algorithm.
6.2.1 Eciency of the M-DPFP Lookahead Function
The first set of experiments involved running the lookahead function for a starting state s
0;1
in
order to determine the total expected utility of an optimal policy for the agents at time t = 0. To
implement the lookahead function Monte-Carlo sampling was used with sample set sizes between
50 and 400 samples. The first experiment (Figure 6:5) demonstrates how the lookahead function
allows for a systematic trade-o of solution quality for speed. In Figure (6:5) the runtime of
the lookahead function measured in seconds is plot on the x-axis (notice the logarithmic scale)
and the computed total expected utility of an optimal policy from state s
0;1
is plot on the y-axis.
Because the expected utility returned by the lookahead function always underestimates the true
expected utility, the y-axis is simply labelled as Solution quality. Data points for each size of the
Monte-Carlo sample set correspond (from left to right) to = 0:3; 0:25; 0:2; 0:15; 0:1 while each
data point is an average of 5 algorithm runs.
138
Figure 6.5: Runtime and quality of the lookahead function
First, observe that the solution quality converges as decreases and that a linear decrease of
corresponds to an exponential increase of runtime. Also, an exponential increase of the size of
Monte-Carlo sample set surprisingly translates to only linear increase of the algorithm runtime
whereas the solution quality is comparable, e.g., for = 0:15 the solution quality for all sizes of
sample sets diers by less than 8%. This last result is encouraging, as is allows the lookahead
function to find accurate solutions even for small sets of Monte-Carlo samples.
The second experiment shows that the lookahead function finds solutions with quality guar-
antees. In Figure 6:6 the algorithm runtime is marked on the x-axis and the the solution quality
is marked on the y-axis. For2f0:3; 0:25; 0:2; 0:15; 0:1g the error bound formula 5.20 is used to
determine the hypothetical maximal quality of the solution; this quality is then contrasted with
the solution quality found by the lookahead function for a fixed size of the Monte-Carlo sample
set. Observe that the lookahead function finds solutions with quality guarantees, e.g., if = 0:1,
139
Figure 6.6: Error bound of the of the Lookahead function
it finds a solution that is guaranteed to be less than 50% away from the optimal solution. Also,
the lookahead function can trade o speed (by decreasing) for optimality.
6.2.2 Eciency of the M-DPFP algorithm
The second set of experiments reports on the eciency of the complete M-DPFP algorithm. Here,
the Monte-Carlo sample set size is fixed to 50, but the number of representative states is varied,
i.e., representative state branching factors of 20, 40 and 80 are used (for the definition of the
branching factor, refer to Section 5.2.3.1).
The experiment in Figure 6.7 reports on the total runtime of M-DPFP; the state branching
factor is varied in the x-axis and the total runtime of M-DPFP is plot on the y-axis — each data
point is an average of 5 algorithm runs. As can be seen, the exponential increase of representative
state branching factor translates into the exponential increase of M-DPFP runtime which is not
surprising. More interesting is the the proportion of time that M-DPFP spends finding the optimal
140
Figure 6.7: Runtime of the M-DPFP algorithm
first decision (the joint action from joint state (s
0;1
; s
0;2
), the second decision (optimal actions after
one method has been executed) and the third decision (optimal actions after two methods have
been executed). When the branching factor for representative states is 20, over 92% of the total
runtime of M-DPFP is spend on finding the optimal first decision, i.e., the lookahead function
for state (s
0;1
; s
0;2
) is the major bottleneck of the algorithm. However, as the branching factor
increases, time spent on finding the optimal later decisions becomes a new bottleneck of the
algorithm.
Hence, representative states must be generated wisely, as their number directly aects the run-
time of M-DPFP. The final experiment shows how dierent representative state selection heuris-
tics aect the maximum quality loss of M-DPFP. In Figure 6.8 x-axis represents the branching
factor for representative states and the y-axis is the maximum quality loss of M-DPFP calculated
using formula 5.12. The first thing to notice is that the increase in the branching factor for rep-
resentative states translates into the decrease of the maximum quality loss of M-DPFP for both
the H
UT
and H
UP
heuristics. Furthermore, when M-DPFP uses the H
UT
heuristic, the maximum
141
Figure 6.8: Quality loss of the M-DPFP algorithm for dierent representative state selection
heuristics
quality loss is generally smaller than the quality loss when M-DPFP uses the H
UP
heuristic. This
result is in accordance with the intuition that the uniform distribution of representative states over
time, generated by the H
UT
heuristic, guarantees that all the states on the time dimension are
well represented. In contrast, H
UP
’s discrimination in choosing where the representative states
should be added, favoring higher-likelihood regions of the time dimension, translates into higher
maximum quality loss values. Note however that in practice, M-DPFP running with the H
UP
heuristic might outperform M-DPFP running with the H
UT
as H
UP
’s representative states are
more likely to be visited than the H
UT
’s representative states and thus, M-DPFP running with the
H
UP
heuristic is less likely to resort to the greedy policy described in Section 5.2.3.2. Indeed the
study of probabilistic error bounds [Kearns et al., 1999] for both heuristics is a topic worthy of
future investigation.
142
Chapter 7: Related Work
Planning with continuous resources has been a very active area of research, with many ecient
algorithms proposed both for single and multi-agent systems.
7.1 Single Agent Systems
Three classes of algorithms have emerged for planning with continuous resources in single agent
systems: (i) Constrained MDPs, (ii) semi MDPs and (iii) continuous state MDPs.
7.1.1 Constrained MDPs
For single agent systems, a popular approach for planning with continuous resources is to use
the Constraint MDP framework proposed by [Altman, 1999]. Constraint MDPs, which allow to
model continuous resource consumption and resource limits can be formulated as either primal or
dual linear programs and solved optimally using existing linear programming solvers. The opti-
mal randomized policy [Paruchuri et al., 2004] can then be found in polynomial time whereas for
optimal deterministic policy, one can introduce additional binary constraints [Dolgov and Durfee,
2005] and then solve the optimization problem using ecient mixed integer linear programming
tools.
143
Unfortunately, constrained MDP have two serious restrictions: First, they do not incorporate
the resource levels in the state description, and as such, they model resource limits as upper
bounds on the total expected discounted costs. As a result, the optimal policy for a constrained
MDP is not dependent on actual resource levels which can lead to suboptimal results during
policy execution. Second, constrained MDPs assume that each action consumes a deterministic
amount of resource, and thus, they are not applicable for modeling our domains of interest.
7.1.2 Semi MDPs
In order to model non-deterministic resource consumption, a popular approach is to use the Semi
MDP framework [Howard, 1960]. Semi MDPs, which allow resource consumption to follow
arbitrary continuous probability density functions, can be solved eciently using either value
iteration [Bellman, 1957] or policy iteration [Howard, 1960] principles. However, similarly to
constrained MDPs, the optimal policies for Semi MDPs are invariant of actual resource limits.
Even worse, they cannot handle resource limits, and instead resort to discount factors to model
the fact that the MDP process whose actions consume resources will run indefinitely. The same
problem pertains to the continuous time MDP model (CTMDP) [Howard, 1971] and the Gener-
alized Semi MDP model (GSMDP) [Younes and Simmons, 2004] and [Younes, 2005], variants
of Semi MDPs introduced to handle action concurrency and exogenous events. For both models
polynomial time algorithms exist, yet these algorithms only return polices that are not indexed by
the amount of resources left.
144
7.1.3 Continuous state MDPs
To eliminate the shortcomings of the Constrained MDPs and Semi MDPs, that is, to allow for
continuous resource consumption and policies dependent of resource availability, the common
approach is to encapsulate the amount of resources left in MDP state description. As a result,
each MDP state becomes a hybrid state that consists of discrete and continuous components —
this model is referred to as continuous state MDP, or continuous resource MDP (if continuous
component values always increase/decrease). One can then use either value or policy iteration
algorithms to solve the underlying planning problems.
Many value iteration algorithms for solving continuous resource MDPs have been proposed.
The easiest (but the least ecient) method is to discretize the continuous resources after which
the continuous resource MDP becomes a standard MDP with only discrete states. Alternatively,
instead of discretizing the continuous resources, one can assume discrete probability distributions
which then discretizes the continuous resources automatically [Boyan and Littman, 2000] —
Unfortunately, either way, the discretization invalidates solution quality guarantees and results in
a combinatorial explosion of the number of states as it becomes more and more fine-grained.
To remedy the problem of combinatorial explosion of the number of states caused by dis-
cretizing the continuous resources, [Feng et al., 2004], [Liu and Koenig, 2005] and [Li and
Littman, 2005] have suggested to exploit the structure of the problem. In particular, [Feng et al.,
2004] exploits the structure of the transition and reward function to dynamically partition the
state space into regions where the value function is constant or changes at the same rate. [Li and
Littman, 2005] employs a similar principle but unlike [Feng et al., 2004], allows for continuous
transition functions. On the other hand [Liu and Koenig, 2005] considers discrete MDPs, but
145
allows for continuous, one switch utility functions to model risk-sensitive planning problems.
Unfortunately, all these techniques can still be inecient when the discrete component of states
is large, causing a combinatorial explosion of the state-space.
Several promising techniques have been proposed to alleviate the combinatorial explosion
caused by the discrete component of states. In particular, [Boutilier et al., 1995, 2000; Guestrin
et al., 2004; Dolgov and Durfee, 2006] suggests using dynamic Bayesian Networks to factor the
discrete component of the states when discrete state variables depend on each other. Modified
versions of dynamic programming that operate directly on the factored representation can then
be used to solve the underlying planning problems. More recently, [Mausam et al., 2005] have
developed a Hybrid AO
(HAO
) algorithm that significantly improves the eciency of continu-
ous resource MDP solvers. The idea in HAO
is to prioritize node expansion order based on the
heuristic estimate of the node value. Furthermore, HAO
is particularly useful when the starting
state, minimal resource consumption and resource constraints are given — HAO
can then prune
infeasible trajectories and reduce the number of states to be considered to find an optimal policy.
Since both DPFP and HAO
are the algorithms for speeding up existing continuous resource
MDP solvers, there are some apparent similarities between them. However, DPFP diers from
HAO
in significant ways. The most important dierence is that DPFP’s forward search is con-
ducted in a dual space of cumulative distribution functions — in particular, DPFP’s key novelty is
its search of the dierent splittings of the cumulative distribution functions entering a particular
state (e.g. see the splitting of F((s
0
; s
1
)) in Figure 3.4). This dierence leads to a novel approach
in DPFP’s allocation of eort in determining a policy — less eort is spent on regions of the
state space reachable with lower likelihood (e.g. in Figure 3.4 more eort is spent on time in-
terval [t
1
; t
3
] than on time interval [t
3
; t
4
]). While this eort allocation idea in DPFP diers from
146
HAO
’s reachability analysis, comparing the runtime eciency of DPFP and HAO
remains an
exciting issue for future work. Such a comparison may potentially lead to creation of a hybrid
algorithm combining these two approaches.
On the other hand, DPFP’s idea to perform a forward search in continuous state-spaces is
similar to [Ng et al., 1999] and [Varakantham et al., 2006], but there are important dierences.
In particular, [Ng et al., 1999] exploits approximate probability density propagation in context
of gradient descent algorithms for searching a space of MDP and POMDP stochastic controllers.
Furthermore, for POMDPs and Distributed POMDPs, [Varakantham et al., 2005, 2006] use La-
grangian methods to eciently calculate the admissible probability ranges, to speed up the search
for policies. In contrast, DPFP’s tailoring to continuous resource MDPs allows it to exploit the
underlying dual space of cumulative distribution functions.
Finally, to address the problem of large runtimes common to value iteration algorithms, a
popular approach is to perform policy iteration in the space of basis functions approximating the
underlying value functions. This approach assumes that a value function has a shape that can be
closely approximated with a linear combination of parametrized basis functions [Lagoudakis and
Parr, 2003], [Hauskrecht and Kveton, 2004]. Unfortunately, the selection of the correct family of
the basis function is problematic; for example [Nikovski and Brand, 2003] uses Gaussian basis
functions to allow for ecient convolution operations, [Mahadevan and Maggioni, 2007] uses
spectral analysis to construct Proto value functions that exhibit good characteristics in spatial
graphs (assumed to be known) whereas [Petrik, 2007] uses Krylov basis functions that provide
a good approximation of the underlying discounted reward-to-go matrices. However, all these
algorithms choose their family of basis functions before the policy iteration starts which can lead
to low quality solutions returned by these algorithms.
147
7.2 Multiagent Systems
Planning with resources in multiagent systems has received a lot of attention in recent years, due
to the increasing popularity of multi-agent domains that require agent coordination under resource
constraints [Raja and Lesser, 2003], [Becker et al., 2003], [Lesser et al., 2004], [Musliner et al.,
2006]. For the purpose of planning with continuous resources in such domains, one could encode
the amount of resource left in description of the state using the Decentralized Markov Decision
Processes model (DEC-MDPs) [Goldman and Zilberstein, 2003], Multi agent Team Decision
Problem with Communication model (COM-MTDP) [Pynadath and Tambe, 2002] or Partially
Observable Stochastic Games model (POSGs) [Hansen et al., 2004].
7.2.1 Globally Optimal Solutions
Unfortunately, the problem of solving DEC-MDPs, COM-MTDPs or POSGs optimally has been
proven to be NEXP-complete [Bernstein et al., 2000] and hence, more tractable subclasses of
these models have been studied extensively. One example of such subclass is the Network Dis-
tributed, Partially Observable MDP model (ND-POMDP) [Nair et al., 2005] in which the agents
are transition and observation independent, but share a joint reward structure that gives them an
incentive to act in a coordinated fashion. In order to solve ND-POMDP optimally [Varakantham
et al., 2007] suggested an ecient branch and bound technique combined with an MDP heuris-
tic that provides an upper bound on the value of a joint policy of a subgroup of agents. More
recently, [Marecki et al., 2008] proposed an algorithm (FANS) which builds on the algorithm
of [Varakantham et al., 2007]. FANS’ use of finite state machines for policy representation al-
lows for varying policy expressivity for dierent agents, according to their needs. That in turn
148
allows FANS to scale up to domains involving double digit agent numbers and double digit time
horizons.
Another important model is the Transition Independent, Decentralized MDP (TI-DEC-MDP)
framework [Becker et al., 2003], a subclass of DEC-MDPs. In order to solve TI-DEC-MDPs op-
timally, [Becker et al., 2003] proposed an algorithm (CSA) that exploits a geometrical represen-
tation of the underlying reward structure. For two agent TI-DEC-MDPs, an even more ecient
algorithm has been identified [Petrik and Zilberstein, 2007]; the proposed algorithm casts the
planning problems as bilinear programs and then uses standard MILP toolkits to solve them opti-
mally. Finally, the assumption about transition independence of TI-DEC-MDP has been relaxed
in the Decentralized MDP with Event Driven Interactions model [Becker et al., 2004] in which
the agents are allowed to interact with each other at a fixed number of time points. Although
[Becker et al., 2004] proved that the new model can be solved optimally with a modified version
of the CSA algorithm, the runtime of the proposed algorithm has been shown to be double ex-
ponential in the number of time points at which the agents are allowed to interact. In contrast,
the CR-DEC-MDP model in this thesis allows the agents to interact at any point in time from a
continuous time interval.
7.2.2 Locally Optimal Solutions
Due to problems with the scale-up to big domains that many globally optimal algorithms face,
recent years have seen an increasing popularity of locally optimal algorithms. In particular, for
solving ND-POMDPs, [Nair et al., 2005] proposed a synergistic approach that combines the
strengths of existing distributed constraint optimization algorithms [Modi et al., 2003] with the
advantages of dynamic programming techniques, to scale up to domains with hundreds of agents.
149
On the other hand, for solving Decentralized POMDPs, [Seuken and Zilberstein, 2007] proposed
to combine the standard bottom up, value iteration approach with a forward search guided by the
portfolio of heuristic, to scale up to domains with large time horizons.
In the context of Decentralized MDPs [Musliner et al., 2006] suggested to use a fast heuristic
to estimate the state value, to focus the search on the most promising parts of the state-space.
A similar idea was used in the OC-DEC-MDP algorithm [Beynier and Mouaddib, 2005, 2006]
that uses the opportunity cost to estimate the state value. The OC-DEC-MDP algorithm is par-
ticularly notable, as it has been shown to scale up to domains with hundreds of tasks and double
digit time horizons. Additionally, OC-DEC-MDP is unique in its ability to address both temporal
constraints and uncertain method execution durations, which is an important factor for real-world
domains. However, OC-DEC-MDP is still slow, because similarly to previously mentioned mod-
els, it discretizes the resource levels.
To summarize, research in developing algorithms for planning under uncertainty in multi-
agent systems has progressed rapidly in recent years. However, the vast majority of proposed
algorithms resort to discretization when dealing with continuous resources. That approach in
turn has one major drawback: The discretization process invalidates the solution quality guar-
antees established for these algorithms. To date, only [Benazera, 2007] proposed a multi-agent
framework (TI-DEC-HMDP) that allows for optimal planning with continuous resources in a
multiagent setting. However, TI-DEC-HMDP’s assumption that agents are transition indepen-
dent can be problematic in many important multiagent domains.
150
Chapter 8: Conclusions
8.1 Summary
Recent advances in robotics have made aerial, underwater and terrestrial unmanned autonomous
vehicles possible. Many domains for which such unmanned vehicles are constructed are uncertain
and exhibit inherent continuous characteristics, such as time required to act or other continuous
resources at the disposal of the vehicles, e.g. battery power. Therefore, fast construction of
ecient plans for agents acting in such domains characterized by constrained and continuous
resources has been a major challenge for AI research. Such rich domains can be modeled as
Markov decision processes (MDPs) with continuous resources, that can then be solved in order
to construct optimal policies for agents acting in these domains.
This thesis addressed two major unresolved problems in continuous resource MDPs. First,
they are very dicult to solve and existing algorithms are either fast, but make additional restric-
tive assumptions about the model, or do not introduce these assumptions but are very inecient.
Second, continuous resource MDP framework is not directly applicable to multi-agent systems
and current approaches commonly discretize resource levels or assume deterministic resource
consumption which automatically invalidates the formal solution quality guarantees. My thesis
addressed these unresolved problems in three key contributions:
151
To speed up the search for policies to continuous resource MDPs, I developed an algo-
rithm called CPH. CPH solves the planning problems at hand by first approximating with a
desired accuracy the probability distributions over the resource consumptions with phase-
type distributions, which use exponential distributions as building blocks. It then uses value
iteration to solve the resulting MDPs more eciently then its closest competitors.
To improve the anytime performance of CPH and other continuous resource MDP solvers
I developed DPFP, an algorithm that solves the planning problems at hand by performing
a forward search in the corresponding dual space of cumulative distribution functions. In
doing so, DPFP discriminates in its policy generation eort providing only approximate
policies for regions of the state-space reachable with low probability yet it bounds the error
that such approximation entails.
For planning with with continuous resources in multi-agent systems a proposed a novel
framework called CR-DEC-MDP and developed two algorithms for solving CR-DEC-
MDPs: The first algorithm (VFP) emphasizes scalability. It performs a series of policy
iterations in order to quickly find a locally optimal policy. In contrast, the second algo-
rithm (M-DPFP) stresses optimality; it allows for a systematic trade-o of solution quality
for speed by using the idea of a forward search in a dual space of cumulative distribution
functions in a multi-agent setting.
The empirical evaluation of my algorithms revealed a speedup of up to three orders of magni-
tude when solving single agent planning problems and up to one order of magnitude when solving
152
multi-agent planning problems. Furthermore, I demonstrated the practical use of the CPH algo-
rithm in a large-scale disaster simulation DEFACTO. In this context, CPH has contributed to a
significant improvement in the eciency of a simulated disaster rescue operation.
8.2 Future Work
In the future one can imagine that agents will be seamlessly integrated with our society: in the
oces, in supply chains, in electronic commerce, in the gaming industry and in all our activities.
In fact, many breathtaking agent applications are just around the corner: Robotic cars could
soon drive without the human intervention; sensor networks could instantly identify and respond
to phenomena in the natural environment and software agents can help to fully automate first
responders missions.
These exciting future single and multiagent applications emphasize the research challenge of
planning with uncertainty and continuous resources, but in a very large-scale environment. In ad-
dressing these challenges I envision continuing to build on some of the basic insights developed in
my thesis: exploiting agent interaction structure and tailoring the computation to continuous vari-
ables, to fundamentally alter the complexity landscape in addressing such large-scale domains.
In the short run, my goal remains to develop new single and multiagent frameworks for rea-
soning under uncertainty that will (i) scale up the current state of the art by one order of magni-
tude, e.g. instead of the 10s of agents in agent networks to 100s of such agents and (ii) allow for
multiple depleting or replenishing continuous resources. Such reasoning is important for some of
the domains mentioned above, including mobile or stationary sensor networks.
153
In the longer run, it will be crucial to develop adaptive algorithms that tailor themselves or
exploit domain structure themselves, autonomously, rather than relying on human developers’
insights. And as I pursue this research trajectory, just as I have done in my thesis work, I will
strive to balance the dual goals of developing fundamental research with grounding these ideas in
concrete domains.
154
Bibliography
E. Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999.
S. Asmussen, O. Nerman, and M. Olsson. Fitting phase-type distributions via the em algorithm.
Scandinavian Journal of Statistics, 23:419–441, 1996.
R. Beard and T. McLain. Multiple UA V cooperative search under collision avoidance and limited
range communication constraints. In IEEE CDC, pages 25–30, 2003.
R. Becker, S. Zilberstein, V . Lesser, and C. V . Goldman. Transition-Independent Decentralized
Markov Decision Processes. In AAMAS, pages 41–48, 2003.
R. Becker, V . Lesser, and S. Zilberstein. Decentralized MDPs with Event-Driven Interactions. In
AAMAS, pages 302–309, 2004.
R.E. Bellman. Dynamic Programming. Princeton University Press, Princeton, 1957.
E. Benazera. Solving decentralized continuous markov decision problems with structured reward.
In Kunstlische Intelligenz, 2007.
D. S. Bernstein, S. Zilberstein, and N. Immerman. The complexity of decentralized control of
Markov decision processes. In UAI, pages 32–37, 2000.
A. Beynier and A. Mouaddib. A polynomial algorithm for decentralized Markov decision pro-
cesses with temporal constraints. In AAMAS, pages 963–969, 2005.
A. Beynier and A. Mouaddib. An iterative algorithm for solving constrained decentralized
Markov decision processes. In AAAI, pages 1089–1094, 2006.
D. Blidberg. The development of autonomous underwater vehicles (AUVs); a brief summary. In
ICRA, 2001.
A. Bobbio and A. Cumani. Ml estimation of the parameters of a ph distribution in triangu- lar
canonical form. Computer Performance Evaluation: Modelling Techniques and Tools, pages
33–46, 1992.
C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting structure in policy construction.
In Chris Mellish, editor, Proceedings of the Fourteenth International Joint Conference on
Artificial Intelligence, pages 1104–1111, San Francisco, 1995. Morgan Kaufmann. URL
citeseer.ist.psu.edu/boutilier95exploiting.html.
155
C. Boutilier, R. Dearden, and M. Goldszmidt. Stochastic dynamic programming
with factored representations. Artificial Intelligence, 121(1-2):49–107, 2000. URL
citeseer.ist.psu.edu/boutilier99stochastic.html.
J. Boyan and M. Littman. Exact solutions to time-dependent MDPs. In NIPS, pages 1026–1032,
2000.
J. Bresina, R. Dearden, N. Meuleau, D. Smith, and R. Washington. Planning under continuous
time and resource uncertainty: A challenge for AI. In UAI, 2002.
D.R. Cox. A use of complex probabilities in the theory of stochastic processes. Proceedings of
the Cambridge Philosophical Society 51, 2:313–319, 1955.
K. Decker and V . Lesser. Designing a Family of Coordination Algorithms. ICMAS-95, January
1995. URLhttp://mas.cs.umass.edu/paper/30.
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the
em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, 1:1–38,
1977.
D. A. Dolgov and E. H. Durfee. Stationary deterministic policies for constrained mdps with
multiple rewards, costs, and discount factors. In Proceedings of the Nineteenth International
Joint Conference on Artificial Intelligence (IJCAI-05), pages 1326–1332, Edinburgh, Scotland,
August 2005.
D. A. Dolgov and E. H. Durfee. Symmetric primal-dual approximate linear programming for fac-
tored MDPs. In Proceedings of the Ninth International Symposiums on Artificial Intelligence
and Mathematics (AI&M 2006), Florida, January 2006.
A.K. Erlang. Solution of some problems in the theory of probabilities of significance in automatic
telephone exchanges. The Post Ofce Electrical Engineers Journal, 10:189–197, 1917.
Z. Feng, R. Dearden, N. Meuleau, and R. Washington. Dynamic programming for structured
continuous MDPs. In UAI, 2004.
C. Goldman and S. Zilberstein. Optimizing information exchange in cooperative multi-agent
systems, 2003.
C. Guestrin, M. Hauskrecht, and B. Kveton. Solving factored mdps with continuous and discrete
variables, 2004.
E. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for par-
tially observable stochastic games. In Proceedings of the Nineteenth National
Conference on Artificial Intelligence (AAAI-04), pp. 709–715., 2004. URL
citeseer.ist.psu.edu/hansen04dynamic.html.
M. Hauskrecht and B. Kveton. Linear programm approximations for factored continuous-state
MDPs. In Advances in Neural Information Processing Systems 16. MIT Press, 2004.
R.A. Howard. Dynamic Programming and Markov Processes. John Wiley and Sons, New York,
1960.
156
R.A. Howard. Dynamic Probabilistic Systems, volume 2. John Wiley and Sons, New York, 1971.
M.A. Johnson. Matching moments to phase distributions: Nonlinear programming approaches.
Communi- cations in Statistics — Stochastic Models 6, 2:259–281, 1990.
M. J. Kearns, Y . Mansour, and A. Y . Ng. A sparse sampling algorithm for near-optimal
planning in large markov decision processes. In IJCAI, pages 1324–1231, 1999. URL
citeseer.ist.psu.edu/kearns01sparse.html.
S. Kullback and R.A. Leibler. On information and sufciency. Annals of Mathematical Statistics
22, 1:79–86, 1951.
M. Lagoudakis and R. Parr. Least-squares policy iteration, 2003.
V . Lesser, K. Decker, T. Wagner, N. Carver, A. Garvey, B. Horling, D. Neiman, R. Podor-
ozhny, M. NagendraPrasad, A. Raja, R. Vincent, P. Xuan, and X.Q. Zhang. Evolution of
the GPGP/TAEMS Domain-Independent Coordination Framework. Autonomous Agents and
Multi-Agent Systems, 9(1):87–143, July 2004.
L. Li and M. Littman. Lazy approximation for solving continuous finite-horizon MDPs. In AAAI,
pages 1175–1180, 2005.
Y . Liu and S. Koenig. Risk-sensitive planning with one-switch utility functions: Value iteration.
In AAAI, pages 993–999, 2005.
D. J. C. MacKay. Introduction to Monte Carlo methods. In M. I. Jordan, editor, Learning in
Graphical Models, NATO Science Series, pages 175–204. Kluwer Academic Press, 1998.
S. Mahadevan and M. Maggioni. Proto-value functions: A Laplacian framework for learning rep-
resentation and control in Markov decision processes. Journal of Machine Learning Research,
8:2169–2231, 2007.
J. Marecki, N. Schurr, M. Tambe, and P. Scerri. Dangers in multiagent rescue using defacto.
In Proceedings of 2nd International Workshop on Safety and Security in Multiagent Systems
held at the 4th International Joint Conference on Autonomous Agents and Multiagent Systems,
2005.
J. Marecki, S. Koenig, and M.Tambe. A fast analytical algorithm for solving Markov decision
processes with real-valued resources. In IJCAI, 2007. To Appear.
J. Marecki, T. Gupta, P. Varakantham, M. Tambe, and M. Yokoo. Not all agents are equal: Scaling
up distributed pomdps for agent networks. In Proceedings of the Seventh International Joint
Conference on Autonomous Agents and Multi-agent Systems (AAMAS-08), 2008.
R. Marie. Calculating equilibrium probabilities for(n)=c
k
=1=n queues. In Proceedings of the 7th
IFIP W.G.7.3 International Symposium on Computer Performance Modelling, Measurement
and Evaluation, pages 117–125. ACM SIGMETRICS, 1980.
Mausam, E. Benazera, R. I. Brafman, N. Meuleau, and E. A. Hansen. Planning with continuous
resources in stochastic domains. In IJCAI, pages 1244–1251, 2005.
157
P. Modi, W. Shen, M. Tambe, and M. Yokoo. An asynchronous complete
method for distributed constraint optimization. In AAMAS, 2003. URL
citeseer.comp.nus.edu.sg/549918.html.
D. Musliner, E. Durfee, J. Wu, D. Dolgov, R. Goldman, and M. Boddy. Coordinated plan man-
agement using multiagent MDPs. In AAAI Spring Symposium, 2006.
R. Nair, M. Tambe, M. Yokoo, D. Pynadath, and S. Marsella. Taming decentralized POMDPs:
Towards ecient policy computation for multiagent settings. In IJCAI, pages 705–711, 2003.
R. Nair, M. Roth, M. Yokoo, and M. Tambe. Communication for improving policy com-
putation in distributed pomdps. In Proceedings of the Third International Joint Con-
ference on Agents and Multiagent Systems (AAMAS-04), pages 1098–1105, 2004. URL
citeseer.ist.psu.edu/nair04communication.html.
R. Nair, P. Varakantham, M. Tambe, and M. Yokoo. Networked distributed POMDPs: A synergy
of distributed constraint optimization and POMDPs. In IJCAI, pages 1758–1760, 2005.
M. Neuts. Matrix-Geometric Solutions in Stochastic Models. John Hopkins University Press,
1981.
A. Y . Ng, D. Harada, and S. Russell. Policy invariance under reward transformations:
theory and application to reward shaping. In Proc. 16th International Conf. on Ma-
chine Learning, pages 278–287. Morgan Kaufmann, San Francisco, CA, 1999. URL
citeseer.ist.psu.edu/article/ng99policy.html.
D. Nikovski and M Brand. Non-Linear stochastic control in continuous state spaces by exact
integration in Bellman’s equations. In ICAPS, 2003.
T. Osogami and M. Harchol-Balter. A closed-form solution for mapping general distribu- tions
to minimal ph distributions. In Proceedings of the 13th International Conference on Modelling
Techniques and Tools for Computer Performance Evaluation, pages 200–217. Springer, 2003.
P. Paruchuri, M. Tambe, F. Ordonez, and S. Kraus. Towards a for-
malization of teamwork with resource constraints, 2004. URL
citeseer.ist.psu.edu/article/paruchuri04towards.html.
L. Pedersen, D.E. Smith, M. Deans, R. Sargent, C. Kunz, D. Lees, and S. Rajagopalan. Mis-
sion planning and target tracking for autonomous instrument placement. In IEEE Aerospace
Conference, pages 34–51, 2005.
M. Petrik. An analysis of laplacian methods for value function approximation in mdps. In IJCAI,
pages 2574–2579, 2007.
M. Petrik and S. Zilberstein. Anytime coordination using separable bilinear programs. In AAAI,
2007.
M. Puterman. Markov decision processes. John Wiley and Sons, New York, 1994.
D. V . Pynadath and M. Tambe. The communicative multiagent team decision problem: Analyzing
teamwork theories and models. In JAIR, volume 16, pages 389–432, 2002.
158
A. Raja and V . Lesser. Ecient meta-level control in bounded-rational agents. In AAMAS ’03:
Proceedings of the second international joint conference on Autonomous agents and multiagent
systems, pages 1104–1105, New York, NY , USA, 2003. ACM Press.
C.H. Sauer and K.M. Chandy. Approximate analysis of central server models. IBM Journal of
Research and Development 19, 3:301–313, 1975.
N. Schurr, J. Marecki, P. Scerri, J. Lewi, and M. Tambe. The defacto system: Coordinating
human-agent teams for the future of disaster response, 2005.
N. Schurr, J. Marecki, and M. Tambe. Riaact: A robust approach to adjustable autonomy for
human-multiagent teams. In Short Paper in Proceedings of the Seventh International Joint
Conference on Autonomous Agents and Multi-agent Systems (AAMAS-08), 2008.
S. Seuken and S. Zilberstein. Memory-bounded dynamic programming for dec-pomdps. In
IJCAI, 2007.
M. Telek and A. Heindl. Matching moments for acyclic discrete and continuous phase-type distri-
butions of second order. International Journal of Simulation Systems, Science and Technology
3, 3–4:47–57, 2002.
S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale,
M. Halpenny, G. Homann, K. Lau, C. Oakley, M. Palatucci, V . Pratt, P. Stang, S. Strohband,
C. Dupont, L.-E. Jendrossek, C. Koelen, C. Markey, C. Rummel, J. van Niekerk, E. Jensen,
P. Alessandrini, G. Bradski, B. Davies, S. Ettinger, A. Kaehler, A. Nefian, and P. Mahoney.
Winning the darpa grand challenge. Journal of Field Robotics, 2006. accepted for publication.
P. Varakantham, R. Maheswaran, and M. Tambe. Exploiting belief bounds: Practical POMDPs
for personal assistant agents. In Proceedings of AAMAS-05, pages 978–985, 2005. URL
citeseer.ist.psu.edu/varakantham05exploiting.html.
P. Varakantham, R. Nair, M. Tambe, and M. Yokoo. Winning back the cup for distributed
POMDPs: Planning over continuous belief spaces. In Proceedings of AAMAS-06, pages 289–
296, May 2006.
P. Varakantham, J. Marecki, M. Tambe, and M. Yokoo. Letting loose a spider on a network of
pomdps: Generating quality guaranteed policies. In Proceedings of AAMAS-07, May 2007.
H. Younes. Planning and execution with phase transitions. In AAAI, 2005.
H. Younes and R. Simmons. Solving generalized semi-MDPs using continuous phase-type dis-
tributions. In AAAI, pages 742–747, 2004.
159
Appendix A: Phase Type Distribution
This appendix introduces the formal definition of a phase-type distribution. Let E() denote an
exponential probability density function given by the formula E()(t) = e
t where > 0 is
the exit rate parameter. Also, let s
1
;:::; s
m
denote the transitory states (also called phases) and
s
m+1
denote the absorbing state of a Markovian process — the transitory states are not the real
states of an MDP, and they are only introduced for the purposes of phase-type approximation.
Furthermore, let
!
= (p
1
;:::; p
m
) :
P
m
i=1
p
i
= 1 specify the initial discrete distribution over
transitory states, i.e., a Markovian process starts in transitory state s
i
with probability p
i
. Finally,
let p
i; j
be the probability that the process will transition from state s
i
to state s
j
for a transition
whose duration follows E(
i
) where
i
is the exit rate associated with state s
i
— for all i = 1;:::; m
we have
P
m
j=1
p
i; j
= 1 q
i
where q
i
is the probability of transitioning to the absorbing state s
m+1
160
from phase s
i
. In order to write down the formula for a phase-type distribution, the probabilities
p
i; j
and exit rates
i
are first encoded in an infinitesimal generator matrix Q given by:
Q=
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
(1 p
1;1
)
1
p
1;2
1
p
1;3
:::
1
p
1;m1
1
p
1;m
2
p
2;1
2
(1 p
2;2
)
2
p
2;3
:::
2
p
2;m1
2
p
2;m
3
p
3;1
3
p
3;2
3
(1 p
3;3
)
:
:
:
3
p
3;m1
3
p
3;m
::: :::
:
:
:
:
:
:
:
:
:
:
:
:
m1
p
m1;1
m1
p
m1;2
m1
p
m1;3
:
:
:
m1
(1 p
m1;m1
)
m1
p
m1;m
m
p
m;1
m
p
m;2
m
p
m;3
m
p
m;m1
m
(1 p
m;m
)
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
(A.1)
A Phase-type distribution f (
!
; Q), which specifies the probabilities of entering the absorbing
state s
m+1
over time, is then given by:
f (
!
; Q)(t) =
!
e
Qt
(Q
!
1 ) (A.2)
where
!
1 is the unit column vector of size m. The corresponding cumulative distribution function
is then given by:
Z
t
1
f (
!
; Q)(t
0
)dt
0
= 1
!
e
tQ
!
1 (A.3)
161
Appendix B: Classes of Phase Type Distributions
Thanks to their analytical properties, some classes of phase-type distributions have been given
particular attention. We now illustrate some of the most important classes and show their gener-
ator matrices Q.
Exponential distribution: If m = 1, p
1
= 1 and p
1;1
= 0 then the phase-type distribution
f (
!
; Q) =
1
e
1
t
= E(
1
) is an exponential distribution. Here, the generator matrix Q is
simply Q = (
1
) for a given exit rate parameter
1
.
Erlang distribution [Erlang, 1917] (also known as hyper-exponential distribution) is a
phase-type distribution determined by two parameters: the number of phases m and the
common exit rate. Intuitively, an m-phase Erlang distribution is a deterministic sequence
of m exponential distributions, each once sampled from E(). Formally, p
1
= 1, p
i
= 0,
p
i;i+1
= 1 and
i
= for all i = 1;:::; m. The generator matrix Q for the m-phase Erlang
distribution is then the following:
162
Q =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
0 0 ::: 0 0
0 0 ::: 0 0
0 0
:
:
: 0 0
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
0 0 0 0
:
:
: 0
0 0 0 0 ::: 0
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
(B.1)
Generalized Erlang distribution is a generalization to the Erlang distribution that allows
the process to transition from state s
1
to the absorbing state s
m+1
, that is, q
1
0. Formally,
p
1
= 1, p
i
= 0,
i
= for all i = 1;:::; m and p
i;i+1
= 1 for all i = 2;:::; m. However,
p
1;2
= p and p
1;i
= 0 for i = 2;:::; m for a given probability p 0. The generator matrix Q
for the m-phase generalized Erlang distribution is then the following:
Q =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
p 0 0 ::: 0 0
0 0 ::: 0 0
0 0
:
:
: 0 0
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
0 0 0 0
:
:
: 0
0 0 0 0 ::: 0
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
(B.2)
163
Coxian distribution [Cox, 1955] is a further generalization of the generalized Erlang dis-
tribution that allows the process to transition from each state s
i
to the absorbing state s
m+1
,
that is, q
i
0 for i = 1;:::; m. Also, the transition duration distributions are allowed be be
dierent exponential distributions. Formally, an m-phase Coxian distribution is specified
by 2m 1 parameters: exit rates
1
;:::;
m
and probabilities p
i;i+1
for i = 1;:::; m 1. It
assumes that p
1
= 1 and p
i; j
= 0 for all i = 1;:::; m and j2f1;:::; mgn i + 1. The generator
matrix Q for the m-phase Coxian distribution is then the following:
Q =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
p
1;2
1
0 0 ::: 0 0
0
2
p
2;3
2
0 ::: 0 0
0 0
3
p
3;4
3
:
:
: 0 0
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
0 0 0 0
:
:
:
m1
p
m1;m
m1
0 0 0 0 ::: 0
m
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
(B.3)
164
Appendix C: Fitting to Phase-Type Distributions
There are two leading approaches for determining the coecients of the generator matrix Q when
fitting to phase-type distributions: the method of moments and the method of shapes. The method
of moments aims to approximate the first few moments of the approximated distribution with a
phase-type distribution. The nth moment of a real-valued probability density function f (x) of a
random variable X is defined as:
n
= E[X
n
] =
Z
1
1
x
n
f (x)dx: (C.1)
In particular, the mean of a distribution is the first moment of this distribution, whereas the
variance
2
of a distribution can be derived with the help of the first two moments:
2
= E[(X
1
)
2
] = E[X
2
] 2
1
E[X] +
2
1
=
2
2
2
1
+
2
1
=
2
2
1
: (C.2)
Although the first moment of a distribution is easily matched by the mean of this distribution,
matching subsequent moments is a non-trivial task. In particular, for a squared coecient of
variation defined as cv
2
:=
2
=, we match the first two moments of a distribution depending on
the value of cv
2
. If cv
2
< 1 we use the n-phase generalized Erlang distribution with n =d1=(cv
2
)e,
165
= (1 p + np)=
1
and p = 1 (2n cv
2
+ n 2
p
n
2
+ 4 4n cv
2
)=(2(n 1)(cv
2
+ 1)) [Sauer
and Chandy, 1975; Marie, 1980]. If, on the other hand, cv
2
1=2 we use a 2-phase Coxian
distribution with
1
= 2=
1
,
2
= 1=(
1
cv
2
) and p = 1=(2 cv
2
) [Marie, 1980]. Depending on
the values of cv
2
and
3
, there are also several approaches for matching the first three moments
of a distribution [Johnson, 1990; Telek and Heindl, 2002; Osogami and Harchol-Balter, 2003].
Matching the first few moments of an initial distribution does not necessary lead to a good
approximation. For example, if a distribution is multi-modal and we match its two first moments,
i.e., its mean and standard deviation, then the obtained approximation is still single-modal. In
situation like these, it is better to approximate directly the shape of the initial distribution, by
minimizing the distance between the initial distribution and its approximation. A common metric
for determining the distance between two distributions f and g is the Kullback-Leibler divergence
(KL-divergence) proposed in [Kullback and Leibler, 1951]. For two probability density functions
f and g, their KL-divergence D
KL
( fjjg) is defined as:
D
KL
( fjjg) =
Z
1
1
f (x) log
f (x)
g(x)
dx: (C.3)
The leading algorithms that approximate the initial distribution by minimizing the KL diver-
gence are the EM (Expectation-Maximization) algorithm [Dempster et al., 1977] and the max-
imum likelihood estimation algorithm [Bobbio and Cumani, 1992]. Both algorithms take as an
input the desired number of phases used by the phase-type distribution as well as the desired
structure of the phase-type distribution (e.g. Erlang, Coxian, etc.). In my implementation of CPH
I have used the EM algorithm, thanks to its superior convergence properties [Asmussen et al.,
1996].
166
Appendix D: Uniformization of Phase Type Distributions
As shown in Equation A.3, a phase-type distribution in uniquely specified by its generator matrix
Q. This matrix, in its most general form, does not make any assumption about the exit rate
parameters
i
which may be arbitrary. However, for many analytical operations, such as the ones
performed by CPH, it is required that these exit rates are uniform, that is,
1
=::: =
m
=. We
now show a technique, called uniformization [Puterman, 1994], which transforms an arbitrary
phase-type distribution to a phase-type distribution with uniform exit rates, that is,
1
= ::: =
m
= max
i=1;:::;m
i
=.
For each row j of the generator matrix Q let k
j
:= (max
i=1;:::;m
i
)=
j
. Observe, that the
generator matrix Q shown in Equation A.1 can also be written as:
Q =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
(
1
k
1
)(1
k
1
1+p
1;1
k
1
) (
1
k
1
)
p
1;2
k
1
::: (
1
k
1
)
p
1;m
k
1
(
2
k
2
)
p
2;1
k
2
(
2
k
2
)(1
k
2
1+p
2;2
k
2
)
:
:
: (
2
k
2
)
p
2;m
k
2
(
3
k
3
)
p
3;1
k
3
(
3
k
3
)
p
3;2
k
3
:
:
: (
3
k
3
)
p
3;m
k
3
::: :::
:
:
:
:
:
:
(
m1
k
m1
)
p
m1;1
k
m1
(
m1
k
m1
)
p
m1;2
k
m1
:
:
: (
m1
k
m1
)
p
m1;m
k
m1
(
m
k
m
)
p
m;1
k
m
(
m
k
m
)
p
m;2
k
m
(
m
k
m
)(1
k
m
1+p
m;m
k
m
)
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
(D.1)
167
Or simply as:
Q =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
0
1
(1 p
0
1;1
)
0
1
p
0
1;2
:::
0
1
p
0
1;m
0
2
p
0
2;1
0
2
(1 p
0
2;2
)
:
:
:
0
2
p
0
2;m
0
3
p
0
3;1
0
3
p
0
3;2
:
:
:
0
3
p
0
3;m
::: :::
:
:
:
:
:
:
0
m1
p
0
m1;1
0
m1
p
0
m1;2
:
:
:
0
m1
p
0
m1;m
0
m
p
0
m;1
0
m
p
0
m;2
0
m
(1 p
0
m;m
)
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
(D.2)
where
0
i
:=
i
k
i
, p
0
i; j
:=
p
i; j
k
i
for i , j and p
0
i;i
:=
k
i
1+p
i;i
k
i
are exit rates and state-to-state
transition probabilities of a new phase-type distribution f (; Q). Note that the new phase-type
distribution has
0
i
= max
i=1;:::;m
i
= for all i = 1;:::; m and thus, the new phase-type distribution
is uniformized.
168
Abstract (if available)
Abstract
My research concentrates on developing reasoning techniques for intelligent, autonomous agent systems. In particular, I focus on planning techniques for both single and multi-agent systems acting in uncertain domains. In modeling these domains, I consider two types of uncertainty: (i) the outcomes of agent actions are uncertain and (ii) the amount of resources consumed by agent actions is uncertain and only characterized by continuous probability density functions. Such rich domains, that range from the Mars rover exploration to the unmanned aerial surveillance to the automated disaster rescue operations are commonly modeled as continuous resource Markov decision processes (MDPs) that can then be solved in order to construct policies for agents acting in these domains.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Towards efficient planning for real world partially observable domains
PDF
The power of flexibility: autonomous agents that conserve energy in commercial buildings
PDF
Addressing uncertainty in Stackelberg games for security: models and algorithms
PDF
Keep the adversary guessing: agent security by policy randomization
PDF
Local optimization in cooperative agent networks
PDF
Decoding information about human-agent negotiations from brain patterns
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Human adversaries in security games: integrating models of bounded rationality and fast algorithms
PDF
The human element: addressing human adversaries in security domains
PDF
Artificial intelligence for low resource communities: Influence maximization in an uncertain world
PDF
Hierarchical planning in security games: a game theoretic approach to strategic, tactical and operational decision making
PDF
Automated negotiation with humans
PDF
A framework for research in human-agent negotiation
PDF
Handling attacker’s preference in security domains: robust optimization and learning approaches
PDF
Modeling social causality and social judgment in multi-agent interactions
PDF
Balancing tradeoffs in security games: handling defenders and adversaries with multiple objectives
PDF
Thwarting adversaries with unpredictability: massive-scale game-theoretic algorithms for real-world security deployments
PDF
Game theoretic deception and threat screening for cyber security
PDF
Balancing local resources and global goals in multiply-constrained distributed constraint optimization
PDF
Predicting and planning against real-world adversaries: an end-to-end pipeline to combat illegal wildlife poachers on a global scale
Asset Metadata
Creator
Marecki, Janusz
(author)
Core Title
Planning with continuous resources in agent systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
06/19/2008
Defense Date
05/07/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
agents,continuous resources,convolution,Markov decision process,OAI-PMH Harvest,planning under uncertainty
Language
English
Advisor
Tambe, Milind (
committee chair
), Gratch, Jonathan (
committee member
), Lesser, Victor (
committee member
), Maheswaran, Rajiv T. (
committee member
), Ordonez, Fernando I. (
committee member
)
Creator Email
marecki@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1277
Unique identifier
UC1217754
Identifier
etd-Marecki-20080619 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-71752 (legacy record id),usctheses-m1277 (legacy record id)
Legacy Identifier
etd-Marecki-20080619.pdf
Dmrecord
71752
Document Type
Dissertation
Rights
Marecki, Janusz
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
agents
continuous resources
convolution
Markov decision process
planning under uncertainty