Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards efficient planning for real world partially observable domains
(USC Thesis Other)
Towards efficient planning for real world partially observable domains
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARDSEFFICIENTPLANNINGFORREALWORLDPARTIALLY OBSERVABLEDOMAINS by PradeepVarakantham ADissertationPresentedtothe FACULTYOFTHEGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (COMPUTERSCIENCE) May2007 Copyright 2007 PradeepVarakantham Dedication Thisdissertationisdedicatedtomyparentsandmybrother. ii Acknowledgements Iwouldliketothankallthepeoplewhohavehelpedmecompletemythesis. First and foremost, I would like to thank my advisor, Milind Tambe for his attention, guid- ance, insight and support at every step of this thesis. Not just on my thesis, his advise has con- tributedinmydevelopmentasanindividualandacademician. IwishtothankManuelaVeloso,SvenKoenig,StacyMarsellaandFernandoOrdonezforbe- ing on my thesis committee. Their valuable comments were instrumental in structuring my dis- sertation. ManuelaVelosothroughherinsightfulcommentshelpedmeunderstandthepragmatic issueswiththecontributions. SvenKoenigalwaysaskedtherightquestionsandwasconstructive in his criticism. StacyMarsella provided valuable feedback and pointed to similar contributions, inmydiscussionswithhim. FernandoOrdonezprovidedanoutsider’sviewonthecontributions madeinthisthesis. I am thankful to Makoto Yokoo for being an excellent collaborator, who was involved in building a significant chunk of this thesis and also for providing crucial feedback that aided me in shaping the thesis. I am grateful to Rajiv Maheswaran for guiding me during the early phases ofmyPhDandforthenumerousstimulatingdiscussionsthathavehelpedmesignificantlyinthis thesis. I sincerely thank Ranjit Nair for being an excellent colleague and co-author, and also for providingsoundadvise. iii I am grateful to all the members of the TEAMCORE research group for being an amiable bunch of friends and collaborators. Praveen Paruchuri, Nathan Schurr, Jonathan Pearce, Emma Bowring, Janusz Marecki, Tapana Gupta and Zvi Topol have always been very helpful and sup- portive. Lastlyandmostimportantly,Iwouldliketoexpressmygratitudetomyfamily. Inparticular, Iwouldliketothankmyparentsandbrotherforbelievinginmeandpushingmetogetadoctorate degree. iv TableofContents Dedication ii Acknowledgements iii ListOfTables vii ListOfFigures viii ListOfAlgorithms x Abstract xii 1 Chapter1: Introduction 1 2 Chapter2: Background 6 2.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 PersonalAssistantAgents(PAA) . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 DistributedSensorNetwork . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 IllustrativeDomain: TigerProblem . . . . . . . . . . . . . . . . . . . . 11 2.1.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 SingleAgentPOMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 DistributedPOMDPs: MTDP . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 ExistingAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 ExactAlgorithmsforPOMDPs . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 ApproximateAlgorithmsforPOMDPs . . . . . . . . . . . . . . . . . . 15 2.3.3 JESPalgorithmforDistributedPOMDPs . . . . . . . . . . . . . . . . . 16 3 Chapter3: Exploitingstructureindynamics 19 3.1 DynamicBeliefSupports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.1 DynamicStates(DS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 DynamicBeliefs(DB) . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.3 DynamicDisjunctiveBeliefs(DDB) . . . . . . . . . . . . . . . . . . . . 31 4 Chapter4: DirectvalueapproximationforPOMDPs 35 4.1 EVAAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 v 5 Chapter5: ResultsforDS,DB,DDBandEVA 42 6 Chapter6: ExploitinginteractionstructureinDistributedPOMDPs 48 6.1 ND-POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 LocallyOptimalPolicyGeneration,LID-JESP . . . . . . . . . . . . . . . . . . . 54 6.2.1 FindingBestResponse . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2.2 CorrectnessResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.3 StochasticLID-JESP(SLID-JESP) . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.4 Hyper-link-basedDecomposition(HLD) . . . . . . . . . . . . . . . . . . . . . . 62 6.5 ComplexityResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.6 LocallyInteracting-GlobalOptimalAlgorithm(GOA) . . . . . . . . . . . . . . 69 6.7 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7 Chapter 7: Direct value approximation and exploiting interaction structure (Dis- tributedPOMDPs) 79 7.1 SearchforPoliciesInDistributedEnviRonments(SPIDER) . . . . . . . . . . . . 80 7.1.1 OutlineofSPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.1.2 MDPbasedheuristicfunction . . . . . . . . . . . . . . . . . . . . . . . 86 7.1.3 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.1.4 ValueApproXimation(VAX) . . . . . . . . . . . . . . . . . . . . . . . 93 7.1.5 PercentageApproXimation(PAX) . . . . . . . . . . . . . . . . . . . . . 93 7.1.6 TheoreticalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.2 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8 Chapter8: ExploitingstructureindynamicsforDistributedPOMDPs 101 8.1 ContinuousSpaceJESP(CS-JESP). . . . . . . . . . . . . . . . . . . . . . . . . 102 8.1.1 IllustrativeExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8.1.2 KeyIdeas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.1.3 Algorithmfornagents . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.1.4 TheoreticalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.2 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 9 Chapter9: RelatedWork 119 9.1 RelatedworkinPOMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 9.2 RelatedworkinSoftwarePersonalAssistants . . . . . . . . . . . . . . . . . . . 123 9.3 RelatedworkonDistributedPOMDPs . . . . . . . . . . . . . . . . . . . . . . . 124 10 Chapter10: Conclusion 127 Bibliography 129 vi ListOfTables 5.1 ComparisonofexpectedvalueforPBVIandEVA . . . . . . . . . . . . . . . . . 47 6.1 Reasons for speed up. C: no. of cycles, G: no. of GETVALUE calls, W: no. of winnerspercycle,forT=2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.1 Comparisonofruntimes(inms)forJESPandCS-JESP . . . . . . . . . . . . . 118 vii ListOfFigures 2.1 PartialSamplePolicyforaTMP . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Sensor net scenario: If present, target1 is in Loc1-1, Loc1-2 or Loc1-3, and tar- get2isinLoc2-1orLoc2-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 TwostepsofvalueiterationinGIPandRBIP . . . . . . . . . . . . . . . . . . . 14 2.4 TraceoftigerscenarioinJESP . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 ComparisonofGIPandDBwithrespecttobeliefbounds . . . . . . . . . . . . 26 3.2 PartitionProcedureforSolvingBeliefMaximizationLagrangian . . . . . . . . . 30 3.3 IllustrationofDDBvsDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 IllustrationofpruninginDBandDDBwhencomparedagainstGIP . . . . . . . 34 4.1 EVA:Anexampleofan-parsimoniousset . . . . . . . . . . . . . . . . . . . . . . 39 5.1 ComparisonofperformanceofEVA+DS,EVA+DB,andEVA+DDBfor=0.01 . 43 5.2 ComparisonofperformanceofEVA+DS,EVA+DB,andEVA+DDBfor=0.02 . 44 5.3 ComparisonofperformanceofEVA+DS,EVA+DB,andEVA+DDBfor=0.03 . 44 5.4 RuntimecomparisonofEVAandPBVI . . . . . . . . . . . . . . . . . . . . . . 46 6.1 SampleexecutiontraceofLID-JESPfora3-agentchain . . . . . . . . . . . . . 56 6.2 Runtimes(a,b,c),andvalue(d). . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3 Differentsensornetconfigurations. . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.4 Runtime(ms)for(a)1x3,(b)cross,(c)5-Pand(d)2x3. . . . . . . . . . . . . . . 78 viii 6.5 Valuefor(a)1x3,(b)cross,(c)5-Pand(d)2x3. . . . . . . . . . . . . . . . . . . 78 7.1 ExecutionofSPIDER,anexample . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.2 Example of abstraction for (a) HBA (Horizon Based Abstraction) and (b) NBA (Node BasedAbstraction) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.3 Sensornetworkconfigurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.4 Comparison of GOA, SPIDER, SPIDER-Abs and VAX for T = 3 on (a) Runtime and (b)Solutionquality;(c)TimetosolutionforPAXwithvaryingpercentagetooptimalfor T=4(d)TimetosolutionforVAXwithvaryingepsilonforT=4. . . . . . . . . . . . . 98 8.1 TraceofthealgorithmforT=2inMultiAgenttigerexamplewithaspecificstart- ingjointpolicy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 8.2 Comparison of (a) CSJESP+GIP, and CSJESP+DB for reward structure 1 (b) CSJESP+DB, and CSJESP+DBM for reward structure 1 (c) CSJESP+GIP, and CSJESP+DB for reward structure 2 (d) CSJESP+DB, and CSJESP+DBM for rewardstructure2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.3 Comparison of the number of belief regions created in CS-JESP+DB and CS- JESP+DBMforrewardstructures1and2 . . . . . . . . . . . . . . . . . . . . . 116 8.4 Comparison of the expected values obtained with JESP for specific belief points andCS-JESP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 ix ListofAlgorithms 1 CALCULATEEPSILON() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 JESP() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 DB-G IP() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 LP-D OMINATE(w,U,b max t ,b min t ,) . . . . . . . . . . . . . . . . . . . . . . . . 38 5 LID-J ESP(i,ND-POMDP ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6 EVALUATE(i,s t i ,s t u ,s t N i ,π i ,π N i ,~ ω t i ,~ ω t N i ,t,T) . . . . . . . . . . . . . . . . . . 57 7 GETVALUE(i,B t i ,π N i ,t,T) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 8 GETVALUEACTION(i,B t i ,a i ,π N i ,t,T) . . . . . . . . . . . . . . . . . . . . . . 59 9 UPDATE(i,B t i ,a i ,ω t+1 i ,π N i ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 10 FINDPOLICY(i,B t i , ~ ω i t ,π N i ,t,T) . . . . . . . . . . . . . . . . . . . . . . . . . 59 11 SLID-J ESP(i,ND-POMDP ,p) . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 12 LID-J ESP-H LD(i,ND-POMDP ) . . . . . . . . . . . . . . . . . . . . . . . . . . 66 13 EVALUATE-HLD (l,s t l ,s t u ,π l ,~ ω t l ,t,T) . . . . . . . . . . . . . . . . . . . . . . . 66 14 GETVALUE-HLD (i,B t i ,π N i ,t,T) . . . . . . . . . . . . . . . . . . . . . . . . . 67 15 GETVALUEACTION-HLD (i,B t i ,a i ,π N i ,t,T) . . . . . . . . . . . . . . . . . . 67 16 UPDATE-HLD (i,l,B t il ,a i ,ω t+1 i ,π l−{i} ) . . . . . . . . . . . . . . . . . . . . . . 68 17 FINDPOLICY-HLD (i,B t i , ~ ω i t ,π N i ,t,T) . . . . . . . . . . . . . . . . . . . . . . 68 x 18 GO-J OINTPOLICY(i,π j ,terminate) . . . . . . . . . . . . . . . . . . . . . . . 71 19 SPIDER(i,π i− ,threshold) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 20 UPPER-B OUND-S ORT(i,Π i ,π i− ) . . . . . . . . . . . . . . . . . . . . . . . . . 85 21 UPPER-B OUND (j,π j− ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 22 UPPER-B OUND-T IME (s t l ,j,π l 1 ,~ ω t l 1 ) . . . . . . . . . . . . . . . . . . . . . . . 88 23 SPIDER-A BS(i,π i− ,threshold) . . . . . . . . . . . . . . . . . . . . . . . . . . 90 24 MAXIMUMBELIEF(s j ,v,V,B min ,B max ) . . . . . . . . . . . . . . . . . . . . . 109 25 CS-J ESP() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 26 UPDATEPARTITION(i,Π) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 27 FINDNEWPARTITION((partition,br)) . . . . . . . . . . . . . . . . . . . . . . . 113 28 OPTIMALBESTRESPONSE(i,Π 0 ,br) . . . . . . . . . . . . . . . . . . . . . . . . 113 29 MERGEBELIEFREGIONS(Π i ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 xi Abstract My research goal is to build large-scale intelligent systems (both single- and multi-agent) that reason with uncertainty in complex, real-world environments. I foresee an integration of such systems in many critical facets of human life ranging from intelligent assistants in hospitals to offices, from rescue agents in large scale disaster response to sensor agents tracking weather phenomena in earth observing sensor webs, and others. In my thesis, I have taken steps towards achievingthisgoalinthecontextofsystemsthatoperateinpartiallyobservabledomainsthatalso have transitional (non-deterministic outcomes to actions) uncertainty. Given this uncertainty, Partially Observable Markov Decision Problems (POMDPs) and Distributed POMDPs present themselvesasnaturalchoicesformodelingthesedomains. Unfortunately,thesignificantcomputationalcomplexityinvolvedinsolvingPOMDPs(PSPACE- Complete)andDistributedPOMDPs(NEXP-Complete)isakeyobstacle. Duetothissignificant computational complexity, existing approaches that provide exact solutions do not scale, while approximatesolutionsdonotprovideanyusableguaranteesonquality. Mythesisaddressesthese issuesusingthefollowingkeyideas: Thefirstisexploitingstructureinthedomain. Utilizingthe structure present in the dynamics of the domain or the interactions between the agents allows improved efficiency without sacrificing on the quality of the solution. The second is direct ap- proximation in the value space. This allows for calculated approximations at each step of the xii algorithm, which in turn allows us to provide usable quality guarantees; such quality guarantees maybespecifiedinadvance. Incontrast,theexistingapproachesapproximateinthebeliefspace leadingtoanapproximationinthevaluespace(indirectapproximationinvaluespace),thusmak- ing it difficult to compute functional bounds on approximations. In fact, these key ideas allow for the efficient computation of optimal and quality bounded solutions to complex, large-scale problems,thatwerenotinthepurviewofexistingalgorithms. xiii Chapter1: Introduction Recent years have seen an exciting growth of applications (deployed and emerging) of agents andmultiagentsystemsinmanyfacetsofourdailylives. Theseapplicationsmandatethatagents act in complex, uncertain domains, and they range from intelligent assistants in hospitals to of- fice[Scerrietal.,2002;LeongandCao,1998;Magnietal.,1998],torescueagentsinlargescale disaster response [Kitano et al., 1999], to sensor agents tracking weather phenomena in earth observing sensor webs [Lesser et al., 2003], and others. However, for a successful transition of theseapplicationstorealworlddomains,theunderlyinguncertaintyhastobetakenintoaccount. Partially Observable Markov Decision Problems (POMDPs) and Distributed Partially Ob- servable Markov Decision Problems (Distributed POMDPs) are becoming popular approaches formodelingdecisionproblems for agents and teams of agents operating in real world uncertain environments [Pollack et al., 2003a; Simmons and Koenig, 1995; Bowling and Veloso, 2002; Rothetal.,2005;Varakanthametal.,2005;Nairetal.,2003c,2005]. Thisisowingtotheability ofthesemodelstocaptureuncertaintypresentinrealworldenvironments: unknowninitialconfig- urationofthedomain,nondeterministicoutcomestoactionsandnoiseinthesensoryperception. Furthermore, these models can also capture the utilities associated with different outcomes due totheirabilitytoreasonwithcostsandrewards. 1 Unfortunately, the computational cost of optimal policy generation in POMDPs (PSPACE- Complete) and distributed POMDPs (NEXP-Complete) [Bernstein et al., 2000] is prohibitive, requiring increasingly efficient algorithms to solve decision problems in large-scale domains. Furthermore,manydomains [Kitanoetal.,1999;Pollacketal.,2003a;Scerrietal.,2002;Leong andCao,1998;Magnietal.,1998]requirethattheefficiencygainsdonotcausesignificantlosses inoptimalityofthepolicygenerated;indeed,itisimportantforthealgorithmstoboundanyloss in quality. Thus the key challenge is to provide efficiency gains in POMDPs and Distributed POMDPswithboundedqualityloss. InsingleagentPOMDPs,therehasbeensignificantprogressmadewithrespecttoefficiency, using two types of solution techniques: exact [Feng and Zilberstein, 2005; Cassandra et al., 1997b]andapproximate [Pineauetal.,2003;SmithandSimmons,2005]. Exacttechniquespro- videoptimalsolutions,avoidingtheproblemswithrespecttoqualitybounds,howeversufferfrom considerable computational inefficiency. On the other hand, approximate techniques provide ef- ficient techniques that scale to larger problems but at the expense of quality bounds. Turning now to distributed POMDPs, researchers have pursued two different approaches here as well: exact [Nair et al., 2003a; Hansen et al., 2004b] and approximate [Becker et al., 2003; Nair et al., 2003a; Peshkin et al., 2000a; Becker et al., 2004]. Unfortunately, the exact approaches have so farbeenlimitedtotwoagents,withcomparativelylittleattentionfocussedonthem. Ontheother hand, approximate approaches either limit agent interactions (transition independence) [Becker etal.,2003]orapproximateobservabilityofthelocalstate [Beckeretal.,2004]orfindlocalopti- malsolutions [Nairetal.,2003a;Peshkinetal.,2000a]. Thoughtheseapproachesfordistributed POMDPs provide improvement in performance, they still suffer from similar drawbacks: (a) computationalinefficiencygivenlargenumbersofagents(b)lackofboundsonsolutionquality. 2 My thesis takes steps to address these problems of efficiency, while providing guarantees on solutionquality. Tothatend,Ihaveproposedtwokeysolutionmechanisms: 1. Exploitingstructureinherentinthedomain: Ihaveinvestigatedtwotypesofstructure,that oftenariseinrealworlddomains: (a) Physicallimitations/Progressstructureintheprocessbeingmodeled(structureindy- namics): These techniques restrict policy computation to the belief space polytope that remains reachable given the physical limitations of a domain. One example of a physical limitation in a process is from a personal assistant domain where an agent assists a user: if the user is at a location, it is highly improbable for him/her to be 5 miles away in the next 5 seconds. I introduce new techniques, particularly one based on applying Lagrangian methods to compute a bounded belief space support inpolynomialtime. Thesetechniquesarecomplementarytomanyexistingexactand approximate POMDP policy generation algorithms. In fact, these exact techniques provideanorderofmagnitudespeedupoverthefastestexistingexactsolvers. (b) Structure in the interactions of agents: Techniques for distributed POMDPs have traditionally considered agents in a multi-agent environment with full interactivity i.e. all agents interact with all other agents. However, in domains like sensor net- works, each node(agent) interacts only with the nodes that are adjacent to it in the network. Distributed Constraint Optimization (DCOP) is a model for coordination, where the solution techniques rely on exploiting these kinds of limited interaction structures [Modi et al., 2003a; Petcu and Faltings, 2005; Maheswaran et al., 2004], howeverwithaninabilitytohandleuncertainty. Ontheotherhand,distributedPOMDP 3 techniques handle uncertainty without exploiting structure in the interactions. I have combinedthesetwoapproachestoproposeanewmodel,NetworkDistributedPOMDPs (ND-POMDP).Inthisthesis,Ihaveprovidedsolutiontechniquesforthesedistributed POMDPs,thatbuildoverexactandlocallyoptimalDCOPapproaches,namelyDPOP (Distributed Pseuodo-tree OPtimization), DBA (Distributed Breakout Algorithm), and DSA (Distributed Stochastic Algorithm). Furthermore, I have also provided a heuristic search technique called SPIDER, that exploit the interaction structure. All thesealgorithmsprovideasignificantimprovementinperformanceofthepolicycom- putation for a team of agents. Furthermore, SPIDER provides this efficiency while providingqualityguaranteesonthesolution. 2. Direct approximation in the value space: Existing approaches [Pineau et al., 2003; Zhou andHansen,2001;Montemerloetal.,2004]toapproximationinPOMDPsandDistributed POMDPshavefocussedonsamplingthebeliefspaceandapproximatingtheoptimalvalue function with the value computed for the sampled belief space (indirect value approxima- tion). Thekeynoveltyinmytechniqueistodirectlyapproximateinthevaluespace,sothat everyapproximationphasehasabounded(pre-computable)qualityloss. Ihaveillustrated the utility of this technique in the context of both POMDPs and Distributed POMDPs. In single agent POMDPs, this idea translates to efficiently computing policies that are at most (approximation parameter) away from the optimal value function. In distributed POMDPs, theexecutionof theideatranslates tocomputingpolicies thatareat most(ap- proximation parameter) away from a (tight) upper bound on the optimal value function (computed by approximating the Distributed POMDP as a centralized Markov Decision 4 ProblemorMDP).Boththesetechniqueswereshowntobefasterthanbestknownexisting solvers,whileprovidingguaranteesonsolutionqualitymissinginpreviouswork. The rest of this document is organized as follows: Chapter 2 contains a background of the domains,modelsandalgorithmsusedinthisthesis. Chapter3explainsthestructureexploitation ofthedynamicsofadomaininsingleagentPOMDPs. DirectvalueapproximationforPOMDPs is presented in Chapter 4, while Chapter 5 provides the experimental results for the single agent POMDP techniques. Chapter 6 elucidates the exploitation of network structure, while Chapter 7 contains an exposition for the direct value approximation technique for in distributed POMDPs. Chapter 8 describes the structure exploitation of the dynamics for distributed POMDPs. Related workispresentedindetailinChapter9,whiletheconclusionispresentedinChapter10. 5 Chapter2: Background This chapter provides a brief background on the experimental domains, the models employed, andexistingalgorithmstosolvethemodels. 2.1 Domains To illustrate the applicability of my techniques, I have considered different types of domains. Thesearepersonalassistantagents(Section2.1.1),sensornetworks(Section2.1.2),anillustrative tigerproblem(Section2.1.3)andotherproblemsfromliterature. 2.1.1 PersonalAssistantAgents(PAA) Recent research has focused on individual agents or agent teams that assist humans in offices, at home, in medical care and in many other spheres of daily activities [Schreckenghost et al., 2002; Pollack et al., 2003b; htt, 2003; Scerri et al., 2002; Leong and Cao, 1998; Magni et al., 1998]. Such agents must often monitor the evolution of a process or state over time (including that of the human, the agents are deployed to assist) and make periodic decisions based on such monitoring. For example, in office environments, agent assistants may monitor the location of usersintransitandmakedecisionssuchasdelaying,cancelingmeetingsoraskingusersformore information [Scerrietal.,2002]. Similarly,inassistingwithcaringfortheelderly [Pollacketal., 6 2003b] and therapy planning [Leong and Cao, 1998; Magni et al., 1998], agents may monitor users’ states/plans and make periodic decisions such as sending reminders. Henceforth in this document, I refer to such agents as PAAs. Owing to the great promise of PAAs, addressing decisionmakingintheseagentsrepresentsacriticalproblem. Unfortunately,PAAsmustmonitorandmakedecisionsdespitesignificantuncertaintyintheir observations(asthetruestateoftheworldmaynotbeknownexplicitly)andactions(outcomeof agents’actionsmaybenon-deterministic). Furthermore,actionshavecosts,e.g.,delayingameet- inghasrepercussionsonattendees. Researchershaveturnedtodecision-theoreticframeworksto reason about costs and benefits under uncertainty. However, this research has mostly focused on Markov decision processes (MDPs) [Scerri et al., 2002; Leong and Cao, 1998; Magni et al., 1998], ignoring the observational uncertainty in these domains, and thus potentially degrading agent performance significantly and/or requiring unrealistic assumptions about PAAs’ observa- tionalabilities. POMDPsaddresssuchuncertainty,butthelongrun-timesforgeneratingoptimal policiesforPOMDPsremainsasignificanthurdleintheiruseinPAAs. AkeyPAAdomainthatwepresenthereisthetaskmanagementproblem(TMP).Thisisakey problemwithinCALO(CognitiveAgentthatLearnsandOrganises),asoftwarepersonalassistant project [htt,2003]. Inthisdomain,asetofdependenttasksistobeperformedbyagroupofusers before a deadline, e.g. a group of users are working on getting a paper done before the deadline. Agents monitor the progress of their users, and help in finishing the tasks before a deadline by doingreallocationsatcertainpointsintime. Furthermore, agentsalsomakeadecisiononwhom toreallocateatask,thustheymustmonitorstatusofotheruserswhoarecapableofdoingit. 7 This problem is complicated as the agents need to reason about reallocation in the presence oftransitionalandobservationaluncertainty. Transitionaluncertaintyarisesbecausethereisnon- determinism in the way users make progress. For example, a user might finish two units of progressinonetimeunit,ormightnotdoanythinginonetimeunit. Ontheotherhand,observa- tionaluncertaintyispresentbecauseoftworeasons: 1. Acquiringexactprogressmadeonataskisdifficult. 2. Knowingwhetherother(capable)usersarefreeornotisdifficult. Agentscanasktheirusersabouttheprogressmade,whenthereisalotofuncertaintyinthestate. When the user responds, agent knows the exact progress of the user on the task. This however comes at a cost of disturbing the user and occurs only with a certain probability as users may or may not respond to agent’s request. Thus each agent needs to find a strategy that guides its operation at each time step, till the deadline. This strategy would consist of executing either a “wait”, or “ask user”, or “reallocate” task to other users, at each time step. These reallocation points are when a user is not making sufficient progress on the tasks. Agents decide on when and whom to reallocate, based on the observations they obtain about the progress made by the useronthetask. Theseobservationshoweverarenoisy,becauseitisdifficulttoacquiretheexact progressmadebytheuseronatask. POMDPs provide a framework to analyze and obtain policies in TMP type domains. In a TMP,aPOMDPpolicycantakeintoaccountthepossiblyunevenprogressofdifferentusers,e.g., someusersmaymakemostoftheirprogresswellbeforethedeadline,whileothersdothebulkof their work closer to the deadline. In contrast, an instantaneous decision-maker cannot take into account such dynamics of progress. For instance, consider a TMP scenario where there are five 8 levelsoftaskprogressx∈{0.00,0.25,0.50,0.75,1.00}andfivedecisionpointsbeforethedead- linet∈{1,2,3,4,5}. Observationsarefivelevelsoftaskprogress{0.00,0.25,0.50,0.75,1.00} and time moves forward in single steps, i.e. T([x,t],a,[˜ x, ˜ t]) = 0 if ˜ t6= t+1. While transition uncertainty implies irregular task progress, observation uncertainty implies agent may observe progressxasforinstancexorx+0.25(unlessx = 1.00). Despitethisuncertaintyinobserving task progress, a PAA needs to choose among waiting (W), asking user for info (A), or reallo- cate task to other users(R). A POMDP policy tree that takes into account both the uncertainty in observations and future costs of decisions, and maps observations to actions, for this scenario is shown in Figure 2.1 (nodes=actions, links=observations). In more complex domains with ad- ditional actions such as delaying deadlines, cascading effects of actions will require even more carefulplanningaffordedbyPOMDPpolicygeneration. OneotherkeycharacteristicofTMPisthatthehumancanrestricttheusageofcertainactions in certain states, thus associating a reward of negative infinity with certain actions. Additionally, a POMDP algorithm solving TMP problems needs to have the following characteristics: (a) A planforapre-specifiedqualityguarantee. (b)Aqualityboundvalidforallpossiblestartingbelief points. (c)Apolicythatcanbecomputedefficiently. W A W W R W ....... ..... 0.00 0.00 0.25 0.25 0.50 1.00 A W .... ..... W 0.00 0.25 0.75 .... W W Figure2.1: PartialSamplePolicyforaTMP 9 2.1.2 DistributedSensorNetwork In this section, I provide an illustrative problem within the distributed sensor net domain, moti- vatedbythereal-worldchallengein [Lesseretal.,2003] 1 . Thisisanimportantproblem,because ofthewideapplicabilityofsensornetworks[Chintalapudietal.,2005;S.FuniakandSukthankar, 2006] in many real world problems. One key example is the tracking of weather phenomena in earth observing sensor webs. This is thus a pre-existing domain, one that has been attacked by othermultiagentresearchers. Onekeyaspectofthisdomainisthelocalityofinteractionsamong multipleagentsandhenceDCOPisagoodformalismtomodelthisdomain. Owingtotheability of DCOP algorithms to exploit locality in interactions, the algorithms developed are based on DCOPalgorithms. Here, eachsensornodecanscan inoneof fourdirections—North, South, Eastor West(see Figure 2.2). To track a target and obtain associated reward, two sensors with overlapping scan- ning areas must coordinate by scanning the same area simultaneously. Thus, the target position constitutes a world state, and each sensor has four actions: scan-north, scan-south, scan-east, scan-west. We assume that there are two independent targets and that each target’s movement is uncertain and unaffected by the sensor agents. Based on the area it is scanning, each sensor receives observations that can have false positives and false negatives. Each agent incurs a cost forscanningwhetherthetargetispresentornot,butnocostifitturnsoff. As seen in this domain, each sensor interacts with only a limited number of neighboring sensors. For instance, sensors 1 and 3’s scanning areas do not overlap, and cannot effect each otherexceptindirectlyviasensor2. Thesensors’observationsandtransitionsareindependentof 1 For simplicity, this scenario focuses on binary interactions. However, the algorithms introduced in this thesis allown-aryinteractions. 10 each other’s actions. Existing distributed POMDP algorithms are inefficient for such a domain becausetheyarenotgearedtoexploitlocalityofinteraction. Thus,theywillhavetoconsiderall possible action choices of even non-interacting agents in trying to solve the distributed POMDP. Distributed constraint satisfaction (DisCSP) [Mailler and Lesser, 2004b; Modi et al., 2001] and distributed constraint optimization (DCOP) [Mailler and Lesser, 2004a] have been applied to sensornetsbuttheycannotcapturetheuncertaintyinthedomain. Figure2.2: Sensornetscenario: Ifpresent,target1isinLoc1-1,Loc1-2orLoc1-3,andtarget2is inLoc2-1orLoc2-2. 2.1.3 IllustrativeDomain: TigerProblem This multiagent tiger problem from [Nair et al., 2003a] is an illustrative problem from the liter- ature. Two agents are in a corridor facing two doors “left” and “right”. Behind one door lies a hungrytiger,andbehindtheotherliesareward. Thesetofstates,S,is{SL,SR},whereSLindi- catestigerbehindtheleftdoor,andSRindicatestigerbehindrightdoor. Theagentscanjointlyor individuallyopeneitherdoor. Inaddition,theagentscanindependentlylistenforthepresenceof thetiger. Thus,thesetofactions,A1=A2={‘OpenLeft’,‘OpenRight’,‘Listen’}. Thetransition function, P specifies that the problem is reset whenever an agent opens one of the doors. How- ever, if both agents listen, the state remains unchanged. After every action each agent receives anobservationaboutthenewstate. Theobservationfunctionsareidenticalandwillreturneither 11 TLorTRwithdifferentprobabilitiesdependingonthejointactiontakenandtheresultingworld state. For example, if both agents listen and the tiger is behind the left door (state is SL), each agent independently receives the observation TL with probability 0.85 and TR with probability 0.15. Formoredetailsonthisdomain,referto [Nairetal.,2003a]. 2.1.4 Others For single agent POMDPs, I have used the following domains: Tiger grid, Hallway, Hallway2, Aircraft,TagandScotlandyard. OftheseproblemsTiger-grid,Hallway,Hallway2,Aircraft,Tag are benchmark problems from the literature [Pineau et al., 2003; Smith and Simmons, 2005]. Hallway, Hallway2, Aircraft and Tag are path planning problems from robotics. While Scot- land yard is a problem derived from the scotland yard game, with 216 states, 16 actions and 6 observations(Seehttp://en.wikipedia.org/wiki/Scotland Yard (board game)). 2.2 Models IwillassumereadersarefamiliarwithPOMDPsandDistributedPOMDPs;however,Iwillbriefly describePOMDPsandDistributedPOMDPstointroducemyterminologyandnotation. 2.2.1 SingleAgentPOMDPs APOMDPcanberepresentedusingthetuple{S,A,T,O,Ω,R},whereS isafinitesetofstates; A is a finite set of actions; Ω is a finite set of observations; T(s,a,s 0 ) provides the probability of transitioning from state s to s 0 when taking action a; O(s 0 ,a,o) is probability of observing o after taking an action a and reaching s 0 ; R(s,a) is the reward function. A belief state b, is a 12 probabilitydistributionoverthesetofstatesS. Avaluefunctionoverabeliefstateisdefinedas: V(b) = max a∈A {R(b,a)+β Σ b 0 ∈B T(b,a,b 0 )V(b 0 )}. 2.2.2 DistributedPOMDPs: MTDP The distributed POMDP model that we base our work on is MTDP [Pynadath and Tambe, 2002], however other models [Bernstein et al., 2000] could also be used. These distributed POMDP models are more than just two single agent POMDPs working independently. In par- ticular, given a team of n agents, an MTDP [Pynadath and Tambe, 2002] is defined as a tu- ple:hS,A,P,Ω,O,Ri. S is a finite set of world states{s 1 ,...,s m }. A =× 1≤i≤n A i , where A 1 ,...,A n ,arethesetsofactionforagents1ton. Ajointactionisrepresentedasha 1 ,...,a n i. P(s i ,ha 1 ,...,a n i,s f ),thetransitionfunction,representstheprobabilitythatthecurrentstateis s f , if the previous state iss i and the previous joint action isha 1 ,...,a n i. Ω =× 1≤i≤n Ω i is the set of joint observations where Ω i is the set of observations for agentsi. O(s,ha 1 ,...,a n i,ω), the observation function, represents the probability of joint observation ω ∈ Ω, if the current state is s and the previous joint action isha 1 ,...,a n i. We assume that observations of each agent is independent of each other’s observations. Given the world state and joint actions, the observationfunctioncanbeexpressedasO(s,ha 1 ,...,a n i,ω)=O 1 (s,ha 1 ,...,a n i,ω 1 )·...· O n (s,ha 1 ,...,a n i,ω n ). The agents receive a single immediate joint reward R(s,ha 1 ,...,a n i) whichissharedequally. Each agent i chooses its actions based on its local policy, π i , which is a mapping of its observation history to actions. Thus, at time t, agent i will perform action π i (~ ω t i ) where ~ ω t i = ω 1 i ,...,ω t i . π =hπ 1 ,...,π n i refers to the joint policy of the team of agents. In this model, executionisdistributedbutplanningiscentralized. 13 2.3 ExistingAlgorithms In this section, I present some of the existing algorithms for solving POMDPs and Distributed POMDPsthatwouldbereferredtoindetailinlaterpartsofthisdocument. 2.3.1 ExactAlgorithmsforPOMDPs Currently,themostefficientexactalgorithmsforPOMDPsarevalueiterationalgorithms,specif- ically GIP [Cassandra et al., 1997b] and RBIP [Feng and Zilberstein, 2004b, 2005]. These are dynamic programming algorithms, which perform two steps at each iteration: (a) generating all potentialpoliciesand(b)pruningdominatedpoliciestoobtainaaminimalsetofdominantpoli- cies called the parsimonious set. Figure 2.3.1 provides a pictorial depiction of these two steps, whereeachline(inthegraphs)representsthevaluevectorcorrespondingtoapolicy. Thefirstfig- ureshowsthedominatedpoliciesonthebottomofthegraph(circled). Thesedominatedpolicies arecomputedbyusinglinearprogramming. E x p e c t e d V a l u e B e l i e f P r o b a b i l i t y 0 1 B e l i e f P r o b a b i l i t y E x p e c t e d V a l u e 0 1 D o m i n a t e d p o l i c i e s G e n e r a t i n g p o l i c i e s P r u n i n g d o m i n a t e d p o l i c i e s 0 Figure2.3: TwostepsofvalueiterationinGIPandRBIP 14 Given a parsimonious set (represented as value vectors corresponding to policies) at time t, V t , we generate the parsimonious set at time t−1,V t−1 as follows (notation similar to the one usedin [Cassandraetal.,1997b]and [FengandZilberstein,2004b]): 1. n v a,o,i t−1 (s) =r(s,a)/|Ω|+β Σ s 0 ∈S Pr(o,s 0 |s,a)v i t (s 0 ) o =: ˆ V a,o t−1 wherev i t ∈V t . 2.V a,o t−1 =PRUNE( ˆ V a,o t−1 ) 3.V a t−1 =PRUNE(···(PRUNE(V a,o 1 t−1 ⊕V a,o 2 t−1 )···⊕V a,o |Ω| t−1 ) 4.V t−1 =PRUNE( S a∈A V a t−1 ) Each PRUNE call executes a linear program (LP) which is recognized as a computation- ally expensive phase in the generation of parsimonious sets [Cassandra et al., 1997b; Feng and Zilberstein, 2004b]. Our approach effectively translates into obtaining speedups by reducing the quantityofthesecalls. 2.3.2 ApproximateAlgorithmsforPOMDPs Here we concentrate on two of the most efficient approximation algorithms that provide quality bounds, Point-Based Value Iteration (PBVI) [Pineau et al., 2003] and Heuristic Search Value Iteration (HSVI) [Smith and Simmons, 2005]. In these algorithms, a policy computed for a sampled set of belief points is extrapolated to the entire belief space. PBVI/HSVI are anytime algorithms,wherethesetofbeliefpointsbeingplannedforisexpandedovertime. Theexpansion ensuresthatthebeliefpointsareuniformlydistributedovertheentirebeliefspace. Theheuristics used to accomplish this belief set expansion differentiate PBVI and HSVI. However, to obtain thissetofbeliefpoints,bothalgorithmsrequirespecificationofastartingbeliefpoint. 15 Since our approach (for solving POMDPs approximately) focusses on quality bounds, we willdiscussthequalityboundsinPBVI/HSVI.ForPBVI,thisboundisprovidedby: (R max −R min )∗ b /(1−γ) 2 , where R max and R min represent the maximimum and minimum possible reward for any action in any state and b = max b 0 ∈Δ min b∈B kb− b 0 k 1 , where Δ is the entire belief space andB is the set of belief points. Computing b requires solving a Non-Linear Program, NLP (shown in Algorithm1). AlthoughHSVIhasaslightlydifferenterrorbound,itstillrequiresthesameNLP tobesolved. Algorithm1 CALCULATEEPSILON() Maximize b subjecttotheconstraints Σ 1≤i≤|S| b[i] = 1 b[i]≥ 0andb[i]≤ 1,∀i∈{1,...,|S|} b < Σ 1≤i≤|S| |b[i]−b k [i]|,∀b k ∈B 2.3.3 JESPalgorithmforDistributedPOMDPs Given the NEXP-complete complexity of generating globally optimal policies for distributed POMDPs [Bernstein et al., 2000], locally optimal approaches [Peshkin et al., 2000b; Chad` es et al., 2002; Nair et al., 2003a] have emerged as viable solutions. Since CS-JESP algorithm (providedlater)buildsonJESP(JointEquilibrium-BasedSearchforPolicies)[Nairetal.,2003a] algorithm,JESPisoutlinedbelow(Algorithm2). Thekeyideaistofindthepolicythatmaximizes thejointexpectedrewardforoneagentatatime,keepingpoliciesoftheothern−1agentsfixed. Thisprocessisrepeateduntilan equilibriumis reached(local optimumis found). Multiplelocal 16 optimaarenotencounteredsinceplanningiscentralized. KeyinnovationinJESPisbasedonthe realization that if policies of all other n− 1 agents are fixed, then the remaining agent faces a normal single-agent POMDP, but with an extended state space. Thus, in line 4, given a known startingbeliefstate,weusedynamicprogrammingoverbeliefstatesofthisnewermorecomplex single-agent POMDP, to compute agent 1’s optimal response to fixed policies of the remaining n−1agents. Algorithm2 JESP() 1: Π 0 ←randomlyselectedjointpolicy,prevVal←valueofΠ 0 ,conv← 0,Π← Π 0 2: whileconv6=ndo 3: fori← 1tondo 4: val,Π i ← OPTIMALBESTRESPONSE(b,Π 0 ,T) 5: ifval = prevValthen 6: conv + ← 1 7: else 8: Π 0 i ← Π i ,prevVal← val,conv← 1 9: endif 10: ifconv =nthenbreak 11: endfor 12: endwhile 13: returnΠ Figure2.4: TraceoftigerscenarioinJESP ThekeyisthentodefinetheextendedstateinJESP.Foratwoagentcase,foreachtimet,the extended state of agent1 is defined as a tuplee t 1 = s t ,~ ω t 2 , where~ ω t 2 is the observation history 17 of the other agent. By treating e t 1 as the state of agent 1 at time t, we can define the transition functionandobservationfunctionfortheresultingsingle-agentPOMDPforagent 1asfollows: P 0 (e t 1 ,a t 1 ,e t+1 1 ) =Pr(e t+1 1 |e t 1 ,a t 1 ) =P(s t ,(a t 1 ,π 2 (~ ω t 2 )),s t+1 ) ·O 2 (s t+1 ,(a t 1 ,π 2 (~ ω t 2 )),ω t+1 2 ) (2.1) O 0 (e t+1 1 ,a t 1 ,ω t+1 1 ) =Pr(ω t+1 1 |e t+1 1 ,a t 1 ) =O 1 (s t+1 ,(a t 1 ,π 2 (~ ω t 2 )),ω t+1 1 ) (2.2) In other words, when computing agent 1’s best-response policy via dynamic programming given the fixed policy of its teammate, we maintain a distribution over the extended states e t 1 , rather than over the world statess t . Figure 2.4 shows a trace of the belief state evolution for the multi-agenttigerdomain,describedinSection2.1.3,e.g. e 2 1 ofSL(TR)indicatesanextendedstate wherethetigerisbehindtheleftdoorandagent2hasobservedTR.However,asnotedabove,the main shortcoming of this technique is that it computes a locally optimal policy assuming a fixed starting belief state, and this assumption is embedded in its dynamic programming as shown in line4ofalgorithm2—itdoesnotgeneratepoliciesovercontinuousbeliefspaces. 18 Chapter3: Exploitingstructureindynamics ThisthesisaimstopracticallyapplyPOMDPstorealworlddomainsbyintroducingnovelspeedup techniquesthatareparticularlysuitableforsuchsettings. Thekeyinsightisthatinsomedynamic domainswhereprocessesevolveovertime,largebutshiftingpartsofthebeliefspaceinPOMDPs (i.e., regionsofuncertainty)remain unreachable. Thus, we can focus policy computationonthis reachable belief-space polytope that changes dynamically. For instance, consider a PAA mon- itoring a user driving to a meeting. Given knowledge of user’s current location, the reachable belief region is bounded by the maximum probability of the user being in different locations at the next time step as defined by the transition function. Current POMDP algorithms typically fail to exploit such belief region reachability properties. POMDP algorithms that restrict belief regionsfailtodosodynamically [RoyandGordon,2002;HauskrechtandFraser,2000]. Our techniques for exploiting belief region reachability exploit three key domain character- istics: (i) not all states are reachable at each decision epoch, because of limitations of physical processes or progression of time; (ii) not all observations are obtainable, because not all states are reachable; (iii) the maximum probability of reaching specific states can be tightly bounded. WeintroducepolynomialtimetechniquesbasedonLagrangiananalysistocomputetightbounds on belief state probabilities. These techniques are complementary to most existing exact and approximate POMDP algorithms. We enhance two state-of-the-art exact POMDP algorithms 19 [Cassandra et al., 1997a; Feng and Zilberstein, 2004a] delivering over an order of magnitude speedupforaPAAdomain. 3.1 DynamicBeliefSupports Ourapproachconsistsofthreekeytechniques: (i)dynamicstatespaces(DS);(ii)dynamicbeliefs (DB); (iii) dynamic disjunctive beliefs (DDB) 1 These ideas may be used to enhance existing POMDP algorithms such as GIP and RBIP. The key intuition is that for domains such as PAA, progress implies a dynamically changing polytope (of belief states) remains reachable through time,andpolicycomputationcanbespeededupbycomputingtheparsimonioussetoverjustthis polytope. The speedups are due to the elimination of policies dominant in regions outside this polytope. DSprovidesaninitialboundonthepolytope,whileDB(whichcapturesDS)andDDB provide tighter bounds on reachable belief states through a polynomial-time technique obtained fromLagrangiananalysis. These techniques do not alter the relevant parsimonious set w.r.t. reachable belief states and thus, yield an optimal solution over the reachable belief states. The resulting algorithms (DS,DB,DDB) applied to enhance GIP are shown in Algorithm 3, where the functions GET- BOUND and DB-GIP are the main additions, with significant updates in other GIP functions (otherwise,theGIPdescriptionsfollows[1,3]). WediscussourkeyenhancementsinAlgorithm3 attheendofeachsubsectionbelow. Ourenhancementshavecurrentlybeenappliedonlytofinite horizon problems and their applicability to infinite horizon problems remains an issue for future work. 1 WealsohaveatechniquecalledDynamicobservations(DO)thatwaspresentedinanearlierpaper[Varakantham etal.,2005] 20 Algorithm3 DB-G IP() FuncPOMDP-SOLVE( L,S,A,T,Ω,O,R) 1: ({S t },{O t },{B max t })= DB-G IP (L,S,A,T,Ω,O,R) 2: t←L;V t ← 0 3: fort =Lto1do 4: V t−1 =DP-UPDATE (V t ,t) 5: endfor FuncDP-UPDATE(V,t) 1: foralla∈Ado 2: V a t−1 ←φ 3: forallω t ∈O t do 4: forallv i t ∈V do 5: foralls t−1 ∈S t−1 do 6: v a,ωt,i t−1 (s t−1 ) =r t−1 (s t−1 ,a)/|O t |+γΣ st∈St Pr(ω t ,s t |s t−1 ,a)v i t (s t ) 7: endfor 8: endfor 9: V a,ωt t−1 ←PRUNE({v a,ωt,i t−1 },t) 10: endfor 11: V a t−1 ←PRUNE(V a t−1 ⊕V a,ωt t−1 ,t) 12: endfor 13:V t−1 ←PRUNE( S a∈A V a t−1 ,t) 14: returnV t−1 Func LP-D OMINATE(w,U,t) 1: LPvars: d,b(s t )[∀s t ∈S t ] 2: LPmaxdsubjectto: 3: b·(w−u)≥d,∀u∈U 4: Σ st∈St b(s t )← 1 5: b(s t )<=b max t (s t );b(s t )>= 0 6: ifd≥ 0returnbelsereturnnil FuncBEST(b,U) 1: max←Inf 2: forallu∈U do 3: if(b·u>max)or((b·u =max)and(u< lex w))then 4: w←u;max←b·u 5: endif 6: endfor 7: returnw FuncPRUNE(U,t) 1: W←φ 2: whileU6=φ 3: u←anyelementinU 4: ifPOINT-DOMINATE( u,W,t)=truethen 5: U←U−u 6: else 7: b←LP-D OMINATE(u,W,t) 8: ifb =nil thenU←U−u 9: elsew←BEST(b,U);W←W S w;U←U−w 10: endif 11: returnW 21 FuncPOINT-DOMINATE( w,U,t) 1: forallu∈U do 2: ifw(s t )≤u(s t ),∀s t ∈S t thenreturntrue 3: endfor 4: returnfalse Func DB-G IP(L,S,A,T,Ω,O,R) 1: t← 1;S t =Setofstartingstates 2: foralls t ∈S t do 3: b max t (s t ) = 1 4: endfor 5: fort = 1toL−1do 6: foralls∈S t do 7: ADD-TO( S t+1 ,REACHABLE-STATES( s,T)) 8: Ω t+1 =GET-RELEVANT-OBS( S t+1 ,O) 9: C=GET-CONSTRAINTS( s t ) 10: b max t+1 (s t+1 ) =MAX c∈C (GET-BOUND( s t+1 ,c)) 11: endfor 12: endfor 13: return({S t },{Ω t },{b max t }) FuncGET-BOUND( s t ,constraint) 1: y min =MIN s∈St−1 (constraint.c[s]/constraint.d[s]) 2: y max =MAX s∈St−1 (constraint.c[s]/constraint.d[s]) 3: INT=GET-INTERSECT-SORTED( constraint,y min ,y max ) 4: foralli∈INTdo 5: Z =SORT(((i+)∗constraint.d[s]−constraint.c[s]),∀s∈S t−1 6: sumBound = 1,numer = 0,denom = 0 7: /*INASCENDINGORDER*/ 8: forallz∈Z do 9: s=FIND-CORRESPONDING-STATE( z) 10: ifsumBound−bound[s t−1 ]> 0then 11: sumBound− =bound[s t−1 ] 12: numer+ =bound[s t−1 ]∗constraint.c[s t−1 ] 13: denom+ =bound[s t−1 ]∗constraint.d[s t−1 ] 14: endif 15: ifsumBound−bound[s t−1 ]<= 0then 16: numer+ =sumBound∗constraint.c[s t−1 ] 17: denom+ =sumBound∗constraint.d[s t−1 ] 18: BREAK-FOR 19: endif 20: endfor 21: ifnumer/denom>iandnumer/denom<maxthen 22: returnnumer/denom 23: endif 24: endfor 22 3.1.1 DynamicStates(DS) WefirstprovideanintuitiveexplanationofDSusingtheexampledomainPAA.Anaturalmethod for PAAs to represent a user’s state (such as in TMP) is with one consisting of a spatial element, (in a TMP, capturing the progress of each task), and a temporal element, capturing the stage of the decision. The transition matrix is then a static function of the state. This approach is used in [Scerri et al., 2002] for an adjustable autonomy problem addressed with MDPs. We note that in these kinds of domains, one cannot reach all states from a given state. For example, in the TMP scenario presented in Section 2.1.1, if there are limits on how tasks progress (one cannot advancemorethanoneprogresslevelinonetimestep,T([x,t],a,[˜ x,t+1]) = 0if ˜ x−x> 0.25) and we know that at t = 1 we are at either x = 0.00 or x = 0.25, then we know at t = 2, x / ∈{0.75,1.00}andatt = 3,x6= 1.00. Given this example, we now introduce the general concept of DS. The key insight is that the state space at each point in time can be represented more compactly in a dynamic fashion. This will require the transition matrix and reward function to be dynamic themselves. Given knowledge about the initial belief space (e.g. possible beginning levels of task progress), we showhowwecanobtaindynamicstatespacesandalsothatthisrepresentationdoesnotaffectthe optimality of the POMDP solution. LetL be the length of a finite horizon decision process. Let S be the set of all possible states that can be occupied during the process. At timet, letS t ⊂ S denote the set of all possible states that could occur at that time. Thus, for any reachable belief 23 state,wehave P st∈St b t (s t ) = 1. Then,wecanobtainS t fort∈ 1,...Linductivelyifweknow thesetS 0 ⊂S forwhichs / ∈S 0 ⇒b 0 (s) = 0,asfollows: S t+1 = s 0 ∈S :∃a∈A,s∈S t s.t. T t (s,a,s 0 )> 0 (3.1) The belief probability for a particular state ˜ s at timet+1 given a starting belief vector at timet (b t )action(a)andobservation(ω)canbeexpressedasfollows: b t+1 (˜ s) := O t (˜ s,a,ω) P st∈St T t (s t ,a,˜ s)b t (s t ) P s t+1 ∈S t+1 O t (s t+1 ,a,ω) P st∈St T t (s t ,a,s t+1 )b t (s t ) Thisimpliesthatthebeliefvectorb t+1 willhavesupportonlyonS t+1 ,i.e. ˜ s / ∈S t+1 ⇒b t+1 (˜ s) = 0, if b t only has support in S t and S t+1 is generated as in (3.1). Thus, we can model a process thatmigratesamongdynamicstatespaces{S t } L t=1 indexedbytimeormoreaccurately,thestage ofthedecisionprocessasopposedtoatransitioningwithinstaticglobalstatesetS. Proposition1 Given S 0 , we can replace a static state space S with dynamic state spaces{S t } generatedby(3.1),dynamictransitionmatricesanddynamicrewardfunctionsinafinitehorizon POMDPwithoutaffectingtheoptimalityofthesolutionobtainedusingvaluefunctionmethods. Proof. IfweletP t denotethesetofpoliciesavailableattimet,V p t denotethevalueofpolicypat timetand,V ∗ t denotethevalueoftheoptimalpolicyattimet,wehaveV ∗ L (b L ) = max p∈P L b L · α p L whereα p L = [V p L (s 1 )···V p L (s |S| )]fors i ∈S. When t = L, we have V p L (s) = R L (s,a(p)) where R L is the reward function at time L and a(p) is the action prescribed by the policy p. Since b L (s) = 0 if s / ∈ S L , then V ∗ L (b L ) = max p∈P L ˜ b L · ˜ α p L where| ˜ b L | =|˜ α p L | =|S L | and ˜ α p L = [V p L (˜ s 1 )···V p L (˜ s |S L | )] for ˜ s i ∈ S L . 24 Calculating the value function at time L−1, we have V ∗ L−1 (b L−1 ) = max p∈P L−1 b L−1 ·α p L−1 whereα p L−1 = [V p L−1 (s 1 )···V p L−1 (s |S| )]fors i ∈S. Whent =L−1,wehave V p L−1 (s) =R L−1 (s,a(p))+γ P s 0 ∈S T L−1 (s,a(p),s 0 ) P ω∈Ω O(s 0 ,a,ω)V pω L (s 0 ), where p ω ∈ P L is the policy subtree of the policy tree p∈ P L−1 when observing ω after the initial action. Sinceb L−1 (s) = 0 ifs / ∈ S L−1 , thenV L−1 (b L−1 ) = max p∈P L−1 ˜ b L−1 · ˜ α p L−1 where| ˜ b L−1 | =|˜ α p L−1 | =|S L−1 |and ˜ α p L−1 = [V p L (˜ s 1 )···V p L (˜ s |S L−1 | ])for˜ s i ∈S L−1 . Applying this reasoning inductively, we can show that we only need V p t (s t ) for s t ∈ S t . Furthermore, if s t ∈S t ,then V p t (s t ) = R t (s t ,a(p)) + γ X s t+1 ∈S t+1 T t (s t ,a(p),s t+1 ) X ω∈Ω O(s t+1 ,a,ω)V pω t+1 (s t+1 ). (3.2) Thus,weonlyneed{V ω(p) t+1 (s t+1 ) :s t+1 ∈S t+1 }. The value functions expressed for beliefs over dynamic state spaces S t have identical ex- pected rewards as when using S. The advantage in this method is that in generating the set of value vectors which are dominant at some underlying belief point (i.e. the parsimonious set) at a particular iteration, we eliminate vectors that are dominant over belief supports that are not reachable. Thisreducesthesetofpossiblepoliciesthatneedtobeconsideredatthenextiteration. Line 6 of DB-GIP function and the DP-UPDATE function of Algorithm 3 provide the algorithm forfindingthedynamicstates. 25 3.1.2 DynamicBeliefs(DB) Byintroducingdynamicstatespaces,weareattemptingtomoreaccuratelymodelthesupporton whichreachablebeliefswilloccur. Wecanmakethisprocessmoreprecisebyusinginformation about the initial belief distribution, the transition and observation probabilities to bound belief dimensions with positive support. For example, if we know that our initial belief regarding task progress can have at most 0.10 probability of being at 0.25 with the rest of the probability mass onbeingat0.00,wecanfindthemaximumprobabilityofbeingat0.25or0.50atthenextstage, given a dynamic transition matrix. Below we outline a polynomial-time procedure by which we canobtainsuchboundsonbeliefsupport. 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 1 . 0 0 . 0 0 . 0 0 . 0 0 . 0 0 . 5 0 . 4 0 . 6 0 . 0 0 . 0 0 . 0 0 . 3 0 . 2 0 . 5 0 . 3 0 . 0 0 . 0 W i t h G I P W i t h D B Figure3.1: ComparisonofGIPandDBwithrespecttobeliefbounds Figure 3.1.2 provides an example comparison of belief bounds obtained using DB and GIP. In the figure, each rectangular box represents a state and the states in the one column represent the states of the POMDP. Each column represents an iteration of dynamic programming and the arrows between the boxes represent the transitions between states. The number inside each state represents the maximum possible belief probability for a state at that iteration. With GIP, this numberremains1forallthestatesatalltheiterations. WhilewithDB,giventhedynamicsofthe domain,itispossibletoobtainaconfigurationasshowninthefigureontherightside. 26 LetB t ⊂ [01] |St| be a space such thatP(b t / ∈ B t ) = 0. That is, there exists no initial belief vector and action/observation sequence of lengtht−1 such that by applying the standard belief updaterule,onewouldgetabeliefvectorb t notcapturedinthesetB t . Then,wehave b t+1 (s t+1 )≥ min a∈A,o∈Ot,bt∈Bt F(s t+1 ,a,o,b t ) =:b min t+1 (s t+1 ) b t+1 (s t+1 )≤ max a∈A,o∈Ot,bt∈Bt F(s t+1 ,a,o,b t ) =:b max t+1 (s t+1 ) whereF(s t+1 ,a,o,b t ) := O t (s t+1 ,a,o) P st∈St T t (s t ,a,s t+1 )b t (s t ) P ˜ s t+1 ∈S t+1 O t (˜ s t+1 ,a,o) P st∈St T t (s t ,a,˜ s t+1 )b t (s t ) Thus,if B t+1 = [b min t+1 (s 1 )b max t+1 (s 1 )]×···×[b min t+1 (s |S t+1 | )b max t+1 (s |S t+1 | )], thenwehaveP(b t+1 / ∈B t+1 ) = 0. Wenowshowhowb max t+1 (s t+1 )(andsimilarlyb min t+1 (s t+1 ))canbegeneratedthroughapolynomial- timeprocedurededucedfromLagrangianmethods. Givenanactionaandobservationω,wecan expresstheproblemas max bt∈Bt b a,ω t+1 (s t+1 ) s.t. b a,ω t+1 (s t+1 ) =c T b t /d T b t 27 wherec(s) =O t (s t+1 ,a,ω)T t (s t ,a,s t+1 )and d(s) = P s t+1 ∈S t+1 O t (s t+1 ,a,ω)T t (s t ,a,s t+1 ). We rewrite the problem in terms of the new variablesasfollows: min x −c T x/d T x s.t. X i x i = 1, 0≤x i ≤b max t (s i ) =: ¯ x i −(3) where P i b max t (s i )≥ 1 to ensure existence of a feasible solution. Expressing this problem as a Lagrangian,wehave L = −c T x/d T x +λ(1− X i x i )+ X i ¯ μ i (x i − ¯ x i )− X i μ i x i fromwhichtheKKTconditionsimply x k = ¯ x k λ =[(c T x)d k −(d T x)c k ]/(d T x) 2 + ¯ μ k 0<x k < ¯ x k λ =[(c T x)d k −(d T x)c k ]/(d T x) 2 x k = 0 λ =[(c T x)d k −(d T x)c k ]/(d T x) 2 −μ k . Because λ is identical in all three conditions and ¯ μ k and μ k are non-negative for all k, the component with the lowest value of (d T x)λ = [(c T x)/(d T x)]d k −c k must receive a maximal allocation (assuming ¯ x k < 1) or the entire allocation otherwise. For example, if size of state space is 3 and the values of the expression[(c T x)/(d T x)]d k −c k for different values ofk are 5, 6, 7 (assuming a state space of 3). Since all the λs (over all x k ) are identical, the above values need to be made equal by deciding on the allocations for each of the x k s. Since P k x k = 1, it cannot be the case that all these values are reduced, since reduction happens only in the third 28 equation forλ wherex k = 0 (since there is a subtraction of non negative variableμ k ). Thus, it is imperative that smaller of these values increase. As can be observed from the equations ofλ, values can be increased only in the case ofx k = ¯ x k (since there is a non negative variable ¯ μ k in theequation),andhencefullallocationforsmallervaluesof(d T x)λ = [(c T x)/(d T x)]d k −c k . Using this reasoning recursively, we see that if x ∗ is an extremal point (i.e. a candidate solution), thenthevaluesofitscomponents{x k }mustbeconstructedbygivingasmuchweight possible to components in the order prescribed by z k = yd k −c k , where y = (c T x ∗ )/(d T x ∗ ). Given avalueofy, onecan construct a solution by iteratively giving as much weight as possible (without violating the equality constraint) to the component not already at its bound with the lowestz k . The question then becomes finding the maximum value of y which yields a consistent so- lution. We note that y is the value we are attempting to maximize, which we can bound with y max = max i c i /d i andy min = min i c i /d i . Wealsonotethatforeachcomponentk,z k describes a line over the support [y min ,y max ]. We can then find the set of all points where the set of lines describedby{z k }intersect. Therecanbeatmost(N−1)N/2intersectionspoints. Wecanthen partitionthesupport[y min ,y max ]intodisjointintervalsusingtheseintersectionpointsyieldingat most(N−1)N/2+1regions. Ineachregion,thereisaconsistentorderingof{z k }whichcanbe obtained in polynomial time. An illustration of this can be seen in Figure 3.1.2. Beginning with the region furthest to the right on the real line, we can create the candidate solution implied by the ordering of{z k } in that region and then calculate the value of y for that candidate solution. If the obtained value of y does not fall within region, then the solution is inconsistent and we movetotheregionimmediately to the left. If the obtained valueofy does fall within the region, then we have the candidate extremal point which yields the highest possible value of y, which 29 is the solution to the problem. By using this technique we can dynamically propagate forward bounds on feasible belief states. Line 12 and 13 of the DB-GIP function in Algorithm 3 provide theprocedureforDB.TheGET-CONSTRAINTSfunctiononLine12givesthesetof c T andd T vectorsforeachstateattimetforeachactionandobservation. y z z z z y y min max z 1 2 3 4 Figure3.2: PartitionProcedureforSolvingBeliefMaximizationLagrangian Inthebeliefmaximizationequationof(3),ifb max t (s i )isequalto1forallstatess i ,thenitcan beeasilyprovedthatthemaximumvalueisequaltomax k c k /d k . Thusthisspecialcasedoesn’t evenrequirethecomplexityofthelagrangianmethod,andcanbesolvedinO(|S|log(|S|)). How- ever, if the maximum possible value of belief probability in the previous stage is not equal to 1, max k c k /d k can serve only as a bound and not the exact maximum. A simple improvement to theabovemethodisassigningx k stheirmaximumvalue(untilthesumis1)basedontheorderof c k /d k . However,ascanbeseenintheexamplebelow,thismethoddoesn’tyieldthemaximum. max((0.6x 1 +0.3x 2 +0.7x 3 )/(0.8x 1 +0.5x 2 +0.9x 3 )) s.t. 0<x 1 < 0.8,0<x 2 < 0.6,0<x 3 < 0.5, X i x i = 1 30 B a 2 ,ω 2 t B a 1 ,ω 1 t B a 1 ,ω 2 t B a 2 ,ω 1 t B DB t b(s i ) b(s j ) Figure3.3: IllustrationofDDBvsDB Byusingdynamicbeliefs,weincreasethecostsofpruningbyaddingsomeconstraints. However, there is an overall gain because we are looking for dominant vectors over a smaller support and this reduces the cardinality of the parsimonious set, leaving fewer vectors to consider at the next iteration. 3.1.3 DynamicDisjunctiveBeliefs(DDB) ThekeyinsightinDBisthatgivenaboundedsetofbeliefsatthebeginningoftheproblem,there aremanybeliefsthatarenotpossibleatlaterstages. Byeliminatingreasoningaboutpoliciesthat are optimal at unreachable beliefs, run-time can be improved significantly without sacrificing the quality of the solution. This is accomplished by performing the pruning operation over the reachable belief polytope rather than the entire simplex. In DB, we outlined the procedure to obtainthemaximumbelieftobeassignedtoaparticularstateataparticularepochforaparticular action and observation. Let us denote this asb a,ω,max t+1 (s), which is the output of the constrained optimization problem solved with Lagrangian techniques. We can similarly find the minimum possiblebelief. 31 InDB,thedynamicbeliefpolytopeforanepochtiscreatedasfollows: 1. Find the maximum and minimum possible belief for each state over all actions and obser- vations: b max t+1 (s) = max a,ω b a,ω,max t+1 (s),b min t+1 (s) = max a,ω b a,ω,min t+1 (s). 2. Createabeliefpolytopethatcombinestheseboundsoverallstates: B t+1 = [b min t+1 (s 1 )b max t+1 (s 1 )]×···×[b min t+1 (s |S| )b min t+1 (s |S| )]. While this is an appropriate bound in that any belief outside this is not possible given the initialbeliefs,transitionprobabilitiesandobservationprobabilities,itispossibletomakeaneven tighter expression of the feasible beliefs. We refer to this new method for reducing feasible beliefs as Dynamic Disjunctive Belief (DDB) bounds. The disjunctions are due to the fact that future beliefs depend on particular action-observation pairs ( max is overa,ω in (1) above). By eliminatingtheconditioningonactionsandobservations,wemaybeincludinginfeasiblebeliefs. ThisisillustratedinFigure3.3,foratwo-observation,two-actionsystemoveratwo-dimensional support. We see thatB DB t , while smaller than the entire space of potential beliefs, is larger than necessaryasitisnotpossibletobelieveanythingoutsideof∪ i=1,2 B a i ,ω i t . Thus,weintroducethe DDBmethodforcomputingfeasiblebeliefspaces: 1. Obtainb a,ω,max t+1 (s)andb a,ω,min t+1 (s),∀a∈A,ω∈ Ω 2. Createmultiplebeliefpolytopes,oneforeachaction-observationpair,asfollows: B a i ,ω i t+1 = [b a i ,ω i ,min t+1 (s 1 ) b a i ,ω i ,max t+1 (s 1 )] × ··· × [b a i ,ω i ,min t+1 (s |S| ) b a i ,ω i ,max t+1 (s |S| )]. 32 ThefeasiblebeliefspaceisthenB DDB t =∪ a,w B a,w t . However,thisisdisjunctiveandcannot be expressed in the LP that is used for pruning. Instead, we prune over each B a,w t and take the union of dominant vectors for these supports. This increases the number of LP calls for a fixed epoch but the LPs cover smaller spaces and will yield fewer vectors at the end. Figure 3.1.3 provides an instance from the example in Figure 2.3.1, where DB and DDB provide improved performancewhencomparedagainstGIP.WithGIPtheparsimonioussetconsistedofsixpolicies. However, given a belief bound of 0.2-0.8 with DB, the parsimonious set only consists of four policies as opposed to six. Similarly with DDB, if we assume there were two small regions (correspondingtotheregion0.2-0.8)0.2-0.3and0.7-0.8(say). Inthisinstance,theparsimonious set obtained with DDB would only consist of two policies. This reduction in the size of the parsimonious set provides improvement in performance because of the cascade effect it has on thesizesoftheparsimonioussetatfutureiterations. Iwillpresenttheexperimentalresultsforthesetechniques(DS,DB,DDB)inchapter5. 33 B e l i e f P r o b a b i l i t y E x p e c t e d V a l u e 0 1 P r u n i n g i n G I P 0 B e l i e f P r o b a b i l i t y E x p e c t e d V a l u e 0 1 P r u n i n g w i t h D B 0 0 . 2 0 . 8 B e l i e f P r o b a b i l i t y E x p e c t e d V a l u e 0 1 P r u n i n g w i t h D D B 0 0 . 2 0 . 8 0 . 3 0 . 7 S i x p o l i c i e s i n p a r s i m o n i o u s s e t F o u r p o l i c i e s i n p a r s i m o n i o u s s e t T w o p o l i c i e s i n p a r s i m o n i o u s s e t Figure3.4: IllustrationofpruninginDBandDDBwhencomparedagainstGIP 34 Chapter4: DirectvalueapproximationforPOMDPs Approximate algorithms, a currently popular approach, address the significant computational complexity in POMDP policy generation by sacrificing solution quality for speed [Pineau et al., 2003; Smith and Simmons, 2005; Zhou and Hansen, 2001; Hauskrecht, 2000a]. Furthermore, most of the approximate algorithms do not provide any guarantees on the quality loss and the ones (point-based approaches [Pineau et al., 2003; Smith and Simmons, 2005]) which provide expressionsforerrorboundshave the followingdrawbacks: (a) The boundsin thesepoint-based approximation algorithms are based on maximum and minimum possible single-stage reward, which can take extreme values, leading to a very loose bound which is not useful in many do- mains, especially in those that have penalties (e.g., R min 0). (b) The computation of the boundforaparticularpolicyisitselfapotentiallysignificantexpense(e.g.,requiringanon-linear program). (c) The algorithms cannot guarantee that they will yield a policy that can achieve a pre-specifiederrorbound. This earlier work in approximately solving POMDPs has focused primarily on sampling the belief space, and finding policies corresponding to a sampled set of belief points. These policies are then extrapolated to the entire belief space [Pineau et al., 2003; Smith and Simmons, 2005; ZhouandHansen,2001;Hauskrecht,2000a]. Onthecontrary,weproposeanapproach,Expected 35 ValueApproximation(EVA),thatapproximatespoliciesdirectlybasedonthetheirexpectedval- ues. Thus, this approach to approximation is beneficial in domains that require tight bounds on solutionquality. Furthermore,EVAprovidesaboundthatdoesnotdependonthemaximumand minimumpossiblerewards(R max andR min ). The value function in a POMDP is piecewise linear and can be expressed by a set of vectors (representative of policies). Approximate algorithms generate smaller vector sets (and thus, a reduced policy space) than the optimal sets. Existing approaches generate these vector sets by sampling the possible beliefs and finding the vectors that dominate over this space. However in EVA, we approximate directly in the value space by representing the optimal set of vectors with a set of vectors whose expected reward will be within a desired bound of the optimal reward. In a multi-stage decision process, solved using dynamic programming techniques, the reduction of vectorsatonestagecanresultinfewervectorsgettinggeneratedatfuturestages. Thiscanresult in theimprovementofoverallperformance, because pruning (the most expensive step in solving aPOMDP)requiresfewernumberofstepsforthereducedset. 4.1 EVAAlgorithm ThevaluefunctioninaPOMDPispiecewiselinearandcanbeexpressedbyasetofvectors. Ap- proximatealgorithmsgeneratefewervectorsetsthantheoptimalalgorithms. Existingapproaches generate these reduced vector sets by sampling the belief space and finding the vectors that ap- ply only at these points. In our approach, Expected Value Approximation (EVA), we choose a reducedsetofvectorsbyapproximatingthevaluespacewithasubsetofvectorswhoseexpected rewardwillbewithinadesiredboundoftheoptimalreward. 36 Usinganapproximation(subset)oftheoptimalparsimonioussetwillleadtolowerexpected quality at some set of belief points. Let denote the maximum loss in quality we will allow at any belief point. We henceforth refer to any vector set that is at most away from the optimal value at all points in the belief space (as illustated in Fig 4.1) as an-parsimonious set. The key probleminEVAistodeterminethis-parsimonioussetefficiently. To that end, we employ a heuristic that extends the pruning strategies presented in GIP. In GIP,aparsimonioussetV correspondingtoasetofvectors,U isobtainedinthreesteps: 1. InitializetheparsimonioussetV withthedominantvectorsatthesimplexpoints. 2. For some chosen vectoru∈U, execute a LP to compute the belief pointb whereu domi- natesthecurrentparsimonioussetV. 3. Computethevectoru 0 withhighestexpectedvalueinthesetU atthebeliefpoint,b;remove vectoru 0 fromU andaddittoV. EVAmodifiesthefirsttwosteps,toobtainthe-parsimoniousset: 1. Sinceweareinterestedinrepresentingtheoptimalparsimonioussetwithasfewvectorsas possible, the initialization process only selects one vector over the beliefs in the simplex extrema. Wechooseavectorwiththehighestexpectedvalueatthemostnumberofsimplex beliefpoints,choosingrandomlytobreakties. 2. The LP is modified to check for -dominance, i.e., dominating all other vectors by at somebeliefpoint. Algorithm4providesamodifiedLPwithb max t andb min t . The key difference between the LP used in GIP and the one in Algorithm 4 is the in RHS ofline4whichchecksforexpectedvaluedominanceofthegivenvectorw overavectoru∈ U. 37 Algorithm4 LP-D OMINATE(w,U,b max t ,b min t ,) 1: variables: d,b(s t )∀s t ∈S t 2: maximized 3: subjecttotheconstraints 4: b·(w−u)≥d+,∀u∈U 5: Σ st∈St b(s t ) = 1,b(s t )≤b max t (s t ),b(s t )≥b min t (s t ) 6: ifd≥ 0returnbelsereturnnil IncludingaspartoftheRHSconstrainsw todominateothervectorsbyatleast. Inthefollow- ingpropositions,weprovethecorrectnessoftheEVAalgorithmandtheerrorboundprovidedby EVA.LetV andV ∗ denotethe-parsimoniousandoptimalparsimoniousset,respectively. Proposition2 ∀b∈ Δ,theentirebeliefspace,ifv b = argmax v ∈V v ·bandv ∗ b = argmax v ∗ ∈V ∗v ∗ · b,thenv b ·b+≥v ∗ b ·b. Proof. Weprovethisbycontradiction. Assume∃b∈ Δsuchthatv b ·b+<v ∗ b ·b. Thisimplies v b 6=v ∗ b ,andv ∗ b / ∈V . Wenowconsiderthesituation(s)whenv ∗ b isconsideredbyEVA.Atthese instants, there will be a current parsimonious setV and a set of vectors still to be consideredU. Let ˆ b = argmax ˜ b∈Δ {min v∈V (v ∗ b · ˜ b−v· ˜ b)} bethebeliefpointatwhichv ∗ b isbestw.r.t.V. Let ˆ v ˆ b = argmax v∈V v· ˆ b bethevectorinV whichisbestat ˆ b. Let ˆ u ˆ b = argmax u∈U u· ˆ b 38 bethevectorinU whichisbestat ˆ b. Therearethreepossibilities: 1. v ∗ b · ˆ b< ˆ v ˆ b · ˆ b+: Thisimpliesv ∗ b · ˆ b−ˆ v ˆ b · ˆ b<.Bythedefinitionof ˆ b,wehave v ∗ b ·b−ˆ v b ·b<v ∗ b · ˆ b−ˆ v ˆ b · ˆ b< where ˆ v b = argmax v∈V v·b.Thisimplies v ∗ b ·b< ˆ v b ·b+≤v b ·b+, whichisacontradiction. 2. v ∗ b · ˆ b≥ ˆ v ˆ b · ˆ b + and v ∗ b · ˆ b≥ ˆ u· ˆ b: This means v ∗ b would have been included in the -parsimoniousset, v ∗ b ∈V ,whichisacontradiction. 3. v ∗ b · ˆ b≥ ˆ v ˆ b · ˆ b+ andv ∗ b · ˆ b < ˆ u· ˆ b: ˆ u will be included inV andv ∗ b is returned toU to be consideredagainuntiloneofprevioustwoterminalconditionsoccur. V ∗ (b) ε V(b) b VECTOR OF OPTIMAL SET VECTOR OF APPROXIMATE SET V ε (b) Figure4.1: EVA:Anexampleofan-parsimoniousset Proposition3 The error introduced by EVA at each stage of the policy computation, is bounded by2|Ω|forGIP-typecross-sumpruning. 39 Proof. The EVA algorithm introduces an error of in a parsimonious set whenever a pruning operation (PRUNE) is performed, due to Proposition 2. In GIP, there are three pruning steps at eachstageofpolicycomputation. 1.V a,o =PRUNE(V a,o,i ): Afterthisstep,eachV a,o isatmostawayfromoptimal∀a,∀o. 2.V a = PRUNE(···(PRUNE(V a,o 1 ⊕V a,o 2 )···⊕V a,o |Ω| ): WebeginwithV a,o 1 which is away from optimal by at most. Each pruning operation adds2 to the bound ( for the new termV a,o i and for the PRUNE). There are|Ω|−1 prune operations. Thus, each V a,o isawayfromtheoptimalbyatmost2(|Ω|−1)+. 3.V 0 = PRUNE( S a∈A V a ): The error of S a∈A V a is bounded by the error ofV a . The PRUNE adds,leadingtoatotalone-stageerrorboundof 2|Ω|. Proposition4 The total error introduced by EVA (for GIP-type cross-sum pruning) is bounded by2|Ω|T foraT-horizonproblem. 40 Proof. LetV t (b) andV ∗ t (b) denote the EVA-policy and optimal value function, respectively, at timet. IfA t ⊆A,isthesetofactionsattherootsofallpolicy-treesassociatedwith V t ,theEVA vectorsetfortimetande t = max b∈B {V ∗ t (b)−V t (b)},then,V t−1 (b) = =max a∈A t {R(b,a)+Σ b 0 ∈B P(b 0 |b,a)V t (b 0 )} ≥max a∈A t {R(b,a)+Σ b 0 ∈B P(b 0 |b,a)V ∗ t (b 0 )} −Σ b 0 ∈B P(b 0 |b,a)e t =max a∈A t {R(b,a)+Σ b 0 ∈B P(b 0 |b,a)V ∗ t (b 0 )}−e t ≥max a∈A {R(b,a)+Σ b 0 ∈B P(b 0 |b,a)V ∗ t (b 0 )}−2|Ω|−e t =V ∗ t−1 (b)−2|Ω|−e t The last inequality is due to Proposition 2, and the other relations are by definition. The above impliesthate t−1 =e t +2|Ω|andwithe T = 2|Ω|,wehavetheerrorfortheEVA-policyatthe beginningofexecution,e 1 = 2|Ω|T (thetotalerrorboundforEVA). Similarly, it can be proved that for γ-discounted infinite horizon problems, the total error boundis 2|Ω| 1−γ . I present the experimental results for EVA combined with the belief bound techniques in chapter5. 41 Chapter5: ResultsforDS,DB,DDBandEVA This chapter focuses on experimental results with the techniques introduced in chapters 3 and 4. While these techniques could be used in conjunction with different exact algorithms, including both GIP and RBIP, in this chapter we will focus on enhancing the GIP algorithm. All our enhancements were implemented over GIP 1 We implemented over GIP, as our enhancements over GIP performed better than over RBIP [Varakantham et al., 2005]. Thus in the following paragraphs, DS refers to GIP+DS, DB to GIP+DB, DDB to GIP+DDB and EVA to EVA+GIP. All the experiments 2 compare the performance (run-time) of GIP, RBIP and our enhancements overGIP.Theexperimentalsetupconsistedof10TMPproblems. Eachproblemhadpre-specified run-timeupperlimitof20000seconds. We conducted two sets of experiments with regards to the enhancements presented in Chap- ter 3 and Chapter 4. The first set of experiments focused on the Task Management Problem (TMP) [Varakantham et al., 2005] in software personal assistant domains. In comparing with other algorithms, it is useful to recall from chapter 2, that this domain has a reward of negative infinity associated with certain actions. As suggested earlier in chapter 2, a POMDP algorithm solvingTMPproblemsneedstohavethefollowingcharacteristics: (a)Computepolicyforapre- specifiedqualitybound;(b)Thisboundmustholdforallpossiblestartingbeliefpoints–because 1 Our enhancements were implemented over Anthony Cassandra’s POMDP solver “http://pomdp.org/pomdp/code/index.shtml. 2 Machinespecsforallexperiments: IntelXeon2.8GHZprocessor,2GBRAM,LinuxRedhat8.1 42 we may start the problem in any possible starting belief states; (c) Efficiency of policy computa- tion is of the essence because if agents require significant amounts of computation prior to each taskallocation,thenthatcouldhinderhumantaskperformance. Existing approaches to solving POMDPs have limited applicability in this domain. Approx- imate approaches provide trivial bounds owing to the presence of the negative infinity reward (R min ). Furthermore, algorithms like PBVI (and HSVI) can provide a guarantee on solution qualitycorrespondingtoafixedstartingbeliefpointonly,thusfailingin(b)above. Thoughexact algorithmsprovidequalityguarantees,theydosoatthecostofcomputationalcomplexity,losing out on (c). Our belief bound techniques along with EVA, though limited in their applicability individuallyintheseproblems,canincombinationhandletheconstraintsmentionedabove. Figure5.1: ComparisonofperformanceofEVA+DS,EVA+DB,andEVA+DDBfor=0.01 None of GIP, RBIP, EVA, DS, DB, DDB terminated within the prespecified limit of 20,000 seconds for either of the problems. EVA was run with a low error bound to illustrate the utility of DS, DB and DDB techniques. We combined our exact enhancements (DS, DB, DDB) with 43 Figure5.2: ComparisonofperformanceofEVA+DS,EVA+DB,andEVA+DDBfor=0.02 Figure5.3: ComparisonofperformanceofEVA+DS,EVA+DB,andEVA+DDBfor=0.03 EVA. Figures 5.1, 5.2 and 5.3 provides the comparison of performances of these combined techniques for varying values in EVA. In Figures 5.1, 5.2 and 5.3, x-axis indicates the TMP problems, while the y-axis indicates the time to solution in seconds. The three bars, shown 44 for each problem, indicate the run-times of EVA+DS, EVA+DB, and EVA+DDB. All the three figures clearly illustrate the dominance of DDB over DB and DS. For instance in Figure 5.1, EVA+DDB provides 66.9-fold speedup over EVA+DS and 33.4-fold speedup for TMP problem 8. Thekeytonoteisthatnoneoftheoriginaltechniquesworkedwithinthe20000secondslimit, andthusthatEVA+DDBrunsinlessthan100secondsandoftenisnotvisibleonthechartshows thesignificantspeedupsobtainedbyEVA+DDB. Second set of experiments illustrate the utility of EVA in other kinds of problems that do not have all the constraints of TMP. We considered problems that did not have any rewards of negativeinfinityandwherequalityboundwasdesiredforagivenstartingbeliefpoint. Forthis,we providecomparisonswithapproximateapproachesthatprovidequalitybounds. Weexperimented withfiveproblems: Tiger-grid,Hallway,Hallway2,Aircraft(fromAnthonyCassandra’swebsite) and Scotland yard 3 . Figure 5.4 provides comparisons against PBVI 4 . In Figure 5.4, the x-axis indicates the problem, and the y-axis indicates the time taken to solution on a log scale. For each problem, the first bar is the time for solving the NLP(computing error bound) in PBVI; the second and third bars are run-times of PBVI and EVA for a fixed error bound; the fourth bar is therun-timeforEVAforatighterbound(halfoftheboundusedforbars2and3). Thetimetaken byPBVIforeachproblemisthesumoffirstandsecondbars. The first aspect of comparison is the time overhead in computing the error bounds. In EVA, error bound computation is negligible as it requires only a multiplication and hence is not pre- sented in the figure. However for PBVI, it can be noted from first bars of Figure 5.4 that this takes a non-trivial amount of time and in some cases is comparable to the time taken by PBVI. 3 Probleminspiredfromscotlandyardgame(216s,16a,6o) 4 We do not have results with HSVI2, an approach that is shown to better than PBVI. However, it should be noted thattherun-timeandqualityresultspresentedhereareonlytoindicatethatEVAcanprovidecomparableperformance toexistingapproacheswhileprovidingqualityboundadvantages 45 ForinstanceinHallway2,theNLP(firstbar)takes143seconds,whichis1/4ofPBVI’srun-time (second bar). More importantly, the error bound computation time is comparable to the time takenbyEVA. The second point of the study is the run-time performance of the actual algorithms (not in- cluding the time taken for error bound computation). Due to the dependence of point-based algorithms (PBVI and HSVI) on the starting belief point, all results for PBVI are averaged over ten randomly generated starting beliefs. Furthermore, to avoid punishing PBVI (in terms of run- time) for planning multiple times, we removed the anytime nature of PBVI, i.e., made it to plan forasetofbeliefpoints(computedaccordingtothebeliefpointselectionheuristicfrom[Pineau etal.,2003])thatprovidesthegivenerrorbound. Figure5.4: RuntimecomparisonofEVAandPBVI Second and third bars for each problem in Figure 5.4 provide this run-time comparison of EVA and PBVI for a given quality bound on the solution. It shows that EVA outperforms PBVI. Forinstance,thespeedupobtainedwithEVAis59.6-foldforHallway. Wealsoexperimentedwith the“Tag”problemfrom[Pineauetal.,2003]. PBVIcouldnotfinishwithinthepre-specifiedlimit of2000seconds,whileEVAterminatedwithin700secondswithaqualityof-9.19(approximately equaltothevaluereportedin[Pineauetal.,2003]). 46 PBVI EVA EVA (samebound) (sametime) Tiger-Grid -1.692 -1.62 -1.420 Hallway 0.122 0.118 0.267 Hallway2 0.038 0.027 0.08 Aircraft 7.416 7.416 7.416 ScotlandYard 0.073 -0.377 0.214 Table5.1: ComparisonofexpectedvalueforPBVIandEVA In PBVI, there is no clear dominance of one belief-point-selection heuristic over the others [Pineauetal.,2003;PineauandGordon,2005]foralltheproblems. However,thereisdominance of certain heuristics on some problems in terms of both quality and time to solution. To account forthis,weprovideacomparisonofPBVI,assumingthatitprovidedamuchtighterbound(half oftheactualbound). Inotherwords,giventhataheuristicmayimprovePBVIsolutionqualityby 100lookathalfoftheactualboundforacomparison. Valuesinthefourthbar(foreachproblem) of Figure 5.4 indicate that EVA outperforms PBVI even in that case. For instance, there is still a 27.3-foldspeedupforhallway. Table 5.1 presents the third aspect of comparison between PBVI and EVA: actual solution quality. For the same error bound on solution quality, PBVI (column 2) performs better for Hallway, Hallway2 and Scotland Yard, while EVA (column 3) performs better for Tiger-Grid. However, if the restriction is on time to solution, EVA (column 4) obtains higher quality than PBVI in all five problems. For instance, PBVI obtains a quality of 0.122, while EVA obtains a qualityof0.267fortheHallwayproblem. 47 Chapter6: ExploitinginteractionstructureinDistributedPOMDPs In this chapter, I continue the theme of exploiting structure for efficiency, but now in distributed POMDPs. The structure exploited is in the interactions between the agents of a distributed POMDP. Earlier, researchers have attempted two different approaches to address the complexity of distributed POMDPs. First type of algorithms sacrifice global optimality and instead concen- trateonlocaloptimality [Nairetal.,2003a;Peshkinetal.,2000a]. Ontheotherhand,thesecond kindofapproacheshavefocusedonrestrictedtypesofdomains,e.g. withtransitionindependence orcollectiveobservability [Beckeretal.,2003,2004]. Whiletheseapproacheshaveledtouseful advances, the complexity of the distributed POMDP problem has limited most experiments to a centralpolicygeneratorplanningforjusttwoagents. WeintroduceathirdcomplementaryapproachcalledNetworkedDistributedPOMDPs(ND- POMDPs), that is motivated by domains such as distributed sensor nets [Lesser et al., 2003], distributedUAVteamsanddistributedsatellites,whereanagentteammustcoordinateunderun- certainty,butwithagentshavingstronglocalityintheirinteractions. Forexample,withinalarge distributedsensornet, smallsubsetsof sensoragentsmustcoordinatetotracktargets. Toexploit such local interactions, ND-POMDPs combine the planning under uncertainty of POMDPs with the local agent interactions of distributed constraint optimization (DCOP) [Modi et al., 2003b; 48 Maheswaran et al., 2004; Yokoo and Hirayama, 1996]. DCOPs have successfully exploited lim- itedagentinteractionsinmultiagentsystems,withoveradecadeofalgorithmdevelopment. Dis- tributedPOMDPsbenefitbybuildinguponsuchalgorithmsthatenabledistributedplanning,and provide algorithmic guarantees. DCOPs benefit by enabling (distributed) planning under uncer- tainty—akeyDCOPdeficiencyinpracticalapplicationssuchassensornets[Lesseretal.,2003]. Taking inspiration from DCOP algorithms, we provide two algorithms for ND-POMDPs, a locally optimal algorithm, LID-JESP and a global optimal algorithm, GOA. First, within LID- JESPwepresenttwowaysofexploitingthelocalityofinteraction,namelyDBA/DSAandHLD. DBA/DSA exploits the external interaction structure, by combining the existing JESP algorithm of Nair et al. [Nair et al., 2003a] and DCOP algorithms, Distributed Breakout Algorithm (DBA) anditsstochasticvariant,DistributedStochasticAlgorithm(DSA) [YokooandHirayama,1996]. ThisapproachthuscombinesthedynamicprogrammingofJESPwiththeinnovationthatituses distributedpolicygenerationinsteadofJESP’scentralizedpolicygeneration. Ontheotherhand, hyper-link-baseddecomposition (HLD)exploitsthestructureintroducedinsideanagent,because of the interaction graph. Concretely, this method works by decomposing each agent’s local op- timization problem into loosely-coupled optimization problems for each hyper-link. This allows us to further exploit the locality of interaction, resulting in faster run times for both DBA and DSAwithoutanylossinsolutionquality. Finally, by empirically comparing the performance of the algorithm with benchmark algo- rithmsthatdonotexploitnetworkstructure,weillustratethegainsinefficiencymadepossibleby exploiting network structure in ND-POMDPs. Through detailed experiments, we show how this canresultinspeedupswithoutsacrificingonsolutionquality. Wealsopresentdetailedcomplexity resultsthatindicatethedifferenceintroducedbecauseofexploitinginteractionstructure. 49 6.1 ND-POMDPs We define an ND-POMDP to be a specialization of MTDP as follows. In particular, we define ND-POMDPasagroup Ag ofnagentsasatuplehS,A,P,Ω,O,R,bi,whereS =× 1≤i≤n S i × S u is the set of world states. S i refers to the set of local states of agent i and S u is the set of unaffectablestates. Unaffectablestatereferstothatpartoftheworldstatethatcannotbeaffected by the agents’ actions, e.g. environmental factors like target locations that no agent can control. A =× 1≤i≤n A i isthesetofjointactions,whereA i isthesetofactionforagenti. WeassumeatransitionindependentdistributedPOMDPmodel,wherethetransitionfunction is defined asP(s,a,s 0 ) = P u (s u ,s 0 u )· Q 1≤i≤n P i (s i ,s u ,a i ,s 0 i ), wherea =ha 1 ,...,a n i is the joint action performed in state s = hs 1 ,...,s n ,s u i and s 0 = hs 0 1 ,...,s 0 n ,s 0 u iis the resulting state. Agent i’s transition function is defined as P i (s i ,s u ,a i ,s 0 i ) = Pr(s 0 i |s i ,s u ,a i ) and the unaffectable transition function is defined as P u (s u ,s 0 u ) = Pr(s 0 u |s u ). Becker et al. [Becker et al., 2004] also relied on transition independence, and Goldman and Zilberstein [Goldman and Zilberstein, 2004] introduced the possibility of uncontrollable state features. In both works, the authorsassumedthatthestateiscollectivelyobservable,anassumptionthatdoesnotholdforour domainsofinterest. Ω =× 1≤i≤n Ω i isthesetofjointobservationswhereΩ i isthesetofobservationsforagentsi. Wemakeanassumptionofobservationalindependence,i.e.,wedefinethejointobservationfunc- tion as O(s,a,ω) = Q 1≤i≤n O i (s i ,s u ,a i ,ω i ), where s =hs 1 ,...,s n ,s u i, a =ha 1 ,...,a n i, ω =hω 1 ,...,ω n i,andO i (s i ,s u ,a i ,ω i ) = Pr(ω i |s i ,s u ,a i ). Therewardfunction,R,isdefinedasR(s,a) = P l R l (s l1 ,...,s lk ,s u ,ha l1 ,...,a lk i),where each l could refer to any sub-group of agents and k = |l|. In the sensor grid example, the 50 reward function is expressed as the sum of rewards between sensor agents that have overlap- ping areas (k = 2) and the reward functions for an individual agent’s cost for sensing (k = 1). Based on the reward function, we construct an interaction hypergraph where a hyper-link, l, exists between a subset of agents for all R l that comprise R. Interaction hypergraph is defined as G = (Ag,E), where the agents, Ag, are the vertices and E = {l|l ⊆ Ag∧ R l isacomponentofR} are the edges. Neighborhood of i is defined as N i ={j ∈ Ag|j 6= i∧(∃l∈ E, i∈ l∧j∈ l)}. S N i =× j∈N i S j refers to thestates ofi’s neighborhood. Similarly we define A N i =× j∈N i A j , Ω N i =× j∈N i Ω j , P N i (s N i ,a N i ,s 0 N i ) = Q j∈N i P j (s j ,a j ,s 0 j ), and O N i (s N i ,a N i ,ω N i ) = Q j∈N i O j (s j ,a j ,ω j ). b, the distribution over the initial state, is defined asb(s) = b u (s u )· Q 1≤i≤n b i (s i ) whereb u andb i refertothedistributionsoverinitialunaffectablestateandi’sinitialstate,respectively. We defineb N i = Q j∈N i b j (s j ). Weassumethatbisavailabletoallagents(althoughitispossibleto refineourmodeltomakeavailabletoagentionlyb u ,b i andb N i ). ThegoalinND-POMDPisto compute joint policy π =hπ 1 ,...,π n i that maximizes the team’s expected reward over a finite horizonT startingfromb. π i referstotheindividualpolicyofagentiandisamappingfromthe set of observation histories of i to A i . π N i and π l refer to the joint policies of the agents in N i andhyper-link l respectively. ND-POMDP can be thought of as an n-ary DCOP where the variable at each node is an individual agent’s policy. The reward component R l where|l| = 1 can be thought of as a local constraint while the reward component R l where l > 1 corresponds to a non-local constraint in the constraint graph. In the next section, we push this analogy further by taking inspiration from the DBA algorithm [Yokoo and Hirayama, 1996], an algorithm for distributed constraint satisfaction,todevelopanalgorithmforsolvingND-POMDPs. 51 The following proposition shows that given a factored reward function and the assumptions of transitional and observational independence, the resulting value function can be factored as wellintovaluefunctionsforeachoftheedgesintheinteractionhypergraph. Proposition5 Given transitional and observational independence andR(s,a) = P l∈E R l (s l1 , ...,s lk ,s u ,ha l1 ,...,a lk i), V t π (s t ,~ ω t ) = X l∈E V t π l (s t l1 ,...,s t lk ,s t u ,~ ω t l1 ,...~ ω t lk ) (6.1) where V t π (s t ,~ ω) is the expected reward from the state s t and joint observation history ~ ω t for executing policyπ, andV t π l (s t l1 ,...,s t lk ,s t u ,~ ω t l1 ,...~ ω t lk ) is the expected reward for executingπ l accruingfromthecomponentR l . Proof: Proof is by mathematical induction. Proposition holds fort = T−1 (no future reward). Assumeitholdsfort =τ where1≤τ <T−1. Thus, V τ π (s τ ,~ ω τ ) = X l∈E V τ π l (s τ l1 ,...,s τ lk ,s τ u ,~ ω τ l1 ,...~ ω τ lk ) Weintroducethefollowingabbreviations: p t i 4 =P i (s t i ,s t u ,π i (~ ω t i ),s t+1 i )·O i (s t+1 i ,s t+1 u ,π i (~ ω t i ),ω t+1 i ) p t u 4 =P i (s t u ,s t+1 u ) r t l 4 =R l (s t l1 ,...,s t lk ,s t u ,π l1 (~ ω t l1 ),...,π lk (~ ω t lk )) v t l 4 =V t π l (s t l1 ,...,s t lk ,s t u ,~ ω t l1 ,...~ ω t lk ) 52 Weshowthatpropositionholdsfort =τ−1, V τ−1 π (s τ−1 ,~ ω τ−1 ) = X l∈E r τ−1 l + X s τ ,ω τ p τ−1 u p τ−1 1 ...p τ−1 n X l∈E v τ l = X l∈E (r τ−1 l + X s τ l1 ,...,s τ lk ,s τ u ,ω τ l1 ,...,ω τ lk p τ−1 l1 ...p τ−1 lk p τ−1 u v τ l ) = X l∈E v τ−1 l We define local neighborhood utility of agent i as the expected reward for executing joint policyπ accruingduetothehyper-linksthatcontainagent i: V π [N i ] = X s i ,s N i ,su b u (s u )·b N i (s N i )·b i (s i )· X l∈E s.t. i∈l V 0 π l (s l1 ,...,s lk ,s u ,hi,...,hi) (6.2) Proposition6 Localityofinteraction: Thelocalneighborhoodutilitiesofagentiforjointpoli- ciesπ andπ 0 areequal(V π [N i ] =V π 0[N i ])ifπ i =π 0 i andπ N i =π 0 N i . Proofsketch: Equation6.2sumsoverl∈E suchthati∈l,andhenceanychangeofthepolicy of an agent j / ∈ i∪ N i cannot affect V π [N i ]. Thus, any such policy assignment, π 0 that has differentpoliciesforonlynon-neighborhoodagents,hasequalvalueas V π [N i ]. Thus, increasing the local neighborhood utility of agent i cannot reduce the local neighbor- hood utility of agent j if j / ∈ N i . Hence, while trying to find best policy for agent i given its neighbors’ policies, we do not need to consider non-neighbors’ policies. This is the property of localityofinteractionthatisusedinlatersections. 53 6.2 LocallyOptimalPolicyGeneration,LID-JESP ThelocallyoptimalpolicygenerationalgorithmcalledLID-JESP(Locallyinteractingdistributed joint equilibrium search for policies) is based on the DBA algorithm [Yokoo and Hirayama, 1996] and JESP [Nair et al., 2003a]. In this algorithm (see Algorithm 5), each agent tries to improveitspolicywithrespecttoitsneighbors’policiesinadistributedmannersimilartoDBA. Initially each agent i starts with a random policy and exchanges its policies with its neighbors (lines 3-4). It then computes its local neighborhood utility (see Equation 6.2) with respect to its current policy and its neighbors’ policies. Agent i then tries to improve upon its current policy by calling function GETVALUE (see Algorithm 7), which returns the local neighborhood utility of agenti’s best response to its neighbors’ policies. This algorithm is described in detail below. Agent i then computes the gain (always≥ 0 because at worst GETVALUE will return the same value asprevVal) that it can make to its local neighborhood utility, and exchanges its gain with itsneighbors(lines8-11). If i’sgainisgreaterthananyofitsneighbors’gain 1 ,ichangesitspolicy (FINDPOLICY) and sends its new policy to all its neighbors. This process of trying to improve the local neighborhood utility is continued until termination. Termination detection is based on usingaterminationcountertocountthenumberofcycleswheregain i remains= 0. Ifitsgainis greaterthanzerotheterminationcounterisreset. Agentithenexchangesitsterminationcounter with its neighbors and set its counter to the minimum of its counter and its neighbors’ counters. Agentiwillterminateifitsterminationcounterbecomesequaltothediameteroftheinteraction hypergraph. 1 Thefunctionargmax j disambiguatesbetweenmultiple j correspondingtothesamemaxvaluebyreturningthe lowest j. 54 Figure 6.2 provides an execution of the LID-JESP algorithm for a small example of three sensor agents connected in a chain. The execution begins with agents A1, A2, and A3 taking policies(randomly)p1,p2andp3respectively. Algorithm5 LID-J ESP(i,ND-POMDP ) 1: ComputeinteractionhypergraphandN i 2: d←diameterofhypergraph,terminationCtr i ← 0 3: π i ←randomlyselectedpolicy,prevVal← 0 4: Exchangeπ i withN i 5: whileterminationCtr i <ddo 6: foralls i ,s Ni ,s u do 7: B 0 i (hs u ,s i ,s Ni ,hii)←b u (s u )·b i (s i )·b Ni (s Ni ) 8: prevVal + ←B 0 i (hs u ,s i ,s Ni ,hii)·EVALUATE(i,s i ,s u ,s Ni ,π i ,π Ni ,hi,hi,0,T) 9: endfor 10: gain i ← GETVALUE(i,B 0 i ,π Ni ,0,T)−prevVal 11: ifgain i > 0thenterminationCtr i ← 0 12: elseterminationCtr i + ← 1 13: Exchangegain i ,terminationCtr i withN i 14: terminationCtr i ←min j∈Ni∪{i} terminationCtr j 15: maxGain← max j∈Ni∪{i} gain j 16: winner← argmax j∈Ni∪{i} gain j 17: ifmaxGain> 0andi =winner then 18: FINDPOLICY(i,b,hi,π Ni ,0,T) 19: Communicateπ i withN i 20: elseifmaxGain> 0then 21: Receiveπ winner fromwinner andupdateπ Ni 22: endif 23: endwhile 24: returnπ i 6.2.1 FindingBestResponse The algorithm, GETVALUE, for computing the best response is a dynamic-programming ap- proach similar to that used in JESP. Here, we define an episode of agent i at time t as e t i = 55 A 1 A 2 A 3 A 1 A 2 A 3 p 1 p 2 p 3 S e n d p o l i c i e s t o n e i g h b o r s C o m p u t e l o c a l u t i l i t i e s A 1 A 2 A 3 C o m p u t e b e s t r e s p o n s e a n d g a i n c u r r = 5 c u r r = 3 c u r r = 7 A 1 A 2 A 3 A g e n t s w i t h g r e a t e r g a i n s t h a n n e i g h b o r s u p d a t e p o l i c i e s c u r r = 5 c u r r = 3 c u r r = 7 b e s t = 9 g a i n = 4 b e s t p o l i c y = p 4 b e s t = 6 g a i n = 3 b e s t p o l i c y = p 5 b e s t = 8 g a i n = 1 b e s t p o l i c y = p 6 A 1 A 2 A 3 p 4 p 2 p 3 . . . S e n d p o l i c i e s t o n e i g h b o r s R a n d o m i n i t i a l i z a t i o n o f p o l i c i e s t o a g e n t s Figure6.1: SampleexecutiontraceofLID-JESPfora3-agentchain 56 Algorithm6 EVALUATE(i,s t i ,s t u ,s t N i ,π i ,π N i ,~ ω t i ,~ ω t N i ,t,T) 1: a i ←π i (~ ω t i ),a Ni ←π Ni (~ ω t Ni ) 2: val← P l∈E R l (s t l1 ,...,s t lk ,s t u ,a l1 ,...,a lk ) 3: ift<T−1then 4: foralls t+1 i ,s t+1 Ni ,s t+1 u do 5: forallω t+1 i ,ω t+1 Ni do 6: val + ← P u (s t u ,s t+1 u ) · P i (s t i ,s t u ,a i ,s t+1 i ) · P Ni (s t Ni ,s t u ,a Ni ,s t+1 Ni ) · O i (s t+1 i ,s t+1 u ,a i ,ω t+1 i ) · O Ni (s t+1 Ni ,s t+1 u ,a Ni ,ω t+1 Ni ) · EVALUATE(i,s t+1 i ,s t+1 u , s t+1 Ni ,π i ,π Ni , ~ ω t i ,ω t+1 i , ~ ω t Ni ,ω t+1 Ni ,t+1,T) 7: endfor 8: endfor 9: endif 10: returnval s t u ,s t i ,s t N i ,~ ω t N i . Treating episode as the state, results in a single agent POMDP, where the transitionfunctionandobservationfunctioncanbedefinedas: P 0 (e t i ,a t i ,e t+1 i ) =P u (s t u ,s t+1 u )·P i (s t i ,s t u ,a t i ,s t+1 i )·P N i (s t N i , s t u ,a t N i ,s t+1 N i )·O N i (s t+1 N i ,s t+1 u ,a t N i ,ω t+1 N i ) O 0 (e t+1 i ,a t i ,ω t+1 i ) =O i (s t+1 i ,s t+1 u ,a t i ,ω t+1 i ) Amultiagentbeliefstateforanagentigiventhedistributionovertheinitialstate,b(s)isdefined as: B t i (e t i ) = Pr(s t u ,s t i ,s t N i ,~ ω t N i |~ ω t i ,~ a t−1 i ,b) Theinitialmultiagentbeliefstateforagenti,B 0 i ,canbecomputedfrombasfollows: B 0 i (hs u ,s i ,s N i ,hii)←b u (s u )·b i (s i )·b N i (s N i ) 57 WecannowcomputethevalueofthebestresponsepolicyviaGETVALUEusingthefollowing equation(seeAlgorithm7): V t i (B t i ) = max a i ∈A i V a i ,t i (B t i ) (6.3) Algorithm7 GETVALUE(i,B t i ,π N i ,t,T) 1: ift≥T thenreturn0 2: ifV t i (B t i )isalreadyrecordedthenreturnV t i (B t i ) 3: best←−∞ 4: foralla i ∈A i do 5: value← GETVALUEACTION(i,B t i ,a i ,π Ni ,t,T) 6: recordvalueasV ai,t i (B t i ) 7: ifvalue>bestthenbest←value 8: endfor 9: recordbestasV t i (B t i ) 10: returnbest Thefunction,V a i ,t i ,canbecomputedusingGETVALUEACTION(seeAlgorithm8)asfollows: V a i ,t i (B t i ) = X e t i B t i (e t i ) X l∈E s.t. i∈l R l (s l1 ,...,s lk ,s u ,ha l1 ,...,a lk i) + X ω t+1 i ∈Ω 1 Pr(ω t+1 i |B t i ,a i )·V t+1 i B t+1 i (6.4) B t+1 i is the belief state updated after performing action a i and observing ω t+1 i and is com- putedusingthefunctionUPDATE(seeAlgorithm9). Agenti’spolicyisdeterminedfromitsvalue functionV a i ,t i usingthefunction FINDPOLICY (seeAlgorithm10). 6.2.2 CorrectnessResults Proposition7 When applying LID-JESP, the global utility is strictly increasing until local opti- mumisreached. 58 Algorithm8 GETVALUEACTION(i,B t i ,a i ,π N i ,t,T) 1: value← 0 2: foralle t i = s t u ,s t i ,s t Ni ,~ ω t Ni s.t. B t i (e t i )> 0do 3: a Ni ←π Ni (~ ω t Ni ) 4: reward← P l∈E R l (s t l1 ,...,s t lk ,s t u ,a l1 ,...,a lk ) 5: value + ←B t i (e t i )·reward 6: endfor 7: ift<T−1then 8: forallω t+1 i ∈ Ω i do 9: B t+1 i ← UPDATE(i,B t i ,a i ,ω t+1 i ,π Ni ) 10: prob← 0 11: foralls t u ,s t i ,s t Ni do 12: foralle t+1 i = s t+1 u ,s t+1 i ,s t+1 Ni , ~ ω t Ni ,ω t+1 Ni s.t. B t+1 i (e t+1 i )> 0do 13: a Ni ←π Ni (~ ω t Ni ) 14: prob + ← B t i (e t i ) · P u (s t u ,s t+1 u ) · P i (s t i ,s t u ,a i ,s t+1 i ) · P Ni (s t Ni ,s t u ,a Ni ,s t+1 Ni ) · O i (s t+1 i ,s t+1 u ,a i ,ω t+1 i )·O Ni (s t+1 Ni ,s t+1 u ,a Ni ,ω t+1 Ni ) 15: endfor 16: endfor 17: value + ←prob·GETVALUE(i,B t+1 i ,π Ni ,t+1,T) 18: endfor 19: endif 20: returnvalue Algorithm9 UPDATE(i,B t i ,a i ,ω t+1 i ,π N i ) 1: foralle t+1 i = s t+1 u ,s t+1 i ,s t+1 Ni , ~ ω t Ni ,ω t+1 Ni do 2: B t+1 i (e t+1 i )← 0,a Ni ←π Ni (~ ω t Ni ) 3: foralls t u ,s t i ,s t Ni do 4: B t+1 i (e t+1 i ) + ← B t i (e t i ) · P u (s t u ,s t+1 u ) · P i (s t i ,s t u ,a i ,s t+1 i ) · P Ni (s t Ni ,s t u ,a Ni ,s t+1 Ni ) · O i (s t+1 i ,s t+1 u ,a i ,ω t+1 i )·O Ni (s t+1 Ni ,s t+1 u ,a Ni ,ω t+1 Ni ) 5: endfor 6: endfor 7: normalizeB t+1 i 8: returnB t+1 i Algorithm10 FINDPOLICY(i,B t i , ~ ω i t ,π N i ,t,T) 1: a ∗ i ← argmax ai V ai,t i (B t i ),π i (~ ω i t )←a ∗ i 2: ift<T−1then 3: forallω t+1 i ∈ Ω i do 4: B t+1 i ← UPDATE(i,B t i ,a ∗ i ,ω t+1 i ,π Ni ) 5: FINDPOLICY(i,B t+1 i , ~ ω i t ,ω t+1 i ,π Ni ,t+1,T) 6: endfor 7: endif 8: return 59 ProofsketchByconstruction,onlynon-neighboringagentscanmodifytheirpoliciesinthesame cycle. Agenti chooses to change its policy if it can improve upon its local neighborhood utility V π [N i ]. From Equation 6.2, increasingV π [N i ] results in an increase in global utility. By locality ofinteraction,ifanagentj / ∈i∪N i changesitspolicytoimproveitslocalneighborhoodutility, it will not affect V π [N i ] but will increase global utility. Thus with each cycle global utility is strictlyincreasinguntillocaloptimumisreached. Proposition8 LID-JESP will terminate within d (= diameter) cycles iff agent are in a local optimum. Proof: Assume that in cycle c, agent i terminates (terminationCtr i = d) but agents are not in a local optimum. In cycle c− d, there must be at least one agent j who can improve, i.e., gain j > 0 (otherwise, agents are in a local optimum in cycle c−d and no agent can improve later). Letd ij refertotheshortestpathdistancebetweenagentsiandj. Then,incyclec−d+d ij (≤ c), terminationCtr i must have been set to 0. However, terminationCtr i increases by at most one in each cycle. Thus, in cycle c, terminationCtr i ≤ d− d ij . If d ij ≥ 1, in cycle c, terminationCtr i < d. Also, if d ij = 0, i.e., in cycle c− d, gain i > 0, then in cycle c− d + 1, terminationCtr i = 0, thus, in cycle c, terminationCtr i < d. In either case, terminationCtr i 6= d. Bycontradiction, if LID-JESP terminates then agents must be in a local optimum. In the reverse direction, if agents reach a local optimum, gain i = 0 henceforth. Thus, terminationCtr i is never reset to 0 and is incremented by 1 in every cycle. Hence, after d cycles,terminationCtr i =dandagentsterminate. 60 Proposition 7 shows that the agents will eventually reach a local optimum and Proposition 8 shows that the LID-JESP will terminate if and only if agents are in a local optimum. Thus, LID-JESPwillcorrectlyfindalocallyoptimumandwillterminate. 6.3 StochasticLID-JESP(SLID-JESP) One of the criticisms of LID-JESP is that if an agent is the winner (maximum reward among its neighbors), then its precludes its neighbors from changing their policies too in that cycle. In addition, it will sometimes prevent its neighbor’s neighbors (and may be their neighbors and so on) from changing their policies in that cycle even if though they are actually independent. For example, consider the execution trace from Figure 6.2, where gain A1 > gain A2 > gain A3 . In this situation, only A1 changes its policy in that cycle. However, A3 should have been able to changed its policy too because it does not depend on A1. This realization that LID-JESP allows limited parallelism led us to come up with a stochastic version of LID-JESP, SLID-JESP (Algorithm11). The key difference between LID-JESP and SLID-JESP is that in SLID-JESP is that if an agenti can improve its local neighborhood utility (i.e. gain i > 0), it will do so with probability p, a predefined threshold probability (see lines 14-17). Note, that unlike LID-JESP, an agent’s decisiontochangeitspolicydoesnotdependonitsneighbors’gainmessages. However,westill agentscontinuetocommunicatetheirgainmessagestotheirneighborstodeterminewhetherthe algorithmhasterminated. Since there has been no change to the termination detection approach and the way gain is computed,thefollowingpropositionsfromLID-JESPholdforSLID-JESPaswell. 61 Proposition9 When applying SLID-JESP, the global utility is strictly increasing until local op- timumisreached. Proposition10 SLID-JESP will terminate within d (= diameter) cycles iff agent are in a local optimum. Proposition 7 shows that the agents will eventually reach a local optimum and Proposition 8 shows that the SLID-JESP will terminate if and only if agents are in a local optimum. Thus, SLID-JESPwillcorrectlyfindalocallyoptimumandwillterminate. Algorithm11 SLID-J ESP(i,ND-POMDP ,p) 0:{lines1-4sameaLID-JESP } 5: whileterminationCtr i <ddo{lines6-13sameasLID-JESP } 14: if RANDOM()<pandgain i > 0then 15: FINDPOLICY(i,b,hi,π Ni ,0,T) 16: Communicateπ i withN i 17: endif 18: Receiveπ j fromallj∈N i thatchangedtheirpolicies 19: endwhile 20: returnπ i 6.4 Hyper-link-basedDecomposition(HLD) Proposition5andEquation6.2indicatethatthevaluefunctionandthelocalneighborhoodutility function can both be decomposed into components for each hyper-link in the interaction hy- pergraph. We developed the Hyper-link-based Decomposition (HLD) technique as a means to exploitthisdecomposability,inordertospeedupthealgorithms EVALUATE and GETVALUE. We introduce the following definitions to ease the description of hyper-link-based decompo- sition. Let E i ={l|l∈ E∧i∈ l} be the subset of hyper-links that contain agent i. Note that N i =∪ l∈E i l−{i}, i.e. the neighborhood ofi contains all the agents inE i except agenti itself. 62 WedefineS l =× j∈l S j referstothestatesofagentsinlinkl. SimilarlywedefineA l =× j∈l A j , Ω l =× j∈l Ω l ,P l (s l ,a l ,s 0 l ) = Q j∈l P j (s j ,a j ,s 0 j ),andO l (s l ,a l ,ω l ) = Q j∈l O j (s j ,a j ,ω j ). Fur- ther,wedefineb l = Q j∈l b j (s j ),whereb j isthedistributionoveragentj’sinitialstate. Usingtheabovedefinitions,wecanrewriteEquation6.2as V π [N i ] = X l∈E i X s l ,su b u (s u )·b l (s l )·V 0 π l (s l ,s u ,hi,...,hi) (6.5) EVALUATE-HLD(Algorithm13)isusedtocomputethelocalneighborhoodutilityofahyperlink l(innerloopofEquation6.8). Whenthejointpolicyiscompletelyspecified,theexpectedreward from each hyper-link can be computed independently (as in E VALUATE-HLD). However, when trying to find the optimal best response, we cannot optimize on each hyper-link separately since inanybeliefstate,anagentcanperformonlyoneaction. Theoptimalbestresponseinanybelief stateistheactionthatmaximizesthesumoftheexpectedrewardsoneachofitshyper-links. The algorithm, GETVALUE-HLD, for computing the best response is a modification of the GETVALUE function that attempts to exploit the decomposability of the value function with- out violating the constraint that the same action must be applied to all the hyper-links in a particular belief state. Here, we define an episode of agent i for a hyper-link l at time t as e t il = D s t u ,s t l ,~ ω t l−{i} E . Treating episode as the state, the transition function and observation functioncanbedefinedas: 63 P 0 il (e t il ,a t i ,e t+1 il ) =P u (s t u ,s t+1 u )·P l (s t l ,s t u ,a t l ,s t+1 l ) ·O l−{i} (s t+1 l−{i} ,s t+1 u ,a t l−{i} ,ω t+1 l−{i} ) O 0 il (e t+1 i ,a t i ,ω t+1 i ) =O i (s t+1 i ,s t+1 u ,a t i ,ω t+1 i ) wherea t l−{i} =π l−{i} (~ ω t l−{i} ). Wecannowdefinethemultiagentbeliefstateforanagentiwith respecttohyper-link l∈E i as: B t il (e t il ) = Pr(s t u ,s t l ,~ ω t l−{i} |~ ω t i ,~ a t−1 i ,b) Weredefinethemultiagentbeliefstateofagentias: B t i (e t i ) ={B t il (e t il )|l∈E i } Wecannowcomputethevalueofthebestresponsepolicyusingthefollowingequation: V t i (B t i ) = max a i ∈A i X l∈E i V a i ,t il (B t i ) (6.6) Thevalueofthebestresponsepolicyforthelinkl canbecomputedasfollows: V t il (B t i ) =V a ∗ i ,t il (B t i ) (6.7) 64 where a ∗ i = argmax a i ∈A i P l∈E i V a i ,t il (B t i ) . The function GETVALUE-HLD (see Algo- rithm14)computesthetermV t il (B t i foralllinksl∈E i . Thefunction,V a i ,t il ,canbecomputedasfollows: V a i ,t il (B t i ) = X e t il B t il (e t il )·R l (s l ,s u ,a l ) + X ω t+1 i ∈Ω 1 Pr(ω t+1 i |B t i ,a i )·V t+1 il B t+1 i (6.8) ThefunctionGETVALUEACTION-HLD(seeAlgorithm15)computestheabovevalueforalllinks l. B t+1 i isthebeliefstateupdatedafterperformingactiona i andobservingω t+1 i andiscomputed using the function UPDATE (see Algorithm 16). Agent i’s policy is determined from its value functionV a i ,t i usingthefunction FINDPOLICY (seeAlgorithm17). ThereasonwhyHLDwillreducetheruntimeforfindingthebestresponseisthattheoptimal value function is computed for each linkly separately. This reduction in runtime is borne out by ourcomplexityanalysisandexperimentalresultsaswell. 6.5 ComplexityResults The complexity of the finding the optimal best response for agenti for JESP (using the dynamic programming[Nairetal.,2003a])isO(|S| 2 ·|A i | T · Q j∈{1...n} |Ω j | T ). Notethatthecomplexity depends on the number world states|S| and the number of possible observation histories of all theagents. Incontrast,thecomplexityoffindingtheoptimalbestresponseforiforLID-JESP(andSLID- JESP)isO( Q l∈E i [|S u ×S l | 2 ·|A i | T ·|Ω l | T ]). Itshouldbenotedthatinthiscase,thecomplexity 65 Algorithm12 LID-J ESP-H LD(i,ND-POMDP ) 0:{lines1-4sameaLID-JESP } 5: whileterminationCtr i <ddo 6: foralls u do 7: foralll∈E i do 8: foralls l ∈S l do 9: B 0 il (hs u ,s l ,hii)←b u (s u )·b l (s l ) 10: prevVal + ←B 0 il (hs u ,s l ,hii)·EVALUATE-HLD (l,s l ,s u ,π l ,hi,0,T) 11: endfor 12: endfor 13: endfor 14: gain i ← GETVALUE-HLD (i,B 0 i ,π Ni ,0,T)−prevVal 15: ifgain i > 0thenterminationCtr i ← 0 16: elseterminationCtr i + ← 1 17: Exchangegain i ,terminationCtr i withN i 18: terminationCtr i ←min j∈Ni∪{i} terminationCtr j 19: maxGain← max j∈Ni∪{i} gain j 20: winner← argmax j∈Ni∪{i} gain j 21: ifmaxGain> 0andi =winner then 22: FINDPOLICY-HLD (i,B 0 i ,hi,π Ni ,0,T) 23: Communicateπ i withN i 24: elseifmaxGain> 0then 25: Receiveπ winner fromwinner andupdateπ Ni 26: endif 27: endwhile 28: returnπ i Algorithm13 EVALUATE-HLD (l,s t l ,s t u ,π l ,~ ω t l ,t,T) 1: a l ←π l (~ ω t l ) 2: val←R l (s t l ,s t u ,a l ) 3: ift<T−1then 4: foralls t+1 l ,s t+1 u do 5: forallω t+1 l do 6: val + ← P u (s t u ,s t+1 u ) · P l (s t l ,s t u ,a l ,s t+1 l ) · O l (s t+1 l ,s t+1 u ,a l ,ω t+1 l ) · EVALUATE-HLD l,s t+1 l ,s t+1 u ,π l , ~ ω t l ,ω t+1 l ,t+1,T 7: endfor 8: endfor 9: endif 10: returnval 66 Algorithm14 GETVALUE-HLD (i,B t i ,π N i ,t,T) 1: ift≥T thenreturn0 2: ifV t il (B t i )isalreadyrecorded∀l∈E i thenreturn[V t il (B t i )] l∈Ei 3: bestSum←−∞ 4: foralla i ∈A i do 5: value← GETVALUEACTION-HLD (i,B t i ,a i ,π Ni ,t,T) 6: valueSum← P l∈Ei value[l] 7: recordvalueSumasV ai,t i (B t i ) 8: ifvalueSum>bestSumthenbest←value,bestSum←valueSum 9: endfor 10: foralll∈E i do 11: recordbest[l]asV t il (B t i ) 12: endfor 13: returnbest Algorithm15 GETVALUEACTION-HLD (i,B t i ,a i ,π N i ,t,T) 1: foralll∈E i do 2: value[l]← 0 3: foralle t il = D s t u ,s t l ,~ ω t l−{i} E s.t. B t il (e t il )> 0do 4: a l−{i} ←π l−{i} (~ ω t l−{i} ) 5: value[l] + ←B t il (e t il )·R l (s t l ,s t u ,a l ) 6: endfor 7: endfor 8: ift<T−1then 9: forallω t+1 i ∈ Ω i do 10: foralll∈E i do 11: B t+1 il ← UPDATE-HLD (i,l,B t il ,a i ,ω t+1 i ,π l−{i} ) 12: prob[l]← 0 13: foralls t u ,s t l do 14: foralle t+1 il = D s t+1 u ,s t+1 l , D ~ ω t l−{i} ,ω t+1 l−{i} EE s.t. B t+1 il (e t+1 il )> 0do 15: a l−{i} ←π l−{i} (~ ω t l−{i} ) 16: prob[l] + ←B t il (e t il )·P u (s t u ,s t+1 u )·P l (s t l ,s t u ,a l ,s t+1 l )·O l (s t+1 l ,s t+1 u ,a l ,ω t+1 l ) 17: endfor 18: endfor 19: endfor 20: futureValue←GETVALUE-HLD (i,B t+1 i ,π Ni ,t+1,T) 21: foralll∈E i do 22: value[l] + ←prob[l]·futureValue[l] 23: endfor 24: endfor 25: endif 26: returnvalue 67 Algorithm16 UPDATE-HLD (i,l,B t il ,a i ,ω t+1 i ,π l−{i} ) 1: foralle t+1 il = D s t+1 u ,s t+1 l , D ~ ω t l−{i} ,ω t+1 l−{i} EE do 2: B t+1 il (e t+1 il )← 0 3: a l−{i} ←π l−{i} (~ ω t l−{i} ) 4: foralls t u ,s t l do 5: B t+1 il (e t+1 il ) + ←B t il (e t il )·P u (s t u ,s t+1 u )·P l (s t l ,s t u ,a l ,s t+1 l )·O l (s t+1 l ,s t+1 u ,a l ,ω t+1 l ) 6: endfor 7: endfor 8: normalizeB t+1 il 9: returnB t+1 il Algorithm17 FINDPOLICY-HLD (i,B t i , ~ ω i t ,π N i ,t,T) 1: a ∗ i ← argmax ai V ai,t i (B t i ) 2: π i (~ ω i t )←a ∗ i 3: ift<T−1then 4: forallω t+1 i ∈ Ω i do 5: foralll∈E i do 6: B t+1 il ← UPDATE-HLD (i,l,B t il ,a ∗ i ,ω t+1 i ,π l−{i} ) 7: endfor 8: FINDPOLICY-HLD (i,B t+1 i , ~ ω i t ,ω t+1 i ,π Ni ,t+1,T) 9: endfor 10: endif 11: return 68 depends on the number of states|S u |,|S i | and|S N i | and not on the number of states of any non-neighboring agent. Similarly, the complexity depends on only the number of observation historiesofianditsneighborsandnotthoseofalltheagents. Thishighlightsthereasonforwhy LID-JESPandSLID-JESParesuperiortoJESPforproblemswherelocalityofinteractioncanbe exploited. ThecomplexityforcomputingoptimalbestresponseforiinLID-JESPwithHLD(andSLID- JESPwithHLD)isO(Σ l∈E i [|S u ×S l | 2 ·|A i | T ·|Ω l | T ]). Keydifferenceofnotecomparedtothe complexity expression for LID-JESP, is the replacement of product, Q with a sum,Σ. Thus, as numberofneighborsincreases,differencebetweenthetwoapproachesincreases. SinceJESPisacentralizedalgorithm,thebestresponsefunctionisperformedforeachagent serially. LID-JESP and SLID-JESP (with and without HLD), in contrast, are distributed algo- rithms, where each agent can be run in parallel on a different processor, further alleviating the largecomplexityoffindingtheoptimalbestresponse. 6.6 LocallyInteracting-GlobalOptimalAlgorithm(GOA) GOAliketheabovealgorithmsalsoborrowsfromaDCOPalgorithm. Asopposedtothelocally optimal DCOP algorithms used in LID-JESP and SLID-JESP, GOA borrows from an exact al- gorithm,DPOP(DistributedPseudotreeOptimizationProcedure)andatpresentworksonlywith binary interactions, i.e. edges linking two nodes. We start with a description of GOA applied to tree-structuredinteractiongraphs,andthendiscussitsapplicationtographswithcycles. DPOP dictates the functioning of message passing between the agents. The first phase is the UTILpropagation,wheretheutilitymessages,inthiscasevaluesofpolicies,arepassedupfrom 69 theleavestotheroot. Valueforapolicyatanagentisdefinedasthesumofbestresponsevalues from its children and the joint policy reward associated with the parent policy. Thus, given a fixed policy for a parent node, GOA requires an agent to iterate through all its policies, finding thebestresponsepolicyandreturningthevaluetotheparent—wheretofindthebestpolicy,an agentrequiresitschildrentoreturntheirbestresponsestoeachofitspolicies. Anagentstoresthe sum of best response values from its children, to avoid recalculation at the children. This UTIL propagationprocessisrepeatedateachlevelinthetree,untiltherootexhaustsallitspolicies. In the second phase of VALUE propagation, where the optimal policies are passed down from the roottilltheleaves. GOAtakesadvantageofthelocalinteractionsintheinteractiongraph,bypruningoutunnec- essary joint policy evaluations (associated with nodes not connected directly in the tree). Since the interaction graph captures all the reward interactions among agents and as this algorithm it- erates through all the joint policy evaluations possible with the interaction graph, this algorithm yieldsanoptimalsolution. Algorithm 18 provides the pseudo code for the global optimal algorithm at each agent. This algorithmisinvokedwiththeprocedurecall GO-J OINTPOLICY(root,hi,no). Lines8-21represent the UTIL propagation, while Lines 1-4 and 22-24 represent the VALUE propagation phase of DPOP.Line8iteratesthroughallthepossiblepolicies,whereaslines20-21worktowardscalcu- lating the best policy over this entire set of policies using the value of the policies calculated in Lines9-19. Line21storesthevaluesofbestresponsepoliciesobtainedfromthechildren. Lines 22-24startstheterminationofthealgorithmafterallthepoliciesareexhaustedattheroot. Lines 1-4propagatetheterminationmessagetolowerlevelsinthetree,whilerecordingthebestpolicy, π ∗ i . 70 Algorithm18 GO-J OINTPOLICY(i,π j ,terminate) 1: ifterminate =yesthen 2: π ∗ i ←bestResponse{π j } 3: forallk∈children i do 4: GO-J OINTPOLICY(k,π ∗ i ,yes) 5: endfor 6: return 7: endif 8: Π i ←enumerateallpossiblepolicies 9: bestPolicyVal←-∞,j← parent(i) 10: forallπ i ∈ Π i do 11: jointPolicyVal← 0,childVal← 0 12: ifi6= rootthen 13: foralls i ,s j ,s u do 14: jointPolicyVal + ←b i (s i )·b Ni (s Ni )·b u (s u )·EVALUATE(i,s i ,s u ,s j ,π i ,π j ,hi,hi,0,T) 15: endfor 16: endif 17: ifbestChildValMap{π i }6=null then 18: jointPolicyVal + ←bestChildValMap{π i } 19: else 20: forallk∈children i do 21: childVal + ← GO-J OINTPOLICY(k,π i ,no) 22: endfor 23: bestChildValMap{π i }←childVal 24: jointPolicyVal + ←childVal 25: endif 26: ifjointPolicyVal >bestPolicyValthen 27: bestPolicyVal←jointPolicyVal,π ∗ i ←π i 28: endif 29: endfor 30: ifi =rootthen 31: forallk∈children i do 32: GO-J OINTPOLICY(k,π ∗ i ,yes) 33: endfor 34: endif 35: ifi6=rootthenbestResponse{π j } =π ∗ i 36: returnbestPolicyVal 71 By using cycle-cutset algorithms [Dechter, 2003], GOA can be applied to interaction graphs containing cycles. These algorithms are used to identify a cycle-cutset, i.e., a subset of agents, whose deletion makes the remaining interaction graph acyclic. After identifying the cutset, joint policiesforthecutsetagentsareenumerated,andthenforeachofthem,wefindthebestpolicies ofremainingagentsusingGOA. 6.7 ExperimentalResults In this section we provide two sets of experiments. The first set of experiments provide perfor- mance comparisons of the locally optimal algorithm, LID-JESP to globally optimal algorithm, GOA and other benchmark algorithms (JESP, LID-JESP no network). Second set of experi- ments provides comparisons of LID-JESP and its enhancements (SLID-JESP, LID-JESP+HLD, SLID-JESP+HLD).AlltheexperimentswereperformedonthesensordomainexplainedinSec- tion2.1.2. In the first set of experiments, we consider three different sensor network configurations of increasingcomplexity. Inthefollowingtext,Loc1-1,Loc2-1andLoc2-2arethesameregionsas inFigure2.2. Thefirstconfigurationisachainwith3agents(sensors1-3). Heretarget1iseither absent or in Loc1-1 and target2 is either absent or in Loc2-1 (4 unaffectable states). Each agent can perform either turnOff, scanEast or scanWest. Agents receive an observation, targetPresent or targetAbsent, based on the unaffectable state and its last action. The second configuration is a 4 agent chain (sensors 1-4). Here, target2 has an additional possible location, Loc2-2, giving rise to 6 unaffectable states. The number of individual actions and observations are unchanged. The third configuration is the 5 agent P-configuration (named for the P shape of the sensor net) 72 andisidenticaltoFigure2.2. Here,target1canhavetwoadditionallocations,Loc1-2andLoc1- 3, giving rise to 12 unaffectable states. We add a new action called scanVert for each agent to scan North and South. For each of these scenarios, we ran the LID-JESP algorithm. Our first benchmark, JESP, uses a centralized policy generator to find a locally optimal joint policy and does not consider the network structure of the interaction, while our second benchmark (LID- JESP-no-nw) is LID-JESP with a fully connected interaction graph. For 3 and 4 agent chains, wealsorantheGOAalgorithm. Figure 6.2 compares the performance of the various algorithms for 3 and 4 agent chains and 5agentP-configuration. Graphs(a),(b),(c)showtheruntimeinsecondsonalogscaleonY-axis for increasing finite horizon T on X-axis. Run times for LID-JESP, JESP and LID-JESP-no- nw are averaged over 5 runs, each run with a different randomly chosen starting policy . For a particular run, all algorithms use the same starting policies. All three locally optimal algorithms showsignificantimprovementoverGOAintermsofruntimewithLID-JESPoutperformingLID- JESP-no-nwandJESPbyanorderofmagnitude(forhighT)byexploiting localityofinteraction. In graph (d), the values obtained using GOA for 3 and 4-Agent case ( T = 3) are compared to the ones obtained using LID-JESP over 5 runs (each with a different starting policy) for T = 3. In this bar graph, the first bar represents value obtained using GOA, while other bars correspond toLID-JESP.Thisgraphemphasizesthefactthatwithrandomrestarts,LID-JESPconvergestoa higher local optima — such restarts are afforded given that GOA is orders of magnitude slower comparedtoLID-JESP. Table 6.1 helps to better explain the reasons for the speed up of LID-JESP over JESP and LID-JESP-no-nw. LID-JESP allows more than one (non-neighboring) agent to change its policy within a cycle (W), LID-JESP-no-nw allows exactly one agent to change its policy in a cycle 73 and in JESP, there are several cycles where no agent changes its policy. This allows LID-JESP to converge in fewer cycles (C) than LID-JESP-no-nw. Although LID-JESP takes fewer cycles than JESP to converge, it required more calls to GETVALUE (G). However, each such call is cheaper owing to the locality of interaction. LID-JESP will out-perform JESP even more on multi-processormachinesowingtoitsdistributedness. (a) 3-agentchain (b) 4-agentchain (c) 5-agentP (d) Figure6.2: Runtimes(a,b,c),andvalue(d). In the second set of experiments, we performed comparison of LID-JESP with the enhance- ments – SLID-JESP, LID-JESP with HLD and SLID-JESP with HLD – in terms of value and runtime for some complex network structures (2x3 and cross) as well. We used four different topologiesofsensors,showninFigure6.3,eachwithadifferenttargetmovementscenario. With twotargetsmovingintheenvironment,possiblepositionsoftargetsareincreasedasthenetwork 74 Config. Algorithm C G W LID-JESP 3.4 13.6 1.412 4-chain LID-JESP-no-nw 4.8 19.2 1 JESP 7.8 7.8 0.436 LID-JESP 4.2 21 1.19 5-P LID-JESP-no-nw 5.8 29 1 JESP 10.6 10.6 0.472 Table6.1: Reasonsforspeedup. C:no. ofcycles,G:no. of GETVALUE calls,W:no. ofwinners percycle,forT=2. grows and the number of unaffected states are increased accordingly. Figure 6.3(a) shows the example where there are 3 sensors arranged in a chain and the number of possible positions for each target is 1. In the cross topology, as in Figure 6.3(b), we considered 5 sensors with one sensor in the center surrounded by 4 sensors and 2 locations are possible for each target. In the example in Figure 6.3(c) with 5 sensors arranged in P shape, target1 and target2 can be at 2 and 3 locations respectively, thus leading to a total of 12 states. There are total 20 states for six sen- sors in example of Figure 6.3(d) with 4 and 3 locations for target1 and target2, respectively. As we assumed earlier, each target is independent of each other. Thus, total number of unaffected states are ( Q targets (numberofpossiblepositionsofeachtarget+1)). Due to the exponentially increasing runtime, the size of the network and time horizon is limited but is still significantly larger than those which have previously been demonstrated in distributed POMDPs. All exper- iments are started at random initial policies and averaged over five runs for each algorithm. We chose 0.9 as the threshold probability (p) for SLID-JESP which empirically gave a good result formostcases. Figure 6.4 shows performance improvement of SLID-JESP and HLD in terms of runtime. In Figure 6.4, X-axis shows the time horizon T, while Y-axis shows the runtime in milliseconds on a logarithmic scale. In all cases of Figure 6.4, the line of SLID-JESP is lower than that of 75 LID-JESP with and without HLD where the difference of two grows as the network grows. As in Figure 6.4(c) and Figure 6.4(d) the difference in runtime between LID-JESP and SLID-JESP is bigger than that in smaller network examples. The result that SLID-JESP always takes less timethanLID-JESPisbecauseinSLID-JESP,moreagentschangetheirpolicyinonecycle,and hence SLID-JESP tends to converge to a local optimum quickly. As for HLD, all the graphs shows that the use of Hyper-link-based decomposition clearly improved LID-JESP and SLID- JESP in terms of runtime. The improvement is more visible when the number of neighbors increaseswhereHLDtakesadvantageofdecomposition. Forexample,inFigure6.4(b),byusing HLD the runtime reduced by more than an order of magnitude for T = 4. In cross topology, thecomputationfortheagentin thecenter whichhas 4neighbors isa mainbottleneck andHLD significantlyreducesthecomputationbydecomposition. Figure 6.5 shows the values of each algorithm for different topologies. In Figure 6.5, X-axis showsthetimehorizonT,whileY-axisshowsthevalueofteamreward. Thereareonlytwolines in each graph because the values of the algorithm with HLD and without HLD are always the same because HLD only exploits independence between neighbors and doesn’t affect the value oftheresultingjointpolicy. TherewardofLID-JESPislargerthanthatofSLID-JESPinthreeout ofthefourtopologiesthatwetried. ThissuggestsSLID-JESP’sgreedyapproachtochangingthe joint policy causes it to converge to lower local optima than LID-JESP in some cases. However, note that in Figure 6.5(a) SLID-JESP converges to a higher local optima than LID-JESP. This suggeststhatnetworktopologygreatlyimpactsthechoiceofwhethertouseLID-JESPorSLID- JESP. Furthermore, the results of SLID-JESP vary in value for different threshold probabilities. However, there is a consistent trend that the result is better when the threshold probability (p) is 76 large. This trend means that in our domain, it is generally better to change policy if there is a visiblegain. (a) 1x3 (b) Cross (c) 5-P (d) 2x3 Figure6.3: Differentsensornetconfigurations. 77 Figure6.4: Runtime(ms)for(a)1x3,(b)cross,(c)5-Pand(d)2x3. Figure6.5: Valuefor(a)1x3,(b)cross,(c)5-Pand(d)2x3. 78 Chapter7: Directvalueapproximationandexploitinginteraction structure(DistributedPOMDPs) Whereaspreviouschapterillustratedexploitationofstructureforefficientcomputationofapprox- imate solutions, this chapter exploits structure for exact algorithms. In addition, I also present a direct value approximation enhancement for Distributed POMDPs. Thus, the technique in- troduced in this chapter not only provides guarantees on solution quality, but also exploits the networkstructuretocomputesolutionsefficientlyforanetworkofagents. Inparticular,thischapterintroducestheexactalgorithmSPIDER(SearchforPoliciesInDis- tributedEnviRonments), beforepresenting the approximation technique algorithm. SPIDERisa branch and bound heuristic search technique that uses a MDP-based heuristic function to search for an optimal joint policy. This MDP-based heuristic approximates the distributed POMDP as a single agent centralized MDP and computes the value corresponding to the optimal policy of this MDP. In a similar vein to the structure exploitation presented in Section 6.6, SPIDER also exploits network structure of agents by organizing agents into a DFS tree (Depth First Search) or pseudo tree [Petcu and Faltings, 2005] and exploiting independence in the different branches of the tree (while constructing joint policies). Furthermore, the MDP-based heuristic function is alsocomputedefficientlybyutilizingtheinteractionstructure. 79 I then provide three enhancements to improve the efficiency of the basic SPIDER algorithm whileprovidingguaranteesonthequalityofthesolution. Thefirstenhancementisanexactone, basedontheideaofinitiallyperformingbranchandboundsearchonabstractpolicies(represent- ing a group of complete policies) and then extending to the complete policies. Second enhance- ment bounds the search approximately given a parameter that provides the tolerable expected value difference from the optimal solution. The third enhancement is again based on bounding thesearchapproximately,howeverwithatoleranceparameterthatisprovidedasapercentageof optimal. We experimented with the sensor network domain presented in Section 2.1.2, while the model used to represent the domain is the Network Distributed POMDP model (presented in Section6.1). Inourexperimentalresults,weshowthatSPIDERdominatesanexistingglobalop- timalapproach,GOApresentedinSection6.6. GOAistheonlyknownglobaloptimalalgorithm that works with more than two agents. Furthermore, we demonstrate that the idea of abstraction improves the performance of SPIDER significantly while providing optimal solutions and also thatbyutilizingtheapproximationenhancements,SPIDERprovidessignificantimprovementsin run-timeperformancewhilenotlosingsignificantlyonquality. 7.1 SearchforPoliciesInDistributedEnviRonments(SPIDER) As mentioned in Section 6.1, an ND-POMDP can be treated as a DCOP, where the goal is to compute a joint policy that maximizes the overall joint reward. The bruteforce technique for computinganoptimalpolicywouldbetoexaminetheexpectedvaluesforallpossiblejointpoli- cies. The key idea in SPIDER is to avoid computation of expected values for the entire space 80 of joint policies, by utilizing upperbounds on the expected values of policies and the interaction structureoftheagents. AkintosomeofthealgorithmsforDCOP[Modietal.,2003a;PetcuandFaltings,2005],SPI- DER has a pre-processing step that constructs a DFS tree corresponding to the given interaction structure. We employ the Maximum Constrained Node (MCN) heuristic used in ADOPT [Modi etal.,2003a],howeverotherheuristics(suchasMLSPheuristicfrom[Maheswaranetal.,2004]) can also be employed. MCN heuristic tries to place agents with more number of constraints at the top of the tree. This tree governs how the search for the optimal joint policy proceeds in SPIDER.Thealgorithmspresentedinthispaperareeasilyextendabletohyper-trees,howeverfor expositorypurposes,weassumeabinarytree. SPIDER is an algorithm for centralized planning and distributed execution in distributed POMDPs. Though the explanation is presented from the perspective of individual agents, the algorithm is centralized. In this paper, we employ the following notation to denote policies and expectedvaluesofjointpolicies: Ancestors(i)⇒agentsfromitotheroot(notincludingi). Tree(i)⇒agentsinthesub-tree(notincluding i)forwhichiistheroot. π root+ ⇒jointpolicyofallagents. π i+ ⇒jointpolicyofallagentsinthesub-treeforwhich iistheroot. π i− ⇒jointpolicyofagentsthatareancestorstoagentsinthesub-treeforwhich iistheroot. π i ⇒policyoftheithagent. ˆ v[π i ,π i− ]⇒upperboundontheexpectedvalueforπ i+ givenπ i andpoliciesofancestoragents i.e. π i− . ˆ v j [π i ,π i− ]⇒upperboundontheexpectedvalueforπ i+ fromthejthchild. 81 v[π i ,π i− ]⇒expectedvalueforπ i givenpoliciesofancestoragentsi.e. π i− . v[π i+ ,π i− ]⇒expectedvalueforπ i+ givenpoliciesofancestoragentsπ i− . v j [π i+ ,π i− ]⇒expectedvalueforπ i+ fromthejthchild. W e s t W e s t W e s t W e s t E a s t W e s t W e s t W e s t E a s t E a s t E a s t O f f O f f O f f W e s t ∞ 2 5 0 2 3 2 2 3 4 . . . . P r u n e d P r u n e d L E V E L 1 L E V E L 2 L E V E L 3 S e a r c h T r e e n o d e A g e n t T r e e f o r F i g u r e 1 s e n s o r n e t w o r k P o l i c y T r e e ( H o r i z o n 2 ) Figure7.1: ExecutionofSPIDER,anexample 7.1.1 OutlineofSPIDER SPIDER is based on the idea of branch and bound search, where the nodes in the search tree represent the joint policies, π root+ . Figure 7.1 shows an example search tree for the SPIDER algorithm, using an example of the three agent chain. We create a tree from this chain, with the middleagentastherootofthetree. Notethatinourexamplefigureeachagentisassignedapolicy withT=2. Eachroundedrectange(searchtreenode)indicatesapartial/completejointpolicyand a rectangle indicates an agent. Heuristic or actual expected value for a joint policy is indicated in the top right corner of the rounded rectangle. If the number is italicized and underlined, it 82 implies that the actual expected value of the joint policy is provided. SPIDER begins with no policyassignedtoanyoftheagents(showninthelevel1ofthesearchtree). Level2ofthesearch tree indicates that the joint policies are sorted based on upper bounds computed for root agent’s policies. Level 3 contains a node with a complete joint policy (a policy assigned to each of the agents). Theexpectedvalueforthisjointpolicyisusedtopruneoutthenodesinlevel2(theones withupperbounds<234) Whencreatingpoliciesforeachnon-leafagent i,SPIDERpotentiallyperformstwosteps: 1. Obtaining upper bounds and sorting: In this step, agent i computes upper bounds on the expected values, ˆ v[π i ,π i− ] of the joint policiesπ i+ corresponding to each of its policyπ i andfixedancestorpolicies. AMDPbasedheuristicisusedtocomputetheseupperbounds on the expected values. Detailed description about this MDP heuristic and other possible heuristics is provided in Section 7.1.2. All policies of agenti, Π i are then sorted based on these upper bounds (also referred to as heuristic values henceforth) in descending order. Exploration of these policies (in step 2 below) are performed in this descending order. As indicatedinthelevel2ofthesearchtreeofFigure7.1,allthejointpoliciesaresortedbased on the heuristic values, indicated in the top right corner of each joint policy. The intuition behindsortingandthenexploringpoliciesindescendingorderofupperbounds,isthatthe policieswithhigherupperboundscouldyieldjointpolicieswithhigherexpectedvalues. 2. Exploration and Pruning: Exploration here implies computing the best response joint pol- icy π i+,∗ corresponding to fixed ancestor policies of agent i, π i− . This is performed by iterating through all policies of agent i i.e. Π i and then for each policy, computing and summing two quantities: (i) compute the best response for each of i’s children (obtained 83 by performing steps 1 and 2 at each of the child nodes); (ii) compute the expected value obtained byi for fixed policies of ancestors. Thus, exploration of a policyπ i yields actual expectedvalueofajointpolicy,π i+ representedasv[π i+ ,π i− ]. Thepolicywiththehighest expectedvalueisthebestresponsepolicy. Pruningreferstotheprocessofavoidingexploringpolicies(orcomputingexpectedvalues) at agent i by using the maximum expected value, v max [π i+ ,π i− ] encountered until this juncture. Henceforth, this v max [π i+ ,π i− ] will be referred to as threshold. A policy, π i neednotbeexplorediftheupperboundforthatpolicy, ˆ v[π i ,π i− ]islessthanthethreshold. This is because the best joint policy that can be obtained from that policy will have an expectedvaluethatislessthantheexpectedvalueofthecurrentbestjointpolicy. Ontheotherhand,whenconsideringaleafagent,SPIDERcomputesthebestresponsepolicy (and consequently its expected value) corresponding to fixed policies of its ancestors,π i− . This is accomplished by computing expected values for each of the policies (corresponding to fixed policies of ancestors) and selecting the policy with the highest expected value. Going back to Figure7.1,SPIDERassignsbestresponsepoliciestoleafagentsatlevel3. Thepolicyfortheleft leaf agent is to perform action East at each time step in the policy, while the policy for the right leaf agent is to perform ”Off” at each time step. This best response policies from the leaf agents yieldanactualexpectedvalueof234forthecompletejointpolicy. Algorithm 19 provides the pseudo code for SPIDER. This algorithm outputs the best joint policy, π i+,∗ (with an expected value greater than threshold) for the agents in the sub-tree with agent i as the root. Lines 3-8 compute the best response policy of a leaf agent i by iterating throughallthepolicies(line4)andfindingthepolicywiththehighestexpectedvalue(lines5-8). 84 Algorithm19 SPIDER(i,π i− ,threshold) 1: π i+,∗ ←null 2: Π i ← GET-A LL-P OLICIES (horizon,A i ,Ω i ) 3: if IS-L EAF(i)then 4: forallπ i ∈ Π i do 5: v[π i ,π i− ]← JOINT-R EWARD (π i ,π i− ) 6: ifv[π i ,π i− ]>thresholdthen 7: π i+,∗ ←π i 8: threshold←v[π i ,π i− ] 9: endif 10: endfor 11: else 12: children← CHILDREN (i) 13: ˆ Π i ← UPPER-B OUND-S ORT(i,Π i ,π i− ) 14: forallπ i ∈ ˆ Π i do 15: ˜ π i+ ←π i 16: if ˆ v[π i ,π i− ]<thresholdthen 17: Gotoline12 18: endif 19: forallj∈childrendo 20: jThres←threshold−v[π i ,π i− ]−Σ k∈children,k6=j ˆ v k [π i ,π i− ] 21: π j+,∗ ← SPIDER(j,π i kπ i− ,jThres) 22: ˜ π i+ ← ˜ π i+ kπ j+,∗ 23: ˆ v j [π i ,π i− ]←v[π j+,∗ ,π i kπ i− ] 24: endfor 25: ifv[˜ π i+ ,π i− ]>thresholdthen 26: threshold←v[˜ π i+ ,π i− ] 27: π i+,∗ ← ˜ π i+ 28: endif 29: endfor 30: endif 31: returnπ i+,∗ Algorithm20 UPPER-B OUND-S ORT(i,Π i ,π i− ) 1: children← CHILDREN (i) 2: ˆ Π i ←null /*Storesthesortedlist*/ 3: forallπ i ∈ Π i do 4: ˆ v[π i ,π i− ]← JOINT-R EWARD (π i ,π i− ) 5: forallj∈childrendo 6: ˆ v j [π i ,π i− ]← UPPER-B OUND(j,π i kπ i− ) 7: ˆ v[π i ,π i− ] + ← ˆ v j [π i ,π i− ] 8: endfor 9: ˆ Π i ← INSERT-I NTO-S ORTED (π i , ˆ Π i ) 10: endfor 11: return ˆ Π i 85 Lines 9-23 computes the best response joint policy for agents in the sub-tree with i as the root. Sortingofpolicies(indescendingorder)basedonheuristicpoliciesisdoneonline11. Exploration of a policy i.e. computing best response joint policy corresponding to fixed an- cestor policies is done in lines 12-23. This includes computation of best joint policies for each of the child sub-trees (lines 16-23). This computation in turn involves distributing the threshold (line 17), recursively calling the SPIDER algorithm (line 18) for each of the children and main- taining the best expected value, joint policy (lines 21-23). Pruning of policies is performed in lines14-15bycomparingtheupperboundontheexpectedvalueagainstthe threshold. Algorithm 20 provides the algorithm for sorting policies based on the upper bounds on the expected values of joint policies. Expected value for an agent i consists of two parts: value obtainedfromancestorsandvalueobtainedfromitschildren. Line4computesthevalueobtained from (fixed policies of) ancestors of the agent (by using the JOINT-R EWARD function), while lines 5-7 compute the heuristic value (upper-bounds) from the children. Thus the sum of these two parts yields an upper bound on the expected value for agenti, and line 8 of the algorithm is usedforsortingthepoliciesbasedontheseupperbounds. 7.1.2 MDPbasedheuristicfunction The job of the heuristic function is to quickly provide an upper bound on the expected value ob- tainablefromthesub-treeforwhich iistheroot. Thesub-treeofagentsisadistributedPOMDP in itself and the idea here is to construct a centralized MDP corresponding to the (sub-tree) dis- tributedPOMDPandobtaintheexpectedvalueoftheoptimalpolicyforthiscentralizedMDP.To reiteratethisintermsoftheagentsinDFStreeinteractionstructure,weassumefullobservability 86 for the agents in theTree(i) and for fixed policies of the agents in the set{Ancestors(i)∪i}, wecomputethejointvalue ˆ v[π i+ ,π i− ]. Weusethefollowingnotationforpresentingtheequationsforcomputingupperbounds/heuristic values(foragentsiandk): LetE i− denotethesetoflinksbetweenagentsinAncestors(i)andTree(i)∪iandE i+ denote the set of links between agents in Tree(i)∪i. Also, if l∈ E i− , then l 1 denotes the agent in Ancestors(i)andl 2 denotestheagentinTree(i). o t k 4 =O k (s t+1 k ,s t+1 u ,π k (~ ω t k ),ω t+1 k ) (7.1) p t k 4 =P k (s t k ,s t u ,π k (~ ω t k ),s t+1 k )·o t k ˆ p t k 4 =p t k ,if k∈{Ancestors(i)∪i} 4 =P k (s t k ,s t u ,π k (~ ω t k ),s t+1 k ),if k∈Tree(i) (7.2) p t u 4 =P(s t u ,s t+1 u ) s t l = s t l1 ,s t l2 ,s t u ω t l = ω t l1 ,ω t l2 r t l 4 =R l (s t l ,π l1 (~ ω t l1 ),π l2 (~ ω t l2 )) IF l∈E i− ,ˆ r t l 4 =max {a l 2 } R l (s t l ,π l1 (~ ω t l1 ),a l2 ) IF l∈E i+ ,ˆ r t l 4 = max {a l 1 ,a l 2 } R l (s t l ,a l1 ,a l2 ) v t l 4 =V t π l (s t l ,s t u ,~ ω t l1 ,~ ω t l2 ) 87 The value function for an agenti executing the joint policyπ i+ at timeη−1 is provided by theequation: V η−1 π i+ (s η−1 ,~ ω η−1 ) = X l∈E i− v η−1 l + X l∈E i+ v η−1 l where v η−1 l =r η−1 l + X ω η l ,s η p η−1 l1 p η−1 l2 p η−1 u v η l (7.3) Algorithm21 UPPER-B OUND (j,π j− ) 1: val← 0 2: foralls 0 l do 3: val + ←startingBelief[s 0 l ]· UPPER-B OUND-T IME (s 0 l ,j,{},hi,hi) 4: endfor 5: returnval Algorithm22 UPPER-B OUND-T IME (s t l ,j,π l 1 ,~ ω t l 1 ) 1: val← GET-R EWARD(s t l ,a l1 ,a l2 ) 2: ift<π i .horizon−1then 3: foralls t+1 l ,ω t+1 l1 do 4: futVal←p t u ˆ p t l1 ˆ p t l2 5: futVal ∗ ← UPPER-B OUND-T IME(s t+1 l ,j,π l1 ,~ ω t l1 kω t+1 l1 ) 6: endfor 7: val + ←futVal 8: endif 9: returnval Upper bound on the expected value for a link is computed by modifying the equation 7.3 to reflect the full observability assumption. This involves removing the observational probability term for agents in Tree(i) and maximizing the future value ˆ v η l over the actions of those agents (inTree(i)). Thus,theequationforthecomputationoftheupperboundwillbeasfollows: IF l∈E i− ,ˆ v η−1 l =ˆ r η−1 l +max a l 2 X ω η l 1 ,s η l ˆ p η−1 l1 ˆ p η−1 l2 p η−1 u ˆ v η l IF l∈E i+ ,ˆ v η−1 l =ˆ r η−1 l + max a l 1 ,a l 2 X s η l ˆ p η−1 l1 ˆ p η−1 l2 p η−1 u ˆ v η l 88 Algorithm 21 and Algorithm 22 provide the algorithm for computing upper bound for child j ofagentiusingtheequationsabove. Algorithm21maximizesoverallpossiblecombinationsof actions for agents in Tree(j)∪j. The value for a combination terates over all links associated withanagentWhileAlgorithm22computestheupperboundonalink,l E a s t T P . . . . . A b s t r a c t i o n L e v e l 1 A b s t r a c t i o n L e v e l 2 C o m p l e t e P o l i c i e s E a s t T A O f f T P E a s t T A E a s t T P E a s t T A W e s t T P E a s t O f f T A O f f T P E a s t W e s t T A O f f . . . . . T P E a s t O f f T A E a s t T P E a s t W e s t T A E a s t . . . . . T P E a s t O f f T A W e s t T P E a s t W e s t T A W e s t T A - T a r g e t A b s e n t T P - T a r g e t P r e s e n t E a s t - S c a n e a s t W e s t - S c a n w e s t O f f - S w i t c h o f f E a s t T P E a s t O f f T A O f f T P E a s t E a s t T A O f f T P E a s t W e s t T A W e s t . . . . . A b s t r a c t i o n L e v e l 1 C o m p l e t e P o l i c i e s H o r i z o n B a s e d A b s t r a c t i o n N o d e B a s e d A b s t r a c t i o n Figure7.2: Exampleofabstractionfor(a)HBA(HorizonBasedAbstraction)and(b)NBA(NodeBased Abstraction) 7.1.3 Abstraction InSPIDER,theexploration/pruningphasecanonlybeginaftertheheuristic(orupperbound) computation and sorting for the policies has finished. We provide an approach of interleaving exploration/pruning phase with the heuristic computation and sorting phase. This thus possibly circumventstheexplorationofagroupofpoliciesbasedonheuristiccomputationforoneabstract policy. The type of abstraction used dictates the amount of interleaving of exploration/pruning phase with heuristic computation phase. The important steps in this technique are defining the abstract policy and how heuristic values are computated for the abstract policies. In this paper, weproposetwotypesofabstraction: 89 Algorithm23 SPIDER-A BS(i,π i− ,threshold) 1: π i+,∗ ←null 2: Π i ← GET-P OLICIES (<>,1) 3: if IS-L EAF(i)then 4: forallπ i ∈ Π i do 5: absHeuristic← GET-A BS-H EURISTIC (π i ,π i− ) 6: absHeuristic ∗ ← (timeHorizon−π i .horizon) 7: ifπ i .horizon =timeHorizonandπ i .absNodes = 0then 8: v[π i ,π i− ]← JOINT-R EWARD (π i ,π i− ) 9: ifv[π i ,π i− ]>thresholdthen 10: π i+,∗ ←π i ;threshold←v[π i ,π i− ] 11: endif 12: elseifv[π i ,π i− ]+absHeuristic>thresholdthen 13: absNodes←π i .absNodes+1 14: ˆ Π i ← GET-P OLICIES (π i ,π i .horizon+1,absNodes) 15: /*InsertpoliciesinthebeginningofΠ i insortedorder*/ 16: Π i + ← INSERT-S ORTED-P OLICIES ( ˆ Π i ) 17: endif 18: REMOVE(π i ) 19: endfor 20: else 21: children← CHILDREN (i) 22: Π i ← UPPER-B OUND-S ORT(i,Π i ,π i− ) 23: forallπ i ∈ Π i do 24: ˜ π i+ ←π i 25: absHeuristic← GET-A BS-H EURISTIC (π i ,π i− ) 26: absHeuristic ∗ ← (timeHorizon−π i .horizon) 27: ifπ i .horizon =timeHorizonandπ i .absNodes = 0then 28: if ˆ v[π i ,π i− ]<thresholdandπ i .absNodes = 0then 29: Gotoline19 30: endif 31: forallj∈childrendo 32: jThres←threshold−v[π i ,π i− ]−Σ k∈children,k6=j ˆ v k [π i ,π i− ] 33: π j+,∗ ← SPIDER(j,π i kπ i− ,jThres) 34: ˜ π i+ ← ˜ π i+ kπ j+,∗ ; ˆ v j [π i ,π i− ]←v[π j+,∗ ,π i kπ i− ] 35: endfor 36: ifv[˜ π i+ ,π i− ]>thresholdthen 37: threshold←v[˜ π i+ ,π i− ];π i+,∗ ← ˜ π i+ 38: endif 39: elseif ˆ v[π i+ ,π i− ]+absHeuristic>thresholdthen 40: absNodes←π i .absNodes+1 41: ˆ Π i ← GET-P OLICIES (π i ,π i .horizon,absNodes) 42: /*InsertpoliciesinthebeginningofΠ i insortedorder*/ 43: Π i + ← INSERT-S ORTED-P OLICIES ( ˆ Π i ) 44: endif 45: endfor 46: REMOVE(π i ) 47: endif 48: returnπ i+,∗ 90 1. Horizon Based Abstraction (HBA): In this type of abstraction, the abstract policy is de- fined as a shorter horizon policy. It represents a group of longer horizon policies that have the same actions as the abstract policy for times less than or equal to the horizon of the abstractpolicy. ThisisillustratedinFigure7.2(a). ForHBA,therearetwopartstoheuristiccomputation: (a) Computing the upper bound for the horizon of the abstract policy. This is same as the heuristic computation defined by the GET-H EURISTIC() algorithm for SPIDER, howeverwithashortertimehorizon(horizonoftheabstractpolicy). (b) Computing the maximum possible reward that can be accumulated in one time step and multiplying it by the number of time steps to time horizon. This maximum pos- sible reward in turn is obtained by iterating through all the actions of all the agents involved(agentsinthesub-treewith iastheroot)andcomputingthemaximumjoint rewardforanyjointaction. Thesumof(a)and(b)aboveistheheuristicvalueforaHBAabstractpolicy. 2. Node Based Abstraction (NBA): Abstraction of this type is performed by not associat- ing actions to certain nodes of the policy tree, i.e. incomplete policies. Unlike abstraction (a) above, this implies multiple levels of abstraction. This is illustrated in Figure 7.2(b), where there is a T=1 policy that is an abstract policy for T=2 policies that do not contain an action for the case where TP is observed. These incomplete T=2 policies are further abstractionsforT=2completepolicies. Increasedlevelsofabstractionleadstofastercom- putation of a complete joint policy, π root+ and also to shorter heuristic computation and 91 exploration/pruningphases. ForNBA,theheuristiccomputationissimilartothatofanor- mal policy except in cases where there is no action associated with certain policy nodes. Incaseswheresuch nodes are encountered, the immediate reward is taken asR max (max- imumrewardpossibleforanyaction). We combine both the abstraction techniques mentioned above into one technique, SPIDER- ABS.Algorithm23providesthealgorithmforthisabstractiontechnique. Forcomputingoptimal joint policy with SPIDER-ABS, a non-leaf agent i initially examines all T=1 policies and sorts them based on abstract policy heuristic computations. This is performed on lines 2, 19 of Al- gorithm 23. These T=1 policies are then explored in descending order of heuristic values and ones that have heuristic values less than the threshold are pruned (lines 25-26). Exploration in SPIDER-ABS has the same definition as in SPIDER if the policy being explored has a horizon ofpolicycomputationwhichisequaltotheactualtimehorizonandifallthenodesofthepolicy haveanactionassociatedwiththem(lines27-30). However,ifthoseconditionsarenotmet,then itissubstitutedbyagroupofpoliciesthatitrepresents(referredtoasextensionhenceforth)(lines 33-35). Before substituting the abstract policy, this group of policies are again sorted based on theheuristicvalues(line37). Atthisjuncture,ifallthesubstitutedpolicieshavehorizonofpolicy computationequaltothetimehorizonandallthenodesofthesepolicieshaveactionsassociated with them, then the exploration/pruning phase akin to the one in SPIDER ensues (line 24). In case of partial policies, further extension of policies occurs. Similar type of abstraction based computationofbestresponseisadoptedatleafagentsinSPIDER-ABS(lines3-16). 92 7.1.4 ValueApproXimation(VAX) Inthissection,wepresentanapproximateenhancementtoSPIDERcalledVAX.Theinputtothis techniqueisanapproximationparameter,whichdeterminesthedifferencebetweentheoptimal solution and the approximate solution. This approximation parameter is used at each agent for pruning out joint policies. The pruning mechanism in SPIDER and SPIDER-Abs dictates that a joint policy be pruned only if the threshold is exactly greater than the heuristic value. However, the idea in this technique is to prune out joint policies even if threshold plus the approximation parameter,isgreaterthantheheuristicvalue. In the example of Figure 7.1, if the heuristic value for the second joint policy (or second searchtreenode)inlevel2were238insteadof232,thenthatpolicycouldnotbebeprunedusing SPIDER or SPIDER-Abs. However, in VAX with an approximation parameter of 5, the joint policyinconsiderationwouldalsobepruned. Thisisbecausethethreshold(234)atthatjuncture plus the approximation parameter (5), i.e. 239 would have been greater than the heuristic value for that joint policy (238). It can be noted from the example (just discussed) that this kind of pruning can lead to fewer explorations and hence lead to an improvement in the overall run- time performance. However, this can entail a sacrifice in the quality of the solution because this technique can prune out a candidate optimal solution. A bound on the error introduced by this approximatealgorithmasafunctionof,isprovidedbyProposition13. 7.1.5 PercentageApproXimation(PAX) In this section, we present the second approximation enhancement over SPIDER called PAX. Input to this technique is a parameter, δ that represents the percentage of the optimal solution qualitythatistolerable. Outputofthistechniqueisapolicywithanexpectedvaluethatisatleast 93 δ 100 oftheoptimalsolutionquality. AswithVAX,thisparameterisalsousedateachagentinthe pruning phase. A policy is pruned if δ 100 of its heuristic value is not greater than the threshold. Again in Figure 7.1, if the heuristic value for the second search tree node in level 2 were 238 instead of 232, then PAX with an input parameter of 98% would be able to prune that search tree node (since 98 100 ∗ 238 < 234). Like in VAX, this leads to fewer explorations and hence an improvement in run-time performance, while potentially leading to a loss in quality of the solution. As shown in Proposition 14, this loss is again bounded and the bound is δ% of the optimalsolutionquality. 7.1.6 TheoreticalResults Proposition11 HeuristicprovidedusingthecentralizedMDPheuristicisadmissible. Proof. Forthevalueprovidedbytheheuristictobeadmissible,itshouldbeanoverestimate oftheexpectedvalueforajointpolicy. Thus,weneedtoshowthat: Forl∈E i+ ∪E i− : ˆ v t l ≥v t l . Weusemathematicalinductiononttoprovethis. Base case: t = T− 1. Irrespective of whether l ∈ E i− or l ∈ E i+ , ˆ r t l is computed by maximizing over all actions of the agents in the sub-tree for which i is the root, while r t l is computedforfixedpoliciesofthesameagents. Hence, ˆ r t l ≥r t l andalso ˆ v t l ≥v t l . Assumption: Proposition holds fort = η, where1≤ η < T−1. Thus, ˆ v η l ≥ v η l , forl∈ E i− orl∈E i+ . Wenowhavetoprovethatthepropositionholdsfort =η−1i.e. ˆ v η−1 l ≥v η−1 l . 94 We initially prove that the above holds for l∈ E i− and similar reasoning can be adopted to prove for l∈ E i+ . The heuristic value function for l∈ E i− is provided by the following equation: ˆ v η−1 l =ˆ r η−1 l +max a l 2 X ω η l 1 ,s η l ˆ p η−1 l1 ˆ p η−1 l2 p η−1 u ˆ v η l RewritingtheRHSandusingEqn7.2 =ˆ r η−1 l +max a l 2 X ω η l 1 ,s η l p η−1 u p η−1 l1 ˆ p η−1 l2 ˆ v η l =ˆ r η−1 l + X ω η l 1 ,s η l p η−1 u p η−1 l1 max a l 2 ˆ p η−1 l2 ˆ v η l Sincemax a l 2 ˆ p η−1 l2 ˆ v η l ≥ P ω l 2 o η−1 l2 ˆ p η−1 l2 ˆ v η l andp η−1 l2 =o η−1 l2 ˆ p η−1 l2 ≥ˆ r η−1 l + X ω η l 1 ,s η l p η−1 u p η−1 l1 X ω l 2 p η−1 l2 ˆ v η l Since ˆ v η l ≥v η l (fromtheassumption) ≥ˆ r η−1 l + X ω η l 1 ,s η l p η−1 u p η−1 l1 X ω l 2 p η−1 l2 v η l ≥ˆ r η−1 l + X (ω η l 1 ,s η l ) X ω l 2 p η−1 u p η−1 l1 p η−1 l2 v η l ≥r η−1 l + X (ω η l ,s η l ) p η−1 u p η−1 l1 p η−1 l2 v η l ≥v η−1 l 95 Proposition12 SPIDERprovidesanoptimalsolution. Proof. SPIDER examines all possible joint policies given the interaction structure of the agents. The only exception being when a joint policy is pruned based on the heuristic value. Thus,aslongasacandidateoptimalpolicyisnotpruned,SPIDERwillreturnanoptimalpolicy. As proved in Proposition 11, the expected value for a joint policy is always an upper bound. Hencewhenajointpolicyispruned,itcannotbeanoptimalsolution. Proposition13 Error bound on the solution quality for VAX (implemented over SPIDER-ABS) with an approximation parameter of is given byρ, whereρ indicates the number of leaf nodes intheDFSagenttree. Proof. WeprovethispropositionusingmathematicalinductiononthedepthoftheDFStree. Base case: depth = 1 (i.e. one node). Best response is computed by iterating through all policies, Π k . A policy,π k is pruned if ˆ v[π k ,π k− ] < threshold + . Thus the best response policy computed by VAX would be at most away from the optimal best response. Hence the propositionholdsforthebasecase. Assumption: Propositionholdsforatreeofdepthd,where1≤depth≤d. Wenowhavetoprovethatthepropositionholdsforatreeofdepthd+1. Without loss of generality, lets assume that the root node of this tree hask children. Each of this children is of depth≤ d, and hence from the assumption above the error introduced in kth childisρ k ,whereρ k isthenumberofleafnodesinkthchildoftheroot. Therefore,ρ = P k ρ k , whereρisthenumberofleafnodesinthetree. Hence,withVAXthepruningconditionattherootagentwillbe ˆ v[π k ,π k− ]< (threshold− P k ρ k )+. However,withSPIDER-ABSthepruningconditionwouldhavebeen ˆ v[π k ,π k− ]< 96 threshold. As long as P k ρ k ≥ 1, the root agent in VAX does not prune a policy that was not pruned in SPIDER-ABS. Hence the root agent does not introduce any error in the solution. All theerroristhusintroducedbychildrenoftherootagent,whichis P k ρ k = ( P k ρ k ) =ρ. Henceproved. Proposition14 For PAX (implemented over SPIDER-ABS) with an input parameter of δ, the solutionqualityisatleast δ 100 v[π root+,∗ ],wherev[π root+,∗ ]denotestheoptimalsolutionquality. Proof. WeprovethispropositionusingmathematicalinductiononthedepthoftheDFStree. Base case: depth = 1 (i.e. one node). Best response is computed by iterating through all policies,Π k . Apolicy,π k isprunedif δ 100 ˆ v[π k ,π k− ]<threshold. Thusthebestresponsepolicy computed by PAX would be at least δ 100 times the optimal best response. Hence the proposition holdsforthebasecase. Assumption: Propositionholdsforatreeofdepthd,where1≤depth≤d. Wenowhavetoprovethatthepropositionholdsforatreeofdepthd+1. Without loss of generality, lets assume that the root node of this tree hask children. Each of thischildrenisofdepth≤d,andhencefromtheassumptionabovethesolutionqualityinthekth childisatleast δ 100 v[π k+,∗ ,π k− ]forPAX. WithSPIDER-ABSthepruningconditionwouldhavebeen: ˆ v[π root ,π root− ]< P k v[π k+,∗ ,π k− ]. WithPAX,thepruningconditionattherootagentwillbe δ 100 ˆ v[π root ,π root− ] < P k δ 100 v[π k+,∗ ,π k− ]⇒ ˆ v[π root ,π root− ] < P k v[π k+,∗ ,π k− ]. Since the pruningconditionattherootagentinPAXisthesameastheoneinSPIDER-ABS,ajointpolicy thatisnotprunedinSPIDER-ABSwillnotbeprunedinPAX.Hencethereisnoerrorintroduced 97 at the root agent and all the error is introduced in the children. Thus the overall solution quality isatleast δ 100 oftheoptimalsolution. Henceproved. L o c 1 - 1 L o c 2 - 1 L o c 2 - 2 L o c 1 - 1 L o c 2 - 1 L o c 1 - 1 L o c 2 - 2 L o c 2 - 1 L o c 1 - 1 L o c 2 - 2 L o c 2 - 1 L o c 1 - 2 3 - C h a i n 4 - C h a i n 4 - S t a r 5 - S t a r Figure7.3: Sensornetworkconfigurations 7.2 ExperimentalResults Figure 7.4: Comparison of GOA, SPIDER, SPIDER-Abs and VAX for T = 3 on (a) Runtime and (b) Solution quality; (c) Time to solution for PAX with varying percentage to optimal for T=4 (d) Time to solutionforVAXwithvaryingepsilonforT=4 98 AllourexperimentswereconductedonthesensornetworkdomainprovidedinSection2.1.2. NetworkconfigurationspresentedinFigure7.3wereusedintheseexperiments. Algorithmsthat weexperimentedwithaspartofthispaperincludeGOA,SPIDER,SPIDER-ABS,PAXandVAX. We compare against GOA because it is the only global optimal algorithm that exploits network structure and considers more than two agents. We performed two sets of experiments: (i) firstly, wecomparedtherun-timeperformanceofthealgorithmsmentionedaboveand(ii)secondly,we further experimented with PAX and VAX to study the tradeoff between run-time and solution quality. Experimentswereterminatediftheyexceededthetimelimitof10000seconds 1 . Figure 7.4(a) provides the run-time comparisons between the optimal algorithms GOA, SPI- DER, SPIDER-Abs and the approximate algorithm, VAX with varying epsilons. X-axis denotes the type of sensor network configuration used, while Y-axis indicates the amount of time taken (on a log scale) to compute the optimal solution. The time horizon of policy computation for all the configurations was 3. For each configuration (3-chain, 4-chain, 4-star and 5-star), there are five bars indicating the time taken by GOA, SPIDER, SPIDER-Abs and VAX with 2 differ- ent epsilons. GOA did not terminate within the time limit for 4-star and 5-star configurations. SPIDER-Abs dominated the other two optimal algorithms for all the configurations. For in- stance, for the 3-chain configuration, SPIDER-ABS provides 230-fold speedup over GOA and 2-fold speedup over SPIDER and for the 4-chain configuration it provides 58-fold speedup over GOAand2-foldspeedupoverSPIDER.Thetwoapproximationapproaches,VAX(with of10) and PAX (withδ of 80) provided a further improvement in performance over SPIDER-Abs. For instance,forthe5-starconfigurationVAXprovidesa15-foldspeedupandPAXprovidesa8-fold speedupoverSPIDER-Abs. 1 Machinespecsforallexperiments: IntelXeon3.6GHZprocessor,2GBRAM 99 Figures7.4(b)providesacomparisonofthesolutionqualityobtainedusingthedifferentalgo- rithms for the problems tested in Figure 7.4(a). X-axis denotes the sensor network configuration whileY-axisindicatesthesolutionquality. SinceGOA,SPIDER,andSPIDER-Absareallglobal optimal algorithms, the solution quality is the same for all those algorithms. With both the ap- proximations, we obtained a solution quality that was close to the optimal solution quality. In 3-chain and 4-star configurations, it is remarkable that both PAX and VAX obtained almost the samequalityastheglobaloptimalalgorithms. Forotherconfigurationsaswell,thelossinquality waslessthan15%oftheoptimalsolutionquality. Figure 7.4(c) provides the time to solution with PAX (for varying epsilons). X-axis denotes theapproximationparameter,δ (percentagetooptimal)used,whileY-axisdenotesthetimetaken to compute the solution (on a log-scale). The time horizon for all the configurations was 4. As δ was decreased from 70 to 30, the time to solution decreased drastically. For instance, in the 3- chaincasetherewasatotalspeedupof170-foldwhenthe δwaschangedfrom70to30. However, thevarianceinactualsolutionqualitywaszero. Figure 7.4(d) provides the time to solution for all the configurations with VAX (for varying epsilons). X-axis denotes the approximation parameter, used, while Y-axis denotes the time taken to compute the solution (on a log-scale). The time horizon for all the configurations was 4. As was increased, the time to solution decreased drastically. For instance, in the 4-star case there was a total speedup of 73-fold when the was changed from 60 to 140. Again, the actual solutionqualitydidnotchangewithvaryingepsilon. 100 Chapter8: ExploitingstructureindynamicsforDistributed POMDPs In this chapter, I propose a novel technique that exploits structure in dynamics for distributed POMDPs, while planning over a continuous initial belief space. This algorithm builds on the technique proposed in Chapter 3 for exploiting structure in single agent POMDPs and on the “Joint Equilibrium-based Search for Policies” (JESP) algorithm [Nair et al., 2003a] which finds locally optimal policies from an unrestricted set of possible policies, with a finite planning hori- zon. Not only does this technique exploits structure to improve efficiency, it also addresses a majorshortcominginexistingresearchinDistributedPOMDPs: planningforacontinuousstart- ingbeliefregion. Inparticular,whereastheoriginalJESPperformediterativebest-responsecomputationsfrom a single starting belief state, the combined algorithm exploits the single-agent POMDP tech- niques to perform best-response computations over continuous regions of the belief space. The new algorithm, CS-JESP (Continuous Space JESP) allows for generation of a piece-wise linear and convex value function over continuous belief spaces for the optimal policy of one agent in thedistributedPOMDP,givenfixedpoliciesofotheragents—thefamiliarcup-likeshapeofthis 101 value function [Kaelbling et al., 1998]. The cup-shape implies that when dealing with a contin- uousstartingbeliefspace,agentsusuallyhavemorethanonepolicy,eachofwhichdominatesin adifferentregionofthebeliefspace. Thisregion-wisedominancehighlightsthethreeimportantchallengesaddressedinCS-JESP. First, CS-JESP requires computation of best response policies for one agent, given that different policies dominate over different regions of the belief space for the second agent. To efficiently computebestresponsepoliciesperbeliefregion,itiscriticaltoemploytechniquesthatpruneout unreachable future belief states. To that end, we illustrate application of the belief bound tech- niques[Varakanthametal.,2005] for improved efficiency. Second, owing to these bestresponse calculations for different belief regions, often the policies for contiguous belief regions can be identical. To address this inefficiency, we implement a merging method that combines such ad- jacent regions with equivalent policies. Third, to improve the performance of the algorithm, we implement region-based convergence, i.e. once policies have converged for a region, these are notconsideredforsubsequentbestresponsecomputations. 8.1 ContinuousSpaceJESP(CS-JESP) OneofthekeyinsightsinCS-JESPisthesynergisticinteractionbetweentheJESPalgorithmfor distributed POMDPs and the DB-GIP technique of single agent POMDPs. We illustrate these interactions with a two-agent example in Section 8.1.1, and present key ideas in Section 8.1.2. Further, we describe the algorithm for n agents in Section 8.1.3 and some theoretical guarantees inSection8.1.4. 102 Unlikepreviouswork,ourworkfocusesoncontinuousstartingbeliefspacesandthusrequires modificationsforpolicyrepresentationthatistraditionallyusedindistributedPOMDPliterature. In particular, because different policies may be dominant over different regions in the belief space,weintroducethenotionofa general policy. A general policy,Π i foranagentiisdefined as a mapping from belief regions to policies. Π i is represented as the set{(B 1 0 ,π 1 i ),...(B m 0 ,π m i )}, where B 1 0 ,..,B m 0 are belief regions in the starting belief space B 0 and π 1 i ,..π m i are the policies thatwillbeexecutedstartingfromthoseregions. Henceforthwereferπ k i asspecializedpolicies. Thus, given a starting belief point b k 0 ∈ B k 0 , agent i on receiving observations ω 1 i ,...,ω t i will performtheactionπ k i (~ ω t i )where~ ω t i = ω 1 i ,...,ω t i . Π =hΠ 1 ,...,Π n irefers to thejoint general policyoftheteamofagents. 8.1.1 IllustrativeExample Foreaseofexplanation,initiallythealgorithmisexplainedwithtwoagents,Agent1andAgent2. However, as we will show in Section 8.1.3, this algorithm is easily extendable to n agents. Ini- tially, each agent selects a random general policy, Π i , which will be a singleton set,{(B 0 ,π i )} , i.e. a single specialized policy,π i , over the entire starting belief space,B 0 . While for expository purposesthisexampledescribespolicycomputationsbyindividualagents,inrealityinCS-JESP these computations are performed by a centralized policy generator. CS-JESP begins when one agent,sayAgent2,fixesitsgeneralpolicyΠ 2 ,andotheragent,Agent1,findsthebestresponsefor Agent2’s general policy. Fixing Agent2’s specialized policy, π 2 , Agent1 creates a single agent POMDPwithanextendedstatespace,asexplainedinSection2.3.3. Agent1solvesthisPOMDP using DB-GIP technique, explained in Section 3, with starting belief space as B 0 , and obtains a new general policy Π 1 , containing a set{(B 1 0 ,π 1 1 ),...(B m 0 ,π m 1 )}. Each B k 0 ∈ B 0 is a belief 103 regionandisrepresentedbyaminimumandmaximumvalueforeachofthe|S|−1dimensions that represent the belief space. Now, Agent1 freezes its general policy, Π 1 , and Agent2 solves a POMDP for each π j 1 ∈ Π 1 , with the starting belief region as B j . Thus, Agent2 solves m POMDPs, and obtains a new general policyΠ 2 . At this point, bordering regions inΠ 2 that have identical policies are merged. This process continues until the solutions converge, and a local optimalisreached,i.e. noagentcanimproveitsvaluevectorsinanybeliefregion. Figure 8.1 illustrates the working of the algorithm with the multiagent tiger scenario (Sec- tion 8.1.1) for time horizon, T = 2. Each tree in Figure 8.1 represents a specialized policy. All the trees on the left side of the figure are part of the general policies of Agent1, and trees on the rightarepartofthegeneralpoliciesofAgent2. Forinstance,attheendofiteration3,bothagents containtwospecializedpoliciesintheirgeneralpolicy. Withineachtree(specializedpolicy),the letterinsideeachnodeindicatestheaction,andedgesindicatetheobservationreceived. Thus,for the highlighted tree in the top left corner, the root node indicates the Listen(L) action, and upon eitherobservingTLorTR,thespecializedpolicyrequirestheagenttotakeaListen(L)action. In this example, belief region over which a specialized policy dominates, consists of two numbers, namely the minimum and maximum belief probability of the state SL. These belief regions are indicated below each specialized policy in the figure. For instance, for the highlighted tree it is [0,1],butforothertrees,regionssuchas[0.18,0.85]areshown. The algorithm begins with both agents randomly selecting a specialized policy for the entire belief space [0,1]. In iteration 1, Agent2 fixes its general policy, and Agent1 comes up with its bestresponsegeneralpolicy. Forcalculatingthebestresponse,theAgent1solvesaPOMDPwith thestartingbeliefrangeas[0,1],sinceAgent2’sgeneralpolicyisdefinedoverthisrange. Afterthe firstiteration,Agent1containsthreespecializedpoliciesaspartofitsgeneralpolicy,dominating 104 over ranges [0,0.15], [0.15,0.85], [0.85,1]. In iteration 2, Agent1 fixes its general policy, and Agent2 begins its best response calculation with region [0,0.15]. For this range [0,0.15], Agent2 has only one dominant specialized policy and same is the case for [0.85,1]. However, for the range[0.15,0.85],Agent2hastwodominantspecializedpolicies,onethatdominatesintherange [0.15,0.5],andtheotherthatdominatesintherange[0.5,0.85]. Thusafteriteration2,Agent2has fourspecializedpoliciesaspartofitsbestresponsegeneralpolicy. However,regionshighlighted (with dotted rectangular boxes) have identical policies and thus after merging we are left with only two specialized policies. This algorithm continues with Agent2 fixing its general policy at iteration 3. Finally at convergence, each agent contains two specialized policies as part of their generalpolicies. 8.1.2 KeyIdeas In this section, we explain in detail the key ideas in the CS-JESP algorithm, namely: (a) JESP and DB-GIP synergy; (b) Calculation of dominanant belief regions for specialized policies; (c) Region-based convergence; and (d) Merging of adjacent regions with identical specialized poli- cies. JESP and DB-GIP synergy : Both the DS and DB techniques of DB-GIP can provide sig- nificant performance improvements in CS-JESP. First, with respect to DS, JESP’s state space is dynamic, where the set of states reachable at time t, e t i differ from the set of states at t + 1, e t+1 i . DS can exploit this dynamism by computing dominant policies at time t over the belief spacegeneratedbythestatesine t i thusreducingthedimensionalityofthestatespaceconsidered. For instance, in Figure 2.4, we have two initial statese 1 1 =SL orSR, while there are four states 105 Figure 8.1: Trace of the algorithm for T=2 in Multi Agent tiger example with a specific starting jointpolicy e 2 1 , e.g.SL(TL),SL(TR) etc. Given a time horizon of T=2, instead of constructing a belief space over (2+4=) 6 dimensions, DS will lead to constructing a belief space over two states at thefirsttimestepandfourstatesover thesecondtimestep. Suchdimensionalityreductionleads to significant speedups in CS-JESP. Second, with respect to DB, each agent solves a POMDP over the belief regions in the general policies of the other agents. DB is able to exploit for- ward projections of such starting belief regions to bound the maximum probabilities over states, and thus again restrict the belief space over which dominant policies are planned per belief re- gion,obtainingadditionalspeedups. ForinstanceinFigure8.1,atiteration2,Agent2solvesthree 106 POMDPs — these POMDPs are defined over extended states given three separate fixed policies of Agent1— one with the starting belief region as [0,0.15], another with [0.15,0.85], and a third with [0.85,1]. Thus, in solving the POMDP starting with the belief range [0,0.15], DB helps prune all the unreachable portions of the belief space given that the starting range is [0,0.15]. In allthreePOMDPs,thebeliefregionisnarrowercomparedto[0,1]. Region-based convergence : Given continuous initial belief space, we obtain value vectors (vector containing values for all the states) for all the belief regions in the general policy. Thus, convergence is attained when for all agents the value vectors at the current iteration for all the belief regions are equal to those in the previous iteration. For instance, in Figure 8.1, the con- vergence is attained in the fourth iteration, with the general policy of Agent2 containing the two exact same specialized policies from iteration 3. However, once one region has converged — the value vectors for all agents do not change from one iteration to the next for that region — CS-JESPwillnottestthatregionfurtherforconvergence,butonlycontinuechangingpoliciesin regionsthathavefailedtoconverge. Mergingofadjacentregionswithidentical specialized policies: Mergingsuchregionscan be important as the other agent would have to solve fewer number of POMDPs in the next iter- ation. For instance, in the general policy of Agent2 before merging at iteration 2, belief regions [0,0.15] and [0.15,0.5] have identical specialized policies. Similarly, regions [0.5, 0.85] and [0.85,1] have identical specialized policies. Thus Agent2 has only two specialized policies after merging (instead of four before merging) and this leads to agent1 solving two instead of four POMDPsatiteration3. 107 Mergingrequiresidentifyingregionsadjacenttoeachother. IntheTigerdomain,thisisdone by doing adjacency check for regions along one dimension. However, finding bordering regions ina|S|dimensionalstatespacerequirescomparisonsalong|S|−1dimensionalspace. Calculation of dominant belief regions for specialized policies: One standard way of rep- resenting solutions in single agent POMDPs is through value vectors. In this representation, the best policy for a belief point, b, is computed by testing for a vector that provides the maximum expectedvalueforthatbeliefpoint. π ∗ 1 ←argmax π∈{π 1 } v π ·b. However, in CS-JESP, one agent uses the belief regions of the other agent to calculate the best responses over each of those belief regions. We develop a linear program to address the dominantbeliefregioncomputationforeachpolicy. Algorithm24computesthemaximumbelief probability of a state, s j , where a policy or value vector,v dominates all the other policies or value vectors,V−v in the final policy. Constraint 1 in Algorithm 24, computes points wherev dominates all the other vectors inV. Objective function of the algorithm is a maximization over b(s j ), thus finding highest possible belief probability for state s j amongst all those dominating points. In a similar way, the minimum for s j can be found by doing a minimize, instead of maximize, in line 1 of the LP. The belief region is calculated by solving these max, min LPs for each states j ∈ S. Thus, requiring 2∗|V|∗|S| number of LPs to be solved for the computation ofanentirebeliefregion. 108 Algorithm24 MAXIMUMBELIEF(s j ,v,V,B min ,B max ) Maximizeb(s j ) subjecttoconstraints 1.b.(v−v 0 )> 0,∀v 0 ∈V−v 2.Σ s∈S b(s) = 1 3.B min (s)<b(s)<B max (s),∀s∈S 8.1.3 Algorithmfornagents Inthissection,wepresenttheCS-J ESPalgorithm(Algorithm25)fornagents. Intheinitialization stage (lines 1-4), each agent i has only one belief region that corresponds to its entire belief space (Π 0 i .beliefPartition). Also, each agent has a single randomly selected specialized policy, Π 0 i .π[h[0,1],...,[0,1]i] (i.e. π is the specialized policy), for the entire belief space (line 3). Every general policy has “count” for each belief region, to track the convergence of policies in that belief region (region-based convergence) — if the count reaches n then the region has converged, because no agent will change any further. The flag “converged” monitors if joint generalpoliciesinalltheregionshaveconverged. In each iteration (one execution of lines 6-23) of Algorithm 25, we choose an agent i and find its optimal response to the fixed general policies of the remaining agents by calling OPTI- MALBESTRESPONSE(). This is repeated until no agent acting alone can improve upon the joint expectedrewardbychangingitsowngeneralpolicy. Although each agent i starts off with the same belief set partition, Π 0 i .beliefPartition, this will not be true after calling OPTIMALBESTRESPONSE() as seen in Figure 8.1. The function UPDATEPARTITION() (Algorithm 26) is responsible for creating a new belief set partition for an agenti, depending on the belief regions of the othern−1 agents. This new belief set partition is obtained by splitting the overlapping belief regions of the n−1 agents, in a way that no two 109 resulting belief regions, which now belong to this partition, overlap. Furthermore, this function computestheΠ i .countforallthenewregions,fromthecountvaluesfortheregionsinΠ 0 j ,where j wasthefreeagentinthelastiteration(i.etheagentwhocomputedthebestresponseinthelast iteration). FINDNEWPARTITION() (Algorithm 27) takes two arguments, (i) partition and (ii) a belief region, br, and it generates all feasible partitions from the two arguments. To illustrate the working of this function, we provide an example with three states{s 1 ,s 2 ,s 3 }. Belief regions in the corresponding belief space can be represented with minimum and maximum belief prob- abilities for just s 1 and s 2 , i.e. {(b min [s 1 ],b max [s 1 ]), (b min [s 2 ],b max [s 2 ])}. For example, let partition ={h[0,0.8],[0.5,0.9]i} (has only one region) and br ={[0.4,0.9],[0.3,0.6]}. In the first step (line 3), partitions are found for each state, s i separately. Thus, for the first state, s 1 , [0,0.8]and[0.4,0.9]yieldspartitions,[0,0.4],[0.4,0.8],[0.8,0.9]. Similarlyforthesecondstate, s 2 ,thepartitionsfoundare[0.3,0.5],[0.5,0.6],[0.6,0.9]. Inthesecondstep(line4),wecompute the cross product of these individual dimension partitions. This gives rise to nine belief regions, viz. {[0,0.4],[0.3,0.5]}, ...,{[0.8,0.9], [0.4,0.8]},{[0.8,0.9],[0.8,0.9]}. Finally, in the third step(line5),wepruneregionswhichdonotcontainanyvalidpoints,i.e. P s≤|S|−1 b min [s]> 1. For instance, the region{[0.8,0.9],[0.4,0.8]} can be pruned, because a belief point in this region hasprobabilityofatleast1.2(= 0.8+0.4). The function OPTIMALBESTRESPONSE() (Algorithm 28) is then called separately for each belief region in agent i’s belief set partition. It returns a new partitioniong of the initial belief space and the optimal policy for each belief region in this partition. CONSTRUCTEXTENDED- POMDP() constructs a POMDP with extended state space, as explained in Section 2.3.3, while 110 the function CALCULATEBELIEFREGION() computes the belief regions where each vector v (∈V)dominates. After computing best responses, CS-J ESP() ensures that the number of belief partitions ob- tained are finite (in lines 14-16) and that Π.count is updated correctly for each belief region (in lines18-21). It is possible that an agent’s best response in adjacent belief regions is the same policy. The function MERGEBELIEFREGIONS() (Algorithm 8.1.3) is responsible for merging such kind of regions(lines4-7). Further,oncethepoliciesinabeliefregionhaveconverged,thatregionisnot consideredforsubsequentmergingphases(firstpartoftheconditiononline4). 8.1.4 TheoreticalResults Inthefollowingproofs,weuse“iteration”tomeanoneexecutionofthe“while”loop(lines5-23) of Algorithm 25, n for the number of agents, and “free agent” to denote the i th agent for that iteration. Proposition15 In CS-JESP, the joint expected reward for all starting belief points is monotoni- callyincreasingwitheachiteration. ProofSketch. Ineveryiteration,eachstartingbeliefpointmustbelongtooneoftheregionsinthe beliefpartitionofthefreeagent. Eachsuchbeliefregioncorrespondstooneofthevaluevectors, calculatedbyacalltoOPTIMALBESTRESPONSE(Algorithm28). SinceDB-G IPisoptimal,these vectorsshouldeitherequalordominatethevectorsatthepreviousiteration,inallbeliefregions. Proposition16 CS-JESP will terminate iff the joint policy has converged in all the free agent’s beliefregions. 111 Algorithm25 CS-J ESP() 1: fori← 1tondo 2: Π 0 i .beliefPartition←{h[0,1],...,[0,1]i} 3: Π 0 i .π[h[0,1],...,[0,1]i]←randomspecializedpolicy 4: Π 0 i .count[h(0,1),...,(0,1)i]← 0 5: endfor 6: converged← false;i←n; 7: whileconverged = falsedo 8: i← (iMODn)+1;converged← true 9: UPDATEPARTITION(i,Π i ,Π 0 ) 10: forallbr inΠ i .beliefPartitiondo 11: ifΠ i .count[br]<nthen 12: converged←false 13: {Π i ,regions}← OPTIMALBESTRESPONSE(i,Π 0 ,br) 14: forallbr 1 inregionsdo 15: π← Π i .π[br 1 ]; REMOVE(Π i ,br 1 ) 16: fordim← 1to|S|−1do 17: br 1 [dim]← ROUNDOFF(br 1 [dim],precision) 18: endfor 19: if VOLUME(br 1 )> 0then 20: ADD(Π i .beliefPartition,br 1 ,π) 21: ifΠ i .π[br 1 ] = Π 0 i .π[br]then 22: Π i .count[br 1 ]←Π 0 i .count[br]+1 23: else 24: Π i .count[br 1 ]← 1 25: endif 26: endif 27: endfor 28: endif 29: endfor 30: MERGEBELIEFREGIONS(Π i ) 31: Π 0 i ← Π i 32: endwhile 33: returnΠ 112 Algorithm26 UPDATEPARTITION(i,Π) 1: Π i ← Π (iMOD n)+1 2: forallj in{1,...,n}−{i,(iMODn)+1}do 3: forallbr 1 inΠ j .beliefPartitiondo 4: ifΠ i .count[br 1 ]<nthen 5: Π i .beliefPartition← FINDNEWPARTITION(Π i .beliefPartition,br 1 ) 6: endif 7: endfor 8: endfor 9: ifi = 1thenj←nelsej←i−1 10: forallbr 2 inΠ i .beliefPartitiondo 11: br 3 ← OVERLAPPINGREGION(Π j .beliefPartition,br 2 ) 12: Π i .count[br 2 ]← Π j .count[br 3 ] 13: endfor 14: return Algorithm27 FINDNEWPARTITION((partition,br)) 1: newPartition←∅ 2: fordim← 1to|S|−1do 3: 1DPartition← SPLITDIMENSION(dim,br,partition) 4: newPartition← CROSSPRODUCT(newPartition,1DPartition) 5: newPartition← PRUNE(newPartition) 6: endfor 7: returnnewPartition Algorithm28 OPTIMALBESTRESPONSE(i,Π 0 ,br) 1: k← 0 2: extendedPOMDP ← CONSTRUCTEXTENDEDPOMDP(i,Π 0 ,br) 3:{V,π new }← DB-G IP(extendedPomdp,br) 4: forj← 1toV.sizedo 5: v←V[j];V 0 ←V−v 6: beliefPartition[k]← CALCULATEBELIEFREGION(v,V 0 ,br) 7: Π i .beliefPartition[k]← beliefPartition[k] 8: Π i .π[beliefPartition[k]]←π new [j];k + ← 1 9: endfor 10: return{Π i ,beliefPartition} 113 Algorithm29 MERGEBELIEFREGIONS(Π i ) 1: foreachb 1 inΠ i .beliefPartitiondo 2: ifΠ i .count(b 1 )<nthen 3: foreachb 2 inΠ i .beliefPartitiondo 4: ifΠ i .count(b 2 )<n∧Π i .π[b 1 ] = Π i .π[b 2 ]then 5: if ISADJACENT(b 1 ,b 2 )then 6: b←MERGEREGIONS(b 1 ,b 2 ) 7: ADD(Π i .beliefPartition,b,Π i .π[b 1 ]) 8: Π i .count[b]← min(Π i .count[b 1 ],Π i .count[b 2 ]) 9: REMOVE(Π i ,b 1 ); REMOVE(Π i ,b 2 ) 10: endif 11: endif 12: endfor 13: endif 14: endfor 15: returnΠ i ProofSketch. Byconstruction,CS-JESP(Algorithm25)terminatesiff converged = true,which willhappeniffΠ i .count[br]≥n,forallbeliefregionsbrofthefreeagenti. Π i .count[br]≥niff the joint policy for the regionbr remains constant forn iterations. In order for the joint general policy to remain constant for n iterations, OPTIMALBESTRESPONSE() should return identical specialized policies (to those in previous iteration) for all the belief regions , for n− 1 free agents. This happens when no one agent can improve the global value by altering its general policy, i.e. when local optima is attained. Furthermore, we round off each dimension of a belief regiontoprecisiondecimalspaces,andhencethenumberofpossiblebeliefregionscannotgrow indefinitely. From Propositions 15 and 16, we can conclude that CS-J ESP will always terminate. At ter- mination,thejointpolicywillbelocallyoptimalaslongasnoneofthebeliefregionsreturnedby OPTIMALBESTRESPONSE wereeliminatedbythe ROUNDOFF procedure. 114 8.2 ExperimentalResults Figure 8.2: Comparison of (a) CSJESP+GIP, and CSJESP+DB for reward structure 1 (b) CSJESP+DB, and CSJESP+DBM for reward structure 1 (c) CSJESP+GIP, and CSJESP+DB for rewardstructure2(d)CSJESP+DB,andCSJESP+DBMforrewardstructure2 This section provides three types of evaluations for CS-JESP using the multiagent tiger do- main [Nair et al., 2003a]. The first experiment focuses on run-time evaluations. We provide a comparisonofthreetechniques: (i)CS-JESP+GIP:isthebasicversionofthecombinationofthe JESPandthevalueiterationalgorithm,GIPofsingleagentPOMDPs. (ii)CS-JESP+DB,isJESP with DB-GIP. (iii)CS-JESP+DBM is CS-JESP+DB with the merging enhancement. Results of thisexperimentareshowninFigure8.2. Weexperimentwithtwoseparaterewardstructures(pre- sentedin [Nairetal.,2003a]). Figure8.2(a)andFigure8.2(b)focusonrewardstructure1,while Figure8.2(c)andFigure8.2(d)focusonrewardstructure2. InFigure8.2(a),x-axisplotsvarying time horizon while y-axis plots run-time in milliseconds on log-scale 1 . In Figure 8.2(b), x-axis 1 Machinespecsforallexperiments: IntelXeon3.6GHZprocessor,2GBRAM 115 againplotstimehorizon,butthey-axisplotsrun-timeinmilliseconds(nolog-scaleisused). Time limit for the problems was set at 7,500,000 ms, after which they were terminated. Figure 8.2(a) and Figure 8.2(c) refer to comparisons between CS-JESP+GIP and CS-JESP+DB, while Fig- ure8.2(b)andFigure8.2(d)refertocomparisonsbetweenCS-JESP+DBandCS-JESP+DBMfor thetworewardstructures. Figure 8.3: Comparison of the number of belief regions created in CS-JESP+DB and CS- JESP+DBMforrewardstructures1and2 Figure8.2(a)showsthatCS-JESP+GIPdidnotterminatewithinthespecifiedtimelimitafter T=4. However, CS-JESP+DB converged to the solution even for T = 7, within the specified time limit. Even in cases where CS-JESP+GIP terminates, CS-JESP+DB provides significant speedups. For instance, in Figure 8.2(a), at T = 4, while CS-JESP+GIP takes in 83717.8 ms, CS-JESP+DB takes only 7345.2 ms leading to a speedup of 11.4 fold. Similar conclusions can be drawn from Figure 8.2(c). These results illustrate the synergy of JESP and DB-GIP, and the suitabilityofCS-JESPtotakeadvantageofDB-GIP. Figure 8.2(b) shows that CS-JESP+DBM provides further speedups over CS-JESP+DB, as timehorizonincreases. ForinstanceatT=7inFigure8.2(b),merginginCS-JESP+DBMprovided 1.66 fold speedup over CS-JESP+DB. Similar results are obtained with reward structure 2 in Figure 8.2(d), thus establishing the utility of merging contiguous regions with identical policies. 116 In Figure 8.2(d) T=7 post merging show a faster execution compared to the T=6 results post merging. This occurs because the number of iterations of CS-JESP required for convergence at T=7arelower(6)comparedwithiterationsatT=6(11). Figure 8.4: Comparison of the expected values obtained with JESP for specific belief points and CS-JESP OursecondevaluationinFigure8.3focusesonunderstandingthespeedupsduetomergingin CS-JESP+DBM. The number of belief regions present in the final solution is an indicator of the numberofsingleagentPOMDPsgettingsolvedateachiteration. Thex-axisinthefiguresrepre- sents the time horizon, while the y-axis is the number of belief regions. Thus in Figure 8.3, for a time-horizon of 7, using CSJESP+DB led to 31 belief regions, whereas using CSJESP+DBM led to 13 belief regions, a 2.39-fold reduction in the number of belief regions considered. Fur- thermore, we see that increasing the time horizon leads to increasing reduction in the number of beliefregionswithCS-JESP+DBMwhencomparedtothenumberwithCS-JESP+DB.Effectof the number of belief regions on the time taken increases with time horizon, because the single agentPOMDPsexpandinsizewiththetimehorizon. Thisprovidestheexplanationforthetiming resultsforCS-JESP+DBMinFigure8.2(b)andFigure8.2(d). Our third evaluation focused on illustrating that CS-JESP achieves what it set out to do — generating policies over continuous initial belief space as shown in Figure 8.4. Belief space (in 117 this domain belief probability of SL) is denoted on the x-axis, while the expected value of the policy is depicted on the y-axis. CS-JESP provides a general policy where the expected value is represented by a “CUP”-shape. There are five different policies represented in the cup, each dominant over a single belief region. The figure also indicates that if we were to approximate this entire general policy with a single policy over a single starting belief state, e.g. with JESP, thenresultsmaybearbitrarilyworse. ForinstancewithJESP(0.3,0.7),thevalueat(1,0)is-27, whilethevaluegeneratedwithCS-JESPis18,adifferenceof45. WithJESP(0.5,0.5),thevalue at(1,0)is-4,whereCS-JESPattainsavalueof18,adifferenceof22. Of course, we may sample several belief points with JESP and then for a new belief point provide a policy from the nearest sample. Such a proposed heuristic approach naturally leads to our fourth evaluation comparing CS-JESP runtime to an approach that samples the belief space. ThisevaluationisnotmeanttobeaprecisecomparisonofJESPandCS-JESP,insteadtheaimis toshowthattherun-timeresultsforsampledJESPwouldbecomparabletotheruntimesofCS- JESP. In Table 8.1, we show the run times of JESP and CS-JESP for T = 6, and T=7 for reward structure1. Toreplicatethepolicyobtained withCS-JESP, JESPwould have tosample atleastas manytimesasthenumberofbeliefregionsinthefinalpolicyofCS-JESP.Forinstance,forT=7, the number of samples required for JESP would be thirteen (from Figure 8.3). Table 8.1 shows anestimateofsuchasampledJESPtechnique,givenruntimeresultsfrom [Nairetal.,2004]. We see that CS-JESP run-times are comparable, yet CS-JESP provides guarantees on these results thatareunavailablewithsampling. CS-JESP JESP SampledRegions SampledJESP T=6 160336 15000 11 165000 T=7 470398 73000 13 949000 Table8.1: Comparisonofruntimes(inms)forJESPandCS-JESP 118 Chapter9: RelatedWork Therearethreemajorareasofrelatedwork. ThefirstisspeedingupPOMDPpolicycomputation, and in particular value iteration algorithms, while the second is related work where agents are deployed in monitoring and assisting humans, and must plan in the presence of uncertainty to assistindividualhumansorteams. ThethirdareaofrelatedworkisdistributedPOMDPs. 9.1 RelatedworkinPOMDPs There are a wide variety of techniques for generating policies for POMDPs. These techniques canbecategorizedintooff-lineandon-linetechniques. Whereasweconsideroff-lineapproaches as planning for any belief state within a given range (without knowledge of an agent’s current belief state), on-line approaches focus on exploring reachable belief states starting only from an agent’s current belief state. Off-line techniques can be further categorized into exact and approximate algorithms; although some approximate techniques may also be converted into on- line techniques. We first focus on offline, exact algorithms and then on approximate algorithms, andfinallydiscusson-linealgorithms. Generalized Incremental Pruning (GIP) [Cassandra et al., 1997a] has been one of the ef- ficient exact baseline algorithms, that was experimentally shown to be superior to other exact algorithms [Kaelbling et al., 1998]. We have already presented GIP in detail in the background 119 section. Recent enhancements to the GIP algorithm, particularly the Region Based Incremental Pruning(RBIP)[FengandZilberstein,2004a,2005]providessignificantspeedups. Thekeyidea in RBIP is the use of witness regions (earlier idea of witness was presented in [Cassandra et al., 1997b]) for cross sums. While these exact algorithms have improved the basic value iteration algorithm considerably, as discussed earlier, they are unable to scale to the problems of interest in key domains. Indeed, as shown in our experimental analysis, we could not generate policies with GIP within our cutoff for most of our problems. This problem stems in part because these algorithms plan for unreachable parts of the belief space. Our work complements these exist- ing algorithms by proving “wrapping” techniques, thus complementing the strengths of current approaches. Indeed, the advantages of our “wrappers” (DS, DB, DDB) can be combined with these existing algorithms, as we illustrated by adding our techniques to both the GIP and RBIP algorithms. Ourapproximationtechniquecanalsobeusedtoenhancethesealgorithms. Other exact and approximate algorithms have also attempted to exploit properties of the do- main to speedup POMDPs, e.g. [Boutilier and Poole, 1996] focus on compactly representing dynamics of a domain. These compact representations however do not seem to have advantages intermsofspeedups [Kaelblingetal.,1998]. AhybridframeworkthatcombinesMDP-POMDP problem solving techniques to take advantage of perfectly and partially observable components of the model and subsequent value function decomposition was proposed by [Hauskrecht and Fraser, 2000]. This method of separating perfectly and partially observable components of a state does reachability analysis on belief states. However: (i) their analysis does not capture dy- namicchangesinbeliefspacereachability;(ii)theiranalysisislimitedtofactoredPOMDPs;(iii) no speedup measurements are shown. This contrasts with our work which focuses on dynamic changes in belief space reachability and its application to both flat and factored state POMDPs. 120 [FengandHansen,2004]provideapproachestoreducethedimensionalityoftheα-vectorsbased on the equality of values of states. This method does not provide speedups in the TMP domain, asthereareveryfewinstanceswheretherearealphavectorswithstateshavingequalvalues. Because of the slowness of exact algorithms at solving even small problems, significant amounts of research in POMDPs has focussed on approximate algorithms. While there is an entire space of algorithms to report in this arena but point-based [Smith and Simmons, 2005; Pineau et al., 2003], policy search [Braziunas and Boutilier, 2004; Poupart and Boutilier, 2004; Menleauetal.,1999],andgrid [Hauskrecht,2000b;ZhouandHansen,2001]approachesdomi- nateotheralgorithms. Sincediscussionaboutpoint-basedapproacheshasalreadybeenpresented in Section 2, here we concentrate on other approaches. Policy-search approaches typically em- ploy a finite-state controller, to represent the policy, that is updated until convergence to a stable controller. By restricting the size of these finite state controllers, performance improvements are obtainedinthesealgorithms. Grid-basedmethodsaresimilartopoint-basedapproaches,withthe difference that they maintain “values” at belief points, as opposed to “value gradients” in point- based techniques. Though these approaches can solve larger problems, many of them provide loose (or no) quality guarantees on the solution, which is a critical weakness in domains of in- terest in our work. For example, quality guarantees are important for agent assistants to gain the trustofahumanuser. Another approximate approach that attacks scalability is the dimensionality reduction tech- nique, which fundamentally alters the belief space itself [Roy and Gordon, 2002]. This work applies E-PCA (an improvement to Principal Component Analysis) on a set of belief vectors, to obtain a low dimensional representation of the original state space. Though this work provides huge reduction of dimension (state space), it does not provide any guarantees on the quality of 121 solutions. A more crucial issue is the dynamic evolution of the reachable regions of the belief space. E-PCA does not capture this dynamic evolution, while our work focuses on and capture suchevolutioninthereachableregionsofthebeliefspace. Turning now to on-line algorithms for POMDPs, algorithms such as Real-time Belief Space Search (RTBSS) [Paquet et al., 2005] are offered as on-line approaches for solving POMDPs, which explore reachable belief states starting only from an agent’s current belief state. On-line approaches clearly save effort by avoiding computation of policies for every possible situation anagent couldencounter. Forinstance, starting with an initial belief state, the RTBSS algorithm doesabranch-and-boundsearchoverbelief-states,findingthebestactionateachcycle. However, inordertocutdowntimetofindanactiononline,RTBSSmustcut-downthedepthofitssearch— thedeeperthesearchinbeliefstates,themoreexpensiveitisonline. Unfortunately,suchshallow search leads to lower quality solutions; while deeper searches consume precious time on-line.In domains such as disaster rescue (including disaster rescue simulation domains), it would appear that such on-line planning may not provide an appropriate tradeoff. In particular, since quality mayberelatedtocrucialaspectsofthedomain,suchassavingcivilians,obtaininglowerquality solutions just to avoid off-line computation may not be appropriate. Furthermore, spending time on-line may waste critical moments particularly when civilians are injured, and time is of the essence in saving such civilians. Also, because by definition these on-line techniques require knowledge of the belief state, Indeed, in such domains, there may be sufficient time available off-linetogenerateapolicyofhighenoughquality. 122 9.2 RelatedworkinSoftwarePersonalAssistants Several recent research projects have focused on deploying personal assistant agents to monitor and assist humans, and must plan in the presence of uncertainty to assist individual humans or teams [Scerri et al., 2002; Magni et al., 1998; Leong and Cao, 1998]. For instance [Scerri et al., 2002]havefocusedonsoftwareassistantsthatassisthumansinoffices,inreschedulingmeetings or deciding presenters for research meetings. [Magni et al., 1998] focuses on therapy planning, consideringthedynamicevolutionofatherapyforpatients. However,theseresearcheffortshave often used MDPs rather than POMDPs, thus assuming away observational uncertainty, that is a keyfactorinrealisticdomains. Among software personal assistants that have relied on POMDPs, [Hauskrecht and Fraser] apply POMDPs for medical therapy planning for patients with heart disease. They note that MDPs fail to capture the situation in their domain where the underlying disease is hidden, can only be observed indirectly via a series of imperfect observations. POMDPs provide a useful tool to overcome this difficulty by enabling us to model the observational uncertainty, but at a highcomputationalcost. Toovercomethischallenge, [HauskrechtandFraser]relyononseveral approximation techniques improve the computational complexity of these POMDPs. We have discussedtherelationshipofourworktothisapproximationtechniqueintheprevioussection. Similarly, [Pollack et al., 2003a] apply POMDPs in mobile robotic assistant, developed to assistelderlyindividual. Thehigh-levelcontrolarchitectureoftheroboticassistantismodeledas a POMDP. Once again the authors, via experiments, illustrate the need to take into account ob- servationaluncertaintyduringplanning,andhencetheneedforPOMDPs,e.ganMDPcontroller in similar circumstances leads to more errors. However, given the large state space encountered, 123 exact algorithms for the POMDP are ruled out. Instead, a hierarchical version of the POMDP is actually used to generate an approximation to the optimal policy. The techniques introduced in our article is in essence complementary to the research reported here, providing techniques to speedupPOMDPpolicycomputation,potentiallyeveninthehierarchicalcontext. 9.3 RelatedworkonDistributedPOMDPs Herewehavetwocategoriesofrelatedwork: RelatedworkforgeneratingpoliciesfordistributedPOMDPsgivenasingleinitialbelief point: asmentionedearlierourworkisrelatedtokeyDCOPanddistributedPOMDPalgorithms, i.e., we synthesize new algorithms by exploiting their synergies. Here we discuss some other recent algorithms for locally and globally optimal policy generation for distributed POMDPs. Forinstance, [Hansenetal.,2004a]presentanexactalgorithmforpartiallyobservablestochastic games (POSGs) based on dynamic programming and iterated elimination of dominant policies. [Montemerlo et al., 2004] approximate POSGs as a series of one-step Bayesian games using heuristics to find the future discounted value for actions. We have earlier discussed [Nair et al., 2003a]’s JESP algorithm that uses dynamic programming to reach a local optimal. Another techniquethatcomputeslocaloptimalpoliciesindistributedPOMDPsisParuchuretal.Paruchuri et al. [2006]’s Rolling Down Randomisation (RDR) algorithm. However, Paruchur et al. have studiedthisinthecontextofgeneratingrandomizedpolicies. In addition, [Becker et al., 2004]’s work on transition-independent distributed MDPs is re- latedtoourassumptionsabouttransitionandobservabilityindependenceinND-POMDPs. These 124 are all centralized policy generation algorithms that could benefit from the key ideas in ND- POMDPs — that of exploiting local interaction structure among agents to (i) enable distributed policy generation; (ii) limit policy generation complexity by considering only interactions with “neighboring”agents. [Guestrinetal.,2002],present“coordinationgraphs”whichhavesimilar- ities to constraint graphs. The key difference in their approach is that the “coordination graph” is obtained from the value function which is computed in a centralized manner. The agents then use a distributed procedure for online action selection based on the coordination graph. In our approach, the value function is computed in a distributed manner. [Dolgov and Durfee, 2004] exploitnetworkstructureinmultiagentMDPs(notPOMDPs)butassumethateachagenttriedto optimizeitsindividualutilityinsteadoftheteam’sutility. Related work for a continuous initial belief space: [Becker et al., 2003] present an exact globally optimal algorithm – the coverage set algorithm for transition-independent distributed MDPs. However unlike CS-JESP, this algorithm starts from a particular known initial state dis- tribution. Hansen et al. Hansen et al. [2004b] and Szer et al. Szer et al. [2005] are techniques that compute optimal solutions without making any assumptions about the domain. Hansen et al. present an algorithm for solving partially observable stochastic games (POSGs) based on dynamicprogramminganditeratedeliminationofdominantpolicies. Thoughthistechniquepro- videsasetofequillibriumstrategiesinthecontextofPOSGs,itisshowntoprovideexactoptimal solutions for decentralized POMDPs. Szer et al. Szer et al. [2005] provide an optimal heuristic search method for solving Decentralized POMDPs with finite horizon (given a starting belief point). This algorithm is based on the combination of classical heuristic search algorithm, A ∗ anddecentralizedcontroltheory. Heuristicfunctions(upperbounds)requiredinA ∗ areobtained by approximating a decentralized POMDP as a single agent POMDP and computing the value 125 function quickly. This algorithm are important from a theoretical standpoint, but because of the inherent complexity of finding an exact solution for general distributed POMDPs, this algorithm does not scale well. Another approach that computes global optimal solutions is presented in Ranjit et al. Nair et al. [2003b]. However this approach computes optimal policy in the context ofagivenBDI(BeliefDesireIntention)teamplan. Among locally optimal approaches, [Peshkin et al., 2000b] use gradient descent search to find local optimum finite-controllers with bounded memory. Their algorithm finds locally op- timal policies from a limited subset of policies, with an infinite planning horizon. Their work does not consider a continuous belief space and starts from a fixed belief point. We have earlier discussed [Nairetal.,2003a]’sJESPalgorithmthatusesdynamicprogrammingtoreachalocal optimal. [Hansen; and Zilberstein, 2005] present a locally optimal bounded policy iteration al- gorithm for infinite-horizon distributed algorithms. This algorithm has been theoretically shown toworkinacontinuousbeliefspacefromanunknowninitialbeliefdistribution. Whilethisisan importantcontribution,theuseoffinite-statecontrollersrestrictsthepolicyrepresentation. Also, their experimental results are for a single initial belief. Further, unlike our algorithm they use a correlationdeviceinordertoensurecoordinationamongthevariousagents. In other related models to distributed POMDPs, there has been an interesting model called theInteractivePOMDP(I-POMDP)modelbyPiotr et alGmytrasiewiczandDoshi[2005]. This model extends the POMDP model to multi-agent settings by incorporating the notion of agent models into the state space. Agents maintain beliefs over physical states of the environment and overmodelsofotheragents,andtheyuseBayesianupdatetomaintaintheirbeliefsovertime. In I-POMDPs,anagentisprimarilyconcernedaboutitsownwelfare,whileindistributedPOMDPs, anagentisconcernedabouttheteamwelfare. 126 Chapter10: Conclusion This thesis presents techniques to build agents/teams of agents that make sequence of decisions, whileoperatinginrealworlduncertainenvironments. Aneedforsuchsystemshasbeenshownin manyfacetsofhumanlife, suchassoftwarepersonalassistants, therapyplanning, spacemission planning, sensor webs for monitoring weather phenomena and others. However for such sys- tems to be a reality, these agents need to handle the uncertainty arising at various levels in these domains: unknown initial state, non-deterministic outcome of actions, and noisy observations. While Partially Observable Markov Decision Processes (POMDPs) and Distributed POMDPs provide powerful models to address uncertainties in real-world domains, solving these models is computationally expensive. Due to this significant computational complexity of these models, existingapproachesthatprovideexactsolutionsdonotscale,whileapproximatesolutionsdonot provideanyusableguaranteesonquality. Towardsaddressingtheabovechallenges,thefollowingkeyideashavebeenproposedinthis thesis: (a)ExploitingstructuretoimproveefficiencyofPOMDPsandDistributedPOMDPs. This technique exploits structure in dynamics to solve POMDPs faster, while the second exploits in- teraction structure of agents to solve distributed POMDPs. (b) An approximate technique for 127 POMDPs and Distributed POMDPs that approximates directly in the value space. This tech- nique provides quality bounds that are easily computable and operationalizable, while providing comparableperformancetofastestexistingsolvers. Foragentsandmultiagentsystemstofinallybreakoutinthereal-world,inaveryfundamental sense, they must conquer uncertainty. In the future, I would like to build upon the work in my thesistowardsunderstandingthereasoningprocessinevermorerealisticenvironments. - Environments with cooperation and competition: Previous work in distributed POMDPs andmultiagentsystemsingeneralhascategorizedagentsaseitherfullyadversarialorcompletely collaborative. However, in many real-world applications, such stark categorization may not be appropriate; agents’ motivations thus themselves become sources of uncertainty. Modeling such uncertaintiesinDistributedMarkovDecisionProblemsandDistributedPOMDPsisanopenques- tioninthefield. - Unknown environments: These are domains where there is no model available or there is uncertaintyaboutthemodelitself,thusrequiringalearningphasetoreducetheuncertaintyabout themodel. - Bounded resource environments: These domains are constrained by the limited availability of resources. There is uncertainty introduced in such domains because actions result in non- deterministicconsumptionofresources. Decisionprocessinsuchdomainsbecomescomplicated duetothisunderlyinguncertaintyandtheconstraintsimposedbytheresourceavailability. Ibelievethatunderstandingtheprocessofdecisionmakinginthesecriticalsettingsandutiliz- ingthisknowledgetowardsbuildingintelligentagent/multi-agentsystemswillresultinasmooth transitionofintelligentsystemsintoourdailylife. 128 Bibliography R. Becker, S. Zilberstein, V. Lesser, and C. V. Goldman. Transition-Independent Decentralized MarkovDecisionProcesses. InAAMAS,2003. R. Becker, S. Zilberstein, V. Lesser, and C.V. Goldman. Solving transition independent decen- tralizedMarkovdecisionprocesses. JAIR,22:423–455,2004. D. S. Bernstein, S. Zilberstein, and N. Immerman. The complexity of decentralized control of MDPs. InUAI,2000. C.BoutilierandD.Poole. Computingoptimalpoliciesforpartiallyobservabledecisionprocesses usingcompactrepresentations. InAAAI,1996. M.BowlingandM.Veloso. Multiagentlearningusingavariablelearningrate. AIJ,2002. D.BraziunasandCraigBoutilier. StochasticlocalsearchforPOMDPcontrollers. InAAAI,2004. A. R. Cassandra, M. L. Littman, and N. L. Zhang. Incremental pruning: A simple, fast, exact methodforpartiallyobservablemarkovdecisionprocesses. InUAI,1997a. A. R. Cassandra, M. L. Littman, and N. L. Zhang. Incremental pruning: A simple, fast, exact methodforpartiallyobservablemarkovdecisionprocesses. InUAI,1997b. I. Chad` es, B.Scherrer, andF.Charpillet. A heuristic approachfor solving decentralized-pomdp: Assessmentonthepursuitproblem. InSAC,2002. K. Chintalapudi, E. A. Johnson, and R. Govindan. Structural damage detection using wireless sensor-actuatornetworks. In ProceedingsofthethirteenthMediterraneanConferenceonCon- trolandAutomation,2005. RinaDechter. ConstraintProcessing. MorganKaufman,2003. D. Dolgov and E. Durfee. Graphical models in local, asymmetric multi-agent markov decision processes. InAAMAS,2004. Z.FengandE.Hansen. Anapproachtostateaggregationforpomdps. In AAAI-04 Workshop on LearningandPlanninginMarkovProcesses–AdvancesandChallenges,2004. Z.FengandS.Zilberstein. EfficientmaximizationinsolvingPOMDPs. InAAAI,2005. Z.FengandS.Zilberstein. RegionbasedincrementalpruningforPOMDPs. InUAI,2004a. Z.FengandS.Zilberstein. RegionbasedincrementalpruningforPOMDPs. InUAI,2004b. 129 P. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multiagent settings. JournalofArtificialIntelligenceResearch,24:49–79,2005. C.V. Goldman and S. Zilberstein. Decentralized control of cooperative systems: Categorization andcomplexityanalysis. JAIR,22:143–174,2004. C.Guestrin,S.Venkataraman,andD.Koller. Contextspecificmultiagentcoordinationandplan- ningwithfactoredMDPs. InAAAI,2002. D.Bernstein;E.Hansen; andS.Zilberstein. Boundedpolicyiterationfordecentralizedpomdps. InIJCAI,2005. E.A.Hansen,D.S.Bernstein,andS.Zilberstein. DynamicProgrammingforPartiallyObservable StochasticGames. InAAAI,2004a. EricA.Hansen,DanielS.Bernstein,andShlomoZilberstein.Dynamicprogrammingforpartially observablestochasticgames. InAAAI,2004b. M.Hauskrecht. Value-functionapproximationsforPOMDPs. JAIR,13:33–94,2000a. M.Hauskrecht. Value-functionapproximationsforPOMDPs. JAIR,13:33–94,2000b. M.HauskrechtandH.Fraser. Planningtreatmentofischemicheartdiseasewithpartiallyobserv- ablemarkovdecisionprocesses. AIinMedicine,18:221–244,2000. M.HauskrechtandH.Fraser. Planningtreatmentofischemicheartdiseasewithpartiallyobserv- ablemarkovdecisionprocesses. ArtificialIntelligenceinMedicine. CALO: Cognitive Agent that Learns and Organizes. http://www.ai.sri.com/project/CALO, http://calo.sri.com,2003. L.P.Kaelbling,M.L.Littman,andA.R.Cassandra. Planningandactinginpartiallyobservable stochasticdomains. AIJournal,1998. H. Kitano, S. Tadokoro, I. Noda, H. Matsubara, T. Takahashi, A. Shinjoh, and S. Shimada. RoboCup-Rescue: Search and rescue for large scale disasters as a domain for multiagent re- search. InIEEESMC,1999. T. Y. Leong and C. Cao. Modeling medical decisions in DynaMoL: A new general framework ofdynamicdecisionanalysis. In World Congress on Medical Informatics (MEDINFO),pages 483–487,1998. V. Lesser, C. Ortiz, and M. Tambe. Distributed sensor nets: A multiagent perspective. Kluwer, 2003. P. Magni, R. Bellazzi, and F. Locatelli:. Using uncertainty management techniques in medical therapy planning: A decision-theoretic approach. In Applications of Uncertainty Formalisms, pages38–57,1998. R. Maheswaran, M. Tambe, E. Bowring, J. Pearce, and P. Varakantham. Taking dcop to the real world: Efficientcompletesolutionsfordistributedeventscheduling. InAAMAS,2004. 130 R.MaillerandV.Lesser. Solvingdistributedconstraintoptimizationproblemsusingcooperative mediation. InAAMAS,2004a. R.MaillerandV.Lesser. Usingcooperativemediationtosolvedistributedconstraintsatisfaction problems. InAAMAS,2004b. N. Menleau, K. E. Kim, L. P. Kaelbling, and A. R. Cassandra. Solving POMDPs by searching thespaceoffinitepolicies. InUAI,1999. P. J. Modi, H. Jung, M. Tambe, W. Shen, and S. Kulkarni. A dynamic distributed constraint satisfactionapproachtoresourceallocation[pdf][ps](extendedversion. InCP,2001. P.J.Modi,W.Shen,M.Tambe,andM.Yokoo. Anasynchronouscompletemethodfordistributed constraintoptimization. InAAMAS,2003a. P.J.Modi,W.Shen,M.Tambe,andM.Yokoo. Anasynchronouscompletemethodfordistributed constraintoptimization. InAAMAS,2003b. R. E. Montemerlo, G. Gordon, J. Schneider, and S. Thrun. Approximate solutions for partially observablestochasticgameswithcommonpayoffs. InAAMAS,2004. R. Nair, D. Pynadath, M. Yokoo, M. Tambe, and S. Marsella. Taming decentralized POMDPs: Towardsefficientpolicycomputationformultiagentsettings. InIJCAI,2003a. R. Nair, M. Tambe, and S. Marsella. Role allocation and reallocation in multiagent teams: To- wards a practical analysis. In Proceedings of the Second International Joint Conference on AutonomousAgentsandMulti-agentSystems(AAMAS-03) ,pages552–559,2003b. R. Nair, M. Tambe, and S. Marsella. Role allocation and reallocation in multiagent teams: To- wardsapracticalanalysis. InAAMAS,2003c. R. Nair, M. Roth, M. Yokoo, and M. Tambe. Communication for improving policy computation indistributedpomdps. InAAMAS,2004. R.Nair,P.Varakantham,M.Tambe,andM.Yokoo.NetworkeddistributedPOMDPs: Asynthesis ofdistributedconstraintoptimizationandPOMDPs. InAAAI,2005. S. Paquet, B. Chaib-draa, and Stephane Ross. Rtbss: An online pomdp algorithm for complex environments. InAAMAS,2005. P. Paruchuri, M. Tambe, R. Ordonez, and S. Kraus. Security in multiagent systems by policy randomization. In Proceedings of the Fifth International Joint Conference on Autonomous AgentsandMulti-agentSystems(AAMAS-06) ,2006. L. Peshkin, N. Meuleau, K.-E. Kim, and L. Kaelbling. Learning to cooperate via policy search. InUAI,2000a. L. Peshkin, N. Meuleau, K.-E. Kim, and L. Kaelbling. Learning to cooperate via policy search. InUAI,2000b. 131 A. Petcu and B. Faltings. A scalable method for multiagent constraint optimization. In IJCAI, 2005. J.PineauandG.Gordon. POMDPplanningforrobustrobotcontrol. InISRR,2005. J.Pineau,G.Gordon,andS.Thrun. PBVI:AnanytimealgorithmforPOMDPs. InIJCAI,2003. M.E.Pollack,L.Brown,D.Colbry,C.E.McCarthy,C.Orosz,B.Peintner,S.Ramakrishnan,and I.Tsamardinos. Autominder: Anintelligentcognitiveorthoticsystemforpeoplewithmemory impairment. RoboticsandAutonomousSystems,44:273–282,2003a. M.E.Pollack,L.Brown,D.Colbry,C.E.McCarthy,C.Orosz,B.Peintner,S.Ramakrishnan,and I.Tsamardinos. Autominder: Anintelligentcognitiveorthoticsystemforpeoplewithmemory impairment. RoboticsandAutonomousSystems,44:273–282,2003b. P.PoupartandC.Boutilier. Vdcbpi: anapproximatescalablealgorithmforlargescalePOMDPs. InNIPS,2004. D.V.PynadathandM.Tambe.Thecommunicativemultiagentteamdecisionproblem: Analyzing teamworktheoriesandmodels. JAIR,16:389–423,2002. Maayan Roth, Reid G. Simmons, and Manuela M. Veloso. Reasoning about joint beliefs for execution-timecommunicationdecisions. In AAMAS,2005. N. Roy and G. Gordon. Exponential family PCA for belief compression in POMDPs. In NIPS, 2002. M.PaskinS.Funiak,C.GuestrinandR.Sukthankar. Distributedlocalizationofnetworkedcam- eras. InProceedingsoftheFifthInternationalConferenceonInformationProcessinginSensor Networks(IPSN-06) ,2006. P. Scerri, D. Pynadath, and M. Tambe. Towards adjustable autonomy for the real-world. JAIR, 17:171–228,2002. D. Schreckenghost, C. Martin, P. Bonasso, D. Kortenkamp, T.Milam, and C.Thronesbery. Sup- portinggroupinteractionamonghumansandautonomousagents. InAAAI,2002. R. Simmons and S. Koenig. Probabilistic robot navigation in partially observable environments. InProceedingsoftheInternationalJointConferenceonArtificialIntelligence(IJCAI),1995. T.SmithandR.Simmons. Point-basedPOMDPalgorithms: Improvedanalysisandimplementa- tion. InUAI,2005. D. Szer, F. Charpillet, and S. Zilberstein. Maa*: A heuristic search algorithm for solving decen- tralizedPOMDPs. InProceedingsoftheTwentiethInternationalJointConferenceonArtificial Intelligence(IJCAI-05) ,2005. P.Varakantham,R.Maheswaran,andM.Tambe. Exploitingbeliefbounds: Practicalpomdpsfor personalassistantagents. InAAMAS,2005. 132 M. Yokoo and K. Hirayama. Distributed breakout algorithm for solving distributed constraint satisfactionproblems. InICMAS,1996. R. Zhou and E. Hansen. An improved grid-based approximation algorithm for POMDPs. In IJCAI,2001. 133
Abstract (if available)
Abstract
My research goal is to build large-scale intelligent systems (both single- and multi-agent) that reason with uncertainty in complex, real-world environments. I foresee an integration of such systems in many critical facets of human life ranging from intelligent assistants in hospitals to offices, from rescue agents in large scale disaster response to sensor agents tracking weather phenomena in earth observing sensor webs, and others. In my thesis, I have taken steps towards achieving this goal in the context of systems that operate in partially observable domains that also have transitional (non-deterministic outcomes to actions) uncertainty. Given this uncertainty, Partially Observable Markov Decision Problems (POMDPs) and Distributed POMDPs present themselves as natural choices for modeling these domains.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Planning with continuous resources in agent systems
PDF
The power of flexibility: autonomous agents that conserve energy in commercial buildings
PDF
The human element: addressing human adversaries in security domains
PDF
The interpersonal effect of emotion in decision-making and social dilemmas
PDF
Keep the adversary guessing: agent security by policy randomization
PDF
Thwarting adversaries with unpredictability: massive-scale game-theoretic algorithms for real-world security deployments
PDF
Protecting networks against diffusive attacks: game-theoretic resource allocation for contagion mitigation
PDF
Context-adaptive expandable-compact POMDPs for engineering complex systems
PDF
Predicting and planning against real-world adversaries: an end-to-end pipeline to combat illegal wildlife poachers on a global scale
PDF
Addressing uncertainty in Stackelberg games for security: models and algorithms
PDF
Real-world evaluation and deployment of wildlife crime prediction models
PDF
Towards trustworthy and data-driven social interventions
PDF
Decentralized real-time trajectory planning for multi-robot navigation in cluttered environments
PDF
Active state learning from surprises in stochastic and partially-observable environments
PDF
Speeding up trajectory planning for autonomous robots operating in complex environments
PDF
Efficient and effective techniques for large-scale multi-agent path finding
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Transfer learning for intelligent systems in the wild
PDF
An intelligent tutoring system’s approach for negotiation training
PDF
Online reinforcement learning for Markov decision processes and games
Asset Metadata
Creator
Varakantham, Pradeep
(author)
Core Title
Towards efficient planning for real world partially observable domains
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
03/30/2007
Defense Date
02/06/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
decision making under uncertainty,OAI-PMH Harvest
Language
English
Advisor
Tambe, Milind (
committee chair
), Koenig, Sven (
committee member
), Marsella, Stacy (
committee member
), Ordonez, Fernando I. (
committee member
), Veloso, Manuela (
committee member
)
Creator Email
varakant@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m339
Unique identifier
UC1158099
Identifier
etd-Varakantham-20070330 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-325930 (legacy record id),usctheses-m339 (legacy record id)
Legacy Identifier
etd-Varakantham-20070330.pdf
Dmrecord
325930
Document Type
Dissertation
Rights
Varakantham, Pradeep
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
decision making under uncertainty