Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Keep the adversary guessing: agent security by policy randomization
(USC Thesis Other)
Keep the adversary guessing: agent security by policy randomization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
KEEP THE ADVERSARY GUESSING: AGENT SECURITY BY POLICY RANDOMIZATION by Praveen Paruchuri A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2007 Copyright 2007 Praveen Paruchuri Dedication This dissertation is dedicated to my parents and my sister. ii Acknowledgements I would like to thank all the people who have helped me complete my thesis. First and foremost, I would like to thank my advisor, Milind Tambe for his attention, guid- ance, and support at every step of this thesis. He served as an excellent role model and source of valuable advice. I would also like to thank Sarit Kraus and Fernando Ordonez for being my col- laborators since the beginning of my PhD. Their valuable guidance and support made my journey through the PhD program very memorable. I wish to thank Gaurav S. Sukhatme, Stacy Marsella and Leana Golubchik for being on my thesis committee. Their valuable insights and comments were instrumental in structuring my dissertation. I am grateful to all the members of the TEAMCORE research group for being my friends and collaborators. Gal Kaminka, David Pynadath, Paul Scerri, Hyuckchul Jung, Jay Modi, Ranjit Nair, Rajiv Maheswaran, Nathan Schurr, Pradeep Varakantham, Jonathan Pearce, Emma Bowring, Janusz Marecki, Tapana Gupta and Zvi Topol have always been very helpful and supportive. I would like to thank all my collaborators for their numerous stimulating discussions that helped in shaping my thesis. I would also like to thank all my friends and roommates who have made my stay at USC a memorable experience. Lastly and most importantly, I would like to express my gratitude to my iii family. In particular, I would like to thank my parents and sister for believing in me and pushing me to get a doctorate degree. iv Table of Contents Dedication ii Acknowledgements iii List Of Figures vii List Of Algorithms viii List Of Tables ix Abstract x Chapter 1 Introduction 1 Chapter 2 Domains 6 2.1 Single agent UA V Patrolling Example . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Multi-agent UA V Patrolling example . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 The Police Patrolling Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 3 Single Agent Security Problem 13 3.1 Markov Decision Problems(MDP) . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Basic Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Randomness of a policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 Maximal entropy solution . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Efficient single agent randomization . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Incorporating models of the adversary . . . . . . . . . . . . . . . . . . . . . . . 27 3.5 Applying to POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 4 Multi Agent Security Problem 29 4.1 From Single Agent to Agent Teams . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Multiagent Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 RDR Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 5 Experimental Results 38 5.1 Single Agent Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Multi Agent Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 v 5.3 Entropy Increases Security: An Experimental Evaluation . . . . . . . . . . . . . 44 Chapter 6 Partial Adversary Model: Limited Randomization Procedure 49 6.1 Bayesian Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.1.1 Harsanyi Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.1.2 Existing Procedure: Finding an Optimal Strategy . . . . . . . . . . . . . 55 6.2 Limited Randomization Approach . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.2.1 Mixed-Integer Quadratic Program . . . . . . . . . . . . . . . . . . . . . 60 6.2.2 Mixed-Integer Linear Program . . . . . . . . . . . . . . . . . . . . . . . 63 6.3 Decomposition for Multiple Adversaries . . . . . . . . . . . . . . . . . . . . . . 65 6.3.1 Decomposed MIQP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.3.2 Decomposed MILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter 7 Partial Adversary Model: Exact Solution 75 7.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.1.1 Mixed-Integer Quadratic Program . . . . . . . . . . . . . . . . . . . . . 76 7.1.2 Decomposed MIQP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.1.3 Decomposed MILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Chapter 8 Related Work 89 8.1 Randomized policies for MDPs/POMDPs . . . . . . . . . . . . . . . . . . . . . 89 8.2 Related work in game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.3 Randomization for Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.4 Randomization and Patrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.5 Randomized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 9 Conclusion 98 Bibliography 102 vi List Of Figures 2.1 Single Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Partially Observable Markov Decision Process . . . . . . . . . . . . . . . . . . . 28 4.1 Belief Tree for UA V team domain generated by RDR . . . . . . . . . . . . . . . 34 5.1 Comparison of Single Agent Algorithms . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Results for RDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Improved security via randomization . . . . . . . . . . . . . . . . . . . . . . . . 45 6.1 Runtimes for various algorithms on problems of 3 and 4 houses. . . . . . . . . . 70 6.2 Reward for various algorithms on problems of 3 and 4 houses. . . . . . . . . . . 72 6.3 Reward for ASAP using multisets of 10, 30, and 80 elements . . . . . . . . . . . 74 7.1 DOBSS vs. ASAP and multiple LP methods . . . . . . . . . . . . . . . . . . . . 86 7.2 DOBSS vs. ASAP for larger strategy spaces . . . . . . . . . . . . . . . . . . . . 88 7.3 Effect of additional followers on leader’s strategy . . . . . . . . . . . . . . . . . 88 vii List of Algorithms 1 MAX-ENTROPY(E min ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 CRLP(E min ,¯ x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 BRLP(E min ,¯ x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 RDR(d,percentdec,¯ x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 viii List Of Tables 3.1 Maximum expected rewards(entropies) for variousβ . . . . . . . . . . . . . . . 26 5.1 RDR: Avg. run-time in sec and (Entropy),T = 2 . . . . . . . . . . . . . . . . . 41 6.1 Payoff table for example normal form game. . . . . . . . . . . . . . . . . . . . . 50 6.2 Payoff tables: Security Agent vs Robbersa andb . . . . . . . . . . . . . . . . . 55 6.3 Harsanyi Transformed Payoff Table . . . . . . . . . . . . . . . . . . . . . . . . 55 ix Abstract Recent advances in the field of agent/multiagent systems brings us closer to agents acting in real world domains, which can be uncertain and many times adversarial. Security, commonly defined as the ability to deal with intentional threats from other agents is a major challenge for agents or agent-teams deployed in these adversarial domains. Such adversarial scenarios arise in a wide variety of situations that are becoming increasingly important such as agents patrolling to provide perimeter security around critical infrastructure or performing routine security checks. These domains have the following characteristics: (a) The agent or agent-team needs to commit to a security policy while the adversaries may observe and exploit the policy committed to. (b) The agent/agent-team potentially faces different types of adversaries and has varying information available about the adversaries (thus limiting the agents’ ability to model its adversaries). To address security in such domains, I developed two types of algorithms. First, when the agent has no model of its adversaries, my key idea is to randomize agent’s policies to minimize the information gained by adversaries. To that end, I developed algorithms for policy randomiza- tion for both the Markov Decision Processes (MDPs) and the Decentralized-Partially Observable MDPs (Dec POMDPs). Since arbitrary randomization can violate quality constraints (for exam- ple, the resource usage should be below a certain threshold or key areas must be patrolled with a certain frequency), my algorithms guarantee quality constraints on the randomized policies x generated. For efficiency, I provide a novel linear program for randomized policy generation in MDPs, and then build on this program for a heuristic solution for Dec-POMDPs. Second, when the agent has partial model of the adversaries, I model the security domain as a Bayesian Stack- elberg game where the agent’s model of the adversary includes a probability distribution over possible adversary types. While the optimal policy selection for a Bayesian Stackelberg game is known to be NP-hard, my solution approach based on an efficient Mixed Integer Linear Program (MILP) provides significant speedups over existing approaches while obtaining the optimal so- lution. The resulting policy randomizes the agent’s possible strategies, while taking into account the probability distribution over adversary types. Finally, I provide experimental results for all my algorithms, illustrating the new techniques developed have enabled us to find optimal secure policies efficiently for an increasingly important class of security domains. xi Chapter 1 Introduction Security, commonly defined as the ability to deal with intentional threats from other agents is a major challenge for agents deployed in adversarial environments [Sasemas, 2005]. My thesis focuses on adversarial domains where the agents have limited information about the adversaries. Such adversarial scenarios arise in a wide variety of situations that are becoming increasingly important such as patrol agents providing security for a group of houses or regions [Carroll et al., 2005; Billante, 2003; Lewis et al., 2005], UA V’s monitoring a humanitarian mission [Beard and McLain, 2003], agents assisting in routine security checks at airports [Poole and Passantino, 2003], agents providing privacy in sensor network routing [Ozturk et al., 2004], realistic game playing bots such as in unreal tournament [Smith et al., 2007] or agents maintaining anonymity in peer to peer networks [Borisov and Waddle, 2005]. In my thesis, I address the problem of planning for agents acting in such uncertain environ- ments while facing security challenges. The common assumption in these security domains is that the agent commits to a plan or policy first while the adversary can observe the agent’s actions and hence knows its plan/policy. The adversary can then exploit the plan or policy the agent com- mitted to. In addition, the agent might have incomplete information about the adversaries. For 1 example, in a typical security domain such as the patrol agents example, agents provide security for a group of houses or regions via patrolling. The patrol agents commit to a plan or policy while the adversaries can observe the patrol routes, learn the patrolling pattern and exploit it to their advantage. Alternatively, via eavesdropping or other means, the adversaries may come to know an agent’s patrolling plan. Furthermore, the agents might not know which adversaries ex- ist and even if they know, they still have uncertain information about which adversary strikes at what time or region. To solve this problem with incomplete information about the adversaries, I provide efficient security algorithms broadly considering two realistic situations: Firstly, when the agents have no model of their adversaries, we wish to minimize the information the adversary has about the agent by randomizing the agent’s policies. Secondly, when the agents have a partial model of their adversaries we maximize the agent’s expected rewards while accounting for the uncertainty in the adversary information. When the agents have no model of their adversaries, I provide efficient algorithms for gener- ating randomized plans or policies for the agents that minimize the information that can be gained by adversaries. In the rest of my thesis I will refer to such randomized policies that attempt to minimize the opponent’s information gain as secure policies. However, arbitrary randomization can violate quality constraints such as: resource usage should be below a certain threshold or key areas must be patrolled with a certain frequency. To that end, I developed algorithms for efficient policy randomization with quality guarantees. Markovian models such as the Markov Decision Problems (MDPs) [Puterman, 1994], Partially Observable Markov Decision Problems (POMDPs) [Cassandra et al., 1994] and Decentralized POMDPs (Dec-POMDPs) [Pynadath and Tambe, 2002; Nair et al., 2003; Emery-Montemerlo et al., 2004] provide general frameworks for reasoning about the environmental uncertainty while enabling development of mathematical 2 frameworks to quantify the randomization of policies. My contributions for the development of efficient randomized policy generation algorithms using these frameworks are as follows: • I provide novel techniques that enable policy randomization in single agents, while attain- ing a certain expected reward threshold. I measure randomization via an entropy-based metric (although my techniques are independent of that metric). In particular, I illustrate that simply maximizing entropy-based metrics introduces a non-linear program that has non-polynomial run-time. Hence, I introduce my CRLP (Convex combination for Random- ization) and BRLP (Binary search for Randomization) linear programming (LP) techniques that randomize policies in polynomial time with different tradeoffs as explained later. • I provide a new algorithm, RDR (Rolling Down Randomization), for generating random- ized policies for decentralized POMDPs without communication, given a threshold on the expected team reward loss. RDR starts with a joint deterministic policy for decentralized POMDPs, then iterates, randomizing policies for agents turn-by-turn, keeping policies of all other agents fixed. A key insight in RDR is that given fixed randomized policies for other agents, I can generate a randomized policy via the CRLP or BRLP methods, i.e., my efficient single-agent methods. When the agents have partial model of their adversaries, I model the security domain as a Bayesian Stackelberg game [Conitzer and Sandholm, 2006]. For example, in some patrol do- mains, the patrol agents may need to provide security taking into account the partial information available about the adversary. Here, the agents know the adversary’s actions and payoffs but does not know which adversary is active at a given time. A common approach for choosing policy of the agents in such scenarios is to model them as Bayesian games [Fudenberg and Tirole, 1991]. 3 A Bayesian game is a game in which agents may belong to one or more types; the type of an agent determines its possible actions and payoffs. Usually these games are analyzed according to the concept of Bayes-Nash equilibrium, an extension of Nash equilibrium for Bayesian games in which it is assumed that all the agents choose their strategies simultaneously. However, the main feature of the security games we consider is that one player must commit to a strategy before the other players choose their strategies. In the patrol domain, the patrol agent commits to a strategy first while the adversaries get to observe the agent’s strategy and decide on their choice of action. These scenarios are known as Stackelberg games [Fudenberg and Tirole, 1991] and are com- monly used to model attacker-defender scenarios in security domains [Brown et al., 2006]. Thus, given that multiple agent types exist and the notion of a security agent committing first before the adversaries, I model my security domains as Bayesian Stackelberg games. The solution concept for these games is that the security agent has to pick the optimal strategy considering the actions, payoffs and probability distribution over the adversaries. My contributions for the development of efficient solution techniques for Bayesian Stackelberg games are as follows: • First, I introduced an efficient technique for generating optimal leader strategies with lim- ited randomization for Bayesian Stackelberg games, known as ASAP (Agent Security via Approximate Policies). The advantage of this procedure is that the policies generated are simple and easy to implement in real scenarios since the amount of randomization is con- trolled. I sometimes refer to this in my thesis as a heuristic to differentiate it from the exact procedure described next. • Second, I developed an efficient exact procedure, known as DOBSS (Decomposed Optimal Bayesian Stackelberg Solver) using the mathematical framework developed for ASAP. 4 I developed an efficient Mixed Integer Linear Program (MILP) implementation for both DOBSS and ASAP, along with experimental results illustrating significant speedups and higher rewards over other approaches. Having outlined the motivation for my work and key ideas in this section, the rest of my thesis is organized as follows. In the next chapter I describe motivating domains for my work both for single and the multiagent cases. Chapter 3 is devoted to the single agent case. I first introduce the Markov Decision Problem and a LP solution to solve it. I then develop a non-linear program and two approximate linear programming alternatives called the CRLP and the BRLP algorithms for efficient randomized policy generation. Lastly, I provide experimental results illus- trating the runtime-reward tradeoffs of the various approaches developed. Chapter 4 introduces the decentralized POMDP framework to model the multi-agent team domains. It then introduces a new iterative solution called the RDR algorithm for generating randomized policies for agent teams. Finally, it provides an experimental validation of the RDR algorithm and also provides an evaluation of the policies obtained earlier for both the single and multiagent cases against a spe- cific adversary. Chapter 6 introduces the ASAP procedure for generating strategies with limited randomization, first for non-Bayesian games, for clarity; then shows how it can be adapted for Bayesian games with uncertain adversaries. Chapter 7 provides an efficient exact optimal policy calculation procedure for Bayesian Stackelberg games based on the ideas used to develop ASAP in the previous chapter. It then provides experimental evaluation of the ASAP and the DOBSS methods against previously existing procedures. Chapter 8 gives an overview of the work done previously related to my own work and chapter 9 provides a brief summary of my work described in this thesis. 5 Chapter 2 Domains The three sections in this section describe in detail the UA V and the police patrolling domains introduced in the previous chapter. In the first section, I describe the UA V domain for the single agent case. The next section describes the multi-UA V version of the same problem. In both these cases I assume that no model of the adversary is available and hence randomization of polices plays a key role in providing security. The third section describes the police patrolling domain with the assumption that the police have partial model of the robbers. Hence, the police compute their optimal patrolling policy using the available robber information. 2.1 Single agent UA V Patrolling Example Figure 2.1 shows the UA V patrolling example for the single agent case. Consider a UA V agent that is monitoring a humanitarian mission [Twist, 2005]. A typical humanitarian mission can have various activities going on simultaneously such as providing shelter for refugees, providing food for the refugees, transportation of food to the various camps in the mission and many other such activities. Unfortunately these refugee camps become targets of various harmful activities mainly because they are vulnerable. Some of these harmful activities by an adversary/adversaries can 6 Figure 2.1: Single Agent include looting food supplies, stealing equipment, harming refugees themselves and other such activities which are unpredictable. Our worst case assumption about the adversary is as stated in the Introduction: (a) the adversary has access to UA V policies due to learning or espionage (b) the adversary eavesdrop or estimates the UA V observations. One way of reducing the vulnerability of such humanitarian missions is to have continuous monitoring activity which would deter the adversaries from performing such criminal acts. In practice, such monitoring activities can be handled by using UA Vs [Twist, 2005]. To start with, we assume a single UA V is monitoring such a humanitarian mission. For expository purposes, we further assume that the mission is divided into n regions say region 1, region 2, ...., region n. If the 7 surveillance policy of UA V is deterministic, e.g. everyday at 9am the UA V takes action survey re- gion 1, 10am survey region 2 etc, its quite easy for the adversaries to know exactly where the UA V will be at some given time without actually seeing the UA V , allowing the adversary ample time to plan and carry out sabotage. On the other hand, if the UA V patrolling policy was randomized, e.g. UA V surveys region 1 with 60% probability, region 2 with 40% probablity, ...., etc it would then be difficult for the adversary to predict the UA V action at a particular time without actually seeing the UA V at that instant. The effect of randomization is that the there is minimum amount of information available to the adversary about the agent’s actions even though the exact policy the agent follows is known. This uncetainty in agent’s actions deters the performance of adver- sarial actions, in effect, increasing the security of the humanitarian mission. Thus, if the policy is not randomized, the adversary may exploit UA V action predictability in some unknown way such as jamming UA V sensors, shooting down UA Vs or attacking the food convoys, etc. Since little is known about the adversary’s ability to cause sabotage, the UA V must maximize the adver- sary’s uncertainty via policy randomization, while ensuring that quality constraints like resource or frequency constraints are met which are expressed as a reward function in our domains. Different regions in the humanitarian mission can have different activities going on. Some of these activities could be critical such as saving human lives whereas some other activities might be less important. Hence, it is a necessity for the UA V to visit different regions with different frequencies. We achieve this in our domain, by assigning a specific weight or a reward function with each patrol action. This would mean that the UA V gets higher reward by taking patrol action to some regions rather than the others. The problem for the UA V would then be to have a patrolling policy that maximizes the policy randomization while ensuring that a threshold reward 8 is maintained. This threshold reward would be set based on the amount of reward the UA V can sacrifice for increasing the security of the humanitarian mission. 2.2 Multi-agent UA V Patrolling example For the multiagent case, we extend the single agent case to 2 UA Vs. In particular, we introduce a simple UA V team domain that is analogous to the illustrative multiagent tiger domain [Nair et al., 2003] except for an adversary – indeed, to enable replicable experiments, rewards, transition and observation probabilities from [Nair et al., 2003] are used, the details of which we provide in the experiments section. We assume the adversary is just like the one introduced in the single agent case. Consider a region in a humanitarian crisis, where two UA Vs execute daily patrols to monitor safe food convoys. However, these food convoys may be disrupted by landmines placed in their route. The convoys pass over two regions: Left and Right. For simplicity, we assume that only one such landmine may be placed at any point in time, and it may be placed in any of the two regions with equal probability. The UA Vs must destroy the landmine to get a high positive reward whereas trying to destroy a region without a landmine disrupts transportation and creates a high negative reward; but the UA V team is unaware of which region has the landmine. The UA Vs can perform three actions Shoot-left, Sense and Shoot-right but they cannot communicate with each other. We assume that both UA Vs are observed with equal probability by some unknown adversary with unknown capabilities, who wishes to cause sabotage. We make the following worst case assumptions about the adversary (as stated earlier): (a) the adversary has access to 9 UA V policies due to learning or espionage (b) the adversary eavesdrop or estimates the UA V observations. When an individual UA V takes action Sense, it leaves the state unchanged, but provides a noisy observation OR or OL, to indicate whether the landmine is to the left or right. The Shoot- left and Shoot-right actions are used to destroy the landmine, but the landmine is destroyed only if both UA Vs simultaneously take either Shoot-left or Shoot-right actions. Unfortunately, if agents miscoordinate and one takes a Shoot-left and the other Shoot-right they incur a very high negative reward as the landmine is not destroyed but the food-convoy route is damaged. Once the shoot action occurs, the problem is restarted (the UA Vs face a landmine the next day). 2.3 The Police Patrolling Domain Given our assumption that the police know the partial model of the robber we formulate the patrolling domain as a game. The most basic version of this game consists of two players: the se- curity agent (the leader) and the robber (the follower) in a world consisting ofm houses,1...m. The security agent’s set of pure strategies consists of possible routes ofd houses to patrol (in an order). The security agent can choose a mixed strategy so that the robber will be unsure of ex- actly where the security agent may patrol, but the robber will know the mixed strategy the security agent has chosen. For example, the robber can observe over time how often the security agent patrols each area. With this knowledge, the robber must choose a single house to rob. We assume that the robber generally takes a long time to rob a house. If the house chosen by the robber is not on the security agent’s route, then the robber successfully robs the house. Otherwise, if it is on 10 the security agent’s route, then the earlier the house is on the route, the easier it is for the security agent to catch the robber before he finishes robbing it. We model the payoffs for this game with the following variables: • v l,x : value of the goods in housel to the security agent. • v l,q : value of the goods in housel to the robber. • c x : reward to the security agent of catching the robber. • c q : cost to the robber of getting caught. • p l : probability that the security agent can catch the robber at the lth house in the patrol (p l <p l 0 ⇐⇒ l 0 <l). The security agent’s set of possible pure strategies (patrol routes) is denoted by X and in- cludes alld-tuplesi =<w 1 ,w 2 ,...,w d > withw 1 ...w d = 1...m. where no two elements are equal (the agent is not allowed to return to the same house). The robber’s set of possible pure strategies (houses to rob) is denoted by Q and includes all integers j = 1...m. The payoffs (security agent, robber) for pure strategiesi,j are: • −v l,x ,v l,q , forj =l / ∈i. • p l c x +(1−p l )(−v l,x ),−p l c q +(1−p l )(v l,q ), forj =l∈i. With this structure it is possible to model many different types of robbers who have differing motivations; for example, one robber may have a lower cost of getting caught than another, or may value the goods in the various houses differently. If the distribution of different robber types is known or inferred from historical data, then the game can be modeled as a Bayesian game 11 [Fudenberg and Tirole, 1991]. Table 6.2 provides an example of a bayesian game derived using the notation described above. More details about Bayesian games are provided in later chapters. 12 Chapter 3 Single Agent Security Problem Markovian models such as the Markov Decision Problems (MDPs), Partially Observable Markov Decision Problems (POMDPs) and Decentralized POMDPs (Dec-POMDPs) are now popular frameworks for building agent and agent teams [Pynadath and Tambe, 2002; Cassandra et al., 1994; Paquet et al., 2005; Emery-Montemerlo et al., 2004]. The basic assumptions of these models are that the agent/agent-teams are acting in accessible or inaccessible stochastic environ- ments with a known transition model. There are many domains in the real world where such agent/agent-team have to act in adversarial environments. In these adversarial domains, agent and agent teams based on single-agent or decentralized (PO)MDPs should randomize policies in order to avoid action predictability. Such policy randomization is crucial for security in domains where we cannot explicitly model our adversary’s actions and capabilities or its payoffs, but the adversary observes our agents’ actions and exploits any action predictability in some unknown fashion. Consider agents that schedule security inspections, maintenance or refueling at seaports or airports. Adversaries may be unobserved terrorists with unknown capabilities and actions, who can learn the schedule from observations. If the schedule is deterministic, then these adversaries may exploit schedule 13 predictability to intrude or attack and cause tremendous unanticipated sabotage. Alternatively, as mentioned in the previous chapter, consider a team of UA Vs (Unmanned Aerial Vehicles) [Beard and McLain, 2003] monitoring a region undergoing a humanitarian crisis. Adversaries may be humans intent on causing some significant unanticipated harm — e.g. disrupting food convoys, harming refugees or shooting down the UA Vs — the adversary’s capabilities, actions or payoffs are unknown and difficult to model explicitly. However, the adversaries can observe the UA Vs and exploit any predictability in UA V surveillance, e.g. engage in unknown harmful actions by avoiding the UA Vs’ route. Therefore, such patrolling UA V agent/agent-team need to randomize their patrol paths to avoid action predictability [Lewis et al., 2005]. One feature of the above mentioned domains is that the actions the agents take can be stochas- tic and hence contribute towards randomization. For example, in the airport scenario, airplanes can get delayed with a small probability. Common activities like refueling or maintainance can take more time than expected. This is also true of the patrolling scenario where a patrol agent can end up patrolling a certain house at a given time rather than an intended house due to real world uncertainties. However given that we intend to develop secure policies, we randomize the actions of the agents explicitly rather than rely on the randomization inherently present within any action like the delay in schedule of airplanes. Thus, our key assumption is that predictability of actions is itself inherently dangerous (because the adversary can exploit this predictability), regardless of the uncertainty in the outcome of the actions. (This is indeed why for example patrols in the real-world may be randomized, as discussed in chapter 8.) While we cannot explicitly model the adversary’s actions, capabilities or payoffs, in order to ensure security of the agent/agent-team we make two worst case assumptions about the adversary. (We show later that a weaker adversary, i.e. one who fails to satisfy these assumptions, will in 14 general only lead to enhanced security.) The first assumption is that the adversary can estimate the agent’s state or belief state. In fully observable domains, the adversary estimates the agent’s state to be the current world state which both can observe fully. If the domain is partially observable, we assume that the adversary estimates the agent’s belief states, because: (i) the adversary eaves- drops or spies on the agent’s sensors such as sonar or radar (e.g., UA V/robot domains); or (ii) the adversary estimates the most likely observations based on its model of the agent’s sensors; or (iii) the adversary is co-located and equipped with similar sensors. The second assumption is that the adversary knows the agents’ policy, which it may do by learning over repeated observations or obtaining this policy via espionage or other exploitation. Thus, we assume that the adversary may have enough information to predict the agents’ ac- tions with certainty if the agents followed a deterministic policy. Hence, our work maximizes policy randomization to thwart the adversary’s prediction of the agent’s actions based on the agent’s state and minimize adversary’s ability to cause harm. Unfortunately, while randomized policies are created as a side effect [Altman, 1999] and turn out to be optimal in some stochas- tic games [Littman, 1994; Koller and Pfeffer, 1995], little attention has been paid to intention- ally maximizing randomization of agents’ policies even for single agents. Obviously, simply randomizing an MDP/POMDP policy can degrade an agent’s expected rewards, and thus we face a randomization-reward tradeoff problem: how to randomize policies with only a limited loss in expected rewards. Indeed, gaining an explicit understanding of the randomization-reward tradeoff requires new techniques for policy generation rather than the traditional single-objective maximization techniques. However, generating policies that provide appropriate randomization- reward tradeoffs is difficult, a difficulty that is exacerbated in agent teams based on decentralized MDPs/POMDPs, as randomization may create miscoordination. 15 In particular, my thesis provides two key contributions to generate randomized policies: (1) I provide novel techniques that enable policy randomization in single agents, while attaining a cer- tain expected reward threshold. I measure randomization via an entropy-based metric (although our techniques are not dependent on that metric). In particular, I illustrate that simply maximiz- ing entropy-based metrics introduces a non-linear program that does not guarantee polynomial run-time. Hence, I introduce my CRLP (Convex combination for Randomization) and BRLP (Bi- nary search for Randomization) linear programming (LP) techniques that randomize policies in polynomial time with different tradeoffs as explained later. (2) I provide a new algorithm, RDR (Rolling Down Randomization), for generating randomized policies for decentralized POMDPs without communication, given a threshold on the expected team reward loss. RDR starts with a joint deterministic policy for decentralized POMDPs, then iterates, randomizing policies for agents turn-by-turn, keeping policies of all other agents fixed. A key insight in RDR is that given fixed randomized policies for other agents, I can generate a randomized policy via CRLP or BRLP, i.e., my efficient single-agent methods. The motivation for use of entropy-based metrics to randomize our agents’ policies stems from information theory. It is well known that the expected number of probes (e.g., observations) needed to identify the outcome of a distribution is bounded below by the entropy of that distribu- tion [Shannon, 1948; Wen, 2005]. Thus, by increasing policy entropy, we force the opponent to execute more probes to identify the outcome of our known policy, making it more difficult for the opponent to anticipate our agent’s actions and cause harm. In particular, in our (PO)MDP setting, the conflict between the agents and the adversary can be interpreted as a game, in which the agents generate a randomized policy above a given expected reward threshold; the adversary knows the agent’s policy and the adversary’s action is to guess the exact action of the agent/agent-team by 16 probing. For example, in the UA V setting, given our agent’s randomized policy, the adversary generates probes to determine the direction our UA V is headed from a given state. Thus, in the absence of specific knowledge of the adversary, we can be sure to increase the average number of probes the adversary uses by increasing the lower bound given by the entropy of the policy distribution at every state. In the rest of this chapter we focus on randomizing single agent MDP policies, i.e., a single MDP-based UA V agent is monitoring a troubled region, where the UA V gets rewards for sur- veying various areas of the region, but as discussed above, security requires it to randomize its monitoring strategies to avoid predictability. The case of multi-UA V teams will be discussed in the next chapter. 3.1 Markov Decision Problems(MDP) An MDP is a model of an agent interacting with a world. As shown in Figure 3.1, the agent takes as input the state of the world and generates as output actions, which themselves affect the state of the world. In the MDP framework, it is assumed that, although there may be great deal of uncertainty about the effects of an agent’s actions, there is never any uncertainty about the agent’s current state – it has complete and perfect perceptual abilities. 3.1.1 Basic Framework An MDP is denoted as a tuplehS,A,P,Ri, where S is a set of world states{s 1 ,...,s m }; A the set of actions{a 1 ,...,a k }; 17 Figure 3.1: Markov Decision Process P the set of tuplesp(s,a,j) representing the transition function and R the set of tuplesr(s,a) denoting the immediate reward for taking action a in state s. Ifx(s,a) represents the number of times the MDP visits states and takes actiona andα j represents the number of times that the MDP starts in each statej∈ S, then the optimal policy, maximizing expected reward, is derived via the following linear program [Dolgov and Durfee, 2003b]: max X s∈S X a∈A r(s,a)x(s,a) s.t. X a∈A x(j,a)− X s∈S X a∈A p(s,a,j)x(s,a) =α j ∀j∈S x(s,a)≥ 0 ∀s∈S,a∈A (3.1) Ifx ∗ is the optimal solution to (3.1), optimal policyπ ∗ is given by (3.2) below, whereπ ∗ (s,a) is the probability of taking action a in state s. It turns out thatπ ∗ is deterministic and uniformly optimal regardless of the initial distribution{α j } j∈S [Dolgov and Durfee, 2003b] i.e.,π ∗ (s,a) 18 has a value of either 0 or 1. However, such deterministic policies are undesirable in domains like our UA V example. π ∗ (s,a) = x ∗ (s,a) P ˆ a∈A x ∗ (s,ˆ a) . (3.2) 3.2 Randomness of a policy We borrow from information theory the concept of entropy of a set of probability distributions to quantify the randomness, or information content, in a policy of the MDP. For a discrete proba- bility distributionp 1 ,...,p n the only function, up to a multiplicative constant, that captures the randomness is the entropy, given by the formula H =− P n i=1 p i logp i [Shannon, 1948]. We now introduce the weighted entropy function as borrowed from [Shannon, 1948] defined for a Markoff process, the equivalent term for the Markov process we define here. The weighted en- tropy function is used to quantify the randomness in a policyπ of an MDP and express it in terms of the underlying frequencyx. Note from the definition of a policyπ in (3.2) that for each state s the policy defines a probability distribution over actions. The weighted entropy is defined by adding the entropy for the distributions at every state weighted by the likelihood the MDP visits that state, namely H W (x) = − X s∈S P ˆ a∈A x(s,ˆ a) P j∈S α j X a∈A π(s,a)logπ(s,a) = − 1 P j∈S α j X s∈S X a∈A x(s,a)log x(s,a) P ˆ a∈A x(s,ˆ a) . 19 We note that the randomization approach we propose works for alternative functions of the randomness yielding similar results. For example we can define an additive entropy taking a simple sum of the individual state entropies as follows: H A (x) = − X s∈S X a∈A π(s,a)logπ(s,a) = − X s∈S X a∈A x(s,a) P ˆ a∈A x(s,ˆ a) log x(s,a) P ˆ a∈A x(s,ˆ a) , We now present three algorithms to obtain random solutions that maintain an expected re- ward of at least E min (a certain fraction of the maximal expected reward E ∗ obtained solving (3.1)). These algorithms result in policies that, in our UA V-type domains, enable an agent to get a sufficiently high expected reward, e.g. surveying enough area, using randomized flying routes to avoid predictability. 3.2.1 Maximal entropy solution We can obtain policies with maximal entropy but a threshold expected reward by replacing the objective of Problem (3.1) with the definition of the weighted entropyH W (x). The reduction in expected reward can be controlled by enforcing that feasible solutions achieve at least a certain expected rewardE min . The following problem maximizes the weighted entropy while maintain- ing the expected reward aboveE min : 20 max − 1 P j∈S α j X s∈S X a∈A x(s,a)log x(s,a) P ˆ a∈A x(s,ˆ a) s.t. X a∈A x(j,a)− X s∈S X a∈A p(s,a,j)x(s,a) =α j ∀j∈S X s∈S X a∈A r(s,a)x(s,a)≥E min x(s,a)≥ 0 ∀s∈S,a∈A (3.3) E min is an input domain parameter (e.g. UA V mission specification). Alternatively, if E ∗ denotes the maximum expected reward from (1), then by varying the expected reward threshold E min ∈ [0,E ∗ ] we can explore the tradeoff between the achievable expected reward and entropy, and then select the appropriate E min . Note that for E min = 0 the above problem finds the maximum weighted entropy policy, and for E min = E ∗ , Problem (3.3) returns the maximum expected reward policy with largest entropy. Solving Problem (3.3) is our first algorithm to obtain a randomized policy that achieves at leastE min expected reward (Algorithm 1). Algorithm 1 MAX-ENTROPY(E min ) 1: Solve Problem (3.3) withE min , letx Emin be optimal solution 2: returnx Emin (maximal entropy, expected reward≥E min ) Unfortunately entropy-based functions like H W (x) are neither convex nor concave in x, hence there are no complexity guarantees in solving Problem (3.3), even for local optima [Vava- sis, 1991]. This negative complexity motivates the polynomial methods presented next. 21 3.3 Efficient single agent randomization The idea behind these polynomial algorithms is to efficiently solve problems that obtain policies with a high expected reward while maintaining some level of randomness. (A very high level of randomness implies a uniform probability distribution over the set of actions out of a state, whereas a low level would mean deterministic action being taken from a state). We then obtain a solution that meets a given minimal expected reward value by adjusting the level of randomness in the policy. The algorithms that we introduce in this section consider two inputs: a minimal expected reward valueE min and a randomized solution ¯ x (or policy ¯ π). The input ¯ x can be any solution with high entropy and is used to enforce some level of randomness on the high expected reward output, through linear constraints. For example, one such high entropy input for MDP- based problems is the uniform policy, where ¯ π(s,a) = 1/|A|. Note that uniform policies need not always lead to highest entropy policies for an MDP. A simple example can be an MDP where every state has two actions say left and right. For the start state we assume that action left leads to end state but action right leads to a state which has two more actions and the tree is symmetric from thereafter. In this case a totally uniform policy gives lower entropy than a policy which deterministically chooses action right in the start state and is uniform thereafter. We enforce the amount of randomness in the high expected reward solution that is output through a parameter β∈ [0,1]. For a givenβ and a high entropy solution ¯ x, we output a maximum expected reward solution with a certain level of randomness by solving (3.4). 22 max X s∈S X a∈A r(s,a)x(s,a) s.t. X a∈A x(j,a)− X s∈S X a∈A p(s,a,j)x(s,a) =α j ∀j∈S x(s,a)≥β¯ x(s,a) ∀s∈S,a∈A. (3.4) which can be referred to in matrix shorthand as max r T x s.t. Ax =α x≥β¯ x. As the parameterβ is increased, the randomness requirements of the solution become stricter and hence the solution to (3.4) would have smaller expected reward and higher entropy. For β = 0 the above problem reduces to (3.1) returning the maximum expected reward solutionE ∗ ; and forβ = 1 the problem obtains the maximal expected reward (denotedE) out of all solutions with as much randomness as ¯ x. If E ∗ is finite, then Problem (3.4) returns ¯ x for β = 1 and E = P s∈S P a∈A r(s,a)¯ x(s,a). Our second algorithm to obtain an efficient solution with a expected reward requirement of E min is based on the following result which shows that the solution to (3.4) is a convex combina- tion of the deterministic and highly random input solutions. Theorem 1 Consider a solution ¯ x, which satisfiesA¯ x =α and ¯ x≥ 0. Letx ∗ be the solution to (3.1) andβ∈ [0,1]. Ifx β is the solution to (3.4) thenx β = (1−β)x ∗ +β¯ x. 23 proof: We reformulate problem (3.4) in terms of the slackz =x−β¯ x of the solutionx overβ¯ x leading to the following problem : βr T ¯ x+ max r T z s.t. Az = (1−β)α z≥ 0, The above problem is equivalent to (3.4), where we used the fact that A¯ x = α. Let z ∗ be the solution to this problem, which shows that x β = z ∗ +β¯ x. Dividing the linear equation Az = (1−β)α, by(1−β) and substitutingu =z/(1−β) we recover the deterministic problem (3.1) in terms of u, withu ∗ as the optimal deterministic solution. Renaming variable u to x, we obtain 1 1−β z ∗ =x ∗ , which concludes the proof. Since x β = (1−β)x ∗ +β¯ x, we can directly find a randomized solution which obtains a target expected reward ofE min . Due to the linearity in relationship betweenx β andβ, a linear relationship exists between the expected reward obtained byx β (i.er T x β ) andβ. In fact setting β = r T x ∗ −E min r T x ∗ −r T ¯ x makes r T x β = E min . We now present below algorithm CRLP based on the observations made aboutβ andx β . Algorithm 2 CRLP(E min ,¯ x) 1: Solve Problem (3.1), letx ∗ be the optimal solution 2: Setβ = r T x ∗ −Emin r T x ∗ −r T ¯ x 3: Setx β = (1−β)x ∗ +β¯ x 4: returnx β (expected reward=E min , entropy based onβ¯ x) Algorithm CRLP is based on a linear program and thus obtains, in polynomial time, solutions to problem(3.4) with expected reward valuesE min ∈ [E,E ∗ ]. Note that Algorithm CRLP might 24 unnecessarily constrain the solution set as Problem (3.4) implies that at least β P a∈A ¯ x(s,a) flow has to reach each states. This restriction may negatively impact the entropy it attains, as experimentally verified in Section 5. This concern is addressed by a reformulation of Problem (3.4) replacing the flow constraints by policy constraints at each stage. For a givenβ∈ [0,1] and a solution ¯ π (policy calculated from ¯ x), this replacement leads to the following linear program max X s∈S X a∈A r(s,a)x(s,a) s.t. X a∈A x(j,a)− X s∈S X a∈A p(s,a,j)x(s,a) =α j , ∀j∈S x(s,a)≥β¯ π(s,a) X b∈A x(s,b), ∀s∈S,a∈A. (3.5) Forβ = 0 this problem reduces to (3.1) returningE ∗ , forβ = 1 it returns a maximal expected reward solution with the same policy as ¯ π. This means that for β at values 0 and 1, problems (3.4) and (3.5) obtain the same solution if policy ¯ π is the policy obtained from the flow function ¯ x. However, in the intermediate range of 0 to 1 forβ, the policy obtained by problems (3.4) and (3.5) are different even if ¯ π is obtained from ¯ x. Thus, theorem 1 holds for problem (3.4) but not for (3.5). Table 3.1, obtained experimentally, validates our claim by showing the maximum expected rewards and entropies obtained (entropies in parentheses) from problems (3.4) and (3.5) for various settings ofβ, e.g. forβ = 0.4, problem (3.4) provides a maximum expected reward of 26.29 and entropy of 5.44, while problem (3.5) provides a maximum expected reward of 25.57 and entropy of 6.82. Table 3.1 shows that for the same value of β in Problems (3.4) and (3.5) we get different maximum expected rewards and entropies implying that the optimal policies for both problems are different, hence Theorem 1 does not hold for (3.5). Indeed, while the expected reward of 25 Problem (3.4) is higher for this example, its entropy is lower than Problem (3.5). Hence to investigate another randomization-reward tradeoff point, we introduce our third algorithm BRLP, which uses problem (3.5) to perform a binary search to attain a policy with expected reward E min ∈ [E,E ∗ ], adjusting the parameterβ. Beta .2 .4 .6 .8 Problem(3.4) 29.14 (3.10) 26.29 (5.44) 23.43 (7.48) 20.25 (9.87) Problem(3.5) 28.57 (4.24) 25.57 (6.82) 22.84 (8.69) 20.57 (9.88) Table 3.1: Maximum expected rewards(entropies) for variousβ Algorithm 3 BRLP(E min ,¯ x) 1: Setβ l = 0,β u = 1, andβ = 1/2. 2: Obtain ¯ π from ¯ x 3: Solve Problem (3.5), letx β andE(β) be the optimal solution and expected reward value returned 4: while|E(β)−E min |> do 5: ifE(β)>E min then 6: Setβ l =β 7: else 8: Setβ u =β 9: end if 10: β = βu+β l 2 11: Solve Problem (3.5), letx β andE(β) be the optimal solution and expected reward value returned 12: end while 13: returnx β (expected reward=E min ±, entropy related toβ¯ x) Given input ¯ x, algorithm BRLP runs in polynomial time, since at each iteration it solves an LP and for tolerance of, it takes at mostO E(0)−E(1) iterations to converge (E(0) and E(1) expected rewards correspond to 0 and 1 values ofβ). 26 3.4 Incorporating models of the adversary Throughout this thesis, we set ¯ x based on uniform randomization ¯ π = 1/|A|. By manipulating ¯ x, we can accommodate the knowledge of the behavior of the adversary. For instance, if the agent knows that a specific states cannot be targeted by the adversary, then ¯ x for that state can have all values 0, implying that no entropy constraint is necessary. In such cases, ¯ x will not be a complete solution for the MDP but rather concentrate on the sets of states and actions that are under risk of attack. For ¯ x that do not solve the MDP, Theorem 1 does not hold and therefore Algorithm CRLP is not valid. In this case, a high-entropy solution that meets a target expected reward can still be obtained via Algorithm BRLP. 3.5 Applying to POMDPs Before turning to agent teams next, we quickly discuss applying these algorithms in single agent POMDPs [Cassandra et al., 1994; Kaelbling et al., 1998]. A POMDP can be represented as a tuple< S,A,T,ω,O,R >, where S is a finite set of states; A is a finite set of actions; T(s, a, s’) provides the probability of transitioning from state s to s’ when taking action a;ω is a finite set of observations; O(s’, a, o) is probability of observing o after taking an action a and reaching s’; R(s, a) is the reward function. Figure 3.2 shows a POMDP whole control is decomposed into two parts. The agent in this model makes observations and generates actions. It keeps an internal belief state b, that summarizes its previous experience. The component labeled SE is the state estimator: it is responsible for updating the belief state based on the last action, the current observation, and the previous belief state. The component labeledπ is the policy: as before in 27 the MDP, it is responsible for generating actions, but this time as a function of the agent’s belief state rather than the state of the world. Figure 3.2: Partially Observable Markov Decision Process For a single-agent finite-horizon POMDPs with known starting belief states [Paquet et al., 2005], we convert the POMDP to (finite horizon) belief MDP, allowing BRLP/CRLP to be ap- plied; returning a randomized policy. However, addressing unknown starting belief states is an issue for future work. 28 Chapter 4 Multi Agent Security Problem In previous chapter I developed efficient security algorithms for the single agent case. In this chapter I develop an efficient security algorithm for the multiagent case using the single agent security algorithms developed earlier. I assume that the agent team acts in a partially observable environment and hence I first introduce the decentralized POMDP framework to model the agent team. I then describe the procedure to obtain randomized policies in agent teams. 4.1 From Single Agent to Agent Teams We use notation from MTDP (Multiagent Team Decision Problem) [Pynadath and Tambe, 2002] for our decentralized POMDP model; other models are equivalent [Bernstein et al., 2000]. Given a team ofn agents, an MTDP is defined as a tuple:hS,A,P,Ω,O,Ri, where, • S is a finite set of world states{s 1 ,...,s m }. • A =× 1≤i≤n A i , whereA 1 ,...,A n , are the sets of action for agents 1 ton. A joint action is represented asha 1 ,...,a n i. 29 • P(s i ,ha 1 ,...,a n i,s f ), the transition function, represents the probability that the current state iss f , if the previous state iss i and the previous joint action isha 1 ,...,a n i. • Ω =× 1≤i≤n Ω i is the set of joint observations whereΩ i is the set of observations for agents i. • O(s,ha 1 ,...,a n i,ω), the observation function, represents the probability of joint obser- vation ω∈ Ω, if the current state is s and the previous joint action isha 1 ,...,a n i. We assume that observations of each agent are independent of each other’s observations, i.e. O(s,ha 1 ,...,a n i,ω) =O 1 (s,ha 1 ,...,a n i,ω 1 )·...·O n (s,ha 1 ,...,a n i,ω n ). • The agents receive a single, immediate joint rewardR(s,ha 1 ,...,a n i). For deterministic policies, each agent i chooses its actions based on its policy, Π i , which maps its observation history to actions. Thus, at timet, agenti will perform actionΠ i (~ ω t i ) where ~ ω t i = ω 1 i ,.....,ω t i . Π =hΠ 1 ,.....,Π n i refers to the joint policy of the team of agents. In this model, execution is distributed but planning is centralized; and agents don’t know each other’s observations and actions at run time. Unlike previous work, in our work, policies are randomized and hence agents obtain a prob- ability distribution over a set of actions rather than a single action. Furthermore, this probability distribution is indexed by a sequence of action-observation tuples rather than just observations, since observations do not map to unique actions. Thus in MTDP, a randomized policy mapsΨ t i to a probability distribution over actions, whereΨ t i =hψ 1 i ,...,ψ t i i andψ t i =ha t−1 i ,ω t i i. Thus, at timet, agenti will perform an action selected randomly based on the probability distribution 30 returned by Π i (Ψ t i ). Furthermore we denote the probability of an individual action under pol- icy Π i givenΨ t i asP Π i (a t i |Ψ t i ). However, there are many problems in randomizing an MTDP policy. 1. Existing algorithms for MTDPs produce deterministic policies. New algorithms need to be designed to specifically produce randomized policies. 2. Randomized policies in team settings may lead to miscoordination, unless policies are generated carefully, as explained in the section below. 3. Efficiently generating randomized policies is a key challenge as search space for random policies is larger than for deterministic policies. 4.2 Multiagent Randomization Let p i be the probability of adversary targeting agent i, and H W (i) be the weighted entropy for agent i’s policy. We design an algorithm that maximizes the multiagent weighted entropy, given by P n i=1 p i ∗H W (i), in MTDPs while maintaining the team’s expected joint reward above a threshold. Unfortunately, generating optimal policies for decentralized POMDPs is of higher complexity (NEXP-complete) than single agent MDPs and POMDPs [Bernstein et al., 2000], i.e., MTDP presents a fundamentally different class where we cannot directly use the single agent randomization techniques. Hence, to exploit efficiency of algorithms like BRLP or CRLP, we convert the MTDP into a single agent POMDP, but with a method that changes the state space considered. To this end, our new iterative algorithm called RDR (Rolling Down Randomization) iterates through finding the best randomized policy for one agent while fixing the policies for all other agents — we show 31 that such iteration of fixing the randomized policies of all but one agent in the MTDP leads to a single agent problem being solved at each step. Thus, each iteration can be solved via BRLP or CRLP. For a two agent case, we fix the policy of agenti and generate best randomized policy for agentj and then iterate with agentj’s policy fixed. Overall RDR starts with an initial joint deterministic policy calculated in the algorithm as a starting point. Assuming this fixed initial policy as providing peak expected reward, the algorithm then rolls down the reward, randomizing policies turn-by-turn for each agent. Rolling down from such an initial policy allows control of the amount of expected reward loss from the given peak, in service of gaining entropy. The key contribution of the algorithm is in the rolling down procedure that gains entropy (randomization), and this procedure is independent of how the initial policy for peak reward is determined. The initial policy may be computed via algorithms such as [Hansen et al., 2004] that determine a global optimal joint policy (but at a high cost) or from random restarts of algorithms that compute a locally optimal policy [Nair et al., 2003; Emery-Montemerlo et al., 2004], that may provide high quality policies at lower cost. The amount of expected reward to be rolled down is input to RDR. RDR then achieves the rolldown in 1/d steps whered is an input parameter. The turn-by-turn nature of RDR suggests some similarities to JESP [Nair et al., 2003], which also works by fixing the policy of one agent and computing the best-response policy of the second and iterating. However, there are significant differences between RDR and JESP, as outlined below: • JESP uses conventional value iteration based techniques whereas RDR creates randomized policies via LP formulations. 32 • RDR defines a new extended state and hence the belief-update, transition and reward func- tions undergo a major transformation. • The d parameter is newly introduced in RDR to control number of rolldown steps. • RDR climbs down from a given optimal solution rather than JESP’s hill-climbing up solu- tion. 4.3 RDR Details For expository purposes, we use a two agent domain, but we can easily generalize ton agents. We fix the policy of one agent (say agent 2), which enables us to create a single agent POMDP if agent 1 uses an extended state, i.e. at each timet, agent 1 uses an extended statee t 1 =hs t ,Ψ t 2 i. Here,Ψ t 2 is as introduced in the previous section. By usinge t 1 as agent 1’s state at timet, given the fixed policy of agent 2, we can define a single-agent POMDP for agent 1 with transition and observation function as follows. P 0 (e t 1 ,a t 1 ,e t+1 1 ) =P(hs t+1 ,Ψ t+1 2 i|hs t ,Ψ t 2 i,a t 1 ) =P(ω t+1 2 |s t+1 ,a t 2 ,Ψ t 2 ,a t 1 ) ·P(s t+1 |s t ,a t 2 ,Ψ t 2 ,a t 1 )·P(a t 2 |s t ,Ψ t 2 ,a t 1 ) =P(s t ,ha t 2 ,a t 1 i,s t+1 )·O 2 (s t+1 ,ha t 2 ,a t 1 i,ω t+1 2 ) ·P Π2 (a t 2 |Ψ t 2 ) (4.1) O 0 (e t+1 1 ,a t 1 ,ω t+1 1 ) = Pr(ω t+1 1 |e t+1 1 ,a t 1 ) =O 1 (s t+1 ,ha t 2 ,a t 1 i,ω t+1 1 ) (4.2) 33 Thus, we can create a belief state for agenti in the context ofj’s fixed policy by maintaining a distribution over e t i =hs t ,Ψ t j i. Figure 4.1 shows three belief states for agent 1 in the UA V domain. For instanceB 2 shows probability distributions overe 2 1 . Ine 2 1 = (LefthSense,OLi), Left implies landmine to the left is the current state, Sense is the agent 2’s action at time 1, OL (Observe Left) is agent 2’s observation at time 2. The belief update rule derived from the tran- sition and observation functions is given in 4.3, where denominator is the transition probability when actiona 1 from belief stateB t 1 results inω t+1 1 being observed. Immediate rewards for the belief states are assigned using 4.4. Figure 4.1: Belief Tree for UA V team domain generated by RDR B t+1 1 (e t+1 1 ) = X s t B t 1 (e t 1 )·P(s t ,(a t 1 ,a t 2 ),s t+1 )·P Π2 (a t 2 |Ψ t 2 ) ·O 2 (s t+1 ,(a t 1 ,a t 2 ),ω t+1 2 )·O 1 (s t+1 ,(a t 1 ,a t 2 ),ω t+1 1 ) /P(ω t+1 1 |B t 1 ,a 1 ) (4.3) 34 <(a t 1 ,B t 1 ) = X e t 1 B t 1 (e t 1 )· X a t 2 R(s t ,(a t 1 ,a t 2 ))·P Π2 (a t 2 |Ψ t 2 ) (4.4) Thus, RDR’s policy generation implicitly coordinates the two agents, without communication or a correlation device. Randomized actions of one agent are planned taking into account the impact of randomized actions of its teammate on the joint reward. Algorithm 4 presents the pseudo-code for RDR for two agents, and shows how we can use beliefs over extended statese t i to construct LPs that maximize entropy while maintaining a certain expected reward threshold. The inputs to this algorithm are the parameters d, percentdec and ¯ x. The parameter d specifies the number of iterations taken to solve the problem. It also decides the amount of reward that can be sacrificed at each step of the algorithm for improving entropy. Parameter percentdec specifies the percentage of expected reward the agent team forgoes for improving entropy. As with Algorithm 2, parameter ¯ x provides an input solution with high randomness; and it is obtained using a uniform policy as discussed in Section 3.3. Step 1 of the algorithm uses the Compute joint policy() function which returns a optimal joint deterministic policy from which the individual policies and the expected joint reward are extracted. Obtaining this initial policy is not RDR’s main thrust, and could be a global optimal policy obtained via [Hansen et al., 2004] or a very high quality policy from random restarts of local algorithms such as [Nair et al., 2003; Emery-Montemerlo et al., 2004]. The input variable percentdec varies between 0 to 1 denoting the percentage of the expected reward that can be sacrificed by RDR. The step size as calculated in the algorithm, denotes the amount of reward sacrificed during each iteration. RDR then iterates 1/d times (step 3). 35 Algorithm 4 RDR(d,percentdec,¯ x) 1: π 1 ,π 2 ,Optimalreward←Compute joint policy() 2: stepsize←percentdec·Optimalreward·d 3: fori← 1 to1/d do 4: MDP←GenerateMDP(b,Π (i+1)Mod2 ,T) 5: Entropy,Π iMod2 ←BRLP(Optimalrew−stepsize∗i,¯ x) 6: end for 1: GenerateMDP(b,π 2 ,T) : 2: reachable(0)←{b} 3: fort← 1 toT do 4: for allB t−1 ∈ reachable(t−1) do 5: for alla 1 ∈A 1 ,ω 1 ∈ Ω 1 do 6: trans,reachable(t) ∪ ← UPDATE(B t−1 ,a 1 ,ω 1 ) 7: end for 8: end for 9: end for 10: fort←T downto1 do 11: for allB t ∈ reachable(t) do 12: for alla 1 ∈A 1 do 13: for alls t ∈S,ψ t 2 ←ha t−1 2 ,ω t 2 i do 14: for alla t 2 givenΨ t 2 do{Equation 4.4} 15: < a 1 t (B t ) + ←B t (s t ,Ψ t 2 )·R s t ,ha t 1 ,a t 2 i .P(a t 2 |Ψ t 2 ) 16: end for 17: end for 18: end for 19: end for 20: end for 21: returnhB,A,trans,<i 1: UPDATE(B t ,a 1 ,ω 1 ) : 2: for alls t+1 ∈S,ψ t 2 ←ha t−1 2 ,ω t 2 i do 3: for alls t ∈S do{Equation 4.3} 4: B t+1 (s t+1 ,Ψ t+1 2 ) + ← B t (s t ,Ψ t 2 )· P(s t ,(a t 1 ,a t 2 ),s t+1 )· O 1 (s t+1 ,(a t 1 ,a t 2 ),ω t+1 1 )· O 2 (s t+1 ,(a t 1 ,a t 2 ),ω t+1 2 )·P(a t 2 |Ψ t 2 ) 5: end for 6: trans←normalize(B t+1 (s t+1 ,Ψ t+1 2 )) 7: end for 8: returntrans,B t+1 36 The function GenerateMDP generates all reachable belief states (lines 2 through 6) from a given starting belief state B and hence a belief MDP is generated. The number of such belief states isO(|A 1 ||Ω 1 |) T−1 whereT is the time horizon. The extended states in each B increases by a factor of|A 2 ||Ω 2 | with increasing T. Thus the time to calculate B(e) for all extended states e, for all belief states B in agent 1’s belief MDP isO(|S| 2 (|A 1 ||A 2 ||Ω 1 ||Ω 2 |) T−1 ). Lines 7 through 12 compute the reward for each belief state. The total computations to calculate the reward is O(|S||A 1 ||A 2 |(|A 1 ||A 2 ||Ω 1 ||Ω 2 |) T−1 ). The belief MDP generated is denoted by the tuplehB,A,trans,Ri. We reformulate the MDP obtained to problem 3.5 and use our polynomial BRLP procedure to solve it, using ¯ x as input. Thus, algorithm RDR is exponentially faster than an exhaustive search of a policy space, and comparable to algorithms that generate locally optimal policies [Nair et al., 2003]. 37 Chapter 5 Experimental Results In this chapter, we present three sets of experimental results. The first set of experiments evaluate the nonlinear algorithm and the two linear approximation algorithms we developed in chapter 3 for the single agent case. The second set of experiments evaluate the RDR algorithm developed for the multiagent case over the various possible parameters of the algorithm. The third set of experiments examine the tradeoffs in entropy of the agent/agent-team and the total number of observations (probes) the enemy needs to determine the agent/agent-team actions at each state. These experiments are performed to show that increasing entropy indeed increases security. 5.1 Single Agent Experiments Our first set of experiments examine the tradeoffs in run-time, expected reward and entropy for single-agent problems. Figures 5.1-a and 5.1-b show the results for these experiments based on generation of MDP policies. The results show averages over 10 MDPs where each MDP represents a flight of a UA V , with state space of 28-40 states. The states represent the regions monitored by the UA Vs. The transition function assumes that a UA V action can make a transition from a region to one of four other regions, where the transition probabilities were selected at 38 random. The rewards for each MDP were also generated using random number generators. These experiments compare the performance of our four methods of randomization for single-agent policies. In the figures, CRLP refers to algorithm 2; BRLP refers to algorithm 3; whereasH W (x) andH A (x) refer to Algorithm 1 with these objective functions. Figure 5.1 examines the tradeoff between entropy and expected reward thresholds. It shows the average weighted entropy on the y-axis and reward threshold percent on the x-axis. The average maximally obtainable entropy for these MDPs is 8.89 (shown by line on the top) and three of our four methods (except CRLP) attain it at about 50% threshold, i.e. an agent can attain maximum entropy if it is satisfied with 50% of the maximum expected reward. However, if no reward can be sacrificed (100% threshold) the policy returned is deterministic. Figure 5.1-b shows the run-times, plotting the execution time in seconds on the y-axis, and expected reward threshold percent on the x-axis. These numbers represent averages over the same 10 MDPs as in Figure 5.1-a. Algorithm CRLP is the fastest and its runtime is very small and remains constant over the whole range of threshold rewards as seen from the plot. Algorithm BRLP also has a fairly constant runtime and is slightly slower than CRLP. Both CRLP and BRLP are based on linear programs and hence their small and fairly constant runtimes. Algorithm 1, for bothH A (x) andH W (x) objectives, exhibits an increase in the runtime as the expected reward threshold increases. This trend that can be attributed to the fact that maximizing a non-concave objective while simultaneously attaining feasibility becomes more difficult as the feasible region shrinks. We conclude the following from Figure 5.1: (i) CRLP is the fastest but provides the lowest entropy. (ii) BRLP is significantly faster than Algorithm 1, providing 7-fold speedup on average over the 10 MDPs over the entire range of thresholds. (iii) Algorithm 1 withH W (x) provides 39 0 1 2 3 4 5 6 7 8 9 10 50 60 70 80 90 100 Reward Threshold(%) Ave. Weighted Entropy BRLP Hw(x) Ha(x) CRLP Max Entropy (a) 0 20 40 60 80 100 120 50 60 70 80 90 100 Reward Threshold(%) Execution Time (sec) BRLP Hw(x) Ha(x) CRLP (b) Figure 5.1: Comparison of Single Agent Algorithms 40 highest entropy among our methods, but the average gain in entropy is only 10% over BRLP. (iv) CRLP provides a 4-fold speedup on an average over BRLP but with a significant entropy loss of about 18%. In fact, CRLP is unable to reach the maximal possible entropy for the threshold range considered in the plot. Thus, BRLP appears to provide the most favorable tradeoff of run-time to entropy for the domain considered, and we use this method for the multiagent case. However, for time critical domains CRLP might be the algorithm of choice and therefore both BRLP and CRLP provide useful tradeoff points. 5.2 Multi Agent Experiments Our second set of experiments examine the tradeoffs in run-time, expected joint reward and en- tropy for the multiagent case. Table 5.1 shows the runtime results and entropy (in parenthesis) averaged over 10 instances of the UA V team problem based on the original transition and obser- vation functions from [Nair et al., 2003] and its variations.d, the input parameter controlling the number of rolldown steps of algorithm 4, varies from 1 to 0.125 for two values of percent reward threshold (90% and 50%) and time horizonT =2. We conclude that asd decreases, the run-time increases, but the entropy remains fairly constant ford≤ .5. For example, for reward threshold of 50%, for d = 0.5, the runtime is 1.47 secs, but the run-time increases more than 5-fold to 7.47 when d = 0.125; however, entropy only changes from 2.52 to 2.66 with this change in d. Reward Threshold 1 .5 .25 .125 90% .67(.59) 1.73(.74) 3.47(.75) 7.07(.75) 50% .67(1.53) 1.47(2.52) 3.4(2.62) 7.47(2.66) Table 5.1: RDR: Avg. run-time in sec and (Entropy),T = 2 41 Thus in our next set of graphs, we present results ford = .5, as it provides the most favorable tradeoff, if other parameters remain fixed. Figure 5.2-a plots RDR expected reward threshold percent on thex-axis and weighted entropy on the y-axis averaged over the same 10 UA V-team instances. Thus, if the team needs to obtain 90% of maximum expected joint reward with a time horizonT = 3, it gets a weighted entropy of 1.06 only as opposed to 3.62 if it obtains 50% of the expected reward for the samed andT . Similar to the single-agent case, the maximum possible entropy for the multiagent case is also shown by a horizontal line at the top of the graph. Figure 5.2-b studies the effect of changing miscoordination cost on RDR’s ability to improve entropy. As explained in Section 2.2, the UA V team incurs a high cost of miscoordination, e.g. if one UA V shoots left and the other shoots right. We now define miscoordination reduction factor (MRF) as the ratio between the original miscoordination cost and a new miscoordination cost. Thus, high MRF implies a new low miscoordination cost, e.g. an MRF of 4 means that the miscoordination cost is cut 4-fold. We plot this MRF on x-axis and entropy on y-axis, with expected joint reward threshold fixed at 70% and the time horizon T at 2. We created 5 reward variations for each of our 10 UA V team instances we used for 5.2-a; only 3 instances are shown, to reduce graph clutter (others are similar). For instance 2, the original miscoordination cost provided an entropy of 1.87, but as this cost is scaled down by a factor of 12, the entropy increases to 2.53. Based on these experiments, we conclude that: (i) Greater tolerance of expected reward loss allows higher entropy; but reaching the maximum entropy is more difficult in multiagent teams — for the reward loss of 50%, in the single agent case, we are able to reach maximum entropy, but we are unable to reach maximum entropy in the multiagent case. (ii) Lower miscoordination costs allow higher entropy for the same expected joint reward thresholds. (iii) Varyingd produces 42 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 50 70 90 Reward Threshold(%) Weighted Entropy T=2 T=3 T=2 Max T=3 Max (a) 1.5 1.7 1.9 2.1 2.3 2.5 2.7 0 2 4 6 8 10 12 14 Miscoordination Reduction Factor Weighted Entropy Inst1 Inst2 Inst3 (b) Figure 5.2: Results for RDR 43 only a slight change in entropy; thus we can used as high as 0.5 to cut down runtimes. (iv) RDR is time efficient because of the underlying polynomial time BRLP algorithm. 5.3 Entropy Increases Security: An Experimental Evaluation Our fourth set of experiments examine the tradeoffs in entropy of the agent/agent-team and the total number of observations (probes) the enemy needs to determine the agent/agent-team actions at each state. The primary aim of this experiment is to show that maximizing policy entropy indeed makes it more difficult for the adversary to determine/predict our agents’ actions, and thus more difficult for the adversary to cause harm, which was our main goal at the beginning of this thesis. Figures 5.3-a and 5.3-b plot the number of observations of enemy as function of entropy of the agent/agent-team. In particular for the experiment we performed, the adversary runs yes-no probes to determine the agent’s action at each state, i.e. probes that return an answer yes if the agent is taking the particular action at that state in which case the probing is stopped, and a no otherwise. The average number of yes-no probes at a state is the total number of observations needed by the adversary to determine the correct action taken by the agent in that state. The more deterministic the policy is, the fewer the probes the adversary needs to run; if the policy is completely deterministic, the adversary need not run any probes as it knows the action. Therefore, the aim of the agent/agent-team is to maximize the policy entropy, so that the expected number of probes asked by the adversary is maximized. In contrast, the adversary minimizes the expected number of probes required to determine the agents’ actions. Hence, for any given states, the adversary uses the Huffman procedure to optimize the number of probes [Huffman, 1952], and hence the total number of probes over the 44 0 2 4 6 8 10 12 14 0 2 4 6 8 10 Entropy # of observations Observe All Observe Select Observe Noisy (a) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 Joint Entropy Joint # of Observations Observe All Observe Select Observe Noisy (b) Figure 5.3: Improved security via randomization 45 entire MDP state space can be expressed as follows. LetS ={s 1 ,s 2 ,....,s m } be the set of MDP states andA ={a 1 ,a 2 ,....,a n } be the action set at each state. Letp(s,a) ={p 1 ,.....,p n } be the probabilities of taking the action set{a 0 1 ,.....,a 0 n },a 0 i ∈A at states sorted in decreasing order of probability. The number of yes-no probes at states is denoted byζ s =p 1 ∗1+.....+p n−1 ∗(n− 1)+p n ∗(n−1). If the weight of the states (see notion of weight introduced in section 3.2) isW = {w 1 ,w 2 ,......,w m }, then the number of observations over the set of states is denoted Observe-all = P s=1...m {w s ∗ζ s }. Setting some weights to zero implies that the adversary was not concerned with those states, and the number of observations in this situation is denoted Observe-select. While the number of observations in both Observe-all and Observe-select are obtained assuming the adversary obtains an accurate policy of the agent or agent team, in real situations, an adversary may obtain a noisy policy, and the adversary’s number of observations in such a case is denoted Observe-noisy. Figure 5.3-a demonstrates the effectiveness of entropy maximization using the BRLP method against an adversary using yes-no probes procedure as his probing method for the single agent case. The plot shows the number of observations on y-axis and entropy on the x-axis averaged over the 10 MDPs we used for our single-agent experiment. The plot has 3 lines corresponding to the three adversary procedures namely Observe-all, Observe-select and Observe-noisy. Observe- all and Observe-select have been plotted to study the effect of entropy on the number of probes the adversary needs. For example, for Observe-all, when entropy is 8, the average number of probes needed by the adversary is 9. The purpose of the Observe-noisy plot is to show that the number of probes that the adversary requires can only remain same or increase when using a noisy policy, as opposed to using the correct agent policy. The noise in our experiments is that 46 two actions at each state of the MDP have incorrect probabilities. Each data point in the Observe- noisy case represents an average of 50 noisy policies, averaging over 5 noisy policies for each reward threshold over each of the 10 MDPs. Figure 5.3-b plots a similar graph as 5.3-a for the multiagent case, averaged over the 10 UA V- team instances with two UA Vs. The plot has three lines namely Observe-all, Observe-select and Observe-noisy with the same definitions as in the single agent case but in a distributed POMDP setting. However, in plot 5.3-b the y-axis represents joint number of yes-no probes and the x-axis represents joint entropy. Both these parameters are calculated as weighted sums of the individual parameters for each UA V , assuming that the adversary assigns equal weight to both the UA Vs. We conclude the following from plots 5.3-a and 5.3-b: (i) The number of observations (yes- no probes) increases as policy entropy increases, whether the adversary monitors the entire state space (observe-all) or just a part of it (observe-select). (ii) If the adversary obtains a noisy policy (observe-noisy), it needs a larger number of observations when compared to the adversary ob- taining an accurate policy. (iii) As entropy increases, the agents’ policy tends to become more uniform and hence the effect of noise on the number of yes-no probes reduces. In the extreme case where the policy is totally uniform the Observe-all and Observe-noisy both have same num- ber of probes. This can be observed at the maximal entropy point in plot 5.3-a. The maximal entropy point is not reached in plot 5.3-b as shown in the results for RDR. From the above we conclude that maximizing entropy has indeed made it more difficult for the adversary to deter- mine our agents’ actions and cause harm. Hence randomization with quality constraints is an effective procedure to generate secure policies when the agent has no model of the adversary. In the next chapter I introduce the scenario when the agent has a partial model of the adversary. It is shown in that chapter, that randomized policies are still an effective way for avoiding adversarial 47 actions when there is a partial model of the adversary. However, the amount of randomization needed is dependent on the type of adversaries present in the domain. 48 Chapter 6 Partial Adversary Model: Limited Randomization Procedure In the previous chapters we dealt with the problem of developing secure policies for agents when there is no model of the adversary. In the rest of this thesis I develop a limited randomization approach and an exact procedure for generating secure policies for agents assuming that there is a partial model of the adversary available. In particular I first introduce an efficient limited randomization approach for generating secure policies for agents in this chapter. The main ad- vantage of this approach is that it is efficient compared to previously existing approaches and also the policies generated are simple and easy to implement in real applications. In the next chapter, I use the techniques developed here to develop an efficient exact solution. In some settings, one player must commit to a strategy before the other players choose their strategies. These scenarios are known as Stackelberg games [Fudenberg and Tirole, 1991]. In a Stackelberg game, a leader commits to a strategy first, and then a follower (or group of followers) selfishly optimize their own rewards, considering the action chosen by the leader. Stackelberg games are commonly used to model attacker-defender scenarios in security domains. [Brown et al., 2006]. For example consider a domain in which a single security agent is responsible for patrolling a region, searching for robbers. Since the security agent (the leader) cannot be 49 in all areas of the region at once, it must instead choose some strategy of patrolling various areas within the region, one at a time. This strategy could be a mixed strategy in order to be unpredictable to the robbers (followers). The robbers, after observing the pattern of patrols over time, can then choose their own strategy of choosing a location to rob. Similarly in terms of the domain described earlier, a team of unmanned aerial vehicles (UA Vs) [Beard and McLain, 2003] monitoring a region undergoing a humanitarian crisis may also need to choose a patrolling policy. They must make this decision without knowing in advance whether terrorists or other adversaries may be waiting to disrupt the mission at a given location. It may indeed be possible to model the motivations of types of adversaries the agent or agent team is likely to face in order to target these adversaries more closely. However, in both cases, the security robot or UA V team will not know exactly which kinds of adversaries may be active on any given day. Although the follower in a Stackelberg game is allowed to observe the leader’s strategy before choosing its own strategy, there is often an advantage for the leader over the case where both players must choose their moves simultaneously. To see the advantage of being the leader in a Stackelberg game, consider a simple game with the payoff table as shown in Table 6.1, adapted from [Conitzer and Sandholm, 2006]. The leader is the row player and the follower is the column player. 1 2 1 2,1 4,0 2 1,0 3,2 Table 6.1: Payoff table for example normal form game. The only pure-strategy Nash equilibrium for this game is when the leader plays 1 and the follower plays 1 which gives the leader a payoff of 2; in fact, for the leader, playing 2 is strictly 50 dominated. However, if the leader can commit to playing 2 before the follower chooses its strat- egy, then the leader will obtain a payoff of 3, since the follower would then play 2 to ensure a higher payoff for itself. If the leader commits to a uniform mixed strategy of playing 1 and 2 with equal (0.5) probability, then the follower will play 2, leading to a payoff for the leader of 3.5. The problem of choosing an optimal strategy for the leader to commit to in a Stackelberg game is analyzed in [Conitzer and Sandholm, 2006] and found to be NP-hard in the case of a Bayesian game with multiple types of followers. Methods for finding optimal leader strategies for non- Bayesian games [Conitzer and Sandholm, 2006] can be applied to this problem by converting the Bayesian game into a normal-form game by the Harsanyi transformation [Harsanyi and Selten, 1972]. However, by transforming the game, the compact structure of the Bayesian game is lost. In addition, this method requires running a set of multiple linear programs, some of which may be infeasible. If, on the other hand, we wish to compute the highest-reward Nash equilibrium, new methods using mixed-integer linear programs (MILPs) [Sandholm et al., 2005] may be used, since the highest-reward Bayes-Nash equilibrium is equivalent to the corresponding Nash equilibrium in the transformed game. Furthermore, since the Nash equilibrium assumes a simultaneous choice of strategies, the advantages of being the leader are not considered. Therefore, finding more efficient and compact techniques for choosing optimal strategies in instances of these games is an important open issue. To address this open issue, I first introduce an efficient randomization method for generating optimal leader strategy for security domains that have limited randomization, known as ASAP (Agent Security via Approximate Policies). This method has three key advantages. First, it directly searches for an optimal strategy, rather than a Nash (or Bayes-Nash) equilibrium, thus allowing it to find high-reward non-equilibrium strategies like the one in the above example. 51 Second, it generates policies with a support which can be expressed as a uniform distribution over a multiset of fixed size as proposed in [Lipton et al., 2003]. This allows for policies that are simple to understand and represent, as well as a parameter (the size of the multiset) that controls the simplicity of the policy and can be tuned. Third, the method allows for a Bayes-Nash game to be expressed compactly without requiring conversion to a normal-form game, allowing for large speedups over existing Nash methods such as [Sandholm et al., 2005] and [Lemke and Hawson, 1964]. Using the techniques used in deriving the ASAP procedure, I then introduced an efficient exact method for finding the optimal leader strategy for security domains, known as DOBSS (Decomposed Optimal Bayesian Stackelberg Solver). This method has two key advantages. First, the method allows for a Bayesian game to be expressed compactly without requiring conversion to a normal-form game via the Harsanyi transformation. Second, the method requires only one mixed-integer linear program to be solved, rather than a set of such programs. In most security patrolling domains, the security agents (which could be UA Vs [Beard and McLain, 2003] or police [Ruan et al., 2005]) cannot feasibly patrol all areas all the time. Instead, they must choose a policy by which they patrol various routes at different times, taking into account factors such as the likelihood of crime in different areas, possible targets for crime, and the security agents’ own resources (number of security agents, amount of available time, fuel, etc.). It is usually beneficial for this policy to be nondeterministic so that robbers cannot safely rob certain locations, knowing that they will be safe from the security agents as shown in the earlier chapters. To demonstrate the utility of our algorithm, we use a simplified version of the police patrolling domain, expressed as a bayesian game as described in chapter 2. 52 6.1 Bayesian Games A Bayesian game contains a set ofN agents, and each agentn must be one of a given set of types θ n . For our patrolling domain, we have two agents, the security agent and the robber.θ 1 is the set of security agent types andθ 2 is the set of robber types. Since there is only one type of security agent,θ 1 contains only one element. During the game, the robber knows its type but the security agent does not know the robber’s type. For each agent (the security agent or the robber)n, there is a set of strategiesσ n and a utility functionu n :θ 1 ×θ 2 ×σ 1 ×σ 2 →<. A Bayesian game can be transformed into a normal-form game using the Harsanyi transfor- mation [Harsanyi and Selten, 1972]. Once this is done, new, linear-program (LP)-based methods for finding high-reward strategies for normal-form games [Conitzer and Sandholm, 2006] can be used to find a strategy in the transformed game; this strategy can then be used for the Bayesian game. While methods exist for finding Bayes-Nash equilibria directly, without the Harsanyi transformation [Koller and Pfeffer, 1997], they find only a single equilibrium in the general case, which may not be of high reward. Recent work [Sandholm et al., 2005] has led to efficient mixed- integer linear program techniques to find the best Nash equilibrium for a given agent. However, these techniques do require a normal-form game, and so to compare the policies given by ASAP and DOBSS against the optimal policy, as well as against the highest-reward Nash equilibrium, we must apply these techniques to the Harsanyi-transformed matrix. The next two subsections elaborate on how this is done. 53 6.1.1 Harsanyi Transformation The first step in solving Bayesian games is to apply the Harsanyi transformation [Harsanyi and Selten, 1972] that converts the incomplete information game into a normal form game. Given that the Harsanyi transformation is a standard concept in game theory, we explain it briefly through a simple example without introducing the mathematical formulations. Let us assume there are two robber typesa andb in the Bayesian game. Robbera will be active with probabilityα, and robber b will be active with probability 1−α. The rules described in Section 2.3 allow us to construct simple payoff tables. Assume that there are two houses in the world (1 and 2) and hence there are two patrol routes (pure strategies) for the agent:{1,2} and{2,1}. The robber can rob either house 1 or house 2 and hence he has two strategies (denoted as 1 l , 2 l for robber type l). Since there are two types assumed (denoted asa andb), we construct two payoff tables (shown in Table 6.2) corresponding to the security agent playing a separate game with each of the two robber types with probabilities α and 1−α. First, consider robber type a. Borrowing the notation from the domain section, we assign the following values to the variables: v 1,x = v 1,q = 3/4,v 2,x = v 2,q = 1/4,c x = 1/2,c q = 1,p 1 = 1,p 2 = 1/2. Using these values we construct a base payoff table as the payoff for the game against robber typea. For example, if the security agent chooses route{1,2} when robbera is active, and robbera chooses house 1, the robber receives a reward of -1 (for being caught) and the agent receives a reward of 0.5 for catching the robber. The payoffs for the game against robber typeb are constructed using different values. Using the Harsanyi technique involves introducing a chance node, that determines the rob- ber’s type, thus transforming the security agent’s incomplete information regarding the robber 54 Security agent: {1,2} {2,1} Robbera 1 a -1, .5 -.375, .125 2 a -.125, -.125 -1, .5 Robberb 1 b -.9, .6 -.275, .225 2 b -.025, -.025 -.9, .6 Table 6.2: Payoff tables: Security Agent vs Robbersa andb into imperfect information [Brynielsson and Arnborg, 2004]. The Bayesian equilibrium of the game is then precisely the Nash equilibrium of the imperfect information game. The transformed, normal-form game is shown in Table 6.3. In the transformed game, the security agent is the col- umn player, and the set of all robber types together is the row player. Suppose that robber type {1,2} {2,1} {1 a ,1 b } −1α−.9(1−α),.5α+.6(1−α) −.375α−.275(1−α),.125α+.225(1−α) {1 a ,2 b } −1α−.025(1−α),.5α−.025(1−α) −.375α−.9(1−α),.125α+.6(1−α) {2 a ,1 b } −.125α−.9(1−α),−.125α+.6(1−α) −1α−.275(1−α),.5α+.225(1−α) {2 a ,2 b } −.125α−.025(1−α),−.125α−.025(1−α) −1α−.9(1−α),.5α+.6(1−α) Table 6.3: Harsanyi Transformed Payoff Table a robs house 1 and robber type b robs house 2, while the security agent chooses patrol{1,2}. Then, the security agent and the robber receive an expected payoff corresponding to their payoffs from the agent encountering robbera at house 1 with probabilityα and robberb at house 2 with probability1−α. 6.1.2 Existing Procedure: Finding an Optimal Strategy Although a Nash equilibrium is the standard solution concept for games in which agents choose strategies simultaneously, in our security domain, the security agent (the leader) can gain an advantage by committing to a mixed strategy in advance. Since the followers (the robbers) will know the leader’s strategy, the optimal response for the followers will be a pure strategy. Given 55 the common assumption, taken in [Conitzer and Sandholm, 2006], in the case where followers are indifferent, they will choose the strategy that benefits the leader, there must exist a guaranteed optimal strategy for the leader [Conitzer and Sandholm, 2006]. From the Bayesian game in Table 6.2, we constructed the Harsanyi transformed bimatrix in Table 6.3. We denoteX = σ θ 2 1 = σ 1 andQ = σ θ 2 2 as the index sets of the security agent and robbers’ pure strategies, respectively, withR andC as the corresponding payoff matrices. R ij is the reward of the security agent andC ij is the reward of the robbers when the security agent takes pure strategyi and the robbers take pure strategyj. A mixed strategy for the security agent is a probability distribution over its set of pure strategies and will be represented by a vector x = (p x1 ,p x2 ,...,p x|X| ), wherep xi ≥ 0 and P p xi = 1. Here,p xi is the probability that the security agent will choose itsith pure strategy. The optimal mixed strategy for the security agent can be found in time polynomial in the num- ber of rows in the normal form game using the following linear program formulation from [Conitzer and Sandholm, 2006]. For every possible pure strategyj by the follower (the set of all robber types), max P i∈X p xi R ij s.t. ∀j 0 ∈Q, P i∈σ 1 p xi C ij ≥ P i∈σ 1 p xi C ij 0 P i∈X p xi = 1 ∀ i∈X ,p xi = 0 (6.1) 56 Then, for all feasible follower strategies j, choose the one that maximizes P i∈X p xi R ij , the reward for the security agent (leader). Thep xi variables give the optimal strategy for the security agent. Note that while this method is polynomial in the number of rows in the transformed, normal- form game, the number of rows increases exponentially with the number of robber types. Using this method for a Bayesian game thus requires running|σ 2 | |θ 2 | separate linear programs. This is not a surprise, since finding the optimal strategy to commit to for the leader in a Bayesian game is NP-hard [Conitzer and Sandholm, 2006]. 6.2 Limited Randomization Approach In the limited randomization approach we limit the possible mixed strategies of the leader to se- lect actions with probabilities that are integer multiples of 1/k for a predetermined integerk. In the previous chapters of this thesis, I have shown that strategies with high entropy are beneficial for security applications when opponents’ utilities are completely unknown. In our domain, if utilities are not considered, this method will result in uniform-distribution strategies. One advan- tage of such strategies is that they are compact to represent (as fractions) and simple to under- stand; therefore they can be efficiently implemented by real organizations. We aim to maintain the advantage provided by simple strategies for our security application problem, incorporating the effect of the robbers’ rewards on the security agent’s rewards. Thus, the ASAP will produce strategies which are k-uniform. A mixed strategy is denotedk-uniform if it is a uniform distribu- tion on a multisetS of pure strategies with|S| = k. A multiset is a set whose elements may be repeated multiple times; thus, for example, the mixed strategy corresponding to the multiset{1, 57 1, 2} would take strategy 1 with probability 2/3 and strategy 2 with probability 1/3. ASAP allows the size of the multiset to be chosen in order to balance the complexity of the strategy reached with the goal that the identified strategy will yield a high reward. Another advantage of the ASAP procedure is that it operates directly on the compact Bayesian representation, without requiring the Harsanyi transformation. This is because the different fol- lower (robber) types are independent of each other. Hence, evaluating the leader strategy against a Harsanyi-transformed game matrix is equivalent to evaluating against each of the game matri- ces for the individual follower types. This independence property is exploited in ASAP to yield a decomposition scheme. Note that the LP method introduced by [Conitzer and Sandholm, 2006] to compute optimal Stackelberg policies is unlikely to be decomposable into a small number of games as it was shown to be NP-hard for Bayes-Nash problems. Finally, note that ASAP requires the solution of only one optimization problem, rather than solving a series of problems as in the LP method of [Conitzer and Sandholm, 2006]. For a single follower type, the algorithm works the following way. Given a particulark, for each possible mixed strategyx for the leader that corresponds to a multiset of sizek, evaluate the leader’s payoff fromx when the follower plays a reward-maximizing pure strategy. We then take the mixed strategy with the highest payoff. We need only to consider the reward-maximizing pure strategies of the followers (robbers), since for a given fixed strategyx of the security agent, each robber type faces a problem with fixed linear rewards. If a mixed strategy is optimal for the robber, then so are all the pure strategies in the support of that mixed strategy. Note also that because we limit the leader’s strategies to take on discrete values, the assumption from Section 6.1.2 that the followers will break ties in the leader’s favor is not significant, since ties will be unlikely to arise. This is because, in domains 58 where rewards are drawn from any random distribution, the probability of a follower having more than one pure optimal response to a given leader strategy approaches zero, and the leader will have only a finite number of possible mixed strategies. Our approach to characterize the optimal strategy for the security agent makes use of prop- erties of linear programming. We briefly outline these results here for completeness, for detailed discussion and proofs see one of many references on the topic, such as [Bertsimas and Tsitsiklis, 1997]. Every linear programming problem, such as: maxc T x|Ax =b,x≥ 0, has an associated dual linear program, in this case: minb T y|A T y≥c. These primal/dual pairs of problems satisfy weak duality: For any x and y primal and dual feasible solutions respectively,c T x≤b T y. Thus a pair of feasible solutions is optimal ifc T x = b T y, and the problems are said to satisfy strong duality. In fact if a linear program is feasible and has a bounded optimal solution, then the dual is also feasible and there is a pairx ∗ ,y ∗ that satisfiesc T x ∗ = b T y ∗ . These optimal solutions are characterized with the following optimality conditions: • primal feasibility:Ax =b, x≥ 0 • dual feasibility:A T y≥c • complementary slackness:x i (A T y−c) i = 0 for alli. 59 Note that this last condition implies that c T x =x T A T y =b T y, which proves optimality for primal dual feasible solutionsx andy. In the following subsections, we first define the problem in its most intuitive form as a mixed- integer quadratic program, and then show how this problem can be converted into a mixed-integer linear program. 6.2.1 Mixed-Integer Quadratic Program We begin with the case of a single type of follower. Let the leader be the row player and the follower the column player. We denote byx the vector of strategies of the leader andq the vector of strategies of the follower. We also denoteX andQ the index sets of the leader and follower’s pure strategies, respectively. The payoff matricesR andC correspond to:R ij is the reward of the leader andC ij is the reward of the follower when the leader takes pure strategyi and the follower takes pure strategyj. Letk be the size of the multiset. We first fix the policy of the leader to somek-uniform policyx. The valuex i is the number of times pure strategyi is used in thek-uniform policy, which is selected with probabilityx i /k. 60 We formulate the optimization problem the follower solves to find its optimal response tox as the following linear program: max X j∈Q X i∈X 1 k C ij x i q j s.t. P j∈Q q j = 1 q≥ 0. (6.2) The objective function maximizes the follower’s expected reward givenx, while the constraints make feasible any mixed strategyq for the follower. The dual to this linear programming problem is the following: min a s.t. a≥ X i∈X 1 k C ij x i j∈Q. (6.3) From strong duality and complementary slackness we obtain that the maximum reward value for the followera is the value of every pure strategy withq j > 0, that is in the support of the optimal mixed strategy. Therefore each of these pure strategies is optimal. Optimal solutions to the follower’s problem are characterized by linear programming optimality conditions: primal feasibility constraints in (6.2), dual feasibility constraints in (6.3), and complementary slackness q j a− X i∈X 1 k C ij x i ! = 0 j∈Q. These conditions must be included in the problem solved by the leader in order to consider only best responses by the follower to thek-uniform policyx. 61 The leader seeks thek-uniform solutionx that maximizes its own payoff, given that the fol- lower uses an optimal responseq(x). Therefore the leader solves the following integer problem: max X i∈X X j∈Q 1 k R ij q(x) j x i s.t. P i∈X x i =k x i ∈{0,1,...,k}. (6.4) Problem (6.4) maximizes the leader’s reward with the follower’s best response (q j for fixed leader’s policy x and hence denoted q(x) j ) by selecting a uniform policy from a multiset of constant size k. We complete this problem by including the characterization of q(x) through linear programming optimality conditions. To simplify writing the complementary slackness conditions, we will constrainq(x) to be only optimal pure strategies by just considering integer solutions ofq(x). The leader’s problem becomes: max x,q X i∈X X j∈Q 1 k R ij x i q j s.t. P i x i =k P j∈Q q j = 1 0≤ (a− P i∈X 1 k C ij x i )≤ (1−q j )M x i ∈{0,1,....,k} q j ∈{0,1}. (6.5) Here, the constantM is some large number. The first and fourth constraints enforce ak-uniform policy for the leader, and the second and fifth constraints enforce a feasible pure strategy for the follower. The third constraint enforces dual feasibility of the follower’s problem (leftmost 62 inequality) and the complementary slackness constraint for an optimal pure strategy q for the follower (rightmost inequality). In fact, since only one pure strategy can be selected by the follower, sayq h = 1, this last constraint enforces thata = P i∈X 1 k C ih x i imposing no additional constraint for all other pure strategies which haveq j = 0. We conclude this subsection noting that Problem (6.5) is an integer program with a non- convex quadratic objective in general, as the matrix R need not be positive-semi-definite. Ef- ficient solution methods for non-linear, non-convex integer problems remains a challenging re- search question. In the next section we show a reformulation of this problem as a linear integer programming problem, for which a number of efficient commercial solvers exist. 6.2.2 Mixed-Integer Linear Program We can linearize the quadratic program of Problem 6.5 through the change of variables z ij = x i q j , obtaining the following problem max q,z P i∈X P j∈Q 1 k R ij z ij s.t. P i∈X P j∈Q z ij =k P j∈Q z ij ≤k kq j ≤ P i∈X z ij ≤k P j∈Q q j = 1 0≤ (a− P i∈X 1 k C ij ( P h∈Q z ih ))≤ (1−q j )M z ij ∈{0,1,....,k} q j ∈{0,1} (6.6) 63 Theorem 2 Problems (6.5) and (6.6) are equivalent. Proof: Considerx,q a feasible solution of (6.5). We will show thatq,z ij =x i q j is a feasible solution of (6.6) of same objective function value. The equivalence of the objective functions, and constraints 4, 6 and 7 of (6.6) are satisfied by construction. The fact that P j∈Q z ij = x i as P j∈Q q j = 1 explains constraints 1, 2, and 5 of (6.6). Constraint 3 of (6.6) is satisfied because P i∈X z ij =kq j . Let us now consider q,z feasible for (6.6). We will show that q and x i = P j∈Q z ij are feasible for (6.5) with the same objective value. In fact all constraints of (6.5) are readily satisfied by construction. To see that the objectives match, notice that ifq h = 1 then the third constraint in (6.6) implies that P i∈X z ih = k, which means that z ij = 0 for all i∈ X and all j 6= h. Therefore, x i q j = X l∈Q z il q j =z ih q j =z ij . This last equality is because both are 0 whenj6=h. This shows that the transformation preserves the objective function value, completing the proof. Given this transformation to a mixed-integer linear program (MILP), we now show how we can apply our decomposition technique on the MILP to obtain significant speedups for Bayesian games with multiple follower types. 64 6.3 Decomposition for Multiple Adversaries The MILP developed in the previous section handles only one follower. Since our security sce- nario contains multiple follower (robber) types, we change the response function for the follower from a pure strategy into a weighted combination over various pure follower strategies where the weights are probabilities of occurrence of each of the follower types. 6.3.1 Decomposed MIQP To admit multiple adversaries in our framework, we modify the notation defined in the previous section to reason about multiple follower types. We denote byx the vector of strategies of the leader andq l the vector of strategies of followerl, withL denoting the index set of follower types. We also denote byX andQ the index sets of leader and followerl’s pure strategies, respectively. We also index the payoff matrices on each followerl, considering the matricesR l andC l . Using this modified notation, we characterize the optimal solution of follower l’s problem given the leaders k-uniform policyx, with the following optimality conditions: X j∈Q q l j = 1 a l − X i∈X 1 k C l ij x i ≥ 0 q l j (a l − X i∈X 1 k C l ij x i ) = 0 q l j ≥ 0 Again, considering only optimal pure strategies for followerl’s problem we can linearize the complementarity constraint above. We incorporate these constraints on the leader’s problem that 65 selects the optimal k-uniform policy. Therefore, given a priori probabilities p l , with l∈ L of facing each follower, the leader solves the following problem: max x,q X i∈X X l∈L X j∈Q p l k R l ij x i q l j s.t. P i x i =k P j∈Q q l j = 1 0≤ (a l − P i∈X 1 k C l ij x i )≤ (1−q l j )M x i ∈{0,1,....,k} q l j ∈{0,1}. (6.7) Problem (6.7) for a Bayesian game with multiple follower types is indeed equivalent to Prob- lem (6.5) on the payoff matrix obtained from the Harsanyi transformation of the game. In fact, every pure strategy j in Problem (6.5) corresponds to a sequence of pure strategies j l , one for each followerl∈ L. This means thatq j = 1 if and only ifq l j l = 1 for alll∈ L. In addition, given the a priori probabilitiesp l of facing playerl, the reward in the Harsanyi transformation payoff table isR ij = P l∈L p l R l ij l . The same relation holds betweenC andC l . These relations between a pure strategy in the equivalent normal form game and pure strategies in the individual games with each followers are key in showing these problems are equivalent. 66 6.3.2 Decomposed MILP We can linearize the quadratic programming problem 6.7 through the change of variablesz l ij = x i q l j , obtaining the following problem max q,z P i∈X P l∈L P j∈Q p l k R l ij z l ij s.t. P i∈X P j∈Q z l ij =k P j∈Q z l ij ≤k kq l j ≤ P i∈X z l ij ≤k P j∈Q q l j = 1 0≤ (a l − P i∈X 1 k C l ij ( P h∈Q z l ih ))≤ (1−q l j )M P j∈Q z l ij = P j∈Q z 1 ij z l ij ∈{0,1,....,k} q l j ∈{0,1} (6.8) Theorem 3 Problems (6.7) and (6.8) are equivalent. Proof: Consider x, q l , a l with l ∈ L a feasible solution of (6.7). We will show that q l , a l , z l ij =x i q l j is a feasible solution of (6.8) of same objective function value. The equivalence of the objective functions, and constraints 4, 7 and 8 of (6.8) are satisfied by construction. The fact that P j∈Q z l ij =x i as P j∈Q q l j = 1 explains constraints 1, 2, 5 and 6 of (6.8). Constraint 3 of (6.8) is satisfied because P i∈X z l ij =kq l j . Lets now considerq l ,z l ,a l feasible for (6.8). We will show thatq l ,a l andx i = P j∈Q z 1 ij are feasible for (6.7) with the same objective value. In fact all constraints of (6.7) are readily satisfied 67 by construction. To see that the objectives match, notice for eachl oneq l j must equal 1 and the rest equal 0. Let us say thatq l j l = 1, then the third constraint in (6.8) implies that P i∈X z l ij l =k, which means thatz l ij = 0 for alli∈X and allj6=j l . In particular this implies that x i = X j∈Q z 1 ij =z 1 ij 1 =z l ij l , this last equality from constraint 6 of (6.8). Thereforex i q l j = z l ij l q l j = z l ij . This last equality is because both are 0 whenj6= j l . This shows that the transformation preserves the objective function value, completing the proof. We can therefore solve this equivalent linear integer program with efficient integer program- ming packages which can handle problems with thousands of integer variables. We implemented the decomposed MILP and the results are shown in the following section. 6.4 Experimental results The patrolling domain and the payoffs for the associated game are detailed in Sections 2.3 and 6.1. We performed experiments for this game in worlds of three and four houses with patrols consisting of two houses. The description given in Section 2.3 is used to generate a base case for both the security agent and robber payoff functions. The payoff tables for additional robber types are constructed and added to the game by adding a random distribution of varying size to the payoffs in the base case. All games are normalized so that, for each robber type, the minimum and maximum payoffs to the security agent and robber are 0 and 1, respectively. Using the data generated, we performed the experiments using four methods for generating the security agent’s strategy: 68 • uniform randomization • ASAP • the multiple linear programs method from [Conitzer and Sandholm, 2006] (to find the true optimal strategy) • the highest reward Bayes-Nash equilibrium, found using the MIP-Nash algorithm [Sand- holm et al., 2005] The last three methods were applied using CPLEX 8.1. Because the last two methods are de- signed for normal-form games rather than Bayesian games, the games were first converted using the Harsanyi transformation [Harsanyi and Selten, 1972]. The uniform randomization method is simply choosing a uniform random policy over all possible patrol routes. We use this method as a simple baseline to measure the performance of other procedures. We anticipated that the uniform policy would perform reasonably well since maximum-entropy policies have been shown to be effective in multiagent security domains. The highest-reward Bayes-Nash equilibria were used in order to demonstrate the higher reward gained by looking for an optimal policy rather than an equilibria in Stackelberg games such as our security domain. Based on our experiments we present three sets of graphs to demonstrate (1) the runtime of ASAP compared to other common methods for finding a strategy, (2) the reward guaranteed by ASAP compared to other methods, and (3) the effect of varying the parameterk, the size of the multiset, on the performance of ASAP. In the first two sets of graphs, ASAP is run using a multiset of 80 elements; in the third set this number is varied. The first set of graphs, shown in Figure 6.1 shows the runtime graphs for three-house (left column) and four-house (right column) domains. Each of the three rows of graphs corresponds 69 Figure 6.1: Runtimes for various algorithms on problems of 3 and 4 houses. 70 to a different randomly-generated scenario. The x-axis shows the number of robber types the security agent faces and they-axis of the graph shows the runtime in seconds. All experiments that were not concluded in 30 minutes (1800 seconds) were cut off. The runtime for the uniform policy is always negligible irrespective of the number of adversaries and hence is not shown. The ASAP algorithm clearly outperforms the optimal, multiple-LP method as well as the MIP-Nash algorithm for finding the highest-reward Bayes-Nash equilibrium with respect to run- time. For a domain of three houses, the optimal method cannot reach a solution for more than seven robber types, and for four houses it cannot solve for more than six types within the cutoff time in any of the three scenarios. MIP-Nash solves for even fewer robber types within the cutoff time. On the other hand, ASAP runs much faster, and is able to solve for at least 20 adversaries for the three-house scenarios and for at least 12 adversaries in the four-house scenarios within the cutoff time. The runtime of ASAP does not increase strictly with the number of robber types for each scenario, but in general, the addition of more types increases the runtime required. The second set of graphs, Figure 6.2, shows the reward to the patrol agent given by each method for three scenarios in the three-house (left column) and four-house (right column) do- mains. This reward is the utility received by the security agent in the patrolling game, and not as a percentage of the optimal reward, since it was not possible to obtain the optimal reward as the number of robber types increased. The uniform policy consistently provides the lowest reward in both domains; while the optimal method of course produces the optimal reward. The ASAP method remains consistently close to the optimal, even as the number of robber types increases. The highest-reward Bayes-Nash equilibria, provided by the MIP-Nash method, produced rewards higher than the uniform method, but lower than ASAP. This difference clearly illustrates the gains 71 Figure 6.2: Reward for various algorithms on problems of 3 and 4 houses. 72 in the patrolling domain from committing to a strategy as the leader in a Stackelberg game, rather than playing a standard Bayes-Nash strategy. The third set of graphs, shown in Figure 6.3 shows the effect of the multiset size on runtime in seconds (left column) and reward (right column), again expressed as the reward received by the security agent in the patrolling game, and not a percentage of the optimal reward. Results here are for the three-house domain. The trend is that as as the multiset size is increased, the runtime and reward level both increase. Not surprisingly, the reward increases monotonically as the multiset size increases, but what is interesting is that there is relatively little benefit to using a large multiset in this domain. In all cases, the reward given by a multiset of 10 elements was within at least 96% of the reward given by an 80-element multiset. The runtime does not always increase strictly with the multiset size; indeed in one example (scenario 2 with 20 robber types), using a multiset of 10 elements took 1228 seconds, while using 80 elements only took 617 seconds. In general, runtime should increase since a larger multiset means a larger domain for the variables in the MILP, and thus a larger search space. However, an increase in the number of variables can sometimes allow for a policy to be constructed more quickly due to more flexibility in the problem. 73 Figure 6.3: Reward for ASAP using multisets of 10, 30, and 80 elements 74 Chapter 7 Partial Adversary Model: Exact Solution The ASAP procedure introduced in the previous chapter can operate directly on the untrans- formed Bayesian game to find a high-quality strategy for the leader; however it does not guaran- tee an optimal solution. In this chapter we introduce the DBOSS algorithm which guarantees an optimal solution. 7.1 Approach Similar to the ASAP procedure, one advantage of the DOBSS algorithm is that it operates directly on the compact Bayesian representation, without requiring the Harsanyi transformation. This is because the different follower types are independent of each other. Hence, evaluating the leader strategy against a Harsanyi-transformed game matrix is equivalent to evaluating against each of the game matrices for the individual follower types. This independence property is exploited in DOBSS to yield a decomposition scheme. This decomposition is analogous to that of the ASAP method shown in the previous chapter; however, while ASAP provides limited randomization procedure for choosing the leader’s strategy, DOBSS provides the optimal solution without any bounds on randomization. In addition, our experiments show that ASAP is significantly slower 75 than DOBSS as the number of follower types increases and may be unable to provide a solution in many cases (where DOBSS is able to). Note that the LP method introduced by [Conitzer and Sandholm, 2006] to compute optimal Stackelberg policies is unlikely to be decomposable into a small number of games as it was shown to be NP-hard for Bayes-Nash problems; DOBSS has the advantage of decomposition, but must work with mixed-integer linear programs (MILPs) rather than LPs. Finally, note that DOBSS requires the solution of only one optimization problem, rather than solving a series of problems as in the LP method of [Conitzer and Sandholm, 2006]. For a single follower type, we simply take the mixed strategy for the leader that gives the highest payoff when the follower plays a reward-maximizing strategy. We need only to consider the reward-maximizing pure strategies of the followers, since for a given fixed strategyx of the leader, each follower type faces a problem with fixed linear rewards. If a mixed strategy is optimal for the follower, then so are all the pure strategies in the support of that mixed strategy. In the following subsections, we first define the problem in its most intuitive form as a mixed- integer quadratic program, and then show how this problem can be decomposed and converted into an MILP. For a detailed discussion of MILPs, please see one of many references on the topic, such as [Wolsey, 1998]. 7.1.1 Mixed-Integer Quadratic Program We begin with the case of a single type of follower. Let the leader be the row player and the follower the column player. We denote byx the vector of strategies of the leader andq the vector of strategies of the follower. We also denoteX andQ the index sets of the leader and follower’s pure strategies, respectively. The payoff matricesR andC are defined such thatR ij is the reward 76 of the leader andC ij is the reward of the follower when the leader takes pure strategyi and the follower takes pure strategyj. We first fix the policy of the leader to some policyx. The valuex i is the proportion of times in which pure strategyi is used in the policy. We formulate the optimization problem the follower solves to find its optimal response tox as the following linear program: max X j∈Q X i∈X C ij x i q j s.t. P j∈Q q j = 1 q≥ 0. (7.1) The objective function maximizes the follower’s expected reward givenx, while the constraints make feasible any mixed strategyq for the follower. The dual to this linear programming problem is the following: min a s.t. a≥ X i∈X C ij x i j∈Q. (7.2) From strong duality and complementary slackness we obtain that the follower’s maximum reward value,a, is the value of every pure strategy withq j > 0, that is in the support of the optimal mixed strategy. Therefore each of these pure strategies is optimal. Optimal solutions to the follower’s problem are characterized by linear programming optimality conditions: primal feasibility con- straints in (7.1), dual feasibility constraints in (7.2), and complementary slackness q j a− X i∈X C ij x i ! = 0 j∈Q. 77 These conditions must be included in the problem solved by the leader in order to consider only best responses by the follower to the policyx. The leader seeks the solutionx that maximizes its own payoff, given that the follower uses an optimal responseq(x). Therefore the leader solves the following integer problem: max X i∈X X j∈Q R ij q(x) j x i s.t. P i∈X x i = 1 x i ∈ [0...1]. (7.3) Problem (7.3) maximizes the leader’s reward with the follower’s best response (q j for fixed leader’s policy x and hence denoted q(x) j ). We complete this problem by including the char- acterization ofq(x) through linear programming optimality conditions. To simplify writing the complementary slackness conditions, we will constrainq(x) to be only optimal pure strategies by just considering integer solutions ofq(x). The leader’s problem becomes: max x,q X i∈X X j∈Q R ij x i q j s.t. P i x i = 1 P j∈Q q j = 1 0≤ (a− P i∈X C ij x i )≤ (1−q j )M x i ∈ [0...1] q j ∈{0,1}. (7.4) Here, the constantM is some large number. The first and fourth constraints enforce a feasible mixed policy for the leader, and the second and fifth constraints enforce a feasible pure strategy 78 for the follower. The third constraint enforces dual feasibility of the follower’s problem (leftmost inequality) and the complementary slackness constraint for an optimal pure strategy q for the follower (rightmost inequality). In fact, since only one pure strategy can be selected by the follower, sayq h = 1, this last constraint enforces thata = P i∈X C ih x i imposing no additional constraint for all other pure strategies which haveq j = 0. We conclude this subsection noting that Problem (7.4) is an integer program with a non- convex quadratic objective. Efficient solution methods for non-linear, non-convex integer prob- lems remains a challenging research question. In the next section we show how this formulation can be decomposed in order to handle multiple follower types without requiring the Harsanyi transformation. 7.1.2 Decomposed MIQP The MIQP developed in the previous section handles only one follower. To extend this Stack- elberg model to handle multiple follower types we follow a Bayesian approach and assume that there is an a priori probabilityp l that a follower of typel will appear, withL denoting the set of follower types. We also adapt the notation defined in the previous section to reason... strategies of followerl. We also denote byX andQ the index sets of leader and followerl’s pure strategies, respectively. We also index the payoff matrices on each followerl, considering the matricesR l andC l . 79 Using this modified notation, we characterize the optimal solution of follower l’s problem given the leader’s policyx, with the following optimality conditions: X j∈Q q l j = 1 a l − X i∈X C l ij x i ≥ 0 q l j (a l − X i∈X C l ij x i ) = 0 q l j ≥ 0 Again, considering only optimal pure strategies for followerl’s problem we can linearize the complementarity constraint above. We incorporate these constraints on the leader’s problem that selects the optimal policy. Therefore, given a priori probabilitiesp l , withl∈ L of facing each follower, the leader solves the following problem: max x,q X i∈X X l∈L X j∈Q p l R l ij x i q l j s.t. P i x i = 1 P j∈Q q l j = 1 0≤ (a l − P i∈X C l ij x i )≤ (1−q l j )M x i ∈ [0...1] q l j ∈{0,1}. (7.5) For a Bayesian Stackelberg game with multiple follower types we will show that the decom- posed MIQP of Problem (7.5) above is equivalent to the MIQP of Problem (7.4) on the payoff matrix obtained from the Harsanyi transformation of the game. 80 The Harsanyi transformation introduces a chance node that determines the follower’s type according to the a priori probabilitiesp l . For the leader it is as if there is a single follower, whose action set is the cross product of the actions of every follower type. Each pure strategyq of the single follower in the transformed game corresponds to a vector of pure strategies(q 1 j 1 ,...,q |L| j |L| ) one for each follower type, and the rewards for these actions need to be weighted by the probabil- ities of occurrence of each follower. Thus, the reward matrices for the Harsanyi transformation are constructed from the individual reward matrices as follows, as in [Harsanyi and Selten, 1972]: R ij = X l∈L p l R l ij andC ij = X l∈L p l C l ij. (7.6) Theorem 4 Problem (7.5) for a Bayesian game with multiple follower types is equivalent to Problem (7.4) on the payoff matrices given by the Harsanyi transformation (7.6). Proof: We first notice that if a pure strategyj corresponds to(j 1 ,...,j |L| ) andq j = 1 then X i∈X X h∈Q x i R ih q h = X i∈X x i X l∈L p l R l ij l = X i∈X X l∈L p l x i X h∈Q R l ih q l h . (7.7) Considerx,q l ,a l withl∈L a feasible solution to Problem (7.5). From its second constraint and integrality ofq we have that for everyl there is aj l such thatq l j l = 1. Letj be the Harsanyi pure strategy that corresponds to (j 1 ,...,j |L| ). Note thatq j = 1 and thus the objectives match. We show thatx, q such thatq j = 1 andq h = 0 forh6= j, anda = P l∈L p l a l is feasible for Problem (7.4) and of the same objective function value. Constraints 1, 2, 4, 5 in (7.4) are easily 81 satisfied by the proposed solution. Constraint 3 in (7.5) means that P i∈X x i C l ij l ≥ P i∈X x i C l ih , for everyh∈J andl∈L, leading to X i∈X x i C ij = X l∈L p l X i∈X x i C l ij l ≥ X l∈L p l X i∈X x i C l ih l = X i∈X x i C ih 0 , for any pure strategyh 1 ,...,h |L| for each of the followers andh 0 its corresponding pure strategy in the Harsanyi game. We conclude this part by showing that X i∈X x i C ij = X l∈L p l X i∈X x i C l ij l = X l∈L p l a l = X l∈L p l a l =a. Now we start with (x,q,a) feasible for (7.4). This means thatq j = 1 for some pure strategy j. Let (j 1 ,...,j |L| ) be the corresponding strategies for each followerl. We show thatx,q l with q l j l = 1 and q l h = 0 for h6= j l , and a l = P i∈X x i C l ij l with l∈ L is feasible for (7.5). By construction this solution satisfies constraints 1, 2, 4, 5 and has a matching objective function. We now show that P i∈X x i C l ij l ≥ P i∈X x i C l ih for allh∈Q andl∈L. Let us assume it does not. That is, there is an ˆ l∈ L and ˆ h∈ Q such that P i∈X x i C ˆ l ij ˆ l < P i∈X x i C ˆ l i ˆ h . Then by multiplying byp ˆ l and adding P l6= ˆ l p l P i∈X x i C l ij l to both sides of the inequality we obtain X i∈X x i C ij < X i∈X x i X l6= ˆ l p l C l ij l +p ˆ l C ˆ l i ˆ h . The right hand side equals P i∈X x i C ih for the pure strategyh that corresponds to(j 1 ,..., ˆ h,...,j |L| ), which is a contradiction since constraint 3 of (7.4) implies that P i∈X x i C ij ≥ P i∈X x i C ih for allh. 82 7.1.3 Decomposed MILP We can linearize the quadratic programming problem 7.5 through the change of variablesz l ij = x i q l j , obtaining the following problem max q,z P i∈X P l∈L P j∈Q p l R l ij z l ij s.t. P i∈X P j∈Q z l ij = 1 P j∈Q z l ij ≤ 1 q l j ≤ P i∈X z l ij ≤ 1 P j∈Q q l j = 1 0≤ (a l − P i∈X C l ij ( P h∈Q z l ih ))≤ (1−q l j )M P j∈Q z l ij = P j∈Q z 1 ij z l ij ∈ [0...1] q l j ∈{0,1} (7.8) Theorem 5 Problems (7.5) and (7.8) are equivalent. Proof: Consider x, q l , a l with l ∈ L a feasible solution of (7.5). We will show that q l , a l , z l ij =x i q l j is a feasible solution of (7.8) of same objective function value. The equivalence of the objective functions, and constraints 4, 7 and 8 of (7.8) are satisfied by construction. The fact that P j∈Q z l ij =x i as P j∈Q q l j = 1 explains constraints 1, 2, 5 and 6 of (7.8). Constraint 3 of (7.8) is satisfied because P i∈X z l ij =q l j . Lets now considerq l ,z l ,a l feasible for (7.8). We will show thatq l ,a l andx i = P j∈Q z 1 ij are feasible for (7.5) with the same objective value. In fact all constraints of (7.5) are readily satisfied 83 by construction. To see that the objectives match, notice for eachl oneq l j must equal 1 and the rest equal 0. Let us say thatq l j l = 1, then the third constraint in (7.8) implies that P i∈X z l ij l = 1, which means thatz l ij = 0 for alli∈X and allj6=j l . In particular this implies that x i = X j∈Q z 1 ij =z 1 ij 1 =z l ij l , the last equality from constraint 6 of (7.8). Therefore x i q l j = z l ij l q l j = z l ij . This last equality is because both are 0 whenj6= j l . Effectively, constraint 6 ensures that all the adversaries are calculating their best responses against a particular fixed policy of the agent. This shows that the transformation preserves the objective function value, completing the proof. We can therefore solve this equivalent linear integer program with efficient integer program- ming packages which can handle problems with thousands of integer variables. We implemented the decomposed MILP and the results are shown in the following section. 7.2 Experimental Results We performed two sets of experiments. The first set of experiments compares the runtime of DOBSS against ASAP as well as the multiple linear programs method of [Conitzer and Sand- holm, 2006]. Note that ASAP provides an approximation, while the other two methods pro- vide the optimal solution. In addition, the method of [Conitzer and Sandholm, 2006] requires a normal-form game, and so the Harsanyi transformation is required as an initial step; this pre- processing time is not recorded here. In this set of experiments, we created games in worlds of three and four houses with patrols consisting of two houses, with payoff tables as described in the previous subsection. 84 The set of graphs in Figure 7.1 shows the runtime graphs for three-house (left column) and four-house (right column) domains. Each of the graphs corresponds to a differently randomly- generated scenario. The x-axis shows the number of follower types the leader faces and the y-axis of the graph shows the runtime in seconds. All experiments that were not concluded in 30 minutes (1800 seconds) were cut off. DOBSS outperforms the multiple LPs method with respect to runtime. For a domain of three houses, the LPs method cannot reach a solution of more than seven follower types, and for four houses it cannot solve for more than six types within the cutoff time in any of the three scenarios. On the other hand, DOBSS method runs much faster, and is able to solve for at least 20 adversaries for the three-house scenarios and for at least 12 adversaries in the four- house scenario within the cutoff time. The reason for the speedup for the DBOSS procedure over the previous procedure i.e., the multiple LPs method is as follows: The previous approach solves an LP over the exponentially blown harsanyi transformed matrix for each joint strategy of the adversaries (also exponential in number). In contrast, the DBOSS procedure introduces one integer per strategy for each new adversary. Branch and bound procedure in the DBOSS formulation would then lead to one exponent blowup thus saving the other exponent. The runtime of DOBSS does not always increase strictly as the number of follower types increases. However the trend is that the addition of more types increases the runtime. The run- times of DOBSS and ASAP are comparable in all scenarios; however DOBSS actually finds the optimal solution, while ASAP is a non-optimal procedure that only approximates the solution. For example, when the problem is scaled up in number of leader strategies as discussed below, the limited randomization character of ASAP creates difficulties in obtaining a feasible solution. 85 Figure 7.1: DOBSS vs. ASAP and multiple LP methods 86 This experiment clearly shows the speedups of DOBSS over the multiple LP method, al- though for these examples its runtime is comparable to that of the ASAP procedure. We now introduce a second of experimental results in Figure 7.2 to demonstrate the speedups of DOBSS over ASAP as the number of leader strategies increases. In this experiment, for a single scenario, the number of houses was varied between three and seven and the number of follower types was varied from one to nine. Patrols of length two were again used; thus the number of pure strategies for the leader ranged from three to 21 (and the number of pure strategies for each follower ranged from three to seven.) In this experiment, DOBSS outperformed ASAP in every instance, except in cases when both methods went past 1800 seconds and were cut off. In many cases, ASAP was unable to return a feasible solution; DOBSS returned a solution in all these cases. Figure 7.3 illustrates the changes in the leader’s patrolling strategies in a single scenario (three houses, Scenario 1, from Figure 7.1) as the number of followers increased; we focused here on the cases of two, six, and 10 follower types. This figure shows how, in this experimental domain, the addition of follower types caused a significant difference in the strategies chosen by the leader, and did not simply cause small variations on a single strategy. In other words, the results for this experimental domain are meaningful in general because the addition of more follower types had a clear effect on the strategy chosen by the leader; they were not irrelevant additions that would be ignored by the MILP solver. For example, with two follower types, strategies 1 and 6 were chosen with highest probability, whereas with 10 follower types, strategy 1 was not chosen at all. 87 Figure 7.2: DOBSS vs. ASAP for larger strategy spaces Figure 7.3: Effect of additional followers on leader’s strategy 88 Chapter 8 Related Work There are five broad areas of related work that I would be presenting below. First, I present literature on randomized policies in decision theoretic literature like MDPs and POMDPs. In the next section I provide related work to my security algorithms developed using game theoretic techniques. In the third section I focus on the application of randomization to preserve privacy of agents. The next section is devoted to related work done in modeling patrol agents. In the final section I describe the usage of randomization in designing algorithms called randomized algorithms. 8.1 Randomized policies for MDPs/POMDPs Decision theoretic frameworks are extremely useful and powerful modeling tools that are being increasingly applied to build agents and agent teams that can be deployed in real world. The main advantage of modeling agent and agent teams using these tools is the following: • Real world is uncertain and the decision theoretic frameworks can model such real-world environmental uncertainty. In particular, the MDP [Puterman, 1994] framework can model stochastic actions and hence can handle transition uncertainty. The POMDP [Cassandra 89 et al., 1994; Kaelbling et al., 1995] and Decentralized POMDP [Pynadath and Tambe, 2002; Becker et al., 2003; Goldman and Zilberstein, 2003] frameworks are more general and can model both action and observation uncertainty. • Efficient algorithms have been devised for generating optimal plans for agents and agent teams modeled using these frameworks [Puterman, 1994; Cassandra et al., 1997]. However, these optimal policy generation algorithms have focused on maximizing the total ex- pected reward while taking the environmental uncertainties into account. Such optimal policies are useful when the agents act in environments where security is not an important issue. As agents get increasingly deployed in real world, they will have to act in adversarial domains often without any adversary model available. Hence, randomization of policies becomes critical. Randomiza- tion of policies using decision theoretic frameworks as a goal has received little attention, and is primarily seen as a side-effect in attaining other objectives. CMDPs (constrained MDPs) are a standard framework for modeling resource constrained agents that act in uncertain environments [Paruchuri et al., 2004; Dolgov and Durfee, 2003a]. Efficient algorithms have been developed involving usage of Linear Programs, to find the op- timal policies that ensure that the resource constraints are not violated [Altman, 1999; Dolgov and Durfee, 2003c]. However, the randomized optimal policies occur in CMDPs as a side-effect of resource constraints. Additionally, the amount of randomization is a function of the resource constraints and cannot be varied as needed for security purposes. Randomized policies in team settings, which arise as a side-effect of resource constraints, lead to miscoordination and hence 90 lower the team rewards [Paruchuri et al., 2004]. The effect of this miscoordination can be miti- gated through costly communication actions or through techniques that can take the miscoordina- tion costs into consideration while generating optimal policies [Paruchuri et al., 2004]. Due to the miscoordination costs, deterministic policies are preferred over randomized policies in team set- tings and research in CMDPs has focused on developing optimal deterministic policies [Dolgov and Durfee, 2003c]. Similarly, randomized policies occur as side-effect in POMDPs where there are memory con- straints in representation of policies. The policies in such POMDPs map directly from the most recent observation to an action and are referred to as memoryless policies. When considering such POMDP policies, randomized policies obtain higher expected reward than deterministic poli- cies [Jaakkola et al., 1994]. In addition it has been pointed out [Parr and Russel, 1995; Kaelbling et al., 1995] that memoryless deterministic policies tend to exhibit looping behavior. A desire to escape this undesirable behavior motivated finding methods to obtain randomized memoryless policies [Bartlett and Baxter, 2000; Jaakkola et al., 1994]. Unfortunately, such randomization is unable to achieve the goal of maximizing expected entropy while attaining a certain threshold reward because the focus is on maximizing the expected reward. 8.2 Related work in game theory In the second half of my thesis I dealt with adversarial domains where the adversaries’ actions and payoffs are known but the exact adversary type is unknown to the security agent. I described the police patrol domain in detail and modeled it as Bayesian Stackelberg game. Stackelberg games [Stackelberg, 1934; Roughgarden, 2001] are commonly used to model attacker-defender 91 scenarios in security domains [Brown et al., 2006]. In particular [Brown et al., 2006] develop algorithms to make critical infrastructure more resilient against terrorist attacks by modeling the scenario as a stackelberg game. However they do not address the issue of incomplete informa- tion about adversaries whereas agents acting in the real world quite frequently face many such adversaries but have incomplete information about them. Bayesian games have been a popular choice to model such incomplete information games [Conitzer and Sandholm, 2006; Brynielsson and Arnborg, 2004] and the solution concept is called the Bayes-Nash equilibrium [Fudenberg and Tirole, 1991]. Nash equilibrium [Nash, 1950] as a solution concept has received lot of attention and many efficient techniques have been developed to obtain the equilibrium solutions for normal form games [Lemke and Hawson, 1964; McKelvey et al., 2004; Savani and Stengel, 2004; Porter et al., 2004; Sandholm et al., 2005]. The Bayes-Nash equilibrium which is the nash equilibrium equiva- lent of Bayesian games can be found by simply converting the Bayesian game using the Harsanyi transformation [Harsanyi and Selten, 1972] into a normal form game and then applying the effi- cient Nash equilibrium techniques [Sandholm et al., 2005]. However the exponential complexity of the Harsanyi transformation itself is a bottleneck to find the Bayes-Nash equilibrium. The Gala toolkit is one method for finding the equilibrium solutions for the Bayesian games [Koller and Pfeffer, 1995] without requiring the game to be represented in normal form via the Harsanyi trans- formation [Harsanyi and Selten, 1972]. Much work has been done on finding optimal Bayes-Nash equilibria for subclasses of Bayesian games, finding single Bayes-Nash equilibria for general Bayesian games using the sequence form representation [Koller et al., 1996; Koller and Pfeffer, 92 1997] or approximate Bayes-Nash equilibria [Singh et al., 2004]. Unfortunately, they do not ad- dress the issue of finding the optimal strategy to commit to for the agent in a stackelberg kind scenario, which is the focus of the latter half of my thesis. It has been shown in [Stengel and Zamir, 2004] that for any generic two player normal form games Nash equilibrium will be lower bound for the optimal strategy that the agent commits to in a stackelberg game. Less attention has been paid to finding the optimal strategy to commit to in a Bayesian game (the Stackelberg scenario). However, the complexity of this problem was shown to be NP-hard in the general case [Conitzer and Sandholm, 2006]. My thesis therefore focused on the development of efficient solution techniques for the Bayesian Stackelberg games. While developing the ASAP procedure I usedk-uniform strategies. These strategies are similar to thek- uniform strategies of [Lipton et al., 2003]. While that work provides epsilon error-bounds based on the k-uniform strategies, their solution concept is still that of a Nash equilibrium, and they do not provide efficient algorithms for obtaining suchk-uniform strategies. This contrasts with ASAP, where my emphasis is on a highly efficient approach that is not focused on equilibrium solutions. The main advantage of my ASAP solution is that the policies are simple and easy to implement in practice. I then converted the discrete strategy set of the leader in ASAP into a continuous strategy space to obtain the optimal leader strategies using the DOBSS algorithm. While bayesian stackelberg games are useful in modeling and development of efficient solu- tions for security applications, they can model other scenarios as well. Traditionally stackelberg games have been used to model duopoly markets where a leader firm moves first, choosing a quan- tity and a follower firm moves next and picks a quantity based on the leaders choice [Wikipedia, 2007]. These scenarios occur in markets where there is a leader who has a monopoly over the 93 industry and the follower is a new entrant into the market. The market can be modeled as a Bayesian stackelberg game as the number of followers increase. 8.3 Randomization for Privacy The next two paragraphs elaborate on the usage of randomization to increase privacy of agent strategies. Intentional randomization of agent strategies has been used as a technique to increase privacy. [Otterloo, 2005] deals with intentional randomization of agent strategies to increase pri- vacy using strategic game settings. However, the tradeoff structure between randomization and reward is different from our structure, thus needing development of different optimization tech- niques. Furthermore, the work focuses on equilibrium solutions and proving equilibrium proper- ties whereas my thesis focuses on providing efficient policy generation algorithms for both single and multi-agent teams acting in uncertain environments. Randomization has also been used to increase privacy of data transmission in sensor net- works. For example, [Ozturk et al., 2004] describes a flexible routing strategy called phantom routing. This routing scheme was designed for maintaining source-location privacy in a energy- constrained sensor network. Phantom routing is a two stage routing scheme that first consists of a directed walk along a random direction, followed by routing from phantom source to the sink. The algorithm has been tested on a panda-hunter game and the randomization was shown to preserve source location privacy. Yet another application of randomization has been described in [Borisov and Waddle, 2005]. This paper defines an entropy based anonymity metric and pro- poses a randomized routing protocol to increase the anonymity in structured peer-to-peer net- works for purposes of user privacy. 94 8.4 Randomization and Patrolling The patrolling problem which motivated my work has recently received growing attention from the multiagent community due to its wide range of applications [Chevaleyre, 2004; Machado et al., 2002; Carroll et al., 2005; Lewis et al., 2005; Billante, 2003; Ruan et al., 2005; Beard and McLain, 2003]. [Gui and Mohapatra, 2005] describes a surveillance application using wireless sensor networks. The SENSTROL protocol described in this work is focused on limiting the energy consumption involved in patrolling due to which deterministic sweeping patterns are gen- erated by the virtual patroller. In [Chevaleyre, 2004; Machado et al., 2002], an informal notion of good patrolling strategy has been defined which states that a good strategy is one that mini- mizes the time lag between two passages to the same place and for all the places. These papers then calculate the optimal policies using this notion of goodness and find that cyclical strategies which are deterministic are the best. However deterministic strategies maximize the information given to the adversary, thus making them suboptimal if the adversary learns the agent’s policy and acts. [Beard and McLain, 2003] describes the scenario where a team of UA Vs cooperatively search an area of interest that contains regions of opportunity and regions of potential hazard. The aim of the UA V team is to visit as many opportunities as possible while avoiding as many hazards as possible. The UA V team is further constrained by distance and collision constraints. The static nature of the hazards allow the UA Vs to find an optimal deterministic policy in these scenarios as opposed to a dynamic opponent in our patrolling domains. [Ruan et al., 2005] describes an algo- rithm for generating many sub-optimal patrol routes for agents acting in stochastic environment. The patrol team would then decide on one particular patrol route randomly. However this work 95 doesn’t have an explicitly defined metric for randomness and a procedure to evaluate the tradeoffs between the reward and randomness parameters of the domain. [Carroll et al., 2005; Lewis et al., 2005; Billante, 2003] describe practical patrol teams where randomized policies have been found to be useful. [Carroll et al., 2005] describes a real patrol unit that automatically moves randomly to and throughout designated patrol areas. While on random patrol, the patrol unit conducts surveillance, checks for intruders, conducts product inventory etc. [Lewis et al., 2005] describes an application for security/sentry vehicles using randomized patrols for avoiding predictive movements. [Billante, 2003] describes how a randomized police patrol has turned to be a key factor in the drop of crime rate in New York city. However, in these papers no specific algorithm/procedure has been provided for the generation of the randomized policies. In addition, no metrics have been proposed to quantify randomization and hence the randomness of the patrol policy cannot be explicitly controlled as a function of other parameters of the domain like reward. Also, neither single nor decentralized MDP/POMDP teams were considered here. In my thesis, I focused on development of efficient security algorithms that reason about the tradeoffs between the randomness and domain constraints. The development of these efficient security algorithms that are polynomial complexity involved extensive usage of mathematical optimization. While a significant amount of research has been done on global optimization algorithms, none of these have polynomial complexity in general [Vavasis, 1995; Floudas, 1999]. 96 8.5 Randomized algorithms Randomization has been used as a technique to devise efficient algorithms that have good ex- pected runtimes in a large class of algorithms appropriately named randomized or probabilistic algorithms [Motwani and Raghavan, 1995]. These algorithms are used in wide variety of applica- tions like cryptography, monte-carlo methods and even appear in simple sorting techniques like the quicksort algorithm. [Ramanathan et al., 2004] describes usage of randomization techniques for generating an efficient algorithm for leader election in large scale distributed systems. In all these algorithms, randomization serves the purpose of speeding up the process of obtaining the optimal solution and has no bearing on the type of the final solution obtained, which can be de- terministic. This contrasts with my research where the focus is on obtaining a final randomized policy. 97 Chapter 9 Conclusion My thesis focuses on the problem of providing security for agents acting in uncertain adversarial environments having limited information about their adversaries. Such adversarial scenarios arise in a wide variety of situations that are becoming increasingly important such as patrol agents providing security for a group of houses/regions [Carroll et al., 2005; Billante, 2003; Lewis et al., 2005], agents assisting routine security checks at airports [Poole and Passantino, 2003] or agents providing privacy in sensor network routing [Ozturk et al., 2004]. I addressed this problem of providing security, broadly by considering two realistic situations: First, when agents have no information about their adversaries and second, when the agents have partial information about their adversaries. In both these cases the adversary is assumed to know the agent’s policy and hence can exploit it to its advantage. When the agents have no information about their adversaries, I developed policy randomization for single-agent and decentralized (PO)MDPs with guaranteed expected rewards, as a solution technique to minimize the information gained by the adversaries. To this end, my thesis provides two key contributions: 1. Provides novel algorithms, in particular the polynomial-time CRLP and BRLP algorithms, to randomize single-agent MDP policies, while attaining a certain level of expected reward; 98 2. Provides RDR, a new algorithm to generate randomized policies for decentralized POMDPs. RDR can be built on BRLP or CRLP, and thus is able to efficiently provide randomized policies. While the techniques developed are applied for analyzing randomization-reward tradeoffs, they could potentially be applied more generally to analyze different tradeoffs between competing objectives in single/decentralized (PO)MDP. When the agent has partial information about the adversaries, I model the security domains as Bayesian Stackelberg games. The reason is as follows: Bayesian games capture the fact that there are many adversary types while Stackelberg games model an important property which is that the agents act first while the adversaries observe the agent’s policy and act i.e., an explicit notion of a leader and follower. My contributions in this case are as follows: 1. Developed an efficient procedure named ASAP for finding the optimal agent policy with limited randomization when faced with multiple opponents. I provided experimental results to show the speedups provided by ASAP over previous approaches. Further, the policies generated by ASAP are simple and easy to implement in practice. 2. Based on the ideas developed for ASAP, I derived an exact procedure named DOBSS for finding the optimal agent policy. For both of my methods, I developed an efficient Mixed Integer Linear Program (MILP) im- plementation and also provided experimental results illustrating significant speedups and higher rewards over previously existing approaches. For multiagent systems to be actually deployed and function effectively in real world they should be able to address the challenge of environmental uncertainty and security. In addition, 99 as agents become increasingly independent of human intervention they might need to deal with unexpected security challenges. Following are some of the problems that I would like to address in the future towards creation of secure agents: • Incorporating Machine Learning: The environments in which agents act are many times dynamic. It might not be possible to develop a complete model of the environment before- hand for generation of optimal policies. For example, in a patrolling domain, the model describing the adversaries and their strategies may not capture correctly the situation in the real world. This is because new adversaries can appear who were previously unmodeled or adversaries can devise new strategies for which the agents might not have planned. Learn- ing components can be incorporated into existing algorithms where the current model gets updated whenever new information is learned and optimal policies generated against the updated model. • Resource constrained agents: Agents acting in the real world will encounter situations where there are not enough resources to execute their plans [Paruchuri et al., 2004]. This may arise because there are not enough resources in the environment or because there are other agents competing for these resources. Furthermore, the resource constraints may change dynamically making it a hard problem to solve. • Modeling risk: While the standard assumption in game and decision theoretic literature for optimal policy generation algorithms is that the agent or agent team maximizes the expected utility thus making them risk neutral. However, agents can be risk seeking or risk averse, and thus need development of different solution models such as the optimistic or robust solutions [Nilim and Ghaoui, 2003]. 100 I believe that by addressing these uncertainty and security challenges I will bring agents close to the reality of being deployed and function effectively in real world. 101 Bibliography E. Altman. Constrained Markov Decision Process. Chapman and Hall, 1999. P. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. In Technical Report, Australian National University, 2000. R. Beard and T. McLain. Multiple uav cooperative search under collision avoidance and limited range communication constraints. In IEEE CDC, 2003. R. Becker, V . Lesser, and C.V . Goldman. Transition-independent decentralized markov decision processes. In Proceedings of AAMAS, 2003. D.S. Bernstein, S. Zilberstein, and N. Immerman. The complexity of decentralized control of MDPs. In UAI, 2000. D. Bertsimas and J. Tsitsiklis. Introduction to Linear Optimization. Athena Scientific, 1997. Nicole Billante. The beat goes on: Policing for crime prevention. http://www.cis.org.au/IssueAnalysis/ia38/ia38.htm, 2003. N. Borisov and J. Waddle. Anonymity in structured peer-to-peer networks. In University of California, Berkeley, Technical Report No. UCB/CSD-05-1390, 2005. G. Brown, M. Carlyle, J. Salmeron, and K. Wood. Defending critical infrastructures. Interfaces, 36(6):530–544, 2006. J. Brynielsson and S. Arnborg. Bayesian games for threat prediction and situation analysis. In FUSION, 2004. Daniel M. Carroll, Chinh Nguyen, H.R. Everett, and Brian Frederick. Development and testing for physical security robots. http://www.nosc.mil/robots/pubs/spie5804-63.pdf, 2005. A. Cassandra, L. Kaelbling, and M. Littman. Acting optimally in partially observable stochastic domains. In Proceedings of National Conference on Artificial Intelligence, 1994. A. Cassandra, M. Littman, and N. Zhang. Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In Proceedings of the Thirteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-97), pages 54–61, 1997. Y . Chevaleyre. Theoretical analysis of multi-agent patrolling problem. In Proceedings of AAMAS, 2004. 102 V . Conitzer and T. Sandholm. Choosing the optimal strategy to commit to. In ACM Conference on Electronic Commerce, 2006. D. Dolgov and E. Durfee. Approximating optimal policies for agents with limited execution resources. In Proceedings of IJCAI, 2003a. D. Dolgov and E. Durfee. Constructing optimal policies for agents with constrained architectures. Technical report, Univ of Michigan, 2003b. D. Dolgov and E. Durfee. Constructing optimal policies for agents with constrained architectures, 2003c. URLhttp://citeseer.ist.psu.edu/dolgov03constructing.html. R. Emery-Montemerlo, G. Gordon, J. Schneider, and S. Thrun. Approximate solutions for par- tially observable stochastic games with common payoffs. In AAMAS, 2004. A. Christodoulos Floudas. Deterministic Global Optimization: Theory, Methods and Applica- tions, volume 37 of Nonconvex Optimization And Its Applications. Kluwer, 1999. D. Fudenberg and J. Tirole. Game Theory. MIT Press, 1991. Claudia V . Goldman and Shlomo Zilberstein. Optimizing information exchange in cooperative multi-agent systems. In Proceedings of the Second International Joint Conference on Au- tonomous Agents and Multi Agent Systems (AAMAS-03), pages 137–144, 2003. C. Gui and P. Mohapatra. Virtual patrol: A new power conservation design for surveillance using sensor networks. In IPSN, 2005. E.A. Hansen, D.S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In AAAI, 2004. J. C. Harsanyi and R. Selten. A generalized Nash solution for two-person bargaining games with incomplete information. Management Science, 18(5):80–106, 1972. D. A. Huffman. A method for the construction of minimum redundancy codes. In Proceedings of IRE, 1952. T. Jaakkola, S. Singh, and M. Jordan. Reinforcement learning algorithm for partially observable markov decision problems. Advances in NIPS, 7, 1994. L. Kaelbling, M. Littman, and A. Cassandra. Planning and acting in partially observable stochas- tic domains. In Technical Report, Brown University, 1995. L. Kaelbling, M. Littman, and A. Cassandra. Planning and acting in partially observable stochas- tic domains. Artificial Intelligence, 101(2):99–134, 1998. D. Koller, N. Meggido, and B. V . Stengel. Efficient computation of equilibria for extensive two- person games. In Games and Economic Behavior, 1996. D. Koller and A. Pfeffer. Generating and solving imperfect information games. In IJCAI, 1995. D. Koller and A. Pfeffer. Representations and solutions for game-theoretic problems. Artificial Intelligence, 94(1):167–215, 1997. 103 C. Lemke and J. Hawson. Equilibrium points of bimatrix games. In Journal of Society of Indus- trial and Applied Mathematics, 1964. Paul J. Lewis, Mitchel R. Torrie, and Paul M. Omilon. Applications suitable for unmanned and autonomous missions utilizing the tactical amphibious ground support (tags) platform. http://www.autonomoussolutions.com/Press/SPIE%20TAGS.html, 2005. R. J. Lipton, E. Markakis, and A. Mehta. Playing large games using simple strategies. In ACM Conference on Electronic Commerce, 2003. M. Littman. Markov games as a framework for multi-agent reinforcement learning. In ML, 1994. URLciteseer.ist.psu.edu/littman94markov.html. A. Machado, G. Ramalho, J. D. Zucker, and A. Drougoul. Multi-agent patrolling: an empirical analysis on alternative architectures. In MABS, 2002. R. D. McKelvey, A. M. McLennan, and T. L. Turocy. Gambit: Software tools for game theory. In Version 0.97.1.5, 2004. R. Motwani and P. Raghavan. Randomized algorithms. In Cambridge University Press, 1995. R. Nair, D. Pynadath, M. Yokoo, M. Tambe, and S. Marsella. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In IJCAI, 2003. J. Nash. Equilibrium points in n-person games. In Proceedings of National Academy of Sciences, 1950. A. Nilim and L. E. Ghaoui. Robustness in markov decision problems with uncertain transition matrices. In NIPS, 2003. S. Otterloo. The value of privacy: Optimal strategies for privacy minded agents. In AAMAS, 2005. C. Ozturk, Y . Zhang, and W. Trappe. Source-location privacy in energy-constrained sensor net- work routing. 2004. S. Paquet, L. Tobin, and B. Chaib-draa. An online POMDP algorithm for complex multiagent environments. In AAMAS, 2005. R. Parr and S. Russel. Approximating optimal policies for partially observable stochastic do- mains. In Proceedings of IJCAI, 1995. P. Paruchuri, M. Tambe, F. Ordonez, and S. Kraus. Towards a formalization of teamwork with resource constraints. In AAMAS, 2004. R. Poole and G. Passantino. A risk based airport security policy. 2003. R. Porter, E. Nudelman, and Y . Shoham. Simple search methods for finding a nash equilibrium. In AAAI, 2004. M.L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 1994. 104 D. V . Pynadath and M. Tambe. The communicative multiagent team decision problem: Analyzing teamwork theories and models. JAIR, 16:389–423, 2002. M. Ramanathan, R. Ferreira, and A.Grama S. Jagannathan. Randomized leader election. In Purdue University Technical Report, 2004. T. Roughgarden. Stackelberg scheduling strategies. In ACM Symposium on TOC, 2001. S. Ruan, C. Meirina, F. Yu, K. R. Pattipati, and R. L. Popp. Patrolling in a stochastic environment. In 10th Intl. Command and Control Research Symposium, 2005. T. Sandholm, A. Gilpin, and V . Conitzer. Mixed-integer programming methods for finding nash equilibria. In AAAI, 2005. Sasemas. About the workshop. 2005. URLhttp://www.sasemas.org/2005/. R. Savani and B. V . Stengel. Exponentially many steps for finding a nash equilibrium in a bimatrix game. In FOCS, 2004. C. Shannon. A mathematical theory of communication. The Bell Labs Technical Journal, pages 379–457,623,656, 1948. S. Singh, V . Soni, and M. Wellman. Computing approximate Bayes-Nash equilibria with tree- games of incomplete information. In ACM Conference on Electronic Commerce, 2004. M. Smith, S.L. Urban, and H.M. Avila. Retaliate: Learning winning policies in first-person shooter games. 2007. H. V . Stackelberg. Marktform and gleichgewicht. In Springer, 1934. B. V . Stengel and S. Zamir. Leadership with commitment to mixed strategies. In CDAM Research Report LSE-CDAM-2004-01, 2004. Jo Twist. Eternal planes to watch over us. http://news.bbc.co.uk/1/hi/sci/tech/4721091.stm, 2005. S. A. Vavasis. Complexity issues in global optimization: a survey. In R. Horst and P.M. Pardalos, editors, Handbook of Global Optimization, pages 27–41. Kluwer, 1995. S.A. Vavasis. Nonlinear optimization: Complexity issues. In University Press, New York, 1991. Y . Wen. Efficient network diagnosis algorithms for all-optical networks with pobabilistic link failures. In thesis MIT, 2005. Wikipedia. Stackelberg competition. 2007. URL http://en.wikipedia.org/wiki/ Stackelberg competition. L. Wolsey. Integer Programming. Wiley, 1998. 105
Abstract (if available)
Abstract
Recent advances in the field of agent/multiagent systems brings us closer to agents acting in real world domains, which can be uncertain and many times adversarial. Security, commonly defined as the ability to deal with intentional threats from other agents is a major challenge for agents or agent-teams deployed in these adversarial domains. Such adversarial scenarios arise in a wide variety of situations that are becoming increasingly important such as agents patrolling to provide perimeter security around critical infrastructure or performing routine security checks. These domains have the following characteristics: (a) The agent or agent-team needs to commit to a security policy while the adversaries may observe and exploit the policy committed to. (b) The agent/agent-team potentially faces different types of adversaries and has varying information available about the adversaries (thus limiting the agents' ability to model itsadversaries).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The human element: addressing human adversaries in security domains
PDF
Planning with continuous resources in agent systems
PDF
Local optimization in cooperative agent networks
PDF
Addressing uncertainty in Stackelberg games for security: models and algorithms
PDF
When AI helps wildlife conservation: learning adversary behavior in green security games
PDF
Balancing local resources and global goals in multiply-constrained distributed constraint optimization
PDF
Human adversaries in security games: integrating models of bounded rationality and fast algorithms
PDF
Towards efficient planning for real world partially observable domains
PDF
Three fundamental pillars of decision-centered teamwork
PDF
The interpersonal effect of emotion in decision-making and social dilemmas
PDF
Thwarting adversaries with unpredictability: massive-scale game-theoretic algorithms for real-world security deployments
PDF
Towards addressing spatio-temporal aspects in security games
PDF
Modeling human bounded rationality in opportunistic security games
PDF
Hierarchical planning in security games: a game theoretic approach to strategic, tactical and operational decision making
PDF
Machine learning in interacting multi-agent systems
PDF
Game theoretic deception and threat screening for cyber security
Asset Metadata
Creator
Paruchuri, Praveen
(author)
Core Title
Keep the adversary guessing: agent security by policy randomization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/02/2007
Defense Date
04/23/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
multiagent systems,OAI-PMH Harvest
Language
English
Advisor
Tambe, Milind (
committee chair
), Kraus, Sarit (
committee member
), Marsella, Stacy (
committee member
), Ordonez, Fernando I. (
committee member
), Sukhatme, Gaurav S. (
committee member
)
Creator Email
paruchur@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m575
Unique identifier
UC1310139
Identifier
etd-Paruchuri-20070702 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-507420 (legacy record id),usctheses-m575 (legacy record id)
Legacy Identifier
etd-Paruchuri-20070702.pdf
Dmrecord
507420
Document Type
Dissertation
Rights
Paruchuri, Praveen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
multiagent systems