Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Handling attacker’s preference in security domains: robust optimization and learning approaches
(USC Thesis Other)
Handling attacker’s preference in security domains: robust optimization and learning approaches
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Handling Attacker’s Preference in Security Domains: Robust Optimization and Learning Approaches by Yundi Qian A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2016 Copyright 2016 Yundi Qian Acknowledgements First and foremost, I would like to thank my advisor, Prof. Milind Tambe, for all of his support and encouragement throughout my Ph.D. When I first joined the Teamcore research group, little did I know about what it means to do research. Milind, with his endless patience and faithful encouragement, guided me through each step to do meaningful research. I can never forget all the moments when Milind stays up late to the lastminutetogetherwithmebeforeconferencedeadlines,challengesmetopusheveryidea to its limit and iterates over paper and talks numerous times to reach their excellence. In addition, I really appreciate the luxurious freedom Milind gives me to explore the research domains I am interested in, with which I really enjoyed the past four-years research experience. Milind’s passion about research and commitment to students make him the best academic advisor I could ask for. I learned so much from him, not only about research but also how to be a good mentor with care and patience. I would also like to thank other members of my thesis committee for providing valu- able feedback to my research and pushing me to think about it in another level: Aram Galstyan, Jonathan Gratch, Maged Dessouky and Yilmaz Kocer. I would also like to thank the many excellent researchers that I have had the privi- lege to work with over the years. This list includes Jason Tsai, Yevgeniy Vorobeychik, ii Christopher Kiekintveld, Chao Zhang, Bhaskar Krishnamachari, Victor Bucarey, Ayan Mukhopadhyay, Arunesh Sinha, William B. Haskell, Albert Xin Jiang, Ariel Procaccia and Nisarg Shah. Special thanks to Jason Tsai for guiding me when I initially joined the Teamcore group, Albert Xin Jiang for his very detailed guidance when I was writing my first paper, William B. Haskell for polishing my paper writing numerous times, Bhaskar Krishnamachari for his valuable insights for my work, Chao Zhangfor always beingthere when I want to discuss my research. I would like to thank the entire Teamcore family: Albert Xin Jiang, Francesco Delle Fave, William Haskell, Arunesh Sinha, James Pita, Manish Jain, Jason Tsai, Jun-young Kwak, Zhengyu Yin, Rong Yang, Matthew Brown, Eric Shieh, Thanh Nguyen, Leandro Marcolino, Fei Fang, Chao Zhang, Debarun Kar, Benjamin Ford, Haifeng Xu, Amulya Yadav, AaronSchlenker, SaraMcCarthy, Yasi Abbasi, ShahrzadGholami, BryanWilder, and Elizabeth Orrico. It is my great honor to share my Ph.D experience with all of you and I really enjoy the time that we work and play together. Special thanks to Albert Xin Jiang for his tremendous help in doing research, Jun-young Kwak and Fei Fang for being lovely officemates when we were at PHE 101, Eric Shieh for treating me with good food and pushing me to practice my English, Chao Zhang for being an awesome person to talk to. Finally I would like to thank my friends and families, who gave me tremendous help formetocomplete myPh.D. Iwouldlike tothankmy parents, foralways supportingme, believing the best in me and always be there when I need their help. I would also like to thank my boyfriend Ji Yang for being a constant source of support and encouragement. This thesis would have never been possible without him. iii Table of Contents Acknowledgements ii List Of Tables vi List Of Figures vii Abstract ix Chapter 1 Introduction 1 1.1 Problem Addressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2 Background 10 2.1 Stackelberg Security Games . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 POMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Restless Multi-armed Bandit (RMAB) Problem . . . . . . . . . . . . . . . 14 Chapter 3 Related Work 17 3.1 Uncertainty in Stackelberg Security Games . . . . . . . . . . . . . . . . . 17 3.2 Adversary Behavioral Models . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Learning Attacker Payoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Green Security Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5 Exploration-exploitation Tradeoff in Security Domains . . . . . . . . . . . 22 3.6 Indexability of Restless Multi-armed Bandit Problem . . . . . . . . . . . . 22 Chapter 4 Robust Strategy against Risk-aware Attackers in SSGs 24 4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 MIBLP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 BeRRA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Minimum Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 iv Chapter 5 Learning Attacker’s Preference — Payoff Modeling 57 5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 GMOP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 Fictitious Best Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Continuous Utility Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.5 Unknown Extractor Scenario—Model Ensemble . . . . . . . . . . . . . . . 79 5.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Chapter 6 Learning Attacker’s Preference — Markovian Modeling 97 6.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Restless Bandit for Planning . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.3 Computation of Passive Action Set . . . . . . . . . . . . . . . . . . . . . . 115 6.4 Planning from POMDP View . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Chapter 7 Conclusion 128 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Bibliography 135 Appendix A An Example of BeRRA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 140 Appendix B Proof for Indexability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 B.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 B.2 Proof of Theorem 6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 B.3 Proof of Theorem 6.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 v List Of Tables 4.1 Utility Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 MIBLP vs BeRRA in Solution Quality . . . . . . . . . . . . . . . . . . . . 50 4.3 MIBLP vs BeRRA in Runtime (s) . . . . . . . . . . . . . . . . . . . . . . 51 5.1 ZMDP/APPL vs GMOP in Solution Quality . . . . . . . . . . . . . . . . 86 5.2 GMOP vs POMCP in Runtime(s) . . . . . . . . . . . . . . . . . . . . . . 91 5.3 GMOP vs POMCP in Runtime(s) . . . . . . . . . . . . . . . . . . . . . . 91 5.4 General vs Advanced Sampling in Runtime(s) . . . . . . . . . . . . . . . . 93 5.5 GMOP vs Heuristic in Runtime(s) . . . . . . . . . . . . . . . . . . . . . . 93 5.6 Ensemble vs Single in Runtime(s) . . . . . . . . . . . . . . . . . . . . . . . 95 6.1 Planning Algorithm Evaluation in Solution Quality for Small-scale Prob- lem Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.1 Comparison Between Different Models . . . . . . . . . . . . . . . . . . . . 132 A.1 Example of BeRRA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 140 vi List Of Figures 4.1 Prospect Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Runtime of BeRRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Solution Quality of RSE in Worst Case. . . . . . . . . . . . . . . . . . . . 53 4.4 Solution Quality of RSE in Average Case . . . . . . . . . . . . . . . . . . 53 4.5 “Price” of Being Robust . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Solution Quality of RSE for Risk-aware Attackers . . . . . . . . . . . . . . 55 5.1 Bayesian Network when n=4 . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Fictitious Quantal Response . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3 Continuous Utility Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4 Fictitious Best Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5 Model Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1 Model Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Special POMDPs vs Standard POMDPs . . . . . . . . . . . . . . . . . . . 115 6.3 Planning Algorithm Evaluation in Solution Quality for Large-scale Prob- lem Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.4 Example when Myopic Policy Fails . . . . . . . . . . . . . . . . . . . . . . 125 6.5 Runtime Analysis of Whittle Index Policy: n s =2,n o =2 . . . . . . . . . 126 6.6 Evaluation of RMAB Modeling . . . . . . . . . . . . . . . . . . . . . . . . 127 vii B.1 An example of V 1 m (x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 viii Abstract Stackelberg security games (SSGs) are now established as a powerful tool in security domains. In order to compute the optimal strategy for the defender in SSG model, the defenderneeds to know theattacker’s preferences over targets so that shecan predict howtheattacker wouldreactunderacertaindefenderstrategy. Uncertaintyoverattacker preferences may cause the defender to suffer significant losses. Motivated by that, my thesis focuses onaddressinguncertainty inattacker preferences usingrobustand learning approaches. In security domains with one-shot attack, e.g., counter-terrorism domains, the de- fender is interested in robust approaches that can provide performance guarantee in the worst case. The first part of my thesis focuses on handling attacker’s preference uncer- tainty with robust approaches in these domains. My work considers a new dimension of preference uncertainty that has not been taken into account in previous literatures: the risk preference uncertainty of the attacker, and propose an algorithm to efficiently compute defender’s robust strategy against uncertain risk-aware attackers. In security domains with repeated attacks, e.g., green security domain of protecting natural resources, the attacker “attacks” (illegally extracts natural resources) frequently, so it is possiblefor the defender to learn attacker’s preference from their previous actions ix and then to use this information to better plan her strategy. The second part of my thesis focuses on learning attacker’s preferences in these domains. My thesis models the preferencesfromtwodifferentperspectives: (i)thepreferenceismodeledaspayoffandthe defenderlearnsthepayoffsfromattackers’ previousactions; (ii) thepreferenceismodeled as a markovian process and the defender learns the markovian process from attackers’ previous actions. x Chapter 1 Introduction Stackelberg security games (SSGs) are now established as a successful tool in the infras- tructuresecurity domain [20,42,62]. Inthisdomain, thesecurity forces (defender)deploy security resources to protect key infrastructures (targets) against potential terrorists (at- tackers). With limited resources available, it is usually impossible to protect all targets at all times. SSGs optimize the use of defender resources with the use of game-theoretic approaches. InSSGmodel, thedefenderacts first andcommits to amixed strategy while the attacker learns the mixed strategy after long-time surveillance and then chooses one target to attack [55]. The success of SSGs in the infrastructure security domains has inspired researchers’ interest in applying game-theoretic models to other security domains with frequent in- teractions between defenders and attackers, e.g., wildlife protection [12,58]. However, these two domains are different. In wildlife protection domain, attack (poaching) hap- pens frequently so that it gives defenders (patrollers) the opportunity to learn attackers’ (poachers’) preferences from their previous actions and then to plan patrol strategies ac- cordingly; whilethislearningopportunitydoesnotariseinthecounter-terrorism domain. 1 1.1 Problem Addressed The computation of the optimal strategy for the defender requires the defender to know how the attacker views the importance of every target since it involves predicting the attacker’s action under a certain defender strategy. If the defender is unable to predict the attacker’s action correctly, she may suffer significant losses. Motivated by that, my thesis focuses on addressing uncertainty in attacker preferences over targets using robust and learning approaches. The defender’s uncertainty about how the attacker views the importance of every target may come from two different perspectives: (i) the uncertainty over attacker’s payoffs, i.e., the defender is uncertain about the true payoffs of different targets for the attacker; (ii) the uncertainty over attacker’s risk attitude, i.e., the attacker may not be risk-neutral and the defender is uncertain about the attacker’s risk attitude. In security domains with one-shot attack, e.g., counter-terrorism domains, the attack happens rarely so there is no chance for the defender to learn attacker’s preference from their previous actions. Thus, the defender is interested in robust approaches that can provide performance guarantee in the worst case. The payoff uncertainty in SSGs has beenaddressedinpreviousliterature [19] whiletheriskattitudeuncertainty hasnot been addressed yet. Therefore, the first part of my thesis focuses on handling attacker’s risk attitude uncertainty in SSGs with robust approaches. In security domains with repeated attacks, e.g., green security domain of protecting natural resources, the attacker “attacks” (illegally extracts natural resources) frequently, so it is possiblefor the defender to learn attacker’s preference from their previous actions 2 and then to use this information to better plan her strategy. Therefore, the second part of my thesis focuses on learning attacker’s preferences and then planning accordingly in these domains. In this way, the learned preference is the preference in the attacker’s mind, which takes both the payoff and risk attitude into account. My thesis models the preferences from two different perspectives: 1. The preference is modeled as payoff and the defender learns the payoffs from at- tackers’ previous actions and then plan accordingly. However, this work is based on two key assumptions: (i) the attacker follows some certain behavioral patterns that are known to the defender; (ii) both the defender and the attacker can observe their opponent’s activities at all targets. However, these two assumptions may not hold in some domains. 2. To relax these two assumptions, I model the preference as a markovian process that transits according to defender’s strategies. The defender learns the markovian process from attackers’ previous actions and then plans accordingly. This model needs no prior information about the attacker’s behavioral patterns and is able to handle the exploration-exploitation tradeoff in these domains, which is caused by the fact that the defender is only able to observe the attack activities happeningat protected targets. 1.2 Contributions Mycontributionsincludeaddressinguncertaintyinattackers’ preferenceusingrobustand learning approaches. My first contribution develops an algorithm to efficiently compute 3 therobust strategy against risk-aware attackers in SSGs. My second contribution models thepreferenceaspayoffsandfocusesonlearningthepayoffsandthenplanningaccordingly ingreensecuritydomains. Mythirdcontributionmodelsthepreferenceasmarkovianpro- cessthattransitsaccordingtodefender’sstrategies tohandletheexploration-exploitation tradeoff in these domains. 1.2.1 Robust Strategy against Risk-aware Attackers in SSGs The first part of my thesis [51] focuses on handling the uncertainty of attacker’s risk preferences in security games. Previous work on game theory for SSGs emphasizes a risk neutral attacker that is trying to maximize his expected reward. However, extensive studies show that the attackers are in fact risk-aware, e.g., terrorist groups in counter- terrorism domains [44,46,48] are shown to be risk-averse. If the defender fails to take attacker’s risk preference into consideration when designing strategies, she may suffer significant losses. Furthermore, risk awareness encompasses a wide range of behavior — so to say that attackers are risk-aware is not enough for the defender. In other words, the defender has uncertainty over the attacker’s degree of risk awareness. To address this issue, the first part of my thesis computes a robust defender strategy that optimizes the worstcaseagainstrisk-awareattackerswithuncertaintyinthedegreeofriskawareness[1], i.e., it provides a solution quality guarantee for the defender no matter how risk-aware the attacker is. To develop the robust strategy, I firstly build a robust SSG framework against an attacker with uncertainty in level of risk awareness. Second, building on previous work 4 on SSGs in mixed-integer programs, I propose a novel mixed-integer bilinear program- ming problem (MIBLP), and find that it only finds locally optimal solutions. While the MIBLP formulation is also unable to scale up, it provides key intuition for my new al- gorithm. This new algorithm, BeRRA (Binary search based Robust algorithm against Risk-Aware attackers) is my third contribution, and it finds globally ǫ-optimal solutions by solvingO(nlog( 1 ǫ )log( 1 δ )) linear feasibility problems. The key idea of the BeRRA algorithm is to reduce the problem from maximizing the reward with a given number of resources to minimizing the number of resources needed to achieve a given reward. This transformationallowsBeRRAtoscaleupviatheremovalofthebilineartermsandinteger variables as well astheutilization of key theoretical propertiesthat provecorrespondence of its potential “attack sets” [20] with that of the maximin strategy. Finally, I also show that the defender does not need to consider attacker’s risk attitude in zero-sum games. The experimental results show the solution quality and runtime advantages of my robust model and BeRRA algorithm. 1.2.2 Learning Attacker’s Preference — Payoff Modeling The second part of my thesis [50] focuses on learning the attacker’s payoffs in green security domains where there are frequent interactions between the defender and the attacker. In green security domains, the “defenders” (law enforcement agencies) try to protect these natural resources and “attackers” (criminals) seek to exploit them. In infrastructuresecuritygames,theattackerconductsextensivesurveillanceonthedefender andexecutesaone-shotattack, whileingreensecuritydomains,theattackeralsoobserves the defender’s strategy but carries out frequent illegal extractions. Therefore, there are 5 frequent interactions between thedefenderandtheattacker, whichgives thedefenderthe opportunitytolearn theattacker’s payoffs byobservingtheattacker’s actions. Motivated by this, the second part of my thesis develops the model and algorithm for the defender to learn target values from attacker’s actions and then uses this information to better plan her strategy. In this work, I model these interactions between the defender and the attacker as a repeated game. I then adopt a fixed model for the attacker’s behavior and recast this repeated game as a partially observable Markov decision process (POMDP). However, my POMDP formulation has an exponential number of states, making current POMDP solvers like ZMDP [54] and APPL [25] infeasible in terms of computational cost. Silver and Veness [53] have proposed the POMCP algorithm which achieves a high level of performance in large POMDPs. It uses particle filtering to maintain an approximation of the belief state of the agent, and then uses Monte Carlo Tree Search (MCTS) for online planning. However, the particle filter is only an approximation of the belief state. By appealing to the special properties of my POMDP, I propose the GMOP algorithm (Gibbs sampling based MCTSOnlinePlanning) which draws samples directly from the exact belief state using Gibbs sampling and then runs MCTS for online planning. My algorithm provides higher solution quality than the POMCP algorithm. Additionally, for a specific subclass of my game with an attacker who plays a best response against the defender’sempirical distribution, andauniformpenalty of beingseized across all targets, I provide an advanced sampling technique to speed up the GMOP algorithm along with a heuristic that trades off solution quality for lower computational cost. Moreover, I explore the case of continuous utilities where my original POMDP formulation becomes 6 a continuous-state POMDP, which is generally difficult to solve. However, the special properties in the specific subclass of game mentioned above make possible the extension of the GMOP algorithm to continuous utilities. Finally, I explore the more realistic scenario where the defender is not only uncertain about the distribution of resources, but also uncertain about the attacker’s behavioral model. I address this challenge by extending my POMDP formulation and the GMOP algorithm. 1.2.3 Learning Attacker’s Preference — Markovian Modeling Mysecondcontribution[50] assumesthatdefendershaveknowledgeofallpoachingactiv- ities throughout the wildlife protected area. Unfortunately, given vast geographic areas for wildlife protection, defenders do not have knowledge of poaching activities in areas they do not protect. Thus, defenders are faced with the exploration-exploitation tradeoff — whether to protect the targets that are already known to have a lot of poaching ac- tivities or to explore the targets that haven’t been protected for a long time. My third contribution [52] aims to solve this exploration-exploitation tradeoff. The exploration-exploitation tradeoff here is different from that in the non-Bayesian stochastic multi-armed bandit problem [4]. In stochastic multi-armed bandit problems, the rewards of every arm are random variables with a stationary unknown distribution. However, in this problem, patrol affects attack activities — more patrol is likely to de- crease attack activities and less patrol is likely to increase attack activities. Thus, the random variable distribution is changing depending on player’s choice — more selection (patrol) leads to lower reward (less attack activities) and less selection (patrol) leads to higher reward (more attack activities). On the other hand, adversarial multi-armed 7 banditproblem[5] is alsonotanappropriatemodelforthisdomain. Inadversarial multi- armed bandit problems, the reward can arbitrarily change while the attack activities in this problem are unlikely to change rapidly in a short period. This makes the adversarial multi-armed bandit model inappropriate for this domain. Inreality, howpatrolaffects attack activities wouldbereasonablyassumedtofollow a consistent pattern that can belearned from historical data (defenders’ historical observa- tions). I model this pattern as a Markov process and provide the following contributions in this work. First, I formulate the problem into a restless multi-armed bandit (RMAB) modeltohandlethelimitedobservability challenge —defendersdonothaveobservations for arms they do not activate (targets they do not protect). Second, I propose an EM based learning algorithm to learn the RMAB model from defenders’ historical observa- tions. Third,Iusethesolution conceptofWhittleindexpolicytosolvetheRMABmodel toplanfordefenders’patrolstrategies. However, indexabilityisrequiredfortheexistence of Whittle index, so I provide two sufficient conditions for indexability and an algorithm to numerically evaluate indexability. Fourth, I propose a binary search based algorithm to find the Whittle index policy efficiently. 1.3 Overview of Thesis This thesis is organized in the following manner. Chapter 2 discusses the necessary back- groundmaterialsfortheresearchpresentedinthisthesis. Chapter3providesanoverview oftherelevantresearch. Chapter4discussesthealgorithmtocomputetherobuststrategy against risk-aware attackers. Chapter 5 presents the model to learn attackers’ payoffs of 8 different targets and then use this information to better plan defenders’ patrol strategies. Chapter 6 demonstrates the model where attacker’s preference is modeled as Markovian process. Chapter 7 concludes this thesis and presents ideas for future work. 9 Chapter 2 Background 2.1 Stackelberg Security Games AnSSG[20,42,62] isatwo-player gamebetween adefenderandanattacker. Weconsider theproblemwithntargetswhereT={1,2,...,n}isthesetoftargets. Thedefenderhasa totalnumberofmresourcestoallocateamongthesentargetstoprotectthemfromattack. The defender commits to a mixed strategy c to protect these targets, where c i ∈[0,1] is the probability that target i is protected. We have the resource constraint P i∈T c i ≤m. The attacker observes the defender’s strategy c and then chooses one target to attack. If the attacker attacks a protected target i, this attack is unsuccessful and the attacker receives utility U c a (i) while the defender receives utility U c d (i). If the attacker attacks an unprotected target i, thisattack is successfulandthe attacker receives utility U u a (i) while the defender receives utility U u d (i). Necessarily, U u d (i)<U c d (i) and U c a (i)<U u a (i),∀i∈T. If U u d (i)+U u a (i) =0 and U c d (i)+U c a (i) =0,∀i∈T, this SSG is a zero-sum game. We defineU a (i,c),c i U c a (i)+(1−c i )U u a (i) to be the expected utility for the attacker when the defender’s strategy is c and the attacker chooses to attack target i; similarly, 10 U d (i,c) , c i U c d (i) + (1− c i )U u d (i) is the expected utility for the defender. Given the defender strategy c, the attacker would attack the target that maximizes his expected utility. Whenthereareties, theattacker isassumedtobreakties infavor ofthedefender. Thus,amixedintegerlinearprogram(MILP)canbeformulatedtocomputethedefender’s optimal strategy, as is shown in Problem (2.1). Here,{q i } i∈T are auxiliary variables to represent if target i is chosen by the attacker, and M is a constant orders of magnitude largerthanalltargetutilities. ThesolutionciscalledtheStrongStackelbergEquilibrium (SSE) strategy [11,24,55] of the game. max c,{q i } i∈T ,v,d v s.t. 0≤c i ≤1,∀i∈T X i∈T c i ≤m q i ∈{0,1},∀i∈T X i∈T q i =1 v≤U d (i,c)+(1−q i )M,∀i∈T 0≤d−U a (i,c)≤(1−q i )M,∀i∈T (2.1) 2.2 POMDP POMDP is a generalization of a Markov decision process (MDP) by assuming that the agent cannot directly observe the underlying state. Instead, the agent observes “obser- vation”, which reveals the underlying states via a probability distribution. Therefore, the agent maintains a probability distribution over the set of possible states based on its observations, and also plans its actions according to this distribution. 11 The POMDP framework can be used to model sequential decision processes in uncer- tain environments. At the beginning of each round, the agent has a probability distribu- tionofthestateitiscurrentin,itthenexecutestheoptimalactionunderthisdistribution andgetsthecorrespondingobservation, itfinallyusestheobservationtoupdatethebelief of its new state. Formally, a POMDP is a 7-tuple (S,A,T,R,Ω,O,γ), where • S is a set of states • A is a set of actions • T is a set of conditional transition probabilities between states • R :S×A→R is the reward function • Ω is a set of observations • O is a set of conditional observation probabilities • γ∈[0,1] is the discount factor For standard POMDP formulation, the belief update is: b ′ (s ′ ) = P(o|s ′ ,a) P s∈S b(s)P(s ′ |s,a) P(o|b,a) (2.2) where P(o|b,a) = X s ′ ∈S P(o|s ′ ,a) X s∈S b(s)P(s ′ |s,a) 12 POMDP canbesolved by valueiteration algorithm, whileIwillbrieflypresentbelow: The value function is V ′ (b) =max a∈A X s∈S b(s)R(s,a)+β X o∈O P(o|b,a)V(b o a ) ! It can be broken up to simpler combinations of other value functions: V ′ (b) =max a∈A V a (b) V a (b) = X o∈O V o a (b) V o a (b) = P s∈S b(s)R(s,a) |O| +βP(o|b,a)V(b o a ) All the value functions can be represented as V(b) = max α∈D b·α since the update process maintains this property, so we only need to update the set D when updating the value function. The set D is updated according to the following process: D ′ =purge [ a∈A D a ! D a =purge M o∈O D o a ! D o a =purge({τ(α,a,o)|α∈D}) where τ(α,a,o) is the|D|-vector given by τ(α,a,o)(s) =(1/|O|)R(s,a)+β X s ′ ∈S α(s ′ )P(o|s ′ ,a)P(s ′ |s,a) 13 andpurge(·) takes aset ofvectors andreducesit toits uniqueminimumform(remove redundantvectorsthataredominatedbyothervectorsintheset). L representsthecross sum of two sets of vectors: A L B ={α+β|α∈A,β∈B}. The update of D ′ and D a is intuitive, so I briefly explain the update of D o a here: P(o|b,a)V(b o a ) =P(o|b,a)max α∈D X s ′ ∈S α(s ′ )P(s ′ |b,a,o) =P(o|b,a)max α∈D X s ′ ∈S α(s ′ ) P(o|s ′ ,a) P s∈S b(s)P(s ′ |s,a) P(o|b,a) =max α∈D X s ′ ∈S α(s ′ )P(o|s ′ ,a) X s∈S b(s)P(s ′ |s,a) =max α∈D X s∈S b(s)· X s ′ ∈S α(s ′ )P(o|s ′ ,a)P(s ′ |s,a) ! Here, P(s ′ |b,a,o) is the belief of state s ′ in the next round when the belief in the current round is b, the agent takes action a and get the observation o, which is the b(s ′ ) in Equation 2.2. 2.3 Restless Multi-armed Bandit (RMAB) Problem In RMABs, each arm represents an independent Markov machine. At every round, the player chooses k out of n arms (k < n) to activate and receives the reward determined by the state of the activated arms. After that, the states of all arms will transition to new states according to certain Markov transition probabilities. The problem is called “restless”becausethestates of passivearmswillalsotransitionlikeactive arms. Theaim of the player is to maximize his cumulative reward by choosing which arms to activate 14 at every round. It has shown by Papadimitriou and Tsitsiklis that it is PSPACE-hard to find the optimal strategy to general RMABs [41]. An index policy assigns an index to each state of each arm to measure how rewarding it is to activate an arm at a particular state. At every round, the index policy chooses to pick the k arms whose current states have the highest indices. Since the index of an arm only depends on the properties of this arm, index policy reduces an n-dimensional problem to n 1-dimensional problems so that the complexity is reduced from exponential with n to linear with n. Whittle proposed a heuristic index policy for RMABs by considering the Lagrangian relaxation of the problem [57]. It has been shown that Whittle index policy is asymptot- ically optimal under certain conditions as k and n tend to∞ with k/n fixed [56]. When k and n are finite, extensive empirical studies have also demonstrated the near-optimal performance of Whittle index policy [3,13]. Whittle index measures how attractive it is toactivate anarmbasedontheconceptofsubsidyforpassivity. Itgives thesubsidymto passive action (not activate) and the smallest m that would make passive action optimal for the current state is definedto bethe Whittle index for this arm at this state. Whittle index policy chooses to activate the k arms with the highest Whittle indices. Intuitively, the larger the m is, the larger the gap is between active action (activate) and passive action, the more attractive it is for the player to activate this arm. Mathematically, de- note V m (x;a = 0) (V m (x;a = 1)) to be the maximum cumulative reward the player can 15 achieve until the end if he takes passive (active) action at the first round at the state x with subsidy m. Whittle index I(x) of state x is then defined to be: I(x),inf m {m:V m (x;a =0)≥V m (x :a=1)} However, Whittle index only exists and Whittle index policy can only be used when theproblemsatisfiesapropertyknownasindexability, whichIdefinebelow. DefineΦ(m) to be the set of states for which passive action is the optimal action given subsidy m: Φ(m),{x:V m (x;a =0)≥V m (x :a=1)} Definition 2.3.1. An arm is indexable if Φ(m) monotonically increases from∅ to the whole state space as m increases from−∞ to +∞. An RMAB is indexable if every arm is indexable. Intuitively, indexability requires that for a given state, its optimal action can never switch from passive action to active action with the increase of m. The indexability of an RMAB is often difficult to establish and computing Whittle index can be complex. 16 Chapter 3 Related Work 3.1 Uncertainty in Stackelberg Security Games Previous approaches that handle uncertainty in SSGs can be divided into two categories: • modeluncertaintyintermsofdifferentattacker typesandsolvearesultingBayesian Stackelberg game [42,62] • apply robust optimization techniques to optimize the worst case for the defender over the range of model uncertainty [19,34,60] 3.1.1 Bayesian Stackelberg Games BayesianStackelberggamemodelsuncertaintybyallowingdifferentattacker types,where thereissomepriorprobabilitycorrespondingtoeachattackertype. Althoughthismethod is used to model payoff uncertainty in previous work [42,62], it can also beused to model different degrees of attacker risk awareness in SSGs. However, this approach requires a prior distribution of attacker types, which is usually inapplicable for many real-world 17 security domains [34]. In addition, it is difficult to apply this approach to infinitely many attacker types. 3.1.2 Robust Stackelberg Games Maximin method Maximin method for addressing uncertainties in SSGs focuses on maximizing the defender’s utility against the worst case of uncertainties. Yin et al. [60] computes a defender strategy that is robust against defender execution uncertainty as well as uncertainty in the attacker’s observations of the defender’s strategy. Kiekintveld etal.[19]focusonintervaluncertaintyintheattacker’s payoffs. Nguyenetal.[34]develop a robust strategy that takes the attacker’s bounded rationality into account as well as the uncertainties [19,60] discuss. Minimax regret method Minimax regret method captures another concept of robust- ness when handling uncertainties in SSGs. In particular, it attempts to minimize the maximum “regret” or distance of a decision (e.g., defender’s strategy) from the actual optimal decision for any instance within the uncertainty. Nguyen et al. [36] uses this concept of robustness in handling interval uncertainty in the attacker’s payoffs. Thepreviouswork hasaddressedneitherattacker riskawareness norambiguity about the attacker risk profile. Although Kiekintveld et al. [19] and Nguyen et al. [36] try to capture uncertainty in attacker’s utilities, they are unable to fully capture the attacker’s risk awareness. The mapped utilities are coupled in the risk awareness setting since they aremappingwiththesameutilityfunction b U,andintervaluncertaintyisunabletomodel that. For example, Supposetarget t 1 is ofreward 1andpenalty−2; target t 2 is ofreward 2 and penalty−1. The coverage probability c 1 = c 2 = 0.5. A risk-averse attacker will 18 always attack t 2 since b U must be strictly increasing. However, the model with interval uncertainty 1 would consider both t 1 , t 2 to be potential targets for attack. 3.2 Adversary Behavioral Models Someprevious work explores human’sboundedrationality in decision making — humans donot necessarily choose thestrategy that provides themthehighest expected utility [8]. Quantal response [31,32] argues that human are more likely to choose the strategy with a higher expected utility. Yang et al. [59] apply the concept of quantal response to security games and compute the optimal strategy for the defender assuming that the attacker’s response follows Quantal response. Nguyen et al. [35] propose the SUQR model by extending the quantal response concept with subjective utilities in security games. However, these approaches do not model risk awareness, and nor do they model uncertainty in risk awareness that I model in my thesis. In fact, models such as SUQR essentially address concerns that are orthogonal to the issue of risk awareness; future research may thus consider integrating bounded rationality models with risk awareness. 3.3 Learning Attacker Payoffs Therehasbeenpreviousworkonlearningattacker payoffs inrepeatedsecuritygames [26, 30]. Letchford et al. [26] develop an algorithm to uncover the attacker type in as few rounds as possible, while my work focuses on maximizing the defender’s utility. Marecki et al. [30] use MCTS to maximize the defender’s utility in the first few rounds. However, their algorithm is unable to offer guidance in later rounds because it does not allow for 19 belief updating, which is a major component of my work. Additionally, Letchford et al. [26] and Marecki et al. [30] both assume that the defender plays a mixed strategy and theattacker playsapurestrategythatmaximizeshisexpectedutilitygiventhedefender’s mixed strategy. However, illegal extractions happen frequently in resource green security domains, so the assumption that the attacker carries out surveillance over a long time to know the exact mixed strategy of the defender does not hold. Furthermore, I relax the assumption that the attacker is perfectly rational to handle more general behavior models such as quantal response. 3.4 Green Security Games There is a significant body of literature discussing the activity of illegal extraction of natural resources [2,16,29]. In particular, this topic has also become popular in the AI community which emphasizes mathematical approaches [15,18,58]. Haskell et al. [15] and Yang et al. [58] model the game between the defender and the attacker as a repeated Stackelberg game where the defender plays a mixed strategy and the attacker plays a pure strategy against the mixed strategy at every round. They assume that the attacker is a non-rational SUQR playing agent [35] with unknown parameters. They use MLE (Maximum Likelihood Estimation) to estimate those parameters from the attacker’s ac- tions and optimize the defender’s strategy against the estimate. Fang et al. [12] extends these works by assuming that the attacker responds to a certain convex combination of the defender’s mixed strategies in previous few rounds. 20 One main difference of my work from these previous works is that I consider a short periodasaroundsothatthedefenderplaysapurestrategyateveryroundwhileprevious works consider a long period as a round so that the defender plays a mixed strategy at every round. my model has the following advantages: (i) from the modeling perspec- tive, it is difficult for the attacker to realize that the defender has switched from one mixed strategy to another mixed strategy. Furthermore, the attacker carries out illegal extractions frequently so the attacker might not have enough time to fully observe the new mixed strategy; (ii) from the strategy flexibility perspective, my model is capable of designing more flexible strategies since “playing a mixed strategy for a long period” can be represented as “playing a randomized strategy according to some probability distri- bution everyday during that period” while most short-period-based strategies can not be represented by long-period-based strategies. These previous works also suffer from another two limitations. First, this research fails to capture the defender’s lack of observation of attacks — in the real world, given a large area to patrol, the defender only has observations of attacks on the limited set of targets she patrolled in any given round. She does not have full knowledge of all of attackers’ actions as assumed in [12,15,18,58], leading to an unaddressed exploration- exploitation tradeoff for defenders: informally, should the defender allocate resources to protect targets that have already been visited and have been observed to have suffered a lot of attacks or should she allocate resources to protect targets that have not visited for a long time and hence where there are no observations of attacks. Second, while significant work insecurity games hasfocusedonuncertainty over attackers’ observations of defender actions [60], the reverse problem has received little attention. Specifically, 21 given frequent interactions with multiple attackers, the defender herself faces observation uncertainty in observing all of the attacker actions even in the targets she does patrol. Addressing this uncertainty in the defender’s observation is important when estimating attacks on targets and addressing the exploration-exploitation tradeoff. 3.5 Exploration-exploitation Tradeoff in Security Domains Thelimitedobservabilitypropertyandexploration-exploitation tradeoffisalsonoticedby Kl´ ıma in the domain of border patrol where the border is large area [21,22]. They model the problem as a stochastic/adversarial multi-armed bandit problem and use (sliding- window) UCB algorithm [4]/EXP3 algorithm [5] to plan for patrollers’ strategies. How- ever, the stochastic bandit formulation fails to model patrol’s effect on attackers’ actions while the adversarial bandit formulation fails to capture attackers’ behavioral pattern. 3.6 Indexability of Restless Multi-armed Bandit Problem There is a rich literature on indexability of restless multi-armed bandit problem. Glaze- brook et al. [13] provide someindexable families of restless multi-armed banditproblems. Nino-Mora [37] propose PCL-indexability and GCL-indexability and show that they are sufficient conditions for indexability. Liu and Zhao [27] apply the concept of RMABs in dynamic multichannel access. In their model, every arm is a two-state Markov chain and the player only knows the state of the arm he chooses to activate. They prove the index- ability of their problem and find the closed-form solution for the Whittle index. In [38], Ny et al. also consider the same class of RMABs but motivated by the application of 22 UAV routing. This problem shares some similarity with my problem but my problem is more difficult in the following aspects: (i) I cannot directly observe the states (POMDP vs. MDP); (ii) different actions lead to different transition matrices in my model; (iii) I allow for more states and observations. A further extension to this work discusses the case with probing errors where the player’s observation about the state might be incor- rect [28]. This concept is similar to what I assume in my model, but the detailed settings are different. 23 Chapter 4 Robust Strategy against Risk-aware Attackers in SSGs This Chapter discusses my contribution of computing the robust strategy against risk- awareattackers. Iwillusetherisk-averseattackers asanexampletodiscussthealgorithm to compute the robust strategy, and then extend the algorithm to handle other types of risk-aware attackers. A major motivation of this work is that the attacker is risk-averse in some domains, e.g., terrorists in the counter-terrorism domain. George Habash of the Popular Front for the Liberation of Palestine has said “the main point is to select targets where success is 100% assured ” [17]. AreportfromRANDcorporation[33]discussestheroleofdeterrence in counter-terrorism domain. They mention the evidence in the report that: In the doctrine of groups like the Provisional Irish Republican Army, require- mentsforoperationalplanningincludeexplicitconsiderationofhowpre-attack surveillance can be used to manage and reduce operational risks. Similarly, in a document captured from the Islamic State of Iraq/al Qaeda in Iraq, a group member laments the deleterious effects on potential suicide bombers 24 when they suspect that poor planning may result in their lives being wasted on low-value targets. The RAND report takes advantage of the fact that terrorists are risk-averse and hate uncertainty and discusses several possible solutions in increasing the uncertainties for terrorists to deter them. Besides that, creating uncertainty is already a key part of some security planning. For example, the Transportation Research Board [40] suggests one goal of security should be to ”create a high degree of uncertainty among terrorists about their chances of defeating the system.”. A similar point was made by the Defense Science Board[39]withrespecttodeterrenceaspartofnationaldefenseagainstnuclearterrorism: The deterrent aspect of the protection equation involves the often-great dif- ferences between how a defender and an attacker will view the relative ca- pabilities of the defense. The long history of offense/defense competitions is strongly characterized by bothsides taking ownside-conservative views. More particularly, the annals of terrorism and counterterrorism are replete with in- stances in whicha prospectiveattacker was deterredby aspects of thedefense that may have seemed relatively weak and ineffectual to the defender. The terrorist may not be afraid to die, but he (or his master) does not want to fail. Dissuasion/deterrence by the adversary’s fear of failure might work in a variety of ways. One aspect is that an attacker will want to know enough about the defense to design a robust, successful attack. If the capabilities of the defense can be improved enough that the attacker must know the details ofdefensivemeasuresinplacetounderstandhowtobestsurmountthem,then 25 the attacker may expose himself to discovery during the planning phases of the attack or bealtogether dissuadedfromtheattempt. Creatinguncertainty in the attacker’s mind will be critical to maximizing the success of defenses which, realistically, cannot aspire to perfection. To exploit the effects of un- certainty, the defense should be deliberately designed and deployed to create as much ambiguity for the attacker as possible as to where the boundaries of defense performance lie. Thereisanotherthreadofworkthatstudiesterroristriskattitudes[45,47,49]. In[47], portfolio theory is applied to study a terrorist group’s decision making process, and this research argues that terrorist strategies are risk-averse and are highly sensitive to the group’s level of risk aversion. While this finding of risk aversion may appear to be counter-intuitive, notice that it is the terrorist groups (and the planners in these groups) that are found to be risk-averse due to resource limitation; not the individuals in the organization who finally launch an attack. [45] studies the risk preferences of Al Qaeda specifically andconcludes that thegroup isrisk-averse andconsistently displays thesame degree of risk aversion in their activities. This work is further extended in [49] where the degree of risk aversion for Al Qaeda is estimated empirically based on data of attacks over the last decade. I firstly build a robust SSG framework against an attacker with uncertainty in level of risk aversion. Second, building on previous work on SSGs in mixed-integer programs, I propose a novel mixed-integer bilinear programming problem (MIBLP), and find that it only finds locally optimal solutions. While the MIBLP formulation is also unable to 26 scale up, it provides key intuition for my new algorithm. This new algorithm, BeRRA (Binary search based Robust algorithm against Risk-Averse attackers) is my third con- tribution, and it finds globally ǫ-optimal solutions by solvingO(nlog( 1 ǫ )log( 1 δ )) linear feasibility problems. Thekey ideaof theBeRRA algorithm istoreducetheproblemfrom maximizing the reward with a given number of resources to minimizing the number of resources needed to achieve a given reward. This transformation allows BeRRA to scale up via the removal of the bilinear terms and integer variables as well as the utilization of key theoretical properties that prove correspondence of its potential “attack sets” [20] with that of the maximin strategy. Finally, I also show that the defender does not need to consider attacker’s risk attitude in zero-sum games. The experimental results show the solution quality and runtime advantages of my robust model and BeRRA algorithm. 4.1 Model TheSSEstrategyprovidestheoptimaldefenderstrategywhentheattackerisrisk-neutral. However, as previously discussed, attackers are risk-averse rather than risk-neutral in several key domains. If the defender executes the SSE strategy against a risk-averse attacker, then the defender may suffer significant losses in solution quality. I show in Example 1 that these losses can be arbitrarily large. Example 4.1.1. Suppose there are two targets, t 1 and t 2 , in the game, and their utilities are as shown in Table 4.1. The defender has only 1 resource. The SSE strategy of the game is c 1 = 0.4,c 2 = 0.6. If the attacker is risk-averse, he would choose to attack t 1 27 (these two targets are identical to the attacker in terms of expected utility, but a risk- averse attacker prefers a small reward with high probability versus a high reward with low probability), and the defender’s reward would be 0.4+0.6x for the SSE strategy. However, if the defender executes the strategy of c 1 = 1,c 2 = 0, then the attacker would attack t 2 and the defender would receive reward−1. Compared with−1, the loss of the SSE strategy can be arbitrarily large since x can be arbitrarily small. Table 4.1: Utility Example U c d U u d U c a U u a t 1 1 x -1 1 t 2 1 -1 -1 2 This example strongly motivates the need to consider risk-averse attackers. However, real world defenders are uncertain about the attacker’s degree of risk aversion, and the defender may suffer significant losses if she incorrectly estimates it. Therefore I focus on a robust strategy in this work, i.e., my aim is to compute a defender strategy that is robust against all possible risk-averse attackers. In literature on risk, the utility function f, which maps values to utilities, is used to specify the risk preference. f is concave for the risk-averse case and is convex for the risk-seeking case, while the risk-neutral case corresponds to the function y =Cx,C >0. The agent makes decisions based on the mapped utilities. Inmyproblem,Idefinethemappingfunction b U thatmapstheutilitiesU c a (i)andU u a (i) to the attacker’s mapped utilities. I denote b U a (i,c), c i b U(U c a (i))+(1−c i ) b U(U u a (i)) as theattacker’s expectedutility underthemapping b U. Irestrict b U tobestrictlyincreasing, concave and satisfying the equality b U(0) =0 — strictly increasing reflects the preference for more to less; concavity corresponds to risk aversion; and b U(0) = 0 distinguishes 28 between gains and losses. According to this definition, the risk-averse case includes the risk-neutral case. I defineU to be the set of all valid mapping functions b U. Problem (4.1) describes the robust defender strategy through a bilevel optimization problem. In the upper level, the defender chooses c to maximize her expected utility U d (k,c). The constraint k∈ argmax i∈T b U a (i,c) requires target k to have the highest expected utility for the at- tacker under the utility mapping b U when the defender’s strategy is c. The lower level demonstratesthatthedefendermaximizesherworst-caserewardoverallpossibleattacker responses with utility mapping functions b U∈U. The lower level also suggests that the attacker breaks ties against the defender due to the concept of robustness. I define the solution c to be the Robust Stackelberg Equilibrium (RSE) strategy of the game. max c min b U∈U,k U d (k,c) : k∈argmax i∈T b U a (i,c) s.t. 0≤c i ≤1,∀i∈T X i∈T c i ≤m (4.1) 4.2 Preliminaries In its current form, the optimization problem (4.1) is not tractable because it is a bilevel programmingproblemthat requires thesolution of uncountably manyinneroptimization problems indexed byU [6]. To take steps towards tractability, in Section 4.2.1 and 4.2.2, I provide key concepts that are used in my MIBLP formulation (Section 4.3) and my BeRRA algorithm (Section 4.4). 29 4.2.1 Risk Aversion Modeling In this section, I write the condition b U∈U in a computationally tractable way via linear constraints. For any utility function b U∈U, we are actually only interested in its values at 0 and at the points of the attacker’s utility set U c a (i) and U u a (i), which I denote as Θ: Θ={U u a (i),U c a (i),∀i∈T} [ {0}={θ 1 ,...,θ I }, where θ 1 <θ 2 <···<θ I . Lemma 4.2.1. Choose ǫ u > 0. 1 b U ∈ U is equivalent to satisfying the linear con- straints (4.2) on the values n b U(θ) o θ∈Θ , i.e.,∀ b U∈U, b U satisfies the constraints (4.2); ∀ n b U ′ (θ) o θ∈Θ that satisfies constraints (4.2),∃ b U∈U such that n b U(θ)= b U ′ (θ) o θ∈Θ . b U(θ 2 )− b U(θ 1 ) θ 2 −θ 1 ≥ b U(θ 3 )− b U(θ 2 ) θ 3 −θ 2 ≥...≥ b U(θ I )− b U(θ I−1 ) θ I −θ I−1 ≥ǫ u b U(0) =0 (4.2) Proof. If b U ∈U, then n b U (θ) o θ∈Θ satisfies constraints (4.2) by definition. Conversely, if n b U ′ (θ) o θ∈Θ satisfies constraints (4.2), the piecewise linear function that connects {(θ 1 , b U ′ (θ 1 )),(θ 2 , b U ′ (θ 2 ))},{(θ 2 , b U ′ (θ 2 )),(θ 3 , b U ′ (θ 3 ))},...,{(θ I−1 , b U ′ (θ I−1 )),(θ I , b U ′ (θ I ))} belongs toU. 1 Since Problem (4.1) is invariant under scaling of b U, i.e., the attacker makes the same decision under either b U or α b U,∀α > 0. Thus, the value of ǫu does not affect the result. 30 Based on Lemma 4.2.1, the condition b U∈U is completely captured by constraints (4.2). From now on I denote the constraints (4.2) compactly as b U∈U. 4.2.2 Possible Attack Set Inthissection, tobetterunderstandProblem(4.1) Istudythe“possibleattack set”S p (c) and its complement S i (c) =T−S p (c). Definition 4.2.2. Given the coverage probability c, Possible Attack Set S p (c) is defined to be the set of targets that may be attacked by a risk-averse attacker, i.e., it is the set of targets that have the highest expected utility for the attacker for some b U∈U. S i (c) =T−S p (c) is defined to be the set of targets that the attacker will never attack, i.e., the set of targets that for any b U∈U, there always exists another target i∈ S p (c) with a higher expected utility for the attacker. Given the coverage probability c, we can compute S p (c) and S i (c) by testing the feasibility of the following constraints for every target. b U a (i,c)≥ b U a (j,c),∀j∈T,j6=i b U∈U (4.3) If these constraints are feasible for a target i, there exists a mapping b U∈U under which target i has the highest expected utility for the attacker, and thus i ∈ S p (c); otherwise, i∈S i (c). 31 In Problem (4.1), when the defender’s strategy c is given, the defender’s (worst case) reward is: min b U∈U,k U d (k,c) : k∈argmax i∈T b U a (i,c) which is equivalent to: min i∈Sp(c) {U d (i,c)} So Problem (4.1) can be written as max c min i∈Sp(c) {U d (i,c)} s.t. 0≤c i ≤1,∀i∈T X i∈T c i ≤m (4.4) 4.3 MIBLP Formulation Inthis section, I formulate Problem(4.4) as an MIBLP problemto findtheRSE strategy for the defender. While this approach does not scale up to large-scale games, it provides several insights for my BeRRA algorithm. As in Problem (2.1), I use integer variables 32 {q i } i∈T to denote if target i belongs to S p (c): I set q i = 1 if i∈ S p (c) and q i = 0 if i∈S i (c). Problem (4.4) can then be converted to the formulation below max c v s.t. 0≤c i ≤1,∀i∈T X i∈T c i ≤m q i ∈{0,1},∀i∈T v≤U d (i,c)+(1−q i )M,∀i∈T i∈S p (c)⇔q i =1 i∈S i (c)⇔q i =0 (4.5) When i∈ S p (c), constraints (4.3) are feasible for target i. When i∈ S i (c), for any utility mapping b U∈U, there is always another target with a higher expected utility for the attacker. I approximate this strict inequality with a small ǫ c >0: min b U∈U max j∈T b U a (j,c)− b U a (i,c) ≥ǫ c which states that for any b U∈U, there exists a target j∈T whose expected utility for the attacker is at least ǫ c more than the expected utility for target i. By substituting max j∈T b U a (j,c) with the slack variable λ, the preceding bilevel optimization problem can be reduced to: 33 min b U,λ λ− b U a (i,c) s.t. b U a (j,c)≤λ,∀j∈T b U∈U (4.6) If the solution of Problem (4.6) is larger than ǫ c , then i∈ S i (c). Otherwise, i∈ S p (c) (subject to the approximation error introduced by ǫ c ). Since Problem (4.6) is a minimization problem, it cannot substitute the constraint i∈S i (c)⇔q i =0 in Problem (4.5). So, I take the Lagrangian dual of Problem (4.6) to convert it into a maximization problem: max α,β,γ,κ βǫ u s.t. X j∈T γ j =1 X k∈T γ k c k 1{θ j =U c a (k)}+γ k (1−c k )1{θ j =U u a (k)} −c i 1{θ j =U c a (i)}−(1−c i )1{θ j =U u a (i)} + α j−2 1{j≥3} θ j −θ j−1 − α j−1 1{I−1≥j≥2} θ j+1 −θ j − α j−1 1{I−1≥j≥2} θ j −θ j−1 + α j 1{j≤I−2} θ j+1 −θ j − β1{j =I} θ j −θ j−1 + β1{j =I−1} θ j+1 −θ j +κ1{θ j =0}=0,∀j∈{1,2,...,I} α j ≥0,∀j∈{1,2,...,I−2} β≥0 γ j ≥0,∀j∈T (4.7) 34 For succinct notation, I denote the constraints on the variables (α,β,γ,κ) in the above formulation as (α,β,γ,κ)∈D. By applying Problem (4.3) and Problem (4.7) to every target i to put constraints on q i , I summarize my final MIBLP formulation in the next theorem. Theorem 4.3.1. Problem (4.1) is (approximately) 2 equivalent to max v s.t. 0≤c i ≤1,∀i∈T X i∈T c i ≤m q i ∈{0,1},∀i∈T v≤U d (i,c)+(1−q i )M,∀i∈T b U i a (j,c)≤ b U i a (i,c)+(1−q i )M,∀i∈T,∀j∈T,j6=i b U i ∈U,∀i∈T β i ǫ u ≥ǫ c −q i M,∀i∈T α i ,β i ,γ i ,κ i ∈D,∀i∈T (4.8) where the superscript i in α i ,β i ,γ i ,κ i and b U i marks different variables. b U i a (j,c) is attacker’s expected utility for target j under the mapping b U i and defender’s strategy c. Proof. If q i =1, the constraints b U i a (j,c)≤ b U i a (i,c)+(1−q i )M,∀j∈T,j6=i and b U i ∈U ensure that i∈ S p (c); if q i = 0, then these constraints are always feasible and can be ignored. Ifq i =0, theconstraints β i ǫ u ≥ǫ c −q i M and α i ,β i ,γ i ,κ i ∈D ensurethati∈S i (c) (approximately) since there exists a solution α i ,β i ,γ i ,κ i ∈D that satisfies β i ǫ u ≥ ǫ c . 2 The approximation is due to the introduction of ǫc. 35 Thus, the objective of Problem (4.7) is larger than ǫ c , and i∈ S i (c). For the other direction, if the objective of Problem (4.7) is larger than ǫ c , then these two constraints are also satisfied; if q i =1, these constraints are always feasible and can be ignored. In summary, I have converted Problem (4.1) into Problem (4.8), which is an MIBLP: {q i } i∈T are integer variables; b U i a (j,c) =c j b U i (U c a (j))+(1−c j ) b U i (U u a (j)) contains bilinear termssincebothc j and b U i (U c a (j))/ b U i (U u a (j))arevariables. Problem(4.8)isanon-convex optimization problem and lacks efficient solvers. I used a powerful nonlinear solver — KNITRO to search for local optimal solutions to Problem (4.8). However, this approach does not scale up — the two-target scenario takes about 1 minute and the three-target scenario takes about 15 minutes to solve. Faced with this scalability issue, I develop the BeRRA algorithm that finds the ǫ-optimal solution and provides significant scalability. 4.4 BeRRA Algorithm Problem (4.8) has two main hindrances to scaling up: the presence of Θ(n 2 ) bilinear terms and the presence of n integer variables. Thus, eliminating these bilinear terms and integer variables should allow us to scale the problem up. The bilinear terms in Problem (4.8) have two components: the coverage probability c i and the mapped attacker utilities b U(U c a (i))/ b U(U u a (i)). Intuitively, we can avoid the bilinearity by fixing one of these two terms. In addition, if the coverage probability c is fixed, then S p (c) is also fixed and we no longer need the integer variables{q i } i∈T to represent if i∈ S p (c). Based on the idea of fixingthecoverage probabilityc, I develop theBeRRA algorithm. Thisalgorithm computes an ǫ-optimal RSE strategy where ǫ can be made arbitrarily small. 36 The main idea of the BeRRA algorithm is to reduce the problem to computing the minimum amount of resources needed to achieve a given reward, which can be solved efficiently by using special properties of the problem. With this reduction, I use binary search to find the highest reward that the defender can achieve with the given number of resources. The high-level intuition of this reduction is that a fixed defender’s reward leads to fixed defender maximin strategy, which eliminates the bilinear terms and integer variables. Additionally, optimal strategy can be derived efficiently from the maximin strategy via the property S p (c max )=S p (c opt ). 4.4.1 Binary Search Reduction Algorithm 1 lists the steps of my BeRRA algorithm. The input to Algorithm 1 is the number of defender resources m and the defender’s and the attacker’s utilities U. The outputisthedefender’sRSEstrategycandherrewardlb. Thelower boundlbandupper boundubarefirstset to bethelowest andthehighest possiblerewards, respectively, that the defender may achieve (Line 2). The function MinimumResources(r, U) returns the strategypthat usestheminimumnumberof resources forthedefendertoachieve reward r. This function will be discussed in detail in Section 4.5. During the binary search phase (Lines 3∼ 11), the lower bound is set to be the defender’s achievable reward (the strategy p returned by the MinimumResources function is the solution) and the upper bound is set to be an unachievable reward. Therefore, the BeRRA algorithm achieves the ǫ-optimal solution and we can set ǫ arbitrarily small to get arbitrarily near-optimal solutions. 37 Algorithm 1 BeRRA Algorithm 1: function BeRRA (m,U) 2: lb←min i∈T U u d (i),ub←max i∈T U c d (i) 3: while ub−lb≥ǫ do 4: p← MinimumResources( lb+ub 2 , U) 5: if P i∈T p i ≤m then 6: lb← lb+ub 2 7: c←p 8: else 9: ub← lb+ub 2 10: end if 11: end while 12: return (c,lb) 13: end function 4.5 Minimum Resources I present Algorithm 2 in this section. This algorithm computes the defender strategy that requires as few resources as possible to achieve a given reward r, i.e., the Minimum- Resources function in Algorithm 1. I call this resource-minimizing strategy the optimal strategy and denote it as c opt for succinctness. Algorithm 2 Minimum Resources 1: function MinimumResources(r,U) 2: (flag,c max ,S p (c max ),S i (c max ))← Maximin(r,U) 3: if flag =false then 4: return (∞,∞,...,∞) ⊤ 5: end if 6: c opt ← Reduce(U,c max ,S p (c max ),S i (c max )) 7: return c opt 8: end function Algorithm 2 consists of two functions: Maximin and Reduce. The Maximin function computes the maximin strategy c max for which the defender achieves reward r, as well as the corresponding sets S p (c max ) and S i (c max ). The variable flag is set to false when the input reward is not achievable for any amount of defender resources. In this 38 case, Algorithm 2 returns (∞,∞,...,∞) ⊤ (Lines 3∼ 5) so that Algorithm 1 knows r is not achievable. I will prove in Theorem 4.5.7 that if the reward r is achievable, then S p (c max ) =S p (c opt ) and S i (c max ) =S i (c opt ). Based on this property, the Reduce functionderivestheoptimal strategyc opt fromthemaximinstrategyc max . Section 4.5.1 and Section 4.5.2 discuss these two functions in detail. 4.5.1 Maximin Function The Maximin function is summarized in Algorithm 3. It first computes the maximin strategy c max for which the defender achieves reward r (Lines 2∼4) and then it assigns eachtargettoeitherS p (c max )orS i (c max )(Lines5∼15). Iftherewardrisnotachievable for any amount of resources, then it returns flag =false (Line 10). Algorithm 3 Maximin 1: function Maximin(r,U) 2: for i =1→n do 3: c max i ←min{1,max{ r−U u d (i) U c d (i)−U u d (i) ,0}} 4: end for 5: S p (c max ),S i (c max )←∅ 6: for i =1→n do 7: if Problem (4.3) is feasible for target i given c max then 8: S p (c max )←S p (c max )∪{i} 9: if r >U c d (i) then 10: return (false,c max ,S p (c max ),S i (c max )) 11: end if 12: else 13: S i (c max )← S i (c max )∪{i} 14: end if 15: end for 16: return (true,c max ,S p (c max ),S i (c max )) 17: end function Lines 2∼ 4 compute the maximin strategy for a given reward r. Given a cover- age probability c, the maximin setting assumes that the attacker attacks target i = 39 argmin j∈T U d (j,c), and thus the defender’s reward will be min i∈T U d (i,c). For the de- fender to achieve reward r, we should have U d (i,c)≥r,∀i∈T so that c max i = r−U u d (i) U c d (i)−U u d (i) (which is bounded by [0,1]). Given the maximin strategyc max , Lines 5∼15 iterate through all targets and assign them to either S p (c max ) or S i (c max ) by testing the feasibility of constraints (4.3). If these constraints are feasible, then i∈S p (c); otherwise, i∈S i (c). Next in Lemma4.5.2 I prove that∃i∈S p (c max ) that satisfies r >U c d (i) if and only if reward r is not achievable. In that case, Algorithm 3 returns flag =false (Lines 9∼11). Lemma 4.5.1. Given coverage probability c, the defender’s reward is min i∈Sp(c) U d (i,c). Proof. Follows from the form of problem 4.4. Lemma 4.5.2. Reward r is infeasible if and only if Algorithm 3 returns flag =false. Proof. I first prove that if the reward r is infeasible, Algorithm 3 returns with flag = false. If Algorithm 3 returns with flag = false, according to the steps of Algorithm 3, ∀i∈S p (c max ), we have U c d (i)≥r⇒U d (i,c max )≥r. So, according to Lemma 4.5.1, the reward of the strategy c max is at least r, so the reward r is feasible. Next I prove that if Algorithm 3 returns with flag = false, then the reward r is infeasible. If Algorithm 3 returns with flag = false, then there exists a target i∈ S p (c max ) such that r >U c d (i), and we have c max i =1 for this target. Since i∈S p (c max ), there exists a mapping b U ∈U under which b U a (i,c max ) = b U c a (i)≥ b U a (j,c max ),∀j ∈ T,j6= i. If the reward r is feasible, there must exist a strategy c which has reward at least r where i∈ S i (c), or else the reward will be at most U c d (i) < r. Thus, there exists a target j6= i that maximizes the expected utility of the attacker under the mapping 40 b U and strategy c such that b U a (j,c) > b U a (i,c). Thus b U a (j,c) > b U a (i,c) ≥ b U c a (i)≥ b U a (j,c max ), so c j < c max j , which implies U d (j,c) < U d (j,c max ). However, c j < c max j ⇒ c max j > 0⇒ c max j = min{1, r−U u d (j) U c d (j)−U u d (j) }⇒ U d (j,c max ) = min{U c d (j),r}≤ r, so we see U d (j,c) < U d (j,c max )≤ r. Since j∈ S p (c), the strategy c has a reward less than r, which contradicts the assumption that c has a reward at least r. In conclusion, r is infeasible if Algorithm 3 returns with flag =false. Theorem 4.5.7 demonstrates why we compute c max , S p (c max ) and S i (c max ). We see that S p (c max ) = S p (c opt ) and S i (c max ) = S i (c opt ). Therefore, we get S p (c opt ) and S i (c opt ) by computing S p (c max ) and S i (c max ). I introduce supporting lemmas before proving Theorem 4.5.7. ThenexttwolemmasexplainhowthesetS p (c)changeswhenthecoverageprobability for a certain target decreases. Lemma 4.5.3 shows that if the coverage probability for a target i∈ S p (c) decreases, then the set S p (c) “shrinks”. Lemma 4.5.4 shows that if the coverage probability for a target i∈ S i (c) decreases, then the set S p (c) also “shrinks” but target i might be added to it. Lemma 4.5.3. Given coverage probability c and another coverage probability c ′ which satisfies c ′ i <c i for a target i∈S p (c) and c ′ j =c j ,∀j∈T,j6=i, we have S p (c ′ )⊆S p (c). Proof. We prove S i (c ′ )⊇S i (c). ∀j ∈ S i (c),∀ b U ∈U,∃k∈ S p (c) such that b U a (k,c) > b U a (j,c). For this mapping b U, the targets j and k, b U a (k,c ′ )≥ b U a (k,c) since c ′ k ≤ c k and b U a (j,c ′ ) = b U a (j,c) since c ′ j =c j . So, we have b U a (k,c ′ )> b U a (j,c ′ ) and thus j∈S i (c ′ ). 41 Lemma 4.5.4. Given coverage probability c and another coverage probability c ′ which satisfies c ′ i < c i for a target i∈ S i (c) and c ′ j = c j ,∀j ∈ T,j 6= i, we have S p (c ′ )⊆ S p (c) S {i}. Proof. We prove S i (c ′ )⊇S i (c)\{i}. ∀j∈S i (c),j6=i,∀ b U∈U,∃k∈S p (c) such that b U a (k,c) > b U a (j,c). For this mapping b U, the targets j and k, b U a (k,c ′ ) = b U a (k,c) since c ′ k = c k and b U a (j,c ′ ) = b U a (j,c) since c ′ j =c j , so we have b U a (k,c ′ )> b U a (j,c ′ ) and thus j∈S i (c ′ ). The next two lemmas discuss key properties of c opt . Lemma 4.5.5 shows that the coverage probability for a target i ∈ S p (c opt ) must be max{ r−U u d (i) U c d (i)−U u d (i) ,0}; Lemma 4.5.6 shows that the coverage probability for a target i∈ S i (c opt ) is at most min{1, max{ r−U u d (i) U c d (i)−U u d (i) ,0}}. This property is used in the Reduce function that derives c opt from c max , as well as in the proof of Theorem 4.5.7. Lemma 4.5.5. Given a feasible reward r, all i∈ S p (c opt ) must satisfy U c d (i)≥ r and have expected reward max{U u d (i),r} for the defender, i.e., c opt i =max{ r−U u d (i) U c d (i)−U u d (i) ,0},∀i∈ S p (c opt ). Proof. According to Lemma 4.5.1,∀i∈ S p (c opt ), U d (i,c opt )≥ r, so we have U c d (i)≥ U d (i,c opt )≥ r. Additionally, U d (i,c opt )≥ r⇒ c opt i ≥ max{ r−U u d (i) U c d (i)−U u d (i) ,0}. Next I will prove c opt i =max{ r−U u d (i) U c d (i)−U u d (i) ,0} by contradiction. Suppose there exists a target i ∈ S p (c opt ) with coverage probability c opt i > max { r−U u d (i) U c d (i)−U u d (i) ,0}. I show that a more resource-conservative strategy c with c i = max { r−U u d (i) U c d (i)−U u d (i) ,0} < c opt i ,c j =c opt j ,∀j∈T,j6= i also has reward at least r for the defender. According to Lemma 4.5.1, we have U d (i,c opt )≥ r,∀i∈ S p (c opt ). According to Lemma 42 4.5.3, S p (c)⊂ S p (c opt ), so∀k∈ S p (c),k ∈ S p (c opt ), if k = i, U d (k,c) = U d (i,c) = max{U u d (i),r}≥ r; if k6= i, U d (k,c) = U d (k,c opt )≥ r, so the strategy c also provides reward at least r. Thusc opt is not optimal, which is a contradiction. Lemma 4.5.6. Given a feasible reward r,∀i∈ S i (c opt ), i has expected reward at most min{U c d (i),max{U u d (i),r}} for the defender, i.e., c opt i ≤min{1,max{ r−U u d (i) U c d (i)−U u d (i) ,0}},∀i∈ S i (c opt ). Proof. ∀i∈ S i (c opt ), if U c d (i) < r, max{ r−U u d (i) U c d (i)−U u d (i) ,0} > 1, so we have min{1,max{ r−U u d (i) U c d (i)−U u d (i) ,0}} =1. Clearly, c opt i ≤1. ∀i∈ S i (c opt ), if U c d (i)≥ r, min{1,max{ r−U u d (i) U c d (i)−U u d (i) ,0}} = max{ r−U u d (i) U c d (i)−U u d (i) ,0}≤ 1. I prove c opt i ≤max{ r−U u d (i) U c d (i)−U u d (i) ,0} by contradiction. Suppose there exists a target i ∈ S i (c opt ) with coverage probability c opt i > max { r−U u d (i) U c d (i)−U u d (i) ,0}. I show that a more resource-conservative strategy c where c i = max { r−U u d (i) U c d (i)−U u d (i) ,0} < c opt i ,c j = c opt j ,∀j ∈ T,j 6= i has reward at least r for the defender. According to Lemma 4.5.1, we have U d (k,c opt )≥r,∀k∈S p (c opt ). According to Lemma 4.5.4, S p (c)⊂ S p (c opt ) S {i}. It follows that∀k∈ S p (c),k∈ S p (c opt ) S {i}, if k = i, U d (k,c) = U d (i,c) = max{U u d (i),r}≥ r; if k∈ S p (c opt ), U d (k,c) = U d (k,c opt )≥ r, so the strategy c also provides reward at least r. Thus c opt is not optimal, which is a contradiction. We are now ready to combine these preliminary lemmas to prove Theorem 4.5.7. Theorem 4.5.7. Given a feasible reward r, S p (c max ) = S p (c opt ) and S i (c max ) = S i (c opt ). 43 Proof. First I present two results that will be used in the proof: (i)∀i∈S p (c max ), since therewardrisfeasible,Lemma4.5.2andAlgorithm3implythatU c d (i)≥rsothatc max i = max{ r−U u d (i) U c d (i)−U u d (i) ,0}; (ii)∀i∈ S p (c opt ), according to Lemma 4.5.5, U c d (i)≥ r and c opt i = max{ r−U u d (i) U c d (i)−U u d (i) ,0}. Furthermore, U c d (i)≥ r implies that c max i = max{ r−U u d (i) U c d (i)−U u d (i) ,0} according to Algorithm 3. Thus, c max i =c opt i ,∀i∈S p (c opt ). I will prove by contradiction that S p (c opt )⊆S p (c max ). Supposethere exists a target i∈ S p (c opt ) with i∈ S i (c max ), we have c max i = c opt i according to result (ii). Since i∈ S p (c opt ), b U a (i,c opt )≥ b U a (j,c opt ),∀j ∈ T,j 6= i under some mapping b U ∈U by definition. Since i∈ S i (c max ), for this mapping b U,∃j 6= i,j ∈ S p (c max ) such that b U a (j,c max ) > b U a (i,c max ) where c max j = max{ r−U u d (j) U c d (j)−U u d (j) ,0} according to result (i). So b U a (j,c max ) > b U a (i,c max ) = b U a (i,c opt )≥ b U a (j,c opt ), which leads to b U a (j,c max ) > b U a (j,c opt ). Thus, wehavec opt j >c max j =max{ r−U u d (j) U c d (j)−U u d (j) ,0}, whichcontradicts Lemmas 4.5.5 and 4.5.6. So, it must be that i∈S p (c max ) which implies S p (c opt )⊆S p (c max ). I will prove by contradiction that S i (c opt )⊆S i (c max ). Suppose there exists a target i∈ S i (c opt ) with i∈ S p (c max ). We have c max i = max{ r−U u d (i) U c d (i)−U u d (i) ,0} according to result (i). Sincei∈S p (c max )thereexistsamapping b U∈U suchthat b U a (i,c max )≥ b U a (j,c max ), ∀j∈ T,j6= i. Since i∈ S i (c opt ), for this mapping b U,∃j6= i,j∈ S p (c opt ) such that b U a (j,c opt ) > b U a (i,c opt ) by definition. We have c max j = c opt j according to result (ii). Thus b U a (i,c max )≥ b U a (j,c max ) = b U a (j,c opt ) > b U a (i,c opt ), which yields b U a (i,c max ) > b U a (i,c opt ). Then c opt i > c max i = max{ r−U u d (i) U c d (i)−U u d (i) ,0}, which contradicts Lemma 4.5.6. It follows that i∈S i (c max ) which implies S i (c opt )⊆S i (c max ). To conclude, S p (c opt )=S p (c max ), S i (c opt ) =S i (c max ). 44 4.5.2 Reduce Function: Derive c opt from c max Section 4.5.1 demonstrated that Algorithm 2 returns (∞,∞,...,∞) ⊤ if the reward r is infeasible; if the reward r is feasible, then S p (c max ) and S i (c max ) are the same as S p (c opt ) and S i (c opt ). It follows that Algorithm 4 correctly derives the optimal strategy c opt from c max . Algorithm 4 Computing c opt by reducing c max 1: function Reduce(U,c max ,S p (c max ),S i (c max )) 2: c opt =c max 3: for every i∈S i (c max ) do 4: lb← 0,ub←c opt i 5: while ub−lb≥δ do 6: c opt i ← lb+ub 2 7: if Problem (4.3) is feasible for target i given c opt then 8: lb← lb+ub 2 9: else 10: ub← lb+ub 2 11: end if 12: end while 13: c opt i ←ub 14: end for 15: return c opt 16: end function Givenc max ,S p (c max )andS i (c max ),Algorithm4returnsc opt i =c max i fori∈S p (c max ). For i∈ S i (c max ), Algorithm 4 uses binary search to find the minimum coverage proba- bility c i such that any further decrease 3 in coverage probability would add target i to the set S p (c opt ). The next lemma shows that this mechanism leads to the optimal strategy c opt . Lemma 4.5.8. Given a feasible reward r, Algorithm 4 returns the optimal strategy c opt . 3 δ can be arbitrarily small 45 Proof. Given a feasible reward r, we have S p (c opt ) = S p (c max ), S i (c opt ) = S i (c max ) according to Theorem 4.5.7. Denote the output of Algorithm 4 to be c. I first prove that c satisfies S p (c) = S p (c opt ), S i (c) = S i (c opt ) by proving S p (c) = S p (c max )andS i (c) =S i (c max ). ThestepsofAlgorithm4ensuresthatS i (c max )⊆S i (c), so we have S p (c)⊆ S p (c max ). Next I prove S p (c max )⊆ S p (c). ∀i∈ S p (c max ),∃ b U∈U such that b U a (i,c max ) ≥ b U a (j,c max ),∀j ∈ T,j 6= i. If i ∈ S i (c), for this mapping b U, ∃j ∈ S p (c) ⊆ S p (c max ) such that b U a (j,c) > b U a (i,c). According to the steps of Algorithm4,sincei,j∈S p (c max ),c max i =c i andc max j =c j ,wehave b U a (j,c) > b U a (i,c) = b U a (i,c max )≥ b U a (j,c max ) = b U a (j,c) , whichleadstoacontradiction. Thusi∈S p (c) and S p (c max )⊆S p (c). So S p (c) =S p (c max ) =S p (c opt ) and S i (c) =S i (c max )=S i (c opt ). Next I prove that the strategy c has reward at least r. ∀i∈ S p (c), i∈ S p (c max ). According to the steps of Algorithm 4 and Algorithm 3, U d (i,c) =U d (i,c max )≥r. Finally, I prove c = c opt . If c6= c opt , c opt consumes fewer resources than c, so P i∈T c opt i < P i∈T c i . According to Lemma 4.5.5, c opt i = c i ,∀i∈ S p (c opt ), so there exists at least one target i∈ S i (c opt ) with c opt i < c i . However, Algorithm 4 is designed so that any further decrease in c i would lead to a mapping b U ∈U such that b U a (i,c)≥ b U a (j,c),∀j∈S p (c opt ). Thus at least one target in S i (c opt ) would be added to S p (c opt ), which contradicts the assumption that c opt is the optimal solution. I conclude c = c opt . Theorem 4.5.9. Given reward r, Algorithm 2 either detects its infeasibility or provides the optimal strategy c opt . Proof. Follows from Lemmas 4.5.2 and 4.5.8. 46 4.6 Discussions 4.6.1 Computational Cost of BeRRA The main computational cost of my BeRRA algorithm comes from evaluating the fea- sibility of the linear constraints (4.3), which is a linear feasibility problem and can be solved in polynomial time. Algorithm 2 is calledO(log( 1 ǫ )) times, and every call to Algo- rithm 2 involves solving Problem (4.3)O(n+|S i (c max )|log( 1 δ )) times, which is bounded byO(nlog( 1 δ )). Thus Problem (4.3) is solvedO(nlog( 1 ǫ )log( 1 δ )) times in total for my BeRRA algorithm. 4.6.2 Extensions of BeRRA to General Risk Awareness Notice that we only requireU to be increasing in the preceding proofs and algorithms. Thus, my GMOP algorithm can also be used to compute the optimal robust strategy against other kinds of risk-aware attacker types, e.g., risk-seeking criminals [7,14]. Be- sides,theattackermayberisk-averseforgainsandrisk-seekingforlosses(S-shapedutility mappingfunctionasisshownintherightfigure)asissuggestedinthenobel-prize-winning prospect theory (PT) 4 model [31,32]. Here I use risk-seeking as an example to show how to apply GMOP to other attacker types. If the attacker is risk-seeking, b U should be a strictly increasing, convex function and satisfies b U(0) = 0. Therefore, when adapting my GMOP algorithm to deal with risk-seeking attackers, the only difference is in testing feasibility of constraints (4.3), where the condition b U∈U in constraints (4.3) should be written as: 4 PT also involves a mapping of the probability. I don’t consider this factor in this thesis. 47 ǫ u ≤ b U(θ 2 )− b U(θ 1 ) θ 2 −θ 1 ≤ b U(θ 3 )− b U(θ 2 ) θ 3 −θ 2 ≤...≤ b U(θ I )− b U(θ I−1 ) θ I −θ I−1 b U(0) =0 (4.9) Figure 4.1: Prospect Theory 4.6.3 Zero-sum Game When the game is a zero-sum game, the utilities for the defender and the attacker are strongly correlated with correlation coefficient−1, i.e., U c a (i) =−U c d (i) and U u a (i) = −U u d (i),∀i∈T. Based on this property, I obtain the following theorem. Theorem 4.6.1. For zero-sum games, the defender’s Robust Stackelberg Equilibrium (RSE) strategy and Maximin strategy are the same. Proof. I first prove that given a defender’s strategy c, the defender’s reward is the same in both settings. Given the defender’s strategy c, the defender’s reward in the Maximin setting is min j∈T U d (j,c),whichisalowerboundonthedefender’srewardintheRobustStackelberg game setting since min j∈T U d (j,c) is the minimum reward the defender can possibly 48 achieve with c. Meanwhile, since the risk-neutral case is a special case of the risk-averse case, i = argmin j∈T U d (j,c) = argmax j∈T U a (j,c)∈ S p (c). Thus, the attacker might attack target i so that the expected reward the defender achieves if the attacker attacks target i — min j∈T U d (j,c) is also an upperboundon thedefender’sreward in theRobust Stackelberg game setting. Therefore, the defender’s reward in the Robust Stackelberg game setting is min j∈T U d (j,c). Sincethedefender’srewardgiventhedefender’sstrategycisthesameinbothsettings, thestrategycthat maximizes thedefender’srewardinbothsettings is alsothesame. It is known that the solution concepts of Nash Equilibrium, minimax, maximin, and SSE all give the same answer for finite two-person zero-sum games. Therefore, Theorem 4.6.1 adds RSE to this equivalence list. 4.7 Experimental Evaluation I will evaluate the performance of my algorithms in this section through extensive nu- merical experiments. Unless otherwise stated, all of the experiment results are averaged over 20 instances. U c d (i) and U u a (i) are generated as random variables between 11 and 40; U u d (i) and U c a (i) are generated as random variables between−11 and−40. To generate payoff matrixes with correlation between the defender’s and the attacker’s utilities, I set U c a (i)←−αU c d (i)+ √ 1−α 2 U c a (i) and U u a (i)←−αU u d (i)+ √ 1−α 2 U u a (i), where−α is the correlation coefficient between U c a (i)(U u a (u)) and U c d (i)(U u d (i)). α =1 corresponds to zero-sumgames. nisthenumberoftargets inthegameandmisthenumberofresources the defender has. 49 4.7.1 MIBLP vs BeRRA I first compare the performance of the MIBLP formulation and my BeRRA algorithm. Due to the scalability of the MIBLP, I only compare the case when n= 2 and n=3. m is set to be 1. The KNITRO solver is used to solve the MIBLP formulation. MIBLP vs BeRRA: Solution Quality The solution quality of the MIBLP formu- lation and my BeRRA algorithm is shown in Table 4.2. We can see from the table that BeRRA algorithm hasahigheraverage rewardcomparedwithMIBLP, andthedifference becomes larger as the number of targets n increases. This is because KNITRO can only find the locally optimal solution while my BeRRA algorithm finds the globally ǫ-optimal solution, and larger game scale leads to worse solution quality of the local optimum. Table 4.2: MIBLP vs BeRRA in Solution Quality (a) n= 2 MIBLP BeRRA α = 0 3.18 3.41 α =0.2 2.78 2.99 α =0.4 2.45 2.62 α =0.6 1.72 1.82 α =0.8 0.75 0.81 α =1.0 0.53 0.53 (b) n = 3 MIBLP BeRRA α = 0 -5.69 -4.60 α =0.2 -5.71 -5.32 α =0.4 -6.75 -5.84 α =0.6 -6.92 -6.31 α =0.8 -7.24 -6.96 α =1.0 -7.64 -7.64 MIBLP vs BeRRA: Runtime The runtime of the MIBLP formulation and my BeRRA algorithm is shown in Table 4.3. We can see from the table that BeRRA is much faster than MIBLP, and the difference becomes larger as the number of targets n increases. This is because solving MIBLP is difficult and the computational complex- ity increases exponentially with the problem size, while BeRRA only requires solving O(nlog( 1 ǫ )log( 1 δ )) linear feasibility problems. For MIBLP, it takes about 15 minutes for the very trivial case n=3, which means it cannot scale up at all. 50 Table 4.3: MIBLP vs BeRRA in Runtime (s) (a) n= 2 MIBLP BeRRA α = 0 70.4 0.95 α =0.2 71.2 0.95 α =0.4 72.0 0.94 α =0.6 68.9 0.77 α =0.8 73.4 0.48 α =1.0 64.4 0.20 (b) n = 3 MIBLP BeRRA α = 0 863.1 1.75 α =0.2 1004.4 1.63 α =0.4 958.8 1.53 α =0.6 886.9 1.36 α =0.8 1119.8 1.10 α =1.0 859.3 0.27 Runtime of BeRRA Figure 4.2 further analyzes the runtime of my BeRRA algo- rithm. missettoben/2andall resultsareaveraged over 100instances. Weobservethat the runtime increases almost linearly with the number of targets n, and the game with 50 targets only takes about 2 minutes to solve, which demonstrates BeRRA’s ability to scale up to larger problems. The figure also shows that the runtime does not change sig- nificantly with different α, but it decreases significantly when α is increased from 0.9999 to 1. This is because|S i (c max )| is almost 0 in zero-sum games. 10 20 30 40 50 0 50 100 150 Number of Targets Runtime (s) α = 0 α = 0.2 α = 0.4 α = 0.6 α = 0.8 α = 0.9999 α = 1 Figure 4.2: Runtime of BeRRA 51 4.7.2 PerformanceEvaluationofRSEstrategyagainstRisk-averseAttackers In this section, I will evaluate solution quality of the RSE strategy in detail. Since BeRRAshowsadvantagesinbothsolutionqualityandruntimecomparedwiththeMIBLP formulation, I use BeRRA to evaluate the performance of RSE strategy. SolutionQualityinWorstCaseFigures4.3(a)and4.3(b)showthesolutionquality ofRSEstrategyintheworstcase—theattackerattackstargeti =argmin j∈Sp(c) U d (j,c). Icompareits performancewith theSSEstrategy, Maximin strategy andtherobuststrat- egy against interval uncertainty of U c a (i) and U u a (i) [19]. For values of the intervals, I tried different intervals ranging from 1 to 20 and pick the best one among them. Figure4.3(a)showshowtheperformancecomparisonchangeswithdifferentnumberof resourcesm. TheRSEstrategy significantlyoutperformsalloftheotherstrategies. Since the robust strategy against interval uncertainty considers some type of “robustness”, it outperforms the SSE strategy and the Maximin strategy. However, since the interval uncertainty does not fully capture the risk aversion of the attacker, it is worse than the RSE strategy. The Maximin strategy is a more conservative strategy compared with the SSE strategy, leading to better performance when compared with SSE. Figure4.3(b)showstheperformancecomparisonwithdifferentα. Itshowsthesimilar patten that RSE > Interval Uncertainty > Maximin > SSE as in Figure 4.3(a). Another observationisthatthedifferencebetweenthesefourstrategiesbecomeslessasαincreases, and when α =1, these four strategies perform the same, as is proved in Theorem 4.6.1 5 . Solution Quality in Average Case Figures 4.4(a) and 4.4(b) show the solution quality of theRSE strategy in theaverage case — theattacker randomly attacks atarget 5 IntervalUncertainty = Maximin can be proved with similar techniquesin the proof of Theorem 4.6.1. 52 10 20 30 40 −40 −20 0 20 40 Number of Resources Average Reward SSE Maximin Interval Uncertainty RSE (a) α= 0,n= 50 0 0.2 0.4 0.6 0.8 1.0 −30 −20 −10 0 10 α Average Reward SSE Maximin Interval Uncertainty RSE (b) m= 20,n= 50 Figure 4.3: Solution Quality of RSE in Worst Case i in S p (c). I explore this case since unknown risk-averse attackers in the real world would not necessarily minimize the defender’s reward. The performance comparison of these four strategies in the average case shows similar patterns compared with that in the worst case. Thus even in the average case, the RSE strategy still performs the best among them. 10 20 30 40 −20 −10 0 10 20 30 Number of Resources Average Reward SSE Maximin Interval Uncertainty RSE (a) α= 0,n= 50 0 0.2 0.4 0.6 0.8 1.0 −20 −10 0 10 α Average Reward SSE Maximin Interval Uncertainty RSE (b) m= 20,n= 50 Figure 4.4: Solution Quality of RSE in Average Case “Price” of Being Robust Figure 4.5 compares thethree strategies (SSE,Maximin, RSE)whentheattacker isrisk-neutralandarbitrarilybadrisk-averse. SSEistheoptimal 53 strategy when the attacker is risk-neutral and RSE is the optimal strategy when the attacker is arbitrarily bad risk-averse. We can see from the figure that the performance of SSE strategy drops significantly if it wrongly estimates the attacker type (SSE-averse is very bad). However, for RSE strategy, its performance is only a little worse than the SSE strategy even if the attack is risk-neutral (compared with the bad performance of SSE-averse). This figure shows that compared with the significant loss of wrongly estimating attacker type, the loss of executing the robust strategy, which is the “price” of being robust, is acceptable. 10 20 30 40 −20 0 20 40 60 Number of Resources Average Reward SSE−averse Maximin−averse Maximin−neutral RSE−averse RSE−neutral SSE−neutral Figure 4.5: “Price” of Being Robust 4.7.3 Performance Evaluation of RSE strategy against other attacker types Inthissection, Iwillevaluatesolutionquality oftheRSEstrategy againstrisk-seekingat- tackersandprospecttheoryattackers (S-shapedutilitymappingfunction)indetail. Since the GMOP algorithm shows advantages in both solution quality and runtime compared with MIBLP, I use GMOP to evaluate the performance of RSE strategy. 54 10 20 30 40 −40 −20 0 20 40 Number of Resources Average Reward SSE Maximin Interval Uncertainty RSE (a) seeking α = 0,n=50 0 0.2 0.4 0.6 0.8 1.0 −30 −20 −10 0 10 α Average Reward SSE Maximin Interval Uncertainty RSE (b) seeking m= 20,n= 50 10 20 30 40 −40 −20 0 20 40 Number of Resources Average Reward SSE Maximin Interval Uncertainty RSE (c) PT α= 0,n= 50 0 0.2 0.4 0.6 0.8 1.0 −30 −20 −10 0 10 α Average Reward SSE Maximin Interval Uncertainty RSE (d) PT m=20,n= 50 5 10 15 −40 −20 0 20 40 Number of Resources Average Reward BRPT SSE RPT RSE (e) PT α= 0,n= 20 0 0.2 0.4 0.6 0.8 1.0 −20 −10 0 10 α Average Reward BRPT SSE RPT RSE (f) PT m= 10,n= 20 Figure 4.6: Solution Quality of RSE for Risk-aware Attackers Figures 4.6(a) and 4.6(b) show the solution quality of RSE strategy against risk- seeking attackers. I compare its performance with the SSE strategy, Maximin strategy and the robust strategy against interval uncertainty of U c a (i) and U u a (i) [19]. For values of the intervals, I tried different intervals ranging from 1 to 20 and pick the best one. Figure4.6(a)showshowtheperformancecomparisonchangeswithdifferentnumberof resourcesm. TheRSEstrategy significantlyoutperformsalloftheotherstrategies. Since the robust strategy against interval uncertainty considers some type of “robustness”, it outperforms the SSE strategy and the Maximin strategy. However, since the interval uncertainty does not fully capture the risk awareness of the attacker, it is worse than the RSE strategy. The Maximin strategy is a more conservative strategy compared with the SSE strategy, leading to better performance when compared with SSE. 55 Figure 4.6(b) shows the performancecomparison with different α. The main observa- tion is that the difference between these four strategies becomes less as α increases, and when α =1, these four strategies perform the same, as is proved in Theorem 4.6.1 6 . Figures 4.6(c) and 4.6(d) show the solution quality of RSE strategy against unknown prospect-theory attackers (S-shaped utility mapping function). They show similar pat- terns with the risk-seeking case. Figure 4.6(e) and 4.6(f) show the performance comparison of the RSE strategy and the BRPT and RPT algorithm [59] against unknown prospect theory attackers. The RSE strategy computes the robust strategy against any S-shaped utility mapping at- tackers (risk-averse for gains and risk-seeking for losses). BRPT algorithm computes the defender’sbest responseagainst a specificutility mappingfunction. RPT algorithm adds some “robustness” against interval uncertainty on the basis of the BRPT algorithm. We can see from Figure 4.6(e) and Figure 4.6(f) that the performance of BRPT and RPT is similar to that of SSE, and is much worse than the RSE strategy. The reason is that both BRPT and RPT fail to capture the uncertainty in the degree of risk-awareness. 6 The proof of Interval Uncertainty = Maximin is similar to that of Theorem 4.6.1. 56 Chapter 5 Learning Attacker’s Preference — Payoff Modeling Insomesecurity domains, e.g., wildlifeprotection domain, theattacker’s frequentattacks provide the defender with the opportunity to learn about the attacker’s payoffs by ob- serving the attacker’s behavior. I begin with the assumption that at every round, the defenderchoosesonetarget toprotect andtheattacker simultaneouslychooses onetarget to attack. Both the attacker and the defender have full knowledge about each other’s previous actions (I will discuss my model and assumptions in more detail in Section 5.1). My workfocusesonconstructinganonlinepolicyforthedefendertomaximize herutility given observations of the attacker. In this work, I model these interactions between the defender and the attacker as a repeated game. I then adopt a fixed model for the attacker’s behavior and recast this repeated game as a partially observable Markov decision process (POMDP). However, my POMDP formulation has an exponential number of states, making current POMDP solvers like ZMDP [54] and APPL [25] infeasible in terms of computational cost. Silver and Veness [53] have proposed the POMCP algorithm which achieves a high level of performance in large POMDPs. It uses particle filtering to maintain an approximation 57 of the belief state of the agent, and then uses Monte Carlo Tree Search (MCTS) for online planning. However, the particle filter is only an approximation of the belief state. By appealing to the special properties of my POMDP, I propose the GMOP algorithm (Gibbs sampling based MCTSOnlinePlanning) which draws samples directly from the exact belief state using Gibbs sampling and then runs MCTS for online planning. My algorithm provides higher solution quality than the POMCP algorithm. Additionally, for a specific subclass of my game with an attacker who plays a best response against the defender’sempirical distribution, andauniformpenalty of beingseized across all targets, I provide an advanced sampling technique to speed up the GMOP algorithm along with a heuristic that trades off solution quality for lower computational cost. Moreover, I explore the case of continuous utilities where my original POMDP formulation becomes a continuous-state POMDP, which is generally difficult to solve. However, the special properties in the specific subclass of game mentioned above make possible the extension of the GMOP algorithm to continuous utilities. Finally, I explore the more realistic scenario where the defender is not only uncertain about the distribution of resources, but also uncertain about the attacker’s behavioral model. I address this challenge by extending my POMDP formulation and the GMOP algorithm. The rest of the chapter is organized as follows: Section 5.1 discusses my model and POMDP formulation. Section 5.2 presents the GMOP algorithm in the basic scenario. Section 5.3 speeds up the GMOP algorithm by exploring special structure when the attacker plays a best response against the defender’s empirical distribution. Section 5.4 extends my GMOP algorithm to the continuous utility scenario and Section 5.5 extends 58 theGMOP algorithm to thesituation wheretheattacker model is also unknown. Section 5.6 provides extensive experimental results. 5.1 Model 5.1.1 Motivating Domain Myworkismotivated bythedomainofresourceconservation, forexample, illegal fishing, illegal oil extraction, water theft, crop theft, and illegal diamond mining, etc. In each case, illegal extractions happen frequently and the resources are spread over a large area that is impossible for the defender to cover in its entirety. Inthis model, I make theassumptionthat thedefenderand theattacker fullyobserve their opponent’s actions. The defender is usually a powerful government agency that has access to satellite imaging, multiple patrol assets, and the reports of local residents. The attacker learns about law enforcement tactics by exchanging information internally, covert observation, and by buying information from other sources. Theflagshipexampleisthereal-worldproblemfacedbytheU.S.CoastGuard(USCG) in the Gulf of Mexico of illegal fishing by fishermen from across-the-border. In this do- main, thedefender(USCG) performsdaily aircraft patrol surveillance 1 ; satellites arealso used to monitor illegal fishing 2 . Furthermore, illegal fishermen have well-organized sup- portfromacross-the-border;USCGprovidedevidencethatfishermenperformsurveillance on USCG boats. 1 http://www.uscg.mil/d8/sectCorpusChristi/ 2 http://wwf.panda.org/?206301/WWF-new-approach-to-fight-illegal-unreported-and-unregulated- fishing 59 5.1.2 Formal Model I now formalize the preceding story into a two-player repeated game between a defender and an attacker 3 . While both players are assumed to be humans or human organizations inthisgame,Iassumethatthedefenderisaidedbymydecisionaidbuttheattackerisnot. In my model, the amount of resources at each target will be fixed and the attacker will have full knowledge of this distribution. The defender will have to learn this distribution by observing the attacker’s behavior. I operate over the finite time horizon t∈T,{1,...,T}. Thereare n targets indexed byN,{1,2,...,n} that represent the locations of the natural resource in question: the attacker wants to steal resources from these targets and the defender wants to interdict theattacker. I represent thevalue of thetargets tothe attacker in terms of their utilities. Each target has a utility u(i) that is only known to the attacker. The utility space is discretized into m levels, u(i)∈ M ,{1,2,...,m}. Human beings cannot distinguish betweentiny differencesinutilities inthereal world, soIamjustifiedindiscretizingthese utilities. For n targets and m utility levels, there are m n possible sets of utilities across all targets. The distribution of resources is then captured by the vector of utilities at each target, and the set of possible resource distributions is: U,{(u(1),u(2),...,u(n)) : u(i)∈M,∀i∈N}=× i∈N M. (5.1) 3 Note that in this work I have begun with the assumption that there is a single extractor, and this already leads to very significant research challenges, needing significant research contributions, that I provide in this work. Generalizing to multiple extractors is clearly an important issue — that although will require me to scale-up my algorithm, fits naturally within my algorithmic framework – but it is left for future work. 60 At the beginning of the game, the defender may have some prior knowledge about the resource levels u(i) at each target i∈ N. This prior knowledge is represented as a probability density function p(u(i)) over M. If the defender does not know anything about u(i), then I adopt a uniform prior for u(i) overM. At each time t∈T, the defender chooses a target a t ∈N to protect and the attacker simultaneously chooses a target o t ∈ N from which to steal. If a t = o t , the defender catches the attacker and the attacker is penalized by the amount P(o t ) < 0; if a t 6= o t , the attacker successfully steals resources from target o t and gets a payoff of u(o t ). For clarity, thedefender’sinterdictionisalwayssuccessfulwhenevershevisitsthesamesiteas theattacker. Additionally, thedefenderfullyobservesthemoves oftheattacker, likewise, theattacker fullyobserves themoves of thedefender. Note that thepenalty P(i),i∈Nis known to both the defender and the attacker. I adopt a zero-sum game, so the defender is trying to minimize the attacker’s payoffs. In most resource conservation domains, the attacker pays the same penalty P if he is seized independent of the target he visits. I allow for varying penalties across targets for greater generality. In this work, I make the basic assumption that the attacker is more likely to steal from targets with higher utilities u(i), lower penalties P(i), and that have not been visited often by the defender. Based on this assumption, I assume that the attacker’s actions depend on u(i),P(i),i∈N along with the defender’s actions in previous rounds. A reasonable assumption about the attacker’s behavioral model is the fictitious Quantal Response(FQR)model. Specifically,afictitiousattackerassumesthedefender’sempirical distribution will behis mixed strategy in the next round, and quantal response(QR) [31] 61 has been shown to be effective in describing human’s behavior through human subject experiments. For the FQR model, the attacker’s behavior could be described in the following way: in every round, he (i) computes the empirical coverage probability c i for every target i basedon thehistoryof thedefender’sactions; (ii) computes theexpected utility EU(i) = c(i)P(i)+(1−c(i))u(i) for every target; (iii) attempts to steal from the target i with the probability p(i) proportional to e λEU(i) : p(i) = e λEU(i) P j∈N e λEU(j) , whereλ≥0istheparameterrepresentingtherationalityoftheplayer(higherλrepresents a more rational player). 5.1.3 Protector’s POMDP Formulation ToimplementthemodelfromSection 5.1.2, Imustresolve twotechnical questions. First, at every round t, based on her current belief about u, how should the defender choose targets to protect in the next round? Second, after each round, how should the defender use the observation of the latest round to update her beliefs about u? I am studying decision making and belief updating in a partially observable environment where the payoffs u are unobservable and the attacker’s actions are observable, which is the exact setup for a POMDP. I now setup my two-player game as a POMDP{S,A,O,T,Ω,R} where the attacker follows a quantal response model. 62 State space: The state space of my POMDP isS=U×Z n , which is the cross product of the utility space and the count space. U is the utility space as defined in Equation 5.1. Z n is the set of possible counts of the defender’s visits to each target, where C t ∈Z n is an integer-valued vector where C t (i),i∈N is the number of times that the defender has protected target i at the beginning of round t∈ T. A particular state s∈S is written as s = (u,C), where u is the vector of utility levels for each target and C is the current state count. The initial beliefs are expressed by a distribution over s = (u,0), induced by the prior distribution on u. I define c t (i), Ct(i) t−1 to be the frequency with which the defender visits target i at the beginning of round t∈T. I set c 1 ,0 by convention. Action space: The action spaceA isN, representing the target the defender chooses to protect. Observation space: The observation spaceO isN, representing the target from which the attacker attempts to steal. Conditional transition probability: Let e a ∈R n denote the unit vector with a 1 in slot a∈N and zeros elsewhere. The conditional transition probability T governing the evolution of the state is T s ′ = u ′ , C ′ |s=(u,C), a = 1, u=u ′ , C ′ =C +e a , 0, otherwise. Specifically, the evolution of the state is deterministic. The underlying utilities do not change, and the count for the target visited by the defender increases by one while all others stay the same. 63 Conditional observation probability: I define EU(u,C)∈ R n to be the vector of empirical expected utilities for theattacker forall targets whentheactual utility is u and the count is C, [EU(u,C)](i) =c(i)P(i)+(1−c(i))u(i),∀i∈N, when t≥ 1. I set [EU(u, 0)](i) = u(i) by convention. Hence, the observation probabili- ties Ω are explicitly Ω(o|s ′ =(u,C),a) = e λ[EU(u,C−ea)](o) P i∈N e λ[EU(u,C−ea)](i) , the probability of observing the attacker takes action o when the defender takes action a and arrives at state s ′ . Note that both a and o are the actions the defender/attacker take at the same round. Reward function: The reward function R is R s=(u,C),s ′ =(u,C +e a ),a,o = −P(o), a=o, −u(o), a6=o. 5.2 GMOP Algorithm In Section 5.1, I modeled the repeated game as a POMDP in order to update the de- fender’s beliefs about the resource distribution and to allocate patrol assets. However, the size of the utility spaceU is m n , and the size of the count space isO( T n n! ). The com- putational cost of the latest POMDP solvers such as ZMDP and APPL soon becomes unaffordable as the problem size grows. For a small instance like n = 4, m = 5 and 5 64 rounds,thereare78750statesinthePOMDP.BoththeZMDPandAPPLsolversrunout of memory when attempting to solve this POMDP. This challenge is non-trivial because the models in reality are much larger than this toy example. Silver and Veness [53] have proposed the POMCP algorithm, which provides high quality solutions for large POMDPs. The POMCP algorithm uses a particle filter to approximate the belief state. Then, it uses Monte Carlo tree search (MCTS) for online planningwhere(i)statesamplesaredrawnfromtheparticlefilterand(ii)theactionwith the highest expected utility based on Monte Carlo simulations is chosen. However, the particlefilterisonlyanapproximationofthetruebeliefstateandislikely tomovefurther away from the actual belief state as the game goes on, especially when most particles get depleted and new particles need to be added. Adding new particles will either (i) make the particle filter a worse approximation of the exact belief state, if the added particles do not follow the distribution of the belief state or (ii) be as difficult as drawing samples directly from thebeliefstate, if theadded particles dofollow thedistributionof thebelief state. However, if we could efficiently draw samples directly from the exact belief state, then there would be no need to use a particle filter. This POMDP has specific structure that we can exploit. The count state in S is known and the utility state does not change, making it possible to draw samples directly from the exact belief state using Gibbs sampling. I propose the GMOP algorithm that draws samples directly from the exact belief state using Gibbs sampling, and then runs MCTS. The samples drawn directly from the belief state better represent the true belief state comparedtosamples drawnfromaparticle filter. Ithusconjecturethat theGMOP 65 algorithm will yield higher solution quality than the POMCP algorithm for the problem, and this intuition is confirmed in the experiments. 5.2.1 GMOP Algorithm Framework TheGMOP algorithm is outlinedinAlgorithm 5. At ahighlevel, inroundtthedefender draws samples of state s from its belief state B t (s) usingGibbs samplingand then it uses MCTS to simulate what will happen in the next few rounds with those samples. Finally, it executes the action with the highest average reward in the MCTS simulation. MCTS starts with a tree that only contains a root node. Since the count state C t is already known, the defender only needs to sample the utility state u from B t . The sampled state s is comprised of the sampled utility u and the count C t . Algorithm 5 GMOP Algorithm Framework 1: function Play(C t ) 2: Initialize Tree 3: for i =1→numSamples do 4: u← GibbsSampling 5: Simulate(s=(u,C t )) 6: end for 7: a t ← action with the highest average reward 8: end function It has been shown that the UCT algorithm converges to the optimal value function in fully observable MDPs [23]. Based on this result, Silver and Veness have established the convergence of MCTS in POMDP online planning as long as the samples are drawn from the true belief state B t (s). It follows that the convergence of the GMOP algorithm is guaranteed. From Algorithm 5, we see that each iteration of the GMOP algorithm is composed of two parts: GIBBSSAMPLING which draws samples u directly from B t (u) using Gibbs 66 sampling, and SIMULATE which does Monte Carlo simulation of the sampled states s=(u,C t ) tofindthe“best”action toexecute. Thesamplingtechniquewillbediscussed in detail in Section 5.2.2 while the details of MCTS for POMDP are available in [53]. 5.2.2 Drawing Samples 5.2.2.1 Gibbs Sampling Overview Gibbs sampling[9] is a Markov chain Monte Carlo(MCMC) algorithm for samplingfrom multivariate probability distributions. Let X = (x 1 ,x 2 ,...,x n ) be a general random vector with n components and with finite support described by the multivariate proba- bility density p(X). Gibbs sampling only requires the conditional probabilities p(x i |x −i ) to simulate X, where x −i = (x j ) j6=i denotes the subset of all components of X except component i. Gibbs sampling is useful when direct sampling from p(X) is difficult. Suppose we want to obtain k samples of X = (x 1 ,x 2 ,...,x n ). Algorithm 6 shows howGibbssamplingworksingeneral toproducethesesamplesusingonlytheconditional probabilities p(x i |x −i ). It constructs a Markov chain whose steady-state distribution is given by p(X), so that the samples we draw also follow the distribution p(X). The states of this Markov chain are the possible realizations of X = (x 1 ,x 2 ,...,x n ), and a specific state X i is denoted as X i = (x i1 ,x i2 ,...,x in ) (there are finitely many such states by my assumption). The transition probabilities of this Markov chain, Pr(X j |X i ), follow from the conditional probabilities p(x i |x −i ). Specifically, Pr(X j |X i ) = p(x l |x −l ) when x jv =x iv for all v not equal to l, and is equal to zero otherwise, i.e., the state transitions only change one component of the vector-valued sample at a time. This Markov chain is 67 reversible (meaning p(X i )Pr(X j |X i )=p(X j )Pr(X i |X j ),∀i,j) so p(X) is its steady-state distribution. Algorithm 6 Gibbs Sampling 1: Initialization: X ={x 1 ,x 2 ,...,x n } satisfying p(X)>0 2: for i =1→ k do 3: for j =1→n do 4: x j ∼p(x j |x −j ) 5: end for 6: X i ←{x 1 ,x 2 ,...,x n } 7: end for 5.2.2.2 Applying Gibbs Sampling in GMOP I let B t be the probability distribution representing the defender’s beliefs about the true utilities at the beginning of round t≥ 1; B 1 represents the defender’s prior beliefs when the game starts. I adopt the notation B t (u) to denote the probability of the vector of utilities u with respect to the distribution B t . Let B be the prior belief distribution and B ′ be the posterior belief distribution. The Bayesian belief update rule to obtain B ′ from B and the observation is explicitly B ′ (s ′ =(u,C)) =ηΩ(o|s ′ ,a) X s∈S T(s ′ |s,a)B(s) =ηΩ(o|s ′ ,a)B(s=(u,C−e a )). 68 If a t and o t represent the actions that the defender and the attacker choose to take at round t, we have B t (u) =ηB t−1 (u)Ω(o t−1 |s=(u,C t ),a t−1 ) =η ′ B 1 (u)Π t−1 i=1 Ω(o i |s=(u,C i+1 ),a i ). (5.2) It follows that the posterior belief B t is proportional to the prior belief B 1 multiplied by the observation probabilities over the entire history. Since there are m n possible utilities, it is impractical to store and update B t when m and n are large, and thus it is impossible to sample directly from B t . Hence, I turn to Gibbs sampling, where we only need the conditional probabilities p(u i |u −i ),∀i∈N p(u i |u −i ) =ηp(u i ,u −i ) =ηB t (u i ,u −i ) =η ′ B 1 (u i ,u −i )Π t−1 j=1 Ω(o j |s=(u=(u i ,u −i ),C j+1 ),a j ) =η ′ B 1 (u i )B 1 (u −i )Π t−1 j=1 Ω(o j |s=(u =(u i ,u −i ),C j+1 ),a j ) =η ′′ B 1 (u i )Π t−1 j=1 Ω(o j |s=(u =(u i ,u −i ),C j+1 ),a j ). (5.3) This quantity is easy to compute where B 1 (u i ) is the prior probability that target i has utility u i . In this way, we are able to draw samples directly from the exact belief state in this POMDP using Gibbs sampling. Thus, the GMOP algorithm has much better solution quality compared with the POMCP algorithm which draws samples from the approximate belief state maintained by particle filter. Besides the conditional probability, we also need to find a valid u with B t (u) > 0 to initialize Gibbs sampling as shown in Line 1 of Algorithm 6. Finding such a u is 69 easy in my FQR model because any u with B 1 (u) > 0 satisfies B t (u) > 0 since B t (u) = η ′ B 1 (u)Π t−1 i=1 Ω(o i |s = (u,C i+1 ),a i ) and Ω(o i |s = (u,C i+1 ),a i ) > 0,∀i = 1,2,...,t−1. In other behavior models, where finding a valid u is not so intuitive, one possibility is to check the sampled utilities at the latest round to pick a valid one. 5.3 Fictitious Best Response In this section, I focus attention on a limiting case of the FQR model, a fictitious best response playing (FBR) attacker. An FBR attacker plays a best response against the empirical distribution of the defender and breaks ties randomly, a similar assumption is found in [26,30]. Additionally, I assume that all targets share the same penalty P. This assumption is satisfied in most resource conservation games. We will see that these two assumptions allow us to greatly speed up the GMOP algorithm. I also put forward a computationally inexpensive heuristic that offers high quality solutions. When the attacker is FBR, the POMDP is roughly the same as in the FQR case except that the conditional observation probabilities Ω are now Ω(o|s ′ =(u,C),a) = 1 |A(u,C−ea)| , o∈A(u,C−e a ), 0, otherwise, (5.4) where A(u, C) is the set of targets with maximal empirical expected utility when the actual utility is u and the count is C, i.e., A(u, C)=argmax i∈N [EU(u, C)](i)⊂N. 70 The FBR attacker is actually a limiting case of the more general FQR model: we obtain thiscase by takingλ→∞. If werunthePOMCPalgorithm foran FBR attacker, the particles producedby the particle filter will be depleted very quickly and most utility states will take on probability 0 after only a few rounds. For example, if n=10 and the defender observes that the attacker visits target 3 in the first round, then approximately 90% of possible utility states take on probability 0. Compared with FQR, more new particlesmustbeaddedintheFBRcase. Thus,theparticlefilterisaworseapproximation of the belief state, leading to worse performance of the POMCP algorithm. 5.3.1 Speeding Up GMOP Gibbssamplingrequirescomputationoftheconditionalprobabilityp(u i |u −i )asdescribed in Equation 5.3. However, t grows as the game evolves and the computational cost in- creases linearlywitht. UndertheassumptionsofanFBRattacker anduniformpenalties, wecanuseanadvancedalgorithmtocomputep(u i |u −i )withcomputationalcostbounded by constant time. Define I t (i,j), I t−1 (i,j), i6=o t−1 , max{I t−1 (i,j), 1−c t−1 (j) 1−c t−1 (i) }, i =o t−1 , and I 1 (i,j) , 0,∀i,j ∈ N. The quantities I t (i,j) can be computed recursively from I t−1 (i,j) at very little computational cost. Intuitively, I t (i,j) maintains the minimum allowed ratio u(i)−P u(j)−P for any u satisfying Π t−1 j=1 Ω(o j |s = (u,C j+1 ),a j ) > 0 as the game evolves. By checking if u satisfies u(i)−P u(j)−P ≥ I t (i,j),∀i,j ∈ N, we can figure out if 71 Π t−1 j=1 Ω(o j |s = (u,C j+1 ),a j ) is equal to 0 or not. I then compute the exact value of Π t−1 j=1 Ω(o j |s =(u,C j+1 ),a j ) whenever this probability is not 0. Proposition 5.3.1. For a specific u, Π t−1 i=1 Ω(o i |s = (u,C i+1 ),a i ) > 0 ⇐⇒ u(i)−P u(j)−P ≥ I t (i,j),∀i,j∈N. Proof. FromEquation5.3and5.4,Π t−1 j=1 Ω(o j |s=(u,C j+1 ),a j )>0 ⇐⇒ o j ∈A(u,C j ),∀j∈ {1,2,...,t−1}. o∈A(u,C) ⇐⇒ [EU(u,C)](o)≥[EU(u,C)](i),∀i∈N c(o)P +(1−c(o))u(o)≥c(i)P +(1−c(i))u(i),∀i∈N u(o)−P u(i)−P ≥ 1−c(i) 1−c(o) ,∀i∈N ∀u that u(i)−P u(j)−P ≥ I t (i,j),∀i,j ∈ N, we have o j ∈ A(u,C j ),∀j ∈{1,2,...,t− 1} by the definition of I t (i,j); ∀u that∃i,j ∈ N u(i)−P u(j)−P < I t (i,j), for that i,j,∃ round k∈{1,2,...,t−1} that 1−c k (j) 1−c k (i) = I t (i,j) and i = o k , so we have o k / ∈ A(u,C k ) because u(i)−P u(j)−P < I t (i,j) = 1−c k (j) 1−c k (i) . Here I proved o j ∈ A(u,C j ),∀j ∈{1,2,...,t− 1} ⇐⇒ u(i)−P u(j)−P ≥I t (i,j),∀i,j∈N. By checking ifu satisfies u(i)−P u(j)−P ≥I t (i,j),∀i,j∈N, wecan figureout if Π t−1 j=1 Ω(o j |s= (u,C j+1 ),a j ) equals 0. I now explain how to compute Π t−1 j=1 Ω(o j |s = (u,C j+1 ),a j ) if it does not equal 0. Define V t (i),{k : o k = i,k∈{1,2,...,t−1}},∀i∈N to be the set of rounds where the attacker attempts to steal from target i; define V eq t (i,j),{k : j∈ A(u,C k ),k∈V t (i)},∀i,j∈Nto betheset ofroundswheretheattacker attempts to steal from target i, but where target j gives the attacker the same expected utility. I define 72 V neq t (i,j) ,{k : j / ∈ A(u,C k ),k∈ V t (i)},∀i,j∈ N to be the set of rounds where the attacker attempts to steal from target i and target j gives the attacker lower expected utility. Additionally, I define Tie t (i,j),{k :I t (i,j) = 1−c k (j) 1−c k (i) ,k∈V t (i)},∀i,j∈N Like I t (i,j), Tie t (i,j) can be computed recursively at very little cost. By definition, V eq t (i,j)∩V neq t (i,j) =φ, V eq t (i,j)∪V neq t (i,j) =V t (i) and Tie t (i,j)⊆V t (i),∀i,j∈N. Proposition 5.3.2. If u(i)−P u(j)−P = I t (i,j), V eq t (i,j) = Tie t (i,j), V neq t (i,j) = V t (i)− Tie t (i,j); If u(i)−P u(j)−P >I t (i,j), V neq t (i,j) =V t (i), V eq t (i,j) =φ. Proof. If u(i)−P u(j)−P =I t (i,j): ∀k∈V eq t (i,j), c k (i)P+(1−c k (i))u(i) =c k (j)P+(1−c k (j))u(j) sincei,j∈A(u,C k ), so 1−c k (j) 1−c k (i) = u(i)−P u(j)−P =I t (i,j), k∈Tie t (i,j). So we have V eq t (i,j)⊆Tie t (i,j) ∀k∈Tie t (i,j), u(i)−P u(j)−P =I t (i,j) = 1−c k (j) 1−c k (i) , so c k (i)P +(1−c k (i))u(i) =c k (j)P +(1− c k (j))u(j), j∈A(u,C k ). So we have Tie t (i,j)⊆V eq t (i,j). Till now I proved when u(i)−P u(j)−P =I t (i,j), V eq t (i,j) =Tie t (i,j), so V neq t (i,j) =V t (i)− Tie t (i,j) by definition. The proof when u(i)−P u(j)−P >I t (i,j) is similar so I omit it here. Algorithm 7 shows how my advanced sampling technique resamples u(k) from the conditional probabilities p(u i |u −i ) by using the quantities I t and Tie t . The input u is the current set of sampled utilities; k is the index of u to be resampled according to p(u k |u −k ); andI and‘Tie’arethelatest I and‘Tie’that havebeencomputedrecursively. 73 Iset#A(j) =|A(u,C j )|todenotethenumberofsitesthathavemaximalexpectedutility for the attacker at round j, and I initialize these quantities to be1 because o k ∈A(u,C k ) by definition. Then, I check every pair of targets i,j ∈ N: (i) if u(i)−P u(j)−P < I t (i,j), then I set B t (u) = 0 according to Proposition 5.3.1; (ii) if u(i)−P u(j)−P = I t (i,j), then I set V eq t (i,j) =Tie t (i,j) according to Proposition 5.3.2, and I increase #A(k) by 1 for those k∈ Tie t (i,j) because j∈ A(u,C k ),∀k∈ V eq t (i,j) = Tie t (i,j); (iii) if u(i)−P u(j)−P > I t (i,j), then V eq t (i,j) = φ according to Proposition 5.3.2, so I do nothing. After checking all pairs i,j∈N, I determine: (i) whether B t (u) =0 and (ii) #A(k),∀k∈{1,2,...,t−1} if B t (u)>0. Based ontheseevaluations, theconditional probability Prob=p(u i |u −i ) used to resample u(k) is computed according to Equation 5.3. Finally, Prob is normalized and then I sample the new u(k). Algorithm 7 Advanced Sampling Technique 1: function DrawSample(u,k,I,Tie) 2: Prob=B 1 (u k ) 3: for i =1→m do 4: u(k)← i 5: #A(j)←1,∀j =1→currRound−1 6: for p=1→n,q =1→ n do 7: if u(p)−k u(q)−k <I(p,q) then 8: Prob(i)←0 9: break 10: else if u(p)−k u(q)−k =I(p,q) then 11: #A(j)←#A(j)+1,∀j∈Tie(p,q) 12: end if 13: end for 14: if Prob(i)6=0 then 15: Prob(i)←Prob(i)∗Π currRound−1 j=1 1 #A(j) 16: end if 17: end for 18: Normalize Prob 19: u(k)∼Prob 20: return u 21: end function 74 5.3.2 Myopic Planning Heuristic For GMOP algorithm, larger sample sizes in MCTS leads to higher solution quality but at theexpenseofgreater computational cost. Somedomainsrequiredecisions tobemade very quickly, so the defender may get poor performance with the GMOP algorithm due to an insufficient number of samples. With this motivation, I provide a myopic planning heuristic. This heuristic offers slightly lower solution quality compared with GMOP, but costs much less computing time. The myopic planning heuristic works as follows: it (i) approximately computes the posterior marginal probabilities of all targets’ utilities based on all previous observations; (ii) computes theexpected u(i) for each target usingthe posterior marginal probabilities; (iii) plans myopically—protects the target with the highest estimated expected utility for the attacker based on the expected u(i) computed in step (ii) and the empirical visit counts C (ties are broken with even probabilities). The key issue lies in step (i)—the computation of the posterior marginal probabilities of the utilities u(i). This step can be viewed as inference in a Bayesian network. An example Bayesian network where n = 4 is shown in Figure 5.1. Here, u(i),∀i∈ N are treated as the unobservedrandomvariables inthe Bayesian network, andthey have prior probabilitiesB 1 (u(i))foralli∈N. Idefinef(u(i),u(j)),∀i,j∈Ntobeobservablebinary random variables that depend on u(i),u(j): f(u(i),u(j)), 1, u(i)−P u(j)−P ≥I t (i,j), u(j)−P u(i)−P ≥I t (j,i), 0, otherwise. 75 variable node factor node u(1) u(2) u(3) u(4) f(u(1), u(2)) f(u(1), u(3)) f(u(3), u(4)) Figure 5.1: Bayesian Network when n=4 In the Bayesian Network, we have observations of f(u(i),u(j)) =1,∀i,j∈N, and the aim is to infer the posterior marginal probabilities for u(i),∀i∈N. According to Propo- sition 5.3.1, these factor nodes f(u(i),u(j)) = 1,∀i,j∈N fully describe the conditions a specific u must satisfy in order to have a positive posterior probability. I then use the widely known belief propagation algorithm [43] for inference, which yields approximate marginal probabilities. Note that this heuristicdoes not take into consideration possibleties in the attacker’s decision-making. In particular, recall that u(i)−P u(j)−P = I t (i,j) and u(i)−P u(j)−P > I t (i,j) corre- spond to two different cases, as I have shown in Proposition 5.3.2, and they are treated separately in Algorithm 7. Yet, the Bayesian network is unable to distinguish between thesetwo cases andit treats themboth as u(i)−P u(j)−P ≥I t (i,j). Hence, theBayesian network doesnotutilizealloftheinformationthatthedefenderhasobtainedandthuscannotoffer an accurate description of the true belief state. Subsequently, the posterior probability I compute from the Bayesian network is inaccurate even if an exact inference algorithm is used. However, I note that in the experiments, I get satisfactory solution quality even though both the Bayesian network formulation and the belief propagation algorithm are inexact. 76 5.4 Continuous Utility Scenario In previous sections, I discussed the model with discretized utilities with the justification that humans can not distinguish between tiny differences, and this can be captured in a discrete model with a sufficiently large m. However, it is difficult to tell how “large” is large enough; furthermore, larger m leads to more computational cost so that we can not increase m arbitrarily. In this section, I try to extend my model to continuous utilities, i.e., u(i)∈[0,1], making it more expressive in describing human’s perception of utilities. In Section 5.1.3, the POMDP was built for discrete utilities. When the utilities are continuous, this formulation remains the same except that the utility spaceU becomes U,{(u(1),u(2),...,u(n)) : u(i)∈[0,1],∀i∈N} whichisincontinuousspace. Thus,thepreviousPOMDPformulationbecomesacontinuous- state POMDP, which lacks efficient solutions. The GMOP algorithm is composed of two steps: sampling from the utility space and running MCTS with those samples. For continuous utilities, the latter step remains the same so that the key issue here is to sample from the continuous utility space, which involves the computation of the conditional probability: p(u i |u −i )=η ′′ B 1 (u i )Π t−1 j=1 Ω(o j |s=(u =(u i ,u −i ),C j+1 ),a j ). Ingeneralcases,thiscomputationinvolvesthemultiplicationofseveralfunctions(Π t−1 j=1 Ω(o j | s=(u=(u i ,u −i ),C j+1 ),a j ) with u i as the variable), which is generally hardto compute 77 unless those functions have special properties, e.g., when the attacker is a FBR attacker and all sites share the same penalty P. Proposition 5.4.1. When the attacker is an FBR player and all sites share the same penalty P, Π t−1 j=1 Ω(o j |s = (u = (u i ,u −i ), C j+1 ),a j ) is a boxcar function 4 with non-zero interval [a,b], where a=max j∈N,j6=i {P+I t (i,j)(u(j)−P)}, b=min j∈N,j6=i {P+ u(j)−P It(j,i) }, and the height at [a,b] equals 1 b−a . Proof. When the attacker is an FBR player and all sites share the same penalty P, Ω(o j |s=(u=(u i ,u −i ),C j+1 ),a j ) are all boxcar functions, so that their multiplication is also a boxcar function. Recall Proposition 5.3.1: Π t−1 i=1 Ω(o i |s=(u,C i+1 ),a i )>0 ⇐⇒ u(i)−P u(j)−P ≥I t (i,j),∀i,j∈N In this situation, u −i is given, and we are trying to find the smallest and largest u i satisfying Π t−1 i=1 Ω(o i |s=(u,C i+1 ),a i ) >0, i.e., u(i)−P u(j)−P ≥I t (i,j),∀i,j∈N. So we have: u i ≥P +(u j −P)I t (i,j),∀j∈N,j6=i, u i ≤P + u j −P I t (j,i) ,∀j∈N,j6=i. Since the total area under the boxcar function is 1 (it is a probability distribution), the height is 1 b−a . 4 A boxcar function is any function which is zero over the entire real line except for a single interval where it is equal to a constant. 78 With Proposition 5.4.1, we can compute Π t−1 j=1 Ω(o j |s=(u =(u i , u −i ),C j+1 ),a j ) very efficiently by only computing the lower limit a and upper limit b, making it possible for us to draw samples from the continuous state space using Gibbs sampling and then to search for the best strategy with MCTS. Finally, clearly at this stage the continuous utility scenario only works in restricted cases; nonetheless, the restriction of all sites sharing a single penalty is reasonable in some real-world domains, and the FBR restriction on the extractor may be a useful approximation for some situations. Understanding the appropriate use of the continuous utility scenario given the need to model human extractors, scaling it up and relaxing the restrictions imposed all remain a topic for future work. 5.5 Unknown Extractor Scenario—Model Ensemble The discussions in previous sections are based on the assumption that the attacker is an FQR attacker, and the defender knows the true attacker model. However, in real-world applications, the attacker may follow other behavioral models, and the defender may only have some rough idea what the behavioral model is instead of knowing it exactly. In this section, I will firstly discuss how to extend the POMDP formulation and GMOP algorithm to other behavioral models, and then discuss the scenario where the defender is uncertain about the attacker’s behavioral model. 5.5.1 Dealing with Other Behavioral Models In addition to the FQR model, there are other human behavioral models that might well describe the attacker’s actions. For example, the attacker may have limited memory or 79 weigh recent defender activities more heavily, and optimizes against this limited memory or “biased” memory. For most generality, I only keep the very basic assumption that the attacker’s decisions depend on u(i),P(i),i∈N along with the defender’s actions in previous rounds, i.e., the probability that the attacker visits target i,i∈N is a function of u, P, and a. In this section, I will discuss how the POMDP formulation and GMOP algorithm can be modified accordingly to deal with this broader category of behavioral models. With this broader category of behavioral models, the attacker’s decision making de- pends on the sequence of the defender’s actions more than the count C, the previous stateS =U×Z n is no longer enough to determine the attacker’s actions. In response, I modify the previous POMDP formulation using a more expressive state space. Statespace: ThestatespaceofthePOMDPbecomesS=U×{N i :i∈{0,1,2,...,T}}. U is still the utility space and{N i : i∈{0,1,2,...,T}} is now the entire history of the defender’s actions, whereh∈{N i :i∈{0,1,2,...,T}} is a vector whereh(t),t∈T is the target thedefendervisits at roundt∈T. A particular state s∈Sis writtenas s=(u,h), where u is the vector of utility levels for each target and h is the current history of the defender’s actions. The initial beliefs are expressed by a distribution over s = (u,∅), induced by the prior distribution on u. Action space: The action space A remains N, representing the target the defender chooses to protect. Observation space: The observation spaceO remainsN, representing the target from which the attacker attempts to steal. 80 Conditionaltransitionprobability: Theconditionaltransitionprobabilityismodified to describe how the defender’s action history evolves. T s ′ = u ′ , h ′ |s=(u,h), a = 1, u=u ′ , h ′ =h+a, 0, otherwise. The notation “+” here means appending a at the end of h. Conditionalobservationprobability: Supposethebehavioralmodelfunctionf(u,P,h,o) determines the attacker’s probability of stealing from target o with the utilities u, the penalties P and the history h, the conditional observation probability is defined as Ω(o|s ′ =(u,h),a) =f(u,P,h−a,o). The notation− here means removing a from the end of h. Reward function: The reward function R becomes R s=(u,h),s ′ =(u,h+a),a,o = −P(o), a=o, −u(o), a6=o. With the new definition of POMDP, the GMOP algorithm remains the same except that the computation of the conditional probability during Gibbs sampling becomes p(u i |u −i ) =ηB 1 (u i )Π t−1 j=1 Ω(o j |s=(u =(u i ,u −i ),h j+1 ),a j ). (5.5) 81 5.5.2 Dealing with Unknown Extractor In the real world, it is more likely that the defender only has some rough idea what the attacker’s behavioral model is rather than knowing it exactly. In this section, I extend mymodel tothisscenario. Toformally definetheproblem, Imodelthedefender’s“rough idea” of the attacker’s behavioral model as a set of candidate behavioral models, and the defender is uncertain about which model best fits the attacker’s behavior. As the game goes on, the defender gets a better idea about the utilities of different targets as well as the attacker’s behavioral model. Suppose that there are k candidate attacker behavioral models Type 1,Type 2,..., Type k that might model the attacker’s behavior. I define the attacker’s behavioral model spaceB to be the set of these k behavioral models: B={Type 1,Type 2,...,Type k}. When the defender has perfect knowledge about the attacker’s behavioral model, I formulate a POMDP formulation where the state space is the cross product of the utility space and the defender’s action history space: S = U×{N i : i∈{0,1,2,...,T}} since they are the only two variables that determine the attacker’s actions. However, when the defender is uncertain what the attacker’s behavioral model is, the attacker’s action is also dependent on his behavioral model in addition to these two variables. Thus, the attacker’s behavioral model should also be included in the state space of the POMDP formulation. 82 State space: The state space becomes the cross product of the behavioral model space B, the utility space U and the history space{N i : i∈{0,1,2,...,T}}: S =B×U×N i . A particular state s∈S is written as s = (b,u,h) where b∈B represents the attacker’s behavioral model. Action space: The action space A remains N, representing the target the defender chooses to protect. Observation space: The observation spaceO remainsN, representing the target from which the attacker attempts to steal. Conditionaltransitionprobability: Theconditionaltransitionprobabilityismodified accordingly. T s ′ = b ′ ,u ′ , h ′ |s=(b,u,h), a = 1, b=b ′ ,u=u ′ ,h ′ =h+a, 0, otherwise. The notation + here means appending a at the end of h. Conditional observation probability: Supposethe behavioral model function f(b,u, P,h,o) determinestheattacker’s probability ofstealing from target owiththebehavioral model b, the utilities u, the penalties P and the history h, the conditional observation probability is defined as Ω(o|s ′ =(b,u,h),a) =f(b,u,P,h−a,o). The notation− here means removing a from the end of h. 83 Reward function: The reward function R becomes R s=(b,u,h),s ′ =(b,u,h+a),a,o = −P(o), a=o, −u(o), a6=o. For this new POMDP, we need to draw samples from the behavioral model space in addition to theutility spacefor GMOP algorithm. We can combinethebehavioral model variable b and the utility variable u together to be a new multivariate. Gibbs sampling is then used to draw samples of b and u, which are fed into MCTS to find the optimal action for the defender to take. In Gibbs sampling, the computation of the conditional probability for utility variable u i becomes p(u i |u −i ,b) =ηB 1 (u i )Π t−1 j=1 Ω(o j |s=(b,u =(u i ,u −i ),h j+1 ),a j ). (5.6) Similarly, the computation of the conditional probability for behavioral model variable b is p(b|u) =ηp(b,u) =ηB t (b,u) =η ′ B 1 (b,u)Π t−1 j=1 Ω(o j |s=(b,u,h j+1 ),a j ) =η ′′ B 1 (b)Π t−1 j=1 Ω(o j |s=(b,u,h j+1 ),a j ). (5.7) In the original formulation, the multivariate we sample in Gibbs sampling is the n- dimensional utility variable u. In this extension, the multivariate is a combination of the behavioral model variable b and the utility variable u, which is of the dimension n+1. 84 Thus, this extension to unknown extractor costs roughly n+1 n times the computational time in the original formulation, and the extra time is used to compute p(b|u). 5.6 Experimental Evaluation I evaluate the performance of my models and algorithms in this section through exten- sive numerical experiments. The results strongly support the benefits of the techniques introduced in this work. For the experiment settings, unless stated otherwise, I follow: n = 10, m = 10, the penalty across all targets is P =−50, and the prior probability distribution B 1 (u i ),i∈N is uniform. All results are averaged over 1000 simulation runs. For each simulation run, I randomly draw the true utilities u(i),i∈ N and then simu- late the actions the defender and the attacker would take over the rounds of the game. Solution quality is assessed in terms of the average reward that the defender gets in the first few rounds of the game. There are two parameters in MCTS for the GMOP algo- rithm: numSamples is the number of samples that are used to simulate in the MCTS and maxHorizon is the depth of the tree, i.e., the number of time steps we look ahead in the POMDP. 5.6.1 GMOP Algorithm Evaluation In this section, I will use the FBR and FQR attacker model to evaluate the performance of the GMOP algorithm in both the discrete and continuous utility scenario. 85 5.6.1.1 GMOP vs ZMDP/APPL To begin, I compare the GMOP algorithm with the ZMDP solver [54] and the APPL solver [25], both are general POMDP solvers. I show that the GMOP algorithm achieves almost the same solution quality as ZMDP/APPL solvers on small problem instances. For a small problem instance like n=4, m=5 and total rounds maxHorizon=5, there are 78750 states in the POMDP. Both ZMDP and APPL solvers run out of memory even in this small problem instance. Hence, I test the two solvers together on an even smaller instance with n = 3, m = 5, P = −10 and maxHorizon = 5, so that the resulting POMDP has only 7000 states. As a base line, I also include a fixed policy wherethedefenderrandomly chooses onesite to protect at each round. Table5.1 reports the average reward of these three algorithms for both the FQR (λ = 0.5,1 and 1.5) and FBR attacker. In this table, the columns titled H i for i = 1,...,5 represent the GMOP algorithmwithmaxHorizonsettobeiandnumSamplessettobe10000. Weseethatthe GMOP algorithm with maxHorizon varying from 1 to 5 and the APPL/ZMDP solvers are very close in terms of average reward, and all algorithms outperform the random policy for the FQR and FBR attacker models. Table 5.1: ZMDP/APPL vs GMOP in Solution Quality Random ZMDP APPL H 1 H 2 H 3 H 4 H 5 FQR(0.5) 1.13 3.85 3.85 3.90 3.89 3.95 3.90 3.91 FQR(1) 1.05 4.84 4.81 4.75 4.80 4.87 4.97 4.79 FQR(1.5) 1.03 5.35 5.39 5.35 5.36 5.42 5.36 5.34 FBR 1.09 6.32 6.31 6.25 6.24 6.27 6.32 6.36 86 5.6.1.2 Analysis of GMOP Now I investigate the effect of numSamples and maxHorizon on the performance of MCTS in the GMOP algorithm for discrete utilities. Figure 5.2(a)/5.2(b) reports the results for the FQR model while Figure 5.4(a)/5.4(b) reports the results for the FBR model. Figure 5.2(a)/5.4(a) show that the performance of MCTS improves as I increase numSampleswhileholdingmaxHorizonfixed,demonstratingtheconvergenceofMCTS. Figures 5.2(b)/5.4(b) together show that: (i) if numSamples is large enough to ensure convergence for both larger maxHorizon and smaller maxHorizon (numSamples = 10000 here), planning with more horizons ahead (larger maxHorizon) increases the re- ward the defender can get; (ii) if numSamples is not large enough to ensure convergence for larger maxHorizon (numSamples = 100 here), the reward the defender can get de- creases as maxHorizon increases because larger maxHorizon indicates a deeper Monte Carlo tree so that more samples are needed to ensure convergence and it deteriorates the performance of MCTS if convergence is not reached. Figure 5.3(a)/5.3(b) reports the results for FBR model for continuous utilities. Note that we are unable to solve problems with FQR model in the continuous utility scenario. Those figures show similar patterns as Figure 5.2(a)/5.2(b) and Figure 5.4(a)/5.4(b). However, an interesting observation is that continuous utilities requires more samples to ensure convergence. numSamples=10000 is reasonably large enough for convergence in discrete utilities while it is not large enough for continuous utilities. Thereason is that it involves a far larger utility space in continuous utilities so that it requires more samples to reach convergence. 87 An interesting phenomenon observed in Figure 5.2(b)/5.4(b)/5.3(b) is that the per- formance difference between different maxHorizon is tiny. However, this is not always the case. Here I provide an example where maxHorizon makes a big difference in per- formance. Suppose that we have 3 targets: u(1) =5; u(2) is 10 with the probability 40% and 4 with the probability 60%; u(3) has the same utility as u(2). The penalties across all targets are all 0. We have 2 rounds in total. In round 1, if the defender chooses to protect target 1, the attacker gets 0.4∗10 = 4; if the defender chooses to protect target 2 or 3, the attacker gets 0.4∗0.5∗10+0.6∗5 = 5. Thus the defender will choose to protect target 1 if the maxHorizon = 1. In round 2, if the defender chooses to protect target 1 in round 1, the attacker attacks target 2 or 3 with equal probability, so he will get 0.5∗0.6∗4+0.5∗0.4∗10 = 3.2, which gives the attacker the utility 7.2 in total; if the defender chooses to protect target 2 or 3 in round 1, the defender will learn exactly where the attacker is going to attack, and the attacker gets 0, which gives the attacker the utility 5 in total. Thus the defender will choose to protect target 2 or 3 in round 1 if maxHorizon=2 and gets the expected utility−5 in these two rounds while getting the utility−7.2 if maxHorizon=1. 5.6.1.3 POMCP (Particle Filter) vs GMOP (Gibbs Sampling) In this work I use Gibbs sampling to drive MCTS instead of the particle filter, as in the original POMCP algorithm [53]. In this way, the distribution of the samples is closer to the actual belief state. I now compare the performanceof these two sampling techniques. TheruntimeofGibbssamplingroughlyincreaseslinearlywithnumSamples;theruntime of the particle filter roughly increases linearly with the size of the particle filter (number 88 H−1 H−2 H−3 H−4 0 1 2 3 4 5 Maximum Horizon Average Reward S−100 S−1000 S−10000 (a) MCTS convergence (FQR: λ = 0.5, 50 rounds) S−100 S−1000 S−10000 0 1 2 3 4 5 Number of Simulations Average Reward H−1 H−2 H−3 H−4 (b) Different horizons (FQR: λ = 0.5, 50 rounds) 10 20 30 40 50 10 12 14 16 18 20 Round Average Reward POMCP−100 POMCP−1000 POMCP−10000 POMCP−100000 GMOP−100 (c) GMOP vs POMCP in solution quality (FQR: λ =1.5, maxHorizon= 1) 0.5 1 1.5 0 5 10 15 λ extractor Average Reward λ protector = 0.5 λ protector = 1 λ protector = 1.5 (d) Robustness (FQR, 50 rounds, numSamples=1000, maxHorizon= 2) Figure 5.2: Fictitious Quantal Response H−1 H−2 H−3 H−4 37 37.5 38 38.5 Maximum Horizon Average Reward S−100 S−1000 S−10000 (a) MCTS convergence (FBR, 100 rounds) S−100 S−1000 S−10000 37 37.5 38 38.5 Number of Simulations Average Reward H−1 H−2 H−3 H−4 (b) Different horizons (FBR, 100 rounds) Figure 5.3: Continuous Utility Scenario 89 H−1 H−2 H−3 H−4 33 33.5 34 34.5 35 35.5 Maximum Horizon Average Reward S−100 S−1000 S−10000 (a) MCTS convergence (FBR, 100 rounds) S−100 S−1000 S−10000 33 33.5 34 34.5 35 35.5 Number of Simulations Average Reward H−1 H−2 H−3 H−4 (b) Different horizons (FBR, 100 rounds) 20 40 60 80 100 20 25 30 35 40 Round Average Reward POMCP−100 POMCP−1000 POMCP−10000 POMCP−100000 GMOP−1000 (c) GMOP vs POMCP in solution quality (FBR, maxHorizon=1) 20 40 60 80 100 24 26 28 30 32 34 36 Round Average Reward GMOP−100 Heuristic GMOP−1000 GMOP−10000 (d) GMOP vs heuristic in solution quality (FBR) Figure 5.4: Fictitious Best Response 90 of particles). For a fair comparison, I fix the particle filter size as well as numSamples in Gibbs sampling. For the FQR model, I set the particle filter size to be 100000 and numSamples in Gibbs sampling to be 100. The total runtimes are recorded in Table 5.2, where we see that the runtime of the GMOP algorithm is shorter than the runtime of POMCP as numSamples varies from 100 to 100000. However, Figure 5.2(c) demonstrates that the performance of the GMOP algorithm with 100 samples exceeds the performance of the POMCP algorithm regardless of the value of numSamples. This performance gap between GMOP and POMCP grows with time because the particle filter gives an increasingly worse approximation of the belief state as time evolves. Table 5.2: GMOP vs POMCP in Runtime(s) GMOP-100 POMCP-100 POMCP-1000 POMCP-10000 POMCP-100000 31.71 75.86 72.92 75.89 92.26 Fictitious Quantal Response (λ =1.5),maxHorizon = 1 Table5.3andFigure5.4(c)showtheruntimeandrewardofGMOPwithnumSamples= 1000 vs POMCP with filter size 10000, for the FBR attacker. For the FBR attacker, we see the same pattern but with an even larger gap in solution quality. In the FBR at- tacker model, the particles are depleted much more quickly than in the FQR model so that more new particles must be added. However, these new particles do not follow the distribution inducedby the current belief state, which is detrimental to the quality of the approximation of the belief state and thus leads to worse performance. Table 5.3: GMOP vs POMCP in Runtime(s) GMOP-1000 POMCP-100 POMCP-1000 POMCP-10000 POMCP-100000 83.48 224.35 240.83 257.40 282.71 Fictitious Best Response, maxHorizon =1 91 5.6.1.4 Robustness Whiletheextension to unknownextractor can deal withthesituation wherethedefender does not know the true value of λ that measures the attacker’s rationality, I try to investigate here how the performance changes if the attacker’s true value of λ is only slightly different from the defender’s estimation. If the performance is very sensitive to the estimation of λ, we will have to include a lot attacker behavioral models with various λ in the extension where the attacker’s behavioral model is unknown. However, if the performance is “robust” against the estimation of λ, it is not necessary to include a lot models in the extension. In this experiment, I allow the attacker’s true value of λ to vary inasmallscale—take valuesin0.5,1,1.5, andIallowthedefendertoestimateλtobeany of 0.5,1,1.5, for a total of 9 combinations of the true λ and its estimate. Figure 5.2(d) presents the results of this experiment. It turns out that the defender only does slightly worse when she incorrectly estimates the attacker’s trueλ, which shows the “robustness” of the original framework. 5.6.1.5 Evaluation of the Advanced Sampling Technique in FBR Model In Section 5.3.1, I proposed an advanced way to compute conditional probabilities when usingGibbssamplingintheFBRmodel. Thistechniqueislesscomputationallyexpensive than the general method. Table 5.4 compares the runtimes of the general sampling technique with the advanced sampling technique. As the number of rounds increases from 20 to 100, the total runtime of the advanced sampling technique increases linearly, implying that the sampling cost at each round is approximately the same. On the other hand, the total runtime of the general sampling technique increases with the square of 92 the number of rounds in the game, implying that the sampling cost is increasing linearly in each round. Table 5.4: General vs Advanced Sampling in Runtime(s) 20 40 60 80 100 General 51.77 209.31 469.80 835.15 1303.95 Advanced 43.83 62.04 77.24 92.77 108.67 Fictitious Best Response, numSamples =1000, maxHorizon =1 5.6.1.6 GMOP vs Myopic Planning Heuristic The myopic heuristic trades solution quality for computational efficiency for a FBR at- tacker. Figure 5.4(d) compares the solution quality of the myopic planning heuristic versus GMOP, and Table 5.5 compares their total runtimes. For a fair comparison, I set maxHorizon to be 1 in the GMOP algorithm. Figure 5.4(d) indicates that the heuristic gives bettersolutions thanGMOP withnumSamples=100. However, thesolution qual- ity of the heuristic is worse than the one produced by GMOP when numSamples equals 1000 or 10000. According to Table 5.5, the runtime of the myopic heuristic is much less than the runtime of GMOP. Table 5.5: GMOP vs Heuristic in Runtime(s) Heuristic GMOP-100 GMOP-1000 GMOP-10000 0.49 8.38 83.48 689.87 Fictitious Best Response 5.6.2 Model Ensemble Evaluation In this section, I will evaluate the performance of GMOP algorithm with the model ensemble idea. I use 6 different attacker models in the experiments. They are defined as: 93 • Model 1—Fictitious Quantal Response with λ=10 • Model 2—Fictitious Quantal Response with λ=1 • Model 3—Fictitious Quantal Response with the memory of the recent 20 rounds and λ=10 • Model 4—Fictitious Quantal Response with the memory of the recent 20 rounds and λ=1 • Model 5—Fictitious Quantal Response with exponentially reduced memory (expo- nential factor = 0.9) 5 and λ=10 • Model 6—Fictitious Quantal Response with exponentially reduced memory (expo- nential factor = 0.9) and λ=1 Inthisexperiment, Itrytocomparetheperformanceoftheensembleagent thattakes intoconsiderationallthe6differentmodelsandthesingleagentthatassumestheattacker hasanexactbehavioralmodel. Figure5.5(a) showstheperformanceofthoseagentswhen the real attacker follows model 1. Similarly, Figure 5.5(b), Figure 5.5(c), Figure 5.5(d), Figure5.5(e)andFigure5.5(f)showtheperformancewhentherealattackerfollowsmodel 2, 3, 4, 5 and 6. The legend “Ensemble” represents the ensemble agent that takes into consideration all the 6 different models, while the legend “model i” represents the agent that assumestheattacker is of modeli. Figures5.5(a)/5.5(b)/5.5(c)/5.5(d)/5.5(e)/5.5(f) showhowtheperformanceofthese7agentsevolves asthegamegoeson. Theobservation is that as the game goes on, the performance of “Ensemble” comes very close to that of 5 defender’s action i rounds ago is of the weight 0.9 i such that more recent actions weigh more. 94 “Modeli”whentherealattacker follows modeli, andoutperformstheothersingleagents that assume the wrong attacker model. The ensemble agent gets a better idea as to the attacker’s behavioral model as the game goes on, and thus the ensemble agent performs better than the single agent when the attacker model is unknown to the defender. In Table 5.6, I compare the runtime of the ensemble agent and the single agent. We can see that using an ensemble agent only brings a little extra computational cost. Table 5.6: Ensemble vs Single in Runtime(s) Ensemble Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 3059.33 2912.09 2764.36 2925.66 2757.56 3802.97 2980.68 Deal with attacker of model 1, 50 rounds 95 10 20 30 40 50 0 10 20 30 Round Average Reward Ensemble Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 (a) Deal with attacker of model 1 (50 rounds) 10 20 30 40 50 5 10 15 20 Round Average Reward Ensemble Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 (b) Deal with attacker of model 2 (50 rounds) 10 20 30 40 50 10 20 30 40 50 Round Average Reward Ensemble Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 (c) Deal with attacker of model 3 (50 rounds) 10 20 30 40 50 0 5 10 15 20 Round Average Reward Ensemble Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 (d) Deal with attacker of model 4 (50 rounds) 10 20 30 40 50 20 30 40 50 60 Round Average Reward Ensemble Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 (e) Deal with attacker of model 5 (50 rounds) 10 20 30 40 50 0 10 20 30 Round Average Reward Ensemble Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 (f) Deal with attacker of model 6 (50 rounds) Figure 5.5: Model Ensemble 96 Chapter 6 Learning Attacker’s Preference — Markovian Modeling My work discussed in Chapter 5 assumes that defenders have knowledge of all poaching activities throughout the wildlife protected area. Unfortunately, given vast geographic areas for wildlife protection, defenders do not have knowledge of poaching activities in areas they do not protect. Thus, defenders are faced with the exploration-exploitation tradeoff—whethertoprotectthetargetsthatarealreadyknowntohavealotofpoaching activities or to explore the targets that haven’t been protected for a long time. The work in this chapter aims to solve this exploration-exploitation tradeoff. The exploration-exploitation tradeoff here is different from that in the non-Bayesian stochastic multi-armed bandit problem [4]. In stochastic multi-armed bandit problems, the rewards of every arm are random variables with a stationary unknown distribution. However, inmyproblem,patrolaffectsattackactivities—morepatrolislikelytodecrease attack activities and less patrol is likely to increase attack activities. Thus, the random variable distribution is changing depending on player’s choice — more selection (patrol) leads to lower reward (less attack activities) and less selection (patrol) leads to higher reward (more attack activities). On the other hand, adversarial multi-armed bandit 97 problem [5] is also not an appropriate model for this domain. In adversarial multi-armed bandit problems, the reward can arbitrarily change while the attack activities in my problem are unlikely to change rapidly in a short period. This makes the adversarial multi-armed bandit model inappropriate for this domain. Inreality, howpatrolaffects attack activities wouldbereasonablyassumedtofollow a consistent pattern that can belearned from historical data (defenders’ historical observa- tions). I model this pattern as a Markov process and provide the following contributions in this work. First, I formulate the problem into a restless multi-armed bandit (RMAB) modeltohandlethelimitedobservability challenge —defendersdonothaveobservations for arms they do not activate (targets they do not protect). Second, I propose an EM based learning algorithm to learn the RMAB model from defenders’ historical observa- tions. Third,i usethesolution concept ofWhittle indexpolicytosolve theRMABmodel toplanfordefenders’patrolstrategies. However, indexabilityisrequiredfortheexistence of Whittle index, so I provide two sufficient conditions for indexability and an algorithm to numerically evaluate indexability. Fourth, I propose a binary search based algorithm to find the Whittle index policy efficiently. 6.1 Model 6.1.1 Motivating Domains and their Properties My work is mainly motivated by the domain of wildlife protection such as protecting endangered animals and fish stocks [12,58]. Other motivating domains include police patrol to catch fare-evaders in a barrier-free transit system [61], border patrol [21,22], 98 etc. The model I will describe in this work is based on the following assumptions about the nature of interactions between defenders and attackers in these domains. Except the frequent interactions between defenders (patrollers/police) and attackers (poachers/fare- evaders/smugglers), these domains share another two important properties: (i) patrol affectsattacking activities (poaching/fareevasion/smuggling); (ii)limited/partial observ- ability. I will next use the wildlife protection domain as the example to illustrate these two properties. Poaching activity is a dynamic process affected by patrol. If patrollers patrol in a certain location frequently, it is very likely that the poachers poaching in this location will switch to other locations for poaching. On the other hand, if a location hasn’t been patrolled for a long time, poachers may gradually notice that and switch to this location for poaching. In the wildlife protection domain, both patrollers and poachers do not have perfect observationoftheiropponents’actions. Thisobservation imperfectionliesintwoaspects: (i) limited observability — patrollers/poachers do not know what happens at locations they do not patrol/poach; (ii) partial observability — patrollers/poachers do not have perfect observation even at locations they patrol/poach — the location might be large (e.g., a 2km×2km area) so that it is possible that patrollers and poachers do not see each other even if they are at the same location. These two properties make it extremely difficult for defenders to optimally plan their patrol strategies. For example, defendersmay finda target with alarge numberof attack activities at the beginning so they may start to protect this target frequently. After a period of time, attack activities at this target may start to decrease due to the frequent 99 patrol. At this time, defenders have to decide whether to keep protecting this target (exploitation) or to switch to other targets (exploration). However, defenders do not have knowledge of attack activities at other targets at that moment, which makes this decision making extremely difficult for defenders. Fortunately, the frequent interactions between defenders and attackers make it possi- ble for defenders to learn the effect of patrol on attackers from the historical data. With this learned effect, defenders are able to estimate attack activities at targets they do not protect. Based on this concept, I model these domains as a restless multi-armed ban- dit problem and use the solution concept of Whittle index policy to plan for defenders’ strategies. 6.1.2 Formal Model I now formalize the story in Section 6.1.1 into a mathematical model that can be for- mulated as a restless multi-armed bandit problem. There are n targets that are indexed by N ,{1,...,n}. Defenders have k patrol resources that can be deployed to these n targets. At every round, defenders choose k targets to protect. After that, defenders will have an observation of the number of attack activities for targets they protect, and no information for targets they do not protect. The objective for defenders is to decide which k targets to protect at every round to catch as many attackers as possible. Due to the partial observability on defenders’ side — defenders’ observation of attack activities is not perfect even for targets they protect, I introducea hiddenvariable attack intensity, which represents the truedegree of attack intensity at a certain target. Clearly, this hidden variable attack intensity cannot directly be observed by defenders. Instead, 100 defenders’ observation is a random variable conditioned on this hidden variable attack intensity, and the larger the attack intensity is, the more likely it is for defenders to observe more attack activities during their patrol. I discretize the hidden variable attack intensity into n s levels, denoted by S = {0,1,...,n s − 1}. Lower i represents lower attack intensity. For a certain target, its attack intensity transitions after every round. If this target is protected, attack intensity transitions according to a n s ×n s transition matrix T 1 ; if this target is not protected, attack intensity transitions according to another n s ×n s transition matrix T 0 . The tran- sition matrix represents how patrol affects attack intensity — T 1 tends to reduce attack intensity and T 0 tends to increase attack intensity. The randomness in the transition matrix models attackers’ partial observability discussed in Section 6.1.1. Note that dif- ferent targets may have different transition matrices because some targets may be more attractive to attackers (for example, some locations may have more animal resources in the wildlife protection domain) so that it is more difficult for attack intensity to go down and easier for attack intensity to go up. I also discretize defenders’ observations of attack activities into n o levels, denoted by O ={0,1,...,n o −1}. Lower i represents less attack activities defenders observe. Note that defenders will only have observation for targets they protect. A n s ×n o observation matrixO determineshowtheobservationdependsonthehiddenvariableattack intensity. Generally, the larger the attack intensity is, the more likely it is for defenders to observe more attack activities during their patrol. Similar to transition matrices, different target may have different observation matrices. 101 While defenders get observations of attack activities during their patrol, they also re- ceiverewardsforthat—arrestingpoachers/fare-evaders/smugglersbringbenefit. Clearly, the reward defenders receive depends on their observation and I thus define the reward function R(o),o∈ O — larger i leads to higher reward R(i). For example, if o = 0 represents finding no attack activity and o = 1 represents finding attack activities, then R(0) =0, R(1) =1. Note that defenders only get rewards for targets they protect. To summarize, for the targets defenders protect, defenders get an observation de- pending on its current attack intensity, get the reward associated with the observation, and then the attack intensity transitions according to T 1 ; for the targets defenders do not protect, defendersdo not have any observation, get reward 0 and the attack intensity transitionsaccordingtoT 0 . Figure6.1demonstratesthisprocess. Inthismodel,thestate discretization level n s , observation discretization level n o and reward function R(o) are pre-specified by defenders; the transition matrices T 1 and T 0 , observation matrix O and initial belief π can be learned from defenders’ previous observations. I will briefly discuss the learning algorithm in Section 6.1.3. After those parameters are learned, this model is formulated into a restless multi-armed bandit model to plan for defenders’ strategies. 6 6 - - - - R(o 0 ) R(o 3 ) a 0 =1 a 1 =0 a 2 =0 a 3 =1 T 1 T 0 T 0 T 1 observation attack intensity s 0 s 1 s 2 s 3 o 0 o 3 Figure 6.1: Model Illustration 102 6.1.3 Learning Model From Defenders’ Previous Observations Givendefenders’action history{a i }andobservationhistory{o i }, myobjectiveistolearn the transition matrices T 1 and T 0 , observation matrix O and initial belief π. Due to the existence of hidden variables{s i }, expectation-maximization (EM) algorithm is used for learning. Expectation-maximization algorithm repeats the following steps until convergence: 1. compute Q(θ,θ d ) = P z∈Z P(z|x;θ d )log[P(x,z;θ)] 2. set θ (d+1) =argmax θ Q(θ,θ d ) where z are latent variables, and are the hidden state sequence in my problem; x are observed data, and are the observation sequence in my problem; θ are the parameters to be estimated, and are the transition matrix T 1 /T 0 (transition matrix when action a=1/0), output matrix O and the initial hidden state distribution π. P(z|x;θ d ) = P(x,z;θ d ) P(x;θ d ) , so we can write θ (d+1) =argmax θ X z∈Z P(x,z;θ d )log[P(x,z;θ)] Denote b Q(θ,θ d ), P z∈Z P(x,z;θ d )log[P(x,z;θ)], so θ (d+1) =argmax θ b Q(θ,θ d ). P(x,z;θ) =π z 1 T−1 Y t=1:at=1 T 1 ztz t+1 T−1 Y t=1:at=0 T 0 ztz t+1 T Y t=1:at=1 O ztxt Taking the log gives us logP(x,z;θ) =logπ z 1 + T−1 X t=1:at=1 logT 1 ztz t+1 + T−1 X t=1:at=0 logT 0 ztz t+1 + T X t=1:at=1 logO ztxt 103 Then we have b Q(θ,θ d ) = X z∈Z P(x,z;θ d )logπ z 1 + X z∈Z T−1 X t=1:at=1 P(x,z;θ d )logT 1 ztz t+1 + X z∈Z T−1 X t=1:at=0 P(x,z;θ d )logT 0 ztz t+1 + X z∈Z T X t=1:at=1 P(x,z;θ d )logO ztxt We also have the constraints: ns−1 X i=0 π i =1 ns−1 X j=0 T 1 ij =1,∀i =0,1,...,n s −1 ns−1 X j=0 T 0 ij =1,∀i =0,1,...,n s −1 no−1 X j=0 O 1 ij =1,∀i =0,1,...,n s −1 Using Lagrange multipliers we have: b L(θ,θ d ) = b Q(θ,θ d )−λ π ( ns−1 X i=0 π i −1)− ns−1 X i=0 λ T 1 i ( ns−1 X j=0 T 1 ij −1) − ns−1 X i=0 λ T 0 i ( ns−1 X j=0 T 0 ij −1)− ns−1 X i=0 λ O i ( no−1 X j=0 O ij −1) 104 Take derivatives and set it to be 0, we get the update steps: π (d+1) i =P(s 1 =i|x;θ d ) T 1(d+1) ij = P T−1 t=1:at=1 P(s t =i,s t+1 =j|x;θ d ) P T−1 t=1:at=1 P(s t =i|x;θ d ) T 0(d+1) ij = P T−1 t=1:at=0 P(s t =i,s t+1 =j|x;θ d ) P T−1 t=1:at=0 P(s t =i|x;θ d ) O (d+1) ij = P T t=1:at=1 P(s t =i|x;θ d )I(o t =j) P T t=1:at=1 P(s t =i|x;θ d ) So we need to compute P(s t = i|x;θ d ) and P(s t = i,s t+1 = j|x;θ d ). It cam be computed through forward-backward algorithm. Let α i (t) =P(o 1 =x 1 ,...,o t =x t ,s t =i;θ). It can be computed recursively: α i (1) = π i O ix 1 , a 1 =1, π i , a 1 =0. α j (t+1) = O jx t+1 P ns−1 i=0 α i (t)T 1 ij , a t =1,a t+1 =1, O jx t+1 P ns−1 i=0 α i (t)T 0 ij , a t =0,a t+1 =1, P ns−1 i=0 α i (t)T 1 ij , a t =1,a t+1 =0, P ns−1 i=0 α i (t)T 0 ij , a t =0,a t+1 =0. 105 Define β i (t) = P(o t+1 = x t+1 ,...,o T = x T |s t = i;θ). It can also be computed recursively: β i (T) =1 β i (t) = P ns−1 j=0 β j (t+1)T 1 ij O jx t+1 , a t =1,a t+1 =1, P ns−1 j=0 β j (t+1)T 0 ij O jx t+1 , a t =0,a t+1 =1, P ns−1 j=0 β j (t+1)T 1 ij , a t =1,a t+1 =0, P ns−1 j=0 β j (t+1)T 0 ij , a t =0,a t+1 =0. so we have: P(s t =i|x;θ) = α i (t)β i (t) P ns−1 j=0 α j (t)β j (t) P(s t =i,s t+1 =j|x;θ) = α i (t)T 1 ij β j (t+1)O jx t+1 P ns−1 k=0 α k (t)β k (t) , a t =1,a t+1 =1, α i (t)T 0 ij β j (t+1)O jx t+1 P ns−1 k=0 α k (t)β k (t) , a t =0,a t+1 =1, α i (t)T 1 ij β j (t+1) P ns−1 k=0 α k (t)β k (t) , a t =1,a t+1 =0, α i (t)T 0 ij β j (t+1) P ns−1 k=0 α k (t)β k (t) , a t =0,a t+1 =0. 6.2 Restless Bandit for Planning In this section, I will formulate the model discussed in Section 6.1.2 as a restless multi- armedbanditproblemandplandefenders’strategiesusingthesolutionconceptofWhittle index policy. 106 6.2.1 Restless Bandit Formulation It is straightforward to formulation the model discussed in Section 6.1.2 into a restless multi-armed bandit problem. Every target is viewed as an arm and defenders choose k arms to activate (k targets to protect) at every round. Consider a single arm (target), it is associated with n s (hidden) states, n o observations, n s ×n s transition matrices T 1 and T 0 , n s ×n o observation matrix O and reward function R(o),o∈O as is described in Section 6.1.2. For the arm defenders activate, defenders get an observation, get reward associated with the observation, and the state transitions according to T 1 . Note that defenders’ observation is not the state. Instead, it is a random variable conditioned on the state, and reveals some information about the state. For the arms defenders do not activate, defenders do not have any observation, get reward 0 and the state transitions according to T 0 . Sincedefenderscan not directly observethestate, defendersmaintain abelief b of the states for each target, based on which defenders make decisions. The belief is updated according to the Bayesian rules. The following equation shows the belief update when defenders protect this target (a = 1) and get observation o or defenders do not protect this target (a =0). b ′ (s ′ ) = η P s∈S b(s)O so T 1 ss ′ , a=1 P s∈S b(s)T 0 ss ′ , a=0, (6.1) where η is the normalization factor. When defenders do not protect this target (a = 0), defenders do not have any observation, so their belief is updated according to the 107 state transition rule; When defenders protect this target (a = 1), their belief is firstly updated according to their observation o (b new (s) = ηb(s)O so according to Bayes’ rule), and then the new belief is then updated according to the state transition rule: b ′ (s ′ ) = P s∈S b new (s)T 1 ss ′ = P s∈S ηb(s)O so T 1 ss ′ =η P s∈S b(s)O so T 1 ss ′ I now present the mathematical definition of Whittle index for this problem. Denote V m (b) to be the value function for belief state b with subsidy m; V m (b;a = 0) to be the value function for belief state b with subsidy m and defenders take passive action; V m (b;a =1) to be the value function for belief state b with subsidym and defenders take active action. The following equations show these value functions: V m (b;a =0) =m+βV m (b a=0 ) V m (b;a =1) = X s∈S b(s) X o∈O O so R(o) +β X o∈O X s∈S b(s)O so V m (b o a=1 ) V m (b) =max{V m (b;a =0),V m (b;a =1)} When defenders take passive action, they get the immediate reward m and the β- discounted future reward — value function at new belief b a=0 , which is updated from b according to the case a = 0 in Equation 6.1. When defenders take active action, they gettheexpectedimmediatereward P s∈S b(s) P o∈O O so R(o)andtheβ-discountedfuture reward. The future reward is composed of different observation cases — P s∈S b(s)O so is defenders’ probability to have observation o at belief state b, and V m (b o a=1 ) is the value function at new belief b o a=1 that is updated from b according to the case a = 1 with 108 observation o in Equation 6.1. The value function V m (b) is the maximum of V m (b;a =0) and V m (b;a =1). Whittle index I(b) of belief state b is then defined to be : I(b),inf m {m :V m (b;a =0)≥V m (b:a=1)} The passive action set Φ(m), which is the set of belief states for which passive action is the optimal action given subsidy m is then defined to be: Φ(m),{b :V m (b;a =0)≥V m (b:a=1)} 6.2.2 Sufficient Conditions for Indexability Inthissection,Iprovidetwosufficientconditionsforindexabilitywhenn o =2andn s =2. Denote the transition matrices to be T 0 and T 1 , observation matrix to be O. Clearly in this problem, O 11 > O 01 , O 00 > O 10 (higher attack intensity leads to higher probability to see attack activities when patrolling); T 1 11 > T 1 01 , T 1 00 > T 1 10 ; T 0 11 > T 0 01 , T 0 00 > T 0 10 (positively correlated arms). Defineα,max{T 0 11 −T 0 01 ,T 1 11 −T 1 01 }. Sinceit is a two-state problemwithS ={0,1}, I use one variable x to represent the belief state: x, b(s = 1), which is the probability of being in state 1. DefineΓ 1 (x) =xT 1 11 +(1−x)T 1 01 , whichisthebeliefforthenextroundifthebelieffor thecurrentroundisxandtheactive action istaken. Similarly, Γ 0 (x) =xT 0 11 +(1−x)T 0 01 , whichisthebeliefforthenextroundifthebeliefforthecurrentroundisxandthepassive action is taken. 109 I present below two theorems demonstratingtwo sufficient conditions for indexability. The proof is in Appendix B. Theorem 6.2.1. When β ≤ 0.5, the process is indexable, i.e., for any belief x, if V m (x;a =0)≥V m (x;a =1), then V m ′(x;a =0)≥V m ′(x;a =1),∀m ′ ≥m Theorem 6.2.2. When αβ≤ 0.5 and Γ 1 (1)≤ Γ 0 (0), the process is indexable, i.e., for any belief x, if V m (x;a = 0) ≥ V m (x;a = 1), then V m ′(x;a = 0) ≥ V m ′(x;a = 1), ∀m ′ ≥m 6.2.3 Numerical Evaluation of Indexability For problems other than those that have been proved to be indexable in Section 6.2.2, we can numerically evaluate their indexability. I first provide the following proposition. Proposition 6.2.3. If m < R(0)−β R(no−1)−R(0) 1−β , Φ(m) =∅; if m > R(n o −1), Φ(m) is the whole belief state space. Proof. If m < R(0)−β R(no−1)−R(0) 1−β , denote V m (b;a = 0) = m+βW 0 ; V m (b;a = 1) = R(o)+βW 1 ,whereW 1 andW 0 representthemaximumfuturereward. SinceW 0 ≤ R(no−1) 1−β (achieving reward R(n o −1) at every round), W 1 ≥ R(0) 1−β (achieving reward R(0) at every round), R(o)≥ R(0), we have V m (b;a = 1)−V m (b;a = 0) = R(o)−m+β(W 1 −W 0 )≥ R(0)−m+β R(0)−R(no−1) 1−β > 0. Thus, being active is always the optimal action for any state so that Φ(m) =∅. Ifm>R(n o −1), thenthestrategy of always beingpassivedominates otherstrategies so Φ(m) is the whole belief state space. 110 Thus, we only need to determine whether the set Φ(m) monotonically increases for m⊆[R(0)−β R(no−1)−R(0) 1−β ,R(n o −1)]. Numerically,wecandiscretizethislimitedmrange and then evaluate if Φ(m) monotonically increases with the increase of discretized m. Given thesubsidym, Φ(m)canbedeterminedbysolvingaspecial POMDPmodelwhose conditional observation probability is dependent on start state and action. I will discuss the algorithm in detail in Section 6.3. This algorithm returns a set D which contains n s -length vectors d 1 ,d 2 ,...,d |D| . Every vector d i is associated with an optimal action e i . Given the belief b, the optimal action is determined by a opt = e i , i = argmax j b T d j . Thus, Φ(m)= S i:e i =0 {b:b T d i ≥b T d j ,∀j}. Given m 0 < m 1 , my aim is to check whether Φ(m 0 )⊆ Φ(m 1 ). Use the superscript 0 or 1 for set D, vector d, action e to distinguish between the returned solutions with subsidy m 0 and m 1 . The following mixed-integer linear program (MILP) can be used to determine whether Φ(m 0 )⊆Φ(m 1 ). 111 min b,z 0 ,z 1 ,ξ 0 ,ξ 1 |D 0 | X i=1 z 0 i e 0 i − |D 1 | X i=1 z 1 i e 1 i s.t. b i ∈[0,1],∀i∈S, X i∈S b i =1 z 0 i ∈{0,1},∀i∈{1,2,...,|D 0 |}, X i z 0 i =1 b T d 0 i ≤ξ 0 ,∀i∈{1,2,...,|D 0 |} ξ 0 ≤b T d 0 i +M(1−z 0 i ),∀i∈{1,2,...,|D 0 |} z 1 i ∈{0,1},∀i∈{1,2,...,|D 1 |} X i z 1 i =1 b T d 1 i ≤ξ 1 ,∀i∈{1,2,...,|D 1 |} ξ 1 ≤b T d 1 i +M(1−z 1 i ),∀i∈{1,2,...,|D 1 |} If the result of the above MILP is 0 or 1, Φ(m 0 )⊆ Φ(m 1 ). In the MILP, M is a given large number, b is the belief state, z 0/1 i is a binary variable that indicates whether b T d 0/1 i ≥b T d 0/1 j ,∀j (1 indicates yes and 0 indicates no), ξ 0/1 is an auxiliary variable that equals max i b T d 0/1 i , P |D 0/1 | i=1 z 0/1 i e 0/1 i is the optimal action for the problem with subsidy m 0/1 . IftheresultofthisMILPis0or1,itmeansthattheredoesnotexistabeliefbunder which the optimal action for the problem with subsidy m 0 is passive (0) and the optimal action for the problem with subsidy m 1 is active (1). This means Φ(m 0 )⊆Φ(m 1 ). 112 6.2.4 Computation of Whittle Index Policy Given the indexability, Whittle index can be found by doing a binary search within the rangem⊆[R(0)−β R(no−1)−R(0) 1−β ,R(n o −1)]. Given theupperboundubandlower bound lb, the problem with middle point lb+ub 2 as passive subsidy is sent to the special POMDP solver to find the optimal action for the current belief. If the optimal action is active, then the Whittle index is greater than the middlepoint so lb← lb+ub 2 ; or else ub← lb+ub 2 . This binary search algorithm can find Whittle index with arbitrary precision. Naively, wecan computetheWhittle indexpolicyby computingtheε-precision indicesof all arms and then picking the k arms with the highest indices. However, since we are actually only interested in which k arms have the highest Whittle index and we do not care what exactly their indices are, we can do better than this naive method, which is demonstrated in Algorithm 8. In Algorithm 8, A is Whittle index policy to be returned and is set to be∅ at the beginning. S is the set of arms that we have not known whether belong to A or not and is set to be the whole set of arms at the beginning. Before it finds top-k arms (the loop between Line 4 and Line 21), it tests all the arms in S about their optimal action with subsidy ub+lb 2 . If the optimal action is 1, it means this arm’s index is higher than ub+lb 2 and we add it to S 1 ; if the optimal action is 0, it means this arm’s index is lower than ub+lb 2 and we add it to S 0 (Lines 6−13). At this moment, we know that all arms in S 1 have higher indices than all arms in S 0 . If there is enough space in A to include all arms in S 1 , we add S 1 to A, remove them from S and set the upper boundto be ub+lb 2 because we already know that S 1 belongs to Whittle index policy set and all the rest arms have 113 Algorithm 8 Algorithm to Compute Whittle Index Policy 1: function FindWhittleIndexPolicy 2: lb←R(0)−β R(no−1)−R(0) 1−β ,ub←R(n o −1) 3: A←∅, S←{1,2,...,n} 4: while|A|<k do 5: S 1 ←∅, S 0 ←∅ 6: for i∈S do 7: a opt ← POMDPSolve(P i , lb+ub 2 ) 8: if a opt =1 then 9: S 1 ← S 1 S {i} 10: else 11: S 0 ← S 0 S {i} 12: end if 13: end for 14: if|S 1 |≤k−|A| then 15: A←A S S 1 , S← S−S 1 16: ub← lb+ub 2 17: else 18: S←S−S 0 19: lb← lb+ub 2 20: end if 21: end while 22: return A 23: end function 114 the index lower than ub+lb 2 (Lines 14−16). If there is not enough space in A, we remove S 0 from S and set the lower bound to be ub+lb 2 because we already know that S 0 does not belong to Whittle index policy set and all the rest arms have the index higher than ub+lb 2 (Lines 17−19). 6.3 Computation of Passive Action Set In this section, I will discuss the algorithm to compute the passive action set Φ(m) with the subsidy m. This problem can be viewed as solving a special POMDP model whose conditional observation probability is dependent on start state and action while the conditional observation probability is dependent on end state and action in standard POMDPs. Figure 6.2 demonstrates the difference. The left figure represents special POMDPs and the right figure represents standard POMDPs. In both cases, the original state is s, the agent takes action a, and the state transitions to s ′ according to P(s ′ |s,a). However, the observation o the agent get during this process is dependent on s and a in my special POMDPs; while it depends on s ′ and a in standard POMDPs. observation state a 6 - P(s ′ |s,a) s s ′ o P(o|s,a) a 6 - P(s ′ |s,a) s s ′ o P(o|s ′ ,a) Figure 6.2: Special POMDPs vs Standard POMDPs 115 Despite this difference, the solution concept of value iteration algorithm in standard POMDPs can be used to solve my special POMDP formulations with appropriate modi- fications. I will discuss the special POMDP formulation for my problem in Section 6.3.1 and present the modified value iteration algorithm in Section 6.3.2. 6.3.1 Special POMDP Formulation The special POMDP formulation for my problem is straightforward. state space The state space is S={0,1,...,n s −1}. action space The action space is A={0,1}, where a=0 represents passive action (do not protect) and a=1 represents active action (protect). observation space Theobservation space isO={−1,0,1,...,n o −1}. It adds a“fake” observation o =−1 to represent no observation when taking action a = 0. It’s called “fake” because defenders have probability 1 to observe o=−1 no matter what the state is when they take action a = 0, so this observation does not provide any information. When defenders take action a=1, they may observe observations O\{−1}. conditional transition probability Theconditional transition probability P(s ′ |s,a) is defined to be: P(s ′ =j|s =i,a =1) =T 1 ij and P(s ′ =j|s=i,a =0) =T 0 ij . conditionalobservationprobabilityTheconditionalobservationprobabilityP(o|s,a) is defined to be P(o =−1|s,a = 0) = 1,∀s∈S; P(o = j|s = i,a = 1) = O ij . Note that the conditional observation probability here is dependent on the start state s and action a, whileit dependson end state s ′ and action a in standard POMDP models. Intuitively, defenders’ observation of attack activities today depends on the attack intensity today, not the transitioned attack intensity tomorrow. 116 reward function The reward function R is R s,s ′ ,a,o = 0, a =0, R(o), a =1. With thetransition probabilityandobservation probability, R(s,a) canbecomputed. Note that this formulation is also slightly different due to the different definition of ob- servation probability. R(s,a) = X s ′ ∈S P(s ′ |s,a) X o∈O P(o|s,a)R(s,s ′ ,a,o) 6.3.2 Value Iteration for My Special POMDP Different from standard POMDP formulation, the belief update in the special POMDP formulation is b ′ (s ′ ) = P s∈S b(s)P(o|s,a)P(s ′ |s,a) P(o|b,a) (6.2) where P(o|b,a) = X s ′ ∈S X s∈S b(s)P(o|s,a)P(s ′ |s,a) = X s∈S b(s)P(o|s,a) Note that the belief update process is also consistent with that in Equation 6.1. Similar to standard POMDP formulation, we have the value function V ′ (b) =max a∈A X s∈S b(s)R(s,a)+β X o∈O P(o|b,a)V(b o a ) ! 117 which can be broken up to simpler combinations of other value functions: V ′ (b) =max a∈A V a (b) V a (b) = X o∈O V o a (b) V o a (b) = P s∈S b(s)R(s,a) |O| +βP(o|b,a)V(b o a ) AllthevaluefunctionscanberepresentedasV(b) =max α∈D b·αsincetheupdateprocess maintains this property, so we only need to update the set D when updating the value function. The set D is updated according to the following process: D ′ =purge [ a∈A D a ! D a =purge M o∈O D o a ! D o a =purge({τ(α,a,o)|α∈D}) where τ(α,a,o) is the|D|-vector given by τ(α,a,o)(s) =(1/|O|)R(s,a)+βP(o|s,a) X s ′ ∈S α(s ′ )P(s ′ |s,a) and purge(·) takes a set of vectors and reduces it to its unique minimum form (remove redundant vectors that are dominated by other vectors in the set). L represents the cross sum of two sets of vectors: A L B ={α+β|α∈A,β∈B}. 118 The update of D ′ and D a is intuitive, so I briefly explain the update of D o a 1 here: P(o|b,a)V(b o a ) =P(o|b,a)max α∈D X s ′ ∈S α(s ′ )P(s ′ |b,a,o) =P(o|b,a)max α∈D X s ′ ∈S α(s ′ ) P s∈S b(s)P(o|s,a)P(s ′ |s,a) P(o|b,a) =max α∈D X s ′ ∈S α(s ′ ) X s∈S b(s)P(o|s,a)P(s ′ |s,a) =max α∈D X s∈S b(s)· P(o|s,a) X s ′ ∈S α(s ′ )P(s ′ |s,a) ! Here, P(s ′ |b,a,o) is the belief of state s ′ in the next round when the belief in the current round is b, the agent takes action a and get the observation o, which is the b(s ′ ) in Equation 6.2. 6.4 Planning from POMDP View I have discussed in Section 6.3.1 that every single target can be modeled as a special POMDP model. Given that, we can combine these POMDP models at all targets to form a special POMDP model that describe the whole problem, and solving this special POMDPmodelleadstodefenders’exact optimalstrategy. Usethesuperscriptitodenote target i. Generally, the POMDP model for the whole problem is the cross product of the single-target POMDP models at all targets with the constraint that only k targets are protected at every round. state space The state space is S=S 1 ×S 2 ×...×S n . Denote s=(s 1 ,s 2 ,...,s n ) 1 ActuallytheonlydifferenceofvalueiterationalgorithmforthespecialPOMDPformulation compared with that for the standard POMDP formulation is the different update of D o a . 119 actionspaceTheactionspaceisA={(a 1 ,a 2 ,...,a n )|a j ∈{0,1},∀j∈N, P j∈N a j = k}, which represents that only k targets can be protected at a round. Denote a = (a 1 ,a 2 ,...,a n ) observation space The observation space is O = O 1 ×O 2 × ...×O n . Denote o=(o 1 ,o 2 ,...,o n ) conditionaltransitionprobabilityTheconditionaltransitionprobabilityisP(s ′ |s,a) = Q j∈N P j (s ′j |s j ,a j ). conditional observation probability The conditional observation probability is P(o|s,a) = Q j∈N P j (o j |s j ,a j ). reward function The reward function is R(s,s ′ ,a,o) = P j∈N R(s j ,s ′j ,a j ,o j ) Naively, the modified value iteration algorithm discussed in Section 6.3.2 can be used to solve this special POMDP formulation. However, this POMDP formulation suffers fromcurseof dimensionality — theproblemsize increases exponentially withthenumber of targets. Thus, the computational cost of value iteration algorithm will soon become unaffordable as the problem size grows. Silver and Veness [53] have proposed POMCP algorithm, which provides high quality solutions and is scalable to large POMDPs. The POMCP algorithm only requires a simulator of the problem so it also applies to my special POMDPs. At a high level, the POMCP algorithm is composed of two parts: (i) it uses a particle filter to maintain an approximation of the belief state; (ii) it draw state samples from the particle filter and then use MCTS to simulate what will happen next to find the best action. It uses a particlefiltertoapproximatethebeliefstatebecauseitisevencomputationallyimpossible in many problems to update belief state due to the extreme large size of the state space. 120 However, in my problem, the all-target POMDP model is the cross product of the single- target POMDP models at all targets. The single-state POMDP model is small so that it is computationally inexpensive to maintain its belief state. Thus, we can easily sample the state s i at target i from its belief state and then compose them together to get the state sample s=(s 1 ,s 2 ,...,s n ) for the all-target POMDP model. The details of MCTS in POMDP are available in [53] so I omit it here. Although the POMCP algorithm shows better scalability than the exact POMDP algorithm, its scalability is also limited because the action space and observation space are also expo- nential with k in my problem. Consider the problem instance of n = 10, k = 3 and n o = 2, the number of actions is 10 3 = 10∗9∗8 1∗2∗3 = 120 and the number of observations is 10 3 ∗2 3 = 960. Since actions and observations are the branches in the MCTS, the tree size will soon become extremely large when planning more rounds ahead. This leads to two problems: (i) it will soon run out of memory when planning more rounds ahead; (ii) a huge number of state samples is needed to establish the convergence. Thus, the POMCP algorithm only applies to problem instances with small k. The experimental evaluation shows that the POMCP algorithm is unable to plan 3 horizons forward (runs out of memory) for the problem instance of n=10, k =3 and n o =2. It means that for large problem instances, the POMCP algorithm is reduced to myopic policy (only look one round ahead when planning). 121 6.5 Experimental Evaluation In this section, I will firstly evaluate the Whittle Index Policy in Section 6.5.1 and then evaluate theRMAB model inSection 6.5.2. Theperformanceis evaluated in terms of the cumulative reward received within the first 20 rounds with discounting factor β = 0.9. All results are averaged over 500 simulation runs. 6.5.1 Evaluation of Whittle Index Policy I will compare the Whittle Index policy with four baseline algorithms: Random: The defenders randomly choose k targets to protect at every round. Myopic: The defenders choose k targets with the highest immediate reward to pro- tect at every round. Exact POMDP: The defenders uses the modified value iteration algorithm to solve thespecialPOMDPproblemdiscussedinSection6.4toplanforpatrolstrategies atevery round. Note that it only works for small-scale problems and is the exact optimal patrol strategy defenders may take POMCP: The defenders uses POMCP algorithm to solve the special POMDP prob- lem discussed in Section 6.4 to plan for patrol strategies at every round. ThecomputationofWhittleIndexpolicyandexactPOMDPalgorithminvolvesolving special POMDPs using the modified value iteration algorithm as is discussed in Section 6.3.2. IimplementthemodifiedvalueiterationalgorithmbymodifyingthePOMDPsolver written by Anthony R. Cassandra 2 . The detailed algorithm I use for value iteration is the incremental pruning algorithm [10]. 2 http://pomdp.org/code/ 122 There are two parameters in the POMCP algorithm: the number of state samples and the depth of the tree, i.e., the numberof roundswe look ahead when planning. With the increase of the number of state samples, the performance of the POMCP algorithm improves; while the runtime also increases at the same time. Thus, for a fair comparison, duringmyexperiment, I choose thenumberof state samples sothat its runtimeis similar to that of Whittle index policy. For the depth of the tree, I choose the one with the largest cumulative reward. Small Scale: Compare with Exact POMDP Algorithm I then evaluate these fiveplanningalgorithmsinasmallprobleminstancewithn=2,k =1,n s =2andn o =2. TheresultisshowninTable6.1. Fromthetable, wecanseethatmyWhittleindexpolicy and POMCP algorithm perform very close to the optimal Exact POMDP solution and are much better than the myopic optimal policy and random policy, demonstrating their high solution quality. Table 6.1: Planning Algorithm Evaluation in Solution Quality for Small-scale Problem Instances Random Myopic Optimal POMCP Exact POMDP Whittle Index 2.6534 3.1384 3.1694 3.1798 3.1740 Large Scale: I then evaluate my planning algorithms in a larger problem instance withn=10. Figure6.3(a)showsthesolutionqualitycomparisonwhenn s =2andn o =2. Thex-axisshowsthenumberofdefenders(k)andthey-axisshowsthecumulativereward. From this figure, we can see that Whittle index policy performs better than the POMCP algorithm and myopic optimal policy, and all of these three algorithms perform much better than the random policy. One thing to note is that the POMCP algorithm shows poor scalability with regard to k — it is unable to plan 3 horizons forward (runs out of 123 memory) with k = 3. Figure 6.3(b) shows the solution quality comparison when n s = 3 and n o =3, and demonstrates similar patterns as Figure 6.3(a). 1 2 3 0 5 10 15 Number of Defenders (k) Cumulative Reward Random Myopic Optimal POMCP Whittle Index Policy (a) n s = 2,n o = 2 1 2 3 0 10 20 30 40 Number of Defenders (k) Cumulative Reward Random Myopic Optimal POMCP Whittle Index Policy (b) n s = 3,n o = 3 Figure 6.3: Planning Algorithm Evaluation in Solution Quality for Large-scale Problem Instances An Example When Myopic Policy Fails We can see from Figures 6.3(a) and 6.3(b) that the myopic policy performs only slightly worse compared with the Whittle index policy. Here I provide an example where the myopic policy performs significantly worse. Consider the case with 2 targets and 1 defender. For target 0: T 0 = 0.95 0.05 0.05 0.95 T 1 = 0.99 0.01 0.1 0.9 O = 0.9 0.1 0.2 0.8 For target 1: T 0 = 0.4 0.6 0.1 0.9 T 1 = 0.7 0.3 0.4 0.6 O = 0.7 0.3 0.3 0.7 124 Figure 6.4 shows the performance of different algorithms. In this case, the myopic policy performs similar to the random policy, and is much worse compared with Whittle Index policy. 0 2 4 6 8 4 4.5 5 5.5 6 Cumulative Reward Random MyopicOptimal POMCP−H1 POMCP−H2 POMCP−H3 WhittleIndex ExactPomdp Figure 6.4: Example when Myopic Policy Fails Runtime Analysis of Whittle Index Policy: Figure 6.5 analyzes the runtime of Whittle index policy. The x-axis shows the number of targets (n) and the y-axis shows the average runtime. From the figure, we can see that the runtimeincreases linearly with the number of targets. This is because Whittle index policy reduces an n-dimensional problem to n 1-dimensional problems so that the complexity is linear with n. Another observation is that the number of defenders (k) does not affect the runtime a lot for a given n. 6.5.2 Evaluation of RMAB Modeling In this section, I will compare my RMAB model with the algorithms (UCB, SWUCB, EXP3) used in [21] with a group of simulated attackers. The performance is evaluated in terms of the cumulative reward received within the first 20 rounds after several rounds learning (β =0.9). 125 4 8 12 16 20 0 500 1000 1500 Number of Targets (n) Runtime (s) k/n=0.25 k/n=0.5 k/n=0.75 Figure 6.5: Runtime Analysis of Whittle Index Policy: n s =2,n o =2 Figure6.6(a)demonstrateshowtheperformancechangeswithdifferentlearningrounds and n s (n o ). It shows that when learning rounds is smaller (100), the model with n s = n o =2performsthebest. Thisisbecausethemodelwithhighern s (n o )suffersoverfitting with limited data at this time. When the data is relatively abundant (learning rounds = 1000), the model with higher n s (n o ) performs better. However, we noticed that the difference is not significantly large. Figure 6.6(b) shows that comparison of my RMAB model with the Random/UCB/ SWUCB/EXP3 algorithms. When learning rounds is smaller (100), my RMAB model performs similar to UCB algorithm, and is better than other algorithms. When learning rounds becomes larger, my RMAB model shows significant advantage over other algo- rithms. 126 100 500 1000 0 2 4 6 8 10 Learning Rounds Cumulative Reward n s = n o = 2 n s = n o = 3 n s = n o = 4 n s = n o = 5 (a) RMAB 100 500 1000 0 2 4 6 8 Learning Rounds Cumulative Reward Random EXP3 SWUCB UCB RMAB (b) n s = 2,n o = 2 Figure 6.6: Evaluation of RMAB Modeling 127 Chapter 7 Conclusion 7.1 Contributions Mycontributionsincludeaddressinguncertaintyinattackers’ preferenceusingrobustand learning approaches. My first contribution develops an algorithm to efficiently compute therobust strategy against risk-aware attackers in SSGs. My second contribution models thepreferenceaspayoffsandfocusesonlearningthepayoffsandthenplanningaccordingly ingreensecuritydomains. Mythirdcontributionmodelsthepreferenceasmarkovianpro- cessthattransitsaccordingtodefender’sstrategies tohandletheexploration-exploitation tradeoff in these domains. Robust Strategy against Risk-aware Attackers in SSGs My first contribution focuses on handling attacker’s risk preference uncertainty in security games with robust approaches. I computes a robust defender strategy that optimizes the worst case against risk-aware attackers with uncertainty in the degree of risk awareness [1], i.e., it provides a solution quality guarantee for the defender no matter how risk-aware the attacker is. 128 To develop the robust strategy, I firstly build a robust SSG framework against an attacker with uncertainty in level of risk awareness. Second, building on previous work on SSGs in mixed-integer programs, I propose a novel mixed-integer bilinear program- ming problem (MIBLP), and find that it only finds locally optimal solutions. While the MIBLP formulation is also unable to scale up, it provides key intuition for my new al- gorithm. This new algorithm, BeRRA (Binary search based Robust algorithm against Risk-Aware attackers) is my third contribution, and it finds globally ǫ-optimal solutions by solvingO(nlog( 1 ǫ )log( 1 δ )) linear feasibility problems. The key idea of the BeRRA algorithm is to reduce the problem from maximizing the reward with a given number of resources to minimizing the number of resources needed to achieve a given reward. This transformationallowsBeRRAtoscaleupviatheremovalofthebilineartermsandinteger variables as well astheutilization of key theoretical propertiesthat provecorrespondence of its potential “attack sets” [20] with that of the maximin strategy. Finally, I also show that the defender does not need to consider attacker’s risk attitude in zero-sum games. The experimental results show the solution quality and runtime advantages of my robust model and BeRRA algorithm. Learning Attacker’s Preference — Payoff Modeling My second contribution focuses on learning attacker’s payoffs in green security domains where there are frequent interactions between the defender and the attacker. I model attacker’s preference as payoffs and the frequent interactions give the defender the opportunity to learn the at- tacker’s payoffs by observingthe attacker’s actions. Motivated by this, my work develops the model and algorithm for the defender to learn target values from attacker’s actions and then uses this information to better plan her strategy. 129 I model these interactions between the defender and the attacker as a repeated game. I then adopt a fixed model for the attacker’s behavior and recast this repeated game as a partially observable Markov decision process (POMDP). However, my POMDP formula- tionhasanexponentialnumberofstates, makingcurrentPOMDPsolverslikeZMDP[54] and APPL [25] infeasible in terms of computational cost. Silver and Veness [53] have proposed the POMCP algorithm which achieves a high level of performance in large POMDPs. It uses particle filtering to maintain an approximation of the belief state of theagent, andthenusesMonteCarloTreeSearch (MCTS)foronlineplanning. However, theparticle filter is only an approximation of thebelief state. By appealingtothespecial propertiesofmyPOMDP,IproposetheGMOPalgorithm(GibbssamplingbasedMCTS Online Planning) which draws samples directly from the exact belief state using Gibbs sampling and then runs MCTS for online planning. My algorithm provides higher solu- tion quality thanthePOMCPalgorithm. Additionally, foraspecificsubclassof mygame with an attacker who plays a best response against the defender’s empirical distribution, and a uniform penalty of being seized across all targets, I provide an advanced sampling techniquetospeeduptheGMOPalgorithmalongwithaheuristicthattradesoffsolution quality for lower computational cost. Moreover, I explore the case of continuous utilities where my original POMDP formulation becomes a continuous-state POMDP, which is generally difficult to solve. However, the special properties in the specific subclass of game mentioned above make possible the extension of the GMOP algorithm to continu- ous utilities. Finally, I explore the more realistic scenario where the defender is not only uncertain about the distribution of resources, but also uncertain about the attacker’s 130 behavioral model. I address this challenge by extending my POMDP formulation and the GMOP algorithm. There are two assumptions in this model: (i) both the defender and the attacker are able to observe their opponent’s actions even if they are protecting/attacking different targets; (ii) the defender knows the attacker’s behavioral model (or several candidate behavioral models). These two assumptions may not hold in some real-world domains. Thus in response, I have the third contribution that do not need these two assumptions. Learning Attacker’s Preference — Markovian Modeling My second contri- bution assumes that defenders have knowledge of all poaching activities throughout the wildlife protected area. Unfortunately, given vast geographic areas for wildlife protec- tion, defendersdonot have knowledge of poachingactivities in areas they donot protect. My third contribution then relaxes this assumption by modeling attacker’s preference as Markovian processes. Iassumethathowpatrolaffectsattack activitiescanbeassumedtofollowaconsistent pattern that can be learned from historical data (defenders’ historical observations). I modelthispattern as aMarkovian processandprovidethefollowing contributions inthis work. First, I formulate the problem into a restless multi-armed bandit (RMAB) model to handle the limited observability challenge — defenders do not have observations for arms they do not activate (targets they do not protect). Second, I propose an EM based learning algorithm to learn the RMAB model from defenders’ historical observations. Third, I use the solution concept of Whittle index policy to solve the RMAB model to plan for defenders’ patrol strategies. However, indexability is required for the existence of Whittle index, so I provide two sufficient conditions for indexability and an algorithm 131 to numerically evaluate indexability. Fourth, I propose a binary search based algorithm to find the Whittle index policy efficiently. Inthiscontribution,itdoesnotassumefullobservabilityorattacker’sexactbehavioral model. Instead, it assumes that the attacker’s behavior follows certain pattern that does not change rapidly over time since all the planning algorithm is based on the adversary model learnedfrom attacker’s previousactions. Inaddition, sincetheplanningalgorithm fully trusts the learned adversary model, it assumes that the learned adversary model is correct. One possible extension is to include some robustness in the learned adversary model. To summarize, since my thesis have two different contributions on learning, and they are also related to thegreen security game modeling [12,58], I show in thefollowing table the comparison between these models. Table 7.1: Comparison Between Different Models AAMAS’14 AAMAS’16 Green Security Game Strategy pure strategy pure strategy mixed strategy Observability by defender full observability limited observability full observability Observability by attacker full observability limited observability full observability assumption on behavioral models QR no SUQR These models have their advantages and disadvantages, and their combination may lead to possible directions for future work. For example, although the AAMAS’14 model is unable to handle the limited observability challenge and it assumes for attacker’s be- havioral model, it provides a more precise prediction result for attacker’s preference. Similarly, although the mixed strategy setting in the green security game modeling has its limitations in the real world, its combination with my model might be possible to improve the performance. 132 7.2 Future Work My thesis has discussed the algorithm to handle attacker’s preference with robust and learning approaches. However, these two methods are all passively responding to at- tacker’s preferences. Thus, an interesting question to ask is — is it possible for the defender to “manipulate” attacker’s actions utilizing their preferences? An intuitive idea isthatthedefendermay“manipulate”theattacker’sactionsby“manipulating”attacker’s penalties in domains where the attacker’s penalties are determined by the defender. For example, in the domain of protecting fish, some areas may be more “important” so that thedefendermaysetthepenaltieshigherintheseareas. Therefore,itbecomesachallenge for the defender as to how to set the penalties optimally to get a higher reward? One other possible direction for future work is to combine the payoff uncertainty and attacker’s risk attitude uncertainty together. Previous research [19] proposes the algorithm to compute the robust strategy against payoff uncertainty and my work [51] discusses the algorithm to compute the robust strategy against risk attitude uncertainty. Consideringthefact thatthesetwouncertainties mayexist atthesametimeinreal-world applications, a necessary next step is to develop an algorithm that can handle these two uncertainties at the same time. Another possible direction for futurework concerns with further improvements of the model for green security domains. My current model is based on a set of simplifying assumptions about the domain. For example, it discretizes the forest into a couple of gridsandviews each grid asasingle target. Inthisway, it ignores thegeometric relations between different grids — the poaching activities at neighboring grids may becorrelated. 133 Moreover, it assumes that the defender protects a grid at every round. However, the defender may take a patrol route that goes across several targets at every round. A necessary next step is to take those domain features into account. 134 Bibliography [1] Michele Aghassi and Dimitris Bertsimas. Robust game theory. Mathematical Pro- gramming, 2006. [2] David J Agnew, John Pearce, Ganapathiraju Pramod, Tom Peatman, Reg Watson, John R Beddington, and Tony J Pitcher. Estimating the worldwide extent of illegal fishing. PLoS One, 2009. [3] PS Ansell, Kevin D Glazebrook, Jos´ e Ni˜ no-Mora, and M O’Keeffe. Whittle’s index policy for a multi-class queueing system with convex holding costs. Mathematical Methods of Operations Research, 2003. [4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the mul- tiarmed bandit problem. Machine learning, 2002. [5] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995. [6] Jonathan F Bard. Practical bilevel optimization: algorithms and applications. Springer, 1998. [7] Michael K Block and Vernon E Gerety. Some experimental evidence on differences between student and prisoner reactions to monetary penalties and risk. The Journal of Legal Studies, 1995. [8] Colin Camerer. Behavioral game theory: Experiments in strategic interaction. Princeton University Press, 2003. [9] George Casella and Edward I George. Explaining the gibbs sampler. The American Statistician, 1992. [10] Anthony Cassandra, Michael L Littman, and Nevin L Zhang. Incremental pruning: A simple, fast, exact method for partially observable markov decision processes. In UAI, 1997. [11] VincentConitzerandTuomasSandholm. Computingtheoptimalstrategytocommit to. In EC, 2006. [12] Fei Fang, Peter Stone, andMilindTambe. Whensecurity gamesgogreen: Designing defender strategies to prevent poaching and illegal fishing. In IJCAI, 2015. 135 [13] KD Glazebrook, D Ruiz-Hernandez, and C Kirkbride. Some indexable families of restless bandit problems. Advances in Applied Probability, 2006. [14] Jeffrey Grogger. Certainty vs. severity of punishment. Economic Inquiry, 1991. [15] WilliamBHaskell, DebarunKar,FeiFang,MilindTambe,SamCheung,andLtEliz- abeth Denicola. Robust protection of fisheries with compass. 2014. [16] Maria Hauck and NA Sweijd. A case study of abalone poaching in south africa and its impact on fisheries management. ICES Journal of Marine Science: Journal du Conseil, 1999. [17] Bruce Hoffman. The modern terrorist mindset: Tactics, targets and technologies. Center for the Study of Terrorism and Political Violence, St. Andrews University. As of April, 1997. [18] Matthew P. Johnson, Fei Fang, , and Milind Tambe. Patrol strategies to maximize pristine forest area. In Conference on Artificial Intelligence (AAAI), 2012. [19] Christopher Kiekintveld, Towhidul Islam, and Vladik Kreinovich. Security games with interval uncertainty. In AAMAS, 2013. [20] Christopher Kiekintveld, Manish Jain, Jason Tsai, James Pita, Fernando Ord´ o˜ nez, and Milind Tambe. Computingoptimal randomized resource allocations for massive security games. In AAMAS, 2009. [21] Richard Kl´ ıma, Christopher Kiekintveld, and Viliam Lis` y. Online learning methods for border patrol resource allocation. In Decision and Game Theory for Security. Springer, 2014. [22] RichardKlıma,ViliamLis` y,andChristopherKiekintveld. Combiningonlinelearning and equilibrium computation in security games. 2015. [23] Levente Kocsis and Csaba Szepesv´ ari. Bandit based monte-carlo planning. In Ma- chine Learning: ECML. 2006. [24] Dmytro Korzhyk, Vincent Conitzer, and Ronald Parr. Complexity of computing optimal stackelberg strategies in security resource allocation games. In AAAI, 2010. [25] Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems, 2008. [26] Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approx- imating the optimal strategy to commit to. In Algorithmic Game Theory. Springer, 2009. [27] Keqin Liu and Qing Zhao. Indexability of restless bandit problems and optimal- ity of whittle index for dynamic multichannel access. Information Theory, IEEE Transactions on, 2010. 136 [28] Keqin Liu, Qing Zhao, and Bhaskar Krishnamachari. Dynamic multichannel access with imperfect channel state detection. Signal Processing, IEEE Transactions on, 2010. [29] St´ ephanie Manel, Pierre Berthier, and Gordon Luikart. Detecting wildlife poaching: identifying the origin of individuals with bayesian assignment tests and multilocus genotypes. Conservation Biology, 2002. [30] Janusz Marecki, Gerry Tesauro, and Richard Segal. Playing repeated stackelberg games with unknown opponents. In AAMAS, 2012. [31] Richard D McKelvey andThomas R Palfrey. Quantal responseequilibria for normal form games. Games and economic behavior, 1995. [32] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for exten- sive form games. Experimental economics, 1998. [33] Andrew R Morral and Brian A Jackson. Understanding the role of deterrence in counterterrorism security. Rand Corporation, 2014. [34] Thanh Nguyen, Albert Jiang, and Milind Tambe. Stop the compartmentalization: Unified robust algorithms for handling uncertainties in security games. In AAMAS, 2014. [35] Thanh H. Nguyen, Rong Yang, Amos Azaria, Sarit Kraus, and Milind Tambe. An- alyzing the effectiveness of adversary modeling in security games. In AAAI, 2013. [36] Thanh Hong Nguyen, Amulya Yadav, Bo An, Milind Tambe, and Craig Boutilier. Regret-based optimization and preference elicitation for stackelberg security games with uncertainty. In AAAI, 2014. [37] Jose Nino-Mora. Restless bandits, partial conservation laws and indexability. Ad- vances in Applied Probability, 2001. [38] Jerome Le Ny, Munther Dahleh, and Eric Feron. Multi-uav dynamic routing with partial observations using restless bandit allocation indices. In American Control Conference. IEEE, 2008. [39] United States. Defense Science Board. Task Force on Preventing and Defending Against Clandestine Nuclear Attack. Report of the Defense Science Board Task Force on Preventing and Defending Against Clandestine Nuclear Attack. Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics, 2004. [40] National Research Council (US). Committee on Science and Technology for Coun- tering Terrorism. Panel on Transportation. Deterrence, protection, and preparation: the new transportation security imperative. Number 270. Transportation Research Board, 2002. [41] ChristosHPapadimitriou andJohnNTsitsiklis. Thecomplexity ofoptimal queuing network control. Mathematics of Operations Research, 1999. 137 [42] Praveen Paruchuri, Jonathan P. Pearce, Janusz Marecki, Milind Tambe, Fernando Ordonez,andSaritKraus. Playinggameswithsecurity: Anefficientexact algorithm for bayesian stackelberg games. In AAMAS, 2008. [43] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. [44] Peter Phillips. The preferred risk habitat of al-qa’ida terrorists. European Journal of Economics, Finance and Administrative Sciences, 2010. [45] Peter Phillips. The preferred risk habitat of al-qa’ida terrorists. European Journal of Economics, Finance and Administrative Sciences, 2010. [46] Peter J Phillips. Applying modern portfolio theory to the analysis of terrorism. computing the set of attack method combinations from which the rational terrorist group will choose in order to maximise injuries and fatalities. Defence and Peace Economics, 2009. [47] Peter J Phillips. Applying modern portfolio theory to the analysis of terrorism. computing the set of attack method combinations from which the rational terrorist group will choose in order to maximise injuries and fatalities. Defence and Peace Economics, 2009. [48] Peter J Phillips. The end of al-qa’ida: rationality, survivability and risk aversion. International Journal of Economic Sciences, 2013. [49] Peter J Phillips. The end of al-qa’ida: rationality, survivability and risk aversion. International Journal of Economic Sciences, 2013. [50] Yundi Qian, William B. Haskell, Albert Xin Jiang, and Milind Tambe. Online planningforoptimalprotectorstrategiesinresourceconservationgames. InAAMAS, 2014. [51] Yundi Qian, William B. Haskell, and Milind Tambe. Robust strategy against un- known risk-averse attackers in security games. In AAMAS, 2015. [52] Yundi Qian, Chao Zhang, Bhaskar Krishnamachari, and Milind Tambe. Restless poachers: Handling exploration-exploitation tradeoffs in security domains. In AA- MAS, 2016. [53] DavidSilverandJoelVeness. Monte-carlo planninginlargepomdps. InNIPS,2010. [54] Trey Smith. Probabilistic Planning for Robotic Exploration. PhD thesis, Carnegie Mellon University, 2007. [55] Bernhard Von Stengel and Shmuel Zamir. Leadership with commitment to mixed strategies. 2004. [56] RichardRWeberandGideonWeiss. Onanindexpolicyforrestlessbandits. Journal of Applied Probability, 1990. 138 [57] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 1988. [58] RongYang,BenjaminFord,MilindTambe,andAndrewLemieux. Adaptiveresource allocation for wildlife protection against illegal poachers. In AAMAS, 2014. [59] RongYang,ChristopherKiekintveld,FernandoOrdonez,MilindTambe,andRichard John. Improving resource allocation strategy against human adversaries in security games. In IJCAI, 2011. [60] Zhengyu Yin, Manish Jain, Milind Tambe, and Fernando Ordonez. Risk-averse strategiesforsecuritygameswithexecutionandobservationaluncertainty. InAAAI, 2011. [61] Zhengyu Yin, Albert Jiang, Matthew Johnson, Milind Tambe, Christopher Kiek- intveld, Kevin Leyton-Brown, Tuomas Sandholm, and John Sullivan. Trusts: Scheduling randomized patrols for fare inspection in transit systems. In IAAI, 2012. [62] Zhengyu Yin and Milind Tambe. A unified method for handling discrete and con- tinuous uncertainty in bayesian stackelberg games. In AAMAS, 2012. 139 Appendix A An Example of BeRRA Algorithm Consider the example below with 3 targets and 1 resource. Table A.1: Example of BeRRA Algorithm Target U u d U c d U u a U c a t 0 −26 39 18 −14 t 1 −25 15 25 −27 t 2 −39 39 30 −24 In Algorithm 1, the upper bound for defender’s reward is 39 and the lower bound is−39 at the beginning. Then it tries to find whether the reward 0 (the midpoint of upper bound and lower bound) is achievable with 1 resource. Algorithm 3 computes the maximin strategy c max to achieve the reward 0 for the defender is c max 0 = 0.4,c max 1 = 0.625,c max 2 = 0.5 and the possible attack set under the maximin strategy only contains t 0 , so S p (c max )={t 0 } and S i (c max ) ={t 1 ,t 2 }. Then it uses binary search in Algorithm 4 to reduce the coverage probability for targets in S i (c max ) and get the optimal strategy c opt , which is c opt 0 =0.4,c opt 1 =0.380769,c opt 2 =0.459269. It returns back to Algorithm 1 and Algorithm 1 computes that the minimum number of resources needed to achieve the reward 0 is 1.240038, which is larger than the resource we have, which is 1. Therefore, the reward 0 is not achievable for the defender with 1 resource and the upper bound of 140 defender’s reward is then set to be 0. This process goes on and on and until the upper bound and the lower bound become closer enough. 141 Appendix B Proof for Indexability I am going to prove Theorem 6.2.1 and Theorem 6.2.2 demonstrating two sufficient con- ditions for indexability. Consider the case with n o =2 and n s =2. Transition matrix: T 0 = T 0 00 T 0 01 T 0 10 T 0 11 T 1 = T 1 00 T 1 01 T 1 10 T 1 11 Observation matrix: O = O 00 O 01 O 10 O 11 In this problem, O 11 > O 01 , O 00 > O 10 (higher attack intensity leads to higher probability to see attack activities when patrolling); T 1 11 > T 1 01 , T 1 00 > T 1 10 ; T 0 11 > T 0 01 , T 0 00 >T 0 10 (positively correlated arms). Define α,max{T 0 11 −T 0 01 ,T 1 11 −T 1 01 }. 142 Since it is a two-state problem with S ={0,1}, I use one variable x to represent the belief state: x,b(s=1), which is the probability of being in state 1. DefineΓ 1 (x) =xT 1 11 +(1−x)T 1 01 , whichisthebeliefforthenextroundifthebelieffor thecurrentroundisxandtheactive action istaken. Similarly, Γ 0 (x) =xT 0 11 +(1−x)T 0 01 , whichisthebeliefforthenextroundifthebeliefforthecurrentroundisxandthepassive action is taken. We have Γ 1 (x 2 )−Γ 1 (x 1 ) = (T 1 11 −T 1 01 )(x 2 −x 1 )≤ α(x 2 −x 1 ),∀x 2 > x 1 and Γ 0 (x 2 )−Γ 0 (x 1 ) = (T 0 11 −T 0 01 )(x 2 − x 1 )≤ α(x 2 − x 1 ),∀x 2 > x 1 according to the definition of α. We normalize the reward function R(o) so that R(0) =0 and R(1) =1. B.1 Preliminaries Further expand the value function presented in thesis, we have: V m (x;a =0) =m+βV m (Γ 0 (x)) V m (x;a =1) =xO 11 +(1−x)O 01 +β[xO 11 +(1−x)O 01 ]V m (Γ 1 ( xO 11 xO 11 +(1−x)O 01 )) +β[xO 10 +(1−x)O 00 ]V m (Γ 1 ( xO 10 xO 10 +(1−x)O 00 )) V m (x) =max{V m (x;a =0),V m (x;a =1)} where V m (x) is the value function for belief state x with subsidy m, V m (x;a = 0) is the value function for belief state x with subsidy m and defenders take passive action, 143 and V m (x;a = 1) is the value function for belief state x with subsidy m and defenders take active action. Define V t m (x) to be the value function for state x with subsidy m when there are t rounds left. V t m (x;a =0) and V t m (x;a =1) are also defined similarly. Clearly, lim t→+∞ V t m (x) =V m (x) lim t→+∞ V t m (x;a =0) =V m (x;a =0) lim t→+∞ V t m (x;a =1) =V m (x;a =1) Proposition B.1.1. The value function for the last round is: V 1 m (x) = max{m,xO 11 + (1−x)O 01 } Proof. It can be easily computed by assuming V 0 m (x) =0,∀x∈[0,1]. Figure B.1 shows an example of V 1 m (x) with m=0.2, O 11 =0.4, O 01 =0.1 0 0.2 0.4 0.6 0.8 1 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Figure B.1: An example of V 1 m (x) 144 My proof below is based on mathematical induction — I first prove the conclusion holdstrueforV 1 m (x); then proveit holds truefor V t+1 m (x) if it holdstruefor V t m (x). Then it holds true for V ∞ m (x), in other words, the conclusion holds true for V m (x). B.2 Proof of Theorem 6.2.1 To prove Theorem 6.2.1, I first prove a couple of lemmas. Lemma B.2.1. V m ′(x)≥V m (x),∀m ′ ≥m,∀x Proof. Clearly, V 1 m ′ (x)≥V 1 m (x),∀m ′ ≥m,∀x If V t m ′ (x) ≥ V t m (x), ∀m ′ ≥ m,∀x, since V t+1 m ′ (x;a = 0) ≥ V t+1 m (x;a = 0) and V t+1 m ′ (x;a =1)≥V t+1 m (x;a =1), we have V t+1 m ′ (x)≥V t+1 m (x),∀m ′ ≥m,∀x Lemma B.2.2. V m ′(x)−V m (x)≤ m ′ −m 1−β ,∀m ′ ≥m,∀x Proof. Clearly, V 1 m ′ (x)−V 1 m (x)≤m ′ −m≤ m ′ −m 1−β ,∀m ′ ≥m,∀x If V t m ′ (x)−V t m (x)≤ m ′ −m 1−β ,∀m ′ ≥m,∀x, we have: V t+1 m ′ (x;a =0)−V t+1 m (x;a =0) =m ′ +βV t m ′ (Γ 0 (x))−m−βV t m (Γ 0 (x))≤(m ′ −m)+ β m ′ −m 1−β = m ′ −m 1−β V t+1 m ′ (x;a =1)−V t+1 m (x;a =1)≤β m ′ −m 1−β ≤ m ′ −m 1−β so V t+1 m ′ (x)−V t+1 m (x)≤ m ′ −m 1−β ,∀m ′ ≥m,∀x Now we are ready to prove Theorem 6.2.1. Theorem (6.2.1). When β ≤ 0.5, the process is indexable, i.e., for any belief x, if V m (x;a =0)≥V m (x;a =1), then V m ′(x;a =0)≥V m ′(x;a =1),∀m ′ ≥m 145 Proof. According to Lemma B.2.1: V m ′(x;a = 0)−V m (x;a = 0) = m ′ +βV m ′(Γ 0 (x))− m−βV m (Γ 0 (x))≥m ′ −m According to Lemma B.2.2: V m ′(x;a =1)−V m (x;a =1)≤β m ′ −m 1−β ≤m ′ −m so V m ′(x;a = 0)−V m (x;a = 0)≥ V m ′(x;a = 1)−V m (x;a = 1), therefore V m ′(x;a = 0)≥V m ′(x;a =1) B.3 Proof of Theorem 6.2.2 To prove Theorem 6.2.2, we first prove the following lemma. Lemma B.3.1. When αβ≤0.5, we have: • the value function V m (x) is an increasing function with x • the optimal policy is a threshold policy with smaller x leads to a = 0 and larger x leads to a =1 • V m (x 2 )−V m (x 1 )≤ (x 2 −x 1 )(O 11 −O 01 ) 1−αβ ,∀x 2 >x 1 Proof. Obviously, it holds true for t=1. Suppose it holds true for V t m (x), i.e., V t m (x) is an increasing function with x, the optimal policy is a threshold policy with smaller x leads to a = 0 and larger x leads to a=1, and V t m (x 2 )−V t m (x 1 )≤ (x 2 −x 1 )(O 11 −O 01 ) 1−αβ ,∀x 2 >x 1 , consider V t+1 m (x). (i)WefirstprovethattheoptimalpolicyisathresholdpolicyforV t+1 m (x)withsmaller x leads to a=0 and larger x leads to a=1. It is equivalent to prove that: if V t+1 m (x 2 ;a = 0)≥ V t+1 m (x 2 ;a = 1), then V t+1 m (x 1 ;a = 0)≥ V t+1 m (x 1 ;a = 1),∀x 1 < x 2 146 if V t+1 m (x 1 ;a = 1)≥ V t+1 m (x 1 ;a = 0), then V t+1 m (x 2 ;a = 1)≥ V t+1 m (x 2 ;a = 0),∀x 1 < x 2 These conclusions are correct if V t+1 m (x 2 ;a = 1)−V t+1 m (x 1 ;a = 1)≥ V t+1 m (x 2 ;a = 0)−V t+1 m (x 1 ;a=0),∀x 2 >x 1 . V t+1 m (x 2 ;a=0)−V t+1 m (x 1 ;a =0) =m+βV t m (Γ 0 (x 2 ))−m−βV t m (Γ 0 (x 1 )) =β(V t m (Γ 0 (x 2 ))−V t m (Γ 0 (x 1 ))) ≤β (Γ 0 (x 2 )−Γ 0 (x 1 ))(O 11 −O 01 ) 1−αβ ≤αβ (x 2 −x 1 )(O 11 −O 01 ) 1−αβ ≤(x 2 −x 1 )(O 11 −O 01 ) V t+1 m (x 2 ;a =1)−V t+1 m (x 1 ;a=1) =(O 11 −O 01 )(x 2 −x 1 ) +β[x 2 O 11 +(1−x 2 )O 01 ]V t m (Γ 1 ( x 2 O 11 x 2 O 11 +(1−x 2 )O 01 )) +β[x 2 O 10 +(1−x 2 )O 00 ]V t m (Γ 1 ( x 2 O 10 x 2 O 10 +(1−x 2 )O 00 )) −β[x 1 O 11 +(1−x 1 )O 01 ]V t m (Γ 1 ( x 1 O 11 x 1 O 11 +(1−x 1 )O 01 )) −β[x 1 O 10 +(1−x 1 )O 00 ]V t m (Γ 1 ( x 1 O 10 x 1 O 10 +(1−x 1 )O 00 )) ≥(O 11 −O 01 )(x 2 −x 1 ) The last inequality is due to the increasing property of V t m (x). 147 So we have V t+1 m (x 2 ;a = 1)− V t+1 m (x 1 ;a = 1)≥ V t+1 m (x 2 ;a = 0)− V t+1 m (x 1 ;a = 0),∀x 2 > x 1 , and thus the optimal policy is a threshold policy with smaller x leads to a=0 and larger x leads to a=1 for V t+1 m (x). (ii) We next prove that V t+1 m (x) is an increasing function with the increase of x. This is true because V t+1 m (x;a = 1) and V t+1 m (x;a = 0) are all increasing functions (because V t m (x)is anincreasingfunction)sothat their maximization isalsoanincreasingfunction. (iii) Last we prove that V t+1 m (x 2 )−V t+1 m (x 1 )≤ (x 2 −x 1 )(O 11 −O 01 ) 1−αβ ,∀x 2 > x 1 . This can be proved by proving V t+1 m (x 2 ;a =0)−V t+1 m (x 1 ;a =0)≤ (x 2 −x 1 )(O 11 −O 01 ) 1−αβ ,∀x 2 >x 1 and 148 V t+1 m (x 2 ;a = 1)−V t+1 m (x 1 ;a = 1)≤ (x 2 −x 1 )(O 11 −O 01 ) 1−αβ ,∀x 2 > x 1 . The former one when a=0 has been proved in (i). We will next prove the latter one when a=1. V t+1 m (x 2 ;a =1)−V t+1 m (x 1 ;a =1) =(O 11 −O 01 )(x 2 −x 1 ) +β[x 2 O 11 +(1−x 2 )O 01 ]V t m (Γ 1 ( x 2 O 11 x 2 O 11 +(1−x 2 )O 01 )) +β[x 2 O 10 +(1−x 2 )O 00 ]V t m (Γ 1 ( x 2 O 10 x 2 O 10 +(1−x 2 )O 00 )) −β[x 1 O 11 +(1−x 1 )O 01 ]V t m (Γ 1 ( x 1 O 11 x 1 O 11 +(1−x 1 )O 01 )) −β[x 1 O 10 +(1−x 1 )O 00 ]V t m (Γ 1 ( x 1 O 10 x 1 O 10 +(1−x 1 )O 00 )) =(O 11 −O 01 )(x 2 −x 1 ) +β[x 1 O 11 +(1−x 1 )O 01 ](V t m (Γ 1 ( x 2 O 11 x 2 O 11 +(1−x 2 )O 01 ))−V t m (Γ 1 ( x 1 O 11 x 1 O 11 +(1−x 1 )O 01 ))) +β[x 2 O 10 +(1−x 2 )O 00 ](V t m (Γ 1 ( x 2 O 10 x 2 O 10 +(1−x 2 )O 00 ))−V t m (Γ 1 ( x 1 O 10 x 1 O 10 +(1−x 1 )O 00 ))) +β(x 2 −x 1 )(O 11 −O 01 )(V t m (Γ 1 ( x 2 O 11 x 2 O 11 +(1−x 2 )O 01 ))−V t m (Γ 1 ( x 1 O 10 x 1 O 10 +(1−x 1 )O 00 ))) ≤(O 11 −O 01 )(x 2 −x 1 )+β O 11 −O 01 1−αβ α( x 2 O 11 (x 1 O 11 +(1−x 1 )O 01 ) x 2 O 11 +(1−x 2 )O 01 −x 1 O 11 ) +β O 11 −O 01 1−αβ α(x 2 O 10 − x 1 O 10 (x 2 O 10 +(1−x 2 )O 00 ) x 1 O 10 +(1−x 1 )O 00 ) +β(x 2 −x 1 )(O 11 −O 01 ) (O 11 −O 01 ) (1−αβ) α x 2 O 11 (1−x 1 )O 00 −x 1 O 10 (1−x 2 )O 01 (x 2 O 11 +(1−x 2 )O 01 )(x 1 O 10 +(1−x 1 )O 00 ) =(O 11 −O 01 )(x 2 −x 1 )(1+ αβ 1−αβ ) = (O 11 −O 01 )(x 2 −x 1 ) 1−αβ 149 During the proof of Lemma B.3.1, we have the following result, which will be used in proving Lemma B.3.3. Remark B.3.2. When αβ≤0.5, we have: • V t m (x 2 ;a=0)−V t m (x 1 ;a=0)≤(x 2 −x 1 )(O 11 −O 01 ),∀x 2 >x 1 ,∀t • (O 11 −O 01 )(x 2 −x 1 )≤ V t m (x 2 ;a = 1)−V t m (x 1 ;a = 1)≤ (O 11 −O 01 )(x 2 −x 1 ) 1−αβ ,∀x 2 > x 1 ,∀t Lemma B.3.3. When αβ≤0.5 and Γ 1 (1)≤Γ 0 (0), we have: • V m ′(x 1 )−V m (x 1 )≥V m ′(x 2 )−V m (x 2 ),∀x 1 ≤x 2 • V m ′(0)−V m (0)−V m ′(1)+V m (1)≤m ′ −m • V m ′(x;a =0)−V m (x;a =0)≥V m ′(x;a =1)−V m (x;a =1),∀x Proof. Consider the following sets of conclusions: • V t m ′ (x 1 )−V t m (x 1 )≥V t m ′ (x 2 )−V t m (x 2 ),∀x 1 ≤x 2 • V t m ′ (0)−V t m (0)−V t m ′ (1)+V t m (1)≤m ′ −m • V t+1 m ′ (x;a =0)−V t+1 m (x;a =0)≥V t+1 m ′ (x;a =1)−V t+1 m (x;a =1),∀x Obviously, it holds true for t=0. Suppose it holds true for t. Consider t+1: 150 (i) We first prove V t+1 m ′ (x 1 )−V t+1 m (x 1 )−V t+1 m ′ (x 2 )+V t+1 m (x 2 )≥0 V t+1 m ′ (x 1 )−V t+1 m (x 1 )−V t+1 m ′ (x 2 )+V t+1 m (x 2 ) =max{V t+1 m ′ (x 1 ;a =0),V t+1 m ′ (x 1 ;a=1)} −max{V t+1 m (x 1 ;a =0),V t+1 m (x 1 ;a=1)} −max{V t+1 m ′ (x 2 ;a =0),V t+1 m ′ (x 2 ;a=1)} +max{V t+1 m (x 2 ;a =0),V t+1 m (x 2 ;a=1)} • Case 1: V t+1 m (x 1 ;a =0)≥V t+1 m (x 1 ;a =1) and V t+1 m ′ (x 2 ;a=0)≥V t+1 m ′ (x 2 ;a =1) V t+1 m ′ (x 1 )−V t+1 m (x 1 )−V t+1 m ′ (x 2 )+V t+1 m (x 2 ) ≥V t+1 m ′ (x 1 ;a =0)−V t+1 m (x 1 ;a =0)−V t+1 m ′ (x 2 ;a=0)+V t+1 m (x 2 ;a =0) =m ′ +βV t m ′(Γ 0 (x 1 ))−m−βV t m (Γ 0 (x 1 ))−m ′ −βV t m ′(Γ 0 (x 2 ))+m+βV t m (Γ 0 (x 2 )) =β(V t m ′(Γ 0 (x 1 ))−V t m (Γ 0 (x 1 ))−V t m ′(Γ 0 (x 2 ))+V t m (Γ 0 (x 2 )))≥0 • Case 2: V t+1 m (x 1 ;a =0)≥V t+1 m (x 1 ;a =1) and V t+1 m ′ (x 2 ;a=0)<V t+1 m ′ (x 2 ;a =1) V t+1 m ′ (x 1 )−V t+1 m (x 1 )−V t+1 m ′ (x 2 )+V t+1 m (x 2 ) ≥V t+1 m ′ (x 1 ;a=0)−V t+1 m (x 1 ;a =0)−V t+1 m ′ (x 2 ;a =1)+V t+1 m (x 2 ;a=1) ≥V t+1 m ′ (x 1 ;a=0)−V t+1 m (x 1 ;a =0)−V t+1 m ′ (x 2 ;a =0)+V t+1 m (x 2 ;a=0) ≥0 151 • Case 3: V t+1 m (x 1 ;a =0)<V t+1 m (x 1 ;a =1) and V t+1 m ′ (x 2 ;a=0)≥V t+1 m ′ (x 2 ;a =1) V t+1 m ′ (x 1 )−V t+1 m (x 1 )−V t+1 m ′ (x 2 )+V t+1 m (x 2 ) ≥V t+1 m ′ (x 1 ;a=0)−V t+1 m (x 1 ;a =1)−V t+1 m ′ (x 2 ;a =0)+V t+1 m (x 2 ;a=1) ≥−(x 2 −x 1 )(O 11 −O 01 )+(O 11 −O 01 )(x 2 −x 1 )=0 The last inequality is based on Remark 1. • Case 4: V t+1 m (x 1 ;a =0)<V t+1 m (x 1 ;a =1) and V t+1 m ′ (x 2 ;a=0)<V t+1 m ′ (x 2 ;a =1) V t+1 m ′ (x 1 )−V t+1 m (x 1 )−V t+1 m ′ (x 2 )+V t+1 m (x 2 ) ≥V t+1 m ′ (x 1 ;a=1)−V t+1 m (x 1 ;a =1)−V t+1 m ′ (x 2 ;a=1)+V t+1 m (x 2 ;a =1) =β[x 2 O 11 +(1−x 2 )O 01 ](V t m ′(Γ 1 ( x 1 O 11 x 1 O 11 +(1−x 1 )O 01 ))−V t m (Γ 1 ( x 1 O 11 x 1 O 11 +(1−x 1 )O 01 )) −V t m ′(Γ 1 ( x 2 O 11 x 2 O 11 +(1−x 2 )O 01 ))+V t m (Γ 1 ( x 2 O 11 x 2 O 11 +(1−x 2 )O 01 ))) +β[x 2 O 10 +(1−x 2 )O 00 ](V t m ′(Γ 1 ( x 1 O 10 x 1 O 10 +(1−x 1 )O 00 ))−V t m (Γ 1 ( x 1 O 10 x 1 O 10 +(1−x 1 )O 00 )) −V t m ′(Γ 1 ( x 2 O 10 x 2 O 10 +(1−x 2 )O 00 ))+V t m (Γ 1 ( x 2 O 10 x 2 O 10 +(1−x 2 )O 00 ))) +β(x 2 −x 1 )(O 11 −O 01 )(V t m ′(Γ 1 ( x 1 O 10 x 1 O 10 +(1−x 1 )O 00 ))−V t m (Γ 1 ( x 1 O 10 x 1 O 10 +(1−x 1 )O 00 )) −V t m ′(Γ 1 ( x 1 O 11 x 1 O 11 +(1−x 1 )O 01 ))+V t m (Γ 1 ( x 1 O 11 x 1 O 11 +(1−x 1 )O 01 ))) ≥0 152 (ii) We then prove that V t+1 m ′ (0)−V t+1 m (0)−V t+1 m ′ (1)+V t+1 m (1)≤m ′ −m V t+1 m ′ (0)−V t+1 m (0)−V t+1 m ′ (1)+V t+1 m (1) =max{m ′ +βV t m ′(Γ 0 (0)),O 01 +βV t m ′(Γ 1 (0))} −max{m+βV t m (Γ 0 (0)),O 01 +βV t m (Γ 1 (0))} −max{m ′ +βV t m ′(Γ 0 (1)),O 11 +βV t m ′(Γ 1 (1))} +max{m+βV t m (Γ 0 (1)),O 11 +βV t m (Γ 1 (1))} • Case 1: m ′ + βV t m ′ (Γ 0 (0)) ≥ O 01 + βV t m ′ (Γ 1 (0)) and m + βV t m (Γ 0 (1)) ≥ O 11 + βV t m (Γ 1 (1)) V t+1 m ′ (0)−V t+1 m (0)−V t+1 m ′ (1)+V t+1 m (1) ≤m ′ +βV t m ′(Γ 0 (0))−m−βV t m (Γ 0 (0))−m ′ −βV t m ′(Γ 0 (1))+m+βV t m (Γ 0 (1)) =β(V t m ′(Γ 0 (0))−V t m (Γ 0 (0))−V t m ′(Γ 0 (1))+V t m (Γ 0 (1))) ≤β(V t m ′(0)−V t m (0)−V t m ′(1)+V t m (1))≤m ′ −m 153 • Case 2: m ′ + βV t m ′ (Γ 0 (0)) ≥ O 01 + βV t m ′ (Γ 1 (0)) and m + βV t m (Γ 0 (1)) < O 11 + βV t m (Γ 1 (1)) V t+1 m ′ (0)−V t+1 m (0)−V t+1 m ′ (1)+V t+1 m (1) ≤m ′ +βV t m ′(Γ 0 (0))−m−βV t m (Γ 0 (0))−O 11 −βV t m ′(Γ 1 (1))+O 11 +βV t m (Γ 1 (1)) =m ′ −m+β(V t m ′(Γ 0 (0))−V t m (Γ 0 (0))−V t m ′(Γ 1 (1))+V t m (Γ 1 (1)))≤m ′ −m • Case 3: m ′ + βV t m ′ (Γ 0 (0)) < O 01 + βV t m ′ (Γ 1 (0)) and m + βV t m (Γ 0 (1)) ≥ O 11 + βV t m (Γ 1 (1)) This case is impossible. This case means V t+1 m ′ (0;a = 0) < V t+1 m ′ (0;a = 1) and V t+1 m (1;a = 0) ≥ V t+1 m (1;a = 1). V t+1 m ′ (0;a = 0) < V t+1 m ′ (0;a = 1) leads to V t+1 m ′ (1;a = 0) < V t+1 m ′ (1;a = 1) since it is a threshold policy for V t+1 m ′ (x) with smaller x leads to a = 0 and higher x leads to a = 1. Add V t+1 m (1;a = 0)≥ V t+1 m (1;a = 1) and V t+1 m ′ (1;a = 0) < V t+1 m ′ (1;a = 1) together contradicts the condition that V m ′(x;a =0)−V m (x;a =0)≥V m ′(x;a =1)−V m (x;a =1),∀x • Case 4: m ′ + βV t m ′ (Γ 0 (0)) < O 01 + βV t m ′ (Γ 1 (0)) and m + βV t m (Γ 0 (1)) < O 11 + βV t m (Γ 1 (1)) V t+1 m ′ (0)−V t+1 m (0)−V t+1 m ′ (1)+V t+1 m (1) ≤O 01 +βV t m ′(Γ 1 (0))−O 01 −βV t m (Γ 1 (0))−O 11 −βV t m ′(Γ 1 (1))+O 11 +βV t m (Γ 1 (1)) =β(V t m ′(Γ 1 (0))−V t m (Γ 1 (0))−V t m ′(Γ 1 (1))+V t m (Γ 1 (1)))≤m ′ −m 154 (iii) We finally prove that V t+2 m ′ (x;a = 0)− V t+2 m (x;a = 0) ≥ V t+2 m ′ (x;a = 1)− V t+2 m (x;a =1),∀x V t+2 m ′ (x;a =0)−V t+2 m (x;a =0)−V t+2 m ′ (x;a =1)+V t+2 m (x;a =1) =m ′ +βV t+1 m ′ (Γ 0 (x))−m−βV t+1 m (Γ 0 (x)) −xO 11 −(1−x)O 01 −β[xO 11 +(1−x)O 01 ]V t+1 m ′ (Γ 1 ( xO 11 xO 11 +(1−x)O 01 )) −β[xO 10 +(1−x)O 00 ]V t+1 m ′ (Γ 1 ( xO 10 xO 10 +(1−x)O 00 )) +xO 11 +(1−x)O 01 +β[xO 11 +(1−x)O 01 ]V t+1 m (Γ 1 ( xO 11 xO 11 +(1−x)O 01 )) +β[xO 10 +(1−x)O 00 ]V t+1 m (Γ 1 ( xO 10 xO 10 +(1−x)O 00 )) =m ′ −m+β(V t+1 m ′ (Γ 0 (x))−V t+1 m (Γ 0 (x))) −β[xO 11 +(1−x)O 01 ](V t+1 m ′ (Γ 1 ( xO 11 xO 11 +(1−x)O 01 ))−V t+1 m (Γ 1 ( xO 11 xO 11 +(1−x)O 01 ))) −β[xO 10 +(1−x)O 00 ](V t+1 m ′ (Γ 1 ( xO 10 xO 10 +(1−x)O 00 ))−V t+1 m (Γ 1 ( xO 10 xO 10 +(1−x)O 00 ))) ≥m ′ −m+β(V t+1 m ′ (1)−V t+1 m (1)−V t+1 m ′ (0)+V t+1 m (0)) ≥m ′ −m+β(m−m ′ )≥0 Now we are ready to prove Theorem 6.2.2. Theorem (6.2.2). When αβ ≤ 0.5 and Γ 1 (1)≤ Γ 0 (0), the process is indexable, i.e., for any belief x, if V m (x;a = 0)≥ V m (x;a = 1), then V m ′(x;a = 0)≥ V m ′(x;a = 1), ∀m ′ ≥m 155 Proof. It follows from Lemma B.3.3 that when αβ≤ 0.5 and Γ 1 (1)≤ Γ 0 (0), V m ′(x;a = 0)−V m (x;a =0)≥ V m ′(x;a = 1)−V m (x;a =1),∀x. So if V m (x;a = 0)≥ V m (x;a = 1), then V m ′(x;a =0)≥V m ′(x;a =1),∀m ′ ≥m. 156
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Balancing tradeoffs in security games: handling defenders and adversaries with multiple objectives
PDF
Hierarchical planning in security games: a game theoretic approach to strategic, tactical and operational decision making
PDF
The human element: addressing human adversaries in security domains
PDF
Not a Lone Ranger: unleashing defender teamwork in security games
PDF
Game theoretic deception and threat screening for cyber security
PDF
Predicting and planning against real-world adversaries: an end-to-end pipeline to combat illegal wildlife poachers on a global scale
PDF
When AI helps wildlife conservation: learning adversary behavior in green security games
PDF
Addressing uncertainty in Stackelberg games for security: models and algorithms
PDF
Human adversaries in security games: integrating models of bounded rationality and fast algorithms
PDF
Thwarting adversaries with unpredictability: massive-scale game-theoretic algorithms for real-world security deployments
PDF
Towards addressing spatio-temporal aspects in security games
PDF
Automated negotiation with humans
PDF
Modeling human bounded rationality in opportunistic security games
PDF
Combating adversaries under uncertainties in real-world security problems: advanced game-theoretic behavioral models and robust algorithms
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Discounted robust stochastic games with applications to homeland security and flow control
PDF
Machine learning in interacting multi-agent systems
PDF
The power of flexibility: autonomous agents that conserve energy in commercial buildings
PDF
The interpersonal effect of emotion in decision-making and social dilemmas
PDF
Computational model of human behavior in security games with varying number of targets
Asset Metadata
Creator
Qian, Yundi
(author)
Core Title
Handling attacker’s preference in security domains: robust optimization and learning approaches
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/28/2016
Defense Date
04/14/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Learning and Instruction,OAI-PMH Harvest,POMDP,repeated games,risk preference,robust optimization,Stackelberg security games
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Tambe, Milind (
committee chair
), Dessouky, Maged (
committee member
), Galstyan, Aram (
committee member
), Gratch, Jonathan (
committee member
), Kocer, Yilmaz (
committee member
)
Creator Email
qian.yundi@gmail.com,yundiqia@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-308230
Unique identifier
UC11280566
Identifier
etd-QianYundi-4832.pdf (filename),usctheses-c40-308230 (legacy record id)
Legacy Identifier
etd-QianYundi-4832.pdf
Dmrecord
308230
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Qian, Yundi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
POMDP
repeated games
risk preference
robust optimization
Stackelberg security games