Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Human adversaries in security games: integrating models of bounded rationality and fast algorithms
(USC Thesis Other)
Human adversaries in security games: integrating models of bounded rationality and fast algorithms
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Human Adversaries in Security Games: Integrating Models of Bounded Rationality and Fast Algorithms by Rong Yang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) April 2014 Copyright 2014 Rong Yang Acknowledgments Five years ago, I decided to join USC pursing a PhD degree. This was one of the most important decisions I have made in my life. Finishing the PhD wasn’t easy, but I have really enjoyed the past five years with the privilege to have worked with a number of extraordinary people, who have given me invaluable advices on both research and life. First of all, I would like to give my special thanks to my PhD advisor Milind Tambe, without whom I would not been anywhere near where I am now. When I first joined the TEAMCORE research group, little did I know about what it means to do research. Milind, with his endless patience, guided me through each step to do meaning research. He taught me the importance of conducting research that has real-world impact. His dedication for his students has encouraged us to not only do better research but also be a better person. His unbounded passion for research and unbeatable working attitude has been and will always be the inspiration to me. I will always remember the days when he worked with me until the last minute on the paper deadlines! I am also grateful for his support as I started my family during the course of pursing my PhD. His understanding and consideration has made the toughest time of my life so much smoother. Milind, thanks for always being there when I needed a discussion regarding research, a break to take care of my baby, and a heartfelt advice for future career path. ii Next, I would like to thank other members of my thesis committee: Fernando Ord´ o˜ nez, Jonathan Gratch, Rajiv Maheswaran, Richard John, and Vincent Conitzer, for providing valuable feedback to my research and pushing me to think deeper. As my research being at the intersection of many disciplines, I would never be able to push my research to the height I have achieved with- out having an interdisciplinary committee to help shaping my idea. A special thanks to Fernando, for his tremendous guidance to me and ceaseless confidence in me, without what my research would have suffered greatly. You introduced me to the world of large-scale optimization tech- niques. I will always remember the days when I could just walk into your office to ask questions whenever I encountered problems in my research. Thanks for flying over all the way from Chili to support my research. Richard, thanks for opening up the door to human subject research for me. Without your guidance, I would never be able to reach the achievement in my research. I really appreciate your always bringing a different perspective to my research. As a computer scientist, I enjoy interacting with human subjects in my experiment as much as I do with my codes, thanks to your help. I would also like to thank the many excellent researchers that I have had the privilege to work with over the years. This list includes Christopher Kiekintveld, Matthew Taylor, Bo An, Albert Xin Jiang. James Pita, Jun-young Kwak, Sarit Kraus, Thanh Nguyen, Fei Fang, Benjamin Ford and Debrun Kar. I thank all the students who helped develop the games for my experiment: Mayuresh Janorkar, Mohit Goenka, Karthik Rajagopal and Noah Olsman. I am also grateful for the support from the Army Research Office, the US Coast Guard and IBM on my research over the years. They have provided me not only the opportunity to work on real-world problems of research interest, but also the possibility to develop my own research interest. I would like to iii particularly thank Janusz Marecki for his help in applying for the IBM fellowship and for being the best academic brother I could ever ask for. During my time at USC, I enjoyed myself very much as a member of the large TEAMCORE family. I appreciate the time shared with those during my PhD career: Matthew Brown, Leandro Marcolino, Chao Zhang, Yundi Qian, Kumar Amulya Yadav, Haifeng Xu, Francesco Delle Fave and William Haskell. Special thanks to Manish Jain for all the advices you have given me over the years; Jason Tsai for being a great officemate with the patience to listen to my stories from shopping for shoes to taking care of my baby boy; Eric Shieh for the delicious food you have brought to the office; Zhengyu Yin for the many afternoons that you have spend discussing with me about research; Jun-young Kwak for always having the good recommendations for Korean restaurants. Finally, I would like to thank my family for their support over the years. In particular, thanks to my parents for always believing in me, for supporting my decision of flying aboard to pursue my career, and for coming over to the US to help take care of Dylan. I would also like to thank my in-laws for the support during the toughest time of my PhD career. Lastly, I would like to thank my dear husband Qing for always being there for me through happiness and toughness, for being the best friend of my life, for supporting me to pursue my own career and for bringing Dylan, the most precious gift ever, to my life, . iv Table of Contents Acknowledgments ii List of Figures ix Abstract xi Chapter 1: Introduction 1 1.1 Problem Addressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Stochastic Models of Adversary Decision Making . . . . . . . . . . . . 4 1.2.2 Algorithms for Optimizing Defender Strategy . . . . . . . . . . . . . . . 5 1.2.3 Adaptive Resource Allocation and Application for Protecting Wildlife . . 7 1.3 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: Background 10 2.1 Stackelberg Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Bayeisan Stackelberg Games . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Strong Stackelberg Equilibrium . . . . . . . . . . . . . . . . . . . . . . 13 2.1.3 Stackelberg Security Games . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Los Angeles International Airport . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Baseline Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Defender Optimal Strategy against a perfectly rational adversary . . . . . 19 2.3.2 Defender Optimal Strategy against theoptimal adversary response . . 20 2.4 Human Subject Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Chapter 3: Related Work 24 3.1 Behavioral Game Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Efficient Computation of Defender Optimal Strategy . . . . . . . . . . . . . . . 26 3.3 Robust Defender Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Learning Adversary Behavioral in Repeated Games . . . . . . . . . . . . . . . . 29 Chapter 4: Modeling Adversary Decision Making 30 4.1 Models for Predicting Attacker Behaviors . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 Prospect Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.2 Quantal Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 v 4.1.3 Quantal Response with Rank-related Expected Utility . . . . . . . . . . 35 4.2 Computing Optimal Defender Strategy . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Computing against a PT-adversary . . . . . . . . . . . . . . . . . . . . . 36 4.2.1.1 BRPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1.2 RPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2 Computing against a QR-adversary . . . . . . . . . . . . . . . . . . . . 42 4.2.3 Computing against a QRRU-adversary . . . . . . . . . . . . . . . . . . . 43 4.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.1 Selecting Payoff Structures . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.2 Parameter Estimation for Prospect Theory . . . . . . . . . . . . . . . . . 49 4.3.3 Parameter Estimation for the QR Model . . . . . . . . . . . . . . . . . . 51 4.3.4 Parameter Estimation for the QRRU Model . . . . . . . . . . . . . . . . 52 4.4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.1 A Simulated Online SSG . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.3 Algorithm Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.4 Quality Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.4.1 Average Performance . . . . . . . . . . . . . . . . . . . . . . 60 4.4.4.2 Performance Distribution . . . . . . . . . . . . . . . . . . . . 66 4.4.5 Model Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 69 Chapter 5: Quantal Response Model with Subjective Utility 74 5.1 The SUQR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1.1 Learning SUQR Parameters . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1.2 Prediction Accuracy of SUQR model . . . . . . . . . . . . . . . . . . . 77 5.2 Improving MATCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.1 Selecting for MATCH: . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.1 Results with AMT Workers, 8-target Games . . . . . . . . . . . . . . . . 81 5.3.2 SU-BRQR vs MATCH . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.3 SU-BRQR vs Improved MATCH . . . . . . . . . . . . . . . . . . . . . . 83 5.3.4 Results with New Experimental Scenarios . . . . . . . . . . . . . . . . . 84 5.3.4.1 Security Intelligence Experts, 8-target games . . . . . . . . . . 85 5.3.4.2 SU-BRQR vs DOBSS . . . . . . . . . . . . . . . . . . . . . . 85 5.3.4.3 SU-BRQR vs MATCH . . . . . . . . . . . . . . . . . . . . . . 85 5.3.5 Bounded Rationality of Human Adversaries . . . . . . . . . . . . . . . . 86 5.3.6 AMT Workers, 24-target Games . . . . . . . . . . . . . . . . . . . . . . 87 5.3.6.1 SU-BRQR vs MATCH with Parameters Learned from the 8- target Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.6.2 SU-BRQR vs DOBSS with Re-estimated Parameters . . . . . 88 5.3.6.3 SU-BRQR vs MATCH with Re-estimated Parameters . . . . . 88 vi Chapter 6: Modeling Human Adversaries in Network Security Games 90 6.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2 Adversary Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2.1 Basic Quantal Response Model . . . . . . . . . . . . . . . . . . . . . . 95 6.2.2 Quantal Response with Heuristics . . . . . . . . . . . . . . . . . . . . . 95 6.3 Model Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3.2 Training the QR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3.3 Training the QRH Model . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4 Computing Defender Resource Allocation Strategy . . . . . . . . . . . . . . . . 102 6.4.1 Best Response to QR model . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.2 Best Response to QRH model . . . . . . . . . . . . . . . . . . . . . . . 104 6.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.5.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.5.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Chapter 7: Computing Defender Optimal Strategy 111 7.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.1.1 Resource Assignment Constraint . . . . . . . . . . . . . . . . . . . . . . 113 7.2 Binary Search Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3 GOSAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.3.1 GOSAQ with No Assignment Constraint . . . . . . . . . . . . . . . . . . 117 7.3.2 GOSAQ with Assignment Constraints . . . . . . . . . . . . . . . . . . . 119 7.4 PASAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.4.1 PASAQ with No Assignment Constraint . . . . . . . . . . . . . . . . . . 122 7.4.2 PASAQ With Assignment Constraints . . . . . . . . . . . . . . . . . . . 125 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.5.1 No Assignment Constraints . . . . . . . . . . . . . . . . . . . . . . . . 127 7.5.2 With Assignment Constraints . . . . . . . . . . . . . . . . . . . . . . . 129 Chapter 8: Scaling-up 133 8.1 Generalized PASAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.2 COCOMO– A Branch-and-Price Algorithm . . . . . . . . . . . . . . . . . . . . 137 8.3 BLADE– A Cutting-Plane Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 140 8.3.1 Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.3.2 Separation Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.3.3 WBLADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.3.4 Quality and Runtime Trade-off . . . . . . . . . . . . . . . . . . . . . . . 148 8.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Chapter 9: Adaptive Resource Allocation and its Application to Wildlife Protection 153 9.1 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.2 Model in PAWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.2.1 Stackelberg Game Formulation . . . . . . . . . . . . . . . . . . . . . . 156 9.2.2 Behavioral Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.2.3 Adapting Patrolling Strategy using Historical Crime Data . . . . . . . . . 161 vii 9.3 Research Advances in PAWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.3.1 Learn the Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . 162 9.3.1.1 Learning with the Identified Data . . . . . . . . . . . . . . . . 163 9.3.1.2 Learning with the Anonymous Data . . . . . . . . . . . . . . . 164 9.3.1.3 Combining the Two Kinds of Data . . . . . . . . . . . . . . . 165 9.3.2 Adapting Patrolling Strategy . . . . . . . . . . . . . . . . . . . . . . . . 167 9.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 9.4.1 General Game Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 9.4.2 Results for the Deployment Area . . . . . . . . . . . . . . . . . . . . . . 170 Chapter 10: Conclusion 174 10.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Bibliography 181 Appendix A: Error Bound of PASAQ 188 viii List of Figures 1.1 US Coast Guard at the port of Boston . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 QENP: The intended site of deployment. Ranger photo taken by Andrew Lemieux. 8 2.1 LAX Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1 Prospect Theory empirical function forms . . . . . . . . . . . . . . . . . . . . . 32 4.2 Piecewise approximation of the weighting function . . . . . . . . . . . . . . . . 40 4.3 Payoff Structure Clusters (color) . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Game interface for our simulated online SSG . . . . . . . . . . . . . . . . . . . 54 4.5 Defender average expected utility achieved by different strategies . . . . . . . . . 59 4.6 Defender average expected utility (normalized between 0 and 1) achieved by dif- ferent strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 Defender average expected utility achieved by QR model based strategies . . . . 64 4.8 Distribution of defender’s expected utility against each individual subject . . . . 67 4.9 Distribution of defender’s expected utility against each individual subject . . . . 68 6.1 Game Interface (colored) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Graphs Tested in Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3 Graphs Tested in Evaluation Experiments . . . . . . . . . . . . . . . . . . . . . 106 6.4 Average Defender Expected Utility . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.5 Average Defender Expected Utility . . . . . . . . . . . . . . . . . . . . . . . . . 110 ix 7.1 Piecewise Linear Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.2 Solution Quality and Runtime Comparison, without assignment constraints (bet- ter in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.3 Solution Quality and Runtime Comparison, with assignment constraint (better in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.1 Branching Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.2 Minimizing weighted 1-norm distance . . . . . . . . . . . . . . . . . . . . . . . 147 8.3 Runtime Comparison of the BLADE family . . . . . . . . . . . . . . . . . . . . 150 8.4 Comparing COCOMO and BLADE, QR Model . . . . . . . . . . . . . . . . . . 151 8.5 Runtime Comparison, QR-Sigmoid model . . . . . . . . . . . . . . . . . . . . . 152 9.1 Lioness photo courtesy of John Coppinger, Remote Africa Safaris Ltd. Poacher snare photo taken by Andrew Lemieux. . . . . . . . . . . . . . . . . . . . . . . 155 9.2 Empirical Marginal PDF of the SUQR parameter among all the 760 subjects . . . 160 9.3 Simulation results over round . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 9.4 Slow Capture v.s. Fast Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.5 Comparing cumulative EU at round 20 . . . . . . . . . . . . . . . . . . . . . . . 170 9.6 The QENP area of interest for our simulation . . . . . . . . . . . . . . . . . . . 172 9.7 Simulation results over round for the 64 sq. km grid area in QENP . . . . . . . . 173 9.8 Patrolling coverage density in the park . . . . . . . . . . . . . . . . . . . . . . . 173 x Abstract Security is a world-wide concern in a diverse set of settings, such as protecting ports, airport and other critical infrastructures, interdicting the illegal flow of drugs, weapons and money, prevent- ing illegal poaching/hunting of endangered species and fish, suppressing crime in urban areas and securing cyberspace. Unfortunately, with limited security resources, not all the potential targets can be protected at all times. Game-theoretic approaches — in the form of ”security games” — have recently gained significant interest from researchers as a tool for analyzing real-world security resource allocation problems leading to multiple deployed systems in day-to-day use to enhance security of US ports, airports and transportation infrastructure. One of the key challenges that remains open in enhancing current security game applications and enabling new ones orig- inates from the perfect rationality assumption of the adversaries — an assumption may not hold in the real world due to the bounded rationality of human adversaries and hence could potentially reduce the effectiveness of solutions offered. My thesis focuses on addressing the human decision-making in security games. It seeks to bridge the gap between two important subfields in game theory: algorithmic game theory and behavioral game theory. The former focuses on efficient computation of equilibrium solution concepts, and the latter develops models to predict the behaviors of human players in various xi game settings. More specifically, I provide: (i) the answer to the question of which of the exist- ing models best represents the salient features of the security problems, by empirically exploring different human behavioral models from the literature; (ii) algorithms to efficiently compute the resource allocation strategies for the security agencies considering these new models of the ad- versaries; (iii) real-world deployed systems that range from security of ports to wildlife security. xii Chapter 1: Introduction Security is a world-wide concern in a variety of different settings, including protecting critical infrastructures such as port, airports and flights, interdicting illegal flow of drugs, weapons and money, preventing illegal poaching/hunting of endangered species and fish, suppressing crime in urban areas and securing cyberspace. The key challenge in these various security settings is that there are only limited amount of resources, therefore not all the potential targets can be protected at any time. At the same time, the adversaries are conducting surveillance, hence any deterministic allocation of the resource may be exploited by these intelligent adversaries. The security agencies often prefer to allocate their resources in a randomized fashion. Game-theoretic approaches have recently gained significant interest from researchers as a tool for analyzing real-world security resource allocation problems [Gatti, 2008a; Agmon et al., 2008; Basiloco et al., 2009]. These models provide a sophisticated approach for generating un- predictable, randomized strategies that mitigate the ability of attackers to find weaknesses using surveillance. The ARMOR [Pita et al., 2008], IRIS [Tsai et al., 2009] and GUARDS [Pita et al., 2011] are notable examples of real-world applications. At the heart of these applications is the 1 leader-follower Stackelberg game model, where the leader (security forces) acts first by com- mitting to a mixed-strategy; the follower (attacker/adversary) observes the leader’s strategy and responds to it. 1.1 Problem Addressed One of the key assumptions in the existing real-world security systems is about how attackers choose strategies based on their knowledge of the security strategy. Typically, such systems apply the standard game-theoretic assumption that attackers are perfectly rational and strictly maximize their expected utilities. This is a reasonable starting point for the first generation of deployed sys- tems. However, in real-world security problems, the security forces are facing human adversaries whose decisions may be governed by their bounded rationality [Simon, 1956], which may lead them to deviate from the optimal choice. Hence, defense strategies based on the perfect ratio- nality assumption may not be robust against attackers using different decision procedures. Such assumptions also fail to exploit known weaknesses in the decision-making of human attackers. Indeed, it is widely accepted that standard game-theoretic assumptions of perfect rationality are not ideal for predicting the behavior of humans in multi-agent decision problems [Camerer et al., 2004; Wright and Leyton-Brown, 2010]. Thus, it is critical to integrate more realistic models of human decision-making for solving real-world security problems. There are several open questions we need to address in moving beyond perfect rationality assumptions. First, a large variety of alternative models have been studied in behavioral game theory and cognitive psychology [Camerer et al., 2004; Costa-Gomes et al., 2001] that capture 2 some of the deviations of human decisions from perfect rationality. However, there is an impor- tant empirical question of which model best represents the salient features of human behavior in applied security contexts. Given that many of these models are descriptive, integrating any of the proposed models into a decision-support system (even for the purpose of empirically evaluation) requires developing computational efficient representation of them. Furthermore, many of these models imply mathematically complex representations of the adversary’s decision-making procedure (e.g., nonlinear and non-convex function forms), which in general leads to an NP-hard problem of computing the defender’s optimal strategy. Therefore, developing efficient algorithms to solve such a computationally complex problem is critical for real-world security problems due to their massiveness. The third open question originated from domains where actual adversary events occur of- ten and generate significant amounts of collectible crime event data, such as preventing illegal poaching of wildlife. As a result, learning behavioral models from collected adversary data and addressing the heterogeneity among large populations of adversaries becomes the new challenges in these domains. In particular, crime data can be anonymous, or it can be linked to confessed ad- versaries (i.e., identified). While the latter type of data provides rich information about individual adversaries, that type of data is sparse and hard to collect. The majority of collected data is evi- dence on crimes committed by anonymous adversaries. Compared to identified data, anonymous data provides no information about the identity of the adversary that committed the crime and therefore cannot be used to build accurate behavioral models on the individual level. The open question here is how to utilize both types of data to build and learn a better model of the large population of criminals. Moreover, how does the learned model help better predict future crime events and thus help law enforcement officials to improve their resource allocation strategies? 3 1.2 Contributions My thesis will address these open questions to improve the security resource allocation strategies against human adversaries in real-world security problems. 1.2.1 Stochastic Models of Adversary Decision Making I first investigate different theories in the behavioral literature to develop models of human decision-making for predicting adversary behavior. More specifically, I have explored two fun- damental theories, i.e., Prospect Theory [Kahneman and Tvesky, 1979] and Quantal Response (QR) Model [McKelvey and Palfrey, 1995], to model the decision-making process of human ad- versaries [Yang et al., 2011, 2013b] through experiment with human subjects using a simulated security game that I developed. Prospect Theory is an important theory in the literature which has led Kahneman win the Nobel Prize in Economic Sciences. It provides a descriptive model of human decision making. Quantal Response Model originates from the literature of discrete choice models [Train, 2003; McFadden, 1984], which models the player’s behavior as a stochas- tic choice making. In experiments with human subjects, the defender strategy computed using quantal response model to predict the human adversary significantly outperformed its competi- tors, including the previous leading contender COBRA [Pita et al., 2010]. I then further extend the QR model from three different perspectives. I first modified the QR model by replacing the expected utility with a more generalized utility function – rank-dependent utility [Yang et al., 2013b]. The rank-dependent utility function incorporates the fact that indi- viduals overweight the low-probability event into the model. It improves the performance of the 4 original quantal response model in cases where the defender has potential large damage on tar- gets covered with very few resources. I also apply the QR model to the network security games. In a network security games, the computation of actual expected utility of each action becomes very complicated for the adversary. I have discovered that extending the expected utility function with a set of easy-to-compute features would improve the performance of the model significantly [Yang et al., 2012a]. Finally, I integrate the QR model with a novel subjective utility function, which is learned from the data collected from experiments with human subjects [Nguyen et al., 2013] 1 . The subjective utility function captures the fact that humans put more weight on the prob- ability of a success attack in their decision making process. Compared with the classic Quantal Response model, the new model is shown to provide better predictions on the behavior of human adversaries. Through extensive experiments with 547 human subjects playing 11102 games in total, I emphatically answer the question of “Is there then any value in using data-driven method to model human behavior in solving SSGs?” in the affirmative. 1.2.2 Algorithms for Optimizing Defender Strategy Given the non-convexity of mathematical model for predicting adversary behavioral, the prob- lem for computing defender optimal strategies is also non-convex which is in general an NP- hard problem [Vavasis, 1995]. To that end, I have provided two novel algorithms (GOSAQ and PASAQ) to solve the problem [Yang et al., 2012b]. These two novel algorithms are based on three key ideas: (i) use of a binary search method to solve the fractional optimization problem effi- ciently, (ii) construction of a convex optimization problem through a non-linear transformation, 1 This is a work that I co-authored with Thanh Nguyen, who is the first author. 5 Figure 1.1: US Coast Guard at the port of Boston (iii) building a piecewise linear approximation of the non-linear terms in the problem. I also pro- vided proofs of approximation bounds, detailed experimental results showing the advantages of GOSAQ and PASAQ in solution quality over the benchmark algorithm (BRQR) and the efficiency of PASAQ. Given these results, PASAQ is at the heart of the first version PROTECT [Shieh et al., 2012] system used by the US Coast Guard in the port of Boston for generating optimal patrolling strategies 2 . Given that many real-world security problems are massive, such as for Federal Air Mar- shals [Kiekintveld et al., 2009], further scaling-up for computing defender strategy incorporating models of adversary bounded rationality is needed. Unfortunately, previously proposed branch- and-price approaches fail to scale-up given the non-convexity of such models, as we show with a realization called COCOMO. Therefore, I present a novel cutting-plane algorithm called BLADE [Yang et al., 2013a] to scale-up SSGs with complex adversary models, with three novelties: (i) 2 Since then, newer algorithms have been developed as will be discussed below 6 an efficient scalable separation oracle to generate deep cuts; (ii) a heuristic that uses gradient to further improve the cuts; (iii) techniques for quality-efficiency tradeoff 1.2.3 Adaptive Resource Allocation and Application for Protecting Wildlife To address the challenges of learning adversary behavioral models from history crime data, I present the Protection Assistant for Wildlife Security (PAWS) application - a joint deployment effort done with researchers at Ugandas Queen Elizabeth National Park (QENP) with the goal of improving wildlife ranger patrols [Yang et al., 2014]. First, we propose a stochastic behav- ioral model which extends the current state-of-the-art to capture the heterogeneity in the decision making process of a population of poachers. Second, we demonstrate how to learn the behavioral pattern of the poacher population from both the identified data and the anonymous data. Third, in order to overcome the sparseness of the identified data, we provide a novel algorithm, PAWS- Learn, to improve the predicating accuracy of the estimated behavioral model by combining the two types of data. Fourth, we develop a new algorithm, PAWS-Adapt, which adapts the rangers’ patrolling strategy against the poacher population’s behavioral model. Fifth, we show the ef- fectiveness of PAWS in a general setting, but our main drive is to deploy PAWS in QENP; we also demonstrate PAW’s effectiveness when applied to an area of QENP. Our experiment results and corresponding discussion illustrate the capabilities of PAWS and its potential to improve the efforts of wildlife law enforcement officials in managing and executing their anti-poaching patrols. 7 (a) Outline of QENP (b) QENP rangers on patrol. Figure 1.2: QENP: The intended site of deployment. Ranger photo taken by Andrew Lemieux. 1.3 Overview of Thesis This thesis is organized as follows. Chapter 2 introduces the necessary background materials for the research presented in this thesis. Chapter 3 provides an overview of the related work. Chap- ter 4 discusses how the models for predicting the adversary decision making are developed from applying existing literature on general human behavior to security games. Chapter 5 presents an extension of the existing quantal response model to further improve its performance in predicting adversary behavior in security games. Chapter 6 investigates the performance quantal response model in the network security games. Chapter 7 explains the algorithms for efficiently computing the optimal defender strategy incorporating the behavioral model of the adversary. Chapter 8 pro- vides further scaling-up of the computation of defender strategy for massive real-world security 8 problems. Chapter 9 describes the real-world application for preventing wildlife crimes. Finally, Chapter 10 concludes the thesis and presents possible future directions. 9 Chapter 2: Background The work in this thesis is based on using Stackelberg game to model the security scenarios. As such, I will first outline the relevant background in Section 2.1 by introducing the general Stackelberg game model (Section 2.1.1), the Bayesian extension (Section 2.1.2), the standard solution concept knows as Strong Stackelberg Equilibrium (Section 2.1.3), and a restricted class of Stackelberg game referred to as security games (Section 2.1.4). In Section 2.2, I will describe a real-world security problem at the Los Angeles International Airport (LAX) which motives the experiment setup in this thesis. Section 2.3 overviews previous algorithms of relevance to this thesis. Finally, Section 2.4 provides the justification of conducting experiment with human subjects as an approach for evaluating the models and algorithms for this thesis. 2.1 Stackelberg Games There are two types of players in a Stackelberg game, the leader and the follower. The leader commits to a strategy first; and then the follower responds after observing the leader’s action by maximizing his utility [von Stackelberg, 2011]. For the reminder of this thesis, I will refer to the leader as ‘her’ and the follower as ‘he’ for explanatory purpose. In order to show the advantage of being the leader in a Stakelberg game, let’s look at an example which was first presented by 10 [Conitzer and Sandholm, 2006]. Table 2.1 shows the payoff matrix of the game, where the leader is the row player and the follower is the column player. If the players move simultaneously, the only pure-strategy Nash Equilibrium for this game is when the leader playsl a and the follower playsf a , which gives the leader a payoff of 2. In fact, playingl b is strictly dominated by playing l a for the leader. However, if the row player moves first, she can commit to playingl b , which will give her a payoff of 3 since the column player will playf b to ensure a higher payoff. Furthermore, if the leader commits to a mixed strategy of playingl a andl b with equal probability (0.5), then the follower will playf b , leading to a payoff of 3.5 for the leader. f a f b l a 2,1 4,0 l b 1,0 3,1 Table 2.1: Example of a Stackelberg game Let denote the leader and denote the follower in a Stackelberg game. Each player has a set of pure strategies they can play, denoted as 2 and 2 . A mixed strategy allows the player to play probabilistically over a subset of the pure strategies. We denote the mixed strategy of the leader by x. In a general Stackelberg game, x is aN-dimensional vector, whereN is the number of pure strategies for the leader. Thei th element of x,x i , represents the probability that the leader will play pure strategyi. For the purpose of computing the equilibria, it is sufficient to consider only pure strategy response of the follower takes a pure strategy, as show in [Conitzer and Sandholm, 2006] The payoffs for each player are defined over all possible joint pure-strategy outcomes. More formally, we define the payoff matrix for both players: U = ( 1 ::: J );V = ( 1 ::: J ): 11 The vector j presents the payoffs for the defender when the follower plays pure strategy j. Similarly, the vector j represents the follower’s payoffs by playing pure strategy j. Given a leader’s mixed strategy x, the follower maximizes his expected utility by choosing one of his pure strategies. For each pure strategyj chosen by the follower, the expected utility for the leader by taking the mixed strategyx is a linear function ofx: T j x. At the same time, the follower gets an expected utility of T j x. 2.1.1 Bayeisan Stackelberg Games In a Bayesian Stackelberg game, both players are extended to have multiple types. In this thesis, we consider only one type of the defender (leader) who is trying to allocating her resources. However, there can be multiple types of attackers (followers). For example, a security force may be interested in both protecting against potential terrorist attacks and catching drug smugglers. Each type of the follower has his own payoff matrix as well as the payoff matrix for the leader. Formerly, let 1 denote the types of the followers. The defender faces attacker type with a priori probability ofp . The associated payoff matrix for the leader and type attacker is represented by (U ;V ) respectively. Given the payoff matrix for each type, the leader commits to a mixed strategyx knowing the priori probability distribution of all different follower types but not the exact type of the follower she faces. The follower, on the other hand, knows his own type and plays his best response by maximizing his expected utility according to his payoff matrixV , after observing the leader’s mixed strategy. The leader’s goal is to maximizing her expected utility, given the priori probability distri- bution of all follower types and the payoff matrix. Let vectorj =hj 1 ;:::;j i denote the pure 12 strategies played by all types of the followers, withj representing the pure strategy played by follower type . The leader’s expected utility by playing mixed-strategyx is then defined as (x;j) = P =1 p (x;j ), where (x;j ) = T j x is the leader’s expected utility when facing the follower of type. 2.1.2 Strong Stackelberg Equilibrium The most common solution concept in game theory is a Nash equilibrium, which is a profile of strategies for each player in which no player can gain by changing only their only strategy. In other words, each player plays his/her best-response assuming that the other players also best respond by maximizing his/her expected utility. In a Stackelberg game, the solution concept that is mostly adopted is the Strong Stackelberg Equilibrium (SSE). Besides the mutual best-response feature that is entailed by Nash equilibrium, SSE also assumes that the adversary will break ties in favor of the defender. Most of the existing algorithms for solving Stackelberg security games adopt the concept of SSE [Paruchuri et al., 2008; Kiekintveld et al., 2009]. This is because that a SSE always exists in all Stackelberg games [Breton et al., 1988]. Furthermore, when ties exist, the leader can always induce the desired outcome by selecting a strategy arbitrarily close to the SSE strategy, against which the follower strictly prefers the desired strategy [von Stengel and Zamir, 2004]. Formally a Strong Stackelberg Equilibrium is defined below: Definition 1. For a given Bayesian Stackelberg game with utility matrices (U 1 ;V 1 );:::; (U ;V ) and type distributionp, a pair of strategies (x;j) forms a Strong Stackelberg Equilibrium if and only if: 1. The leader plays a best response: u(x;j(x))u(x 0 ;j(x 0 ));8x 0 : 13 2. The follower plays a best response: v (x;j (x))v (x;j);81 ;81jJ: 3. The follower breaks ties in favor of the leader: u (x;j (x))u (x;j);81 ;8j that is a best response tox as above: In general, finding the equilibrium of a Bayesian Stackelberg game is NP-hard [Conitzer and Sandholm, 2006]. 2.1.3 Stackelberg Security Games I now introduce a restricted version of the Stackelberg game knows as a security game. We con- sider a Stackelberg Security Games (SSG) [Yin et al., 2010] with a single leader (defender) and at least one follower (attacker). The defender has to protect a set of targetsT =ft 1 ;t 2 ;:::;t jTj g from being attacked by the attacker, using a set of resources. In a security game, a pure strategy of an attacker is defined as attacking a single target; and a pure strategy of a defender is defined as an assignment of all the security resources to the set of targets. An assignment of a security resource to a target is also referred to as covering a target. The defender strategy set includes all the possible assignments of all the resources. The payoffs for both the defender and the attacker depend on which target is attacked, and whether that target is protected (covered) by the defender. Formally, let d and a still denote the defender and the attacker respectively. We then use R d i to represent the defender’s payoff (reward) of covering a targett i that is attacked by the attacker, andP d i as the payoff (penalty) of not covering that attacked target. Similarly for the attacker, we useP a i (penalty) andR a i (reward) to represent his payoff of attacking a targett i when it is covered or uncovered by the defender. 14 An important feature of the security game is thatR d i P d i , and thatP a i R a i . In other words, add resources to cover a target benefits the defender and hurts the attacker. In many real world security problems, there are constraints on assigning the resources. For example, in the FAMS problem [Jain et al., 2010b], an air marshal is scheduled to protect 2 flights (targets) out ofM total flights. The total number of possible schedule is M 2 . However, not all of the schedules are feasible, since the flights scheduled for an air marshal have to be connected, e.g. an air marshal cannot be on a flight from A to B and then on a flight C to D. A resource assignment constraint implies that the feasible assignment setA is restricted; not all combinatorial assignment of resources to targets are allowed. A compact representation of the defender strategy, introduced in [Kiekintveld et al., 2009], uses the probability that each target will be covered by a security resource. The defender’s mixed- strategy can then be denoted by a vectorx =hx 1 ;:::;x jTj i, where c i denote the probability that targett i will be covered by a security resource. This compact representation of the defender strategies is proved to be equivalent to the distribution over the original pure strategies when there is no constraints on assigning the resources [Korzhyk et al., 2010]. In the presence of assignment constraints, such equivalence usually can be maintained by adding a set of linear constraints on x (Axb). Definition 2. We consider a marginal coverage x to be feasible if and only if there existsa j 0;A j 2A such that P A j 2A a j = 1 and for alli2T ,x i = P A j 2A a j A ij . In fact,ha j i is the mixed strategy over all the feasible assignments of the resources. With this compact representation, efficient algorithms were able to be developed to compute defender optimal strategies [Kiekintveld et al., 2009; Tsai et al., 2010]. I will show more details 15 in Chapter 7 on the benefit of using this compact representation. Given the coverage vectorx, the defender’s expected utility when the attacker attacks target t i is calculated using Equation 2.1; and the attacker’s expected utility of attacking targett i is calculated in Equation 2.2. U d i (x) = (1x i )P d i +x i R d i (2.1) U a i (x) =x i P a i + (1x i )R a i (2.2) 2.2 Los Angeles International Airport While there are a number of security problems where game theory is potentially applicable, I will focus on introducing the security scenario at the Los Angeles International Airport in this section. It is also the base of my experiment setup, due to its simplicity in constraints which is ideal for an initial investigation against human subjects. Los Angeles International Airport (LAX) is the fifth busiest airport in the United State, and the largest destination airport in the United State [Stevens et al., 2006]. It serves 60-70 million passengers each year [Stevens et al., 2006]. LAX is unfortunately one of the prime terrorist target on the west coast of the United State, given its importance and the record of the multiple attempting attacks by the arrested plotters [Stevens et al., 2006]. The Los Angeles World Airport (LAWA) police have designed a security system to protect the airport, which includes vehicular checkpoints, police units patrolling the roads to the terminals and inside the terminals (with canines), and security screening and bag checks for passengers. Unfortunately, there are not enough resources to protect the entire airport all the time, given the size of the airport and the number of passengers. Setting up available checkpoints, canine units or other patrols on deterministic schedules allows adversaries to learn the schedules 16 (a) LAX Checkpoint (b) Canine Patrol Figure 2.1: LAX Security and plot an attack that avoids the police checkpoints and patrols, which makes deterministic schedules ineffective. Game-theoretic approach provides a solution to randomize the allocation of the limited re- sources for LAWA. In particular, the ARMOR [Pita et al., 2008] system is developed based on using the security game framework to assist LAWA. Figure 2.1(a) shows a vehicular checkpoint set up on a road inbound towards LAX. Police officers examine cars that drive by, and if any car appears suspicious, they do a more detailed inspection of that car. ARMOR provides a random- ized schedule for the LAWA police to set up these checkpoints for a particular time frame. At the same time, ARMOR also generates an random assignment of canines to patrol routes through the terminals inside LAX. Figure 2.1(b) illustrates a canine unit on patrol at LAX. 17 2.3 Baseline Solvers The leader’s goal in a SSG is to maximize her expected utility, given how the adversary responds to the defender’s strategy. The behavioral modeling is done only on the attacker, who faces a decision theory problem given the leader’s commitment. Mathematically, the defender’s optimal strategy can be computed by solving the following optimization problem: x =arg max x X i q i (x)U d i (x) (2.3) where,U d i (x) is the defender’s expected utility if the attacker chooses to attack targett i as shown in Equation 2.1, andq i (x) represents the attacker’s response given defender’s strategyx. One leading family of algorithms to compute such mixed strategies are DOBSS and its suc- cessors [Pita et al., 2008; Kiekintveld et al., 2009], which are used in the deployed ARMOR and IRIS applications. These algorithms follows the perfect rationality assumption for the ad- versary decision-making. However, in many real world domains, agents face human adversaries whose behavior may deviate from such assumption. COBRA [Pita et al., 2010] represents the best available benchmark for how to determine defender strategies in security games against human adversaries withoptimal response. In this section, we describe the computation of the defender optimal strategies against two baseline models of the adversary: a perfectly rational adversary; and aoptimal adversary response. 18 2.3.1 Defender Optimal Strategy against a perfectly rational adversary The Strong Stackelberg Equilibrium assumes the adversary is perfectly rational, i.e. he will strictly maximizes his expected utility. The computation of the defender’s optimal strategy can then be formulated as the following: max x;q X i q i U d i (x) (2.4) s.t. n X i=1 x i (2.5) 0x i 1; 8i (2.6) q i = 1; ifU a i (x)U a i 0(x);8i 0 6=i (2.7) X i q i = 1; (2.8) q i 2f0; 1g;8i (2.9) The objective is to maximize the defender’s expected utility, as shown in Equation (2.17). The constrains in Equations (2.7)-(2.9) enforce that the adversary selects the target which maximizes 19 his expected utility. By introducing some auxiliary variables, the above optimization problem can be formulated as a Mixed-Integer Linear Program (MILP), as shown below: max x;a;d;q d (2.10) s.t. n X i=1 x i (2.11) 0x i 1; 8i (2.12) 0aU a i (x i )M(1q i );8i (2.13) X i q i = 1 (2.14) q i 2f0; 1g;8i (2.15) M(1q i ) +U d i (x i )d;8i (2.16) The variable a in Equation 2.13 represents the attacker’s expected utility. M is a very large constant, which enforcesq i to be set to 1 for the target that leads to the maximum expected utility for the attacker. Similarly, the variabled in the objective function and Equation (2.16) represents the defender’s expected utility. The defender’s optimal strategy against a perfect rational adversary can then be computed by solving the above MILP. 2.3.2 Defender Optimal Strategy against the optimal adversary response The optimal response addresses the bounded rationality of the adversary. It assumes that, instead of strictly maximizing the expected utility, the adversary could deviate to any target with 20 an expected utility within of the maximum. The computation of the defender’s optimal strategy againstoptimal response can then be formulated as the following: max x;a;d;q;h d (2.17) s.t. n X i=1 x i (2.18) 0x i 1; 8i (2.19) 0aU a i (x i )M(1q i );8i (2.20) X i q i = 1 (2.21) q i 2f0; 1g;8i (2.22) (1h i )aU a i (x i )M(1q i ) +;8i (2.23) h i 2f0; 1g;8i (2.24) q i h i ;8i (2.25) M(1h i ) +U d i (x i )d;8i (2.26) Note that the above MILP modifies the MILP in Equation (2.17)-(2.16). The variableh in Equa- tion (2.23)-(2.25) represents theoptimal response of the adversary.h i is set to 1 if the expected utility of attacking targett i is within ofa, which is the maximum expected utility the attacker can achieve. 2.4 Human Subject Experiments Since my research is focused on addressing the boundedly rational behavior of human adver- saries, conducting experiments with human subjects is necessary to evaluate the effectiveness 21 of the proposed approaches. To that end, I conduct my experiment with human subjects using an online labor market, i.e. Amazon Mechanical Turk (AMT). AMT has been widely used for behavioral research as a tool to collect data [Mason and Suri, 2012]. The are many advantages of conducting experiment on AMT, including subject pool access, subject pool diversity and low cost [Reips, 2002; Mason and Suri, 2012]. While conducting experiments with real terrorists is often infeasible in reality, experimental analysis with general population still points to the right direction and allows me to show how my approaches are expected to perform compared to alter- native approaches. One might argue that the psychiatric profile of the terrorists might significantly differ from the general population. Therefore, the terrorists are completely irrationally and not making any strategic decisions in planning the attack. However, studies show that the normalcy is indeed the primary shared characteristic of the psychiatric profile of the terrorists [Richardson, 2007; Abrahms, 2008; Gill and Young, 2011]. In fact, they are highly rational and carefully conducting the attacking plan [Richardson, 2007; Rosoff and John, 2009; Keeney and von Winterfeldt, 2010]. It then follows the question whether the perfect rationality assumption is sufficient for mod- eling the decision-making of the terrorists. First of all, many studies in economic behavior and cognitive science show that human decision makers suffer from bounded rationality and cognitive limitation. The bounded rationality of human decision makers may be caused by both external and internal reasons[Simon, 1956, 1969; Hastie and Dawes, 2001]. On the one hand, the environ- ment may be complicated. The human decision makers might only have limited information of the environment. On the other hand, humans have limited memory and other cognitive limitation which prevent them from making optimal choice. Indeed, many studies in the literature [Rubin- stein, 1998; Camerer, 2003] have shown that human decision makers rely on heuristics in making 22 decisions rather than strictly maximize the expected utility. Furthermore, terrorists sometimes face competing objectives and noisy information [Allison and Zelikow, 1999; Abrahms, 2008], which may lead them to deviate from the optimal strategy. Additionally, the approaches developed in this thesis based on using human-subject experi- ments with general populations may be of use beyond the counter-terrorism domain. In many other domains, the criminals are more close to the general population, such as the ticket-less trav- elers in the metro train system, or the villagers illegally hunting animals or extracting plants. In general, criminal activities can be broadly broken down into six categories: (i) Property crimes, (ii) violent crimes, (iii) sex crimes, (iv) gangs and crime, (v) white-collar occupational crime, and (vi) drugs and crime [Pogrebin, 2012]. The responsibility of different security agencies is to pre- vent these crime activities. The approaches presented in this thesis can be potentially helpful to many of these agencies. Given the large range of crime activities, the human criminals will also span a wide variety. Therefore, the use of general population in the human subject experiment can be of great value for providing insights of how the proposed approach might be applicable to these different domains. In the future, we could further refine the approach for a specific type of criminal. More specifically, by defining the personality and demographic profile of the criminals in a specific type of crimes, we can evaluate the approach in experimental with human subjects of that profile. However, the difficulty of obtain the correct population that needs to be examined is general in behavioral studies and might not be complete tackled. 23 Chapter 3: Related Work Motivated by real-world security problems, there have been many algorithms developed to com- pute optimal defender strategies in Stackelberg games [Paruchuri et al., 2008; Kiekintveld et al., 2009; Tsai et al., 2010]. The first such algorithm to be used in a real application is DOBSS (De- composed Optimal Bayesian Stackelberg Solver) [Paruchuri et al., 2008], which is the central to the ARMOR system [Pita et al., 2008] at LAX airport and the GUARDS system [Pita et al., 2011] built for the Transportation Security Administration. Other works related to Stackelberg security games include those of Agmon et al. [Agmon et al., 2008, 2009] and those of Gatti et al. [Gatti, 2008b; Basiloco et al., 2009] on multi-robot patrolling. However, an important limitation of all of this work is the assumption of a perfectly rational adversary, which may not hold in many real world domains. 3.1 Behavioral Game Theory Behavioral Game Theory aims at developing models of human decision-making in strategic set- tings. Many models have been proposed to capture human bounded rationality in their decision making in the literature of psychology and cognitive science [Train, 2003; McFadden, 1989; Starmer, 2000; Rubinstein, 1998]. A key challenge of applying these models to game-theoretical 24 framework to help design better strategy is the transition from a (sometimes descriptive) model to a computational model. On the other hand, there has been a growing interests in the game the- ory literature in developing more realistic computational models incorporating human decision making in games [Camerer et al., 2004; Ficici and Pfeffer, 2008; Stahl and Wilson, 1994]. Most of these models find empirical support from the data of human playing games. However, few research efforts have being made to identify which of these models capture the salient features of human decision-making in the important area of SSGs. To that end, my work focus on extending the existing models from literature to apply to SSGs as well as designing experiments to evaluate the effectiveness of these models with human subjects. The most related work to this thesis is that by Pita et al.[Pita et al., 2010]. Pita et al. develop a new algorithm COBRA, which provids a solution for designing better defender strategies against human adversaries by considering two factors in human behavior (i) human deviation from the utility maximizing strategy and (ii) human anchoring bias when given limited observation of defender mixed strategy. COBRA significantly outperforms the baseline algorithm DOBSS, which assumes perfect rationality of the adversaries, in the experiments against human subjects, and is considered the leading contender in addressing human bounded rationality in SSG. However, COBRA only exploits two aspects for human bounded rationality. There are many other models proposed in the literature of behavior game theory and cognitive psychology which could be potentially used to model adversary decision-making in Stackelberg security games. Thus, it remains an open question whether there are other approaches that allow for fast solutions and outperform COBRA in addressing human behavior in security games. 25 Outside the area of Stackelberg security games, there have been several recent investigations of human subjects interacting with agents. For example, Melo et al [de Melo et al., 2011] in- vestigate the impact of expression of an automated agent’s anger or happiness in how a human participant may play the game. In repeated prisoner’s dilemma games, agents’ expressions are shown to significantly affect human subjects’ cooperation or defection. Similarly, Azaria et al. [Azaria et al., 2011] focus on road selection games, and advice an automated system may provide to human subjects; Peled et al. [Peled et al., 2011] focus on bilateral bargaining games, designing agents that negotiate proficiently with people. Aside from the obvious difference that our focus is on SSGs, another key is our focus on efficiently computing optimal mixed strategies for the defender. 3.2 Efficient Computation of Defender Optimal Strategy There have been a number of algorithms developed to compute the optimal defender strategy for massive real-world security problems [Paruchuri et al., 2008; Kiekintveld et al., 2009; Jain et al., 2010a]. In order to scale-up the computation for SSG with combinatorial number of de- fender strategies, Kiekintveld et al. [Kiekintveld et al., 2009] exploit the underlying structure of security games and developed efficient algorithms based on using a compact representation of the defender strategy. More specifically, ORIGAMI computes the optimal defender strategy in polynomial time when there is no constraints on assigning the security resources; ERASER-C provides scales-up over ORIGAMI by computing the security coverage per schedule instead of computing the mixed-strategy over a joint assignment for all security resources. However, Kiek- intveld et al. [Kiekintveld et al., 2009] show that ERASER-C only addresses certain types of 26 constraints on assigning the resources [Korzhyk et al., 2010] and may fail to produce a correct solution when facing arbitrary constraints. Jain et al. [Jain et al., 2010a] then develop a novel algorithm ASPEN based on using the branch-and-price approach. ASPEN further advanced the state of art and is able to handle arbitrary resource allocation constraint. In essential, ASPEN avoids representing the full space of defender pure strategies by starting with a small subset of it and iteratively expanding it until reaches the optimal solution. At the same time, many algorithms have been developed to solve Bayesian Stackelberg games. DOBSS [Paruchuri et al., 2008] is the first algorithm develop for computing defender optimal strategy in a Bayesian Stackelberg game. Jain [Jain et al., 2011b] then proposed a hi- erarchical methodology of decomposing large Bayesian Stackelberg games into many smaller Bayesian Stackelberg games, and provided a framework to use the solutions to these smaller games to efficiently apply branch-and-bound on the original large Bayesian Stackelberg game. Yin et al. [Yin and Tambe, 2012] further improved the state-of-art by combining techniques in artificial intelligence such as best-first search and operation research such as Bender’s decompo- sition. Unfortunately, all the previous work assumes a perfectly rational adversary. Given that most of the behavioral models imply mathematically complex presentations of the adversary decision making, it is unclear whether similar techniques can be applied for computing defender optimal strategy incorporating these models. Therefore, new efficient algorithms need to be developed to address this new computational challenge. 27 3.3 Robust Defender Strategies Another line of related work in Stackelberg security games has been trying to design more robust strategies to deal with different kinds of uncertainties [Aghassi and Bertsimas, 2006; Yin et al., 2011; Kiekintveld et al., 2011]. Yin et al. [Yin et al., 2011] proposed a unified efficient algo- rithm that addresses both execution uncertainties of the defender and observation uncertainties of adversaries in SSGs. Kiekintveld et al. [Kiekintveld et al., 2011] address payoff uncertainty by introducing a general model of infinite Bayesian Stackelberg security games which allows pay- offs to be represented using continuous payoff distributions. An et al. [An et al., 2012] considers the cases where the adversaries don’t have perfect surveillance of the defender’s strategy. They provided a model for the adversary’s belief update of the defender’s strategy as well as the for- mulation for computing the defender’s optimal strategy considering such imperfect surveillance of the adversary. A key difference of this line of work from this thesis is that this work consid- ers robustness against perfectly rational adversaries. Although the simulation based experiment showed promising result of these studies, the performances of these models against real human subjects are left unaddressed. In order to address adversary bounded rationality, Pita et al. [Pita et al., 2012] introduce a new algorithm MATCH which computes a robust strategy for the defender with a linear correlated cost of the defender to that of the adversary. More specifically, MATCH guarantees that if the adversary deviates from his optimal action, the cost of such deviation to the defender is linearly correlated to that to the adversary. MATCH is designed to intentionally avoid explicitly modeling the decision making of the human adversary. It is unclear how such approach compares to that based on building an explicit model to predict the adversary’s decision making. 28 3.4 Learning Adversary Behavioral in Repeated Games There is a significant body of literature in game theory on learning with incomplete information [Brown, 1951; Sastry et al., 1994; Aumann and Maschler, 1995]. These studies focus on address- ing the uncertainty of payoff information of the game in repeated game settings via learning. The players are often assumed to be perfectly rational in these studies. In the scope of security games, there has been work on learning attacker payoffs in repeated security games [Letchford et al., 2009; Marecki et al., 2012]. Letchford et al. [Letchford et al., 2009] develop an algorithm to uncover the attacker type in as few rounds as possible. Marecki et al. [Marecki et al., 2012] use Monte-Carlo Tree Search to maximize the defenders utility in the first few rounds. Either work provides a guidance for the defender in the later round. In a most recent work [Qian et al., 2014], Qian et al propose an algorithm combining Gibbs sampling with Monte Carlo tree search for online planning for the defender. In comparison, the work presented in this thesis focuses on learning the behavioral model of the adversaries from past crime data, and also providing guidance for defender to adapt their resource allocation strat- egy based on the updated belief of the model. Furthermore, this thesis addresses the interesting problem of learning the behavioral model from both labeled and unlabeled crime data, similar to that concerned by semi-supervised learning problems[Chapelle et al., 2006]. 29 Chapter 4: Modeling Adversary Decision Making This chapter introduces my contribution towards moving beyond perfect rationality assumptions of human adversaries in security games. In order to integrate more realistic models of human decision-making in real-world security systems, several key challenges need to be addressed. First, the literature has introduced a multitude of potential models on human decision making [Kahneman and Tvesky, 1979; Camerer et al., 2004; McKelvey and Palfrey, 1995; Costa-Gomes et al., 2001], but each of these models has its own set of assumptions and there is little consensus on which model is best for different types of domains. Therefore, there is an important empirical question of which model best represents the salient features of human behavior in the important class of applied security games. Second, integrating any of the proposed models into a decision- support system (even for the purpose of empirically evaluating the model) requires developing new algorithms for computing solutions to Stackelberg security games, since most existing algo- rithms are based on mathematically optimal attackers [Paruchuri et al., 2008; Kiekintveld et al., 2009]. One notable exception is COBRA developed by Pita et al. [Pita et al., 2010]. COBRA is one example of modeling bounded rationality of human adversaries by taking into account (i) the anchoring bias of humans while interpreting the probabilities of several events; (ii) the limited computational ability of humans which may lead to deviation from their best response. To the 30 best of our knowledge, COBRA is the best performing strategy for Stackelberg security games in experiments with human subjects. Thus, the open question is whether there are other approaches that allow for fast solutions and outperform COBRA in addressing human behavior in security games. This chapter significantly expands the previous work on modeling human behavior in Stack- elberg security games. Section 4.1 presents the new models of adversary decision-making based on Prospect Theory and Quantal Response. Following that, Section 4.2 describes the algorithms we developed to compute optimal defender strategy against these new adversary models. In Sec- tion 4.3, we explain the methods we used to decide the parameters of different models. Section 4.4 presents our experimental setup and results. 4.1 Models for Predicting Attacker Behaviors Existing models of adversary behavior in SSGs have poor performance in predicting the behavior of human adversaries [Pita et al., 2010]. In order to design better defender strategy, better models of adversary decision-making need to be developed. In this section, we present three models of adversary’s behavior in SSGs, based on using Prospect Theory and Quantal Response Equilib- rium. All of the models have key parameters. We describe in the next section our methodology for setting these parameters in each case. 31 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x π (x) π(x) = x γ (x γ +(1−x) γ ) 1 γ (a) weighting function −10 −5 0 5 10 −10 −8 −6 −4 −2 0 2 4 C V(C) V(C) = C α ,C ≥ 0 V(C) =−θ· (−C) β ,C < 0 (b) value function Figure 4.1: Prospect Theory empirical function forms 4.1.1 Prospect Theory Prospect Theory provides a descriptive model of how humans make decision among alternatives with risk, which is a process of maximizing the ‘prospect’, which will be defined soon, rather than the expected utility. More formally, the prospect of a certain alternative is defined as X l (x l )V (C l ) (4.1) In Equation (4.1),x l denotes the probability of receivingC l as the outcome. The weighting function () describes how probability x l is perceived by individuals. An empirical function form of() (Equation (4.2)) is shown in Fig. 4.1(a) [Kahneman and Tvesky, 1992]. (x) = x (x + (1x) ) 1 (4.2) The key concepts of a weighting function are that individuals overestimate low probability and underestimate high probability [Kahneman and Tvesky, 1979, 1992]. Also,() is not consistent with the definition of probability, i.e.(x) +(1x) 1 in general. The value function V (C l ) in Equation (4.1) reflects the value of the outcome C l . PT pre- dicts that individuals are risk averse regarding gain but risk seeking regarding loss, implying an 32 S-shaped value function [Kahneman and Tvesky, 1979, 1992]. A key component of Prospect Theory is the reference point. Outcomes lower than the reference point are considered as loss and higher as gain. V (C) = 8 > > < > > : C ; C 0 (C) ; C < 0 (4.3) Equation (4.3) is a general form for the value function where C is the relative outcome to the reference. In Equation (4.3), we assume the reference point to be at 0. and determine the extent of non-linearity in the curves. If the parameters = 1:0 and = 1:0, the function would be linear; typical values for both and are 0.88 [Kahneman and Tvesky, 1992]. captures the idea that the loss curve is usually steeper than the gains curve, a typical value of is 2.25 [Kahneman and Tvesky, 1992], which reflects a finding that losses are a little more than twice as painful as gains are pleasurable. The function is also displayed in Fig. 4.1(b) [Kahneman and Tvesky, 1992]. Given these parameters, we will henceforth denote this value function withV ;; In a SSG, the prospect of attacking targett i for the adversary is computed as prospect(t i ) =(x i )V ;; (P a i ) +(1x i )V ;; (R a i ) (4.4) According to Prospect Theory, subjects will choose the target with the highest prospect. Thus, q i = 8 > > < > > : 1; ifprospect(t i )prospect(t i 0);8t i 02T 0; otherwise (4.5) 4.1.2 Quantal Response Quantal Response is an important solution concept in behavioral game theory [McKelvey and Palfrey, 1995]. It is based on a long history of work in single-agent problems and brings that work into a game-theoretic setting [Stahl and Wilson, 1994; Wright and Leyton-Brown, 2010]. It 33 assumes that instead of strictly maximizing utility, individuals respond stochastically in games: the chance of selecting a non-optimal strategy increases as the cost of such an error decreases. Given the strategy profile of all the other players, the response of a player is modeled as a quantal response (QR model): he/she selects actioni with a probability given by q i (x) = e U a i (x) P t k 2T e U a k (x) (4.6) where,U a i (x) is the expected utility for the attacker for selecting pure strategyi. Here,2 [0;1] is the parameter that captures the rational level of playerp: one extreme case is=0, when player p plays uniformly random; the other extreme case is ! 1, when the quantal response is identical to the best response. Combining Equation(4.6) and (2.2), q i (x) = e R a i e (R a i P a i )x i P t k 2T e R a k e (R a k P a k )x k (4.7) In applying the QR model to the security game domain, we only consider noise in the response of the adversary. The defender uses a computer decision support system to choose her strategy hence is able to compute optimal strategy. On the other hand, since the attacker observes the defender’s strategy first to decides his response, it can only hurt the defender to add noise in her response. Recent work [Wright and Leyton-Brown, 2010] shows Quantal Level-k [Stahl and Wilson, 1994] to be best suited for predicting human behavior in simultaneous move games. The key idea of level-k is that humans can perform only a bounded number of iterations of strategic reasoning: a level-0 player plays randomly, a level-k (k > 1) player best response to the level- (k 1) player. We applied QR instead of Quantal Level-k to model the attacker’s response because in Stackelberg security games the attacker observes the defender’s strategy, so level-k reasoning is not applicable. 34 4.1.3 Quantal Response with Rank-related Expected Utility We modify the Quantal Response Model by taking into consideration the fact that individuals are attracted to extreme events, such as the less uncertain and highest payoff. This idea is inspired by the rank-dependent Expected Utility Model [Diecidue and Wakker, 2001], in which the utilities of choosing different alternatives are based on the their ranks. We adapt this idea to security games, but we only consider such effect on the target covered with minimum resources. That is the adversary would prefer the target covered with minimum resources since he is most likely to be successful attacking that target. This could significantly reduce the defender’s reward in the case when this target with fewest resources also gives a large penalty to the defender. We modify the QR model by adding extra weight to the target covered with minimum re- sources. We refer this modified model as Quantal Response with Rank-related expected Utility (QRRU) model, where the probability that the attacker attacks targett i is computed as q i (x) = e uU a i (x i ) e sS i (x) P t k 2T e uU a k (x k ) e sS k (x) (4.8) whereS i (x)2f0; 1g indicating whethert i is covered with least resource. S i (x) = 8 > > < > > : 1; ifx i x 0 i ,8t i 02T 0; otherwise (4.9) The denominator in Equation (4.8) is only for normalizing the probability distribution so all the q i sum up to 1. In the numerator, we have two terms deciding the probability that targett i will be chosen by the adversary. The first terme uU a i (x i ) relates to the expected utility for the adversary to choose targett i . U a i (x i ) is computed as in Equation (2.2). The parameter u 0 represents the level of error in adversary’s computation of the expected utility, which is equivalent to in Equation (4.6). The second term e sS i (x) relates to the adversary’s preference for the least 35 covered target. Note that if t i is not covered with the minimum resource, this term equals to 1 so there is no extra weight added to the non-minimum covered targets; if t i is covered with minimum resource, this term will be 1, adding extra weight to the probability that adversary will chooset i . The parameter s 0 represents the level of the adversary’s preference to the minimum covered target. s = 0 indicates no preference to the minimum covered target. As s increase, this preference becomes stronger. 4.2 Computing Optimal Defender Strategy Given the new models of adversary behavior in SSG, new algorithms need to be developed to compute the optimal defender strategy since the existing algorithms are based on the assumption of a perfectly rational adversary. We now describe efficient computation of the optimal defender mixed strategy assuming a human adversary whose response follows one of the three models we proposed: Prospect Theory (PT-Adversary), Quantal Response (QR-Adversary) or Quantal Response with Rank-related Utility (QRRU-Adversary). 4.2.1 Computing against a PT-adversary Assuming that the adversary’s response follows Prospect Theory (PT-adversary), we developed two methods to compute the optimal defender strategy. 4.2.1.1 BRPT Best Response to Prospect Theory (BRPT) is a mixed integer programming formulation for com- puting the optimal leader strategy against players whose responses follow a PT model. We first present an abstract version of our formulation of BRPT in Equations (4.10)-(4.14), and then 36 present a more detailed operational version in Equations (4.15)-(4.27) that uses piecewise lin- ear approximation to provide the BRPT MILP (Mixed Integer Linear Program). max x;q;a;d;z d (4.10) s.t. n X i=1 x i M (4.11) n X i=1 q i = 1; q i 2f0; 1g (4.12) 0a ((x i )V (P a i ) +(1x i )V (R a i ))K(1q i );8i (4.13) K(1q i ) + (x i R d i + (1x i )P d i )d;8i (4.14) The objective is to maximized, the defender’s expected utility. Equation (4.11) enforces that the constraint on the total amount of resources is met. In Equation (4.12), the integer variables q i represent the attacker’s pure strategy. In BRPT,q i is constrained to be binary variable, since, as justified and explained in [Paruchuri et al., 2008], we assume the adversary has a pure strategy best response: q i = 1 ift i is attacked and 0 otherwise. Equation (4.13) is the key to decide the attacker’s strategy, given a defender’s mixed strategyx =< x i >. The variablea represents the attacker’s ‘benefit’ of choosing a pure strategy<q i >. Since we are modeling attacker’s decision making using Prospect Theory, the benefit perceived by the adversary for attacking targett i is the attacker’s ‘prospect’, which is calculated as ((x i )V (P a i )+(1x i )V (R a i )) following Equation (4.1). The attacker tries to maximize a by choosing the target with the highest ‘prospect’, as enforced by Equation (4.13). In particular, the inequality on the left side of Equation (4.13) enforces thata is greater or equal to the ‘prospect’ of attacking any target. On the right hand of Equation (4.13), we have a constant parameter K with a very large positive value. For targets 37 withq i = 0, the upper bound of the difference betweena and the ‘prospect’ isK, therefore, the bounds is not operational. For target withq i = 1 (i.e. the target chosen by the attacker), the value of a is forced to be equal to the actual ‘prospect’ of attacking that target. In Equation (4.14), the constant parameterK enforces thatd is only constrained by the target that is attacked by the adversary (i.e.q i = 1). We now present the BRPT MILP based on our piecewise linear approximation of the weight- ing function as discussed earlier. We use the empirical functions introduced in Section 4.1.1 for the weighting function() and value functionV (). Let (P a i ) 0 = V (P a i ) and (R a i ) 0 = V (R a i ) denote the adversary’s value of penaltyP a i and rewardR a i , which are both given as input to the optimization formula in Equations (4.11)-(4.14). The key challenge to solve that optimization problem is that the() function is non-linear and non-convex. If we apply the function directly, we have to solve a nonlinear and non-convex mixed-integer optimization problem, which is diffi- cult. Therefore, we approximately solve the problem by representing the non-linear() function as a piecewise linear function. This transforms the problem into a MILP, which is shown in Equations (4.15)-(4.27). 38 max x;q;a;d;z d (4.15) s.t. n X i=1 5 X k=1 x ik M (4.16) 5 X k=1 (x ik + x ik ) = 1;8i (4.17) 0x ik ; x ik c k c k1 ;8i;k = 1::5 (4.18) z ik (c k c k1 )x ik ;8i;k = 1::4 (4.19) z ik (c k c k1 ) x ik ;8i;k = 1::4 (4.20) x i(k+1) z ik ;8i;k = 1::4 (4.21) x i(k+1) z ik ;8i;k = 1::4 (4.22) z ik ; z ik 2f0; 1g;8i;k = 1::4 (4.23) x 0 i = 5 X k=1 b k x ik ; x 0 i = 5 X k=1 b k x ik ;8i (4.24) n X i=1 q i = 1; q i 2f0; 1g (4.25) 0a (x 0 i (P a i ) 0 + x 0 i (R a i ) 0 )M(1q i );8i (4.26) M(1q i ) + 5 X k=1 (x ik R d i + x ik P d i )d;8i (4.27) Let ~ () denote the use of a piecewise linear approximation of the weighting function(), as shown in Figure 4.2. We empirically set 5 segments 1 for ~ (). This function is defined by fc k jc 0 = 0;c 5 = 1;c k < c k+1 ;k = 0;:::; 5g that represent the endpoints of the linear segments andfb k jk = 1;:::; 5g that represent the slope of each linear segment. In order to represent the 1 This piecewise linear representation of() achieves a small approximation error: sup z2[0;1] k(z) ~ (z)k 0:03. 39 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x i Π(x) Piecewise Linear Approximation Π() function x i3 x i4 x i5 x i1 x i2 Figure 4.2: Piecewise approximation of the weighting function piecewise linear approximation, i.e. ~ (x i ) (and simultaneously ~ (1x i )), we partitionx i (and 1x i ) into five segments, denoted by variablesx ik (and x ik ). Therefore,x 0 i which equals ~ (x i ) can be calculated as the sum of the linear function in each segment x 0 i = ~ (x i ) = 5 X k=1 b k x ik which is shown in Equation (4.24). At the same time, we can enforce the correctness of parti- tioning x i (and 1x i ) by ensuring that segment x ik (and x ik ) is positive only if the previous segment is used completely. This is enforced in Equations (4.17)(4.23) by using the auxiliary integer variablez ik (and z ik ). z ik = 0 indicates that thek th segment ofx i (i.e. x ik ) has not been completely used, therefore, the following segments can only be set to 0, and vice versa. Equation (4.24) definesx 0 i =~ (x i ) as the value of the piecewise linear approximation ofx i , and x 0 i =~ (1x i ) as the value of the piecewise linear approximation of 1x i . 40 4.2.1.2 RPT Robust-PT (RPT) modifies the base BRPT method to account for the possible uncertainty in adver- sary’s choice caused (for example) by imprecise computations [Simon, 1956]. Similar to COBRA, RPT assumes that the adversary may choose any strategy within of the best choice, defined here by the prospect of each action. It optimizes the worst-case outcome for the defender among the set of strategies that have the prospect for the attacker within of the optimal prospect. max x;h;q;a;d;z d (4.28) s.t. Constraints (4.16)(4.26) n X i=1 h i 1 (4.29) h i 2f0; 1g; q i h i ;8i (4.30) (1h i )a (x 0 i (P a i ) 0 + x 0 i (R a i ) 0 )M(1h i ) +;8i (4.31) M(1h i ) + 5 X k=1 (x ik R d i + x ik P d i )d;8i (4.32) We modify the BRPT optimization problem as follows: the first 11 constraints are equivalent to those in BRPT (Equation (4.16)-(4.26)); in Equation (4.29), the binary variableh i indicates the -optimal strategy for the adversary; the-optimal assumption is embedded in Equation (4.31), which forcesh i = 1 for any targett i that leads to a prospect within of the optimal prospect, i.e. a; Equation (4.32) enforcesd to be the minimum expected utility for defender on the targets that lead to-optimal prospect for the attacker. RPT attempts to maximize the minimum for the defender over the -optimal targets for the attacker, thus providing robustness against attacker (human) deviations within that-optimal set of targets. 41 4.2.2 Computing against a QR-adversary Assuming the adversary follows a quantal response (QR-adversary), we now present the algo- rithm to compute the defender’s optimal strategy against a QR-adversary. Given the quantal response of the adversary, which is described in Equation (4.7), the best response of defender is to maximize her expected utility: max x U d (x) = n X i=1 q i (x)U d i (x) Combined with Equation (4.7) and (2.1), the problem of finding the optimal mixed strategy for the defender can be formulated as max x P t i 2T e R a i e (R a i P a i )x i ((R d i P d i )x i +P d i ) P t k 2T e R a k e (R a k P a k )x k (4.33) s.t. n X i=1 x i M (4.34) 0x i 1; 8i;j (4.35) Algorithm 1: BRQR 1 opt g 1; 2 forit 1;:::;IterN do 3 x (0) randomly generate a feasible starting point; 4 (opt l ;x ) Find-Local-Minimum(x (0) ); 5 ifopt g >opt l then 6 opt g opt l ,x opt x ; 7 end 8 end 9 returnopt g ;x opt ; Unfortunately, since the objective function in Equation (4.33) is non-linear and non-convex, finding the global optimum is extremely difficult. Therefore, we focus on methods to find local optima. To compute an approximately optimal strategy against a QR-adversary efficiently, we 42 develop the Best Response to Quantal Response (BRQR) heuristic described in Algorithm 1. We first take the negative of Equation (4.33), converting the maximization problem to a minimization problem. In each iteration, we find the local minimum using the fmincon() function in Matlab with the Interior Point Algorithm with a given starting point. If there are multiple local minima, by randomly setting the starting point in each iteration, the algorithm will reach different local minima with a non-zero probability. By increasing the iteration number,IterN, the probability of reaching the global minimum increases. We empirically setIterN to 300 in our experiments. 4.2.3 Computing against a QRRU-adversary We now present the algorithm to compute defender optimal strategy assuming the adversary’s behavior follows the QRRU model. The adversary’s response given this model is computed as in Equation (4.8). The optimal defender strategy against a QRRU-adversary is computed by solving the following optimization problem: max x;s;x min P t i 2T e uR a i e u(R a i P a i )x i e ss i ((R d i P d i )x i +P d i ) P t k 2T e uR a k e u(R a k P a k )x k e ss k (4.36) s.t. Constraint (4:34); (4:35) x i (1s i )Kx min x i ;8t i 2T (4.37) X t i 2T s i = 1 (4.38) s i 2f0; 1g;8t i 2T (4.39) where the integer variabless i are introduced to represent the functionS i (x) as shown in Equation (4.9). In constraint (4.37),K is a constant with a very large value. Constraints (4.37) and (4.38) enforcesx min to be the minimum value among all thex i . Simultaneously,s i is set to 1 if targett i 43 has the minimum coverage probability assigned; and is set to 0 otherwise. The above optimization problem is a non-linear and non-convex mixed integer programming problem, which is difficult to solve directly. Therefore, we developed Best Response to a QRRU-Adversary (BRQRRU), an algorithm that iteratively computes the defender’s optimal strategy. The iterative approach breaks down the mixed-integer non-linear programming problem into sub-problems without in- teger variables. For each sub-problem, one of the target is assumed to be the least covered target. Then, under this constraint, the maximum defender expected utility and the associated defender mixed strategy are computed by solving a non-linear programming problem (similar to BRQR). Finally, the sub-problem generating the highest maximum defender expected utility is found as the ‘actual’ optimal solution, and the associated defender mixed-strategy is the optimal defender strategy assuming a QRRU-adversary. Algorithm 2 shows the pseudo code of the algorithm. Algorithm 2 describes BRQRRU. In Algorithm 2: BRQRRU 1 opt g 1; 2 fort i 02T do 3 (opt l ;x ) Find-Optimal-Defender-Strategy(s i 0 = 1); 4 ifopt g >opt l then 5 opt g opt l ,x opt x ; 6 end 7 end 8 Returnopt g ;x opt ; 44 each iteration, one targett i 0 is conditioned to be covered with minimum resource, therefores i = 1. This reduces the optimization problem to the following max x P t i 2T e uR a i e u(R a i P a i )x i e ss i ((R d i P d i )x i +P d i ) P t k 2T e uR a k e u(R a k P a k )x k e ss k (4.40) s.t. Constraint (4:34); (4:35) x i x i ; 8t i 2T (4.41) where there are no integer variables involved sinces i ;8t i 2 T are all pre-defined parameters of the optimization problem. Therefore, we could solve it using the same method of local search with random restart as that in BRQR. Find-Optimal-Defender-Strategy(s i 0 = 1) on Line (3) in Algorithm 2 calls Algorithm 1 to solve the optimization problem in Equation (4.40)- (4.41). 4.3 Parameter Estimation In this section, we describe our methodology for setting the values of the parameters for the different models of human behavior introduced in the previous section. We set the parameters for our later experiments using data collected in a preliminary set of experiments with human subjects playing the online game which will be introduced in Section 4.4.1. We posted the game on Amazon Mechanical Turk as a Human Intelligent Task (HIT) and asked subjects to play the game. Subjects played the role of the adversary and were able to observe the defender’s mixed strategy (i.e., randomized allocation of security resources). In order to avoid non-compliant participants, we only allowed workers whose HIT approval rates were greater than 95% and who had more than 100 approved HITs to participate in the experiment. 45 LetG denote a game instance, which is a combination of a payoff structuref(R a i ;P a i ;R d i ;P d i );t i 2 Tg, and a defender’s strategyx. Given a game instanceG, we denote the choice of thej th subject as G j 2 T . We include seven payoff structures in the experiments: four of which are selected based on using a classification method we explain in detail in Section 4.3.1; the other three are taken directly from Pita et al.[Pita et al., 2010]. For each payoff structure we tested five different defender strategies. This results in 7 5 = 35 different game instances. Each of the subjects played all 35 games. In total, 80 subjects participated in the preliminary experiment. 4.3.1 Selecting Payoff Structures Even for a restricted class of games such as security games, there are an infinite number of possible game instances depending on the specific values of the payoffs for each of the targets. Since we cannot conduct experiments on every possible game instance we need a method to select a set of payoffs structures to use in our experiments. Our main criteria for selecting payoffs structures are (1) to select a diverse set of payoff structures that cover different regions in the space of possible security games and (2) to select payoff structures that will differentiate between the different behavioral models (in other words, the models should make different predictions in different test conditions). In the first round our goal was to select game instance that would distinguish between the three key families of prediction methods (BRPT, RPT, BRQR). In the second round of selection we need to further differentiate within the families. Since there is not yet a well-understood method to select such game instances in the literature, we introduce a procedure for making such selections below. We first sample randomly 1000 different payoff structures, each with 8 targets. R a i andR d i are integers drawn fromZ + [1; 10];P a i andP d i are integers drawn fromZ [10;1]. This scale 46 Table 4.1: A-priori defined features Feature 1 Feature 2 Feature 3 Feature 4 mean(j R a i P a i j) std(j R a i P a i j) mean(j R d i P d i j) std(j R d i P d i j) Feature 5 Feature 6 Feature 7 Feature 8 mean(j R a i P d i j) std(j R a i P d i j) mean(j R d i P a i j) std(j R d i P a i j) −4 −2 0 2 4 6 8 −6 −4 −2 0 2 4 6 1 st PCA Component 2 nd PCA Component cluster 1 cluster 2 cluster 3 cluster 4 Payoff 1 Payoff 2 Payoff 3 Payoff 4 Payoff 5,6,7 Figure 4.3: Payoff Structure Clusters (color) is similar to the payoff structures used in [Pita et al., 2010]. We then use k-means clustering to group the 1000 payoff structures into four clusters based on eight features, which are defined in Table 4.1. Intuitively, features 1 and 2 describe how good the game is for the adversary, features 3 and 4 describe how good the game is for the defender, and features 58 reflect the level of conflict between the two players in the sense that they measure the ratio of one player’s gain over the other player’s loss. In Fig. 4.3, all 1000 payoff structures are projected onto the first two Principal Component Analysis (PCA) dimensions for visualization. The three payoff structures (5–7) that were first 47 used in Pita et al.[Pita et al., 2010] are marked in Fig. 4.3. All three of these payoff structures belong to cluster 3, indicating that the game instances used in the previous experiments we all similar in terms of the features we used for classification 2 . To select specific payoff structures from these clusters we first generated five defender strate- gies based on the following families of algorithms: DOBSS, COBRA, BRPT, RPT and BRQR. Here we select only one algorithm from each family (e.g., only one version of BRQR). At this point we did not have preliminary data to set the parameters of the algorithms, since we are de- ciding which payoff structures to test on. Instead, we set the parameters as follows: DOBSS has no parameters; for COBRA we use parameters drawn from [Pita et al., 2010]; BRPT and RPT use the empirical parameter settings for Prospect Theory [Kahneman and Tvesky, 1992]; BRQR uses a value of = 0:76 which we set using the data reported in [Pita et al., 2010] (using the method to be described in Section 4.3.3). We use the following the criteria to select payoff structures that differentiate among the dif- ferent families of algorithms: We define the distance between two mixed strategies, x k and x l , using the Kullback- Leibler divergence: D(x k ;x l ) = D KL (x k jx l ) + D KL (x l jx k ), where D KL (x k jx l ) = P n i=1 x k i log(x k i =x l i ). For each payoff structure, D(x k ;x l ) is measured for every pair of strategies. With five strategies, we have 10 such measurements. 2 In [Pita et al., 2010], there were four payoff structures used, but we only use three of those here. The fourth payoff structure is a zero-sum game, and the deployed Stackelberg security games have not been zero sum [Pita et al., 2008; Tsai et al., 2009]. Furthermore, in zero-sum games, defender’s strategies computed from DOBSS, COBRA and MAXIMIN collapse into one – they turn out to be identical. 48 We remove payoff structures that have a mean or minimum of these 10 quantities below a given threshold. This results in a subset of about 250 payoff structures in total for all four clusters. We then select one payoff structure closest to the cluster center from each of these subsets. The four payoff structures (1–4) we selected from different clusters and are marked in Fig. 4.3. 4.3.2 Parameter Estimation for Prospect Theory An empirical setting of parameter values is suggested in the literature [Kahneman and Tvesky, 1992] based on various experiments conducted with human subjects. We also include this setting of parameter values in our experiments to evaluate the benchmark performance of the prospect theory. At the same time, we provide a method to estimate the parameter values for the PT model using a set of empirical response data collected for the SSG domain. In this section, we describe our method of estimating the parameter values based on using grid search. The empirical functions we used in the PT model for the adversary have four parameters that must be specified: ;;; , as shown in Equations (4.2) and (4.3). Varying the values for these four parameters will change the responses predicted by the PT-model. We denote the weighting and value function as () andV ;; (), for a given a set of parameter values. We then define the fit of a parameter setting to a given data set of subjects’ choices as the percentage of subjects who choose the target predicted by the model. The fit can be computed as Fit(;;; jG) = 1 N X j=1::N q G j (;;; jG) = X t i 2T N i N q i (;;; jG) 49 whereq i ()2f0; 1g indicates whether the PT model predicts targett i to be chosen by the subjects and is computed using Equation (4.5),N i is the number of subjects who choose target t i , and N = P t i 2T N i is the total number of subjects. We estimate the parameter setting with the best fit for PT model by maximizing the fit function over all 35 game instances max ;;; X G Fit(;;; jG) (4.42) s:t: 0<; < 1; 1; 0< < 1 (4.43) The constraints in (4.43) restrict the feasible range of all the four parameters, as defined in the prospect theory model. The objective function in Equation (4.42) cannot be expressed as a closed- form expression of;; and . Without a closed form it is difficult to apply gradient descent or any other analytical search algorithm to find the optimal solution. Therefore, we use grid search [Sen and Stoffa, 1995; Becsey et al., 1968] to solve the problem as follows: (1) We first uniformly sample a set of values for each parameter across the feasible ranges, with the following grid intervals: = 0:05, = 0:05, = 0:05, and = 0:1. This gives a set of different values for each of the four parameters. For simplicity, we represents the four sets of sampled values as the following:f k 1 = l +k 1 g, where l is the lower bound of the region; similarlyf k 2 = l +k 2 g;f k 3 = l +k 3 g; andf k 4 = l +k 4 g. The feasible region of does not have upper bound, so we set it to 5 which is twice as the suggested empirical value [Kahneman and Tvesky, 1992]. 50 (2) In total, we have 20 20 20 40 = 320k different combinations of the four parameter values. We then evaluate the objective function on each of the combinations ( k 1 ; k 2 ; k 3 ; k 4 ) and take the parameter combination with the best aggregate fit as the solution: ( ; ; ; ) = arg max k 1 ;k 2 ;k 3 ;k 4 X G Fit( k 1 ; k 2 ; k 3 ; k 4 jG) The parameter settings estimated using the method described above are: ( ; ; ; ) = (1:0; 0:6; 2:2; 0:6) 4.3.3 Parameter Estimation for the QR Model We now explain how we estimate the parameter for the Quantal Response Model (QR Model). The parameter in the QR model represents the level of noise in the adversary’s response func- tion. We employ Maximum Likelihood Estimation (MLE) to fit using data we collected. Given a game instanceG andN samples of the subjects’ choicesf j (G);j = 1::Ng, the likelihood of is L(jG) = Y j=1::N q G j (jG) where, G j 2 T denotes the target attacked by thej th player andq G j (j G) can be computed by Equation (4.7). For example, if playerj attacks targett 3 in gameG, we would haveq G j (j G) =q 3 (jG). Furthermore, the log-likelihood of is logL(jG) = N X j=1 logq j (G) (jG) = X t i 2T N i logq i () Combining with Equation (4.6), logL(jG) = X t i 2T N i U a i (x i )N log( X t i 2T e U a i (x) ) 51 We learn the optimal parameter setting for by maximizing the total log-likelihood over all 35 game instances: max X G logL(jG) (4.44) s:t: 0 (4.45) The objective function in Equation (4.44) is concave, since for eachG, a logL(jx) is a concave function. This can be demonstrated by showing that the second order derivative of logL(jG) is non-positive8G: d 2 logL d 2 = P i<j (U a i (x i )U a j (x j )) 2 e (U a i (x i )+U a j (x j )) ( P i e U a i (x i ) ) 2 0 Therefore, logL(j x) only has one local maximum. We use gradient descent solve the above optimization problem. The MLE of is = 0:55 . 4.3.4 Parameter Estimation for the QRRU Model For the QRRU Model, we need to estimate two parameters: u and s as defined in Equa- tion (4.8). We again apply Maximum Likelihood Estimation, similar to the method for the QR model. Given a game instance G, and the responses ofN subjectsf j (G);j = 1::Ng, the log- likelihood of a parameter setting ( u ; s ) is logL( u ; s jG) = N X j=1 logq j (G) ( u ; s jG) = X t i 2T N i logq i ( u ; s ) 52 Combining with Equation (4.8), logL( u ; s jG) = u X t i 2T N i U a i (x i ) + s X t i 2T N i S i (x)N log( X t i 2T e uU a i (x i )+sS i (x) ) We learn the optimal parameter settings for the QRRU Model by maximizing the total log- likelihood over all 35 game instances: max u;s X G logL( u ; s jG) (4.46) s:t: u 0; s 0 (4.47) The objective function in Equation (4.46) is a concave function, since8G the Hessian matrix of logL( u ; s j G) is negative semi-definite. We include the details of proof in the appendix and only show here that8h u ; s i h u ; s iH( u ; s jG)h u ; s i T 0 whereH( u ; s jG) is the Hessian matrix of logL( u ; s jG) computed as the following H( u ; s jG) =N 0 B B @ P i<j (U a i U a j ) 2 e A i +A j ( P t i 2T e A i ) 2 P i<j (U a i U a j )(S i S j )e A i +A j ( P t i 2T e A i ) 2 P i<j (U a i U a j )(S i S j )e A i +A j ( P t i 2T e A i ) 2 P i<j (S i S j ) 2 e A i +A j ( P t i 2T e A i ) 2 1 C C A where, A i = u U a i (x i ) + s S i (x). Therefore, we can use gradient descent to solve the opti- mization problem in Equation (4.46) and (4.47). The MLE parameters based on our data set are: ( u ; s ) = (0:6; 0:77) . 53 Figure 4.4: Game interface for our simulated online SSG 4.4 Experimental Results and Discussion We evaluated the performances of defender strategies as well as the the accuracy of different adversary models with human subjects using an online game “The Guard and The Treasure” that will be introduced soon. We conducted two set of evaluations: the first set includes the same 7 payoff structures used in the experiments in the previous section; the second set focuses on comparison between the QR model and the QRRU model. 4.4.1 A Simulated Online SSG We develop a game, called “The Guards and The Treasure”, to simulate the security model at the LAX airport, which has eight terminals that can be targeted in an attack [Pita et al., 2008]. Fig. 6.1 shows the interface of the game. Players are introduced to the game through a series of explanatory screens describing how the game is played. In each game instance a subject is 54 asked to choose one of the eight gates to open (attack). They are told that guards are protecting three of the eight gates, but not which ones. The defender’s mixed strategy, represented as the marginal probability of covering each target,<x i >, is given to the subjects. At the same time, the subjects are also told the reward on successfully attacking each target as well as the penalty of getting caught at each target. The three gates protected by the guards are drawn randomly from the probability shown on the game interface. If subjects select a gate protected by the guards, they receive a penalty; otherwise, they receive a reward. Subjects are rewarded based on the reward/penalty shown for each gate. For example, in the game shown in Figure 6.1, the probability that gate 1 (target 1) will be protected by a guard is 0.59. Assuming the subjects choose gate 1, he/she gets reward of 8 if gate 1 is not protected by the guard; or get a penalty of -3 if gate 1 is protected by a guard. 4.4.2 Experimental Settings The design of the simulated game was already provided in Section 4.4.1. We now present a detailed description of the experimental settings. In total, we included 70 game instances (com- prising 7 payoff structures and 10 strategies for each payoff structure) in the first set and 12 game instances (comprising 4 new payoff structures and 3 strategies for each payoff structure) in the second set. To avoid confusion between these two sets of payoff structures, we will number the first seven payoff structures as 1.1-1.7, and the next four as 2.1-2.4. Each game instance is played by at least 80 different participants (the actual number of sub- jects for each game instance ranges between 80 to 91). Each subject is asked to play 40 out of the 70 games. For the purpose of a within-subject comparison, we want a subject to play the 10 different strategies for the same payoff structure. Therefore, the 40 games is composed of 4 55 payoff structures and 10 defender strategies for each. Furthermore, in order to mitigate the or- dering effect on subject responses, we randomize the order of the game instances played by each subject. We generated 40 different orderings of the games using latin square design. The order played by each subject was drawn uniformly randomly from the 40 possible orderings. To further mitigate ordering effect, no feedback on success or failure is given to the subjects until the end of the experiment. As motivation to the subjects, they earn or lose money based on whether or not they succeed in attacking a gate; if the subject opens a gate not protected by the guards, they win; otherwise, they lose. The participants were recruited on Amazon Mechanical Turk. Note that these participants differ from those who played the game to provide data for estimating the parameter, as discussed in the previous section. In order to avoid non-compliant participants, we only allowed workers whose HIT approval rates were greater than 95% and who had more than 100 approved HITs to participate in the experiment. They were first given a detailed instruction of the game explaining to them how the game is played. Then two practical rounds of games were provided to help them get familiar with the game. After all the learning and practising, they were given enough time to finish all the games. Each participant first received 50 cents for participating in the game. Then they gain bonus based on the outcomes of the games they played, with each point worth 1 cent. On average, the subjects who participated in the first set of experiment (i.e. payoff 1.1-1.7) received $1:45 as bonus based on their total scores across 40 game instances they played; the subjects who participated in the second set of experiment (i.e. payoff 2.1-2.4) received $0:44 as bonus based on their total scores across 12 game instances they played. Participants were given 5 hours in total to finish the experiment which was shown to be sufficiently long given that the average time 56 Payoff 1.1 1.2 1.3 1.4 1.5 1.6 1.7 COBRA- 0.15 0.15 0.15 0.15 0.37 0 0.25 COBRA- 2.5 2.9 2.0 2.75 2.5 2.5 2.5 BRPT-E (;;; ) = (0:88; 0:88; 2:25; 0:64) RPT-E (;;; ) = (0:88; 0:88; 2:25; 0:64), = 2:5 BRPT-L (;;; ) = (1; 0:6; 2:2; 0:6) RPT-L (;;; ) = (1; 0:6; 2:2; 0:6), = 2:5 BRQR-76 = 0:76 BRQR-55 = 0:55 BRQRRU ( u ; s ) = (0:6; 0:77) Table 4.2: Parameter settings for different algorithms they spent was 28 minutes for the first set of 40 games and 8 minutes for the second set of 12 games. In the following part of this section, we first describe the parameter settings for the different leader strategies. We then provide our experimental results, and follow that up with analysis. We compare both the quality of different defender strategies against the human participants and the accuracy of different adversary models in the sense that how well the human participants follow the assumption of these models. 4.4.3 Algorithm Parameters For the seven payoff structures (1.1-1.7) introduced in Section 4.3, we tested ten different mixed strategies generated from seven different algorithms: MAXIMIN, DOBSS [Paruchuri et al., 2008], COBRA [Pita et al., 2010], BRPT, RPT, BRQR, BRQRRU. We include MAXIMIN as a benchmark algorithm. MAXIMIN assumes that adversary always selects the target that is worst to the de- fender. Table 4.2 lists the parameter settings of these ten strategies for each of the seven payoff structures. DOBSS and MAXIMIN have no parameters. 57 For COBRA, we set the parameters following the methodology presented in [Pita et al., 2010] as closely as possible for payoff structures 1.11.4,. In particular, the values we set for meet the entropy heuristic discussed in that work. For payoff structures 1.51.7 that are identical to payoff structures first used by Pita et al., we use the same parameter settings as in their work. For both BRPT-E and RPT-E, the parameters for Prospect Theory are empirical values suggested by literatures [Kahneman and Tvesky, 1992]. For RPT-E, we empirically set to 25% of the maximum potential reward for the adversary, which is 10 in our experimental settings. We tried another set of parameters for Prospect Theory, which are learned from our first set of experiment as described in Section 4.3.2. We denote these two algorithms as BRPT-L and RPT-L. For BRQR, we tried two different values for the parameter, = 0:76 is the values learned from the data reported by Pita et al.[Pita et al., 2010]; = 0:55 is the value learned from data collected in our first set of experiments with participants from Amazon Mechanical Turk. We will refer to the strategies resulting from these two parameter settings of the BRQR algorithm as BRQR-76 and BRQR-55 respectively. For BRQRRU, the parameters are learned from the data collected our first set of experi- ments. 58 -2.2 -1.7 -1.2 -0.7 -0.2 0.3 0.8 1.3 payoff 1.1 payoff 1.2 payoff 1.3 payoff 1.4 Defender Average EU DOBSS MAXIMIN COBRA BRPT-E BRPT-L RPT-E RPT-L BRQR-76 BRQR-55 BRQRRU -2.2 -1.7 -1.2 -0.7 -0.2 payoff 1.5 payoff 1.6 payoff 1.7 Defender Average EU DOBSS MAXIMIN COBRA BRPT-E BRPT-L RPT-E RPT-L BRQR-76 BRQR-55 BRQRRU Figure 4.5: Defender average expected utility achieved by different strategies 4.4.4 Quality Comparison We evaluated the performance of different defender strategies using the defender’s expected util- ity and the statistical significance of our results using the bootstrap-t method [Wilcox, 2003]. 59 4.4.4.1 Average Performance We first evaluated the average defender expected utility,U d avg (x), of different defender strategies based on the subjects’ choices: U d avg (x) = 1 N N X j=1 U d j (x) = 1 N X t i 2T N i U d i (x i ) where j is the target selected by thej th subject,N i is the number of subjects that chose target t i and N is the total number of subjects. Fig. 4.5 displays U d avg (x) for the different strategies in each payoff structure. We also displayed the normalized defender average expected utility of different strategies within each payoff structure in Figure 4.6. After normalization,U d avg (x) for each defender strategy varies between 0 and 1, with the highestU d avg (x) in each payoff structure scaled to 1 and the lowestU d avg (x) scaled to 0. Overall, BRQR-76, BRQR-55 and BRQRRU performed better than other algorithms. We com- pare the performance of three algorithms with each of the other seven algorithms and report the level of statistical significance in Table 4.3, 4.4 and 4.5. We summarize the results below: MAXIMIN is outperformed by all three algorithms with statistical significance in all seven payoff structures. DOBSS is also outperformed by all three algorithms with statistical sig- nificance except for payoff structure 1.6. In five of the seven payoff structures, COBRA is outperformed by all three algorithms with statistical significance. In payoff structure 1.3, the performance of COBRA is very close to the three algorithms, but there is no statistical significance either way. In payoff struc- ture 1.5, COBRA is outperformed by all three algorithms but no statistical significance is achieved. 60 0.00 0.20 0.40 0.60 0.80 1.00 payoff 1.1 payoff 1.2 payoff 1.3 payoff 1.4 Defender Average EU DOBSS MAXIMIN COBRA BRPT-E BRPT-L RPT-E RPT-L BRQR-76 BRQR-55 BRQRRU 0.00 0.20 0.40 0.60 0.80 1.00 payoff 1.5 payoff 1.6 payoff 1.7 Defender Average EU DOBSS MAXIMIN COBRA BRPT-E BRPT-L RPT-E RPT-L BRQR-76 BRQR-55 BRQRRU Figure 4.6: Defender average expected utility (normalized between 0 and 1) achieved by different strategies 61 v.s. DOBSS MAXIMIN COBRA BRPT-E RPT-E BRPT-L RPT-L payoff 1.1 *** *** *** *** ** *** *** payoff 1.2 *** *** *** *** *** *** 0.15 payoff 1.3 *** *** 0.96 *** 0.21 *** ** payoff 1.4 *** *** * *** 0.25 *** *** payoff 1.5 *** *** 0.26 *** 0.99 *** *** payoff 1.6 0.20 *** *** * *** 0.13 *** payoff 1.7 *** *** ** *** ** *** *** Table 4.3: level of statistical significance of comparing BRQR-76 to other algorithms: ***(p 0:01), **(p 0:05), *(p 0:1) The three algorithms outperform BRPT-E with statistical significance in all seven payoff structures. Furthermore, BRPT-L is outperformed by the three algorithms in all seven pay- off structures with statistical significance in six cases except for in payoff structure 1.6. In four of the seven payoff structures, RPT-E is outperformed by the three algorithms with statistical significance. In payoff 1.3, RPT-E is outperformed by all three algorithms but the result is not statistical significant. In payoff structure 1.4, RPT-E achieves very similar performance to BRQR-55 and is outperformed by BRQR-76 and BRQRRU. In payoff 1.5, RPT-E achieves very similar performance as BRQR-76 and is outperformed by BRQR-55 and BRQRRU. Furthermore, RPT-L is outperformed by all three algorithms with statis- tical significance in almost all seven payoff structures, except for in payoff structure 1.2 where the result of comparing BRQR-76 and BRQRRU with RPT-L doesn’t have statistical significance. Overall, any of the three quantal response (BRQR-76,BRQR-55 and BRQRRU) strategies would be preferred over the other strategies. However, the performance of the three strategies are close to each other in this set of experiments. In order to further differentiate the three strategies as well as prove the effectiveness of QRRU model, we conducted a separate set of experiments. 62 v.s. DOBSS MAXIMIN COBRA BRPT-E RPT-E BRPT-L RPT-L payoff 1.1 *** *** ** *** * *** *** payoff 1.2 *** *** ** *** *** *** * payoff 1.3 *** *** 0.86 *** 0.16 *** ** payoff 1.4 *** *** ** *** 0.95 *** ** payoff 1.5 *** *** 0.37 *** 0.12 *** ** payoff 1.6 0.16 *** *** ** *** 0.11 *** payoff 1.7 *** *** *** *** *** *** *** Table 4.4: level of statistical significance of comparing BRQR-55 to other algorithms: ***(p 0:01), **(p 0:05), *(p 0:1) v.s. DOBSS MAXIMIN COBRA BRPT-E RPT-E BRPT-L RPT-L payoff 1.1 *** *** ** *** * *** *** payoff 1.2 *** *** ** *** *** *** 0.27 payoff 1.3 *** *** 0.99 *** 0.27 *** ** payoff 1.4 *** *** ** *** 0.18 *** *** payoff 1.5 *** *** 0.40 *** 0.33 *** * payoff 1.6 0.15 *** *** ** *** 0.11 *** payoff 1.7 *** *** *** *** *** *** *** Table 4.5: level of statistical significance of comparing BRQRRU to other algorithms: ***(p 0:01), **(p 0:05), *(p 0:1) We first select four new payoff structures from the 1000 random samples using the following rules:: We first measure the distance between the BRQRRU strategy and each of the other two BRQR strategies using Kullback-Leibler (KL) divergence: D(x k ;x l ) = D KL (x k jx l ) + D KL (x l jx k ), whereD KL (x k jx l ) = P n i=1 x k i log(x k i =x l i ). For each payoff structure, we measure this KL distance for the pair (BRQRRU,BRQR-76) and the pair (BRQRRU,BRQR-55). So we have two such measurements for each payoff structure. We sort the payoff structures in a descending order of the mean of these two distance. 63 In the top 10 payoff structures, we select two payoff structures where the targets assigned with minimum coverage probability by BRQR-76 or BRQR-55 have large penalty for the defender; and two payoff structures where the penalty for the defender on such target is small. The details of these four payoff structures and the defender strategies are included in the appendix. We conducted a new set of experiments with human subjects using these four payoff structures and the three QR model based strategies for each payoff structure. In total, we have 4*3 = 12 game instances included in these experiments. Each subject is asked to play against all these 12 game instances. 80 subjects are involved in these experiments. -1.2 -0.7 -0.2 0.3 0.8 1.3 1.8 payoff 2.1 payoff 2.2 payoff 2.3 payoff 2.4 Defender Average EU BRQRRU BRQR-76 BRQR-55 Figure 4.7: Defender average expected utility achieved by QR model based strategies Figure 4.7 displays the defender average expected utility achieved by the three strategies. We report the statistical significance results in Table 4.6. In payoff structures 2.1 and 2.2, BRQRRU outperforms both BRQR-76 and BRQR-55 with statistical significance. In payoff structures 2.3 and 2.4, the three strategies have very close performance. No statistical significance is found in the results, as reported in Table 4.6. 64 payoff 2.1 BRQRRU v.s. BRQR-76 *** BRQRRU v.s. BRQR-55 ** payoff 2.2 BRQRRU v.s. BRQR-76 ** BRQRRU v.s. BRQR-55 ** payoff 2.3 BRQR-76 v.s. BRQRRU 0.87 BRQR-55 v.s. BRQRRU 0.40 payoff 2.4 BRQRRU v.s. BRQR-76 0.97 BRQR-55 v.s. BRQRRU 0.35 Table 4.6: statistical significance (**:p 0:05; ***:p 0:01) As noted earlier, a very important feature of payoff structures 2.1 and 2.2, compared to payoff structures 2.3 and 2.4, is that the target covered with minimum resource by BRQR-76 and BRQR- 55 (target 3 in payoff structure 2.1 and target 3 in payoff structure 2.2) has a large penalty (6) for the defender. In the experiments with payoff structures 2.1 and 2.2, more than 10% of subjects selected these targets (target 3 in payoff structure 2.1 and 2.2) while playing against BRQR-76 or BRQR-55, while no subjects chose this target while playing against BRQRRU— BRQRRU covers these targets with more resources. This is the main reason why BRQRRU significantly outper- forms BRQR in payoff 2.1 and payoff 2.2. In payoff 2.3 and 2.4, similar observation is obtained in subjects’ choice: the targets covered with minimum resources by BRQR-76 and BRQR-55 are selected more frequently compared to the case when BRQRRU is played. However, these targets (i.e. target 1 in payoff 2.3 and target 2 in payoff 2.4) have very small penalty for the defender (-1). Therefore we do not see significant differences in performance among the different BRQR strategies. Based on the result in both sets of experiments, we conclude that the stochastic model based strategies are superior to their competitors, and BRQRRU is the preferred strategy within the stochastic model based strategies.. In particular, BRQRRU achieves significantly better perfor- mance than BRQR when the target covered with minimum resource by BRQR has potentially a 65 large penalty for the defender; and has a performance similar to the other stochastic model based strategies otherwise. 4.4.4.2 Performance Distribution We now analyze the distribution of the performance of each defender strategy while playing against different adversaries (subjects). Given a game instanceG, the defender expected utility achieved by playing strategy x against a subject j is denoted as U d G j (x). Figures 4.8 and 4.9 display the distribution ofU d G j (x) for different defender strategies against individual subjects in each payoff structure. The y-axis shows the range of the defender’s expected utility against all different subjects. Each box with the extended dash line in the figure shows the distribution of this defender expected utility for each of the ten defender strategies: the dashed line specifies the range ofU d G j (x) with the bottom band showing the minimum value and the top band showing the maximum value; the box specified the 25 th to 75 th percentiles ofU d G j (x) with the bottom showing the 25 th percentile value and the top showing the 75 th value; the band inside the box specifies the median (50 th percentile) ofU d G j (x). We compare the distributions of different defender strategies from two perspectives: Range: As presented in Figure 4.8 and Figure 4.9, in general, the defender expected utility has the smallest range when MAXIMIN strategy is played (except that in payoff structure 1.7, the range of the defender expected utility when RPT-L is played is slightly smaller than that when MAXIMIN is played). COBRA, RPT, BRQR and BRQRRU lead to larger range of defender expected utility than MAXIMIN. Defender expected utility has the largest range when DOBSS or BRPT is played. 66 Worst Case: The lower band of the dashed line indicates the worst-case defender expected utility when different strategies are played. MAXIMIN has the highest worst-case defender ex- pected utility in general (except that in payoff 1.5, the worst-case defender expected utility by playing BRQR-76 is better than that by playing MAXIMIN). DOBSS and BRPT lead to lowest worst-case defender expected utility. The worst-case defender expected utility from playing CO- BRA, RPT, BRQR and BRQRRU are in between the two extreme cases. Furthermore, BRQR and BRQRRU lead to higher worst-case defender expected utility than COBRA and RPT. −4 −3 −2 −1 0 1 2 3 4 DOBSS MAXIMIN COBRA BRPT−E RPT−E BRPT−L RPT−L BRQR−76 BRQR−55 BRQRH Defender Expected Utility Payoff 1.1 −6 −4 −2 0 2 4 6 DOBSS MAXIMIN COBRA BRPT−E RPT−E BRPT−L RPT−L BRQR−76 BRQR−55 BRQRH Defender Expected Utility Payoff 1.2 −4 −3 −2 −1 0 1 2 3 4 DOBSS MAXIMIN COBRA BRPT−E RPT−E BRPT−L RPT−L BRQR−76 BRQR−55 BRQRH Defender Expected Utility Payoff 1.3 −10 −8 −6 −4 −2 0 2 4 6 DOBSS MAXIMIN COBRA BRPT−E RPT−E BRPT−L RPT−L BRQR−76 BRQR−55 BRQRH Defender Expected Utility Payoff 1.4 Figure 4.8: Distribution of defender’s expected utility against each individual subject 67 −5 −4 −3 −2 −1 0 1 DOBSS MAXIMIN COBRA BRPT−E RPT−E BRPT−L RPT−L BRQR−76 BRQR−55 BRQRH Defender Expected Utility Payoff 1.5 −4 −3 −2 −1 0 1 DOBSS MAXIMIN COBRA BRPT−E RPT−E BRPT−L RPT−L BRQR−76 BRQR−55 BRQRH Defender Expected Utility Payoff 1.6 −5 −4 −3 −2 −1 0 DOBSS MAXIMIN COBRA BRPT−E RPT−E BRPT−L RPT−L BRQR−76 BRQR−55 BRQRH Defender Expected Utility Payoff 1.7 Figure 4.9: Distribution of defender’s expected utility against each individual subject In general, by playing MAXIMIN, the defender expected utility against each individual ad- versary achieves the smallest variance, hence it is most robust to the uncertainty in adversary’s choice. However, it does so by assuming that the adversary could select any target hence making the expected utility on each target equal. MAXIMIN does not exploit the different preferences adversary may have among different targets. BRPT and DOBSS assume the subjects select the target that maximizes their expected utility and do not consider the possibility of deviations from the optimal choice by the adversary. This leads to arbitrarily lower defender expected utility when the adversary deviates from the predicted choice. 68 COBRA, RPT, BRQR and BRQRRU all try to be robust against such deviations. BRQR and BRQRRU consider some (possibly very small) probability of adversary attacking any target us- ing a soft-max function. In contrast, COBRA and RPT separate the targets into two groups, the -optimal set and the non--optimal set, using a hard threshold. They then try to maximize the worst case for the defender assuming the response will be in the -optimal set, but assign less resources to the non--optimal targets. When the non--optimal targets have high defender penal- ties, COBRA and RPT become vulnerable to adversary’s deviation. For example, target 6 in payoff structure 1.2 has a small reward (= 1) and a large penalty (=10) for the attacker. Both COBRA and RPT consider this target to be in the non--optimal set and assign very small probability to cover this target ( 0:05). However, approximately 10% of the subjects have chosen this target. Since this target has a high defender penalty (6), COBRA and RPT lose reward on this target. Similar examples include target 5 in payoff structure 1.4 and target 8 in payoff structure 1.1. 4.4.5 Model Prediction Accuracy In this section, we evaluate how well each model predicts the actual responses of human partici- pants using three different metrics [Feltovich, 2000]: mean square deviation (MSD), a proportion of inaccuracy (POI), and Euclidean distance (ED). We first extend the definition of MSD from that in [Feltovich, 2000] which is designed for a 2-action game, in order to suit our domain where the player has 8 actions to take. Given the choices of theN subjects, the MSD of a model is computed as MSD =f 1 N N X n=1 (p (n) 1) 2 g 1=2 (4.48) 69 where,(n) represents the index of the target chosen by subjectn,p i is the predicted probability by a model that targeti will be chosen. The POI score is meant to put models with deterministic prediction on the same footing as those with stochastic prediction. It treats the target with the highest predicted probability as the predicted target, and computes the proportion of the subjects who didn’t choose the predicted target. The POI score is computed as POI = 1 N N X n=1 (1 ~ p (n) ) (4.49) where, (n) is the index of the target chosen by subject n. ~ p (n) = 1 if (n) is the predicted target; and ~ p (n) = 0 otherwise. Note that for models with deterministic prediction, the POI score is exactly equal to the square of MSD value. The Euclidean distance measures the difference between the actual distribution of the sub- jects’ choices and the prediction of the model. It is computed as ED = s X i2T (p i p act i ) 2 (4.50) where p i is the probability predicted by the model that target i will be chosen, and p act i is the actually percentage of subjects who have chosen targeti. Table 4.7 presents the ability of different models to predict the attacker decision measured with the three different criteria 3 . The measurements for both the out-of-sample data (70 rounds of games) and in-sample data (35 rounds of games) are displayed in the table. Better predictive power is indicated by lower MSD value and POI score and lower ED value. The top four models all have deterministic prediction and the three quantal response related models have stochastic prediction. The last three models (COBRA, RPT-E and RPT-L) don’t have a strict definition of the 3 MAXIMIN doesn’t have a prediction of adversary behavior, so we exclude it from the analysis. 70 Table 4.7: Ability of behavioral models to predict attacker decision Out of sample In sample Model MSD POI ED MSD POI ED DOBSS 0.81 0.67 0.76 0.85 0.73 0.80 PT-E 0.84 0.71 0.81 0.87 0.75 0.84 PT-L 0.84 0.71 0.81 0.86 0.74 0.83 QR-76 0.79 0.67 0.23 0.83 0.73 0.22 QR-55 0.81 0.67 0.22 0.84 0.73 0.21 QRRU 0.80 0.65 0.21 0.83 0.70 0.18 COBRA 0.91 0.83(0.35) 0.94 0.91 0.83(0.42) 0.93 RPT-E 0.93 0.87(0.52) 0.99 0.94 0.88(0.56) 0.99 RPT-L 0.93 0.86(0.49) 0.98 0.93 0.86(0.54) 0.96 prediction of the attacker’s behavior. They are modifications of the base models for robustness. For example, COBRA modifies DOBSS by assuming that attacker will deviate from choosing the target with the highest expected utility to any other targets whose expected utilities are within of the highest value. However, within this subset of possibly chosen targets, the model doesn’t explicitly predict the behavior of the attacker but rather plays a maximin strategy (i.e. maximizing the lowest expected utility). RPT-E and RPT-L modify PT-E and PT-L in similar ways. Given the above property of these three models, we compute the POI score in two different ways by using two different definitions of the model prediction. The first definition predicts a single target with the lowest expected utility for the defender within the subset of possible deviations. Therefore the POI score counts the proportion of subjects who have chosen any other targets. The second definition predicts all the targets within the subsect of the possible deviations. Therefore, the POI score only counts for the targets outside this subset. The POI score computed with the first definition should be equal to or higher than the value computed with the second definition. Note that the second definition doesn’t satisfy the property 71 of prediction since the sum of the predictions on all targets might be larger than 1. We use this definition to mainly show the importance of accounting for deviation of attackers’ decision. The POI values computed with the second definition are shown in parentheses in Table 4.7. The observations from the table are summarized blow, 1. For the out-of-sample data, less than 30% of the subjects have selected the target predicted by PT-E or PT-L; in the other words, more than 70% of the subjects have deviated from the prediction. For DOBSS, on average 67% of the subjects deviated from the predicted response. Similar patterns can be observed for the in-sample data. 2. Both RPT and COBRA take into consideration the deviation of the subjects’ responses from their optimal action. The percentage of subjects deviate from the model prediction decreased significantly: for the out-of-sample data, the POI score of COBRA is 0:35 compared to 0:67 of DOBSS; the POI score of RPT-E decreased by 0.19 compared to PT-E; the POI score of RPT-L decreased by 0.22 compared to PT-L. Similar patterns are observed for the in-sample data. 3. The POI score of QR-76 and QR-55 is the same as DOBSS. This is expected since the target predicted by the QR model to be chosen with the highest probability is the target with the highest expected utility for the attacker, which is the also prediction of DOBSS. In other words, QR-76 and QR-55 have the same predicted target as DOBSS. At the same time, QRRU has the lowest POI score among all the models in both the out-of-sample data and in-sample data. The MSD scores of the three QR-related models are better (lower) than other models (except that in the out-of-sample data QR-55 has the same score as DOBSS). 4. The advantage of the the three QR related models is most significant under the ED score, which represents the error of the model in predicting the distribution of subjects’ choices. As shown in Table 4.7, the three QR-related models have significantly lower ED scores than the 72 other models. This is essentially the reason why the three models achieved significantly better defender expected utility than the other models. 73 Chapter 5: Quantal Response Model with Subjective Utility In this chapter, I compare the quantal response model to an alternative approach for addressing human bounded rationalitym: a robust optimization approach, which intentionally avoids mod- eling human decision making. The leading contender here is an algorithm called MATCH ([Pita et al., 2012]). Instead of modeling the particular probabilities of the adversarys deviations from the optimal choice, MATCH only guarantees a bound for the loss to the defender if the adver- sary deviates from selecting optimally (maximum-expected value choice). It has been shown (in [Pita et al., 2012]) that MATCH significantly outperforms BRQR (Section 4.2) even when signifi- cant amounts of human subject data were used to tune the key parameter of the quantal response model. It hence become unclear whether there is still any value in using human behavior models in solving SSG. In this chapter, using a large number of human subject experiments, I illustrate the importance of integrating human behavior models (and in particular the QR model) within algorithms to solve SSGs. Section 5.1 introduces an extended version of the quantal response model by integrating it with an novel subjective utility function. Section 5.2 provides an improved version of the MATCH algorithm by integrating it with the same subjective utility function learned from the data. Then in section 5.3, I conduct experiments comparing the extended quantal resonse model with 74 the MATCH algorithm under different settings with human subjects, including both the Amazon Mechanical Turk workers and a group of security intelligence experts. 5.1 The SUQR Model The key idea in subjective expected utility (SEU) as proposed in behavioral decision-making [Savage, 1972; Fischhoff et al., 1981] is that individuals have their own evaluations of each alter- native during decision-making 1 . Recall that in an SSG, the information presented to the human subject for each choice includes: the marginal coverage on targett (x t ); the subject’s reward and penalty (R a t ,P a t ); the defender’s reward and penalty (R d t ,P d t ). Inspired by the idea of SEU, we propose a subjective utility function of the adversary for SSG as the following: ^ U a t =w 1 x t +w 2 R a t +w 3 P a t (5.1) The novelty of our subjective utility function is the linear combination of the values (re- wards/penalty) and probabilities. (Note that we are modeling the decision-making of the general population not of each individual as we do not have sufficient data for each specific subject). While unconventional at first glance, as shown later, this model actually leads to higher predic- tion accuracy than the classic expected value function. A possible explanation for that is that humans might be driven by simple heuristics in their decision making. Other alternatives to this subjective utility function are feasible, e.g., including all the information presented to the subjects ( ^ U a t =w 1 x t +w 2 R a t +w 3 P a t +w 4 R d t +w 5 P d t ), which we discuss later. 1 Similar approach with subjective utility function has been shown to predict human behavior well in previous work [Azaria et al., 2012] 75 We modify the QR model by replacing the classic expected value function with the SU func- tion, leading to the SUQR model. In the SUQR model, the probability that the adversary chooses targett,q t , is given by: q t = e ^ U a t P t 0e ^ U a t 0 = e (w 1 xt+w 2 R a t +w 3 P a t ) P t 0e (w 1 x t 0+w 2 R a t 0 +w 3 P a t 0 ) (5.2) The problem of finding the optimal strategy for the defender can therefore be formulated as: max x T X t=1 e (w 1 xt+w 2 R a t +w 3 P a t ) P t 0e (w 1 x t 0+w 2 R a t 0 +w 3 P a t 0 ) (x t R d t + (1x t )P d t ) s:t: T X t=1 x t K; 0x t 1 (5.3) Here, the objective is to maximize the defender’s expected value given that the adversary chooses to attack each target with a probability according to the SUQR model. Constraint (5.3) ensures that the coverage probabilities on all the targets satisfy the resource constraint. Given that this optimization problem is similar to BRQR we use the same approach as BRQR to solve it. We refer the resulting algorithm as SU-BRQR. 5.1.1 Learning SUQR Parameters Without loss of generality, we set = 1. We employ Maximum Likelihood Estimation (MLE) to learn the parameters (w 1 ;w 2 ;w 3 ). Given the defender strategy x andN samples of the players’ choices, the log-likelihood of (w 1 ;w 2 ;w 3 ) is given by: logL(w 1 ;w 2 ;w 3 jx) = P N j=1 log[q t j (w 1 ;w 2 ;w 3 )] wheret j is the target that is chosen in samplej andq t j (w 1 ;w 2 ;w 3 ) is the probability that the adversary chooses the target t j given the parameters (w 1 ;w 2 ;w 3 ). Let N t be the number of subjects attacking targett. Then we have: 76 logL(w 1 ;w 2 ;w 3 jx) = P T t=1 N t log[q t (w 1 ;w 2 ;w 3 )] Combining with equation (2), logL(w 1 ;w 2 ;w 3 jx) =w 1 ( P T t=1 N t x t ) +w 2 ( P T t=1 N t R a t ) +w 3 ( P T t=1 N t P a t )Nlog( P T t=1 e w 1 xt+w 2 R a t +w3P a t ) logL(w 1 ;w 2 ;w 3 jx) can be shown to be a concave function: we can show that the Hessian matrix oflogL(w 1 ;w 2 ;w 3 jx) is negative semi-definite. Thus, this function has an unique local maximum point and we can hence use a convex optimization solver to compute the optimal weights (w 1 ;w 2 ;w 3 ), e.g., fmincon in Matlab. 5.1.2 Prediction Accuracy of SUQR model As in some real-world security environments, we would want to learn parameters of our SUQR model based on limited data. To that end, we used the data of 5 payoff structures and 2 algorithms MATCH and BRQR (10 games in total) from [Pita et al., 2012] to learn the parameters of the new SU function and the alternatives. In total, 33 human subjects played these 10 games using the setting of 8-targets and 3-guards from our on-line game. The parameters that we learnt are: (w 1 ;w 2 ;w 3 )=(9:85; 0:37; 0:15) for the 3-parameter SU function; and (w 1 ;w 2 ;w 3 ;w 4 ;w 5 ) = (8:23; 0:28; 0:12; 0:07; 0:09) for the 5-parameter function. Table 5.1: Prediction Accuracy QR 3-parameter SUQR 5-parameter SUQR 8% 51% 44% We ran a Pearson’s chi-squared goodness of fit test [Greenwood and Nikulin, 1996] in all the 100 payoff structures in [Pita et al., 2012] to evaluate the prediction accuracy of the two proposed models as well as the classic QR model. The test examines whether the predicted distribution of 77 the players’ choices fits the observation. We set = :76 for QR model, the same as what was learned in [Yang et al., 2011]. The percentages of the payoff structures that fit the predictions of the three models (with statistical significance level of = 0:05) are displayed in Table 5.1. The table clearly shows that the new SUQR model (with the SU function in Equation (5.1)) predicts the human behavior more accurately than the classic QR model. In addition, even with more parameters, the prediction accuracy of the 5-parameter SUQR model does not improve. Given this result, and our 3-parameter model demonstrated superiority (as we will show in the Experiments section), we leave efforts to further improve the SUQR model for future work. 5.2 Improving MATCH Since SUQR better predicts the distribution of the subject’s choices than the classic QR, and as shown later, SU-BRQR outperforms MATCH, it is natural to investigate the integration of the subjective utility function into MATCH. In particular, we replace the expected value of the 78 adversary with subjective utility function. Therefore, the adversary’s loss caused by his deviation from the optimal solution is measured with regard to the subjective utility function. max x;h;; (5.4) s:t: X t2T x t K; 0x t 1; 8t (5.5) X t2T h t = 1;h t 2f0; 1g; 8t (5.6) 0 (w 1 x t +w 2 R a t +w 3 P a t )M(1h t ) (5.7) (x t R d t + (1x t )P d t )M(1h t ) (5.8) (x t R d t + (1x t )P d t ) ( (w 1 x t +w 2 R a t +w 3 P a t )); 8t (5.9) We refer to this modified version as SU-MATCH, which is shown in Equation (5.4)-(5.9) whereh t represents the adversary’s target choice, represents the maximum subjective utility for the adversary, represents the expected value for the defender if the adversary responds optimally andM is a large constant. Constraint (5.7) finds the optimal strategy (target) for the adversary. In constraint (5.8), the defender’s expected value is computed when the attacker chooses his optimal strategy. The key idea of SU-MATCH is in constraint (5.9). It guarantees that the loss of the defender’s expected value caused by adversary’s deviation is no more than a factor of times the loss of the adver- sary’s subjective utility. 79 5.2.1 Selecting for MATCH: In MATCH, the parameter is the key that decides how much the defender is willing to lose if the adversary deviates from his optimal strategy. Pita et al. set to 1.0, leaving its optimization for future work. In this section, we propose a method to estimate based on the SUQR model. 1 Initialize 1; 2 for i = 1 to N do 3 Sample([0;MaxBeta];i),x MATCH(); 4 P t q t U d t ; 5 if then 6 ; ; 7 end 8 end 9 return ( ; ); In this method,N values of are uniformly sampled within the range (0, MaxBeta). For each sampled value of, the optimal strategyx for the defender is computed using MATCH. Given this mixed strategyx, the defender’s expected value, , is computed assuming that the adversary will respond stochastically according to the SUQR model. The leading to the highest defender expected value is chosen. In practice, we set MaxBeta to 5, to provide an effective bound on the defender loss, given that penalties/rewards of both players range from -10 to 10; and N to 100, which gives a grid size of 0.05 for for the range of (0; 5). We refer to the algorithm with carefully selected as MATCHBeta. 5.3 Experimental Results In this section, I present the experiment results on comparing the MATCH algorithm (and its extended versions) with the SU-BRQR algorithm. We leave the comparison between SUQR model and QRRU model 4.1.3 for future work. 80 5.3.1 Results with AMT Workers, 8-target Games Our first experiment compares SU-BRQR against MATCH and its improvements, in the setting where we learned the parameters of the SUQR model, i.e., the 8-target and 3-guard game with the AMT workers. In this 8-target game setting, for each game, our reported average is over at least 45 human subjects. The experiments were conducted on the AMT system. When two algorithms are compared, we ensured that identical human subjects played both on the same payoff structures. Participants were paid a base amount of US $1:00. In addition, each participant was given a bonus based on their performance in the games to motivate them. Similar to [Pita et al., 2012]’s work, we ensured that players were not choosing targets arbitrarily by having each participant play two extra trivial games (i.e., games in which there is a target with the highest adversary reward and lowest adversary penalty and lowest defender coverage probability). Players’ results were removed if they did not choose that target. We generated the payoff structures based on covariance games in GAMUT [Nudelman et al., 2004]. In covariance games, we can adjust the covariance valuer2 [1; 1] to control the correla- tion between rewards of players. We first generate 1000 payoff structures withr ranging from -1 to 0 by 0.1 increments (100 payoff structures per value ofr). Then, for each of the 11r values, we select 2 payoff structures ensuring that the strategies generated by each candidate algorithm (e.g., SU-BRQR and versions of MATCH) are not similar to each. One of these two has the maximum and the other has the median sum of 1-norm distances between defender strategies generated by each pair of the algorithms. This leads to a total of 22 payoff structures. By selecting the payoffs in this way, we explore payoff structures with different levels of the 1-norm distance between generated strategies so as to obtain accurate evaluations with regard to performance of the tested 81 SU-BRQR Draw MATCH =:05 13 8 1 Table 5.2: SU-BRQR vs MATCH, AMT workers, 8 targets algorithms. We evaluate the statistical significance of our results using the bootstrap-t method [Wilcox, 2003]. 5.3.2 SU-BRQR vs MATCH This section evaluates the impact of the new subjective utility function via a head-to-head com- parison between SU-BRQR and MATCH. In this initial test, the parameter of MATCH was set to 1.0 as in [Pita et al., 2012]. Figure 5.3.2a shows all available comparison results for com- pleteness. More specifically, we show the histogram of the difference between SU-BRQR and MATCH in the average defender expected reward over all the choices of the participants. The x-axis shows the range of this difference in each bin and the y-axis displays the number of payoff structures (out of 22) that belong to each bin. For example, in the third bin from the left, the average defender expected value achieved by SU-BRQR is larger than that achieved by MATCH, and the difference ranges from 0 to 0.4. There are 8 payoffs that fall into this category. Overall, SU-BRQR achieves a higher average expected defender reward than MATCH in the 16 out of the 22 payoff structures. 82 In Figure 5.3.2b, the second column shows the number of payoffs where SU-BRQR outper- forms MATCH with statistical significance ( = :05). The number of payoff structures where MATCH is better than SU-BRQR with statistical significance is shown in the fourth column. In the 22 payoff structures, SU-BRQR outperforms MATCH 13 times with statistical significance while MATCH defeats SU-BRQR only once; in the remaining 8 cases, no statistical significance is obtained either way. This result stands in stark contrast to [Pita et al., 2012]’s result and directly answers the question we posed at the beginning of this paper: there is indeed value to integrating models of human decision making in computing defender strategies in SSGs, but use of SUQR rather than traditional QR models is crucial. Furthermore, we ran the Person’s chi-square good- ness of fit test to evaluate the predication accuracy of the SUQR model and the traditional QR model, similar to that in Section 5.1. In all the 44 games (22 payoffs with 2 strategies for each), 20 games fit the predication of the SUQR model while only 7 games fit the prediction of the QR model. 5.3.3 SU-BRQR vs Improved MATCH In Table 5.3, we compare MATCH and SU-BRQR against the three improved versions of MATCH: SU-MATCH, MATCHBeta, and SU-MATCHBeta (i.e., MATCH with both the subjective utility function and the selected) when playing our 22 selected payoff structures. We report the results that hold with statistical significance ( =:05). The first number in each cell in Table 5.3 shows the number of payoffs (out of 22) where the row algorithm obtains a higher average defender expected reward than the column algorithm; the second number shows where the column algo- rithm outperforms the row algorithm. For example, the second row and second column shows 83 Table 5.3: Performance comparison, =:05 SU-MATCH MATCHBeta SU-MATCHBeta MATCH 3, 11 1, 6 1, 8 SU-BRQR 8, 2 8, 2 5, 3 that MATCH outperforms SU-MATCH in 3 payoff structures with statistical significance while SU-MATCH defeats MATCH in 11. Table 5.3 shows that the newer versions of MATCH achieve a significant improvement over MATCH. Additionally, SU-BRQR retains a significant advantage over both SU-MATCH and MATCHBeta. For example, SU-BRQR defeats SU-MATCH in 8 out of the 22 payoff structures with statistical significance, as shown in Table 5.3; in contrast, SU-MATCH is better than SU- BRQR only twice. Although SU-BRQR in this case does not outperform SU-MATCHBeta to the extent it does against MATCH (i.e., SU-BRQR performs better than SU-MATCHBeta only 5 times with statis- tical significance while SU-MATCHBeta is better than SU-BRQR thrice (Table 5.3)), SU-BRQR remains the algorithm of choice for the following reasons: (a) SU-BRQR does perform better than SU-MATCHBeta in more cases with statistical significance; (b) selecting the parameters in SU-MATCHBeta can be a significant computational overhead for large games given that it requires testing many values of. Thus, we could just prefer SU-BRQR . 5.3.4 Results with New Experimental Scenarios All previous experiments are based on the 8-target and 3-guards game, which were motivated by the LAX security scenario. In addition, the games have been played by AMT workers or college students. To evaluate the performance of the SUQR model in new scenarios, we introduce two new experimental settings: in one the experiments are conducted against a new type of human 84 SU-BRQR Draw MATCH =:05 6 13 3 Table 5.4: SU-BRQR vs MATCH, security experts adversary, i.e., security intelligence experts; and in the other, we change the game to 24 targets and 9 guards. 5.3.4.1 Security Intelligence Experts, 8-target games In this section, we evaluate our algorithm with security intelligence experts who serve in the best Israeli Intelligence Corps unit or are alumna of that unit. Our purpose is to examine whether SU-BRQR will work when we so radically change the subject population to security experts. We use the same 22 payoff structures and the same subjective utility function as in the previous experiment with AMT workers. Each result below is averaged over decisions of 27 experts. 5.3.4.2 SU-BRQR vs DOBSS DOBSS performed poorly in 8-target games against AMT workers, as shown in Chapter 4. How- ever, would DOBSS perform better in comparison to SU-BRQR against security experts? Our results show that SU-BRQR is better than DOBSS in all 22 tested payoff structures; 19 times with statistical significance. Thus, even these experts did not respond optimally (as anticipated by DOBSS) against the defender’s strategies. 5.3.4.3 SU-BRQR vs MATCH Figure 5.3.4.3a shows that SU-BRQR obtains a higher expected defender reward than MATCH in 11 payoff structures against our experts. Furthermore, SU-BRQR performs better than MATCH in 6 payoff structures with statistical significance while MATCH is better than SU-BRQR only 85 in 3 payoff structures with statistical significance (Figure 5.3.4.3b). These results still favor SU- BRQR over MATCH, although not as much as when playing against AMT workers (as in Figure 5.3.2). Nonetheless, what is crucially shown in this section is that changing the subject population to security experts does not undermine SU-BRQR completely; in fact, despite using the data from AMT workers, SU-BRQR is still able to perform better than MATCH. We re-estimate the param- eters (w 1 ;w 2 ;w 3 ) of the SU function using the data of experts. The result is:w 1 =11:0;w 2 = 0:54, andw 3 = 0:35. This result shows that while the experts evaluated all the criteria differently from the AMT workers they gave the same importance level to the three parameters. Because of limited access to experts, we could not conduct experiments with these re-estimated parameters; we will show the impact of such re-estimation in our next experimental setting. 5.3.5 Bounded Rationality of Human Adversaries We now compare the AMT workers and security experts using the traditional metric of “ratio- nality level” of the QR model. To that end, we revert to the QR-model with the expected value function to measure how close these players are to perfect rationality. In particular, we use QR’s parameter as a criterion to measure their rationality. We use all the data from AMT workers 86 as well as experts on the chosen 22 games in previous experiments to learn the parameter. We get = 0:77 with AMT workers and = 0:91 with experts. This result implies that security intelligence experts tend to be more rational than AMT workers (the higher the, the closer the players are to perfect rationality). Indeed, in 34 of 44 games, experts obtains a higher expected value than AMT workers. Out of these, their expected value is higher than AMT workers 9 times while AMT workers’ is higher only once with statistical significance ( =:05). Nonetheless, the lambda for experts of 0.91 suggests that the experts do not play with perfect rationality (perfect rational =1). 5.3.6 AMT Workers, 24-target Games In this section, we focus on examining the performance of the algorithms in large games, i.e., 24 targets and 9 defender resources. We expect that the human adversaries may change their behaviors because of tedious evaluation of risk and benefit for each target. Three algorithms were tested: SU-BRQR, MATCH, and DOBSS. We first run experiments with the new subjective utility function learned previously using the data of the 8-target game. 5.3.6.1 SU-BRQR vs MATCH with Parameters Learned from the 8-target Games Figure 5.3.6.1a shows that SU-BRQR obtains a higher average defender expected value than MATCH in 14 out of 22 payoff structures while MATCH is better than SU-BRQR in 8 payoff structures. These averages are reported over 45 subjects. In addition, as can be seen in Fig- ure 5.3.6.1b, SU-BRQR performs better than MATCH with statistical significance 8 times while MATCH outperforms SU-BRQR 3 times. While SU-BRQR does perform better than MATCH, its superiority over MATCH is not as much as it was in previous 8-target games. 87 SU-BRQR Draw MATCH =:05 8 11 3 Table 5.5: SU-BRQR vs MATCH, 24 targets, original We can hypothesize based on these results that the learned parameters of the 8-target games do not predict human behaviors as well in the 24-target games. Therefore, we re-estimate the values of the parameters of the subjective utility function using the data of the previous experiment in the 24-target games. The training data contains 388 data points. This re-estimating results in w 1 =15:29;w 2 = :53;w 3 = :34. Similar to the experts case, the weights in 24-target games are different from the ones in 8-target games but their order of importance is the same. 5.3.6.2 SU-BRQR vs DOBSS with Re-estimated Parameters Since DOBSS has not been tested in the 24-target setting, we test it as a baseline. SU-BRQR outperforms DOBSS with statistical significance in all 22 tested payoff structures illustrating the superiority of SU-BRQR over a perfectly rational baseline. 5.3.6.3 SU-BRQR vs MATCH with Re-estimated Parameters In this experiment, we evaluate the impact of the new subjective utility function with the re- estimated parameters on the performance of SU-BRQR in comparison with MATCH. 88 SU-BRQR Draw MATCH =:05 11 10 1 Table 5.6: SU-BRQR vs MATCH, 24 targets, re-estimated Figure 5.3.6.3a shows that SU-BRQR outperforms MATCH in 18 payoff structures while MATCH wins SU-BRQR in only 4 payoff structures. Moreover, it can be seen in Figure 5.3.6.3b that SU-BRQR defeats MATCH with statistic significance 11 times while MATCH defeats SU- BRQR only once with statistical significance. In other words, the new weights of the subjective utility function indeed help improve the performance of SU-BRQR . This result demonstrates that a more accurate SU function can help improve SU-BRQR’s performance. 89 Chapter 6: Modeling Human Adversaries in Network Security Games In this chapter, I initiate the study of human behavior models of adversaries in network security games, as well as the problem of designing defender strategies against such human adversaries. Many real-world security domains have structure that is naturally modeled as graphs. For ex- ample, in response to the devastating terrorist attacks in 2008 [Chandran and Beitchman, 2008], Mumbai police deployed randomized checkpoints as one countermeasure to prevent future attacks ([Ali, 2009]). This can be modeled as a network security game ([Washburn and Wood, 1995; Tsai et al., 2010; Jain et al., 2011a]), a Stackelberg game on a graph with intersections as nodes and roads as edges, where certain nodes are targets for attacks. The defender (as the leader) can schedule randomized checkpoints on edges of the graph. The attacker (as the follower) chooses a path on the graph ending at one of the targets. A common assumption of these previous studies is that the attacker is perfectly rational (i.e. chooses a strategy that maximizes their expected utility). This is a reasonable proxy for the worst case of a highly intelligent attacker, but it can lead to a defense strategy that is not robust against attackers using different decision procedures, and it fails to exploit known weaknesses in the decision-making of human attackers. In previous chapters, we have considered security domains 90 where the human adversaries choose from a set of given targets, in a network security game the human attacker faces a more complex decision: that of choosing a path in a graph. On one hand, the rationality assumption is even more problematic here; on the other hand, the existing behavior models do not explicitly consider the specific graphical structure of this domain. For modeling human path planning in continuous terrains, Burgess and Darken ([Burgess and Darken, 2004]) proposed a fluid-simulation based model; however, their model is less applicable to our domain in which the choices are discrete. In this chapter, I present the first systematic study of human behavior models applied to network security games. After formerly defining the problem in Section 6.1, I consider two behavior models for attackers in Section 6.2. First, I adapt the quantal response model to network security games. The second model, which I call quantal response with heuristics , is motivated by studies showing that humans rely on heuristics to address complex decision problems (e.g., [Gigerenzer et al., 1999]). Then in Section 6.3, I describe how the model parameters are estimated using data collected through a web-based game that I develop to simulate the decision tasks faced by the attacker. It then follows by Section 6.4, where I explain how to compute defender strategies that optimize defender utility against each of these behavior models of the attackers. Finally, in Section 6.3, I compare the performance of these strategies in a subsequent set of experiments on Amazon Mechanical Turk. 6.1 Problem Definition We model a network security domain, similar to that introduced by Tsai et al [Tsai et al., 2010]. We use the following notation to describe the game, which are also listed in Table 9.1. The game 91 is played on a graph G = (V;E). The attacker starts at one of the source nodes s2 S V and travels along a path chosen by him to get to one of the target nodes t 2 T V . The attacker’s set of pure strategiesA then consists of all the possible paths from somes2S to some t2 T , which we denoteA 1 ;:::;A jAj E. Meanwhile, the defender tries to catch the attacker by setting up check points on the passing edges before the attacker reaches the target. Let M be the total number of security resources, meaning the defender could then set up at most M simultaneous check points in the network. Thus the set of defender’s pure strategiesD consists of all subsets of E with at most M elements, which we denote D 1 ;:::;D jDj . If the attacker chooses a path which has at leat one edge covered by the defender, then the attacker gets caught and receives a penalty, and the defender receives a reward for catching the attacker; otherwise, the attacker receives a reward for successfully attacking the target and the defender receives a penalty. Formally, assuming the defender plays an allocationD i , and the attacker chooses a path A j , the attacker succeeds if and only ifD i \A j =;. The game was assumed to be zero-sum in earlier work [Tsai et al., 2010; Jain et al., 2011a]. In this paper, we relax this assumption to consider a more general class of games. Specifically, successful attacks might lead to different rewards to the attackers since different targets might be of different values to the attackers. Meanwhile, catching the attacker on different paths might give different rewards to the defender. We useR a j to denote the rewards received by attacker for a successful attack through pathA j , andP d j to denote the penalty received by the defender. If the attacker gets caught on pathA j , we denote his penalty byP a j and the reward received by the defender by R d j . Furthermore, we make the natural assumption that R a i > P a i and R d i > P d i , 8i2f1;:::;jAjg. Taking everything together, we define a network security game as the tuple (G;S;T;M;fR d i g;fP d i g;fR a i g;fP a i g). 92 Table 6.1: Notations used in this paper (V;E) Network game graph M Total number of defender resources A Set of attacker paths,A =fA i g A i i th attacker path R a i Reward for attacker for a successful attack through pathA i P a i Penalty for attacker if he gets caught on pathA i R d i Reward for defender for catching attacker on pathA i P d i Penalty for defender for a successful attack through pathA i D Set of defender allocations (strategies),D =fD j g D j j th defender allocation A network security game x e Probability that edgee will be covered by a resource The attacker conducts surveillance to learn about the defender’s strategy, so it is important for the defender to randomize her strategy to avoid exploitable patterns. In other words, the defender has to commit to a distribution over her pure strategies. We usex e to denote the probability that an edge e2 E will be covered by the defender and x =hx e ;8e2 Ei to denote the vector of marginal probabilities of covering each of the edges in the graph. In general, if the attacker chooses path A i the probability that he will be captured (denoted p i ) is the probability that at least one edge on the pathA i is covered by the defender, which is not completely specified by the marginals x. Tsai et al [Tsai et al., 2010] showed that given x, the sum of marginals on the edges of the path P e2A i x e is an upper bound ofp i , and this upper bound can be reached if the defender can ensure that in each pure strategyD j played with positive probability, only one edge on the pathA i is covered. Tsai et al [Tsai et al., 2010] proposed algorithms that sample defender pure strategies from x, however such techniques are not guaranteed to reach this upper bound in all cases. In this paper, we make the simplifying assumption that the total amount of defender resourcesM is equal to 1, which is consistent with our focus on small graphs. Then since at most 93 one edge of the graph will be covered in any pure strategyD j , we havep i p i (x) = P e2A i x e for alli. Then we can write the expected utility of the defender if attacker chooses pathA i as U d i (x; ) =p i (x)R d i + (1p i (x))P d i (6.1) and the expected utility for the attacker if he choosesA i as U a i (x; ) = (1p i (x))R a i +p i (x)P a i (6.2) Letq i (x; ) denote the probability that attacker chooses pathA i , given the defender’s marginal coverage on all the edgesx. The optimal strategy for the defender is to maximize the average ex- pected utility: max x X A i 2A q i (x; )U d i (x; ) (6.3) It is thus important for the defender to accurately model the attacker’s response to her strategy, i.e.,q i (x; ) for alli. We assume that the attacker can observeM (which is equal to 1), as well as the the defender’s marginal coverage on all the edgesx. A fully rational attacker would be able to deduce thatp i = P e2A i x e for alli and choose a path that maximizes his expected utility:i = arg max i U a i (x; ). However in real-world security problems, we are facing human attackers who may not respond optimally. The goal of this paper is to explore models that can better predict the behavior of human attackers. 6.2 Adversary Model In this section, we propose several models of how a human attacker responds to the defender’s strategy. 94 6.2.1 Basic Quantal Response Model In our first model the attacker’s mixed strategy is a quantal response (QR) to the defender’s strategy. Under thisQR model, given a graph game and a defender’s strategyx, the probability that the adversary is going to choose pathA i is QR : q i (j x; ) = e U a i (x;) P A k 2A e U a k (x;) (6.4) where > 0 is the parameter of the quantal response model, which represents the error level of adversary’s quantal response. When = 0, the adversary chooses each path with equal probability; when =1, the adversary becomes fully rational and only selects the paths which give him the maximum expected utility. It is shown in many empirical studies that usually takes a positive finite value. 6.2.2 Quantal Response with Heuristics In a network security game , in order to evaluate the expected utility of a path A i , U a i (x; ), the attacker has to compute p i , which requires reasoning about a sequence of random events, i.e., whether or not each edge on the path will be covered by the defender. Even in our sim- plified games in whichM = 1 and thus a perfectly-rational attacker can computep i as the sum P e2A i x e , Computing this probability can be more difficult for bounded-rational human attackers who might not know this formula. Instead, the adversary might use simple heuristics to evaluate the “utility” of each path. We propose the following model of the attacker’s behavior which we call Quantal Response with Heuristics (QRH): QRH : h i (j x; ) = e f i (x) P A k 2A e f k (x) (6.5) 95 Table 6.2: Lists of Path Features f i1 (x) := P e2E A ie Number of edges f i2 (x) := max e2A i x e Minimum edge coverage f i3 (x) := min e2A i x e Maximum edge coverage f i4 (x) := P e2A i x e Summation of edge coverage f i5 (x) :=f i4 (x)=f i1 ((x)) Average edge coverage where =h 1 ;:::; m i is a vector of coefficients of the model and givenx,f i (x) =hf i1 (x);::;f im (x)i is a vector ofm features for pathA i that influences the attacker’s decision making. We observe that under both QR and QRH models the attacker’s mixed strategy belongs to the exponential family of distributions widely used in statistical learning. The form of the QRH model is more general than QR: it allows linear combinations of multiple features, and furthermore f ij (x) can be any function, including the attacker’s expected utility U a i (x; ) used in the QR model. On the other hand, since our focus for the QRH model is on simple heuristics, we use a set of five features that are easy to compute for humans and thus could be used as basis for heuristics. These features are listed in Table 6.2. 6.3 Model Parameter Estimation 6.3.1 Data Collection In order to estimate the values of the parameters of our models, we first need data on how humans behave when faced with the kind of decision tasks the attacker faces. We developed a web-based game which simulates the decision tasks faced by the attacker in network security games, and collected data on how human subjects play the game by posting the game as a Human Intelligent Task (HIT) on Amazon Mechanical Turk (AMT). 1 1 https://www.mturk.com 96 Figure 6.1: Game Interface (colored) Figure 6.1 displays the interface of the game. Players were introduced to the game through a series of explanatory screens describing how the game is played. In the game, the web interface presents a graph to the subjects and specifies the source(starting) nodes and the target nodes in the graph. The subjects are asked to select a path from one of the source nodes to one of the target nodes. They are also told that the defender is trying to catch them by setting up checkpoints on the edges. The probability that there will be a check point on each edge is given to the subjects, as well as the reward for successfully getting through the path and the penalty for being caught by the defender. Thus each instance of this game can be specified by a network security game and a defender strategy. Formally, we define a game sample asg = (;x), where is a network security game and x is a defender strategy. Each human subject plays multiple rounds in sequence, each corresponding to a different game sample. In each game round, after a subject selects a path in the 97 Graph 1 Graph 2 Graph 3 Figure 6.2: Graphs Tested in Data Collection network, the edges that will be covered by the defender is sampled according to the probability shown in the figure. Subjects get a positive score if they successfully get through the path and a negative score if they select a path which has edges covered by the defender. In order to mitigate learning effects, subjects were not told of the result of each game round until they finish all game rounds. Each subject receives 0:5$ for participating in the experiments, and is paid 0:01$ bonus for each point they earn. In our experiments subjects earned 1:1$ bonus on average. We conducted a first set of experiments on three simple graphs, shown in Figure 6.2. Since the purpose of this set of experiments is to collect data to train our models, we want to use a wide variety of defender strategies. We first randomly generated 1000 different defender strategies for each graph. We then used k-means clustering to classify these random strategies into K clusters. The centers of the clusters are selected as the representative strategies and used in the experiments. We selected 10 strategies for Graph 1, 10 strategies for Graph 2 and 20 strategies for Graph 3; details on the strategies can be found at an online appendix. 2 In total, we tested 40 different game samples, each of which are played by 40 different subjects. 2 http://anon-aamas2012-paper826.webs.com/ 98 6.3.2 Training the QR Model We first train the basic quantal response model, QR, using the data collected in the experiment described in Section 6.3.1. We use Maximum Likelihood Estimation (MLE) to tune the parameter . Given the choices ofN subjects, with(n) denoting the path chosen by playern, the log- likelihood of on game sampleg is logL QR (jg) = X n=1::N logq (n) (j x; ) LetN i be the number of subjects attacking targeti. Then, logL QR (jg) = X A i 2A N i logq i (j x; ) Combining with Equation (6.4), logL QR (jg) = X A i 2A N i U a i (x)N log( X A i 2A e U a i (x) ) (6.6) We train the model by maximizing the total log-likelihood of all the 40 game samples max X g2S logL QR (jg) (6.7) whereS denotes the set of all 40 game samples. It is relatively straightforward to verify that the second order derivative of logL QR (j g) is always nonpositive. Thus logL QR (j g) is a concave function in for allg. Therefore, the total log-likelihood of Equation 6.7 is concave and we can apply any local optimization solver (we used Matlab’sfmincon solver). The maximum- likelihood estimate of based on the data is 0.34. 99 6.3.3 Training the QRH Model In training theQRH model, we need to first decide which subset of the 5 features from Table 6.2 to use in the model, and then train the model for the selected features. Although in general the more features we select the better the fit will be, taking the set of all features can result in over-fitting. This feature selection problem is well-studied in statistics and machine learning, and techniques such as L1-regularized regression methods were proposed to introduce bias towards smaller sets of features. In this paper we apply a simple form of bias: we consider only subsets of features of sizes 1 and 2. We then select the top-performing subsets of size 1 and the top-performing subsets of size 2. Specifically, for eachL2f1; 2g, we do the following: 1. For each of the 5 L possible subsets of sizeL, we train a QRH model using this subset of features using MLE; 2. We compare the models using 2-fold cross validation, and pick the top two feature combi- nations. Since we are only selecting from 5 features, we only have to evaluate a small number of models. In future work we plan to explore more sophisticated feature-selection techniques, which would allow us to select from a large set of possible features. In order to apply 2-fold cross validation, we first randomly divided all the 40 game samples into two equal-sized sets, S 1 and S 2 . We conducted two rounds of training, one using S 1 and the other using S 2 . In each round of training, the model is trained by maximizing the total log- likelihood of the game samples in the training set: max X g2S train logL QRH (jg); (6.8) 100 whereS train 2fS 1 ;S 2 g is the training set, and logL QRH (j g) is the log-likelihood of QRH model of game sampleg, derived similarly as logL QR (jg): logL QRH (jg) = ( X A i 2A N i f i (x))N log( X A i 2A e f i (x) ): (6.9) We can show that logL QRH (jg) is a concave function in, since the Hessian matrix is negative definite. Therefore, it can be solved use any local optimization solver. features Train onS 1 Train onS 2 Total training testing training testing testing 1 -707.2 -672 -636.8 -744.4 -1416.4 2 -693.6 -666 -658 -702 -1368 3 -636 -580.4 -573.6 -642.8 -1223.2 4 -677.2 -723.6 -710 -690.8 -1414.4 5 -667.6 -618.4 -606.4 -680.4 -1298.8 U a i -645.6 -689.6 -682.8 -652.4 -1342 Table 6.3: Fit (logL) of modelQRH using single feature features Train onS 1 Train onS 2 Total training testing training testing testing (1,2) -693.2 -672 -635.2 -734.4 -1406.4 (1,3) -630.4 -594.4 -570 -657.2 -1251.6 (1,4) -603.6 -602 -573.6 -636 -1238 (1,5) -648.8 -638.4 -606 -684.4 -1322.8 (2,3) -636 -582.8 -572.4 -656.8 -1239.6 (2,4) -631.6 -655.2 -638 -649.6 -1304.8 (2,5) -643.6 -581.2 -566.4 -660.8 -1242 (3,4) -616 -601.6 -573.6 -644 -1245.6 (3,5) -636 -581.2 -571.2 -646.8 -1228 (4,5) -610.4 -615.6 -592.4 -635.6 -1251.2 Table 6.4: Fit of modelQRH using two features Given a combination of the featuresf i , let 1 and 2 be the training results on S 1 and S 2 , respectively. We measure the model fit off i as the sum of the log-likelihoods of S 2 under the model for 1 andS 1 under the model for 2 : Fit(f i ) = X g2S 2 logL QRH ( 1 jg) + X g2S 2 logL QRH ( 2 jg) (6.10) 101 Table 6.3 displays the fit results for single features. For comparison, we also conduct the MLE training with 2-fold cross validation for the QR model and list the fitting result on the last row in Table 6.3. Over all, feature 3 (maximum edge coverage) achieves the best fitting performance, which is also better than the QR model. Additionally, feature 5 (average edge coverage) also achieves better fitting performance than the QR model. Table 6.4 displays the fit result with two features. The best two combinations are (1; 4) and (3; 4). features parameter value 3 -9.95 5 -6.26 (1,4) (1.04, -10.60) (3,4) (-9.67, -1.95) Table 6.5: Number of strategies tested Based on the 2-fold cross validation results, we selected four candidate feature combinations for the QRH model: feature 3, feature 5, feature 1 + feature 4, feature 3 + feature 4. We then tuned the model parameters for these candidates by training on the whole data set S. The final values for the parameters are listed in Table 6.5. 6.4 Computing Defender Resource Allocation Strategy In this section, we describe how we compute optimal defender strategies against different models of attackers. 102 6.4.1 Best Response to QR model Given a QR model of the adversary, the defender’s expected utility by playing strategy x in a network security game is: X A i 2A q i (j x; )U d i (x; ): (6.11) Combining with Equation (6.1) we have the following optimization problem to compute the de- fender’s optimal strategy against a QR model of the adversary: max x;p P A i 2A e R a i e (R a i P a i )p i ((R d i P d i )p i +P d i ) P A i 2A e R a i e (R a i P a i )p i (6.12) s.t. X e2E x e M (6.13) p i = X e2A i x e ; 8A i 2A (6.14) 0x e 1; 8e2E (6.15) where = 0:34 as learned from the data. Since the defender is assumed to have only one resource, Constraint (6.14) ensures thatp i is the probability that pathA i will be covered by the defender. The objective function, Equation (6.12), is a nonlinear fractional function, thus is not guaranteed to be concave. We use a heuristic algorithm based on local optimization with random restarts, described in Algorithm 4. The algorithm generates a new starting point in each iteration and (at Line 5) callsFindLocalMaximum to find a locally optimal solution of (6.12). The best local optimal solution is returned in the end. We used Matlab’sfmincon as the local optimizer. 103 Algorithm 3: Local Search with Random Multi-Restart 1 Input:IterN; 2 opt g 1; 3 fori 1;:::;IterN do 4 x 0 randomly generated feasible starting point; 5 (opt l ;x ) FindLocalMaximum(x 0 ); 6 ifopt l >opt g then 7 opt g opt l ,x opt x 8 end 9 end 10 returnopt g ,x opt ; 6.4.2 Best Response to QRH model In this section, we explain our approach for computing an optimal defender strategy against a QRH model given any combination of featuresf i (x) and the corresponding feature coefficients . Given a network security game and the defender’s strategy x, the probability that the at- tacker will select pathA i ish i (j x; ) as: defined by Equation 6.5. Then the defender’s expected utility can be written as X A i A h i (j x; )U d i (x; ): Therefore we can formulate the defender’s optimal strategy as the solution of the following opti- mization problem: max x;p P A i 2A e f i (x) ((R d i P d i )p i +P d i ) P A i 2A e f i (x) (6.16) s.t. X e2E x e M (6.17) p i = X e2A i x e ; 8A i 2A (6.18) 0x e 1; 8e2E (6.19) 104 wheref i (x) is a subset of the features described in Table 6.2. Again, the objective function (6.16) is a nonlinear fractional function, so is not guaranteed to be concave. Nevertheless we can apply Algorithm 4, withFindLocalMaximum to find a locally optimal solution of (6.16). 6.5 Experiment Results In this section, we evaluate the performance of different models in network security games. We use the same web-based game that we introduced in Section 6.3.1 to set up the experiments with human subjects. Different from the first set of experiments, where we intended to collect data on how humans play the game in order to train the model, the goal of this new set of experiments is to use the defender strategies computed from the different models to play against human subjects in order to compare the performance of these models. 6.5.1 Experiment Settings Fig. 6.1 shows the interface of the web-based game we developed. We have provided details on the game rules in Section 6.3.1. We now focus on describing the game instances that are included in these experiments. We tested eight different graph types, including the three graphs used in data collection that are displayed in Figure 6.2. The other five graphs are displayed in Figure 6.3. Among the eight graphs, we have four graphs with a single target (graph 1-4) and four graphs with multiple targets (graph 5-8). The models are trained using the data from single-target graphs, we are interested to see how they perform in multi-target graphs. For each graph type, we designed two different 105 Uniform Defender covers each edge with equal probability Maximin Attacker always chooses the worst path for the defender Rational Attacker maximizes his expected utility QR quantal response ( = 0:34) QRH-1 QRH with maximum edge coverage ( =9:95) QRH-2 QRH with average edge coverage ( =6:26) QRH-3 QRH with number of edges and sum of edge coverage ( =h1:04;10:60i) QRH-4 QRH with maximum edge coverage and sum of edge cov- erage ( =h9:67;1:95i) Table 6.6: Attacker Models Tested Evaluated sets of payoffs (i.e. the reward/penalty for the attacker and the defender on each path) 3 . There- fore, we have a total of 8 2 = 16 security games in the experiments. For each of these games, we computed the defender strategies from eight different models. Table 6.6 lists the eight mod- els. Therefore, for each game instance, we have eight different defender strategies. In total, we have 8 16 = 128 different game samples (i.e., combinations of security games and defender strategies). Each of the game samples is played by 40 different subjects. Graph 4 Graph 5 Graph 6 Graph 7 Graph 8 Figure 6.3: Graphs Tested in Evaluation Experiments 3 The details of the payoffs can be found on the online appendix: http://anon-aamas2012-paper826.webs.com/ 106 6.5.2 Experiment Results We evaluate the performance of different defender strategies using the defender’s expected utility. Given that a subject selects pathA i , the defender’s expected utility is computed with Equation (6.1). Average Performance: We first evaluated the average defender expected utility,U d exp (x), of different defender strategies based on all 40 subjects choices: U d exp (x) = 1 40 X A i 2A N i U d i (x) whereN i is the number of subjects that chose pathA i . Figure 6.4 displays the average performance of the different models in all the single-target games on the left group of bars, and the average model performances in all the multi-target graphs on the right group of bars. In both cases, the QR model outperforms the three baseline models (Uniform, Maximin and Rational). Among the four QRH models, QRH-2 (i.e. using average edge coverage) and QRH-3 (i.e. using number of edges and sum of edge coverage) outperforms the three baseline models in both single-target graphs and multi-target graphs. There are two interesting observations from the figure. Between the QR model and the QRH models, we see that in the single-target graphs, none of the QRH models achieves better performance than the QR model; while in the mutli- target graphs, all four QRH models outperform QR. This is an unexpected result since the QRH models are trained on the single-target graph data and do not use features that come up in the multi-target graphs. The rational model did worse in the multi-target graphs than it did in the single-target graphs, as compared to the QR and QRH models. For the single-target graphs the average 107 defender expected utility achieved by the rational model was closer to that of the QR and QRH models, and it even outperformed two of the QRH models (QRH-1 and QRH-4). While in the multi-target graphs, the rational model was significantly outperformed by both QR and QRH models. This is also an surprising result, since the QR and QRH models are trained on the single-target graphs and are thus expected to perform better in the singe- target graphs. A possible reason for the above two interesting observations is that as the graph becomes more complex (i.e. more targets and more paths), it becomes more difficult for humans to compute the actual expected utility of each path so they are more likely to rely on heuristics. We also show the performance of different models in each graph type: Figures 6.5(a) shows the average defender expected utility achieved by the eight models in the four single-target graphs; and 6.5(b) displays the results in the four multi-target graphs. We can see from Figure 6.5(a) that the rational model was outperformed by the QRH-3 model in all of the four graphs; it was also outperformed by the QR model in 3 of the 4 graphs except for in graph 2 where the two models have roughly the same performance. In the multi-target graphs, Figure 6.5(b) shows that the rational model was outperformed by both the QR and QRH models in three of the four graphs, except for in graph 7 where all of the models have very similar performances. Model Fitting Performance: We also evaluated the fitting performance of the five trained models. Table 6.7 reports the total log-likelihood of different models in the multi-target games and the single-target games. 4 It is clear that the four QRH models achieve much better fitting performance (i.e. higher log-likelihood) than the QR model. An interesting finding here is that better fit performance does not necessarily lead to higher defender expected utility. In particular, 4 Detailed model fitting results can be found at the online appendix: http://anon-aamas2012-paper826.webs.com. 108 SingleTarget MultiTargets −2.5 −2 −1.5 −1 −0.5 0 Defender Expected Utility Uniform Maxmin Rational QR QRH−1 QRH−2 QRH−3 QRH−4 Figure 6.4: Average Defender Expected Utility Multi-Target Single-Target (8 game instance) (8 game instance) Total QR -397.09 -232.04 -629.13 QRH-1 -297.51 -208.27 -505.78 QRH-2 -349.60 -193.31 -542.90 QRH-3 -283.74 -168.43 -452.18 QRH-4 -279.43 -168.03 -447.46 Table 6.7: Fitting Performance: log-likelihood of different models QRH-4 has the best fitting performance in single-target graphs, but the average defender expected utility achieved by QRH-4 was much worse than other models except for QRH-1, as shown in Figure 6.4. 109 Graph1 Graph2 Graph3 Graph4 −5 −4 −3 −2 −1 0 1 Defender Expected Utility Uniform Maxmin Rational QR QRH−1 QRH−2 QRH−3 QRH−4 (a) Single Target Graphs Graph5 Graph6 Graph7 Graph8 −5 −4 −3 −2 −1 0 1 Defender Expected Utility Uniform Maxmin Rational QR QRH−1 QRH−2 QRH−3 QRH−4 (b) Multiple Target Graphs Figure 6.5: Average Defender Expected Utility 110 Chapter 7: Computing Defender Optimal Strategy In this chapter, I address the problem of computing optimal defender strategies in real-world se- curity games against a quantal response model of attackers. The difficulties faced here include (1) solving a non-convex optimization problem efficiently for massive real-world security games; and (2) addressing constraints on assigning security resources, which adds to the complexity of com- puting the optimal defender strategy. I have introduced BRQR in Section 4.2 to compute defender optimal strategy against a quantal response. BRQR however was not guaranteed to converge to the optimal solution, as it used a nonlinear solver with multi-starts to obtain an efficient solution to a non-convex optimization problem. Furthermore, it does not consider resource assignment constraints that might be involved in my real-world security domains. This chapter presents two new algorithms to address these difficulties. After formerly defin- ing the problem in Section 7.1, Section 7.2 introduces the basic idea of using binary search to iteratively compute the defender optimal strategy against a quantal response model of the adver- sary. Section 7.3 then presents the GOSAQ algorithm which can compute the globally optimal defender strategy against a QR model of attackers when there are no resource constraints and gives an efficient heuristic otherwise. It then follows with Section 7.4, which provides an effi- cient approximation of the optimal defender strategy with or without resource constraints through 111 the algorithm PASAQ. Finally, Section 7.5 presents detailed experimental results showing the ad- vantages of GOSAQ and PASAQ in solution quality over the benchmark algorithm (BRQR) and the efficiency of PASAQ. 7.1 Problem Definition Assuming a QR-adversary, i.e. with a quantal responsehq i ;i 2 Ti to the defender’s mixed strategyx =hx i ;i2Ti. The valueq i is the probability that adversary attacks targeti, computed as q i (x) = e U a i (x i ) P k2T e U a k (x k ) (7.1) where 0 is the parameter of the quantal response model, which represents the error level in adversary’s quantal response. Simultaneously, the defender maximizes her utility (given her computer-aided decision making tool): U d (x) = X i2T q i (x)U d i (x i ) Therefore, in domains without constraints on assigning the resources, the problem of computing the optimal defender strategy against a QR-adversary can be written in terms of marginals as: P1: 8 > > > > > > > > > < > > > > > > > > > : max x P i2T e R a i e (R a i P a i )x i ((R d i P d i )x i +P d i ) P i2T e R a i e (R a i P a i )x i s.t. X i2T x i M 0x i 1; 8i2T ProblemP1 has a polyhedral feasible region and is a non-convex fractional objective function. 112 7.1.1 Resource Assignment Constraint As I have shown in Section 2.1.3, there could be arbitrary constraints on assigning the security resources in real world security problems. A resource assignment constraint implies that the feasible assignment setA is restricted; not all combinatorial assignment of resources to targets are allowed. Hence, the marginals on targets,x, are also restricted. In order to compute the defender’s optimal strategies against a QR-adversary in the presence of resource-assignment constraints, we need to solveP2. The constraints inP1 are modified to enforce feasibility of the marginal coverage. P2: 8 > > > > > > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > : max x;a P i2T e R a i e (R a i P a i )x i ((R d i P d i )x i +P d i ) P i2T e R a i e (R a i P a i )x i s.t. X i2T x i M x i = X A j 2A a j A ij ; 8i2T X A j 2A a j = 1 0a j 1; 8A j 2A 7.2 Binary Search Method We need to solveP1 andP2 to compute the optimal defender strategy, which requires optimally solving a non-convex problem which is in general an NP-hard problem [Vavasis, 1995]. In this section, we describe the basic structure of using a binary search method to solve the two prob- lems. However, further efforts are required to convert this skeleton into actual efficiently runnable algorithms. We will fill in the additional details in the next two sections. 113 Table 7.1: Symbols for Targets in SSG i :=e R a i > 0 i :=(R a i P a i )> 0 i :=R d i P d i > 0 For notational simplicity, we first define the symbols8i2T in Table 7.1. We then denote the numerator and denominator of the objective function inP1 andP2 byN(x) andD(x): N(x) = P i2T i i x i e i x i + P i2T i P d i e i x i D(x) = P i2T i e i x i > 0 The key idea of the binary search method is to iteratively estimate the global optimal value (p ) of the fractional objective function of P1, instead of searching for it directly. LetX f be the feasible region ofP1 (orP2). Given a real valuer, we can know whether or notr p by checking 9x2X f ; s.t.rD(x)N(x) 0 (7.2) We now justify the correctness of the binary search method to solve any generic fractional programming problem max x2X f N(x)=D(x) for any functionsN(x) andD(x)> 0. Lemma 1. For any real valuer2R, one of the following two conditions holds. (a) rp ()9x2X f , s.t.,rD(x)N(x) 0 (b) r>p ()8x2X f ,rD(x)N(x)> 0 Proof. We only prove (a) as (b) is proven similarly. ‘(’: since9x such thatrD(x)N(x), this means thatr N(x) D(x) p ; ‘)’: Since P1 optimizes a continuous objective over a closed convex set, then there exists an optimal solutionx such thatp = N(x ) D(x ) r which rearranging gives the result. 2 114 Algorithm 4: Binary Search 1 Input:,P M andnumRes; 2 (U 0 ;L 0 ) EstimateBounds(P M ;numRes); 3 (U;L) (U 0 ;L 0 ); 4 whileUL do 5 r U+L 2 ; 6 (feasible; x r ) CheckFeasibility(r); 7 iffeasible then 8 L r 9 end 10 else 11 U r 12 end 13 end 14 returnL,x L ; Algorithm 5 describes the basic structure of the binary search method. Given the payoff matrix (P M ) and the total number of security resources (numRes), Algorithm 5 first initializes the upper bound (U 0 ) and lower bound (L 0 ) of the defender expected utility on Line 2. Then, in each iteration,r is set to be the mean ofU andL. Line 6 checks whether the currentr satisfies Equation (7.2). If so,p r, the lower-bound of the binary search needs to be increased; in this case, it also returns a valid strategyx r . Otherwise,p <r, the upper-bound of the binary search should be decreased. The search continues until the upper-bound and lower-bound are sufficiently close, i.e. UL < . The number of iterations in Algorithm 5 is bounded byO(log( U 0 L 0 )). Specifically for SSGs we can estimate the upper and lower bounds as follows: Lower bound: Lets u be any feasible defender strategy. The defender utility based on using s u against a adversary’s quantal response is a lower bound of the optimal solution of P1. A simple example ofs u is the uniform strategy. 115 Upper bound: SinceP d i U d i R d i we haveU d i max i2T R d i . The defender’s utility is computed as P i2T q i U d i , whereU d i is the defender utility on targeti andq i is the probability that the adversary attacks targeti. Thus, the maximumR d i serves as an upper bound ofU d i . We now turn to feasibility checking, which is performed in Step 6 in Algorithm 5. Given a real numberr2R, in order to check whether Equation (7.2) is satisfied, we introduceCF-OPT. CF-OPT: min x2X f rD(x)N(x) Let be the optimal objective function of the above optimization problem. If 0, Equation (7.2) must be true. Therefore, by solving the new optimization problem and checking if 0, we can answer if a givenr is larger or smaller than the global maximum. However, the objective function inCF-OPT is still non-convex, therefore, solving it directly is still a hard problem. We introduce two methods to address this in the next two sections. 7.3 GOSAQ We now present Global Optimal Strategy Against Quantal response (GOSAQ), which adapts Al- gorithm 5 to efficiently solve problems P1 and P2. It does so through the following nonlinear invertible change of variables: y i =e i x i ;8i2T (7.3) 116 7.3.1 GOSAQ with No Assignment Constraint We first focus on applying GOSAQ to solve P1 for problems with no resource assignment con- straints. Here, GOSAQ uses Algorithm 1, but with a rewrittenCF-OPT as follows given the above variable substitution: min y r X i2T i y i X i2T i P d i y i + X i2T i i i y i ln(y i ) s.t. X i2T 1 i ln(y i )M (7.4) e i y i 1; 8i (7.5) Let’s refer to the above optimization problem asGOSAQ-CP. Lemma 2. LetObj CF (x) andObj GC (y) be the objective function ofCF-OPT andGOSAQ-CP respectively;X f andY f denote the feasible domain ofCF-OPT andGOSAQ-CP respectively: min x2X f Obj CF (x) = min y2Y f Obj GC (y) (7.6) The proof, omitted for brevity, follows from the variable substitution in equation 7.6. Lemma 2 indicates that solvingGOSAQ-CP is equivalent to solvingCF-OPT. We now show thatGOSAQ-CP is actually a convex optimization problem. Lemma 3. GOSAQ-CP is a convex optimization problem with a unique optimal solution. Proof. We can show that both the objective function and the nonlinear constraint function (7.4) in GOSAQ-CP are strictly convex by taking second derivatives and showing that the Hessian matrices are positive definite. The fact that the objective is strictly convex implies that it can have only one optimal solution.2 117 In theory, convex optimization problems like the one above, can be solved in polynomial time through the ellipsoid method or interior point method with the volumetric barrier function [Boyd and Vandenberghe, 2004] (in practice there are a number of nonlinear solvers capable of finding the only KKT point efficiently). Hence, GOSAQ entails running Algorithm 5, performing Step 6 withO(log( U 0 L 0 )) times, and each time solving GOSAQ-CP which is polynomial solvable. Therefore, GOSAQ is a polynomial time algorithm. We now show the bound of GOSAQ’s solution quality. Lemma 4. LetL andU be the lower and upper bounds of GOSAQ when the algorithm stops, and x is the defender strategy returned by GOSAQ. Then, L Obj P 1 (x )U whereObj P 1 (x) denotes the objective function ofP1. Proof. Givenr, Let (r) be the minimum value of the objective function inGOSAQ-CP. When GOSAQ stops, we have (L ) 0, because from Lines 6-8 of Algorithm 5, updating the lower bound requires it. Hence, from Lemma 2, L D(x )N(x ) 0) L N(x ) D(x ) . Similarly, (U ) 0)U > N(x ) D(x ) 2 Theorem 1. Let x be the defender strategy computed by GOSAQ, 0p Obj P 1 (x ) (7.7) Proof. p is the global maximum of P1, sop Obj P 1 (x ). LetL andU be the lower and upper bound when GOSAQ stops. Based on Lemma 4,L Obj P 1 (x )U . Simultaneously, Algorithm 5 indicates thatL p U . Therefore, 0p Obj P 1 (x )U L 2 118 Theorem 1 indicates that the solution obtained by GOSAQ is an-optimal solution. 7.3.2 GOSAQ with Assignment Constraints In order to address the assignment constraints, we need to solve P2. Note that the objective function of P2 is the same as that of P1. The difference lies in the extra constraints which enforce the marginal coverage to be feasible. Therefore we once again use Algorithm 5 with variable substitution given in Equation 7.3, but modifyGOSAQ-CP as follows (which is referred asGOSAQ-CP-C) to incorporate the extra constraints: min y;a r X i2T i y i X i2T i P d i y i + X i2T i i i y i ln(y i ) s.t. Constraint (7:4); (7:5) 1 i ln(y i ) = X A j 2A a j A ij ; 8i2T (7.8) X A j 2A a j = 1 (7.9) 0a j 1; A j 2A (7.10) Equation (7.8) is a nonlinear equality constraint that makes this optimization problem non-convex. There are no known polynomial time algorithms for generic non-convex optimization problems, which can have multiple local minima. We can attempt to solve such non-convex problems using one of the efficient nonlinear solvers but we would obtain a KKT point which can be only locally optimal. There are a few research grade global solvers for non-convex programs, however they are limited to solving specific problems or small instances. Therefore, in the presence of assign- ment constraints, GOSAQ is no longer guaranteed to return the optimal solution as we might be left with locally optimal solutions when solving the subproblemsGOSAQ-CP-C. 119 7.4 PASAQ Since GOSAQ may be unable to provide a quality bound in the presence of assignment constraints (and as shown later, may turn out to be inefficient in such cases), we propose the Piecewise linear Approximation of optimal Strategy Against Quantal response (PASAQ). PASAQ is an algorithm to compute the approximate optimal defender strategy. PASAQ has the same structure as Algorithm 5. The key idea in PASAQ is to use a piecewise linear function to approximate the nonlinear objec- tive function inCF-OPT, and thus convert it into a Mixed-Integer Linear Programming (MILP) problem. Such a problem can easily include assignment constraints giving an approximate solu- tion for a SSG against a QR-adversary with assignment constraints. In order to demonstrate the piecewise approximation in PASAQ, we first rewrite the nonlinear objective function ofCF-OPT as: X i2T i (rP d i )e i x i X i2T i i x i e i x i The goal is to approximate the two nonlinear functionf (1) i (x i ) =e i x i andf (2) i (x i ) =x i e i x i as two piecewise linear functions in the rangex i 2 [0; 1], for eachi = 1::jTj. We first uniformly 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x i Pieceswise Linear Function Original Function cut points x i2 x i1 x i3 L (1) i (x i ) f (1) i (x i ) (a) approximation off (1) i (xi) 0 0.2 0.4 0.6 0.8 1 −0.05 0 0.05 0.1 0.15 0.2 x i Pieceswise Linear Function Original Function cut points x i3 x i1 x i2 L (2) i (x i ) f (2) i (x i ) (b) approximation off (2) i (xi) Figure 7.1: Piecewise Linear Approximation 120 divide the range [0; 1] into K pieces (segments). Simultaneously, we introduce a set of new variablesfx ik ;k = 1::Kg to represent the portion ofx i in each of theK pieces,f[ k1 K ; k K ];k = 1::Kg. Therefore,x ik 2 [0; 1 K ],8k = 1::K andx i = P K k=1 x ik . In order to ensure thatfx ik g is a valid partition ofx i , allx ik must satisfy: x ik > 0 only ifx ik 0 = 1 K ;8k 0 < k. In other words, x ik can be non-zero only when all the previous pieces are completely filled. Figures 7.1(a) and 7.1(b) display two examples of such a partition. Thus, we can represent the two nonlinear functions as piecewise linear functions usingfx ik g. Letf( k K ;f (1) i ( k K ));k = 0::Kg be theK+1 cut-points of the linear segments of functionf (1) i (x i ), andf ik ;k = 1::Kg be the slopes of each of the linear segments. Starting from f (1) i (0), the piecewise linear approximation off (1) i (x i ), denoted asL (1) i (x i ): L (1) i (x i ) =f (1) i (0) + K X k=1 ik x ik = 1 + K X k=1 ik x ik Similarly, we can obtain the piecewise linear approximation off (2) i (x i ), denoted asL (2) i (x i ): L (2) i (x i ) =f (2) i (0) + K X k=1 ik x ik = K X k=1 ik x ik where,f ik ;k = 1::Kg is the slope of each linear segment. 121 7.4.1 PASAQ with No Assignment Constraint In domains without assignment constraints, PASAQ consists of Algorithm 5, but with CF-OPT rewritten as follows: min x;z X i2T i (rP d i )(1 + K X k=1 ik x ik ) X i2T i i K X k=1 ik x ik s.t. X i2T K X k=1 x ik M (7.11) 0x ik 1 K ; 8i; k = 1:::K (7.12) z ik 1 K x ik ; 8i; k = 1:::K 1 (7.13) x i(k+1) z ik ; 8i; k = 1:::K 1 (7.14) z ik 2f0; 1g; 8i; k = 1:::K 1 (7.15) Let’s refer to the above MILP formulation asPASAQ-MILP. Lemma 5. The feasible region for x =hx i = P K k=1 x ik ;i2Ti ofPASAQ-MILP is equivalent to that ofP1 JUSTIFICATION. The auxiliary integer variablez ik indicates whether or notx ik = 1 K . Equa- tion (7.13) enforces thatz ik = 0 only whenx ik < 1 K . Simultaneously, Equation (7.14) enforces that x i(k+1) is positive only if z ik = 1. Hence,fx ik ;k = 1::Kg is a valid partition of x i and x i = P K k=1 x ik and thatx i 2 [0; 1]. Thus, the feasible region ofPASAQ-MILP is equivalent to P1 2 Lemma 5 shows that the solution provided by PASAQ is in the feasible region ofP1. However, PASAQ approximates the minimum value ofCF-OPT by usingPASAQ-MILP, and furthermore solves P1 approximately using binary search. Hence, we need to show an error bound on the solution quality of PASAQ. 122 Table 7.2: Notations for Error Bound Proof := min i2T i R d := max i2T jR d i j := max i2T i := max i2T i P d := max i2T jP d i j := max i2T i Table 7.3: Game Constant := max i2T i := min i2T i R d := max i2T jR d i j P d := max i2T jP d i j := max i2T i := max i2T i We first show Lemma 7-9 on the way to build the proof for the error bound. Full proofs are available in the Appendix A. We also define the game constants decided by the payoff in Table A.1. Further, we define two constants which are decided by the game payoffs: C 1 = (=)e f(R d + P d ) + g and C 2 = 1 + (=)e . The notation used is defined in Table A.1. In the following, we are interested in obtaining a bound on the difference betweenp (the global optimal obtained fromP1) andObj P 1 (~ x ), where ~ x is the strategy obtained from PASAQ. However, along the way, we have to obtain a bound for the difference betweenObj P 1 (~ x ) and its corresponding piecewise linear approximation ~ Obj P 1 (~ x ). Lemma 6. Let ~ L and ~ U be the final lower and upper bounds of PASAQ, and ~ x is the defender strategy returned by PASAQ. Then, ~ L ~ Obj P 1 (~ x ) ~ U Lemma 7. Let ~ N(x) = P i2T i i L (2) i (x i )+ P i2T i P d i L (1) i (x i ) and ~ D(x) = P i2T i L (1) i (x i )> 0 be the piecewise linear approximation ofN(x) andD(x) respectively. Then,8x2X f jN(x) ~ N(x)j ( +P d ) jTj K jD(x) ~ D(x)j jTj K 123 Lemma 8. The difference between the objective funciton ofP1,Obj P 1 (x), and its corresponding piecewise linear approximation, ~ Obj P 1 (x), is less thanC 1 1 K Proof. jObj P 1 (x) ~ Obj P 1 (x)j =j N(x) D(x) ~ N(x) ~ D(x) j =j N(x) D(x) N(x) ~ D(x) + N(x) ~ D(x) ~ N(x) ~ D(x) j 1 ~ D(x) (jObj P 1 (x)jjD(x) ~ D(x)j +jN(x) ~ N(x)j) Based on Lemma 7,jObj P 1 (x)jR d , and ~ D(x)jTje . jObj P 1 (x) ~ Obj P 1 (x)jC 1 1 K Lemma 9. Let ~ L andL be final lower bound of PASAQ and GOSAQ, L ~ L C 1 1 K +C 2 Theorem 2. Let ~ x be the defender strategy computed by PASAQ, p is the global optimal de- fender expected utility, 0p Obj P 1 (~ x ) 2C 1 1 K + (C 2 + 1) Proof. The first inequality is implied since ~ x is a feasible solution. Furthermore, p Obj P 1 (~ x ) =(p L ) + (L ~ L ) + ( ~ L ~ Obj P 1 (~ x )) + ( ~ Obj P 1 (~ x )Obj P 1 (~ x )) 124 Algorithm 5 indicates thatL p U , hencep L . Additionally, Lemma 8, 21 and 6 provide an upper bound on ~ Obj P 1 (~ x )Obj P 1 (~ x ),L ~ L and ~ L ~ Obj P 1 (~ x ), therefore p Obj P 1 (~ x ) +C 1 1 K +C 2 +C 1 1 K 2C 1 1 K + (C 2 + 1) Theorem 2 suggests that, given a game instance, the solution quality of PASAQ is bounded linearly by the binary search threshold and the piecewise linear accuracy 1 K . Therefore the PASAQ solution can be made arbitrarily close to the optimal solution with sufficiently small and sufficiently largeK. 7.4.2 PASAQ With Assignment Constraints In order to extend PASAQ to handle the assignment constraints, we need to modifyPASAQ-MILP as the follows, referred to asPASAQ-MILP-C, min x;z;a X i2T i (rP d i )(1 + K X k=1 ik x ik ) X i2T i i K X k=1 ik x ik s.t. Constraint (7:11) (7:15) K X k=1 x ik = X A j 2A a j A ij ; 8i2T (7.16) X A j 2A a j = 1 (7.17) 0a j 1; A j 2A (7.18) PASAQ-MILP-C is an MILP so it can be solved optimally with any MILP solver (e.g. CPLEX). We can prove, similarly as we did for Lemma 5, that the above MILP formulation has the same feasible region asP2. Hence, it leads to a feasible solution ofP2. Furthermore, the error bound 125 of PASAQ relies on the approximation accuracy of the objective function by the piecewise lin- ear function and the fact that the subproblem PASAQ-MILP-C can be solved optimally. Both conditions have not changed from the cases without assignment constraints to the cases with assignment constraints. Hence, the error bound is the same as that shown in Theorem 2. 7.5 Experiments We separate our experiments into two sets: the first set focuses on the cases where there is no constraint on assigning the resources; the second set focuses on cases with assignment constraints. In both sets, we compare the solution quality and runtime of the two new algorithms, GOSAQ and PASAQ, with the previous benchmark algorithm BRQR. The results were obtained using CPLEX to solve the MILP for PASAQ. For both BRQR and GOSAQ, we use the MATLAB toolbox function fmincon to solve nonlinear optimization problems 1 . All experiments were conducted on a standard 2.00GHz machine with 4GB main memory. For each setting of the experiment parameters (i.e. number of targets, amount of resources and number of assignment constraints), we tried 50 different game instances. In each game instance, payoffsR d i andR a i are chosen uniformly randomly from 1 to 10, whileP d i andP a i are chosen uniformly randomly from -10 to -1; feasible assignmentsA j are generated by randomly setting each elementA ij to 0 or 1. For the parameter of the quantal response in Equation (7.1), we used the same value ( = 0:76) as learned in the experiment in Chapter 4. 1 We also tried the KNITRO [Byrd et al., 2006] solver. While it gave the same solution quality asfmincon, it was three-times slower thanfmincon; as a result we report results withfmincon 126 7.5.1 No Assignment Constraints We first present experimental results comparing the solution quality and runtime of the three algorithms (GOSAQ,PASAQ and BRQR) in cases without assignment constraints. Solution Quality: For each game instance, GOSAQ provides the-optimal defender expected utility, BRQR presents the best local optimal solution among all the local optimum it finds, and PASAQ leads to an approximated global optimal solution. We measure the solution quality of different algorithms using average defender’s expected utility over all the 50 game instances. Figures 7.2(a), 7.2(c) and 7.2(e) show the solution quality results of different algorithms under different conditions. In all three figures, the average defender expected utility is displayed on the y-axis. On the x-axis, Figure 7.2(a) changes the numbers of targets (jTj) keeping the ratio of resources (M) to targets and fixed as shown in the caption; Figure 7.2(c) changes the ratio of resources to targets fixing targets and as shown; and Figure 7.2(e) changes the value of the binary search threshold. Given a setting of the parameters (jTj,M and), the solution qualities of different algorithms are displayed in a group of bars. For example, in Figure 7.2(a),jTj is set to 50 for the leftmost group of bars,M is 5 and = 0:01. From left to right, the bars show the solution quality of BRQR (with 20 and 100 iterations), PASAQ (with 5,10 and 20 pieces) and GOSAQ. Key observations from Figures 7.2(a), 7.2(c) and 7.2(e) include: (i) The solution quality of BRQR drops quickly as the number of targets increases; increasing the number of iterations in BRQR improves the solution quality, but the improvement is very small. (ii) The solution quality of PASAQ improves as the number of pieces increases; and it converges to the GOSAQ solution as the number of pieces becomes larger than 10. (iii) As the number of resources increases, the 127 defender expected utility also increases; and the resource count does not impact the relationship of solution quality between different algorithms. (iv) As becomes smaller, the solution quality of both GOSAQ and PASAQ improves. However, after epsilon becomes sufficiently small ( 0:1), no substantial improvement is achieved by further decreasing the value of. In other words, the solution quality of both GOSAQ and PASAQ converges. In general, BRQR has the worst solution quality; GOSAQ has the best solution quality. PASAQ achieves almost the same solution quality as GOSAQ when it uses more than 10 pieces. Runtime: We present the runtime results in Figures 7.2(b), 7.2(d) and 7.2(f). In all three figures, the y-axis display the runtime, the x-axis displays the variables which we vary to measure their impact on the runtime of the algorithms. For BRQR run time is the sum of the run-time across all its iterations. Figure 7.2(b) shows the change in runtime as the number of targets increases. The number of resources and the value of are shown in the caption. BRQR with 100 iterations is seen to run significantly slower than GOSAQ and PASAQ. Figure 7.2(d) shows the impact of the ratio of resource to targets on the runtime. The figure indicates that the runtime of the three algorithms is independent of the change in the number of resources. Figure 7.2(f) shows how runtime of GOSAQ and PASAQ is affected by the value of. On the x-axis, the value for decreases from left to right. The runtime increases linearly as decreases exponentially. In both Figures 7.2(d) and 7.2(f), the number of targets and resources are displayed in the caption. Overall, the results suggest that GOSAQ is the algorithm of choice when the domain has no assignment constraints. Clearly, BRQR has the worst solution quality, and it is the slowest of the set of algorithms. PASAQ has a solution quality that approaches that of GOSAQ when the number of pieces is sufficiently large ( 10), and GOSAQ and PASAQ also achieve comparable 128 runtime efficiency. Thus, in cases with no assignment constraints, PASAQ offers no advantages over GOSAQ. 7.5.2 With Assignment Constraints In the second set, we introduce assignment constraints into the problem. The feasible assignments are randomly generated. We present experimental results on both solution quality and runtime. Solution Quality: Figures 7.3(a) and 7.3(b) display the solution quality of the three algo- rithms with varying number of targets (jTj) and varying number of feasible assignments (jAj). In both figures, the average defender expected utility is displayed on the y-axis. In Figure 7.3(a) the number of targets is displayed on the x-axis, and the ratio ofjAj tojTj is set to 60. BRQR is seen to have very poor performance. Furthermore, there is very little gain in solution qual- ity from increasing its number of iterations. While GOSAQ provides the best solution quality, PASAQ achieves almost identical solution quality when the number of pieces is sufficiently large (> 10). Figure 7.3(b) shows how solution quality is impacted by the number of feasible assign- ments, which is displayed on the x-axis. Specifically, the x-axis shows numbers of assignment constraintsA to be 20 times, 60 times and 100 times the number of targets. The number of targets is set to 60. Once again, BRQR has significantly lower solution quality, and it drops as the number of assignments increases; and PASAQ again achieves almost the same solution quality as GOSAQ, as the number the number of pieces is larger than 10. Runtime: We present the runtime results in Figures 7.3(c), 7.3(e), 7.3(d) and 7.3(f). In all experiments, we set 80 minutes as the cut-off. Figure 7.3(c) displays the runtime on the y-axis and the number of targets on the x-axis. It is clear that GOSAQ runs significantly slower than both PASAQ and BRQR, and slows down exponentially as the number of targets increases. Figure 129 7.3(e) shows extended runtime result of BRQR and PASAQ as the number of targets increases. PASAQ runs in less than 4 minutes with 200 targets and 12000 feasible assignments. BRQR runs significantly slower with higher number of iterations. Overall, the results suggest that PASAQ is the algorithm of choice when the domain has as- signment constraints. Clearly, BRQR has significantly lower solution quality than PASAQ. PASAQ not only has a solution quality that approaches that of GOSAQ when the number of pieces is sufficiently large ( 10), PASAQ is significantly faster than GOSAQ (which suffers exponential slowdown with scale-up in the domain). 130 50 100 400 1000 −5 −4 −3 −2 −1 0 # of Targets Defender Expected Utility BRQR−20 BRQR−100 PASAQ−5 PASAQ−10 PASAQ−20 GOSAQ (a) Solution Quality v.s.jTj (M = 0:1jTj, = 0:01) 10 2 10 3 0 1 2 3 4 5 # of Targets Runtime (minutes) BRQR−100 BRQR−20 GOSAQ PASAQ−20 PASAQ−10 PASAQ−5 (b) Runtime v.s.jTj (M = 0:1jTj, = 0:01) 10% 20% 30% −3 −2 −1 0 Ratio: (# of Resources)/(# of Targets) Defender Expected Utility BRQR−20 BRQR−100 PASAQ−5 PASAQ−10 PASAQ−20 GOSAQ (c) Solution Quality v.s.M (jTj = 400, = 0:01) 10% 15% 20% 25% 30% 0 1 2 3 Ratio: (# of Resources)/(# of Targets) Runtime (minutes) BRQR−100 BRQR−20 GOSAQ PASAQ−20 PASAQ−10 PASAQ−5 (d) Runtime v.s.M(jTj = 400, = 0:01) 1 0.1 0.01 0.001 −1.4 −1.2 −1 −0.8 ε Defender Expected Utility PASAQ−5 PASAQ−10 PASAQ−20 GOSAQ (e) Solution Quality v.s. (jTj = 400,M = 80) 10 −2 10 0 2 4 6 8 10 12 14 ε Runtime (seconds) GOSAQ PASAQ−20 PASAQ−10 PASAQ−5 (f) Runtime v.s. (jTj = 400,M = 80) Figure 7.2: Solution Quality and Runtime Comparison, without assignment constraints (better in color) 131 20 40 60 −3 −2 −1 0 # of Targets Defender Expected Utility BRQR−20 BRQR−100 PASAQ−5 (a) Solution Quality v.s.jTj (jAj = 60jTj) 20 60 100 −3 −2 −1 0 Ratio: (# of Assignments)/(# of Targets) Defender Expected Utility PASAQ−10 PASAQ−20 GOSAQ (b) Solution Quality v.s.jAj (jTj = 60) 20 40 60 80 100 0 20 40 60 80 # of Targets Runtime (minutes) GOSAQ BRQR−100 BRQR−20 PASAQ−20 PASAQ−10 PASAQ−5 (c) Runtime v.s.jTj (jAj = 60jTj) 20 40 60 80 100 0 20 40 60 80 Ratio: (# of Assignments)/(# of Targets) Runtime (minutes) GOSAQ BRQR−100 BRQR−20 PASAQ−20 PASAQ−10 PASAQ−5 (d) Runtime v.s.jAj (jTj = 60) 50 100 150 200 0 10 20 30 40 50 # of Targets Runtime (minutes) BRQR−100 BRQR−20 PASAQ−20 PASAQ−10 PASAQ−5 (e) Runtime v.s.jTj (jAj = 60jTj) 50 100 150 200 0 5 10 15 20 Ratio: (# of Assignments)/(# of Targets) Runtime (minutes) BRQR−100 BRQR−20 PASAQ−20 PASAQ−10 PASAQ−5 (f) Runtime v.s.jAj (jTj = 60) Figure 7.3: Solution Quality and Runtime Comparison, with assignment constraint (better in color) 132 Chapter 8: Scaling-up This chapter focuses on scaling up SSG algorithms integrated with any of a family of discrete choice models[Train, 2003; Goeree et al., 2005], an important class of bounded rationality models of adversary decision making, of which quantal response model is an important representative. Unfortunately, PASAQ fails to scale-up when faced with massive scale since it requires explicit enumeration of defender strategies, which is not feasible in massive-scale SSGs such as with the FAMS or the US Coast Guard in bigger ports. Previous work has provided branch-and- price (BnP) [Barnhart et al., 1994] as a key technique to avoid explicit enumeration of defender strategies in SSGs[Jain et al., 2010a]; however, how well BnP would handle bounded rationality models is an unknown. In this chapter, I present a novel algorithm called BLADE to scale-up SSGs with complex ad- versary models. In Section 8.1, I extend the PASAQ algorithm to handle any of the family of the discrete choice model. Section 8.2 investigates the effectiveness of BnP in SSG algorithms han- dling bounded rationality via a BnP algorithm called COCOMO. As I will show in more details, the non-convexity of the objective function given bounded rationality adversary models creates enormous hurdles in scale-up. In Section 8.3, I provide a new algorithm called BLADE, which for the first time illustrates an efficient realization of the cutting-plane approach in SSGs[Kelley, 133 1960; Boyd and Vandenberghe, 2008]. The cutting-planes approach iteratively refines the solu- tion space via cuts. Our key hypothesis is that with these cuts, BLADE can successfully exploit the structure of the solution space of defender strategies – generated due to the bounded rational- ity adversary models in SSGs – whereas BnP approaches are blind to this structure. BLADE is based on three novel ideas. First, i present a separation oracle that can effectively prune the search space via deep cuts. More importantly I show that to handle massive scale SSGs, not only must this separation oracle itself use a secondary oracle but that this two-level hierarchy of oracles is efficient. Second, I provide a novel heuristic to further speed-up BLADE by exploiting the SSG objective function to improve its cuts. Third, BLADE provides a technique for quality-efficiency tradeoff. Finally, in section 8.4, I present the experimental results on comparing COCOMO and BLADE and show that BLADE is significantly more efficient than COCOMO. 8.1 Generalized PASAQ Before discussing scale-up, we generalize the PASAQ algorithm to solve SSGs integrated with bounded rationality models, as they are specialized to the QR model. In PASAQ, the objective function ofP1 is F (x) = X i q i (x)U d i (x i ) = X i e U a i (x i ) P j e U a j (x j ) U d i (x i ) QR is a representative of a more general form of the discrete choice model [Train, 2003; Goeree et al., 2005] for adversary response as shown in Equation (8.1). In SSGs, typically f i (x i ) 0;8x i 2 [0; 1] is a monotonically decreasing function of x i , indicating that as the 134 defender’s marginal coverage on targeti increases, the probability that the adversary chooses this target decreases, e.g., in QR,f i (x i ) =e U a i (x i ) is an exponentially decreasing function ofx i . q i (x) = f i (x i ) P i f i (x i ) (8.1) Furthermore, PASAQ handles the constraints on the defender resource allocation by enumer- ating all the possible assignments. In general, there are spatio-temporal constraints: the air mar- shal’s two flights have to be connected, e.g., an air marshal cannot fly from Los Angeles to New York and then from Chicago to Seattle. Moreover, the second flight cannot depart before the first flight arrives. Furthermore, there might be user-specified constraints [An et al., 2010]: FAMS might want to cover 50% of the flights to Chicago; or that at most 30% of the flights departing from the JFK airport to be covered. Thus, the defender’s optimization problem can be written as follows: P1:fmax x X i2T U d i (x i )q i (x)j x2X f X f 1 \X f 2 g X f 1 :=fxj x = X A j 2A a j A j ; X j a j = 1;a j 2 [0; 1]g (8.2) X f 2 :=fxjBx bg (8.3) We denote the objective function in P1 as F (x). In F (x), U d i (x i ) is the defender’s expected utility if the adversary chooses target i; and q i (x) is the probability that the adversary chooses targeti.q i (x) depends on the model used, e.g., assuming a QR model of the adversary,q i (x) is a logit function ofx. This leads to a nonlinear fractional optimization problem, which in general is NP-hard [Vavasis, 1995]. Furthermore,X f represents the feasible region of the marginal cover- age vector, which is defined by the intersection ofX f 1 which emcompasses the spatio-temporal 135 constraints, andX f 2 which emcompasses the user-specified linear constraints on the marginals. Therefore,X f =X f1 \X f2 . G-PASAQ generalizes PASAQ to solve P1 with the general form ofq i (x) in Equation (8.1). As with PASAQ, G-PASAQ solves this non-linear fractional optimization problem using binary search. At each step of the binary search it solves a non-convex optimization problem whose objective is a sum of nonlinear functions of marginal variablesx i . Approximating each of these single-variable nonlinear functions as a piecewise-linear function with K segments, the non- convex problem is approximated by the MILP shown in Equation (8.4) - (8.9); this MILP solved in each iteration of the binary search. min x;z X i2T (rP d i )(f i (0) + K X k=1 ik x ik ) X i2T i K X k=1 ik x ik (8.4) s.t. 0x ik 1=K; 8i; k = 1:::K (8.5) z ik =Kx ik ; 8i; k = 1:::K 1 (8.6) x i(k+1) z ik ; 8i; k = 1:::K 1 (8.7) z ik 2f0; 1g; 8i; k = 1:::K 1 (8.8) x2X f (8.9) The objective function in Equation (8.4) is a piecewise linear approximation of P i2T (r P d i )f i (x i ) P i2T i x i f i (x i ) where i = R d i P d i is a constant, ik is the slope off i (x i ) in thek th segments and ik is the corresponding slope ofx i f i (x i ). The range of eachx i is divided intoK segments, andx i is replaced by the variablesfx ik ;k = 1:::kg such thatx i = P K k=1 x ik . fz ik ;k = 1::Kg in Equation (8.6)-(8.8) are integer variables that decide the particular segment thatx i lies in. For example, assumingK = 5, there are 5 possible sets of values forfz ik g that satisfy the constraints in Equation (8.6)-(8.8). If we setfz i;1::3 = 1;z i;4 = 0g, thenx i is in the 136 fourth segment, i.e.,x i 2 [0:6; 0:8]. Equation (8.9) defines the feasible regions forx. More details are in Chapter 7. 8.2 COCOMO– A Branch-and-Price Algorithm Level 1 (x 1 ) Level 2 (x 2 ) x 1 ∈[0, 0.5] x 1 ∈[0.5, 1] x 2 ∈[0, 0.5] x 2 ∈[0.5, 1] x 2 ∈[0, 0.5] x 2 ∈[0.5,1] { x i ∈[0,1]} Figure 8.1: Branching Tree G-PASAQ assumes that the set of defender pure strategies (A) can be explicitly enumerated. In massive SSGs,A cannot be enumerated; COCOMO (COlumn generation for COmplex adver- sary MOdels) attempts in such cases to use the branch-and-price approach to scale-up G-PASAQ. COCOMO exploits the fact that the integer variables in G-PASAQ represent the particular piece- wise linear segments each marginalx i belongs to and defines a branching tree shown in Figure 8.1. Initially at the root node, all the integer variables are relaxed to be continuous, indicating that none of thex i are set to any fixed ranges. Thei th level in the tree is associated with marginalx i . If each marginal is divided intoK segments, each node hasK children. For example, in Figure 8.1, the two nodes at level 1 are associated with the two possible ranges of marginalx 1 : the left node setsx 1 2 [0; 0:5], realized by settingz 11 = 0; the right node setsx 1 2 [0:5; 1], realized by 137 settingz 11 = 1. As we move deeper, more integer variable values are set. The tree has a depth of jTj andK jTj nodes in total. COCOMO starts from the root node in the branching queue and iterates until the queue is empty. In each iteration, the top node in the branching queue is first branched into a set of children. For each child node, the upper bound (UB’) and the lower bound (LB’) are estimated. If the two bounds are not close enough, the node is added to the branching queue. COCOMO keeps a record of the best lower bound solution ( ~ LB) found so far, and uses that to prune all the unvisited nodes in the branching queue. In the end, the defender strategy associated with this best lower bound is returned as the solution. Upper Bound Estimation: To generate tighter upper bounds, we run G-PASAQ at each node of COCOMO, where the values of some variables z ik are set to either 0 or 1 (see Figure 8.1). We obtain the upper bound by relaxing the rest of the integer variables to be continuous, resulting in an LP calledUpperBound-LP.UpperBound-LP cannot escape the large number of variablesa j andA j ; hence we apply the standard column generation technique: we start by solvingUpperBound-LP with a subset of columns, i.e., defender strategiesA j , and iteratively add more columns with negative reduced cost. Let’s first rewrite Equation (8.9) based on the definition ofX f from Equation (8.2) and (8.3). X k=1::K x ik X A j 2A a j A ij = 0;8i2T (8.10) X A j 2A a j = 1; a j 0;A j 2A (8.11) X i2T B mi X k=1::K x ik b m ; 8m (8.12) 138 The reduced cost of column A j is ! T A j , where ! and are the duals of Equation (8.10) and (8.11) respectively. Given the optimal duals of the current iteration of UpperBound-LP, a separate Slave process provides a new column with the minimum reduced cost; the process iterates until convergence. Slave: Given the spatio-temporal constraints, the Slave can often be formulated as a minimum- cost integer flow problem on a polynomial-sized network, e.g., [Jain et al., 2010a] provide such a Slave formulation with application to the FAMS domain. A good example of such formulation can be found in [Jain et al., 2010a] for the FAMS domain: a target is a flight and is represented by a vertex in the network. If two flights can be covered by the same air marshal on the same schedule, there is an edge between the two corresponding vertices. A feasible flow in the network represents a feasible pure strategy of the defender. Similarly here,! T A j is assigned as the cost to the vertex representing targeti in the network. More generally, if the setA of pure strategy vectorsA j that satisfies the spatio-temporal con- straints can be formulated as the feasible set of a polynomial number of integer linear constraints: A j 2A =fs2f0; 1g jTj : Cs cg, the slave amounts to solving the following integer linear program. fmin s s T ys T B T g +uj s i 2f0; 1g;8i2T ;Cs cg Lower Bound Estimation: A subset of the columns will be generated while solving the UpperBound-LP. The lower bound of the same node is computed by running G-PASAQ with this subset of columns. Pruning: COCOMO keeps a record of the best lower bound solution, ~ LB, that has been found so far. After branching the current node, pruning is applied to all the unvisited nodes in the 139 branching queue. The nodes whose upper bound is lower than ~ LB are pruned from the branching queue. The exponential size of COCOMO’s branching tree (K jTj ) and the need to run G-PASAQ for each of branching nodes, ultimately leads to its inefficiency. 8.3 BLADE– A Cutting-Plane Algorithm Despite our effort for efficiency in COCOMO, the need to run column generation at each of the K jTj nodes ultimately leads to its inefficiency. BLADE (Boosted piecewise Linear Approximation of DEfender strategy with arbitrary constraints) uses the cutting-plane approach to scale-up G- PASAQ, and avoids running column generation at each node. Algorithm 6 presents BLADE. The Master is a modified version ofP1 with a relaxed defender strategy space, defined by the set of boundaries ~ Hx ~ h. In Line (2), ( ~ H; ~ h) is initialized with the user-specified constraints, (B;b). The solution found by the Master, i.e., ~ x, provides an upper bound (UB) of the solution forP1. In each iteration, the Separation Oracle checks whether or not ~ x2X f . If so, the optimal solution ofP1 has been found; otherwise, a new cutting planeH l xh l is returned to further restrict the search space in the Master. The Separation Oracle also returns a feasible solution x f ‘closest’ to the infeasible solution ~ x, which provides a lower bound (LB = F (x f )) of the solution for P1. In Line (9), we improve our lower bound estimation to further speed up the algorithm. The algorithm terminates when UB and LB are close enough, i.e., UB LB. 140 Algorithm 5: BLADE 1 Input:fR d i ;P d i ;R a i ;P a i g, (B;b),; 2 ( ~ H; ~ h) (B;b); feasible false; 3 UB M; LB M; 4 while UB LB> do 5 (UB; ~ x) Master( ~ H, ~ h); 6 (feasible,H l ,h l ,x f ) SeparationOracle(~ x); 7 ~ H ~ H[H l , ~ h ~ h[h l ; 8 if feasible6= true then 9 (LB;x l ) LowerBoundEstimator( ~ H; ~ h) ; 10 end 11 end 12 returnx l ; 8.3.1 Master We first reformulateP1 by representing its feasible region using the set of bouandries instead of the extreme points: P1.1:fmax x F (x)j Hx h;Bx b; 0x i 1;8i2Tg H is aN-by-jTj matrix, whereN is the number of linear boundaries of the convex hull. Each row,H l xh l , represents a linear boundary ofX f1 . In the presence of user-specified constraints, Bx b is added to the boundary set ofX f , as defined in Equation (8.3). However, we cannot directly solveP1.1 becauseH andh are not initially given. In BLADE, the Master solvesP1.1 using G-PASAQ with a subset of the boundaries ofX f . More specifically, Equation (8.9) is rewritten as Equation (8.12) and (8.13): X i2T ~ H li X k=1::K x ik ~ h l ; 8l (8.13) ( ~ H; ~ h) in Equation (8.13) represents the subset of the boundaries forX f . The solution of Master, denoted as ~ x, then provides an upper bound on the solution of P1.1: F (~ x) F (x ), where 141 x denote the optimal solution ofP1.1. As the algorithm keeps refining the feasible region by adding new boundaries to the Master, this upper bound gets tighter. Given ~ x as the relaxed solution from the Master, we check whether it belongs toX f . If so, we have found the optimal solution ofP1.1. Otherwise, we further restrict the feasible region in the Master via a cut to separate the current infeasible solution and the original feasible region. 8.3.2 Separation Oracle One standard approach for checking feasibility and generating cutting planes is to apply Farkas’ Lemma, as in [Papadimitriou and Roughgarden, 2008]. However, the resulting cutting planes are not guaranteed to be deep cuts that touch the feasible region and therefore eliminate as much of the infeasible region as possible. Instead, we use a norm-minimization approach for the Separa- tion Oracle in BLADE, which efficiently checks the feasibility of ~ x, and generates a deep cut to separateX f from an infeasible ~ x. Additionally, our approach finds a feasible point that is closest to ~ x, allowing us to compute a lower bound on the optimal objective. Check Feasibility: The Separation Oracle checks the feasibility of ~ x by minimizing its dis- tance to the feasible region. If the minimum distance is 0, ~ x is within the feasible region. We choose 1-norm to measure the distance between ~ x and any feasible point, as 1-norm leads to a 142 Linear Program (LP), which allows the use of column generation to deal with large defender strategy space. We first show theMin-1-Norm LP in Equation (8.14)-(8.18), min a;z X i2T z i (8.14) s.t. z +Aa ~ x (8.15) zAa~ x (8.16) BAab (8.17) X A j 2A a j = 1; a j 0;8A j 2A (8.18) In the above LP, a marginal coverage is represented by the set of defender pure strategies: Aa. Constraint (8.17) and (8.18) enforces thatAa satisfies both the spatio-temporal constraints and the user-specified constraints. The 1-norm distance between the given marginal ~ x and Aa is represented by vector z. This is obtained by combining Constraints (8.15) and (8.16): z jAa ~ xj z. The objective function minimizes the 1-norm of z, therefore the 1-norm distance between ~ x and any given feasible marginal is minimized. Lemma 10. Given a marginal ~ x, let (z ; a ) be the optimal solution of the correspondingMin-1-Norm LP . ~ x is feasible if and only if P i2T z i = 0. Furthermore,Aa provides the feasible marginal with the minimum 1-norm distance to ~ x. Generate Cut: If ~ x is infeasible, we need to further restrict the relaxed region in the Master. Theoretically, any hyperplane that separates ~ x from the feasible region could be used. In practice, a deep cut is preferred. Letw,v,g andu be the dual variables of Constraints (8.15), (8.16), (8.17) and (8.18) respectively; and lety = wv. 143 Lemma 11. Given an infeasible marginal ~ x, let (y ; g ;u ) be the dual values at the optimal solution of the corresponding Min-1-Norm LP . The hyperplane (y ) T x (g ) T b +u = 0 separates ~ x andX f : (y ) T ~ x (g ) T b +u > 0 (8.19) (y ) T x (g ) T b +u 0; 8x2X f (8.20) Proof. The dual of theMin-1-Norm LP is: max y;u;g ~ x T yb T g +u (8.21) s.t. A T yA T B T g +u 0 (8.22) 1 y1; g 0 (8.23) Equation (8.19) can be proved using LP duality. Since ~ x is infeasible, the minimum of the correspondingMin-1-Norm LP is strictly positive. Therefore, the maximum of the dual LP is also strictly positive. We now prove the contrapositive of Equation (8.20): (y ) T x (g ) T b +u > 0) x is not feasible Given any x 0 , there is a corresponding LP with the same formulation as that in Equation (8.21)- (8.23). Let (y 0 ;u 0 ;g 0 ) be the optimal solution of this LP. Note that, (y ;u ;g ) is a feasible solution of this LP. Therefore, (y 0 ) T x (g 0 ) T b +u 0 (y ) T x (g ) T b +u > 0 This indicates that the minimum 1-norm distance betweenx andX f is strictly positive. Hence,x is infeasible. 144 Lemma 12. Equation (8.20) is a deep cut that touches the feasible convex hullX f . Proof. For simplicity, we consider the cases without user-specified constraints. The cut in Equa- tion (8.20) then becomes (y ) T x +u 0. Leta j be the dual of thej th constraint in Equation (8.22) and a =ha j i be the dual at the optimal solution of LP in Equation (8.21)-(8.23). Ac- cording to the LP duality,Aa is the optimal solution of theMin-1-Norm LP. Therefore,Aa is the feasible marginal with the minimum 1-Norm distance to ~ x. Furthermore,8a j > 0, the corresponding constraint in Equation (8.22) is active, i.e. (y ) T A j +u = 0. Hence, the extreme pointA j is on the cutting-plane. Therefore, by solving either theMin-1-Norm LP or its dual LP, the Separation Oracle can not only check the feasibility of a given marginal, but also generate a deep cut. We choose to solve the dual LP in Equation (8.21)-(8.23), since it gives the constraint directly as shown in Equation (8.20). However, since in our case the set of the defender’s pure strategies is too large to be enumerated, the constraints of the LP cannot be enumerated.We solve the LP using a constraint generation approach, outlined in Algorithm 7. Specifically, we solve the LP in Equation (8.21)- Algorithm 6: Separation Oracle 1 Input:fR d i ;P d i ;R a i ;P a i g, (B;b), ~ x,A (0) ; 2 A A (0) ,A l A 1 ; 3 whileA l 6=; do 4 A A[A l ; 5 (y ;u ;g ) Solve-Separation-Oracle-LP(A); 6 A l SecondaryOracle(y ;u ;g ); 7 end 8 return (y ;u ;g ); (8.23) with a subset of constraints first, and use a Secondary Oracle to check whether the relaxed solution violates any of the other constraints. 145 Secondary Oracle: The secondary oracle is executed at Line (6) in Algorithm 7. If any constraint in Equation (8.22) is violated, the oracle returns the one that is most violated, i.e.,A l with the most negative value of the LHS of Equation (8.22); otherwise, we have found the optimal solution of the LP. The secondary oracle is similar to the Slave in COCOMO. 8.3.3 WBLADE The convergence of BLADE depends on how fast the cuts generated by the Separation Oracle approximate the feasible set around the optimal solution ofP1.1. We propose WBLADE, which modifies the Separation Oracle by changing the norm used to determine the distance toX f for one that takes the objective function into account, to bias the cut generated toward the optimal solution. Formally, given the solution ~ x from the Master, instead of searching for the feasible point with the 1-norm distance, which is uniform in all dimensions, we modify the objective function of theMin-1-Norm LP in Equation (8.14) as: X i2T (r i F (~ x) +)z i (8.24) wherer i F (~ x) is the gradient of objective function F (x) at point ~ x with respect to x i ; is a pre-defined constant to ensure thatr i F (~ x)+> 0;8i so the objective remains a norm. We refer to this modified LP asMin-Weighted-Norm LP. Lemma 13. A marginal ~ x is feasible if and only if the minimum of the correspondingMin-Weighted-Norm LP is 0. Proof. We already showed thatz i 0 represents the absolute difference between ~ x and the feasi- ble pointAa on thei th dimension. Combining withr i F (~ x) +> 0;8i, we have P i (r i F (~ x) + 146 -1 -0.9 -0.8 -0.7 -0.6 -0.5 0.2 0.8 1.5 2.2 2.8 3.5 Upper Bound Time (mins) wBLADE BLADE (a) Upper bounds over time x ’ ∂F( x ’) x1 x2 a b (b) Weighted 1-norm Figure 8.2: Minimizing weighted 1-norm distance )z i 0. According to Lemma 14, ~ x is feasible if and only if the minimum of P i z i is 0. Hence, if there exists (z ;a ) such that P i z i = 0, we have P i (r i F (~ x)+)z i = 0; and vice versa. To provide some intuition into why tighter bounds can be obtained by solvingMin-Weighted-Norm LP, we consider the case whenr i F (~ x)> 0 and ~ xAa. First we note that these are typical situ- ations in security games, where having more defense resources tends to benefit the defender. This is the case even if the attacker is boundedly rational, as in the quantal response model. Therefore for most values of ~ x the gradientr i F (~ x) will be positive. As a result, a solution ~ x of the relaxed problem solved by the master will tend to use more resources than what is feasible, i.e., ~ xAa. These properties are confirmed in our numerical experiments. Then, if we haver i F (~ x)> 0 and ~ xAa, theMin-Weighted-Norm LP is equivalent to minimizingrF (~ x) (~ xAa) and hence also to maximizing F (~ x) +rF (~ x)(Aa ~ x) (8.25) Equation (8.25) is the first-order Taylor approximation ofF (x), maximizing which should pro- vide a good lower bound if ~ x is close to the feasible region. Regarding the cuts generated, sincerF (~ x) > 0 we can take = 0. In this case the Min-Weighted-Norm LP is looking for the projection point inX f on the highest level-set 147 perpendicular torF (~ x). The cut generated, therefore, will be this highest level-set perpendicular torF (~ x). Since the gradient points in the direction of maximum increase of its function, for sufficiently smooth functions and sufficiently close projection point, points with higher function values than the projection point would be eliminated by this cut. Figure 8.2(b) demonstrates an example: the shaded polyhedron represents the feasible regionX f . Given the infeasible point x 0 and its gradientrF (x 0 ), the solid line perpendicular torF (x 0 ) approximates the level-set of F (x). Therefore, the half-space on the direction ofrF (x 0 ) of that line would tend to have points with higher value ofF (x) than the other half-space. This suggests that such a cut would prune more infeasible points with high objectives, leading to a tighter upper bound. Confirming this intuition, Figure 8.2(a) shows that over 30 random samples the upper-bound decreases faster in WBLADE: x-axis marks iterations in time, y-axis plots the upper bound. 8.3.4 Quality and Runtime Trade-off The feasible point returned by the Separation Oracle provides a lower bound onP1.1. A better lower bound can be achieved by solving a restricted version of P1.1 (Line 9 in Algorithm 6). This can be done by replacingX f1 with the convex hull formed by the subset of defender pure strategies generated in solving the Separation Oracle. These upper and lower bounds allow us to trade off between solution quality and runtime by controlling the threshold: as soon as UB LB the algorithm returns the feasible solution associated with LB, which is guaranteed to be within of the optimal objective. 148 8.4 Experimental results In this section, we compare COCOMO and BLADE assuming two different bounded rationality models. We take FAMS as our example domain. For each setup, we tried 30 game instances. In each game, payoffsR d i andR a i are random integers from 1 to 10, whileP d i andP a i are random integers from -10 to -1; the feasible schedules for each unit of resources are generated by ran- domly selecting 2 targets for each schedule (we assume that each air marshal can cover 2 flights on a single trip, similar to [Jain et al., 2010a]). In all experiments, the deployment-to-saturation (d:s) ratio is set to 0.5, which is shown to be computationally harder than any other d:s ratio [Jain et al., 2012]. Furthermore, we set the number of piecewise linear segments to be 15 for each f i (x i ), given that 10 segments provide a sufficiently good approximation [Yang et al., 2012b]. The results were obtained using CPLEX v12.2 on a standard 2.8GHz machine with 4GB main memory. The BEST BLADE Given the two versions of BLADE, one with the non-weighted Separation Oracle and another with the weighted Separation Oracle, we investigate whether it would be more effective to combine them such that two cuts are generated in each iteration. While this combined version CBLADE could generate more cuts per iteration reducing the total number of iterations, the runtime of each iteration might be longer. Our first set of experiment investigates the efficiency of the three BLADE algorithms. Figure 8.3 shows the average runtime of these three algorithms with different number of targets. WBLADE achieves the shortest runtime, as shown in Figure 8.3. Furthermore, although on average CBLADE takes less iterations to converge, it generates more cuts than both BLADE and WBLADE. For example, with 60 targets, CBLADE takes 17 iterations on average to converge, while WBLADE and BLADE take 23 and 29 iterations 149 0 2 4 6 8 10 12 14 20 40 60 80 Runtime (mins) # of Targets BLADE cBLADE wBLADE Figure 8.3: Runtime Comparison of the BLADE family respectively. However, the total cuts generated by CBLADE is 34 (2 cuts per iteration) which is more than WBLADE and BLADE. Given this result, we will use WBLADE as the representative of the BLADE family in the rest of the experiments. Quantal Response Model Figure 8.4(a), 8.4(b) and 8.4(c) present the average runtime of COCOMO and WBLADE assuming a QR model of the adversary with parameter set to 0:76. In the experiment, we set 50 minutes as the runtime limit. The dashed line in the figures indicates that at that point at least some game instances were not completed within this time limit; and the absence of any markers afterward implies the trend continues. COCOMO cannot scale to 80 targets, as shown in Figure 8.4(a). In Figure 8.4(b), we vary the amount of user-specified constraints (i.e. percentage of the number of targets) while fixing the number of targets to be 60. The constraints were randomly generated inequalities of the marginal coverage vectorhxi. Increasing the amount of user-specified constraints doesn’t impact the runtime of WBLADE, but 150 0 10 20 30 40 50 60 20 40 60 80 Runtime (mins) # of Targets CoCoMo wBLADE (a) 20% User Constraints,Threshold=0.02 0 10 20 30 40 50 60 10% 20% 30% 40% 50% Runtime (mins) User-specefied Constraints CoCoMo wBLADE (b) 60 Targets, Threshold=0.02 0 10 20 30 40 50 60 0.4 0.2 0.1 0.05 0.02 0.01 Runtime (mins) Threshold CoCoMo wBLADE (c) 60 Targets,20% User Constraints 0 1 2 3 4 0 0.2 0.4 0.6 Runtime (mins) Solution Quality (Threshold) 40 targets 50 targets 60 targets (d) Runtime vs Solution Quality Figure 8.4: Comparing COCOMO and BLADE, QR Model significantly slows down COCOMO. We then vary the threshold from 0.4 to 0.01 as shown in Figure 8.4(c). COCOMO is only able to converge when the threshold is larger than 0.05;in comparison, the runtime of WBLADE slowly increases as the threshold decreases. Thus, WBLADE obtain much better solution quality within significantly shorter amount of time than COCOMO. We further investigate the tradeoff between solution quality and runtime of WBLADE and show the result in Figure 8.4(d). We gradually increase the solution quality by decreasing the threshold under different number of targets, illustrating runtime-quality tradeoff. A More Complex Bounded Rationality Model We now set f i (x i ) = 1 1+e i x i , a more complex model than QR.We investigate the impact of model complexity on the runtime of CO- COMO and WBLADE. Figure 8.5(a) and 8.5(b) display the runtime comparison of COCOMO and WBLADE. As shown in Figure 8.5(a), while COCOMO could not finish running within 50 151 0 15 30 45 60 20 40 60 80 Runtime (mins) # of Targets CoCoMo wBLADE (a) 20% User Constraints, Threshold=0.02 0 15 30 45 60 0.4 0.2 0.1 0.05 0.02 Runtime (mins) Threshold CoCoMo wBLADE (b) 60 Targets,20% User Constraints Figure 8.5: Runtime Comparison, QR-Sigmoid model minutes in any of the settings, the runtime of WBLADE was less than 7 seconds for 20 targets. In Figure 8.5(b), we show that the runtime of BLADE gradually increases as the threshold decreases. In comparison, COCOMO is only able to finish running when the threshold is sufficiently large ( 0:4) leading to poor solution quality. Thus, as the bounded rationality model becomes more complex, BLADE’s advantage over COCOMO is further magnified. 152 Chapter 9: Adaptive Resource Allocation and its Application to Wildlife Protection In many domains, adversary events occur often and generate significant amounts of collectible event data. These domains present new research challenges and opportunities related to learning behavioral models from collected poaching data. One example of such domains is wildlife pro- tection. Illegal poaching is an international problem that leads to the extinction of species and the destruction of ecosystems. As evidenced by dangerously dwindling populations of endangered species, existing anti-poaching mechanisms are insufficient. Compared to the counter-terrorism domain, wildlife crime is an important domain that promotes a wide range of new deployments. In this chapter, I introduce the Protection Assistant for Wildlife Security (PAWS) application - a joint deployment effort done with researchers at Uganda’s Queen Elizabeth National Park (QENP) with the goal of improving wildlife ranger patrols. 9.1 Domain The goal of PAWS is to help conservation agencies improve patrol efficiency such that poach- ers, from fear of being caught, are deterred from poaching in QENP. Wire snaring is one of the main techniques used by poachers in Africa, including QENP, (as shown in figures 9.1(a),9.1(b)); 153 poachers can set and leave snares unattended, and come back when they think an animal has been captured. In addition, poachers can conduct surveillance on rangers’ activities and patrol patterns; wildlife rangers are well-aware that some neighboring villagers will inform poachers of when they leave for patrol and where they are patrolling [Moreto, 2013]. For any number of reasons, such as changes that impact animal migration habits, rangers may change their patrolling patterns; poachers, in turn, continually conduct surveillance on the rangers’ changing patrol strat- egy and adapt their poaching strategies accordingly. As the law enforcement officers of the park, park rangers’ primary objective is to stop poaching, and their main method of doing so is to patrol the park. During a patrol, rangers will search for signs of illegal activity inside the park, confis- cate any poaching equipment found, and apprehend any persons inside the park illegally (e.g., poachers). In addition to their normal patrol duties, rangers will collect data on any observed or suspected illegal activity. In most cases, if rangers find wire snares, they will not find the poacher that set them. If the rangers do encounter and apprehend poachers, however, they are sometimes able to make the poachers confess to where they set their snares. After the rangers return to the outpost, collected data is uploaded and analyzed. Eventually, enough data will be collected so that the ranger patrol strategies can be continually updated based on any emerging trends. If snares are found by a ranger patrol, they are recorded as data points. Since it is unknown who placed the snares, we refer to these data points as anonymous data points. Identified data points, when a poacher is captured and divulges where they placed snares, are inherently more useful as they can be used to obtain a more complete behavioral model that can better predict where future poachers will place their traps. 154 (a) A lioness caught in a snare. (b) A caught poacher holding up a snare. Figure 9.1: Lioness photo courtesy of John Coppinger, Remote Africa Safaris Ltd. Poacher snare photo taken by Andrew Lemieux. For this deployment, a poacher placing a snare in an area represents an attack. In order to have a tractable space for computing defender strategies, we discretize areas into a grid where each cell represents 1 square kilometer, and every cell in the grid could contain wildlife and is thus a valid target for attackers. Terrain also has an impact; poachers and rangers can travel further if they are traversing grasslands instead of a dense forest of varying elevations. In order to simplify distance calculations in our model, we currently focus on one type of terrain, grasslands. Future work will focus on incorporating different types of terrain into the model. Areas of high animal density, such as areas that contain fresh water (e.g., watering holes, lakes), are known to be high-risk areas for poaching [Wato et al., 2006; Moreto, 2013; Montesh, 2013]. Distance is also an important factor; a snare density study demonstrated that the density began to decrease significantly once they began travelling more than 4 kilometers away from the international border [Wato et al., 2006]. This finding is intuitive as poachers need to carry back any poached animals or goods, and longer distances will increase the chances of spoilage and apprehension. Even for Ugandan poachers, distance travelled will still be a factor based on similar concerns. Despite the available information from these studies, there are still too many areas for rangers to patrol, and it is a huge cognitive burden to account for these factors (in addition to physical distance constraints 155 and available rangers) while constantly creating new, unpredictable patrols. Based on all of these factors, PAWS will aid patrol managers and determine an optimal strategy that will enable park rangers to effectively cover these numerous areas with their limited resources. 9.2 Model in PAWS 9.2.1 Stackelberg Game Formulation Based on our discussion of the wildlife crime domain and its various parameters of interest, we apply a game theoretic framework, more specifically Stackelberg Security Games (SSGs), to the problem and first model the interaction between the rangers and the poachers. In a SSG, there are two types of players: the defender (leader) commits to a strategy first; the follower then responds after observing the leader’s strategy. The defender’s goal is to protect a set of targets, with limited security resources, from being attacked by the adversary. The adversary will first conduct surveillance to learn about the defender’s strategy, and then he (he by convention) will select a target to attack. In the wildlife crime problem, the ranger plays as the leader and the poachers are the follow- ers. While the rangers are trying to protect animals by patrolling locations where they frequently appear, the poachers are trying to poach the animals at these areas. As discussed earlier, we dis- cretize the area into a grid where each cell represents 1 square kilometer. We useT to denote the set of locations that can be targeted by the poacher, wherei2T represents thei th target. If the poacher selects targeti and it is covered by the rangers, he receives a utility ofU c p;i . If the se- lected target is not covered by rangers, he receives a utility ofU u p;i . The ranger’s utility is denoted 156 similarly byU c r;i andU u r;i . As a key property of SSG, we assumeU c p;i U u p;i andU c r;i U u r;i . Simply put, adding resources to cover a target hurts poachers and helps the rangers. As discussed in the previous section 9.1, animal density is a key factor in determining poach- ing risk, and we thus model it as the primary determinant of reward for poachers (i.e.,U u p;i ) and penalty for rangers (i.e., U u r;i ). Areas with a high density of animals are attractive to poachers since they are more likely to have a successful hunt. Similarly, rangers will view these areas as costly if left unprotected. Distance is also a determining factor in poaching reward. Although a poacher may view an area with a large density of animals as attractive, it may be too far away to be rewarding. We also need to model the penalty for poachers (i.e.,U c p;i ) and reward for rangers (i.e.,U c r;i ). If the poachers attack a defended area, they will incur a fixed penalty that represents a fine. The poachers will also incur an additional penalty that increases with the distance that they travel from their starting point. Rangers will receive a flat (i.e., uniform) reward based on the poacher’s fixed penalty but not on the distance travelled. This uniform reward represents the ranger’s lack of preference on where or how poachers are found; as long as poachers are apprehended, the patrol is considered a success. In our SSG model for this wildlife crime problem, we assume a single leader (i.e., a single group of rangers who are executing the same patrolling strategy) and a population of poachers. We also assume that poachers respond to the rangers’ patrolling strategy independently, and we defer to future work to consider potential collaboration between poachers. We adopt a compact representation of the rangers’ patrolling strategy: x =hx i i wherex i denotes the probability thati will be covered by the rangers. The actual patrol can be derived from this compact representation using sampling techniques similar to those in previous SSG applications [Shieh et al., 2012; Tsai et al., 2009]. Given a defender strategy x, we denote the response of a poacher ashq i (!jx)i, 157 Table 9.1: Notations used in this paper T Set of targets;i2T denotes target i x i Probability that targeti is covered by a resource U c r;i Ranger utility for covering i if it’s selected by the poacher U u r;i Ranger utility for not coveringi if it’s selected U c p;i Poacher utility for selectingi if it’s covered U u p;i Poacher utility for selectingi if it’s not covered ! Parameter of the SUQR model f(!) Probability density function of! U r (xj!) Ranger expected utility by playing strategy x against a poach with the model parameter! U r (x) Ranger expected utility by playing strategyx against the whole population of the poachers q i (!jG) Probability that poacher with parameter! selects target i in gameG whereq i (!jx) represents the probability that the poacher will select targeti. The parameter! is associated with the poacher’s behavioral model, which we will discuss in more details in Section 9.2.2. Table 9.1 lists key notations used in this paper. We model the repeated crime activities of the poachers as the following: in each round of the interaction between the rangers and the poachers, the ranger executes the same mixed strategy over a period of time (e.g., a month); the poachers will first conduct surveillance on the rangers’ patrolling strategy and then respond. If the ranger switches the patrolling strategy, a new round starts. We assume that the poachers are myopic (i.e., they make their decision based on their knowledge of the ranger’s strategy in the current round). In this paper, we also assume the poach- ers’ surveillance grants them perfect knowledge about the rangers’ strategy; we defer to future work to consider the noise in poachers’ understanding of the rangers’ strategy due to limited observations. 158 9.2.2 Behavioral Heterogeneity The model we use to predict the behavior of the poachers is based on the SUQR model as de- scribed in Chapter 5 and replaces the assumption of a single parameter setting with a probabilistic distribution of the model parameter in order to incorporate the heterogeneity among a large popu- lation of adversaries. SUQR extends the classic quantal response model by replacing the expected utility function with a subjective utility function: SU i (!) =! 1 x i +! 2 U u p;i +! 3 U c p;i (9.1) where the parameter ! = h! 1 ;! 2 ;! 3 i measures the weight of each factor in the adversary’s decision making process. In Chapter 5,! was learned using data collected with human subjects from Amazon Mechanical Turk and assumed that there was a single parameter!. We will show that the parameters learned for individuals in the data set differ from each other. We then show that the model’s predictive power significantly improves if the parameter is changed from a single value to a probabilistic distribution. In the data set collected with the subjects from Amazon Mechanical Turk, each subject played 25-30 games. In total, data was collected on about 760 subjects. We learn the SUQR parameter for each individual by maximizing the log-likelihood defined in Equation (9.2) logL(!) = X k log(q c k (!jG k )) (9.2) where,G k denotes thek th game played by the subject.c k is the index of the target selected by the subject in this game. q c k represents the probability that targetc k will be selected by the subject predicted by the SUQR model, which is computed as the following: q c k (!jG k ) = e SUc k (!jG k ) P i e SU i (!jG k ) (9.3) 159 Table 9.2: Log-likelihood Single Parameter Setting Parameter Learned for each subject Training Set -1.62 -1.09 Testing Set -1.68 -1.15 where,SU i (!jG k ) is the subjective utility function as described in Equation (9.1) given a game instanceG k . Figure 9.2 displays the empirical PDF of!. It shows a shape of normal distribution in all three dimensions. Furthermore, we report in Table 9.2 the average log-likelihood of the SUQR model with the parameter value learned for each subject. We also include in Table 9.2 the log-likelihood of the SUQR model with the assumption that the parameter value is the same for all the subjects. The results are evaluated using cross-validation. Table 9.2 shows that the predictive power of the model improves by tuning the parameter for each subject, since the log-likelihood of the prediction by the model is increased. On average, the log-likelihood of the SUQR model with the parameter learned for each subject is 0.53 higher than that with a uniform parameter across all subjects. In other words, the prediction of the former model is 1.70 (i.e. e 0:53 ) times more likely than that of the latter. −100 −50 0 50 0 0.05 0.1 0.15 0.2 0.25 ω 1 f(ω 1 ) −2 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 ω 2 f(ω 2 ) −2 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 ω 3 f(ω 3 ) Figure 9.2: Empirical Marginal PDF of the SUQR parameter among all the 760 subjects Given the results shown in Figure 9.2, we assume a probabilistic, normal distribution of the SUQR parameter ! in order to incorporate the heterogeneity of the decision-making process 160 of the whole population of poachers. The SUQR model with a specific value of ! essentially represents one type of poacher. With the continuous distribution of !, we are indeed facing a Bayesian Stackelberg game with infinite types. We denote the probability density function of! asf(!). 9.2.3 Adapting Patrolling Strategy using Historical Crime Data In the domain of wildlife crime, if the behavioral model of the whole adversary population is given, the optimal patrolling strategy x is the one that maximizes the expected utility of the rangers. x =arg max x I U r (xj!)f(!)d! (9.4) where,U r (xj!) is the rangers’ expected utility by executing strategyx against a poacher that has a model parameter of!.U r (xj!) is computed as the following, U r (xj!) = X i U r;i (xj!)q i (!jG) (9.5) where,U r;i (xj!) is the rangers’ expected utility if targeti is selected by the poachers.U r (xj!) is a nonlinear fractional function given thatq i (!jG) follows the prediction of the SUQR model. In reality, the behavioral model of the adversary population is unknown to the rangers. Thus, a key challenge for obtaining an optimal patrolling strategy is to learn the poachers’ behavioral models. More specifically, we want to learn the distribution of the SUQR model parameter. In the wildlife crime problem, data is often available about historical crime activities. Recall that these data points record the snares found by the rangers, which can be either anonymous or identified. The identified crime data that is linked to an individual poacher can be used to learn his behavioral model (i.e., estimate the SUQR model parameter for that poacher). In contrast, it is impossible 161 to directly use anonymous crime data to build a behavioral model for any individuals. In theory, with enough identified crime data, we could estimate the underlying population distribution of! directly. In reality, however, identified crime data is rare compared to anonymous crime data. 9.3 Research Advances in PAWS Recall that existing techniques in SSG cannot be applied directly to PAWS due to the new chal- lenges coming from this new domain. In this section, we describe the novel research advances developed for solving the SSG in PAWS. 9.3.1 Learn the Behavioral Model At the beginning of the game, rangers only know that the distribution of the poacher population’s model parameter follows a normal distribution. The goal is to learn the multi-variable normal distribution (i.e., the mean and the covariance matrix ) of the 3-dimensional SUQR model parameter! as data becomes available. As previously discussed, identified data, although sparse, can be used to directly learn poach- ers’ individual behavioral models. Since it is sparse, it takes a much longer time to collect enough data to learn a reasonable distribution. In contrast, there is much more anonymous crime data col- lected. As we will show, we can learn the behavioral model of the poacher population using these two types of data. Furthermore, we combine the use of the sparse identified data to boost the convergence of the learning. Let’s first define the format of the data collected in each round of the game. Let N (t) a be the number of anonymous crimes observed by the ranger in roundt andN (t) c be the number of 162 captured poachers in roundt. Furthermore, letA (t) =fa (t) j jj = 1;:::;N (t) a g denote the set of targets chosen by the anonymous poachers in roundt and (t) =f! (t) k jk = 1;:::;N (t) c g denote the set of parameter values associated with the captured poachers in roundt. ! (t) k is the SUQR parameter of thek th captured poacher in roundt. We assume that a captured poacher will confess his entire crime history in all previous rounds. For thek th captured poacher in roundt, we denote C (t) k =fc (t) k;l g as the set of crimes committed by him, where the index l in c (t) k;l represents the l th crime committed by him. c (t) k;l = ( (t) k;l ;x (t) k;l ) includes the target chosen by the poacher when the crime was committed (denoted as (t) k;l ) and the resource allocation strategy of the rangers at the time (denoted as x (t) k;l ). To simplify the notation, we denote (t) k;l as l and x (t) k;l as l in the following part of the paper. 9.3.1.1 Learning with the Identified Data For each captured poacher, the associated SUQR model parameter can be estimated with Maxi- mum Likelihood Estimation (MLE). ! (t) k =arg max ! logL(!jC (t) k ) =arg max ! X l log(q l (!j l )) (9.6) where,q l (!j l ) is the predicted probability that thek th captured poacher chooses target l when he committed the crime after observing l as the resource allocation strategy of the rangers. It can be shown that logL(!jC (t) k ) is a concave function, since the Hessian matrix is negative semi- definite. At roundt, there are in total P t =1 N () c poachers captured. After learning the model parame- ter! for each of these poachers, there are P t =1 N () c data samples collected from the distribution 163 of the poacher population. By applying MLE, the distribution of! can be learned from these data samples. Given that! follows a 3-dimensional normal distribution, the mean and the covariance matrix learned with MLE is calculated as the following: (t) = 1 P t =1 N () c t X =1 X !2 () ! (9.7) (t) = 1 P t =1 N () c t X =1 X !2 () (! (t) )(! (t) ) T (9.8) 9.3.1.2 Learning with the Anonymous Data Each anonymous data item records the target selected by the poacher. Since no information is recorded about the individual poacher who committed the crime, it is impossible to estimate the model parameter like is done with identified data. One potential approach is to treat each anonymous data point as committed by different independent poachers. ! (t) j =arg max ! logL(!ja (t) j ;x (t) ) (9.9) where, x (t) is the strategy of the rangers in round t. ! (t) j denotes the estimated model parameter of the anonymous poacher who committed thej th crime in roundt. Note that in each round, the log-likelihood of any given value of! only depends on the target that was selected by the poacher. Different poachers with different model parameters will be treated the same if they choose the same target in the same round. Let ~ (t) =f! (t) j jj = 1;:::;N (t) a g represent the set of estimated model parameters associated with the N (t) a anonymous crimes recorded in round t. Similar to 164 Algorithm 7: PAWS-Learn 1 Input:t;C () ;A () ;x () ;8 = 1:::t 1; 2 ( (t) ; (t) ) Learn(fC () ; = 1;:::;t 1g); 3 (f! n g) Sample( (t) ; (t) ;N s ); ; 4 o n f(!nj (t) ; (t) ) P n 0f(! n 0j (t) ; (t) ) ,8n; 5 h n i Refine(fC () ; = 1;:::;t 1g;h o n i); ; 6 Return (f! n g;h n i); how the identified data was used, the maximum likelihood estimation of the mean and covariance matrix of the distribution of the model parameter can be computed as: ~ (t) = 1 P t =1 N () a t X =1 X !2 ~ () ! (9.10) ~ (t) = 1 P t =1 N () a t X =1 X !2 ~ () (! ~ (t) )(! ~ (t) ) T (9.11) 9.3.1.3 Combining the Two Kinds of Data Identified data provides an accurate measurement of an individual poacher’s behavior. However, it leads to slow learning convergence for the population’s behavioral model due to its sparseness. While anonymous data provides a noisy estimation of an individual poacher’s behavioral model, it gives a sufficiently accurate measurement of the crime distribution of the poacher population due to the large amount of data points. We propose PAWS-Learn, an algorithm to improve the estimation of the model parameter by combining both the identified data and the anonymous data. Algorithm 8 shows the outline of PAWS-Learn. At roundt, PAWS-Learn first uses the identified data to learn the mean and the covariance, as shown in Line (2). It then measures the accuracy of this estimation using the mean square error (MSE) of the predicted crime distribution recorded by the anonymous data. MSE (t) ( (t) ; (t) ) = X i2T ( q i (x (t) j (t) ; (t) )y (t) i ) 2 (9.12) 165 where y (t) i is the proportion of crimes found at target i as recorded by the anonymous data 1 . q i (xj (t) ; (t) ) is the predicted probability that targeti will be selected by the poacher population, givenN ( (t) ; (t) ). Ideally, q i (xj (t) ; (t) ) is calculated as q i (xj (t) ; (t) ) = I q i (!;x (t) )f(!j (t) ; (t) )d! Let =h n i denote the vector of probabilities associated with the sampled parameter values, where P n n = 1 due to normalization. The predicted probability that targeti will be selected by the poacher population in roundt is approximated as q i (x (t) ) = X n n q i (! n ;x (t) ) The quadratic program formulation for minimizing the MSE of the observed crime distribution is shown in Equations (9.13)-(9.15). min X i2T ( X n n q i (! n ;x (t) )y (t) i ) 2 (9.13) s:t: X n n = 1; n 2 [0; 1];8n (9.14) j n o n j o n ;8n (9.15) Equation (9.15) is to ensure the smoothness ofh n i since the values are essentially samples from the probability density function of a normal distribution. More specifically, it constrains n to be within a certain distance of the initial value o n . The parameter is set to decide the range of n proportion to o n . As shown in Line (4), o n is set to the pdf of the current estimated distribution N ( (t) ; (t) ): o n = Cf(! n j (t) ; (t) ), whereC = 1 P n f(!nj (t) ; (t) ) is the constant to make sure that P n n = 1. As shown in Line (5), PAWS-Learn refines the probabilities of the sampled parameter values by solving the above quadratic programming problem. 1 PAWS-Learn currently assumes that in the anonymous data collected by rangers in each round, the observed crime distribution is close to the true distribution. 166 Algorithm 8: PAWS-Adapt 1 Input:N (t) c ,N (t) a ; 2 x (1) MAXIMIN; 3 for = 1;::: do 4 (C () ;A () ) CollectData(x () ) ; 5 (f! n g;f n g) PAWSLearn(C () ;A () ;x () ); 6 x (+1) ComputeStrategy(f! n g;f n g) 7 end 9.3.2 Adapting Patrolling Strategy We propose PAWS-Adapt, a framework to adaptively design the patrolling strategy for the rangers. Let (C o ;A o ;x o ) be the initial data set. At roundt, PAWS-Adapt first estimates the behavioral model with all the historical data by calling PAWS-Learn. Let (! (t) n ;h (t) n ) be the learning results of the poacher population’s behaviorial model by PAWS-Learn. PAWS-Adapt then computes the optimal patrolling strategy, based on the current learning result, to execute in the next round. In computing the optimal patrolling strategy under the given behavioral model, we need to solve the optimization problem in Equation (9.4), which is equivalent to a Bayesian Stackelberg Game with infinite types. With the representation of discretized samples, we approximate the infinite types with a set of sampled model parametersf! n g. Given that the objective function in Equation (9.4) is non-convex, we solve it by finding multiple local optima with random restarts. Letx (t) be the patrolling strategy computed by PAWS-Adapt for roundt. The rangers then update their strategy in the new round. As shown in Algorithm 9, the rangers will update the poachers’ behavioral models each round after more data is collected. They then switch to a new strategy that was computed with the new model. 167 9.4 Evaluation 9.4.1 General Game Settings In the first set of our experiments, we generate a random payoff matrix similar to the experiment in Chapter 5. The crime data points are simulated as the following: given the true distribution of!, we first draw a set of random parameters for! to represent the whole poacher population. LetN p be the total number of poachers. In each round, we first draw a subset of! from these N p values to represent the subset of poachers who are going to commit crimes in the current round. Given the patrolling strategy, we then simulated the target choices made by this subset of poachers. These choices are recorded as the anonymous data. Meanwhile, we randomly select a given number of poachers from this subset to represent the poachers that are captured by the rangers in the current round. Once a poacher is captured, the choices he made in the previous round will be linked and recorded as the identified data points. 5 10 15 20 −30 −25 −20 −15 −10 −5 0 Round # Cumulative Defender EU 20Targets, 5security resources 3 Captures, 50Crimes identified data anonymous data PAWS−Learn Upper Bound Maximin (a) Cumulative EU 5 10 15 20 0.5 1 1.5 2 2.5 Round # 1−norm distance to optimal strategy 20Targets, 5security resources 3 Captures, 50Crimes identified data anonymous data PAWS−Learn (b) Strategy Convergence Figure 9.3: Simulation results over round In Figure 9.3(a), we show the cumulative expected utility (EU) of the rangers over the round. We compare three different approaches: PAWS-Learn, learning from only the identified data, and learning from only the anonymous data. We also included the maximin strategy as the baseline. 168 0 0.5 1 1.5 2 2.5 1 3 5 7 9 11 13 15 17 19 21 1-norm distance to optimal strategy Round # 1 capture/round 3 captures/round Figure 9.4: Slow Capture v.s. Fast Capture The upper bound is computed assuming the rangers know the true distribution of!. In Figure 9.3(a), we show the average result over 20 random game instances. We set the number of targets to 20 and the number of security resources to 5. The true distribution of! is the same as that learned in Section 9.2.2. In each round, 50 anonymous data points are generated, and 3 poachers are captured. As can be seen in the figure, PAWS-Learn outperforms the other two learning approaches that use one type of data. Furthermore, learning indeed helps improve the patrolling strategy since the three solid lines are much closer to the upper bound compared to the baseline solution maximin strategy. In Figure 9.3(b), we show the convergence of the patrolling strategy from the three different learning methods. The figure shows that PAWS-Learn converges faster than the other two meth- ods. Thus, combining the two types of data indeed boosts the learning of the poacher population’s behavioral model. In order to show how the speed of capturing poachers impacts the performance of PAWS, we fix the number of anonymous data points to 50 and simulate the captured poachers in each round at two different paces: 1 poacher vs. 3 poachers. Figure 9.4 shows the convergence of 169 PAWS-Learn in these two cases. It is clear that the strategy converges faster if more poachers are captured. We compare the cumulative EU achieved by the three different methods under varying num- ber of targets and varying amount of resources. In both Figure 9.5(a) and 9.5(b), the y-axis displays the cumulative EU of the rangers at the end of round 20. In both figures, we simulate 50 crimes and randomly generate 3 captured poachers each round. In Figure 9.5(a), we vary the number of resources on the x-axis while fixing the number of targets to 20. It shows that the cumulative EU increases as more resources are added. In addition, PAWS-Learn outperforms the other two approaches regardless of resource quantity. Similarly, we vary the number of targets on the x-axis in Figure 9.5(b) while fixing the amount of resources to 5. The better performance of PAWS-Learn over the other two learning methods can be seen from the figure regardless of the number of targets. 3 5 7 10 paws-learn anonymous data identified data -40 -30 -20 -10 0 10 20 30 40 3 5 7 10 Cumulative EU Resource Unit PAWS-learn anonymous data identified data (a) 20 Targets, 3 Captures, 50 Crimes -50 -40 -30 -20 -10 0 10 20 10 20 40 Cumulative EU Number of Targets PAWS-learn anonymous data identified data (b) 5 security resources, 3 Captures, 50 Crimes Figure 9.5: Comparing cumulative EU at round 20 9.4.2 Results for the Deployment Area We now show the experiment results of applying PAWS to QENP. We focus on a 64 square kilometer area in QENP that features flat grasslands, an international trade route that connects 170 nearby Democratic Republic of the Congo, smaller roads, and fresh water. In our simulation area 9.6(b), the series of lakes are modeled as areas of high animal density. Since the roads in this area provide multiple access points for poachers and rangers, they can leave the closest road at the closest point to their targeted cells. We calculate travel distances according to that rationale. These representations of animal density and distance form the primary basis for the payoffs for both the rangers and the poachers. The poachers’ reward (i.e., if they choose a cell not covered by rangers) depends on the relative animal density of the cell and the travelling cost to that cell. The travelling cost depends on the distance from the closest entry point (e.g., a road). Therefore, a lake close to a road is at high risk for poaching and is thus modelled as an area of high reward to the poachers. In turn, the poachers’ penalty (i.e., the chosen cell is covered by rangers) is decided by the travelling cost to a cell and the loss of being captured by the rangers. The rangers’ reward is considered to be uniform since their goal is to search for snares and capture poachers regardless of the location. The penalty for the rangers (i.e., fail to find snares at a place) is decided by the animal density of the cell. Further discussion of the rationale can be found in the domain section 9.1. We run simulations with a sample game, similar to that in the general setting as explained in Section 9.4.1. Figure 9.7 displays the simulation results, where the number of resources is set to 16, indicating that a single patrol covers 16 grid areas in the map. The ranger’s cumulative EU is shown in Figure 9.7(a). It can be seen that PAWS-Learn achieves very close performance to the optimal strategy. The convergence of the patrolling strategy to the optimal strategy is shown in Figure 9.7(b). In order to help visualize the change of ranger’s patrolling strategy, we show the coverage density in the 8-by-8 area at three different rounds in Figure 9.8. Darker colors indicate less 171 (a) A zoomed out view of the simulation area. (b) The 64 sq. km grid overlayed on the simulation area. Figure 9.6: The QENP area of interest for our simulation coverage in the area. Note that there are three lakes located in the lower-left area, where the density of animals is higher. It is clear that these areas are covered more by the rangers. The figure also shows a clear shift of the patrolling coverage over the rounds. These results are enthusiastically received by our collaborator at QENP. While the existing framework requires manual analysis of the snare data, PAWS provides a systematic way of gen- erating patrolling strategies based on automatic analysis of the data. PAWS will start to be tested in the field in March 2014 with actual deployment planned for the latter portion of 2014. 172 5 10 15 20 25 30 −50 −40 −30 −20 −10 0 Round # Cumulative Defender EU 64Targets, 16security resources 3 Captures, 50Crimes identified data anonymous data PAWS−Learn Upper Bound Maximin (a) Cumulative EU 5 10 15 20 25 30 1 2 3 4 5 Round # 1−norm distance to optimal strategy 64Targets, 16security resources 3 Captures, 50Crimes identified data anonymous data PAWS−Learn (b) Strategy Convergence Figure 9.7: Simulation results over round for the 64 sq. km grid area in QENP Round 1 Round 5 Round 20 Figure 9.8: Patrolling coverage density in the park 173 Chapter 10: Conclusion Game-theoretic approach has become a very important tool for solving real-world security prob- lems. Its usefulness is proved by a number of real-world deployed applications, including AR- MOR [Pita et al., 2008] for the Los Angeles International Airport, IRIS [Tsai et al., 2009] for the Federal Air Marshals, GUARDS [Pita et al., 2011] for the Transportation Security Administra- tive. These systems have often adopted the standard assumption of a perfectly rational adversary made by the classic game theory, which make not hold in the real-world against human adver- saries who may have bounded rationality. While such assumption is a reasonable start for these first generation of applications, it is critical to address the decision making of human adversary as the next step. My thesis aims to address this challenge by closing the gap between two important subfileds in game theory: Behavioral Game Theory and Algorithmic Game Theory. While the former provides empirical models for predicting the decision making of human players; the latter fo- cuses developing efficient computation of the optimal strategy for the players. In addressing the decision-making of human adversary for real-world security problems, the key is to bridging the efforts from both sides. 174 To that end, my thesis on the one hand provide novel models for predicting adversary decision making in security games by first investigating the effectiveness of different existing models and then further extending the selected model with key insights draw from data. On the other hand, my thesis develops efficient algorithms for optimizing the defender’s resource allocation strategy incorporating the behavioral model of the adversary. In particular, my thesis has the following five key contributions. 10.1 Contributions Stochastic model for adversary decision making: This work answers a critical question of which existing model to use to predict adversary decision making. In particular, it presents: (i) new efficient algorithms for computing optimal strategic solutions using Prospect The- ory and Quantal Response Equilibrium; (ii) the most comprehensive experiment to date studying the effectiveness of different models against human subjects for security games; and (iii) new techniques for generating representative payoff structures for behavioral ex- periments in generic classes of games. Our results with human subjects show that our new techniques outperform the leading contender for modeling human behavior in security games. More sophisticated models: This work analyzes the effectiveness of adversary model in security games towards addressing the the bounded rationality of adversary. Through ex- tensive experiments with human subjects, I emphatically answer the question in the affir- mative, while providing the following key results: (i) our algorithm, SU-BRQR, based on 175 a novel integration of human behavior model with the subjective utility function, signifi- cantly outperforms an robust optimization approach MATCH; (ii) we are the first to present experimental results with security intelligence experts, and find that even though the ex- perts are more rational than the Amazon Turk workers, SU-BRQR still outperforms an approach assuming perfect rationality (and to a more limited extent MATCH); (iii) we show the advantage of SU-BRQR in a new, large game setting and demonstrate that sufficient data enables it to improve its performance over MATCH. GOSAQ and PASAQ: I provide two algorithms for efficient computation of the defenders’ optimal strategy incorporating a boundedly rational model of the adversary. They over- comes the difficulties of solving a nonlinear and non-convex optimization problem and handling constraints on assigning security resources in designing defender strategies. In addressing these difficulties, GOSAQ guarantees the global optimal solution in comput- ing the defender strategy against an adversarys quantal response; PASAQ provides more efficient computation of the defender strategy with nearly-optimal solution quality. Both algorithms achieves much better solution quality than the benchmark algorithm BRQR. In the presence of resource assignment constraint PASAQ is shown to achieves much better computational efficiency than both GOSAQ and a benchmark algorithm BRQR. In fact, the approximation error of PASAQ is proven to be linearly bounded by the piecewise linear accuracy. BLADE: I develop the algorithm to further scale-up for computing defender optimal strat- egy in massive security games with trillions of defender strategies incorporating the bounded 176 rationality of the adversary. BLADE is based on three novel ideas. First, we present a sep- aration oracle that can effectively prune the search space via deep cuts. More importantly we show that to handle massive scale SSGs, not only must this separation oracle itself use a secondary oracle but that this two-level hierarchy of oracles is efficient. Second, we pro- vide a novel heuristic to further speed-up BLADE by exploiting the SSG objective function to improve its cuts. Third, BLADE provides a technique for quality-efficiency tradeoff. As we experimentally demonstrate, BLADE is significantly more efficient than a Branch-and- Price based algorithm. Adaptive resource allocation: In domains with collective data of adversary events, we are facing new challenges including learning the behavioral model of adversary from the col- lected data. Using wildlife protection as an example domain, I present PAWS, a novel application for improving wildlife crime patrols, which is essential to combating wildlife poaching. As demonstrated in the experimental results, PAWS successfully models the wildlife crime domain and optimizes wildlife crime patrols while remaining flexible enough to operate generally and in a specific deployed area. Due to the unique challenges in- troduced by wildlife crime, we have also made a series of necessary technical contribu- tions. Specifically, the success of PAWS depend on the following novel contributions: 1. a stochastic behavioral model extension that captures the populations heterogeneity; 2. PAWS-Learn, which combines both anonymous and identified data to improve the accuracy of the estimated behavioral model; 3. PAWS-Adapt, which adapts the rangers patrolling strategy against the behavioral model generated by PAWS-Learn. 177 10.2 Future Work In this thesis, I have shown how game-theoretic approach can be applied for optimizing resource allocation in security problems, with a focus on two domains: counter-terrorism and preventing illegal poaching of wildlife. As security remains an important global concern, there are numerous research opportunities available. One key area is to translate the results obtained here in controlled experiments on AMT into specific, real-world security applications. Most of the issues related to making this transition are not unique to our work, but apply more generally to studies in agent/human interactions. For example, the specific conditions tested in the lab and the way in which decisions are presented is not likely to be exactly reflected in real interactions, and neither is the population of adversaries identical to the population of adversaries in a real-world security setting. However, our methods are based on fundamental features of human decision-making that are robustly supported in a large number of behavioral studies and these methods would thus translate into real-world appli- cations. In addition, the parameters offer some ability to tune the models over time to specific settings or populations of interest, and our methodology provides techniques for tuning these pa- rameters. The parameter settings in our work can serve as initial settings in a real deployment to be adapted over time. Alternatively, the parameters can initially be set conservatively (e.g., somewhat close to settings that result in a standard equilibrium), and adapted over time from this starting point. Another interesting possibility that could be explored in future work is to develop ways to incorporate different sources of information (such as prior knowledge of the biases of specific adversaries) into the models in a general way. 178 One other possible direction for future work concerns with further improvements of the model in PAWS. The current model in PAWS is based on a set of simplifying assumptions about the domain. For example, the adversaries is now assumed to be myopic and have a static behavioral model. However, in reality, as the defender gradually improves the behavioral model of the ad- versaries, the adversaries might also adapt their strategies. Considering such dynamics in PAWS and other security games with repeated settings will be critical to improve the performance of the model. In addition, PAWS doesn’t consider the collaboration and competition among the individ- ual poachers. Instead, it is assumed that individual poachers make independent decisions during the illegal poaching. Modeling the collaboration and competition among individual poachers will be a necessary next step to improve the performance of PAWS. Furthermore, the current learning model assumes that the rangers have perfect knowledge of the location distribution of the past poaching events. However, given that the rangers only find proof of poaching events in areas where they go for patrolling, the distribution of the poaching events in the uncovered area is in fact unknown to the rangers. A necessary next step is to modify the learning model to take into consideration such noise in the data. Another possible direction relates to the efficient computation of defender optimal strategy. My algorithms open the door for optimizing defender resource allocation in massive real-world security problems with large number of defender pure strategies. With problems similar to TRUSTS [Yin et al., 2012] with large number of adversary pure strategies, the door remains open for specialized techniques to further improve the computational efficiency of the algorithms. Fur- thermore, BLADE assumes a single type of adversary behavioral model. In domains with a large population of adversaries such as wildlife protection, efficient algorithm is still need to address the Bayesian types of adversary behavioral model. 179 In the long run, a general framework should be built for applying game-theoretic approach to any security domains incorporating the behavioral model of human decision-making. First, a behavioral model will need to be developed for predicting adversary behavioral accounting for the new features that might be involved in the decision-making process of the adversaries in these new domains. One possible approach is to use a quantitative model similar to the quantal response model. The many features that might be involved in the decision making process of the adversary can be built into the model using machine learning methods if data is available. Once a model is developed for predicting the adversary behavior, efficient algorithms for optimizing defender resource allocation will need to be developed. Similar approaches applied in this thesis for developing PASAQ and BLADE can potentially be applied given that the quantitative model of adversary decision making usually leads to a non-convex optimization problem. 180 Bibliography Max Abrahms. What terrorists really want: Terrorist motives and counterterrorism strategy. International Security, 32(4):78–105, 2008. Michele Aghassi and Dimitris Bertsimas. Robust game theory. Math. Program., 107:231–273, 2006. Noa Agmon, Sarit Kraus, and Gal A. Kaminka. Multi-robot perimeter patrol in adversarial set- tings. In In ICAT, 2008. Noa Agmon, Sarit Kraus, Gal A. Kaminka, and Vladimir Sadow. Adversarial uncertainty in multi-robot patrol. In In IJCAI, 2009. S. A. Ali. Rs 18L seized in nakabandi at Vile Parle. Times of India, August 2009. Graham Allison and Philip Zelikow. Essence of Decision: Explaining the Cuban Missile Crisis. Pearson, 1999. Bo An, Manish Jain, Milind Tambe, and Christopher Kiekintveld. Mixed-initiative optimization in security games: A preliminary report. In Proceeding of the AAAI Spring Symposium, 2010. Bo An, David Kempe, Christopher Kiekintveld, Eric Shieh, Satinder Singh, Milind Tambe, and Yevgeniy V orobeychik. Security games with limited surveillance. In In AAAI, 2012. Robert J. Aumann and M. B. Maschler. Repeated Games with Incomplete Information. The MIT press, 1995. Amos Azaria, Zinovi Rabinovich, Sarit Kraus, and Claudia V . Goldman. Strategic information disclosure to people with multiple alternatives. In In AAAI, pages 594–600, 2011. Amos Azaria, Zinovi Rabinovich, Sarit Kraus, and Claudia V . Goldman. Strategic information disclosure to people with multiple alternatives. In In AAAI, 2012. C. Barnhart, E. Johnson, g. Nemhauser, M. Savelsbergh, and P. Vance. Branch and price: Column generation for solving huge integer programs. Operations Research, 46:316–329, 1994. Nicola Basiloco, Nicola Gatti, and Francesco Amigoni. Leader-follower strategies for robotic patrolling in environments with arbitrary topologies. In In AAMAS, 2009. J. C. Becsey, Laszlo Berke, and James R. Callan. Nonlinear least squares methods: A direct grid search approach. Journal of Chemical Education, 45(11):728, 1968. 181 S. Boyd and L. Vandenberghe. Localization and cutting-plane methods. Lecture Notes, 2008. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY , 2004. M. Breton, A. Alg, and A. Haurie. Sequential stackelberg equilibria in two-person games. Opti- mization Theory and Applications, 59(1):71–97, 1988. George W. Brown. Iterative solution of games by fictitious play. In Activity Analysis of Production and Allocation. Wiley, 1951. R. G. Burgess and C. J. Darken. Realistic human path planning using fluid simulation. In Pro- ceedings of Behavior Representation in Modeling and Simulation (BRIMS), 2004. Richard H. Byrd, Jorge Nocedal, and Richard A. Waltz. Knitro: An integrated package for nonlinear optimization. In Large-Scale Nonlinear Optimization, pages 35–59. G. di Pillo and M. Roma, eds, Springer-Verlag, 2006. Colin F. Camerer. Behavioral Game Theory: Experiments in Strategic Interaction. Princeton University Press, Princeton, New Jersey, 2003. Colin F. Camerer, Teck-Hua Ho, and Juin-Kuan Chongn. A congnitive hierarchy model of games. QJE, 119(3):861–898, 2004. R. Chandran and G. Beitchman. Battlefor mumbai ends, death toll rises to 195. Times of India, November 2008. Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-Supervised Learning. The MIT Press, Cambridge, Massachusetts, USA, 2006. Vincent Conitzer and Thomas Sandholm. Computing the optimal strategy to commit to. In In Proceedings of the ACM Conference on Electronic Commerce (ACM-EC), pages 82–90, 2006. Miguel Costa-Gomes, Vicent P. Crawford, and Brimp Broseta. Cognition and behavior in normal- form games: An experimental studys. Econometrica, 69(5):1193–1235, 2001. Celso de Melo, Peter Carnevale, and Jonathan Gratch. The effect of expression of anger and happiness in computer agents on negotiations with humans. In In AAMAS, 2011. Enrico Diecidue and Peter P. Wakker. On the intuition of rank-dependent utility. The Journal of Risk and Uncertainty, 23(3):281–289, 2001. Nick Feltovich. Reinforment-based vs. belief-based learning models in experimental asymmetric- information games. Econometrica, 68(3):605–641, May 2000. Sevan G. Ficici and Avi Pfeffer. Simultaneously modeling humans’ preferences and their beliefs about others’ preferences. In In AAMAS, 2008. Baruch Fischhoff, Bernard Goitein, and Zur Shapira. Subjective expected utility: A model of decision-making. Journal of American Society of Information Science, 32(5):391–399, 1981. 182 Nicola Gatti. Game theoretical insights in strategic patrolling: Model and algorithm in normal- form. In In ECAI-08, pages 403–407, 2008a. Nicola Gatti. Game theoretical insights in strategic patrolling model and algorithm in normal- form. In In ECAI, pages 403–407, 2008b. G. Gigerenzer, Todd P. M., and the ABC Research Group. Simple Heuristics that make us smart. Oxford University Press, 1999. Paul Gill and Joseph Young. Comparing role-specific terrorist profiles. In American Society of Criminology Annual Meeting, 2011. Jacob K. Goeree, Charles A. Holt, and Thomas R. Palfrey. Regular quantal response equilibrium. Experimental Economics, 8(4):347–367, December 2005. Priscilla E. Greenwood and Michael S. Nikulin. A Guide to Chi-squared Testing. John Wiley & Sons, Inc, 1996. Reid Hastie and Robyn M. Dawes. Rational Choice in an Uncertain World: the Psychology of Judgement and Decision Making. Sage Publications, Thounds Oaks, 2001. Manish Jain, Erim Kardes, Christopher Kiekintveld, Milind Tambe, and Fernando Ordonez. Se- curity games with arbitrary schedules: A branch and price approach. In In AAAI, 2010a. Manish Jain, James Pita, Jason Tsai, Christopher Kiekintveld, Shyamsunder Rathi, Fernando Ord´ o˜ nez, and Milind Tambe. Software assistants for patrol planning at lax and federal air marshals service. Interfaces, 40(4):267–290, 2010b. Manish Jain, Dmytro Korzhyk, Ondrej Vanek, Vincent Conitzer, Michal Pechoucek, and Milind Tambe. A double oracle algorithm for zero-sum security games on graphs. In AAMAS, 2011a. Manish Jain, Milind Tambe, and Christopher Kiekintveld. Quality-bounded solutions for finite bayesian stackelberg games: Scaling up. In In AAMAS, 2011b. Manish Jain, Kevin Leyton-Brown, and Milind Tambe. The deployment-to-saturation ratio in security games. In AAAI, 2012. Daniel Kahneman and Amos Tvesky. Prospect theory: An analysis of decision under risk. Econo- metrica, 47(2):263–292, 1979. Daniel Kahneman and Amos Tvesky. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5:297–322, 1992. Gregory L. Keeney and Detlof von Winterfeldt. Identifying and structuring the objectives of terrorists. Risk Analysis, 30(12):1803–1816, 2010. J. E. Kelley. The cutting-plane method for solving convex programs. The Society for Industrial and Applied Mathematics, 8(4):703–713, 1960. Christopher Kiekintveld, Manish Jain, Jason Tsai, James Pita, Fernando Ord´ o˜ nez, and Milind Tambe. Computing optimal randomized resource allocations for massive security games. In In AAMAS, pages 689–696, 2009. 183 Christopher Kiekintveld, Janusz Marecki, and Milind Tambe. Approximation methods for infinite bayesian stackelberg games: Modeling distributional payoff uncertainty. In In AAMAS, 2011. Dmytro Korzhyk, Vincent Conitzer, and Ronald Parr. Complexity of computing optimal stackel- berg strategies in security resource allocation games. In In AAAI, 2010. Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In In Algorithmic Game Theory. Springer, 2009. Janusz Marecki, Gerry Tesauro, and Richard Segal. Playing repeated stackelberg games with unknown opponents. In In AAMAS, 2012. Winter Mason and Siddharth Suri. Conducting behavioral research on amazons mechanical turk. Behavior Research Methods, 44(1):1–23, 2012. Daniel L. McFadden. Econometric analysis of qualitative response models. Handbook of Econo- metrics, 2:1395–1457, 1984. Daniel L. McFadden. A method of simulated moments for estimation of discrete choice models without numerical integration. Econometrica, 57(5):995–1026, 1989. Richard D. McKelvey and Thomas R. Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior, 2:6–38, 1995. Moses Montesh. Rhino poaching: A new form of organised crime. Technical report, College of Law Research and Innovation Committee of the University of South Africa, 2013. William Moreto. To Conserve and Protect: Examining Law Enforcement Ranger Culture and Operations in Queen Elizabeth National Park, Uganda. Thesis, Rutgers, 2013. Thanh H. Nguyen, Rong Yang, Amos Azaria, Sarit Kraus, and Milind Tambe. Analyzing the effectiveness of adversary modeling in security games. In In AAAI, 2013. E. Nudelman, J. Wortman, Y . Shoham, and K. Leyton-Brown. Run the gamut: A comprehensive approach to evaluating game-theoretic algorithms. In AAMAS, pages 880–887, 2004. Christos H. Papadimitriou and Tim Roughgarden. Computing correlated equilibria in multi- player games. Journal of the ACM, 55(3):14, July 2008. Praveen Paruchuri, Jonathan P. Pearce, Janusz Marecki, Milind Tambe, Fernando Ord´ o˜ nez, and Sarit Kraus. Playing games for security: An efficient exact algorithm for solving bayesian stackelberg games. In In AAMAS, 2008. Noam Peled, Ya’akov Gal, and Sarit Kraus. A study of computational and human strategies in revelation games. In In AAMAS, 2011. James Pita, Manish Jain, Fernando Ord´ o˜ nez, Christopher Portway, Milind Tambe, Craig Western, Praveen Paruchuri, and Sarit Kraus. Deployed armor protection: The application of a game theoretic model for security at the los angeles international airport. In In AAMAS, 2008. 184 James Pita, Manish Jain, Fernando Ord´ o˜ nez, Milind Tambe, and Sarit Kraus. Solving stackelberg games in the real-world: Addressing bounded rationality and limited observations in human preference models. Artificial Intelligence Journal, 174(15):1142–1171, 2010. James Pita, Milind Tambe, Chris Kiekintveld, Shane Cullen, and Erin Steigerwald. Guards - game theoretic security allocation on a national scale. In In AAMAS, 2011. James Pita, Richard John, Rajiv Maheswaran, Milind Tambe, and Sarit Kraus. A robust approach to addressing human adversaries in security games. In ECAI, pages 660–665, 2012. Mark R. Pogrebin. About Criminals: A View of the Offenders World. SAGE, 2012. Yundi Qian, William B. Haskell, Albert Xin Jiang, and Milind Tambe. Online planning for optimal protector strategies in resource conservation games. In In AAMAS, 2014. Ulf-Dietrich Reips. Conducting behavioral research on amazons mechanical turk. Experimental Psychology, 49(4):243–256, 2002. Louise Richardson. What Terrorists Want: Understanding the Enemy, Containing the Threat. Random House Trade Paperbacks, 2007. Heather Rosoff and Richard John. Decision analysis by proxy for the rational terrorist. In In QRASA at IJCAI, pages 25–32, 2009. Ariel Rubinstein. Modeling Bounded Rationality. MIT Press, Cambridge, Massachusetts, 1998. P. S. Sastry, V . V . Phansalkar, and M. Thathachar. Decentralized learning of nash equilibria in multi-person stochastic games with incomplete information. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 24(5), 1994. Leonard J. Savage. The Foundations of Statistics. Dover Publications, 1972. Mrinal K. Sen and Paul L. Stoffa. Global optimization methods in geophysical inversion. Elsevier, New York, 1995. Eric Shieh, Bo An, Rong Yang, Milind Tambe, Craig Baldwin, Joseph DiRenzo, Ben Maule, and Garrett Meyer. Protect: A deployed game theoretic system to protect the ports of the united states. In In AAMAS, 2012. Herbert A. Simon. Rational choice and the structure of the environment. Psychological Review, 63(2):129–138, 1956. Herbert A. Simon. Science of the Artificial. MIT Press, Cambridge, Massachusetts, 1969. Dale O. Stahl and Paul W. Wilson. Experimental evidence on players’ models of other players. JEBO, 25(3):309–327, 1994. Chris Starmer. Developments in non-expected utility theory: The hunt for a descriptive theory of choice under risk. Journal of Economic Literature, 38(2):332–382, 2000. 185 Donald Stevens, Thomas Hamilton, Marvin Schaffer, Diana Dunham-Scott, Jamison J. Medby, Edward W. Chan, John Gibson, Mel Eisman, Richard Mesic, Charles T. Kelly Jr, Julie Kim, Tom LaTourrette, and J. Jack Riley. Implementing security improvement options at Los Angeles International Airport. RAND Corporation, 2006. URLhttp://www.rand.org/pubs/ documentedbriefings/2006/RANDDB499-1.pdf. Kenneth Train. Discrete Choice Methods with Simulation. Cambridge University Press, Cam- bridge, UK, 2003. Jason Tsai, Shyamsunder Rathi, Christopher Kiekintveld, Fernando Ord´ o˜ nez, and Milind Tambe. Iris - a tool for strategic security allocation in transportation networks. In In AAMAS, 2009. Jason Tsai, Zhengyu Yin, Jun young Kwak, David Kempe, Christopher Kiekintveld, and Milind Tambe. Urban security: Game-theoretic resource allocation in networked physical domains. In In AAAI, 2010. Stephen A. Vavasis. Complexity issues in global optimization: a survey. In Handbook of Global Optimization, pages 27–41. In R. Horst and P.M. Pardalos, editors, Kluwer, 1995. Heinrich Freiherr von Stackelberg. Market Structure and Equilibrium. Springer, 2011. Bernhard von Stengel and Shmuel Zamir. Leadership with commitment to mixed strategies. In Tech. rep. LSE-CDAM-2004-01, CDAM Research Report, 2004. Alan Washburn and Kevin Wood. Two-person zero-sum games for network interdiction. Opera- tions Research, 43(2):243–251, 1995. Yussuf Adan Wato, Geoffrey M. Wahungu, and Moses Makonjio Okello. Correlates of wildlife snaring patterns in tsavo west national park, kenya. Biological Conservation, 132(4):500–509, 2006. ISSN 0006-3207. doi: http://dx.doi.org/10.1016/j.biocon.2006.05.010. URL http: //www.sciencedirect.com/science/article/pii/S0006320706002047. Rand R. Wilcox. Applying contemporary statistical techniques. Academic Press, 2003. James R. Wright and Kevin Leyton-Brown. Beyond equilibrium: Predicting human behavior in normal-form games. In In AAAI, 2010. Rong Yang, Christopher Kiekintveld, Fernando Ord´ o˜ nez, Milind Tambe, and Richard John. Im- proving resource allocation strategy against human adversaries in security games. In In IJCAI, 2011. Rong Yang, Albert Xin Jiang, Fei Fang, Milind Tambe, Rajiv Maheswaran, and Karthik Ra- jagopal. Designing better strategies against human adversaries in network security games. In In AAMAS, 2012a. Rong Yang, Fernando Ord´ o˜ nez, and Milind Tambe. Computing optimal strategy against quantal response in security games. In In AAMAS, 2012b. Rong Yang, Albert Xin Jiang, Milind Tambe, and Fernando Ordonez. Scaling-up security games with boundedly rational adversaries: A cutting-plane approach. In In IJCAI, 2013a. 186 Rong Yang, Christopher Kiekintveld, Fernando Ord´ o˜ nez, Milind Tambe, and Richard John. Im- proving resource allocation strategy against human adversaries in security games: An extended study. AIJ, 195:440–469, February 2013b. Rong Yang, Benjamin Ford, Milind Tambe, and Andrew Lemieux. Adaptive resource allocation for wildlife protection against illegal poachers. In In AAMAS, 2014. Zhengyu Yin and Milind Tambe. A unified method for handling discrete and continuous uncer- tainty in bayesian stackelberg games. In In AAMAS, 2012. Zhengyu Yin, Dmytro Korzhyk, Christopher Kiekintveld, Vincent Conitzer, and Milind Tambe. Stackelberg vs. nash in security games: Interchangeability, equivalence, and uniqueness. In In AAMAS, 2010. Zhengyu Yin, Manish Jain, Milind Tambe, and Fernando Ord´ o˜ nez. Risk-averse strategies for security games with execution and observational uncertainty. In In AAAI, 2011. Zhengyu Yin, Albert Jiang, Matthew Johnson, Milind Tambe, Christopher Kiekintveld, Kevin Leyton-Brown, Tuomas Sandholm, and John Sullivan. Trusts: Scheduling randomized patrols for fare inspection in transit systems. In In IAAI, 2012. 187 Appendix A: Error Bound of PASAQ For simplicity, let’s first define the following notations: F (r) (x), the objective function of theCF-OPT problem associated with a given estimation valuer: F (r) (x) = X i2T i (rP d i )e i x i X i2T i i x i e i x i (r) =arg min x F (r) (x) ~ F (r) (x), the objective function of thePASAQ-MILP problem associated with a given esti- mation valuer: ~ F (r) (x) = X i2T i (rP d i )(1 + K X k=1 a il x il ) X i2T i i K X k=1 b il x il ~ (r) =arg min x ~ F (r) (x) Also, we define the game constants decided by the payoff in Table A.1 Lemma 14. For any real valuer2R, one of the following two conditions holds. (a)rp ()9x2X f , s.t.,rD(x)N(x) 0; (b)r>p ()8x2X f ,rD(x)N(x)> 0 Proof. We only prove (a) as (b) is proven similarly. ‘(’: since9x such thatrD(x)N(x), this means thatr N(x) D(x) p ; ‘)’: SinceP1 optimizes a continuous objective over a closed convex set, then there exists an optimal solutionx such thatp = N(x ) D(x ) r which rearranging gives the result. Table A.1: Game Constant := max i2T i := min i2T i R d := max i2T jR d i j P d := max i2T jP d i j := max i2T i := max i2T i 188 Lemma 15. The approximation error of the piecewise linear function is bounded as the follow- ing: je i x i L (1) i (x i )j i K ; 0x i 1; 8i2T (A.1) jx i e i x i L (2) i (x i )j 1 K ; 0x i 1; 8i2T (A.2) Proof. Let f i (x i ) be the original function, and L i (x i ) be the corresponding piecewise linear approximation function. The following proof holds for both f i (x i ) = e i x i and f i (x i ) = x i e i x i : max 0x i 1 jf i (x i )L i (x i )j = K max k=1 max k1 K x i k K jf i (x i )L i (x i )j (A.3) We now prove the bound on maxk1 K x i k K jf i (x i )L i (x i )j in three steps: 1. Assumingf i (x i )L i (x i ); k1 K x i k K max k1 K x i k K jf i (x i )L i (x i )j max k1 K x i k K f i (x i ) min k1 K x i k K L i (x i ) = max k1 K x i k K f i (x i ) minfL i ( k 1 K );L i ( k K )g = max k1 K x i k K f i (x i ) minff i ( k 1 K );f i ( k K )g max k1 K x i k K f i (x i ) min k1 K x i k K f i (x i ) 1 K max 0x i 1 jf 0 i (x i )j 2. Assumingf i (x i )L i (x i ); k1 K x i k K max k1 K x i k K jf i (x i )L i (x i )j max k1 K x i k K L i (x i ) min k1 K x i k K f i (x i ) = maxfL i ( k 1 K );L i ( k K )g min k1 K x i k K f i (x i ) = maxff i ( k 1 K );f i ( k K )g min k1 K x i k K f i (x i ) max k1 K x i k K f i (x i ) min k1 K x i k K f i (x i ) 1 K max 0x i 1 jf 0 i (x i )j 3. Iff i (x i ) andL i (x i ) get across in [ k1 K ; k K ], we could partition the range into small regions such that within each sub partition, the two functions do not get across. We then can apply (a) or (b) within each partition. Combining the above three conditions, we have max k1 K x i k K jf i (x i )L i (x i )j 1 K max 0x i 1 jf 0 i (x i )j (A.4) 189 At the same time, it can be shown with some effort that where,f 0 i (x i ) is the first order derivative of functionf i (x i ). Combining with Equation (A.3), we have max 0x i 1 jf i (x i )L i (x i )j 1 K max 0x i 1 jf 0 i (x i )j (A.5) Hence, the approximation error bound is decided by the maximum absolute value of the first order derivative. It can be shown that max 0x i 1 j d(e i x i ) dx i j =j d(e i x i ) dx i j x i =0 = i (A.6) max 0x i 1 j d(x i e i x i ) dx i j =j d(x i e i x i ) dx i j x i =0 = 1 (A.7) Combining Equation (A.5)-(A.7) gives the result. Lemma 16. LetL andU be the lower and upper bounds of GOSAQ when the algorithm stops, andx is the defender strategy returned by GOSAQ. Then, L Obj P 1 (x )U Proof. When the algorithm stops, we haveF (L ) (x ) 0)L N(x ) D(x ) =Obj P 1 (x ) At the same time,F (U ) (x )> 0;8x)U > N(x ) D(x ) =Obj P 1 (x ) Lemma 17. Let ~ L and ~ U be the lower and upper bounds of PASAQ when the algorithm stops, and ~ Obj P 1 (x) be the approximation of objective functionP1 with the piecewise linear represen- tation ofe i x i andx i e i x i . Then, ~ L ~ Obj P 1 (~ x ) ~ U where, ~ x is the defender strategy returned by PASAQ. Proof. Same as that for Lemma 16 Lemma 18. LetObj P 1 (x) be the objective function ofP1 and ~ Obj P 1 (x) be the corresponding approximation with the piecewise linear representation ofe i x i andx i e i x i . Then,8x2X f jObj P 1 (x) ~ Obj P 1 (x)j (=)e fR d +P d g + g 1 K (A.8) Proof. Let ~ N(x) = P i2T i i L (2) i (x i ) + P i2T i P d i L (1) i (x i ) and ~ D(x) = P i2T i L (1) i > 0 be the piecewise linear approximation of the numerator and denominator ofObj P 1 respectively. jObj P 1 (x) ~ Obj P 1 (x)j =j N(x) D(x) ~ N(x) ~ D(x) j =j N(x) D(x) N(x) ~ D(x) + N(x) ~ D(x) ~ N(x) ~ D(x) j j N(x) D(x) ~ D(x)D(x) ~ D(x) j +j N(x) ~ N(x) ~ D(x) j = 1 ~ D(x) (jObj P 1 (x)jjD(x) ~ D(x)j +jN(x) ~ N(x)j) 190 Based on Lemma 15, jN(x) ~ N(x)j X i2T i i 1 K + X i2T i jP d i j i K ( +P d ) jTj K jD(x) ~ D(x)j X i2T i i K (=) jTj K At the same time,jObj P 1 (x)jR d and ~ D(x)jTje . Hence, jObj P 1 (x) ~ Obj P 1 (x)j (=)e fR d +P d + g 1 K Lemma 19.8x2X f , the following condition holds jF (r) (x) ~ F (r) (x)j (jrj +P d ) X i2T i K + T K (A.9) Proof. LetL (1) i (x i ) = 1+ P K k=1 a il x il be the piecewise linear approximations of functione i x i , andL (2) i (x i ) = P K k=1 b il x il be that of functionx i e i x i . We have jF (r) (x) ~ F (r) (x)j j X i2T i (rP d i )e i x i X i2T i (rP d i )L (1) i (x i )j +j X i2T i i x i e i x i X i2T i i L (2) i (x i )j X i2T i jrP d i jje i x i L (1) i (x i )j + X i2T i i jx i e i x i L (2) i (x i )j (jrj +P d ) X i2T je i x i L (1) i (x i )j + X i2T jx i e i x i L (2) i (x i )j (A.10) Combining Equation (A.1) and (A.2), we have jF (r) (x) ~ F (r) (x)j (jrj +P d ) X i2T i K + jTj K Lemma 20. Let ~ L be the estimated maximum ofObj P 1 (x) by running PASAQ, then ~ F ( ~ L ) (x)jTj; 8x2X f (A.11) Proof. LetU andL be the upper and lower bound when the algorithm stops. According to Line 3 in Algorithm 1,U L . Furthermore,U > ~ L L . Therefore we know ~ L +U , so the result ofCheckFeasiblity with given ~ L + must beinfeasible. In other words, ~ F ( ~ L +) (x)> 0; 8x2X f (A.12) 191 On the other hand, ~ F ( ~ L +) (x) ~ F ( ~ L ) (x) = X i2T i L (1) i (x i )jTj;8x2X f (A.13) Combining Equation (A.12) and (A.13) ~ F ( ~ L ) (x) ~ F ( ~ L +) (x)jTjjTj Lemma 21. LetL be the estimated maximum ofP1 by GOSAQ L ~ L (=)e f + 1 K ((R d +P d ) +)g (A.14) Proof. According to Lemma 14,F (L ) ( (L ) ) 0. At the same time, F (L ) ( (L ) ) =F ( ~ L ) ( (L ) ) + (L ~ L ) X i2T i e i (L ) i ) (L ~ L ) X i2T i e i (L ) i F ( ~ L ) ( (L ) ) (A.15) Furthermore, Lemma 19 indicates that F ( ~ L ) ( (L ) ) ~ F ( ~ L ) ( (L ) ) + (j ~ L j +P d ) X i2T i K + + jTj K ~ F ( ~ L ) ( (L ) ) +((R d +P d ) X i2T i K + jTj K ) ~ F ( ~ L ) ( (L ) ) +( (R d +P d ) K + K )jTj (A.16) sincej ~ L jjR d i j. Combining Equation (A.11),(A.15) and (A.16) (L ~ L ) X i2T i e i (L ) i jTj +( (R d +P d ) K + K )jTj Furthermore, P i2T i e i (L ) i Te , so L ~ L (=)e f + 1 K ((R d +P d ) +)g 192
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The human element: addressing human adversaries in security domains
PDF
Modeling human bounded rationality in opportunistic security games
PDF
Addressing uncertainty in Stackelberg games for security: models and algorithms
PDF
Thwarting adversaries with unpredictability: massive-scale game-theoretic algorithms for real-world security deployments
PDF
Combating adversaries under uncertainties in real-world security problems: advanced game-theoretic behavioral models and robust algorithms
PDF
Balancing tradeoffs in security games: handling defenders and adversaries with multiple objectives
PDF
Hierarchical planning in security games: a game theoretic approach to strategic, tactical and operational decision making
PDF
Game theoretic deception and threat screening for cyber security
PDF
When AI helps wildlife conservation: learning adversary behavior in green security games
PDF
Predicting and planning against real-world adversaries: an end-to-end pipeline to combat illegal wildlife poachers on a global scale
PDF
Computational model of human behavior in security games with varying number of targets
PDF
Towards addressing spatio-temporal aspects in security games
PDF
Automated negotiation with humans
PDF
Real-world evaluation and deployment of wildlife crime prediction models
PDF
Keep the adversary guessing: agent security by policy randomization
PDF
Handling attacker’s preference in security domains: robust optimization and learning approaches
PDF
Planning with continuous resources in agent systems
PDF
Not a Lone Ranger: unleashing defender teamwork in security games
PDF
Context dependent utility: an appraisal-based approach to modeling context, framing, and decisions
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
Asset Metadata
Creator
Yang, Rong
(author)
Core Title
Human adversaries in security games: integrating models of bounded rationality and fast algorithms
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
06/09/2014
Defense Date
04/03/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bounded rationality,game theory,human behavior modeling,OAI-PMH Harvest,optimization
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Tambe, Milind (
committee chair
), Conitzer, Vincent (
committee member
), Gratch, Jonathan (
committee member
), John, Richard S. (
committee member
), Maheswaran, Rajiv (
committee member
), Ordóñez, Fernando (
committee member
)
Creator Email
rong85.yang@gmail.com,yangrong@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-418833
Unique identifier
UC11295087
Identifier
etd-YangRong-2539.pdf (filename),usctheses-c3-418833 (legacy record id)
Legacy Identifier
etd-YangRong-2539.pdf
Dmrecord
418833
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Yang, Rong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bounded rationality
game theory
human behavior modeling
optimization