Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sequential decision-making for sensing, communication and strategic interactions
(USC Thesis Other)
Sequential decision-making for sensing, communication and strategic interactions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SEQUENTIAL DECISION-MAKING FOR SENSING, COMMUNICATION AND STRATEGIC INTERACTIONS by Dhruva Kartik Mokhasunavisu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2021 Copyright 2021 Dhruva Kartik Mokhasunavisu Dedicated to humanity’s eternal pursuit of knowledge and wisdom... ii Acknowledgements Obtaining a PhD degree is a challenging adventure, which is impossible to carry out alone. Over the past five years, I have received support, feedback and encouragement from various individuals. This dissertation would not have been possible without their help. First of all, I would like to express my sincere gratitude to my PhD advisors, Prof. Urbashi Mitra and Prof. Ashutosh Nayyar, for their support, guidance, care and patience. I cannot forget all the long, deep discussions I had with them on various subjects. These discussions enriched me as a researcher as well as a human being. I would like to thank my qualifying exam and dissertation committee members: Prof. Rahul Jain, Prof. Phebe Vayanos and Prof. Bhaskar Krishnamachari, for their intriguing questions and constructive comments that helped me in completing this thesis. I would like to thank my mentors Prof. Sanjay Bose and Prof. A. Rajesh who guided and prepared me for research during my undergraduate studies at the Indian Institute of Technology Guwahati. I would like to thank the EEB staff members: Diane Demetras, Gerrielyn Ramos, Corine Wong, John Diaz and Susan Wiedem. They were always available to help and answer all my questions. I would also like to thank my group members: Dr. Junting Chen, Dr. Marcos Vasconcelos, Dr. Arpan Chattopadhyay, Dr. Sajjad Beygi, Dr. Sunav Choudhary, Dr. Amr Elnakeeb, Dr. Libin Liu, Tze-Yang Tung, Dr. Mukul Gagrani, Dr. Seyed Mohammad Asghari and Dr. Shiva Navabi for all the interesting discussions, arguments and fun that we had together. Lastly and most importantly, I would like to express my deepest gratitude to my parents, Raghavendra Rao and Sarada, and my sister Dhanvi, for their enduring love, support, understand- ing and encouragement. iii Table of Contents Dedication ii Acknowledgements iii List of Tables xi List of Figures xii Abstract xv Chapter 1: Introduction 1 1.1 Thesis Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problems Investigated in this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Part I: Active Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Part II: Stochastic Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Part III: Coordination over Communication Channels . . . . . . . . . . . . . 5 I Sequential Decision-making for Sensing 6 Chapter 2: Fixed-horizon Active Hypothesis Testing 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1.1 Fixed-horizon Hypothesis Testing . . . . . . . . . . . . . . . . . . . 9 2.1.1.2 Sequential Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . 10 2.1.1.3 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1.4 Variable-length Coding with Feedback . . . . . . . . . . . . . . . . . 11 2.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Minimum Misclassification Error Problems . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Problem Formulation and Preliminaries . . . . . . . . . . . . . . . . . . . . . 15 2.2.2.1 The Asymmetric Formulation (P1) . . . . . . . . . . . . . . . . . . . 15 2.2.2.2 The Symmetric Formulation (P2) . . . . . . . . . . . . . . . . . . . 16 2.2.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3.1 Asymptotic Decay-Rate of Optimal Misclassification Probability φ ∗ N (i) in Problem (P1) . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3.2 Asymptotically Optimal Experiment Selection Strategies for Prob- lem (P1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 iv 2.2.3.3 Asymptotic Decay-Rate of Optimal Misclassification Probabilityγ ∗ N in Problem (P2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Analysis of Problem (P1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Lower Bound for a Fixed Experiment Selection Strategy . . . . . . . . . . . . 23 2.3.2 Strategy-Independent Lower Bound . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.3 Achievability of Decay-Rate D ∗ (i) in Problem (P1) . . . . . . . . . . . . . . . 27 2.3.4 Tighter Non-asymptotic Lower Bounds . . . . . . . . . . . . . . . . . . . . . . 29 2.4 Analysis of Problem (P2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.0.1 Experiment selection strategy . . . . . . . . . . . . . . . . . . . . . 32 2.4.0.2 Inference strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 An Example: Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.0.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.0.2 Asymptotically Optimal Rate and Weak Converse . . . . . . . . . . 34 2.5.0.3 Strong Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.0.4 Deterministic Adaptive Strategy (DAS) . . . . . . . . . . . . . . . . 36 2.5.0.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.0.6 Chernoff’s Deterministic Strategy . . . . . . . . . . . . . . . . . . . 37 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 3: Active Hypothesis Testing and Anomaly Detection 41 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Signal and System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Definitions and Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.1 Individual Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Group Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Proof Outline of Theorem 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.1 Individual Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.2 Group Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 4: A Neural Network-based Framework for Strategy Design 64 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Markov Decision Process Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 Deep Q-learning for Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4.1 Discounted Reward Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4.2 Action-value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4.3 Action-value Function as a Deep Neural Network . . . . . . . . . . . . . . . . 73 4.4.4 Additional Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5.1 Extrinsic Jensen-Shannon (EJS) Divergence . . . . . . . . . . . . . . . . . . . 77 4.5.2 Open Loop Verification (OPE) . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5.3 KL-divergence Zero-sum Game (HEU) . . . . . . . . . . . . . . . . . . . . . . 77 4.5.4 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 v Chapter 5: Non-parametric Target Localization 81 5.0.1 Notation and Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.1.0.1 Nature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1.0.2 Agent’s information . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1.0.3 Decision strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1.0.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 Adaptive Sampling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Pathological Noise Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5 Analysis of Greedy Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5.1 Outline of Proof of Lemma 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5.2 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Chapter 6: Adaptive Sampling for Estimating Distributions 97 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Definitions and Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3.1 Prior Belief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3.2 Posterior Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3.3 Upper Confidence Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4 Strategies and Performance Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.1 Regret Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 II Sequential Decision-making for Strategic Interactions 108 Chapter 7: Stochastic Games with Asymmetric Information 109 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.1.1 Related Work and Our Contributions . . . . . . . . . . . . . . . . . . . . . . 110 7.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2.1 Information Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2.2 Examples of Information Structures that Satisfy Assumption 8.1 . . . . . . . 115 7.2.3 Strategies and Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.3 Virtual GameG v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.4 Expanded Virtual GameG e with Prescription History . . . . . . . . . . . . . . . . . 121 7.4.1 Upper and Lower Values of GamesG v andG e . . . . . . . . . . . . . . . . . . 122 7.4.2 The Dynamic Programming Characterization . . . . . . . . . . . . . . . . . . 123 7.4.2.1 The min-max dynamic program . . . . . . . . . . . . . . . . . . . . 124 7.4.2.2 The max-min dynamic program . . . . . . . . . . . . . . . . . . . . 124 7.4.3 Connections with the Recursive Formula in [91] . . . . . . . . . . . . . . . . . 127 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 vi Chapter 8: Zero-sum Games between Teams 130 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.2.0.1 Information Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.2.0.2 Strategies and Values . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.3 Virtual GamesG v andG e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.3.1 Virtual GameG v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.3.2 Expanded Vitual GameG e . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.3.3 The Dynamic Programming Characterization . . . . . . . . . . . . . . . . . . 140 8.3.3.1 Values in GameG e . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.3.3.2 Optimal Strategies in GameG e . . . . . . . . . . . . . . . . . . . . . 142 8.4 Virtual Player 1 Controls the Common Information based Belief . . . . . . . . . . . 143 8.4.1 Game Models Satisfying Assumption 8.2 . . . . . . . . . . . . . . . . . . . . . 143 8.4.1.1 All players in Team 2 have the same information . . . . . . . . . . . 143 8.4.1.2 Team 2’s observations become common information with one-step delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.4.1.3 Team 2 does not control the state . . . . . . . . . . . . . . . . . . . 144 8.4.1.4 Global and local states . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.4.2 Min-max Value and Strategy in GameG . . . . . . . . . . . . . . . . . . . . . 145 8.4.2.1 Dynamic Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.4.2.2 Min-max Value and Strategy . . . . . . . . . . . . . . . . . . . . . . 146 8.5 Joint Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.5.1 Strategies and Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.5.2 Virtual GameG v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.5.3 Virtual Expanded Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.6 Structural Properties of the Value Functions and Numerically Solving Zero-sum Games151 8.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Chapter 9: Equivalent Static and Dynamic Games 155 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.2.2 Dynamic Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.2.3 Equivalent Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 9.3 Static Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 9.3.1 Equivalent Static Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.4 Behavioral Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 9.5 Team Nash Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 9.6 Example - Signaling Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 III Sequential Decision-making for Communication 166 Chapter 10:Real-time Coordination over Communication Channels 167 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 10.1.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 vii 10.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 10.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 10.3 Example: Coordination vs Communication . . . . . . . . . . . . . . . . . . . . . . . 170 10.3.1 The Communication Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 10.3.2 The Coordination Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 10.4 Single Stage Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 10.4.1 Noiseless Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 10.4.2 Noisy Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 10.5 Multi-stage Coordination over Noiseless Channels . . . . . . . . . . . . . . . . . . . . 176 10.6 Multi-stage Coordination over Noisy Channels . . . . . . . . . . . . . . . . . . . . . 180 10.6.1 Sub-optimal strategies without Assumption 2 . . . . . . . . . . . . . . . . . . 183 10.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 References 183 Appendices 195 Appendix A Fixed-horizon Active Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 A.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 A.2 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 A.3 Proof of Lemma 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 A.4 Proof of Lemma 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 A.5 Proof of Lemma 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 A.6 Proof of Lemma B.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 A.7 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 A.7.0.1 Confidence Level and Log-likelihood Ratios . . . . . . . . . . . . . . 200 A.7.0.2 Open-loop Randomized Experiment Selection at time n + 1 . . . . . 201 A.7.0.3 Inductive Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 A.7.0.4 Chernoff Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 A.8 Feasibility of Strategy in Section 2.4.0.1 for Problem (P2) . . . . . . . . . . . . . . . 204 A.8.0.1 Bounds on the MGF of Confidence Increment . . . . . . . . . . . . 205 A.8.0.2 Chernoff Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Appendix B Active Hypothesis Testing and Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . 208 B.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 B.1.0.1 Confidence Level and Log-likelihood Ratios . . . . . . . . . . . . . . 208 B.1.0.2 Max-min MGF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 B.1.0.3 Inductive Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 B.1.0.4 Chernoff Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 B.2 Proof of Lemma 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 B.3 Proof of Lemma 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 B.4 Complete Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 B.4.1 Strong Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 B.4.2 Strong Achievability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 B.5 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 B.5.1 Proof of Lemma 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 B.5.2 Proof of Lemma 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 viii B.5.3 Proof of Lemma B.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 B.5.4 Proof of Lemma B.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Appendix C A Neural Network-based Framework for Strategy Design . . . . . . . . . . . . . . . . . . . 223 C.1 Recurrent Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 223 Appendix D Non-parametric Target Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 D.1 Proof of Lemma 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 D.2 Proof of Lemma 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 D.3 Proof of Lemma 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 D.4 Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Appendix E Adaptive Sampling for Estimating Distributions . . . . . . . . . . . . . . . . . . . . . . . . 231 E.1 Proof of Lemma 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 E.2 Proof of Lemma 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 E.3 Proof of Theorem 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 E.4 Incorporating Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 E.5 Practical Considerations: Overall Estimate and Batch Sampling . . . . . . . . . . . . 234 Appendix F Stochastic Games with Asymmetric Information . . . . . . . . . . . . . . . . . . . . . . . 237 F.1 Proof of Lemma 8.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 F.2 Proof of Lemma G.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 F.3 Proof of Theorem 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 F.4 Proof of Lemma 8.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 F.5 Some Continuity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 F.6 Proof of Lemma 8.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 F.7 Proof of Theorem 8.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Appendix G Zero-sum Games between Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 G.1 Proof of Lemma 8.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 G.2 Belief Update and Instantaneous Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 253 G.3 Domain Extension and Proof of Lemma 8.3 . . . . . . . . . . . . . . . . . . . . . . . 255 G.3.0.1 Domain Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 G.3.0.2 Proof of Lemma 8.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 G.4 Information Structures that Satisfy Assumption 8.2: Proofs . . . . . . . . . . . . . . 257 G.4.0.1 All players in Team 2 have the same information . . . . . . . . . . . 258 G.4.0.2 Team 2’s observations become common information with a delay of one-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 G.4.0.3 Team 2 does not control the state . . . . . . . . . . . . . . . . . . . 259 G.4.0.4 Global and local states . . . . . . . . . . . . . . . . . . . . . . . . . 260 G.5 Proof of Theorem 8.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 G.6 Proof of Lemma 8.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 ix Appendix H Numerically Solving Zero-sum Games and Structural Properties . . . . . . . . . . . . . . . 269 H.1 Solving the DP: Methods and Challenges . . . . . . . . . . . . . . . . . . . . . . . . 269 H.1.1 Structural Properties of the DP . . . . . . . . . . . . . . . . . . . . . . . . . . 271 H.2 Proof of Lemma 8.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 H.3 Numerical Experiments: Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 H.3.1 Game Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 H.3.2 Architecture and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 278 H.3.3 The Value Function and Team 1’s Strategy . . . . . . . . . . . . . . . . . . . 279 H.3.4 Value Function Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Appendix I Real-time Coordination over Communication Channels . . . . . . . . . . . . . . . . . . . . 289 I.0.1 Proof of Lemma 10.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 I.0.2 Proof of Lemma 10.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 I.0.3 Proof of Theorem 10.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 I.0.4 Proof of Theorem 10.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 x List of Tables 2.1 Conditional probabilities P[Y = 1| X,U]. In our numerical experiments ν = 0.6 which indicates that the observations from these sensors are very noisy. . . . . . . . 34 2.2 Conditional probabilitiesP[Y = 1|X,U] for the problem setup in Section 2.5.0.6. . . 39 4.1 Conditional distributions p u i (y) for each query . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Conditional distributions p u i (y) for each query. Here, δ = 0.0000001. . . . . . . . . . 79 H.1 The transition probabilitiesP[X t+1 |X t ,U 2 t ]. Note that the dynamics do not depend on time t and Team 1’s actions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 H.2 The costc(X t ,U 1,2 t ,U 2,1 t ). Note that the actions of Player 1 in Team 1 do not affect the cost. Each row corresponds to a pair of actions (U 1,2 t ,U 2,1 t ) and each column corresponds to a state X t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 xi List of Figures 2.1 The plot depicts the performance of strategies ORS and DAS. Both are asymptot- ically optimal but DAS is better in the non-asymptotic regime. When the horizon N = 500, we see a 13dB improvement in the misclassification probability with DAS. Also notice that the strong bound on misclassification probability is very close to the performance of DAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2 Type-0 misclassification probability associated with strategies DAS-RS, ORS and Chernoff’s deterministic strategy. Both ORS and DAS-RS are asymptotically opti- mal and Chernoff’s deterministic strategy is not. Notice that DAS-RS outperforms all the other strategies in the non-asymptotic regime. . . . . . . . . . . . . . . . . . . 39 3.1 Performance curves and lower bounds for the 4 component system with individual sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2 Performance curves and lower bounds for the 128 component system with individual sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3 The KL-divergence metric in (3.31) as a function of the group size k. . . . . . . . . . 62 3.4 Performance curves and lower bounds for the sixteen component system with group sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Agent’s choices and subsequent observations represented as a tree. Every instance of the probability space can be uniquely represented by a path in this tree. . . . . . 67 4.2 The logit function is the inverse of the logistic sigmoid function 1/(1 +e −x ). It is widely used in statistics and machine learning to quantify confidence level [65]. . . . 68 4.3 The agent performs a queryu at some stateρ. The environments simulates the belief update using u and ρ to generate the update belief ρ 0 and its associated reward r. . 72 4.4 The neural network takes the belief vector ρ as the input and outputs the Q-values for each action u. The hidden layers are fully connected with non-linear activation. Only the final layer has linear activation. . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5 Evolution of expected confidence rateR N under hypothesish 0 in the first setup with queries u 1 and u 2 . Note the subpar performance of OPE in this setup. . . . . . . . . 79 4.6 Evolution of expected confidence rate R N under hypothesis h 0 in the second setup with additional queries u 3 and u 4 . Note the subpar performance of OPE and EJS in this setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1 Computation of the feasible region . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 xii 5.2 An illustration of the regionsR 1 andR 2 . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3 Search space reduction using greedy sampling . . . . . . . . . . . . . . . . . . . . . . 91 5.4 A simple example of a pathological noise pattern . . . . . . . . . . . . . . . . . . . . 92 5.5 This figure illustrates all the regions used in the proof of Lemma 5.1. Also, notice the violation of unimodality in the regionsR l andR r for the naive reconstruction y 1 . 94 5.6 This figure illustrates the averaging effect of unimodal regression on the noisy signal. This averaging effect is essential for proving Lemma 5.1. . . . . . . . . . . . . . . . . 94 6.1 The plot on the left depicts the regretR N for the strategy in Algorithm 3 vs. the strategy in [132]. The plot in the middle depicts the upper confidence bounds u (k) n . The plot on the right depicts the expected number of samples obtained and includes the empirical Bernstein approach in [23] as well. . . . . . . . . . . . . . . . . . . . . 105 6.2 Population classified on the basis of ethnicity and location, respectively in plots 6.2a and 6.2b. The plots depict the number of sampled obtained in each category under each strategy. Here, Actual refers to the number of samples obtained by the survey [135] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 8.1 In these plots, the x-axis represents π 1 (0) and we restrict our attention to those beliefs where π 1 (0) +π 1 (1) = 1, i.e. when the attacker is active. In Figure 8.1b, the blue and red curves respectively depict the Bernoulli probabilities associated with the distributions γ 1,1 1 (0) and γ 1,1 1 (1), where γ 1,1 1 is Player 1’s prescription in Team 1. 153 10.1 A Bayesian network representing the conditional indepedence structure of the pro- cesses X n and S n under Assumption 10.2. . . . . . . . . . . . . . . . . . . . . . . . . 180 C.1 This plot compares the performance of the RNN network vs the MAP rule. . . . . . 224 C.2 The LSTM network with query selection. . . . . . . . . . . . . . . . . . . . . . . . . 224 E.1 This plot represents the number of samples collected by the seroprevalence survey in [135], the oracle allocation in Problem (C1) with appropriate constantsθ (k) , and the allocation by the heuristic (denoted by Adaptive). Notice that the heuristic tracks the oracle closely. This plot tells us that in order get a good overall estimate, we should allocate fewer samples to an underrepresented group like the Pacific Islanders (w (k) ≈ 0.003) than the number suggested by the oracle (P1) (See 6.2a). However, it also tells us that we can allocate substantially more samples than the number in the survey (≈w (k) N) to this group for a better estimate of their positivity without comprising the quality of the overall estimate. . . . . . . . . . . . . . . . . . . . . . . 236 H.1 Estimated value function ˆ V 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 H.2 Estimated value function ˆ V 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 H.3 Estimated value function ˆ V 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 H.4 Estimated value function ˆ V 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 H.5 Estimated value function ˆ V 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 xiii H.6 Estimated value function ˆ V 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 H.7 Estimated value function ˆ V 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 H.8 Estimated value function ˆ V 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 H.9 Estimated value function ˆ V 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 H.10 Estimated value function ˆ V 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 H.11 Estimated value function ˆ V 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 H.12 Estimated value function ˆ V 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 H.13 Estimated value function ˆ V 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 H.14 Estimated value function ˆ V 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 H.15 Estimated value function ˆ V 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 xiv Abstract Intelligent autonomous systems can be employed in many areas such as transportation, power grids, manufacturing, exploration etc. Agents in these systems act sequentially over time in uncertain and dynamically evolving environments. They may acquire data from their surroundings, communicate with other agents and act strategically in a cooperative or a non-cooperative manner. In this dissertation, we focus on three aspects of sequential decision-making: sensing (or data acquisition), communication and strategic interactions. Sensing: We investigate the problem of active sensing for multi-hypothesis testing. In active hypothesis testing, the agent is interested in an unknown quantity. It sequentially performs exper- iments that generate data and based on the data acquired, an inference is made. We formulate Neyman-Pearson type active hypothesis testing problems wherein the objective is to minimize the probability of making an incorrect inference while ensuring that the correct inference probability is sufficiently large. We characterize the asymptotically optimal error exponents of these problems. We identify a novel class of asymptotically optimal strategies. This class subsumes most of the classical strategies but also includes new strategies that have superior non-asymptotic performance. We demonstrate the performance gain obtained by using our strategy in several anomaly detection models. We also consider the problem of adaptive sampling for estimating multiple unknown distri- butions with uniform accuracy. We provide a regret-free Bayesian UCB-type sampling algorithm that outperforms existing frequentist sampling approaches. We use this approach to design testing strategies for obtaining accurate SARS-CoV-2 seroprevalence estimates in various categories. Strategic Interactions: Adversarial interactions in cyber-physical systems like power grids and transportation networks can be modeled as stochastic zero-sum games between teams. In such games, the players may act based on different information. Finding the value and Nash equilibria of games between teams with incomplete and asymmetric information is notoriously hard. We xv consider a general model of zero-sum stochastic games between two competing teams. This model subsumes many previously considered team and zero-sum game models. For this general model, we provide bounds on the upper (min-max) and lower (max-min) values of the game. Furthermore, if the upper and lower values of the game are identical (i.e., if the game has a value), our bounds coincide with the value of the game. Our bounds are obtained using two dynamic programs based on a sufficient statistic known as the common information belief (CIB). We also identify certain information structures in which only the minimizing team controls the evolution of the CIB. In these cases, we show that one of our CIB based dynamic programs can be used to find the min-max strategy (in addition to the min-max value). We propose an approximate dynamic programming approach for computing the values (and the strategy when applicable) and illustrate our results with the help of an example. Communication: We consider a two-agent dynamic system in which the agents must act in a coordinated manner. One of the agents knows the state and can communicate with the other agent over a rate-constrained unidirectional communication channel. The objective is to minimize a cost while ensuring that the coordination constraint is met. In this setting, we provide control and communication strategies for various kinds of communication channels. xvi Chapter 1 Introduction 1.1 Thesis Background The functioning of our world is more reliant on autonomous systems than ever. There is hardly any sphere of life that is not influenced by an automated control system or an algorithm. Power transmission, automated driving, communication systems, process monitoring, surveillance, man- ufacturing, recommendation systems, credit risk analysis, algorithmic trading, elections and even scientific research are areas that have been significantly impacted by the advent of automation. The rise in the degree of automation can partly be attributed to humans’ intrinsic desire to build tools that make their life easier. A more compelling factor that drives automation is the scale of operation of the aforementioned systems. Given today’s scale, manually operating these systems seems incomprehensible. Automation has the potential to dramatically increase the systems’ ef- ficiency and also, make them robust to adversarial attacks. Thus the ubiquity of automation is unsurprising. Clearly, the automated systems that we design and use everyday heavily influence our society’s functioning. Hence, it is crucial for us to have a thorough and deep understanding of such systems, and the work in this report is a tiny step forward in that direction. Autonomous systems or agents generally operate over a period of time. They acquire data from their surroundings, process it and then act in order to achieve a well-defined objective. This procedure is referred to as sequential decision-making. Sequential decision-making may usually involve the following challenges: 1. System dynamics: The state of the system evolves over time often in a stochastic manner. Since the agents’ actions affect the state evolution, the control or decision-making strategy 1 must take into account the long-term consequences of performing an action. Such problems are usually modeled as Markov Decision Processes (MDPs) and are solved using dynamic programming. We often come across problems where the state space is too large and applying dynamic programming in a straightforward manner becomes intractable. In such cases, ap- proximate dynamic programming methodologies and other approaches may be used to design sequential decision-making strategies. 2. Partial observability: Very often, we see systems in which the decision-making agent does not have full information regarding the system state. This is even more common in decentralized decision-making in which multiple agents make their decisions based on different information. Since agents do not fully know the other agents’ information, partial observability is inevitable in decentralized systems. Finding solutions for Partially Observable MDPs (POMDPs) is notoriously hard. 3. Presence of selfish strategic agents: In many systems, different agents may have different ob- jectives (potentially competing). In such cases, we are interested in finding Nash equilibrium strategies for these agents. In centralized problems with complete information, i.e. when all agents use the same information to make their decisions, dynamic programming can be used to find Nash equilibrium strategies. However, this approach does not work for decentralized systems in general. All these challenges make sequential decision-making a very complex task. However, when the problem admits additional structure, it may be possible to design reasonably tractable solutions despite the challenges mentioned above. We will focus on three such problems in this report. In the first part, we will discuss problems related to active sensing a.k.a. active hypothesis testing. In active sensing, we are interested in making an accurate inference regarding an unknown quantity by smartly acquiring data in an adaptive or data-driven manner. In the second part, we will discuss sequential decision-making for strategic interactions. Our focus will be on two-player zero-sum stochastic games in which players have asymmetric information. In the third part, we will discuss strategies for a two-agent system in which the agents need to act in a coordinated manner while communicating with each other over a rate-constrained communication channel. We will then discuss some directions for future research in these areas. 2 1.2 Problems Investigated in this Thesis 1.2.1 Part I: Active Hypothesis Testing In active hypothesis testing, we have an unknown quantity that takes values from a finite set. We are interested in inferring this quantity by acquiring data. We have multiple sources or experiments that generate data and we can sequentially choose which experiment to perform at each time. We see such problems in applications like anomaly detection, target detection, state tracking, clinical trial and communication with feedback etc. The central theme in active hypothesis testing is to design efficient strategies for selecting experiments and subsequently making an inference based on the data gathered. While this problem can be viewed from a control-theoretic lens and solved using approximate dynamic programming, tools from information-theory and statistics help us in analyzing and obtaining tractable solutions for this problem. In Chapter 2, we formulate Neyman-Pearson type active hypothesis testing problems in which the objective is to design experiment selection and inference strategies that minimize the mis- classification probability while ensuring that the probability of making correct inference is suffi- ciently high. For these problems, we characterize asymptotically optimal error-exponents. We also characterize a class of asymptotically optimal inference and experiment selection strategies that subsumes many classical strategies. Classical strategies tend to be randomized and some cases, open-loop. Our class of strategies includes adaptive deterministic strategies that are also asymptotically optimal and have superior non-asymptotic performance is some cases. In Chapter 3, we consider a special class of hypothesis testing problems in which we are inter- ested testing whether a multi-component system has anomalies or not. We investigate two sampling models: individual sampling and group sampling. For both these models, we provide asymptoti- cally optimal strategies. We observe through simulations that our deterministic adaptive strategies have superior non-asymptotic performance compared to the classical Chernoff-type strategies. In this chapter, we prove that for a special class of hypothesis testing problems, the error performance of our adaptive strategies is only a logarithmic term away from the optimal performance. In Chapter 4, we develop a computational framework that can be used to generate experiment selection strategies for active hypothesis testing. This framework uses dynamic programming with neural-network based function approximation. With appropriately tuned hyper-parameters, we 3 demonstrate through simulations that this framework can compute optimal strategies for some non-trivial hypothesis testing problems. In Chapter 5, we consider the problem of finding the peak of a unimodal signal based on noisy observations that can acquired adaptively. For this problem, we discuss some results associated with a greedy sampling algorithm. In Chapter 6, we consider the problem of designing adaptive sampling strategies for estimating multiple distributions with uniform accuracy. We propose a Bayesian Upper Confidence Bound (UCB)-type sampling strategy and prove that its regret performance is no worse than existing frequentist methods. Further, we demonstrate that its empirical performance is superior to the frequentist methods. This is because of its ability to compute tight confidence bounds. We propose to use this sampling methodology to design testing strategies for estimating seroprevalence in various categories of interest with uniform accuracy. 1.2.2 Part II: Stochastic Games In this part, we discuss sequential decision-making for strategic interactions. We focus on two-player zero-sum games in which players make their decisions based on different information. Stochastic zero-games with asymmetric information can be used to model adversarial attacks on cyber-physical systems such as power grids and transportation networks. In Chapter 7, we consider a general model of zero-sum games that captures a wide variety of information structures. For such games, we provide a dynamic programming characterization of the value if it exists. If the value does not exist, then the dynamic program provide bounds on the upper and lower values of the zero-sum game. Our dynamic program is given in terms of a statistic known as common information belief. While these dynamic programs are computationally intractable in general, they become tractable for games with specialized information structures. In Chapter 8, we consider stochastic games between two teams. We extend our results on the value characterization from Chapter 7 to this game between teams. Additionally, we show that in several game models, only the minimizing team controls the evolution of the common information belief. For such games, we characterize a minmax strategy using dynamic programming. We provide several structural properties of the value functions in the dynamic program which can be 4 used to make its computation simpler. We propose an architecture for solving the dynamic program and illustrate it with the help of an example. In Chapter 9, we discuss an equivalence between static and dynamic games. This equivalence is a generalization of Witsenhausen’s equivalence result for cooperative teams. 1.2.3 Part III: Coordination over Communication Channels In this part, we consider a setting in which there are two-agents that must act in a coordinated manner, i.e. the two agents must take exactly the same action with probability 1 at any given time. One of the agents knows the system state while the other does not. The more-informed agent can communicate with the other using a communication channel. The objective is to design control and communication strategies for these agents such that a cost function is minimized while ensuring that the coordination constraint is satisfied. We consider various kinds of channels (noiseless and noisy). For noiseless channels, we provide a dynamic program that can be used to find the optimal control and communication strategies for both agents. When the channel is noisy, we show that the multi-stage problem can be broken down to separate single-stage problems under additional statistical assumptions on the stochastic environment. 5 Part I Sequential Decision-making for Sensing 6 Chapter 2 Fixed-horizon Active Hypothesis Testing 2.1 Introduction We frequently encounter scenarios wherein we would like to deduce whether one of several hypothe- ses is true by gathering data or evidence. This problem is referred to as multi-hypothesis testing. If we have access to multiple candidate experiments or data sources, we can adaptively select more informative experiments to infer the true hypothesis. This leads to a joint control and inference problem commonly referred to as active hypothesis testing. There are numerous ways of formulating this problem and the precise mathematical formulation depends on the target application. In this paper, we consider a scenario in which there is an agent that can perform a fixed number of experiments. Subsequently, the agent can decide on one of the hypotheses using the collected data. The agent is also allowed to declare the experiments inconclusive if needed. In this fixed-horizon setting, we consider two formulations. In the first formulation, we are interested in minimizing the probability of incorrectly declaring a particular hypothesis to be true while ensuring that the probability of correctly declaring the same hypothesis is moderately high. Thus, we would like to declare that this hypothesis is true only if we have very strong evidence supporting it. This formulation is intended for applications like anomaly detection wherein incorrectly declaring the system to be safe (i.e. anomaly-free) can be very expensive whereas a moderate number of false alarms can be tolerated. This formulation, and thus our results, can be viewed as a generalization of the classical Chernoff-Stein lemma [34] to an active multi-hypothesis testing setup. In the second formulation, we are interested in minimizing the probability of making an in- correct inference (misclassification probability) while ensuring that the true hypothesis is declared 7 conclusively with moderately high probability. The key difference between the first and second formulations is that the former is asymmetric, i.e., it focuses on reliably inferring a particular hypothesis, whereas the latter formulation is symmetric in the sense that it aims to avoid misclas- sifying every hypothesis. This symmetric formulation is of particular interest when the penalty for making any incorrect inference is significantly higher than the penalty for making no decision. In such cases, it is reasonable for the agent to abstain from drawing conclusions unless there is strong evidence supporting one of the hypotheses. In both these problems, the agent can select experiments at each time in a data-driven manner. We refer to the strategy used for selecting these experiments as the experiment selection strategy. We refer to the strategy used by the agent to make an inference (or to declare its experiments inconclusive) based on all the data collected as the inference strategy. Thus, the two problems described above involve optimization over the space of inference and experiment selection strategy pairs. Our contributions in this paper pertaining to these hypothesis testing problems can be summa- rized as follows. 1. We find lower and upper bounds on the optimal misclassification probabilities in our con- strained optimization problems. These bounds are asymptotically (w.r.t. the time-horizon) tight under some mild assumptions. Thus, we characterize the optimal misclassification error exponents in each problem. 2. We propose a novel approach for designing experiment selection strategies. Unlike the clas- sical approach which results in randomized and, in some cases, open-loop strategies, this approach allows us to design deterministic and adaptive experiment selection strategies that are asymptotically optimal. 3. We demonstrate numerically that the experiment selection strategies designed using our ap- proach, when coupled with appropriate inference strategies, achieve superior non-asymptotic performance in comparison to the classical approaches. The rest of the paper is organized as follows. In Section 3.1, we summarize key prior literature on hypothesis testing and discuss how our problem is related to various other formulations. In 8 Section 8.1.1, we describe our notation. We describe our system model in Section 2.2.1 and in Section 2.2.2, we formulate our problems. We state the main results in Section 3.3 and sketch the proof of our results. We provide a detailed analysis of our hypothesis testing problems in Sections 2.3 and 2.4. In Section 10.3, we discuss an anomaly detection example and provide the results of our numerical experiments. We conclude the paper in Section 4.6. 2.1.1 Related Work Hypothesis testing is a long-standing problem and has been addressed in various settings. Works that are closely related to active hypothesis testing can be broadly classified into the following paradigms. 2.1.1.1 Fixed-horizon Hypothesis Testing In the simplest fixed-horizon hypothesis testing setup, we have binary hypotheses and a single experiment. The inference is made based on a fixed number of i.i.d. observations obtained by repeatedly performing this experiment. In this setup, there are two popular formulations: (i) the Neyman-Pearson type asymmetric formulation used in the Chernoff-Stein lemma [34]; (ii) the unconstrained symmetric formulation that involves minimizing the Bayesian error probability [34]. While our asymmetric formulation is a generalization of the Neyman-Pearson type formulation, our symmetric formulation is different from Bayesian error probability minimization in [34]. The key difference is that in Bayesian error minimization, the agent is not allowed to declare its experiments inconclusive at the end of the horizon and must declare one of the hypotheses to be true. More general works in this paradigm include [16, 15, 146, 86, 116]. All the aforementioned works are passive in the sense that there is only one experiment and thus, the experiment selection strategy is trivial. Nevertheless, we employ many of the analysis techniques developed in these works all of which are available in the form of lecture notes in [117]. An active fixed-horizon formulation has been considered in [107] in which the objective is to minimize the maximal error probability. This formulation is symmetric and does not allow the inconclusive declaration. Allowing the inconclusive declaration makes the nature of our analysis and results significantly different from the formulation in [107]. 9 2.1.1.2 Sequential Hypothesis Testing In sequential hypothesis testing, the time horizon is not fixed and the agent can continue to perform experiments until a stopping criterion is met. The objective then is to minimize the Bayesian risk which is a combination of the expected stopping time and the error probability. Inspired by Wald’s sequential probability ratio test (SPRT) [151], Chernoff first addressed the problem of active sequential hypothesis testing in [27]. This work was later generalized in [14, 107, 99]. These works focused on characterizing the asymptotic behavior 1 of the Bayesian risk in the regime where the cost of performing experiments is much lower than the cost of making an incorrect inference. Our fixed-horizon formulations are most closely related to this sequential hypothesis testing framework. In both these settings, the optimal error rates and the strategies used to achieve them are very similar. Intuitively, this is because in both the sequential setting and our fixed-horizon setting, the agent conclusively declares a hypothesis to be true only if there is strong evidence supporting it. If strong enough evidence is not found, the agent in the sequential setting continues to perform experiments whereas the agent in our setting simply declares the experiments inconclusive. Fixed-horizon formulations are useful in applications with hard time constraints where the agent does not have the luxury to keep performing experiments until strong enough evidence is obtained. A key distinction between sequential and fixed-horizon formulations is that in the sequential setting, much of the analysis is focused on computing the expected stopping time. On the other hand, in our fixed-horizon setting, the stopping time is fixed but there are constraints on the probability of making conclusive correct inference. Proving that the designed strategies satisfy these constraints calls for a different set of mathematical tools. In all the aforementioned works on the sequential setting, the experiment selection strategy has a randomized component. Although these randomized strategies are asymptotically optimal, their non-asymptotic performance may be poor. Deterministic strategies were proposed in [27, 96] but in many cases, these strategies are not asymptotically optimal. In this paper, we develop an approach that helps us in designing deterministic, adaptive and asymptotically optimal experiment selection strategies. Moreover, in some scenarios like anomaly detection, our deterministic strategies have a significantly better non-asymptotic performance. For some specific kinds of sequential hypothesis 1 In [99], certain non-asymptotic results are also provided. 10 testing problems, deterministic strategies were designed and analyzed in [76] and [81]. Our strategy design is similar in spirit to these designs. While these works are more general than ours in certain respects such as allowing different experiment costs and infinite observation spaces, they make some other restrictive technical assumptions which are not required in our analysis. These assumptions include the last assumption in [76, pg. 513] and Assumption 1.6 in [81]. Because of these assumptions, there are settings where our results apply while those of [76, 81] do not. In Section 2.5.0.6, we provide one simple example where this is the case. Some of the techniques used to design and analyze our strategy share some similarities with those used in multi-armed bandits [7]. 2.1.1.3 Anomaly Detection Many anomaly detection problems can be viewed as active hypothesis testing problems. In anomaly detection, there are multiple normal processes which exhibit a certain kind of statistical behavior. Among these processes, there could be an anomaly with statistical characteristics distinct from the normal processes. There are various mechanisms to probe these processes and the objective is to reliably detect the anomaly as quickly as possible. Some recent works in anomaly detection include [66, 26, 145, 54, 153]. All these works are in the sequential setting. It has been noted in [66, 145, 26] that deterministic strategies achieve better performance. We believe that these deterministic strategies may be related to ours. The problem of oddball detection has been considered in [148, 147]. The approach used in [148, 147] is similar to Chernoff’s approach in [27] but the key innovation is that they do not assume knowledge of the underlying distributions. 2.1.1.4 Variable-length Coding with Feedback This problem is concerned with designing variable-length codes for discrete memoryless communi- cation channels with perfect feedback [21, 94]. It can be viewed as a sequential hypothesis testing problem and a detailed discussion on this can be found in [94]. Our framework can be used to formulate a fixed-horizon analogue of the variable-length coding problem wherein the receiver is allowed to declare the transmission inconclusive if needed. 11 2.1.2 Notation Random variables are denoted by upper case letters (X), their realization by the corresponding lower case letter (x). We use calligraphic fonts to denote sets (U). The probability simplex over a finite setU is denoted by ΔU. In general, subscripts denote time indices unless stated otherwise. For time indices n 1 ≤ n 2 , Y n 1 :n 2 denotes the collection of variables (Y n 1 ,Y n 1 +1 ,...,Y n 2 ). For a strategy g, we useP g [·] andE g [·] to indicate that the probability and expectation depend on the choice of g. For an hypothesis i, E g i [·] denotes the expectation conditioned on hypothesis i. For a random variable X and an eventE,E[X;E] denotesE[X 1 E ], where 1 E is the indicator function associated with the evenE. The cross-entropy between two distributionsp andq over a finite space Y is given by H(p,q) =− X y∈Y p(y) logq(y). (2.1) The Kullback-Leibler divergence between distributions p and q is given by D(p||q) = X y∈Y p(y) log p(y) q(y) . (2.2) We use the convention that if x≤ 0, then logx . =−∞. 2.2 Minimum Misclassification Error Problems In this section, we will formulate the two active hypothesis testing problems. We will describe our assumptions and state our main results on the asymptotic behavior of optimal misclassification probabilities. 2.2.1 System Model LetX ={0, 1,...,M− 1} be a finite set of hypotheses and let the random variable X denote the true hypothesis. The prior probability on X is ρ 1 . Without loss of generality, let us assume that the distributionρ 1 has full support. At each timen = 1, 2,..., an agent can perform an experiment 12 U n ∈U and obtain an observation Y n ∈Y. We assume that the setsU andY are finite. The observation Y n at time n is given by Y n =ξ(X,U n ,W n ), (2.3) where{W n : n = 1, 2,...} is a collection of mutually independent and identically distributed primitive random variables. The probability of observing y after performing an experiment u under hypothesis i is denoted by p u i (y), p u i (y) . =P(Y n =y|X =i,U n =u). The time horizon, that is, the total number of experiments performed is fixed a priori to N <∞. At timen = 1, 2,...,N, the information available to the agent, denoted by I n , is the collection of all experiments performed and the corresponding observations up to time n− 1, I n . ={U 1:n−1 ,Y 1:n−1 }. (2.4) Let the collection of all possible realizations of information I n be denoted byI n . At time n, the agent selects a distribution over the set of actionsU according to an experiment selection rule g n :I n → ΔU and the action U n is randomly drawn from the distribution g n (I n ), that is, U n ∼g n (I n ). (2.5) For a given experimentu∈U and information realizationI∈I n , the probabilityP g [U n =u|I n =I] is denoted by g n (I : u). The sequence{g n ,n = 1,...,N} is denoted by g and referred to as the experiment selection strategy. Let the collection of all such experiment selection strategies beG. After performing N experiments, the agent can declare one of the hypotheses to be true or it can declare that its experiments were inconclusive. We refer to this final declaration as the agent’s inference decision and denote it by ˆ X N . Thus, the inference decision can take values inX∪{ℵ}, whereℵ denotes the inconclusive declaration. Using the information I N+1 , the agent chooses a 13 distribution over the set of hypotheses according to an inference strategy f :I N+1 → Δ(X∪{ℵ}) and the inference ˆ X N is drawn from the distribution f(I N+1 ), i.e. ˆ X N ∼f(I N+1 ). (2.6) For a given inference ˆ x∈X∪{ℵ} and information realizationI∈I N+1 , the probabilityP f,g [ ˆ X N = ˆ x|I N+1 =I] is denoted byf N (I : ˆ x). Let the set of all inference strategies beF. For an experiment selection strategy g and an inference strategy f, we define the following probabilities. Definition 2.1. Fori∈X , letψ N (i) be the probability that the agent infers hypothesis i given that the true hypothesis is i, i.e. ψ N (i) . =P f,g [ ˆ X N =i|X =i]. (2.7) We refer to ψ N (i) as the correct-inference probability of type-i. Let φ N (i) be the probability that the agent infers i given that the true hypothesis is not i, i.e. φ N (i) . =P f,g [ ˆ X N =i|X6=i]. (2.8) We refer to φ N (i) as the misclassification probability of type-i. We will also be interested in the event that the agent declares an incorrect hypothesis to be true. That is, we will consider the event∪ i∈X { ˆ X N = i,X 6= i}. We refer to this event as the misclassification event. Definition 2.2. Let γ N be the probability of making an incorrect inference, i.e. γ N . =P f,g [∪ i∈X { ˆ X N =i,X6=i}]. (2.9) We will refer to γ N as the misclassification probability. Remark 2.1. Note that the misclassification probability γ N can be expressed in terms of the mis- classification probabilities φ N (i) of type-i in the following manner γ N = X i∈X P f,g [ ˆ X N =i|X6=i]P[X6=i] (2.10) 14 = X i∈X φ N (i)(1−ρ 1 (i)). (2.11) 2.2.2 Problem Formulation and Preliminaries We will consider two active hypothesis testing formulations. The first one is an asymmetric formu- lation in which the focus is on a particular hypothesisi and involves minimizing the misclassification probability φ N (i) of type-i. The second formulation is a symmetric formulation that involves min- imizing the misclassification probability γ N . 2.2.2.1 The Asymmetric Formulation (P1) In this formulation, we are interested in designing an experiment selection strategy g and an inference strategy f that minimize the misclassification probability φ N (i) of type-i subject to the constraint that the correct-inference probabilityψ N (i) of type-i is sufficiently large. In other words, we would like to solve the following optimization problem: inf f∈F,g∈G φ N (i) (P1) subject to ψ N (i)≥ 1− N , where 0 < N < 1. Let the infimum value of this optimization problem be φ ∗ N (i). Note that this problem is always feasible because the agent can trivially satisfy the correct-inference prob- ability constraint by always declaring hypothesis i. We refer to this problem as the minimum misclassification error problem of type-i or simply Problem (P1). Remark 2.2. Problem (P1) can be seen as a binary hypothesis testing problem with null hypothesis {X =i} and alternate hypothesis{X6=i}. We observe that when there is only one experiment and two hypotheses, this formulation is identical to that of the Chernoff-Stein lemma in [34]. This formulation is helpful in modeling scenarios in which the cost of incorrectly declaring a particular hypothesis to be true is very high. For instance, consider a system which can potentially have various types of anomalies. We are interested in testing whether the system has no anomalies (hypothesis X =i) or has some anomaly (hypothesis X6=i). In such systems, a few false alarms may be tolerable but declaring that the system is free of anomalies when there is one can be very 15 expensive. Therefore, we would like to minimize the probability of falsely declaring the system to be free of anomalies subject to the constraint that the probability of raising false alarms is sufficiently small. Clearly, this scenario can be modeled using the asymmetric formulation (P1). 2.2.2.2 The Symmetric Formulation (P2) In this formulation, we are interested in designing an experiment selection strategy g and an infer- ence strategy f that minimize the misclassification probability γ N while satisfying the constraint that the correct-inference probabilityψ N (i) of type-i is sufficiently large for every hypothesisi∈X . In other words, we would like to solve the following optimization problem: min f∈F,g∈G γ N (P2) subject to ψ N (i)≥ 1− N ,∀i∈X, where 0 < N < 1. Let γ ∗ N denote the infimum value of this optimization problem. We define γ ∗ N . = ∞ if the optimization problem is infeasible. We refer to this problem as the minimum misclassification error problem or simply Problem (P2). The above formulation is intended for scenarios where the penalty for making an incorrect in- ference is much higher than the penalty for not making any inference. In such cases, it is reasonable for the agent to abstain from drawing conclusions when the evidence is not strong enough. The constraints on type-i correct-inference probabilities ψ N (i) ensure that the agent does not abstain from drawing conclusions too often. Thus, the optimization problem (P2) aims to find experiment selection and inference strategies that misclassify the least among all those strategies that make the correct inference with high probability. Definition 2.3 (Log-likelihood ratio). For an experimentu∈U and any pair of hypothesesi,j∈X let λ i j (u,y) . = log p u i (y) p u j (y) (2.12) be the log-likelihood ratio associated with an observation y∈Y. We make the following assumptions on our system model. 16 Assumption 2.1 (Common Support). For any given experiment u∈U, there exists a non-empty setY(u)⊆Y such that for every hypothesis i∈X , the support of the distribution p u i isY(u). In other words, for every hypothesis i∈X , p u i (y)> 0 if and only if y∈Y(u). Let B > 0 be a constant such that|λ i j (u,y)| < B for every experiment u∈U, observation y∈Y(u) and any pair of hypotheses i,j∈X . Note that the existence of such a constant B is guaranteed because of Assumption 3.1. Since our focus is on minimizing the incorrect inference probabilities, we make the following as- sumption on N . This assumption captures the fact that we need the correct-inference probabilities to be large, but not necessarily too large. Assumption 2.2. We have that the bound 1− N on the type-i correct-inference probability satisfies N → 0. Further, there exists a constant b> 0 such that lim N→∞ N N b =∞. The following assumption is standard in the active hypothesis testing literature and ensures that every pair of hypotheses can be distinguished by performing some experiment. Assumption 2.3. For any pair of distinct hypotheses i,j∈X , there exists an experiment u∈U such that D(p u i ||p u j )> 0. Before stating our main results, we will define some important quantities. Definition 2.4 (Max-min Kullback-Leibler Divergence). For each hypothesis i∈X , define D ∗ (i) . = max α∈ΔU min j6=i X u α(u)D(p u i ||p u j ) (2.13) = min β∈Δ ˜ X i max u∈U X j6=i β(j)D(p u i ||p u j ), (2.14) where ˜ X i = X\{i}. Note that α is a distribution over the set of experiments U and β is a distribution over the set of alternate hypotheses ˜ X i . The max-min Kullback-Leibler divergenceD ∗ (i) can be viewed as the value of a two-player zero-sum game [27, 110]. In this zero-sum game, the maximizing player selects a mixed strategy α ∈ ΔU and the minimizing player selects a mixed strategy β∈ Δ ˜ X i . The payoff associated with these strategies is X u X j6=i α(u)D(p u i ||p u j )β(j). (2.15) 17 The equality of the min-max and max-min values follows from the minimax theorem [110] because the setsU andX are finite and the Kullback-Leibler divergences are bounded byB due to Assumption 3.1. Let the max-minimizer in (3.13) be denoted byα i∗ and the min-maximizer in (3.14) be denoted by β i∗ . Definition 2.5 (Posterior Belief). The posterior beliefρ n on the hypothesisX based on information I n is given by ρ n (i) =P[X =i|U 1:n−1 ,Y 1:n−1 ] =P[X =i|I n ]. (2.16) Note that given a realization of the experiments and observations until time n, the posterior belief does not depend on the experiment selection strategy g or the inference strategy f. Definition 2.6 (Confidence Level). For a hypothesis i∈X and a distribution ρ onX such that 0<ρ(i)< 1, the confidence levelC i (ρ) associated with hypothesis i is defined as C i (ρ) . = log ρ(i) 1−ρ(i) . (2.17) The confidence level is the logarithm of the ratio of the probability (w.r.t. the distribution ρ) that hypothesis i is true versus the probability that hypothesis i is not true. We emphasize that this notion of confidence is not to be confused with the confidence levels in the context of confidence intervals [52]. 2.2.3 Main Results We will now state our three main results on some asymptotic aspects of Problems (P1) and (P2). 2.2.3.1 Asymptotic Decay-Rate of Optimal Misclassification Probabilityφ ∗ N (i) in Problem (P1) The following theorem can be viewed as a generalization of the classical Chernoff-Stein lemma [34] to the setting of active hypothesis testing. It states that the optimal value φ ∗ N (i) in Problem (P1) decays exponentially with the horizon N and its asymptotic rate of decay is equal to the max-min Kullback-Leibler divergence D ∗ (i) defined in Definition 3.3. 18 Theorem 2.1. The asymptotic rate of the optimal misclassification probability in Problem (P1) is given by lim N→∞ − 1 N logφ ∗ N (i) =D ∗ (i). (2.18) Proof sketch: To prove this result, we use the following approach. We first establish a lower bound on the misclassification probability φ N (i) for any pair of experiment selection and inference strategies (g,f) that satisfy the constraintψ N (i)≥ 1− N . The asymptotic decay-rate of this lower bound is equal to the max-min Kullback-Leibler divergence D ∗ (i). We then construct experiment selection and inference strategies that asymptotically achieve this rate. The details of the proof of this result are included in Section 2.3. The inference strategy constructed in the achievability proof of Theorem 2.1 has the following threshold structure f N (ρ N+1 :i) = 1 ifC i (ρ N+1 )−C i (ρ 1 )≥θ N 0 otherwise, (2.19) where θ N = ND ∗ (i)−o(N). The precise value of the threshold θ N is provided in Appendix A.7. The experiment selection strategy used is as follows: at each timen, randomly select an experiment u from the max-min distributionα i∗ as defined in Definition 3.3. This strategy design is motivated from the design in [27]. Note that this is a completely open-loop strategy, that is, the experiments are selected without using any of the information acquired in the past. The open-loop randomized experiment selection strategy described above is asymptotically op- timal for Problem (P1). However, in the non-asymptotic regime, there may be other experiment selection strategies that perform significantly better than this open-loop randomized strategy. For some specialized problems such as anomaly detection, it was observed in recent active hypothe- sis testing literature [66, 145, 26] that there exist deterministic and adaptive strategies that are asymptotically optimal and also outperform the classical Chernoff-type randomized strategies in the non-asymptotic regime. Even in [27], Chernoff proposed a deterministic and adaptive strat- egy (see Section 7, [27]) but he also presented a counter-example in which the strategy was not asymptotically optimal. 19 In this paper, we propose a class of deterministic and adaptive strategies that are asymptotically optimal for Problem (P1). To the best of our knowledge, this is the first proof of asymptotic optimality of such strategies in a general active hypothesis testing setting. For a simple anomaly detection example, we also demonstrate numerically in Section 10.3 that our deterministic adaptive strategy (when paired with an appropriate inference strategy) performs significantly better than the open-loop randomized strategy α i∗ in the non-asymptotic regime. 2.2.3.2 Asymptotically Optimal Experiment Selection Strategies for Problem (P1) Let the moment generating function of the negative log-likelihood ratios−λ i j (u,Y ) for an experi- ment u be denoted by μ i j (u,s), i.e. μ i j (u,s) . =E i [exp −sλ i j (u,Y ) ]. (2.20) Definition 2.7. For a given experiment u∈U, belief ρ∈ ΔX and 0≤s≤ 1, define M i (u,ρ,s) . = P j6=i (ρ(j)) s μ i j (u,s) P j6=i (ρ(j)) s . (2.21) Using this definition, we will now define a class of experiment selection strategies. Criterion 2.1. For a given time horizon N, consider an experiment selection strategy g N . = (g N 1 ,g N 2 ,...,g N N ) such that at each time n, α n . =g N n (I n )∈ ΔU satisfies X u∈U α n (u)M i (u,ρ n ,s N )≤ X u∈U α i∗ (u)M i (u,ρ n ,s N ), (2.22) with probability 1. Here, ρ n is the posterior belief at time n and s N . = min 1, s 2 log M N NB 2 . (2.23) Observation 2.1. Criterion 2.1 captures three experiment selection strategies of interest. These are: 1. Open-loop Randomized Strategy (ORS): At any time n, g N n (I n ) . =α i∗ . This is the open-loop randomized strategy discussed earlier in Section 2.2.3.1. 20 2. Deterministic Adaptive Strategy (DAS): At each timen, g N n (I n ) selects the experiment u∈U that minimizes M i (u,ρ n ,s N ). This is a deterministic and adaptive strategy. 3. Deterministic Adaptive Strategy with Restricted Support (DAS-RS): At each time n, g N n (I n ) selects the experiment u from the support of α i∗ that minimizesM i (u,ρ n ,s N ). This is also a deterministic and adaptive strategy. For all these experiment selection strategies, we have the following result. Theorem 2.2. Let f N be as defined in (2.19) and g N be an experiment selection strategy that satisfies Criterion 2.1. Then the class of strategies{(f N ,g N ) :N∈N} is asymptotically optimal. In other words, if ψ N (i) and φ N (i) are the correct-inference and misclassification probabilities associated with the strategy pair (f N ,g N ), then ψ N (i)≥ 1− N for every N and lim N→∞ − 1 N logφ N (i) =D ∗ (i). (2.24) Proof. See Section 2.3.3. Remark 2.3. Theorem 2.2 holds as long as the value of s N used in Criterion 2.1 satisfies s N → 0 and Ns N →∞. We chose the value of s N that maximizes the value of θ N defined in (B.11). Remark 2.4 (Zero-sum Game Interpretation). Note that selecting an experiment that minimizes M i (u,ρ,s) over the setU is equivalent to selecting an experiment that maximizes (1−M i (u,ρ,s))/s. When s is small, this function can be approximated as follows 1−M i (u,ρ,s) s = P j6=i (ρ(j)) s (1−μ i j (u,s)) s P j6=i (ρ(j)) s (2.25) ≈ P j6=i (ρ(j)) s D(p u i ||p u j ) P j6=i (ρ(j)) s , (2.26) since (1−μ i j (u,s))/s→D(p u i ||p u j ) as s→ 0. Thus, we can interpret the strategy DAS in terms of the zero-sum game discussed earlier in Section 2.2.2 after Definition 3.3. In the zero-sum game, if the minimizing player selects an alternate hypothesisj with probabilityβ(j) = (ρ(j)) s / P k6=i (ρ(k)) s , then the strategy DAS selects an approximate best-response to the minimizing player’s strategy with respect to the payoff function in (2.15). 21 Given a horizonN, the strategies DAS and DAS-RS described in this section are time-invariant functions of the posterior belief. However, they depend on the value ofs N and thus, on the horizon N of the problem. In some cases, these strategies turn out to be independent of the value s N which results in fully stationary (with respect to the posterior belief ρ n ) strategies. We show that this is the case in the example discussed in Section 10.3. It may be possible to show that for such stationary strategies, Lemma 2 in [27] holds and thus, they are asymptotically optimal even in the sequential formulation in [27]. 2.2.3.3 Asymptotic Decay-Rate of Optimal Misclassification Probabilityγ ∗ N in Problem (P2) Similar to the result in Theorem 2.1, we can characterize the decay-rate of the optimal misclassifi- cation probability γ ∗ N as follows. Theorem 2.3. The optimal misclassification probability γ ∗ N in Problem (P2) decays exponentially with the horizon N and its asymptotic decay-rate is equal to min i∈X D ∗ (i), i.e. lim N→∞ − 1 N logγ ∗ N = min i∈X D ∗ (i). (2.27) Proof sketch: The methodology used for proving this result is very similar to that of Theorem 2.1. We first obtain a lower bound on the misclassification probabilityγ N for any pair of experiment selection and inference strategies (g,f) that satisfy the constraints of Problem (P2). This lower bound is very closely related to the lower bounds established for Problem (P1). Then we construct a class of experiment selection and inference strategies that achieve this lower bound asymptoti- cally. This class of experiment selection strategies includes the randomized strategy proposed in [27] and also, deterministic strategies similar to DAS and DAS-RS introduced in Section 2.2.3.2. The derivation of the lower bound and the construction of the experiment selection and inference strategies are discussed in detail in Section 2.4. 22 2.3 Analysis of Problem (P1) In this section, we analyze the asymmetric formulation (P1). To optimize the misclassification error of type-i in Problem (P1), we need to design both an experiment selection strategy g and an inference strategy f. We will first arbitrarily fix the experiment selection strategy g. For a fixed g, we derive a lower bound on the misclassification probability φ N (i) associated with any inference strategyf that satisfies the constraint in Problem (P1). These bounds are obtained using the weak converse approach described in [117]. In these derivations, we will introduce some useful properties of the confidence level defined in Definition 3.5. Using these properties, we will then weaken the lower bounds to derive a bound that does not depend on the strategy g. Further, we will show that any experiment selection strategy satisfying Criterion 2.1 defined in Section 2.2.3.2, coupled with an appropriate inference strategy, can asymptotically achieve this strategy-independent lower bound. Finally, we will discuss some methods for obtaining better non-asymptotic lower bounds on the misclassification probability φ N (i) using the strong converse theorem in [117]. 2.3.1 Lower Bound for a Fixed Experiment Selection Strategy In this sub-section, we fix the experiment selection strategy to an arbitrary choice g and analyze the problem of optimizing the inference strategy for this particular experiment selection strategy. This analysis will help us in obtaining a lower bound on the misclassification probability and in designing inference strategies for Problem (P1). Consider the following optimization problem. min f∈F φ N (i) (P3) subject to ψ N (i)≥ 1− N . To analyze problem (P3), we will first define some useful quantities related to the confidence level in Definition 3.5. For the hypothesisi and a strategyg∈G, define the likelihood distributionsP g N,i andQ g N,i over the setI N+1 as follows P g N,i (I N+1 ) . =P g [I N+1 =I N+1 |X =i] (2.28) 23 Q g N,i (I N+1 ) . =P g [I N+1 =I N+1 |X6=i]. (2.29) Proposition 2.1. Under any experiment selection strategy g, with probability 1, we have log P g N,i (I N+1 ) Q g N,i (I N+1 ) =C i (ρ N+1 )−C i (ρ 1 ). (2.30) Proof. For any instanceI N+1 ∈I N+1 such thatP g [I N+1 =I N+1 ]> 0, we have the following using Bayes’ rule log P g N,i (I N+1 ) Q g N,i (I N+1 ) =C i (ρ N+1 )−C i (ρ 1 ), (2.31) whereρ N+1 is the posterior belief associated with the instanceI N+1 . Note that because of Assump- tion 3.1,P g [I N+1 =I N+1 ]> 0 implies that both the numerator and the denominator in (2.31) are non-zero and thus, the expression in (2.31) is well-defined. Since equation (2.31) is true for every instanceI N+1 with non-zero probability, we have our result. Thus, the increment in confidence level is a log-likelihood ratio. Definition 2.8 (Expected Confidence Rate). We define the expected confidence rate J g N (i) as J g N (i) . = 1 N E g i [C i (ρ N+1 )−C i (ρ 1 )]. (2.32) Remark 2.5. Due to Proposition 2.1, the expected confidence rate is the averaged Kullback-Leibler divergence between the distributions P g N,i and Q g N,i . That is, J g N (i) = 1 N D(P g N,i ||Q g N,i ). When the experiment selection strategy is fixed, we can view the problem (P1) as a one-shot hypothesis testing problem in which we are trying to infer whether the collection of actions and observationsI N+1 is drawn from distributionP g N,i orQ g N,i . This interpretation allows us to use the classical results [34, 117] on one-shot hypothesis testing and derive various properties. 24 We will first obtain a lower bound on the misclassification probability φ N (i) in Problem (P3) using the data-processing inequality of Kullback-Leibler divergences [117]. This is commonly known as the weak converse [86, 117]. Lemma 2.1 (Weak Converse). Let g be any given experiment selection strategy. Then for any inference strategy f such that ψ N (i)≥ 1− N , we have − 1 N logφ N (i)≤ J g N (i) 1− N + log 2 N(1− N ) , (2.33) where J g N (i) is the expected confidence rate. Therefore, − 1 N logφ ∗ N (i)≤ sup g∈G J g N (i) 1− N + log 2 N(1− N ) , (2.34) where φ ∗ N (i) is the optimum value in Problem (P1). Proof. See Appendix A.2. Note that this lemma is true for every 0≤ N < 1. The bound (3.17) suggests that we can obtain a strategy-independent lower bound on φ ∗ N by obtaining upper bounds on the quantity sup g∈G J g N (i). In the next sub-section, we will focus on obtaining this upper bound. 2.3.2 Strategy-Independent Lower Bound We will first describe some important properties of the confidence levelC i (ρ) which will be used to derive a strategy-independent lower bound on the misclassifcation probability in problem (P1). Definition 2.9. For a given experiment selection strategy g and an alternate hypothesis j6= i, define the total log-likelihood ratio up to time n as Z n (j) . = n X k=1 λ i j (U k ,Y k ), where the log-likelihood ratio λ i j is as defined in equation (3.11). Also, let ¯ Z n . = X j6=i β i∗ (j)Z n (j), 25 whereβ i∗ is the min-maximizing distribution in Definition 3.3. Notice that the processes Z n (j) for each j6=i and ¯ Z n are sub-martingales with respect to the filtration I n+1 when X =i. We will now establish the relationship between the total log-likelihood ratiosZ n (j) and the confidence level C i . Lemma 2.2. For any experiment selection strategy g and for each 1≤n≤N, we have C i (ρ n+1 )−C i (ρ 1 ) =− log X j6=i exp log ˜ ρ 1 (j)−Z n (j) , (2.35) where ˜ ρ 1 (j) =ρ 1 (j)/(1−ρ 1 (i)). Proof. This is a consequence of simple algebraic manipulations. See Appendix A.3 for details. Note that for any vector z, we have− log P j exp(−z(j))≈ min j z(j). Thus, the interpretation of this lemma is that the increment in confidence levelC i (ρ N+1 )−C i (ρ 1 ) approximately represents the smallest total log-likelihood ratio min j6=i Z N (j). Therefore, this lemma can be seen as the first step towards establishing the relationship between the average expected increment in confidenceJ g N and the max-min Kullback-Leibler divergenceD ∗ (i). To formally establish this relationship, we use Lemma 3.5 to decompose the increment in confidence into a non-positive cross-entropy term and a sub-martingale. This decomposition will be used in deriving strategy-independent lower bounds, both weak and strong, on the misclassification probability in Problem (P1). Lemma 2.3 (Decomposition). For any experiment selection strategy g, we have C i (ρ n+1 )−C i (ρ 1 ) =−H(β i∗ , ˜ ρ n+1 ) + ¯ Z n +H(β i∗ , ˜ ρ 1 ). Here, ˜ ρ n (j) =ρ n (j)/(1−ρ n (i)). As a result of the non-negativity of cross entropy, we have C i (ρ n+1 )−C i (ρ 1 )≤ ¯ Z n +H(β i∗ , ˜ ρ 1 ). (2.36) Proof. This is an algebraic consequence of Lemma 3.5. See Appendix B.5.2 for details. 26 Using Lemma 3.6, we will now establish the relationship between the confidence rate J g N (i) and the max-min Kullback-Leibler divergence D ∗ (i) defined in equation (3.13). This, in conjunction with Lemma 3.1, will give us a strategy-independent lower bound on φ ∗ N (i). Lemma 2.4. For any experiment selection strategy g, we have J g N (i)≤D ∗ (i) + H(β i∗ , ˜ ρ 1 ) N , (2.37) where ˜ ρ 1 (j) = ρ 1 (j)/(1−ρ 1 (i)). Further, using Lemma 3.1 and Assumption 3.2, we can conclude that lim sup N→∞ − 1 N logφ ∗ N (i)≤D ∗ (i). (2.38) Proof. See Appendix A.5. 2.3.3 Achievability of Decay-Rate D ∗ (i) in Problem (P1) We will now construct inference and experiment selection strategies that satisfy the constraint on hit probability ψ N (i) and asymptotically achieve misclassification decay-rate of D ∗ (i). We will begin with the construction and analysis of the inference strategy. The following is an upper bound on the misclassification probability associated with a determin- istic confidence-threshold based inference strategy of the form discussed in Section 3.3. Incidentally, this bound does not depend on the experiment selection strategy g. Similar upper bounds on error probabilities are commonly used in hypothesis testing [27, 34]. Lemma 2.5. Let f be a deterministic inference strategy in which hypothesis i is decided only if C i (ρ N+1 )−C i (ρ 1 )≥θ. Then φ N (i)≤e −θ . Proof. See Appendix A.6. The inference strategy f N is constructed as follows f N (ρ N+1 :i) = 1 ifC i (ρ N+1 )−C i (ρ 1 )≥θ N 0 otherwise, (2.39) 27 where θ N =ND ∗ (i)−o(N) and its precise value is provided in equation (A.61) in Appendix A.7. Due to Lemma B.3, we can say that under the inference strategy f N , any experiment selection strategy g achieves achieves φ N (i)≤ e −θ N . However, the inference strategy and the experiment selection strategy must also satisfy the constraint ψ N (i)≥ 1− N . In Section 2.2.3.2, we discussed experiment selection strategies that satisfy Criterion 2.1. Let g N be any such experiment selection strategy that satisfies Criterion 2.1. Let the type-i correct-inference and misclassification proba- bilities associated with the strategy pair (g N ,f N ) be ψ N (i) and φ N (i), respectively. We can show that this strategy pair satisfies the constraintψ N (i)≥ 1− N . The proof is in Appendix A.7. Using the result (2.38), Lemma B.3 and the fact that θ N /N→D ∗ (i) as N→∞, we can say that lim N→∞ 1 N log 1 φ N (i) =D ∗ (i). (2.40) Thus, the experiment selection strategies that satisfy Criterion 2.1 (including ORS, DAS and DAS- RS discussed in Section 2.2.3.2), when used in conjunction with the inference strategyf N described above, are feasible solutions for the optimization problem (P1) and asymptotically achieve a type-i misclassification probability decay rate of D ∗ (i). This concludes the proof of Theorem 2.2. Since the inference strategy f N and the experiment selection strategy g N described above are feasible strategies with respect to the optimization problem in (P1), we can say that φ ∗ N (i)≤ φ N (i)≤e −θ N . And since under Assumption 3.2, θ N /N→D ∗ (i) as N→∞, we have lim inf N→∞ − 1 N logφ ∗ N (i)≥D ∗ (i). (2.41) Combining this result with the upper bound on the asymptotic decay rate of φ ∗ N (i) in Lemma 2.4, we have Theorem 2.1. Remark 2.6. We observe that our proof methodology shares some similarities with that of prior works [27, 107, 99] on active hypothesis testing. Nonetheless, many of our techniques are novel, especially those used to prove the asymptotic optimality of strategies that satisfy Criterion 2.1. 28 2.3.4 Tighter Non-asymptotic Lower Bounds In this section, we will provide an alternate approach to finding lower bounds on the misclassification probability φ N (i). This approach can be used to obtain tight lower bounds in some special cases. We will later illustrate this procedure with the help of an example. For any given pair of inference and experiment selection strategies f,g that are feasible in Problem (P1), recall that the increment in confidence can be viewed as a log-likelihood ratio (2.30). Therefore for this strategy pair f,g, we have the following for every χ∈R − logφ N (i) (2.42) a ≤ χ− log(ψ N (i)−P g i [C i (ρ N+1 )−C i (ρ 1 )>χ]) (2.43) b ≤ χ− log(1− N −P g i [C i (ρ N+1 )−C i (ρ 1 )>χ]) (2.44) = χ− log(P g i [C i (ρ N+1 )−C i (ρ 1 )≤χ]− N ) (2.45) Inequality (a) is a consequence of the strong converse theorem in [117]. Inequality (b) holds because ψ N (i)≥ 1− N . However, much like the weak converse in Lemma 3.1, this lower bound on φ N (i) depends on the experiment selection strategy g. We made use of the decomposition in Lemma 3.6 to obtain a strategy-independent lower bound in Lemma 2.4. We will follow a similar approach here. We have P g i [C i (ρ N+1 )−C i (ρ 1 )≤χ] (2.46) a =P g i [−H(β i∗ , ˜ ρ N+1 ) + ¯ Z N +H(β i∗ , ˜ ρ 1 )≤χ] (2.47) b ≥P g i [ ¯ Z N +H(β i∗ , ˜ ρ 1 )≤χ]. (2.48) Equality (a) is a consequence of Lemma 3.6, and since H(β i∗ , ˜ ρ N+1 )≥ 0, we have that the event {−H(β i∗ , ˜ ρ N+1 ) + ¯ Z N +H(β i∗ , ˜ ρ 1 )≤χ} (2.49) ⊇{ ¯ Z N +H(β i∗ , ˜ ρ 1 )≤χ}, (2.50) which results in the inequality (b). Combining (B.33) and (B.36) leads us to the following lemma. 29 Lemma 2.6 (Stong Converse). For any given pair of inference and experiment selection strategies f,g that are feasible in Problem (P1), we have for every χ∈R − logφ N (i)≤χ− log(P g i [ ¯ Z N +H(β i∗ , ˜ ρ 1 )≤χ]− N ), with the convention that logx . =−∞ if x≤ 0. Note that this lower bound is also dependent on the strategy g. However, it may be easier to derive strategy-independent lower bounds using the bound in Lemma B.2. This is because the process ¯ Z n −nD ∗ (i) is a super-martingale given X =i. In fact, if every experiment inU is in the support ofα i∗ (which is the case in many problems), then the process ¯ Z n −nD ∗ (i) is a martingale given X = i. Thus, a lower bound on φ N (i) may be obtained using a strategy-independent lower bound on the tail probabilityP g i [ ¯ Z N +H(β i∗ , ˜ ρ 1 )≤χ] [48]. In some special cases, it may even occur that the evolution of the process ¯ Z n is completely independent of the strategy g. We will discuss an example that satisfies this condition in Section 10.3. 2.4 Analysis of Problem (P2) In this section, we will analyze Problem (P2) and prove Theorem 2.3. Let f,g be inference and experiment selection strategies that satisfy the constraints in Problem (P2), i.e. ψ N (i)≥ 1− N for every i∈X , where ψ N (i) is the type-i correct-inference probability associated with strategies f,g. Let φ N (i) be the type-i misclassification probability associated with strategies f,g. Since the strategy pairf,g satisfiesψ N (i)≥ 1− N , we can use Lemma 3.1 to obtain the following inequality − 1 N logφ N (i)≤ J g N (i) + log 2 N 1− N , (2.51) for every i∈X . Let γ N be the misclassification probability associated with strategy pair f,g. Then, we have γ N = X i∈X (1−ρ 1 (i))φ N (i) (2.52) ≥ X i∈X (1−ρ 1 (i)) exp − (NJ g N (i) + log 2) 1− N ! . (2.53) 30 Rearranging the terms above, we have − 1 N logγ N (2.54) ≤− 1 N log X i∈X (1−ρ 1 (i))e − (NJ g N (i)+log2) 1− N (2.55) =− 1 N log X i∈X e − (NJ g N (i)+log2) 1− N +log(1−ρ 1 (i)) (2.56) a ≤− 1 N max i∈X − (NJ g N (i) + log 2) 1− N + log(1−ρ 1 (i)) ! (2.57) = min i∈X J g N (i) + log 2 N 1− N − log(1−ρ 1 (i)) N ! , (2.58) where inequality (a) is because log P i exp(x i )≥ max i (x i ). Using the definition of γ ∗ N , we have − 1 N logγ ∗ N (2.59) = sup f,g:ψ N (i)≥1− N ,i∈X − 1 N logγ N (2.60) b ≤ min i∈X sup g J g N (i) + log 2 N 1− N − log(1−ρ 1 (i)) N ! (2.61) c ≤ min i∈X D ∗ (i) + H(β i∗ ,˜ ρ 1 )+log 2 N 1− N − log(1−ρ 1 (i)) N . (2.62) Inequality (b) is due to the result in (2.58) and inequality (c) follows from Lemma 2.4. Thus, we can conclude that lim sup N→∞ − 1 N logγ ∗ N ≤ min i∈X D ∗ (i). (2.63) This establishes an upper bound on the asymptotic decay rate of the optimal misclassification probability γ ∗ N in Problem (P2). We will now show that this rate is asymptotically achievable by constructing appropriate experiment selection and inference strategies. 31 For a given horizon N and for each i∈X , let g N,i be an experiment selction strategy that satisfies Criterion 2.1 with respect to hypothesis i. Let the maximum likelihood (ML) estimate at time n be ¯ i n . = arg max i∈X ¯ ρ n (i), (2.64) where ties are broken arbitrarily in the arg max operator and ¯ ρ n is the posterior belief at time n formed using uniform prior at time 1 instead of ρ 1 . LetN . ={da l e :l∈N,da l e≤N}, where a> 1 is a constant and is defined in Appendix A.8. 2.4.0.1 Experiment selection strategy If n∈N , then the experiment U n is selected randomly according to the uniform distribution over U. If n / ∈N and ¯ i n = i, then U n is selected randomly with distribution g N,i n (I n ). We denote this experiment selection strategy by ¯ g N (or ¯ g). Note that if g N,i is ORS, then the strategy ¯ g N is identical to the one in [107]. 2.4.0.2 Inference strategy Consider the following deterministic inference strategy ¯ f N where for each i∈X ¯ f N (ρ N+1 :i) = 1 ifC i (ρ N+1 )−C i (ρ 1 )≥θ N (i) 0 otherwise, where−C i (ρ 1 )<θ N (i) =ND ∗ (i)−o(N) and is precisely defined in Appendix A.8. Notice that if ¯ f N (ρ N+1 :i) = 0 for every i∈X , then ¯ f N (ρ N+1 :ℵ) = 1. Since θ N (i)> 0 for every i∈X , the threshold condition in ¯ f N can be satisfied by at most one hypothesis. Thus, ¯ f N declares hypothesis i if and only if the confidence increment associated with i exceeds the threshold θ N (i). Hence, for each hypothesis i∈X , the inference strategy ¯ f N admits the structure required for Lemma B.3 and thus, using Lemma B.3, we can conclude that for each hypothesis i∈X , φ N (i)≤e −θ N (i) . Therefore, under the strategies ( ¯ f N , ¯ g N ), we have γ N ≤ X i∈X (1−ρ 1 (i)) exp(−θ N (i)). (2.65) 32 In Appendix A.8, we show that there exists an integer ¯ N such that for every N≥ ¯ N, the strategy pair ( ¯ f N , ¯ g N ) also satisfies all the type-i correct-inference probability constraints in problem (P2). Thus, for every N≥ ¯ N, we have γ ∗ N ≤γ N ≤ X i∈X (1−ρ 1 (i)) exp(−θ N (i)), and hence, lim inf N→∞ − 1 N logγ ∗ N (2.66) ≥ lim N→∞ − 1 N log X i∈X (1−ρ 1 (i)) exp(−θ N (i)) (2.67) ≥ lim N→∞ − 1 N log M max i∈X {(1−ρ 1 (i)) exp(−θ N (i))} = lim N→∞ − 1 N max i∈X {log (1−ρ 1 (i)) exp(−θ N (i)) } (2.68) = min i∈X lim N→∞ θ N (i)− log(1−ρ 1 (i)) N (2.69) = min i∈X D ∗ (i). (2.70) Using the results (2.63) and (2.66), we can conclude that lim N→∞ − 1 N logγ ∗ N = min i∈X D ∗ (i). (2.71) This concludes the proof of Theorem 2.3. 2.5 An Example: Anomaly Detection Consider a system with two sensorsA andB. These sensors can detect an anomaly in the system in their proximity. The system state X can take three values{0, 1, 2} where X = 0 indicates that the system is safe, i.e. there is no anomaly in the system. On the other hand, X = 1 indicates that there is an anomaly near sensor A and X = 2 indicates that there is an anomaly near sensor B. The prior belief ρ 1 is uniform over the set{0, 1, 2}. There is a controller that can activate one of these sensors at each time to obtain an observation. Thus, the collection of actions that the controller can select from isU ={A,B}. The observations are binary, i.e. Y ={0, 1}. The 33 X = 0 X = 1 X = 2 U =A 1−ν ν 1−ν U =B 1−ν 1−ν ν Table 2.1: Conditional probabilitiesP[Y = 1|X,U]. In our numerical experimentsν = 0.6 which indicates that the observations from these sensors are very noisy. conditional probabilitiesP[Y = 1|X,U] associated with the observations given various states and actions are given in Table 4.1. 2.5.0.1 Formulation After collecting N observations from these sensors, we are interested in determining whether the system is safe or unsafe. We consider a setting where incorrectly declaring the system to be safe can be very expensive while a few false alarms can be tolerated. In this setting, the inconclusive decisionℵ is treated as an alarm. Therefore, we would like to design an experiment (sensor) selection strategy g and an inference strategy f that minimize the probability φ N (0) of incorrectly declaring the system to be safe subject to the condition that the probability ψ N (0) of correctly declaring the system to be safe is sufficiently high. This can be formulated as min f∈F,g∈G φ N (0) (P4) subject to ψ N (0)≥ 1− N , where N = min{0.05, 10/N} in our numerical experiments. Notice that this fits the formulation of Problem (P1). 2.5.0.2 Asymptotically Optimal Rate and Weak Converse Using Theorem 2.1, we can conclude that the asymptotically optimal misclassification rate is D ∗ (0) = max α∈ΔU min j6=0 X u α(u)D(p u 0 ||p u j ) = (ν− 1/2) log ν 1−ν . 34 The max-minimizer α 0∗ (A) = α 0∗ (B) = 0.5 and the min-maximizer β 0∗ (1) = β 0∗ (2) = 0.5. For convenience, we will refer to α 0∗ as α ∗ and β 0∗ as β ∗ . Also, notice that the bound on the log- likelihood ratios B = log ν 1−ν , ν > 0.5. Therefore using the weak converse in Lemma 3.1 and Lemma 2.4, we have 1 N log 1 φ ∗ N (0) ≤ ν− 1/2 1− N log ν 1−ν + 2 log 2 N(1− N ) . 2.5.0.3 Strong Converse We will now use the lower bound in Lemma B.2 to derive a strategy-independent lower bound one φ N (0). Define L n . =β ∗ (1)λ 0 1 (U n ,Y n ) +β ∗ (2)λ 0 2 (U n ,Y n ) (2.72) Notice that given X = 0, for either experiment u∈U, we have β ∗ (1)λ 0 1 (u,Y ) +β ∗ (2)λ 0 2 (u,Y ) = 1 2 log ν 1−ν w.p. ν 1 2 log 1−ν ν w.p. 1−ν. Let the moment generating function of the variable above be ¯ μ(s). Therefore, for any strategy g, we have E g 0 [exp( n X k=1 s k L k )] =E g 0 [E 0 exp( n X k=1 s k L k )|I n ]] (2.73) =E g 0 [exp( n−1 X k=1 s k L k )E 0 [exp(s n L n )|I n ]] =E g 0 [exp( n−1 X k=1 s k L k )]¯ μ(s n ) = Π n k=1 ¯ μ(s k ). (2.74) 35 Thus, L n is an i.i.d. sequence and the process ¯ Z n is the sum of these i.i.d. random variables. We can also exploit the fact that the observations are binary valued. Let the number of 0’s in N observations be K. Then K has binomial distribution with parameters N,ν. Further, ¯ Z N ≤χ− log 2 ⇐⇒ K− N 2 log ν 1−ν ≤χ− log 2. (2.75) Define χ ∗ = (Q(2 N )−N/2) log ν 1−ν + log 2, where Q is the quantile function of the binomial distribution with parameters N,ν. Using the relation (2.75), we can conclude that P g 0 [ ¯ Z N +H(β i∗ , ˜ ρ 1 )≤χ ∗ ]≥ 2 N , (2.76) under any experiment selection strategy g. Finally, using Lemma B.2, we have for any pair of inference and experiment selections strategies that satisfy the constraints in Problem (P4) log 1 φ N (0) ≤χ ∗ − log( N ). (2.77) 2.5.0.4 Deterministic Adaptive Strategy (DAS) If the sensorA is selected, then the log-likehood ratioλ 0 2 (A,Y ) is identically 0. Therefore,μ 0 2 (A,s) = 1 for every s. Further, it can easily be verified that μ 0 1 (A,s) = 1 at s = 0 and s = 1. This fact combined with the convexity of μ 0 1 (A,s) implies that μ 0 1 (A,s)≤ 1 for every 0≤s≤ 1. Similarly, μ 0 1 (B,s) = 1 for every s and μ 0 2 (B,s)≤ 1 for every 0≤ s≤ 1. Because of this, the deterministic adaptive experiment selection strategy (DAS) in Section 2.2.3.2 reduces to the following: g n (ρ n :A) = 1 if ρ n (1)≥ρ n (2) 0 otherwise. (2.78) Due to Theorem 2.2, we know this strategy is asymptoticaly optimal. Note that the strategy described above is independent of the time-horizon N. This strategy happens to coincide with the one designed in [76, 81]. 36 0 100 200 300 400 500 600 700 800 TimeHorizonN −35 −30 −25 −20 −15 −10 −5 0 LogarithmofMisclassificationProbabilitylogφ N (0) DAS ORS StrongBound WeakBound Figure 2.1: The plot depicts the performance of strategies ORS and DAS. Both are asymptotically optimal but DAS is better in the non-asymptotic regime. When the horizonN = 500, we see a 13dB improvement in the misclassification probability with DAS. Also notice that the strong bound on misclassification probability is very close to the performance of DAS. 2.5.0.5 Numerical Results We compare the performance of the open-loop randomized strategyα 0∗ and the deterministic adap- tive strategy described above in Figure 3.1. We observe that the performance of DAS is significantly better in the non-asymptotic regime. We also plot the weak and strong bounds established ear- lier. We observe that the strong bound is very close to the performance of DAS. In our numerical experiments, the inference strategy is a confidence threshold based strategy. Instead of computing the threshold, we empirically find the best threshold using binary search. 2.5.0.6 Chernoff’s Deterministic Strategy In [27], Chernoff described a fully deterministic strategy for sequential hypothesis testing but gave an example scenario for which his strategy was not asymptotically optimal (see Section 7, [27]). 37 We will now demonstrate that even in such pathological scenarios, our strategies designed based on Criterion 2.1 are asymptotically optimal and also, have a better performance in the non-asymptotic regime. Consider the same anomaly detection setup and formulation as in Section 2.5.0.1 with two additional experimentsC andD. The conditional distributions of the observations associated with these experiments are provided in Table 4.2. Chernoff’s approach to deterministic strategy design is as follows. Let ¯ j n be the most-likely alternate hypothesis at time n, i.e. ¯ j n . = arg max j6=0 ¯ ρ n (j), (2.79) where ¯ ρ n is the posterior belief at time n formed using a uniform prior. Then at time n, perform the experiment that maximizes D(p u 0 ||p u ¯ jn ). In other words, select the experiment that can best distinguish the most-likely alternate hypothesis from the hypothesis of interest (in this case, it is X = 0). In our setup, Chernoff’s deterministic strategy reduces to the following U n = A if ρ n (1)≥ρ n (2) B otherwise. (2.80) It can be shown that this strategy is not asymptotically optimal. Chernoff’s randomized strategy (ORS) in this case reduces to selecting either experiment C orD with probability 0.5. On the other hand, with some simple calculations, we can show that our strategy DAS-RS described in Section 2.2.3.2 reduces to U n = C if ρ n (1)≥ρ n (2) D otherwise. (2.81) Because of Theorem 2.2, we know that both ORS and DAS-RS are asymptotically optimal. The performances of ORS, DAS-RS and Chernoff’s deterministic strategy are shown in Figure 3.4. In this case, the strategy DAS-RS does not depend on the time-horizon N whereas the strategy DAS depends on N. We note that the last assumption in [76, pg. 513] and Assumption 1.6 in [81] do not apply for the above example. 38 X = 0 X = 1 X = 2 U =A 0.400 0.600 0.400 U =B 0.400 0.400 0.600 U =C 0.402 0.598 0.280 U =D 0.402 0.280 0.598 Table 2.2: Conditional probabilitiesP[Y = 1|X,U] for the problem setup in Section 2.5.0.6. 0 100 200 300 400 500 600 700 800 TimeHorizonN −45 −40 −35 −30 −25 −20 −15 −10 −5 0 LogarithmofMisclassificationProbabilitylogφ N (0) DAS-RS Chernoff ORS WeakBound Figure 2.2: Type-0 misclassification probability associated with strategies DAS-RS, ORS and Chernoff’s deterministic strat- egy. Both ORS and DAS-RS are asymptotically optimal and Chernoff’s deterministic strategy is not. Notice that DAS-RS outperforms all the other strategies in the non-asymptotic regime. 39 2.6 Conclusions We formulated two fixed horizon active hypothesis testing problems (asymmetric and symmetric) in which the agent can decide on one of the hypotheses or declare its experiments inconclusive. Using information-theoretic techniques, we obtained lower bounds on optimal misclassification probabil- ity in these problems. We also derived upper bounds by constructing appropriate strategies and analyzing their performance. We proposed a novel approach to designing deterministic and adap- tive strategies for these active hypothesis testing problems. We proved that these deterministic strategies are asymptotically optimal, and through numerical experiments, demonstrated that they have significantly better performance in the non-asymptotic regime in some problems of interest. In this paper, our analysis was restricted to the setting where the observation space is finite. We believe that our analysis can be extended to cases where the observation space is infinite and the log-likelihood ratios are sub-Gaussian. We hope to address this extension in future work. Another direction for future work is to consider potentially infinite composite hypothesis spaces which arise in a variety of problems like oddball detection [148] and drug efficacy testing [27]. 40 Chapter 3 Active Hypothesis Testing and Anomaly Detection 3.1 Introduction Active hypothesis testing and anomaly detection have been persistently studied since the 1940s; however much of this prior art focuses on the asymptotic regime. In this paper, we study asymptotic as well as non-asymptotic aspects. For example, [154, 145, 66, 73] have noted that the classical Chernoff-type strategy may be asymptotically optimal, but in many cases, other design approaches result in strategies that are substantially superior in the non-asymptotic regime. One of our main goals is to provide a better explanation for this phenomenon. The finite-block length bounds for non-adaptive Neyman-Pearson type hypothesis testing in [115] are some of the strongest non- asymptotic bounds in the hypothesis testing literature. We extend these bounds to certain kinds of active anomaly detection problems. The logarithm of the error probability contains two dominant terms: (i) a first-order term which is linear in the horizon N and (ii) a second-order term which decays proportionally to √ N. Asymptotically optimal strategies are first-order optimal. The second-order error performance plays a significant role in the non-asymptotic regime and different asymptotically optimal strategies can have substantially different second-order performance. We characterize the second-order coefficients for certain sampling strategies and use them to explain the performance gap between our designed strategies and classical strategies in the non-asymptotic regime. We believe that our results provide a framework for understanding of the non-asymptotic behavior or a larger class of active hypothesis testing problems than those considered herein. We consider a setting in which we are allowed to probe the components (individually or as groups) for a fixed number of times whereas most prior works [66, 154, 145] focus on the sequential setting 41 with a stopping time. Fixed-horizon formulations are useful in applications with hard time/energy constraints and the agent does not have the luxury to keep performing experiments until strong enough evidence is obtained. We make a distinction between symmetric and asymmetric systems. In symmetric systems, when a group of components is probed, the statistics of the observations depend only on the size of the group and whether or not the group contains an anomaly. The statistics do not on the components’ indices. This may not be true in asymmetric systems. Our main contributions in this paper can be summarized as follows. 1. We formulate a fixed-horizon Neyman-Pearson type problem for active anomaly detection for two sampling models: individual sampling and group sampling. For both these models, we derive the asymptotically optimal error rates and strategies. 2. We generalize the novel strategy design approach proposed in [73]. In [73], asymptotic opti- mality of the strategy was shown only for the case where the observation space is finite; we now consider arbitrary observation spaces. 3. We prove the asymptotic optimality of our proposed strategies; their strong performance is demonstrated via numerical results which confirm the improved non-asymptotic performance in comparison to Chernoff-type approaches. 4. The proposed approaches are data-driven and are straightforward to implement. 5. For the symmetric individual sampling problem, we derive strong non-asymptotic converse and performance bounds which facilitate explaining the performance differences between our new strategies and the classical ones in the non-asymptotic regime. Much prior work has considered the problem of determining which component is anomalous. The problem of safety evaluation has received much less attention. The structure of safety evaluation makes it amenable to non-asymptotic analysis. However, we can exploit the methods in [73] to incorporate the identification of anomalies (if they exist) as well. In this case, an initial exploration phase will be used to assess whether there is an anomaly or not with a moderate level of confidence. If no anomalies are found in this exploration phase, then our strategies designed in this paper can 42 be used to efficiently confirm the absence of anomalies with high confidence. If in the exploration phase we do find anomalies, we can use the search methodologies in [154, 145, 28] to identify them. Related Work Hypothesis testing is a long-standing problem and has been addressed in various settings. In the simplest fixed-horizon hypothesis testing setup, there is a single experiment with a fixed number of iid observations. For the Neyman-Pearson formulation [34] tight bounds on error probabilities were proved in [115]. There has been limited work on extending this framework to the case of active hypothesis testing [114]. We have formulated active Neyman-Pearson hypothesis testing problems and conducted asymptotic analyses [69, 73]. Herein, we further this analysis to the non-asymptotic case. In sequential hypothesis testing, the time horizon is not fixed and the agent can continue to perform experiments to minimize the expected stopping time and the the Bayesian error probability. Inspired by Wald’s sequential probability ratio test (SPRT) [151], Chernoff first addressed the problem of active sequential hypothesis testing in [27]. This work was later generalized in [14, 107, 99]. Although our formulation has a fixed time-horizon, it is closely related to the sequential active hypothesis testing framework. In contrast to all these approaches, a Gibbs sampling-based active sensing approach was also proposed in [25]. Some recent works on anomaly detection that are closely related to our work are [66, 26, 145, 54, 153]. All these works are in the sequential setting with a stopping time and focus on asymptotic optimality. In contrast, we have a fixed horizon and focus on the correct verification and incorrect verification probabilities. We should underscore that these prior works identify the anomaly while we consider safety evaluation which enables the non-asymptotic analysis. Individual sampling models were considered in [66] and [145]. In [66], it is assumed that at least one anomaly exists and thus the system is never safe. The model in [145] is closer to ours. The problem formulation differs; however the optimal error exponents and sampling strategy coincide with ours for certain conditions, but the non-asymptotic bounds obtained in our work for individual sampling however are significantly stronger than the bounds in [145]. Group sampling was considered in [154] and [28]. In [154], a hierarchical tree structure is imposed on the sampling and they propose a termination check at the root node. In contrast, we allow for arbitrary groups and we further show this tree-based approach is suboptimal for key 43 observation models. The group sampling model in [28] is very similar to ours but only considers Bernoulli observations models – herein, our observation model is very general. Furthermore, [28] assumes a single anomaly and designs algorithms for identifying the anomaly. We observe that if the system has at most one anomaly, we can combine our strategy with the strategy in [28] to design an algorithm that is capable of both checking for anomalies as well as finding it if one exists. Noisy group testing [6, 24, 36, 139] may also be used to check if a system has anomalies. However, most works on group testing assume simple observation models (Bernoulli or Z-channel) [130]. These works are typically non-adaptive or limited to two or three stage strategies. Notation Random variables are denoted by upper case letters (X), their realizations are denoted by the corresponding lower case letter (x). We use calligraphic fonts to denote sets (U). The probability simplex over a finite set U is denoted by ΔU. In general, subscripts denote time indices unless stated otherwise. For time indices n 1 ≤n 2 ,Y n 1 :n 2 denotes the collection of variables (Y n 1 ,Y n 1 +1 ,...,Y n 2 ). For a strategy g, we useP g [·] andE g [·] to indicate that the probability and expectation depend on the choice ofg. For a hypothesisi,E g i [·] denotes the expectation conditioned on hypothesis i. For a random variable X and an eventE,E[X;E] denotesE[X 1 E ], where 1 E is the indicator function associated with the eventE. The cross-entropy between two distributions p and q over a finite spaceY is given by H(p,q) =− X y∈Y p(y) logq(y). (3.1) The Kullback-Leibler divergence between distributions p and q is given by D(p||q) = X y∈Y p(y) log p(y) q(y) . (3.2) This paper is organized as follows. In Section 3.2, we describe our system model and formulate the verification problem. We discuss our main results on both individual and group sampling models in Section 3.3. An outline of the proof of our main results is provided in Section 3.4 and detailed proofs are provided in the appendices. We discuss our numerical experiments and results in Section 3.5 and conclude the paper in Section 4.6. 44 3.2 Signal and System Model Consider a system with multiple components. Let the set of all the components in the system be denoted byS . ={1,...,M} where M is a positive integer. Let the random variable X⊆S denote the set of anomalous components and letX be the set of all realizations of X. In this paper, we consider two cases forX : (i) there can be at most one anomalous component in the system, i.e.,X =X in . ={?,{1},...,{M}}, and (ii) any number of components can be anomalous, i.e., X =X gr . = 2 S , where 2 S is the power set ofS. Let the prior distribution on X be denoted by ρ 1 ∈ ΔX . LetU be a collection of subsets ofS, i.e.,U ⊆ 2 S . At each time n, the agent can select a groupU n ∈U of components and obtain an observationY n ∈Y. This action of selecting a group of components and obtaining an observation will be referred to as an experiment. In this paper, we will consider two experiment models. In the first model, we can only sample components individually in which caseU =U in . ={{1},...,{M}}. In the second model, we can group any (non-zero) number of components and thus,U =U gr . = 2 S \{?}. The observationY n can be used to infer whether the selected group U n contains an anomaly or not, and is given by Y n =ξ(X,U n ,W n ) . = Υ(U n ,W n ) if X∩U n =? ¯ Υ(U n ,W n ) otherwise , (3.3) where{W n : n = 1, 2,...} is a collection of mutually independent variables and Υ, ¯ Υ are some measurable mappings. The density associated with an observation y when a group of components u is selected is denoted by p u 1 if the group contains an anomaly and p u 0 otherwise. For each group u, the densities are with respect to a σ-finite measure ν over the observation spaceY. Thus, for a measurable set A⊆Y and and hypothesis x∈X , P[Y n ∈A|U n =u,X =x] = R A p u 0 (y)dν(y) if x∩u =? R A p u 1 (y)dν(y) otherwise. 45 The system is said to be symmetric if the statistics of the observation Y depend only on the size of the group and whether the selected group of components contains an anomaly or not. The total number of observations collected by the agent is fixed and is denoted by N. At time n = 1, 2,...,N, the information available to the agent, denoted by I n , is the collection of all experiments performed and the corresponding observations up to time n− 1, I n . ={U 1:n−1 ,Y 1:n−1 }. (3.4) Let the collection of all possible realizations of information I n be denoted byI n . At time n, the agent selects a distribution over the set of componentsU according to an experiment selection rule g n :I n → ΔU and the experiment U n is randomly drawn from the distribution g n (I n ), that is, U n ∼g n (I n ). (3.5) For a given experimentu∈U and information realizationI∈I n , the probabilityP g [U n =u|I n =I] is denoted by g n (I : u). The sequence{g n ,n = 1,...,N} is denoted by g and referred to as the experiment selection strategy or sampling strategy. Let the collection of all such experiment selection strategies beG. If the system does not have any anomalies, it is considered to be safe (denoted by ?), and if the system has an anomaly, it is considered to by unsafe (denoted byℵ). After performing N experiments, the agent can declare the system to be safe or unsafe. We refer to this final declaration as the agent’s inference decision and denote it by ˆ X N . Thus, the inference decision can take values in{?,ℵ}. Using the information I N+1 , the agent chooses a distribution over the set {?,ℵ} according to an inference strategy f :I N+1 → Δ({?,ℵ}) and the inference ˆ X N is drawn from the distribution f(I N+1 ), i.e. ˆ X N ∼f(I N+1 ). (3.6) For a given inference ˆ x∈{?,ℵ} and information realization I∈I N+1 , the probabilityP f,g [ ˆ X N = ˆ x|I N+1 =I] is denoted by f N (I : ˆ x). Let the set of all inference strategies beF. The system is said to be correctly verified if the agent declares the system to be safe when it was safe, and it is said to be incorrectly verified if the agent declares it to be safe when it was not 46 safe. For an experiment selection strategy g and an inference strategy f, we define the following probabilities. Definition 3.1. Let ψ N be the probability that the agent declares the system to be safe given that the system is safe, i.e. ψ N . =P f,g [ ˆ X N =?|X =?]. (3.7) We refer to ψ N as the correct-verification probability. Let φ N be the probability that the agent declares the system to be safe given that the system is not safe, i.e. φ N . =P f,g [ ˆ X N =?|X6=?]. (3.8) We refer to φ N as the incorrect-verification probability. We are interested in designing an experiment selection strategy g and an inference strategy f that minimize the incorrect-verification probability φ N subject to the constraint that the correct- verification probabilityψ N is sufficiently large. In other words, we would like to solve the following optimization problem: inf f∈F,g∈G φ N (P1) subject to ψ N ≥ 1− N , where 0 < N < 1. Let the infimum value of this optimization problem be φ ∗ N . Note that this problem is always feasible because the agent can trivially satisfy the correct-verification probability constraint by always declaring the system to be safe. Problem (P1) is a simple vs. composite hypothesis testing problem. The simple null hypothesis is that the system is safe (X =?) and the alternate hypothesis (X6=?) is composite because the system can be unsafe in several ways. In general, obtaining an optimal solution to Problem (P1) can be difficult. This is primarily because the optimization is over the strategy spaceG which grows doubly-exponentially with the horizon N. We can however design approximately optimal strategies. For such approximately optimal strategies, it is essential to characterize the optimality gap or regret which is defined as R N (f,g) = log(φ N )− log(φ ∗ N ), (3.9) 47 where φ N is the incorrect verification probability associated with the strategy pair f,g. We say that an experiment design f,g is asymptotically optimal if the regret is sub-linear in N, i.e., lim N→∞ R N (f,g) N → 0. (3.10) In subsequent sections, we consider individual sampling and group sampling separately. For each of these two settings, we first derive a lower bound on the corresponding optimal incorrect verification probability φ ∗ N . We then construct experiment selection and inference strategies that are feasible with respect to the constraints in Problem (P1) and derive an upper bound on the corresponding probability φ N . This upper bound when combined with the lower bound derived earlier gives us an upper bound on the regret. 3.3 Definitions and Main Results We first define some quantities and state some results [73] associated with general active hypothesis testing problems. We then obtain stronger results for various specialized models described in Section 3.2. Definition 3.2 (Pairwise LLR). For an experiment u∈U, a hypothesis x∈X and for an obser- vation y∈Y, the pairwise LLR is the logarithm of the ratio of the likelihood densities associated with hypothesis? and hypothesis x and is denoted by λ x (u,y) . = log p u 0 (y) p u 1 (y) if u∩x6=? 0 otherwise. (3.11) Also, define D u x = E u ? [λ x (u,Y )] where observation Y ∼ p u 0 . If x is a singleton component{j}, we denote λ {j} (u,y) with λ j (u,y) for simplicity and refer to it as component-wise LLR. Similarly, when the experiment u ={j} and the hypothesis x ={i} are both singleton sets, we will denote D u x =D {j} {i} simply with D j i . We make the following standard assumption [27, 107] on the log-likelihood ratios. 48 Assumption 3.1. For any given experiment u∈S and each k∈{0, 1}, Z y log p u 0 (y) p u 1 (y) 2 p u k (y)dν(y)<∞. (3.12) Definition 3.3 (Max-min Kullback-Leibler Divergence). Define D ∗ . = max α∈ΔU min x∈ ˜ X X u∈U α(u)D u x (3.13) = min β∈Δ ˜ X max u∈U X x∈ ˜ X β(x)D u x , (3.14) where ˜ X =X\?. Note thatα is a distribution over the set of experiments andβ is a distribution over the set of all possible unsafe configurations. Letα ∗ be the distribution that achieves the maximum in (3.13) and letβ ∗ be the distribution that achieves the minimum in (3.14). The distributions α ∗ andβ ∗ will be referred to as the maxminimizer and the minmaximizer. The equality of the min-max and max-min values follows from the minimax theorem [110] because the setU is finite and the Kullback-Leibler divergences are bounded due to Assumption 3.1. Definition 3.4 (Posterior Belief). The posterior beliefρ n on the hypothesisX based on information I n is given by ρ n (x) =P[X =x|U 1:n−1 ,Y 1:n−1 ] =P[X =x|I n ]. (3.15) Note that given a realization of the experiments and observations until time n, the posterior belief does not depend on the experiment selection strategy g or the inference strategy f. Definition 3.5 (Confidence). The confidence on the safe state? associated with information I n is given by C(I n ) . = log ρ n (?) 1−ρ n (?) − log ρ 1 (?) 1−ρ 1 (?) , (3.16) where ρ n is the posterior belief associated with information I n . Sometimes, we will refer toC(I n ) asC(I n ,ρ 1 ) to emphasize its dependence on the prior belief ρ 1 . The following lemma from [73] establishes a lower bound on the optimal misclassification error φ ∗ N . 49 Lemma 3.1 (Weak Converse). The optimum value φ ∗ N in Problem (P1) satisfies − 1 N logφ ∗ N ≤ D ∗ 1− N + log 2 +H(β ∗ , ˜ ρ 1 ) N(1− N ) . (3.17) Proof. For a given sampling strategy g, we can view Problem (P1) as a binary hypothesis testing problem and use the weak converse in [117] to obtain a lower bound that depends on the strategy g. Further, using some properties of the confidence defined in Definition 3.5, we can obtain the inequality in (3.17). A complete proof of this result is in [73]. Assumption 3.2. We have that the bound 1− N on the correct-verification probability in (P1) satisfies N → 0. Further, lim N→∞ − log N N = 0. (3.18) Definition 3.6 (Experiment Selection Strategy). Consider an experiment selection strategyg such that at time n, g selects an experiment that minimizes X x∈ ˜ X (ρ n (x)) s N μ x (u,s N ). (3.19) Here,μ x (u,s) is the moment generating function of the negative pairwise LLR defined in Definition 3.2, i.e., μ x (u,s) =E u ? [exp[−sλ x (u,Y )]], (3.20) where the observation Y ∼p u 0 . The parameter s N is such that s N → 0; lim N→∞ − log N Ns N = 0. (3.21) Definition 3.7 (Inference Strategy). Letf be an inference strategy such that the system is declared to be safe if and only if C i (I N+1 )≥θ N , (3.22) 50 where θ N is defined precisely in (B.11). Under Assumptions 3.1 and 3.2, we prove that for each N, the inference and experiment se- lections strategies f,g defined in Definitions 3.7 and 3.6 respectively satisfy the constraint on correct-verification probability in Problem (P1) and the corresponding incorrect-verification prob- ability φ N decays exponentially with N at rate D ∗ . This achievability argument leads us to the following theorem. Theorem 3.1. We have lim N→∞ − 1 N logφ ∗ N =D ∗ . (3.23) Proof. This result was proven in [73] under the assumption that the observationY is finite. We provide a proof for more general observation spaces in Appendix B.1. Remark 3.1. An alternative experiment selection strategy is the following. At each time n, ran- domly select a component u with probability α ∗ (u). As opposed to our strategy in Defition 3.6, this simple strategy is randomized and completely open-loop. This strategy is inspired from the design methodology in [27, 107, 99] and can be shown to be asymptotically optimal using similar arguments. However, it tends to have poor non-asymptotic performance. We provide a detailed discussion on this in Section 3.5. 3.3.1 Individual Sampling Recall that in individual sampling, the set of experimentsU isU in . ={{1},...,{M}}. For any experiment u ∈ U in and for any unsafe hypothesis x ∈ ˜ X , we have the following relationship between the LLR λ x (u,y) defined in (3.11) and the component-wise log-likelihood ratios. λ x (u,y) = X i∈x λ i (u,y). (3.24) 51 Lemma 3.2. In individual sampling, i.e., when the set of experiments isU in , the max-min Kullback- Leibler divergence defined in Definition 3.3 can be simplified as D ∗ = X i∈S 1 D i i ! −1 , (3.25) where D i i is as defined in Definition 3.2. The corresponding maxminimizing and minmaximizing distributions α ∗ and β ∗ are given by α ∗ (i) =β ∗ (i) =D ∗ /D i i for each i∈S. Proof. See Appendix B.2. Lemma 3.2 gives us a closed-form expression for the optimal asymptotic error rate as well as the minmaximizer and maxminimizer. We note that Lemma 3.2 holds both when the set of hypotheses X isX in (at most one anomaly) orX gr (arbitrarily many anomalies). Further, the minmaximizerβ ∗ has only singleton elements in its support even when the system has arbitrary many anomalies. This suggests that even though there are exponentially many hypotheses, only the singleton hypotheses are of significance. Now we state our non-asymptotic results for symmetric systems. The Kullback-Leibler diver- gence D u u in symmetric systems does not depend on the component u and thus, we simply denote it by D. Using Lemma 3.2, we can conclude that D ∗ = D/M, and that α ∗ and β ∗ are uniform distributions over the set of componentsS. Definition 3.8. For a given experiment selection strategy g and a component j∈S, define the total log-likelihood ratio up to time n as Z n (j) . = n X k=1 λ j (U k ,Y k ), where the log-likelihood ratio λ j is as defined in equation (3.11). Also, let L n . = X j∈S β ∗ (j)λ j (U n ,Y n ) ¯ Z n . = X j∈S β ∗ (j)Z n (j) = n X k=1 L k , where β ∗ is the min-maximizing distribution in Definition 3.3. 52 Since the system is symmetric, we have L n = 1 M X j∈S λ j (U n ,Y n ) = 1 M log p 0 (Y n ) p 1 (Y n ) . (3.26) Notice that given X = 0, the distribution of L n does not depend on U n and thus, on the strategy g. The following lemma plays a crucial role in the derivation of our non-asymptotic bounds for the symmetric system. Lemma 3.3. When the system is symmetric, L n is an i.i.d. sequence and the process ¯ Z n is the sum of these i.i.d. random variables. Proof. See Section B.5.1 in the supplementary material. Letinv n denote the quantile function (which is the same as the inverse-cdf if it exists) associated with the random varible ¯ Z n +D(β ∗ ||˜ ρ 1 ). Theorem 3.2. For symmetric systems, we have − logφ ∗ N ≤inv N N + N η + log η N (3.27) − logφ ∗ N ≥inv N N − N η −O log η N , (3.28) for any η> 1 as long as the argument of inv N ∈ (0, 1). Note that η may also depend on N. Remark 3.2. The bound in (3.28) is stated in big-O notation because the constants associated with the logarithmic term are difficult to determine in general. Herein, we only prove the existence of constants that achieve the bound (3.28). We would like to emphasize that Theorem 6.1 does not require Assumption 3.2. The bound in (3.27) is based on the strong converse theorem [115] and other properties of the log-likelihood ratios Z n (j). The result in (3.28) is obtained by constructing experiment selection and inference strategies and bounding their performance. The approach used for bounding performance is based on the well-known Chernoff-bound [126]. The experiment selection strategy that achieves the bound is as follows: at each time n, select the component j that minimizes Z n−1 (j)− log ˜ ρ 1 (j). The inference strategy is a simple confidence-threshold based strategy. These strategies are obtained 53 by specializing the strategies defined in Definitions 3.6 and 3.7 to the individual sampling model. Detailed descriptions of these strategies and the complete proof of Theorem 6.1 are provided in Section 3.4. Since L n is an i.i.d. collection of random variables, we can further simplify these bounds by approximating the quantile function inv N using the Berry-Esseen Theorem [115]. Corollary 3.1 (Berry-Esseen). If V . =E ? [(L 1 −D ∗ ) 2 ] and T . =E ? [|L 1 −D ∗ | 3 ]<∞, then − logφ ∗ N ≤ (3.29) ND ∗ − √ NVQ −1 N + N η + 6T √ NV 3 +O log η N , − logφ ∗ N ≥ (3.30) ND ∗ − √ NVQ −1 N − N η − 6T √ NV 3 −O log η N . Here, the Q-function is the tail distribution function of the standard normal distribution. The results above are valid only when the argument of Q −1 is between 0 and 1. The approximations obtained using the Berry-Esseen theorem provide us with a clearer picture of the non-asymptotic behavior of the optimal error probabilityφ ∗ N as well as the bounds obtained. It is clear from (3.29) and (3.30) that both bounds share the same linear term ND ∗ . Further, the √ N term in both bounds is scaled by very similar factors indicating the tightness of the bounds. Further discussion on the tightness of these bounds is provided in Section 3.5. 3.3.2 Group Sampling In this section, we will discuss our results associated with group sampling. Recall that in group sampling, the set of experimentsU isU gr = 2 S \{?}. While one can use the general theory on active hypothesis testing [73, 27, 99] to characterize asymptotically optimal rates (and strategies) in this case, it is in general difficult to compute and thus implement these strategies in practice due to the prohibitively large set of experiments (exponential in the number of components). However, under certain assumptions on the observation model, this computational challenge is alleviated. One such case is when the system is symmetric, i.e., the statistics of the observationY depend only on the size of the group and whether the selected group of components contains an anomaly or not. 54 Lemma 3.4. In symmetric group sampling, i.e., when the set of experiments isU gr , the max-min Kullback-Leibler divergence defined in Definition 3.3 can be simplified as D ∗ = max 1≤k≤M kD {1,...,k} {1} M , (3.31) where D {1,...,k} {1} is as defined in Definition 3.2. Let k ∗ be the maximizing size in (3.31). The corre- sponding maxminimizing distribution α ∗ is the uniform distribution over all groups of size k ∗ and the minmaximizing distribution β ∗ is the uniform distribution over all (M) singleton components. Proof. See Appendix B.3. Similar to the individual sampling case, Lemma 3.4 holds both when the set of hypothesesX isX in (at most one anomaly) orX gr (arbitrarily many anomalies). The classical Chernoff-type sampling strategy is to randomly select an experiment from the distributionα ∗ at any given time [27]. Note that this sampling strategy is open-loop. It was shown in [73] that this Chernoff-type strategy is asymptotically optimal for the fixed-horizon hypothesis testing problem. An alternative methodology was proposed in [73] for designing asymptotically optimal strategies. This approach generally leads to deterministic and adaptive strategy. Theorem 3.3. Let g be an experiment selection strategy such that any given time n, g n selects k ∗ components that are the most likely to be anomalous according to the posterior belief. Let f be the best confidence threshold based inference strategy given that g is used for experiment selection. For the symmetric group sampling problem with at most one anomaly, the strategy pair (f,g) is asymptotically optimal. Proof. We specialize the experiment selection strategy in Definition 3.6 to the group sampling model with at most one anomaly. Additionally, we restrict the set of experiments to the support ofα ∗ , i.e., groups of size k ∗ . It was shown in [73] that this restriction is without loss of optimality. For some group u∈U gr of size k ∗ , the metric in (3.19) is given by X x∈ ˜ X (ρ n (x)) s N μ x (u,s N ) (3.32) a = X j∈S (ρ n (j)) s N μ j (u,s N ) (3.33) 55 b = X j∈u (ρ n (j)) s N μ(s N ) + X j/ ∈u (ρ n (j)) s N (3.34) = X j∈u (ρ n (j)) s N (μ(s N )− 1) + X j∈S (ρ n (j)) s N . (3.35) Equality (a) follows from the structure of the log-likelihood ratios in (3.11) and (b) follows from the symmetry. From the last equation, it is clear that the groupu that has most likely components (according to ρ n ) minimizes the metric above. Remark 3.3. In individual sampling, the posterior can be factorized and can be updated easily. However, in group sampling with an arbitrary number of anomalies, updating the posterior belief may be computationally infeasible in general. Therefore, our experiment selection strategy in Def- inition 3.6 may not necessarily be useful in this case. However, we conjecture that, under certain conditions it is possible to use our strategy design methodology tractably. For instance, we believe that when the number of components M is a multiple of k ∗ , then the components can be clus- tered into groups of size k ∗ and can be treated as macro-components. These macro-components can then be sampled individually. A deeper understanding of group sampling with arbitrarily many components is a problem for future work. 3.4 Proof Outline of Theorem 6.1 We will first define some important quantities that will be used in our analysis. Let us arbitrarily fix the experiment selection strategy to be some g∈G. For a given instance ι N+1 ={u 1:N ,y 1:N } of information, let ι n+1 ={u 1:n ,y 1:n } for n≤N. Define P g N (ι N+1 ) . = N Y n=1 P g [u n |ι n ]p un 0 (y n ) (3.36) Q g N (ι N+1 ) . = X j∈U ˜ ρ 1 (j) N Y n=1 P g [u n |ι n ]p un 1(un=j) (y n ). (3.37) 56 The quantities P g N and Q g N are densities over the spaceI N+1 conditioned on X = 0 and X6= 0 respectively. The densities are with respect to the product measure| N ×ν N where| denotes the counting measure. Clearly, for any strategy g, C(I N+1 ) = log P g N (I N+1 ) Q g N (I N+1 ) (3.38) with probability 1 (see Proposition 1 [73]). The first step in analyzing Problem (P1) with the experiment selection strategy g fixed is to view it as a one-shot binary hypothesis testing problem in the following manner: N experiments are performed using the strategy g and then we observe the sequence I N+1 of experiments and observations. This observed sequence I N+1 can be viewed as a single observation. Based on I N+1 , we need to infer whether the system is safe (X =?) or it is unsafe (X6=?) using the inference strategyf. IfX =?, the density associated with the observed sequenceI N+1 isP g N and ifX6=?, then the density associated with the sequence I N+1 is Q g N . Thus,C(I N+1 ) is the log-likelihood ratio associated with this one-shot binary hypothesis testing problem. We can now use the strong converse (or the weak converse) [117] to obtain a lower bound on φ ∗ N . However, this bound will depend on the choice of strategy g. We can exploit various properties of the confidence levelC to obtain a strategy-independent lower bound onφ ∗ N . To do so, we will first establish the relationship between the total log-likelihood ratios Z n (j) and the confidence levelC. Lemma 3.5. For any experiment selection strategy g and for each 1≤n≤N, we have C(I n+1 ,ρ 1 ) =− log X j∈U exp log ˜ ρ 1 (j)−Z n (j) , where ˜ ρ 1 (j) =ρ 1 (j)/(1−ρ 1 (0)). Proof. The proof of this lemma can be found in [73]. We use Lemma 3.5 to decompose the confidence into a Kullback-Leibler divergence term and a sub-martingale (a simple i.i.d. sum in the case of symmetric systems). 57 Lemma 3.6 (Decomposition). For any experiment selection strategy g, we have C(I n+1 ) = h −D(β ∗ ||˜ ρ n+1 ) i + h ¯ Z n +D(β ∗ ||˜ ρ 1 ) i , where ˜ ρ n+1 (j) = ˜ ρ 1 (j)e −Zn(j) P k∈U ˜ ρ 1 (k)e −Zn(k) . Proof. See Section B.5.2 in the supplementary material. The decomposition in Lemma 3.6 allows us to obtain the bounds in Theorem 6.1. We can analyze the two terms in the decomposition separately. The key property to exploit is that ¯ Z n is a sum of i.i.d. variables and does not depend on the strategy g. Further, for the strong converse, we simply exploit the non-negativity of Kullback-Leibler divergence. For the achievability bound, we can show that the adaptive strategy keeps the Kullback-Leibler divergence term small by means of a Chernoff bound. The details of this proof are provided in Appendix B.4. 3.5 Numerical Results 3.5.1 Individual Sampling We first consider a setup with four components and the set of reals R as the observation space. The observation Y n given that X =x and U n =u is distributed as follows Y n ∼ N (1, 1) if x∩u =? N (−1, 1) otherwise, whereN (μ,σ 2 ) denotes the normal distribution with mean μ and variance σ 2 . We compare our deterministic adaptive strategy (DAS) in Definition 3.6 and the classical Chernoff-type open-loop randomized strategy (ORS) which selects each component with probability 0.25. Additionally, we consider a sampling strategy (RR) in which each component is selected in a round-robin manner. Clearly RR is deterministic and samples all components uniformly. The arguments for proving that ORS is asympotically optimal can be invoked for RR to show the same. Figure 3.1 depicts the error performance of these three strategies as well as lower bounds on the error probability. The first 58 bound is the weak bound from (3.17). There are two strong lower bounds: one that is applicable only for the strategy RR and the other is applicable for any sampling strategy. Strong converse bounds [115, 116] are primarily obtained by selecting an appropriate χ in the following inequality that holds for any real χ. logφ N ≥ log(P g ? [C(I N+1 )≤χ]− N )−χ. (3.39) Here, we use the convention that if x≤ 0, then logx . =−∞. Note that this bound depends on the strategy g and in order to choose the value of χ, we need to know the quantile function of the random variableC(I N+1 ) under the strategy g. For most strategies g, it is very difficult to characterize this quantile function. However, for RR, one can compute the quantile function (denoted by inv RR n ) easily. The RR bound in Figure 3.1 is obtained by setting χ n =inv RR n n + 1 √ n . (3.40) The following bound which is independent of the strategyg can be obtained using Lemmas 3.3 and 3.6 (see Appendix B.4.1 for more details). logφ N ≥ log(P ? [ ¯ Z n +D(β ∗ ||˜ ρ 1 )≤χ]− N )−χ. (3.41) Using this inequality, the Strong bound in Figure 3.1 is obtained by setting χ n =inv n n + 1 √ n . (3.42) In all the numerical experiments, the optimal threshold for the inference strategy is selected via binary search. While the performance curves are empirical averages computed via importance sampling, the converse bounds are theoretical and thereby exact. Notice that the performance of our strategy is very close to the strong converse bound indicating the tightness of the converse bound and efficiency of the designed strategy. We also note that the performance of ORS is generally worse than that of RR. Further, the strong converse for RR indicates that its performance cannot be better than the DAS strategy. 59 0 200 400 600 800 1000 1200 1400 1600 TimeHorizonN −800 −700 −600 −500 −400 −300 −200 −100 0 LogarithmofMisclassificationProbabilitylogφN(0) ORS RR DAS StrongBound RRbound WeakBound Figure 3.1: Performance curves and lower bounds for the 4 component system with individual sampling. We next increase the number of components from 4 to 128; performance curves are seen in Figure 3.2. Most key trends are maintained; however, as the number of components increases, we observe that the Chernoff-type asymptotically optimal strategy is substantially farther from the optimal performance in the non-asymptotic regime. This is due to the second-order term which is being captured by the corresponding converse bounds. Recall that the RR bound is based on the quantile function of the variableC(I n+1 ) which is closely related to the LLRs Z n (j) (see Lemma 3.5). The variance of Z n (j) scales as n/M. On the other hand, the strategy independent strong converse in (3.41) depends on the quantile function of ¯ Z n (j) whose variance scales as n/M 2 . This second-order effect of the variance is clear from the Figures 3.1 and 3.2. Further, Theorem 6.1 states that the performance of our DAS strategy is only a logarithmic term away from the strong converse. This is evident from these plots. We also observe that the strong converse bound does not lead to a very tight bound in this case and is only marginally better than the weak bound. Nonetheless, the strong bound has two key advantages: (i) it captures the first and second-order trends of our DAS strategy accurately, and (ii) it captures these trends even when N is a constant (such as 0.05). Recall that we need N → 0 for the weak bound to be asymptotically tight, but this is not needed for the strong bound. 60 0 200 400 600 800 1000 1200 1400 1600 TimeHorizonN −25 −20 −15 −10 −5 0 LogarithmofMisclassificationProbabilitylogφN(0) ORS RR DAS StrongBound RRbound WeakBound Figure 3.2: Performance curves and lower bounds for the 128 component system with individual sampling. 3.5.2 Group Sampling We consider a symmetric system with M = 16 components. The observation spaceY ={0, 1}. At any given time n, the agent can select any non-empty subset U n ∈U gr and the corresponding observation Y n is given by P[Y n = 1|U n =u,X =x] = $(|u|) if x∩u =? 1−$(|u|) otherwise, where$ :{1,..., 16}→ [0, 1] decreases linearly with $(1) = 0.6 and$(16) = 0.5. For this model, we compute arg max 1≤k≤M kD {1,...,k} {1} M = 5. (3.43) A plot of this KL-divergence metric as a function of the group size k is shown in Figure 3.3. This KL-divergence metric also represents the corresponding asymptotic error rate achieved when groups of size k are sampled uniformly. Larger groups are capable of testing many components simultaneously, but suffer from increased noise and thus may require more samples and vice-versa. For the Bernoulli observation model with linear scaling described above, this trade-off between group size and noise in observations is captured by the KL-divergence metric in Figure 3.3. It was 61 0 2 4 6 8 10 12 14 16 Groupsize k 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 KL-Divergencemetric Metric Figure 3.3: The KL-divergence metric in (3.31) as a function of the group size k. suggested in [154] that one can sample all the components as a group to test if the system has any anomalies or not. It is clear from the plot in Figure 3.3 that this strategy is inefficient and groups of size 5 are most efficient for our example. In this case, the Chernoff-type strategy is to randomly select a group of size 5 from a uniform distribution. Our strategy according Theorem 3.3 is to select the five components that are most likely to be anomalous according to the current posterior distribution. It is clear from Figure 3.4 that there is a large gap in the performance between the Chernoff-type strategy ORS and our DAS strategy. In the case of group sampling, it is difficult to obtain the strong converse bounds (as in (3.41)). This is because Lemma 3.3 does not hold in the group case. Finding similar converse bounds for the group sampling case is a problem for future work. 3.6 Conclusions Individual and group sampling strategies are investigated for active anomaly detection via a Neyman- Pearson formulation. The asymptotic detection rates are computed, are shown to be optimal and inform the design of new data-driven individual and group sampling strategies. Our data-driven strategies provide improved performance in the finite horizon regime over classical Chernof-type designs – the gain is substantial when the number of components to test is large. A methodology for computing a strong converse and achievability bounds is determined and applied to the finite 62 0 100 200 300 400 500 600 700 800 TimeHorizonN −14 −12 −10 −8 −6 −4 −2 LogarithmofMisclassificationProbabilitylogφN(0) DAS ORS WeakBound Figure 3.4: Performance curves and lower bounds for the sixteen component system with group sampling. horizon case. We generalized the novel class of experiment selection strategies in [73] to incorpo- rate infinite observation spaces. While we have focused on fast anomaly confirmation; anomaly identification can be addressed by combining our methodology with that of [28] or [154]. This is achieved by switching from our method to that of [28, 154] when confident. 63 Chapter 4 A Neural Network-based Framework for Strategy Design 4.1 Introduction Information theory provides us with a quantitative framework [131, 34] to analyze various notions associated with information processing such as communication and storage. Some of the major ad- vances in data storage and communication have been facilitated by information theory. Information theory has also been very successful in identifying performance limits such as channel capacity and compression rate. It guarantees the existence of policies that achieve optimal performance but in many cases, finding these optimal policies (like capacity-achieving encoding and decoding schemes) can be a difficult task. Many problems in statistics, such as hypothesis testing, also have strong connections with information theory. Stochastic control theory [79, 12] provides us with a framework to analyze sequential decision- making problems under uncertainty. Many real world decision-making problems can be modeled as Markov Decision Processes (MDPs) or Partially Observable Markov Decision Processes (POMDPs). It has been widely used in the areas of artificial intelligence, robotics and finance. Stochastic control theory provides us with strong tools such as Dynamic Programming (DP) that can help us characterize optimal solutions for these decision problems. For instance, optimal solutions for MDPs with complete model information can be computed efficiently using dynamic programming [12]. This efficiency, however, does not extend to POMDPs. It is known that finding optimal solutions for POMDPs in general is a PSPACE-hard problem [112]. Because of the computational hardness of these problem, various heuristic solutions are em- ployed in practice. For example, Point Based Value Iteration [113] is a well-known heuristic for 64 POMDPs. In this work, we examine the possibility of using deep learning to design better heuristics for problems in information and control theory. Deep learning is an emerging branch of machine learning and has found tremendous success in the areas of image and text processing [82]. Deep neural networks are universal approximators [64] and these networks can be trained in a supervised manner using the backpropagation algorithm [55]. Deep neural networks have lately been used to solve problems in communication [44, 45] and reinforcement learning [92]. In this work, we consider the problem of active sequential hypothesis testing, which involves a combination of information and control theory. The aim of active hypothesis testing is to infer an unknown hypothesis based on observations. The agent can adaptively make queries to obtain observations and we seek to design a sequential query selection policy that can reliably infer the underlying hypothesis using few queries. We define a notion of confidence and reformulate this problem as a confidence maximization problem in a fixed sample-size setting. The asymptotic version of the confidence maximization problem can be seen as an infinite-horizon, average-reward MDP. We design heuristics for this confidence maximization problem using deep neural networks. We first examine a design framework based on Recurrent Neural Networks (RNNs). RNNs are a category of neural networks with a recurring neural unit which maintains a hidden state vector for each input instance. They have been used for solving sequential problems with success [137, 29]. Their ability to store long-term dependencies of sequences within hidden states makes them apt for the task. One of the most popular and successful variants of recurrent networks are Long-Short Term Memory (LSTM) networks [61] which maintain their state using forget, input and output gates. The underlying structure of recurrent networks fits naturally for active hypothesis testing. We explore an LSTM architecture that can be trained simultaneously to adaptively select queries as well as learn to infer the true hypothesis based on observations obtained. We observe that the model manages to learn to infer the hypothesis for a given set of observations but fails to learn the query selection policy. We discuss the details of this architecture in Appendix C.1. We then design a heuristic based on deep reinforcement learning [92]. In this heuristic, the agent simulates the MDP associated with the confidence maximization problem. Based on its simulated experience, the agent tries to learn the optimal query selection policy using deep reinforcement learning. We observe in our numerical experiments that this heuristic policy comes very close to 65 optimality. The details of this approach are discussed in Section 4.4. In addition to the neural network based heuristics, we introduce a heuristic based on a KL-divergence zero-sum game. This policy is adaptive and also achieves near-optimal performance in our numerical experiments. The details of the policy are discussed in Section 4.5.3. The rest of the paper is organized as follows. In Section 6.8, we describe the prior works related to active hypothesis testing and deep learning. In Section 8.1.1, we discuss the mathematical notation used in this paper. The confidence maximization problem is formulated in Section 4.2 and expressed as an MDP in Section 4.3. The deep reinforcement learning approach for active hypothesis testing is discussed in Section 4.4. In Section 4.5 we describe a few policies from prior works and compare them with our designed policies based on numerical experiments. We conclude the paper in Section 4.6. 4.1.1 Related Work Active hypothesis testing was first formulated by Chernoff in [27] inspired by Wald’s Sequential Probability Ratio Test (SPRT) [151]. Thereafter, this work has been extended in various ways. In [107], the problem of multihypothesis testing is considered in both fixed sample size and sequential settings. In [99], a Bayesian setting is examined with a random stopping time and is formulated as a POMDP. Upper and lower bounds on the optimal value function of this POMDP were derived and some heuristic policies were proposed. All these works provide heuristic policies that are asymptotically optimal. However, optimal policies for the non-asymptotic formulations are not known. Furthermore, most of the heuristics proposed in these works are almost open-loop and randomized policies. This motivates us to seek better heuristics. The idea of using deep neural networks to solve POMDPs is relatively less explored. Reinforce- ment learning usually assumes perfect state observability. In [57, 68], the authors aim to perform deep reinforcement learning under partial observability. They use a combination of convolutional neural networks and recurrent neural networks to achieve this. In this model, the agent does not have model information and thus, cannot directly make Bayesian belief updates. The network model in [39] is very similar to our deep Q-network. However, the model in [39] cannot be used directly for hypothesis testing due to some issues discussed in Section 4.4.4. We make appropriate modifications to rectify these issues. To the best of our knowledge, deep neural networks have not 66 ρ(1) h = 1 u 1 y 1 u 1 . . . u 2 . . . y 2 u 1 . . . u 2 . . . u 2 . . . . . . h = 2 u 1 . . . . . . u 2 y 1 u 1 . . . u 2 . . . y 2 u 1 . . . u 2 . . . Nature Agent Nature Agent Figure 4.1: Agent’s choices and subsequent observations represented as a tree. Every instance of the probability space can be uniquely represented by a path in this tree. been used in the context of active hypothesis testing and our neural network design framework is the first of its kind. 4.2 Problem Formulation LetH⊂N be a finite set of hypotheses and let H be the true hypothesis. At each time n∈N, the agent can perform an experimentU n ∈U and obtain an observationY n ∈Y. The relation between U n and Y n is given by Y n =ξ(H,U n ,W n ), (4.1) where W n is a collection of independent primitive random variables. Thus, all the observations are independent conditioned on the hypothesis and the experiment. The probability of observing y after performing an experiment u under hypothesis h is denoted by p u h (y). For simplicity, let us also assume that the setsU andY are finite. The information available at the agent at time n is I n ={U 1:n−1 ,Y 1:n−1 }. (4.2) Actions of the agent at time n can be functions of I n (see Fig. 4.1). Let the experiment selection policy be U n =g n (I n ). (4.3) 67 0 0.2 0.4 0.6 0.8 1 −10 −5 0 5 10 p log p 1−p Figure 4.2: The logit function is the inverse of the logistic sigmoid function 1/(1 +e −x ). It is widely used in statistics and machine learning to quantify confidence level [65]. The sequence of all the policies{g n } is denoted by g which is referred to as a strategy. Let the collection of all such strategies beG. Using the available information, the agent forms a posterior belief ρ(n) on H at time n which is given by ρ h (n) =P[H =h|Y 1:n−1 ,U 1:n−1 ]. (4.4) Definition 4.1 (Bayesian Log-Likelihood Ratio). The Bayesian log-likelihood ratioC h (ρ) associated with an hypothesis h∈H is defined as C h (ρ) := log ρ h 1−ρ h . (4.5) The Bayesian log-likelihood ratio (BLLR) is the logarithm of the ratio of the probability that hypothesish is true versus the probability that hypothesis h is not true. BLLR can be interpreted as a confidence level on hypothesis h being true in logit form, which is also referred to as log-odds in statistics [65]. The logit function is the inverse of the logistic sigmoid function. Notice that the posterior belief ρ h and BLLR are related by the bijective increasing logit function (See Fig. 4.2). The objective is to design an experiment selection strategy g for the agent such that the confi- dence levelC H on the true hypothesis H increases as quickly as possible. In other words, the total 68 reward after acquiring N observations is the average rate of increase in the confidence level on the true hypothesis H and is given by C H (ρ(N + 1))−C H (ρ(1)) N . (4.6) More explicitly, we seek to design a policy g that maximizes the asymptotic expected reward R(g) which is defined as R(g) := lim N→∞ inf 1 N E g [C H (ρ(N + 1))−C H (ρ(1))]. Since the initial confidenceC H (ρ(1)) is a constant, we can ignore it for large values ofN. Henceforth, we refer to this problem as the Expected Confidence Maximization (ECM) problem. Remark 4.1. Generally, the objective is to maximize the decay rate of Bayesian error probability [107], or to use a stopping time and optimize a linear combination of expected stopping time and expected error probability [107, 99]. Our problem formulation is mathematically different from these frameworks but conceptually, all the formulations aim to capture the same phenomenon which is to infer the true hypothesis quickly and reliably. The precise mathematical relationship between these formulations is yet to be understood and is an avenue for future work. To describe an upper bound on the optimal performance of the confidence maximization prob- lem, we state the following theorem without proof. Theorem 4.1. For any query selection policy g and any hypothesis h, we have lim N→∞ sup 1 N E g [C H (ρ(N + 1))|H =h]≤R ∗ h , (4.7) where R ∗ i := max α∈ΔU min j6=i X u α u D(p u i ||p u j ). (4.8) Further, if the underlying hypothesis H =h, we have lim N→∞ sup 1 N [C h (ρ(N + 1))]≤R ∗ h , (4.9) 69 with probability 1. The upper bound on the expected confidence rate can be obtained using dynamic programming for infinite-horizon, average reward MDPs and the same inequality in an almost sure sense can be obtained with the help of Strong Law of Large Numbers (SLLN). 4.3 Markov Decision Process Formulation In this section, we show that the problem of maximizing R(g) can be formulated as an infinite- horizon, average-cost MDP problem. The state of the MDP is the posterior beliefρ(n). The agent’s observation and action spaces are the same as in Section 4.2. The posterior belief is updated using Bayes’ rule. Thus, if U n =u and Y n =y, we have ρ h (n + 1) = ρ h (n)p u h (y) P h 0ρ h 0(n)p u h 0 (y) . (4.10) For convenience, let us denote the Bayes’ update in (4.10) by ρ(n + 1) =F (ρ(n),U n ,Y n ). (4.11) Thus, we have P[ρ(n + 1) =F (ρ(n),u,y)|I n ,U n =u] = X y:ρ(n+1)=F (ρ(n),u,y) P[Y n =y|I n ,U n =u] = X y:ρ(n+1)=F (ρ(n),u,y) X h∈H ρ h (n)p u h (y). Clearly, the dynamics of this system are controlled Markovian. The expectation of the average confidence rate under a strategy g is given by R N (g) . = 1 N E g [C H (ρ(N + 1))−C H (ρ(1))] (4.12) = 1 N E g N X n=1 [C H (ρ(n + 1))−C H (ρ(n))] (4.13) =: 1 N E g N X n=1 r(ρ(n),U n ,Y n ). (4.14) 70 Thus, the instantaneous reward for this MDP is r(ρ,u,y), i.e. if the state is ρ, the experiment performed is u and the observation is y, then the instantaneous reward is given by r(ρ,u,y) =C(F (ρ,u,y))−C(ρ), (4.15) where for any belief state ρ C(ρ) = X i∈H ρ i log ρ i 1−ρ i = X i∈H ρ i C i (ρ). (4.16) We refer to the functionC(ρ) as Average Bayesian Log-Likelihood Ratio (ABLLR). Note that this is almost identical to the notion of average log-likelihood ratio U(ρ) =−C(ρ) in [95], which was used to design a greedy heuristic for the active hypothesis testing problem. The objective in this MDP problem is to find a strategyg ∗ that maximizes the following average expected reward R(g) = lim N→∞ inf 1 N N X n=1 E g (r(ρ(n),U n ,Y n )). (4.17) 4.4 Deep Q-learning for Hypothesis Testing In this section, we describe our deep learning approach for policy design for active sequential hypothesis testing. We use a variant of the Deep Q-Network (DQN) introduced in [92], which is a learning agent that combines reinforcement learning with deep neural networks. It is an adaptation of a popular off-policy Temporal Difference (TD) learning algorithm, known as Q-learning [138]. We create an artificial environment that simulates the Bayesian belief update (4.10) over mul- tiple episodes. The duration (N) of each episode is fixed. At the beginning of each episode, the underlying hypothesis H is randomly selected with probability ρ(1) and it remains fixed over the episode’s duration. At any given time, the agent interacts with this environment by making a query (u) based on the current state (ρ) using an appropriate exploration strategy (e.g. -greedy explo- ration [138]). The environment then reveals the next state (ρ 0 ) and its associated reward (r) to the agent. We refer to (ρ,u,ρ 0 ,r) as an experience tuple. Using this information, the agent updates its target policy g. This iterative simulated learning process, schematically illustrated in Figure 4.3, 71 Figure 4.3: The agent performs a query u at some state ρ. The environments simulates the belief update using u and ρ to generate the update belief ρ 0 and its associated reward r. is repeated until a convergence criterion is met. We elucidate this methodology in greater detail in the following sub-sections. 4.4.1 Discounted Reward Formulation The Q-learning algorithm is designed for a discounted reward MDP formulation. Therefore, we first convert our average reward formulation in Section 4.3 to a discounted reward formulation with a discount factor γ < 1. For an experiment selection policy g, let R d (g) :=E g " ∞ X n=1 γ n−1 r(ρ(n),U n ,Y n ) # , (4.18) be the total discounted reward. Sincer(ρ,u,y) is uniformly bounded, the discounted rewardR d (g) is also bounded and well-defined for any policy g. Our objective now is to find a strategy g ∗ that maximizes R d (g). Remark 4.2. When the state and action spaces of an MDP are finite, it is well-known [12] that the discounted reward and the average reward formulations are equivalent for a sufficiently large discount factor γ. However, the state space herein is uncountably infinite and this equivalence may not necessarily hold. Nonetheless, we observe in our numerical experiments that the solution to the discounted reward formulation is near-optimal with respect to the average reward formulation. 72 4.4.2 Action-value Function The action-value function [138] for a policy g is defined as q g (ρ,u) :=E g " ∞ X k=0 γ k r k+1 |ρ(1) =ρ,U 1 =u # , (4.19) where r n = r(ρ(n),U n ,Y n ). Let g ∗ be an optimal policy with respect to the discounted reward formulation and let q ∗ (ρ,u) be its corresponding action-value function. Then the optimal action- value function satisfies the fixed point equation, also known as Bellman optimality equation [138], q ∗ (ρ,u) =E[r(ρ,u,Y ) + max u 0 q ∗ (F (ρ,u,Y ),u 0 )], (4.20) for every belief stateρ and queryu. Note that the source of randomness in the fixed point equation is the variable Y and the expectation is with respect to the distribution P h∈H ρ h p u h (y). Further, if a policy g is such that, for every belief state ρ, g(ρ) = arg max u q ∗ (ρ,u), (4.21) then g is an optimal policy with respect to the discounted reward formulation. Thus, finding an optimal policy g ∗ can be reduced to finding the optimal action-value function q ∗ . The optimal action-value function can be obtained using the Q-learning algorithm in [138] when the state and action spaces are finite. However, the state space is infinite in our case and thus, we need a different approach to find the optimal action-value function. 4.4.3 Action-value Function as a Deep Neural Network The first challenge in performing Q-learning with an infinite state space is to find an appropriate representation for the action-value function. Notice that the posterior beliefρ is a finite-dimensional vector and the action space is finite. We can thus represent the action-value function as a deep neural network which takes posterior belief ρ as an input and outputs the action-value vector of dimension|U| as illustrated in Figure 4.4. The neural network is parameterized by a finite collection of weights θ and henceforth, we refer to the output of this neural network as Q θ (ρ,u). 73 Figure 4.4: The neural network takes the belief vector ρ as the input and outputs the Q-values for each action u. The hidden layers are fully connected with non-linear activation. Only the final layer has linear activation. The second challenge lies in making the Q-learning updates. We would ideally like to make an update of the following form Q 0 (ρ,u)← Q(ρ,u) +ζ[r +γ max u 0 Q(ρ 0 ,u 0 )−Q(ρ,u)], where (ρ,u,ρ 0 ,r) is an experience tuple. Notice, however, that the action-value function is charac- terized by a collection of weights (θ) and thus, one has to update these weights so that the neural network outputs the corresponding updated Q-values Q 0 (ρ,u). To achieve this, we can modify the weights θ using gradient descent such that the following Mean-Squared Error (MSE) loss is minimized L(θ) = (Q 0 (ρ,u)−Q θ (ρ,u)) 2 . (4.22) This naive update rule can make the network unstable because it closely fits the network to the updated Q-value Q 0 (ρ,u) for the current state-action pair (ρ,u) but the Q-values associated with other state-action pairs may be disturbed. To ensure this does not happen, a method known as experience replay [92] is employed. At each time, the agent stores experience tuples (ρ,u,ρ 0 ,r) in its memoryD. Whenever the weights are updated, a random mini-batch B of experience tuples is 74 sampled from the memory and the MSE loss is minimized using gradient descent over this mini- batch with the following loss function L(θ) = X (ρ,u,ρ 0 ,r)∈B (Q 0 (ρ,u)−Q θ (ρ,u)) 2 . (4.23) 4.4.4 Additional Challenges Generally, it is necessary to explore all the states to learn the state transition and reward structure. However, since the state space is uncountably infinite, we cannot possibly explore all the states. Therefore, we choose a large value for (≈ 0.8) so that the state space is sufficiently explored. We observe in our numerical experiments that training over a large number of episodes results in an efficient query selection policy despite this exploration issue. Another challenge is that as the belief on the true hypothesis gets close to 1, the belief on all the alternate hypotheses becomes very small. Improvement in the confidence level, i.e. the instantaneous rewardr is very sensitive to the belief on alternate hypotheses. Thus, the DQN fails to select optimal queries when the belief on alternate hypotheses is too small. To counter this, we normalize the belief on the alternate set of hypotheses and augment it to the belief vector. The normalized alternate belief is denoted by ˜ ρ and is given by ˜ ρ j = ρ j 1−ρ i , (4.24) where i is the most likely hypothesis with respect to the belief ρ and j6=i. The overall Deep Q-learning algorithm for active sequential hypothesis testing is described in Algorithm 1. The agent and the environment operate in an interleaved manner. Their combined behavior is captured by Algorithm 1. The comment on each instruction specifies whether the instruction is meant for the agent (A) or the artificial environment (E). The Q-value update is denoted by QUP θ and is given by QUP θ (ρ,u,ρ 0 ,r) =Q θ (ρ,u) +ζ[r +γ max u 0 Q θ (ρ 0 ,u 0 )−Q θ (ρ,u)]. 75 Note that ζ is a small constant less than 1. Note that the letter α is generally used in place of ζ. We select ζ to avoid notational conflict with distribution α ∗ i which will be introduced later. Algorithm 1 Deep Q-learning algorithm for active sequential hypothesis testing 1: Initialize memoryD to capacity K . A 2: Initialize DQN with random weights θ . A 3: for episode = 1, EpiNum do 4: Randomly select H with prob. ρ(1) . E 5: Initialize state ρ =ρ(1) . A 6: for n = 1,N do 7: With probility , select random query u . A 8: Otherwise, select u = arg max u Q θ (ρ,u) . A 9: Perform query u . A 10: Generate Y =ξ(H,u,W ) . E 11: Update belief ρ 0 =F (ρ,u,Y ) . E 12: Compute reward r =C(ρ 0 )−C(ρ) . E 13: Reveal ρ 0 and r to A . E 14: Store (ρ,u,ρ 0 ,r) inD . A 15: Assign ρ←ρ 0 . A 16: Sample random minibatchB fromD . A 17: Duplicate DQN θ 0 ←θ . A 18: for epoch = 1, EpochNum do 19: for each (ˆ ρ, ˆ u, ˆ ρ 0 , ˆ r) inB do 20: Q 0 (ˆ ρ, ˆ u)← QUP θ (ˆ ρ, ˆ u, ˆ ρ 0 , ˆ r) . A 21: Perform gradient descent step on . A 22: (Q 0 (ˆ ρ, ˆ u)−Q θ 0(ˆ ρ, ˆ u)) 2 23: Assign θ←θ 0 . A 24: return DQN θ 4.5 Numerical Experiments In this section, we numerically compare our DQN model with other popular heuristics used for active hypothesis testing. We also propose a new heuristic based on a Kullback-Leibler divergence zero- sum game and demonstrate numerically that this heuristic’s performance is close to the maximum achievable confidence rate. We first briefly describe all the heuristics we use in our experiments. 76 4.5.1 Extrinsic Jensen-Shannon (EJS) Divergence Extrinsic Jensen-Shannon divergence as a notion of information was first introduced in [96]. Using our notation, EJS for a queryu at some belief stateρ is simply the expected instantaneous reward, i.e. EJS(ρ,u) =E[C(F (ρ,u,Y ))−C(ρ)]. (4.25) Notice that the only random variable in the expression above is Y and the expectation is with respect to the distribution P h∈H ρ h p u h (y) onY. The EJS heuristic selects the experiment u that maximizes EJS(ρ,u) for a given state ρ. 4.5.2 Open Loop Verification (OPE) Open loop verification policy is the most widely used policy in prior literature [99, 107]. In this heuristic, the agent first explores for a while using an appropriate exploration strategy. We refer to this phase as the exploration phase. Whenever the confidence on some hypothesisi is large enough, i.e. ρ i > ¯ ρ, the queries are randomly selected in an open-loop manner from the distribution α ∗ i which is defined as α ∗ i := arg max α∈ΔU min j6=i X u α u D(p u i ||p u j ), (4.26) where ΔU is the set of all distributions over the set of experimentsU. We refer to this phase as the verification phase. In our implementation, we use EJS for the exploration phase and set the threshold ¯ ρ = 0.7. 4.5.3 KL-divergence Zero-sum Game (HEU) This heuristic is similar to OPE but the query selection policy in the verification phase is adaptive. For each hypothesisi, we can formulate a zero-sum game [110] in which the first player (maximizing) selects an experiment u∈U and the second player (minimizing) selects an alternate hypothesis 77 j∈ ˜ H i :=H\{i}. The payoff for this zero-sum game is the KL-divergence D(p u i ||p u j ). In the verification phases, i.e., whenever ρ i > ¯ ρ, the agent picks an experiment u that maximizes P i (ρ,u) := X j6=i ˜ ρ j D(p u i ||p u j ), where ˜ ρ j =ρ j /1−ρ i . This strategy can be interpreted as the first player’s best-response when the second player uses the mixed strategy ˜ ρ j to select an alternate hypothesis. Note that the mixed strategyα ∗ i used in OPE is an equilibrium strategy for the maximizing player in the zero-sum game described above. 4.5.4 Simulation Setup To simulate these heuristics, we first consider a simple setup with three hypotheses and two queries. The conditional distributions p u i (y) for each of these queries are illustrated in Figure 4.1. y = 0 y = 1 h 0 0.8 0.2 h 1 0.2 0.8 h 2 0.8 0.2 (a) Query u 1 y = 0 y = 1 h 0 0.8 0.2 h 1 0.8 0.2 h 2 0.2 0.8 (b) Query u 2 Table 4.1: Conditional distributions p u i (y) for each query The queries are designed such that when H =h 0 , the agent is forced to make both queries u 1 and u 2 . This is because hypotheses h 0 and h 2 are indistinguishable under query u 1 and similarly, hypothesesh 0 andh 1 are indistinguishable under query u 2 . However, whenH =h 1 , the agent can eliminate h 0 and h 2 simultaneously using query u 1 alone and similarly, when H = h 2 , the agent only needs to perform the query u 2 . We observe that all the four heuristics manage to learn this scheme of query selection. However, the rate at which confidence is maximized is different for each heuristic. We illustrate the evolution of expected confidence rate R N under hypothesis h 0 in Figure 4.5. The heuristics DQN, EJS and HEU come very close to the maximum achievable rate computed in Chapter 2. OPE eventually achieves maximal rate but very slowly. 78 0 10 20 30 40 50 TimeHorizonN 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 ExpectedConfidenceRateRN DQN EJS HEU OPE OptimalRate Figure 4.5: Evolution of expected confidence rate R N under hypothesis h 0 in the first setup with queries u 1 and u 2 . Note the subpar performance of OPE in this setup. In the second experimental setup, we include two additional queries u 3 and u 4 characterized by the distributions in Figure 4.2. When H =h 0 the queries u 3 and u 4 together can eliminate at a much faster rate than u 1 and u 2 . Intuitively, this is because when the agent performs u 3 and observes y = 1, the belief on h 1 decreases drastically because y = 1 is extremely unlikely under hypothesis h 1 . Similarly, u 4 is very effective in eliminating h 2 . y = 0 y = 1 h 0 0.8 0.2 h 1 1−δ δ h 2 0.8 0.2 (a) Query u 3 y = 0 y = 1 h 0 0.8 0.2 h 1 0.8 0.2 h 2 1−δ δ (b) Query u 4 Table 4.2: Conditional distributions p u i (y) for each query. Here, δ = 0.0000001. The evolution of expected confidence rate under hypothesis h 0 with additional experiments u 3 andu 4 is shown in Figure 4.6. The heuristics DQN, HEU and OPE select queries u 3 andu 4 under hypothesish 0 . But the greedy heuristic EJS usually selects only u 1 andu 2 and fails to realize that queries u 3 and u 4 are more effective under hypothesis h 0 . The greedy EJS approach fails because queries u 3 and u 4 are constructed in such way that they are optimal over longer horizons but are sub-optimal over shorter horizons. Thus the assumption required for asymptotic optimality of EJS 79 0 20 40 60 80 100 TimeHorizonN 0.2 0.4 0.6 0.8 1.0 1.2 1.4 ExpectedConfidenceRateRN DQN HEU OPE EJS OptimalRate Figure 4.6: Evolution of expected confidence rate R N under hypothesis h 0 in the second setup with additional queries u 3 and u 4 . Note the subpar performance of OPE and EJS in this setup. in [96] does not hold in this setup. This also demonstrates that DQN is not simply selecting its queries greedily and manages to learn the long-term consequences of selecting queries. 4.6 Conclusion In this chapter, we considered a computational framework for designing experiment selection strate- gies in active hypothesis testing. We defined a notion of confidence, called Bayesian log-likelihood ratio and reformulated the hypothesis testing problem as a confidence maximization problem which can be seen as an infinite-horizon, average-reward MDP over a finite-dimensional belief space. We proposed a deep reinforcement learning based policy design framework for this MDP. We also proposed an experiment selection policy based on a KL-divergence zero-sum game. Using numer- ical experiments, we compared our policies with those in prior works and demonstrated that our designed policies perform significantly better than existing methods in some scenarios. 80 Chapter 5 Non-parametric Target Localization Target localization, from measurements of a signal induced by the target, is a ubiquitous problem. In key cases, however, the target signature admits a unimodal structure such that the location of the target coincides with the peak of the signal, and the problem of localizing the target is reduced to localizing the peak of the signal. Acquiring measurements can be expensive and ideally, we would like the strategy to localize using as few samples as possible. It is well known [33] that when the measurements are noiseless, the peak of the signal can be localized using a group testing approach as follows: a few locations are measured and the maximum among these samples is found. Due to unimodality, the true peak is in the vicinity of the peak of the sub-sampled signal and subsequent search is restricted to this smaller region. If the decay of the signal is strictly monotonic, then the search space reduction is geometric and the peak can be localized using O(log(N)) queries where N is the number of possible peak locations. This greedy approach is order optimal. However, the classical approach is not necessarily optimal in the presence of noise. The noise may lead to the formation of false local maxima due to which the classical approach would fail almost invariably. Thus, an appropriate strategy has to be designed that takes noise into account based on its statistical properties. Moreover, the classical algorithm samples greedily to minimize the number of measurements, i.e. samples are acquired only in the region where the peak is believed to exist and all the samples outside are dropped. Surprisingly, our numerical experiments suggest that this greedy approach is not always optimal in the presence of noise. We believe that these aspects are instrumental in designing a provably optimal algorithm and in this paper, we present our preliminary results on the optimality of greedy sampling for robust peak localization. The main contributions of this paper can be summarized as follows: 81 1. We formulate the problem of active peak localization under non-parametric priors using novel notions of feasible peak locations and localizing strategies. 2. We extend prior work [32] on non-Bayesian localization wherein feasibility is examined via unimodal regression [136]. Prior work employed a static binary search type strategy; the current approach is data adaptive. 3. Prior work [160, 32] often requires assumptions to ensure geometric reduction and O(logN) sample complexity. Herein, we examine the impact of each sample thus enabling the consid- eration of all sample complexities between O(logN) and O(N). 4. We characterize the cases where greedy sampling is not optimal; and show that these cases occur with low probability and this probability is bounded. This work is motivated by prior work [31, 32] on target localization wherein the search space is two-dimensional and the signature admits both separability and unimodality. Separability can be exploited to obtain a coarse reconstruction of the signature with very few measurements using low rank matrix reconstruction [22], and the top singular vectors of this coarse estimate can be used to localize the peak in a binary search manner. For clarity, we restrict our discussion to one- dimensional unimodal arrays or vectors in this work, which is a key element of [31, 32]. However, our methods and bounds are easily generalized to higher dimensional spaces as well. For a broader perspective on active target localization problems, we refer the reader to works on controlled sensing [106], active hypothesis testing [97], active target detection [30, 31], and generalized binary search [108, 98]. These works assume the existence of posterior distributions and likelihood functions which are not well-defined in our case due to lack of prior information on the signal. Some of the more recent works that are very close to our work are [32] and [160]. The observation and noise models in these works are different from ours and they make additional growth assumptions to ensure geometric search space reduction. Another approach [119, 160] obtains information theoretic lower bounds to show optimality. Our goal is to bypass the computation of these lower bounds for proving optimality of our algorithm using a different approach. 82 5.0.1 Notation and Auxiliary Results We use lowercase boldface letters to denote column vectors (e.g. z) and uppercase boldface letters to denote matrices (e.g. A). Random variables/vectors are denoted by italicized boldface letters (e.g. v). The global search space is an ordered setR 0 :={1,...,N}. We denote a region within the global search space by anR with appropriate subscripts. An index setI in a regionR is an ordered subset ofR. For an index setI, H I is a projection matrix whose rows are transposed canonical vectors e i for each i∈I. The set of all unimodal vectors inR n with peak at i isU n i , i.e. it is the set of all vectors x∈R n such that e T j x≤e T j+1 x if 0<j mσ 2 + √ mη 2 i ≤ exp " − η 4 8σ 4 ∧ √ mη 2 8σ 2 !# , we haveP[N p ]≥ 1−p. Thus, the definition of the feasible peak locations ensures that an index j inH I v is a feasible peak if and only if there exist vectors z∈U m j and n 0 ∈N p such thatH I v = z +n 0 . This further implies thatP[π∈g ∗ (I)]≥ 1−p. Definition 5.2 (Localizing Strategies). Letψ := (f,g,h) be a strategy such that for everyv 0 ∈U N , and for given constants 0b (k,l) n |I n ]≤δ n /2 (6.12) P ρ 1 [p (k,l) 1 time-steps. Thus, Player i’s information at time t is given by I i t ={Y 1:2 1:t−s ,U 1:2 1:t−s ,Y i t−s+1:t ,U i t−s+1:t−1 }. The common nformation C t = {Y 1:2 1:t−s ,U 1:2 1:t−s }, which is increasing with time. Player i’s private information isP i t ={Y i t−s+1:t ,U i t−s+1:t−1 } which satisfies Assumption 8.1. Notice that the new information received by Player 1 at time t is Y 1 t ,Y 2 t−s ,U 2 t−s . Thus, in our model, players may receive signals that depend on past states and actions. 6. Bounded private memory: Consider a setup in which players’ actions are common information. Each player stores its observations on a private device with bounded memory. Let the size of this memory be s. Player i’s information is given by I i t ={Y i t−s:t ,U 1:2 1:t−1 }. In this case, the common information C t = U 1:2 1:t−1 and for player i, private information P i t = Y i t−s:t . Clearly, this information structure satisfies Assumption 8.1. Also, note that players do not have perfect recall in this game. 7. Bounded private memory with strategic memory updates: Consider a setup where there is no common information and players do not have perfect recall. Instead, at each time t, player i 116 has a private memory state M i t that can take values in a finite setM. Based on its current memory state, the player picks an action U i t to influence the state evolution and a memory update actionL i t . HereL i t is an action that affects the memory update through the following fixed transformation: M i t+1 = ξ i t+1 (M i t ,Y i t+1 ,L i t ). This model can be viewed as an instance of our general model with M i t as player i’s private information and (U i t ,L i t ) as the player’s actions. 8. Information structure in [149]: In this model, playeri has a private stateX i t ,i = 1, 2. Player i knows its private state and both players’ actions are commonly observed. This information structure can be seen as a special case of our model in the following manner: let the state X t := (X 1 t ,X 2 t ) and the observation processes Y i t = X i t for i = 1, 2. Define the information sets at timet asI i t ={Y i 1:t ,U 1:2 1:t−1 }. In this case,C t ={U 1 1:t−1 ,U 2 1:t−1 } andP i t =Y i 1:t . Clearly, this information structure satisfies Assumption 8.1. Similarly, information structures in [111] and [102] can also be seen as special cases of our model. 7.2.3 Strategies and Values Players can use any information available to them to select their actions and we allow behavioral strategies for both players. Thus, player i chooses a distribution δU i t over its action space using a control law g i t :I i t → ΔU i t , i.e. δU i t =g i t (I i t ) =g i t (P i t ,C t ). (7.5) Player i’s action at time t is randomly chosen fromU i t according to the distribution δU i t . We will at times refer to δU i t as player i’s behavioral action at time t. It will be helpful for our analysis to explicitly describe the randomization procedure used by the players. To do so, we assume that player i has access to i.i.d. random variables K i 1:T that are uniformly distributed over the interval (0, 1]. The variablesK 1 1:T ,K 2 1:T are independent of each other and of the primitive random variables. Further, player i has access to a mechanism κ that takes as input K i t and a distribution overU i t and generates a random action with the input distribution. Thus, player i’s action at time t can be written as U i t =κ(g i t (I i t ),K i t ). 117 Remark 7.1. One choice of the mechanism κ can be described as follows: SupposeU i t ={1, 2,..n} and the input distribution is (p 1 ,...p n ). We can partition the interval (0, 1] into n intervals (a i ,b i ] such that the length of ith interval is b i −a i =p i . Then, U i t =k if K i t ∈ (a k ,b k ] for k = 1,...,n. The collection of control laws g i = (g i 1 ,...,g i T ) is referred to as the control strategy of player i, and the pair of control strategies (g 1 ,g 2 ) is referred to as a strategy profile. Let the set of all possible control strategies for player i beG i . The total expected cost associated with a strategy profile (g 1 ,g 2 ) is J(g 1 ,g 2 ) :=E (g 1 ,g 2 ) " T X t=1 c t (X t ,U 1 t ,U 2 t ) # , (7.6) where c t :X t ×U 1 t ×U 2 t →R is the cost function at time t. Player 1 wants to minimize the total expected cost, while Player 2 wants to maximize it. We refer to this zero-sum game as Game G. Definition 7.1. The upper value of the game G is defined as S u (G) := inf g 1 ∈G 1 sup g 2 ∈G 2 J(g 1 ,g 2 ). (7.7) The lower value of the game G is defined as S l (G) := sup g 2 ∈G 2 inf g 1 ∈G 1 J(g 1 ,g 2 ). (7.8) If the upper and lower values are the same, they are referred to as the value of the game and denoted by S(G). A Nash equilibrium of the zero-sum game G is a strategy profile (g 1∗ ,g 2∗ ) such that for every g 1 ∈G 1 and g 2 ∈G 2 , we have J(g 1∗ ,g 2 )≤J(g 1∗ ,g 2∗ )≤J(g 1 ,g 2∗ ). (7.9) Nash equilibria in zero-sum games satisfy the following property [110]. 118 Proposition 7.1. If a Nash equilibrium in GameG exists, then for every Nash equilibrium (g 1∗ ,g 2∗ ) in Game G, we have J(g 1∗ ,g 2∗ ) =S l (G) =S u (G) =S(G). (7.10) Remark 7.2. Note that the existence of a Nash equilibrium is not guaranteed in general. However, if players have perfect recall, i.e. {U i 1:t−1 }∪I i t−1 ⊆I i t (7.11) for every i and t, then the existence of a behavioral strategy equilibrium is guaranteed by Kuhn’s theorem [89]. The objective of this work is to characterize the upper and lower values S u (G) and S l (G) of GameG. To this end, we will define a virtual gameG v and an “expanded” virtual gameG e . These virtual games will be used to obtain bounds on the upper and lower values of the original game G. 7.3 Virtual Game G v The virtual game G v is constructed using the methodology in [103]. This game involves the same set of primitive random variables as in Game G. The two players of game G are replaced by two virtual players inG v . The virtual players operate as follows. At each timet, virtual playeri selects a function Γ i t that maps private information P i t to a distribution δU i t over the spaceU i t . We refer to these functions as prescriptions. LetB i t be the set of all possible prescriptions for virtual player i at time t (i.e.B i t is the set of all mappings fromP i t to ΔU i t ). Once the virtual players select their prescriptions, the actionU i t is randomly generated according to distribution Γ i t (P i t ). More precisely, the system dynamics for this game are given by: X t+1 =f t (X t ,U 1:2 t ,W s t ) (7.12) P i t+1 =ξ i t+1 (P i t ,U i t ,Y i t+1 ) i = 1, 2, (7.13) Y i t+1 =h i t+1 (X t+1 ,U 1:2 t ,W i t+1 ) i = 1, 2, (7.14) U i t =κ(Γ i t (P i t ),K i t ) i = 1, 2, (7.15) 119 Z t+1 =ζ t+1 (P 1:2 t ,U 1:2 t ,Y 1:2 t+1 ), (7.16) where the functions f t ,h i t , ξ i t ,κ and ζ t are the same as inG. In the virtual game, virtual players use the common informationC t to select their prescriptions at time t. The ith virtual player selects its prescription according to a control law χ i t , i.e. Γ i t = χ i t (C t ). For virtual player i, the collection of control laws over the entire time horizon χ i = (χ i 1 ,...,χ i T ) is referred to as its control strategy. LetH i t be the set of all possible control laws for virtual player i at time t and letH i be the set of all possible control strategies for virtual player i, i.e.H i =H i 1 ×···×H i T . The total cost associated with the game for a strategy profile (χ 1 ,χ 2 ) is J (χ 1 ,χ 2 ) =E (χ 1 ,χ 2 ) " T X t=1 c t (X t ,U 1 t ,U 2 t ) # , (7.17) where the function c t is the same as in GameG. The following lemma establishes a connection between the original gameG and the virtual game G v constructed above. Lemma 7.1. LetS u (G v ) andS l (G v ) be, respectively, the upper and lower values of the virtual game G v . Then, S l (G) =S l (G v ) and S u (G) =S u (G v ). Consequently, if a Nash equilibrium exists in the original game G, then S(G) =S l (G v ) =S u (G v ). Proof. See Appendix G.1. The authors in [103] use the virtual game to find equilibrium costs and strategies for a stochastic dynamic game of asymmetric information. However, the methodology in [103] is applicable only under the assumption that the posterior beliefs on state X t and private information P 1,2 t given the common information C t do not depend on the strategy profile being used (see Assumption 2 in [103]). We will refer to this assumption as the strategy-independent beliefs (SIB) assumption. As pointed out in [103], the SIB assumption is satisfied by some special system models and information structures but is not true for general stochastic dynamic games. A simple example which does not satisfy the SIB assumption is the following delayed sharing information structure [104]: Consider gameG with common information C t ={Y 1,2 1:t−2 ,U 1,2 1:t−2 } and P i t ={Y i t ,Y i t−1 ,U i t−1 }. 120 Thus, we are faced with the following situation: if our zero-sum game satisfies the SIB as- sumption, we can adopt the results in [103] to find equilibrium costs (i.e. the value) of our game. However, if the zero-sum game does not satisfy the SIB assumption, then the methodology of [103] is inapplicable. In the next section, we will develop a methodology to bound the upper and lower values of the zero-sum gameG even when the game does not satisfy the SIB assumption. 7.4 Expanded Virtual Game G e with Prescription History In order to circumvent the SIB assumption, we now construct an expanded virtual game G e by increasing the amount of information available to virtual players in gameG v . In this new gameG e , the state dynamics, observation processes, primitive random variables and cost function are all the same as in the game G v . The only difference is in the information used by the virtual players to select their prescriptions. The virtual players now have access to the common information C t as well as all the past prescriptions of both players, i.e., Γ 1:2 1:t−1 . Virtual playeri selects its prescription at time t using a control law ˜ χ i t , i.e, Γ i t = ˜ χ i t (C t , Γ 1:2 1:t−1 ). Let ˜ H i t be the set of all such (measurable) control laws at time t for virtual player i. ˜ H i := ˜ H i 1 ×···× ˜ H i T is the set of all control strategies for player i. The total cost associated with the game for a strategy profile (˜ χ 1 , ˜ χ 2 ) is J (˜ χ 1 , ˜ χ 2 ) =E (˜ χ 1 ,˜ χ 2 ) " T X t=1 c t (X t ,U 1 t ,U 2 t ) # . (7.18) Remark 7.3. Note that any strategy χ i ∈X i is equivalent to the strategy ˜ χ i ∈ ˜ X i that satisfies the following condition: for each time t and for each realization of common information c t ∈C t , ˜ χ i t (c t ,γ 1:2 1:t−1 ) =χ i t (c t ) ∀γ 1:2 1:t−1 ∈B 1:2 1:t−1 . (7.19) Hence, with slight abuse of notation, we can say that the strategy spaceX i in the virtual game G v is a subset of the strategy space ˜ X i in the expanded game G e . For this reason, the functionJ in (8.16) can be thought of as an extension of the functionJ in (8.50). Remark 7.4. Expansion of information structures has been used in prior work to find equilibrium costs/strategies. See, for example, [9] which studies a linear stochastic differential game where 121 both players have a common noisy observation of the state. Similar virtual games with expanded information structures referred to as auxiliary games have also been used in [91, 121, 122, 125, 50]. 7.4.1 Upper and Lower Values of Games G v and G e We will now establish the relationship between the upper and lower values of the expanded game G e and the virtual game G v . To do so, we define the following mappings between the strategies in gamesG v andG e . Definition 7.2. Let % i : ˜ X 1 × ˜ X 2 →X i be an operator that maps a strategy profile (˜ χ 1 , ˜ χ 2 ) in virtual game G e to a strategy χ i for virtual player i in game G v as follows: For t = 1, 2,...,T, χ i t (c t ) := ˜ χ i t (c t , ˜ γ 1:2 1:t−1 ), (7.20) where ˜ γ j s = ˜ χ j s (c s , ˜ γ 1:2 1:s−1 ) for every 1≤s≤t− 1 and j = 1, 2. We denote the ordered pair (% 1 ,% 2 ) by %. The mapping% is defined in such a way that the strategy profile (˜ χ 1 , ˜ χ 2 ) and the strategy profile %(˜ χ 1 , ˜ χ 2 ) induce identical dynamics in the respective games G e andG v . Lemma 7.2. Let (χ 1 ,χ 2 ) and (˜ χ 1 , ˜ χ 2 ) be strategy profiles for games G v and G e , such that χ i = % i (˜ χ 1 , ˜ χ 2 ), i = 1, 2. Then, J (χ 1 ,χ 2 ) =J (˜ χ 1 , ˜ χ 2 ). (7.21) Proof. See Appendix F.2. The following theorem connects the upper and lower values of the two virtual games and the original game. Theorem 7.1. The lower and upper values of the three games defined above satisfy the following: S l (G) =S l (G v )≤S l (G e )≤S u (G e )≤S u (G v ) =S u (G). Consequently, if a Nash equilibrium exists in the original game G, then S(G) =S l (G e ) =S u (G e ). 122 Proof. See Appendix F.3. Using Theorem 7.1, we can obtain bounds on the upper and lower values of the original game by computing the upper and lower values of the expanded game G e . 7.4.2 The Dynamic Programming Characterization We now describe a methodology for finding the upper and lower values of the expanded game G e . Suppose the virtual players are using the strategy profile (˜ χ 1 , ˜ χ 2 ) in the expanded game G e . Let Π t be the virtual players’ belief on the state and private information based on their information in gameG e . Thus, Π t is defined as Π t (x t ,p 1:2 t ) :=P (˜ χ 1 ,˜ χ 2 ) (X t =x t ,P 1:2 t =p 1:2 t |C t , Γ 1:2 1:t−1 ),∀x t ,p 1 t ,p 2 t . We refer to Π t as the common information belief (CIB). Π t takes values in the setS t := Δ(X t × P 1 t ×P 2 t ). Definition 7.3. Given a belief π on the state and private informations at time t and mappings γ i ,i = 1, 2, fromP i t to ΔU i t , we define γ i (p i t ;u) as the probability assigned to action u under the probability distribution γ i (p i t ). Also, define ˜ c t (π,γ 1 ,γ 2 ) := X xt,p 1:2 t ,u 1:2 t c t (x t ,u 1 t ,u 2 t )π(x t ,p 1 t ,p 2 t )γ 1 (p 1 t ;u 1 t )γ 2 (p 2 t ;u 2 t ). (7.22) ˜ c t (π,γ 1 ,γ 2 ) is the expected value of the cost at time t if the state and private informations have π as their probability distribution and γ 1 ,γ 2 are the prescriptions chosen by the virtual players. Lemma 7.3. For any strategy profile (˜ χ 1 , ˜ χ 2 ), the common information based belief Π t evolves almost surely as Π t+1 =F t (Π t , Γ 1:2 t ,Z t+1 ), t≥ 1, (7.23) 123 where F t is a fixed transformation that does not depend on the virtual players’ strategies. Further, the total expected cost can be expressed as J (˜ χ 1 , ˜ χ 2 ) =E (˜ χ 1 ,˜ χ 2 ) " T X t=1 ˜ c t (Π t , Γ 1 t , Γ 2 t ) # , (7.24) where ˜ c t is as defined in equation (G.14). Proof. See Appendix G.2. Remark 7.5. Because (8.27) is an almost sure equality, the transformation F t in Lemma 8.2 is not necessarily unique. In Appendix G.2, we identify a class of transformations such that for any transformation F t in this class, Lemma 8.2 holds. We denote this class by B. We now describe two dynamic programs, one for each virtual player in G e . 7.4.2.1 The min-max dynamic program The minimizing virtual player (virtual player 1) in gameG e solves the following dynamic program. Define V u T +1 (π T +1 ) = 0 for every π T +1 . In a backward inductive manner, at each time t≤T and for each possible common information belief π t and prescriptions γ 1 t ,γ 2 t , define w u t (π t ,γ 1 t ,γ 2 t ) := ˜ c t (π t ,γ 1 t ,γ 2 t ) +E[V u t+1 (F t (π t ,γ 1:2 t ,Z t+1 ))|π t ,γ 1:2 t ] (7.25) V u t (π t ) := inf γ 1 t sup γ 2 t w u t (π t ,γ 1 t ,γ 2 t ). (7.26) 7.4.2.2 The max-min dynamic program The maximizing virtual player (virtual player 2) in gameG e solves the following dynamic program. Define V l T +1 (π T +1 ) = 0 for every π T +1 . In a backward inductive manner, at each time t≤T and for each possible common information belief π t and prescriptions γ 1 t ,γ 2 t , define w l t (π t ,γ 1 t ,γ 2 t ) := ˜ c t (π t ,γ 1 t ,γ 2 t ) +E[V l t+1 (F t (π t ,γ 1:2 t ,Z t+1 ))|π t ,γ 1:2 t ] (7.27) V l t (π t ) := sup γ 2 t inf γ 1 t w l t (π t ,γ 1 t ,γ 2 t ). (7.28) 124 Lemma 7.4. For any realization of common information based belief π t , the inf and sup in (8.23) are achieved, i.e. there exists a measurable mapping Ξ 1 t :S t →B 1 t such that V u t (π t ) = min γ 1 t max γ 2 t w u t (π t ,γ 1 t ,γ 2 t ) = max γ 2 t w u t (π t , Ξ 1 t (π t ),γ 2 t ). (7.29) Similarly, for any realization of common information based belief π t , the sup and inf in (8.25) are achieved, i.e, there exists a measurable mapping Ξ 2 t :S t →B 2 t such that V l t (π t ) = max γ 2 t min γ 1 t w l t (π t ,γ 1 t ,γ 2 t ) = min γ 1 t w l t (π t ,γ 1 t , Ξ 2 t (π t )). (7.30) Proof. See Appendix G.3. Definition 7.4. Define strategies ˜ χ 1∗ and ˜ χ 2∗ for virtual players 1 and 2 respectively as follows: for each instance of common information c t and prescription history γ 1:2 1:t−1 , let ˜ χ 1∗ t (c t ,γ 1:2 1:t−1 ) := Ξ 1 t (π t ) (7.31) ˜ χ 2∗ t (c t ,γ 1:2 1:t−1 ) := Ξ 1 2 (π t ), (7.32) where Ξ 1 t and Ξ 2 t are the mappings defined in Lemma 8.3 and π t (which is a function of c t ,γ 1:2 1:t−1 ) is obtained in a forward inductive manner using the relation π 1 (x 1 ,p 1 1 ,p 2 1 ) =P[X 1 =x 1 ,P 1 1 =p 1 1 ,P 2 1 =p 2 1 |C 1 =c 1 ]∀x 1 ,p 1 1 ,p 2 1 , (7.33) π τ+1 =F τ (π τ ,γ 1 τ ,γ 2 τ ,z τ+1 ), 1≤τ <t. (7.34) Note that F τ is the common information belief update function defined in Lemma 8.2. The following theorem establishes that the two dynamic programs described above characterize the upper and lower values of gameG e . Theorem 7.2. The upper and lower values of the expanded virtual game G e are given by S u (G e ) =E[V u 1 (Π 1 )], (7.35) S l (G e ) =E[V l 1 (Π 1 )]. (7.36) 125 Further, the strategies ˜ χ 1∗ and ˜ χ 2∗ as defined in Definition 8.3 are, respectively, min-max and max-min strategies in the expanded virtual game G e . Proof. See Appendix F.7. Theorem 8.2 gives us a dynamic programming characterization of the upper and lower values of the expanded game. As mentioned in Theorem 7.1, the upper and lower values of the expanded game provide bounds on the corresponding values of the original game. Further, if the original game has a Nash equilibrium, the dynamic programs of Theorem 8.2 characterize the value of the game. Note that this applies to any dynamic game of the form in Section 7.2 where the common information is non-decreasing in time and the private information has a “state-like” update equation (see Assumption 8.1). As noted before, a variety of information structures satisfy this assumption [105], [103]. The computational burden of solving the dynamic programs of Theorem 8.2 would depend on the specific information structure being considered, i.e., on the exact nature of common and private information. At one extreme, we can consider the following instance of the original game G: C t = (X 1:t ),P 1 t = P 2 t =∅. It is easy to see that in this case, the common information belief can be replaced by the current state in the dynamic programs and the prescriptions are simply distributions on the players’ finite action sets. Also, in this case, w u t and w l t are bilinear functions of the prescriptions and the min-max/max-min problems at each stage of the dynamic program can be solved by a linear program [118]. On the other extreme, we can consider an instance of game G with C t =∅,P i t =Y i 1:t ,i = 1, 2. In this case, the common information belief will be on the current state and observation histories of the two players and the prescriptions will take values in a large- dimensional space. Also, the functions w u t andw l t (fort<T ) in this case do no have any apparent structure that can be exploited for efficient computation of the min-max and max-min values in the dynamic program. One general approach that can be used for any instance of game G is to discretize the CIB belief space and compute approximate value functions V u t andV l t in a backward inductive manner. However, we believe that significant structural and computational insights can be obtained by specializing the dynamic programs of Theorem 8.2 to the specific instance of the game being considered. We demonstrate this in [71] where we consider an information structure in which one player has complete information while the other player has only partial information. For 126 this information structure, it is shown in [71] that the functions w u t andw l t turn out to be identical at all times t and they satisfy some structural properties that can be leveraged for computation. Comparison with [111] and [149]: In [111], the authors considered ann-player stochastic game model which can potentially be non-zero sum. In this model, each player has a private state that is privately observed by the corresponding player and a public state that is commonly observed by all the players. The model in [111] additionally allows players’ private information to be partially revealed in the form of common observations. The actions of all the players in this model are commonly observed. The authors also make the assumption that the evolution of the private states of the players is conditionally independent. The model in [149] can be viewed as a special case of the model in [111]. For these models, backward inductive algorithms were presented to compute perfect Bayesian equilibria. Consider the case when the number of players in the games of [111] and [149] is two and the games are zero-sum. Then: 1. The models in [111] and [149] can be viewed as special cases of our model in Section 7.2. 2. The players in these games have perfect recall. Hence, we can use Kuhn’s theorem to conclude that a Nash equilibrium and, thus, the value exists for these zero-sum games. Therefore, we can use the dynamic programs in Section 8.3.3 to the characterize the value of these zero- sum games. This characterization does not make any additional assumptions. The backward inductive algorithms in [111] and [149], however, require the existence of a particular kind of fixed point solution at each stage. This fixed point solution is not guaranteed to exist in general. Thus, there may be instances where the approaches in [111] and [149] fail to characterize the value of the game while our dynamic program in Section 8.3.3 can always characterize it. 7.4.3 Connections with the Recursive Formula in [91] A general model of zero-sum games is described in Section IV.3 of [91]. The state dynamics and the observation model in [91, Section IV.3] are similar to the state dynamics (equation (8.2)) and observation model (equation (8.3)) we described in Section 7.2. We would like to highlight two key differences between our model and the model in [91, Section IV.3]. Firstly, the model in [91, Section 127 IV.3] assumes that the players have perfect recall, whereas we do not make this assumption. In our setup, players may or may not have perfect recall. We only require the common information to be recalled perfectly by the players (see Assumption 8.1). Secondly, in the model of [91, Section IV.3], the new information (or the signal) a player gets at each stage can be seen as a function of the current and previous states, the actions taken at previous stage and some random noise. In our model, the new information that a player gets at each stage includes the new private information and an increment in common information that could depend on the history of past states and actions (through the term P i t in equations (8.4) and (8.5)) 2 . In order to accommodate this feature in the state dynamics and observation model of [91, Section IV.3], one needs to redefine the state in [91, Section IV.3] to consist of both the original system state and the state of players’ private information. Since the players in the model of [91, Section IV.3] have perfect recall, the value of the game exists. A characterization of this value is provided in [91] using a recursive formula. The recursive formula in [91] is in terms of certain probability distributions P t and has the following structure 3 : V T +1 (P T +1 ) = 0 (7.37) V t (P t ) = min g 1 t max g 2 t E (g 1 t ,g 2 t ) h c t (X t ,U 1 t ,U 2 t ) i +V t+1 (P t+1 ) . (7.38) Here, g i t is player i’s behavioral strategy at time t. While this recursive formula appears to be similar to our dynamic programming characterization, it has some key differences which are listed below: 1. The minimization and maximization in the recursive formula is over behavioral strategies at each time. Note that the behavioral strategy at each time is a mapping from a player’s entire information at that time to probability distributions over its action sets. In our dy- namic program, the minimization and maximization are over prescription spacesB 1 t andB 2 t , respectively. Recall that a prescription is a mapping from a player’s private information to probability distributions over its action sets. Since a player’s private information may be much smaller than the entire information available to it, the prescription spaces in our 2 For example, see the delayed sharing information structure in Section 7.2.2. 3 Our notation is different from [91, Section IV.3]. 128 dynamic program may be much smaller than behavioral strategy spaces. This conceptual difference in the two characterizations may also have computational implications since min- imizing/maximizing over the smaller space of prescriptions would generally be easier than minimizing/maximizing over the larger space of behavioral strategies at each time. 2. The information state in the recursive formula (i.e. the arguments of the value functions above) is a probability distribution on an abstract space referred to as the universal belief space 4 . The information state in our dynamic program is the common information belief, which is more tangible and is explicitly defined as a posterior probability distribution on the current state and players’ private information based on the common information. We believe that this explicit description of the information state as a common belief is both conceptually more illuminating and computationally more useful. 7.5 Conclusion In this paper, we considered a general model of zero-sum stochastic games with asymmetric in- formation. We model each player’s information as consisting of a private information part and a common information part. In our model, players need not have perfect recall. We only need the common information to be perfectly recalled. For this general model, we provided a dynamic programming approach for characterizing the value (if it exists). This dynamic programming char- acterization of value relies on our construction of two virtual games that have the same value as our original game. If the value does not exist in the original game, then our dynamic program provides bounds on the upper and lower values of the original game. 4 We refer the reader to [91, Chapter III] for a detailed discussion on the universal belief space. 129 Chapter 8 Zero-sum Games between Teams 8.1 Introduction In decentralized team problems, players collaboratively control a stochastic system to minimize a common cost. The information used by these players to select their control actions may be different. For instance, some of the players may have more information about the system state than others [159]; or each player may have some private observations that are shared with other players with some delay [104]. Such multi-agent team problems with an information asymmetry arise in a multitude of domains like autonomous vehicles and drones, power grids, transportation networks, military and rescue operations, wildlife conservation [42] etc. Over the past few years, several methods have been developed to address decentralized team problems [105, 109, 120, 47, 159]. However, games between such teams are less understood. Many of the aforementioned systems are susceptible to adversarial attacks. Therefore, the strategies used by the team of players for controlling these systems must be designed in such a way that the damage inflicted by the adversary is minimized. Such adversarial interactions can be modeled as zero-sum games between competing teams, and our main goal in this paper is develop a framework that can be used to analyze and solve them. The aforementioned works [105, 109, 120, 47, 159] on cooperative team problems solve them by first constructing an auxiliary single-agent Markov Decision Process (MDP). The auxiliary state (state of the auxiliary MDP) is the common information belief (CIB). CIB is the belief on the system state and all the players’ private information conditioned on the common (or public) information. Auxiliary actions (actions in the auxiliary MDP) are mappings from agents’ private 130 information to their actions [105]. The optimal values of the team problem and this auxiliary MDP are identical. Further, an optimal strategy for the team problem can be obtained using any optimal solution of the auxiliary MDP with a simple transformation. The optimal value and strategies of this auxiliary MDP (and thus the team problem) can be characterized by dynamic programs (a.k.a. Bellman equations or recursive formulas). A key consequence of this characterization is that the CIB is a sufficient statistic for optimal control in team problems. We investigate whether a similar approach can be used to characterize values and strategies in zero-sum games between teams. This extension is not straightforward. In general games (i.e., not necessarily zero-sum), it may not be possible to obtain such dynamic programs (DPs) and/or sufficient statistics [102, 140]. However, we show that for zero-sum games between teams, the values can be characterized by CIB based DPs. Further, we show that for some specialized models, the CIB based DPs can be used to characterize a min-max strategy as well. A key implication of our result is that this CIB based approach can be used to solve several team problems considered before [105, 159, 47] even in the presence of certain types of adversaries. A phenomenon of particular interest and importance in team problems is signaling. Players in a team can agree upon their control strategies ex ante. Based on these agreed upon strategies, a player can often make inferences about the system state or the other players’ private information (which are otherwise inaccessible to the player). This implicit form of communication between players is referred to as signaling and can be vital for effective coordination. While signaling is beneficial in cooperative teams, it can be detrimental in the presence of an adversary. This is because the adversary can exploit it to infer sensitive private information and inflict severe damage upon the system. A concrete example that illustrates this trade-off between signaling and secrecy is discussed is Section 8.6. Our framework can be used optimize this trade-off in several stochastic games between teams. Related Work on Games Zero-sum games between two individual players with asymmetric information have been extensively studied. In [91, 125, 122, 50, 84, 72], stochastic zero-sum games with varying degrees of generality were considered and dynamic programming characterizations of the value of the game were provided. Various properties of the value functions (such as continuity) were also established and for some specialized information structures, these works also characterize 131 a min-max strategy. Linear programs for computing the values and strategies in certain games were proposed in [162, 83]; and methods based on heuristic search value iteration (HSVI) [134] to compute the value of some games were proposed in [63, 62]. Zero-sum extensive form games in which a team of players competes against an adversary have been studied in [150, 43, 161]. Structured Nash equilibria in general games (i.e. not necessarily zero-sum) were studied in [103, 111, 149] under some assumptions on the system dynamics and players’ information structure. A combination of reinforcement learning and search was used in [19] to solve two-player zero-sum games. While this approach has very strong empirical performance, a better analytical understanding of it is needed. Our work is closely related to [70, 72] and builds on their results. Our novel contributions in this paper over past works are summarized below. Contributions (i) In this paper, we study a general class of stochastic zero-sum games between two competing teams of players. Our general model captures a variety of information structures including many previously considered stochastic team [109, 47, 159] and zero-sum game models [72, 63, 62] as well as game models that have not been studied before. Our results provide a unified framework for analyzing a wide class of game models that satisfy some minimal assumptions. (ii) For our general model, we adapt the techniques in [72] to provide bounds on the upper (min-max) and lower (max-min) values of the game. These bounds provide us with fundamental limits on the performance achievable by either team. Furthermore, if the upper and lower values of the game are identical (i.e., if the game has a value), our bounds coincide with the value of the game. Our bounds are obtained using two dynamic programs (DPs) based on a sufficient statistic known as the common information belief (CIB). (iii) We also identify a subclass of game models in which only one of the teams (say the minimizing team) controls the evolution of the CIB. In these cases, we show that one of our CIB based dynamic programs can be used to find the min-max value as well as a min-max strategy 1 . (iv) Our result reveals that the structure of the CIB based min-max strategy is similar to the structure of team optimal strategies. Such structural results have been successfully used in prior works [47, 19] to design efficient strategies for significantly challenging team problems. (v) Lastly, we discuss an approximate dynamic programming approach along with 1 Note that this characterization of a min-max strategy is not present in [72]. A similar result for a very specific model with limited applicability exists in [70]. Our result is substantially more general than that in [70]. 132 key structural properties for computing the values (and the strategy when applicable) and illustrate our results with the help of an example. 8.1.1 Notation Random variables are denoted by upper case letters, their realizations by the corresponding lower case letters. In general, subscripts are used as time index while superscripts are used to index decision-making agents. For time indices t 1 ≤ t 2 , X t 1 :t 2 is the short hand notation for the vari- ables (X t 1 ,X t 1 +1 ,...,X t 2 ). Similarly, X 1:2 is the short hand notation for the collection of variables (X 1 ,X 2 ). Operators P(·) and E[·] denote the probability of an event, and the expectation of a random variable respectively. For random variables/vectors X andY ,P(·|Y =y),E[X|Y =y] and P(X = x| Y = y) are denoted byP(·|y),E[X|y] andP(x| y), respectively. For a strategy g, we useP g (·) (resp. E g [·]) to indicate that the probability (resp. expectation) depends on the choice of g. For any finite setA, ΔA denotes the probability simplex over the setA. For any two sets A and B (whereA is finite),F(A,B) denotes the set of all functions fromA toB. We define rand to be mechanism that given (i) a finite setA, (ii) a distribution d overA and a random variable K that is uniformly distributed over the interval (0, 1], produces a random variable X∈A with distribution d, i.e., X =rand(A,d,K)∼d. (8.1) 8.2 Problem Formulation Consider a dynamic system with two teams. Team 1 has N 1 players and Team 2 has N 2 players. The system operates in discrete time over a horizon T . Let X t ∈X t be the state of the system at time t, and let U i,j t ∈U i,j t be the action of Player j, j∈{1,...,N i }, in Team i, i∈{1, 2}, at time t. Let U 1 t . = U 1,1 t ,...,U 1,N 1 t ; U 2 t . = U 2,1 t ,...,U 2,N 2 t , 133 andU i t be the set of all possible realizations of U i t . We will refer to U i t as Team i’s action at time t. The state of the system evolves in a controlled Markovian manner as X t+1 =f t (X t ,U 1 t ,U 2 t ,W s t ), (8.2) where W s t is the system noise. There is an observation process Y i,j t ∈Y i,j t associated with each Player j in Team i and is given as Y i,j t =h i,j t (X t ,U 1 t−1 ,U 2 t−1 ,W i,j t ), (8.3) where W i,j t is the observation noise. Let us define Y 1 t . = Y 1,1 t ,...,Y 1,N 1 t ; Y 2 t . = Y 2,1 t ,...,Y 2,N 2 t . We assume that the setsX t ,U i,j t andY i,j t are finite for alli,j andt. Further, the random variables X 1 ,W s t ,W i,j t (referred to as the primitive random variables) can take finitely many values and are mutually independent. 8.2.0.1 Information Structure At timet, Playerj in Teami has access to a subset of all observations and actions generated so far. Let I i,j t denote the collection of variables (i.e. observations and actions) available to Player j in teami at timet. ThenI i,j t ⊆∪ i,j {Y i,j 1:t ,U i,j 1:t−1 }. The set of all possible realizations ofI i,j t is denoted byI i,j t . Examples of such information structures include I i,j t ={Y i,j 1:t ,U i,j 1:t−1 } which corresponds to the information structure in Dec-POMDPs [109] and I i,j t ={Y i,j 1:t ,Y 1:2 1:t−d ,U 1:2 1:t−1 } wherein each player’s actions are seen by all the players and their observations become public after a delay of d time steps. 134 Information I i,j t can be decomposed into common and private information, i.e. I i,j t = C t ∪ P i,j t ; common information C t is the set of variables known to all players 2 at time t. The private information P i,j t for Player j in Team i is defined as I i,j t \C t . Let P 1 t . = P 1,1 t ,...,P 1,N 1 t ; P 2 t . = P 2,1 t ,...,P 2,N 2 t . We will refer to P i t as Team i’s private information. LetC t be the set of all possible realizations of common information at time t,P i,j t be the set of all possible realizations of private information for Player j in Team i at time t andP i t be the set of all possible realizations of P i t . We make the following assumption on the evolution of common and private information. This is similar to Assumption 1 of [103, 72]. Assumption 8.1. The evolution of common and private information available to the players is as follows: 1. The common informationC t is non-decreasing with time, i.e. C t ⊆C t+1 . LetZ t+1 . =C t+1 \C t be the increment in common information. Thus, C t+1 ={C t ,Z t+1 }. Furthermore, Z t+1 =ζ t+1 (P 1:2 t ,U 1:2 t ,Y 1:2 t+1 ), (8.4) where ζ t+1 is a fixed transformation. 2. The private information evolves as P i t+1 =ξ i t+1 (P 1:2 t ,U 1:2 t ,Y 1:2 t+1 ), (8.5) where ξ i t+1 is a fixed transformation and i = 1, 2. As noted in [105, 72], a number of information structures satisfy the above assumption. Our analysis applies to any information structure that satisfies Assumption 8.1 including, among others, Dec-POMDPs and the delayed sharing information structure discussed above. 2 Note that a variable is part of common information Ct if and only if it is known to all players of all teams. A variable common to players of just one team is not part of Ct. 135 8.2.0.2 Strategies and Values Players can use any information available to them to select their actions and we allow behavioral strategies for all players. Thus, at time t, Player j in Team i chooses a distribution δU i,j t over its action space using a control law g i,j t :I i,j t → ΔU i,j t , i.e. δU i,j t =g i,j t (I i,j t ) =g i,j t (C t ,P i,j t ). (8.6) The distrubtion δU i,j t is then used to randomly generate the control action U i,j t as follows. We assume that player j of Team i has access to i.i.d. random variables K i,j 1:T that are uniformly distributed over the interval (0, 1]. These uniformly distributed variables are independent of each other and of the primitive random variables. The action U i,j t is generated using K i,j t and the randomization mechanism described in (8.1), i.e., U i,j t =rand(U i,j t ,δU i,j t ,K i,j t ). (8.7) The collection of control laws used by the players in Team i at time t is denoted by g i t . = (g i,1 t ,...,g i,N i t ) and is referred to as the control law of Team i at time t. Let the set of all possible control laws for Teami at timet be denoted byG i t . The collection of control lawsg i . = (g i 1 ,...,g i T ) is referred to as the control strategy of Teami, and the pair of control strategies (g 1 ,g 2 ) is referred to as a strategy profile. Let the set of all possible control strategies for Team i beG i . The total expected cost associated with a strategy profile (g 1 ,g 2 ) is J(g 1 ,g 2 ) . =E (g 1 ,g 2 ) " T X t=1 c t (X t ,U 1 t ,U 2 t ) # , (8.8) where c t :X t ×U 1 t ×U 2 t →R is the cost function at time t. Team 1 wants to minimize the total expected cost, while Team 2 wants to maximize it. We refer to this zero-sum game between Team 1 and Team 2 as GameG. Definition 8.1. The upper and lower values of the game G are respectively defined as S u (G) . = min g 1 ∈G 1 max g 2 ∈G 2 J(g 1 ,g 2 ), (8.9) 136 S l (G) . = max g 2 ∈G 2 min g 1 ∈G 1 J(g 1 ,g 2 ). (8.10) If the upper and lower values are the same, they are referred to as the value of the game and denoted by S(G). The minimizing strategy in (8.9) is referred to as Team 1’s optimal strategy and the maximizing strategy in (8.10) is referred to as Team 2’s optimal strategy 3 . A key objective of this work is to characterize the upper and lower values S u (G) and S l (G) of Game G. To this end, we will define two virtual games, G v and G e , with symmetric information. Each of these virtual games has two virtual players, one for each team in Game G. Unlike the players in GameG, the virtual players in gamesG v andG e use the same information to select their decisions at any given time. These virtual games will be used to obtain bounds on the upper and lower values of the original gameG. The bounds obtained happen to be tight when the upper and lower values of GameG are equal. For a sub-class of information structures, we will show that these virtual games can be used to obtain optimal strategies for one of the teams. We note that if the upper and lower values of Game G are the same, then any pair of optimal strategies (g 1∗ ,g 2∗ ) forms a Team Nash Equilibrium 4 , i.e., for every g 1 ∈G 1 and g 2 ∈G 2 , J(g 1∗ ,g 2 )≤J(g 1∗ ,g 2∗ )≤J(g 1 ,g 2∗ ). (8.11) In this case, J(g 1∗ ,g 2∗ ) is the value of the game, i.e. J(g 1∗ ,g 2∗ ) = S l (G) = S u (G) . = S(G). Conversely, if a Team Nash Equilibrium exists, then the upper and lower values are the same [110]. 8.3 Virtual Games G v and G e The virtual games G v andG e are constructed using the methodology in [72]. These games involve the same underlying system model as in game G. The key distinction between games G, G v and G e lies in the manner in which the actions used to control the system are chosen. In the virtual games, all the players in each team of game G are replaced by a virtual player. Thus, games G v 3 The strategy spacesG 1 andG 2 are compact and the cost J(·) is continuous in g 1 ,g 2 . Hence, the existence of optimal strategies can be established using Berge’s maximum theorem [53]. 4 When players in a team randomize independently, Team Nash equilirbia may not exist in general [5]. 137 and G e have two virtual players: virtual player i for Team i, where i = 1, 2. These virtual players in GamesG v andG e operate as described in the following sub-sections. 8.3.1 Virtual Game G v Consider virtual playeri associated with Teami,i = 1, 2. At each timet and for eachj = 1,...,N i , virtual player i selects a function Γ i,j t that maps private information P i,j t to a distribution δU i,j t over the spaceU i,j t . Thus, δU i,j t = Γ i,j t (P i,j t ). The set of all such mappings is denoted byB i,j t . = F(P i,j t , ΔU i,j t ). We refer to the tuple Γ i t . = (Γ i,1 t ,..., Γ i,N i t ) of such mappings as virtual player i’s prescription at time t. The set of all possible prescriptions for virtual player i at time t is denoted byB i t . =B i,1 t ×···×B i,N i t . Once virtual playeri selects its prescription, the action U i,j t is randomly generated according to the distribution Γ i,j t (P i,j t ). More precisely, U i,j t =rand(U i,j t , Γ i,j t (P i,j t ),K i,j t ), (8.12) where the random variable K i,j t and the mechanism rand are the same as in equation (8.7). In virtual gameG v , virtual players use the common informationC t to select their prescriptions at timet. Theith virtual player selects its prescription according to a control lawχ i t , i.e., Γ i t =χ i t (C t ). For virtual player i, the collection of control laws over the entire time horizon χ i = (χ i 1 ,...,χ i T ) is referred to as its control strategy. LetH i t be the set of all possible control laws for virtual player i at time t and letH i be the set of all possible control strategies for virtual player i, i.e. H i =H i 1 ×···×H i T . The total cost associated with the game for a strategy profile (χ 1 ,χ 2 ) is J (χ 1 ,χ 2 ) =E (χ 1 ,χ 2 ) " T X t=1 c t (X t ,U 1 t ,U 2 t ) # , (8.13) where the functionc t is the same as in GameG. The upper and lower values in GameG v are defined as S u (G v ) . = min χ 1 ∈H 1 max χ 2 ∈H 2 J (χ 1 ,χ 2 ), (8.14) S l (G v ) . = max χ 2 ∈H 2 min χ 1 ∈H 1 J (χ 1 ,χ 2 ). (8.15) 138 Lemma 8.1. The upper (resp. lower) value of Game G v is equal to the upper (resp. lower) value of Game G, i.e., S l (G v ) =S l (G) and S u (G v ) =S u (G). Consequently, if a Team Nash equilibrium exists in the original game G, then S(G) = S l (G v ) = S u (G v ). Proof. See Appendix G.1. 8.3.2 Expanded Vitual Game G e Consider virtual playeri associated with Teami,i = 1, 2. At each timet and for eachj = 1,...,N i , virtual player i selects a prescription Γ i t ∈B i t . The action U i,j t is chosen according to (8.12). The main difference between the virtual players in games G v and G e is in the information they use to select their prescriptions. The virtual players in game G e have access to the common information C t as well as all the past prescriptions of both players, i.e., Γ 1:2 1:t−1 . Virtual player i selects its prescription at time t using a control law ˜ χ i t , i.e., Γ i t = ˜ χ i t (C t , Γ 1:2 1:t−1 ). Let ˜ H i t be the set of all such control laws at timet and ˜ H i . = ˜ H i 1 ×···× ˜ H i T be the set of all control strategies for virtual player i. The total cost for a strategy profile (˜ χ 1 , ˜ χ 2 ) is J (˜ χ 1 , ˜ χ 2 ) =E (˜ χ 1 ,˜ χ 2 ) " T X t=1 c t (X t ,U 1 t ,U 2 t ) # . (8.16) The upper and lower values inG e are defined as S u (G e ) . = min ˜ χ 1 ∈ ˜ H 1 max ˜ χ 2 ∈ ˜ H 2 J (˜ χ 1 , ˜ χ 2 ), (8.17) S l (G e ) . = max ˜ χ 2 ∈ ˜ H 2 min ˜ χ 1 ∈ ˜ H 1 J (˜ χ 1 , ˜ χ 2 ). (8.18) Note that any strategy χ i ∈X i is equivalent to the strategy ˜ χ i ∈ ˜ X i that satisfies the following condition: for each time t and for each realization of common information c t ∈C t , ˜ χ i t (c t ,γ 1:2 1:t−1 ) =χ i t (c t ) ∀γ 1:2 1:t−1 ∈B 1:2 1:t−1 . (8.19) 139 Hence, with slight abuse of notation, we can say that the strategy spaceX i in the virtual gameG v is a subset of the strategy space ˜ X i in the expanded gameG e . Theorem 8.1. The lower and upper values of games G, G v and G e satisfy the following: S l (G) =S l (G v )≤S l (G e )≤S u (G e )≤S u (G v ) =S u (G). Further, all inequalities in the display above become equalities when a Team Nash equilibrium exists in Game G. Proof. The proof of this result follows from the same arguments in Appendix 3 of [72]. 8.3.3 The Dynamic Programming Characterization The expanded game G e constructed in Section 8.3.2 is similar to the expanded game in [72]. This allows us to use the methodology in [72] to obtain a dynamic programming characterization of the upper and lower values of the expanded gameG e . The results (and their proofs) in this subsection are conceptually similar to those in Section 4.2 of [72] and therefore, we state these results without proofs. Note that we use the same notation (B i t ) as in [72] to denote the prescription spaces for virtual playeri at timet. However, the prescription spacesB i t in this paper are different (and more general) from those in [72]. This makes our results in this paper more general. Our dynamic program is based on a sufficient statistic for virtual players in game G e called the common information belief (CIB). Definition 8.2. At time t, the common information based belief (CIB), denoted by Π t , is defined as the virtual players’ belief on the state and private information based on their information in game G e . Thus, for each x t ∈X t ,p 1 t ∈P 1 t and p 2 t ∈P 2 t , we have Π t (x t ,p 1:2 t ) . =P h X t =x t ,P 1:2 t =p 1:2 t |C t , Γ 1:2 1:t−1 i . The belief Π t takes values in the setS t . = Δ(X t ×P 1 t ×P 2 t ). The following lemma describes an update rule that can be used to compute the CIB. 140 Lemma 8.2. For any strategy profile (˜ χ 1 , ˜ χ 2 ) in Game G e , the common information based belief Π t evolves almost surely as Π t+1 =F t (Π t , Γ 1:2 t ,Z t+1 ), (8.20) where F t is a fixed transformation (see Definition G.2 in Appendix G.2) that does not depend on the virtual players’ strategies. Further, the total expected cost can be expressed as J (˜ χ 1 , ˜ χ 2 ) =E (˜ χ 1 ,˜ χ 2 ) " T X t=1 ˜ c t (Π t , Γ 1 t , Γ 2 t ) # , (8.21) where the function ˜ c t is as defined in Definition (G.3) in Appendix G.2. 8.3.3.1 Values in Game G e We now describe two dynamic programs, one for each virtual player inG e . The minimizing virtual player (virtual player 1) in gameG e solves the following dynamic program. Define V u T +1 (π T +1 ) = 0 for everyπ T +1 . In a backward inductive manner, at each timet≤T and for each possible common information beliefπ t and prescriptionsγ 1 t ,γ 2 t , define the upper cost-to-go functionw u t and the upper value function V u t as w u t (π t ,γ 1 t ,γ 2 t ) (8.22) . = ˜ c t (π t ,γ 1 t ,γ 2 t ) +E[V u t+1 (F t (π t ,γ 1:2 t ,Z t+1 ))|π t ,γ 1:2 t ], V u t (π t ) . = min γ 1 t max γ 2 t w u t (π t ,γ 1 t ,γ 2 t ). (8.23) The maximizing virtual player (virtual player 2) in gameG e solves the following dynamic program. Define V l T +1 (π T +1 ) = 0 for every π T +1 . In a backward inductive manner, at each time t≤T and for each possible common information beliefπ t and prescriptionsγ 1 t ,γ 2 t , define the lower cost-to-go function w l t and the lower value function V l t as w l t (π t ,γ 1 t ,γ 2 t ) (8.24) . = ˜ c t (π t ,γ 1 t ,γ 2 t ) +E[V l t+1 (F t (π t ,γ 1:2 t ,Z t+1 ))|π t ,γ 1:2 t ], 141 V l t (π t ) . = min γ 1 t max γ 2 t w l t (π t ,γ 1 t ,γ 2 t ). (8.25) Lemma 8.3. For each t, there exists a measurable mapping Ξ 1 t :S t →B 1 t such that V u t (π t ) = max γ 2 t w u t (π t , Ξ 1 t (π t ),γ 2 t ). Similarly, there exists a measurable mapping Ξ 2 t :S t →B 2 t such that V l t (π t ) = min γ 1 t w l t (π t ,γ 1 t , Ξ 2 t (π t )). Proof. This lemma can be proved using the approach in [72]. We provide a substantially simpler proof for it in Appendix G.3. Theorem 8.2. The upper and lower values of the expanded virtual game G e are given by S u (G e ) =E[V u 1 (Π 1 )]; S l (G e ) =E[V l 1 (Π 1 )]. (8.26) Theorem 8.2 gives us a dynamic programming characterization of the upper and lower values of the expanded game. As mentioned in Theorem 8.1, the upper and lower values of the expanded game provide bounds on the corresponding values of the original game. Further, if the original game has a Team Nash equilibrium, then the dynamic programs described above characterize the value of the game. 8.3.3.2 Optimal Strategies in Game G e The mappings Ξ 1 and Ξ 2 obtained from the dynamic programs described above (see Lemma 8.3) can be used to construct optimal strategies for both virtual players in game G e in the following manner. Definition 8.3. Define strategies ˜ χ 1∗ and ˜ χ 2∗ for virtual players 1 and 2 respectively as follows: for each instance of common information c t and prescription history γ 1:2 1:t−1 , let ˜ χ 1∗ t (c t ,γ 1:2 1:t−1 ) . = Ξ 1 t (π t ); ˜ χ 2∗ t (c t ,γ 1:2 1:t−1 ) . = Ξ 1 2 (π t ), where Ξ 1 t and Ξ 2 t are the mappings defined in Lemma 8.3 and π t (which is a function of c t ,γ 1:2 1:t−1 ) is obtained in a forward inductive manner using the update rule F t defined in Lemma 8.2. 142 Theorem 8.3. The strategies ˜ χ 1∗ and ˜ χ 2∗ as defined in Definition 8.3 are, respectively, min-max and max-min strategies in the expanded virtual game G e . 8.4 Virtual Player 1 Controls the Common Information based Belief In this section, we consider a special class of instances of Game G and show that the dynamic program in (8.23) can be used to obtain a min-max strategy for Team 1, the minimizing team in game G. The key property of the information structures considered in this section is that the common information belief Π t is controlled 5 only by virtual player 1 in the corresponding expanded gameG e . This is formally stated in the following assumption. Assumption 8.2. For any strategy profile (˜ χ 1 , ˜ χ 2 ) in Game G e , the CIB Π t evolves almost surely as Π t+1 =F t (Π t , Γ 1 t ,Z t+1 ), (8.27) where F t is a fixed transformation that does not depend on the virtual players’ strategies. We will now describe some instances of Game G that satisfy Assumption 8.2. 8.4.1 Game Models Satisfying Assumption 8.2 8.4.1.1 All players in Team 2 have the same information Consider an instance of game G in which every player in Team 2 has the following information structure I 2,j t = n Y 2 1:t ,U 2 1:t−1 o , j = 1,...,N 2 . (8.28) Further, Team 2’s information is known to every player in Team 1. Thus, the common information C t =I 2,j t . Under this condition, players in Team 2 do not have any private information. Thus, their private information P 2 t =?. Any information structure satisfying the above conditions satisfies 5 Note that the players in Team 2 might still be able to control the state dynamics through their actions. 143 Assumption 8.2, see Appendix G.4.0.1 for a proof. Since Team 1’s information structure is relatively unrestricted, the above model subsumes many information structures previously considered in [50, 162, 63, 159, 37] 8.4.1.2 Team 2’s observations become common information with one-step delay Consider an an instance of game G where the current private information of Team 2 becomes common information in the very next time-step. More specifically, we have C t+1 ⊇{Y 2 1:t ,U 2 1:t } and for each Playerj in Team 2,P 2,j t =Y 2,j t . Note that unlike in [162, 63], players in Team 2 have some private information in this model. Any information structure that satisfies the above conditions satisfies Assumption 8.2, see Appendix G.4.0.2 for a proof. 8.4.1.3 Team 2 does not control the state Consider an instance of GameG in which the state evolution and the players’ observations are given by X t+1 =f t (X t ,U 1 t ,W s t ); Y i,j t =h i,j t (X t ,U 1 t−1 ,W i,j t ). Further, let the information structure of the players be such that the common and private infor- mation evolve as Z t+1 =C t+1 \C t =ζ t+1 (P 1:2 t ,U 1 t ,Y 1:2 t+1 ) (8.29) P i t+1 =ξ i t+1 (P 1:2 t ,U 1 t ,Y 1:2 t+1 ). (8.30) In this model, Team 2’s actions do not affect the system evolution and Team 2’s past actions are not used by any of the players to select their current actions. Any information structure that satisfies these conditions satisfies Assumption 8.2, see Appendix G.4.0.3 for a proof. 144 8.4.1.4 Global and local states Consider a system in which the system stateX t = (X 0 t ,X 1 t ,X 2 t ) comprises of a global stateX 0 t and a local state X i t = (X i,1 t ,...,X i,N i t ) for Team i = 1, 2. The state evolution is given by X 0 t+1 =f 1 t (X 0 t ,X 1 t ,U 1 t ,U 2 t ,W s,1 t ) (8.31) X 1 t+1 =f 2 t (X 0 t ,X 1 t ,U 1 t ,U 2 t ,W s,2 t ) (8.32) X 2 t+1 =f 3 t (X 0 t ,U 1 t ,U 2 t ,W s,3 t ). (8.33) Note that Team 2’s current local state does not affect the state evolution. Further, we have C t ={X 0 1:t ,U 1 1:t−1 ,U 2 1:t−1 } (8.34) P 1,j t ={X 1,j t } ∀j = 1,...,N 1 (8.35) P 2,j t ={X 2,j t } ∀j = 1,...,N 2 . (8.36) Under this system dynamics and information structure, we can prove that Assumption 8.2 holds. The proof of this is provided in Appendix G.4.0.4. Remark 8.1. Zero-sum games between two individual players that satisfy a property similar to Assumption 8.2 were studied in [50]. 8.4.2 Min-max Value and Strategy in Game G 8.4.2.1 Dynamic Program Since we are considering special cases of GameG, we can use the analysis in Section 8.3.3 to write the min-max dynamic program for virtual player 1. Because of Assumption 8.2, the belief update F t (π t ,γ 1:2 t ,z t+1 ) in (8.22) is replaced by F t (π t ,γ 1 t ,z t+1 ). Using Theorems 8.2 and 8.3, we can can conclude that the upper value of the expanded game S u (G e ) =E[V u 1 (Π 1 )] and that the strategy ˜ χ 1∗ obtained from the dynamic program is a min-max strategy for virtual player 1 in Game G e . 145 Algorithm 4 Strategy g 1,j∗ for Player j in Team 1 Input: Ξ 1 t (π) obtained from DP for all t and all π for t = 1 to T do Current information: C t ,P 1,j t . C t ={C t−1 ,Z t } if t = 1 then Initialize CIB Π t using C t else Update CIB Π t =F t−1 (Π t−1 , Ξ 1 t−1 (Π t−1 ),Z t ) Get prescription Γ 1 t = Ξ 1 t (Π t ) Get distribution δU 1,j t = Γ 1,j t (P 1,j t ) Select action U 1,j t =rand(U 1,j t ,δU 1,j t ,K 1,j t ) 8.4.2.2 Min-max Value and Strategy The following results provide a characterization of the min-max valueS u (G) and a min-max strategy g 1∗ in gameG under Assumption 8.2. Theorem 8.4. Under Assumption 8.2, we have S u (G) =S u (G e ) =E[V u 1 (Π 1 )]. Proof. See Appendix G.5. Theorem 8.5. Under Assumption 8.2, the strategyg 1∗ defined in Algorithm 4 is a min-max strategy for Team 1 in the original game G. Proof. See Appendix G.5. 8.5 Joint Randomization 8.5.1 Strategies and Values So far, we considered a setting in which players randomize independently. In this independent randomization mechanism (see Section 8.2.0.2), each player j in Team i has access to a uniformly distributed random variableK i,j t at timet. This variable is used to randomly select an action using the mechanism in (8.7). We will now allow the players in a team to jointly randomize their actions in the following manner. At each time t, in addition to its information I i,j t , each Player j in Team i has access to a random variable K i t that is uniformly distributed over the interval (0, 1]. This random variable K i t is used for joint randomization by the players in Team i at time t. At each time t, each player in Team i can use all information available to it at that time and the common 146 randomnessK i t to select its actions. More precisely, Player j of Teami chooses actionU i,j t using a (measurable) control law g i,j t :I i,j t × (0, 1]→U i,j t , i.e. U i,j t =g i,j t (I i,j t ,K i t ). (8.37) The control law of Team i at time t is denoted by g i t := (g i,1 t ,...,g i,N i t ). Let the set of all control laws for Team i at time t be denoted byG i t . The collection of control laws g i = (g i 1 ,...,g i T ) is referred to as the control strategy of Team i, and the pair of control strategies (g 1 ,g 2 ) is referred to as a strategy profile. Let the set of all possible control strategies for Team i beG i . The random variablesK 1 1:T ,K 2 1:T used for joint randomization are independent of each other and other primitive random variables. The total expected cost associated with a strategy profile (g 1 ,g 2 ) is J jr (g 1 ,g 2 ) :=E (g 1 ,g 2 ) " T X t=1 c t (X t ,U 1 t ,U 2 t ) # , (8.38) where c t :X t ×U 1 t ×U 2 t →R is the cost function at time t. Team 1 wants to minimize the total expected cost, while Team 2 wants to maximize it. We refer to this zero-sum game between Team 1 and Team 2 as GameG jr where the superscript jr denotes joint randomization. Definition 8.4. The upper value of the game G jr is defined as S u (G jr ) := inf g 1 ∈G 1 sup g 2 ∈G 2 J jr (g 1 ,g 2 ). (8.39) The lower value of the game G is defined as S l (G jr ) := sup g 2 ∈G 2 inf g 1 ∈G 1 J jr (g 1 ,g 2 ). (8.40) If the upper and lower values are the same, they are referred to as the value of the game and denoted by S(G jr ). Note that in gamesG andG jr , the strategy spaces are identical. However, the joint randomization mechanism in game G jr subsumes the independent randomization mechanism in game G. This can be shown in the following manner. For any given integer k, let μ k : (0, 1]→ (0, 1] k be a 147 measurable mapping such that if K is uniformly distributed then: (i) each component μ j k (K) of μ k (K) := [μ 1 k (K),...,μ k k (K)] is uniformly distributed over the interval (0, 1] for 1≤j≤k, and (ii) the collection of variablesμ 1 k (K),...,μ k k (K) is mutually independent. Teami can use the mapping μ N i to generateN i independent and uniformly distributed variables fromK i t as h K i,1 t ,...,K i,N i t i = μ N i (K i t ). Thus, any given strategy g i for Team i in gameG can be implemented in gameG jr using the strategy ¯ g i where ¯ g i,j t (I i,j t ,K i t ) =g i,j t (I i,j t ,μ j N i (K i t )). (8.41) This superiority of the joint randomization mechanism over the independent randomization mech- anism leads to the following relationship between the upper and lower values of games G andG jr . Lemma 8.4. We have S l (G)≤S l (G jr )≤S u (G jr )≤S u (G). (8.42) 8.5.2 Virtual Game G v We will construct a virtual game analogous to Game G v in Section 8.3.1. This game involves the same set of primitive random variables as in Game G. All the players in each team of game G are replaced by two virtual players inG v . The virtual players operate as follows. At each timet, virtual Playeri selects a function Γ i t that maps private informationP i t to a distributionδU i t over the space U i t . We will refer to these functions as prescriptions. The prescription Γ i t cannot be any arbitrary mapping fromP i t to ΔU i t . We require that Virtual Player i select its prescription from a setB i t of admissible prescriptions which we define below in Definition 8.6. Let Q i t := N i Y j=1 F(P i,j t ,U i,j t ). (8.43) We denote a generic element of Q i t by $ i := ($ i,1 ,...,$ i,N i ), where $ i,j ∈ F(P i,j t ,U i,j t ) for j = 1,...,N i . We write $ i (p i ) =u i if $ i,j (p i,j ) =u i,j for all j = 1,...,N i .Q i t can be thought of as the collection of all admissible pure (i.e., non-randomized) mappings fromP i t toU i t . A probability 148 distribution δ over the spaceQ i t induces a mappingL i t (δ)∈F(P i t , ΔU i t ), where the transformation L i t is defined below. Definition 8.5 (Induced Prescription). For each δ∈ ΔQ i t , define a transformationL i t : ΔQ i t → F(P i t , ΔU i t ) asL i t (δ) =γ i t , where for each p i t ,u i t , γ i t (p i t ;u i t ) := X $∈Q i t δ($)1 $(p i t )=u i t . (8.44) Here, γ i t (p i t ;u i t ) denotes 6 the probability assigned to team action u i t by the distribution γ i t (p i t ). The set of admissible prescriptionsB i t is the collection of all the prescriptions induced by the transformationL i t defined above. Definition 8.6 (Admissible Prescriptions). The set of admissible prescriptionsB i t is defined as B i t := n L i t (δ) :δ∈ ΔQ i t o . (8.45) Remark 8.2. For any admissible prescription γ i t , we know there exists a distribution δ∈ ΔQ i t such that γ i t =L i t (δ). One approach for constructing such a distribution δ given an admissible prescription γ i t , is to find a solution to the following linear constraints. γ i t (p i t ;u i t ) = X $∈Q i t δ($)1 $(p i t )=u i t ∀p i t ,u i t (8.46) X $∈Q i t δ($) = 1 (8.47) δ($)≥ 0 ∀$∈Q i t . (8.48) We will denote the solution 7 to the linear program above with ¯ L i t (γ i t ). 6 This notation will be used throughout the paper. 7 We choose one arbitrarily if multiple solutions exist. 149 Once the virtual players select their prescriptions (from their respectiveB i t ), the team action U i t is randomly generated according to distribution Γ i t (P i t ) for i = 1, 2. More precisely, the system dynamics for this game are given by: U i t =rand(U i t , Γ i t (P i t ),K i t ), (8.49) where the random variable K i t and the mechanism rand are the same as in equation (8.37). In the virtual game, virtual players use the common informationC t to select their prescriptions at time t. The i-th virtual player selects its prescription according to a control law χ i t , i.e. Γ i t = χ i t (C t ). For virtual player i, the collection of control laws over the entire time horizon χ i = (χ i 1 ,...,χ i T ) is referred to as its control strategy. LetH i t be the set of all possible control laws for virtual player i at time t and letH i be the set of all possible control strategies for virtual player i, i.e.H i =H i 1 ×···×H i T . The total cost associated with the game for a strategy profile (χ 1 ,χ 2 ) is J (χ 1 ,χ 2 ) =E (χ 1 ,χ 2 ) " T X t=1 c t (X t ,U 1 t ,U 2 t ) # , (8.50) where the function c t is the same as in GameG. The following lemma establishes a connection between the original gameG and the virtual game G v constructed above. Lemma 8.5. LetS u (G v ) andS l (G v ) be, respectively, the upper and lower values of the virtual game G v . Then, S l (G) =S l (G v ) and S u (G) =S u (G v ). Consequently, if a Nash equilibrium exists in the original game G, then S(G) =S l (G v ) =S u (G v ). Proof. See Appendix G.6. Remark 8.3. There exists a transformationN i (see Appendix G.1) such that if a strategy χ i is an -optimal strategy in the virtual game G v , then g i :=N i (χ i ) is -optimal in the original game G. More precisely, if sup χ 2J (χ 1 ,χ 2 ) < S u (G v ) + then sup g 2J(N 1 (χ 1 ),g 2 ) < S u (G) + and if inf χ 1J (χ 1 ,χ 2 )>S l (G v )− then inf g 1J(g 1 ,N 2 (χ 2 ))>S l (G)−. This is a direct consequence of Lemma 8.1 and Lemma G.4 in Appendix G.1. 150 8.5.3 Virtual Expanded Game Construction of the virtual expanded game involves the same steps that were used in Sections 8.3 and 8.4 for the case of independent randomization. The results on value and strategy character- izations that were established in Sections 8.3 and 8.4 can be established using similar arguments. The key difference between the virtual expanded game in Sections 8.3 and 8.4 and herein is the prescription spaceB i t . 8.6 Structural Properties of the Value Functions and Numerically Solving Zero-sum Games Consider an instance of GameG in which Team 1 has two players and Team 2 has only one player. At each time t, Player 1 in Team 1 observes the state perfectly, i.e. Y 1,1 t = X t , but the player in Team 2 gets an imperfect observationY 2 t defined as in (8.3). Player 1 has complete information: at each timet, it knows the entire state, observation and action histories of all the players. The player in Team 2 has partial information: at each time t, it knows its observation history Y 2 1:t and action histories of all the players. Player 2 in Team 1 has the same information as that of the player in Team 2. Thus, the total information available to each player at t is as follows: I 1,1 t ={X 1:t ,Y 2 1:t ,U 1:2 1:t−1 }; I 1,2 t =I 2 t ={Y 2 1:t ,U 1:2 1:t−1 }. Clearly, I 2 t ⊆ I 1,1 t . The common and private information for this game can be written as follows: C t = I 2 t , P 1,1 t ={X 1:t } and P 1,2 t = P 2 t = ?. The increment in common information at time t is Z t ={Y 2 t ,U 1:2 t−1 }. In the game described above, the private information in P 1,1 t includes the entire state history. However, Player 1 in Team 1 can ignore the past states X 1:t−1 without loss of optimality. Lemma 8.6 (Proof in App. H.2). There exists a min-max strategy g 1∗ such that the control law g 1,1∗ t at time t uses only X t and I 2 t to select δU 1,1 t , i.e., δU 1,1 t =g 1,1∗ t (X t ,I 2 t ). The lemma above implies that, for the purpose of characterizing the value of the game and a min-max strategy for Team 1, we can restrict player 1’s information structure to beI 1,1 t ={X t ,I 2 t }. 151 Thus, the common and private information become: C t = I 2 t , P 1,1 t ={X t } and P 2 t = P 1.1 t = ?. We refer to this game with reduced private information as Game H. The corresponding expanded virtual game is denoted byH e . A general methodology for reducing private information in decentralized team and game problems can be found in [142]. The information structure inH is a special case of the first information structure in Section 8.4.1, and thus satisfies Assumption 8.2. Therefore, using the dynamic program in Section 8.4.2, we can obtain the value function V u 1 and the min-max strategy g 1∗ . Numerical Experiments We consider a particular example of game H described above. In this example, there are two entities (l and r) that can potentially be attacked and at any given time, exactly one of the entities is vulnerable. Player 1 of Team 1 knows which of the two entities is vulnerable whereas all the other players do not have this information. Player 2 of Team 1 can choose to defend one of the entities. The attacker in Team 2 can either launch a blanket attack on both entities or launch a targeted attack on one of the entities. When the attacker launches a blanket attack, the damage incurred by the system is minimal if Player 2 in Team 1 happens to be defending the vulnerable entity and the damage is substantial otherwise. When the attacker launches a targeted attack on the vulnerable entity, the damage is substantial irrespective of the defender’s position. But if the attacker targets the invulnerable entity, the attacker becomes passive and cannot attack for some time. Thus, launching a targeted attack involves high risk for the attacker. The state of the attacker (active or passive) and all the players’ actions are public information. The system stateX t thus has two components, the hidden state (l orr) and the state of the attacker (a or p). For convenience, we will denote the states (l,a) and (r,a) with 0 and 1 respectively. The only role of Player 1 in Team 1 in this game is to signal the hidden state using two available actions α andβ. The main challenge is that both the defender and the attacker can see Player 1’s actions. Player 1 needs to signal the hidden state to some extent so that its teammate’s defense is effective under blanket attacks. However, if too much information is revealed, the attacker can exploit it to launch a targeted attack and cause significant damage. In this example, the key is to design a strategy that can balance between these two contrasting goals of signaling and secrecy. A precise description of this model is provided in Appendix H.3. 152 0.0 0.2 0.4 0.6 0.8 1.0 64 66 68 70 72 Estimated Value function (a) An estimate of the value function V u 1 (·). 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 State 0 State 1 (b) Prescriptions at t = 1 for Player 1 in Team 1. Figure 8.1: In these plots, thex-axis representsπ 1 (0) and we restrict our attention to those beliefs whereπ 1 (0) +π 1 (1) = 1, i.e. when the attacker is active. In Figure 8.1b, the blue and red curves respectively depict the Bernoulli probabilities associated with the distributions γ 1,1 1 (0) and γ 1,1 1 (1), where γ 1,1 1 is Player 1’s prescription in Team 1. In order to solve this problem, we used the approximate DP approach discussed in Appendix H.1. The value functionV u 1 (·) thus obtained is shown in Figure 8.1a. The tension between signaling and secrecy can be seen in the shape of the value function in Figure 8.1a. When the CIBπ 1 (0) = 0.5, the value function is concave in its neighborhood and decreases as we move away from 0.5. This indicates that in these belief states, revealing the hidden state to some extent is preferable. However, as the belief goes further away from 0.5, the value function starts increasing at some point. This indicates that the adversary has too much information and is using it to inflict damage upon the system. Figure 8.1b depicts Player 1’s prescriptions leading to non-trivial signaling patterns at various belief states. Notice that the distributions γ 1,1 1 (0) and γ 1,1 1 (1) for hidden states 0 and 1 are quite distinct when π 1 (0) = 0.5 (indicating significant signaling) and are nearly identical when π 1 (0) = 0.72 (indicating negligible signaling). A more detailed discussion on our experimental results can be found in Appendix H.3. 8.7 Conclusions We considered a general model of stochastic zero-sum games between two competing decentralized teams and provided bounds on their upper and lower values in the form of CIB based dynamic programs. When game has a value, our bounds coincide with the value. We identified several in- stances of this game model (including previously considered models) in which the CIB is controlled 153 only by one of the teams (say the minimizing team). For such games, we also provide a character- ization of the min-max strategy. Under this strategy, each player only uses the current CIB and its private information to select its actions. The sufficiency of the CIB and private information for optimality can potentially be exploited to design efficient strategies in various problems. We dis- cussed a methodology for obtaining similar value and strategy characterizations when players in a team can randomize jointly in a particular manner. Finally, we proposed a computational approach for approximately solving the CIB based DPs. There is significant scope for improvement in our computational approach. Tailored forward exploration heuristics for sampling the belief space and adding a policy network can improve the accuracy and tractability of our approach. 154 Chapter 9 Equivalent Static and Dynamic Games 9.1 Introduction Decision-making problems with multiple agents can be classified into two categories based on the dependence of information and decisions. In static problems (also called one-shot or single-stage problems), the information available for making a decision does not depend on other decisions. In dynamic problems, the information available for making a decision may depend on other decisions. This dependence of information on decisions is common in multi-stage problems where decision taken now usually affect information available later. The interaction of decisions and information can lead to effects such as signaling where an agent uses its decision as a means to communicate some information to a future agent. Such signaling effects are absent in static problems. If the multi-agent decision problem is a team problem where all agents want to cooperatively minimize an expected cost, then Witsenhausen [156] showed that a dynamic problem is equivalent to a static problem if a technical condition holds. Further, this condition always holds if the random uncertainties, observations and decisions involved in the problem are discrete-valued. In this paper, we extend Witsenhausen’s observation to games where each agent is interested in minimizing its own expected cost. 155 9.2 System Model 9.2.1 Notation We use capital letters with appropriate subscripts (e.g. Y t ) to represent random variables and small letters (e.g. y t ) realizations of the random variable. The range of possible values of a random variable is denoted by the corresponding calligraphic letter (e.g.Y t ). A vector of variables (y i ,...,y j ) is denoted by y i:j . For a collection of functions γ 1:T , we use γ −t to represent γ 1:T \γ t . Probability distributions are represented as boldface letters (e.g. P). For conciseness, P(Y = y,U =u) is often represented as P(y,u). We use P γ to denote that distribution P is defined for a fixed choice of γ. 9.2.2 Dynamic Game We consider a game G d with T non-cooperative agents. The agents/players act in a pre-specified sequence. We useA t to refer to thet−th agent in this sequence. The actions taken by an agent are based on information available to it which can potentially depend on the actions of the preceding agents. Following are the parameters that characterize the game G d : 1. Number of agents T (T is finite). 2. An underlying probability space (Ω,F, P) which is a product of T + 1 independent factors (Ω 0 ,F 0 , P 0 ) and (Ω t ,F t , P t ) wheret∈{1, 2,...,T}. W 0 andW t are random variables defined on spaces (Ω 0 ,F 0 , P 0 ) and (Ω t ,F t , P t ), respectively. W 0:T are jointly independent random variables. 3. Measurable spaces (Y t ,F y t ) representing the observations made by agent A t . 4. Measurable spaces (U t ,F u t ) representing the action taken by agent A t . 5. Measurable functions for the observation of agent A t g t : (Ω 0 ,F 0 )× (Ω t ,F t )× t−1 Y τ=1 (U τ ,F u τ )→ (Y t ,F y t ). 156 Thus, the observation of A t is given as Y t =g t (W 0 ,W t ,U t−1 ). 6. In addition to its own observationY t , agentA t has access to some observations made by agents that acted before it. The subset k t ⊆{1, 2,...,t} denotes the indices of the observations available to agent A t . Thus, the information of agent A t is I t :={Y τ :τ∈k t }. 7. Real measurable functions representing the cost associated with agent A t V t : (Ω 0 ,F 0 )× T Y τ=1 ((Y τ ,F y τ )× (U τ ,F u τ ))→R The strategy of agent A t is a measurable function from its information to its actions, that is, γ t : Y τ∈kt (Y τ ,F y τ )→ (U t ,F u t ) or equivalently U t =γ t (I t ). Let γ := (γ 1 ,γ 2 ,...,γ T ). We call γ the strategy profile of the agents. Let the set of all possible strategies for agent A t be Γ t . Γ = Q T t=1 Γ t . The cost for agent A t associated with strategy profile γ∈ Γ is defined as J t (γ) =E γ [V t (W 0 ,Y 1:T ,U 1:T )] where U t =γ t (I t ) for t = 1, 2,...,T . Definition 9.1 (Nash equilibrium). A strategy profile γ ∗ is a Nash equilibrium if and only if for every t J t (γ ∗ )≤J t (γ t ,γ ∗ −t ) for every γ t ∈ Γ t . 157 9.2.3 Equivalent Games Consider two instances G and ˜ G of the dynamic game model described earlier. The parameters associated with the second game are distinguished by a tilde. Definition 9.2. The games G and ˜ G are equivalent if • The number of agents is the same, i.e. T = ˜ T . • For eacht, (U t ,F u t ) is isomorphic to ( ˜ U t , ˜ F u t ) and (Y t ,F y t ) is isomorphic to ( ˜ Y t , ˜ F y t ). Further, k t = ˜ k t for each t. Thus, for each t, there exists a bijection between the sets of possible strategies Γ t and ˜ Γ t in the two games. This bijection is referred to as strategy correspondence. • If all corresponding components ofγ and ˜ γ are related by strategy correspondence, thenJ(γ) = ˜ J(˜ γ). The above definition of equivalence between games is analogous to the equivalence of team problems defined in [156]. 9.3 Static Reduction Static games are the games in which the observations of an agent do not depend on actions taken in the past. In the model of Section 10.2, this means that the functions g t do not depend on the past actions U 1:t−1 . In static games, observations Y 1:T are random variables with known prior distribution that do not depend on the choice of agents’ strategies. In the rest of the paper, we will assume that all random variables, observations and actions in G d take values in finite sets with the power-set as the associated sigma-algebra. We will show that the gameG d of Section 10.2 is equivalent to a static game. Following the approach for teams used in [156], the key idea in reducing the dynamic game to a static game is to transfer the dependence of observations on actions to the cost part. This is done based on the functions f t and probability distribution Q t defined as follows. 158 Definition 9.3. For each agentA t in the dynamic game, definef t and Q t such that the conditional probability of Y t =y t given W 0 =ω 0 ,U 1 =u 1 ,...,U t−1 =u t−1 in the game G d can be written as P(Y t =y t |W 0 =ω 0 ,U 1 =u 1 ,...,U t−1 =u t−1 ) =f t (y t ,ω 0 ,u 1 ,...,u t−1 )Q t (y t ) for every y t ,ω 0 ,u 1 ,...,u t−1 . SinceY t is finite, Q t (y t ) can be any full-support distribution overY t and f t (y t ,ω 0 ,u 1 ,...,u t−1 ) = P(Y t =y t |W 0 =ω 0 ,U 1 =u 1 ,...,U t−1 =u t−1 ) Q t (y t ) Lemma 9.1. Consider a strategy profile forγ∈ Γ in G d . Then, the joint distribution of variables W 0 ,Y 1 ,...,Y t underγ is given by P γ (W 0 =ω 0 ,Y 1:t =y 1:t ) = P(ω 0 ) t Y τ=1 f τ (y τ ,ω 0 ,u 1:τ−1 )Q τ (y τ ), where u t =γ t (i t ). Proof. Using chain rule, we have P γ (W 0 =ω 0 ,Y 1:t =y 1:t ) = P(ω 0 )P(y 1 |ω 0 ) t Y τ=2 P γ (y τ |ω 0 ,y 1:τ−1 ) = P(ω 0 )P(y 1 |ω 0 ) t Y τ=2 P(y τ |ω 0 ,u 1:τ−1 ) = P(ω 0 ) t Y τ=1 f τ (y τ ,ω 0 ,u 1:τ−1 )Q τ (y τ ). 159 9.3.1 Equivalent Static Game Consider a static gameG s whose parameters 1 are defined based on those ofG d as follows: 1. Number of agents T ∗ =T . 2. (Ω 0 ,F 0 , P 0 ) is degenerate, and (Ω ∗ t ,F ∗ t , P ∗ t ) = (Y t ,F y t , Q t ) for t∈{1, 2,...,T}. 3. (Y ∗ t ,F ∗y t ) = (Y t ,F y t ) for all t. 4. (U ∗ t ,F ∗u t ) = (U t ,F u t ) for all t. 5. The observation of agent A t is given by Y ∗ t =g ∗ t (W ∗ t ,U ∗ 1:t−1 ) =W ∗ t . Note that the observations are mutually independent. 6. k ∗ t =k t for all t. 7. Cost function for A t is V ∗ t (y ∗ 1:T ,u ∗ 1:T ) = X ω 0 ∈Ω 0 " P(ω 0 )V t (ω 0 ,y ∗ 1:T ,u ∗ 1:T ) T Y τ=1 f τ (y ∗ τ ,ω 0 ,u ∗ 1:τ−1 ) # With the above definitions, the strategy space of G s andG d are the same, tht is, Γ ∗ = Γ. The cost for agent A t associated with γ ∗ ∈ Γ ∗ is J ∗ t (γ ∗ ) =E γ ∗ [V ∗ t (Y ∗ 1:T ,U ∗ 1:T )] where U ∗ t =γ ∗ t (I ∗ t ) for t = 1, 2,...,T . Theorem 9.1. The static game G s is equivalent to the dynamic game G d . 1 The parameters of the static game are distinguished by an asterisk mark. 160 Proof. Clearly, the isomorphism conditions are satisfied because the observation and actions spaces are identical. For a strategy profileγ, the expected cost for A t inG d is J t (γ) = X ω 0 ,y 1:T [V t (ω 0 ,y 1:T ,u 1:T )P(W 0 =ω 0 ,Y 1:T =y 1:T )], where u t =γ t (i t ). Using Lemma 9.1 this can be written as X ω 0 ,y 1:T " P(ω 0 )V t (ω 0 ,y 1:T ,u 1:T ) T Y τ=1 Q τ (y τ )f τ (y τ ,ω 0 ,u 1:τ−1 ) # = X y 1:T " V ∗ t (y 1:T ,u 1:T ) T Y τ=1 Q τ (y τ ) # =J ∗ t (γ). from the above equivalence, we can establish equivalence of Nash equilibria in gamesG d andG s . Theorem 9.2. A strategy profileγ∈ Γ is a Nash-equilibrium in the dynamic game G d if and only if it is a Nash-equilibrium in the static game G s . Proof. Assume thatγ is an equilibrium in the dynamic game and not an equilibrium in the static equivalent. Sinceγ is not an equilibrium inG s , there exists an agent A t and strategy γ 0 t such that J ∗ t (γ 0 t ,γ −t ) < J ∗ t (γ). From performance equivalence of static and dynamic games, J t (γ 0 t ,γ −t ) < J t (γ) in G d which is a contradiction since γ is an equilibrium in G d . A similar argument can be given for the converse as well. 9.4 Behavioral Strategies The strategies used in Sections 10.2 and 9.3 are deterministic or pure strategies. Pure strategies can be generalized by allowing randomization over the action spaces. These randomized strategies are called behavioral strategies. A behavioral strategy maps the available information to a distribution over action space. Let ΔU t be the set of all distributions overU t . Thus, γ t :I t → ΔU t is a behavioral strategy. Under this strategy, if i t is the realization of I t , then action u is taken with 161 probabilityγ t (i t )(u). Note that it is implicit in the above description that different agents randomize independently of each other. Let the set of all possible behavioral strategies for agent A t be Γ b t . Theorem 9.3. The static game G s is equivalent to the dynamic game G d even when behavioral strategies are allowed. Proof. Consider a behavioral strategy profileγ in G d . Under this strategy profile the joint distri- bution of W 0 and the observations and actions inG d can be written as P γ (W 0 =ω 0 ,Y 1:t =y 1:t ,U 1:t =u 1:t ) = P(ω 0 )× t Y τ=1 P γ (y τ |ω 0 ,y 1:τ−1 ,u 1:τ−1 )P γ (u τ |ω 0 ,y 1:τ ,u 1:τ−1 ) = P(ω 0 ) t Y τ=1 P(y τ |ω 0 ,u 1:τ−1 )P γ (u τ |i τ ) = P(ω 0 ) t Y τ=1 f τ (y τ ,ω 0 ,u 1:τ−1 )Q τ (y τ )(γ τ (i τ )(u τ )). The expected cost for A t underγ can be written as J t (γ) = X ω 0 ,y 1:T ,u 1:T [V t (ω 0 ,y 1:T ,u 1:T )P(ω 0 ,y 1:T ,u 1:T )] = X ω 0 ,y 1:T ,u 1:T P(ω 0 )V t (ω 0 ,y 1:T ,u 1:T )× T Y τ=1 Q τ (y τ )f τ (y τ ,ω 0 ,u 1:τ−1 )(γ τ (i τ )(u τ )) = X y 1:T ,u 1:T " V ∗ t (y 1:T ,u 1:T ) T Y τ=1 Q τ (y τ )(γ τ (i τ )(u τ )) # = X y 1:T ,u 1:T [V ∗ t (y 1:T ,u 1:T )P ∗ (y 1:T ,u 1:T )], where P ∗ (y 1:T ,u 1:T ) = Q T τ=1 Q τ (y τ )(γ τ (i τ )(u τ )) is the joint distribution of actions and observations in the static game under the behavioral strategy profileγ. Thus, the expression X y 1:T ,u 1:T [V ∗ t (y 1:T ,u 1:T )P ∗ (y 1:T ,u 1:T )] 162 is simply the expected value of V ∗ t (·) in the static game under the behavioral strategy profile γ which, by definition, is J ∗ t (γ). Corollary 9.1. A behavioral strategy profileγ∈ Γ b is a Nash-equilibrium in the dynamic gameG d if and only if it is a Nash-equilibrium in the static game G s . Proof. The proof is identical to that of Theorem 9.2. Corollary 9.2. There exists a Nash equilibrium in behavioral strategy for the dynamic game G d . Proof. For every finite static (Bayesian) game, we know that there exists a behavioral strategy Nash equilibrium [110]. From the equivalence of equilibria in the static and dynamic games, it follows that there exists a behavioral strategy equilibrium in the dynamic game as well. 9.5 Team Nash Equilibrium In the dynamic game model, we assumed that the agents are non-cooperative. In this section, we consider a model in which agents can be grouped into teams. Agents in the same team coordinate their strategies and have the same cost function. The agents of the dynamic game G d are partitioned into k teams T 1 ,...,T k . The objective of each team is to minimize the cost K i (γ) =E γ [C i (W 0 ,Y 1:T ,U 1:T )] Definition 9.4 (Team Nash equilibrium). A strategy profileγ ∗ is a Team Nash equilibrium if and only if for every team i K i (γ ∗ )≤K i (γ i ,γ ∗ −i ) for everyγ i ∈ Γ team i , where Γ team i is the set of all feasible strategies of agents in Team i. Theorem 9.4. A strategy profile γ is a Team Nash equilibrium in the dynamic game if and only if it is a team Nash equilibrium in the static game. Proof. Similar to the proof for Theorem 4.2. 163 9.6 Example - Signaling Game In this section, we illustrate the static reduction of a dynamic game with the help of an example. Source X Transmitter U 1 U 1 U 2 U 3 Intermediary Receiver We consider a three-agent signaling game in which a transmitter observes a sourceX and needs to transmits it to a receiver via an intermediary. The transmitter does not want the intermediary to know too much about X but can communicate to the receiver only through the intermediary. The following facts are given: • X is a random variable with distributionP. • Observations: The transmitter’s (A 1 ) observation Y 1 =X, the intermediary’s (A 2 ) observa- tion Y 2 =U 1 and the receiver’s (A 3 ) observation Y 3 = (U 1 ,U 2 ). • Cost: The cost forA 1 is (U 3 −X) 2 + (U 2 −X−θ) 2 , it is (U 2 −X) 2 forA 2 and (U 3 −X) 2 for A 3 . Note that the agents in this game are non-cooperative and cannot have coordinated strategies. A static game can be constructed for the above signaling game with the following parameters: • A 1 observes Y 1 which is a uniformly distributed variable onX ; A 2 observes Y 2 which is a uniformly distributed variable onU 1 ; A 3 observes Y 3 , which is a uniformly distributed variable onU 2 , and Y 2 . Y 1 ,Y 2 ,Y 3 are mutually independent. • The f t functions of Definition 9.3 are given as f 1 (y 1 ,x) =|X|δ(x−y 1 ) f 2 (y 2 ,u 1 ) =|U 1 |δ(y 2 −u 1 ) f 3 (y 3 ,u 2 ) =|U 2 |δ(u 2 −y 3 ) • The cost function for A 1 is X x∈X P(x)|X||U 1 ||U 2 |δ(x−y 1 )δ(y 2 −u 1 )δ(u 2 −y 3 )× 164 ((u 2 −x−θ) 2 + (u 3 −x) 2 ) • The cost function for A 2 is X x∈X P(x)|X||U 1 ||U 2 |δ(x−y 1 )δ(y 2 −u 1 )δ(u 2 −y 3 )(u 2 −x) 2 • The cost function for A 3 is X x∈X P(x)|X||U 1 ||U 2 |δ(x−y 1 )δ(y 2 −u 1 )δ(u 2 −y 3 )(u 3 −x) 2 From Theorem 9.2, the static game and the signaling game above have the same Nash equilibria. 9.7 Conclusion We considered a dynamic game with T agents and showed that in the case of finite variables this dynamic game is equivalent to a static game. This equivalence suggests that questions of equilib- rium existence and equilibrium characterization in a dynamic game can potentially be investigated by considering its static reduction. Thus, we were able to establish existence of equilibrium in behavioral strategies in the dynamic game for any information structure without relying on Kuhn’s theorem [89]. It remains to be explored if this equivalence can be exploited for the purpose of computing and characterizing equilibria in dynamic games. 165 Part III Sequential Decision-making for Communication 166 Chapter 10 Real-time Coordination over Communication Channels 10.1 Introduction Decentralized decision problems typically involve multiple agents that need to make decisions based on different information about a changing environment. In many cases, these decisions need to be coordinated to satisfy some global system constraints and optimize performance. Such coordina- tion can be achieved by communication. For agents operating under strict time constraints, the coordination must be in real-time, that is, agents need to make coordinated decisions at each time instant based only on information currently available. The real-time nature of the coordination distinguishes the framework we consider here from the information-theoretic setups considered in [35]. We consider the problem of coordinating decisions between a pair of agents who may have different information about the state of their environment. Suppose at each time instant agent 1 can select one decision from the finite setU while agent 2 can select one decision from the finite setV. Certain decision pairs (u,v)∈U×V, however, are forbidden. This may be because these decision pairs lead to unacceptable performance or they violate some hard system constraints. Let F⊂U×V be the set of forbidden decisions. The two agents have to make sure that their joint decision (u,v) lies in the permitted setP =U×V\F. The best joint decision that should be chosen from the permitted set may depend on the state of the environment. If the two agents have the same information about the state of the environment, it is relatively straight forward to find decision strategies for the agents that select the best joint decision and avoid the forbidden pairs. The problem is more interesting if agents have different information. 167 In our setup, one agent is more informed about the environment than the other and the agents have limited communication ability before they need to make coordinated decisions. We further assume thatU =V and the set of permitted decisions is described asP ={(u,v)∈U×U : u = v}. Thus, the agents must select the same decision to achieve coordination. Exactly which decision should be taken will depend on the state of the environment. We first consider one-shot (single stage) coordination with noiseless and noisy communication channel between the agents. We observe that the coordination problem has some connections with zero-error communication [77], [100]. We next consider a sequential (multi-stage) problem with the state of environment modeled as a Markov process. In the case of noiseless communication, we adopt methods from real-time communication to address the coordination problem. When the communication is noisy, we show that the multi-stage coordination problem can be broken down to separate single-stage problems under additional statistical assumptions about the stochastic environment. 10.1.1 Organization The rest of the paper is organized as follows. We present notation and conventions in Section 10.1.2. We formulate the problem of real-time coordination over communication channels in Section 10.2. We illustrate the distinction between coordination and communication problems in Section 10.3. In Section 10.4, we establish the relationship between single-stage coordination problems and zero- error communication problems. In Sections 10.5 and 10.6, we analyze multi-stage coordination problems over noiseless and noisy channels. We conclude the paper in Section 10.7. 10.1.2 Notation Random variables/vectors are denoted by upper case letters, their realization by the corresponding lower case letter. In general, subscripts are used as time index while superscripts are used to index decision making agents. For time indicesn 1 ≤n 2 ,X n 1 :n 2 (resp. g n 1 :n 2 (·)) is the short hand notation for the variables (X n 1 ,X n 1 +1 ,...,X n 2 ) (resp. functions (g n 1 (·),...,g n 1 (·))). OperatorsP(·) andE[·] denote the probability of an event and the expectation of a random variable, respectively. For random variables/vectors X and Y ,P(·|Y = y),E[X|Y = y] andP(X = x| Y = y) are denoted byP(·|y),E[X|y] andP(x|y), respectively. For a strategy g, we useP g (·) (resp. E g [·]) to indicate 168 that the probability (resp. expectation) depends on the choice of g. We defineδ(a−b) = 1 ifa =b and 0 otherwise. The XOR operation is denoted by⊕. 10.2 Problem Formulation The coordination system consists of two agents: a transmitter A t and a receiver A r . The system operates in discrete time over a time horizonN. The transmitter observes a source processX n ∈X , and side-information processS n ∈S. The receiver can observe the side-information processS n but it does not have direct access to the source process X n . The transmitter encodes the source and transmits it toA r over a discrete memoryless channel. At any timen, the inputU n ∈U and output Y n ∈Y of the channel are related as follows: Y n =q n (U n ,W c n ), (10.1) where W c n is the channel noise. The channel noise is i.i.d. and independent of the source and side-information processes. Thus, the information available to A t and A r at time n is I t n = (X 1:n ,S 1:n ) (10.2) I r n = (Y 1:n ,S 1:n ), (10.3) respectively. Assumption 10.1. The source and side-information (X n ,S n ) evolve as an uncontrolled Markov process, i.e. P(X n+1 =x n+1 ,S n+1 =s n+1 |x 1:n ,s 1:n ) =P(X n+1 =x n+1 ,S n+1 =s n+1 |x n ,s n ). The encoding function used by A t at time n is f n :X n ×S n →U and thus, U n is given by U n =f n (X 1:n ,S 1:n ) =f n (I t n ). (10.4) 169 In order to coordinate, both the agents A t and A r must take the same decision D n ∈D at any given time. Let the respective decision functions used by the transmitter and receiver at time n be g t n :X n ×S n →D and g r n :Y n ×S n →D. We require that D n :=g t n (X 1:n ,S 1:n ) a.s. = g r n (Y 1:n ,S 1:n ). (10.5) Letf,g t andg r be the encoding and decision strategies over the horizon N and letψ := (f,g t ,g r ). We assume that all the random variables take values in finite sets. Definition 10.1 (Coordination Strategies). A strategy ψ := (f,g t ,g r ) is a coordination strategy if and only if g t n (X 1:n ,S 1:n ) a.s. = g r n (Y 1:n ,S 1:n ), for every n∈{1,...,N}. Let the set of all possible coordination strategies be Ψ. The cost incurred at time n is l n (X n ,S n ,D n ), and the total expected cost associated with a coordination strategy ψ∈ Ψ is J(ψ) =E ψ " N X n=1 l n (X n ,S n ,D n ) # . (10.6) The aim of this work is to find an optimal coordination strategy ψ ∗ , which is given by ψ ∗ ∈ arg min ψ∈Ψ J(ψ). (10.7) 10.3 Example: Coordination vs Communication In this section, we consider a coordination problem over a Binary Symmetric Channel (BSC) with time horizon N = 1, and compare it with a communication problem over the same channel. Let the source X be a Bernoulli random variable such that P[X = 1] = α > 1/2. The input U∈{0, 1} and output Y ∈{0, 1} of the binary symmetric channel are related as Y = (U⊕W ) (10.8) 170 whereW is an independent Bernoulli random variable such thatP[W = 0] =p and 1>p>α. Let the side-information be degenerate, i.e. S = 1 with probability 1. 10.3.1 The Communication Problem In the communication problem, the agent A t transmits U = f(X) using an encoding strategy f and A r estimates the source X based on its observation Y as ˆ X =g r (Y ). Let the cost function be l(x, ˆ x) = 0 if x = ˆ x 1 otherwise. The objective of the communication problem is to find encoding and decoding strategies that minimize the cost function J 1 (f,g r ) =E f,g r h l(X, ˆ X) i =P f,g r h X6= ˆ X i . For any encoding strategy f, the estimation strategy that minimizes the probability of error is the maximum a posteriori probability (MAP) estimate which is given by ˆ X = arg max x∈{0,1} P f [X =x|Y ]. It can be shown, by enumerating all possible strategiesf, that the identity mappingf(X) =X achieves optimal performance and thus, U = X. The MAP estimate for this identity mapping is ˆ X =g r (Y ) =Y and thus, the minimum achievable cost J ∗ 1 is 1−p. 171 10.3.2 The Coordination Problem In the coordination problem, the agentA t transmitsU =f(X) using an encoding strategyf. Both A t andA r must take the same decisionD∈{0, 1} using decision strategiesg t andg r , respectively, such that D :=g t (X) a.s. = g r (Y ). The objective of the coordination problem is to find encoding and decision strategies that minimize the cost function J 2 (f,g t ,g r ) =E (f,g t ,g r ) [l(X,D)] =P (f,g t ,g r ) [X6=D], subject to the constraint that (f,g t ,g r )∈ Ψ. If the encoding strategy isf(X) = 0 with probability 1, then g t (X) =g r (Y ) =g r (W ) with probability 1 if and only if g t and g r are constant mappings. Similarly, if f(X) = X, then g t (X) = g r (X⊕W ) with probability 1 if and only if g t and g r are constant mappings. By symmetry, this is true for f(X) = 1 and f(X) = 1⊕X as well. Thus, for any encoding strategy f, only constant decision strategies achieve coordination. Among the two constant decision strategy pairs, the pair (g t (X) = 1,g r (Y ) = 1) achieves optimal performance and the corresponding cost J ∗ 2 is 1−α. Notice that the optimal communication performanceJ ∗ 1 = 1−p is strictly less than the optimal coordination performanceJ ∗ 2 = 1−α because of the additional constraint on the decision strategies in the coordination problem. 10.4 Single Stage Coordination In this section, we analyze the coordination problem formulated in Section 10.2 for horizon N = 1. For notational convenience, we drop the time index in all the variables. 10.4.1 Noiseless Channel We first consider the case of noiseless channel. In this case, Y =U with probability 1. Proposition 10.1. For coordination over a noiseless channel, if the encoding strategy is f and the decision strategy at the receiver is g r , then the transmitter’s decision strategy is given by g t (x,s) = g r (f(x,s),s). 172 We will prove a more general statement in the next section. Thus, in case of noiseless channels, the coordination strategy is characterized by the encoding strategy at the transmitter and the decision strategy at the receiver. For any encoding strategyf, an optimal decision strategy is given by ˆ d f (y,s) = arg min d∈D X x:f(x,s)=y l(x,s,d)P[X =x|s]. (10.9) Let L(f) =E f [l(X,S, ˆ d f (Y,S))]. The coordination problem now reduces to minimizing L(f) over all possible encoding strategies. Notice that this problem is equivalent to a single-shot quantization problem and can be solved, under certain assumptions on the cost function, using algorithms such as Lloyd’s algorithm [34]. 10.4.2 Noisy Channel In this section, we establish that the problem of coordination over noisy channels can be reduced to a problem of coordination over noiseless channels. The solution for the noiseless case can then be used to find a solution for the noisy case with a simple transformation. Definition 10.2 (Single-shot Zero-error Capacity). Let the input and the output of a channel be related as y = q(u,w) and letM R be a set of R messages. Rate R is defined to be achievable if there exists an encoding strategy f :M R →U and a decoding strategy g :Y→M R such that m =g(y) =g(q(f(m),w))∀m∈M R ,w∈W. The single-shot zero-error capacity C 1 0 is defined as the maximum achievable rate. Definition 10.3 (Support of a Distribution). For random variables X and S, we define supp(X) ={x :P[X =x]> 0}, supp(X|S =s) ={x :P[X =x|S =s]> 0}. 173 The receiver’s uncertainty on the transmitter’s decision, conditioned on the side-information, can be at most C 1 0 to achieve coordination. This is formally stated as follows. Lemma 10.1. For any channel with single-shot zero error capacity C 1 0 , if (f,g t ,g r ) achieves co- ordination, then max s∈S | supp(D|S =s)| ≤C 1 0 , where D =g t (X,S). Proof. LetB(s) := supp(D|S =s) for each s∈S. Assume that for some s 0 ∈S,|B(s 0 )| >C 1 0 . Since the strategy achieves coordination, g t (x,s 0 ) =d⇔g r (y,s 0 ) =g r (q(f(x,s 0 ),w),s 0 ) =d, for every x∈ supp(X|S =s 0 ) and w∈W. Let g t in (d)∈ supp(X| S = s 0 ) be an arbitrary element in the inverse image of{d} under the strategyg t (·,s 0 ). Note that the inverse image is non-empty for everyd∈B(s 0 ). LetM :=B(s 0 ) be a set of messages. Consider the following encoding and decoding strategies ˜ f(m) :=f(g t in (m),s 0 ) and ˜ g(y) :=g r (y,s 0 ) over the set of messagesM. Since g t (g t in (m),s 0 ) =m, we have g r (q(f(g t in (m),s 0 ),w),s 0 ) =m, for every m∈B(s 0 ) and w∈W. Thus, ˜ g(y) = ˜ g(q( ˜ f(m),w)) =m∀m∈M,w∈W. And hence, rate|B(s 0 )| is achievable, which is a contradiction since it is larger thanC 1 0 . Therefore, |B(s)|≤C 1 0 for every s∈S. The following lemma implies that any coordination performance that is achievable using a noisy channel can be achieved using a noiseless channel with same single-shot zero error capacity and side-information. 174 Lemma 10.2. Let (f,g t ,g r ) be a coordination strategy over the noisy channel. Consider a noiseless channel with alphabetZ and|Z| = C 1 0 . There exists an injective function f s inj :B(s)→Z and its left inverse f 0 s inj :Z→D such that using (f s inj ◦g t ,g t ,f 0 s inj ) over the noiseless channel achieves coordination and the same performance as using (f,g t ,g r ) over the noisy channel. Proof. The decision strategies at the transmitter for strategies (f,g t ,g r ) and (f s inj ◦g t ,g t ,f 0 s inj ) are identical. Hence, they achieve the same performance. If the latter strategy also achieves coordination, then the two strategies are equivalent. By Lemma 10.1,|B(s)| must be less than or equal to|Z| = C 1 0 for every s∈S. And hence, for each s there exists an injective function f s inj :B(s)→Z. Letf s inj ◦g t be the encoding strategy for the noiseless channel. Since the noiseless channel has no error, the receiver receivesf s inj (D), whereD =g t (X,s), and can invert it usingf 0 s inj to recover D without error. Thus, coordination can be achieved with strategy (f s inj ◦g t ,g t ,f 0 s inj ) and the two strategies under consideration are equivalent. The converse of this lemma is also true and is stated as follows. Lemma 10.3. Given any coordination strategy (f z ,g t z ,g r z ) over a noiseless channel with alphabet Z of cardinality|Z| =C 1 0 , there exists an equivalent coordination strategy (f,g t ,g r ) over the noisy channel with zero-error capacity C 1 0 . Proof. Since the single-shot zero-error capacity of the noisy channel is|Z|, there exist functions φ :Z→U and φ 0 :Y→Z such that z =φ 0 (q(φ(z),w)) for all z∈Z and w∈W. Let f =φ◦f z , g t = g t z and g r = g r z ◦φ 0 . The strategy (f,g t ,g r ) achieves coordination because for every s∈S, g r (y,s) = g r z (φ 0 (q(φ(f z (x,s)),w)),s) = g r z (f z (x,s),s) = g t z (x,s) = g t (x,s) for all x,y,w. The last equality comes from the fact that (f z ,g t z ,g r z ) achieves coordination over the noiseless channel. Since both the strategies achieve coordination and have identical decision strategies at the transmitter, they are equivalent. The lemmas stated above imply the following result. Theorem 10.1. Let C 1 0 be the single-shot zero-error capacity of the channel and let (φ,φ 0 ) be encoding and decoding strategies that achieve capacity. Let (f z ,g t z ,g r z ) be an optimal coordination strategy for a noiseless channel with alphabet size C 1 0 . Then the strategy (φ◦f z ,g t z ,g r z ◦φ 0 ) achieves coordination and is optimal for the noisy channel. 175 10.5 Multi-stage Coordination over Noiseless Channels Let the channel available to A t and A r be a noiseless channel with input and output alphabet U =Y =Z. The objective is to find a strategy ψ := (f,g t ,g r )∈ Ψ which minimizes the total expected loss J(ψ) =E ψ " N X n=1 l n (X n ,S n ,D n ) # . An optimal solution for this coordination problem can be obtained with the help of the following structural results. Lemma 10.4. For coordination over a noiseless channel, if the encoding strategy is f and the decision strategy at the receiver is g r , then the transmitter’s decision strategy at n is given by g t n (x 1:n ,s 1:n ) =g r n (f 1 (x 1 ,s 1 ),...,f n (x 1:n ,s 1:n ),s 1:n ). Proof. The message z n =f n (x 1:n ,s 1:n ). The decision d n is given by d n =g t n (x 1:n ,s 1:n ) =g r n (z 1:n ,s 1:n ) =g r n (f 1 (x 1 ,s 1 ),...,f n (x 1:n ,s 1:n ),s 1:n ). Thus, the coordination strategy is completely characterized by the encoding strategy at the transmitter and the decision strategy at the receiver. The transmitter precisely knows the symbol received by the receiver because the communication is noiseless. This property makes this system similar to the class of causal communication systems studied in [152]. Optimal encoding and decoding solutions for such communication systems have been characterized in [152] (see also [67]). We use a similar framework to find structural properties of the optimal encoding and decision strategies in the coordination problem. Lemma 10.5. There is an optimal encoding strategy of the form f ∗ n (x 1:n ,s 1:n ,z 1:n−1 ) =φ n (x n ,s 1:n ,z 1:n−1 ). 176 Proof. The proof is similar to the one in [152]. Because of Lemma 10.5, we can restrict our analysis to encoding strategies of the formφ n (x n ,s 1:n ,z 1:n−1 ). Note that any such functionφ n :X×S n ×Z n−1 →Z can be seen as a mappingp n :S n ×Z n−1 → Γ, where Γ is the set of all mappings fromX toZ. Thus,γ n =p n (s 1:n ,z 1:n−1 ) andz n =γ n (x n ). Since the receiver’s information is common knowledge, an equivalent problem can be formulated in which A r is the only decision maker. At any time instant, the agent A r first decides a prescription Γ n ∈ Γ based on S 1:n ,Z 1:n−1 . The encoder merely uses the prescription to encode the source X n , i.e. Z n = Γ n (X n ). After receiving the message Z n , A r then selects an action D n based on S 1:n ,Z 1:n . The transmitter can mimic the receiver’s decision since S 1:n ,Z 1:n is common knowledge. The receiver’s problem of finding optimal prescriptions and decisions can be described as a Partially Observed Markov Decision Process (POMDP) problem as follows: 1. Time horizon: The receiver has to take 2N decisions. 2. State: The system state, before and after transmission, at time n is (X n ,S n ). The state evolves as an uncontrolled Markov process as described in Assumption 10.1. 3. Action processes: The pre-transmission action is Γ n and the post-transmission action is D n . 4. Observation processes: The pre-transmission observation at time n is S n , which is uncon- trolled. The post-transmission observation is Z n = Γ n (X n ). 5. Instantaneous costs: The pre-transmission cost is 0 and the post-transmission cost isl n (X n ,S n ,D n ). To characterize optimal strategies, we use the classical method [80] of solving POMDPs. The receiver’s posterior belief is used as the information state and the optimal actions for each stage are computed using dynamic programming. Definition 10.4. The pre-transmission belief on the state (X n ,S n ) is the ordered pair (Π n ,S n ), where Π n is given by Π n =P[X n |S 1:n ,Z 1:n−1 , Γ 1:n−1 ]. (10.10) 177 Similarly, the post-transmission belief is (Θ n ,S n ), where Θ n is given by Θ n =P[X n |S 1:n ,Z 1:n , Γ 1:n ]. (10.11) Also, for notational convenience, let M S n+1 (s|x,s 0 ) :=P[S n+1 =s|X n =x,S n =s 0 ], (10.12) M X n+1 (x,s|x 0 ,s 0 ) =P[X n+1 :=x,S n+1 =s|X n =x 0 ,S n =s 0 ]. (10.13) Lemma 10.6. The estimator’s belief evolves as follows: 1. Pre-transmission belief: (Π n+1 ,S n+1 ) = (Q n+1 (Θ n ,S n ,S n+1 ),S n+1 ), (10.14) where Q n+1 (θ,s,s 0 )(x) = P x 0 ∈X M X n+1 (x,s 0 |x 0 ,s)θ(x 0 ) P x 00 ∈X M S n+1 (s 0 |x 00 ,s)θ(x 00 ) . (10.15) 2. Post-transmission belief: (Θ n ,S n ) = (R n (Π n ,S n , Γ n ,Z n ),S n ), (10.16) where R n (π,s,γ,z)(x) = δ(z−γ(x))π(x) P x 0 ∈X δ(z−γ(x 0 ))π(x 0 ) . (10.17) Proof. This is a direct consequence of Bayes’ rule. The optimal decision strategies for the POMDP can be obtained by solving a dynamic program as shown in the following theorem. 178 Theorem 10.2. The following dynamic programming equations characterize optimal strategies for the receiver’s POMDP. W N+1 (π,s) = 0 (10.18) V n (θ,s) = min d∈D E[l n (X n ,s,d) +W n+1 (Π n+1 ,S n+1 )| Θ n =θ,S n =s] (10.19) W n (π,s) = min γ∈Γ E[V n (Θ n ,S n )| Π n =π,S n =s], (10.20) where Π n+1 =Q n+1 (Θ n ,S n ,S n+1 ) and Θ n =R n (Π n ,S n ,γ,Z n ). Also, P[S n+1 =s 0 | Θ n =θ,S n =s] = X x∈X M S n+1 (s 0 |x,s)θ(x), P[Z n =z| Π n =π,S n =s] = X x∈X δ(z−γ(x))π(x). Proof. This result follows from standard dynamic programming arguments for POMDPs. Notice that the decision d in equation (10.19) affects only the instantaneous cost and thus, the minimization can be further simplified. Definition 10.5. Let ξ(x) be a distribution onX and let s∈S. Define ˆ d n (θ,s) = arg min d∈D X x∈X l n (x,s,d)θ(x). Corollary 10.1. The optimal decision strategy at the receiver is given by g r n (s 1:n ,z 1:n ) = ˆ d n (θ n ,s n ), where θ n is the receiver’s post transmission belief at time n. The dynamic programming equations in Theorem 10.2 can be used to obtain the optimal encod- ing and decision policies in a backward induction manner. However, finding optimal prescriptions in equation (10.20) involves minimization over the function space Γ. This minimization may be hard in general and may need additional assumptions for computational tractability [101]. 179 X 0 S 0 X 1 S 1 X n−1 S n−1 X n S n X n+1 S n+1 X N S N Figure 10.1: A Bayesian network representing the conditional indepedence structure of the processes Xn and Sn under As- sumption 10.2. 10.6 Multi-stage Coordination over Noisy Channels In this section, we discuss the problem of multi-stage coordination over a noisy channel. Recall that the encoding strategy at n isf n :X n ×S n →U and the decision strategies at the transmitter and receiver are g t n :X n ×S n →D and g r n :Y n ×S n →D, respectively. Note that Lemma 10.5 may not hold in this case, i.e. the optimal encoder is not necessarily of the form f n (x n ,s 1:n ) and could potentially require all the past states X 1:n−1 . To simplify the encoder’s structure, we assume that the state X n+1 is independent of all the past states X 1:n−1 conditioned on the side-information S n+1 . The joint evolution of the state and side-information (X n ,S n ) is governed by Assumption 10.2. Figure 10.1 is a graphical representation of the conditional independence induced by Assumption 10.2. Assumption 10.2. The source process X n and the side-information process S n have the following conditional independence structure: P [X n+1 =x,S n+1 =s|x 1:n ,s 1:n ] =P [X n+1 =x|S n+1 =s]P[S n+1 =s|x n ,s n ]. We show that, under the above assumption, the problem of multi-stage coordination over a horizonN can be decomposed intoN single-stage coordination problems. Moreover, at each stagen, the agents use only current information (X n ,S n ) and (Y n ,S n ), respectively, for coordination. We use an approach similar to the one used in [144] and [157]. We first show that the decomposition result holds for two-stage and three-stage coordination problems. We then use these results inductively to show that the decomposition holds for N-stage problems as well. 180 Lemma 10.7. Under Assumption 10.2, we have P[x n:N ,s n+1:N |x 1:n−1 ,s 1:n ] =P[x n:N ,s n+1:N |s n ]. Proof. This result is a consequence of the conditional independence structure in Assumption 10.2 and can be shown using chain rule. Lemma 10.8 (Two-stage Lemma). Consider the problem of coordination with side-information over a time horizon N = 2. Let ψ := (f,g t ,g r ) be a strategy that achieves coordination. Then, there exists a strategy ψ 0 := (f 0 ,g 0 t ,g 0 r ) such that f 0 1 =f 1 ,g 0 t 1 =g t 1 and g 0 r 1 =g r 1 , and U 2 =f 0 2 (X 2 ,S 2 ) D 2 =g 0 t 2 (X 2 ,S 2 ) a.s. = g 0 r 2 (Y 2 ,S 2 ) J(ψ 0 )≤J(ψ). Proof. See Appendix I.0.1. Lemma 10.9 (Three-stage Lemma). Consider the problem of coordination with side-information over a time horizon N = 3. Let ψ := (f,g t ,g r ) be a strategy that achieves coordination such that U 3 =f 2 (X 3 ,S 3 ) D 3 =g t 3 (X 3 ,S 3 ) a.s. = g r 3 (Y 3 ,S 3 ). Then, there exists a strategy ψ 0 := (f 0 ,g 0 t ,g 0 r ) such that f 0 1 =f 1 ,g 0 t 1 =g t 1 ,g 0 r 1 =g r 1 ,f 0 3 =f 3 ,g 0 t 3 =g t 3 and g 0 r 3 =g r 3 , and U 2 =f 0 2 (X 2 ,S 2 ) D 2 =g 0 t 2 (X 2 ,S 2 ) a.s. = g 0 r 2 (Y 2 ,S 2 ) J(ψ 0 )≤J(ψ). Proof. See Appendix I.0.2. 181 Based on the above lemmas, we can now establish the following structural property for an optimal coordination strategy. Theorem 10.3. Consider the problem of coordination with side-information over a time horizon N. Then there exists a strategy ψ ∗ such that U n =f ∗ n (X n ,S n ) D n =g ∗t n (X n ,S n ) a.s. = g ∗r n (Y n ,S n ) J(ψ ∗ )≤J(ψ) for every n∈{1, 2,...,n} and for every ψ∈ Ψ. Proof. See Appendix I.0.3. The above theorem establishes optimality of memoryless strategies. This allows us to break down the multi-stage problem into separate single-stage problems as described in the theorem below. Theorem 10.4. For each n and s n ∈S, let (f ∗ n (x n ,s n ),g ∗t n (x n ,s n ),g ∗r n (y n ,s n )) be the solution to the following single-stage coordination problem minimize (fn,g t n ,g r n ) E[l n (X n ,S n ,D n )|S n =s n ] subject to :g t n (X n ,s n ) a.s. = g r n (Y n ,s n ). Let ψ ∗ := [(f ∗ 1 ,g ∗t 1 ,g ∗r 1 ),..., (f ∗ N ,g ∗t N ,g ∗r N )]. Then J(ψ ∗ )≤J(ψ) for all ψ∈ Ψ and ψ ∗ ∈ Ψ. Proof. See Appendix I.0.4. 182 10.6.1 Sub-optimal strategies without Assumption 2 If Assumption 2 does not hold, we can obtain sub-optimal strategies as follows. LetC 1 0 be the single- shot zero-error capacity of the noisy channel and let (φ,φ 0 ) be encoding and decoding strategies that achieve capacity. Let (f n ,g t n ,g r n ) be an optimal coordination strategy at n for a multi-stage problem with noiseless channel with alphabet size C 1 0 . Then, it can be easily established that the strategy (φ◦f n ,g t n ,g r n ◦φ 0 ) is a valid strategy for the multi-stage noisy channel problem, and that it achieves the same performance as using (f n ,g t n ,g r n ) over the noiseless channel. According to Theorem 10.1, this approach is optimal for single-stage problems. However, it may not be optimal in case of multi-stage problems, and this sub-optimality is related to the super-multiplicative nature of n-stage zero error capacity of noisy channels. 10.7 Conclusion We considered the problem of coordinating decisions between a pair of agents who may have differ- ent information about the state of their environment. We considered single-stage and multi-stage formulations with both noiseless and noisy channels. Using concepts from zero-error communica- tion and real-time communication problems, we obtained characterizations of optimal coordination strategies. The environment in our setup was uncontrolled. The case of controlled environments as well as coordination with non-identical decisions by the agents remain to be explored. 183 References [1] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for Thompson sampling. In Artificial intelligence and statistics, pages 99–107. PMLR, 2013. [2] Tansu Alpcan and Tamer Ba¸ sar. Network security: A decision and game-theoretic approach. Cambridge University Press, 2010. [3] Saurabh Amin, Xavier Litrico, Shankar Sastry, and Alexandre M Bayen. Cyber security of water scada systems—part i: Analysis and experimentation of stealthy deception attacks. IEEE Transactions on Control Systems Technology, 21(5):1963–1970, 2012. [4] Saurabh Amin, Galina A Schwartz, Alvaro A Cardenas, and S Shankar Sastry. Game- theoretic models of electricity theft detection in smart utility networks: Providing new capa- bilities with advanced metering infrastructure. IEEE Control Systems Magazine, 35(1):66–81, 2015. [5] Venkat Anantharam and Vivek Borkar. Common randomness and distributed control: A counterexample. Systems & control letters, 56(7-8):568–572, 2007. [6] George K Atia and Venkatesh Saligrama. Boolean compressed sensing and noisy group testing. IEEE Transactions on Information Theory, 58(3):1880–1901, 2012. [7] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002. [8] Robert J Aumann, Michael Maschler, and Richard E Stearns. Repeated games with incomplete information. MIT press, 1995. [9] T Ba¸ sar. On the saddle-point solution of a class of stochastic differential games. Journal of Optimization Theory and Applications, 33(4):539–556, 1981. [10] Tamer Basar and Geert Jan Olsder. Dynamic noncooperative game theory, volume 23. Siam, 1999. [11] M J´ esus Bayarri and James O Berger. The interplay of Bayesian and frequentist analysis. Statistical Science, pages 58–80, 2004. [12] Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 2005. [13] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, volume 5. Athena Scientific Belmont, MA, 1996. 184 [14] Stuart Alan Bessler. Theory and applications of the sequential design of experiments, k- actions and infinitely many experiments. Department of Statistics, Stanford University., 1960. [15] R Blahut. Information bounds of the Fano-Kullback type. IEEE Transactions on Information Theory, 22(4):410–421, 1976. [16] Richard Blahut. Hypothesis testing and information theory. IEEE Transactions on Informa- tion Theory, 20(4):405–417, 1974. [17] Elizabeth Bondi, Hoon Oh, Haifeng Xu, Fei Fang, Bistra Dilkina, and Milind Tambe. Using game theory in real time in the real world: A conservation case study. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pages 2336–2338. International Foundation for Autonomous Agents and Multiagent Systems, 2019. [18] Lawrence D Brown, T Tony Cai, and Anirban DasGupta. Interval estimation for a binomial proportion. Statistical science, pages 101–117, 2001. [19] Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforce- ment learning and search for imperfect-information games. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17057–17069. Curran Associates, Inc., 2020. [20] Valeri˘ ı Vladimirovich Buldygin and Yu V Kozachenko. Metric characterization of random variables and random processes, volume 188. American Mathematical Soc., 2000. [21] Marat Valievich Burnashev. Data transmission over a discrete channel with feedback. random transmission time. Problemy peredachi informatsii, 12(4):10–30, 1976. [22] E. J. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, May 2010. [23] Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, R´ emi Munos, and Peter Auer. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In International Conference on Algorithmic Learning Theory, pages 189–203. Springer, 2011. [24] Chun Lam Chan, Sidharth Jaggi, Venkatesh Saligrama, and Samar Agnihotri. Non-adaptive group testing: Explicit bounds and novel algorithms. IEEE Transactions on Information Theory, 60(5):3019–3035, 2014. [25] A. Chattopadhyay and U. Mitra. Optimal active sensing for process tracking. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 551–555, June 2018. [26] Da Chen, Qiwei Huang, Hui Feng, Qing Zhao, and Bo Hu. Active anomaly detection with switching cost. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5346–5350. IEEE, 2019. [27] Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959. [28] Sung-En Chiu and Tara Javidi. Low complexity sequential search with size-dependent mea- surement noise. IEEE Transactions on Information Theory, 2021. 185 [29] Kyunghyun Cho, Bart Van Merri¨ enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [30] S. Choudhary, N. Kumar, S. Narayanan, and U. Mitra. Active target detection with mobile agents. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4185–4189, May 2014. [31] S. Choudhary and U. Mitra. Analysis of target detection via matrix completion. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3771–3775, April 2015. [32] Sunav Choudhary, Naveen Kumar, Srikanth Narayanan, and Urbashi Mitra. Active target lo- calization using low-rank matrix completion and unimodal regression. CoRR, abs/1601.07254, 2016. [33] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001. [34] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012. [35] Paul Warner Cuff, Haim H Permuter, and Thomas M Cover. Coordination capacity. IEEE Transactions on Information Theory, 56(9):4181–4206, 2010. [36] Peter Damaschke. Threshold group testing. In General theory of information transfer and combinatorics, pages 707–718. Springer, 2006. [37] Jilles Steeve Dibangoye, Christopher Amato, Olivier Buffet, and Fran¸ cois Charpillet. Op- timally solving dec-pomdps as continuous-state mdps. Journal of Artificial Intelligence Re- search, 55:443–497, 2016. [38] Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020. [39] Maxim Egorov. Deep reinforcement learning with pomdps, 2015. [40] Fei Fang, Thanh Hong Nguyen, Rob Pickles, Wai Y Lam, Gopalasamy R Clements, Bo An, Amandeep Singh, Brian C Schwedock, Milind Tambe, and Andrew Lemieux. Paws-a deployed game-theoretic application to combat poaching. AI Magazine, 38(1):23–36, 2017. [41] Fei Fang, Peter Stone, and Milind Tambe. When security games go green: Designing de- fender strategies to prevent poaching and illegal fishing. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. [42] Fei Fang, Milind Tambe, Bistra Dilkina, and Andrew J Plumptre. Artificial intelligence and conservation. Cambridge University Press, 2019. [43] Gabriele Farina, Andrea Celli, Nicola Gatti, and Tuomas Sandholm. Ex ante coordination and collusion in zero-sum multi-player extensive-form games. In Conference on Neural In- formation Processing Systems (NIPS), 2018. [44] Nariman Farsad and Andrea Goldsmith. Detection algorithms for communication systems using deep learning. arXiv preprint arXiv:1705.08044, 2017. 186 [45] Nariman Farsad, Milind Rao, and Andrea Goldsmith. Deep learning for joint source-channel coding of text. arXiv preprint arXiv:1802.06832, 2018. [46] Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science & Business Media, 2012. [47] Jakob Foerster, Francis Song, Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson, Matthew Botvinick, and Michael Bowling. Bayesian action decoder for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pages 1942–1951. PMLR, 2019. [48] David A Freedman et al. On tail probabilities for martingales. the Annals of Probability, 3(1):100–118, 1975. [49] Drew Fudenberg and Jean Tirole. Game Theory. MIT Press, Cambridge, MA, 1991. [50] Fabien Gensbittel, Miquel Oliu-Barton, and Xavier Venel. Existence of the uniform value in zero-sum repeated games with a more informed controller. Journal of Dynamics and Games, 1(3):411–445, 2014. [51] Fabien Gensbittel and J´ erˆ ome Renault. The value of Markov chain games with incomplete information on both sides. Mathematics of Operations Research, 40(4):820–841, 2015. [52] John A Gubner. Probability and random processes for electrical and computer engineers. Cambridge University Press, 2006. [53] A Hitchhiker’s Guide. Infinite dimensional analysis. Springer, 2006. [54] A. Gurevich, K. Cohen, and Q. Zhao. Sequential anomaly detection under a nonlinear system cost. IEEE Transactions on Signal Processing, 67(14):3689–3703, July 2019. [55] Martin T Hagan and Mohammad B Menhaj. Training feedforward networks with the mar- quardt algorithm. IEEE transactions on Neural Networks, 5(6):989–993, 1994. [56] Eric A Hansen. Solving pomdps by searching in policy space. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pages 211–219, 1998. [57] Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. CoRR, abs/1507.06527, 7(1), 2015. [58] OG Haywood Jr. Military decision and game theory. Journal of the Operations Research Society of America, 2(4):365–385, 1954. [59] On´ esimo Hern´ andez-Lerma and Jean B Lasserre. Discrete-time Markov control processes: basic optimality criteria, volume 30. Springer Science & Business Media, 2012. [60] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochre- iter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, 2017. [61] Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. 187 [62] Karel Hor´ ak and Branislav Boˇ sansk` y. Solving partially observable stochastic games with pub- lic observations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2029–2036, 2019. [63] Karel Hor´ ak, Branislav Boˇ sansk` y, and Michal Pˇ echouˇ cek. Heuristic search value iteration for one-sided partially observable stochastic games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. [64] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. [65] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regres- sion, volume 398. John Wiley & Sons, 2013. [66] Boshuang Huang, Kobi Cohen, and Qing Zhao. Active anomaly detection in heterogeneous processes. IEEE Transactions on Information Theory, 65(4):2284–2301, 2018. [67] Tara Javidi and Andrea Goldsmith. Dynamic joint source-channel coding with feedback. In Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, pages 16–20. IEEE, 2013. [68] Peter Karkus, David Hsu, and Wee Sun Lee. Qmdp-net: Deep learning for planning under partial observability. In Advances in Neural Information Processing Systems, pages 4694– 4704, 2017. [69] D. Kartik, A. Nayyar, and U. Mitra. Active hypothesis testing: Beyond chernoff-stein. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 897–901, July 2019. [70] Dhruva Kartik and Ashutosh Nayyar. Stochastic zero-sum games with asymmetric informa- tion. In 58th IEEE Conference on Decision and Control. IEEE, 2019. [71] Dhruva Kartik and Ashutosh Nayyar. Zero-sum stochastic games with asymmetric informa- tion. arXiv preprint arXiv:1909.01445, 2019. [72] Dhruva Kartik and Ashutosh Nayyar. Upper and lower values in zero-sum stochastic games with asymmetric information. Dynamic Games and Applications, 11(2):363–388, 2021. [73] Dhruva Kartik, Ashutosh Nayyar, and Urbashi Mitra. Fixed-horizon active hypothesis test- ing. arXiv preprint arXiv:1911.06912, 2019. [74] Dhruva Kartik, Neeraj Sood, Urbashi Mitra, and Tara Javidi. Adaptive sampling for estimating distributions: A Bayesian upper confidence bound approach. arXiv preprint arXiv:2012.04137, 2020. [75] Emilie Kaufmann, Olivier Capp´ e, and Aur´ elien Garivier. On Bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pages 592–600. PMLR, 2012. [76] Robert Keener et al. Second order efficiency in the sequential design of experiments. The Annals of Statistics, 12(2):510–532, 1984. [77] Janos Korner and Alon Orlitsky. Zero-error information theory. IEEE Transactions on Information Theory, 44(6):2207–2229, 1998. 188 [78] Samuel Kotz, Narayanaswamy Balakrishnan, and Norman L Johnson. Continuous multivari- ate distributions, Volume 1: Models and applications. John Wiley & Sons, 2004. [79] Panganamala Ramana Kumar and Pravin Varaiya. Stochastic systems: Estimation, identifi- cation, and adaptive control, volume 75. SIAM, 2015. [80] Panqanamala Ramana Kumar and Pravin Varaiya. Stochastic systems: Estimation, identifi- cation, and adaptive control. SIAM, 2015. [81] SP Lalley and G Lorden. A control problem arising in the sequential design of experiments. The Annals of Probability, 14(1):136–172, 1986. [82] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015. [83] Lichun Li and Jeff Shamma. LP formulation of asymmetric zero-sum stochastic games. In 53rd IEEE Conference on Decision and Control, pages 1930–1935. IEEE, 2014. [84] Xiaoxi Li and Xavier Venel. Recursive games: uniform value, tauberian theorem and the mertens conjecture. International Journal of Game Theory, 45(1-2):155–189, 2016. [85] Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083–6093. PMLR, 2020. [86] Jingbo Liu, Ramon van Handel, and Sergio Verd´ u. Second-order converses via reverse hyper- contractivity. arXiv preprint arXiv:1812.10129, 2018. [87] John Loch and Satinder P Singh. Using eligibility traces to find the best memoryless policy in partially observable markov decision processes. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 323–331, 1998. [88] Olivier Marchal, Julyan Arbel, et al. On the sub-Gaussianity of the Beta and Dirichlet distributions. Electronic Communications in Probability, 22, 2017. [89] Michael Maschler, Eilon Solan, and Shmuel Zamir. Game Theory. Cambridge University Press, 2013. [90] Pascal Massart. Concentration inequalities and model selection. 2007. [91] Jean-Fran¸ cois Mertens, Sylvain Sorin, and Shmuel Zamir. Repeated games, volume 55. Cam- bridge University Press, 2015. [92] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [93] James D Morrow. Game theory for political scientists. Princeton University Press, 1994. [94] M. Naghshvar, T. Javidi, and M. Wigger. Extrinsic jensen–shannon divergence: Applications to variable-length coding. IEEE Transactions on Information Theory, 61(4):2148–2164, April 2015. 189 [95] Mohammad Naghshvar. Active learning and hypothesis testing. PhD thesis, UC San Diego, 2013. [96] Mohammad Naghshvar and Tara Javidi. Extrinsic jensen-shannon divergence with appli- cation in active hypothesis testing. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pages 2191–2195. IEEE, 2012. [97] Mohammad Naghshvar and Tara Javidi. Active sequential hypothesis testing. Ann. Statist., 41(6):2703–2738, 12 2013. [98] Mohammad Naghshvar, Tara Javidi, and Kamalika Chaudhuri. Bayesian active learning with non-persistent noise. IEEE Transactions on Information Theory, 61(7):4080–4098, 2015. [99] Mohammad Naghshvar, Tara Javidi, et al. Active sequential hypothesis testing. The Annals of Statistics, 41(6):2703–2738, 2013. [100] Girish N Nair. A nonstochastic information theory for communication and state estimation. IEEE Transactions on automatic control, 58(6):1497–1510, 2013. [101] Ashutosh Nayyar, Tamer Ba¸ sar, Demosthenis Teneketzis, and Venugopal V Veeravalli. Op- timal strategies for communication and remote estimation with an energy harvesting sensor. IEEE Transactions on Automatic Control, 58(9):2246–2260, 2013. [102] Ashutosh Nayyar and Abhishek Gupta. Information structures and values in zero-sum stochastic games. In American Control Conference (ACC), 2017, pages 3658–3663. IEEE, 2017. [103] Ashutosh Nayyar, Abhishek Gupta, Cedric Langbort, and Tamer Ba¸ sar. Common infor- mation based Markov perfect equilibria for stochastic games with asymmetric information: Finite games. IEEE Transactions on Automatic Control, 59(3):555–570, 2014. [104] Ashutosh Nayyar, Aditya Mahajan, and Demosthenis Teneketzis. Optimal control strate- gies in delayed sharing information structures. IEEE Transactions on Automatic Control, 56(7):1606–1620, 2010. [105] Ashutosh Nayyar, Aditya Mahajan, and Demosthenis Teneketzis. Decentralized stochastic control with partial history sharing: A common information approach. IEEE Transactions on Automatic Control, 58(7):1644–1658, 2013. [106] S. Nitinawarat, G. K. Atia, and V. V. Veeravalli. Controlled sensing for multihypothesis testing. IEEE Transactions on Automatic Control, 58(10):2451–2464, Oct 2013. [107] Sirin Nitinawarat, George K Atia, and Venugopal V Veeravalli. Controlled sensing for multi- hypothesis testing. IEEE Transactions on Automatic Control, 58(10):2451–2464, 2013. [108] R. D. Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893–7906, Dec 2011. [109] Frans A Oliehoek and Christopher Amato. A concise introduction to decentralized POMDPs. Springer, 2016. [110] Martin J Osborne and Ariel Rubinstein. A course in game theory. MIT press, 1994. 190 [111] Yi Ouyang, Hamidreza Tavafoghi, and Demosthenis Teneketzis. Dynamic games with asym- metric information: Common information based perfect bayesian equilibria and sequential decomposition. IEEE Transactions on Automatic Control, 62(1):222–237, 2017. [112] Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision processes. Mathematics of operations research, 12(3):441–450, 1987. [113] Joelle Pineau, Geoff Gordon, Sebastian Thrun, et al. Point-based value iteration: An anytime algorithm for pomdps. In IJCAI, volume 3, pages 1025–1032, 2003. [114] Y Polyanskiy and S Verdu. Binary hypothesis testing with feedback. In Information Theory and Applications Workshop (ITA), 2011. [115] Yury Polyanskiy. Channel coding: non-asymptotic fundamental limits. Princeton University, 2010. [116] Yury Polyanskiy, H Vincent Poor, and Sergio Verd´ u. Channel coding rate in the finite blocklength regime. IEEE Transactions on Information Theory, 56(5):2307, 2010. [117] Yury Polyanskiy and Yihong Wu. Lecture notes on information theory. Lecture Notes for ECE563 (UIUC) and, 6(2012-2016):7, 2014. available at http://people.lids.mit.edu/yp/ homepage/data/itlectures_v5.pdf. [118] J-P Ponssard and Sylvain Sorin. The lp formulation of finite zero-sum games with incomplete information. International Journal of Game Theory, 9(2):99–105, 1980. [119] Maxim Raginsky and Alexander Rakhlin. Information-based complexity, feedback and dy- namics in convex programming. IEEE Transactions on Information Theory, 57(10):7036– 7056, 2011. [120] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pages 4295–4304. PMLR, 2018. [121] J´ erˆ ome Renault. The value of Markov chain games with lack of information on one side. Mathematics of Operations Research, 31(3):490–512, 2006. [122] J´ erˆ ome Renault. The value of repeated games with an informed controller. Mathematics of operations Research, 37(1):154–179, 2012. [123] Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial Markov decision processes. In International Conference on Machine Learning, pages 5478–5486, 2019. [124] Dinah Rosenberg. Duality and markovian strategies. International Journal of Game Theory, 27(4), 1998. [125] Dinah Rosenberg, Eilon Solan, and Nicolas Vieille. Stochastic games with a single controller and incomplete information. SIAM journal on control and optimization, 43(1):86–110, 2004. [126] Sheldon M Ross. Introduction to probability models. Academic press, 2014. [127] Walter Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964. 191 [128] Ekraam Sabir, Stephen Rawls, and Prem Natarajan. Implicit language model in lstm for ocr. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 7, pages 27–31. IEEE, 2017. [129] Nils R Sandell. Control of finite-state, finite memory stochastic systems. 1974. [130] Jonathan Scarlett. Noisy adaptive group testing: Bounds and algorithms. IEEE Transactions on Information Theory, 65(6):3646–3661, 2018. [131] Claude Elwood Shannon. A mathematical theory of communication. ACM SIGMOBILE mobile computing and communications review, 5(1):3–55, 2001. [132] Shubhanshu Shekhar, Mohammad Ghavamzadeh, and Tara Javidi. Adaptive sampling for estimating probability distributions. In International Conference on Machine Learning, 2020. [133] D. Shelar and S. Amin. Security assessment of electricity distribution networks under der node compromises. IEEE Transactions on Control of Network Systems, 4(1):23–36, March 2017. [134] Trey Smith and Reid Simmons. Heuristic search value iteration for pomdps. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 520–527, 2004. [135] Neeraj Sood, Paul Simon, Peggy Ebner, Daniel Eichner, Jeffrey Reynolds, Eran Bendavid, and Jay Bhattacharya. Seroprevalence of SARS-CoV-2–specific antibodies among adults in Los Angeles County, California, on april 10-11, 2020. Jama, 2020. [136] Quentin F. Stout. Optimal algorithms for unimodal regression. Computing Science and Statistics, 32:2000, 2000. [137] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014. [138] Richard S Sutton, Andrew G Barto, Francis Bach, et al. Reinforcement learning: An intro- duction. MIT press, 1998. [139] Vincent YF Tan and George K Atia. Strong impossibility results for noisy group testing. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8257–8261. IEEE, 2014. [140] Dengwang Tang, Hamidreza Tavafoghi, Vijay Subramanian, Ashutosh Nayyar, and Demos- thenis Teneketzis. Dynamic games among teams with delayed intra-team information sharing. arXiv preprint arXiv:2102.11920, 2021. [141] Jean Tarbouriech and Alessandro Lazaric. Active exploration in Markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 974–982, 2019. [142] Hamidreza Tavafoghi, Yi Ouyang, and Demosthenis Teneketzis. A sufficient information ap- proach to decentralized decision making. In 2018 IEEE Conference on Decision and Control (CDC), pages 5069–5076. IEEE, 2018. [143] D. Teneketzis. On the structure of optimal real-time encoders and decoders in noisy commu- nication. IEEE Transactions on Information Theory, 52(9):4017–4035, 2006. 192 [144] Demosthenis Teneketzis. On the structure of optimal real-time encoders and decoders in noisy communication. IEEE Transactions on Information Theory, 52(9):4017–4035, 2006. [145] A. Tsopelakos, G. Fellouris, and V. V. Veeravalli. Sequential anomaly detection with obser- vation control. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2389–2393, July 2019. [146] Ertem Tuncel. On error exponents in hypothesis testing. IEEE Transactions on Information Theory, 51(8):2945–2950, 2005. [147] N. K. Vaidhiyan, S. P. Arun, and R. Sundaresan. Neural dissimilarity indices that predict oddball detection in behaviour. IEEE Transactions on Information Theory, 63(8):4778–4796, Aug 2017. [148] N. K. Vaidhiyan and R. Sundaresan. Learning to detect an oddball target. IEEE Transactions on Information Theory, 64(2):831–852, Feb 2018. [149] Deepanshu Vasal, Abhinav Sinha, and Achilleas Anastasopoulos. A systematic process for evaluating structured perfect bayesian equilibria in dynamic games with asymmetric infor- mation. IEEE Transactions on Automatic Control, 64(1):78–93, 2019. [150] Bernhard von Stengel and Daphne Koller. Team-maxmin equilibria. Games and Economic Behavior, 21(1-2):309–321, 1997. [151] Abraham Wald. Sequential analysis. Courier Corporation, 1973. [152] Jean Walrand and Pravin Varaiya. Optimal causal coding-decoding problems. IEEE Trans- actions on Information Theory, 29(6):814–820, 1983. [153] C. Wang, K. Cohen, and Q. Zhao. Active hypothesis testing on a tree: Anomaly detection under hierarchical observations. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 993–997, June 2017. [154] Chao Wang, Kobi Cohen, and Qing Zhao. Information-directed random walk for rare event detection in hierarchical processes. IEEE Transactions on Information Theory, 67(2):1099– 1116, 2020. [155] Alan Washburn and Kevin Wood. Two-person zero-sum games for network interdiction. Operations research, 43(2):243–251, 1995. [156] Hans S Witsenhausen. Equivalent stochastic control problems. Mathematics of Control, Signals and Systems, 1(1):3–11, 1988. [157] HS Witsenhausen. On the structure of real-time source coders. The Bell System Technical Journal, 58(6):1437–1451, 1979. [158] Manxi Wu and Saurabh Amin. Securing infrastructure facilities: When does proactive defense help? Dynamic Games and Applications, pages 1–42, 2018. [159] Yuxuan Xie, Jilles Dibangoye, and Olivier Buffet. Optimally solving two-agent decentral- ized pomdps under one-sided information sharing. In International Conference on Machine Learning, pages 10473–10482. PMLR, 2020. 193 [160] Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. Active learning from noisy and absten- tion feedback. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1352–1357. IEEE, 2015. [161] Youzhi Zhang and Bo An. Computing team-maxmin equilibria in zero-sum multiplayer extensive-form games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 2318–2325, 2020. [162] Jiefu Zheng and David A Casta˜ n´ on. Decomposition techniques for Markov zero-sum games with nested information. In 52nd IEEE Conference on Decision and Control, pages 574–581. IEEE, 2013. [163] Quanyan Zhu and Tamer Basar. Game-theoretic methods for robustness, security, and re- silience of cyberphysical control systems: games-in-games principle for optimal cross-layer resilient control systems. IEEE Control Systems Magazine, 35(1):46–65, 2015. 194 Appendices 195 Appendix A Fixed-horizon Active Hypothesis Testing A.1 Auxiliary Results The proposition and its corollary stated below are consequences of simple algebraic manipulations. Proposition A.1. Let i∈X be a hypothesis. For any j6= i and at each time n, let ˜ ρ n (j) = ρ n (j)/(1−ρ n (i)). Then for any experiment selection strategy g, we have ˜ ρ n (j) = e (log ˜ ρ 1 (j)−Z n−1 (j) ) P k6=i e (log ˜ ρ 1 (k)−Z n−1 (k) ) , (A.1) with probability 1. Here, Z n (j) is the total log-likelihood ratio defined in Definition 3.8. Corollary A.1. Under the setting of Proposition A.1, we have for each 0≤s≤ 1 and j6=i, (ρ n (j)) s P k6=i (ρ n (k)) s = e (s log ˜ ρ 1 (j)−sZ n−1 (j) ) P k6=i e (s log ˜ ρ 1 (k)−sZ n−1 (k) ) . (A.2) Lemma A.1. Let i,j∈X . If 0≤s≤ 1, then under any strategy g and for every n, we have E g i [exp (s log ˜ ρ 1 (j)−sZ n (j))]≤ 1. (A.3) Proof. We have E g i [exp (s log ˜ ρ 1 (j)−sZ n (j))] =E g i [exp (s log ˜ ρ 1 (j)−sZ n−1 (j))]E g i [μ i j (U n ,s)|I n ] a ≤E g i [exp (s log ˜ ρ 1 (j)−sZ n−1 (j))] 196 b ≤E g i [exp (s log ˜ ρ 1 (j))]≤ 1. Inequality (a) is because for any experiment u, μ i j (u,s) is convex and μ i j (u, 0) = μ i j (u, 1) = 1. Inequality (b) is obtained by inductively applying the same arguments. A.2 Proof of Lemma 3.1 Define a random variable X † as follows: X † . = 1 if ˆ X N =i and X † . = 0 otherwise. Thus, we have ψ N (i) =P f,g [X † = 1|X =i] (A.4) φ N (i) =P f,g [X † = 1|X6=i]. (A.5) In this proof, let us denote ψ N (i) with ψ and φ N (i) with φ for convenience. Notice that under strategies g and f, the variables X,I N+1 and X † form a Markov chain. That is, P f,g [X † = 1|X,I N+1 ] =P f,g [X † = 1|I N+1 ]. (A.6) Therefore, using the data-processing property of relative entropy [117, 86], we can conclude that D(P g N,i ||Q g N,i )≥ψ log ψ φ + (1−ψ) log 1−ψ 1−φ a ≥−ψ logφ +ψ logψ + (1−ψ) log (1−ψ) b ≥−ψ logφ− log 2 c ≥−(1− N ) logφ− log 2. Inequality (a) follows from the fact that 1−φ≤ 1. Inequality (b) holds because−ψ logψ− (1− ψ) log (1−ψ) is a binary entropy and can at most be log 2. Inequality (c) follows from our assertion that ψ N (i)≥ 1− N . Therefore, we have − 1 N logφ N (i)≤ J g N (i) 1− N + log 2 N(1− N ) . (A.7) 197 A.3 Proof of Lemma 3.5 We have log ρ n+1 (i) 1−ρ n+1 (i) − log ρ 1 (i) 1−ρ 1 (i) (A.8) a = log ρ 1 (i) Q n m=1 p Um i (Y m ) P j6=i ρ 1 (j) Q n m=1 p Um j (Y m ) − log ρ 1 (i) 1−ρ 1 (i) (A.9) = log Q n m=1 p Um i (Y m ) P j6=i ˜ ρ 1 (j) Q n m=1 p Um j (Y m ) (A.10) =− log X j6=i exp log ˜ ρ 1 (j) + n X m=1 λ j i (U m ,Y m ) =− log X j6=i exp log ˜ ρ 1 (j)−Z n (j) . (A.11) Equality (a) follows from the fact that the observation Y m is independent of the past I m = {U 1:m−1 ,Y 1:m−1 } conditioned on the hypothesis X and the current experiment U m . A.4 Proof of Lemma 3.6 Using the definition of cross-entropy, we have −H(β i∗ , ˜ ρ n+1 ) = X j6=i β i∗ (j) log ˜ ρ n+1 (j) a = X j6=i β i∗ (j) log ˜ ρ 1 (j)e −Zn(j) P k6=i ˜ ρ 1 (k)e −Zn(k) ! =−H(β i∗ , ˜ ρ 1 )− ¯ Z n − log X k6=i exp (log ˜ ρ 1 (k)−Z n (k)) b =−H(β i∗ , ˜ ρ 1 )− ¯ Z n +C i (ρ n+1 )−C i (ρ 1 ). Equality (a) follows from Proposition A.1 in Appendix A.1 and equality (b) is a consequence of Lemma 3.5. 198 A.5 Proof of Lemma 2.4 Using the definition of expected confidence rate, we have J g N (i) = 1 N E g i [C i (ρ N+1 )−C i (ρ 1 )] a ≤ H(β i∗ , ˜ ρ 1 ) N + 1 N E g i X j6=i β i∗ (j) N X n=1 λ i j (U n ,Y n ) b = H(β i∗ , ˜ ρ 1 ) N + 1 N E g i X j6=i β i∗ (j) N X n=1 D(p Un i ||p Un j ) = H(β i∗ , ˜ ρ 1 ) N + 1 N E g N X n=1 X j6=i β i∗ (j)D(p Un i ||p Un j ) c ≤ H(β i∗ , ˜ ρ 1 ) N + 1 N E g i " N X n=1 D ∗ (i) # = H(β i∗ , ˜ ρ 1 ) N +D ∗ (i). (A.12) Equality (a) is a consequence of Lemma 3.6. Equality (b) follows from the fact that E g i N X n=1 λ i j (U n ,Y n ) =E g i N X n=1 E i [λ i j (U n ,Y n )|U n ] (A.13) =E g i N X n=1 D(p Un i ||p Un j ). (A.14) Inequality (c) follows from the definition of the min-max distribution β i∗ . Combining inequalities (A.12) and (3.17) from Lemma 3.1 gives us (2.38). A.6 Proof of Lemma B.3 LetA N+1 be the region in which the inference policy f described in Lemma B.3 selects hypothesis i, that is A N+1 :={I :f(I :i) = 1,P f,g [I N+1 =I]6= 0}. We have P f,g [ ˆ X N+1 =i,X6=i] =P g [I N+1 ∈A N+1 ,X6=i] 199 = X I∈A N+1 P g [I N+1 =I,X =i]e h − log P g [I N+1 =I,X=i] P g [I N+1 =I,X6=i] i a = X I∈A N+1 P g [I N+1 =I,X =i] exp [−C i (ρ)] (A.15) b ≤ X I∈A N+1 P g [I N+1 =I,X =i] exp [−(θ +C i (ρ 1 ))] (A.16) c ≤ρ 1 (i)e −(θ+C i (ρ 1 )) = (1−ρ 1 (i))e −θ . (A.17) In equality (a), ρ is the posterior belief on X given information I. Equality (a) follows from the definition of confidence level. Inequality (b) follows from the fact thatC i (ρ)≥θ +C i (ρ 1 ) for every I∈A N+1 . And inequality (c) is simply becauseP g [I N+1 =I,X =i]≤P[X =i]. Therefore, φ N (i) =P f,g [ ˆ X N+1 =i|X6=i]≤e −θ . (A.18) A.7 Proof of Theorem 2.2 Let us fix the horizon N. In this proof, we will drop the superscript from g N and simply refer to it as g for convenience. Since the inference strategy has a threshold structure (see (2.19)), proving that ψ N (i)≥ 1− N is equivalent to proving that the probability P g i [C i (ρ N+1 )−C i (ρ 1 )<θ N ]≤ N . To this end, we will begin with obtaining upper bounds on the moment-generating function (MGF) of the confidence increment and then obtain a Chernoff bound based on these upper bounds. A.7.0.1 Confidence Level and Log-likelihood Ratios Let 0 0 such that for every i∈X , we haveP ¯ g i [T >n]≤Kn −b . Let N 0 . = & 2K N 1/b ' . (A.47) 204 This ensures that P ¯ g i [T > N 0 ]≤ N /2. Fix a hypothesis i. Define the following event for each n≥N 0 Z n ={ ¯ i k =X,N 0 ≤k≤n}. (A.48) Clearly, the eventsZ n are decreasing with n. Also, we have{T≤N 0 }⊆Z n , for every n≥N 0 . Due to the threshold structure of the inference strategy ¯ f N , proving that ψ N (i)≥ 1− N is equivalent to showing that P g i [C i (ρ N+1 )−C i (ρ 1 )<θ N (i)]≤ N . To do so, we will use a Chernoff-bound based approach similar to the approach in Appendix A.7. We have P g i [C i (ρ N+1 )−C i (ρ 1 )<θ N (i)] (A.49) =P g i [C i (ρ N+1 )−C i (ρ 1 )<θ N (i),T >N 0 ] (A.50) +P g i [C i (ρ N+1 )−C i (ρ 1 )<θ N (i),T≤N 0 ] (A.51) ≤ N /2 +P g i [C i (ρ N+1 )−C i (ρ 1 )<θ N (i),T≤N 0 ], (A.52) where the last inequality follows from the definition of N 0 . A.8.0.1 Bounds on the MGF of Confidence Increment For some 0≤s≤ 1, define τ n (s) . = 1 if n∈N exp(−sD ∗ (i) +s 2 B 2 /2) otherwise. (A.53) Consider the following E ¯ g i exp[−s(C i (ρ N+1 )−C i (ρ 1 ));T≤N 0 ] (A.54) 205 a ≤ X j6=i E ¯ g i [exp (s log ˜ ρ 1 (j)−sZ N (j)) ;T≤N 0 ] (A.55) b ≤ X j6=i E ¯ g i [exp (s log ˜ ρ 1 (j)−sZ N (j)) ;Z N ] (A.56) c ≤ X j6=i E ¯ g i [exp (s log ˜ ρ 1 (j)−sZ N−1 (j)) ;Z N ]τ N (s) (A.57) d ≤ X j6=i E ¯ g i [exp (s log ˜ ρ 1 (j)−sZ N−1 (j)) ;Z N−1 ]τ N (s) e ≤ X j6=i E ¯ g i [exp (s log ˜ ρ 1 (j)−sZ N 0 −1 (j))] (A.58) × (exp(−sD ∗ (i) +s 2 B 2 /2)) N−N 0 −N † +1 (A.59) f ≤M(exp(−sD ∗ (i) +s 2 B 2 /2)) N−N 0 −N † +1 . (A.60) Here, N † is the number of indices inN that are at least N 0 . Note that N † <dlog a Ne. Inequality (a) is a consequence of the result in equation (B.1). Inequality (b) holds because the event{T≤ N 0 }⊆ Z N . We consider two cases for obtaining inequality (c): (i) If N ∈N , then we select U N uniformly and inequality (c) follows from the same arguments used to prove Lemma A.1 in Appendix A.1; (ii) Notice that under the events Z N and{X = i}, we have ¯ i N = i, and by the construction of strategy ¯ g N , the experiment U N is selected using the control law g N,i N if N / ∈N . This control law at time N satisfies Criterion 2.1 which is the condition required for using Lemma B.1. Thus, inequality (c) can be obtained from the same arguments used to prove Lemma B.1. Inequality (d) holds because Z N ⊆ Z N−1 . Inequality (e) is obtained by inductively applying the arguments (b)− (d). Inequality (f) is due to Lemma A.1 in Appendix A.1. A.8.0.2 Chernoff Bound Let N 00 . =N−N 0 −N † + 1 and let ζ > 0 be any small constant. Define θ N (i) (A.61) . = max ( ζ−C i (ρ 1 ),N 00 D ∗ (i)− s N N 00 B 2 2 − 1 s N log 2M N ) , where s N is as defined in (3.21). Under Assumption 3.2, one can verify that θ N (i)/N → D ∗ (i). Further, we can say that there exists an integer ¯ N such that N 00 > 0 and for every i∈X and 206 N≥ ¯ N, we have θ N (i) > ζ−C i (ρ 1 ). Thus, for every N≥ ¯ N, using the Chernoff bound [52], we have P g i [C i (ρ N+1 )−C i (ρ 1 )<θ N (i),T≤N 0 ] ≤E g i exp[−s(C i (ρ N+1 )−C i (ρ 1 )−θ N (i));T≤N 0 ] a ≤M(exp(sθ N (i)−sN 00 D ∗ (i) +s 2 N 00 B 2 /2)) b ≤ N /2, where inequality (a) follows from (A.60) and inequality (b) is obtained by substituting the values of θ N (i) and s N . Combining this result with (A.52), we have P g i [C i (ρ N+1 )−C i (ρ 1 )≤θ N (i)]≤ N . Therefore, the strategy pair ( ¯ f N , ¯ g N ) defined in Section 2.4.0.1 satisfies the constraints in Problem (P2). 207 Appendix B Active Hypothesis Testing and Anomaly Detection B.1 Proof of Theorem 3.1 Let us fix the horizonN. Since the inference strategy has a threshold structure (see Definition 3.7), proving that ψ N (i)≥ 1− N is equivalent to proving that the probability P g i [C(I N+1 )<θ N ]≤ N . To this end, we will begin with obtaining upper bounds on the moment-generating function (MGF) of the confidence increment and then obtain a Chernoff bound based on these upper bounds. B.1.0.1 Confidence Level and Log-likelihood Ratios Let 0 0 for everyi∈S. SinceD i i > 0, the minimization in (B.15) can be restricted to singleton hypotheses. Further, based on a simple contradiction argument, we can conclude that for every i∈S, the max-minimizer α ∗ must satisfy α ∗ (i)D(p i 0 ||p i 1 ) =D ∗ . (B.17) Using the fact that α ∗ is a distribution overS, we can obtain D ∗ and α ∗ from (B.17). In order to determine the min-maximizing distribution β ∗ , let us consider the following: D ∗ = min β∈Δ ˜ X max u∈U in X x∈ ˜ X β(x)D u x (B.18) d = min β∈ΔS max i∈S X j∈S β(j)D i j (B.19) e = min β∈ΔS max i∈S β(j)D i i . (B.20) Every x such that β ∗ (x)> 0 must satisfy [110] D ∗ = X u∈U in α ∗ (u)D u x = X i∈x α ∗ (i)D i i . (B.21) Since α ∗ (i) = D ∗ /D i i , the above equality holds only when x is a singleton set. Therefore, β ∗ will assign non-zero probability to only singleton sets which leads to (d). Equality in (e) is a consequence of the log-likelihood ratios in (3.11). It is clear from (B.16) and (B.20) that β ∗ =α ∗ . B.3 Proof of Lemma 3.4 For any distribution α∈ ΔU and any β∈ Δ ˜ X , we have min x∈ ˜ X X u∈U α(u)D u x ≤ max u∈U X x∈ ˜ X β(x)D u x . (B.22) If α ∗ ∈ ΔU and β ∗ ∈ Δ ˜ X satisfy (B.22) with equality, then they are respectively maxminimizing and minmaximizing distributions. Using this approach, we prove that the uniform distribution 211 over groups of size k ∗ is a maxminimizer and the uniform distribution over M singleton sets of components is a minmaximizer. We have min x∈ ˜ X X u∈U α(u)D u x = 1 M k ∗ min x∈ ˜ X X u:|u|=k ∗ D u x (B.23) a = 1 M k ∗ min x∈ ˜ X X u:|u|=k ∗ u∩x6=? D u x (B.24) b = 1 M k ∗ min j∈S X u:|u|=k ∗ j∈u D u j (B.25) c = M−1 k ∗ −1 M k ∗ D {1,...,k ∗ } 1 (B.26) = k ∗ D {1,...,k ∗ } 1 M . (B.27) Equality in (a) holds becauseD u x = 0 if the selected groupu has no intersection with the hypothesis x. In (B.24), the summation involves all the experiments that have non-null intersection with x. Therefore, the number of terms in the summation is the smallest when x is singleton. This leads to (b). The result (c) is a direct consequence of our assumption that the system is symmetric. Further, we have max u∈U X x∈ ˜ X β(x)D u x = max u∈U 1 M X j∈u D u j (B.28) d = max 1≤k≤M kD {1,...,k} 1 M . (B.29) The result (d) follows from our assumption that the system is symmetric. Equations (B.27) and (B.29) imply that our distributionsα ∗ andβ ∗ satisfy (B.22) with equality and therefore are respec- tively the maxminimizer and minmaximizer. B.4 Complete Proof We prove this result for the case where the system can have at most one anomaly. When the system has arbitrarily many anomalies, the same arguments can be used to establish the strong converse since the two models have the same β ∗ . Further, letC(I N+1 ) be the confidence computed 212 assuming that there is at most one anomaly and letC 0 (I N+1 ) be the confidence computed assuming that there could be arbitrarily many anomalies. One can easily show using (3.24) and properties of the log-sum-exponential function that ifC(I N+1 ) > θ thenC 0 (I N+1 ) > θ−M. Because of this connection between the two forms confidence levels, it is sufficient to prove the achievability for the former model. B.4.1 Strong Converse For any given pair of inference and experiment selection strategies f,g that are feasible in Problem (P1), the confidence levelC can be viewed as a log-likelihood ratio. Therefore for this strategy pair f,g, we have the following for every χ∈R − logφ N (B.30) a ≤ χ− log(ψ N −P g ? [C(I N+1 ,ρ 1 )>χ]) (B.31) b ≤ χ− log(1− N −P g ? [C(I N+1 ,ρ 1 )>χ]) (B.32) = χ− log(P g ? [C(I N+1 ,ρ 1 )≤χ]− N ) (B.33) Here, we use the convention that if x≤ 0, then logx . =−∞. Inequality (a) is a consequence of the strong converse theorem in [117]. Inequality (b) holds because ψ N ≥ 1− N . However, this lower bound on φ N depends on the experiment selection strategy g. We can use the decomposition in Lemma 3.6 to obtain a strategy-independent lower bound. We have P g ? [C(I N+1 ,ρ 1 )≤χ] (B.34) a =P g ? [−D(β ∗ ||˜ ρ N+1 ) + ¯ Z N +D(β ∗ ||˜ ρ 1 )≤χ] (B.35) b ≥P ? [ ¯ Z N +D(β ∗ ||˜ ρ 1 )≤χ]. (B.36) Equality (a) is a consequence of Lemma 3.6, and since D(β ∗ ||˜ ρ N+1 )≥ 0, we have that the event {−D(β ∗ ||˜ ρ N+1 ) + ¯ Z N +D(β ∗ ||˜ ρ 1 )≤χ} (B.37) ⊇{ ¯ Z N +D(β ∗ ||˜ ρ 1 )≤χ}, (B.38) 213 which results in the inequality (b). Combining (B.33) and (B.36) leads us to the following lemma. Lemma B.2 (Stong Converse). For any given pair of inference and experiment selection strategies f,g that are feasible in Problem (P1), we have for every χ∈R − logφ N ≤χ− log(P g ? [ ¯ Z N +D(β ∗ ||˜ ρ 1 )≤χ]− N ), with the convention that logx . =−∞ if x≤ 0. The bound (3.27) in Theorem 6.1 is obtained from Lemma B.2 by assigning χ =inv N N + N η , (B.39) where inv N is the quantile function of ¯ Z N +D(β ∗ ||˜ ρ 1 ). The bound (3.29) in Corollary 3.1 is obtained by assigning χ ∗ . =ND ∗ − √ NVQ −1 N + N η + 6T √ NV 3 +D(β ∗ ||˜ ρ 1 ). Then using the Berry-Esseen theorem [115], we have P g ? [ ¯ Z N +D(β ∗ ||˜ ρ 1 )≤χ ∗ ]≥ N + N η . (B.40) The result above combined with Lemma B.2 leads to the strong converse bound in Corollary 3.1. B.4.2 Strong Achievability Lemma B.3. Let f be a deterministic inference strategy in which hypothesis 0 is decided only if C(I N+1 ,ρ 1 )≥θ. Then φ N ≤e −θ . Proof. This follows from the standard arguments associated with log-likelihood ratios [117]. A proof of this lemma is provided in Appendix G of [73] for the case when the observation spaceY is finite. 214 Consider the following inference strategy: decide that the system is safe ifC(I N+1 ,ρ 1 )≥ θ N and decide that it is unsafe otherwise, where the threshold θ N is given by θ N . =inv N N − N η −O log η N . (B.41) Using Lemma B.3, we can conclude that the inference strategy stated above (irrespective of which experiment selection strategy is used) achieves φ N ≤ exp(−θ N ). However, for a pair of experiment selection and inference strategies to be feasible, we also need to show that the constraint (ψ N ≥ 1− N ) in Problem (P1) is satisfied. To do so, all we need to show is that our deterministic adaptive strategy, combined with the threshold based inference described above, satisfies P g ? [C(I N+1 ,ρ 1 )<θ N ]≤ N . (B.42) Based on the decomposition in Lemma 3.6 and a union bound, a sufficient criterion for the satisfying condition above is the following: P g ? [−D(β ∗ ||˜ ρ n+1 )<θ N,1 ]≤ N /η (B.43) P g ? [ ¯ Z n +D(β ∗ ||˜ ρ 1 )<θ N,2 ]≤ N − N /η, (B.44) where η> 1 and θ N,1 =−O log η N (B.45) θ N,2 =inv N N − N η . (B.46) Notice that the condition (B.44) is trivially satisfied because of the definition of the θ N,2 and the quantile function inv N . Therefore, we just need to show that condition (B.43) is satisfied. To do so, we will use a Chernoff bound based argument in the following manner. Remark B.2. The result (3.30) in Corollary 3.1 can be obtained by assigning θ N,2 . = (B.47) 215 ND ∗ − √ NVQ −1 N − N η − 6T √ NV 3 +D(β ∗ ||˜ ρ 1 ). (B.48) Once again, using the Berry-Esseen theorem [115], P g ? [ ¯ Z n +D(β ∗ ||˜ ρ 1 )<θ N,2 ]≤ N − N /η. (B.49) Proof of Inequality (B.43) Define ζ n (j) . =Z n (j)− log ˜ ρ 1 (j)− ¯ Z n −H(β ∗ , ˜ ρ 1 ). (B.50) In the symmetric case, we have ζ n (j)−ζ n−1 (j) =λ j (U n ,Y n )− 1 M X k∈U λ j (U n ,Y n ) (B.51) = M−1 M log p 0 (Yn) p 1 (Yn) if U n =j 1 M log p 1 (Yn) p 0 (Yn) if U n 6=j (B.52) For convenience, define m(s) =E ? exp s(M− 1) M log p 1 (Y ) p 0 (Y ) (B.53) ¯ m(s) =E ? exp s M log p 0 (Y ) p 1 (Y ) . (B.54) Notice that the strategy DAS described earlier is equivalent to selecting the component j with the smallest ζ n (j) at time n + 1. Let ¯ j n . = arg min j∈U ζ n (j). (B.55) We have −H(β ∗ , ˜ ρ n+1 ) =− log X j∈U exp(−ζ n (j)) (B.56) 216 Therefore, for 0≤s≤ 1, E g 0 exp[sH(β ∗ , ˜ ρ n+1 )] =E g 0 X j∈U exp (−ζ n (j)) s =E g 0 X j∈U (exp (−sζ n (j))) 1/s s a ≤E g 0 X j∈U exp (−sζ n (j)) (B.57) = X j∈U E g 0 exp (−sζ n (j)). (B.58) Inequality (a) holds becausek·k 1/s ≤k·k 1 . Further, we have X j∈U E g 0 [exp (−sζ n+1 (j))] (B.59) a =E g 0 X j∈U exp (−sζ n (j))E g 0 [exp(−s(ζ n+1 (j)−ζ n ))|I n+1 ] =E g ? exp(−sζ n ( ¯ j n ))m(s) + X j6= ¯ jn exp(−sζ n (j)) ¯ m(s) , (B.60) where (a) follows from the tower property of conditional expectation and the fact that ζ n (j) is measurable w.r.t. I n+1 . Let % n . = exp(−sζ n ( ¯ j n )) P j∈U exp(−sζ n (j)) ≥ 1 M . (B.61) Then exp(−sζ n ( ¯ j n ))m(s) + P j6= ¯ jn exp(−sζ n (j)) ¯ m(s) P j∈U exp(−sζ n (j)) (B.62) =% n m(s) + (1−% n ) ¯ m(s). (B.63) Lemma B.4. For 0≤δ< 1/(M− 1), if % n ≤ 1+δ M then exp(−sζ n ( ¯ j n )) = max j∈U exp(−sζ n (j))≤ 1 +δ 1 +δ−Mδ . (B.64) 217 Proof. See Section B.5.4 in the supplementary material. Lemma B.5. There exist constants 0 < s ∗ < 1 and 0 < ς < 1 such that if % n > 1+δ M , then % n m(s ∗ ) + (1−% n ) ¯ m(s ∗ )<ς. Proof. Based on a first-order Taylor approximation at s = 0, we can conclude that m(s) = 1−s (M− 1)D M +o(s) (B.65) ¯ m(s) = 1 +s D M +o(s). (B.66) Therefore, there exists a neighborhood of s around 0 such that m(s) < ¯ m(s). Hence, in this neighborhood % n m(s) + (1−% n ) ¯ m(s) is a decreasing function of % n for fixed value of s. Further, % n m(s) + (1−% n ) ¯ m(s) = 1 +s D M −% n +o(s). (B.67) For % n = 1+δ M , the RHS in the expression above is 1−s Dδ M +o(s). (B.68) Therefore, there exists an s ∗ such that ς . = 1 +δ M m(s ∗ ) + 1− 1 +δ M ¯ m(s ∗ )< 1. (B.69) Hence, for every % n > 1+δ M , % n m(s ∗ ) + (1−% n ) ¯ m(s ∗ ) < ς < 1. This concludes the proof of the lemma. Henceforth, the value of s is assigned to be s ∗ defined in the proof of Lemma B.5. Based on Lemmas B.4 and B.5, we can consider the the following cases: 1. % n ≤ 1+δ M : In this case, we can conclude using Lemma B.4 that exp(−sζ n ( ¯ j n ))m(s) + X j6= ¯ jn exp(−sζ n (j)) ¯ m(s) (B.70) ≤ (1 +δ)(m(s) + (M− 1) ¯ m(s)) 1 +δ−Mδ . =K. (B.71) 218 2. % n > 1+δ M : In this case, we can conclude using Lemma B.5 that exp(−sζ n ( ¯ j n ))m(s) + X j6= ¯ jn exp(−sζ n (j)) ¯ m(s) (B.72) <ς X j∈U exp(−sζ n (j)) . (B.73) Therefore, we have exp(−sζ n ( ¯ j n ))m(s) + X j6= ¯ jn exp(−sζ n (j)) ¯ m(s) (B.74) < max{K,ς X j∈U exp(−sζ n (j))}. (B.75) This, using the result above and (B.60), we can conclude that X j∈U E g 0 exp (−sζ n+1 (j))<K +ς X j∈U E g ? exp(−sζ n (j)). (B.76) Using the result (B.76) inductively and combining it with (B.58), we have E g 0 exp[sH(β ∗ , ˜ ρ N+1 )]≤Mς N + N X n=1 Kς N−n (B.77) ≤M + K 1−ς . =K 0 . (B.78) Chernoff Bound We can use the Chernoff bound [126] to conclude that P g ? [−D(β ∗ ||˜ ρ n+1 )<θ N,1 ] (B.79) =P g ? [logM−H(β ∗ , ˜ ρ N+1 )<θ N,1 ] (B.80) ≤ exp(s(θ N,1 − logM))E g 0 exp[sH(β ∗ , ˜ ρ N+1 )] (B.81) a ≤ exp(s(θ N,1 − logM)) M + K 1−ς (B.82) b = N /η, (B.83) 219 where θ N,1 . = 1 s ∗ log N ηK 0 + logM =−O log η N . (B.84) Here, inequality (a) is a consequence of (B.78) and (b) follows from the definition of θ N,1 above. This concludes our argument that the condition (B.43) is satisfied. Remark B.3. We would like to emphasize that the constants in the logarithmic term such as δ,s ∗ etc. need to be chosen appropriately and at this point it is not clear how one might determine these constants in general. It remains to be investigated how tight the bound obtained herein would be once these constants are obtained. B.5 Supplementary Material B.5.1 Proof of Lemma 3.3 Let the moment generating function of L n be ¯ μ(s). Therefore, for any strategy g, we have E g ? [exp( n X k=1 s k L k )] =E g ? [E ? exp( n X k=1 s k L k )|I n ]] (B.85) =E g ? [exp( n−1 X k=1 s k L k )E ? [exp(s n L n )|I n ]] =E g ? [exp( n−1 X k=1 s k L k )]¯ μ(s n ) = Π n k=1 ¯ μ(s k ). (B.86) B.5.2 Proof of Lemma 3.6 Using the definition of cross-entropy, we have −H(β ∗ , ˜ ρ n+1 ) = X j∈U β ∗ (j) log ˜ ρ n+1 (j) a = X j∈U β ∗ (j) log ˜ ρ 1 (j)e −Zn(j) P k∈U ˜ ρ 1 (k)e −Zn(k) ! =−H(β ∗ , ˜ ρ 1 )− ¯ Z n − log X k∈U exp (log ˜ ρ 1 (k)−Z n (k)) b =−H(β ∗ , ˜ ρ 1 )− ¯ Z n +C(I n+1 ). 220 Equality (a) follows from the definition of ˜ ρ n+1 and equality (b) is a consequence of Lemma 3.5. The lemma then follows by adding H(β ∗ ,β ∗ ) on both sides. B.5.3 Proof of Lemma B.1 We have min u P x∈ ˜ X (ρ n+1 (x)) s μ x (u,s) P x∈ ˜ X (ρ n+1 (j)) s ≤μ ∗ (s) (B.87) ⇐⇒ min u X x∈ ˜ X exp (s log ˜ ρ 1 (x)−sZ n (x))μ x (u,s) ≤ X x∈ ˜ X exp (s log ˜ ρ 1 (j)−sZ n (x))μ ∗ (s). (B.88) Recall that the strategy g at time n + 1 selects an experiment that minimizes the LHS in (B.87). Thus, we have X x∈ ˜ X E g ? exp (s log ˜ ρ 1 (x)−sZ n+1 (x)) (B.89) = X x∈ ˜ X E g ? [E g ? [exp (s log ˜ ρ 1 (x)−sZ n+1 (x))|I n+1 ]] (B.90) = X x∈ ˜ X E g ? [exp (s log ˜ ρ 1 (x)−sZ n (x)) (B.91) ×E g ? [exp(−sλ x (U n+1 ,Y n+1 ))|I n+1 ]] a =E g ? min u X x∈ ˜ X exp (s log ˜ ρ 1 (x)−sZ n (x))μ x (u,s) b ≤E g ? X x∈ ˜ X exp (s log ˜ ρ 1 (x)−sZ n (x))μ ∗ (s), where equality (a) follows from the fact that the observation Y n+1 is conditionally independent of I n+1 given an experiment u. Inequality (b) follows from the result in (B.88). B.5.4 Proof of Lemma B.4 Consider the following facts: 221 1. We have min j∈U exp(−sζ n (j))≤ 1. This is because P j∈U ζ n (j) = 0 and thus max j∈U ζ n (j)≥ 0. 2. For every j∈U, we have exp(−sζ n (j))≤ exp(−sζ n ( ¯ j n )). This simply follows from the definition of ¯ j n . Combining the two facts stated above, we have X j∈U exp(−sζ n (j))≤ (M− 1) exp(−sζ n ( ¯ j n )) + 1. Therefore, 1 +δ M ≥ exp(−sζ n ( ¯ j n )) P j∈U exp(−sζ n (j)) (B.92) ≥ exp(−sζ n ( ¯ j n )) (M− 1) exp(−sζ n ( ¯ j n )) + 1 (B.93) =⇒ max j∈U exp(−sζ n (j))≤ 1 +δ 1 +δ−Mδ (B.94) This concludes the proof of the lemma. 222 Appendix C A Neural Network-based Framework for Strategy Design C.1 Recurrent Neural Network Architecture The first goal is to verify if the internal state of an LSTM can maintain hypothesis information. The model is a simpler version of the recurrent network shown in Figure C.2, which takes a se- quence of random queries and its results as input. This model is compared against Maximum A Posteriori (MAP) rule for hypothesis classification which is optimal for any input, in Figure C.1. The performance of LSTM comes close to that of MAP, which clearly shows that its hidden state maintains hypothesis information. We examine if LSTMs can learn query selection as well. Our model architecture in Figure C.2 predicts a query at each time-step which in turn is used to produce an input for the next time-step. True hypothesis is provided for training the model, but optimal query selection for any time-step is unknown. The model is expected to learn this implicitly. There are two practical issues with this architecture. Query result is produced by a non-differentiable black-box making the output and input of consecutive time-steps disconnected. This prevents explicit learning of query selection. However, it is known that recurrent networks can learn implicit tasks [128]. Second, query selection is a soft decision made by the model, whereas a discrete decision is preferred. Experiments show that the model fails to learn query selection. An improvement to this architecture can be made if hard decisions can be incorporated in a model. A discrete decision from the model also solves the problem of explicit query selection, since the output and input at consecutive time-steps can be connected. This direction of research leads to reinforcement learning, which requires further investigation. 223 Figure C.1: This plot compares the performance of the RNN network vs the MAP rule. Figure C.2: The LSTM network with query selection. 224 Appendix D Non-parametric Target Localization D.1 Proof of Lemma 5.2 We prove Lemma 5.2 by contradiction. Assume that for some l∈{1,...,k}, c l 6=μ(H I l x). Define c 0 =c k+1 = 0. Construct another vector z † such thatH I j z † =H I j z ∗ for all j6=l and every entry (c † l ) ofH I l z † is equal to 1. max(μ(H I l x),c l−1 ,c l+1 ) if c l >c l−1 ,c l+1 . 2. min(max(μ(H I l x),c l−1 ),c l+1 ) if c l−1 <c l <c l+1 . 3. min(max(μ(H I l x),c l+1 ),c l−1 ) if c l+1 <c l <c l−1 . Under this construction, it can be verified, that z † ∈U n i . Also, notice that c † l is constructed such that it is strictly closer to μ(H I l x) than c l . Let q =|I l |. ||x−z ∗ || 2 −||x−z † || 2 =||H I l x−H I l z ∗ || 2 −||H I l x−H I l z † || 2 =q[(c l −μ(H I l x)) 2 − (c † l −μ(H I l x)) 2 ] > 0. Thus, z ∗ is not an optimal solution and this in contradiction with the definition of z ∗ . Thus, for every l∈{1,...,k}, the entries of c l must be equal to μ(H I l x). 225 D.2 Proof of Lemma 5.3 Consider the eventE l := μ(H (1:k l (x,ρ,p)) y)−e T 1 x≥−ρ. Conditioned on this event, and using Lemma 5.2 we have μ(H (1:k l (x,ρ,p)) y) = 1 k l (x,ρ,p) l−1 X j=1 |I j |c j +|I ∗ l |c l ≥e T 1 x−ρ, whereI ∗ l :=I l ∩{1,...,k l (x,ρ,p)}. Since the weighted mean of c j s is larger than e T 1 x−ρ, the maximum among c j (which is equal to c l due to monotonicity) must also be larger than e T 1 x−ρ. Therefore, max j∈{1,...,l} c j =c l ≥e T 1 x−ρ. Thus, P[c l ≥e T 1 x−ρ] =P[c l ≥e T 1 x−ρ|E]P[E] +P[c l ≥e T 1 x−ρ,E c ] ≥ 1−p. The last inequality is due to the fact thatP[c l ≥e T 1 x−ρ|E] = 1 andP[E]≥ 1−p. The bound on P[E] can be easily obtained using Chernoff’s inequality (5.1) as follows: P[μ(H (1:k l (x,ρ,p)) y)−e T 1 x≥−ρ] =P h μ(H (1:k) n)≥−ρ− (μ(H (1:k) x)−e T 1 x) i ≥ 1−p. (From Chernoff’s inequality) D.3 Proof of Lemma 5.1 The proof can been divided into two parts. In the first part, we define an eventE 0 associated with the random vectorn and find bounds on the probability of this event. In the second part, we show 226 that for any noise vectorn∈E 0 ,R 0 2 ⊆R 0 1 (see Fig. 5.5) and thus, obtaining a bound on the event E. For conciseness, let k l = k l (H R 2 z 0 , 0,p) and k r = k r (H R 2 z 0 , 0,p). Define region R p := {1,...,l +k l − 1}∪{r−k r + 1,n}. Consider the event E p :={n∈R n :||H Rp n|| 2 ≤ (n−n 0 )σ 2 + ( √ n− √ n 0 )η 2 }. From Bernstein’s inequality, P[E p ]≥ 1− exp − ( √ n− √ n 0 )η 2 −k p σ 2 2 8(n−n 0 +k p )σ 4 (D.1) = 1− exp − ( √ n− √ n 0 ) √ −8 lnp−k p 2 8(n−n 0 +k p ) . (D.2) Consider another event E l :={n :μ(H (l:l+k l −1) n)≥−(μ(H (l:l+k l −1) z 0 )−e T l z 0 )}. Using the Chernoff bound, we have P[E l ]≥ 1− exp −k l (μ(H (l:l+k l −1) z 0 )−e T l z 0 ) 2 2σ 2 ! (D.3) (a) ≥ 1−p, (D.4) where (a) follows from the definition ofk l . A similar eventE r can be defined for the right end such that its probability is greater than 1−p. LetE 0 =E p ∩E l ∩E r . Using the union bound, we have P[E 0 ]≥ 1− 2p− exp − ( √ n− √ n 0 ) √ −8 lnp−k p 2 8(n−n 0 +k p ) . 227 We now show that if the noise vector n∈E 0 , thenR 0 2 ⊆R 0 1 . Thus,E⊇E 0 andP[E]≥P[E 0 ]. Suppose n∈E 0 . Let z =z 0 +n. If i∈R 0 2 , then by definition ofR 0 2 , ||z ∗ i −H R 2 z|| 2 2 ≤n 0 σ 2 + √ n 0 η 2 , where z ∗ i = arg min x∈U n 0 i 0 ||x−H R 2 z|| 2 . For every i∈R 0 2 , we show, by construction, that there is a vector y i such that ||y i −z|| 2 ≤nσ 2 + √ nη 2 , so that i∈R 0 1 and thus,R 0 2 ⊆R 0 1 . Let l 0 be the minimum of the set {k∈R 2 :e T k z ∗ i ≥e l z 0 }. This represents the first entry of z ∗ i that exceeds e l z 0 . Similarly, define r 0 for the right end of z ∗ i . Define vector z † , of size n 0 , as e T k z † = e T k z ∗ i if l 0 ≤k≤r 0 e T l z 0 if k<l 0 e T r z 0 if k>r 0 . Let the regions corresponding to the sets of indices (l : l 0 − 1), (l 0 : r 0 ) and (r 0 + 1 : r) beR l ,R c andR r , respectively. Let z l :=e T l z 0 and z r :=e T r z 0 . Then, ||z † −H R 2 z|| 2 ≤||z l 1−H R l z|| 2 −||H (1:l 0 −1) z ∗ i −H R l z|| 2 +n 0 σ 2 + √ n 0 η 2 +||z r 1−H Rr z|| 2 −||H (r 0 +1:n 0 ) z ∗ i −H Rr z|| 2 . Consider the term||z l 1−H R l z|| 2 −||H (1:l 0 −1) z ∗ i −H R l z|| 2 . Let k be an index in l :l 0 − 1. For each k, we have 228 • If e T k H R l z≥z l ≥e T k z ∗ i , then (z l −e T k H R l z) 2 − (e T k z ∗ i −e T k H R l z) 2 < 0. (D.5) • If e T k H R l z<z l , then (z l −e T k H R l z) 2 − (e T k z ∗ i −e T k H R l z) 2 ≤ (z l −e T k H R l z) 2 = (z l −e T k H R l z 0 −e T k H R l n) 2 (D.6) <|e T k H R l n| 2 . (D.7) Thus, we can conclude that||z l 1−H R l z|| 2 −||H (1:l 0 −1) z ∗ i −H R l z|| 2 ≤||H R l n|| 2 . We can have a similar inequality for the right end as well. Hence, we have ||z † −H R 2 z|| 2 ≤n 0 σ 2 + √ n 0 η 2 +||H R l n|| 2 +||H Rr n|| 2 . (D.8) LetR 1 =R L ∪R 2 ∪R R whereR L andR R are the regions to the left and right ofR 2 respectively. Let|R L | =n L and|R R | =n R . Let z † L :=H R L z 0 (D.9) z † R :=H R R z 0 . Construct a vector y i := [z †T L z †T z †T R ] T . By construction, y i is unimodal with peak at i. Also, from (D.8) and (D.9) we have ||y i −z|| 2 =||H R L z−z † L || 2 +||z † −H R 2 z|| 2 +||H R R z−z † R || 2 ≤||H R L n|| 2 +n 0 σ 2 + √ n 0 η 2 +||H R l n|| 2 +||H Rr n|| 2 +||H Rr n|| 2 (b) ≤ n 0 σ 2 + √ n 0 η 2 +||H Rp n|| 2 229 (c) ≤ n 0 σ 2 + √ n 0 η 2 + (n−n 0 )σ 2 + ( √ n− √ n 0 )η 2 =nσ 2 + √ nη 2 . Thus, i ∈ R 0 1 . The inequality (b) is a consequence of Lemma 5.3. According to Lemma 5.3, l 0 ≤k l and r 0 ≥k r under eventsE l andE r , respectively. Thus, the regionR p subsumes the region R L ∪R l ∪R r ∪R R . The inequality (c) holds because n∈E p . D.4 Proof of Theorem 5.1 Letθ be aκ-greedy algorithm. Every time θ reduces the search space, there is a chance of missing the peak. This probability is upper bounded byp due to Bernstein’s inequality (5.5). Search space reduction can happen at most N times (it usually much lower than this). The target is missed if it is missed even once and we can bound the probability of missing the target at least once using a union bound. Thus, P[π∈R T ]≥ 1−Np. For any ω∈ Ω under the strategy θ, there are two possibilities: 1. There is no search space reduction until the stopping time T . Then S 1:T ={1,...,N} since the algorithm terminates only when all the samples inside the search space are sampled. Thus, the terminal feasible region is computed by θ using all the samples and hence, R T =R ∗ N . 2. If there is some search space reduction, then|S 1:T ∩R T−1 |≤κN. LetE m be the event that the peak is never missed. Since the peak is in R t for every t, we can apply Lemma 5.1 with R 1 ={1,...,N} andR 2 =R T−1 to conclude thatP[R T ⊆R ∗ N |E m ]≥ 1−o(p γ 2 ). Thus, P[R T ⊆R ∗ N ]≥P[R T ⊆R ∗ N ,E m ] =P[R T ⊆R ∗ N |E m ]P[E m ] ≥ (1−o(p γ 2 ))(1−Np) = 1−o(Np +p γ 2 ). 230 Appendix E Adaptive Sampling for Estimating Distributions E.1 Proof of Lemma 6.1 A Beta distribution with parameters α,β is sub-Gaussian with parameter 1 4(α+β+1) [88]. And because of (6.10) we can conclude using the Chernoff bound [126] that P ρn p (k,l) > α (k,l) n α (k,0) n + v u u t ln 2 δn 2(α (k,0) n + 1) ≤δ n (E.1) P ρn p (k,l) < α (k,l) n α (k,0) n − v u u t ln 2 δn 2(α (k,0) n + 1) ≤δ n . (E.2) Since a (k,l) n and b (k,l) n are selected using the inverse cdf of the posterior distribution, we have b (k,l) n −a (k,l) n ≤ 2 v u u t ln 2 δn 2(α (k,0) n + 1) . (E.3) The second inequality in the lemma simply follows from the fact the α (k,0) n ≥T (k) n . E.2 Proof of Lemma 6.2 Under eventE, we have |u (k) n −c (k) | (E.4) = X l (p (k,l) ) 2 − (q (k,l) n ) 2 −p (k,l) +q (k,l) n (E.5) 231 a ≤ X l (p (k,l) ) 2 − (q (k,l) n ) 2 (E.6) = X l p (k,l) −q (k,l) n p (k,l) +q (k,l) n (E.7) ≤ X l len E (k,l) n p (k,l) +q (k,l) n (E.8) b ≤ 2 v u u t ln 2 δn 2(T (k) n + 1) X l p (k,l) +q (k,l) n (E.9) = 4 v u u t ln 2 δn 2(T (k) n + 1) . (E.10) Inequality in (a) is because of the triangle inequality and the inequality in (b) is due to Lemma 6.1. E.3 Proof of Theorem 6.1 We have δ≥P ρ 1 [E c ] =E ρ 1 [δ p ]≥E ρ 1 h δ p 1 η(p) (p) i =E % [δ p ]η, (E.11) where δ p . = P g [E c | p]. Note that the last equality in the display above follows from a simple change of measure argument. Sinceδ p is the probability of the unfavorable eventE c conditioned on the model p, we can obtain a regret bound in terms of δ p without local averaging using the same arguments as in the proofs of Lemma 1 and Theorem 1 in [132, Appendix D.3]. Thus, we have L (k) N (p (k) ,g)−ϕ ∗ (p,N) (E.12) ≤ (K + 5)LM (λ min ) 2 N 3/2 + δ p (λ min ) 2 N 1 + 6M √ N + 6(K− 1)M 2 (λ min ) 3 N 2 +Lδ p , (E.13) where M . = λ max p 8 log(2/δ N ) λ min C =O q log(ηN) (E.14) λ max . = min k,π∈η(p) c (k) P i c (i) ; λ max . = max k,π∈η(p) c (k) P i c (i) (E.15) C . = X i c (i) , (E.16) 232 and c (i) = P l π (i,l) (1−π (i,l) ). Using (E.11), we can then locally average the regret as E % h L (k) N (π (k) ,g)−ϕ ∗ (p,N) i (E.17) ≤ (K + 5)LM (λ min ) 2 N 3/2 + δ (ηλ min ) 2 N 1 + 6M √ N + 6(K− 1)M 2 (λ min ) 3 N 2 +L δ η (E.18) =O(log(η)N −3/2 ). (E.19) E.4 Incorporating Priors In this section, we will consider a scenario where we are provided with a priorρ 1 onp. We will first modify the distribution % used for local averaging as %(A) =ρ 1 (A∩η(p))/η, (E.20) for any Borel measurable setA⊆ Δ K . Here, η(p) is a ball around p with ρ 1 (η(p)) =η. Dirichlet priors When the prior ρ 1 is any factored Dirichlet distribution (i.e. not necessarily uniform) of the form in (6.8), one can easily extend all the results with the modified %. Intervals for Bernoulli parameters Another scenario of interest is when L = 2 and we are given that the Bernoulli parameter associated with p (k) lies in some interval [γ (k) l ,γ (k) u ]. We can then assume that the Bernoulli parameterp (k) is distributed uniformly and independently over the interval [γ (k) l ,γ (k) u ]. Since this prior does not correspond to a Beta distribution, the posterior belief may not be a Beta distribution. However, we still can easily find the cdf of the posteriors in this case because the posterior on the parameters is just going to be a truncated Beta distribution. Let F (k) n be the cdf of the posterior at time n with a uniform prior and let ˜ F (k) n be the posterior with the truncated prior. Then one can show that ˜ F (k) n (x) = F (k) n (x)−F (k) n (γ (k) l ) F (k) n (γ (k) u )−F (k) n (γ (k) l ) . (E.21) The confidence intervals in (6.12) and (6.13) can then be computed using the inverse of ˜ F (k) n . 233 E.5 Practical Considerations: Overall Estimate and Batch Sampling In seroprevalence estimation, we generally allocate samples in batches of size B. Also, we are generally interested in estimating the positivity of each category as well as the positivity in the overall population. Let the fraction of individuals of category k in the overall population be w (k) . Then the overall positivity r and its estimate ˆ r are given by r = X k w (k) p (k) (E.22) ˆ r N = X k w (k) ˆ p (k) N . (E.23) The mean squared error between r and ˆ r N is given by MSE(r, ˆ r N ) = X k (w (k) ) 2 c (k) T (k) . (E.24) If the mean squared error associated with r is not considered, then it may so happen that a tiny group (small w k ) with high positivity will be allocated too many samples. The contribution of this small group to the overall estimate would be small and thus, allocating too many samples to it could compromise the quality of the overall estimate r. Therefore, we need to determine an allocation that accounts for the quality of the overall estimate r as well. A suitable way to formalize this notion is to pose the following constraints on the oracle allo- cation T (k) c (k) T (k) ≤θ (k) , k = 1,...,K (C1) X k (w (k) ) 2 c (k) T (k) ≤θ (0) (E.25) X k T (k) ≤N (E.26) T (k) ≥ 0, k = 1,...,K, (E.27) 234 where θ (k) , k = 0,...,K are predetermined constants. A solution to the above set of constraints can be obtained by solving the following optimization problem min T (1) ,...,T (K) max ( max k ( c (k) θ (k) T (k) ) , X k (w (k) ) 2 c (k) θ (0) T (k) ) (C2) s.t. X k T (k) ≤N (E.28) T (k) ≥ 0, k = 1,...,K. (E.29) The Problem (C1) is feasible if and only if the optimum value of the optimization problem above is less than or equal to 1. In that case, the solution to Problem (C2) is a solution to Problem (C1). Notice that Problem (C1) is very similar to Problem (P1) except for the additional mean squared error term associated with the overall estimater. Because of this distinction, it is not clear whether one can view Problem (C2) as a particular instance of Problem (P1) with appropriate modifications and simply apply the adaptive sampling strategy in Section 6.4 to the modified problem. Nonetheless, we provide a similar heuristic sampling approach that tracks the oracle quite well (See Figure E.1). At each time time n, the heuristic is to allocate samples within each batch according to the solution of the following optimization problem min τ (1) ,...,τ (K) ,λ λ (C3) s.t. u (k) n T (k) n +τ (k) ≤θ (k) λ (E.30) X k (w (k) ) 2 u (k) n T (k) n +τ (k) ≤θ (0) λ (E.31) X k τ (k) =B (E.32) τ (k) ≥ 0∀k. (E.33) 235 American Indian Asian Black Hispanic Other Pacific Islander White 0 200 400 600 800 1000 1200 1400 1600 1800 Survey Oracle Adaptive Figure E.1: This plot represents the number of samples collected by the seroprevalence survey in [135], the oracle allocation in Problem (C1) with appropriate constants θ (k) , and the allocation by the heuristic (denoted by Adaptive). Notice that the heuristic tracks the oracle closely. This plot tells us that in order get a good overall estimate, we should allocate fewer samples to an underrepresented group like the Pacific Islanders (w (k) ≈ 0.003) than the number suggested by the oracle (P1) (See 6.2a). However, it also tells us that we can allocate substantially more samples than the number in the survey (≈ w (k) N) to this group for a better estimate of their positivity without comprising the quality of the overall estimate. 236 Appendix F Stochastic Games with Asymmetric Information F.1 Proof of Lemma 8.1 It was shown in [103] that there exist bijective mappingsM i :G i →H i , i = 1, 2, such that for every g 1 ∈G 1 and g 2 ∈G 2 , we have J(g 1 ,g 2 ) =J (M 1 (g 1 ),M 2 (g 2 )). (F.1) Therefore, for any strategy g 1 ∈G 1 , we have sup g 2 ∈G 2 J(g 1 ,g 2 ) = sup g 2 ∈G 2 J (M 1 (g 1 ),M 2 (g 2 )) (F.2) = sup χ 2 ∈X 2 J (M 1 (g 1 ),χ 2 ). (F.3) Consequently, inf g 1 ∈G 1 sup g 2 ∈G 2 J(g 1 ,g 2 ) = inf g 1 ∈G 1 sup χ 2 ∈X 2 J (M 1 (g 1 ),χ 2 ) (F.4) = inf χ 1 ∈X 1 sup χ 2 ∈X 2 J (χ 1 ,χ 2 ). (F.5) This implies that S u (G) =S u (G v ). We can similarly prove that S l (G) =S l (G v ). Remark F.1. We can also show that a strategy profile (g 1 ,g 2 ) is a Nash equilibrium in game G if and only if (M 1 (g 1 ),M 2 (g 2 )) is a Nash equilibrium in game G v . 237 F.2 Proof of Lemma G.2 Let us consider the evolution of the virtual game G v under the strategy profile (χ 1 ,χ 2 ) and the expanded virtual game G e under the strategy profile (˜ χ 1 , ˜ χ 2 ). Let the primitive variables and the randomization variablesK i t in both games be identical. The variables such as the state, action and information variables in the expanded gameG e are distinguished from those in the virtual gameG v by means of a tilde. For instance, X t is the state in gameG v and ˜ X t is the state in gameG e . We will prove by induction that the system evolution in both these games is identical over the entire horizon. This is clearly true at the end of time t = 1 because the state, observations and the common and private information variables are identical in both games. Moreover, since χ i = % i (˜ χ 1 , ˜ χ 2 ), i = 1, 2, the strategies χ i 1 and ˜ χ i 1 are identical by definition (see Definition G.4). Thus, the prescriptions and actions at t = 1 are also identical. For induction, assume that the system evolution in both games is identical until the end of time t. Then, X t+1 =f t (X t ,U 1:2 t ,W s t ) =f t ( ˜ X t , ˜ U 1:2 t ,W s t ) = ˜ X t+1 . Using equations (8.3), (8.5) and (8.4), we can similarly argue that Y i t+1 = ˜ Y i t+1 , P i t+1 = ˜ P i t+1 and C t+1 = ˜ C t+1 . Since χ i =% i (˜ χ 1 , ˜ χ 2 ), we also have ˜ Γ i t+1 = ˜ χ i t+1 ( ˜ C t+1 , ˜ Γ 1:2 1:t ) a =χ i t+1 ( ˜ C t+1 ) b = Γ i t+1 . (F.6) Here, equality (a) follows from the construction of the mapping% i (see Definition G.4) and equality (b) follows from the fact that C t+1 = ˜ C t+1 . Further, U i t+1 =κ(Γ i t+1 (P i t+1 ),K i t+1 ) =κ( ˜ Γ i t+1 ( ˜ P i t+1 ),K i t+1 ) (F.7) = ˜ U i t+1 . (F.8) Thus, by induction, the hypothesis is true for every 1≤ t≤ T . This proves that the virtual and expanded games have identical dynamics under strategy profiles (χ 1 ,χ 2 ) and (˜ χ 1 , ˜ χ 2 ). 238 Since the virtual and expanded games have the same cost structure, having identical dynamics ensures that strategy profiles (χ 1 ,χ 2 ) and (˜ χ 1 , ˜ χ 2 ) have the same expected cost in games G v and G e , respectively. Therefore,J (χ 1 ,χ 2 ) =J (˜ χ 1 , ˜ χ 2 ). F.3 Proof of Theorem 7.1 For any strategy χ 1 ∈X 1 , we have sup ˜ χ 2 ∈ ˜ X 2 J (χ 1 , ˜ χ 2 )≥ sup χ 2 ∈X 2 J (χ 1 ,χ 2 ), (F.9) becauseX 2 ⊆ ˜ X 2 . Further, sup ˜ χ 2 ∈ ˜ X 2 J (χ 1 , ˜ χ 2 ) = sup ˜ χ 2 ∈ ˜ X 2 J (% 1 (χ 1 , ˜ χ 2 ),% 2 (χ 1 , ˜ χ 2 )). (F.10) = sup ˜ χ 2 ∈ ˜ X 2 J (χ 1 ,% 2 (χ 1 , ˜ χ 2 )) (F.11) ≤ sup χ 2 ∈X 2 J (χ 1 ,χ 2 ), (F.12) where the first equality is due to Lemma G.2, the second equality is because % 1 (χ 1 , ˜ χ 2 ) =χ 1 and the last inequality is due to the fact that % 2 (χ 1 , ˜ χ 2 )∈X 2 for any ˜ χ 2 ∈ ˜ X 2 . Combining (F.9) and (F.12), we obtain that sup χ 2 ∈X 2 J (χ 1 ,χ 2 ) = sup ˜ χ 2 ∈ ˜ X 2 J (χ 1 , ˜ χ 2 ). (F.13) Now, S u (G e ) := inf ˜ χ 1 ∈ ˜ X 1 sup ˜ χ 2 ∈ ˜ X 2 J (˜ χ 1 , ˜ χ 2 ) (F.14) ≤ inf χ 1 ∈X 1 sup ˜ χ 2 ∈ ˜ X 2 J (χ 1 , ˜ χ 2 ) (F.15) = inf χ 1 ∈X 1 sup χ 2 ∈X 2 J (χ 1 ,χ 2 ), (F.16) =:S u (G v ). (F.17) 239 where the inequality (F.15) is true sinceX 1 ⊆ ˜ X 1 and the equality in (F.16) follows from (F.13). Therefore, S u (G e )≤S u (G v ). We can use similar arguments to show that S l (G v )≤S l (G e ). F.4 Proof of Lemma 8.2 We begin with defining the following transformations for each time t. Recall thatS t is the set of all possible common information beliefs at timet andB i t is the prescription space for virtual player i at time t. Definition F.1. (i) Let P j t :S t ×B 1 t ×B 2 t → Δ(Z t+1 ×X t+1 ×P 1 t+1 ×P 2 t+1 ) be defined as P j t (π t ,γ 1:2 t ;z t+1 ,x t+1 ,p 1:2 t+1 ) (F.18) := X xt,p 1:2 t ,u 1:2 t π t (x t ,p 1:2 t )γ 1 t (p 1 t ;u 1 t )γ 2 t (p 2 t ;u 2 t )P[x t+1 ,p 1:2 t+1 ,z t+1 |x t ,p 1:2 t ,u 1:2 t ]. (F.19) We will use P j t (π t ,γ 1:2 t ) as a shorthand for the probability distribution P j t (π t ,γ 1:2 t ;·,·,·). The distributionP j t (π t ,γ 1:2 t ) can be viewed as a joint distribution over the variablesZ t+1 ,X t+1 ,P 1:2 t+1 if the distribution on X t ,P 1:2 t is π t and prescriptions γ 1:2 t are chosen by the virtual players at time t. (ii) Let P m t :S t ×B 1 t ×B 2 t → ΔZ t+1 be defined as P m t (π t ,γ 1:2 t ;z t+1 ) = X x t+1 ,p 1:2 t+1 P j t (π t ,γ 1:2 t ;z t+1 ,x t+1 ,p 1:2 t+1 ). (F.20) The distribution P m t (π t ,γ 1:2 t ) is the marginal distribution of the variable Z t+1 obtained from the joint distribution P j t (π t ,γ 1:2 t ) defined above. (iii) Let F t :S t ×B 1 t ×B 2 t ×Z t+1 →S t+1 be defined as F t (π t ,γ 1:2 t ,z t+1 ) = P j t (πt,γ 1:2 t ;z t+1 ,·,·) P m t (πt,γ 1:2 t ;z t+1 ) if P m t (π t ,γ 1:2 t ;z t+1 )> 0 G t (π t ,γ 1:2 t ,z t+1 ) otherwise, (F.21) 240 where G t :S t ×B 1 t ×B 2 t ×Z t+1 →S t+1 can be any arbitrary measurable mapping. One such mapping is the one that maps every element π t ,γ 1:2 t ,z t+1 to the uniform distribution over the finite spaceX t+1 ×P 1 t+1 ×P 2 t+1 . Let the collection of transformations F t that can be constructed using the method described in Definition G.2 be denoted by B. Note that the transformations P j t ,P m t and F t do not depend on the strategy profile (˜ χ 1 , ˜ χ 2 ) because the termP[x t+1 ,p 1:2 t+1 ,z t+1 |x t ,p 1:2 t ,u 1:2 t ] in (G.10) depends only on the system dynamics (see equations (7.12) – (7.16)) and not on the strategy profile (˜ χ 1 , ˜ χ 2 ). Consider a strategy profile (˜ χ 1 , ˜ χ 2 ). Note that the number of possible realizations of common information and prescription history under (˜ χ 1 , ˜ χ 2 ) is finite. Let c t+1 ,γ 1:2 1:t be a realization of the common information and prescription history at timet + 1 with non-zero probability of occurrence under (˜ χ 1 , ˜ χ 2 ). For this realization of virtual players’ information, the common information based belief on the state and private information at time t + 1 is given by π t+1 (x t+1 ,p 1:2 t+1 ) =P (˜ χ 1 ,˜ χ 2 ) [X t+1 =x t+1 ,P 1:2 t+1 =p 1:2 t+1 |c t+1 ,γ 1:2 1:t ] =P (˜ χ 1 ,˜ χ 2 ) [X t+1 =x t+1 ,P 1:2 t+1 =p 1:2 t+1 |c t ,γ 1:2 1:t−1 ,z t+1 ,γ 1:2 t ] = P (˜ χ 1 ,˜ χ 2 ) [X t+1 =x t+1 ,P 1:2 t+1 =p 1:2 t+1 ,Z t+1 =z t+1 |c t ,γ 1:2 1:t ] P (˜ χ 1 ,˜ χ 2 ) [Z t+1 =z t+1 |c t ,γ 1:2 1:t ] . (F.22) Notice that the expression (F.22) is well-defined, that is, the denominator is non-zero be- cause of our assumption that the realization c t+1 ,γ 1:2 1:t has non-zero probability of occurrence. Let us consider the numerator in the expression (F.22). For convenience, we will denote it with P (˜ χ 1 ,˜ χ 2 ) [x t+1 ,p 1:2 t+1 ,z t+1 |c t ,γ 1:2 1:t ]. We have P (˜ χ 1 ,˜ χ 2 ) [x t+1 ,p 1:2 t+1 ,z t+1 |c t ,γ 1:2 1:t ] = X xt,p 1:2 t ,u 1:2 t π t (x t ,p 1:2 t )γ 1 t (p 1 t ;u 1 t )γ 2 t (p 2 t ;u 2 t )P (˜ χ 1 ,˜ χ 2 ) [x t+1 ,p 1:2 t+1 ,z t+1 |c t ,γ 1:2 1:t ,x t ,p 1:2 t ,u 1:2 t ] (F.23) = X xt,p 1:2 t ,u 1:2 t π t (x t ,p 1:2 t )γ 1 t (p 1 t ;u 1 t )γ 2 t (p 2 t ;u 2 t )P[x t+1 ,p 1:2 t+1 ,z t+1 |x t ,p 1:2 t ,u 1:2 t ] (F.24) =P j t (π t ,γ 1:2 t ;z t+1 ,x t+1 ,p 1:2 t+1 ), (F.25) 241 where π t is the common information belief on X t ,P 1 t ,P 2 t at time t given the realization 1 c t ,γ 1:2 1:t−1 andP j t is as defined in Definition G.2. The equality in (F.24) is due to the structure of the system dynamics in game G e described by equations (7.12) – (7.16). Similarly, the denominator in (F.22) satisfies 0 0 0 if α = 0, where 0 is a zero-vector of size|X t ×P 1 t ×P 2 t |. Having extended the domain of the above functions, we can also extend the domain of the argument π t in the functions w u t (·),w l t (·),V u t (·),V l t (·) defined in the dynamic programs of Section 8.3.3. First, for any 0≤ α≤ 1 and π T +1 ∈S T +1 , define V u T +1 (απ T +1 ) := 0. We can then define the following functions for every t≤T in a backward inductive manner: For any γ i t ∈B i t ,i = 1, 2, 0≤α≤ 1 and π t ∈S t , let w u t (απ t ,γ 1 t ,γ 2 t ) := ˜ c t (απ t ,γ 1 t ,γ 2 t ) + X z t+1 P m t (απ t ,γ 1:2 t ;z t+1 )V u t+1 (F t (απ t ,γ 1:2 t ,z t+1 )) (F.33) V u t (απ t ) := inf γ 1 t sup γ 2 t w u t (απ t ,γ 1 t ,γ 2 t ). (F.34) Note that whenα = 1, the above definition ofw u t is equal to the definition ofw u t in equation (8.23) of the dynamic program. We can similarly extend w l t and V l t . These extended value functions 243 satisfy the following homogeneity property. A similar result was shown in [83, Lemma III.1] for a special case of our model. Lemma F.1. For any constant 0≤ α≤ 1 and any π t ∈ ¯ S t , we have αV u t (π t ) = V u t (απ t ) and αV l t (π t ) =V l t (απ t ). Proof. The proof can be easily obtained from the above definitions of the extended functions. The following lemmas will be used in Appendix G.3 to establish some useful properties of the extended functions. Lemma F.2. Let V : ¯ S t+1 → R be a continuous function satisfying V (απ) = αV (π) for every 0≤α≤ 1 and π∈ ¯ S t+1 . Define V 0 (π t ,γ 1 t ,γ 2 t ) := X z t+1 P m t (π t ,γ 1:2 t ;z t+1 )[V (F t (π t ,γ 1:2 t ,z t+1 ))]. For a fixed γ 1 t ,γ 2 t , V 0 (·,γ 1 t ,γ 2 t ) is a function from ¯ S t+1 toR. Then, the family of functions F 1 :={V 0 (·,γ 1 t ,γ 2 t ) :γ i t ∈B i t ,i = 1, 2} (F.35) is equicontinuous. Similarly, the following families of functions F 2 :={V 0 (π t ,·,γ 2 t ) :γ 2 t ∈B 2 t ,π t ∈ ¯ S t } (F.36) F 3 :={V 0 (π t ,γ 1 t ,·) :γ 1 t ∈B 1 t ,π t ∈ ¯ S t } (F.37) are equicontinuous in their respective arguments. Proof. A continuous function is bounded and uniformly continuous over a compact domain (see Theorem 4.19 in [127]). Therefore, V is bounded and uniformly continuous over ¯ S t+1 . Using the fact that V (απ) =αV (π) and the definition of F t in Definition G.2, the function V 0 can be written as V 0 (π t ,γ 1 t ,γ 2 t ) = X z t+1 V P j t (π t ,γ 1:2 t ;z t+1 ,·,·) . (F.38) 244 Recall that P j t is trilinear in π t ,γ 1 t and γ 2 t with bounded coefficients for a fixed value of z t+1 (see (G.8)). Therefore, for each z t+1 , {P j t (·,γ 1 t ,γ 2 t ,z t+1 )} is an equicontinuous family of func- tions in the argument π t , where P j t (π t ,γ 1 t ,γ 2 t ,z t+1 ) is a short hand notation for the measure P j t (π t ,γ 1 t ,γ 2 t ,z t+1 ,·,·) over the spaceX t+1 ×P 1 t+1 ×P 2 t+1 . Also, since V is uniformly continuous, the family n V P j t (·,γ 1:2 t ,z t+1 ) o is equicontinuous in π t for each z t+1 . This is because composition with a uniformly continuous function preserves equicontinuity. Therefore, the family of functions F 1 is equicontinuous in π t . We can use similar arguments to prove equicontinuity of the other two families. Lemma F.3. Let w :B 1 t ×B 2 t →R be a function such that (i) the family of functions{w(·,γ 2 ) : γ 2 ∈B 2 t } is equicontinuous in the first argument; (ii) the family of functions{w(γ 1 ,·) :γ 1 ∈B 1 t } is equicontinuous in the second argument. Then sup γ 2w(γ 1 ,γ 2 ) is a continuous function of γ 1 and, similarly, inf γ 1w(γ 1 ,γ 2 ) is a continuous function of γ 2 . Proof. Let > 0. For a given γ 1 , there exists a δ> 0 such that |w(γ 1 ,γ 2 )−w(γ 01 ,γ 2 )|≤ ∀γ 2 ,∀||γ 1 −γ 01 ||≤δ. (F.39) Let ¯ γ 2 be a prescription such that w(γ 1 , ¯ γ 2 ) = sup γ 2 w(γ 1 ,γ 2 ). (F.40) Note that the existence of ¯ γ 2 is guaranteed because of continuity ofw(γ 1 ,·) in the second argument and compactness ofB 2 t . Pick anyγ 01 satisfying||γ 1 −γ 01 ||≤δ. Let ¯ γ 02 be a prescription such that w(γ 01 , ¯ γ 02 ) = sup γ 2 w(γ 01 ,γ 2 ). (F.41) Using (F.39), we have (i) w(γ 1 , ¯ γ 2 )−w(γ 01 , ¯ γ 02 )≥w(γ 1 , ¯ γ 02 )−w(γ 01 , ¯ γ 02 ) ≥−, (F.42) (ii) w(γ 1 , ¯ γ 2 )−w(γ 01 , ¯ γ 02 )≤w(γ 1 , ¯ γ 2 )−w(γ 01 , ¯ γ 2 ) 245 ≤. (F.43) Equations (F.40) - (F.43) imply that sup γ 2w(γ 1 ,γ 2 ) is a continuous function of γ 1 . We can use a similar argument for showing continuity of inf γ 1w(γ 1 ,γ 2 ) in γ 2 . F.6 Proof of Lemma 8.3 We first use the definitions of extensions of w u t ,w l t ,V u t ,V l t in Appendix F.5 and Lemmas G.1 and F.2 to establish the following equicontinuity result. Lemma F.4. The families of functions F a t :={w u t (·,γ 1 t ,γ 2 t ) :γ i t ∈B i t ,i = 1, 2} (F.44) F b t :={w u t (π t ,·,γ 2 t ) :γ 2 t ∈B 2 t ,π t ∈ ¯ S t } (F.45) F c t :={w u t (π t ,γ 1 t ,·) :γ 1 t ∈B 1 t ,π t ∈ ¯ S t } (F.46) are all equicontinuous in their arguments for every t≤T . A similar statement holds for w l t . Proof. We use a backward induction argument for the proof. For induction, assume that V u t+1 is a continuous function for some t≤ T . This is clearly true for t = T . Using the continuity of V u t+1 we will establish the statement of the lemma for time t and also prove the continuity of V u t . This establishes the lemma for all t≤T . Equicontinuity of w u t : Since ˜ c t (π t ,γ 1 t ,γ 2 t ) is linear in π t with uniformly bounded coefficients for any given γ 1:2 t (see (G.14)), it is equicontinuous in the argument π t . In Lemma G.1, we showed that the value functions V u t satisfy the condition V u t (απ) = αV u t (π) for every 0≤ α≤ 1, π∈S t . Further, due to our induction hypothesis, V u t+1 is continuous. Thus, using Lemma F.2, the second term of w u t , X z t+1 P m t (π t ,γ 1:2 t ;z t+1 )V u t+1 (F t (π t ,γ 1:2 t ,z t+1 )), is also equicontinuous in π t . Hence, the familyF a t is equicontinuous in π t . 246 Continuity of V u t : Due to the equicontinuity of the family F a t , we have the following. For any given > 0 and π t ∈ ¯ S t , there exists a δ> 0 such that |w u t (π t ,γ 1 t ,γ 2 t )−w u t (π 0 t ,γ 1 t ,γ 2 t )|< (F.47) for every γ 1 t ,γ 2 t and π 0 t satisfying||π t −π 0 t ||<δ. Therefore, w u t (π t ,γ 1 t ,γ 2 t )<w u t (π 0 t ,γ 1 t ,γ 2 t ) +∀γ 1 t ,γ 2 t (F.48) =⇒ sup γ 2 t w u t (π t ,γ 1 t ,γ 2 t )≤ sup γ 2 t w u t (π 0 t ,γ 1 t ,γ 2 t ) +∀γ 1 t (F.49) =⇒ inf γ 1 t sup γ 2 t w u t (π t ,γ 1 t ,γ 2 t )≤ inf γ 1 t sup γ 2 t w u t (π 0 t ,γ 1 t ,γ 2 t ) + =⇒V u t (π t )≤V u t (π 0 t ) +, (F.50) for every π 0 t that satisfies||π t −π 0 t ||<δ. Similarly, V u t (π t )≥V u t (π 0 t )− for every π 0 t that satisfies ||π t −π 0 t ||<δ. Therefore, V u t (π t ) is continuous at π t . Hence, by induction, we can say that the familyF a t is equicontinuous inπ t for everyt≤T . We can use similar arguments to prove the equicontinuity of the other families. The continuity of w u t established above implies that sup γ 2 t w u t (π t ,γ 1 t ,γ 2 t ) is achieved for ev- ery π t ,γ 1 t . Further, Lemma F.4 implies that w u t and w l t satisfy the equicontinuity conditions in Lemma F.3 for any given realization of belief π t . Therefore, we can use Lemma F.3 to argue that sup γ 2 t w u t (π t ,γ 1 t ,γ 2 t ) is continuous inγ 1 t . And sinceγ 1 t lies in the compact spaceB 1 t , a minmaximizer exists for the function w u t . Further, we can use the measurable selection condition (see Condition 3.3.2 in [59]) to prove the existence of measurable mapping Ξ 1 t (π t ) as defined in Lemma 8.3. A sim- ilar argument can be made to establish the existence of a maxminimizer and a measurable mapping Ξ 2 t (π t ) as defined in Lemma 8.3. This concludes the proof of Lemma 8.3. 247 F.7 Proof of Theorem 8.2 Let us first define a distribution ˜ Π t over the spaceX t ×P 1 t ×P 2 t in the following manner. The distribution ˜ Π t , given C t , Γ 1:2 1:t−1 , is recursively obtained using the following relation ˜ Π 1 (x 1 ,p 1 1 ,p 2 1 ) =P[X 1 =x 1 ,P 1 1 =p 1 1 ,P 2 1 =p 2 1 |C 1 ]∀x 1 ,p 1 1 ,p 2 1 , (F.51) ˜ Π τ+1 =F τ ( ˜ Π τ , Γ 1 τ , Γ 2 τ ,Z τ+1 ), τ≥ 1, (F.52) where F τ is as defined in Definition G.2 in Appendix G.2. We refer to this distribution as the strategy-independent common information belief (SI-CIB). Let ˜ χ 1 ∈ ˜ H 1 be any strategy for virtual player 1 in gameG e . Consider the problem of obtaining virtual player 2’s best response to the strategy ˜ χ 1 with respect to the cost J (˜ χ 1 , ˜ χ 2 ) defined in (8.16). This problem can be formulated as a Markov decision process (MDP) with common information and prescription history C t , Γ 1:2 1:t−1 as the state. The control action at time t in this MDP is Γ 2 t , which is selected based on the information C t , Γ 1:2 1:t−1 using strategy ˜ χ 2 ∈H 2 . The evolution of the state C t , Γ 1:2 1:t−1 of this MDP is as follows {C t+1 , Γ 1:2 1:t } ={C t ,Z t+1 , Γ 1:2 1:t−1 , ˜ χ 1 t (C t , Γ 1:2 1:t−1 ), Γ 2 t }, (F.53) where P (˜ χ 1 ,˜ χ 2 ) [Z t+1 =z t+1 |C t , Γ 1:2 1:t−1 , Γ 2 t ] =P m t [ ˜ Π t , Γ 1 t , Γ 2 t ;z t+1 ], (F.54) almost surely. Here, Γ 1 t = ˜ χ 1 t (C t , Γ 1:2 1:t−1 ) and the transformation P m t is as defined in Definition G.2 in Appendix G.2. Notice that due to Lemma 8.2, the common information belief Π t associated with any strategy profile (˜ χ 1 , ˜ χ 2 ) is equal to ˜ Π t almost surely. This results in the state evolution equation in (F.54). The objective of this MDP is to maximize, for a given ˜ χ 1 , the following cost E (˜ χ 1 ,˜ χ 2 ) " T X t=1 ˜ c t ( ˜ Π t , Γ 1 t , Γ 2 t ) # , (F.55) 248 where ˜ c t is as defined in equation (G.14). Due to Lemma 8.2, the total expected cost defined above is equal to the costJ (˜ χ 1 , ˜ χ 2 ) defined in (8.16). The MDP described above can be solved using the following dynamic program. For every realization of virtual players’ information c T +1 ,γ 1:2 1:T , define V ˜ χ 1 T +1 (c T +1 ,γ 1:2 1:T ) := 0. In a backward inductive manner, for each time t≤T and each realization c t ,γ 1:2 1:t−1 , define V ˜ χ 1 t (c t ,γ 1:2 1:t−1 ) := sup γ 2 t [˜ c t (˜ π t ,γ 1 t ,γ 2 t ) +E[V ˜ χ 1 t+1 (c t ,Z t+1 ,γ 1:2 1:t )|c t ,γ 1:2 1:t ]], (F.56) where γ 1 t = ˜ χ 1 t (c t ,γ 1:2 1:t−1 ) and ˜ π t is the SI-CIB associated with the information c t ,γ 1:2 1:t−1 . Note that the measurable selection condition (see condition 3.3.2 in [59]) holds for the dynamic program described above. Thus, the value functions V ˜ χ 1 t (·) are measurable and there exists a measurable best-response strategy for player 2 which is a solution to the dynamic program described above. Therefore, we have sup ˜ χ 2 J (˜ χ 1 , ˜ χ 2 ) =EV ˜ χ 1 1 (C 1 ). (F.57) Claim F.1. For any strategy ˜ χ 1 ∈ ˜ H 1 and for any realization of virtual players’ information c t ,γ 1:2 1:t−1 , we have V ˜ χ 1 t (c t ,γ 1:2 1:t−1 )≥V u t (˜ π t ), (F.58) where V u t is as defined in (8.23) and ˜ π t is the SI-CIB belief associated with the instance c t ,γ 1:2 1:t−1 . As a consequence, we have sup ˜ χ 2 J (˜ χ 1 , ˜ χ 2 )≥EV u 1 (Π 1 ). (F.59) 249 Proof. The proof is by backward induction. Clearly, the claim is true at time t =T + 1. Assume that the claim is true for all times greater than t. Then we have V ˜ χ 1 t (c t ,γ 1:2 1:t−1 ) = sup γ 2 t [˜ c t (˜ π t ,γ 1 t ,γ 2 t ) +E[V ˜ χ 1 t+1 (c t ,Z t+1 ,γ 1:2 1:t )|c t ,γ 1:2 1:t ]] ≥ sup γ 2 t [˜ c t (˜ π t ,γ 1 t ,γ 2 t ) +E[V u t+1 (F t (˜ π t ,γ 1:2 t ,Z t+1 ))|c t ,γ 1:2 1:t ]] ≥V u t (˜ π t ). The first equality follows from the definition in (F.56) and the inequality after that follows from the induction hypothesis. The last inequality is a consequence of the definition of the value function V u t . This completes the induction argument. Further, using Claim F.1 and the result in (F.57), we have sup ˜ χ 2 J (˜ χ 1 , ˜ χ 2 ) =EV ˜ χ 1 1 (C 1 )≥EV u 1 ( ˜ Π 1 ) =EV u 1 (Π 1 ). We can therefore say that S u (G e ) = inf ˜ χ 1 sup ˜ χ 2 J (˜ χ 1 , ˜ χ 2 )≥ inf ˜ χ 1 EV u 1 (Π 1 ) =EV u 1 (Π 1 ). (F.60) Further, for the strategy ˜ χ 1∗ defined in Definition 8.3, the inequalities (F.58) and (F.59) hold with equality. We can prove this using an inductive argument similar to the one used to prove Claim F.1. Therefore, we have S u (G e ) = inf ˜ χ 1 sup ˜ χ 2 J (˜ χ 1 , ˜ χ 2 )≤ sup ˜ χ 2 J (˜ χ 1∗ , ˜ χ 2 ) =EV ˜ χ 1∗ 1 (C 1 ) =EV u 1 (Π 1 ). (F.61) Combining (F.60) and (F.61), we have S u (G e ) =EV u 1 (Π 1 ). 250 Thus, the inequality in (F.61) holds with equality which leads us to the result that the strategy ˜ χ 1∗ is a min-max strategy in gameG e . A similar argument can be used to show that S l (G e ) =EV l 1 (Π 1 ), and that the strategy ˜ χ 2∗ defined in Definition 8.3 is a max-min strategy in game G e . 251 Appendix G Zero-sum Games between Teams G.1 Proof of Lemma 8.1 For a given strategy g i ∈G i , let us define a strategy χ i ∈X i in the following manner. For each time t, each instance of common information c t ∈C t and player j = 1,...,N i , let γ i,j t . =g i,j t (c t ,·), (G.1) and let γ i t . = (γ i,1 t ,...,γ i,N i t ). Note that the partial function g i,j t (c t ,·) is a mapping fromP i,j t to ΔU i,j t . Let χ i t (c t ) . =γ i t . (G.2) We will denote this correspondence between strategies in Game G and G v withM i :G i →H i , i = 1, 2, i.e., χ i =M i (g i ). One can easily verify that the mappingM i is bijective. Further, for every g 1 ∈G 1 and g 2 ∈G 2 , we have J(g 1 ,g 2 ) =J (M 1 (g 1 ),M 2 (g 2 )). (G.3) We refer the reader to Appendix A of [103] for a proof of the above statement. Therefore, for any strategy g 1 ∈G 1 , we have sup g 2 ∈G 2 J(g 1 ,g 2 ) = sup g 2 ∈G 2 J (M 1 (g 1 ),M 2 (g 2 )) (G.4) 252 = sup χ 2 ∈X 2 J (M 1 (g 1 ),χ 2 ). (G.5) Consequently, inf g 1 ∈G 1 sup g 2 ∈G 2 J(g 1 ,g 2 ) = inf g 1 ∈G 1 sup χ 2 ∈X 2 J (M 1 (g 1 ),χ 2 ) (G.6) = inf χ 1 ∈X 1 sup χ 2 ∈X 2 J (χ 1 ,χ 2 ). (G.7) This implies that S u (G) =S u (G v ). We can similarly prove that S l (G) =S l (G v ). Remark G.1. We can also show that a strategy g 1 is a min-max strategy in Game G if and only ifM 1 (g 1 ) is a min-max strategy in Game G v . A similar statement holds for max-min strategies as well. G.2 Belief Update and Instantaneous Cost Definition G.1 (Notation for Prescriptions). For virtual playeri at timet, letγ i = (γ i,1 ,...,γ i,N i )∈ B i t be a prescription. We use γ i,j (p i,j t ;u i,j t ) to denote the probability assigned to action u i,j t by the distribution γ i,j (p i,j t ). We will also use the following notation: γ i (p i t ;u i t ) . = N i Y j=1 γ i,j (p i,j t ;u i,j t ) ∀p i t ∈P i t ,u i t ∈U i t . Here,γ i (p i t ;u i t ) is the probability assigned to a team actionu i t by the prescriptionγ i when the team’s private information is p i t . We begin with defining the following transformations for each time t. Recall thatS t is the set of all possible common information beliefs at time t andB i t is the prescription space for virtual player i at time t. Definition G.2. 1. Let P j t :S t ×B 1 t ×B 2 t → Δ(Z t+1 ×X t+1 ×P 1 t+1 ×P 2 t+1 ) be defined as P j t (π t ,γ 1:2 t ;z t+1 ,x t+1 ,p 1:2 t+1 ) (G.8) := X xt,p 1:2 t ,u 1:2 t π t (x t ,p 1:2 t )γ 1 t (p 1 t ;u 1 t )γ 2 t (p 2 t ;u 2 t ) (G.9) 253 ×P[x t+1 ,p 1:2 t+1 ,z t+1 |x t ,p 1:2 t ,u 1:2 t ]. (G.10) We will use P j t (π t ,γ 1:2 t ) as a shorthand for the probability distribution P j t (π t ,γ 1:2 t ;·,·,·). The distributionP j t (π t ,γ 1:2 t ) can be viewed as a joint distribution over the variablesZ t+1 ,X t+1 ,P 1:2 t+1 if the distribution on X t ,P 1:2 t is π t and prescriptions γ 1:2 t are chosen by the virtual players at time t. 2. Let P m t :S t ×B 1 t ×B 2 t → ΔZ t+1 be defined as P m t (π t ,γ 1:2 t ;z t+1 ) . = X x t+1 ,p 1:2 t+1 P j t (π t ,γ 1:2 t ;z t+1 ,x t+1 ,p 1:2 t+1 ). (G.11) The distribution P m t (π t ,γ 1:2 t ) is the marginal distribution of the variable Z t+1 obtained from the joint distribution P j t (π t ,γ 1:2 t ) defined above. 3. Let F t :S t ×B 1 t ×B 2 t ×Z t+1 →S t+1 be defined as F t (π t ,γ 1:2 t ,z t+1 ) . = (G.12) P j t (πt,γ 1:2 t ;z t+1 ,·,·) P m t (πt,γ 1:2 t ;z t+1 ) if P m t (π t ,γ 1:2 t ;z t+1 )> 0 G t (π t ,γ 1:2 t ,z t+1 ) otherwise, (G.13) where G t :S t ×B 1 t ×B 2 t ×Z t+1 →S t+1 can be any arbitrary measurable mapping. One such mapping is the one that maps every element π t ,γ 1:2 t ,z t+1 to the uniform distribution over the finite spaceX t+1 ×P 1 t+1 ×P 2 t+1 . Let the collection of transformations F t that can be constructed using the method described in Definition G.2 be denoted byB. Definition G.3. The cost function ˜ c t is as defined as ˜ c t (π,γ 1 ,γ 2 ) := (G.14) X xt,p 1:2 t ,u 1:2 t c t (x t ,u 1 t ,u 2 t )π(x t ,p 1 t ,p 2 t )γ 1 (p 1 t ;u 1 t )γ 2 (p 2 t ;u 2 t ). 254 G.3 Domain Extension and Proof of Lemma 8.3 G.3.0.1 Domain Extension Recall thatS t is the set of all probability distributions over the finite setX t ×P 1 t ×P 2 t . Define ¯ S t :={απ t : 0≤α≤ 1,π t ∈S t }. (G.15) The functions ˜ c t in (G.14), P j t in (G.8), P m t in (G.11) and F t in (G.13) were defined for any π t ∈S t . We will extend the domain of the argument π t in these functions to ¯ S t as follows. For any γ i t ∈B i t ,i = 1, 2, z t+1 ∈Z t+1 , 0≤α≤ 1 and π t ∈S t , let 1. ˜ c t (απ t ,γ 1 t ,γ 2 t ) :=α˜ c t (π t ,γ 1 t ,γ 2 t ) 2. P j t (απ t ,γ 1:2 t ) :=αP j t (π t ,γ 1:2 t ) 3. P m t (απ t ,γ 1:2 t ) :=αP m t (π t ,γ 1:2 t ) 4. F t (απ t ,γ 1:2 t ,z t+1 ) := F t (π t ,γ 1:2 t ,z t+1 ) if α> 0 0 if α = 0, where 0 is a zero-vector of size|X t ×P 1 t ×P 2 t |. Having extended the domain of the above functions, we can also extend the domain of the argument π t in the functions w u t (·),w l t (·),V u t (·),V l t (·) defined in the dynamic programs of Section 8.3.3. First, for any 0≤ α≤ 1 and π T +1 ∈S T +1 , define V u T +1 (απ T +1 ) := 0. We can then define the following functions for every t≤T in a backward inductive manner: For any γ i t ∈B i t ,i = 1, 2, 0≤α≤ 1 and π t ∈S t , let w u t (απ t ,γ 1 t ,γ 2 t ) . = ˜ c t (απ t ,γ 1 t ,γ 2 t ) (G.16) + X z t+1 P m t (απ t ,γ 1:2 t ;z t+1 )V u t+1 (F t (απ t ,γ 1:2 t ,z t+1 )) V u t (απ t ) . = inf γ 1 t sup γ 2 t w u t (απ t ,γ 1 t ,γ 2 t ). Note that whenα = 1, the above definition ofw u t is equal to the definition ofw u t in equation (8.23) of the dynamic program. We can similarly extend w l t and V l t . 255 Lemma G.1. For any constant 0≤ α≤ 1 and any π t ∈ ¯ S t , we have αV u t (π t ) = V u t (απ t ) and αV l t (π t ) =V l t (απ t ). G.3.0.2 Proof of Lemma 8.3 We will first prove inductively that the function w u t : ¯ S t ×B 1 t ×B 2 t →R is continuous. Let us as assume that value function V u t+1 is continuous for some t≤ T . Note that this assumption clearly holds at t =T . At time t, we have w u t (π t ,γ 1 t ,γ 2 t ) = ˜ c t (π t ,γ 1 t ,γ 2 t ) +E[V u t+1 (F t (π t ,γ 1:2 t ,Z t+1 ))|π t ,γ 1:2 t ] = ˜ c t (π t ,γ 1 t ,γ 2 t ) + X z t+1 P m t (π t ,γ 1:2 t ;z t+1 )V u t+1 (F t (π t ,γ 1:2 t ,z t+1 )) = ˜ c t (π t ,γ 1 t ,γ 2 t ) + X z t+1 V u t+1 (P j t (π t ,γ 1:2 t ;z t+1 ,·,·)), (G.17) where the last inequality follows from the homogeneity property of the value functions in Lemma G.1 and the structure of the belief update in (G.13). The first term in (G.17) is clearly continuous (see G.14). Also, the transformationP j t (π t ,γ 1:2 t ;z t+1 ,·,·) defined in (G.8) is a continuous function. Therefore, by our induction hypothesis, the composition V u t+1 (P j t (·)) is continuous in (π t ,γ 1:2 t ) for every common observationz t+1 . Thus,w u t is continuous in its arguments. To complete our inductive argument, we need to show thatV u t is a continuous function and to this end, we will use the Berge’s maximum theorem (Lemma 17.31) in [53]. Since w u t is continuous andB 2 t is compact, we can use the maximum theorem to conclude that the following function v u t (π t ,γ 1 t ) . = sup γ 2 t w u t (π t ,γ 1 t ,γ 2 t ), (G.18) is continuous in π t ,γ 1 t . Once again, we can use the maximum theorem to conclude that V u t (π t ) = inf γ 1 t v u t (π t ,γ 1 t ) (G.19) 256 is continuous in π t . This concludes our inductive argument. Also, because of the continuity of v u t in (G.18), we can use the measurable selection condition (see Condition 3.3.2 in [59]) to prove the existence of the measurable mapping Ξ 1 t (π t ) as defined in Lemma 8.3. A similar argument can be made to establish the existence of a maxminimizer and a measurable mapping Ξ 2 t (π t ) as defined in Lemma 8.3. This concludes the proof of Lemma 8.3. G.4 Information Structures that Satisfy Assumption 8.2: Proofs For each model in 8.4.1, we will accordingly construct transformations Q t :S t ×B 1 t ×Z t+1 → R |X t+1 ×P 1:2 t+1 | and R t :S t ×B 1 t ×Z t+1 →R at each time t such that for all π t ,γ 1:2 t ,z t+1 , P m t (π t ,γ 1:2 t ;z t+1 )≤R t (π t ,γ 1 t ,z t+1 ), (G.20a) and for everyP m t (π t ,γ 1:2 t ;z t+1 )> 0, P j t (π t ,γ 1:2 t ,z t+1 ,·) P m t (π t ,γ 1:2 t ;z t+1 ) = Q t (π t ,γ 1 t ,z t+1 ) R t (π t ,γ 1 t ,z t+1 ) . (G.20b) Note that the transformations Q t and R t do not make use of virtual player 2’s prescription γ 2 t . Following the methodology in Definition G.2, we define F t as F t (π t ,γ 1:2 t ,z t+1 ) (G.21) = P j t (πt,γ 1:2 t ,z t+1 ,·) P m t (πt,γ 1:2 t ;z t+1 ) ifP m t (π t ,γ 1:2 t ;z t+1 )> 0 G t (π t ,γ 1:2 t ,z t+1 ) otherwise, where the transformation G t is chosen to be G t (π t ,γ 1:2 t ,z t+1 ) (G.22) = Qt(πt,γ 1 t ,z t+1 ) Rt(πt,γ 1 t ,z t+1 ) if R t (π t ,γ 1 t ,z t+1 )> 0 U(X t+1 ×P 1:2 t+1 ) otherwise, 257 whereU(X t+1 ×P 1:2 t+1 ) is the uniform distribution over the spaceX t+1 ×P 1:2 t+1 . Since the transforma- tionsQ t andR t satisfy (G.20a) and (G.20b), we can simplify the expression for the transformation F t in (G.21) to obtain the following F t (π t ,γ 1:2 t ,z t+1 ) (G.23) = Qt(πt,γ 1 t ,z t+1 ) Rt(πt,γ 1 t ,z t+1 ) if R t (π t ,γ 1 t ,z t+1 )> 0 U(X t+1 ×P 1:2 t+1 ) otherwise. This concludes the construction of an update ruleF t in the classB that does not use virtual player 2’s prescription γ 2 t . We will now describe the the construction of the transformations Q t and R t for each information structure in Section 8.4.1. G.4.0.1 All players in Team 2 have the same information In this case, Team 2 does not have any private information and any instance of the common observa- tionz t+1 includes Team 2’s action at timet (denote it with ˆ u 2 t ). The corresponding transformation P j t (see Definition G.2) has the following form. P j t (π t ,γ 1:2 t ;z t+1 ,x t+1 ,p 1 t+1 ) (G.24) = X xt,p 1:2 t ,u 1:2 t π t (x t ,p 1:2 t )γ 1 t (p 1 t ;u 1 t )γ 2 t (p 2 t ;u 2 t )P[x t+1 ,p 1:2 t+1 ,z t+1 |x t ,p 1:2 t ,u 1:2 t ] (G.25) =γ 2 t (?; ˆ u 2 t ) X xt,p 1 t ,u 1 t π t (x t ,p 1 t )γ 1 t (p 1 t ;u 1 t )P[x t+1 ,p 1 t+1 ,z t+1 |x t ,p 1 t ,u 1 t , ˆ u 2 t ] (G.26) =:γ 2 t (?; ˆ u 2 t )Q t (π t ,γ 1 t ,z t+1 ;x t+1 ,p 1 t+1 ). (G.27) Here, we use the fact that Team 2 does not have any private information and ˆ u 2 t is a part of the common observationz t+1 . Similarly, the corresponding transformationP m t (see Definition G.2) has the following form. P m t (π t ,γ 1:2 t ;z t+1 ) =γ 2 t (?; ˆ u 2 t ) X x t+1 ,p 1 t+1 Q t (π t ,γ 1 t ,z t+1 ;x t+1 ,p 1 t+1 ) (G.28) =:γ 2 t (?; ˆ u 2 t )R t (π t ,γ 1 t ,z t+1 ). (G.29) 258 Using the results (G.27) and (G.29), we can easily conclude that the transformations Q t and R t defined above satisfy the conditions (G.20a) and (G.20b). G.4.0.2 Team 2’s observations become common information with a delay of one-step In this case, any instance of the common observation z t+1 includes Team 2’s private information at timet (denote it with ˆ p 2 t ) and Team 2’s action at timet (denote it with ˆ u 2 t ). The corresponding transformationP j t (see Definition G.2) has the following form. P j t (π t ,γ 1:2 t ;z t+1 ,x t+1 ,p 1:2 t+1 ) = X xt,p 1:2 t ,u 1:2 t π t (x t ,p 1:2 t )γ 1 t (p 1 t ;u 1 t )γ 2 t (p 2 t ;u 2 t )P[x t+1 ,p 1:2 t+1 ,z t+1 |x t ,p 1:2 t ,u 1:2 t ] (G.30) =γ 2 t (ˆ p 2 t ; ˆ u 2 t ) X xt,p 1 t ,u 1 t π t (x t ,p 1 t , ˆ p 2 t )γ 1 t (p 1 t ;u 1 t )P[x t+1 ,p 1:2 t+1 ,z t+1 |x t ,p 1 t , ˆ p 2 t ,u 1 t , ˆ u 2 t ] (G.31) =:γ 2 t (ˆ p 2 t ; ˆ u 2 t )Q t (π t ,γ 1 t ,z t+1 ;x t+1 ,p 1:2 t+1 ). (G.32) Here, we use the fact that both ˆ p 2 t and ˆ u 2 t are part of the common observation z t+1 . Similarly, the corresponding transformationP m t (see Definition G.2) has the following form. P m t (π t ,γ 1:2 t ;z t+1 ) =γ 2 t (ˆ p 2 t ; ˆ u 2 t ) X x t+1 ,p 1:2 t+1 Q t (π t ,γ 1 t ,z t+1 ;x t+1 ,p 1:2 t+1 ) (G.33) =:γ 2 t (ˆ p 2 t ; ˆ u 2 t )R t (π t ,γ 1 t ,z t+1 ). (G.34) Using the results (G.32) and (G.34), we can conclude that the transformations Q t and R t defined above satisfy the conditions (G.20a) and (G.20b). G.4.0.3 Team 2 does not control the state In this case, the corresponding transformation P j t (see Definition G.2) has the following form. P j t (π t ,γ 1:2 t ;z t+1 ,x t+1 ,p 1:2 t+1 ) = X xt,p 1:2 t ,u 1:2 t π t (x t ,p 1:2 t )γ 1 t (p 1 t ;u 1 t )γ 2 t (p 2 t ;u 2 t )P[x t+1 ,p 1:2 t+1 ,z t+1 |x t ,p 1:2 t ,u 1:2 t ] (G.35) 259 = X xt,p 1:2 t ,u 1:2 t π t (x t ,p 1:2 t )γ 1 t (p 1 t ;u 1 t )γ 2 t (p 2 t ;u 2 t )P[x t+1 ,p 1:2 t+1 ,z t+1 |x t ,p 1:2 t ,u 1 t ] (G.36) = X xt,p 1:2 t ,u 1 t π t (x t ,p 1:2 t )γ 1 t (p 1 t ;u 1 t )P[x t+1 ,p 1:2 t+1 ,z t+1 |x t ,p 1:2 t ,u 1 t ] (G.37) =:Q t (π t ,γ 1 t ,z t+1 ;x t+1 ,p 1:2 t+1 ). (G.38) Note that (G.36) holds because u 2 t does not influence the evolution of state, common and private information and (G.37) follows by summing over u 2 t . Similarly, the corresponding transformation P m t (see Definition G.2) has the following form. P m t (π t ,γ 1:2 t ;z t+1 ) = X x t+1 ,p 1:2 t+1 Q t (π t ,γ 1 t ,z t+1 ;x t+1 ,p 1:2 t+1 ) (G.39) =:R t (π t ,γ 1 t ,z t+1 ). (G.40) Using the results (G.38) and (G.40), we can conclude that the transformations Q t and R t satisfy the conditions (G.20a) and (G.20b). G.4.0.4 Global and local states In this case, the private information variables of each player are part of the system stateX t because P 1 t =X 1 t andP 2 t =X 2 t . Therefore, the common information belief Π t is formed only on the system state. Let us first define a collection of beliefs with a particular structure S 0 t . = n π∈S t :π(x 0 ,x 1 ,x 2 ) = 1 ˆ x (x 0 )π 1|0 (x 1 |x 0 )π 2 (x 2 ) ∀x 0 ,x 1 ,x 2 o , (G.41) where π 1|0 and π 2 denote the respective conditional and marginal distributions formed using the joint distribution π, and ˆ x is some realization of the global state X 0 t . Under the dynamics and information structure in this case, we can show that the belief π t computed recursively using the 260 transformationF t as in (G.13) always lies inS 0 t (for an appropriate initial distribution on the state X 1 ). Therefore, we restrict our attention to beliefs in the restricted setS 0 t . Let π t ∈S 0 t such that π t (x 0 ,x 1 ,x 2 ) = 1 ˆ x (x 0 )π 1|0 t (x 1 |x 0 )π 2 t (x 2 ) ∀x 0 ,x 1 ,x 2 . (G.42) In this case, any instance of the common observation z t+1 comprises of the global state ˆ x 0 t+1 and players’ actions (denoted by ˆ u 1 t , ˆ u 2 t for Teams 1 and 2 respectively). The corresponding trans- formationP j t (see Definition G.2) has the following form. P j t (π t ,γ 1:2 t ;z t+1 ,x t+1 ) (G.43) = X xt,u 1:2 t π t (x t )γ 1 t (x 1 t ;u 1 t )γ 2 t (x 2 t ;u 2 t )P[x t+1 ,z t+1 |x t ,u 1:2 t ] (G.44) = X x 0 t ,x 1 t 1 ˆ x (x 0 t )π 1|0 t (x 1 t |x 0 t )γ 1 t (x 1 t ; ˆ u 1 t )P[x t+1 ,z t+1 |x 0 t ,x 1 t , ˆ u 1:2 t ] X x 2 t π 2 t (x 2 t )γ 2 t (x 2 t ; ˆ u 2 t ) (G.45) = X x 1 t π 1|0 t (x 1 t | ˆ x)γ 1 t (x 1 t ; ˆ u 1 t )P[x t+1 ,z t+1 | ˆ x,x 1 t , ˆ u 1:2 t ] X x 2 t π 2 t (x 2 t )γ 2 t (x 2 t ; ˆ u 2 t ) (G.46) =:Q t (π t ,γ 1 t ,z t+1 ;x t+1 ) X x 2 t π 2 t (x 2 t )γ 2 t (x 2 t ; ˆ u 2 t ) . (G.47) Similarly, the corresponding transformation P m t (see Definition G.2) has the following form. P m t (π t ,γ 1:2 t ;z t+1 ) = X x 2 t π 2 t (x 2 t )γ 2 t (x 2 t ; ˆ u 2 t ) X x t+1 Q t (π t ,γ 1 t ,z t+1 ;x t+1 ) (G.48) =: X x 2 t π 2 t (x 2 t )γ 2 t (x 2 t ; ˆ u 2 t ) R t (π t ,γ 1 t ,z t+1 ). (G.49) Using the results (G.47) and (G.49), we can conclude that the transformations Q t and R t satisfy the conditions (G.20a) and (G.20b). G.5 Proof of Theorem 8.5 The following mappings between the strategies in gamesG v andG e and the subsequent lemma will be useful in proving the theorem. 261 Definition G.4. Let % i : ˜ X 1 × ˜ X 2 →X i be an operator that maps a strategy profile (˜ χ 1 , ˜ χ 2 ) in virtual game G e to a strategy χ i for virtual player i in game G v as follows: For t = 1, 2,...,T, χ i t (c t ) . = ˜ χ i t (c t , ˜ γ 1:2 1:t−1 ), (G.50) where ˜ γ j s = ˜ χ j s (c s , ˜ γ 1:2 1:s−1 ) for every 1≤s≤t− 1 and j = 1, 2. We denote the ordered pair (% 1 ,% 2 ) by %. The following lemma is the analogue of Lemma 2 in [72]. Lemma G.2. Let (χ 1 ,χ 2 ) and (˜ χ 1 , ˜ χ 2 ) be strategy profiles for games G v and G e , such that χ i = % i (˜ χ 1 , ˜ χ 2 ), i = 1, 2. Then, J (χ 1 ,χ 2 ) =J (˜ χ 1 , ˜ χ 2 ). (G.51) Let ˜ χ 1∗ ∈ ˜ X 1 be the min-max strategy for virtual player 1 in the expanded virtual game G e as described in Section 8.4.2. Note that the strategy ˜ χ 1∗ uses only the common information c t and virtual player 1’s past prescriptions γ 1 1:t−1 . This is because the CIB update F t in Assumption 8.2 does not depend on virtual player 2’s prescription γ 2 t . Let us define a strategyχ 1∗ ∈X 1 for virtual player 1 in the virtual gameG v . At each time t and for each instance c t ∈C t , χ 1∗ t (c t ) . = Ξ 1 t (π t ). (G.52) Here, Ξ 1 t is the mapping obtained by solving the min-max dynamic program (see Lemma 8.3) and π t is computed using the following relation∀x 1 ,p 1 1 ,p 2 1 π 1 (x 1 ,p 1 1 ,p 2 1 ) =P[X 1 =x 1 ,P 1 1 =p 1 1 ,P 2 1 =p 2 1 |C 1 =c 1 ] π τ+1 =F τ (π τ , Ξ 1 τ (π τ ),z τ+1 ), 1≤τ <t, whereF t is the belief update transformation in Assumption 8.2. Note that the prescriptionχ 1∗ t (c t ) is the same as the one obtained in the “Get prescription” step in Algorithm 4 for common information c t in the t-th iteration. 262 Using Definition G.4, we have χ 1∗ =% 1 (˜ χ 1∗ , ˜ χ 2 ) (G.53) for any strategy ˜ χ 2 ∈ ˜ H 2 . Based on this observation and the fact that for a given χ 2 ∈H 2 , % 2 (˜ χ 1 ,χ 2 ) =χ 2 for every ˜ χ 1 ∈ ˜ H 1 , we have (χ 1∗ ,χ 2 ) =%(˜ χ 1∗ ,χ 2 ). (G.54) Further, due to Theorem 8.1, we have S u (G v )≥S u (G e ) (G.55) a = sup ˜ χ 2 ∈ ˜ H 2 J (˜ χ 1∗ , ˜ χ 2 ) (G.56) b ≥ sup χ 2 ∈H 2 J (˜ χ 1∗ ,χ 2 ) (G.57) c = sup χ 2 ∈H 2 J (χ 1∗ ,χ 2 ) (G.58) ≥S u (G v ). (G.59) where the equality in (a) is because ˜ χ 1∗ is a min-max strategy of G e . Inequality (b) holds because H 2 ⊆ ˜ H 2 . Equality in (c) is a consequence of the result in (G.54) and Lemma G.2. The last inequality simply follows from the definition of the upper value of the virtual game. Therefore, all the inequalities in the display above must hold with equality. Hence, χ 1∗ must be a min-max strategy of gameG v and S u (G v ) =S u (G e ). From Lemma 8.1, we know that S u (G v ) = S u (G) and thus, S u (G) = S u (G e ). Further, we can verify from the description of the strategy g 1∗ in Algorithm 4 that χ 1∗ =M 1 (g 1∗ ), whereM 1 is the transformation defined in the proof of Lemma 8.1. Based on the argument in Remarks G.1 in Appendix G.1, we can conclude that the strategy g 1∗ is a min-max strategy in GameG. G.6 Proof of Lemma 8.5 We will first define a mappingM i :G i →H i . Let g i ∈G i be any strategy for Team i in the original game G. Let c t ∈C t be an instance of common information at time t. Recall that U i,j t = 263 g i,j t (C t ,P i,j t ,K i t ) for j = 1,...,N i . For a realization c t of C t and k of K i t , the strategy g i t induces a collection of mappings $ := (g i t (c t ,·,k),...,g i t (c t ,·,k)) ∈ Q i t , whereQ i t is defined in (8.43). Therefore, for a realization c t of C t , the strategy g i t induces a distribution δ ct over the spaceQ i t . This distribution δ ct can formally be defined as δ ct ($) :=P hn g i t (c t ,p i t ,K i t ) =$(p i t )∀p i t ∈P i t oi . (G.60) Then define χ i t (c t ) at time t as χ i t (c t ) :=L i t (δ ct ). (G.61) whereL i t is defined in Definition 8.5. Let us denote this mapping betweenG i andH i withM i . Let us now define a mappingN i :H i →G i . Let χ i ∈H i be any strategy for virtual player i in the virtual gameG v . Letc t ∈C t be an instance of common information at timet. Sinceχ i t (c t )∈B i t , there exists a distribution δ ct overQ i t such that χ i t (c t ) =L i t (δ ct ). (G.62) Note that we can compute δ ct = ¯ L i t (χ i t (c t )) using the linear program described in Remark 8.2. Using the commonly available random variable K i t with teami, the team can first randomly select a mapping $∈Q i t with probability δ ct ($). Let $ i ct = ($ 1,1 ct ,...,$ 1,N i ct ) :=rand(Q i t ,δ ct ,K i t ). (G.63) Then Player j in Team i can then select an action based on this randomly chosen mapping $ i ct in the following manner U i,j t =g i,j t (c t ,p i,j t ,K i t ) :=$ i,j ct (p i,j t ). (G.64) Let us denote this mapping betweenH i andG i withN i . 264 Lemma G.3. Let (g 1 ,g 2 ) be a strategy profile in Game G and (χ 1 ,χ 2 ) be a strategy profile in virtual G v . For these strategy profiles, let one of the four following propositions be true. χ 1 =M 1 (g 1 ) and χ 2 =M 2 (g 2 ) (G.65a) χ 1 =M 1 (g 1 ) and g 2 =N 2 (χ 2 ) (G.65b) g 1 =N 1 (χ 1 ) and χ 2 =M 2 (g 2 ) (G.65c) g 1 =N 1 (χ 1 ) and g 2 =N 2 (χ 2 ). (G.65d) Then at any given time t, for any vector of actions u i t and for any realization (c t ,p i t ) of I i t , we have P g 1 ,g 2 [U i t =u i t |I i t = (c t ,p i t )] =P χ 1 ,χ 2 [U i t =u i t |I i t = (c t ,p i t )]. (G.66) As a consequence, we have J(g 1 ,g 2 ) =J (χ 1 ,χ 2 ). (G.67) Proof. Let χ 1 =M 1 (g 1 ). For a given time t, let c t ,p 1 t be a realization of Team 1’s information at timet in the original gameG. In the virtual gameG v , for the common information instance c t , the prescription chosen by the virtual team 1 is given by χ 1 t (c t ) =: γ 1 t . From the construction of the mappingM 1 , we have γ 1 t =L 1 t (δ 1 ct ), where δ 1 ct is the same as δ ct defined in (G.60) with i = 1. For notational convenience, define the events A $ := n g 1 t (c t ,q 1 t ,K 1 t ) =$(q 1 t )∀q 1 t ∈P 1 t o . (G.68) Then we have P χ 1 ,χ 2 [U 1 t =u 1 t |I 1 t = (c t ,p 1 t )] (G.69) =γ 1 t (p 1 t ,u 1 t ) (G.70) = X $∈Q 1 t δ 1 ct ($)1 $(p 1 t )=u 1 t (G.71) 265 = X $∈Q 1 t P [A $ ] 1 $(p 1 t )=u 1 t (G.72) = X $∈Q 1 t P h A $ ∩{$(p 1 t ) =u 1 t } i (G.73) = X $∈Q 1 t P h A $ ∩{g 1 t (c t ,p 1 t ,K 1 t ) =u 1 t } i (G.74) =P h g 1 t (c t ,p 1 t ,K 1 t ) =u 1 t i (G.75) =P g 1 ,g 2 [U 1 t =u 1 t |I 1 t = (c t ,p 1 t )]. (G.76) Here, (G.69) is because action u 1 t in the virtual game G v are chosen according to the prescription γ 1 t . Equation (G.71) follows from the definition ofL 1 t and (G.72) follows from the construction of δ 1 ct in (G.60). Equation (G.74) follows from the equivalence of the events under consideration and lastly, equation (G.75) follows from the fact that the events A $ are mutually exclusive and exhaustive. Now let g 1 =N 1 (χ 1 ). For an instance c t ∈C t , let us again denote χ 1 t (c t ) =: γ 1 t . Recall that according to the mechanismN 1 (see (G.62)), the action u 1 t is chosen using a mapping $∈Q 1 t which is randomly selected according to the distribution δ 1 ct that satisfiesL 1 t (δ 1 ct ) =γ 1 t . Therefore, according to this mechanism, we have P g 1 ,g 2 [U 1 t =u 1 t |I 1 t = (c t ,p 1 t )] (G.77) = X $∈Q 1 t δ 1 ct ($)1 $(p 1 t )=u 1 t (G.78) =γ 1 t (p 1 t ;u 1 t ) =P χ 1 ,χ 2 [U 1 t =u 1 t |I 1 t = (c t ,p 1 t )]. (G.79) Note that (G.79) follows from the definition of the transformationL 1 t . We can show the same for Team 2. Since the dynamics in games G and G v are identical, we can argue inductively that if one of the four propositions in the Lemma holds, then J(g 1 ,g 2 ) = J (χ 1 ,χ 2 ). The following lemma will help us in connecting the inf-sup and sup-inf values of games G and G v . 266 Lemma G.4. For any strategy g 1 ∈G 1 in game G and strategy χ 1 ∈H 1 in game G v , we have sup g 2 ∈G 2 J(g 1 ,g 2 ) = sup χ 2 ∈X 2 J (M 1 (g 1 ),χ 2 ) (G.80a) sup χ 2 ∈X 2 J (χ 1 ,χ 2 ) = sup g 2 ∈G 2 J(N 1 (χ 1 ),g 2 ). (G.80b) Similar inequalities can be obtained for the infimum with respect to g 1 and χ 1 in games G and G v respectively. Proof. We have sup g 2 ∈G 2 J(g 1 ,g 2 ) = sup g 2 ∈G 2 J (M 1 (g 1 ),M 2 (g 2 )) (G.81) ≤ sup χ 2 ∈X 2 J (M 1 (g 1 ),χ 2 ) (G.82) = sup χ 2 ∈X 2 J(g 1 ,N 2 (χ 2 )) (G.83) ≤ sup g 2 ∈G 2 J(g 1 ,g 2 ). (G.84) Equations (G.81) and (G.83) follow from cases (G.65a) and (G.65b) of Lemma 8.1, respectively. Inequalities (G.82) and (G.84) follow from the fact thatM 2 (g 2 )∈X 2 for every g 2 ∈G 2 and N 2 (χ 2 )∈G 2 for every χ 2 ∈X 2 , respectively. Thus, all the inequalities above hold with equality. Therefore, sup g 2 ∈G 2 J(g 1 ,g 2 ) = sup χ 2 ∈X 2 J (M 1 (g 1 ),χ 2 ). (G.85) We can prove (G.80b) using similar arguments. Further, we have inf g 1 ∈G 1 sup g 2 ∈G 2 J(g 1 ,g 2 ) = inf g 1 ∈G 1 sup χ 2 ∈X 2 J (M 1 (g 1 ),χ 2 ) (G.86) ≥ inf χ 1 ∈X 1 sup χ 2 ∈X 2 J (χ 1 ,χ 2 ) (G.87) = inf χ 1 ∈X 1 sup g 2 ∈G 2 J(N 1 (χ 1 ),g 2 ) (G.88) 267 ≥ inf g 1 ∈G 1 sup g 2 ∈G 2 J(g 1 ,g 2 ). (G.89) Here, equations (G.86) and (G.88) follow from Lemma G.4. Inequalities (G.87) and (G.89) follow from the fact thatM 1 (g 1 )∈X 1 for everyg 1 ∈G 1 andN 1 (χ 1 )∈G 1 for everyχ 1 ∈X 1 , respectively. It also follows from similar arguments used in this proof that if g 1∗ is an inf-sup strategy in Game G, thenM 1 (g 1∗ ) is an inf-sup solution in the virtual game G v . Similarly, if χ 1∗ is an inf- sup strategy in Game G v , thenN 1 (χ 1∗ ) is an inf-sup strategy in Game G. We can make similar statements on the sup-inf values and strategies as well. 268 Appendix H Numerically Solving Zero-sum Games and Structural Properties H.1 Solving the DP: Methods and Challenges In the previous section, we provided a dynamic programming characterization of the upper value S u (G) and a min-max strategy g 1∗ in the original game G. We now describe an approximate dynamic programming methodology [13] that can be used to compute them. The methodology we propose offers broad guidelines for a computational approach. We do not make any claim about the degree of approximation achieved by the proposed methodology. In Section H.1.1, we discuss some structural properties of the cost-to-go and value functions in the dynamic program that may be useful for computational simplification. We illustrate our approach and the challenges involved with the help of an example in Section 8.6. At each timet, let ˆ V t+1 (π t+1 ,θ t+1 ) be a an approximate representation of the upper value func- tion V u t+1 (π t+1 ) in a suitable parametric form, where θ t+1 is a vector representing the parameters. We proceed as follows. Sampling the belief space At timet, we sample a setS t of belief points from the setS t (the set of all CIBs). A simple sampling approach would be to uniformly sample the spaceS t . Using the ar- guments in Appendix 5 of [72], we can say that the value functionV u t is uniformly continuous in the CIBπ t . Thus, if the setS t is sufficiently large, computing the value function only at the belief points in S t may be enough for obtaining an -approximation of the value function. It is likely that the required size ofS t to ensure an -approximation will be prohibitively large. For solving POMDPs, more intelligent strategies for sampling the belief points, known as forward exploration heuristics, 269 exist in literature [134, 63, 159]. While these methods rely on certain structural properties of the value functions (such as convexity), it may be possible to adapt them for our dynamic program and make the CIB sampling process more efficient. A precise understanding of such exploration heuristics and the relationship between the approximation error and the associated computational burden is needed. This is a problem for future work. Compute value function at each belief point Once we have a collection of pointsS t , we can then approximately compute the value V u t (π t ) for each π t ∈S t . For each belief vector π t ∈S t , we will compute ¯ V t (π t ) which is given by ˆ w t (π t ,γ 1 t ,γ 2 t ) . = ˜ c t (π t ,γ 1 t ,γ 2 t ) +E[ ˆ V t+1 (F t (π t ,γ 1 t ,Z t+1 ),θ t+1 )|π t ,γ 1:2 t ] (H.1) ¯ V t (π t ) . = min γ 1 t max γ 2 t ˆ w t (π t ,γ 1 t ,γ 2 t ). (H.2) For a given belief point π t , one approach for solving the min-max problem in (H.2) is to use the Gradient Descent Ascent (GDA) method [85, 60]. This can however lead to local optima because in general, the cost ˆ w t is neither convex nor concave in the respective prescriptions. In some cases such as when Team 2 has only one player, the inner maximizing problem in (H.2) can be substantially simplified and we will discuss this in Section H.1.1. Interpolation For solving a particular instance of the min-max problem in (H.2), we need an esti- mate of the value functionV u t+1 . Generally, knowing just the value of the function at different points may be insufficient and we may need additional information like the first derivatives/gradients es- pecially when we are using gradient based methods like GDA. In that case, it will be helpful to choose an appropriate parametric form (such as neural networks) for ˆ V t that has desirable proper- ties like continuity or differentiability. The parameters θ t can be obtained by solving the following regression problem. min θt X πt∈St ( ˆ V t (π t ,θ t )− ¯ V t (π t )) 2 . (H.3) Standard methods such as gradient descent can be used to solve this regression problem. 270 H.1.1 Structural Properties of the DP We will now prove that under Assumption 8.2, the cost-to-go function w u t (see (8.22)) in the min- max dynamic program defined in Section 8.4.2 is linear in virtual player 2’s prescription in its product form γ 2 t (p 2 t ;u 2 t ) (see Definition G.1 in Appendix G.2). Lemma H.1. The cost-to-go w u t (π t ,γ 1 t ,γ 2 t ) is linear in γ 2 t (p 2 t ;u 2 t ) (see Definition G.1 in Appendix G.2), i.e., there exists a function a t :P 2 t ×U 2 t ×S t ×B 1 t →R such that w u t (π t ,γ 1 t ,γ 2 t ) = X p 2 t ,u 2 t π t (p 2 t )γ 2 t (p 2 t ;u 2 t )a t (p 2 t ,u 2 t ,π t ,γ 1 t ). Here, π t (p 2 t ) denotes the marginal probability of p 2 t with respect to the CIB π t . Proof. We have w u t (π t ,γ 1 t ,γ 2 t ) = ˜ c t (π t ,γ 1 t ,γ 2 t ) +E[V u t+1 (F t (π t ,γ 1:2 t ,Z t+1 ))|π t ,γ 1:2 t ] (H.4) = ˜ c t (π t ,γ 1 t ,γ 2 t ) + X z t+1 P m t (π t ,γ 1:2 t ;z t+1 )V u t+1 (F t (π t ,γ 1 t ,z t+1 )) (H.5) . = X p 2 t ,u 2 t π t (p 2 t )γ 2 t (p 2 t ;u 2 t )a t (p 2 t ,u 2 t ,π t ,γ 1 t ), (H.6) where the last equality uses the fact that the belief update F t does not depend on γ 2 t and both ˜ c t (see (G.14)) and P m t (see (G.11)) are linear in γ 2 t (p 2 t ;u 2 t ). We then rearrange terms in (H.5) and appropriately define a t to obtain (H.6). Here, π t (p 2 t ) represents the marginal distribution of the second player’s private information with respect to the common information based belief π t . Remark H.1. When there is only one player in Team 2, then the prescription γ 2 t coincides with its product form in Definition G.1. Then the cost-to-go w u t (π t ,γ 1 t ,γ 2 t ) is linear in γ 2 t as well. At any given time t, a pure prescription for virtual player 2 in Game G e is a prescription (γ 2,1 ,...,γ 2,N 2 )∈B 2 t such that for everyj and every instancep 2,j t ∈P 2,j t , the distributionγ 2,j (p 2,j t ) is degenerate. LetQ 2 t be the set of all such pure prescriptions at time t for virtual player 2. 271 Lemma H.2. Under Assumption 8.2, for any given π t ∈S t and γ 1 t ∈B 1 t the cost-to-go function in (8.22) satisfies max γ 2 t ∈B 2 t w u t (π t ,γ 1 t ,γ 2 t ) = max γ 2 t ∈Q 2 t w u t (π t ,γ 1 t ,γ 2 t ). Proof. Using Lemma H.1, for a fixed value of CIB π t and virtual player 1’s prescription γ 1 t , the inner maximization problem in (8.23) can be viewed as a single-stage team problem. In this team problem, the areN 2 players and the state of the system is P 2 t which is distributed according to π t . Playerj’s information in this team isP 2,j t . Using this information, the player selects a distribution δU 2,j t which is then used to randomly generate an action U 2,j t . The reward for this team is given by a t defined in Lemma H.1. It can be easily shown that for a single-stage team problem, we can restrict to deterministic strategies without loss of optimality. It is easy to see that the set of all deterministic strategies in the single-stage team problem described above corresponds to the collection of pure prescriptionsQ 2 t . This concludes the proof of Lemma H.2. Lemma H.2 allows us to transform a non-concave maximization problem over the continuous spaceB 2 t in (8.23) and (H.2) into a maximization problem over a finite setQ 2 t . This is particularly useful when players in Team 2 do not have any private information because in that case,Q 2 t has the same size as the action spaceU 2 t . Other structural properties and simplifications are discussed in Appendix H.1. H.2 Proof of Lemma 8.6 For proving Lemma 8.6, it will be helpful to split Team 1’s strategy g 1 into two parts (g 1,1 ,g 1,2 ) where g 1,1 = (g 1,1 1 ,...,g 1,1 T ) and g 1,2 = (g 1,2 1 ,...,g 1,2 T ). The set of all such strategies for Player j in Team 1 will be denoted byG 1,j . We will prove the lemma using the following claim. Claim H.1. Consider any arbitrary strategy g 1 = (g 1,1 ,g 1,2 ) for Team 1. Then there exists a strategy ¯ g 1,1 for Player 1 in Team 1 such that, for each t, ¯ g 1,1 t is a function of X t and I 2 t and J((¯ g 1,1 ,g 1,2 ),g 2 ) =J((g 1,1 ,g 1,2 ),g 2 ), ∀g 2 ∈G 2 . 272 Suppose that the above claim is true. Let (h 1,1 ,h 1,2 ) be a min-max strategy for Team 1. Due to Claim H.1, there exists a strategy ¯ h 1,1 for Player 1 in Team 1 such that, for each t, ¯ h 1,1 t is a function only of X t and I 2 t and J(( ¯ h 1,1 ,h 1,2 ),g 2 ) =J((h 1,1 ,h 1,2 ),g 2 ), for every strategy g 2 ∈G 2 . Therefore, we have that sup g 2 ∈G 2 J(( ¯ h 1,1 ,h 1,2 ),g 2 ) = sup g 2 ∈G 2 J((h 1,1 ,h 1,2 ),g 2 ) = inf g 1 ∈G 1 sup g 2 ∈G 2 J(g 1 ,g 2 ). Thus, ( ¯ h 1,1 ,h 1,2 ) is a min-max strategy for Team 1 wherein Player 1 uses only the current state and Player 2’s information. Proof of Claim H.1: We now proceed to prove Claim H.1. Consider any arbitrary strategy g 1 = (g 1,1 ,g 1,2 ) for Team 1. Let ι 2 t ={u 1:2 1:t−1 ,y 2 1:t } be a realization of Team 2’s information I 2 t (which is the same as Player 2’s information in Team 1). Define the distribution Ψ t (ι 2 t ) over the space ( Q t τ=1 X τ )×U 1,1 t as follows: Ψ t (ι 2 t ;x 1:t ,u 1,1 t ) . =P g 1 ,h 2 [X 1:t ,U 1,1 t = (x 1:t ,u 1,1 t )|I 2 t =ι 2 t ], if ι 2 t is feasible, that isP g 1 ,h 2 [I 2 t =ι 2 t ]> 0, under the open-loop strategy h 2 . = (u 2 1:t−1 ) for Team 2. Otherwise, define Ψ t (ι 2 t ;x 1:t ,u 1,1 t ) to be the uniform distribution over the space ( Q t τ=1 X τ )×U 1,1 t . Lemma H.3. Let g 1 be Team 1’s strategy and let g 2 be an arbitrary strategy for Team 2. Then for any realization x 1:t ,u 1,1 t of the variables X 1:t ,U 1,1 t , we have P g 1 ,g 2 [X 1:t ,U 1,1 t = (x 1:t ,u 1,1 t )|I 2 t ] = Ψ t (I 2 t ;x 1:t ,u 1,1 t ), almost surely. Proof. From Team 2’s perspective, the system evolution can be seen in the following manner. The system state at time t is S t = (X 1:t ,U 1,1 t ,I 2 t ). Team 2 obtains a partial observation Y 2 t of the state at timet. Using information{Y 2 1:t ,U 1:2 1:t−1 }, Team 2 then selects an actionU 2 t . The state then evolves in a controlled Markovian manner (with dynamics that depend on g 1 ). Thus, from Team 273 2’s perspective, this system is a partially observable Markov decision process (POMDP). The claim in the Lemma then follows from the standard result in POMDPs that the belief on the state given the player’s information does not depend on the player’s strategy [79]. For any instance ι 2 t of Team 2’s information I 2 t , define the distribution Φ t (ι 2 t ) over the space X t ×U 1,1 t as follows Φ t (ι 2 t ;x t ,u 1,1 t ) = X x 1:t−1 Ψ t (ι 2 t ;x 1:t ,u 1,1 t ). (H.7) Define strategy ¯ g 1,1 for Player 1 in Team 1 such that for any realizationx t ,ι 2 t of stateX t and Player 2’s information I 2 t at time t, the probability of selecting an action u 1,1 t at time t is ¯ g 1,1 t (x t ,ι 2 t ;u 1,1 t ) . = Φt(ι 2 t ;xt,u 1,1 t ) P u 1,1 0 t Φt(ι 2 t ;xt,u 1,1 0 t ) if P u 1,1 0 t Φ t (ι 2 t ;x t ,u 1,1 0 t )> 0 U(·) otherwise, (H.8) whereU(·) denotes the uniform distribution over the action spaceU 1,1 t . Notice that the construction of the strategy ¯ g 1,1 does not involve Team 2’s strategy g 2 . Lemma H.4. For any strategy g 2 for Team 2, we have P (g 1 ,g 2 ) [U 1,1 t =u 1,1 t |X t ,I 2 t ] = ¯ g 1,1 t (X t ,I 2 t ;u 1,1 t ) almost surely for every u 1,1 t ∈U 1,1 t . Proof. Let x t ,ι 2 t be a realization that has a non-zero probability of occurrence under the strategy profile (g 1 ,g 2 ). Then using Lemma H.3, we have P (g 1 ,g 2 ) [X 1:t ,U 1,1 t = (x 1:t ,u 1,1 t )|ι 2 t ] = Ψ t (ι 2 t ;x 1:t ,u 1,1 t ), (H.9) 274 for every realization x 1:t−1 of states X 1:t−1 and u 1,1 t of action U 1,1 t . Summing over all x 1:t−1 ,u 1,1 t and using (H.7) and (H.9), we have P (g 1 ,g 2 ) [X t =x t |I 2 t =ι 2 t ] = X u 1,1 t Φ t (ι 2 t ;x t ,u 1,1 t ). (H.10) The left hand side of the above equation is positive sincex t ,i 2 t is a realization of positive probability under the strategy profile (g 1 ,g 2 ). Using Bayes’ rule, (H.7), (H.8) and (H.9), we obtain P g 1 ,g 2 [U 1,1 t =u 1,1 t |X t =x t ,I 2 t =ι 2 t ] = Φ t (ι 2 t ;x t ,u 1,1 t ) P u 1,1 0 t Φ t (ι 2 t ;x t ,u 1,1 0 t ) = ¯ g 1,1 t (x t ,ι 2 t ;u 1,1 t ). This concludes the proof of the lemma. Let us define ¯ g 1 = (¯ g 1,1 ,g 1,2 ), where ¯ g 1,1 is as defined in (H.8). We can now show that the strategy ¯ g 1 satisfies J(¯ g 1 ,g 2 ) =J(g 1 ,g 2 ), for every strategyg 2 ∈G 2 . Because of the structure of the cost function in (8.38), it is sufficient to show that for each time t, the random variables (X t ,U 1 t ,U 2 t ,I 2 t ) have the same joint distribution under strategy profiles (g 1 ,g 2 ) and (¯ g 1 ,g 2 ). We prove this by induction. It is easy to verify that at time t = 1, (X 1 ,U 1 1 ,U 2 1 ,I 2 1 ) have the same joint distribution under strategy profiles (g 1 ,g 2 ) and (¯ g 1 ,g 2 ). Now assume that at time t, P g 1 ,g 2 [x t ,u 1 t ,u 2 t ,ι 2 t ] =P ¯ g 1 ,g 2 [x t ,u 1 t ,u 2 t ,ι 2 t ], (H.11) for any realization of state, actions and Team 2’s informationx t ,u 1 t ,u 2 t ,ι 2 t . Letι 2 t+1 = (ι 2 t ,u 1:2 t ,y 2 t+1 ). Then we have P g 1 ,g 2 [x t+1 ,ι 2 t+1 ] = X ¯ xt P[x t+1 ,y 2 t+1 | ¯ x t ,u 1:2 t ,ι 2 t ]P g 1 ,g 2 [¯ x t ,u 1:2 t ,ι 2 t ] = X ¯ xt P[x t+1 ,y 2 t+1 | ¯ x t ,u 1:2 t ,ι 2 t ]P ¯ g 1 ,g 2 [¯ x t ,u 1:2 t ,ι 2 t ] (H.12) 275 =P ¯ g 1 ,g 2 [x t+1 ,ι 2 t+1 ]. (H.13) The equality in (H.12) is due to the induction hypothesis. Note that the conditional distribution P[x t+1 ,ι 2 t+1 |x t ,u 1 t ,u 2 t ,ι 2 t ] does not depend on players’ strategies (see equations (8.2) and (8.3)). At t + 1, for any realization x t+1 ,u 1 t+1 ,u 2 t+1 ,ι 2 t+1 that has non-zero probability of occurrence under the strategy profile (g 1 ,g 2 ), we have P g 1 ,g 2 [x t+1 ,u 1 t+1 ,u 2 t+1 ,ι 2 t+1 ] (H.14) =P g 1 ,g 2 [x t+1 ,ι 2 t+1 ]g 2 t (ι 2 t+1 ;u 2 t+1 )g 1,2 t (ι 2 t+1 ;u 1,2 t+1 )P g 1 ,g 2 [u 1,1 t+1 |x t+1 ,ι 2 t+1 ] (H.15) =P g 1 ,g 2 [x t+1 ,ι 2 t+1 ]g 2 t (ι 2 t+1 ;u 2 t+1 )g 1,2 t (ι 2 t+1 ;u 1,2 t+1 )¯ g 1,1 t (x t+1 ,ι 2 t+1 ;u 1,1 t+1 ) (H.16) =P ¯ g 1 ,g 2 [x t+1 ,ι 2 t+1 ]g 2 t (ι 2 t+1 ;u 2 t+1 )g 1,2 t (ι 2 t+1 ;u 1,2 t+1 )¯ g 1,1 t (x t+1 ,ι 2 t+1 ;u 1,1 t+1 ) (H.17) =P ¯ g 1 ,g 2 [x t+1 ,ι 2 t+1 ]g 2 t (ι 2 t+1 ;u 2 t+1 )g 1,2 t (ι 2 t+1 ;u 1,2 t+1 )P ¯ g 1 ,g 2 [u 1,1 t+1 |x t+1 ,ι 2 t+1 ] (H.18) =P ¯ g 1 ,g 2 [x t+1 ,u 1 t+1 ,u 2 t+1 ,ι 2 t+1 ], (H.19) where the equality in (H.14) is a consequence of the chain rule and the manner in which players randomize their actions. Equality in (H.16) follows from Lemma H.4 and the equality in (H.17) follows from the result in (H.13). Therefore, by induction, the equality in (H.11) holds for all t. This concludes the proof of Claim H.1. H.3 Numerical Experiments: Details H.3.1 Game Model States There are two players in Team 1 and one player in Team 2. We will refer to Player 2 in Team 1 as the defender and the Player in Team 2 as the attacker. Player 1 in Team 1 will be referred to as the signaling player. The state space of the system isX t ={(l,a), (r,a), (l,p), (r,p)}. For convenience, we will denote these four states with{0, 1, 2, 3} in the same order. Recall that l and r represent the two entities and a and p denote the active and passive states of the attacker. Therefore, if the state X t = (l,a), this means that the vulnerable entity is l and the attacker is active. Similar interpretation applies for other states. 276 Actions Each player in Team 1 has two actions and letU 1,1 t =U 1,2 t ={α,β}. Player 2’s action α corresponds to defending entity l and β corresponds to defending entity r. Player 1’s actions are meant only for signaling. The player (attacker) in Team 2 has three actions,U 2,1 t ={α,β,μ}. For the attacker, α represents the targeted attack on entity l, β represents the targeted attack on entity r and μ represents the blanket attack on both entities. Dynamics In this particular example, the actions of the players in Team 1 do not affect the state evolution. Only the attacker’s actions affect the state evolution. We consider this type of setup for simplicity and one can let the players in Team 1 control the state evolution as well. On the other hand, note that the evolution of the CIB is controlled only by Team 1 (the signaling player) and not by Team 2 (attacker). The transition probabilities are provided in Table H.1. We interpret these transitions below. Whenever the attacker is passive, i.e. X t = 2 or 3, then the attacker will remain passive with probability 0.7. In this case, the system state does not change. With probability 0.3, the attacker will become active. Given that the attacker becomes active, the next state X t+1 may either be 0 or 1 with probability 0.5 each. In this state, even the attacker’s actions do not affect the transitions. When the current state X t = 0, the following cases are possible: 1. Attacker plays α: In this case, the attacker launches a successful targeted attack. The next state X t+1 is either 0 or 1 with probability 0.5. 2. Attacker plays β: The next state X t+1 is 2 with probability 1. Thus, the attacker becomes passive if it targets the invulnerable entity. 3. Attacker plays μ: The state does not change with probability 0.7. With probability 0.3, X t+1 = 2. The attacker’s actions can be interpreted similarly when the state X t = 1. Cost Player 1’s actions in Team 1 do not affect the cost in this example (again for simplicity). The cost as a function of the state and the other players’ actions is provided in Table H.2. Note that when the attacker is passive, the system does not incur any cost. If the attacker launches a successful targeted attack, then the cost is 15. If the attacker launches a failed targeted 277 Table H.1: The transition probabilities P[X t+1 | Xt,U 2 t ]. Note that the dynamics do not depend on time t and Team 1’s actions. X t = 0 X t = 1 X t = 2 X t = 3 α (0.5,0.5,0.0,0.0) (0.0,0.0,0.0,1.0) (0.15,0.15,0.7,0.0) (0.15,0.15,0.0,0.7) β (0.0,0.0,1.0,0.0) (0.5,0.5,0.0,0.0) (0.15,0.15,0.7,0.0) (0.15,0.15,0.0,0.7) μ (0.7,0.0,0.3,0.0) (0.0,0.7,0.0,0.3) (0.15,0.15,0.7,0.0) (0.15,0.15,0.0,0.7) Table H.2: The costc(Xt,U 1,2 t ,U 2,1 t ). Note that the actions of Player 1 in Team 1 do not affect the cost. Each row corresponds to a pair of actions (U 1,2 t ,U 2,1 t ) and each column corresponds to a state Xt. X t = 0 X t = 1 X t = 2 X t = 3 (α,α) 15 0 0 0 (α,β) 0 15 0 0 (α,μ) 10 20 0 0 (β,α) 15 0 0 0 (β,β) 0 15 0 0 (β,μ) 20 10 0 0 attack, then the cost is 0. When the attacker launches a blanket attack, if the defender happens to be defending the vulnerable entity, the cost incurred is 10. Otherwise, the cost incurred is 20. We consider discounted cost with a horizon of T = 15. Therefore, the cost function c t (·) = δ t−1 c(·) at time t, where the function c(·) is defined in Table H.2. The discount factor δ in our problem is 0.9. H.3.2 Architecture and Implementation At any time t, we represent the value functions V u t using a deep fully-connected ReLU network. The network (denoted by ˆ V t with parameters θ t ) takes the CIB π t as input and returns the value ˆ V t (π t ,θ t ). The state X t in our example takes only four values and thus, we sample the CIB belief space uniformly. For each CIB point in the sampled set, we compute an estimate of the value function ¯ V t as in (H.2). Since Team 2 has only one player, we can use the structural result in Section H.1.1 to simplify the inner maximization problem in (H.2) and solve the outer minimization problem in (H.2) using gradient descent. The neural network representation of the value function is helpful for this as we can compute gradients using backpropagation. Since this minimization is not convex, we may not converge to the global optimum. Therefore, we use multiple initializations and 278 pick the best solution. Finally, the interpolation step is performed by training the neural network ˆ V t with ` 2 loss. We illustrate this iterative procedure in Section H.3.4. For each time t, we compute the upper value functions ¯ V t at the sampled belief points using the estimated value function ˆ V t+1 . In the figures in Section H.3.4, these points are labeled “Value function” and plotted in blue. The weights of the neural network ˆ V t are then adjusted by regression to obtain an approximation. This approximated value function is plotted in orange. For the particular example discussed above, we can obtain very close approximations of the value functions. We can also observe that the value function converges due to discounting as the horizon of the problem increases. H.3.3 The Value Function and Team 1’s Strategy The value function computed using the methodology described above is depicted in Figure 8.1a. Since the the initial belief is given byπ 1 (0) =π 1 (1) = 0.5, we can conclude that the min-max value of the game is 65.2. We note that the value function in Figure 8.1a is neither convex nor concave. As discussed earlier in Section 8.6, the trade-off between signaling and secrecy can be understood from the value function. Clearly, revealing too much (or too little) information about the hidden state is unfavorable for Team 1. The most favorable belief according to the value function seems to be π 1 (0) = 0.28 (or 0.72). We also compute a min-max strategy for Team 1 and the corresponding best-response 1 for the attacker in Team 2. The min-max strategy is computed using Algorithm 4. 1. In the best-response, the attacker decides to launch a targeted attack (while it is active) on entity l if 0≤ π 1 (0) < 0.28 and on entity r if 0.72 < π 1 (0)≤ 1. For all other belief-states it launches a blanket attack. Clearly, the attacker launches a targeted attack only if it is confident enough. 2. According to our computed min-max strategy, Player 2 in Team 1 simply computes the CIB and chooses to defend the entity that is more likely to be vulnerable. 1 This is not to be confused with the max-min strategy for the attacker. 279 3. The signaling strategy for Player 1 in Team 1 is more complicated. This can be seen from the strategy depicted in Figure 8.1b. The signaling strategy can be divided into three broad regions as shown in Figure 8.1b. (a) The grey region is where the signaling can be arbitrary. This is because the attacker in these belief states is confident enough about which entity is vulnerable and will decide to launch a targeted attack immediately. (b) In the brown region, the signaling is negligible. This is because even slight signaling might make the attacker confident enough to launch a targeted attack. (c) In the yellow region, there is substantial signaling. In this region, the attacker uses the blanket attack and signaling can help Player 2 in Team 1 defend better. We also observe an interesting pattern in the signaling strategy. Consider the CIB π 1 (0) = 0.6 in Figure 8.1b. Let the true state be 0. In this scenario, Player 1 in Team 1 knows that the state is 0 while Player 2 believes that state 1 is more likely. Therefore, Player 1’s signaling strategy must be such that it quickly resolves this mismatch between the state and Player 2’s belief. However, when the true state is 1, Player 2’s belief is consistent with the state. Too much signaling in this case can make the attacker more confident about the system state. We note that there is an asymmetry w.r.t. the state in the signaling requirement, i.e., signaling is favorable when the state is 0 but unfavorable when the state is 1. The asymmetric signaling strategy for states 0 and 1 depicted in Figure 8.1b achieves this goal of asymmetric signaling efficiently. H.3.4 Value Function Approximations In this Section, we present successive approximations of the value functions obtained using our methodology in Section H.1. 280 0.0 0.2 0.4 0.6 0.8 1.0 12.0 12.5 13.0 13.5 14.0 14.5 15.0 Value function Estimated Value function Figure H.1: Estimated value function ˆ V 15 0.0 0.2 0.4 0.6 0.8 1.0 21 22 23 24 25 26 27 28 Value function Estimated Value function Figure H.2: Estimated value function ˆ V 14 281 0.0 0.2 0.4 0.6 0.8 1.0 28 30 32 34 Value function Estimated Value function Figure H.3: Estimated value function ˆ V 13 0.0 0.2 0.4 0.6 0.8 1.0 32 34 36 38 40 Value function Estimated Value function Figure H.4: Estimated value function ˆ V 12 282 0.0 0.2 0.4 0.6 0.8 1.0 38 40 42 44 46 Value function Estimated Value function Figure H.5: Estimated value function ˆ V 11 0.0 0.2 0.4 0.6 0.8 1.0 42 44 46 48 50 Value function Estimated Value function Figure H.6: Estimated value function ˆ V 10 283 0.0 0.2 0.4 0.6 0.8 1.0 44 46 48 50 52 Value function Estimated Value function Figure H.7: Estimated value function ˆ V 9 0.0 0.2 0.4 0.6 0.8 1.0 48 50 52 54 56 Value function Estimated Value function Figure H.8: Estimated value function ˆ V 8 284 0.0 0.2 0.4 0.6 0.8 1.0 50 52 54 56 58 60 Value function Estimated Value function Figure H.9: Estimated value function ˆ V 7 0.0 0.2 0.4 0.6 0.8 1.0 54 56 58 60 62 Value function Estimated Value function Figure H.10: Estimated value function ˆ V 6 285 0.0 0.2 0.4 0.6 0.8 1.0 56 58 60 62 64 Value function Estimated Value function Figure H.11: Estimated value function ˆ V 5 0.0 0.2 0.4 0.6 0.8 1.0 58 60 62 64 66 Value function Estimated Value function Figure H.12: Estimated value function ˆ V 4 286 0.0 0.2 0.4 0.6 0.8 1.0 60 62 64 66 68 Value function Estimated Value function Figure H.13: Estimated value function ˆ V 3 0.0 0.2 0.4 0.6 0.8 1.0 62 64 66 68 70 Value function Estimated Value function Figure H.14: Estimated value function ˆ V 2 287 0.0 0.2 0.4 0.6 0.8 1.0 64 66 68 70 72 Value function Estimated Value function Figure H.15: Estimated value function ˆ V 1 288 Appendix I Real-time Coordination over Communication Channels I.0.1 Proof of Lemma 10.8 The existence of ψ 0 is shown by construction. Let f 0 1 =f 1 ,g 0 t 1 =g t 1 and g 0 r 1 =g r 1 . Thus, E ψ [l 1 (X 1 ,S 1 ,D 1 )] =E ψ 0 [l 1 (X 1 ,S 1 ,D 1 )]. Define F (x 1 ,w 1 ,s 1:2 ) =E ψ [l 2 (X 2 ,S 2 ,D 2 )|x 1 ,w 1 ,s 1:2 ]. and for a given s 2 , let (x ∗ 1 ,s ∗ 1 ,w ∗ 1 ) = arg min (x 1 ,s 1 ,w 1 ) F (x 1 ,w 1 ,s 1:2 ). Consider the event S 2 =s 2 . Then using tower property of conditional expectation, we have E ψ [l 2 (X 2 ,S 2 ,D 2 )|s 2 ] =E h E ψ [l 2 (X 2 ,S 2 ,D 2 )|X 1 ,W 1 ,S 1:2 ]|s 2 i =E[F (X 1 ,W 1 ,S 1:2 )|s 2 ] ≥F (x ∗ 1 ,w ∗ 1 ,s ∗ 1 ,s 2 ). Let y ∗ 1 =q 1 (f 1 (x ∗ 1 ,s ∗ 1 ),w ∗ 1 ), and f 0 2 (x 2 ,s 2 ) :=f 2 (x ∗ 1 ,x 2 ,s ∗ 1 ,s 2 ) g 0 t 2 (x 2 ,s 2 ) :=g t 2 (x ∗ 1 ,x 2 ,s ∗ 1 ,s 2 ) 289 g 0 r 2 (y 2 ,s 2 ) :=g r 2 (y ∗ 1 ,y 2 ,s ∗ 1 ,s 2 ). Under this definition, X 2 andY 2 are independent of X 1 ,W 1 andS 1 conditioned on S 2 =s 2 . Thus, P ψ 0 [g 0 t 2 (X 2 ,S 2 )6=g 0 r 2 (Y 2 ,S 2 )|s 2 ] =P ψ 0 [g t 2 (x ∗ 1 ,X 2 ,s ∗ 1 ,S 2 )6=g r 2 (y ∗ 1 ,Y 2 ,s ∗ 1 ,S 2 )|s 2 ] =P ψ 0 [g t 2 (x ∗ 1 ,X 2 ,s ∗ 1 ,S 2 )6=g r 2 (y ∗ 1 ,Y 2 ,s ∗ 1 ,S 2 )|x ∗ 1 ,w ∗ 1 ,s ∗ 1 ,s 2 ] =P ψ [g t 2 (X 1:2 ,S 1:2 )6=g r 2 (Y 1:2 ,S 1:2 )|x ∗ 1 ,w ∗ 1 ,s ∗ 1 ,s 2 ] = 0. The last equality follows from the fact that ψ is a valid coordination strategy. Therefore, ψ 0 is also a valid strategy that achieves coordination. Under the strategy ψ 0 , X 2 and D 2 are independent of X 1 ,W 1 and S 1 conditioned on S 2 =s 2 . Thus, we have E ψ 0 [l 2 (X 2 ,S 2 ,D 2 )|s 2 ] =E ψ 0 [l 2 (X 2 ,S 2 ,D 2 )|x ∗ 1 ,w ∗ 1 ,s ∗ 1 ,s 2 ] =E ψ [l 2 (X 2 ,S 2 ,D 2 )|x ∗ 1 ,w ∗ 1 ,s ∗ 1 ,s 2 ] =F (x ∗ 1 ,w ∗ 1 ,s ∗ 1 ,s 2 ) ≤E ψ [l 2 (X 2 ,S 2 ,D 2 )|s 2 ]. Since this is true for everys 2 ∈S,E ψ 0 [l 2 (X 2 ,S 2 ,D 2 )]≤E ψ [l 2 (X 2 ,S 2 ,D 2 )] and thus,J(ψ 0 )≤J(ψ). I.0.2 Proof of Lemma 10.9 The existence of ψ 0 is shown by construction. Let f 0 1 =f 1 ,g 0 t 1 =g t 1 ,g 0 r 1 =g r 1 ,f 0 3 =f 3 ,g 0 t 3 =g t 3 and g 0 r 3 =g r 3 . Thus, E ψ [l 1 (X 1 ,S 1 ,D 1 )] =E ψ 0 [l 1 (X 1 ,S 1 ,D 1 )]. Define G(x 1 ,w 1 ,s 1:2 ) :=E[l 2 (X 2 ,S 2 ,D 2 )|x 1 ,w 1 ,s 1:2 ] 290 +E[l 3 (X 3 ,S 3 ,D 3 )|x 1 ,w 1 ,s 1:2 ], and (x ∗ 1 ,s ∗ 1 ,w ∗ 1 ) := arg min (x 1 ,s 1 ,w 1 ) G(x 1 ,w 1 ,s 1:2 ). Consider the event S 2 =s 2 . Using tower property, we have E ψ [l 2 (X 2 ,S 2 ,D 2 ) +l 3 (X 3 ,S 3 ,D 3 )|s 2 ] =E[G(X 1 ,W 1 ,S 1:2 )|s 2 ] ≥G(x ∗ 1 ,w ∗ 1 ,s ∗ 1 ,s 2 ). Let y ∗ 1 =q 1 (f 1 (x ∗ 1 ,s ∗ 1 ),w ∗ 1 ), and f 0 2 (x 2 ,s 2 ) :=f 2 (x ∗ 1 ,x 2 ,s ∗ 1 ,s 2 ) g 0 t 2 (x 2 ,s 2 ) :=g t 2 (x ∗ 1 ,x 2 ,s ∗ 1 ,s 2 ) g 0 r 2 (y 2 ,s 2 ) :=g r 2 (y ∗ 1 ,y 2 ,s ∗ 1 ,s 2 ). Notice that ψ 0 is a valid strategy and this can be shown in exactly the same way as shown in the proof of the Two-stage Lemma. Under the strategy ψ 0 , X 2 ,D 2 ,X 3 ,S 3 and D 3 are jointly independent of X 1 ,S 1 and W 1 conditioned on S 2 =s 2 . Thus, we have E ψ 0 [l 2 (X 2 ,S 2 ,D 2 ) +l 3 (X 3 ,S 3 ,D 3 )|s 2 ] =E ψ 0 [l 2 (X 2 ,S 2 ,D 2 ) +l 3 (X 3 ,S 3 ,D 3 )|x ∗ 1 ,w ∗ 1 ,s ∗ 1 ,s 2 ] =E ψ [l 2 (X 2 ,S 2 ,D 2 ) +l 3 (X 3 ,S 3 ,D 3 )|x ∗ 1 ,w ∗ 1 ,s ∗ 1 ,s 2 ] =G(x ∗ 1 ,w ∗ 1 ,s ∗ 1 ,s 2 ) ≤E ψ [l 2 (X 2 ,S 2 ,D 2 ) +l 3 (X 3 ,S 3 ,D 3 )|s 2 ]. Since this is true for every s 2 ∈ S, E ψ 0 [l 2 (X 2 ,S 2 ,D 2 ) + l 3 (X 3 ,S 3 ,D 3 )] ≤ E ψ [l 2 (X 2 ,S 2 ,D 2 ) + l 3 (X 3 ,S 3 ,D 3 )] and thus, J(ψ 0 )≤J(ψ). 291 I.0.3 Proof of Theorem 10.3 Letψ be a strategy that achieves coordination. Consider the two-stage problem defined as follows: ˜ X 1 =X 1:N−1 ; ˜ X 2 =X N ˜ S 1 =S 1:N−1 ; ˜ S 2 =S N ˜ Y 1 =Y 1:N−1 ; ˜ Y 2 =Y N ˜ f 1 =f 1:N−1 ; ˜ f 2 =f N ˜ g t 1 =g t 1:N−1 ; ˜ g t 2 =g t N ˜ g r 1 =g r 1:N−1 ; ˜ g r 2 =g r N ˜ l 1 ( ˜ X 1 , ˜ S 1 , ˜ D 1 ) = N−1 X n=1 l n (X n ,S n ,D n ) ˜ l 2 ( ˜ X 2 , ˜ S 2 , ˜ D 2 ) =l N (X N ,S N ,D N ). Let the collection of strategies in the Two-stage system be ˜ ψ. Using the two-stage lemma, we can say that there exists another strategy ˜ ψ 0 for the two-stage system such that the strategies at stage 1 are unchanged and ˜ U 2 = ˜ f 0 2 ( ˜ X 2 , ˜ S 2 ) ˜ D 2 = ˜ g 0 t 2 ( ˜ X 2 , ˜ S 2 ) = ˜ g 0 r 2 ( ˜ Y 2 , ˜ S 2 ) J( ˜ ψ 0 )≤J( ˜ ψ). Let ψ 0 be the strategy for the N-stage system which is formed by replacing the strategies of ψ at stage N with those at stage 2 in ˜ ψ 0 and leaving all other strategies unchanged. Clearly, J(ψ) =J( ˜ ψ)≥J( ˜ ψ 0 ) =J(ψ 0 ). Thus, at stage N there is no loss of optimality by using strategies that depend only on X N ,S N at the transmitter and Y N ,S N at the receiver. Now, let ψ be a strategy such that at any stage i>n 292 (1<n<N), U i andD i are functions of X i ,S i at the transmitter, and D i is a function of Y i ,S i at the receiver. Consider a three-stage problem for 1<n<N defined as follows: ˜ X 1 =X 1:n−1 ; ˜ X 2 =X n ; ˜ X 3 =X n+1:N ˜ S 1 =S 1:n−1 ; ˜ S 2 =S n ; ˜ S 3 =S n+1:N ˜ Y 1 =Y 1:n−1 ; ˜ Y 2 =Y n ; ˜ Y 3 =Y n+1:N ˜ f 1 =f 1:n−1 ; ˜ f 2 =f n ; ˜ f 3 =f n+1:N ˜ g t 1 =g t 1:n−1 ; ˜ g t 2 =g t n ; ˜ g t 3 =g t n+1:N ˜ g r 1 =g r 1:n−1 ; ˜ g r 2 =g r n ; ˜ g r 3 =g r n+1:N ˜ l 1 ( ˜ X 1 , ˜ S 1 , ˜ D 1 ) = n−1 X i=1 l i (X i ,S i ,D i ) ˜ l 2 ( ˜ X 2 , ˜ S 2 , ˜ D 2 ) =l n (X n ,S n ,D n ) ˜ l 3 ( ˜ X 3 , ˜ S 3 , ˜ D 3 ) = N X i=n+1 l i (X i ,S i ,D i ). Let the collection of strategies in the three-stage system be ˜ ψ. Notice that, by Lemma 10.7, the three-stage system retains the conditional independence structure in Assumption 10.2, and ˜ ψ satisfies the conditions required for using the three-stage lemma. Thus, we can say that there exists another strategy ˜ ψ 0 for the three-stage system such that the strategies at stage 1 and 3 are unchanged, and ˜ U 2 = ˜ f 0 2 ( ˜ X 2 , ˜ S 2 ) ˜ D 2 = ˜ g 0 t 2 ( ˜ X 2 , ˜ S 2 ) = ˜ g 0 r 2 ( ˜ Y 2 , ˜ S 2 ) J( ˜ ψ 0 )≤J( ˜ ψ). Let ψ 0 be the strategy for the N-stage system which is formed by replacing the strategies of ψ at stage n with those at stage 2 in ˜ ψ 0 and leaving all other strategies unchanged. Clearly, J(ψ) =J( ˜ ψ)≥J( ˜ ψ 0 ) =J(ψ 0 ). 293 Thus, at stage n there is no loss of optimality by using strategies that depend only on X n ,S n at the transmitter and Y n ,S n at the receiver given that strategies at later stages are also memoryless in a similar manner. Let ψ N+1 be an optimal strategy for the given system. Based on the two arguments stated above, one can find a strategy ψ i , starting from i = N to i = 1 in a backward inductive manner, such that J(ψ i )≤J(ψ i+1 ), and at any iteration i, all the strategies at every time instant j≥i are memoryless. Hence, every strategy in ψ 1 is memoryless and J(ψ 1 )≤J(ψ N+1 )≤J(ψ), for every ψ∈ Ψ. I.0.4 Proof of Theorem 10.4 Because of Theorem 10.3, we can restrict our discussion to memoryless coordination strategies that use only current information, and let ψ be one such strategy. J(ψ) =E ψ " N X n=1 l n (X n ,S n ,D n ) # = N X n=1 E[E (fn,g t n ,g r n ) [l n (X n ,S n ,D n )|S n ]] = N X n=1 E[E[l n (X n ,S n ,g t n (X n ,S n ))|S n ]] ≥ N X n=1 E[E[l n (X n ,S n ,g ∗t n (X n ,S n ))|S n ]] = N X n=1 E[E (f ∗ n ,g ∗t n ,g ∗r n ) [l n (X n ,S n ,D n )|S n ]] =J(ψ ∗ ). 294 Note that the inequality above follows from the definition of g ∗t n as the best single stage decision strategy satisfying the coordination constraints. Also, the strategies f n and g r n do not appear in the equations above. However, they are implicitly present in the form of coordination constraints. 295
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Learning and decision making in networked systems
PDF
Team decision theory and decentralized stochastic control
PDF
Communication and cooperation in underwater acoustic networks
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Hierarchical planning in security games: a game theoretic approach to strategic, tactical and operational decision making
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Machine learning in interacting multi-agent systems
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Smarter markets for a smarter grid: pricing randomness, flexibility and risk
PDF
The smart grid network: pricing, markets and incentives
PDF
Learning and control in decentralized stochastic systems
PDF
Synthetic and bio-inspired molecular communications
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Joint communication and sensing over state dependent channels
PDF
Robust and adaptive online decision making
PDF
Essays on information design for online retailers and social networks
PDF
Active state tracking in heterogeneous sensor networks
PDF
Information design in non-atomic routing games: computation, repeated setting and experiment
PDF
Learning and control for wireless networks via graph signal processing
Asset Metadata
Creator
Mokhasunavisu, Dhruva Kartik
(author)
Core Title
Sequential decision-making for sensing, communication and strategic interactions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2021-12
Publication Date
10/06/2021
Defense Date
09/02/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
decentralized decision-making,distribution learning,hypothesis testing,joint communication and control,OAI-PMH Harvest,zero-sum games
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mitra, Urbashi (
committee chair
), Jain, Rahul (
committee member
), Nayyar, Ashutosh (
committee member
), Vayanos, Phebe (
committee member
)
Creator Email
dhruva.kartik@gmail.com,mokhasun@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC16022363
Unique identifier
UC16022363
Legacy Identifier
etd-Mokhasunav-10141
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Mokhasunavisu, Dhruva Kartik
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
decentralized decision-making
distribution learning
hypothesis testing
joint communication and control
zero-sum games