Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning and decision making in networked systems
(USC Thesis Other)
Learning and decision making in networked systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning and Decision Making in Networked Systems by Mukul Gagrani A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2020 Copyright 2020 Mukul Gagrani To My Family. ii Acknowledgements First and foremost, I would like to express my gratitude to my advisor, Prof. Ashutosh Nayyar. I would like to thank him for mentoring me and for inspiring me to always strive for the highest quality of research. I never cease to be amazed by his sharp insights, extraordinary foresight and clarity of thought. His meticulous feedback on my writings and presentations have helped me to improve significantly. I want to thank him for his constant support and guidance over the course of my graduate studies. I will fondly remember our meetings and discussions related to both research and life. I would also like to thank Prof. Rahul Jain for his constant mentorship over the past years and for taking a deep interest in my research. He introduced me to a very interesting research topic which turned out to be quite fruitful. I am grateful to my qualifying exam and dissertation committee members Prof. Ketan Savla, Prof. Bhaskar Krishnamachari and Prof. Urbashi Mitra for their constructive suggestions and queries, improving the quality of this dissertation. This work was a result of collaboration with some brilliant researchers. Special thanks to Yi Ouyang for being a senior mentor to me and for generously sharing his knowledge with me while we worked together on the learning problem. I would also like to thank Marcos Vasconcelos for his valuable advice on research life and for always being ready to lend a helping hand in and outside of our research work. Thanks to all my labmates and colleagues at USC, Mohammad, Shiva, Dhruva and Xi for many valuable discussions and hearty interactions. My time in Los Angeles was made much more enjoyable by a wonderful group of friends. I would like to thank Gaurav for being a great roommate and for being my lunch, dinner & tea partner. Thanks to Mayank for coming down from San Diego for the fun weekends. I would also like to thank Kartik, Saket, Mayank Anand, Vamsi and Ajinkya for all the memorable times we spent together. I am grateful to my family for their unconditional love and support through all my endeavors. I am indebted to my parents for their sacrifices which have enabled me to be what I am today. Thanks to my sister Yukta who inspires me to be a better version of myself. Finally, I would like to thank my wife for always putting my happiness above her own and for pushing me to aim higher. Thank you Shivangi for all the joy you have brought in my life and for standing with me through all the ups and downs in this thing called life. iii Contents Acknowledgements iii Contents iv List of Tables ix List of Figures x Abstract xi 1 Introduction 1 1.1 Some key concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2 Common Information Approach . . . . . . . . . . . . . . . . . . . . 5 1.2 Organization and Contributions of the thesis . . . . . . . . . . . . . . . . . 6 1.2.1 Learning to control an unknown MDP . . . . . . . . . . . . . . . . . 6 1.2.2 Learning to control an unknown Linear system . . . . . . . . . . . . 7 1.2.3 Thompson Sampling for some decentralized control problems . . . . 7 1.2.4 Networked Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.5 Worst case guarantees for remote estimation . . . . . . . . . . . . . 9 1.2.6 Decentralized minimax control problems with partial history sharing 10 1.2.7 Weakly coupled constrained Markov decision process in Borel spaces 11 2 Learning to Control an Unknown MDP 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Reinforcement Learning for Weakly Communicating MDPs . . . . . 17 2.3 Thompson Sampling with Dynamic Episodes . . . . . . . . . . . . . . . . . 18 2.3.1 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 iv 2.3.2 Approximation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Number of Episodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.2 Regret Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Learning to Control an Unknown Linear System 31 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Preliminaries: Stochastic Linear Quadratic Control . . . . . . . . . . 36 3.2.2 Reinforcement Learning with Stationary Parameter . . . . . . . . . 38 3.3 Thompson Sampling Based Control Policies . . . . . . . . . . . . . . . . . . 39 3.4 Regret of the TSDE-LQ Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Mean field LQ learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.2 Preliminaries: Optimal Control for known θ . . . . . . . . . . . . . . 56 3.6.3 Naive Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . 57 3.6.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6.5 Posterior update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6.6 Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4 Thompson Sampling for some Decentralized Control Problems 76 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Problem 1: Decoupled dynamics . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3 Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4 Problem 2: One step delayed sharing . . . . . . . . . . . . . . . . . . . . . . 90 4.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5 Networked Estimation 96 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.1.1 Related literature, connections with prior work and contributions . . 98 5.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2 Network Estimation: IID Case . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.1.1 Signaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 v 5.2.2 Main result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2.3 Information Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2.3.1 Information structure expansion . . . . . . . . . . . . . . . 111 5.2.4 An equivalent problem with a coordinator . . . . . . . . . . . . . . . 114 5.2.4.1 Dynamic program . . . . . . . . . . . . . . . . . . . . . . . 117 5.2.5 Solving the dynamic program . . . . . . . . . . . . . . . . . . . . . . 119 5.2.6 Computation of optimal thresholds . . . . . . . . . . . . . . . . . . . 124 5.2.7 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.2.7.1 Optimal blind scheduling . . . . . . . . . . . . . . . . . . . 125 5.2.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.8.1 The N sensor case . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.8.2 Unequal weights and communication costs . . . . . . . . . 131 5.3 Network Estimation: Markov Case . . . . . . . . . . . . . . . . . . . . . . . 132 5.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.3.2 Optimal estimation strategy . . . . . . . . . . . . . . . . . . . . . . . 135 5.3.3 Optimal scheduling strategy . . . . . . . . . . . . . . . . . . . . . . . 137 5.3.4 Numerical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.3.5 Structural Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.3.6 Characterizing no transmission region . . . . . . . . . . . . . . . . . 141 5.3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6 Worst Case Guarantees for Remote Estimation 148 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.2 Minimax Control with Maximum Instantaneous Cost Objective . . . . . . . 154 6.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.4 An Equivalent Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.5 Globally optimal strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 6.5.1 Homogenous noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7 Decentralized Minimax Control Problems with Partial History Sharing 177 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 7.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.3 Coordinator’s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 7.3.1 The coordinated system . . . . . . . . . . . . . . . . . . . . . . . . . 184 7.3.2 Equivalence between the two models . . . . . . . . . . . . . . . . . . 185 7.3.3 Centralized minimax control problem . . . . . . . . . . . . . . . . . 186 7.4 Optimal Strategies in Problem 7.1 . . . . . . . . . . . . . . . . . . . . . . . 191 7.5 Additive cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 vi 7.6 Generalization of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8 Weakly Coupled Constrained Markov Decision Process in Borel Spaces 202 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 8.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 8.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 8.3 Optimal Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 8.4 Constrained Linear Quadratic Systems . . . . . . . . . . . . . . . . . . . . . 215 8.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 9 Concluding Remarks 224 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 9.1.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 9.1.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 9.1.3 Decentralized control . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 9.2.1 Decentralized Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 230 9.2.2 Decentralized control with constraints . . . . . . . . . . . . . . . . . 232 Bibliography 235 Appendices 257 A Appendix: Learning to Control an unknown MDP . . . . . . . . . . . . . . 257 A.1 Bound on the number of macro episodes . . . . . . . . . . . . . . . . 257 A.2 Proof of Lemma 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 A.3 Proof of Lemma 3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 A.4 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 262 B Appendix: Learning to Control an unknown Linear System . . . . . . . . . 264 B.1 Centralized LQR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 B.2 Mean-field LQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 C Appendix: Thompson Sampling for some Decentralized Control Problems . 277 C.1 Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 D Appendix: Networked Estimation- IID case . . . . . . . . . . . . . . . . . . 281 D.1 Auxiliary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 D.2 Proof of lemma 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 D.2.1 Empty battery . . . . . . . . . . . . . . . . . . . . . . . . . 282 D.2.2 Nonempty battery . . . . . . . . . . . . . . . . . . . . . . . 282 vii D.3 Optimal thresholds for the asymmetric case . . . . . . . . . . . . . . 286 E Appendix: Networked Estimation - Markov Case . . . . . . . . . . . . . . . 287 E.1 Proof of Lemma 5.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 E.2 Proof of Lemma 5.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 E.3 Proof of Lemma 5.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 F Appendix: Worst Case Guarantees for remote estimation . . . . . . . . . . 297 F.1 Proof of Theorem 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 297 F.2 Proof of Lemma 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 F.3 Proof of lemmas 6.4 and 6.5 . . . . . . . . . . . . . . . . . . . . . . . 303 F.4 Proof of Theorem 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 307 F.5 Proof of Lemma 6.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 G Appendix: Weakly coupled constrained MDP in Borel spaces . . . . . . . . 311 G.1 Proof of Theorem 8.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 311 G.2 Proof of Theorem 8.2 and Lemma 8.1 . . . . . . . . . . . . . . . . . 317 viii List of Tables 7.1 Optimal cost incurred for L = 8,T = 4,m = 2,d = 2,C = 10. . . . . . . . . 200 ix List of Figures 2.1 Expected Regret vs Time for random MDPs . . . . . . . . . . . . . . . . . . 29 2.2 Expected Regret vs Time for RiverSwim . . . . . . . . . . . . . . . . . . . . 30 3.1 Scaler Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Multi-Dimensional Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Expected log regret for a scalar system under exponential schedule . . . . . 53 5.1 Schematic diagram for the remote sensing system two sensor-estimator pairs with an energy harvesting scheduler. . . . . . . . . . . . . . . . . . . . . . . 98 5.2 Optimal threshold function for the scheduling of two i.i.d. standard Gaussian sources. The threshold is a function of the energy level and time. . . . . . . 128 5.3 Comparison between the performances of the optimal open-loop and closed- loop strategies as a function of the battery capacity, B. The relative gap between these two curves is defined as the Value of Information. . . . . . . 129 5.4 Optimal performanceJ ? of the systems with and without harvesting of Ex- amples 1 and 2 as a function of the communication cost c. . . . . . . . . . . 129 5.5 Decision region for the network manager at t = 1 . . . . . . . . . . . . . . . 139 6.1 Remote Estimation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.2 Timeline of state realization, observations and actions in coordinator’s problem162 7.1 Time ordering of Observations, Actions and Memory updates. . . . . . . . . 184 7.2 Time ordering of Observations, Actions and Memory updates with common observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.3 Optimal cost incurred for L = 8,T = 4,m = 2,d = 2,C = 10,X 0 1 = 4 . . . . 200 8.1 Trajectory of the running average cost . . . . . . . . . . . . . . . . . . . . . 221 8.2 Trajectory of the running average constraint function . . . . . . . . . . . . . 222 x Abstract Networked systems are ubiquitous in today’s world. Such systems consist of agents who have to make a series of decisions in order to achieve a common goal. There are two key challenges in the design of optimal decision strategies for the agents: i) Uncertainty: Agents have to make decisions under the presence of uncertainty. The uncertainty could manifest itself in three different forms: stochastic uncertainty, model uncertainty and adversarial uncertainty. ii) Decentralization of decision making: The agents may have different information about the network, the environment and each other and need to make decentralized decisions using only their local information. In this thesis, we consider instances of sequential decision making problems under different type of uncertainties and information structures in the network. In the context of model uncertainty, we consider the problem of controlling an unknown Markov Decision Process (MDP) and Linear Quadratic (LQ) system in a single agent setting. We pose this as an online learning problem and propose a Thompson sampling (TS) based algorithm for regret minimization. We show that the regret achieved by our proposed algorithm is order optimal up to logarithmic factors. In the multi-agent setting, we consider a mean-field LQ problem with unknown dynamics. We propose a TS based algorithm and derive theoretical guarantees on the regret of our scheme. Finally, we also study a TS algorithm for a multi- agent MDP with two classes of information structure and dynamics. xi In the context of stochastic uncertainty, we study a networked estimation problem with multiple sensors and non-collocated estimators. We study the joint design of scheduling strategy for the scheduler and estimation strategies for the estimators. This leads to a sequential team problem with non-classical information structure. We characterize the jointly optimal scheduling and estimation strategies under the two models of sensor state dynamics: i) IID dynamics, ii) Markov dynamics. We also study a weakly coupled constrained MDP in Borel spaces. This is a decentralized control problem with constraints. We derive the optimal decentralized strategies using an occupation measure based linear program. We further consider the special case of multi- agent LQ systems and show that the optimal control strategies could be obtained by solving a semi-definite program (SDP). In the context of adversarial uncertainty, we first look at a sequential remote estimation problem of finding a scheduling strategy for the sensor and an estimation strategy for the estimator to jointly minimize the worst-case maximum instantaneous estimation error over a finite time horizon. We obtain a complete characterization of optimal strategies for this decentralized minimax problem. Finally, we consider a broader class of minimax sequential team problems with the partial history sharing information structure for which we characterize the optimal decision strategies for the agents using a dynamic programming decomposition. xii Chapter 1 Introduction Networked systems are ubiquitous in today’s world. Modern engineering systems such as Cyber-Physical Systems (CPS), Network Control Systems (NCS), Wireless Sensor Net- works, Power Systems etc. are all examples of networked systems [1–5]. Decision-making is an important aspect of such systems. These networked systems consist of agents who have to make a series of decisions in order to achieve a common goal. For example, in a sensor network, the sensors have to decide when to take the measurements and when to communicate the measurements to the base station. A self-driving car has to decide steer- ing controls in order to navigate successfully in its environment. In a smart grid multiple sources of power generation have to decide how much energy should they generate such that collectively they can satisfy the total energy demand while minimizing their total cost of production etc. The agents in these networks map their information using a decision strat- egy to make their decisions. Finding good decision strategies for such networks is important 1 for good performance, efficient use of the resources and to avoid system failures. The fo- cus of this thesis is to find optimal/near-optimal decision strategies for some instances of decision-making problems in networked systems. One of the key challenge in the design of optimal decision strategies is the presence of uncertainty. The uncertainty could manifest itself in multiple forms. Some common types of uncertainty are: 1. Stochastic/Aleatoric Uncertainty: Stochastic or Aleatoric uncertainty is present in the form of random noise in the dynamics of the environment and noisy obser- vations at the agents. Stochastic team theory provides a methodology for designing optimal decision making strategies for a group of agents in the presence of stochastic uncertainty. However, most of the results in team theory apply to special types of information structures for the agents (e.g partially nested information structure) and do not extend readily to general decentralized decision making problems. 2. Model/Epistemic Uncertainty: The environment in which the agents are inter- acting is not known precisely to them quite often in practice. Such uncertainty could be present in the form of unknown system dynamics/parameters, uncertainty about the other agents in the network, uncertainty about the network topology etc. Optimal strategy design depends on the underlying model of the system which makes finding optimal decision strategies even more challenging in the presence of model uncertainty. 3. Adversarial Uncertainty: If a set of agents start acting against the common goal/interest of the network then we have a system with adversarial uncertainty. This 2 type of uncertainty could be present in the form of adversarial attacks on the net- work by agents who have turned malicious. Designing decision strategies which can make the system robust against adversarial uncertainty is crucial for the safety and efficiency of the network. In the case of multi-agent systems, the agents in networked systems interact with the en- vironment through their decisions and interact with each other by exchanging information through a communication network. Each agent has different information about the net- work based on their interactions. This leads to decentralized decision making at the agents since the agents have to make decisions using only their local information. This decen- tralization renders the problem of finding optimal decision making strategies challenging in general. Also, since the agents interact with each other and the network over a period of time, the decision making at the agents is sequential in nature. Then, in addition to minimizing/maximizing a system objective, the decisions of an agent also determine the quality of information it receives and also the quality of information other agents receive in the future time instants. This dual aspect of decision making in sequential problems makes them non-trivial. In this thesis, we study optimal strategy design for sequential decision-making problems in a single/multi-agent system under the three types of aforementioned uncertainties. In the context of model uncertainty, we consider the problem of controlling an unknown Markov Decision Process (MDP) and an unknown Linear Quadratic (LQ) system in a single agent setting in Chapter 2 and 3. In the multi-agent setting, we investigate the problem of learning 3 to control a mean-field LQ system in Chapter 3 and also look at some decentralized learning and control problems in multi-agent MDPs in Chapter 4. In the context of stochastic uncertainty, we study a networked estimation problem under two different model of the source dynamics in Chapter 5. We also study a decentralized control problem with constraints in Chapter 8 Finally, in the context of adversarial uncertainty, we look at a sequential remote estimation problem in Chapter 6 and then consider a more general class of decentralized minimax control problems in Chapter 7. 1.1 Some key concepts In this section, we introduce two key concepts which play a central role in different parts of this thesis. 1.1.1 Thompson Sampling In Chapters 2-4, we consider sequential decision making problems with model uncertainty. For real-world systems, it is hardly the case that the model and its parameters are known precisely to the agent(s). Typically, a set is known in which the model parameters lie. Furthermore, for many problems, we do not have the luxury of first performing system identification, and then using that in designing control strategy. The agent(s) want to maintain adequate control under the presence of the uncertainty about the true model of 4 the system. We refer to this problem as the problem of “learning” to control dynami- cal systems. This problem is well known as the adaptive control problem [6]. Classical adaptive control [6–10] mostly provides asymptotic guarantees on the performance of the agents. However, our objective is to learn model parameters and the corresponding optimal controller simultaneously at the fastest possible non-asymptotic rate. We take an Online learning [11] approach in Chapters 2-4 for learning the optimal con- trol strategy. Thompson sampling [12] has recently emerged as a popular online learning approach due to its good performance in online learning problems [13, 14]. Thompson sam- pling (TS) is a Bayesian approach where the agent(s) start with a prior distribution on the unknown parameters of the model and maintain a posterior on the basis of their information at each time. Then, at certain carefully chosen times, the agent generates a random sample from the posterior and applies the optimal strategy corresponding to the generated sample. In Chapters 2-4, we design TS based algorithms with dynamic sampling schedules which are provably order optimal in terms of the rate of learning the model parameters. 1.1.2 Common Information Approach In Chapters 5-7, we study sequential decision making problem with multiple agents. Multi- agent decision making problems where the agents have a common objective are also referred to as Team problems. One important aspect of team problem is the information structure of the problem which describes what information is available to each decision maker. The 5 information structure is said to be “classical” if each decision maker knows all the informa- tion available to all the agents that acted before it. The information structure is said to be “non-classical” if it is not classical. Sequential team problems with non-classical information structure are usually nonconvex, and are, in general, difficult to solve. Common information approach [15] provides a prin- cipled way to solve such team problems. The core idea of this approach is to first convert the team problem into an equivalent single-agent partially observed markov decision pro- cess (POMDP). The information available to the agent in the equivalent POMDP is the common information among the agents in the original team problem. Then, the equivalent POMDP is solved using tools from Markov decision theory [6] and finally this solution is transformed to obtain the optimal decentralized strategies for the original team problem. We use common information approach to find the optimal decision strategies in the problem of estimation over a network in Chapter 5. We also generalize and use the idea of common information approach for team problems with adversarial noise in Chapters 6 and 7. 1.2 Organization and Contributions of the thesis Next, we describe the problems studied in this thesis and briefly summarize the main ideas. 1.2.1 Learning to control an unknown MDP In Chapter 2, we consider the problem of learning to control an unknown MDP in the infinite horizon setting. The agent is interested to minimizing its accumulated cost when 6 the transition probabilities of the MDP are not known. Variations of this problem have been studied in adaptive control and reinforcement learning [10, 16–19]. We take an online learning approach and pose this problem as a regret minimization prob- lem. We propose a Thompson/Posterior sampling based decision making algorithm for the agent. We provide non-asymptotic guarantees on the regret of the proposed algorithm and show that the regret is order optimal with respect to the time horizon up to logarithmic factors. 1.2.2 Learning to control an unknown Linear system In Chapter 3, we move on to the problem of controlling an LQ system in the infinite horizon setting when the system dynamics are unknown. We extend the idea of Thompson sampling to the single agent LQ problem and provide theoretical guarantees on the regret of our proposed algorithm. Later in the chapter, we also consider the multi-agent problem of learning a mean-field LQ system with unknown dynamics. By carefully exploiting the structure of the mean-field problem, we design a low complexity TS algorithm. We show that the regret of our scheme is order optimal with respect to the time horizon up to logarithmic factors and does not grow with the number of agents. 1.2.3 Thompson Sampling for some decentralized control problems In Chapter 4, we consider a team learning problem over an infinite time horizon under two different dynamics and information sharing models: i) Decoupled dynamics with no 7 information sharing, ii) Coupled dynamics with one-step delayed information sharing. The state transition kernels are parametrized by an unknown but fixed parameter taking values in a finite space. We study a decentralized Thompson sampling based approach to learn the underlying parameter where each agent maintains a belief about the underlying parameter. The agents draw a sample from their beliefs at each time and select their actions using the benchmark policy for the sampled parameter. We show that under some assumptions on the state transition kernels, the regret achieved by Thompson sampling is upper bounded by a constant independent of the time horizon. 1.2.4 Networked Estimation In Chapter 5, we consider the problem of networked estimation over a finite time horizon with multiple sensors and non-collocated estimators. Each sensor is observing a state of interest which is modeled as a stochastic process. Each estimator is interested in forming real-time estimates of the state of its corresponding sensor. The sensors can communicate their state to the corresponding estimator via a shared communication network. In order to avoid packet collisions, an energy harvesting scheduler decides at each time which sensor’s state (if any) will be transmitted over the network based on the realization of the states. The communication is energy consuming and costly. We study the joint design of scheduling strategy for the scheduler and estimation strategies for the estimators in order to minimize the sum of communication cost and estimation error. This leads to a sequential decentralized decision making problem. 8 First, we consider the case when the state process is i.i.d in time. We completely characterize the jointly optimal scheduling and estimation strategies in this case. We show that a scheduling strategy which transmits the state of the sensor with the highest norm if it exceeds a certain threshold is optimal. We also show that the optimal estimate at each estimator in the absence of a received observation is the expected value of the state process. Our approach consists of first relaxing the information constraints in the problem and then using ideas from the common information approach [20] we derive the optimal strategies. Next, we move on to the case when the state process is evolving in time with Markovian dynamics. In this case, under some restrictions on the space of scheduling strategies we characterize the optimal scheduling and estimation strategies. We show that the most recently received observation at the estimator is the optimal estimate and provide a dynamic program to characterize the optimal scheduling strategy. 1.2.5 Worst case guarantees for remote estimation In Chapter 6 we switch gears and move to a decision making problem with adversarial noise. Specifically we look at a remote estimation problem with a single sensor and a corresponding remote estimator. The sensor wants to communicate the state of an uncertain source to the estimator over a finite time horizon. The uncertain source is modeled as an autoregressive process with bounded noise. Given that the sensor has a limited communication budget, the sensor must decide when to transmit the state to the estimator who has to produce real-time estimates of the source state. We consider the problem of finding a scheduling strategy for the sensor and an estimation strategy for the estimator to jointly minimize the worst-case 9 maximum instantaneous estimation error over the time horizon. This leads to a two-agent decentralized minimax decision-making problem. We obtain a complete characterization of optimal strategies for this decentralized minimax problem. In particular, we show that an open loop communication scheduling strategy is optimal and the optimal estimate depends only on the most recently received sensor observation. 1.2.6 Decentralized minimax control problems with partial history shar- ing In Chapter 7, we move on to a broader class of decision making problem with adversarial uncertainty. We consider a decentralized minimax control problem with the partial history sharing information structure. The partial history sharing model is a general decentralized model where (i) controllers sequentially share part of their past data (past observations and control actions) with each other by means of a shared memory; and (ii) all controllers have perfect recall of the shared data (common information). We model the noise as uncertain quantities that take values in some fixed and known finite sets. The objective is to find control strategies that minimize the worst-case cost. We first consider a terminal cost problem. We provide a common information based dynamic program for this decentralized problem. The information state in the dynamic program is the set of feasible values of the current state and local information consistent with the information that is commonly known to all controllers. We also extend our results to the case of additive costs and common observations. 10 1.2.7 Weakly coupled constrained Markov decision process in Borel spaces In Chapter 8, we consider a decision making problem with constraints. In particular, we study a multi-agent stochastic control problem where the agents have decoupled system dy- namics. Each agent has an associated cost function and a constraint function. The agents want to find decentralized control strategies which minimizes their long term average cu- mulative cost function while keeping the long term average cumulative constraint function below a certain threshold. This problem is referred to as weakly coupled constrained Markov decision process (MDP). In this chapter, we consider the problem of weakly coupled con- strained MDP with Borel state and action spaces. We use the linear programming (LP) based approach of [21] to derive an occupation measure based LP for finding the opti- mal decentralized control strategies for our problem. We show that randomized stationary policies are optimal for each agent under some assumptions on the transition kernels, cost and the constraint functions. We further consider the special case of multi-agent Linear Quadratic Gaussian (LQG) systems and show that the optimal control strategies could be obtained by solving a semi-definite program (SDP). We illustrate our results through numerical experiments. 11 Chapter 2 Learning to Control an Unknown MDP 2.1 Introduction We consider the problem of reinforcement learning by an agent interacting with an environ- ment while trying to minimize the total cost accumulated over time. The environment is modeled by an infinite horizon Markov Decision Process (MDP) with finite state and action spaces. When the environment is perfectly known, the agent can determine optimal actions by solving a dynamic program for the MDP [22]. In reinforcement learning, however, the agent is uncertain about the true dynamics of the MDP. A naive approach to an unknown model is the certainty equivalence principle. The idea is to estimate the unknown MDP pa- rameters from available information and then choose actions as if the estimates are the true 12 parameters. But it is well-known in adaptive control theory that the certainty equivalence principle may lead to suboptimal performance due to the lack of exploration [6]. This issue actually comes from the fundamental exploitation-exploration trade-off: the agent wants to exploit available information to minimize cost, but it also needs to explore the environment to learn system dynamics. One common way to handle the exploitation-exploration trade-off is to use the optimism in the face of uncertainty (OFU) principle [23]. Under this principle, the agent constructs confidence sets for the system parameters at each time, find the optimistic parameters that are associated with the minimum cost, and then selects an action based on the optimistic parameters. The optimism procedure encourages exploration for rarely visited states and actions. Several optimistic algorithms are proved to possess strong theoretical performance guarantees [10, 16–19, 24, 25]. An alternative way to incentivize exploration is the Thompson Sampling (TS) or Posterior Sampling method. The idea of TS was first proposed by Thompson in [12] for stochas- tic bandit problems. It has been applied to MDP environments [26–31] where the agent computes the posterior distribution of unknown parameters using observed information and a prior distribution. A TS algorithm generally proceeds in episodes: at the beginning of each episode a set of MDP parameters is randomly sampled from the posterior distribution, then actions are selected based on the sampled model during the episode. TS algorithms have the following advantages over optimistic algorithms. First, TS algorithms can easily incorporate problem structures through the prior distribution. Second, they are more com- putationally efficient since a TS algorithm only needs to solve the sampled MDP, while an 13 optimistic algorithm requires solving all MDPs that lie within the confident sets. Third, empirical studies suggest that TS algorithms outperform optimistic algorithms in bandit problems [13, 14] as well as in MDP environments [27, 30, 31]. Due to the above advantages, we focus on TS algorithms for the MDP learning problem. The main challenge in the design of a TS algorithm is the lengths of the episodes. For finite horizon MDPs under the episodic setting, the length of each episode can be set as the time horizon [27]. When there exists a recurrent state under any stationary policy, the TS algorithm of [29] starts a new episode whenever the system enters the recurrent state. However, the above methods to end an episode can not be applied to MDPs without the special features. The work of [30] proposed a dynamic episode schedule based on the doubling trick used in [18], but a mistake in their proof of regret bound was pointed out by [32]. In view of the mistake in [30], there is no TS algorithm with strong performance guarantees for general MDPs to the best of our knowledge. We consider the most general subclass of weakly communicating MDPs in which meaningful finite time regret guarantees can be analyzed. We propose the Thompson Sampling with Dynamic Episodes (TSDE) learning algorithm. In TSDE, there are two stopping criteria for an episode to end. The first stopping criterion controls the growth rate of episode length. The second stopping criterion is the doubling trick similar to the one in [18, 19, 24, 25, 30] that stops when the number of visits to any state-action pair is doubled. Under a Bayesian framework, we show that the expected regret of TSDE accumulated up to timeT is bounded by ˜ O(HS √ AT ) where ˜ O hides logarithmic factors. Here S andA are the sizes of the state and action spaces,T is time, andH is the bound of the span. This regret bound matches the 14 best available bound for weakly communicating MDPs [18], and it matches the theoretical lower bound in order ofT except for logarithmic factors. We present numerical results that show that TSDE actually outperforms current algorithms with known regret bounds that have the same order in T for a benchmark MDP problem as well as randomly generated MDPs. 2.2 Problem Formulation 2.2.1 Preliminaries An infinite horizon Markov Decision Process (MDP) is described by (S,A,c,θ). HereS is the state space,A is the action space, c :S×A→ [0, 1] 1 is the cost function, and θ :S 2 ×A→ [0, 1] represents the transition probabilities such that θ(s 0 |s,a) = P(s t+1 = s 0 |s t = s,a t = a) where s t ∈S and a t ∈A are the state and the action at t = 1, 2, 3... . We assume thatS andA are finite spaces with sizesS≥ 2 andA≥ 2, and the initial state s 1 is a known and fixed state. A stationary policy is a deterministic map π :S→A that maps a state to an action. The average cost per stage of a stationary policy is defined as J π (θ) = lim sup T→∞ 1 T E h T X t=1 c(s t ,a t ) i . Here we use J π (θ) to explicitly show the dependency of the average cost on θ. To have meaningful finite time regret bounds, we consider the subclass of weakly commu- nicating MDPs defined as follows. 1 SinceS andA are finite, we can normalize the cost function to [0, 1] without loss of generality. 15 Definition 2.1. An MDP is weakly communicating (or weak accessible) if its states can be partitioned into two subsets: in the first subset all states are transient under every stationary policy, and every two states in the second subset can be reached from each other under some stationary policy. From MDP theory [22], we know that if the MDP is weakly communicating, the optimal average cost per stage J(θ) = min π J π (θ) satisfies the Bellman equation J(θ) +v(s,θ) = min a∈A n c(s,a) + X s 0 ∈S θ(s 0 |s,a)v(s 0 ,θ) o (2.1) for alls∈S. The corresponding optimal stationary policyπ ∗ is the minimizer of the above optimization given by a =π ∗ (s,θ). (2.2) Since the cost function c(s,a)∈ [0, 1], J(θ)∈ [0, 1] for all θ. If v satisfies the Bellman equation,v plus any constant also satisfies the Bellman equation. Without loss of generality, let min s∈S v(s,θ) = 0 and define the span of the MDP as sp(θ) = max s∈S v(s,θ). 2 We define Ω ∗ to be the set of all θ such that the MDP with transition probabilities θ is weakly communicating, and there exists a number H such that sp(θ)≤ H. We will focus on MDPs with transition probabilities in the set Ω ∗ . 2 See [18] for a discussion on the connection of the span with other parameters such as the diameter appearing in the lower bound on regret. 16 2.2.2 Reinforcement Learning for Weakly Communicating MDPs We consider the reinforcement learning problem of an agent interacting with a random weakly communicating MDP (S,A,c,θ ∗ ). We assume thatS,A and the cost functionc are completely known to the agent. The actual transition probabilitiesθ ∗ is randomly generated at the beginning before the MDP interacts with the agent. The value of θ ∗ is then fixed but unknown to the agent. The complete knowledge of the cost is typical as in [18, 29]. Algorithms can generally be extended to the unknown costs/rewards case at the expense of some constant factor for the regret bound. At each timet, the agent selects an action according toa t =φ t (h t ) whereh t = (s 1 ,s 2 ,...,s t ,a 1 ,a 2 ,...,a t−1 ) is the history of states and actions. The collection φ = (φ 1 ,φ 2 ... ) is called a learning al- gorithm. The functions φ t allow for the possibility of randomization over actions at each time. We focus on a Bayesian framework for the unknown parameter θ ∗ . Letμ 1 be the prior dis- tribution forθ ∗ , i.e., for any set Θ,P(θ ∗ ∈ Θ) =μ 1 (Θ). We make the following assumptions on μ 1 . Assumption 2.1. The support of the prior distribution μ 1 is a subset of Ω ∗ . That is, the MDP is weakly communicating and sp(θ ∗ )≤H. In this Bayesian framework, we define the expected regret (also called Bayesian regret or Bayes risk) of a learning algorithm φ up to time T as R(T,φ) =E h T X t=1 h c(s t ,a t )−J(θ ∗ ) ii (2.3) 17 where s t ,a t ,t = 1,...,T are generated by φ and J(θ ∗ ) is the optimal per stage cost of the MDP. The above expectation is with respect to the prior distribution μ 1 for θ ∗ , the randomness in state transitions, and the randomized algorithm. The expected regret is an important metric to quantify the performance of a learning algorithm. 2.3 Thompson Sampling with Dynamic Episodes In this section, we propose the Thompson Sampling with Dynamic Episodes (TSDE) learn- ing algorithm. The input of TSDE is the prior distributionμ 1 . At each timet, given the his- toryh t , the agent can compute the posterior distribution μ t given byμ t (Θ) =P(θ ∗ ∈ Θ|h t ) for any set Θ. Upon applying the action a t and observing the new state s t+1 , the posterior distribution at t + 1 can be updated according to Bayes’ rule as μ t+1 (dθ) = θ(s t+1 |s t ,a t )μ t (dθ) R θ 0 (s t+1 |s t ,a t )μ t (dθ 0 ) . (2.4) Let N t (s,a) be the number of visits to any state-action pair (s,a) before time t. That is, N t (s,a) =|{τ <t : (s τ ,a τ ) = (s,a)}|. (2.5) With these notations, TSDE is described as follows. 18 Algorithm 1 Thompson Sampling with Dynamic Episodes (TSDE) Input: μ 1 Initialization: t← 1, t k ← 0 for episodes k = 1, 2,... do T k−1 ←t−t k t k ←t Generate θ k ∼μ t k and compute π k (·) =π ∗ (·,θ k ) from (2.1)-(2.2) while t≤t k +T k−1 and N t (s,a)≤ 2N t k (s,a) for all (s,a)∈S×A do Apply action a t =π k (s t ) Observe new state s t+1 Update μ t+1 according to (4.5) t←t + 1 end while end for The TSDE algorithm operates in episodes. Let t k be start time of the kth episode and T k =t k+1 −t k be the length of the episode with the conventionT 0 = 1. From the description of the algorithm, t 1 = 1 and t k+1 ,k≥ 1, is given by t k+1 = min{t>t k : t>t k +T k−1 or N t (s,a)> 2N t k (s,a) for some (s,a)}. (2.6) At the beginning of episode k, a parameter θ k is sampled from the posterior distribution μ t k . During each episode k, actions are generated from the optimal stationary policy π k 19 for the sampled parameter θ k . One important feature of TSDE is that its episode lengths are not fixed. The length T k of each episode is dynamically determined according to two stopping criteria: (i) t > t k +T k−1 , and (ii) N t (s,a) > 2N t k (s,a) for some state-action pair (s,a). The first stopping criterion provides that the episode length grows at a linear rate without triggering the second criterion. The second stopping criterion ensures that the number of visits to any state-action pair (s,a) during an episode should not be more than the number visits to the pair before this episode. Remark 2.1. Note that TSDE only requires the knowledge of S, A, c, and the prior distribution μ 1 . TSDE can operate without the knowledge of time horizon T , the bound H on span used in [18], and any knowledge about the actual θ ∗ such as the recurrent state needed in [29]. 2.3.1 Main Result Theorem 2.1. Under Assumption 2.1, R(T, TSDE)≤ (H + 1) q 2SAT log(T ) + 49HS q AT log(AT ). The proof of Theorem 2.1 appears in Section 2.4. Remark 2.2. Note that our regret bound has the same order in H,S,A and T as the optimistic algorithm in [18] which is the best available bound for weakly communicating 20 MDPs. Moreover, the bound does not depend on the prior distribution or other problem- dependent parameters such as the recurrent time of the optimal policy used in the regret bound of [29]. 2.3.2 Approximation Error At the beginning of each episode, TSDE computes the optimal stationary policy π k for the parameter θ k . This step requires the solution to a fixed finite MDP. Policy iteration or value iteration can be used to solve the sampled MDP, but the resulting stationary policy may be only approximately optimal in practice. We call π an −approximate policy if c(s,π(s)) + X s 0 ∈S θ(s 0 |s,π(s))v(s 0 ,θ)≤ min a∈A n c(s,a) + X s 0 ∈S θ(s 0 |s,a)v(s 0 ,θ) o +. When the algorithm returns an k −approximate policy ˜ π k instead of the optimal station- ary policy π k at episode k, we have the following regret bound in the presence of such approximation error. Theorem 2.2. If TSDE computes an k −approximate policy ˜ π k instead of the optimal stationary policy π k at each episode k, the expected regret of TSDE satisfies R(T, TSDE)≤ ˜ O(HS √ AT ) +E h X k:t k ≤T T k k i . Furthermore, if k ≤ 1 k+1 , E h P k:t k ≤T T k k i ≤ p 2SAT log(T ). Proof. See Appendix A. 21 Theorem 2.2 shows that the approximation error in the computation of optimal station- ary policy is only additive to the regret under TSDE. The regret bound would remain ˜ O(HS √ AT ) if the approximation error is such that k ≤ 1 k+1 . 2.4 Analysis 2.4.1 Number of Episodes To analyze the performance of TSDE over T time steps, define K T = arg max{k :t k ≤T} be the number of episodes of TSDE until time T . Note that K T is a random variable because the number of visits N t (x,u) depends on the dynamical state trajectory. In the analysis for timeT we use the convention thatt (K T +1) =T +1. We provide an upper bound on K T as follows. Lemma 2.1. K T ≤ q 2SAT log(T ). Proof. Define macro episodes with start times t n i ,i = 1, 2,... where t n 1 =t 1 and t n i+1 = min{t k >t n i : N t k (s,a)> 2N t k−1 (s,a) for some (s,a)}. The idea is that each macro episode starts when the second stopping criterion happens. Let M be the number of macro episodes until time T and define n (M+1) =K T + 1. 22 Let ˜ T i = P n i+1 −1 k=n i T k be the length of the ith macro episode. By the definition of macro episodes, any episode except the last one in a macro episode must be triggered by the first stopping criterion. Therefore, within the ith macro episode, T k = T k−1 + 1 for all k =n i ,n i + 1,...,n i+1 − 2. Hence, ˜ T i = n i+1 −1 X k=n i T k = n i+1 −n i −1 X j=1 (T n i −1 +j) +T n i+1 −1 ≥ n i+1 −n i −1 X j=1 (j + 1) + 1 = 0.5(n i+1 −n i )(n i+1 −n i + 1). Consequently, n i+1 −n i ≤ q 2 ˜ T i for all i = 1,...,M. From this property we obtain K T =n M+1 − 1 = M X i=1 (n i+1 −n i )≤ M X i=1 q 2 ˜ T i . (2.7) Using (16) and the fact that P M i=1 ˜ T i =T we get K T ≤ M X i=1 q 2 ˜ T i ≤ v u u t M M X i=1 2 ˜ T i = √ 2MT (2.8) where the second inequality is Cauchy-Schwarz. From Lemma A.1 in the Appendix A, the number of macro episodes M ≤ SA log(T ). Substituting this bound into (17) we obtain the result of this lemma. Remark 2.3. TSDE computes the optimal stationary policy of a finite MDP at each episode. Lemma 2.1 ensures that such computation only needs to be done at a sublinear rate of p 2SAT log(T ). 23 2.4.2 Regret Bound As discussed in [27, 32, 33], one key property of Thompson/Posterior Sampling algorithms is that for any functionf,E[f(θ t )] =E[f(θ ∗ )] ifθ t is sampled from the posterior distribution at timet. This property leads to regret bounds for algorithms with fixed sampling episodes since the start time t k of each episode is deterministic. However, our TSDE algorithm has dynamic episodes that requires us to have the stopping-time version of the above property. Lemma 2.2. Under TSDE, t k is a stopping time for any episode k. Then for any measur- able function f and any σ(h t k )−measurable random variable X, we have E h f(θ k ,X) i =E h f(θ ∗ ,X) i Proof. From the definition (3.19), the start timet k is a stopping-time, i.e. t k isσ(h t k )−measurable. Note thatθ k is randomly sampled from the posterior distributionμ t k . Sincet k is a stopping time, t k and μ t k are both measurable with respect to σ(h t k ). From the assumption, X is also measurable with respect to σ(h t k ). Then conditioned on h t k , the only randomness in f(θ k ,X) is the random sampling in the algorithm. This gives the following equation: E h f(θ k ,X)|h t k i =E h f(θ k ,X)|h t k ,t k ,μ t k i = Z f(θ,X)μ t k (dθ) =E h f(θ ∗ ,X)|h t k i (2.9) since μ t k is the posterior distribution of θ ∗ given h t k . Now the result follows by taking the expectation of both sides. 24 For t k ≤ t < t k+1 in episode k, the Bellman equation (2.1) holds by Assumption 2.1 for s =s t , θ =θ k and action a t =π k (s t ). Then we obtain c(s t ,a t ) =J(θ k ) +v(s t ,θ k )− X s 0 ∈S θ k (s 0 |s t ,a t )v(s 0 ,θ k ). (2.10) Using (2.10), the expected regret of TSDE is equal to E h K T X k=1 t k+1 −1 X t=t k c(s t ,a t ) i −TE h J(θ ∗ ) i =E h K T X k=1 T k J(θ k ) i −TE h J(θ ∗ ) i +E h K T X k=1 t k+1 −1 X t=t k h v(s t ,θ k )− X s 0 ∈S θ k (s 0 |s t ,a t )v(s 0 ,θ k ) ii =R 0 +R 1 +R 2 , (2.11) where R 0 , R 1 and R 2 are given by R 0 =E h K T X k=1 T k J(θ k ) i −TE h J(θ ∗ ) i , R 1 =E h K T X k=1 t k+1 −1 X t=t k h v(s t ,θ k )−v(s t+1 ,θ k ) ii , R 2 =E h K T X k=1 t k+1 −1 X t=t k h v(s t+1 ,θ k )− X s 0 ∈S θ k (s 0 |s t ,a t )v(s 0 ,θ k ) ii . We proceed to derive bounds on R 0 , R 1 and R 2 . Based on the key property of Lemma 2.2, we derive an upper bound on R 0 . Lemma 2.3. The first term R 0 is bounded as R 0 ≤E[K T ]. 25 Proof. From monotone convergence theorem we have R 0 =E h ∞ X k=1 1 {t k ≤T} T k J(θ k ) i −TE h J(θ ∗ ) i = ∞ X k=1 E h 1 {t k ≤T} T k J(θ k ) i −TE h J(θ ∗ ) i . Note that the first stopping criterion of TSDE ensures thatT k ≤T k−1 +1 for allk. Because J(θ k )≥ 0, each term in the first summation satisfies E h 1 {t k ≤T} T k J(θ k ) i ≤E h 1 {t k ≤T} (T k−1 + 1)J(θ k ) i . Note that 1 {t k ≤T} (T k−1 + 1) is measurable with respect to σ(h t k ). Then, Lemma 2.2 gives E h 1 {t k ≤T} (T k−1 + 1)J(θ k ) i =E h 1 {t k ≤T} (T k−1 + 1)J(θ ∗ ) i . Combining the above equations we get R 0 ≤ ∞ X k=1 E h 1 {t k ≤T} (T k−1 + 1)J(θ ∗ ) i −TE h J(θ ∗ ) i =E h K T X k=1 (T k−1 + 1)J(θ ∗ ) i −TE h J(θ ∗ ) i =E h K T J(θ ∗ ) i +E h K T X k=1 T k−1 −T J(θ ∗ ) i ≤E h K T i where the last equality holds because J(θ ∗ )≤ 1 and P K T k=1 T k−1 =T 0 + P K T −1 k=1 T k ≤T . Note that the first stopping criterion of TSDE plays a crucial role in the proof of Lemma 2.3. It allows us to bound the length of an episode using the length of the previous episode which is measurable with respect to the information at the beginning of the episode. 26 The other two terms R 1 and R 2 of the regret are bounded in the following lemmas. Their proofs follow similar steps to those in [27, 30]. Lemma 2.4. The second term R 1 is bounded as R 1 ≤E[HK T ]. Proof. See Appendix A. Lemma 2.5. The third term R 2 is bounded as R 2 ≤ 49HS q AT log(AT ). Proof. See Appendix A. We are now ready to prove Theorem 2.1. Proof of Theorem 2.1. From (2.11),R(T, TSDE) =R 0 +R 1 +R 2 ≤E[K T ] +E[HK T ] +R 2 where the inequality comes from Lemma 2.3, Lemma 2.4. Then the claim of the theorem directly follows from Lemma 2.1 and Lemma 2.5. 2.5 Simulations In this section, we compare through simulations the performance of TSDE with three learn- ing algorithms with the same regret order: UCRL2 [19], TSMDP [29], and Lazy PSRL [30]. UCRL2 is an optimistic algorithm with similar regret bounds. TSMDP and Lazy PSRL are 27 TS algorithms for infinite horizon MDPs. TSMDP has the same regret order in T given a recurrent state for resampling. The original regret analysis for Lazy PSRL is incorrect, but the regret bounds are conjectured to be correct [32]. We choseδ = 0.05 for the implementa- tion of UCRL2 and assume an independent Dirichlet prior with parameters [0.1, 0.1,..., 0.1] over the transition probabilities for all TS algorithms. We consider two environments: randomly generated MDPs and the RiverSwim example [34]. For randomly generated MDPs, we use the independent Dirichlet prior over 6 states and 2 actions but with a fixed cost. We select the resampling state s 0 = 1 for TSMDP here since all states are recurrent under the Dirichlet prior. The RiverSwim example models an agent swimming in a river who can choose to swim either left or right. The MDP consists of six states arranged in a chain with the agent starting in the leftmost state (s = 1). If the agent decides to move left i.e with the river current then he is always successful but if he decides to move right he might fail with some probability. The cost function is given by: c(s,a) = 0.8 if s = 1, a = left; c(s,a) = 0 if s = 6, a = right; and c(s,a) = 1 otherwise. The optimal policy is to swim right to reach the rightmost state which minimizes the cost. For TSMDP in RiverSwim, we consider two versions with s 0 = 1 and with s 0 = 3 for the resampling state. We simulate 500 Monte Carlo runs for both the examples and run for T = 10 5 . From Figure 2.1 we can see that TSDE outperforms all the three algorithms in randomly generated MDPs. In particular, there is a significant gap between the regret of TSDE and that of UCRL2 and TSMDP. The poor performance of UCRL2 assures the motivation to consider TS algorithms. From the specification of TSMDP, its performance heavily hinges 28 0 2 4 6 8 10 x 10 4 0 100 200 300 400 500 600 T Regret UCRL2 TSMDP Lazy PSRL TSDE Figure 2.1: Expected Regret vs Time for random MDPs on the choice of an appropriate resampling state which is not possible for a general unknown MDP. This is reflected in the randomly generated MDPs experiment. In the RiverSwim example, Figure 2.2 shows that TSDE significantly outperforms UCRL2, Lazy PSRL, and TSMDP with s 0 = 3. Although TSMDP with s 0 = 1 performs slightly better than TSDE, there is no way to pick this specifics 0 if the MDP is unknown in practice. Since Lazy PSRL is also equipped with the doubling trick criterion, the performance gap between TSDE and Lazy PSRL highlights the importance of the first stopping criterion on the growth rate of episode length. We also like to point out that in this example, the MDP is fixed and is not generated from the Dirichlet prior. From the result, we conjecture that TSDE also has the same regret bounds under a non-Bayesian setting. 29 0 2 4 6 8 10 x 10 4 0 1000 2000 3000 4000 5000 6000 T Regret UCRL2 TSMDP with s 0 =3 TSMDP with s 0 =1 Lazy PSRL TSDE Figure 2.2: Expected Regret vs Time for RiverSwim 2.6 Conclusion We propose the Thompson Sampling with Dynamic Episodes (TSDE) learning algorithm and establish ˜ O(HS √ AT ) bounds on expected regret for the general subclass of weakly communicating MDPs. Our result fills a gap in the theoretical analysis of Thompson Sam- pling for MDPs. Numerical results validate that the TSDE algorithm outperforms other learning algorithms for infinite horizon MDPs. 30 Chapter 3 Learning to Control an Unknown Linear System 3.1 Introduction In this chapter, we consider a linear stochastic system with quadratic cost (an LQ system) with unknown parameters. If the true parameters are known, then the problem is the classic stochastic LQ control where optimal control is a linear function of the state. In the learning problem, however, the true system dynamics are unknown. This problem is also known as the adaptive control problem [6, 35]. The early works in the adaptive control literature made use of the certainty equivalence principle. The idea is to estimate the parameters from collected data and apply the optimal control by taking the estimates to be the true parameters. It was shown that the certainty 31 equivalence principle may lead to the convergence of the estimated parameters to incorrect values [36] and thus results in suboptimal performance. This issue arises fundamentally from the lack of exploration. The controller must explore the environment to learn the system dynamics but at the same time it also needs to exploit the information available to minimize the accumulated cost. This leads to the well known exploitation-exploration trade-off in learning problems. One approach to actively explore the environment is to add perturbations to the controls (see, for examples, [37]). However, the persistence of perturbations lead to suboptimal performance except in the asymptotic region. To overcome this issue, Campi and Kumar [38] proposed a cost-biased maximum likelihood algorithm and proved its asymptotic optimality. More recent works [39] show a connection between the cost-biased maximum likelihood and the optimism in the face of uncertainty (OFU) principle [23] in online learning. The OFU principle handles the exploitation-exploration trade-off by making use of optimistic parameters. Based on the OFU principle, [39] design algorithms that achieve ˜ O( √ T ) bounds on regret accumulated up to time T with high probability. Here ˜ O(·) hides the constants and logarithmic factors. Recently, [40] showed that the regret for online LQR has a lower bound ofO( √ T ) and hence the regret of [39] is of the optimal order except for logarithmic factors. One drawback of the OFU-based algorithms is their computational requirements. Each step of an OFU-based algorithm requires optimistic parameters as the solution of an optimization problem. Solving the optimization is computationally expensive. In recent years, Thompson sampling (TS) has become a popular alternative to OFU due to its computational simplicity 32 (see [41] for a recent tutorial). It has been successfully applied to multi-armed bandit problems [14, 42–44] as well as to Markov Decision Processes (MDPs) [27, 29, 30].The idea dates back to 1933 due to Thompson [12]. TS is a Bayesian approach where a posterior distribution over the unknown parameter is maintained at each time. Then, at certain carefully chosen times a random sample is generated from the posterior and optimal control policy corresponding to the generated sample is applied until a new sample is generated. Without solving any optimization problem, PS-based algorithms are computationally more efficient than OFU-based algorithms in general. A recent paper [45] applied the TS idea to stochastic control problems with finite state and finite action spaces. However, it has the following limitations: (a) The model of [45] only considers finite state and finite action spaces as well as a finite parameter space. Therefore, the algorithm proposed by [45] is not applicable to the LQ control problem which builds on continuous spaces. In fact, continuous spaces require a totally different algorithm design because any count-based idea won’t work in the uncountable continuous domains. (b) In addition to the finite parameter restriction, [45] makes an identifiability assumption: any two values in the parameter space differ by a strictly positive relative entropy. Under this assumption, any two possible parameters are statistically distinguishable, and it implies that the system is always identifiable under any control law. Therefore, the main challenge of exploration disappears under the assumption of [45]. This assumption, however, does not hold in most learning problems as discussed in [6, 35, 36] . The idea of TS has not been applied to learning in LQ control until very recently [30, 46, 47]. One key challenge to adapt TS to LQ control is to appropriately design the length of the 33 episodes. Abbasi-Yadkori and Szepesvari [30] designed a dynamic episode schedule for PS- based on their OFU-based algorithm [39]. Their TS-based algorithm was claimed to have a ˜ O( √ T ) growth, but a mistake in the proof of their regret bound was pointed out by [32]. A modified dynamic episode schedule was proposed in [46], but it suffers a ˜ O(T 2 3 ) (non-Bayesian) regret. Their regret bound was improved to ˜ O( √ T ) for systems with 1- dimensional state and action spaces in [48]. [47] appeared soon after a preliminary version of this chapter [49] and also presents a PS-based algorithm. An exponential exploration schedule is proposed but no numerical results that validate the √ T -regret are given. In this chapter, we consider the LQ control problem with unknown parameters and propose a Thompson sampling with dynamic episodes (TSDE-LQ) learning algorithm. In TSDE-LQ, there are two stopping criteria for an episode to end. The first stopping criterion controls the growth rate of episode length. The second stopping criterion is the doubling trick similar to the ones in [30, 39, 46] that stops when the determinant of the sample covariance matrix becomes less than half of the previous value. Instead of a high probability bound on regret as derived in [39, 46], we choose the expected (Bayesian) regret as the performance metric for the learning algorithm. The reason is because in LQ control, a high probability bound does not provide a desired performance guarantee as the system cost may go unbounded in the bad event with small probability. Under some conditions on the prior distribution, we show that the expected regret of TSDE-LQ accumulated up to time T is bounded by ˜ O(d 0.5 x (d x +d u ) √ T ) where d x is the state dimension and d u is the control dimension. The performance of TSDE-LQ is verified numerically. Next, we consider a mean-field LQ learning problem. This is a multi-agent problem where 34 the agents are coupled via the mean-field of the state and the actions. Mean-field control of large-scale systems has gained considerable importance in the last 10-15 years [50–54]. We refer the reader to [55] for a more detailed survey. A naive application of TSDE-LQ to this problem requires complete state information at each agent and its computational complexity and regret bound grows polynomially with the number of agents. We propose TSDE-MF which decomposes the problem into two separate learning procedures and adaptively picks the observation of an agent at each time to update the posterior distribution. We show that by carefully exploiting the structure of the model, it is possible to design a learning algorithm for mean-field LQ systems where the regret does not grow in the number of agents. Moreover, TSDE-MF is not a fully centralized algorithm and its computational complexity is independent of the number of agents. The main contributions of this chapter are: (i) The proposed TSDE-LQ algorithm uses the TS idea to tackle the exploration challenge instead of the persistence perturbation method used in traditional adaptive control. In the absence of persistence perturbation, TSDE-LQ achieves finite time performance guarantees in the form of a sub-linear regret. (ii) The TSDE-LQ algorithm improves the theoretical regret bound of Thompson sampling based algorithms to ˜ O( √ T ) for the LQ control problem. (iii) TSDE-LQ is computationally feasible because the posterior distribution can be updated by a Kalman filter-like procedure. On the contrary, OFU-based methods require solving optimization problems which are computationally intractable without heuristic approximations. (iv) We propose TSDE-MF algorithm for the mean-field LQ learning problem. We show that its regret is bounded by ˜ O( √ T ) and does not grow with the number of agents. 35 Notation: E denotes the expectation operator,1 denotes the indicator function andN (μ, Σ) denotes the normal distribution with meanμ and covariance Σ. We useDARE(A,B,Q,R) to denote the solution S to the following discrete algebraic riccati equation: S =Q +A > SA−A > SB(R +B > SB) −1 B > SA (3.1) 3.2 Problem Formulation 3.2.1 Preliminaries: Stochastic Linear Quadratic Control Consider a linear system controlled by a controller. The system dynamics are given by x t+1 =Ax t +Bu t +w t , (3.2) where x t ∈ R dx is the state of the system plant, u t ∈ R du is the control action by the controller, and w t is the system noise which has the Gaussian distributionN (0,σ 2 I). A and B are system matrices with proper dimensions. The initial state x 1 is assumed to be zero. The control action u t = π t (h t ) at time t is a function π t of the history of observations h t = (x 1:t ,u 1:t−1 ) including states x 1:t := (x 1 ,··· ,x t ) and controls u 1:t−1 = (u 1 ,··· ,u t−1 ). We callπ = (π 1 ,π 2 ,... ) a (adaptive) control policy. The control policy allows the possibility of randomization over control actions. 36 The cost incurred at time t is a quadratic instantaneous function c t =x > t Qx t +u > t Ru t (3.3) where Q and R are positive definite matrices. Let θ > = [A,B] be the system parameter including both the system matrices. Then θ∈ R d×dx where d = d x +d u with compact support Ω 1 . When θ is perfectly known to the controller, minimizing the infinite horizon average cost per stage is a standard stochastic Linear Quadratic (LQ) control problem. Let J(θ) be the optimal per stage cost under θ. That is, J(θ) = min π lim sup T→∞ 1 T T X t=1 E π [c t |θ] (3.4) It is well-known that the optimal cost is given by J(θ) =σ 2 tr(S(θ)) (3.5) if the following Riccati equation has a unique positive definite solution S(θ). S(θ) =Q +A > S(θ)A−A > S(θ)B(R +B > S(θ)B) −1 B > S(θ)A. (3.6) 37 Furthermore, for any θ and any x, the optimal cost function J(θ) satisfies the Bellman equation J(θ) +x > S(θ)x = min u n x > Qx +u > Ru +E h x > t+1 (u)S(θ)x t+1 (u)|x,θ io (3.7) wherex t+1 (u) =θ > [x > ,u > ] > +w t , and the optimal control that minimizes (3.7) is equal to u =G(θ)x (3.8) with the gain matrix G(θ) given by G(θ) =−(R +B > S(θ)B) −1 B > S(θ)A. (3.9) 3.2.2 Reinforcement Learning with Stationary Parameter The problem we are interested in is the case when the system matrices A,B are fixed but unknown system matrices. When θ > 1 = [A,B] is unknown, the problem becomes a reinforcement learning problem where the controller needs to learn the system parameter while minimizing the cost. We adopt a Bayesian setting and assume that there is a prior distribution μ 1 for θ 1 . We measure the performance of an adaptive control policy π using the notion of regret. Regret compares the cost incurred under the adaptive policy against the optimal costJ(θ 1 ) over a time horizonT . The expected regret of a policyπ over a time 38 horizon of T is given as, R(T,π) =E h T X t=1 h c t −J(θ 1 ) ii . (3.10) The above expectation is with respect to the randomness for w t , the prior distribution μ 1 for θ 1 , and the randomized algorithm. The objective is to find a learning algorithm that minimizes the expected regret. We note that Thompson sampling-type algorithms are Bayesian algorithms and a prior distribution is needed. In practice, a prior is chosen that makes computation of posterior distribution easy. In theory, choosing a wrong prior will only add a multiplicative factor to the regret upper bound. Note that if R(T,π)/T goes to zero, as T →∞, it implies that the average cost converges to the optimal expected average reward J(θ 1 ) where θ 1 is the true but unknown parameter. Thus, the regret is really capturing the non-asymptotic rate of convergence to J(θ 1 ). 3.3 Thompson Sampling Based Control Policies In this section, we develop Thompson Sampling (PS)-based control policies for the problems with stationary parameters. TS is a Bayesian approach where a posterior distribution μ t (·) over the unknown parameter θ 1 is maintained at each time. TS-based algorithms generally proceed in episodes. At the beginning of each episode, a sample is generated from the posterior distribution μ t . Then, optimal control corresponding to the generated sample is applied until the next episode begins. For the reinforcement learning problem with stationary parameters, we make the following assumption on the prior distribution μ 1 . 39 Assumption 3.1. The prior distributionμ 1 consists of independent Gaussian distributions projected on a compact support Ω 1 ⊂R d×dx such that for any θ∈ Ω 1 , the Riccati equation (3.6) with [A,B] =θ > has a unique positive definite solution. Specifically, there exist ˆ θ 1 (i)∈R d fori = 1,...,d x and a positive definite matrix Σ 1 ∈R d×d such that for any θ∈R d×dx μ 1 = ¯ μ 1 | Ω 1 , ¯ μ 1 (θ) = dx Y i=1 ¯ μ 1 (θ(i)) (3.11) ¯ μ 1 (θ(i))≡N ( ˆ θ 1 (i), Σ 1 ) for i = 1,...,d x . (3.12) Here θ(i) denotes θ’s ith column (θ = [θ(1),...,θ(d x )]). Note that under the prior distribution, the mean ˆ θ 1 (i) for each column of θ 1 may be dif- ferent, but they have the same covariance matrix Σ 1 . At each time t, given the history of observations h t = (x 1:t ,u 1:t−1 ), we define μ t to be the posterior belief of θ 1 given by μ t (Θ) =P(θ 1 ∈ Θ|h t ). (3.13) The posterior belief can be computed according to the following lemma. Lemma 3.1. The posterior belief μ t on the parameter θ 1 satisfies μ t = ¯ μ t | Ω 1 , ¯ μ t (θ) = dx Y i=1 ¯ μ t (θ(i)) (3.14) ¯ μ t (θ(i))≡N ( ˆ θ t (i), Σ t ) (3.15) 40 where ˆ θ t (i),i = 1,...,d x and Σ t can be sequentially updated using observations as follows. ˆ θ t+1 (i) = ˆ θ t (i) + Σ t z t (x t+1 (i)− ˆ θ t (i) > z t ) σ 2 +z > t Σ t z t (3.16) Σ t+1 = Σ t − Σ t z t z > t Σ t σ 2 +z > t Σ t z t (3.17) where z t = [x > t ,u > t ] > ∈R d . Lemma 3.1 can be proved using arguments for the least square estimator. For example, see [56] for a proof. Remark 3.1. Instead of the Kalman filter-type equation (3.17), Σ t can also be computed by Σ −1 t+1 = Σ −1 t + 1 σ 2 z t z > t . (3.18) Let’s introduce the Thompson Sampling with Dynamic Episodes (TSDE-LQ) learning algo- rithm. 41 Algorithm 2 TSDE-LQ Input: Ω 1 , ˆ θ 1 , Σ 1 Initialization: t← 1, t k ← 0 for episodes k = 1, 2,... do T k−1 ←t−t k t k ←t Generate ˜ θ k ∼μ t k Compute G k =G( ˜ θ k ) from (3.7)-(3.8) while t≤t k +T k−1 and det(Σ t )≥ 0.5 det(Σ t k ) do Apply control u t =G k x t Observe new state x t+1 Update μ t+1 according to (3.16)-(3.17) t←t + 1 end while end for The TSDE-LQ algorithm operates in episodes. Let t k be start time of the kth episode and T k =t k+1 −t k be the length of the episode with the conventionT 0 = 1. From the description of the algorithm, t 1 = 1 and t k+1 ,k≥ 1, is given by t k+1 = min{t>t k : t>t k +T k−1 or det(Σ t )< 0.5 det(Σ t k )}. (3.19) At the beginning of episode k, a parameter ˜ θ k is sampled from the posterior distribution 42 μ t k . During each episode k, controls are generated by the optimal gain G k for the sampled parameter ˜ θ k . One important feature of TSDE-LQ is that its episode lengths are not fixed. The lengthT k of each episode is dynamically determined according to two stopping criteria: (i) t>t k +T k−1 , and (ii) det(Σ t )< 0.5 det(Σ t k ). The first stopping criterion provides that the episode length grows at a linear rate without triggering the second criterion. The second stopping criterion ensures that the determinant of sample covariance matrix during an episode should not be less than half of the determinant of sample covariance matrix at the beginning of this episode. We note that without such an exploration schedule, the TSDE-LQ algorithm does not get sublinear regret in numerical experiments. In fact, with exponential schedules (e.g., as proposed in [47]), the algorithm has super-linear regret as we show in numerical experiments in Figure 3.3. A dynamic schedule such as we propose seems necessary for TS-type algorithms to converge. It is well known in adaptive control that to converge to the optimal control law, a persistence of excitation condition must be satisfied by an adaptive control algorithm. Thus, Thompson sampling may be seen as providing enough randomization for persistence of excitation and in such a way that the convergence rate is optimal, and even non-asymptotically so. Remark 3.2. Following the idea in [19], the TSDE-LQ algorithm can be extended to the time-varying parameter case. In particular, TSDE-LQ can adapt to the jumps of the model parameter by re-initialization of the posterior distribution. However, the analysis of the time-varying parameter case is beyond the scope of this thesis. Remark 3.3. TSDE-LQ algorithm generates the control policy based on the sampled param- eter ˜ θ k . In contrast, model free reinforcement learning algorithms for linear systems (e.g. 43 [57, 58]) directly estimate the optimal control policy without estimating the parameters of the system. TS based ideas solve the problem of efficient exploration in reinforcement learning via parameter estimation since the idea is borrowed from multi-armed bandit models where online learning is akin to a parameter estimation problem. 3.4 Regret of the TSDE-LQ Algorithm In this section, we analyze the regret of TSDE-LQ. In the regret analysis, we make the following assumption on the prior distribution. Assumption 3.2. There exists a positive number δ< 1 such that for any θ∈ Ω 1 , we have ρ(A 1 +B 1 G(θ))≤ δ. Here ρ(·) is the spectral radius of a matrix, i.e. the largest absolute value of its eigenvalues. This assumption ensures that the closed-loop system is stable under the learning algorithm. A weaker assumption in [59] can ensure that Assumption 3.2 is satisfied for θ = ˜ θ k with high probability. In [60], it is argued that the set of parameters where such a stabilizability property does not hold has measure zero as the number of samples goes to infinity. A sketch of a plausible argument for Assumption 3.2 to hold is given in [47]. Other papers have assumed that a stabilizable set of parameters is available as an input to the algorithm in which case, the algorithm can reject sampled parameters if they fall outside such a given set. 44 Since J(·), S(·), and G(·) are well-defined functions on the compact set Ω 1 , there exists finite numbers M J ,M θ ,M S , andM G such that Tr(S(θ))≤M J ,||θ||≤M θ ,||S(θ)||≤M S , and||[I,G(θ) > ]||≤M G for all θ∈ Ω 1 . The main result of this section is the following bound on expected regret of TSDE-LQ in the stationary parameter case. Theorem 3.1. Under Assumptions 3.1 and 3.2, the expected regret (3.41) of TSDE-LQ satisfies R(T,TSDE-LQ)≤ ˜ O σ 2 d 0.5 x (d x +d u ) √ T (3.20) where ˜ O(·) hides all constants and logarithmic factors. Remark 3.4. Note that the expectation is respect to a product probability measure that depends on the prior μ 1 but the upper bound on it is independent of the prior. The prior is also used in the algorithm. If a wrong prior is used, then the upper bound will also involve a multiplicative constant which would be the Radon-Nikodym derivate of the wrong prior with respect to the true prior. To prove Theorem 3.1, we first provide bounds on the system state and the number of episodes. Then, we give a decomposition for the expected regret and derive upper bounds for each term of the regret. Let X T = max t≤T kx t k be the maximum value of the norm of the state andK T be the number of episodes over the horizonT . Then we have the following properties. 45 Lemma 3.2. For any j≥ 1 and any T we have E h X j T i ≤O (1−δ) −j ·σ j · log(T ) . (3.21) Lemma 3.3. The number of episodes is bounded by K T ≤O v u u t 2dT log T X 2 T σ 2 ! ≤ ˜ O √ dT . (3.22) The proofs of Lemmas 3.2 and 3.3 are in the Appendix B. Following the steps in [39] using the Bellman equation (3.7), for t k ≤ t < t k+1 during the kth episode, the cost of TSDE-LQ satisfies c t =J( ˜ θ k ) +x > t S( ˜ θ k )x t −E h x > t+1 S( ˜ θ k )x t+1 |x t , ˜ θ k i + (θ > 1 z t ) > S( ˜ θ k )θ > 1 z t − ( ˜ θ > k z t ) > S( ˜ θ k ) ˜ θ > k z t . (3.23) Then from (3.23), the expected regret of TSDE-LQ can be decomposed into R(T,TSDE-LQ) =E h K T X k=1 t k+1 −1 X t=t k c t i −TE h J(θ 1 ) i =R 0 +R 1 +R 2 (3.24) where R 0 =E h K T X k=1 T k J( ˜ θ k ) i −TE h J(θ 1 ) i , (3.25) 46 R 1 =E h K T X k=1 t k+1 −1 X t=t k h x > t S( ˜ θ k )x t −x > t+1 S( ˜ θ k )x t+1 ii , (3.26) R 2 =E h K T X k=1 t k+1 −1 X t=t k h (θ > 1 z t ) > S( ˜ θ k )θ > 1 z t − ( ˜ θ > k z t ) > S( ˜ θ k ) ˜ θ > k z t ii . (3.27) In the following, we proceed to derive bounds on R 0 , R 1 and R 2 . As discussed in [27, 32, 33], one key property of Thompson Sampling algorithms is that for any function f, E[f(θ t )] = E[f(θ 1 )] if θ t is sampled from the posterior distribution at timet. However, our TSDE-LQ algorithm has dynamic episodes that requires us to have the stopping-time version of the above property. Lemma 3.4 (Lemma 2.2 in chapter 2). Under TSDE-LQ, t k is a stopping time for any episodek. Then for any measurable functionf and anyσ(h t k )−measurable random variable X, we have E h f( ˜ θ k ,X) i =E h f(θ 1 ,X) i . (3.28) Based on the key property of Lemma 3.4, we establish an upper bound on R 0 . Lemma 3.5. The first term R 0 is bounded as R 0 ≤σ 2 M J E[K T ]≤σ 2 ˜ O √ dT . (3.29) Proof. From monotone convergence theorem, we have R 0 =E h ∞ X k=1 1 {t k ≤T} T k J( ˜ θ k ) i −TE h J(θ 1 ) i 47 = ∞ X k=1 E h 1 {t k ≤T} T k J( ˜ θ k ) i −TE h J(θ 1 ) i . (3.30) Note that the first stopping criterion of TSDE-LQ ensures that T k ≤ T k−1 + 1 for all k. Because J( ˜ θ k )≥ 0, each term in the first summation satisfies E h 1 {t k ≤T} T k J( ˜ θ k ) i ≤E h 1 {t k ≤T} (T k−1 + 1)J( ˜ θ k ) i . (3.31) Note that 1 {t k ≤T} (T k−1 + 1) is measurable with respect to σ(h t k ). Then, Lemma 3.4 gives E h 1 {t k ≤T} (T k−1 + 1)J( ˜ θ k ) i =E h 1 {t k ≤T} (T k−1 + 1)J(θ 1 ) i . (3.32) Combining the above equations, we get R 0 ≤ ∞ X k=1 E h 1 {t k ≤T} (T k−1 + 1)J(θ 1 ) i −TE h J(θ 1 ) i =E h K T X k=1 (T k−1 + 1)J(θ 1 ) i −TE h J(θ 1 ) i =E h K T J(θ 1 ) i +E h K T X k=1 T k−1 −T J(θ 1 ) i ≤M J σ 2 E h K T i (3.33) where the last equality holds because J(θ 1 ) = σ 2 Tr(S(θ 1 ))≤ σ 2 M J and P K T k=1 T k−1 ≤ T . The statement of the lemma then follows by using the bound on number of episodes K T in lemma 3.3. The term R 1 can be upper bounded using K T and X T . 48 Lemma 3.6. The second term R 1 is bounded by R 1 ≤M S E h K T X 2 T i ≤ ˜ O σ 2 √ dT . (3.34) Proof. See Appendix B We now derive an upper bound for R 2 . Lemma 3.7. The third term R 2 is bounded by R 2 ≤O M 2 q (T +E[K T ])E[X 4 T log(TX 2 T )] ≤ ˜ O σ 2 p d x d √ T (3.35) where M 2 =M S M θ M 2 G q 32d 2 dx λ min and λ min is the minimum eigenvalue of Σ −1 1 . Proof. See Appendix B Using the bounds on R 0 , R 1 and R 2 , we are now ready to prove Theorem 3.1. Proof of Theorem 3.1. From the regret decomposition (3.24), Lemmas 3.5-3.7, and the bound on K T from Lemma 3.3, we obtain R(T,TSDE-LQ)≤O M 2 r (T +E[ q 2dT log(TX 2 T )])E[X 4 T log(TX 2 T )] +E h q 2dT log(TX 2 T )(M J +M S X 2 T ) i ≤ ˜ O p d x d r (T +E[ q T log(X T )])E[X 4 T log(X T )] +E hq dT log(X T )X 2 T i . (3.36) 49 From Lemma B.1 in the Appendix B, we have E[ p log(X T )]≤ ˜ O(1), E[ p log(X T )X 2 T ]≤ ˜ O σ 2 (1−δ) −2 , and E[X 4 T log(X T )]≤ ˜ O σ 4 (1−δ) −4 . Applying these bounds to (3.36) we get R(T,TSDE-LQ)≤ ˜ O p d x d q (T + √ T )σ 4 (1−δ) −4 + √ dTσ 2 (1−δ) −2 = ˜ O σ 2 p d x (d x +d u ) √ T (1−δ) −2 . (3.37) 3.5 Simulations In this section, we illustrate the performance of the TSDE-LQ algorithm through numerical simulations. The prior distribution used in TSDE-LQ are set according to (3.12) with ˆ θ 1 (i) = 1, Σ 1 = I, and Ω 1 ={θ :||A 1 +B 1 G(θ)||≤ δ} where δ is a simulation parameter. The parameter δ can be seen as the level of accuracy of the prior distribution. The smaller δ is, the more accurate the prior distribution is for the true system parameters. Note that Assumption 3.2 holds only when δ< 1. For each system, we select δ = 0.99 andδ = 2. We run 500 simulations and show the mean of regret with confidence interval for each scenario. We use σ 2 = 1 for all the experiments. We first consider two scalar systems: a stable system with A 1 = 0.9 and an unstable system with A 1 = 1.5. We set Q = 2, R = 1 and B 1 = 0.5 for both cases. Figure 3.1(a) and 3.1(b) shows the regret curve for the stable and the unstable system respectively. TSDE-LQ successfully learns and controls both the stable and the unstable system as the regret grows at a sublinear rate (though not apparent, it 50 -1000 0 1000 2000 3000 4000 5000 Horizon -50 0 50 100 150 200 250 300 350 400 Regret δ = 0.99 δ = 2 (a) Expected regret for a stable scalar system -1000 0 1000 2000 3000 4000 5000 Horizon -200 0 200 400 600 800 1000 1200 Regret δ = 0.99 δ = 2 (b) Expected Regret for a unstable scalar system Figure 3.1: Scaler Systems grows as ˜ O( √ T )). Although Assumptions 3.2 does not hold when δ = 2, the results show that TSDE-LQ might still work in this situation. Figure 3.2 illustrates the regret curves for a multi-dimensional system with d x = d u = 3. We again consider two systems: a stable system with 0.9 as the largest eignevalue ofA 1 and an unstable system with 1.5 as the largest eigenvalue ofA 1 . The results show that TSDE-LQ achieves sublinear regret in the multi-dimensional cases also. Figure 3.2(b) shows that the ratio R(T ) √ T is converging which verifies that the rate of growth matches with the theoretical rate of theorem 3.1, ˜ O( √ T ). Remark 3.5. Note that we designed a linear schedule for TSDE-LQ since an exponential 51 -1000 0 1000 2000 3000 4000 5000 Horizon -50 0 50 100 150 200 250 300 350 400 450 Regret δ = 0.99 δ = 2 (a) Expected regret for a stable vector system with d x =d u = 3 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 T -5 0 5 10 15 20 25 R(T) √ T δ = 0.99 δ = 2 (b) R(T) √ T for a unstable vector system with d x =d u = 3 Figure 3.2: Multi-Dimensional Systems schedule does not obtain sub-linear regret. [47] proposed a TS-based algorithm where the exploration schedule is deterministic and exponentially spaced in time i.e. their algorithm generates a sample from μ t if t =bγ m c for some m = 0, 1, 2,... and γ > 1. We considered a scalar system with A 1 = 1.5,B 1 = 0.5,Q = 2,R = 1. We set the parameter γ = 1.09 and the prior distribution according to (3.12) with ˆ θ 1 (i) = 1, Σ 1 = I, and Ω 1 = {θ : ||A 1 +B 1 G(θ)||≤ 2}. Figure 3.3 shows the log of regret for the algorithm in [47] averaged over 300 simulation runs. It can be observed that the regret curve is not sub-linear (contrary to the claim in [47]) and actually has sharp jumps at some time instances. All the starting points of the jump in regret coincide with the sampling times in the exploration schedule. 52 This can be explained as follows: If at some sampling time, the generated sample is such that the system becomes closed-loop unstable then the system is stuck with this bad sample for exponentially long time because of the deterministic exponential schedule. This leads to a large system state during that episode which blows up the regret to a very large number and hence results in sharp jumps in regret. Note, that TSDE-LQ performs much better for the same setup as observed in figure 3.1(b). 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Horizon 0 5 10 15 20 25 30 35 log (Regret) Figure 3.3: Expected log regret for a scalar system under exponential schedule 3.6 Mean field LQ learning The LQ learning problem in section 3.2 is considered in a single agent setting. In this section, we consider a multi-agent learning problem of mean-field LQ. We will design a TS based learning algorithm TSDE-MF and establish theoretical guarantees on its regret. 3.6.1 Problem formulation Consider a large-scale system with n mean-field coupled agents. In particular, let x i t ∈R dx andu i t ∈R du be the state and control action of agenti, 1≤i≤n, at timet. The mean-field 53 of the states, ¯ x t , is the empirical mean of the states of all agents and similarly the mean-field of the control actions, ¯ u t , is the empirical mean of the actions of all agents. That is, ¯ x t := 1 n n X i=1 x i t and ¯ u t := 1 n n X i=1 u i t Each agent has linear state dynamics and the agents are dynamically coupled through the mean-field of the states and the actions. In particular, the state dynamics of agent i are given by, x i t+1 =Ax i t +Bu i t +D¯ x t +E¯ u t +w i t , (3.38) where A,B,D, and E are matrices of appropriate dimensions. The noise w i t is assumed to be a Gaussian random vector with zero mean and covariance matrixI. We assume that the noise is independent across both time and the agents. Let x t := vec(x 1 t ,...,x n t ) and u t := vec(u 1 t ,...,u n t ) denote the global system state and control actions at time t. Agent i determines its action at time t using the policy π i t as u i t =π i t (x 1:t , u 1:t−1 ). We callπ i := (π i 1 ,π i 2 ,...) the control policy of agenti and the collection π := (π 1 ,π 2 ,...,π N ) is referred to as the joint control policy of the agents. At each time, the system incurs a cost that consists of a term which is quadratic in the local state and action of each agent and a cost which is quadratic in the mean-field state and action. More precisely, the cost at time t is given by, c(x t , u t ) = ¯ x | t ¯ Q¯ x t + ¯ u | t ¯ R¯ u t + 1 n n X i=1 h (x i t ) | Qx i t + (u i t ) | Ru i t i , (3.39) 54 where Q,R, ¯ Q, ¯ R are positive definite matrices. Let θ | = [A,B,D,E] denote the system parameters. The infinite horizon cost of a joint control policy π is given as follows, J(θ;π) := lim sup T→∞ 1 T T X t=1 E π [c(x t , u t )|θ]. (3.40) If the system parameters θ are known, the infinite horizon mean-field team problem is to find a joint control policy π that minimizes the average infinite-horizon cost. We use J(θ) to denote the optimal infinite-horizon cost for a given θ, that is, J(θ) = inf π J(θ;π). We are interested in the problem of adaptively controlling the mean-field system without the knowledge of the underlying parameterθ. Thus we assume that the system parameterθ is fixed but unknown matrix. We study this problem in a Bayesian setting and assume that there is a prior distribution λ 1 for θ. We measure the performance of an adaptive control policy π using the notion of regret. Regret compares the cost incurred under the adaptive policy against the optimal cost J(θ) over a time horizon T . For a fixed θ, the expected regret of a joint policy π over a time horizon of T is given as, R(T,θ;π) =E h T X t=1 h c(x π t , u π t )−J(θ) ii . (3.41) The above expectation is with respect to the randomness of w i t and the randomized algo- rithm. The Bayes regret of the policy π is R(T ;π) = E λ 1 [R(T,θ;π)]. The mean-field LQ learning problem is to find a joint adaptive control policy π which minimizes the regret. 55 3.6.2 Preliminaries: Optimal Control for known θ Whenθ is known, the above mean-field problem is a linear quadratic regulator (LQR) with state x t and control action u t . Naively using the standard LQR result would involve solving an algebraic Riccati equation of size (nd x )× (nd x ). An alternative parameterization of the LQR solution was proposed in [61], which involves solving two Riccati equations of size d x ×d x . In particular, for each i define ˘ x i t := x i t − ¯ x t and ˘ u i t := u i t − ¯ u t as the state and control actions in relative coordinates (with the mean-field as origin). We will refer to ˘ x i t , ˘ u i t as the relative state and relative action of agent i. Then, the dynamics of relative and the mean-field states are given by ˘ x i t+1 =A˘ x i t +B˘ u i t + ˘ w i t (3.42) ¯ x t+1 = (A +D)¯ x t + (B +E)¯ u t + ¯ w t (3.43) where ¯ w t = P n i=1 w i t n and ˘ w i t =w i t − ¯ w t . Furthermore, the per-step cost may be written as c(x t , u t ) = ¯ c(¯ x t , ¯ u t ) + 1 n n X i=1 ˘ c(˘ x i t , ˘ u i t ) (3.44) where ¯ c(¯ x t , ¯ u t ) = ¯ x | t (Q + ¯ Q)¯ x t + ¯ u | t (R + ¯ R)¯ u t (3.45) ˘ c(˘ x i t , ˘ u i t ) = (˘ x i t ) | Q˘ x i t + (˘ u i t ) | R˘ u i t (3.46) 56 Thus, the overall linear system can be viewed asn+1 subsystems(one mean-field system and n relative systems) coupled only through noise. Let ˘ θ | := [A,B] and ¯ θ | := [A +D,B +E]. Let S( ˘ θ) and S( ¯ θ) denote the solutions of the following algebraic Riccati equations: S( ˘ θ) =DARE(A,B,Q,R) (3.47) S( ¯ θ) =DARE(A +D,B +E,Q + ¯ Q,R + ¯ R) (3.48) LetG( ˘ θ) andG( ¯ θ) denote the gain matrices corresponding to (3.47) and (3.48) respectively. Then, the optimal control strategy for the model in (3.38)-(3.39) is given by [61]: u i,∗ t =G( ˘ θ)˘ x i t +G( ¯ θ)¯ x t Also, the optimal infinite horizon cost is given by, J(θ) = ¯ J( ¯ θ) + ˘ J( ˘ θ) (3.49) where ¯ J( ¯ θ) = 1 n Tr(S( ¯ θ)) and ˘ J( ˘ θ) = 1− 1 n Tr(S( ˘ θ)). 3.6.3 Naive Thompson Sampling We can view the learning problem of the mean filed LQ system described in Section 3.6.1 as a centralized LQR problem with state x t , action u t and the following dynamics: x t+1 = ˜ Ax t + ˜ Bu t + w t (3.50) 57 where ˜ A, ˜ B are given as follows: ˜ A = A + D n D n ··· D n D n A + D n ··· D n . . . . . . ··· . . . D n D n ··· A + D n and ˜ B = B + E n E n ··· E n E n B + E n ··· E n . . . . . . ··· . . . E n E n ··· B + E n Then, we can apply TSDE-LQ to the mean-field LQ learning problem by viewing it as the above centralized LQR problem. However, this approach has some drawbacks which are listed in the following: 1. Fully centralized: Naive TSDE-LQ will require complete state information at each agent to compute the posterior and the control action at each time. 2. Computational complexity: The computational complexity of each iteration ofTSDE-LQ will scale asO(n 3 ) as the algorithm requires to compute the inverse and determinant of matrices whose size scales with n. 3. Regret bound: If we naively apply TSDE-LQ to the mean-field LQ problem, then Theorem 3.1 will give us the following regret bound R(T,TSDE−LQ)≤ ˜ O n 1.5 p d x (d x +d u ) √ T The above regret bound increases polynomially with the number of agents n. Thus, applying TSDE-LQ naively to the mean-field problem will result in a large regret bound as the number of agents grows large. 58 Since the dynamics of each agent is governed by the same parameters, the number of parameters to be learned is independent of n. Applying TSDE-LQ naively to the mean-field LQ problem does not take advantage of the underlying structure of the corresponding system matrices. In the next subsection, we will design a TS based algorithm which mitigates these drawbacks by exploiting the structure of the mean-field problem and the structure of its solution with known θ as described in Section 3.6.2. 3.6.4 Algorithm In this subsection, we propose a Thompson sampling based scheme TSDE-MF for the mean field LQ learning problem in Algorithm 3. TSDE-MF is inspired from the TSDE-LQ algorithm and the structure of the optimal solution for knownθ in the mean-field problem. Instead of keeping track of a posterior distribution over the complete parameterθ, the agents maintain separate posterior distributions over ˘ θ and ¯ θ conditioned on appropriate past information. At time t, the agents generate the iterates ¯ v t and ˘ v i t using the current sample of ¯ θ and ˘ θ. The agents then generate their action u i t by combining the iterates ˘ v i t and ¯ v t . 59 Algorithm 3 TSDE-MF Input: ¯ Ω, ˘ Ω, ¯ μ 1 , ˘ μ 1 , ¯ Σ 1 , ˘ Σ 1 Initialization: k 0 , k 1 , ¯ t 0 , ˘ t 0 , ¯ T −1 , ˘ T −1 ← 0 for t = 1, 2,... do if t− ¯ t k 0 > ¯ T k 0 −1 or det( ¯ Σ t )< 0.5 det( ¯ Σ t k 0 ) then k 0 ←k 0 + 1 Generate ¯ θ k 0 ∼ ¯ λ t Compute ¯ G =G( ¯ θ k 0 ) from (3.48) ¯ T k 0 −1 ←t−t k 0 −1 , ¯ t k 0 ←t end if if t− ˘ t k 1 > ˘ T k 1 −1 or det( ˘ Σ t )< 0.5 det( ˘ Σ t k 1 ) then k 1 ←k 1 + 1 Generate ˘ θ k 1 ∼ ˘ λ t Compute ˘ G =G( ˘ θ k 1 ) from (3.47) ˘ T k 1 −1 ←t− ˘ t k 1 −1 , ˘ t k 1 ←t end if Agent i applies control u i t = ¯ G¯ x t + ˘ G˘ x i t Observe new state x i t+1 ,∀i i ∗ t ← arg max i (˘ z i t ) | ˘ Σ t ˘ z i t Update ¯ λ t+1 and ˘ λ t+1 according to (3.54)-(3.57) end for TSDE-MF works as follows: 60 Let ¯ z t = [¯ x | t , ¯ u | t ] | , ˘ z i t = [(˘ x i t ) | , (˘ u i t ) | ] | . We maintain a posterior ¯ λ t on ¯ θ given the mean-field state and action history, that is, ¯ λ t ( ¯ Θ) =P( ¯ θ∈ ¯ Θ|¯ x 1:t , ¯ u 1:t−1 ) (3.51) Also, let ˘ λ t denote the posterior distribution on ˘ θ given the following relative state and action history, ˘ λ t ( ˘ Θ) =P( ˘ θ∈ ˘ Θ|{˘ x i ∗ s s , ˘ u i ∗ s s , ˘ x i ∗ s s+1 } 1≤s<t ) (3.52) That is, ˘ λ t is the distribution of ˘ θ conditioned on the tuples (˘ x i ∗ s s , ˘ u i ∗ s s , ˘ x i ∗ s s+1 ) where the s th tuple corresponds to the agent i ∗ s chosen adaptively at each time s. The agent i ∗ t is determined as follows: i ∗ t = arg max i (˘ z i t ) | ˘ Σ t ˘ z i t (3.53) where ˘ Σ t is the covariance of the probability distribution ˘ λ t . We will discuss the choice of the posterior ˘ λ t and i ∗ t in a later remark in the next section. A sample ¯ θ k 0 is generated from ¯ λ t whenever the sampling condition for ¯ λ t (Line 4 in Algo- rithm 3) becomes true and similarly a sample ˘ θ k 1 is generated from ˘ λ t based on its sampling condition. Note that the sampling conditions for ¯ θ and ˘ θ are decoupled from each other. The iterate ¯ v t is computed using the sample ¯ θ k 0 and the iterate ˘ v i t is computed using the sample ˘ θ k 1 . Agent i’s control action is computed by combining ¯ v t and ˘ v i t as follows: 61 ¯ v t =G( ¯ θ k 0 )¯ x t , ˘ v i t =G( ˘ θ k 1 )˘ x i t u i t = ˘ v i t + ¯ v t Next, we describe the update of the posterior distributions ¯ λ t and ˘ λ t . 3.6.5 Posterior update Let ¯ λ 1 and ˘ λ 1 be the prior on ¯ θ and ˘ θ respectively. We make the following assumption on the prior ¯ λ 1 Assumption 3.3. The prior distribution ¯ λ 1 consists of independent Gaussian distributions projected on a compact support ¯ Ω ⊂ R (dx+du)×dx such that for any ¯ θ ∈ ¯ Ω, the Riccati equation (3.48) with [A +D,B +E] = ¯ θ | has a unique positive definite solution. Specifically, there exist ¯ λ 1 (l) ∈ R dx+du for l = 1,...,d x and a positive definite matrix ¯ Σ 1 ∈R (dx+du)×(dx+du) such that for any ¯ θ∈R (dx+du)×dx ¯ λ 1 = ˜ λ 1 | ¯ Ω , ˜ λ 1 ( ¯ θ) = dx Y l=1 ˜ λ 1 ( ¯ θ(l)) ˜ λ 1 ( ¯ θ(l))≡N (¯ μ 1 (l), ¯ Σ 1 ) for l = 1,...,d x . Here ¯ θ(l) denotes ¯ θ’s lth column ( ¯ θ = [ ¯ θ(1),..., ¯ θ(d x )]). 62 Assumption 3.3 is similar to the assumption 3.1 on the prior distribution for the centralized LQR problem. This assumption says that the prior ¯ λ 1 consists of independent Gaussian components projected on a compact set ¯ Ω. The following lemma can be used for computing the posterior belief ¯ λ t Lemma 3.8. The posterior belief ¯ λ t on the parameter ¯ θ satisfies ¯ λ t = ˜ λ t | ¯ Ω , ˜ λ t ( ¯ θ) = dx Y l=1 ˜ λ t ( ¯ θ(l)) ˜ λ t ( ¯ θ(l))≡N (¯ μ t (l), ¯ Σ t ) for l = 1,...,d x . where ¯ μ t (l),l = 1,...,d x , and ¯ Σ t can be sequentially updated using observations as follows. ¯ μ t+1 (l) = ¯ μ t (l) +n ¯ Σ t ¯ z t (1 +n¯ z | t ¯ Σ t ¯ z t ) −1 (¯ x t+1 (l)− ¯ μ t (l) | ¯ z t ) (3.54) ¯ Σ −1 t+1 = ¯ Σ −1 t +n¯ z t ¯ z | t (3.55) Proof. Using (3.43), we can write, √ n¯ x t+1 = ( √ n¯ z t ) | ¯ θ + √ n ¯ w t The above equation is (3.43) scaled by a factor of √ n. The scaling is done to standardize the noise in the dynamics since the covariance of ¯ w t is 1 n I. The result then follows using the standard results from Gaussian linear regression. We now look at the update of ˘ λ t . First, we make a similar assumption for the prior ˘ λ 1 . We 63 assume that the prior ˘ λ 1 is a Gaussian distribution restricted on a subset ˘ Ω. We make this precise in the following. Assumption 3.4. The prior distribution ˘ λ 1 consists of independent Gaussian distributions projected on a compact support ˘ Ω ⊂ R (dx+du)×dx such that for any ˘ θ ∈ ¯ Ω, the Riccati equation (3.47) with [A,B] = ˘ θ | has a unique positive definite solution. Specifically, there exist ˘ μ 1 (l) ∈ R dx+du for l = 1,...,d x and a positive definite matrix ˘ Σ 1 ∈R (dx+du)×(dx+du) such that for any ˘ θ∈R (dx+du)×dx ˘ λ 1 = ˜ λ 1 | ˘ Ω , ˜ λ 1 ( ˘ θ) = dx Y l=1 ˜ λ 1 ( ˘ θ(l)) ˜ λ 1 ( ˘ θ(l))≡N (˘ μ 1 (l), ˘ Σ 1 ) for l = 1,...,d x . Here ˘ θ(l) denotes ˘ θ’s lth column ( ˘ θ = [ ˘ θ(1),..., ˘ θ(d x )]). The following lemma shows that the posterior belief ˘ λ t is also Gaussian projected on ˘ Ω Lemma 3.9. The posterior belief ˘ λ t on the parameter ˘ θ satisfies ˘ λ t = ˜ λ t | ˘ Ω , ˜ λ t ( ˘ θ) = dx Y l=1 ˜ λ t ( ˘ θ(l)) ˜ λ t ( ˘ θ(l))≡N (˘ μ t (l), ˘ Σ t ) for l = 1,...,d x . ˘ μ t (l),l = 1,...,d x , and ˘ Σ t can be sequentially updated using observations as follows. 64 ˘ μ t+1 (l) = ˘ μ t (l) + n n− 1 ˘ Σ t ˘ z i ∗ t t 1 + n n− 1 (˘ z i ∗ t t ) | ˘ Σ t ˘ z i ∗ t t −1 ˘ x i ∗ t t+1 (l)− (˘ μ t (l)) | ˘ z i ∗ t t (3.56) ( ˘ Σ t+1 ) −1 = ( ˘ Σ t ) −1 + n n− 1 ˘ z i ∗ t t (˘ z i ∗ t t ) | (3.57) Proof. Using (3.42), for each column of ˘ θ(j), we have the observation r n n− 1 ˘ x i ∗ t t+1 (l) = r n n− 1 ˘ z i ∗ t t | ˘ θ(l) + r n n− 1 ˘ w i ∗ t t (l) (3.58) We have scaled (3.42) by q n n−1 to standardize the noise since ˘ w i ∗ t t (l)∼N (0, 1− 1 n ). The result then follows using the standard results from Gaussian linear regression. 3.6.6 Regret Analysis In this subsection, we analyze the regret of TSDE-MF. The first key observation we make is that the regret can be decomposed into two components: regret due to mean field and regret due to relative states as follows R(T ) =E h T X t=1 h c(x t , u t )−J(θ) ii =E h T X t=1 h ¯ c(¯ x t , ¯ u t ) + 1 n X i ˘ c(˘ x i t , ˘ u i t )− ¯ J( ¯ θ) + ˘ J( ˘ θ) ii =E h ¯ c(¯ x t , ¯ u t )− ¯ J( ¯ θ) i + 1 n n X i=1 E h ˘ c(˘ x i t , ˘ u i t )− ˘ J( ˘ θ) i = ¯ R(T ) + 1 n n X i=1 ˘ R i (T ) (3.59) 65 where ¯ R(T ) :=E X t h ¯ c(¯ x t , ¯ u t )− ¯ J( ¯ θ) i ˘ R i (T ) :=E X t h ˘ c(˘ x i t , ˘ u i t )− ˘ J( ˘ θ) i The above follows using the decomposition of the instantaneous costc(x t , u t ) in (3.44)-(3.46) and the decomposition of optimal infinite horizon cost in (3.49). Therefore, we can bound the regret by bounding each term in (3.59). The second key observation is that the iterates ¯ v t and ˘ v i t are equal to the the mean field action ¯ u t and relative action ˘ u i t of agent i respectively. ¯ u t = 1 n X i ¯ v t + ˘ v i t = ¯ v t + 1 n X i ˘ v i t = ¯ v t ˘ u i t =u i t − ¯ u t = ¯ v t + ˘ v i t − ¯ v t = ˘ v i t where the above follows since P i ˘ v i t = ˘ G P i ˘ x i t = 0. Now, we note that each term in the regret decomposition can be viewed as the regret of a centralized LQR system as follows: 1. ¯ R(T ): This is equivalent to the regret of a centralized LQR system with state ¯ x t and action ¯ v t : • Dynamics: The mean-field state ¯ x t evolves as ¯ x t+1 = [¯ x t , ¯ v t ] | ¯ θ + ¯ w t . • Instantaneous cost: ¯ c(¯ x t , ¯ v t ) 66 We refer to this system as mean-field state dynamical system with parameter ¯ θ. Now, since the iterate ¯ v t is the optimal action for the mean-field state dynamical system with parameter ¯ θ k (where ¯ θ k is the sample being used at time t from the posterior of ¯ θ), it satisfies the following bellman equation, ¯ c(¯ x t , ¯ v t ) = ¯ J( ¯ θ k ) + ¯ x | t S( ¯ θ k )¯ x t −E[¯ x | t+1 S( ¯ θ k )¯ x t+1 |¯ x t , ¯ θ k ] + ( ¯ θ | ¯ z t ) | S( ¯ θ k ) ¯ θ | ¯ z t − ( ¯ θ | k ¯ z t ) | S( ¯ θ k ) ¯ θ | k ¯ z t (3.60) Let ¯ K T denote the number of times sample is generated from the posterior ¯ λ t . Then, using (3.60) we can further decompose the regret term ¯ R(T ) in (3.59) as follows: ¯ R(T ) = ¯ R 0 (T ) + ¯ R 1 (T ) + ¯ R 2 (T ) where ¯ R 0 (T ) =E h ¯ K T X k=1 ¯ T k J( ¯ θ k ) i −TE h J( ¯ θ) i , ¯ R 1 (T ) =E h ¯ K T X k=1 ¯ t k+1 −1 X t= ¯ t k h ¯ x | t S( ¯ θ k )¯ x t − ¯ x | t+1 S( ¯ θ k )¯ x t+1 ii , ¯ R 2 (T ) =E h ¯ K T X k=1 ¯ t k+1 −1 X t= ¯ t k h ( ¯ θ | ¯ z t ) | S( ¯ θ k ) ¯ θ | ¯ z t − ( ¯ θ | k ¯ z t ) | S( ¯ θ k ) ¯ θ | k ¯ z t ii . 2. ˘ R i (T ): This is equivalent to the regret of a centralized LQR system with state ˘ x i t and action ˘ v i t : 67 • Dynamics: The relative state ˘ x i t evolves as ˘ x i t+1 = [˘ x i t , ˘ v i t ] | ˘ θ + ˘ w i t . • Instantaneous cost: ˘ c(˘ x i t , ˘ v i t ) We refer to this system as relative-state dynamical system with parameter ˘ θ. The iterate ˘ v i t is the optimal action for the relative-state dynamical system with parameter ˘ θ (where ˘ θ k is the sample being used at time t from the posterior of ˘ θ) and hence satisfies the following bellman equation, ˘ c(˘ x i t , ˘ v i t ) = ˘ J( ˘ θ k ) + (˘ x i t ) | S( ˘ θ k )˘ x i t −E[(˘ x i t+1 ) | S( ˘ θ k )˘ x i t+1 |˘ x i t , ˘ θ k ] + ˘ θ | ˘ z i t ) | S( ˘ θ k ) ˘ θ | ˘ z i t − ( ˘ θ | k ˘ z i t ) | S( ˘ θ k ) ˘ θ | k ˘ z i t (3.61) Let ˘ K T denote the number of times sample is generated from the posterior ˘ λ t . Then, using (3.61), we can decompose the regret term ˘ R i T in (3.59) as follows: ˘ R i (T ) = ˘ R i 0 (T ) + ˘ R i 1 (T ) + ˘ R i 2 (T ) where ˘ R i 0 (T ) =E h ˘ K T X k=1 ˘ T k J( ˘ θ k ) i −TE h J( ˘ θ) i , ˘ R i 1 (T ) =E h ˘ K T X k=1 ˘ t k+1 −1 X t= ˘ t k h (˘ x i t ) | S( ˘ θ k )˘ x i t − (˘ x i t+1 ) | S( ˘ θ k )˘ x i t+1 ii , ˘ R i 2 (T ) =E h ˘ K T X k=1 ˘ t k+1 −1 X t= ˘ t k h ( ˘ θ | ˘ z i t ) | S( ˘ θ k ) ˘ θ | ˘ z i t − ( ˘ θ | k ˘ z i t ) | S( ˘ θ k ) ˘ θ | k ˘ z i t ii . 68 We can now bound each term in the regret decomposition (3.59) by following the anal- ysis of the regret bounds for centralized LQR in section 3.4. For that purpose, define ¯ X T = max 1≤t≤T ||¯ x t || , ˘ X i T = max 1≤t≤T ||˘ x i t ||. Also, since ¯ S(·), ¯ G(·) are well defined contin- uous functions on a compact set ¯ Ω there exist finite constants ¯ M J , ¯ M ¯ θ , ¯ M S , ¯ M G such that Tr( ¯ S( ¯ θ))≤ ¯ M J ,|| ¯ θ||≤ ¯ M ¯ θ ,|| ¯ S( ¯ θ)||≤ ¯ M S and||[I, ¯ G( ¯ θ)]||≤ ¯ M G for all ¯ θ∈ ¯ Ω. Similarly, there exist finite constants ˘ M J , ˘ M ˘ θ , ˘ M S , ˘ M G such thatTr( ˘ S( ˘ θ))≤ ˘ M J ,|| ˘ θ||≤ ˘ M ˘ θ ,|| ˘ S( ˘ θ)||≤ ˘ M S and||[I, ˘ G( ˘ θ)]||≤ ˘ M G for all ˘ θ∈ ˘ Ω. The next lemma upper bounds the number of samples ¯ K T , ˘ K T Lemma 3.10. ¯ K T ≤O q (d x +d u )T log(nT ¯ X 2 T ) ˘ K T ≤O v u u t (d x +d u )T log n n− 1 X t ( ˘ X i ∗ t T ) 2 ! Proof. See Appendix B. We make the following assumption which ensures that the closed loop dynamics of mean field state and relative state of each agent is stable. Assumption 3.5. There exists a positive number ¯ δ < 1 and ˘ δ < 1 such that for any [A +D,B +E] = ¯ θ | , ¯ θ 0 ∈ ¯ Ω and for any [A,B] = ˘ θ | , ˘ θ 0 ∈ ˘ Ω, we have ρ(A +BG( ˘ θ 0 ))≤ ˘ δ ρ(A +D + (B +E)G( ¯ θ 0 ))≤ ¯ δ 69 where ρ(·) is the spectral radius of a matrix. The following lemma upper bounds ¯ X T , ˘ X i T Lemma 3.11. For any j≥ 1 and any T we have E h ¯ X j T i ≤ 1 √ n j O log(T )(1− ¯ δ) −j E h ( ˘ X i T ) j i ≤ 1− 1 n j/2 O log(T )(1− ˘ δ) −j Proof. The proof is identical to the proof of Lemma 3.2 by noting that the noise covariance in the ¯ x t dynamical system is 1 n I and the noise covariance in the ˘ x i t dynamical system is 1− 1 n I. Then, we have the following lemma which bounds the regret due to the mean-field dynamical system: Lemma 3.12. ¯ R 0 (T )≤ 1 n ¯ M J E[ ¯ K T ]≤ 1 n ˜ O q (d x +d u )T (3.62) ¯ R 1 (T )≤ ¯ M S E[ ¯ K T ¯ X 2 T ]≤ 1 n ˜ O q (d x +d u )T (3.63) ¯ R 2 (T )≤O ¯ M 2 q (T +E[ ¯ K T )])E[ ¯ X 4 T log(T ¯ X 2 T )] ≤ 1 n ˜ O p d x (d x +d u ) √ T (3.64) where ¯ M 2 = ¯ M S ¯ M ¯ θ ¯ M 2 G r 32(dx+du) 2 dx λ min ( ¯ Σ −1 1 ) . 70 Proof. The lemma follows directly from lemma 3.5-3.7, lemma 3.10, 3.11 and the fact that covariance of noise ¯ w t in the mean-field dynamical system is 1 n I. The following lemma bounds the regret due to the relative state dynamical system of agent i. Lemma 3.13. ˘ R i 0 (T )≤ 1− 1 n ˘ M J E[ ˘ K T ]≤ 1− 1 n ˜ O q (d x +d u )T (3.65) ˘ R i 1 (T )≤ ˘ M S E[ ˘ K T ( ˘ X i T ) 2 ]≤ 1− 1 n ˜ O q (d x +d u )T (3.66) ˘ R i 2 (T )≤O ˘ M 2 v u u t (T +E[ ˘ K T )])E[( ˘ X i T ) 4 log( T X t=1 ( ˘ X i ∗ t T ) 2 )] ≤ 1− 1 n ˜ O p d x (d x +d u ) √ T (3.67) where ˘ M 2 = ˘ M S ˘ M ˘ θ ˘ M 2 G r 32(dx+du) 2 dx λ min ( ˘ Σ −1 1 ) Proof. The proof of (3.65),(3.66) follows from lemma 3.5,3.6 and lemma 3.10, 3.11. Follow- ing the proof of lemma 3.7 we can bound ˘ R i 2 (T ) similar to (22) as follows, ˘ R i 2 (T )≤ s E h X t || ˘ Σ −0.5 t ( ˘ θ− ˘ θ t )|| 2 i s E h X t ( ˘ X i T ) 2 || ˘ Σ 0.5 t ˘ z i t || 2 i (3.68) From Lemma B.2 in the appendix B, the first part of the above bound follows: s E h X t || ˘ Σ −0.5 t ( ˘ θ− ˘ θ t )|| 2 i ≤ q 4(d x +d u )d x (T +E[ ˘ K T ]). (3.69) 71 For the second part of the bound in (3.68), we note that X t || ˘ Σ 0.5 t ˘ z i t || 2 = T X t=1 (˘ z i t ) | ˘ Σ t ˘ z i t ≤ T X t=1 max 1, ˘ M 2 G ( ˘ X i T ) 2 ˘ λ min ! min(1, (˘ z i t ) | ˘ Σ t ˘ z i t ) ≤ T X t=1 max 1, ˘ M 2 G ( ˘ X i T ) 2 ˘ λ min ! min(1, (˘ z i ∗ t t ) | ˘ Σ t ˘ z i ∗ t t ) (3.70) where the last inequality follows from the definition of i ∗ t in (3.53). Using Lemma 8 of [30] we have T X t=1 min(1, (˘ z i ∗ t t ) | ˘ Σ t ˘ z i ∗ t t )≤ 2d log tr( ˘ Σ −1 1 ) + ˘ M 2 G P T t=1 (X i ∗ t T ) 2 d ! (3.71) Combining (3.70) and (3.71), we can bound the second part of (3.68) to the following s E h X t ( ˘ X i T ) 2 || ˘ Σ 0.5 t ˘ z i t || 2 i ≤O v u u t 2(d x +d u ) ˘ M 2 G ˘ λ min E h ( ˘ X i T ) 4 log( T X t=1 ( ˘ X i ∗ t T ) 2 ) i . (3.72) Combining (3.69),(3.70) with (3.68) we get the first inequality in (3.67). The second in- equality in (3.67) then follows from the bound on ˘ X i T in lemma B.3 in appendix B. We now state the main bound on the total regret in the following, 72 Theorem 3.2. The regret of TSDE-MF is upper bounded as follows, R(T,TSDE-MF)≤ ˜ O √ T (3.73) Proof. Using Lemma 3.12-3.11 we get the following bounds on ¯ R(T ) and ˘ R i (T ) ¯ R(T )≤ 1 n ˜ O( √ T ) (3.74) ˘ R i (T )≤ 1− 1 n ˜ O( √ T ) (3.75) The statement of the theorem follows from the regret decomposition in (3.59) and the above bounds. Remark 3.6. Consider an alternate posterior ˘ λ j t over ˘ θ which is defined as ˘ λ j t ( ˘ Θ) =P( ˘ θ∈ ˘ Θ|x j 1:t ,u j 1:t−1 ) for a fixed j∈{1,...,n}. So ˘ λ j t is the distribution of ˘ θ conditioned on the relative state and action history of agent j. It can be shown that the posterior of column l of ˘ θ, ˘ λ j t ( ˘ θ(l)), is a Gaussian distributionN (˘ μ j t (l), ˘ Σ j t ) projected on ˘ Ω where, ˘ μ j t+1 (l) = ˘ μ j t (l) + n n− 1 ˘ Σ j t ˘ z j t 1 + n n− 1 (˘ z j t ) | ˘ Σ j t ˘ z j t −1 ˘ x j t+1 (l)− (˘ μ j t (l)) | ˘ z j t ( ˘ Σ j t+1 ) −1 = ( ˘ Σ j t ) −1 + n n− 1 ˘ z j t (˘ z j t ) | The above update rule suggests that the inverse covariance of ˘ θ increases as ˘ z j t increase. Thus, instead of choosing a fixed agent’s relative state-action observations at each time t, we can adaptively pick an agent at time t based on a measure of ˘ z j t in order to reduce the 73 covariance of ˘ θ at a faster rate. Picking the observation of the agent which has the highest ||˘ z i t || 2 ˘ Σt = (˘ z i t ) | ˘ Σ t ˘ z i t at t leads to the posterior ˘ λ t which has been used for TSDE-MF. Remark 3.7. Note that TSDE-MF does not require complete state information at each agent. Hence, it is not a fully centralized algorithm. Each agent only requires the knowledge of its local state x i t , the mean-field state ¯ x t and the tuple (˘ x i ∗ t−1 t−1 , ˘ u i ∗ t−1 t−1 , ˘ x i ∗ t−1 t ) at time t to run the algorithm and update the posteriors. Remark 3.8. The agents can find i ∗ t in a distributed fashion as follows: Agent i compares its (˘ z i t ) | ˘ Σ t ˘ z i t with the running maximum max j<i (˘ z j t ) | ˘ Σ t ˘ z j t and passes the maximum to the agent i + 1 if i<n. Finally, agent n then determines max j (˘ z j t ) | ˘ Σ t ˘ z j t and broadcasts i ∗ t to all the agents. Thus, in every iteration of TSDE-MF, agent i needs to do the following: (a) Compute (˘ z i t ) | ˘ Σ t ˘ z i t and compare it with the running maximum, (b) Generate sample from the posterior if the sampling condition is true, (c) Compute the iterates ¯ v t and ˘ v i t and (d) Update the posteriors. The computation complexity of each of these steps is independent of n. Therefore, the computational complexity of each iteration of TSDE-MF is O(1) (independent of n) per agent. 3.7 Conclusion In this chapter, we proposed a Thompson sampling with dynamic episodes TSDE-LQ learn- ing algorithm for control of stochastic linear systems with quadratic costs. Under some conditions on the prior distribution, we provide a ˜ O( √ T ) bound on expected regret of TSDE-LQ. This implies that the average regret per unit time goes to zero, thus the learning 74 algorithm asymptotically learns the optimal control policy. Numerical simulations confirm that TSDE-LQ indeed achieves sublinear regret which matches with the theoretical upper bounds. In addition to use of the Posterior sampling-based learning, the key novelty here is design of an exploration schedule that achieves sublinear regret. Later, we also consider the multi-agent setting of the mean-field LQ learning problem. We saw that naively applying TSDE-LQ to this problem requires full state information at each agent, has high computational complexity and results in a regret bound which scales polynomially in the number of agents. We proposed a Thompson sampling based learning algorithm TSDE-MF for the mean-field LQ learning problem. Inspired by the structure of the optimal control solution, TSDE-MF decomposes the problem to learning two parameters ( ¯ θ and ˘ θ) separately. We show that the regret of TSDE-MF does not grow with the number of agents and is of the order ˜ O( √ T ) under some conditions on the prior distribution. Moreover, TSDE-MF is not a fully centralized algorithm and has low computational complexity. 75 Chapter 4 Thompson Sampling for some Decentralized Control Problems 4.1 Introduction Decentralized control problems involve a distributed group of agents controlling a dynamic system to achieve a common goal under uncertainty. Decentralized control problems, also called team problems, have been studied extensively under different information structures for the agents, see for example [62–70]. Problems where the dynamic system is a controlled Markov process jointly controlled by several agents have also been studied in [15, 71, 72]. [73] provides a survey of such problems under partial observation models. All of these works assume that the model is perfectly known to all the agents in the sys- tem. However, for most real-world systems, it is hardly the case that the model and its 76 parameters are known precisely. For such decentralized control problems, Q-Learning based algorithms have been studied in [74–76]. A reinforcement learning based algorithm for team problems was presented in [77]. While [74–76] are heuristic based approaches without any guarantees, [77] gives asymptotic convergence results. However, these works don’t take into account the online performance of the learning algorithm. It would be desirable to learn the model parameters and perform optimal control simultaneously at the fastest possible non-asymptotic rate. Recent advances in Online Learning open the possibility of using Online Learning-based methods for finding the optimal controllers for unknown stochastic systems. In this chapter, we consider two decentralized control problems with different dynamics and information sharing models: i) Decoupled dynamics with no information sharing, ii) Coupled dynamics with one-step delayed information sharing. The state transition dynamics are parametrized by a parameter θ which can take finitely many values. We are interested in the problem of maximizing infinite horizon average reward when the underlying parameter θ is not known to the agents. The agents need to choose their actions such that they learn the true parameter while also maximizing their expected reward. This leads to the the well-known problem of exploration vs. exploitation. Moreover, the learning should happen simultaneously at both the agents which makes this problem even more challenging. We propose a decentralized Thompson sampling algorithm for an infinite horizon average reward two-agent team learning problem. Thompson sampling (TS) has recently received wide attention as an online learning algorithm due to its computational efficiency and performance. The idea of TS was first proposed by Thompson in [12] for stochastic bandit 77 problems. It has been extensively applied to the centralized Markov Decision Processes (MDPs) in [26–31], [78], [45] where the agent computes the belief about the unknown parameters using observed information and a prior distribution. The agent draws a sample from its belief and selects an action using the benchmark policy for the sampled parameter. We extend the TS algorithm of [45] for the two decentralized control problems considered. We show that under some assumptions on the state transition kernels the regret achieved by Thompson sampling is upper bounded by a constant independent of the time horizon. Note that unlike previous works in team learning problems, we provide finite time (non- asymptotic) guarantees on the performance of our learning algorithm. 4.1.1 Notation Random variables are denoted by upper case letters and their realizations by corresponding small letters. X a:b denotes the collection (X a ,X a+1 ,··· ,X b ). Boldface letter X is used to denote the collection (X 1 ,X 2 ). P(·) is the probability of an event, E[·] is the expectation of a random variable. For a collection of functions g, we use P g (·) and E g [·] to denote that the probability measure/expecation depends on the choice of functions in g. I denotes the indicator function. For a matrix A with entries a ij ,||A|| := max i P j |a ij | denotes the matrix norm. 78 4.2 Problem 1: Decoupled dynamics Consider a two-agent team problem with state process X t = (X 1 t ,X 2 t ),t≥ 0, over an infinite time horizon. X i t ∈X is the state process associated with agent i∈{1, 2}. The agents take an action U i t ∈U and obtain a joint reward r(X t , U t ) at each time t. The agents’ state evolve in a decoupled fashion and in a controlled Markovian manner via the following kernels: X 1 t+1 ∼q 1 θ (·|X 1 t ,U 1 t ), X 2 t+1 ∼q 2 θ (·|X 2 t ,U 2 t ), where θ∈ Θ is a fixed but unknown parameter which parametrizes the system dynamics. The initial state of the agents is fixed to (x 1 0 ,x 2 0 ). We assume that Θ,X,U are finite sets. Agent i only observes X i t at each time t. The agents are interested in finding the control strategies which maximize the average reward defined as follows: lim T→∞ 1 T E[ T−1 X t=0 r(X t , U t )]. (4.1) Ifθ was known, this problem becomes an instance of decentralized optimal stochastic control problems. It can be shown for this particular problem that the optimal strategy for each agent lies in the class of Markov policies (not necessarily stationary). We will further restrict our policy space to the class of stationary Markov policies. 79 Letμ i θ denote a benchmark stationary policy for agenti when the true parameter isθ. Note that the collection (μ 1 θ ,μ 2 θ ) represents a pair of desirable policies if the true parameter is θ. These strategies may or may not be globally optimal (for example, they could be person- by-person optimal strategies). The goal of each agent is to mimic the performance of the benchmark stationary policy when θ is unknown. We assume that μ i θ can be computed offline for each agent i and parameter θ∈ Θ. Under the policy μ i θ , the state process (X i t ,t≥ 1) is a time homogenous Markov chain for i = 1, 2. The probability transition matrix of the Markov chain is denoted by Q θ,i where Q θ,i (x,y) =q i θ (y|x,μ i θ (x)). We assume that the Markov chain is ergodic so that there exists a row constant stochastic matrix Q ∞ θ,i for each agent i and the following holds, lim t→∞ ||Q t θ,i −Q ∞ θ,i || = 0. (4.2) It can be shown via a covergence result from the theory of ergodic Markov chains [79], that there exists constants α θ,i > 0, 0<β θ,i < 1,i = 1, 2, such that ||Q t θ,i −Q ∞ θ,i ||≤α θ,i β t θ,i . (4.3) Let the history of the agenti at timet be denoted byH i t := (X i 1:t ,U i 1:t−1 ). In the absence of the knowledge of the true parameter, a learning algorithm can potentially map the history of the agent i (H i t ) to its action U i t . Let φ i = (φ i 1 ,φ i 2 ,...) be a sequence of mappings such that agent i chooses its actions as U i t =φ i t (H i t ). Then, the collection φ = (φ 1 ,φ 2 ) is called an online learning algorithm. 80 We use regret as a measure of the performance of any online learning algorithm. Regret compares the reward obtained by the learning algorithm with the reward obtained if the true parameter was known. We define the average regret by comparing the reward of the learning algorithmφ against the reward of the benchmark (or desirable) stationary Markov policies for the two agents i.e. R(T,φ,θ) := E μ θ T−1 X t=0 r(Y t ,μ θ (Y t ))−E φ T−1 X t=0 r(X t , U t ) . whereY i t denotes the state trajectory of agenti under the policyμ i θ andμ θ (Y t ) = (μ 1 θ (Y 1 t ),μ 2 θ (Y 2 t )). X i t ,U i t denote the state trajectory and control action of agenti under the learning algorithm φ. The worst case regret is then defined as: R(T,φ) := max θ∈Θ R(T,φ,θ). (4.4) In the next section, we look at a decentralized Thompson sampling algorithm and analyze its regret. 4.3 Thompson Sampling 4.3.1 Algorithm In this section, we present the Thompson sampling algorithm for decentralized learning. In Thompson sampling, agents incorporate the uncertainty about the underlying parameter 81 by maintaining a probability distribution/belief π i t (·) over the parameter space Θ, where π i t (θ) is the belief of the agent i at time t that the true parameter is θ. π i 0 (·) denotes agent i’s prior belief. The agents update their belief via Bayes’ rule as follows: π i t+1 (θ) = q i θ (X i t+1 |X i t ,U i t )π i t (θ) X ˜ θ∈Θ q i ˜ θ (X i t+1 |X i t ,U i t )π i t ( ˜ θ) (4.5) Thompson sampling algorithm works as follows: At each time t, each agent i generates a sample θ i t from its belief π i t . The agents then take θ i t to be the true parameter and apply control using the desirable stationary policy for θ i t . Algorithm 1 describes this procedure. Algorithm 4 Thompson Sampling Input: π i 0 ,i = 1, 2 for each time t≥ 0 and each agent i do Generate θ i t ∼π i t (·) Apply control u i t =μ θ i t (x i t ).Observe the new state X i t+1 Update π i t+1 according to (4.5) end for 4.3.2 Analysis We make the following assumption for analyzing the regret achieved under the Thompson sampling scheme. For each x i ,u i , let q i θ (x i ,u i ) =q i θ (·|x i ,u i ). 82 Assumption 4.1. For each i,x i ∈X,u i ∈U,θ 1 ,θ 2 ∈ Θ with θ 1 6=θ 2 there exists an > 0 such that K(q i θ 1 (x i ,u i )|q i θ 2 (x i ,u i ))≥, (4.6) whereK(p|q) is the KL divergence between any two probability distributions p and q and is defined as P j p j log p j q j when p is absolutely continous with respect to q and∞ otherwise. Remark 4.1. Note that Assumption 4.1 was also present in the analysis of Thompson sampling for centralized MDP [45]. This assumption allows us to show that the agents’ beliefs concentrate at the true parameter exponentially fast. Lemma 4.1. Suppose Assumption 4.1 holds true. Then, under Thompson sampling algo- rithm, there exists constants a θ ,b θ such that E τ θ [1−π i t (θ)]≤a θ e −b θ t ,i∈{1, 2}, (4.7) whereE τ θ denotes the expectation operator under Thompson sampling when the true param- eter is θ . The above lemma states that when the true parameter is θ, π i t (θ)→ 1 exponentially fast in expectation. The proof of this lemma is identical to the proof of Lemma 4 in [45]. We provide the proof in Appendix C for completeness. We are now ready to state and prove the main result. Theorem 4.1. The regret under the Thompson sampling algorithm (Algorithm 1) is upper bounded by a constant which is independent of the horizon T . 83 Proof. We follow the methodology of Theorem 5 in [45] with appropriate modifications to take into account the decentralized nature of our problem. Let ν(x) be the shorthand to r(x,μ 1 θ (x 1 ),μ 2 θ (x 2 )). Let R(T ) be the T -horizon regret of Thompson sampling algorithm when the true parameter is θ. We decompose the regret into two parts by adding and subtracting the termE τ [ P T−1 t=0 ν(X t )] and get an upper bound using the triangle inequality as follows: R(T ) = E μ θ " T−1 X t=0 ν(Y t ) # −E τ " T−1 X t=0 r(X t , U t ) # ≤ E μ θ " T−1 X t=0 ν(Y t ) # −E τ " T−1 X t=0 ν(X t ) # | {z } R 1 (T ) + E τ " T−1 X t=0 ν(X t )− T−1 X t=0 r(X t , U t ) # | {z } R 2 (T ) We will analyze and bound the two terms seperately. Let’s look at the first term R 1 (T ) of the above decomposition. R 1 (T ) = T−1 X t=0 X x 1 t ,x 2 t ν(x t ) P μ θ θ (Y t = x t )−P τ θ (X t = x t ) = T−1 X t=0 X x 1 t ,x 2 t ν(x t ) P μ 1 θ θ (Y 1 t =x 1 t )P μ 2 θ θ (Y 2 t =x 2 t )− P τ θ (X 1 t =x 1 t )P τ θ (X 2 t =x 2 t ) i = T−1 X t=0 X x 1 t ,x 2 t ν(x t ) " 2 Y i=1 Q t θ,i (x i 0 ,x i t )− 2 Y i=1 P τ θ (X i t =x i t ) # (4.8) where the second equality follows from the independence of the dynamics of the two agents and independence of the Thompson sampling done at the two agents. The last equality 84 follows from the fact that under the policy μ i θ , Y i t is Markov chain with transition matrix Q θ,i . Now, we can writeP τ θ (X i t =x i t ) =E τ θ P τ θ (X i t =x i t |H i t−1 ) where P τ θ (X i t =x i t |H i t−1 ) = X ˜ θ∈Θ π i t−1 ( ˜ θ)P τ θ (X i t =x i t |θ i t−1 = ˜ θ,H i t−1 ) =π i t−1 (θ)Q θ,i (X i t−1 ,x i t ) + X ˜ θ6=θ π i t−1 ( ˜ θ)q θ,i (x i t |X i t−1 ,μ i ˜ θ (X i t−1 )) =Q θ,i (X i t−1 ,x i t ) + X ˜ θ6=θ π i t−1 ( ˜ θ)[q θ,i (x i t |X i t−1 ,μ i ˜ θ (X i t−1 ))−Q θ,i (X i t−1 ,x i t )]. Therefore, we can writeP τ θ (X i t =x i t ) as follows: P τ θ (X i t =x i t ) =E τ θ [Q θ,i (X i t−1 ,x i t )] +E τ θ [ X ˜ θ6=θ π i t−1 ( ˜ θ) q θ,i (x i t |X i t−1 ,μ i ˜ θ (X i t−1 ))−Q θ,i (X i t−1 ,x i t ) ] | {z } Δ i t (x i t ) = X x i t−1 Q θ,i (x i t−1 ,x i t )P τ θ (X i t−1 =x i t−1 ) + Δ i t (x i t ) (4.9) LetP i t be the|X|-dimensional row vector whose entries areP τ θ (X i t =x i t ) and Δ i t be the row vector with entries Δ i t (x i t ). Then, (4.9) gives us the recursion P i t = P i t−1 Q θ,i + Δ i t , which leads to the following: P i t =P i 0 Q t θ,i + t X s=1 Δ i s Q t−s θ,i , (4.10) 85 whereP i 0 is the row vector with 1 at (x i 0 ) th component and 0 elsewhere. LetV = [ν(x 1 ,x 2 ) : x 1 ,x 2 ∈X ] denote the|X|×|X| matrix of rewards under the desirable stationary policies of the two agents corresponding to the parameterθ. Going back to the decomposition ofR 1 (T ) in (4.8), observe that we can write the second inner sum P x 1 t ,x 2 t ν(x 1 t ,x 2 t )P 1 t (x 1 t )P 2 t (x 2 t ) succintly in the matrix form as P 1 t V (P 2 t ) 0 . We can similarly write the first inner sum P x 1 t ,x 2 t ν(x 1 t ,x 2 t ) Q 2 i=1 Q t θ,i (x i 0 ,x i t ) asP 1 0 Q t θ,1 V (P 2 0 Q t θ,2 ) 0 by noting thatP i 0 Q t θ,i forms the vec- tor whose entries are Q t θ,i (x i 0 ,x i t ). Also, let S i t := P t s=1 Δ i s Q t−s θ,i = P t s=1 Δ i s (Q t−s θ,i −Q ∞ θ,i ) where the second equality follows because P x i t Δ i t (x i t ) = 0 andQ ∞ θ,i is a row constant matrix. Now, substituting (4.10) in (4.8), we get the following: R 1 (T ) = T−1 X t=0 P 1 0 Q t θ,1 V (P 2 0 Q t θ,2 ) 0 −P 1 t V (P 2 t ) 0 = T−1 X t=0 P 1 0 Q t θ,1 V (S 2 t ) 0 +S 1 t V (P 2 0 Q t θ,2 ) 0 +S 1 t V (S 2 t ) 0 ≤ T−1 X t=0 P 1 0 Q t θ,1 V (S 2 t ) 0 | {z } E 1 + T−1 X t=0 S 1 t V (P 2 0 Q t θ,2 ) 0 | {z } E 2 + T−1 X t=0 S 1 t V (S 2 t ) 0 | {z } E 3 (4.11) Next, we will bound each of the three terms in (4.11) seperately. For a matrix/vector A, let |A| be the matrix/vector whose entries are the absolute value of the corresponding entries of A. It is straightforward to observe that ABC 0 ≤|A||B||C| 0 for any row vector A,C and matrix B. Let 1 denote the row vector with each entry equal to 1. • Bounding E 1 /E 2 86 E 1 = T−1 X t=0 P 1 0 Q t θ,1 V (S 2 t ) 0 ≤ T−1 X t=0 |P 1 0 Q t θ,1 ||V||S 2 t | 0 ≤ T−1 X t=0 RP 1 0 Q t θ,1 1 0 1|S 2 t | 0 =R T−1 X t=0 1|S 2 t | 0 . (4.12) whereR = max x,u |r(x, u)|. The second inequality follows because each entry of|V| is bounded byR and the following equality holds because P 1 0 Q t θ,1 1 0 is 1 since P 1 0 Q t θ,1 is a probability vector. We will now bound the sum P T−1 t=0 1|S 2 t | 0 . First, note that, T−1 X t=0 1|S 2 t | 0 ≤ T−1 X t=0 1( t X s=1 |Δ 2 s ||Q t−s θ,2 −Q ∞ θ,2 |) 0 . (4.13) Now, for all x i ∈X we have, |Δ i s |(x i )≤|E τ θ h 1−π i s−1 (θ) i |≤a θ e −b θ (s−1) (4.14) which follows from Lemma 4.1 and the definition of Δ i s sinceq θ,i (x i t |X i t−1 ,μ i ˜ θ (X i t−1 ))− Q θ,i (X i t−1 ,x i t )≤ 1. We also have, |Q t−s θ,i −Q ∞ θ,i |(x i ,y i )≤||Q t−s θ,i −Q ∞ θ,i ||≤α θ β t−s θ (4.15) This follows from the convergence results of time homogenous ergodic Markov chains in (4.3) and defining the two constants α θ = max i α θ,i and β θ = max i β θ,i . Plugging 87 the bounds (4.14),(4.15) in (4.13) we get, T−1 X t=0 1|S 2 t | 0 ≤|X| T−1 X t=0 1 1 0 t X s=1 a θ e −b θ (s−1) α θ β t−s θ ≤|X| 2 a θ α θ e b θ ∞ X t=0 β t θ t X s=1 e −b θ s β −s θ ≤|X| 2 K θ ( ∞ X t=0 β t θ + ∞ X t=0 e −b θ t ) =|X| 2 K θ ( 1 1−β θ + 1 1−e −b θ ) (4.16) whereK θ = a θ α θ |β θ −e −b θ| . Using (4.16) along with (4.12) we get the following bound on E 1 , E 1 ≤R|X| 2 K θ ( 1 1−β θ + 1 1−e −b θ ). (4.17) It is easy to observe thatE 2 is also upper bounded by the same constant as in (4.17). • Bounding E 3 : The last term E 3 in (4.11) can be bounded as follows: E 3 = T−1 X t=0 S 1 t V (S 2 t ) 0 ≤ T−1 X t=0 |S 1 t ||V||S 2 t | 0 ≤R T−1 X t=0 |S 1 t |1 0 1|S 2 t | 0 ≤ T−1 X t=0 |S 1 t |1 0 ! T−1 X t=0 1|S 2 t | 0 ! ≤R|X| 4 K 2 θ 1 1−β θ + 1 1−e −b θ 2 where the last inequality can be obtained easily from the upper bound obtained on P T−1 t=0 1|S 2 t | 0 in (4.16). 88 Now, we bound the second regret term R 2 (T ). R 2 (T ) = E τ θ " T−1 X t=0 ν(X t )− T−1 X t=0 r(X t , U t ) # ≤ 2RE τ θ " T−1 X t=0 I(U 1 t 6=μ 1 θ (X 1 t )∪U 2 t 6=μ 2 θ (X 2 t )) # ≤ 2R T−1 X t=0 E τ θ h I[U 1 t 6=μ 1 θ (X 1 t )] i +E τ θ h I[U 2 t 6=μ 2 θ (X 2 t )] i ≤ 2R T−1 X t=0 E τ θ h 1−π 1 t (θ) i +E τ θ h 1−π 2 t (θ) i ≤ 4R T−1 X t=0 a θ e −b θ t ≤ 4Ra θ 1−e −b θ Since both R 1 (T ) and R 2 (T ) are bounded by constants, the result now follows. Discussion Recapping the proof, the regret was decomposed into two terms: R 1 (T ) and R 2 (T ). The first term R 1 (T ) connects the state trajectory Y i t (under the benchmark policy μ i θ ) and X i t (under Thompson sampling). It is the expectation of the difference of the function ν evaluated at two different stochastic processes Y i t and X i t . This depends on the difference between probability of reaching a state x i t under Thompson sampling and the benchmark policy μ i θ . This difference was bounded using the ergodicity of the Markov chains under the benchmark policy and the convergence of the belief using Lemma 4.1. The second term 89 R 2 (T ) depends upon the difference between the reward function computed on the same state trajectory X i t but with two different control policies (benchmark stationary policy and Thompson sampling). Since the state trajectories are the same, R 2 (T ) can be upper bounded by the number of times the agents take different actions under the two policies, which is again upper bounded by the number of times the sampled parameter is different from the true parameter. This can be bounded using the concentration of the belief at the true parameter in Lemma 4.1. 4.4 Problem 2: One step delayed sharing In this section, we consider the two-agent team learning problem with one-step delayed sharing information structure. The evolution of the state in this case has the following dynamics: X t+1 ∼q θ (·|X t , U t ). (4.18) Thus, the next state of each of the agent is affected by the current states and actions of both the agents. We assume that the agents share their information with a delay of one time unit. Hence, the information available at agenti at timet isH i t = (X i t , X 1:t−1 , U 1:t−1 ). We again restrict the policy space of each agent to the class of memoryless stationary policies. The agents are interested in finding a memoryless stationary control strategies which maximize the average reward defined as in (4.1). Letμ i θ denote the benchmark stationary memoryless policy for agenti when the true parameter isθ. Then under the policyμ i θ , the state process (X t ,t≥ 0) is a time homogenous Markov chain. The probability transition matrix of the 90 Markov chain is denoted by Q θ where Q θ (x, y) =q θ (y|x,μ 1 θ (x 1 ),μ 2 θ (x 2 )). We assume that the Markov chain is ergodic similar to in section 4.2, such that there exists a row constant stochastic matrix Q ∞ θ and constants α θ ,β θ for which, ||Q t θ −Q ∞ θ ||≤α θ β t θ . (4.19) Next, we analyze the decentralized Thompson sampling algorithm and its regret as defined in (4.4). 4.4.1 Algorithm LetC t denote the common information between the two agents at timet which is{X 1:t−1 , U 1:t−1 }. Here, the agents maintain a common belief π t (θ) = P(θ|C t ) over the underlying parame- ter. Note that this common belief only uses the common information and not the entire information available to the agents. We use the common belief instead of the private beliefs in this case because it allows us the possibility to extend this approach for more general information structures. The belief is updated over time as follows: π t+1 (θ) = q θ (X t |X t−1 , U t−1 )π t (θ) X ˜ θ∈Θ q ˜ θ (X t |X t−1 , U t−1 )π t ( ˜ θ) (4.20) The agents follow the Algorithm 1 as in section 4.3: At each time t, each agent i generates a sample θ i t from the common belief π t independently. The agents then take θ i t to be the 91 true parameter and apply control using the benchmark stationary policy for θ i t . We make the following assumption similar to section 4.3 to analyze the regret of Thompson sampling. Assumption 4.2. For each x, u,θ 1 ,θ 2 ∈ Θ with θ 1 6=θ 2 there exists an > 0 such that K(q θ 1 (x, u)|q θ 2 (x, u))≥ . (4.21) 4.4.2 Analysis Lemma 4.2. Suppose Assumption 4.2 holds true. Then, under Thompson sampling algo- rithm, there exists constants a θ ,b θ such that E τ θ [1−π t (θ)]≤a θ e −b θ (t−1) , (4.22) The proof of the above Lemma is identical to the proof of Lemma 4.1. We will next show that the regret achieved by Thompson sampling is again upper bounded by a constant independent of the horizon T . Theorem 4.2. The regret under the Thompson sampling algorithm is upper bounded by a constant which is independent of the horizon T . Proof. We sketch the outline of the proof here which is similar to the proof of Theorem 4.1. We can decompose the regret into components R 1 (T ) andR 2 (T ). The first termR 1 (T ) is, T−1 X t=0 X x 1 t ,x 2 t ν(x t ) P μ θ θ (Y t = x t )−P τ θ (X t = x t ) (4.23) 92 We can write P μ θ θ (Y t = x t ) = Q t θ (x 0 , x t ) since under the policy μ θ , Y t is Markov chain with transition matrix Q θ . Also, the second probability in (4.23) can be broken down as P τ θ (X t = x t ) =E τ θ [P τ θ (X t = x t |C t−1 , X t−1 )] where P τ θ (X t = x t |C t−1 , X t−1 ) = X ( ˜ θ 1 , ˜ θ 2 )∈Θ×Θ π t−1 ( ˜ θ 1 )π t−1 ( ˜ θ 2 )×P τ θ X t = x t θ 1 t−1 = ˜ θ 1 ,θ 2 t−1 = ˜ θ 2 ,C t−1 , X t−1 =Q θ (X t−1 , x t ) + X ( ˜ θ 1 , ˜ θ 2 )6=(θ,θ) π t−1 ( ˜ θ 1 )π t−1 ( ˜ θ 2 )× [q θ (x t |X t−1 ,μ 1 ˜ θ 1 (X 1 t−1 ),μ 2 ˜ θ 2 (X 2 t−1 ))−Q θ (X t−1 , x t )] (4.24) Then, we have, P τ θ (X t = x t ) = X x t−1 Q θ (x t−1 , x t )P τ θ (X t−1 = x t−1 ) + Δ t (x t ). (4.25) where Δ t (x t ) is the expectation of the second term on the right hand side of (4.24). Let P t be the|X| 2 dimensional row vector whose entries are P τ θ (X t = x t ). Then, we get the following recursive relationship between P t , P t =P 0 Q t θ + t X s=1 Δ s Q t−s θ . (4.26) 93 Define V = [ν(x) : x∈X×X ) to be the|X| 2 dimesnional column vector of rewards under the optimal policy. Then, we can get the following R 1 (T )≤ ∞ X t=0 t X s=1 |Δ s ||Q t−s θ −Q ∞ θ | ! |V|. (4.27) Using the definition of Δ s we obtain the following inequality, |Δ s (x s )|≤E τ θ [1−π 2 s−1 (θ)]≤ 2E τ θ [1−π s−1 (θ)]≤ 2a θ e −b θ (s−2) . (4.28) where the last inequality follows from Lemma 4.2. Also, using the convergence of the ergodic Markov chains we have|Q t−s θ −Q ∞ θ |(x, y)≤||Q t−s θ −Q ∞ θ ||≤ α θ β t−s θ . Then, following a similar analysis as in the proof of Theorem 4.1, we can upper bound R 1 (T ) by, R 1 (T )≤ 2R|X| 4 a θ α θ e b θ |β θ −e −b θ | 1 1−β θ + 1 1−e −b θ . (4.29) The second termR 2 (T ) =|E τ θ [ P T−1 t=0 ν(X t )− P T−1 t=0 r(X t , U t )]| can be upper bounded using Lemma 4.2 in a similar manner as done in Theorem 4.1 as: R 2 (T )≤ 4Ra θ e b θ 1−e −b θ . (4.30) The statement of the theorem now follows. 94 Remark 4.2. Note that the results derived in this section and for the model in Section 4.2 can be easily extended for a system with N > 2 agents under the same assumptions. 4.5 Conclusion We studied a two-agent team learning problem when the state transition kernels are parametrized by an unknown but fixed parameter taking values in a finite space. A decentralized Thomp- son sampling based algorithm is proposed for two different dynamics and information shar- ing models. The regret achieved by Thompson sampling is shown to be upper bounded by a constant independent of the time horizon for both the cases. The results obtained are limited by a key assumption on the state transition kernels (Assumption 4.1,4.2). It would be interesting to see if this approach can be generalized by relaxing this assumption. Another future direction is to see if the ideas presented can be extended to more general information sharing structure among the agents. 95 Chapter 5 Networked Estimation 5.1 Introduction Reliable real-time wireless networking is an essential requirement of modern cyber-physical and networked control systems [1, 80]. Due to their large scale, these systems are typically formed by multiple physically distributed subsystems that communicate over a wireless network of limited capacity. One way to model this communication constraint is to assume that, at any time instant, only one packet can be reliably transmitted over the network to its destination. This constraint forces the system designer to use strategies that allocate the shared communication resources among multiple communicating nodes. In addition to degrading the performance of the overall system, the fact that communication among the different agents in cyber-physical systems is imperfect often leads to team-decision problems with nonclassical information structures. Such problems are usually nonconvex, and are, in general, difficult to solve. 96 In this chapter, we consider a sequential remote estimation problem over a finite time horizon with non-collocated sensors and estimators. The system is comprised of multiple sensors, each of which has a stochastic process associated with it. Each sensor is paired with an estimator, which is interested in forming real-time estimates of its corresponding source process. The sensors communicate with their estimators via a shared communication network. Due to the limited capacity, at most one of the sensor’s observations can be transmitted at each time. To avoid collisions [81, 82], the communication is mediated by a scheduler, which observes the realization of each source. The scheduler decides at each time which of the observations (if any) gets transmitted over the communication network. In addition to the communication constraint, the framework also assumes that the scheduler operates under an energy constraint through a finite battery, which is capable of harvesting additional energy from the environment. The designer’s goal is to find scheduling and estimation strategies that jointly minimize an objective function consisting of a mean-squared estimation error criterion and a com- munication cost. This joint design problem is a team-decision problem with a nonclassical information structure for which obtaining globally optimal solutions is a challenging task, in general [83]. However, under certain assumptions on the underlying probabilistic model, despite the difficulties imposed by lack of convexity, this problem admits an explicit globally optimal solution, whose derivation is the centerpiece of this article. This problem is also motivated by applications such as the Internet of Things (IoT), where there exists a necessity to coordinate access to limited communication resources by multiple heterogeneous devices in real-time. In addition to that, in IoT applications, the network 97 S <latexit sha1_base64="WUcfr4zW4XU0RrOFm+mYYfOBtu0=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUwS4LblxWtA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6WNza3tnfJuZW//4PCoenzS1TJVhHaI5FL1Q6wpZ4J2DDOc9hNFcRxy2gunt7nfe6JKMykezSyhQYzHgkWMYGMlfxBjMyGYZw/zYbXm1t0F0DrxClKDAu1h9WswkiSNqTCEY619z01MkGFlGOF0XhmkmiaYTPGY+pYKHFMdZIvIc3RhlRGKpLJPGLRQf29kONZ6Fod2Mo+oV71c/M/zUxM1g4yJJDVUkOVHUcqRkSi/H42YosTwmSWYKGazIjLBChNjW6rYErzVk9dJt1H3ruqN++taq1nUUYYzOIdL8OAGWnAHbegAAQnP8ApvjnFenHfnYzlacoqdU/gD5/MHifSRYw==</latexit> Wireless Network <latexit sha1_base64="tHe8vCLmSQqCM8gUmiM66CViOpc=">AAACBnicbVBPS8MwHE3nv1n/VT0KEhyCp9ENQb0NvHiSCdYN1jLS7NctLE1Lkiqj7ObFr+LFg4pXP4M3v43Z1oNuPgg83vv9krwXppwp7brfVmlpeWV1rbxub2xube84u3t3KskkBY8mPJHtkCjgTICnmebQTiWQOOTQCoeXE791D1KxRNzqUQpBTPqCRYwSbaSuc+hTEBokE33bbjEJHJTyffsa9EMih12n4lbdKfAiqRWkggo0u86X30toFptLKSdKdWpuqoOcSM0oh7HtZwpSQoekDx1DBYlBBfk0xxgfG6WHo0SaIzSeqr83chIrNYpDMxkTPVDz3kT8z+tkOjoPcibSTIOgs4eijGOd4EkpuGeCU81HhhAqmfkrpgMiCTXNKNuUUJuPvEi8evWi6t7UK43Too0yOkBH6ATV0BlqoCvURB6i6BE9o1f0Zj1ZL9a79TEbLVnFzj76A+vzB8UcmMM=</latexit> <latexit sha1_base64="tHe8vCLmSQqCM8gUmiM66CViOpc=">AAACBnicbVBPS8MwHE3nv1n/VT0KEhyCp9ENQb0NvHiSCdYN1jLS7NctLE1Lkiqj7ObFr+LFg4pXP4M3v43Z1oNuPgg83vv9krwXppwp7brfVmlpeWV1rbxub2xube84u3t3KskkBY8mPJHtkCjgTICnmebQTiWQOOTQCoeXE791D1KxRNzqUQpBTPqCRYwSbaSuc+hTEBokE33bbjEJHJTyffsa9EMih12n4lbdKfAiqRWkggo0u86X30toFptLKSdKdWpuqoOcSM0oh7HtZwpSQoekDx1DBYlBBfk0xxgfG6WHo0SaIzSeqr83chIrNYpDMxkTPVDz3kT8z+tkOjoPcibSTIOgs4eijGOd4EkpuGeCU81HhhAqmfkrpgMiCTXNKNuUUJuPvEi8evWi6t7UK43Too0yOkBH6ATV0BlqoCvURB6i6BE9o1f0Zj1ZL9a79TEbLVnFzj76A+vzB8UcmMM=</latexit> <latexit sha1_base64="tHe8vCLmSQqCM8gUmiM66CViOpc=">AAACBnicbVBPS8MwHE3nv1n/VT0KEhyCp9ENQb0NvHiSCdYN1jLS7NctLE1Lkiqj7ObFr+LFg4pXP4M3v43Z1oNuPgg83vv9krwXppwp7brfVmlpeWV1rbxub2xube84u3t3KskkBY8mPJHtkCjgTICnmebQTiWQOOTQCoeXE791D1KxRNzqUQpBTPqCRYwSbaSuc+hTEBokE33bbjEJHJTyffsa9EMih12n4lbdKfAiqRWkggo0u86X30toFptLKSdKdWpuqoOcSM0oh7HtZwpSQoekDx1DBYlBBfk0xxgfG6WHo0SaIzSeqr83chIrNYpDMxkTPVDz3kT8z+tkOjoPcibSTIOgs4eijGOd4EkpuGeCU81HhhAqmfkrpgMiCTXNKNuUUJuPvEi8evWi6t7UK43Too0yOkBH6ATV0BlqoCvURB6i6BE9o1f0Zj1ZL9a79TEbLVnFzj76A+vzB8UcmMM=</latexit> E 1 <latexit sha1_base64="/k/qisIgDVEWxxq5lpnbkvSnxos=">AAAB83icbVBNSwMxFHzrZ61fVY9egkXwVHZFsN4KInis4NpCu5Zsmm1Dk+yaZAtl6e/w4kHFq3/Gm//GbLsHbR0IDDPv8SYTJpxp47rfzsrq2vrGZmmrvL2zu7dfOTh80HGqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpH17nfGlOlWSzvzSShgcADySJGsLFS0BXYDAnm2c300etVqm7NnQEtE68gVSjQ7FW+uv2YpIJKQzjWuuO5iQkyrAwjnE7L3VTTBJMRHtCOpRILqoNsFnqKTq3SR1Gs7JMGzdTfGxkWWk9EaCfzkHrRy8X/vE5qonqQMZmkhkoyPxSlHJkY5Q2gPlOUGD6xBBPFbFZEhlhhYmxPZVuCt/jlZeKf165q7t1FtVEv2ijBMZzAGXhwCQ24hSb4QOAJnuEV3pyx8+K8Ox/z0RWn2DmCP3A+fwAI35HH</latexit> <latexit sha1_base64="/k/qisIgDVEWxxq5lpnbkvSnxos=">AAAB83icbVBNSwMxFHzrZ61fVY9egkXwVHZFsN4KInis4NpCu5Zsmm1Dk+yaZAtl6e/w4kHFq3/Gm//GbLsHbR0IDDPv8SYTJpxp47rfzsrq2vrGZmmrvL2zu7dfOTh80HGqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpH17nfGlOlWSzvzSShgcADySJGsLFS0BXYDAnm2c300etVqm7NnQEtE68gVSjQ7FW+uv2YpIJKQzjWuuO5iQkyrAwjnE7L3VTTBJMRHtCOpRILqoNsFnqKTq3SR1Gs7JMGzdTfGxkWWk9EaCfzkHrRy8X/vE5qonqQMZmkhkoyPxSlHJkY5Q2gPlOUGD6xBBPFbFZEhlhhYmxPZVuCt/jlZeKf165q7t1FtVEv2ijBMZzAGXhwCQ24hSb4QOAJnuEV3pyx8+K8Ox/z0RWn2DmCP3A+fwAI35HH</latexit> <latexit sha1_base64="/k/qisIgDVEWxxq5lpnbkvSnxos=">AAAB83icbVBNSwMxFHzrZ61fVY9egkXwVHZFsN4KInis4NpCu5Zsmm1Dk+yaZAtl6e/w4kHFq3/Gm//GbLsHbR0IDDPv8SYTJpxp47rfzsrq2vrGZmmrvL2zu7dfOTh80HGqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpH17nfGlOlWSzvzSShgcADySJGsLFS0BXYDAnm2c300etVqm7NnQEtE68gVSjQ7FW+uv2YpIJKQzjWuuO5iQkyrAwjnE7L3VTTBJMRHtCOpRILqoNsFnqKTq3SR1Gs7JMGzdTfGxkWWk9EaCfzkHrRy8X/vE5qonqQMZmkhkoyPxSlHJkY5Q2gPlOUGD6xBBPFbFZEhlhhYmxPZVuCt/jlZeKf165q7t1FtVEv2ijBMZzAGXhwCQ24hSb4QOAJnuEV3pyx8+K8Ox/z0RWn2DmCP3A+fwAI35HH</latexit> E 2 <latexit sha1_base64="8ztMtkBxPcvhd+nkzutmBPD6ntM=">AAAB9HicbVDLSgMxFL3js9ZX1aWbYBFclZkq2GVBBJcV7APasWTSTBuaScYkUyhDv8ONC0Xc+jHu/Bsz7Sy09UDgcM693JMTxJxp47rfztr6xubWdmGnuLu3f3BYOjpuaZkoQptEcqk6AdaUM0GbhhlOO7GiOAo4bQfjm8xvT6jSTIoHM42pH+GhYCEj2FjJ70XYjAjm6e3ssdovld2KOwdaJV5OypCj0S999QaSJBEVhnCsdddzY+OnWBlGOJ0Ve4mmMSZjPKRdSwWOqPbTeegZOrfKAIVS2ScMmqu/N1IcaT2NAjuZhdTLXib+53UTE9b8lIk4MVSQxaEw4chIlDWABkxRYvjUEkwUs1kRGWGFibE9FW0J3vKXV0mrWvEuK9X7q3K9ltdRgFM4gwvw4BrqcAcNaAKBJ3iGV3hzJs6L8+58LEbXnHznBP7A+fwBojOR+Q==</latexit> X 1 t <latexit sha1_base64="W42FtkaGxQv3mhFtPZ0y7j+U4Y4=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqYI8FLx4rmLbQxrLZbtqlm03YnQil9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSqFQdf9dgobm1vbO8Xd0t7+weFR+fikZZJMM+6zRCa6E1LDpVDcR4GSd1LNaRxK3g7Ht3O//cS1EYl6wEnKg5gOlYgEo2glv/Po9bFfrrhVdwGyTrycVCBHs1/+6g0SlsVcIZPUmK7nphhMqUbBJJ+VepnhKWVjOuRdSxWNuQmmi2Nn5MIqAxIl2pZCslB/T0xpbMwkDm1nTHFkVr25+J/XzTCqB1Oh0gy5YstFUSYJJmT+ORkIzRnKiSWUaWFvJWxENWVo8ynZELzVl9dJq1b1rqq1++tKo57HUYQzOIdL8OAGGnAHTfCBgYBneIU3RzkvzrvzsWwtOPnMKfyB8/kDY5GOYA==</latexit> X 2 t <latexit sha1_base64="n+j1ZQYzld4uD+RVW4vF0h/lVCk=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqYI8FLx4rmLbQxrLZbtqlm03YnQil9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSqFQdf9dgobm1vbO8Xd0t7+weFR+fikZZJMM+6zRCa6E1LDpVDcR4GSd1LNaRxK3g7Ht3O//cS1EYl6wEnKg5gOlYgEo2glv/NY62O/XHGr7gJknXg5qUCOZr/81RskLIu5QiapMV3PTTGYUo2CST4r9TLDU8rGdMi7lioacxNMF8fOyIVVBiRKtC2FZKH+npjS2JhJHNrOmOLIrHpz8T+vm2FUD6ZCpRlyxZaLokwSTMj8czIQmjOUE0so08LeStiIasrQ5lOyIXirL6+TVq3qXVVr99eVRj2PowhncA6X4MENNOAOmuADAwHP8ApvjnJenHfnY9lacPKZU/gD5/MHZReOYQ==</latexit> S t <latexit sha1_base64="M9VRWM5dM74qqPlq8CeLpvr3ipk=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV7LHgxWOl9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsbm1vbObmGvuH9weHRcOjltmzjVjLdYLGPdDajhUijeQoGSdxPNaRRI3gkmd3O/88S1EbF6xGnC/YiOlAgFo2ilZnOAg1LZrbgLkHXi5aQMORqD0ld/GLM04gqZpMb0PDdBP6MaBZN8VuynhieUTeiI9yxVNOLGzxanzsilVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxrPmZUEmKXLHlojCVBGMy/5sMheYM5dQSyrSwtxI2ppoytOkUbQje6svrpF2teNeV6sNNuV7L4yjAOVzAFXhwC3W4hwa0gMEInuEV3hzpvDjvzseydcPJZ87gD5zPHzcqjbg=</latexit> Y 1 t <latexit sha1_base64="OSWUDKt+nSHjUFiRihUsS3ZWY2w=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqYI8FLx4rmFZpY9lsN+3SzSbsToRS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMkmmGfdZIhN9H1LDpVDcR4GS36ea0ziUvB2Ormd++4lrIxJ1h+OUBzEdKBEJRtFK/sOj18NeueJW3TnIKvFyUoEczV75q9tPWBZzhUxSYzqem2IwoRoFk3xa6maGp5SN6IB3LFU05iaYzI+dkjOr9EmUaFsKyVz9PTGhsTHjOLSdMcWhWfZm4n9eJ8OoHkyESjPkii0WRZkkmJDZ56QvNGcox5ZQpoW9lbAh1ZShzadkQ/CWX14lrVrVu6jWbi8rjXoeRxFO4BTOwYMraMANNMEHBgKe4RXeHOW8OO/Ox6K14OQzx/AHzucPZRmOYQ==</latexit> Y 2 t <latexit sha1_base64="fij9saLZSJQGusgR+w8aMRjMGrA=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqYI8FLx4rmFZpY9lsN+3SzSbsToRS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMkmmGfdZIhN9H1LDpVDcR4GS36ea0ziUvB2Ormd++4lrIxJ1h+OUBzEdKBEJRtFK/sNjrYe9csWtunOQVeLlpAI5mr3yV7efsCzmCpmkxnQ8N8VgQjUKJvm01M0MTykb0QHvWKpozE0wmR87JWdW6ZMo0bYUkrn6e2JCY2PGcWg7Y4pDs+zNxP+8ToZRPZgIlWbIFVssijJJMCGzz0lfaM5Qji2hTAt7K2FDqilDm0/JhuAtv7xKWrWqd1Gt3V5WGvU8jiKcwCmcgwdX0IAbaIIPDAQ8wyu8Ocp5cd6dj0VrwclnjuEPnM8fZp+OYg==</latexit> ˆ X 1 t <latexit sha1_base64="DzngWj0cwmNiLcpFmsolmTSJRsU=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqYI8FLx4r2A9IY9lst+3SzSbsToQS+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1LDpVC8hQIl7yaa0yiUvBNObud+54lrI2L1gNOEBxEdKTEUjKKV/N6YYtadPXp97JcrbtVdgKwTLycVyNHsl796g5ilEVfIJDXG99wEg4xqFEzyWamXGp5QNqEj7luqaMRNkC1OnpELqwzIMNa2FJKF+nsio5Ex0yi0nRHFsVn15uJ/np/isB5kQiUpcsWWi4apJBiT+f9kIDRnKKeWUKaFvZWwMdWUoU2pZEPwVl9eJ+1a1buq1u6vK416HkcRzuAcLsGDG2jAHTShBQxieIZXeHPQeXHenY9la8HJZ07hD5zPHzfokS0=</latexit> ˆ X 2 t <latexit sha1_base64="QBtrDPoLhbK225I6fwEO5Z+0gJI=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqYI8FLx4r2A9IY9lst+3SzSbsToQS+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1LDpVC8hQIl7yaa0yiUvBNObud+54lrI2L1gNOEBxEdKTEUjKKV/N6YYtadPdb62C9X3Kq7AFknXk4qkKPZL3/1BjFLI66QSWqM77kJBhnVKJjks1IvNTyhbEJH3LdU0YibIFucPCMXVhmQYaxtKSQL9fdERiNjplFoOyOKY7PqzcX/PD/FYT3IhEpS5IotFw1TSTAm8//JQGjOUE4toUwLeythY6opQ5tSyYbgrb68Ttq1qndVrd1fVxr1PI4inME5XIIHN9CAO2hCCxjE8Ayv8Oag8+K8Ox/L1oKTz5zCHzifPzlukS4=</latexit> Y 1 t 1 <latexit sha1_base64="/aKDFJvyujwse1FsQl1IFhzWBhg=">AAAB8HicbVDLSgNBEOz1GeMr6tHLYBC8GHajYI4BLx4jmBfJGmYnk2TIzO4y0yuEJV/hxYMiXv0cb/6Nk2QPmljQUFR1090VxFIYdN1vZ219Y3NrO7eT393bPzgsHB03TJRoxusskpFuBdRwKUJeR4GSt2LNqQokbwbj25nffOLaiCh8wEnMfUWHoRgIRtFK7faj10vx0pv2CkW35M5BVomXkSJkqPUKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpvpsYHlM2pkPesTSkihs/nR88JedW6ZNBpG2FSObq74mUKmMmKrCdiuLILHsz8T+vk+Cg4qcijBPkIVssGiSSYERm35O+0JyhnFhCmRb2VsJGVFOGNqO8DcFbfnmVNMol76pUvr8uVitZHDk4hTO4AA9uoAp3UIM6MFDwDK/w5mjnxXl3Phata042cwJ/4Hz+AAhJj98=</latexit> Y 2 t 1 <latexit sha1_base64="FkgoacEMEuiAqaeRnUgPkdXpScM=">AAAB8HicbVDLSgNBEOz1GeMr6tHLYhC8GHajYI4BLx4jmBdJDLOT2WTIzOwy0yuEJV/hxYMiXv0cb/6Nk2QPmljQUFR1090VxIIb9LxvZ219Y3NrO7eT393bPzgsHB03TJRoyuo0EpFuBcQwwRWrI0fBWrFmRAaCNYPx7cxvPjFteKQecBKzniRDxUNOCVqp3X4s91O89Kf9QtEreXO4q8TPSBEy1PqFr+4goolkCqkgxnR8L8ZeSjRyKtg0300MiwkdkyHrWKqIZKaXzg+euudWGbhhpG0pdOfq74mUSGMmMrCdkuDILHsz8T+vk2BY6aVcxQkyRReLwkS4GLmz790B14yimFhCqOb2VpeOiCYUbUZ5G4K//PIqaZRL/lWpfH9drFayOHJwCmdwAT7cQBXuoAZ1oCDhGV7hzdHOi/PufCxa15xs5gT+wPn8AQnTj+A=</latexit> Z t <latexit sha1_base64="xOh8nj8Dc6NvIJvRgPBG4nLD2Qc=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV7LHgxWNF+4FtKJvtpl262YTdiVBCf4IXD4p49Rd589+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqxpsslrHuBNRwKRRvokDJO4nmNAokbwfjm5nffuLaiFg94CThfkSHSoSCUbTS/WMf+6WyW3HnIKvEy0kZcjT6pa/eIGZpxBUySY3pem6CfkY1Cib5tNhLDU8oG9Mh71qqaMSNn81PnZJzqwxIGGtbCslc/T2R0ciYSRTYzojiyCx7M/E/r5tiWPMzoZIUuWKLRWEqCcZk9jcZCM0ZyokllGlhbyVsRDVlaNMp2hC85ZdXSata8S4r1burcr2Wx1GAUziDC/DgGupwCw1oAoMhPMMrvDnSeXHenY9F65qTz5zAHzifP0HUjb8=</latexit> Sensors <latexit sha1_base64="EfdB9Z0AMafAxGsXjnSvDZ1Q9MU=">AAAB7XicbVBNT8JAEJ3iF+IX6tFLIzHxRFou4o3Ei0eMVkigIdtlChu2u83u1oQ0/AgvHtR49f9489+4QA8KvmSSl/dmMjMvSjnTxvO+ndLG5tb2Tnm3srd/cHhUPT551DJTFAMquVTdiGjkTGBgmOHYTRWSJOLYiSY3c7/zhEozKR7MNMUwISPBYkaJsVLnHoWWSg+qNa/uLeCuE78gNSjQHlS/+kNJswSFoZxo3fO91IQ5UYZRjrNKP9OYEjohI+xZKkiCOswX587cC6sM3VgqW8K4C/X3RE4SradJZDsTYsZ61ZuL/3m9zMTNMGcizQwKulwUZ9w10p3/7g6ZQmr41BJCFbO3unRMFKHGJlSxIfirL6+ToFG/rnt3jVqrWaRRhjM4h0vw4QpacAttCIDCBJ7hFd6c1Hlx3p2PZWvJKWZO4Q+czx/tgo90</latexit> <latexit sha1_base64="EfdB9Z0AMafAxGsXjnSvDZ1Q9MU=">AAAB7XicbVBNT8JAEJ3iF+IX6tFLIzHxRFou4o3Ei0eMVkigIdtlChu2u83u1oQ0/AgvHtR49f9489+4QA8KvmSSl/dmMjMvSjnTxvO+ndLG5tb2Tnm3srd/cHhUPT551DJTFAMquVTdiGjkTGBgmOHYTRWSJOLYiSY3c7/zhEozKR7MNMUwISPBYkaJsVLnHoWWSg+qNa/uLeCuE78gNSjQHlS/+kNJswSFoZxo3fO91IQ5UYZRjrNKP9OYEjohI+xZKkiCOswX587cC6sM3VgqW8K4C/X3RE4SradJZDsTYsZ61ZuL/3m9zMTNMGcizQwKulwUZ9w10p3/7g6ZQmr41BJCFbO3unRMFKHGJlSxIfirL6+ToFG/rnt3jVqrWaRRhjM4h0vw4QpacAttCIDCBJ7hFd6c1Hlx3p2PZWvJKWZO4Q+czx/tgo90</latexit> <latexit sha1_base64="EfdB9Z0AMafAxGsXjnSvDZ1Q9MU=">AAAB7XicbVBNT8JAEJ3iF+IX6tFLIzHxRFou4o3Ei0eMVkigIdtlChu2u83u1oQ0/AgvHtR49f9489+4QA8KvmSSl/dmMjMvSjnTxvO+ndLG5tb2Tnm3srd/cHhUPT551DJTFAMquVTdiGjkTGBgmOHYTRWSJOLYiSY3c7/zhEozKR7MNMUwISPBYkaJsVLnHoWWSg+qNa/uLeCuE78gNSjQHlS/+kNJswSFoZxo3fO91IQ5UYZRjrNKP9OYEjohI+xZKkiCOswX587cC6sM3VgqW8K4C/X3RE4SradJZDsTYsZ61ZuL/3m9zMTNMGcizQwKulwUZ9w10p3/7g6ZQmr41BJCFbO3unRMFKHGJlSxIfirL6+ToFG/rnt3jVqrWaRRhjM4h0vw4QpacAttCIDCBJ7hFd6c1Hlx3p2PZWvJKWZO4Q+czx/tgo90</latexit> Scheduler <latexit sha1_base64="gE3hinBMVWKQEYjTmMu5yotPvOM=">AAAB73icbVBNT8JAEJ36ifiFevSykZh4Ii0X8UbixSNGKxhoyHY7hQ3bbbO7NSGEX+HFgxqv/h1v/hsX6EHBl0zy8t5MZuaFmeDauO63s7a+sbm1Xdop7+7tHxxWjo4fdJorhj5LRao6IdUouETfcCOwkymkSSiwHY6uZ377CZXmqbw34wyDhA4kjzmjxkqPd2yIUS5Q9StVt+bOQVaJV5AqFGj1K1+9KGV5gtIwQbXuem5mgglVhjOB03Iv15hRNqID7FoqaYI6mMwPnpJzq0QkTpUtachc/T0xoYnW4yS0nQk1Q73szcT/vG5u4kYw4TLLDUq2WBTngpiUzL4nEVfIjBhbQpni9lbChlRRZmxGZRuCt/zyKvHrtauae1uvNhtFGiU4hTO4AA8uoQk30AIfGCTwDK/w5ijnxXl3Phata04xcwJ/4Hz+AEoskDo=</latexit> <latexit sha1_base64="gE3hinBMVWKQEYjTmMu5yotPvOM=">AAAB73icbVBNT8JAEJ36ifiFevSykZh4Ii0X8UbixSNGKxhoyHY7hQ3bbbO7NSGEX+HFgxqv/h1v/hsX6EHBl0zy8t5MZuaFmeDauO63s7a+sbm1Xdop7+7tHxxWjo4fdJorhj5LRao6IdUouETfcCOwkymkSSiwHY6uZ377CZXmqbw34wyDhA4kjzmjxkqPd2yIUS5Q9StVt+bOQVaJV5AqFGj1K1+9KGV5gtIwQbXuem5mgglVhjOB03Iv15hRNqID7FoqaYI6mMwPnpJzq0QkTpUtachc/T0xoYnW4yS0nQk1Q73szcT/vG5u4kYw4TLLDUq2WBTngpiUzL4nEVfIjBhbQpni9lbChlRRZmxGZRuCt/zyKvHrtauae1uvNhtFGiU4hTO4AA8uoQk30AIfGCTwDK/w5ijnxXl3Phata04xcwJ/4Hz+AEoskDo=</latexit> <latexit sha1_base64="gE3hinBMVWKQEYjTmMu5yotPvOM=">AAAB73icbVBNT8JAEJ36ifiFevSykZh4Ii0X8UbixSNGKxhoyHY7hQ3bbbO7NSGEX+HFgxqv/h1v/hsX6EHBl0zy8t5MZuaFmeDauO63s7a+sbm1Xdop7+7tHxxWjo4fdJorhj5LRao6IdUouETfcCOwkymkSSiwHY6uZ377CZXmqbw34wyDhA4kjzmjxkqPd2yIUS5Q9StVt+bOQVaJV5AqFGj1K1+9KGV5gtIwQbXuem5mgglVhjOB03Iv15hRNqID7FoqaYI6mMwPnpJzq0QkTpUtachc/T0xoYnW4yS0nQk1Q73szcT/vG5u4kYw4TLLDUq2WBTngpiUzL4nEVfIjBhbQpni9lbChlRRZmxGZRuCt/zyKvHrtauae1uvNhtFGiU4hTO4AA8uoQk30AIfGCTwDK/w5ijnxXl3Phata04xcwJ/4Hz+AEoskDo=</latexit> Estimators <latexit sha1_base64="+iaUxylKaQpvfpQ7PJnIEpSpRyc=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVJJerLeCCB4rGFtsQ9lsN+3SzSbsvggl9F948aDi1Z/jzX/jts1BWwcWhpn32HkTplIYdN1vZ219Y3Nru7RT3t3bPzisHB0/mCTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px9cxvP3FtRKLucZLyIKZDJSLBKFrp8cagiCkm2vQrVbfmzkFWiVeQKhRo9StfvUHCspgrZJIa0/XcFIOcahRM8mm5lxmeUjamQ961VNGYmyCfJ56Sc6sMSJRo+xSSufp7I6exMZM4tJM23sgsezPxP6+bYdQIcqHSDLlii4+iTBJMyOx8MhCaM5QTSyjTwmYlbEQ1ZWhLKtsSvOWTV4lfr13V3Lt6tdko2ijBKZzBBXhwCU24hRb4wEDBM7zCm2OcF+fd+ViMrjnFzgn8gfP5A0bokNA=</latexit> <latexit sha1_base64="+iaUxylKaQpvfpQ7PJnIEpSpRyc=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVJJerLeCCB4rGFtsQ9lsN+3SzSbsvggl9F948aDi1Z/jzX/jts1BWwcWhpn32HkTplIYdN1vZ219Y3Nru7RT3t3bPzisHB0/mCTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px9cxvP3FtRKLucZLyIKZDJSLBKFrp8cagiCkm2vQrVbfmzkFWiVeQKhRo9StfvUHCspgrZJIa0/XcFIOcahRM8mm5lxmeUjamQ961VNGYmyCfJ56Sc6sMSJRo+xSSufp7I6exMZM4tJM23sgsezPxP6+bYdQIcqHSDLlii4+iTBJMyOx8MhCaM5QTSyjTwmYlbEQ1ZWhLKtsSvOWTV4lfr13V3Lt6tdko2ijBKZzBBXhwCU24hRb4wEDBM7zCm2OcF+fd+ViMrjnFzgn8gfP5A0bokNA=</latexit> <latexit sha1_base64="+iaUxylKaQpvfpQ7PJnIEpSpRyc=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVJJerLeCCB4rGFtsQ9lsN+3SzSbsvggl9F948aDi1Z/jzX/jts1BWwcWhpn32HkTplIYdN1vZ219Y3Nru7RT3t3bPzisHB0/mCTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px9cxvP3FtRKLucZLyIKZDJSLBKFrp8cagiCkm2vQrVbfmzkFWiVeQKhRo9StfvUHCspgrZJIa0/XcFIOcahRM8mm5lxmeUjamQ961VNGYmyCfJ56Sc6sMSJRo+xSSufp7I6exMZM4tJM23sgsezPxP6+bYdQIcqHSDLlii4+iTBJMyOx8MhCaM5QTSyjTwmYlbEQ1ZWhLKtsSvOWTV4lfr13V3Lt6tdko2ijBKZzBBXhwCU24hRb4wEDBM7zCm2OcF+fd+ViMrjnFzgn8gfP5A0bokNA=</latexit> Battery <latexit sha1_base64="wRA9+951f+7fjTZlngHKgdlDBkg=">AAAB7XicbVBNT8JAEJ3iF+IX6tHLRmLiibRcxBvRi0dMrJBAQ7bLFDZst83u1oQ0/AgvHtR49f9489+4QA8KvmSSl/dmMjMvTAXXxnW/ndLG5tb2Tnm3srd/cHhUPT551EmmGPosEYnqhlSj4BJ9w43AbqqQxqHATji5nfudJ1SaJ/LBTFMMYjqSPOKMGit1bqgxqKaDas2tuwuQdeIVpAYF2oPqV3+YsCxGaZigWvc8NzVBTpXhTOCs0s80ppRN6Ah7lkoaow7yxbkzcmGVIYkSZUsaslB/T+Q01noah7YzpmasV725+J/Xy0zUDHIu08ygZMtFUSaIScj8dzLkCpkRU0soU9zeStiYKspsBrpiQ/BWX14nfqN+XXfvG7VWs0ijDGdwDpfgwRW04A7a4AODCTzDK7w5qfPivDsfy9aSU8ycwh84nz/R2Y9i</latexit> <latexit sha1_base64="wRA9+951f+7fjTZlngHKgdlDBkg=">AAAB7XicbVBNT8JAEJ3iF+IX6tHLRmLiibRcxBvRi0dMrJBAQ7bLFDZst83u1oQ0/AgvHtR49f9489+4QA8KvmSSl/dmMjMvTAXXxnW/ndLG5tb2Tnm3srd/cHhUPT551EmmGPosEYnqhlSj4BJ9w43AbqqQxqHATji5nfudJ1SaJ/LBTFMMYjqSPOKMGit1bqgxqKaDas2tuwuQdeIVpAYF2oPqV3+YsCxGaZigWvc8NzVBTpXhTOCs0s80ppRN6Ah7lkoaow7yxbkzcmGVIYkSZUsaslB/T+Q01noah7YzpmasV725+J/Xy0zUDHIu08ygZMtFUSaIScj8dzLkCpkRU0soU9zeStiYKspsBrpiQ/BWX14nfqN+XXfvG7VWs0ijDGdwDpfgwRW04A7a4AODCTzDK7w5qfPivDsfy9aSU8ycwh84nz/R2Y9i</latexit> <latexit sha1_base64="wRA9+951f+7fjTZlngHKgdlDBkg=">AAAB7XicbVBNT8JAEJ3iF+IX6tHLRmLiibRcxBvRi0dMrJBAQ7bLFDZst83u1oQ0/AgvHtR49f9489+4QA8KvmSSl/dmMjMvTAXXxnW/ndLG5tb2Tnm3srd/cHhUPT551EmmGPosEYnqhlSj4BJ9w43AbqqQxqHATji5nfudJ1SaJ/LBTFMMYjqSPOKMGit1bqgxqKaDas2tuwuQdeIVpAYF2oPqV3+YsCxGaZigWvc8NzVBTpXhTOCs0s80ppRN6Ah7lkoaow7yxbkzcmGVIYkSZUsaslB/T+Q01noah7YzpmasV725+J/Xy0zUDHIu08ygZMtFUSaIScj8dzLkCpkRU0soU9zeStiYKspsBrpiQ/BWX14nfqN+XXfvG7VWs0ijDGdwDpfgwRW04A7a4AODCTzDK7w5qfPivDsfy9aSU8ycwh84nz/R2Y9i</latexit> Figure 5.1: Schematic diagram for the remote sensing system two sensor-estimator pairs with an energy harvesting scheduler. is expected to be able to support a massive number of users for which the traditional scheduling techniques based on random access, collision resolution, and retransmission are not feasibly implementable. Therefore, new scheduling schemes where decisions are driven by data such as the one proposed herein are becoming increasingly more relevant. This framework is also applicable to Wireless Body Area Networks, which are systems where multiple biometric sensors deployed on humans communicate with remote sensing stations over a wireless network [84–86]. A mobile phone is used as a hub to coordinate the access of the network among multiple sensors. The phone acts as a scheduler by collecting data from different biometric sensors, and chooses in real-time which one of the measurements is transmitted over the network. 5.1.1 Related literature, connections with prior work and contributions Over the last few years, the problem of scheduling transmissions over limited capacity net- works shared by multiple estimators/control loops has received a lot of attention [87–89] and references therein. To the best of our knowledge, the works of Shi and Zhang [90] and 98 Xia et al. [91] were among the pioneers in characterizing the trade-offs between communica- tion frequency and the estimation error covariance for event-triggered scheduling schemes. Molin et al. [92] proposed a dynamic priority scheme for scheduling real-time data over a shared network for state estimation using the notion of Value of Information. Recently, the work of Knorn and Quevedo [93] and Knorn et al. [94] incorporated the features of energy-harvesting, energy-sharing and energy-leaking sensor batteries in the computation of optimal transmission scheduling schemes. Guo et al. [95] addressed the critical issue of security and corresponding robustness concerning cyber attacks in the remote estimation of multi-systems scheduled over a shared collision channel. There is a vast literature on scheduling in point-to-point communication between a single sensor and estimator. The work of Imer and Basar [96], and subsequently Lipsa and Martins [97], Nayyar et al. [20], and Wu et al. [98] were among the first to address the issues related to the joint design of scheduling and estimation strategies. Since then, critical new features have been incorporated into the base model. Leong et al. [99] characterized structural results of the optimal transmission scheduling function, displaying a threshold in the estimation error covariance and the battery’s energy level. Wu et al. [100] and Leong et al. [101] studied the issue of learning the optimal scheduling strategy when the probability of packet-drop by the channel is unknown. The works of Leong et al. [102] and Lu et al. [103] studied the optimal design of a threshold strategy for remote estimation in the presence of an eavesdropper under a secrecy constraint, also showing that the optimal scheduling strategy has a threshold structure. Our work relates and contributes to the existing literature in the following aspects. The 99 problem formulation considered herein can be seen as a generalization of the system studied in [96] to the case of multiple sensors with the addition of an energy harvesting scheduler. Unlike other results that make structural assumptions on the estimator (linearity or piece- wise linearity), our approach is to perform joint optimization without making any structural assumptions, which often leads to intractable optimization problems (see Section 5.2.1.1). Our results, however, make assumptions on the probabilistic model of the sources similar to the ones in [20, 97, 104]. Nonetheless, despite the simplicity of the system model, our results do not follow from trivial or any existing arguments. We first consider i.i.d. sources and an energy harvesting scheduler. Here, our approach is to first relax the problem by expanding the information sets at the estimators. We proceed by solving the relaxed problem using the common information approach [15]. We investigate the value functions of the dynamic program and completely characterize the jointly optimal scheduling and estimation strategies for the relaxed problem. We show that the globally optimal solution for the relaxed problem is independent of the additional information introduced in the expansion, and therefore it is also optimal for the original problem. Next, we consider the case when the state process of different subsystems are independent of each other but have Markovian dynamics (instead of the i.i.d assumption in the first part of the chapter). Also, for simplicity, in this part we assume that the scheduler does not have any energy constraints but only incurs a communication cost for each transmission. The main contributions of this work are: • I.I.D. Case 100 1. We establish the joint optimality of a pair of scheduling estimation strategies for a sequential problem formulation with i.i.d. sources and an energy-harvesting scheduler under symmetry and unimodality assumptions of the observations’ pdfs. 2. We provide a proof strategy that uses a combination of the expansion of infor- mation structures and the common information approach. 3. We illustrate our theoretical results with numerical examples. • Markov Case 1. We first characterize the optimal estimation strategy after restricting the schedul- ing strategies to the class of symmetric strategies. We then provide a dynamic program for obtaining the optimal scheduling strategy in the class of symmetric strategies. 2. We further investigate the dynamic program to gain structural insights about the optimal scheduling strategy 5.1.2 Notation We adopt the following notation: random variables and random vectors are represented using upper case letters, such as X. Realizations of random variables and random vectors are represented by the corresponding lower case letter, such as x. We use X a:b to denote the collection of random variables (X a ,X a+1 ,··· ,X b ). The probability density function (pdf) of a continuous random variable X, provided that it is well defined, is denoted by π. 101 Functions and functionals are denoted using calligraphic letters such asF. We useN (m,σ 2 ) to represent the Gaussian probability distribution of mean m and varianceσ 2 , respectively. The real line is denoted by R. The set of natural numbers is denoted by N. The set of nonnegative integers is denoted byZ ≥0 . The probability of an eventE is denoted by Pr(E); the expectation of a random variable Z is denoted by E[Z]. The indicator function of a statementS is defined as follows: I S def = 1 if S is true 0 otherwise. (5.1) We also adopt the following convention: • Consider the setW def ={1, 2,··· ,N} and a functionF :W→R are given. IfW is the subset of elements that maximizeF then arg max α∈W F(α) is defined as the smallest number inW. 5.2 Network Estimation: IID Case 5.2.1 Problem statement Consider a system with two sensor-estimator pairs and one energy harvesting scheduler. All the subsequent results hold for an arbitrary number of sensor-estimator pairs, a fact that will be formally stated in section 5.2.8.1. Therefore, the focus on two sensor-estimator pairs is without loss of generality. 102 The system operates sequentially over a finite time horizonT∈N. The role of the scheduler is to mediate the communication between the sensors and estimators such that, at any given time step, at most, one sensor-estimator pair is allowed to communicate. We proceed to define the stochastic processes observed at the sensors. Let X i t ∈ R n i denote the random vector observed at the i-th sensor, t∈{1,··· ,T}, i∈{1, 2}. Let n 1 +n 2 = n. We shall refer to X i t , i∈{1, 2}, as outputs of information sources at time t. Throughout the paper, we assume that the sources are independent and identically distributed in time. Moreover, the random variables X i t admit a pdf π i for all i∈{1, 2} and t∈{1,··· ,T}. We assume that the stochastic processes{X 1 t ,t≥ 1} and{X 2 t ,t≥ 1} are independent. The scheduler operates with a battery of finite capacity denoted byB∈N such thatB <T . Let the state of the battery, E t , be defined as the number of energy units available at time step t. At each time t, the scheduler makes a decision U t ∈{0, 1, 2}, where U t = 0 denotes that no transmissions are scheduled; U t = 1 denotes that the scheduler transmits X 1 t ; and U t = 2 denotes that the scheduler transmitsX 2 t . Each transmission depletes the battery by one energy unit and only no transmissions can be scheduled if the battery is empty, i.e., if E t = 0. Thus, the scheduling decision U t ∈U(E t ), where: U(E t ) def = {0, 1, 2} if E t > 0 {0} if E t = 0. (5.2) At time t, the scheduler harvests Z t units of energy from the environment. The random variable Z t is i.i.d. in time according to a probability mass function p Z (z),z∈Z ≥0 , and is independent of the information source processes. The state of the battery evolves according 103 to the following equation: E t+1 =F(E t ,U t ,Z t ), t∈{1,··· ,T− 1}, (5.3) where F(E t ,U t ,Z t ) def = min E t −I U t 6= 0 +Z t ,B , (5.4) and initial energy E 1 =B. We will assume that the communication between the scheduler and the estimators occurs over a so-called unicast network, where only the intended estimator receives the transmitted packet. Fori∈{1, 2}, the observation of the estimatorE i at timet is denoted byY i t , which is determined according to Y i t =h i (X i t ,U t ), where: h i (X i t ,U t ) def = X i t if U t =i ? if U t 6=i. (5.5) Remark 5.1. One way to think about the unicast network model is that there are inde- pendent point-to-point links between different sensors and estimator pairs. At each time instant, the scheduler chooses at most one of these links to be active, and the others remain idle. 104 Information sets and strategies Let X t def = (X 1 t ,X 2 t ) and Y t def = (Y 1 t ,Y 2 t ). The scheduler decides what to transmit based on its available information at time t, which isI S t def ={X 1:t ,E 1:t , Y 1:t−1 }. The decision variable U t is computed according to a function f t as follows: U t =f t (X 1:t ,E 1:t , Y 1:t−1 ). (5.6) We refer to the collection f def ={f 1 ,··· ,f T } as the scheduling strategy of the scheduler. Leti∈{1, 2}. The estimatorE i computes the state estimate based on the entire history of its observations,I E i t def ={Y i 1:t }, according to a function g i t as follows: ˆ X i t =g i t (Y i 1:t ). (5.7) We refer to the collection g i def ={g i 1 ,··· ,g i T } as the estimation strategy of estimatorE i . Remark 5.2. From now on, we assume that f t , g 1 t and g 2 t , t∈{1,··· ,T}, are measurable functions with respect to the appropriate sigma-algebras. Cost We consider a performance index that penalizes the mean squared estimation error and a communication cost for every transmission made by the scheduler. 105 The cost functional and optimization problem are defined as follows: J f, g 1 , g 2 def = T X t=1 E " X i∈{1,2} kX i t − ˆ X i t k 2 +cI(U t 6= 0) # . (5.8) Problem 5.1. For the model described in this section, given the statistics of the sensor’s observations, the statistics of the energy-harvesting process, the battery storage limit B, communication cost c, and the horizon T , find scheduling and estimation strategies f, g 1 and g 2 that jointly minimize the costJ (f, g 1 , g 2 ) in (5.8). 5.2.1.1 Signaling In problems of decentralized control and estimation with nonclassical information structures, the optimal solutions typically involve a form of implicit communication known as signaling. Signaling is the effect of conveying information through actions [105], and it is the reason why problems within this class are difficult to solve, e.g., [106]. In order to illustrate the fundamental difficulty imposed by signaling, consider the instance of problem 5.1 with two zero-mean independent scalar sources,c = 0 andT = 1. Here we will show how the coupling between scheduling and estimation leads to nonconvex optimization problems. First, consider a fixed scheduling function f 1 : R 2 →{1, 2}. Let i,j∈{1, 2}, such thati6=j. Since the cost is the mean squared error between the observations and the estimates, the optimal estimator is the conditional mean, i.e., g i? 1 (y) =E X i 1 |Y i 1 =y , i∈{1, 2}. (5.9) 106 When y = (i,x i 1 ), we have: g i? 1 (i,x i 1 ) =x i 1 , i∈{1, 2}. (5.10) However, when y =?, we have: g i? 1 (?) =E[X i |f 1 (X 1 1 ,X 2 1 )6=i], (5.11) from which two important points can be drawn: 1. the estimate g i? 1 (?) is an implicit function of the scheduling function f 1 ; 2. the event that X i was not transmitted always carries some implicit information aboutX i . That means that even no-transmission symbols received over the network can be used as side information for estimation. Therefore, solving the resulting optimization problem for the scheduling function f 1 , which seeks to minimize the cost functional J f 1 = X (i,j):i6=j Z R 2 x i 1 −g i? 1 (?) 2 I(f 1 (x 1 1 ,x 2 1 ) =j)π 1 (x 1 1 )π 2 (x 2 1 )dx 1 1 dx 2 1 , where g i? 1 (?) is given by (5.11), for arbitrary pdfs π 1 and π 2 , is intractable. If on the other hand, we fix the estimation functions g 1 1 and g 2 1 , such that the following identities are satisfied: g i 1 (y) = x i 1 if y = (i,x i 1 ) η i 1 if y =?, 107 where η i 1 ∈R, the optimal scheduler is determined by the following inequality: f ? 1 (x 1 ) = 1⇔|x 2 1 −η 2 1 |<|x 1 1 −η 1 1 |, which leads to the following nonconvex objective function: J g 1 1 ,g 2 1 =E h min n X 1 1 −η 1 1 2 , X 2 1 −η 2 1 2 oi . In both cases, the globally optimal solution to Problem 5.1 is nontrivial for arbitrary pdfs π 1 and π 2 , due to the coupling between f 1 , g 2 1 and g 1 2 . In this paper, we attempt to solve the more general problem statement for arbitrary T≥ 1 and c≥ 0 assuming that the pdfs π 1 and π 2 satisfy certain properties. 5.2.2 Main result The following definition will be used to state our main result. Definition 5.1 (Symmetric and unimodal probability density functions). Let π :R n →R be a probability density function (pdf). The pdfπ is symmetric and unimodal arounda∈R n if it satisfies the following property: kx−ak≤ky−ak⇒π(x)≥π(y), x,y∈R n . (5.12) Theorem 5.1. Provided that π 1 and π 2 are symmetric and unimodal around a 1 ∈ R n 1 and a 2 ∈ R n 2 , respectively, the following scheduling and estimation strategies are globally 108 optimal for Problem 5.1: f ? t (x,e) def = 0, if max i∈{1,2} kx i −a i k ≤τ ? t (e) arg max i∈{1,2} kx i −a i k, otherwise, (5.13) where τ ? t :Z→R is a threshold; and g i? t y i ) def = x i if y i =x i a i if y i =?, (5.14) for t∈{1,··· ,T}. 5.2.3 Information Structures Problem 5.1 can be understood as a sequential stochastic team with three decision-makers: the scheduler and the two estimators. One key aspect to note is that Problem 5.1 has a nonclassical information structure. Such team problems are usually nonconvex, and their solutions are found on a case-by-case basis. Our analysis relies on the common information approach [15], where the idea is to transform the decentralized problem into an equivalent centralized one where the information for decision-making is the common information among all the decision-makers in the decentralized system. We begin by establishing a structural result for the optimal scheduling strategy. The follow- ing lemma states that the scheduler may ignore the past state observations at each sensor without any loss of optimality. 109 Lemma 5.1. Without loss of optimality, the scheduler can be restricted to strategies of the form: U t =f t X t ,E 1:t , Y 1:t−1 . (5.15) Proof. Let the strategy profile of the estimators g 1 and g 2 be arbitrarily fixed. The problem of selecting the best scheduling policy (for the fixed estimation strategy pro- files g 1 and g 2 ) simplifies to a Markov Decision Process (MDP), whose state is defined as S t def = (X t ,E 1:t , Y 1:t−1 ). Using simple arguments involving conditional probabilities and the basic definitions of section 5.2.1, we can show that the state process{S t ,t≥ 1} is a controlled Markov chain, i.e., P(S t+1 |S 1:t ,U 1:t ) =P(S t+1 |S t ,U t ). (5.16) The cost incurred at time t of the equivalent MDP is: ρ(S t ,U t ) def = X i∈{1,2} X i t − ˆ X i t 2 +cI(U t 6= 0) (5.17) (a) = X i∈{1,2} X i t −g i t (Y i 1:t ) 2 +cI(U t 6= 0) (5.18) (b) = X i∈{1,2} X i t −g i t Y i 1:t−1 ,h i (X i t ,U t ) 2 +cI(U t 6= 0), (5.19) where (a) follows from (5.7) and (b) follows from (5.5). Thus, the problem of finding the optimal scheduling strategy to minimize the costJ f, g 1 , g 2 becomes equivalent to finding the optimal decision strategy for an MDP with state process 110 S t and instantaneous cost ρ(S t ,U t ). Standard results for MDPs [6] imply that there exists an optimal scheduling strategy of the form in lemma. Since this is true for any arbitrary g 1 and g 2 , it is also true for the globally optimal g 1? and g 2? . Under the structural result in lemma 5.1, the information sets available at the scheduler and estimators can be reduced to: I S t def = X t ,E 1:t , Y 1:t−1 (5.20) I E i t def = Y i 1:t , i∈{1, 2}, (5.21) without any loss of optimality. However, the information structure described by (5.20),(5.21) do not share any common information. In other words, the information setsI S t ,I E 1 t andI E 2 t have no common random variables, a fact that limits the utility of the common information approach. We resort to a technique which consists of judiciously expanding the information available at the decision-makers such that the common information approach can be more profitably employed. 5.2.3.1 Information structure expansion We expand the estimators’ information sets to the following: ¯ I E 1 t def = E 1:t , Y 1:t−1 ,Y 1 t (5.22) 111 ¯ I E 2 t def = E 1:t , Y 1:t−1 ,Y 2 t . (5.23) The optimal cost for Problem 5.1 under an expanded information structure is at least as good as the optimal cost under the original information structure (having more information at each estimator cannot worsen its performance). Moreover, if the optimal solution under the expanded information structure is adapted to the original information structure, then this solution is also optimal under the original information structure [83, Proposition 3.5.1]. We proceed by defining another problem identical to Problem 5.1 but with expanded infor- mation sets at the estimators. Problem 5.2. Consider the model of section 5.2.1 with the expanded information sets of (5.22),(5.23) at the estimatorsE 1 andE 2 , respectively. Given the statistics of the sensors’ observations, the statistics of the energy harvested at each time, the battery storage limit B, communication cost c, and the horizon T , find the scheduling and estimation strategies f, g 1 and g 2 that jointly minimize the costJ f, g 1 , g 2 in (5.8). Under the expanded information structure, the common information among the decision makers is: I com t def = E 1:t , Y 1:t−1 . (5.24) Notice that the common information contains several variables that were not initially avail- able to the estimators. However, we will eventually show at the end of section 5.2.5 that 112 the optimal estimation strategy for Problem 5.2 does not depend on this additional infor- mation. To show this independence, we first establish the following lemma, which provides a structural result for the estimation strategies under the expanded information sets. Lemma 5.2. Without loss of optimality, the search for optimal strategies for estimatorE i can be restricted to functions of the form: g i t (E 1:t , Y 1:t−1 ,Y i t ) = X i t if Y i t =X i t ˜ g i t (E 1:t , Y 1:t−1 ) otherwise. (5.25) Proof. Let the strategy of the scheduler be fixed to some arbitrary f. We can view Problem 5.2 from the perspective of the estimatorE i at time t as follows: inf g i t E kX i t − ˆ X i t k 2 + ˜ J, (5.26) where ˜ J def =E " T X k=1 cI(U k 6= 0) + T X k=1 X j6=i kX j k − ˆ X j k k 2 + X k6=t kX i k − ˆ X i k k 2 # . (5.27) Notice that the estimation function g i t only affects the value of the estimate ˆ X i t , i.e., ˆ X i t =g i t ( ¯ I E i t ), (5.28) which does not appear in (5.27). Since g i t does not affect ˜ J , the optimal estimate can be computed by solving: inf g i t E kX i t − ˆ X i t k 2 . (5.29) 113 This is the standard MMSE estimation problem whose solution is the conditional mean, i.e., ˆ X i t =E X i t ¯ I E i t . (5.30) Therefore, the optimal estimation strategy is of the form: g i? t ( ¯ I E i t ) = X i t if Y i t =X i t E X i t E 1:t , Y 1:t−1 ,Y i t =? otherwise. (5.31) Notice that (E 1:t , Y 1:t−1 ) is known toE i in Problem 5.2. Thus, ˜ g i t (E 1:t , Y 1:t−1 ) def =E X i t E 1:t , Y 1:t−1 ,Y i t =? . (5.32) Since (5.31) holds for any f, it also holds for the globally optimal scheduling strategy f ? . Therefore, the optimal estimate is of the form given in the lemma. 5.2.4 An equivalent problem with a coordinator In this section, we will formulate a problem that will be used to solve Problem 5.2. We consider the model of section 5.2.1 and introduce a fictitious decision-maker referred to as the coordinator, which has access to the common informationI com t . The coordinator is the only decision-maker in the new problem. The scheduler and the estimators act as “passive decision-makers” to which strategies chosen by the coordinator are prescribed. 114 The equivalent system operates as follows: Let n 1 and n 2 denote the dimensions of the observation made by sensors 1 and 2, respectively. At each time t, based onI com t , the coordinator chooses a map Γ t :R n 1 ×R n 2 →{0, 1, 2} for the scheduler, and a vector ˜ X i t ∈R n i for each estimatorE i ,i∈{1, 2}. The function Γ t and vectors ˜ X 1 t and ˜ X 2 t are referred to as the scheduling and estimation prescriptions. The scheduler uses its prescription to evaluate U t according to: U t = Γ t (X t ). (5.33) The estimatorE i uses its prescription to compute the estimate ˆ X i t according to: ˆ X i t = X i t if Y i t =X i t ˜ X i t otherwise. (5.34) The coordinator selects its prescriptions for the scheduler and the estimators using strategies d t ,` 1 t and ` 2 t as follows: Γ t =d t (E 1:t , Y 1:t−1 ) (5.35) and ˜ X i t =` i t (E 1:t , Y 1:t−1 ), i∈{1, 2}. (5.36) We refer to the collections d def ={d 1 ,··· ,d T } and` i def ={` i t ,··· ,` i T } as the prescription strate- gies for the scheduler and the estimatorE i , respectively. The strategies` 1 and` 2 must be a valid estimation strategies in Problem 5.2. The strategyd must be such that f t (X t ,E 1:t ,Y 1:t−1 ) def = d t (E 1:t ,Y 1:t−1 ) (X t ) (5.37) 115 is a valid scheduling strategy in Problem 5.2. The cost incurred by the prescription strategies d,` 1 and` 2 is identical as in (5.8), that is, ˆ J (d,` 1 ,` 2 ) = T X t=1 E " cI(U t 6= 0) + X i∈{1,2} kX i t − ˆ X i t k 2 # . (5.38) Problem 5.3. Find prescription strategies d,` 1 , and` 2 that jointly minimize ˆ J (d,` 1 ,` 2 ). Problem 5.3 is equivalent to Problem 5.2 in the sense that for every scheduling strategy f and estimation strategies g 1 , g 2 in Problem 5.2 there exist prescription strategies d,` 1 and ` 2 such thatJ (f, g 1 , g 2 ) = ˆ J (d,` 1 ,` 2 ) and vice-versa. Thus, solving Problem 5.3 allows us to obtain optimal f ? , g 1? and g 2? for Problem 5.2. The same technique is used in [20] to prove a similar equivalence in a problem involving a single sensor-estimator pair. Problem 5.3 can be described as a centralized POMDP as follows: (i) State process: The state is S t def = (X t ,E t ). (ii) Action process: Let the setA(E t ) be defined as the collection of all measurable functions fromR n 1 × R n 2 →U(E t ), whereU is defined in (5.2). The coordinator selects the prescription for the network manager, Γ t ∈A(E t ), and the prescriptions for the estimators ˜ X 1 t ∈R n 1 and ˜ X 2 t ∈R n 2 . (iii) Observations: After choosing its action at time t, the coordinator observes Y t and E t+1 . 116 (iv) Instantaneous cost: Let ˜ X t def = ( ˜ X 1 t , ˜ X 2 t ). The instantaneous cost incurred is given by ρ(X t , Γ t , ˜ X t ) def = P i∈{1,2} kX i t − ˜ X i t k 2 if Γ t (X t ) = 0 c +kX 2 t − ˜ X 2 t k 2 if Γ t (X t ) = 1 c +kX 1 t − ˜ X 1 t k 2 if Γ t (X t ) = 2. (5.39) (v) Markovian dynamics: Since X t is an i.i.d process, X t+1 is independent of S t . The evolution of the energy E t+1 is given by: E t+1 = min E t −I γ t (X t )6= 0 +Z t ,B . (5.40) Noticing that (5.40) can be written as a function of the state S t , action γ t and the noise Z t , the state S t satisfies (5.16) and forms a controlled Markov chain. 5.2.4.1 Dynamic program Having established that Problem 5.3 is a POMDP, the optimal prescriptions can be com- puted by solving a dynamic program whose information state is the belief of the state process given the common information. However, since E t is perfectly observed, the coor- dinator only needs to form a belief on X t . Let x = (x 1 ,x 2 ). We define the belief state at 117 time t as: Π t (x) def =P X t = x|E 1:t , Y 1:t−1 . (5.41) Since the sources are i.i.d. and independent of the energy process, we have: Π t (x) =π(x), t∈{1,··· ,T}, (5.42) where, due to the independence of the sources, π(x) =π 1 (x 1 )π 2 (x 2 ). (5.43) Lemma 5.3. Define the functionsV π t :Z→R for t∈{0, 1,··· ,T + 1} as follows: V π T +1 (e) def = 0, e∈{0, 1,...,B}, (5.44) and V π t (e) def = inf ˜ xt,γt E h ρ(X t ,γ t , ˜ x t ) +V π t+1 F e,γ t (X t ),Z t i , (5.45) where ˜ x t ∈R n , γ t ∈A(e). If the infimum in (5.45) is achieved, then at each time t∈{1,··· ,T} and for each e∈ {0, 1,··· ,B}, the minimizing γ t and ˜ x t in (5.45) determines the optimal prescriptions for the network manager and the estimators, respectively. Furthermore,V 1 (B) is the optimal cost for Problem 5.3. 118 Proof. This result follows from standard dynamic programming arguments for POMDPs. 5.2.5 Solving the dynamic program In this section, we will find the optimal prescriptions using the dynamic program in lemma 5.3. For the remainder of this section, without loss of generality, we will assume thatπ 1 and π 2 are symmetric and unimodal around 0. The same arguments apply for general a i ∈R n i , i∈{1, 2}. Note that each step of the dynamic program in (5.45) is an optimization problem with respect to ˜ x t and γ t . This is an infinite dimensional optimization problem since γ t is a mapping which lies in A(E t ). The next lemma will describe the structure of the optimal prescription for the scheduler and show that the infinite-dimensional optimization in (5.45) can be reduced to a finite dimensional problem with respect to the vector ˜ x t . For that purpose, we define the functionsC 0 t+1 ,C 1 t+1 :Z→R as follows: C 0 t+1 (e) def =E h V π t+1 min{e +Z t ,B} i (5.46) C 1 t+1 (e) def =c +E h V π t+1 min{e− 1 +Z t ,B} i . (5.47) 119 V π t (e) = inf ˜ xt E h X i∈{1,2} kX i t − ˜ x i t k 2 i +C 0 t+1 (e) if e = 0 inf ˜ xt E h min n X i∈{1,2} kX i t − ˜ x i t k 2 +C 0 t+1 (e),kX 2 t − ˜ x 2 t k 2 +C 1 t+1 (e),kX 1 t − ˜ x 1 t k 2 +C 1 t+1 (e) oi if e> 0. (5.49) Lemma 5.4. Suppose the prescription to the estimators are ˜ x 1 t , ˜ x 2 t at time t. Then, the optimal prescription to the scheduler has the following form when e> 0: γ ? t (x t ) def = 0, if max i∈{1,2} kx i t − ˜ x i t k ≤τ ? t (e) arg max i∈{1,2} kx i t − ˜ x i t k , otherwise, (5.48) where τ ? t (e) def = q C 1 t+1 (e)−C 0 t+1 (e) 1 . Moreover, the value functionV π t of lemma 5.3 can be obtained by solving the finite dimensional optimization in (5.49). Proof. If e = 0, there is only one feasible scheduling policy: γ ? t (x t ) = 0, x t ∈R n . (5.50) Therefore, V π t (0) = inf ˜ xt E X i∈{1,2} kX i t − ˜ x i t k 2 +C 0 t+1 (0). (5.51) If e> 0, the value function in (5.45) can be written as in (5.52). 1 The functionC 1 t+1 (e) is larger thanC 0 t+1 (e). Therefore, the threshold τ ? t (e) is a real number for all e∈{1,··· ,B} and t∈{1,··· ,T}. 120 V π t (e) = inf ˜ xt ( inf γt Z " X i∈{1,2} kx i t −˜ x i t k 2 +C 0 t+1 (e) I(γ t (x t ) = 0)+ kx 2 t −˜ x 2 t k 2 +C 1 t+1 (e) I(γ t (x t ) = 1) + kx 1 t − ˜ x 1 t k 2 +C 1 t+1 (e) I(γ t (x t ) = 2) # π(x t )dx t ) (5.52) For any fixed ˜ x i t ∈R n i , i∈{1, 2}, the scheduling prescription that achieves the minimum in the inner optimization problem in (5.52) is determined as follows: • γ ? t (x t ) = 0 if and only if kx i t − ˜ x i t k 2 ≤C 1 t+1 (e)−C 0 t+1 (e), i∈{1, 2}; (5.53) • γ ? t (x t ) = 1 if and only if kx 1 t − ˜ x 1 t k 2 >C 1 t+1 (e)−C 0 t+1 (e) (5.54) and kx 1 t − ˜ x 1 t k≥kx 2 t − ˜ x 2 t k; (5.55) • γ ? t (x t ) = 2 if and only if kx 2 t − ˜ x 2 t k 2 >C 1 t+1 (e)−C 0 t+1 (e) (5.56) and kx 2 t − ˜ x 2 t k>kx 1 t − ˜ x 1 t k. (5.57) 121 Therefore, γ ? t (x t ) def = 0, if max i∈{1,2} kx i t − ˜ x i t k 2 ≤C 1 t+1 (e)−C 0 t+1 (e) arg max i∈{1,2} kx i t − ˜ x i t k , otherwise. (5.58) Using the optimal scheduling prescription in (5.58), the value function becomes: V t (e) = inf ˜ xt E h min n kX 1 t − ˜ x 1 t k 2 +kX 2 t − ˜ x 2 t k 2 +C 0 t+1 (e), kX 2 t − ˜ x 2 t k 2 +C 1 t+1 (e),kX 1 t − ˜ x 1 t k 2 +C 1 t+1 (e) oi . (5.59) Lemma 5.4 implies that the optimal solution to problem 5.3 can be found by solving the finite dimensional optimization problem in (5.49). We will show that (5.49) admits a globally optimal solution under certain conditions on the probabilistic structure of the problem. Lemma 5.5. Let X 1 t and X 2 t be independent continuous random vectors with pdfs π 1 and π 2 . Provided that π 1 and π 2 are symmetric and unimodal around zero 2 , then ˜ x ? t = 0 is a global minimizer in (5.49) for all e∈{0, 1,··· ,B}. Proof. The proof is in Appendix D.2. We are now ready to provide the proof of theorem 5.1. 2 This assumption is without loss of generality. The same result holds for pdfs symmetric and unimodal around arbitrary a i ∈R n i , i∈{1, 2}, with ˜ x ? t = (a 1 ,a 2 ) instead of ˜ x ? t = 0. 122 Proof of Theorem 5.1. We will first show that (f ? , g 1? , g 2? ) as defined in theorem 5.1 is globally optimal for problem 5.2. The optimal prescriptions for problem 5.3 are obtained using lemma 5.4,5.5. The optimal prescription for the scheduler is given by: γ ? t (x t ) def = 0, if max i∈{1,2} kx i t k <τ ? t (e) arg max i∈{1,2} kx i t k , otherwise, (5.60) whose threshold functions τ ? t (e) can be computed recursively (see section 5.2.6); and the optimal prescription for the estimators are: ˜ x i? t = 0, i∈{1, 2}. (5.61) Therefore, using the equivalence between problem 5.2 and problem 5.3, the optimal strategy profiles for problem 5.2 are f ? t (x t ,e t ) def = 0, if max i∈{1,2} kx i t k <τ ? t (e t ) arg max i∈{1,2} kx i t k, otherwise, (5.62) and g i? t y i t ) def = x i t if y i t =x i t 0 if y i t =? , i∈{1, 2}. (5.63) Moreover, since the solution to problem 5.2, (f ? , g 1? , g 2? ) does not depend on the additional 123 information provided to the estimators and is adapted to the original information structure of the estimators in problem 5.1, it is also a globally optimal strategy profile for problem 5.1. 5.2.6 Computation of optimal thresholds Once the structural result in theorem 5.1 is established, the optimal scheduling strategy is completely specified by the sequence of optimal threshold functions τ ? t , t∈{1,··· ,T}. The thresholdsτ ? t (e) are obtained using the functionsC 0 t+1 (e),C 1 t+1 (e) in (5.46),(5.47). The functionsC 0 t (·),C 1 t (·) can be computed by computing the value functionsV π t via a backward inductive procedure. Note that we can simplify the expression for the value function using lemma 5.5 and (5.49) to: V π t (0) =E h kX 1 t k 2 +kX 2 t k 2 +V π t+1 min{Z t ,B} i (5.64) and V π t (e) =E h min n kX 1 t k 2 +kX 2 t k 2 +C 0 t+1 (e),kX 2 t k 2 +C 1 t+1 (e),kX 1 t k 2 +C 1 t+1 (e) oi if e> 0. (5.65) The following algorithm outlines the recursive computation of the threshold function τ ? t : 124 Algorithm 5 Computing the optimal threshold functions τ ? t Initialization: t←T SetV π T +1 (e)← 0 for e∈{0,··· ,B} while t≥ 1 do ComputeC 0 t+1 (e) andC 1 t+1 (e) using (5.46),(5.47) for e∈{1,··· ,B} Set τ ? t (e)← q C 1 t+1 (e)−C 0 t+1 (e) for e∈{1,··· ,B} ComputeV π t (e) using (5.64),(5.65) for e∈{0,··· ,B} t←t− 1 end while Remark 5.3. The expectations in the algorithm are taken with respect to the random vec- tors X 1 t and X 2 t . Computing these expectations for high dimensional random vectors may be computationally intensive for some source distributions, but in practice, they can be ap- proximated using Monte Carlo methods. The remaining operations in the algorithm admit efficient implementations. 5.2.7 Illustrative examples 5.2.7.1 Optimal blind scheduling Before we provide a few numerical examples it is useful to introduce a scheduling strategy which is based exclusively on the statistics of the sources, and not on the observations. Consider the following blind scheduling strategy: if the battery is not empty, transmit the 125 source whose variance is the largest, i.e., f blind t (e t ) def = 0 if e t = 0 arg max i∈{1,2} n E kX i t −E[X i t ]k 2 o otherwise, (5.66) The estimation strategies associated with blind scheduling are: g blind i t y i t ) def = x i t if y i t =x i t E[X i t ] if y i t =? , i∈{1, 2}. (5.67) The performance of the blind scheduling and estimation strategies and the is given by: J blind (B) def = T X t=1 h P(E t = 0) X i∈{1,2} E kX i t −E[X i t ]k 2 + 1−P(E t = 0) min i∈{1,2} n E kX i t −E[X i t ]k 2 oi , (5.68) where the probabilities P(E t = 0),t∈{1,··· ,T} are computed recursively using (5.3),(5.4) and assuming E 1 =B > 0 with probability 1. Example 5.1 (Limited number of transmissions). Consider the scheduling of two i.i.d. zero mean scalar Gaussian sources with variances σ 2 1 = σ 2 2 = 1. Assume that the total system deployment time is T , and that during that time the scheduler is only allowed to transmitB <T times. Furthermore, assume that during that time, there is no energy being harvested, i.e., Z t = 0 with probability 1, and there are no additional communication costs, i.e., c = 0. 126 The algorithm outlined in section 5.2.6 is used to compute the optimal thresholds, which are functions of the time index, and the energy level at the battery. Figure 5.2a displays the optimal thresholds computed for this example with T = 100 and B = 30. Notice that when the energy level is greater than the remaining deployment time, the optimal threshold is zero, that is, the observation with the largest magnitude is always transmitted. On the other hand, if the power level is below the remaining deployment time, the optimal threshold is strictly positive, and it increases as the power level decreases. That means that as the battery depletes, the scheduler will only transmit observations whose magnitudes are increasingly larger. Example 5.2 (Energy harvesting scheduler). Consider a setup identical to that in example 5.1 withT = 100, but in addition assume that the energy harvesting processZ t is distributed according to two possible probability mass functions: p 1 Z (z) = 0.85 z = 0 0.1 z = 1 0.05 z = 2 or p 2 Z (z) = 0.7 z = 0 0.2 z = 1 0.1 z = 2 , (5.69) yielding on average 0.2 and 0.4 energy units per time step, respectively. The optimal thresholds obtained for the energy harvesting system under p 1 Z are shown in figure 5.2b, and they are uniformly smaller than the ones of the system without harvesting. We also note a change in the “curvature” of the threshold function for a fixed t. 127 0 0.5 1 1 1.5 5 2 2.5 10 3 15 1 20 20 40 60 25 80 30 100 (a) No energy harvesting 0 0.5 1 1 1.5 5 2 2.5 10 3 15 1 20 20 40 60 25 80 30 100 (b) Energy harvesting with p 1 Z Figure 5.2: Optimal threshold function for the scheduling of two i.i.d. standard Gaussian sources. The threshold is a function of the energy level and time. 128 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120 140 160 180 200 Figure 5.3: Comparison between the performances of the optimal open-loop and closed-loop strategies as a function of the battery capacity, B. The relative gap between these two curves is defined as the Value of Information. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 40 60 80 100 120 140 160 Figure 5.4: Optimal performanceJ ? of the systems with and without harvesting of Examples 1 and 2 as a function of the communication cost c. 129 Figure 5.3 shows the performance of the optimal strategy and the blind scheduling scheme as a function of the battery capacityB for the three systems: no harvesting, harvesting with p 1 Z and p 2 Z . The optimal scheme proposed in this paper leads to a significant improvement upon the blind scheduling strategy of (5.66). For B = 10, without energy harvesting, the optimal performance is J ? ≈ 147.37. However, in order to achieve a comparable performance using blind scheduling, a battery of capacity equal to 53 energy units would be required. Therefore, the energy savings in this case is of approximately 81.13%. Finally, figure 5.4 illustrates the performance of the systems with and without harvesting for the scheduling of two standard Gaussian sources over a horizon T = 100 and a battery of fixed size B = 30 as a function of the communication cost c. 5.2.8 Extensions 5.2.8.1 The N sensor case Theorem 5.1 holds for any number of sensors (N≥ 2). Let x t = (x 1 t ,x 2 t ,··· ,x N t ), where x i t ∈R n i is the observation at the i-th sensor. Provided that the observations are mutually independent and, their pdfs are symmetric and unimodal around a 1 ,a 2 ,··· ,a N , where a i ∈R n i , i∈{1, 2,··· ,N}, the jointly optimal scheduling and estimation strategies are: f ? t (x t ,e t ) def = 0, if max i∈{1,···,N} kx i t −a i k ≤τ ? t (e t ) arg max i∈{1,···,N} kx i t −a i k , otherwise, (5.70) 130 and g i? t y i ) def = x i t if y i t =x i t a i if y i t =? , i∈{1,··· ,N}. (5.71) 5.2.8.2 Unequal weights and communication costs In specific applications, each sensor may be assigned a different weight in the expected distortion metric. This new metric is used to emphasize the importance of the observations made by one sensor relative to another. Additionally, different sensors may also have dif- ferent communication costs, which may reflect the dimension of the measurements or used to preserve the battery power, for instance. These cases are captured by the following cost functional: J f, g 1 , g 2 def = T X t=1 E " X i∈{1,2} w i kX i t − ˆ X i t k 2 +c i I(U t =i) # . (5.72) The globally optimal scheduling and estimation strategies for the more general cost func- tional in (5.72) are given by (5.74),(5.73): g i? t y i t ) def = x i t if y i t =x i t a i if y i t =? , i∈{1, 2}, (5.73) where the thresholdsτ 1 t andτ 2 t are computed by modified version of Algorithm 1, described in Appendix D.3. 131 f ? t (x t ,e t ) = 0, ifkx 1 t −a 1 k≤τ 1 t (e t ), kx 2 t −a 2 k≤τ 2 t (e t ) 1, ifkx 1 t −a 1 k>τ 1 t (e t ), w 1 kx 1 t −a 1 k 2 −w 2 kx 2 t −a 2 k 2 ≥w 1 τ 1 t (e t ) 2 −w 2 τ 2 t (e t ) 2 2, otherwise (5.74) 5.3 Network Estimation: Markov Case In this section, we will consider the case when the source state evolve in a Markovian manner. 5.3.1 Problem formulation Consider a system with N ≥ 2 stochastic subsystems and corresponding non-collocated estimatorsE 1 ,...,E N . Every subsytem i has a source state process X i t ∈R,t = 0, 1,...,T associated with it. The source processes evolve in time in an uncontrolled Markovian manner and independent of each other as follows: X i t+1 =X i t +W i t , (5.75) where W i t is the noise and X i 0 = 0∀i. We assume that W i t are independent for all t and i and have the common distribution f W . EstimatorE i wants to form a real-time estimate ˆ X i t of the process X i t at each time t. Subsystem i can communicate its state X i t to the corresponding estimator over a synchronous wireless network with a communication cost c≥ 0. However, the network can support only a single packet per time slot. Thus, a transmission is successful only if exactly one subsystem transmits its state and if more than 132 one subsystem communicate at the same time a collision occurs. In order to avoid collisions, the access to the shared network is scheduled by a network manager. The network manager collects the state of each subsystem and decides which state to transmit (if any) at each time t. Let U t ∈{0, 1, 2,...,N} denote the index of the scheduled subsystem. Note that U t = 0 corresponds to the decision when no subsystem is scheduled at time t. In this case a no transmission symbol? is sent across the network. The observation Y i t ofE i at time t is given by: Y i t = X i t if U t =i ? o.w. Information and strategies The information available at the network manager at time t is I t ={X 1:t , Y 1:t−1 } where X t = (X 1 t ,...,X N t ) and Y t = (Y 1 t ,...,Y N t ). The network manager maps its information I t to its decision U t using the scheduling strategy γ t :I t →{0, 1,...,N}, that is, U t =γ t (I t ). (5.76) We denote by γ the collection (γ 1 ,γ 2 ,...,γ T ) and call it the network scheduling strategy. The information available at the estimatorE i at time t is I i t ={Y i 1:t }. Let ψ i t : I i t → R 133 denote the estimation strategy used byE i at time t. Then, ˆ X i t =ψ i t (I i t ). (5.77) The collection ψ i = (ψ i 1 ,...,ψ i T ) is referred to as the estimation strategy ofE i and ψ = (ψ 1 ,...,ψ N ) is referred to as the network estimation strategy. Cost In addition to the communication cost, the network incurs a distortion cost||X i t − ˆ X i t || 2 for each i,t. The total cost of a network scheduling strategy γ and a network estimation strategy ψ is given by: J(γ,ψ) =E " T X t=1 cI(U t 6= 0) + N X i=1 ||X i t − ˆ X i t || 2 !# (5.78) We can now state the problem formally as follows: Problem 5.4. Given the statistics of the noise W i t for each i,t, the constant c≥ 0, find a network scheduling strategy γ and a network estimation strategy ψ which minimizes the cost J(γ,ψ) in (5.78). Problem 5.4 is an instance of decentralized stochastic control problem. Next, we briefly discuss the role of signaling in decentralized control problems. Remark 5.4. In problems of decentralized control and estimation, agents can often use signaling to communicate implicitly. For example, suppose the network manager decides 134 not to transmit the state of subsystem i when it lies in some set A i . Then, decision of not transmitting the state of subsystem i conveys information about the state to the estimator i. Signaling is essentially propagating information through actions [107], and it is the reason why this class of problems are difficult to solve, cf. [108]. 5.3.2 Optimal estimation strategy In this section, we will derive the optimal estimation strategy but first we establish a structural result for the optimal scheduling strategy. The following lemma states that the network manager can ignore the past values of each source state without losing performance. Lemma 5.6. The scheduling strategy can be restricted to the form U t = γ t (X t , Y 1:t−1 ) without any loss in performance Proof. Fix the strategy profile of the estimators to any arbitrary choice ψ. We will argue that for the fixed choiceψ, there exists an optimal scheduling strategy of the form in lemma. Once the estimation strategy profile is fixed, the problem simplifies to a centralized markov decision process with the state process X t , Y 1:t−1 . This can be established by observing that conditioned on X t , Y 1:t−1 , the next state X t+1 , Y 1:t is independent of the past source states X 1:t−1 . Also, the cost can be written down as a function of the X t , Y 1:t−1 and decision U t . Therefore, using the standard results from Markov decision theory it follows that there exists an optimal strategy of the form in the lemma. Since this is true for any arbitrary ψ, it is also true for the globally optimal ψ. Hence, the result follows. 135 We will now state a simple structural result for the optimal strategy at the estimators which follows from the form of optimal MMSE estimators. Lemma 5.7. EstimatorE i ’s optimal estimate at time t is given by ˆ X i,∗ t =E[X i t |I i t ]. Proof. We can view the strategy optimization problem from the perspective of the estima- tion strategy ψ i t as inf ψ i t E[||X i t − ˆ X i t || 2 ] + ˜ J where ˜ J =E T X s=1 cI(U s 6= 0) + T X s=1 X j6=i ||X j s − ˆ X j s || 2 + X s:s6=t ||X i s − ˆ X i s || 2 . Since ψ i t does not affect the term ˜ J, the optimal estimate can be computed by solving inf ψ i t E[||X i t − ˆ X i t || 2 ]. This is the standard MMSE problem whose solution has the form given in the lemma. Let Z i t be the most recent state observation received atE i before time t with Z i 1 = 0,∀i . Z i t satisfies the following recursive relation, Z i t+1 = Y i t if Y i t 6=? Z i t otherwise (5.79) Define E i t =X i t −Z i t to be the difference between the current state and the most recently sent state value for subsystem i. Note that any scheduling strategy γ t (X t , Y 1:t−1 ) can be alternately specified in terms of E t , Z t and Y 1:t−1 as ˜ γ t (E t , Z t , Y 1:t−1 ). We will use the latter representation of the scheduling strategies for the rest of the paper. Let Γ sym be 136 the class of scheduling strategies such that γ t (E t , Z t , Y 1:t−1 ) =γ t (|E t |, Z t , Y 1:t−1 ), that is, the scheduling decision depends only on the absolute value of E i t for all i. We will restrict the search for the optimal scheduling strategy to the class of symmetric strategies Γ sym as stated below. Assumption 5.1. We restrict the search for scheduling strategy to the class Γ sym Furthermore, we make the following assumption on the noise distribution: Assumption 5.2. f W is symmetric and unimodal around 0. The restriction to the class of symmetric scheduling strategies helps us to simplify the structure of the estimator. We are now ready to state the optimal estimation strategy in the following lemma. Lemma 5.8. Under Assumption 5.1 and 5.2, the optimal estimate ofE i at time t is, ˆ X i,∗ t = Y i t if Y i t 6=? Z i t otherwise (5.80) Proof. See Appendix E.1. 5.3.3 Optimal scheduling strategy Dynamic Program Once we have derived the optimal estimation strategy, the problem reduces to finding the optimal scheduling strategy of the network manager which is a single decision maker 137 problem. For ease of exposition we will now assume N = 2 for the rest of the paper. However, all the results can be easily extended for a general N. We can formulate the network manager’s problem as a Markov Decision Process(MDP) as follows: 1. State Process: E t , Action Process: U t . 2. Controlled Markovian evolution of the states: The state evolves from E t to E t+1 based on the realization of E t and choice of U t . If U t 6=i, then E i t+1 =E i t +W i t else E i t+1 =W i t . 3. Instantaneous Cost: The cost incurred at timet isl(E t ,U t ) =cI(U t 6= 0)+ P i |E i t | 2 I(U t 6= i). Finding the optimal scheduling strategy for Problem 5.4 is equivalent to the finding the optimal decision strategy for the above MDP which minimizesE[ P T t=1 l(E t ,U t )]. Using the standard results from Markov decision theory, we can write down the dynamic program whose value functions are given by, V T +1 (E 1 T +1 ,E 2 T +1 ) = 0 (5.81) V t (E 1 t ,E 2 t ) = min c +|E 1 t | 2 +EV t+1 (E 1 t +W 1 t ,W 2 t ),c +|E 2 t | 2 +EV t+1 (W 1 t ,E 2 t +W 2 t ), |E 1 t | 2 +|E 2 t | 2 +EV t+1 (E 1 t +W 1 t ,E 2 t +W 2 t ) (5.82) 138 -3 -2 -1 0 1 2 3 E 1 1 -3 -2 -1 0 1 2 3 E 1 2 U = 2 U = 1 U = 1 U = 0 U = 2 Figure 5.5: Decision region for the network manager at t = 1 The minimization in (5.82) is over the choices of control actionsU t ∈{0, 1, 2}. The optimal scheduling strategy can be computed by finding the minimizer in equation (5.82) and com- puting the value functions V t in a backward inductive manner. This dynamic program can be solved numerically by discretizing the state space of the value functions V t (·). However, since E t is allowed to take any arbitrary value inR 2 , the complexity of solving the dynamic program numerically would be large. In the next subsection, we try to infer some structural properties of the value functions to gain more insight about the optimal scheduling strategy. 5.3.4 Numerical example In this section, we look at the no-transmission region for a three stage problem (T = 3) withN = 2 subsystems. We setc = 2 and assume that all the noise in the system have the standard normal distribution. We compute the value functions of the dynamic program by discretizing the state space using a uniform grid of step size of 0.02. Figure 5.5 shows the decision region for the network manager at t = 1 with the optimal action for each region annotated in the figure. We note the following: 139 1. The shape of the NT region is not a square unlike a single stage problem where the NT region is a square. 2. We also observe from simulations that the size of theNT region is increasing with time. This is due to the fact that estimation errors made early in time can have compouding effects in future. Hence, the optimal policy tends to schedule a subsystem more often early in the time. 5.3.5 Structural Results The following lemma states the main result of this section. Lemma 5.9. 1. V t (·) is a function of|E t |. 2. V t (·) is monotonically increasing in both|E 1 t | and|E 2 t |. 3. V t (x,y) =V t (y,x) for all x,y∈R. 4. If U ∗ t 6= 0 , then U ∗ t = argmax i |E i t | Proof. See Appendix E.2. The above lemma states that|E t | serves as the information state of the dynamic program. Also, it partially specifies the optimal scheduling strategy as follows: Whenever the network manager decides to schedule a subsystem, it will pick the subsystem with the largest|E i t |. The following is a corollary to the above lemma in the special case of c = 0. 140 Corollary 5.1. Consider the special case when there is no communication cost to transmit a subsystem’s state, that is, c = 0. In this case, the network manager will always schedule a subsystem at each time. This is because the last term in the dynamic program (5.82) can be shown to be always dominated by the first two terms in the minimization using the monotonicy of V t+1 (·). Hence, the optimal scheduling strategy is completely characterized using Lemma 5.9 and has the form U ∗ t = argmax i |E i t |. Remark 5.5. The results obtained in Section 5.3.2 and Section 5.3.3 can be extended to the vector case when X i t ∈ R n i . In the vector case, we define symmetric strategies to be those which depend only on||E i t || for all i. Then, under the same assumptions, Lemma 5.8 holds true and the statement of Lemma 5.9 holds with|E i t | replaced by||E i t ||. 5.3.6 Characterizing no transmission region Lemma 5.9 gives a clear guideline on which subsystem to pick under the optimal scheduling strategy given that U ∗ t 6= 0. However, its not yet clear, when should the network manager decide to not schedule any of the subsystem. We try to answer this question using the dynamic program in this subsection. To that end, define the function f t (x,y) =EV t+1 (x + W 1 t ,y +W 2 t ). Then, V t (E 1 t ,E 2 t ) = min c +|E 1 t | 2 +f t (E 1 t , 0),c +|E 2 t | 2 +f t (0,E 2 t ),|E 1 t | 2 + |E 2 t | 2 +f t (E 1 t ,E 2 t ) . It can be shown that f t also has the following properties similar to V t (see proof of Lemma 5.9): 1. f t (·) is a function of|x|,|y|. 2. f t (·) is monotonically increasing in both|x| and|y|. 141 3. f t (x,y) =f t (y,x) for all x,y∈R. The network manager will decide not to transmit any of the subsystems’ states at time t iff, |E 1 t | 2 +f t (E 1 t ,E 2 t )≤c +f t (E 2 t , 0) (5.83) |E 2 t | 2 +f t (E 1 t ,E 2 t )≤c +f t (E 1 t , 0) (5.84) Let NT be shorthand to the no-transmission region in the E 1 t -E 2 t space. Note, that NT is contained inside the square of width 2 √ c centered around the origin because if|E 1 t |> √ c then (5.83) fails to hold as f t (E 1 t ,E 2 t )≥ f t (E 2 t , 0) due to monotonicity of f t . Similarly, if |E 2 t | > √ c then (5.84) is not satisfied. However, it is not clear under what condition on |E 1 t |,|E 2 t | the inequalities (5.83),(5.84) hold true. In the rest of this subsection, we will try to derive another property of the value functions which can help us in characterizing NT region. We will focus only on the points in the non-negative quadrant in E 1 t -E 2 t space for the rest of this subsection, due to the symmetry of value functions. We will now use (x,y) to denote a particular(non-negative) realization of (E 1 t ,E 2 t ). We say (x 1 ,y 1 ) (x 2 ,y 2 ) if x 1 ≤x 2 and y 1 ≤y 2 . Definition 5.2. Let (x 1 ,y 1 ) (x 2 ,y 2 ). We say that the value function V t has the property S if U ∗ t = 0 for (x 2 ,y 2 ) implies that U ∗ t = 0 for (x 1 ,y 1 ). We will now find a sufficient condition under which the value function V t satisifes the property S. For that purpose, we define the following: 142 Definition 5.3. We say a functiong :R 2 + →R has increasing differences ifg(x,y)−g(x, 0) is a increasing function of x for all y> 0 and g(x,y)−g(0,y) is a increasing function of y for all x> 0. The following lemma gives a sufficient condition on the function f t under which V t has the property S. Lemma 5.10. V t has the property S if the function f t has increasing differences. Proof. Let (x 1 ,y 1 ) (x 2 ,y 2 ) and U ∗ t = 0 when (E 1 t ,E 2 t ) = (x 2 ,y 2 ). Then, f t (x 1 ,y 1 )−f t (x 1 , 0)≤f t (x 2 ,y 1 )−f t (x 2 , 0) ≤f t (x 2 ,y 2 )−f t (x 2 , 0)≤c−y 2 2 ≤c−y 2 1 where the first inequality follows sincef t has increasing differences, second follows from the monotonicity of f t , third inequality is due to (5.84) since U ∗ t = 0 when (E 1 t ,E 2 t ) = (x 2 ,y 2 ). Similarly, we can show that f t (x 1 ,y 1 )−f t (0,y 1 )≤c−x 2 1 . Therefore, (x 1 ,y 1 ) satisfies both (5.83),(5.84), which implies that U ∗ t = 0 when (E 1 t ,E 2 t ) = (x 1 ,y 1 ). Remark 5.6. Note that the terminal value function V T always satisfies the property S. This can be easily observed since f T (x,y) = 0 everywhere and it satsifies the increasing differences property trivially. As a matter of fact, NT at the terminal time T is a square centered around origin and of width 2 √ c. 143 The next lemma shows that the function f T−1 also has the increasing difference property. This would imply that the value function V T−1 satsifies property S. The function f T−1 is given as, f T−1 (x,y) =E min c+(W 1 T−1 +x) 2 ,c + (W 2 T−1 +y) 2 , (W 1 T−1 +x) 2 + (W 2 T−1 +y) 2 (5.85) Lemma 5.11. f T−1 has increasing differences. Consequently, V T−1 has the property S. Proof. See Appendix E.3. It is hard to show analytically that the functions f t for t < T− 1 have the increasing differences property. However, intuitively it seems that if the scheduling decision is U ∗ t = 0 for some E t then it should be also be 0 for ˜ E t when ˜ E t E t . We conjecture the following, Conjecture 1. V t satisfies the property S for all t. The above conjecture has the following implication: If (x,y) lies in theNT region at timet then the whole rectangle centered around the origin with (x,y) as the top right corner also lies in the NT region. Remark 5.7. Consider figure 5.5 of the example in Section 5.3.4. It can be observed from the NT region that the value function V 1 indeed satisfies the property S which substantiates Conjecture 1. Also, the decision outside NT region agrees with the result in Lemma 5.9. 144 5.3.7 Discussion Property S can be used to find two squares A in and A out , both centered around origin, such that NT contains A in and is contained inside A out , i.e., A in ⊂ NT ⊂ A out . Given suchA in ,A out and the value function V t+1 , we can compute V t insideA in without carrying out the minimization and evaluating only the last expression corresponding to U t = 0 in (5.82). Also, for all points (x,y) lying outside A out , the optimal scheduling decision can be computed using Lemma 5.9. Therefore, to computeV t outsideA out , we again need not carry out the minimization and compute the appropriate term corresponding to the scheduling decision in (5.82). For a general N, this reduces the computation overhead from O(N) to O(1) for all the points inside A in and outside A out , which is significant when N is large. We can exploit the symmetry of the value functions to obtain A in via a binary search procedure as follows: Initialize the variablesx low = 0 andx high = √ c and let ¯ x = x low +x high 2 . We intialize x high = √ c since we know that the NT region is contained in the square of width 2 √ c centered around the origin. Check whether (¯ x, ¯ x) lies in the NT region using the dynamic program. If it does then set x low = ¯ x else set x high = ¯ x. Repeat this process till x low and x high are close within some accuracy. Finally, set A in to the square centered around origin with (x low ,x low ) as the top right corner. A out can also be obtained by a similar binary procedure but along one of the axis as follows: Initialize x low = 0 and x high = √ c and let ¯ x = x low +x high 2 . Check whether (¯ x, 0) lies in the NT region using the dynamic program. If it does then set x low = ¯ x else set x high = ¯ x. 145 Repeat this process till x low and x high are close within some accuracy. Set A out to the square centered around origin with (x high ,x high ) as the top right corner. 5.4 Conclusions In this chapter we studied the problem of optimal scheduling and estimation in a sequential remote estimation system where non-collocated sensors and estimators communicate over a shared medium. The access to the communication resources is granted by a scheduler, which implements an observation-driven medium access control scheme to avoid packet collisions. We first looked at the case when the sensors’ source dynamics are independent and identi- cally distributed in time and the scheduler is harvesting random energy from its environ- ment. Our objective was to jointly design the scheduling and estimation strategies which minimize the total communication cost and estimation error. We derive the globally optimal scheduling and estimation strategy under some assumption on the sensor state’s distribu- tion. Our approach was to first relax the problem by expanding the information at the estimators and then use the common information approach to solve the relaxed problem. Finally, we show that the solution to the relaxed problem is also feasible for our original problem and hence is optimal. Next, we moved on to the case when the sensors’ source dynamics are Markovian. Under the independence and symmetry assumptions on the noise distribution and by restricting the search for scheduling strategy to the class of symmetric strategies, we showed that the opti- mal estimate for each estimator is its most recently received observation. We also obtained 146 a dynamic program which characterized the optimal scheduling strategy. Furthermore, we established some properties of the value functions of the dynamic program which may be helpful in computing the best symmetric scheduling strategy. 147 Chapter 6 Worst Case Guarantees for Remote Estimation 6.1 Introduction Information collection is essential for most engineering systems. In many applications, sen- sors are deployed to collect and send information to a base station/control center to estimate or control the state of the system. In environmental monitoring, for example, remote sensors are used to measure environmental variables such as temperature, rainfall, soil moisture, etc. The sensors collect information and transmit it to the base station through wireless communication. For a sensor with limited battery, the energy spent in communication is a significant factor determining the battery lifespan. Since battery replacement is expen- sive for remote sensors, it is important for sensors to adopt a transmission schedule that 148 preserves energy while achieving a desired level of estimation accuracy. Similar scenarios of remote estimation also arise in other applications such as smart grids, networked control systems and healthcare monitoring [109–111]. The remote estimation problem with one sensor and one estimator has been studied un- der two different communication models: i) Remote estimation with pull communication protocol: In this class of problems the estimator decides when to get data from the sen- sor. Since the estimator is the only decision-maker in the system, this protocol leads to a centralized sequential decision-making problem. Instances of such problems have been studied in [112–115]. ii) Remote estimation with push communication protocol: Here, the sensor makes the decision about when to send data to the estimator. The estimator de- cides, at each time, what estimate to produce. This leads to a decentralized decision-making problem with the sensor and the estimator as the two decsion-makers. Computing jointly optimal scheduling and estimation strategies in a decentralized setup is difficult in general. However, several works have addressed this problem by placing some restrictions on the transmission/estimation strategies and/or by making certain assumptions about the source statistics. For example, [96, 116] studied the problem of remote estimation under limited number of transmissions when the state process is i.i.d. and the transmission strategy is restricted to be threshold-based. A continuous-time version of the problem is considered in [117] with a Markov state process, limited number of transmissions and a fixed estimation strategy. [118] derived the optimal communication schedule assuming a Kalman-like esti- mator. Jointly optimal scheduling and estimation strategies were derived in [20, 97, 119] 149 for Markov sources that satisfied certain symmetry assumptions on their probability distri- butions. The uncertainties in all the aforementioned work are modeled as random variables and the objective is to minimize the expected sum cost over a finite time horizon. However, in many applications, there is no statistical model for the system variables of interest. Furthermore, guarantees on estimation accuracy at each time instant may be critical for safety concerned systems such as healthcare monitoring. For example, while monitoring the heartbeat of a patient it is desirable that the estimation error at each time is minimal. In this chapter, we consider an uncertain source that can be modeled as a discrete-time autoregressive process with bounded noise. The source is observed by a sensor with limited communication budget. The sensor can communicate with a remote estimator that needs to produce real-time estimates of the source state. Given such a model, we are interested in the worst-case guarantee on estimation error at any time that can be achieved under a limited communication budget. Put another way, we want to find the minimum communication budget needed to ensure that the worst-case estimation error at any time is below a given threshold. In order to address these questions, we consider a minimax formulation of the remote estimation problem. Our goal is to design a communication scheduling strategy for the sensor and an estimation strategy for the estimator to jointly minimize the worst-case instantaneous estimation cost over all realizations of the source process. Centralized decision and control problems where the goal is to minimize a worst-case cost have long been studied in the literature. One prominent line of work has focused on develop- ing dynamic program type approaches for minimax problems [120–124]. These centralized 150 minimax dynamic programs use analogues of stochastic dynamic programming concepts such as information states and value functions. The centralized minimax dynamic program can be interpreted in terms of a zero-sum game between the controller and an adversary who selects the disturbances to maximize the cost metric [120, 121]. Dynamic games based approaches for minimax design problems were also studied in [125]. Minimax problems where the goal is to minimize the worst-case maximum instantaneous cost were studied in [126–128]. In the centralized minimax problems described above, the uncertainties are described in terms of the set of values they can take. In contrast, some minimax problems have looked at systems with stochastic uncertainties. In these problems, the parameters of the stochastic uncertainties are ambiguous. These parameters are either fixed apriori but unknown or they are chosen dynamically by an adversary. In either case, the objective of the control problem is to minimize the maximum expected cost corresponding to the worst-choice of unknown parameters. Examples of this line of work include [129–133]. Our minimax problem is most closely related to the minimax control problems studied in [120] and [127, 128]. The minimax problems in [120] and [127, 128] were centralized decision-making problem involving a single decision-maker acting over time. In contrast, our minimax problem involves two decision-makers, the sensor and the estimator, making decisions based on different information. The decentralized nature of our decision problem creates issues such as signaling where decision-makers may communicate implicitly through their actions. A decision to not communicate by the sensor, for example, can implicitly convey some information about the source to the estimator. Such signaling effects are a key 151 reason why the joint optimization of strategies becomes a difficult problem [20, 97]. A class of decentralized minimax control problems with partial history sharing were investigated in [134]. In order to jointly optimize the strategies for the sensor and the estimator while taking into account the signaling between them, we extend the coordinator-based approach of [15], [20] which was developed for a stochastic model and expected cost criterion to our minimax setting. Using this, we explicitly identify optimal communication scheduling and estimation strategy for our minimax problem. Organization: We start with a general centralized minimax control problem in Section 6.2 and then formulate the minimax remote estimation problem in Section 6.3. We formulate an equivalent centralized minimax control problem in Section 6.4 and derive the optimal scheduling and estimation strategies in Section 6.5. We conclude in Section 6.6. Notation and Uncertain Variables: X a:b denotes the collection of variables (X a ,X a+1 ,...,X b ). I A denotes the indicator function of an event A. We now review the concept of uncertain variables as defined in [135]. An uncertain variable is a mapping from some underlying sample space Ω to a space of interest. We use capital letters to denote uncertain variables while small letters denote their realizations and script letters denote the spaces of all possible realizations. For example, an uncertain variable X has a realization X(ω) =x∈X for an outcome ω∈ Ω. Instead of probability measures as in the case of random variables, uncertain variables can be analyzed using their ranges. The range ofX is defined as [[X]] ={X(ω) :ω∈ Ω}. Similarly, 152 for a collection of uncertain variables X 1 ,...,X n , [[X 1 ,...,X n ]] ={(X 1 (ω),...,X n (ω)) : ω∈ Ω}. The conditional range of X given Y = y is denoted by [[X|y]] (or [[X|Y =y]]) and is defined as{X(w) :Y (w) =y,w∈ Ω}. We also define the uncertain conditional range [[X|Y ]] as an uncertain variable that takes the value [[X|y]] when Y takes the value y. Using the ranges of uncertain variables, an analogue of statistical independence can be defined as follows [Definition 2.1 [135]]. Definition 6.1. Uncertain variables X 1 ,X 2 ,...,X n are unrelated if [[X 1 ,...,X n ]] = [[X 1 ]]×···× [[X n ]] (6.1) where× is the Cartesian product. The following property comes from the definition of unrelated uncertain variables [Lemma 2.1 [135]]. Property 1. If X is unrelated to (Y,Z), that is, [[X,Y,Z]] = [[X]]× [[Y,Z]], then [[X,Y|Z]] = [[X]]× [[Y|Z]]. (6.2) For a function f(x), we define sup X f(X) := sup x∈[[X]] f(x) to denote its supremum over the range of X. Similarly sup X 1:n f(X 1:n ) := sup x 1:n ∈[[X 1:n ]] f(x 1:n ). Also, sup X|y f(X) := sup x∈[[X|y]] f(x) denotes the supremum of f(x) over the conditional range. For a bivariate function f(x,y), we have the following property. 153 Property 2. If X,Y are uncertain variables. Then sup X,Y f(X,Y ) = sup x∈[[X]] sup y∈[[Y|x]] f(x,y). (6.3) Furthermore, if Z is another uncertain variable, then sup (X,Y )|z f(X,Y ) = sup x∈[[X|z]] sup y∈[[Y|x,z]] f(x,y). (6.4) Note that the above property is the analogue of the tower property of conditional expecta- tion with supremum playing the role of expectation. 6.2 Minimax Control with Maximum Instantaneous Cost Ob- jective Consider a discrete time system with stateS t ∈S and observationO t ∈O evolving accord- ing to the following dynamics: S t+1 =f t+1 (S t ,A t ,N t+1 ), (6.5) O t+1 =h t+1 (S t ,A t ,N t+1 ), (6.6) where A t is the control action, N t is the noise, t ∈ T = {1, 2,...,T}, S 1 = N 1 and O 1 =h 1 (S 1 ). The noise processN ={N t ,t = 1,...,T} is a sequence of unrelated uncertain 154 variables. We assume that the state has two components, S t = (S h t ,S o t ), where S h t is the hidden part and S o t ∈S o is the observable part. At each time t, the controller’s available information is Q t = (O 1:t ,S o 1:t ,A 1:t−1 ). Note that Q t includes the history of observations O 1:t , the history of observable part of the states S o 1:t and the past control actions A 1:t−1 . Q t denotes the set of all possible values of Q t . The set of available control actions att, which may depend on the directly observable state S o t , isA(S o t ). Based on the available information at t, the controller takes a control action according to a function η t :Q t 7→A(S o t ) as A t =η t (Q t ). (6.7) We call η = (η 1 ,η 2 ,...,η T ) a strategy of the controller. The instantaneous cost at time t is ρ t (S t ,A t ). The minimax control objective is to find a strategy η that minimizes the worst-case maximum instantaneous cost. Thus, the strategy optimization problem is inf η n sup N 1:T max t∈T ρ t (S t ,A t ) o . (6.8) Let Π t = [[S h t |Q t ]] be the conditional range of the hidden part of the state S h t given the available information Q t . LetB denote the space of all possible Π t . Note that S o t belongs to Q t , so conditional range of S o t given Q t is the singleton set i.e. [[S o t |Q t ]] ={S o t }. The conditional range Π t along with S o t can be used as an information state for decision- making in the minimax control problem. In particular, we can obtain the following dynamic programming result using arguments from [128]. 155 Theorem 6.1. For each t∈T , define functions V ∗ t :B×S o 7→R as follows: i) For π T ∈B,s o T ∈S o , V ∗ T (π T ,s o T ) := inf a T ∈A(s o T ) sup s h T ∈π T ρ T ((s h T ,s o T ),a T ), (6.9) ii) For t<T,π t ∈B,s o t ∈S o , V ∗ t (π t ,s o t ) := inf at∈A(s o t ) n sup s h t ∈πt,n t+1 ∈[[N t+1 ]] max ρ t (s t ,a t ),V ∗ t+1 (Π t+1 ,S o t+1 ) o . (6.10) where Π t+1 is given as follows, Π t+1 ={s h t+1 :s t+1 =f t+1 (s t ,a t ,n t+1 ),h t+1 (s t ,a t ,n t+1 ) =o t+1 s t = (s h t ,s o t ),s h t ∈π t ,n t+1 ∈ [[N t+1 ]]}. If the infimum in (6.9), (6.10) is achieved, then for eachπ t ∈B ands o t ∈S o the minimizing a t in (6.9)-(6.10) gives the optimal action at time t for t∈T . Moreover, the optimal cost is given by sup Q 1 V ∗ 1 (Π 1 ,S o 1 ). Proof. See Appendix F.1 156 X t Sensor Estimator ˆ X t Y t U t Figure 6.1: Remote Estimation setup 6.3 Problem Formulation Consider a communication problem between a sensor (transmitter) and an estimator (re- ceiver) over a finite time horizonT ={1, 2,...,T}, T≥ 1. The sensor perfectly observes a discrete-time uncertain processX t ∈R n which evolves according to the following dynamics X t+1 =λAX t +N t+1 , (6.11) where λ is a scalar and A is an orthogonal matrix. N t is an uncertain variable which lies in the ball of radius a t around the origin i.e. ||N t ||≤a t . We assume that the initial state X 1 =N 1 . The numbers a 1 ,...,a T are finite. Since all the noise in the system is bounded, the state X t also remains bounded for all t. LetX⊂R n denote a bounded set such that X t ∈X for t∈T . The sensor can send the observed state to the estimator through a perfect channel. However, each transmission consumes one unit of sensor’s energy, and the sensor has a limited energy budget of K units 1 with 1≤K <T . Let E t denote the energy available at time t. We use U t to denote the transmission decision at time t. U t is 1 if the current state observation is transmitted and 0 otherwise. Note that U t ∈U(E t ) whereU(E t ) ={0, 1} if E t > 0 and U(0) ={0} i.e. there can be no transmission at time t if E t = 0. The energy at time t + 1 1 K is a fixed known integer and not an uncertain variable 157 can be written as: E t+1 = max(E t −U t , 0). (6.12) The estimator receives Y t at time t which is given as, Y t =h(X t ,U t ) = X t if U t = 1, if U t = 0, (6.13) where denotes no transmission. The sensor makes the transmission decision at t based on available information X 1:t ,E 1:t ,Y 1:t−1 , U t =f t (X 1:t ,E 1:t ,Y 1:t−1 ), (6.14) where f t is the transmission strategy of the sensor at time t. We call the collection f = (f 1 ,f 2 ,...,f T ) the transmission strategy. The estimator produces an estimate of the state ˆ X t based on its received information Y 1:t at time t as follows: ˆ X t =g t (Y 1:t ), (6.15) where g t denotes the estimation strategy at time t. The collection g = (g 1 ,g 2 ,...,g T ) is referred to as the estimation strategy. The cost incurred under a transmission strategy f and estimation strategy g is the worst case maximum instantaneous distortion cost over the entire horizon, given by, J(f, g) = sup N 1:T max t∈T ||X t − ˆ X t ||. (6.16) 158 We can now formulate the following problem. Problem 6.1. Determine a transmission strategy f for the sensor and an estimation strat- egy g for the estimator which jointly minimize the cost J(f, g) in (6.16). min f,g J(f, g) subject to (6.11)− (6.15) Remark 6.1. Communication scheduling and remote estimation problems similar to Prob- lem 6.1 have been studied in [20, 96, 97]. The key differences between the problems in [20, 96, 97] and Problem 6.1 are: (i) source model- [20, 96, 97] deal with a stochastic source model whereas the source model in Problem 6.1 is non-stochastic; (ii) objective- [20, 96, 97] deal with minimizing an expected cumulative cost over a time horizon whereas the objective in Problem 6.1 is to minimize the worst-case instantaneous cost. The objective in Problem 6.1 may be more suitable for safety critical systems. Next, we provide a structural result which establishes that the sensor can ignore past values of the source and energy levels without losing performance. Lemma 6.1. The transmission strategy can be restricted to the form U t = f t (X t ,Y 1:t−1 ) without any loss in performance. Proof. See Appendix F.2. Problem 6.1 is a minimax sequential decision-making problem with two decision-makers (the sensor and the estimator). We will adopt the common information approach [20] for 159 stochastic remote estimation problem to our minimax problem. This involves formulating a single-agent sequential decision-making problem from the perspective of an agent who knows the common information. In our setup, we can adopt the estimator’s perspective to formulate the single-agent problem as done in the following section. 6.4 An Equivalent Problem We now formulate a new sequential decision problem that will help us to solve Problem 6.1. In the new problem, we consider the model of Section 6.3 with the following modification. At the beginning of t th time step, the estimator selects a mapping Γ t :X7→{0, 1}. Γ t will be referred to as the estimator’s prescription to the sensor. The sensor uses the prescription to evaluate U t as follows: U t = Γ t (X t ). (6.17) The estimator selects the prescription based on its available information, that is, Γ t =d t (Y 1:t−1 ), (6.18) where the functiond t is referred to as the prescription strategy at time t. At the end oft th time step, the estimator produces an estimate ˆ X t as follows ˆ X t =g t (Y 1:t ), (6.19) 160 whereg t is the estimation strategy at timet. The cost incurred by the prescription strategy d = (d 1 ,...,d T ) and the estimation strategy g = (g 1 ,...,g T ) is, ˆ J(d, g) = sup N 1:T max t∈T ||X t − ˆ X t ||. We consider the following problem, Problem 6.2. Determine a prescription strategy d and an estimation strategy g to mini- mize the cost ˆ J(d, g). min d,g ˆ J(d, g) subject to (6.11)− (6.13), (6.17), (6.18), (6.19) In Problem 6.2, the estimator is the sole decision-maker since the sensor merely evaluates the prescription at the current source state. Problem 6.2 can be shown to be equivalent to Problem 6.1 in a similar manner as in [20] for the stochastic remote estimation problem. The main idea is that for every choice of sensor strategy f there exists an equivalent prescription strategy d and vice-versa. Since this equivalence is true for every realization of the uncertain variables N 1:T , the stochastic case argument also holds in this minimax scenario. Problem 6.2 can be seen as an instance of the minimax problem formulated in Section 6.2 as follows: 161 X t Γ t Y t ^ X t t t+ t+1 E t ~ E t Figure 6.2: Timeline of state realization, observations and actions in coordinator’s problem 1. We can imagine the system operating with 2T decision points by splitting each time instant into two decision points: (i) At each time t, before the transmission at that time the estimator decides the prescription Γ t ; (ii) After receiving Y t , the estimator decides ˆ X t . We denote this decision point by t+ (See Figure 7.2). 2. State: At t, the state is S t = (S h t ,S o t ) = (X t ,E t ) since E t is observable by the estimator. At t+, S t+ = (S h t+ ,S o t+ ) = (X t , ˜ E t ) where ˜ E t is the post-transmission energy given as ˜ E t =E t − Γ t (X t ). (6.20) 3. Actions : At t, action A t = Γ t ∈A(E t ), whereA(E t ) is the collection of functions fromX toU(E t ). Recall thatU(0) ={0} andU(E t ) ={0, 1} for E t > 0. At t+, action A t+ = ˆ X t ∈R n . 4. Information: The information available at time t to choose a prescription is Q t = {Y 1:t−1 , Γ 1:t−1 , ˆ X 1:t−1 } and at time t+ to generate ˆ X t is Q t+ ={Y 1:t , Γ 1:t , ˆ X 1:t−1 }. 5. Cost: The instantaneous cost at timet,ρ t (S t ,A t ) = 0 and at timet+,ρ t+ (S t+ ,A t+ ) = ||X t − ˆ X t ||. 162 Since Problem 6.2 is an instance of the minimax problem of Section 6.2, we can use Theorem 6.1 to conclude that the optimal strategy is a function of the conditional range of the state (X t ,E t ) given the estimator’s information. SinceE t is known to the estimator, we just need to define the conditional range ofX t . For that purpose, we define Θ t as the pre-transmission conditional range of X t and Π t as the post-transmission conditional range of X t at time t as follows: Θ t = [[X t |Q t ]], Π t = [[X t |Q t+ ]]. The following lemma describes the evolution of the sets Θ t and Π t . Lemma 6.2. 1. The pre-transmission conditional range Θ t+1 at time t + 1 is a function of Π t i.e. Θ t+1 =φ t (Π t ). 2. The post-transmission conditional range Π t is a function of Θ t , Γ t and Y t i.e. Π t = ψ(Θ t , Γ t ,Y t ). Proof. 1. Given the post-transmission conditional range Π t , Θ t+1 is given as Θ t+1 = n x t+1 :x t+1 =λAx t +n t+1 for somex t ∈ Π t and||n t+1 ||≤a t+1 o . (6.21) :=φ t (Π t ) 163 2. Given the pre-transmission conditional range Θ t , Π t can be evaluated after receiving Y t as follows Π t = {Y t } ifY t 6= {x t ∈ Θ t : Γ t (x t ) = 0} otherwise (6.22) :=ψ(Θ t , Γ t ,Y t ) LetB denote the space of all possible realizations of Π t , Θ t andE ={0, 1,...,K}. Then, Theorem 6.1 can be used to write a dynamic program which characterizes the optimal estimates ˆ X t and the optimal prescriptions Γ t in Problem 6.2 as follows, Lemma 6.3. For t∈T , define the functions V t :B×E 7→ R and W t :B×E 7→ R as follows: (i) For π T ∈B and ˜ e T ∈E define 2 , V T (π T , ˜ e T ) := inf ˆ x T ∈R n sup x T ∈π T ||x T − ˆ x T ||, (6.23) (ii) For t∈T , θ t ∈B and e t ∈E define, W t (θ t ,e t ) := inf γt∈A(et) sup xt∈θt V t (ψ(θ t ,γ t ,y t ),e t −γ t (x t )), (6.24) 2 ˜ et denotes a realization of the post-transmission energy as defined in (6.20). 164 where y t =h(x t ,γ t (x t )). (iii) For t<T , π t ∈B and ˜ e t ∈E define, V t (π t , ˜ e t ) := inf ˆ xt∈R n sup xt∈πt {max (||x t − ˆ x t ||,W t+1 (φ t (π t ), ˜ e t ))}. (6.25) Suppose the infimum in (6.23),(6.24),(6.25) are always achieved. Then, for each θ t ∈B and e t ∈E the minimizing γ t in (6.24) gives the optimal prescription at time t. Also, for each π t (or π T )∈B, the minimizing ˆ x t (or ˆ x T ) gives the optimal estimate. Furthermore, W 1 ([[X 1 ]],K) is the optimal cost for Problem 6.2. Proof. The result follows by writing the dynamic program using Theorem 6.1, Lemma 6.2 and associating the function V t with the value function at time t+ and W t with the value function at time t. Note that the above dynamic program is computationally hard to solve because: i) It involves minimization over functions in (6.24) ii) The information state is the conditional range of the source state and thus can be any arbitrary subset ofX . In the next section, we will analyze the dynamic program to obtain certain properties of the value functions which will help us in identifying the structure of the optimal strategies. 6.5 Globally optimal strategies We now proceed with solving the dynamic program of Lemma 6.3. We proceed in four steps. 165 Step 1: Nature of optimal prescriptions We define a relation Q between sets which will be helpful in identifying the structure of the globally optimal prescriptions. To that end, we define the radius of a set S⊂ R n as r ∗ (S) := inf x∈R n sup y∈S ||y−x||. The following lemma gives the relation between the radius of a set E and the radius of its transformation φ t (E) defined by (6.21). Lemma 6.4. Let E⊂R n . Then, r ∗ (φ t (E)) =|λ|r ∗ (E) +a t+1 . (6.26) Proof. See Appendix F.3. We now define a relation Q between sets and a property Q for functions. Definition 6.2. 1. Let G,H⊂R n be two sets. We say GQH if r ∗ (G) =r ∗ (H). 2. We say that a function f :B×E7→R satisfies property Q if GQH =⇒ f(G,e) =f(H,e)∀e∈E. Letγ all denote the ’always transmit’ prescription, i.e. γ all (x) = 1,∀x∈X . Letγ none denote the ’never transmit’ prescription, i.e. γ none (x) = 0,∀x∈X . Lemma 6.5. 1. For each t∈T , the functions V t and W t of Lemma 6.3 satisfy property Q. 166 2. For each t∈T , either γ all or γ none is an optimal choice of prescription γ t in (6.24). Proof. See Appendix F.3. Consider two singleton sets {x 1 t } and{x 2 t }. The first part of Lemma 6.5 implies that V t ({x 1 t },e t ) = V t ({x 2 t },e t ) because{x 1 t }Q{x 2 t }. Thus, V t ({x t },e t ) does not depend on the value of x t and can be represented as function of energy alone, that is, V t ({x t },e t ) = K t (e t ). The second part of Lemma 6.5 implies that we can replace the infimum in (6.24) by minimzation over just two prescriptions, γ all and γ none . Using the above observations, we can reduce the dynamic program of Lemma 6.3 to the following: V T (π T , ˜ e T ) =r ∗ (π T ), (6.27) V t (π t , ˜ e t ) = max (r ∗ (π t ),W t+1 (φ t (π t ), ˜ e t )), for t<T, (6.28) where (6.27) and (6.28) follow from the definition of r ∗ (π t ) and the dynamic program in Lemma 6.3; for e t > 0, W t (θ t ,e t ) = min n sup xt∈θt V t (ψ(θ t ,γ all ,y t ),e t − 1), sup xt∈θt V t (ψ(θ t ,γ none ,y t ),e t ) o = min n sup xt∈θt V t ({x t },e t − 1), sup xt∈θt V t (θ t ,e t ) o = min{K t (e t − 1),V t (θ t ,e t )}, t∈T, (6.29) where K t (e t − 1) =V t ({x t },e t − 1) for any x t . For e t = 0, W t (θ t , 0) = sup xt∈θt V t (ψ(θ t ,γ none ,y t ), 0) =V t (θ t , 0). (6.30) 167 Step 2: Simplified information state We will now use property Q to simplify the information state of the dynamic program. Lemma 6.5 suggests that value functionsV t ,W t depend only on the radius of the conditional range. Thus, we would expect that the radius of the conditional range can act as an information state of the dynamic program. This idea is formalized in the following lemma. Lemma 6.6. Define ˜ V t :R + ×E→R and ˜ W t :R + ×E→R as follows: (i) For t =T , r∈R + and ˜ e∈E, ˜ V T (r, ˜ e) :=r, (6.31) (ii) For t∈T , r∈R + and ˜ e∈E, ˜ W t (r,e) := min( ˜ V t (r,e), ˜ V t (0,e− 1)), if e> 0, ˜ V t (r, 0) if e = 0. (6.32) (iii) For t<T , ˜ V t (r, ˜ e) := max r, ˜ W t+1 (|λ|r +a t+1 , ˜ e) . (6.33) Then, for t∈T , V t (π t , ˜ e t ) = ˜ V t (r ∗ (π t ), ˜ e t ), (6.34) W t (θ t ,e t ) = ˜ W t (r ∗ (θ t ),e t ). (6.35) 168 Proof. V T (π T , ˜ e T ) = ˜ V T (r ∗ (π t ), ˜ e t ) follows from (6.31), (6.27). We then proceed by induc- tion — we first show that (6.35) is true if (6.34) is true for t. (6.35) follows easily from (6.29),(6.30) and the induction hypothesis by noting that K t (e t − 1) = V t ({x},e t − 1) = ˜ V t (0,e t − 1). Next, we show that (6.34) is true for t if (6.35) is true for t + 1. Using (6.28) and the induction hypothesis together with the fact thatr ∗ (φ t (π t )) =|λ|r ∗ (π t )+a t+1 , (6.34) can be easily established. We can further eliminate ˜ V t from (6.31)-(6.33) to obtain a recursive relation among ˜ W t given as: ˜ W T (r,e) = 0 ife> 0, r ife = 0, (6.36) For t<T , ˜ W t (r,e) = min n max{r, ˜ W t+1 (|λ|r +a t+1 ,e)}, ˜ W t+1 (a t+1 ,e− 1) o , for e> 0 max{r, ˜ W t+1 (|λ|r +a t+1 , 0)} for e = 0 (6.37) The above equations can be seen as a reduced version of the dynamic program of Lemma 6.3 with the radius of the conditional range and the energy level as the information state. Unlike the dynamic program of Lemma 6.3, however, the above dynamic program is com- pletely deterministic, that is, it does not involve maximization over any uncertain variables. 169 In the next step, we will connect this deterministic dynamic program to a deterministic optimal control problem and use it to identify optimal transmission strategy. Step 3: A deterministic control problem Consider a deterministic control system with state (X d t ,E d t )∈R + ×E and control action U d t ∈U(E d t ), whereU(E d t ) ={0, 1} if E d t > 0 andU(0) ={0}, operating for a time horizon T . The dynamics of the state are as follows: X d t+1 = |λ|X d t +a t+1 if U t = 0, a t+1 if U t = 1, E d t+1 = max(E d t −U d t , 0) with X d 1 =a 1 and E d 1 =K. The instantaneous cost is given by ρ(X d t ,U d t ) = X d t if U t = 0 0 if U t = 1. The deterministic control problem can be stated as follows. Problem 6.3. Determine a control sequence U d 1:T to minimize the cost J d (U d 1:T ) := max t∈T ρ(X d t ,U d t ). We are interested in the above deterministic control problem because of the following lemma. 170 Lemma 6.7. The optimal cost for the original problem (i.e, Problem 6.1), the coordinator’s problem (i.e, Problem 6.2) and the deterministic control problem (i.e, Problem 6.3) are equal. That is, min f,g J(f, g) = min d,g ˆ J(d, g) = min U d 1:T J d (U d 1:T ). (6.38) Proof. We have already discussed that Problems 6.1 and 6.2 are equivalent, so we will focus on the second equality in (6.38). Since the deterministic control problem is a special case of the minimax problem of Section 6.2, we can use Theorem 6.1 to write the following dynamic program for it: ˜ W d T (x d T ,e d T ) := 0 ife d T > 0, x d T ife d T = 0, ˜ W d t (x d t ,e d t ) := min u d t ∈U(e d t ) n max ρ(x d t ,u d t ), ˜ W d t+1 (x d t+1 ,e d t+1 ) o , for t<T ; with ˜ W d 1 (a 1 ,K) being the optimal cost for Problem 6.3. Comparing the above dynamic program with (6.36)-(6.37), it is easy to see that ˜ W d t (x,e) = ˜ W t (x,e),∀x,e,t. From Theorem 6.1, the optimal cost of Problem 6.3 is ˜ W d 1 (a 1 ,K) = ˜ W 1 (a 1 ,K), which is the same as the optimal cost of Problem 6.2. 171 Step 4: Optimal transmission and estimation strategies for Problem 6.1 - We can now identify optimal transmission and estimation strategies for Problem 6.1. We start with the estimation strategy. We define ˜ X 0 = 0 and for t∈T , ˜ X t = Y t ifU t = 1, λA ˜ X t−1 ifU t = 0. (6.39) Lemma 6.8. In Problem 6.1 and Problem 6.2, the globally optimal estimation strategy is g ∗ t (Y 1:t ) = ˜ X t , for t∈T . Proof. See Appendix F.4. Let U d∗ t be an optimal open loop control sequence for Problem 6.3. Since Problem 6.3 is an optimal control problem with determinstic dynamics we know that there exists such an open loop strategy and can be computed via the dynamic program. We can now identify the optimal strategies for Problem 6.1. Theorem 6.2. Let g ∗ be the estimation strategy as defined in Lemma 6.8 and f ∗ be defined as follows: f ∗ t (X t ,E t ,Y 1:t−1 ) =U d∗ t where U d∗ t is an optimal open loop control sequence for Problem 6.3. Then, (f ∗ , g ∗ ) are globally optimal strategies for Problem 6.1. Proof. See Appendix F.4. 172 Theorem 6.2 establishes that the globally optimal transmission strategy to minimize the worst-case instantaneous cost is an open-loop strategy that transmits at pre-determined time instants. Thus, even though the sensor has access to the state and transmission history, this information is not used by the optimal transmission strategy. Remark 6.2. We can compare the nature of optimal strategies in Theorem 6.2 with the optimal strategies in the stochastic remote estimation problem in [20, 97]. The optimal estimation strategy obtained in our minimax setup is identical to the one obtained in the stochastic case considered in [20, 97]. However, the optimal transmission strategy in [20, 97] is a threshold-based strategy in contrast to the deterministic strategy obtained in our setup. 6.5.1 Homogenous noise Consider the case when all the uncertain noise variables take values in the ball of same size i.e a t = a for all t∈T . It turns out that transmitting at uniformly spaced intervals is optimal in this case as made precise in the following lemma. Lemma 6.9. Define Δ := l T +1 K+1 m . Then, 1. The optimal cost for Problem 6.1 under homogenous noise model is, c ∗ (K,T,a,λ) := |λ| Δ−1 −1 |λ|−1 a when |λ|6= 1, (Δ− 1)a when |λ| = 1 173 2. An optimal control sequence for Problem 6.1 under homogenous noise model is given as follows: U t = 1 if t∈{Δ, 2Δ,...,KΔ}∩T 0 otherwise. (6.40) Proof. See Appendix F.5. Remark 6.3. In the case of homogenous noise, it is possible that the sensor does not utilize all the K available transmission opportunities under the transmission strategy f ∗ . For example, when T = 5,K = 3, the sensor will transmit only twice at t = 2, 4. Thus, the worst-case error achieved in this case would be the same even if K = 2. Therefore, one could also ask the following question: What is the minimum number of transmission opportunities (K ∗ ) required so that the worst-case error is at most ? K ∗ can be computed as follows: K ∗ = min{K≥ 1 :c ∗ (K,T,a,λ)≤} (6.41) Remark 6.4. Consider the problem where the estimator requests transmissions instead of the sensor deciding when to transmit. The cost of this problem is lower bounded by the cost of Problem 6.1 because the sensor has more information to make the transmission decision than the estimator. Moreover, since the optimal scheduling strategy obtained for Problem 6.1 is an open loop strategy, it can also be implemented in this new problem. Therefore, the results obtained for Problem 6.1 also hold for this problem. 174 Remark 6.5. Consider the problem where the sensor can observe the source state only M times instead of observing the state at each time with M≥K. In addition to the scheduling strategy, here the sensor must also decide when to observe the source. The cost of this problem is lower bounded by the cost of Problem 6.1 because the sensor has less information in this case compared to Problem 6.1. Also, since the optimal scheduling strategy for Problem 6.1 is an open loop strategy, the sensor in this problem can take observations at the fixed times when it transmits, thereby achieving the same cost as in Problem 6.1. Therefore, the results obtained for Problem 6.1 also hold for this problem. Remark 6.6. For each t∈T , letB t be any set such thatB t is symmetric (i.e. if n∈B t then−n∈B t ) and sup nt∈Bt ||n t || =a t . It can be shown that the optimal transmission and estimation strategy remains the same if the noise N t lies in the setB t . 6.6 Conclusion We considered the problem of remote estimation of a non-stochastic source over a finite time horizon where the sensor has a limited communication budget. Our objective was to find jointly optimal scheduling and estimation strategies which minimize the worst-case maxi- mum instantaneous estimation error over the time horizon. This problem is a decentralized minimax decision-making problem. Our approach started with the dynamic program (DP) for a general centralized minimax control problem. We framed our decentralized minimax problem from the estimator’s perspective and used the common information approach to write down a dynamic program. This dynamic program, however, involved minimization 175 over functions. By identifying a key property of the value functions, we were able to charac- terize the globally optimal strategies. In particular, we show that an open loop transmission strategy and simple Kalman-like estimator are jointly optimal. We also described related problems where the same optimal strategy holds. 176 Chapter 7 Decentralized Minimax Control Problems with Partial History Sharing 7.1 Introduction Decentralized control problems involve control of a dynamical system by multiple con- trollers with different information about the system and about each other. The controllers are cooperative, that is, they share a common objective such as minimizing a system level cost. The design problem is to find a control strategy for each controller that uses only the information available at that controller to achieve the common objective. A recent 177 survey of decentralized control problems in team-theoretic and closed-loop norm optimiza- tion frameworks was presented in [136]. In the team-theoretic framework, uncertainties are random variables and the objective is to minimize the expected value of a system cost [137],[138],[139],[140],[141],[142]. In the norm optimization framework, the system and the controllers are typically LTI systems and the objective is to minimize a norm (such as the H 2 -norm or theH ∞ -norm) of the closed-loop transfer function [143],[144],[145],[146]. In this chapter, we consider a decentralized minimax control problem with the partial his- tory sharing information structure [15]. The partial history sharing model is a general decentralized model where (i) controllers sequentially share part of their past data (past observations and control) with each other by means of a shared memory; and (ii) all con- trollers have perfect recall of the shared data (common information). This model subsumes a large class of decentralized control models in which information is shared among the con- trollers such as the delayed sharing model [147], [148], [149], control sharing model [141, 150] and periodic sharing model [151]. Unlike the stochastic model of [15], the noise variables in dynamics and observations in our model are not random variables with known distributions but simply uncertain quantities that take values in some fixed and known finite sets. The objective is to find control strategies that minimize the worst-case cost. This formulation is appropriate when the distributions of the noise variables is not known or when strict (i.e. worst-case) guarantees on system performance are desired. We first consider a terminal cost problem. For this case, our solution methodology combines the common information based methodology of [15] and the dynamic program for centralized minimax problems developed in [152]. This methodology provides a common information 178 based dynamic program for the decentralized problem. The information state (or sufficiently informative function in the language of [152]) in the dynamic program is the set of feasible values of the current state and local information consistent with the information that is commonly known to all controllers. Each step of this dynamic program involves selection of prescriptions that map controllers’ local information to actions. We then extend the terminal cost problem to incorporate additive costs and common observations. 7.1.1 Notation We use capital letters to denote uncertain variables 1 and the corresponding small letters to denote their realizations. X a:b is a short hand for the collection (X a ,X a+1 ,...,X b ) and similarly X c:d is a short hand for the collection (X c ,X c+1 ,...,X d ). The bold face letter X is used to denote the vector X 1:n . In general, subscripts are used to index time while superscripts are used for controller’s index. φ is used to denote the empty set. For any set X,|X| denotes its cardinality. If A 1 ,...,A n are n sets, then A 1:n denotes the product set A 1 ×···×A n . 7.1.2 Organization The rest of the chapter is organized as follows: In section 7.2 we present our model of a decentralized minimax control problem. We formulate an equivalent problem in section 7.3 and present our main results in section 7.4. In section 7.5 we consider the case when the 1 We use the term “uncertain variable” in its colloquial sense to refer to any quantity whose value is not fixed apriori. A more precise definition is provided in [135]. 179 system cost is additive and in section 7.6 we present a generalization of our model. We give an example in section 7.7 and finally conclude in section 7.8. 7.2 System Model We use the partial history sharing model of [15] as our system model. We describe this model below. Consider a dynamic system with state process X t ∈X t controlled by n controllers for a finite time horizon of T . Each controller i takes a control action U i t ∈U i t at time t. Let U t denote the vector (U 1 t ,U 2 t ,...,U n t ), then the state process is governed by the following dynamics: X t+1 =f t (X t , U t ,W 0 t ), (7.1) where W 0 t is the input disturbance known to take values in the setW 0 t . The initial state X 1 is assumed to lie in the setX 1 . Each controller i receives the measurement Y i t ∈Y i t of the following form Y i t =h i t (X t ,W i t ), (7.2) where W i t is the measurement noise which take values in the setW i t . We assume that X 1 ,W 0 t ,W i t take values independently of each other for all i andt. To be more precise, the vector (X 1 ,W 0 1 ,W 0 2 ,...,W n T ) takes value in the setX 1 ×W 0 1 ×W 0 2 ×···×W n T . Also, it is assumed that all the sets mentioned above are finite. 180 In addition to the local observations, each controller has access to two types of memories: a local memory and a shared memory. • Local Memory: Each controller i stores a subset of{Y i 1:t−1 ,U i 1:t−1 } in its local memory M i t ∈M i t , M i t ⊂{Y i 1:t−1 ,U i 1:t−1 }. (7.3) At t = 1 the local memory is empty i.e. M i 1 =φ∀i. • Shared Memory: A common memory C t is shared across all the controllers. Its contents are a subset of all the observations and actions until time t− 1, C t ⊂{Y 1:t−1 , U 1:t−1 } (7.4) where Y t and U t are the vectors (Y 1 t ,Y 2 t ,...,Y n t ) and (U 1 t ,U 2 t ,...,U n t ) respectively. The shared memory is empty at t = 1 i.e. C 1 =φ. • Updates : After taking the control action at time t, the controller i has the set {M i t ,Y i t ,U i t } available as its local information. It then decides to send Z i t ∈ Z i t according to a pre-specified protocol to the shared memory. This is captured by the following transformation: Z i t =ξ i t (M i t ,Y i t ,U i t ) (7.5) 181 The contentsC t+1 of the shared memory at time t + 1 isC t augmented with the new information Z t = (Z 1 t ,Z 2 t ,...,Z n t )∈Z 1:n t sent by all the controllers at time t. C t+1 ={C t , Z t } = Z 1:t (7.6) Observe that the shared memory is non-decreasing in time. The update function for the local memory is given as follows M i t+1 =ζ i t (M i t ,Y i t ,U i t ), (7.7) Controlleri chooses the control actionU i t as a function of the information available to it at time t. We can write for each controller i = 1,...,n, U i t =g i t (M i t ,Y i t ,C t ) (7.8) whereg i t is the control law of controlleri. The collection g i = (g i 1 ,...,g i T ) is denoted as the control strategy of controller i. The collection g 1:n = (g 1 ,..., g n ) is called as the control strategy of the system. At the end of time horizon T , the system incurs a cost c(X T ). The performance of any control strategy is measured by the worst-case cost J(g 1:n ) = max X 1 ,W 0 1:T−1 ,W i 1:T c(X T ). (7.9) 182 We consider the following problem. Problem 7.1. Given the state evolution functions f t , the observation functions h i t , the protocols for updating the local and the shared memories, the setsX 1 ,W 0 t ,W i t for all i = 1,...,n andt = 1,...,T , find a control strategy for the system that minimizes the worst-case cost given by (7.9). Note that this model assumes a general information sharing and memory updating protocol. Several existing models of decentralized control can be seen as special cases of our model with particular choices of protocols for updating local and shared memories. For example, • The delayed information sharing structure where the observations and control actions of each controller are shared with other controllers after a delay of s time steps. Here, the shared memory at time t is C t ={Y 1:t−s , U 1:t−s }, the local memory at the beginning of time t is M i t ={Y i t−s+1:t−1 ,U i t−s+1:t−1 }. Each controller sends Z i t = {Y i t−s+1 ,U i t−s+1 } after taking the control action U i t at time t. The shared and local memories at time t + 1 are updated as C t+1 ={Y 1:t−s+1 , U 1:t−s+1 } and M i t+1 = {Y i t−s+2:t ,U i t−s+2:t } respectively. • Control sharing information sharing structure where the control actions of each con- troller are shared with each other after a delay of 1 time step. In this model the shared memory at time t is C t ={U 1:t−1 } and the local memory at the beginning of time t is M i t ={Y i 1:t−1 }. Each controller sends Z i t =U i t after taking the control action at time t. The shared and local memories at time t + 1 are updated as C t+1 ={U 1:t } and M i t+1 ={Y i 1:t } respectively. 183 M i t Y i t U i t Z i t C t Z t t t+1 Shared memory Controller i Figure 7.1: Time ordering of Observations, Actions and Memory updates. Several other special cases of the partial history sharing model have been described in [15]. 7.3 Coordinator’s Problem In this section we will reformulate the above problem from a coordinator’s perspective which knows only the shared memory among the controllers. We will show that the new problem is equivalent to the original problem. This in turn reduces the above decentralized control problem to a problem of centralized control which is easier to tackle. 7.3.1 The coordinated system Consider a coordinated system which consists of a coordinator and n passive controllers. The coordinator has the knowledge of the shared memory C t at time t but it does not have access to the local observations Y i t and the local memories M i t . At each time t, the coordinator decides the mappings Γ i t :Y i t ×M i t →U i t , i = 1,...,n, as follows Γ t =d t (C t , Γ 1:t−1 ) (7.10) 184 where Γ t = (Γ 1 t ,..., Γ n t ). The function d t is called the coordination rule at time t and the collection d = (d 1 ,...,d T ) is called the coordination strategy. The mapping Γ i t is communicated to the controller i at time t. The controller i then uses the function Γ i t to determine the control action. More precisely, controller i decides its control action at time t as follows U i t = Γ i t (Y i t ,M i t ). (7.11) We call Γ i t the coordinator’s prescription to controller i at time t. The system dynamics is governed by the same model as described in Section 7.2. The state dynamics is given by (7.1), the observations at the controller by (7.2), and the memories are updated using the same protocol as in Problem 7.1. The system incurs a cost c(X T ) at the end of time horizon T and the performance of a coordination strategy is measured by the worst-case cost ˆ J(d) = max X 1 ,W 0 1:T−1 ,W i 1:T c(X T ) (7.12) The objective is to determine the coordination strategy with the minimum worst-case cost. Problem 7.2. Determine the coordination strategy d which minimizes the worst-case cost given in (7.12). 7.3.2 Equivalence between the two models Lemma 7.1. Problems 7.1 and 7.2 are equivalent in the following sense: 185 a) For any given strategy g 1:n in Problem 7.1, we can define the coordinator’s strategy in Problem 7.2 as d t (C t ) = (g 1 t (·,·,C t ),...,g n t (·,·,C t )), t = 1,...,T. Then, ˆ J(d) =J(g 1:n ). b) Conversely, for any given strategy d of the coordinator we can define a control strategy g 1:n in Problem 7.1 as g i 1 (·,·,C 1 ) =d i 1 (C 1 ) g i t (·,·,C t ) =d i t (C t , Γ 1:t−1 ) where Γ k =d k (C k , Γ 1:k−1 ), 1≤k≤t− 1 and d i t (·) is the i th component of d t (·). Then, J(g 1:n ) = ˆ J(d). Proof. The proof follows the same argument as in Proposition 3 in [15]. 7.3.3 Centralized minimax control problem The next lemma establishes that the coordinator’s problem is a centralized minimax control problem with the state process S t :={X t , Y t , M t }, observation process O t := Z t−1 and action process Γ t . The state S t takes values in the setS t :=X t ×Y 1:n t ×M 1:n t . 186 Lemma 7.2. a) There exist functions ˜ f t and ˜ h t , t = 1,...,T, such that S t+1 = ˜ f t (S t , Γ t ,W 0 t , W t+1 ) (7.13) O t+1 = Z t = ˜ h t (S t , Γ t ) (7.14) b) Furthermore, there exists a function ˜ c such that c(X T ) = ˜ c(S T ) (7.15) Thus, minimizing (7.12) is equivalent to minimizing ˆ J(d) = max X 1 ,{W 0 t } T t=1 ,{W i t } T t=1 ˜ c(S T ) (7.16) Proof. The existence of ˜ f t follows from (7.1), (7.2), (95), (7.11) and the definition of S t . Existence of ˜ h t follows from (95) and (7.11). Existence of ˜ c follows straight from the definition of S t . For a general centralized minimax control problem, a dynamic program based solution was provided in [152] to compute the optimal control strategy. A notion of sufficiently informative functions was introduced (analogous to the information state for the centralized stochastic control problem) to reduce the complexity of the dynamic program. For the centralized terminal cost problem, the sufficiently informative function at time t was shown to be the set of feasible values of the current state consistent with the observations and action histories until t. 187 Following the methodology of [152], we define the state uncertainty set Π t as the set of all possible values of the state S t consistent with the information available at the coordinator at time t which is C t , Γ 1:t−1 . If the realization of the coordinator’s information at time t is γ 1:t−1 , z 1:t−1 , then s∈ Π t if and only if there exist feasible values x 1 ,w 0 1:t−1 , w 1:t−1 of the initial state, disturbance and noise variables such that x 1 ,w 0 1:t−1 , w 1:t−1 and γ 1:t−1 , z 1:t−1 together satisfy system dynamics and measurement equations (7.13), (7.14) for all time before t and lead to the state s at time t. That is, Π t (S t |γ 1:t−1 , z 1:t−1 ) = ( s t ∈S t |∃x 1 ∈X 1 ,w 0 1:t−1 ∈ t−1 Y j=1 W 0 j ,w i 1:t−1 ∈ t−1 Y j=1 W i j ∀i, such that z k = ˜ h k (s k ,γ k ), s k+1 = ˜ f k (s k ,γ k ,w 0 k , w k+1 ), 1≤k≤t− 1 ) For brevity, we denote Π t (S t |γ 1:t−1 , z 1:t−1 ) as simply Π t when the coordinator’s information is clear from the context. Since C 1 =φ, we can define the state uncertainty set at t = 1 as π 1 ={(x 1 , y,φ) : x 1 ∈X 1 ,∃w i 1 ∈W i 1 such thaty i = h i 1 (x 1 ,w i 1 ),∀i}. The next lemma characterizes the evolution of the state uncertainty sets in time. Lemma 7.3. The state uncertainty set Π t+1 at timet + 1 can be computed from Π t , γ t and z t i.e. Π t+1 = Φ t (Π t ,γ t , z t ). (7.17) 188 Proof. Let Π t be given. After receiving z t , we can remove some states from Π t that are inconsistent with the new information. For that purpose we define Θ t as Θ t = ( (x t , y t , m t )∈ Π t |z i t =ξ i t (m i t ,y i t ,u i t ), u i t =γ i t (y i t ,m i t ), 1≤i≤n ) (7.18) Hence, Θ t is a transformation of Π t ,γ t , z t . Now, the feasible values of the next state S t+1 can be derived using the system dynamics as follows: Π t+1 = ( (x t+1 , y t+1 , m t+1 ) :x t+1 =f t (x t , u t ,w 0 t ), y i t+1 =h i t+1 (x t+1 ,w i t+1 ), 1≤i≤n m i t+1 =ζ i t+1 (m i t ,y i t ,u i t ), 1≤i≤n u i t =γ i t (y i t ,m i t ), 1≤i≤n (x t , y t , m t )∈ Θ t ,w 0 t ∈W 0 t ,w i t ∈W i t ∀i ) . From the above equation it follows that Π t+1 = Φ t (Π t ,γ t , z t ). We would like to use the dynamic program for the centralized minimax problem in [152] to write a corresponding dynamic program for the coordinator’s problem. However the observation model of the coordinator’s problem is slightly different from the model used in [152]. Note that in the coordinator’s problem the observation O t is a function ofS t−1 , Γ t−1 (the state and action at time t− 1) whereas in [152] the observation at time t is a function 189 of the state at time t and a noise variable. It turns out that the results of [152] can be extended in a straightforward way to deal with an observation model where the current observation is a function of the previous state and action [153]. Using this extension, the following lemma provides the dynamic program for the coordinator’s problem. The value functions in this dynamic program are functions of state uncertainty set Π t . Lemma 7.4. For each possible state uncertainty set π t , define the value functionV t (π t ) for time t =T,T− 1,..., 1, as follows: V T (π T ) = max s T ∈π T ˜ c(s T ), (7.19) V t (π t ) = min γt max zt∈ ˆ Ot(πt,γt) V t+1 (Φ t (π t ,γ t , z t )), t≤T− 1, (7.20) where ˆ O t (π t ,γ t ) is the set of the feasible values that the next observation z t can take and is given as ˆ O t (π t ,γ t ) ={z t ∈Z 1:n t : z t = ˜ h t (s t ,γ t ), wheres t ∈π t }. (7.21) For each timet and uncertainty setπ t , the optimal prescriptions are given by the minimizing γ t in (7.20). Complexity of the Dynamic Program: Given the value function at time t + 1, V t (π t ) can be computed by finding the minimax in (7.20). For that purpose, we need to compute V t+1 (Φ t (π t ,γ t , z t )) at all feasible prescription-observation pairs (γ t , z t ). Let the cardinal- ity of the space of prescriptions γ t be denoted by |Γ t |. It is easy to see that |Γ t | = Q n i=1 |U i t | |Y i t ||M i t | . The cardinality of the set ˆ O t (π t ,γ t ) can at most be|Z 1:n t | where|Z 1:n t | = 190 Q n i=1 |Z i t |. Hence, to evaluate the minimax in (7.20) for a particular π t we need to carry out at most|Γ t ||Z 1:n t | evaluations of Φ t . In order to characterize the value function at time t, this process has to be repeated for every possible state uncertainty set π t and the number of such sets at time t can at most be 2 |St| , where|S t | =|X t | Q n i=1 |Y i t ||M i t |. Thus the computation of V t (·) will require at most|Γ t ||Z 1:n t |2 |St| evaluations of Φ t . 7.4 Optimal Strategies in Problem 7.1 Using Lemma 7.1 and the optimal strategy for the coordinator’s problem we can conclude the following theorem. Theorem 7.1. In Problem 7.1, there exist optimal control strategies of the form U i t = ˆ g i t (Y i t ,M i t , Π t ), i = 1,...,n, (7.22) where Π t is the uncertainty set of X t , Y t , M t given C t . LetB t denote the space of all possible realizations of Π t . Consider a strategy ˆ g i of the form specified in Theorem 7.1 for controlleri. Then the control law ˆ g i t at timet is a function from Y i t ×M i t ×B t toU i t . The control policy ˆ g i t can also be described by specifying{ˆ g i t (·,·,π)} π∈Bt where each element of this collection is mapping fromY i t ×M i t toU i t . For a fixed realization of π of Π t , ˆ g i t (·,·,π) tells the controller i how to map its local observation and memory to its control action. We call ˆ g i t (·,·,π) the partial control law of controller i at time t for the given realization π of Π t . 191 We now use Lemma 7.4 to give a dynamic program for Problem 7.1. The dynamic program allows us to evaluate the optimal partial control laws at each controller for any realization π of the consistency set of Π t in a backward inductive manner. Theorem 7.2. Define the value functions V t :B t →R, for t =T,T− 1,..., 1, as follows V T (π T ) = max x T ∈π T c(x T ), (7.23) V t (π t ) = min γ 1 t ,...,γ n t max zt∈ ˆ Ot(πt,γt) V t+1 (Φ t (π t ,γ t , z t )), t≤T− 1, (7.24) where γ t = (γ 1 t ,...,γ n t ) and ˆ O t (π t ,γ t ) is the set of the feasible values that the next observa- tion z t can take and is given as ˆ O t (π t ,γ t ) = ( z∈Z 1:n t :z i =ξ i t (m i t ,y i t ,u i t ) u i t =γ i t (y i t ,m i t ), i = 1,...,n (x t , y t , m t )∈π t ) . (7.25) For t = 1,...,T− 1 and for each π∈B t , an optimal control law for controller i is the minimizing choice of γ i t in the definition of V t (π). Letη t (π) be the argmin in the right hand side of equation (7.24) and η i t (π) be its i-th component. Then, an optimal partial control law for controller i is given as ˆ g i t (·,·,π) =η i t (π). (7.26) 192 7.5 Additive cost We now consider the case where the system incurs a costc(X t , U t ) at each timet. Consider the system model described in Section 7.2 with the objective of minimizing the following cost function ˆ J(g) = max X 1 ,W 0 1:T−1 ,W i 1:T T X t=1 c(X t , U t ) (7.27) As in [152], this problem can be transformed to a terminal cost problem by introducing an additional state variable. Lemma 7.5. For the case of additive cost structure, define the extended state as ˜ X t = (X t ,A t ) where A t = P t i=1 c(X i , U i ) is the cost accumulated till time t. Then this problem is equivalent to the Problem 7.1 with ˜ X t as the state of the system. Proof. The time evolution of A t is characterized as A t+1 = A t +c(X t , U t ) with A 1 = 0. Hence, for the extended state ˜ X t we can write ˜ X t+1 = X t+1 A t+1 = f t (X t , U t ,W 0 t ) A t +c(X t , U t ) = ˜ F t ( ˜ X t , U t ,W 0 t ). (7.28) The total cost is simply A T which is a function of the terminal extended state ˜ X T . This fact together with (7.28) transforms the problem to the model of Section 7.2 with ˜ X t as the state of the system. 193 M i t Y i t U i t Z i t C t Z t t t+1 Shared memory Controller i π t Y com t Figure 7.2: Time ordering of Observations, Actions and Memory updates with common observation The above lemma implies that we can apply the result of Theorem 7.2 for the additive cost problem using the extended state ˜ X t . 7.6 Generalization of the model The methodology described in Section 7.3 relies on the fact that the shared memory is common information among all controllers. In some cases, in addition to the shared memory, controllers may have a common observation. The general approach of Section 7.3 can be easily modified to include such cases as well. Consider the model of section 7.2 with the following changes : 1. In addition to its local observation, each controller has a common observation at time t Y com t =h com t (X t ,W com t ), (7.29) whereW com t ∈W com t is a noise variable taking values independently from all the other uncertain quantities (namely, the initial state, disturbances and noise variables) of the system. 194 2. The shared memory C t at time t is a subset of{Y 1:t−1 , U 1:t−1 ,Y com 1:t−1 } 3. The control action of controller i is chosen using a control law of the form U i t =g i t (M i t ,Y i t ,C t ,Y com t ). (7.30) 4. The memory updates Z i t is a subset of{Y i t ,U i t ,M i t ,Y com t } and necessarily includes Y com t . Thus, the memory updates can be described as follows Z i t =ξ i t (M i t ,Y i t ,U i t ,Y com t ). (7.31) Also, this means that the history of common observations form a part of the shared memory i.e. Y com 1:t−1 ⊂C t . The rest of the model is same as in Section 7.2 and the objective is to find control strategies which minimize the worst case terminal cost given by (7.9). We can proceed in a manner similar to Section 7.3 with the new observation process at the coordinator defined asO t+1 ={Z t ,Y com t+1 }. Also, the uncertainty set of the coordinator state S t is the set of all possible values of the state S t consistent with the information available at the coordinator at time t which is γ 1:t−1 , z 1:t−1 ,y com t . Lemma 7.6. The state uncertainty set Π t+1 at time t + 1 evolve as follows Π t+1 = Φ t (Π t ,γ t , z t ,y com t+1 ) (7.32) 195 Proof. Let Π t be given. After receiving z t , we can remove some states from Π t and form the set Θ t using (7.18). Then, the feasible values of the next stateS t+1 can be derived using the system dynamics and (7.29) as follows : Π t+1 = ( (x t+1 , y t+1 , m t+1 ) :x t+1 =f t (x t , u t ,w 0 t ) y i t+1 =h i t+1 (x t+1 ,w i t+1 ) 1≤i≤n m i t+1 =ζ i t+1 (m i t ,y i t ,u i t ) 1≤i≤n u i t =γ i t (y i t ,m i t ) (x t , y t , m t )∈ Θ t ,w 0 t ∈W 0 t ,w i t ∈W i t ∀i and∃w com t+1 ∈W com t+1 such thath com t+1 (x t+1 ,w com t+1 ) =y com t+1 ) (7.33) From the above equation it follows that Π t+1 = Φ t (Π t ,γ t , z t ,y com t+1 ). The structural results of Section 7.4 hold true for this model with Π t defined as above. The dynamic programming decoposition of Section 7.4 can be easily modified to the following Lemma 7.7. Define the value functions for time t =T,T− 1,..., 1 as follows V T (π T ) = max x T ∈π T c(x T ), (7.34) V t (π t ) = min γt max (zt,y com t+1 )∈ ˆ Ot(πt,γt) V t+1 (Φ t (π t ,γ t , z t ,y com t+1 )), (7.35) 196 where ˆ O t (π t ,γ t ) is the set of the feasible values that (z t ,y com t+1 ) can take and is given as ˆ O t (π t ,γ t ) ={(z t ,y com t+1 ) :y com t+1 =h com t+1 (f t (x t , u t ,w 0 t ),w com t+1 ) z i t =ξ i t (m i t ,y i t ,u i t ), u i t =γ i t (y i t ,m i t ) where (x t , y t , m t )∈π t ,w 0 t ∈W 0 t ,w com t+1 ∈W com t+1 }. 7.7 Example Consider a system with two agents who want to track down a target. The target and both the agents can move across a finite one dimensional grid of size L. The objective of the agents is to surround the target from the two sides while being at a distance of at most d from the target. • The target’s position X 0 t ∈{1,...,L} at time t evolves in an uncontrolled manner as follows X 0 t+1 = X 0 t +W 00 t , if 1≤X 0 t +W 00 t ≤L 1, ifX 0 t +W 00 t < 1 L, ifX 0 t +W 00 t >L (7.36) where W 00 t ∈W 00 t ={−m,−m + 1,...,m− 1,m} and m is a positive integer. 197 • At each time the agents can decide to move left, right or stay in its place. The position X 1 t+1 of the agent 1 at time t + 1 is determined by the following dynamics X 1 t+1 =X 1 t +U 1 t +W 01 t , (7.37) where U 1 t ∈{−1, 0, 1} and W 01 t ∈W 01 t is the noise. Also, the position X 2 t+1 of the agent 2 at time t + 1 evolves in a controlled manner without any disturbance i.e. X 2 t+1 =X 2 t +U 2 t , (7.38) where U 2 t ∈{−1, 0, 1}. • The agents know their own and each other’s position perfectly. Also, agent 1 can observe the target’s position perfectly. However, agent 2 receives only a quantized version Y t of the target’s location X 0 t which is described as follows Y t = 2k, ifX 0 t ∈{2k− 1, 2k}. (7.39) • Let I opp be the indicator variable which takes value 0 if at time T both the agents end up on the same side (right/left) of the target and takes a value of 1 otherwise. Similarly, I d denotes the indicator variable which takes a value 1 if both the agents are at a distance of at most d from the target at time T . Then the cost function is defined as follows c(X 0 T ,X 1 T ,X 2 T ) 198 = 0, ifI opp = 1, I d = 1 |X 0 T −X 1 T | +|X 0 T −X 2 T |, ifI opp = 1, I d = 0 C +|X 0 T −X 1 T | +|X 0 T −X 2 T |, ifI opp = 0 where C > 0 is the additional penalty if the agents end up on the same side of the target. The objective is to come up with the control strategies of the agents which minimize the worst case cost. In this example the common observation is Y com t ={Y t ,X 1 t ,X 2 t } as agent 1 knows Y t from the perfect knowledge of X 0 t . The private memories are M 1 t = X 0 t and M 2 t =φ, the information shared by the agents Z t =φ and the shared memory isC t =Y com 1:t−1 . The coordinator state is S t = (X 0 t ,X 1 t ,X 2 t ) and its uncertainty set Π t is derived using γ 1:t−1 ,y com 1:t . Since the agents location is common information, the uncertainty set for X 1 t and X 2 t is going to be a singleton. Also, the uncertainty about the target’s position X 0 t can be at most a window of the form{2k− 1, 2k} or possibly a singleton. The number of singleton sets for the location of the agents/target isL and there are L 2 windows of size two which form the uncertainty about target’s location. Hence the number of possible instances of the state uncertainty set π t is 3L 2 ×L×L. Since Z t = φ, we can characterize the time evolution of the state uncertainty set using (7.33). π t+1 = ( (x 0 t+1 ,x 1 t+1 ,x 2 t+1 ) :x 0 t+1 =x 0 t +w 00 t x 0 t+1 ∈{y com t+1 (1),y com t+1 (1)− 1}, x 1 t+1 =y com t+1 (2), x 2 t+1 =y com t+1 (3) (x 0 t ,x 1 t ,x 2 t )∈π t ,w 00 t ∈W 00 t ,w 01 t ∈W 01 t ) 199 (i,j,k) J ∗ (1, 8, 8) 19 (1, 4, 8) 15 (1, 8, 4) 7 (4, 1, 4) 13 (4, 1, 8) 5 Table 7.1: Optimal cost incurred for L = 8,T = 4,m = 2,d = 2,C = 10. 1 2 3 4 5 6 7 8 Agent 2 intial position 1 2 3 4 5 6 7 8 Agent 1 initial position 6 8 10 12 14 16 Figure 7.3: Optimal cost incurred for L = 8,T = 4,m = 2,d = 2,C = 10,X 0 1 = 4 wherey com t (i) denotes thei th component ofy com t . Table I gives the worst case cost incurred by the optimal strategy (value function computed at t = 1) for different initial configu- rations of the target and agents. We assume that the noise at the agent 1 is correlated with the target noise with W 01 t = sign(W 00 t ). The configuration (i,j,k) denotes that the target, agent 1 and agent 2 start at the position i, j andk respectively and J ∗ denotes the corresponding the cost incurred by the optimal policy. Figure 7.3 shows the cost incurred by the optimal strategy when the initial position of the target is fixed to the center of the grid (X 0 1 = 4). It can be observed that the cost incurred is lower if the agent’s initial position are close to the target’s initial position or if they are on the opposite side of the target in their initial 200 7.8 Conclusion In this chapter we considered a decentralized minimax control problem with the partial history sharing information structure. We model the noise in the system as uncertain quan- tities taking values in known finite sets and the objective is to compute control strategies which minimize the worst-case system cost. We start with a system which incurs only a terminal cost and formulate an equivalent problem from a fictitious coordinator’s perspec- tive which has access to the shared data (common information) among the controllers. A dynamic program based solution is developed where the information state is the set of feasi- ble values of the current state and local information consistent with the information that is commonly known to all controllers. We further extend our results to a system with additive cost and to the case when all the controllers have a common observation. 201 Chapter 8 Weakly Coupled Constrained Markov Decision Process in Borel Spaces 8.1 Introduction Consider a multi-agent system with N agents that have decoupled dynamics, i.e, each agent’s state evolution depends only on its own actions. Each agent has an associated cost function and a constraint function that depend on its local state and local action. The agents are coupled because the time-average of the total constraint function (summed over all agents and all times) must be kept below a threshold. Such multi-agent problems are referred to as Weakly coupled constrained Markov decision process (MDP) ([154, 155]). 202 Weakly coupled MDP have been used as a model for online advertising [154], multi-server data center control [155], robotics [156] etc. In this chapter, we study such problems with Borel state and action spaces. Our approach builds on the occupation measure based approaches for single agent constrained MDPs with Borel spaces [21]. For single-agent/centralized systems, constrained Markov decision process (CMDP) is a popular model for sequential decision making problems with constraints. The goal of the agent is to minimize its long term expected cost while keeping the constraint functions below a threshold. One approach to solve such single agent problems is based on the idea of occupation measures. These are joint measures on the state and action spaces that can be used to quantify the time-averaged cost and constraint values [21, 157–159]. Using such measures, the strategy design problem can be written as a linear program whose solution gives the optimal occupation measure. Linear programming (LP) based formulation are presented in [157, 158] for CMDPs with finite/countable state and action spaces. The idea of LP was extended for CMDPs with Borel state and action spaces in [21, 159, 160]. Weakly coupled MDP with finite state and action spaces have been studied in the literature. A resource allocation problem for multiple task completion was modeled as a weakly coupled MDP in [161]. Each individual task was modeled as an MDP with instantaneous resource constraints on control strategy. [154] considered the problem of budget allocation across independent MDPs. Optimal value functions are derived as a function of the available budget and the allocation problem is posed as multi-item, multiple choice knapsack problem for which a greedy algorithm is presented to determine budget allocation. A distributed 203 online learning based algorithm was proposed for weakly coupled MDPs in [155] where the system dynamics were assumed to be unknown. In this chapter, we consider the problem of weakly coupled constrained MDP with Borel state and action spaces. We use the linear programming based approach of [21] to derive an occupation measure based LP to find the optimal decentralized control strategies for our problem. Our main contributions could be summarized as follows: 1. We formulate a LP to show that randomized stationary strategies are optimal for each agent under some assumptions on the transition kernels, cost and the constraint functions. 2. We consider the special case of multi-agent Linear Quadratic Gaussian (LQG) systems and show that the optimal control strategy could be obtained by solving a semi-definite program (SDP). Finally, we also present some numerical experiments for a toy problem on multi-agent LQG. The following is the outline of this chapter: We will start with problem formulation in section 8.2 and present the LP to solve the general Borel case in section 8.3. We consider the multi- agent LQG case in section 8.4, provide numerical results in section 8.5 and conclude in section 8.6. Notation Random variables are denoted by upper case letters and their realizations by corresponding small letters. X a:b denotes the collection (X a ,X a+1 ,··· ,X b ). Boldface letter X is used to 204 denote the collection (X 1 ,X 2 ). E[·] is the expectation of a random variable. For a collection of functions g and a probability distribution f, we useE g f [·] to denote that the expectation depends on the choice of functions in g and the distribution f. For any Borel spaceS, let B(S) denote the set of all Borel sets ofS. A B means that (A−B) is positive semi-definite.N (m, Σ) denotes the Gaussian distribution with mean m and covariance Σ. 8.2 Problem formulation Consider a two-agent dynamical system with state process X t = (X 1 t ,X 2 t ),t≥ 0. X i t ∈X i is the state-component associated with agent i fori∈{1, 2}. The distribution of the initial state X i 0 is denoted by ν i . X 1 0 ,X 2 0 are independent and let ν denotes the pair (ν 1 ,ν 2 ). At time t, agent i takes a control action U i t ∈U i and the states of the two agents evolve in a decoupled manner according to the following stochastic kernel: X 1 t+1 ∼Q 1 (·|X 1 t ,U 1 t ), (8.1) X 2 t+1 ∼Q 2 (·|X 2 t ,U 2 t ). (8.2) Information and Strategies Each agent can observe its component of the state perfectly at each time. Agents do not share any information with each other. Hence, the information available to agent i at time t isI i t ={X i 1:t ,U i 1:t−1 }. Agent i maps its information to its corresponding action using a 205 randomized strategy π i t as follows, U i t ∼π i t (·|I i t ). where π i t (·|I i t ) is a probability distribution on the control spaceU i of agent i. We allow for randomized strategies because we are considering a control problem with constraints [157]. The collectionπ i ={π i t } t≥0 denotes the control strategy of agenti and the pairπ = (π 1 ,π 2 ) is referred to as the joint control strategy of the agents. Cost and Constraints Agent i incurs an instantaneous cost c i (X i t ,U i t ) at each time t. In addition, agent i also has an associated constraint cost function d i (X i t ,U i t ) at timet. The long-term average cost function and constraint function under a joint strategy pair π and initial distribution pair ν is defined as: J(π,ν) = lim sup T→∞ 1 T E π ν " T−1 X t=0 c 1 (X 1 t ,U 1 t ) +c 2 (X 2 t ,U 2 t ) # (8.3) K(π,ν) = lim sup T→∞ 1 T E π ν " T−1 X t=0 d 1 (X 1 t ,U 1 t ) +d 2 (X 2 t ,U 2 t ) # (8.4) The objective of the agents is to jointly minimize their long term average cost (8.3) while keeping the joint long term average constraint function (8.4) below a threshold k. We formally state the problem below. 206 Problem 8.1. Find a joint control strategyπ and initial distributionν for the agents which minimizes the cost J(π,ν) subject to the constraint K(π,ν)≤k, i.e., inf π,ν J(π,ν) subject to K(π,ν)≤k (8.5) Assumption 8.1. Problem 8.1 is feasible i.e. there exists a pair (π,ν) such thatK(π,ν)≤ k and J(π,ν)<∞. Remark 8.1. Constrained problems which consider long term average cost and a fixed initial distribution have been studied for single agent systems in [155, 157, 158, 162]. These problems are referred to as ”ergodic” problems in the literature. Problems which require the joint design of initial distribution and control strategy (as in Problem 8.1) are referred to as ”minimum pair” problems. Such constrained problems have been considered in [21, 163–166] for centralized (single-agent) systems. 8.2.1 Discussion Problem 8.1 is an instance of constrained team decision problem with additive cost and constraint function. In the absence of the constraint (8.5), this problem can be decomposed into two single agent (centralized) decision problems, the solution to which can be obtained using Markov decision theory [6]. However, constraint (8.5) couples the decision making of the two agents. This is because the choice of control strategy for agent 2 can affect the 207 choice of control strategy for agent 1 since (8.5) has to be satisfied jointly by the two agents. This coupling makes this problem non-trivial. Such problems are also referred to as weakly coupled Markov decision problems ([154, 155]) since the coupling among the agents is only due to the constraint (8.5). The framework we discuss in this chapter can be used to model problems where the agents are working as a team to achieve a common goal encoded by the constraint (8.5) while trying to minimize their cumulative individual costs. We give a few examples which can be posed as an instance of Problem 8.1. 1) Resource constrained problems: Consider a problem where the agents are sharing re- sources (e.g. control resources, energy resources) with each other. The goal of the agents is to minimize their total costs with a constraint on the resource utilization. Problems of such type can be modeled using the framework of Problem 8.1 where d i (X i t ,U i t ) and c i (X i t ,U i t ) is the resource consumption and the cost respectively for agent i. For example, consider a smart building with two air conditioning systems which are maintaining temperatures of two different rooms while sharing a common power supply. The state X i t denotes the temperature of room i while U i t denotes the amount of power consumed by air conditioner i. Suppose the temperature of room i evolves as follows: X i t+1 = A i X i t +B i U i t +W i t , where W i t is random noise. The objective of the agents is to minimize the deviation of the room temperature around a nominal value while keeping the total power consumed below a certain threshold i.e. min lim sup T→∞ 1 T E π ν " T−1 X t=0 2 X i=1 ||X i t −X i nom || 2 # 208 subject to lim sup T→∞ 1 T E π ν " T−1 X t=0 2 X i=1 ||U i t || 2 # ≤k 2) Remote estimation: Consider an estimation problem with multiple estimators. Estimator i wants to form an estimate ˆ X i t of a corresponding Markov source X i t at each time t. The sources are being observed by a shared sensor. In order to compute the estimate, estimator i can request the sensor to transmit X i t at time t using the decision variable A i t ∈{0, 1}. A i t = 1 indicates that estimator i has requested an observation. Due to limited power supply, the sensor can handle a limited number of observation requests on average. The objective of the agents is to minimize the cumulative estimation error such that the average cumulative number of requested observations is below a certain threshold. This problem can be easily modeled using the framework of Problem 8.1 withX i t as the state and ( ˆ X i t ,A i t ) as the action of agent i. 3) Mean-field constraint: Consider a two-agent problem where the state spaceX i ={0, 1}. Each agent has a control cost given by c i (U i t ). The agents goal is to minimize the time- averaged cost while keeping the time-averaged fraction of agents in state 1 below a threshold, i.e., lim sup T→∞ 1 T E " T−1 X t=0 1 2 2 X i=1 I(X i t = 1) # ≤k 209 8.3 Optimal Strategies We are going to restrict our attention to the case when the state space and the control spaceX i ,U i are Borel spaces 1 (e.g. Euclidean space). Single agent constrained Markov decision process in Borel spaces can be solved using infinite dimensional linear programming approach [21]. In this approach an optimal occupation measure (joint probability measure) of the state and control is computed using a linear program. The optimal pair of control strategy and an initial distribution is obtained using the optimal occupation measure. Building on the single-agent solution, we will provide an infinite dimensional LP which will characterize the solution to our multi-agent problem described in Problem 8.1. To do so, we will need the following definitions. Definition 8.1. 1. Let w i (x i ,u i ) := 1 +c i (x i ,u i ) and ˆ w i (x i ) := inf u i ∈U iw i (x i ,u i ). 2. LetF i (X i ×U i ) be the vector space of measurable functions fromX i ×U i to R with finite w i norm. That is, f i ∈F i (X i ×U i ) if sup (x i ,u i )∈X i ×U i |f i (x i ,u i )| w i (x i ,u i ) <∞ . 1 A Borel space is a Borel subset of complete and separable metric space 210 3. LetM i + (X i ×U i ) be the vector space of positive measures onX i ×U i with finite w i variations. That is, μ i ∈M i + (X i ×U i ) if Z X i ×U i w i (x i ,u i )μ i dx i ,du i <∞. 4. Define the bilinear formhf i ,μ i i for f i ∈F i (X i ×U i ), μ i ∈M i + (X i ×U i ) as follows: hf i ,μ i i := Z X i ×U i f i (x i ,u i )μ i dx i ,du i Let μ i ∈M i + (X i ×U i ) be a probability measure on the joint state and action space. Note that any distributionμ i ∈M i + (X i ×U i ) can be decomposed in terms of its marginal onX i and a conditional distribution over the control space φ i (·|x i ) such that μ i (B i ,C i ) = Z B i φ i (C i |x i )ˆ μ i (dx i ),∀B i ∈B(X i ),C i ∈B(U i ) (8.6) whereμ i (B i ,C i ) denotes the measure of the rectangle B i ×C i and ˆ μ i (B i ) :=μ i (B i ,U i ) for allB i ∈B(X i ) is the marginal ofμ i onX i . We will write the measureμ i = ˆ μ i ·φ i when the corresponding decomposition is as in (8.6). We can now describe the linear program that characterizes the optimal control strategies. Let μ 1 ∈M 1 + (X 1 ×U 1 ) and μ 2 ∈M 2 + (X 2 ×U 2 ). Consider the following linear program: LP-1: min μ 1 ,μ 2 hμ 1 ,c 1 i +hμ 2 ,c 2 i 211 subject to:hμ 1 ,d 1 i +hμ 2 ,d 2 i≤k (8.7) μ i (B,U i ) = Z X i ×U i Q i (B|x i ,u i )μ i dx i ,du i , ∀i,B∈B(X i ) (8.8) μ i (X i ,U i ) = 1,μ i ∈M i + (X i ×U i ) (8.9) LP-1 is an infinite dimensional linear program whose solution consists of a probability measure on the state and action space for each agent. Theorem 8.1 characterizes the solution to Problem 8.1 in terms of the solution to the LP-1 under the following assumption Assumption 8.2. 1. c i (x i ,u i ) is non-negative and inf-compact, d i (x i ,u i ) is non-negative and lower semi continuous∀i. 2. The transition kernel Q i is weakly continuous∀i. 3. d i ∈F i (X i ×U i ), ∀i. 4. R X i ˆ w i (y i )Q i (dy i |·)∈F i (X i ×U i ), ∀i Assumption 8.2 ensures that there exists a solution to LP-1. Similar assumption has been made in the analysis of single agent constrained MDP (see [21]). We are now ready to state our main result. Theorem 8.1. Under Assumption 8.1 and 8.2 there exists μ 1 ∗ ,μ 2 ∗ that achieve the optimal value of LP-1. Let μ i ∗ = ˆ μ i ∗ ·φ i ∗ ,i∈{1, 2} be the decomposition of μ ∗ i into the marginal and 212 conditional distribution as in (8.6). Then, an optimal control strategy for agenti in Problem 8.1 is a randomized stationary strategy φ i ∗ (·|x i ) and the optimal initial distribution is ˆ μ i ∗ (·). Moreover, the optimal cost achieved under (φ 1 ∗ ,φ 2 ∗ ) when the initial state distribution is (ˆ μ 1 ∗ , ˆ μ 2 ∗ ) ishμ 1 ∗ ,c 1 i +hμ 2 ∗ ,c 2 i. Proof outline. The proof follows by considering a centralized problem where a single agent knows the entire state and action history and takes both actions. The optimal cost of the centralized problem serves as a lower bound for Problem 8.1. We then establish that this lower bound is achieved under the control strategy and initial distribution described in Theorem 8.1. Note that Theorem 8.1 implies that each agent’s optimal strategy is a stationary Markov strategy since the distribution of U i t depends only onX i t . See Appendix G.1 for the complete proof. Remark 8.2. The results obtained in this section hold true when the state spaceX i and ac- tion spaceU i are finite. In this case, the infinite dimensional linear program LP-1 simplifies to the following finite dimensional linear program: min μ 1 ,μ 2 2 X i=1 X x i ,u i μ i (x i ,u i )c i (x i ,u i ) subject to 2 X i=1 X x i ,u i μ i (x i ,u i )d i (x i ,u i )≤k X u i μ i (x i ,u i ) = X y i ,u i Q i (x i |y i ,u i )μ i (y i ,u i ) ∀i,x i X x i ,u i μ i (x i ,u i ) = 1 and μ i (x i ,u i )≥ 0 ∀i,x i ,u i 213 Using Theorem 8.1, the optimal control strategy is the conditional distribution of the action obtained from μ i ∗ as follows: φ i ∗ (u i |x i ) := μ i ∗ (x i ,u i ) P ˜ u iμ i ∗ (x i , ˜ u i ) (8.10) In the finite case, it can be established that the optimal cost is independent of the initial state distribution. Moreover, the optimal control strategy is given by (8.10) for any initial state distribution. Similar observations were made in [155]. Remark 8.3. Consider the case when the system has N > 2 agents under the same as- sumptions. In addition, the agents have to satisfy multiple constraints of the form in (8.5). The results obtained in this section can be easily generalized to handle this case. We can write down the LP-1 in which each agent has an associated measureμ i and add a constraint in the linear program of the form in (8.7) corresponding to each joint constraint of the form in (8.5). Theorem 8.1 applies to arbitrary dynamics, cost and constraint functions as described in (8.1)-(8.4). When the dynamics and cost are specialized, the infinite dimensional linear program may be reducible to more tractable optimization problems. We demonstrate this for the linear quadratic systems in the next section. 214 8.4 Constrained Linear Quadratic Systems In this section, we consider an instance of Problem 8.1 when the system dynamics are linear, cost and constraint function have a quadratic form and the disturbances are Gaussian. We refer to such systems as the constrained Linear Quadratic Gaussian (LQG) multi-agent systems. The state X i t ∈R n i of agent i evolves according to the following linear dynamics: X i t+1 =A i X i t +B i U i t +W i t (8.11) where W i t ∼N (0,I), U i t ∈ R m i and A i ,B i are matrices of appropriate dimensions. The instantaneous cost and constraint function are given as follows: c i (X i t ,U i t ) = (X i t ) 0 Q i X i t + (U i t ) 0 R i U i t , (8.12) d i (X i t ,U i t ) = (X i t ) 0 M i X i t + (U i t ) 0 N i U i t , (8.13) where Q i ,M i ,R i ,N i are symmetric positive definite matrices for i∈{1, 2}. This problem can be seen as a special case of Problem 8.1 where the state and action spaces are Borel spaces sinceX i =R n i ,U i =R m i . It can be verified easily that Assumption 8.2 holds true for this problem and hence we can obtain the optimal control strategy by solving the LP-1 and using Theorem 8.1. For that purpose, we define the following moments associated with 215 a measure μ i onX i ×U i , m i x = Z R n i xμ i (dx,U i ), Σ i xx = Z R n i xx 0 μ i (dx,U i ) m i u = Z R m i uμ i (X i ,du), Σ i uu = Z R m i uu 0 μ i (X i ,du) Σ i xu = Z R n i×R m i xu 0 μ i (dx,du) The next theorem shows that in the case of LQG systems the infinite dimensional linear program (LP-1) can be reduced to a finite dimensional semi-definite program (SDP). Theorem 8.2. Consider the following SDP: LQG-SDP : min Σ i xx ,Σ i uu ,Σ i xu 2 X i=1 Tr(Q i Σ i xx ) +Tr(R i Σ i uu ) subject to : 2 X i=1 Tr(M i Σ i xx ) +Tr(N i Σ i uu )≤k (8.14) Σ i xx =A i Σ i xx (A i ) 0 +A i Σ i xu (B i ) 0 +B i Σ i ux (A i ) 0 +B i Σ i uu (B i ) 0 +I (8.15) Σ i xx Σ i xu (Σ i xu ) 0 Σ i uu 0 (8.16) Suppose Σ i,∗ xx , Σ i,∗ uu , Σ i,∗ xu is the solution of the LQG-SDP. Then, the Gaussian measure on X i ×U i with mean 0 and second moments Σ i,∗ xx , Σ i,∗ uu , Σ i,∗ xu is optimal for LP-1. Moreover, the optimal control strategy for agent i is a Gaussian stationary randomized policy given as: φ i ∗ (U i |X i )∼N m i u|x , Σ i u|x (8.17) 216 where m i u|x = Σ i,∗ ux (Σ i,∗ xx ) −1 X i and Σ i u|x = Σ i,∗ uu − Σ i,∗ ux (Σ i,∗ xx ) −1 Σ i,∗ xu for i∈{1, 2}. Also, the corresponding optimal initial distribution ν i ∗ for agent i isN (0, Σ i,∗ xx ). Proof. We will first show that it is sufficient to consider Gaussian measures for LP-1. Con- sider a measureμ i onX i ×U i with meansm i x ,m i u and second moment matrix Σ i xx Σ i xu (Σ i xu ) 0 Σ i uu . Now, observe that, hμ i ,c i i =Tr(Q i Σ i xx ) +Tr(R i Σ i uu ), hμ i ,d i i =Tr(M i Σ i xx ) +Tr(N i Σ i uu ). Suppose (X i ,U i )∼ μ i and let ˜ X i = A i X i +B i U i +W i be the next state. Then, (8.8) encodes the constraint that distribution of X i and ˜ X i should be the same. Let μ i be a feasible measure for LP-1 which satisfies (8.8). This means that the first and the second moment of X i and ˜ X i should match when (X i ,U i )∼μ i , i.e., m i x =A i m i x +B i m i u (8.18) Σ i xx =A i Σ i xx (A i ) 0 +A i Σ i xu (B i ) 0 +B i Σ i ux (A i ) 0 +B i Σ i uu (B i ) 0 +I (8.19) Now, consider a Gaussian measure μ i g which has the same 1 st and 2 nd moments as μ i . If (X i ,U i )∼μ i g are jointly Gaussian then ˜ X i =A i X i +B i U i +W i is also Gaussian with mean and covariance given by the right hand side of (107) and (108) above. Thus,X i , ˜ X i are both Gaussian with same mean and covariance since (107),(108) holds true for the moments of μ i g . Hence,μ i g satisfies (8.8). Also,hμ i g ,c i i =hμ i ,c i i andhμ i g ,d i i =hμ i ,d i i asμ i ,μ i g have the 217 same second moments. Thus, for any feasible μ i there exists a feasible Gaussian measure μ i g which achieves the same value of the linear program. Hence, it is sufficient to consider the class of Gaussian measures in LP-1. Since a Gaussian measure can be characterized only by the first and the second moments, we can reduce LP-1 to the SDP presented in the lemma by setting m i x =m i u = 0 without loss of generality. Finally, using theorem 8.1 and the fact that optimal μ i ∗ is Gaussian, it can be easily shown using that the optimal control strategy is Gaussian with mean m i u|x , Σ i u|x as defined in the lemma. Based on the optimal randomized strategy in Theorem 8.2 (see (8.17)), one can write the optimal action of agent i as follows: U i,∗ t =K i ∗ X i t +V i t , where K i ∗ := Σ i,∗ ux (Σ i,∗ xx ) −1 and V i t ∼N (0, Σ i u|x ). Note that agent i is using its local state in a linear fashion. As noted earlier, in the absence of the constraint (8.5), Problem 8.1 would decompose into two single agent unconstrained LQG control problem. This would imply that the optimal unconstrained controller for each agent is also a linear function of its local state. However, the gain matrix in the unconstrained problem may be different from that obtained in Theorem 8.2. Also, the optimal constrained controller obtained via Theorem 8.2 has a noise termV i t in contrast with the deterministic linear controller in the unconstrained case. 218 In the next lemma, we show that the agents can in fact ignore the control noise and use a deterministic linear control strategy. Lemma 8.1. Let g i ∗ be the following deterministic stationary linear controller: g i ∗ (X i t ) := Σ i,∗ ux (Σ i,∗ xx ) −1 X i t . (8.20) where Σ i,∗ ux , Σ i,∗ xx are obtained from the SDP in Theorem 8.2. Then, g i ∗ is an optimal control strategy for agent i. Proof Outline. It can be shown, using an induction argument, that the expected instanta- neous cost and constraint under the optimal policy φ ∗ = (φ 1 ∗ ,φ 2 ∗ ) from theorem 8.2 is lower bounded by the expected instantaneous cost and constraint under g ∗ = (g 1 ∗ ,g 2 ∗ ) when the initial distribution is ˆ μ ∗ . That is, E φ∗ ˆ μ∗ [c(X t , U t )]≥ E g∗ ˆ μ∗ [c(X t , U t )] and E φ∗ ˆ μ∗ [d(X t , U t )]≥ E g∗ ˆ μ∗ [d(X t , U t )] for all time t. Therefore, the average cost and constraint function achieved under the pair (g ∗ , ˆ μ ∗ ) is not more than the average cost and constraint function achieved under the pair (φ ∗ , ˆ μ ∗ ). Hence, g i ∗ is also an optimal control strategy for agent i. See Appendix G.2 for a complete proof. So far we assumed that the noise in the system dynamics W i t was Gaussian. The following extends our results to non-Gaussian noise. Lemma 8.2. Suppose the system dynamics are as in (8.11) and the noise W i t is non- Gaussian with mean 0 and covariance matrix I. The results of Theorem 8.2 and Lemma 8.1 hold true for this case. 219 Proof Outline. It can be shown that LQG-SDP is a relaxation of LP-1 when the system dynamics are as in (8.11) with non-Gaussian noise and the cost, constraint function have the quadratic form in (8.12),(8.13). Therefore, the optimal value of this SDP is a lower bound for the optimal value of LP-1. We can further show that this lower bound is achieved if the agents follow the control strategy g i ∗ as described in Lemma 8.1. Therefore, g i ∗ is optimal in the non-Gaussian case as well. 8.5 Numerical Experiments In this section, we will present numerical experiments for a multi-agent LQ problem with constraints. Consider a two agent system where X i t ∈ R 2 and U i t ∈ R 2 for i = 1, 2. The system dynamics is characterized by the following matrices: A 1 = 1 2 1 1 ,B 1 = 3 1 2 −1 , A 2 =A 1 ,B 2 =B 1 . The cost matrices are given as follows: Q 1 = 4 2.8 2.8 2 ,R 1 = 14.5 3.4 3.4 0.8 , Q 2 = 0.9 0.4 0.4 0.2 ,R 2 = 1.3 1.2 1.2 1.2 . 220 Figure 8.1: Trajectory of the running average cost The constraint matrices are set to the following: M 1 = 1.1 0.9 0.9 0.75 ,N 1 = 0.1 0.3 0.3 1.1 , M 2 = 0.35 1.2 1.2 4.4 ,N 2 = 0.15 0.15 0.15 0.18 . Let J(t) := 1 t P t−1 s=0 P 2 i=1 (X i s ) 0 Q i X i s + (U i s ) 0 R i U i s be the running average cost and similarly K(t) be the running average constraint function. We will compare these running averages under the optimal constrained controller obtained from the SDP in Theorem 8.2 with the optimal unconstrained controllers for each agent. The optimal unconstrained controllers can be obtained by solving the discrete Riccati equation for each agent [22]. Figure 8.1 shows the average cost J(t) as a function of time for the optimal constrained 221 Figure 8.2: Trajectory of the running average constraint function controller (referred to as SDP controller in the figure) and the optimal unconstrained con- troller. It can be seen that the optimal constrained controller performs worse than the optimal unconstrained controller in terms of the achieved average cost. Figure 8.2 shows the average constraint K(t) as a function of time for the optimal constrained and uncon- strained controller when the constraint threshold is set to k = 7.6. It can be observed that the controller obtained via the SDP is able to satisfy the constraint threshold while the unconstrained controller could not. Thus, the optimal constrained controller is able to meet the constraint at the expense of higher cost compared to the optimal unconstrained controller. 8.6 Conclusion We considered the problem of weakly coupled constrained MDP with Borel state and action spaces. We showed that randomized stationary policies are optimal for each agent under 222 some assumptions on the transition kernels, cost and the constraint functions. Our approach was to consider a centralized problem where a single agent knows the entire state and action history and takes both the actions. We solve the centralized problem using the occupation measure based LP of [21] and established that the obtained solution is optimal for our original problem. Further, we considered the case of multi-agent LQG and showed that the infinite dimensional LP can be simplified to a SDP for obtaining the optimal control strategy. Finally, we illustrated our results through some numerical experiments. 223 Chapter 9 Concluding Remarks 9.1 Summary In this thesis, we studied some instances of sequential decision-making problems under different types of uncertainties. We will summarize the main results of the thesis in this section. 9.1.1 Learning In Chapters 2-4, we considered the problem when the agent(s) are uncertain about the underlying model of the system and have to learn to control the system. The goal was to design learning algorithms for the agent(s) which converge to the optimal behavior at a fast rate. To capture the rate of learning, we considered the notion of regret which characterizes 224 the difference between the cost of the learning algorithm against the optimal cost for the known system. We started with a single-agent problem of controlling an unknown MDP over an infinite time horizon with finite state and action spaces in chapter 2. The goal of the agent was to minimize the regret with respect to the optimal infinite horizon cost. We proposed a Thomp- son Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE). We established ˜ O(HS √ AT ) bounds on expected regret under a Bayesian setting, where S and A are the sizes of the state and action spaces, T is time, and H is the bound of the span. This regret bound matches the lower bound on regret of any learning algorithm for MDPs up to logarithmic factors in T . The TSDE algorithm determines the end of an episode by two stopping criteria. The second criterion comes from the doubling trick used in many re- inforcement learning algorithms. But the first criterion on the linear growth rate of episode length seems to be a new idea for episodic learning algorithms. This stopping criterion is crucial in the proof of regret bound (Lemma 3.5). The simulation results of TSDE versus Lazy PSRL further shows that this criterion is not only a technical constraint for proofs, it indeed helps balance exploitation and exploration. In Chapter 3, we studied the problem of controlling an unknown LQ system over an infinite time horizon. This problem is more challenging than the problem in Chapter 2 since the state, action space and the cost are no longer bounded in this case. We proposed TSDE-LQ which is a TS based algorithm with dynamic episode lengths. Under some conditions on the prior distribution, we showed a ˜ O(d 0.5 x (d x +d u ) √ T ) bound on expected regret of TSDE-LQ where d x and d u are the dimensions of the state and controls. We then moved on to 225 the mean-field LQ learning problem. This is a multi-agent problem where the agents are coupled via the mean-field of the state and actions in both the dynamics and the cost of the system. So the effective dimension of the state and the controls is nd x and nd u , where n is the number of agents. Therefore, if we naively apply TSDE-LQ to this problem, its computational complexity grows as O(n 3 ) and the regret bound is ˜ O(n 1.5 d 0.5 x (d x +d u ) √ T ) which grows polynomially inn. We proposed a TS based learning algorithm TSDE-MF which exploits the structure of the optimal solution for the mean-field LQ problem with known dynamics. The main idea of this scheme is to decompose the learning into learning the mean-field system ( ¯ θ) and the relative-state system of each agent ( ˘ θ) separately. Moreover, at each time one agent’s relative state-action observations are picked adaptively to learn ˘ θ. We show that the regret of TSDE-MF is upper bounded by ˜ O(d 0.5 x (d x +d u ) √ T ) which is independent of the number of agents n. Moreover, its computational complexity per agent is independent of n. In Chapter 4, we consider the problem of learning in two decentralized control problems with finite state and action spaces. The state transition kernels of the agents are parametrized by an unknown but fixed parameterθ taking values in a finite space. A decentralized Thompson sampling based algorithm is proposed for two different dynamics and information sharing models. The regret achieved by Thompson sampling is shown to be upper bounded by a constant independent of the time horizon for both the cases. One key assumption for getting this regret bound is that KL divergence between the state transition kernels under different choices of θ is lower bounded by a positive number (Assumption 4.1,4.2). This allowed us to show that posterior distribution converges to the true parameter exponentially fast in 226 expectation. 9.1.2 Estimation In Chapter 5, we studied the problem of optimal scheduling and estimation in a sequential remote estimation system where non-collocated sensors and estimators communicate over a shared medium. The access to the communication resources is granted by a scheduler, which implements an observation-driven medium access control scheme to avoid packet collisions. We first looked at the case when the sensors’ source dynamics are independent and identically distributed in time and the scheduler is harvesting random energy from its environment. The energy level at the scheduler has a stochastic dynamics, which couples the decision-making process in time. The optimal solutions to such remote estimation problems are typically challenging to find due to the presence of signaling between the scheduler and estimators. The main result herein is to establish, under certain assumptions on the probabilistic model of the sources, the joint optimality of a pair of scheduling and estimation strategies. In particular, we showed that the optimal scheduling to send the state of the source with highest norm if it exceeds a certain threshold which depends on the time and the energy level. The optimal estimate in the absence on a received observation is the expected value of the source state. More importantly, the globally optimal solution is obtained despite the lack of convexity in the objective function introduced by signaling. Our proof consists of a judicious expansion of the information sets at the estimators, which enables the use of the common information approach to solving a single dynamic program from the perspective of a fictitious coordinator. Finally, by noticing that the optimal solution to this “relaxed” 227 problem does not depend on the additional information introduced in the expansion, it is also shown to be optimal for the original optimization problem. As a byproduct, our proof technique also applies to more general settings with an arbitrary number of sensors, unequal weights, and communication costs. Next, we moved on to the case when the sensors’ source dynamics are Markovian. Under the independence and symmetry assumptions on the noise distribution and by restricting the search for scheduling strategy to the class of symmetric strategies, we showed that the optimal estimate for each estimator is its most recently received observation. We also obtained a dynamic program which characterized the optimal scheduling strategy. Furthermore, we established some properties of the value functions of the dynamic program which may be helpful in computing the best symmetric scheduling strategy. In Chapter 6, we studied the sequential remote estimation problem with a single sensor and estimator but with adversarial noise. The sensor has a limited communication budget to communicate with the estimator. Our objective was to find jointly optimal scheduling and estimation strategies which minimize the worst-case maximum instantaneous estima- tion error over the time horizon. This problem is a decentralized minimax decision-making problem with non-classical information structure. Common information approach [15] pro- vides a principled way to obtain optimal strategies in decentralized problems with common information among the agents. However, the ideas in [15] are only presented for prob- lems with stochastic noise. We argued that the ideas from common information approach [15] also extend to our minimax setting which allowed us to write down a dynamic pro- gram (DP) using the DP decomposition for centralized minimax problems. This dynamic 228 program, however, involved minimization over functions and thus was intractable computa- tionally. By identifying a key property of the value functions, we were able to characterize the globally optimal strategies. In particular, we showed that an open loop transmission strategy and a simple Kalman-like estimator are jointly optimal. We also described related problems where the same optimal strategy holds. 9.1.3 Decentralized control In Chapter 7, we considered a general decentralized minimax control problem with the par- tial history sharing information structure. We model the noise in the system as uncertain quantities taking values in known finite sets and the objective was to compute decentral- ized control strategies which minimize the worst-case system cost. This chapter combines the work of centralized minimax control by Bertsekas [120] and the common information approach [15] to develop a dynamic programming decomposition for decentralized minimax control problems. We start with a system which incurs only a terminal cost and formulate an equivalent problem from a fictitious coordinator’s perspective which has access to the shared data (common information) among the controllers. A dynamic program based so- lution is developed where the information state is the set of feasible values of the current state and local information consistent with the information that is commonly known to all controllers. We further extend our results to a system with additive cost and to the case where all the controllers have a common observation. Finally in Chapter 8, we move on to a stochastic decentralized control problem with con- straints. In this chapter, we considered the problem of weakly coupled constrained MDP 229 with Borel state and action spaces. We showed that randomized stationary policies are optimal for each agent under some assumptions on the transition kernels, cost and the con- straint functions. Our approach was to consider a centralized problem where a single agent knows the entire state and action history and takes both the actions. We solved the cen- tralized problem using the occupation measure based linear programming (LP) approach of [21] and established that the obtained solution is optimal for our original problem. Fur- ther, we considered the case of multi-agent LQ control problem with quadratic constraints and showed that the infinite dimensional LP can be simplified to a SDP for obtaining the optimal control strategies. 9.2 Future Directions In this section we discuss some questions which are of interest based on the topics studied in this thesis. 9.2.1 Decentralized Learning While designing learning algorithms for large scale systems, it is desired that they operate in a fully decentralized manner. In Chapter 3, we designed a TS based learning algorithm TSDE-MF for a mean-field LQ learning problem. AlthoughTSDE-MF does not require complete state information at each agents, it is not a fully decentralized algorithm. At each time t, each agent needs to know the relative state-action tuple history{(˘ x i ∗ s s , ˘ u i ∗ s s , ˘ x i ∗ s s+1 )} 1≤s<t to evaluate the posterior distribution ˘ λ t . This requires coordination among the agents to 230 compute i ∗ t and for communicating the tuple of the agent i ∗ t at time t. Thus, one can pose the following question: Can we design a fully decentralized learning algorithm for the mean-field learning problem with provable regret bounds? We outline a decentralized TS algorithm similar to TSDE-MF below. Suppose the information of agent i at time t is given byI i t ={x i 1:t , ¯ x 1:t ,u i 1:t−1 , ¯ u 1:t−1 }. Each agent starts with the prior distribution ˘ λ 1 on ˘ θ. Agent i maintains a posterior over ˘ θ at each time t as follows: ˘ λ i t ( ˘ θ) =P( ˘ θ∈ ˘ Θ|˘ x i 1:t , ˘ u i 1:t−1 ) (9.1) Also, each agent starts with the prior distribution ¯ λ 1 on ¯ θ and maintains a posterior over ¯ θ as follows ¯ λ t ( ¯ Θ) =P( ¯ θ∈ ¯ Θ|¯ x 1:t , ¯ u 1:t−1 ) (9.2) Agent i generates a sample ¯ θ k from ¯ λ t whenever the sampling condition for ¯ λ t becomes true. Similarly, agent i generates a sample ˘ θ i k from ˘ λ i t when the sampling condition for ˘ λ i t becomes true. Agent i’s control action is computed as follows: ¯ v t =G( ¯ θ k 0 )¯ x t , ˘ v i t =G( ˘ θ i k i )˘ x i t u i t = ˘ v i t + ¯ v t 231 The above learning scheme works in a decentralized fashion. However, it is not clear if we can get theoretical guarantees on the regret of this scheme. In Chapter 4, we designed decentralized TS based learning algorithms for two classes of decentralized control problems. There are two interesting questions one can pose in this direction: 1. The TS algorithm in Chapter 4 is restricted to only two special dynamics and informa- tion sharing models: i) Decoupled dynamics with no information sharing, ii) Coupled dynamics with one-step delayed information sharing. Can we design decentralized learning algorithms for more general information structures and dynamics model with provable guarantees on their performance? 2. The regret bounds in Chapter 4 are limited by a key assumption on the state transition kernels (Assumption 4.1,4.2). Can we obtain similar guarantees on the regret of the TS algorithm by relaxing Assumption 4.1,4.2? 9.2.2 Decentralized control with constraints In Chapter 8, we looked at a weakly coupled constrained MDP problem where the agents were coupled only via a constraint. In order to derive the optimal decentralized strategies we exploited the following two structural properties of this problem: 1. Decoupled dynamics 2. Additive cost and constraint function 232 However, if either of the above property is absent i.e. if the dynamics of the agents are coupled or the cost/constraint function is not additive across the agents, then it is not clear how to approach the problem of finding optimal decentralized strategies for team problems with constraints. One question which arises then is How to find optimal decentralized strategies for general team problems with constraints? One interesting direction to answer this question is to use the idea of common informa- tion approach [15]. Common information approach is a principled way to find optimal decentralized control strategies for general unconstrained team problems. The central idea of this approach is to convert the team problem into an equivalent single-agent partially observed markov decision process (POMDP) and then solve the POMDP to obtain the optimal strategies for the team problem. We outline a possible sketch of the solution methodology for the constrained team problems below: 1. Transform the constrained team problem into an equivalent POMDP with constraints by following the steps of transformation of a general unconstrained team problem to a POMDP in common information approach [15]. 2. Use tools and techniques from constrained POMDP (CPOMDP) literature [167–170] to solve this equivalent constrained POMDP. 3. Transform the solution of the constrained POMDP to obtain optimal strategies for the constrained team problem. 233 It would be interesting to see whether the above methodology can work for general team problems with constraints. 234 Bibliography [1] K.-D. Kim and P. R. Kumar, “Cyber–physical systems: A perspective at the centen- nial,” Proceedings of the IEEE, vol. 100, no. Special Centennial Issue, pp. 1287–1308, 2012. [2] J. P. Hespanha, P. Naghshtabrizi, and Y. Xu, “A survey of recent results in networked control systems,” Proceedings of the IEEE, vol. 95, no. 1, pp. 138–162, 2007. [3] R. A. Gupta and M.-Y. Chow, “Networked control system: Overview and research trends,” IEEE transactions on industrial electronics, vol. 57, no. 7, pp. 2527–2535, 2010. [4] A. L. Dimeas and N. D. Hatziargyriou, “Operation of a multiagent system for micro- grid control,” IEEE Transactions on Power systems, vol. 20, no. 3, pp. 1447–1455, 2005. [5] A. Vaccaro, G. Velotto, and A. F. Zobaa, “A decentralized and cooperative architec- ture for optimal voltage regulation in smart grids,” IEEE Transactions on Industrial Electronics, vol. 58, no. 10, pp. 4593–4602, 2011. 235 [6] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identification, and adaptive control. SIAM, 2015. [7] K. J. Astrom and B. Wittenmark, Adaptive Control. Addison-Wesley Longman Pub- lishing Co., Inc., 1994. [8] S. Sastry and M. Bodson, Adaptive control: stability, convergence, and robustness. Prentice-Hall, Inc., 1989. [9] K. Narendra and A. Annaswamy, Stable adaptive systems. Prentice-Hall, Inc., 1989. [10] A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies for markov decision processes,” Mathematics of Operations Research, vol. 22, no. 1, pp. 222–255, 1997. [11] T. Lattimore and C. Szepesv´ ari, “Bandit algorithms,” preprint, 2018. [12] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933. [13] S. L. Scott, “A modern bayesian look at the multi-armed bandit,” Applied Stochastic Models in Business and Industry, vol. 26, no. 6, pp. 639–658, 2010. [14] O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,” in NIPS, 2011. [15] A. Nayyar, A. Mahajan, and D. Teneketzis, “Decentralized stochastic control with partial history sharing: A common information approach,” IEEE Transactions on Automatic Control, vol. 58, no. 7, pp. 1644–1658, 2013. 236 [16] M. Kearns and S. Singh, “Near-optimal reinforcement learning in polynomial time,” Machine Learning, vol. 49, no. 2-3, pp. 209–232, 2002. [17] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time algorithm for near-optimal reinforcement learning,” Journal of Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002. [18] P. L. Bartlett and A. Tewari, “Regal: A regularization based algorithm for reinforce- ment learning in weakly communicating mdps,” in UAI, 2009. [19] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1563–1600, 2010. [20] A. Nayyar, T. Ba¸ sar, D. Teneketzis, and V. V. Veeravalli, “Optimal strategies for com- munication and remote estimation with an energy harvesting sensor,” IEEE Trans- actions on Automatic Control, vol. 58, no. 9, pp. 2246–2260, 2013. [21] O. Hern´ andez-Lerma, J. Gonz´ alez-Hern´ andez, and R. R. L´ opez-Mart´ ınez, “Con- strained average cost markov control processes in borel spaces,” SIAM Journal on Control and Optimization, vol. 42, no. 2, pp. 442–468, 2003. [22] D. P. Bertsekas, Dynamic programming and optimal control, vol. 2. Athena Scientific, Belmont, MA, 2012. [23] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Ad- vances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985. 237 [24] S. Filippi, O. Capp´ e, and A. Garivier, “Optimism in reinforcement learning and kullback-leibler divergence,” in Allerton, pp. 115–122, 2010. [25] C. Dann and E. Brunskill, “Sample complexity of episodic fixed-horizon reinforcement learning,” in NIPS, 2015. [26] M. Strens, “A bayesian framework for reinforcement learning,” in ICML, 2000. [27] I. Osband, D. Russo, and B. Van Roy, “(More) efficient reinforcement learning via posterior sampling,” in NIPS, 2013. [28] R. Fonteneau, N. Korda, and R. Munos, “An optimistic posterior sampling strategy for bayesian reinforcement learning,” in BayesOpt2013, 2013. [29] A. Gopalan and S. Mannor, “Thompson sampling for learning parameterized markov decision processes,” in COLT, 2015. [30] Y. Abbasi-Yadkori and C. Szepesv´ ari, “Bayesian optimal control of smoothly param- eterized systems.,” in UAI, 2015. [31] I. Osband and B. Van Roy, “Why is posterior sampling better than optimism for reinforcement learning,” EWRL, 2016. [32] I. Osband and B. Van Roy, “Posterior sampling for reinforcement learning without episodes,” arXiv preprint arXiv:1608.02731, 2016. [33] D. Russo and B. Van Roy, “Learning to optimize via posterior sampling,” Mathematics of Operations Research, vol. 39, no. 4, pp. 1221–1243, 2014. 238 [34] A. L. Strehl and M. L. Littman, “An analysis of model-based interval estimation for markov decision processes,” Journal of Computer and System Sciences, vol. 74, no. 8, pp. 1309–1331, 2008. [35] G. C. Goodwin and K. S. Sin, Adaptive filtering prediction and control. Courier Corporation, 2014. [36] A. Becker, P. Kumar, and C.-Z. Wei, “Adaptive control with the stochastic approxima- tion algorithm: Geometry and convergence,” IEEE T. on Automatic Control, vol. 30, no. 4, pp. 330–338, 1985. [37] H.-F. Chen and L. Guo, “Convergence rate of least-squares identification and adap- tive control for stochastic systems,” International Journal of Control, vol. 44, no. 5, pp. 1459–1476, 1986. [38] M. C. Campi and P. Kumar, “Adaptive linear quadratic gaussian control: the cost- biased approach revisited,” SIAM Journal on Control and Optimization, vol. 36, no. 6, pp. 1890–1907, 1998. [39] Y. Abbasi-Yadkori and C. Szepesv´ ari, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, pp. 1–26, 2011. [40] M. Simchowitz and D. J. Foster, “Naive exploration is optimal for online lqr,” arXiv preprint arXiv:2001.09576, 2020. [41] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al., “A tutorial on thompson sampling,” Foundations and Trends R in Machine Learning, 2018. 239 [42] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit problem,” in Conference on Learning Theory, 2012. [43] E. Kaufmann, N. Korda, and R. Munos, “Thompson sampling: An asymptotically optimal finite-time analysis,” in International Conference on Algorithmic Learning Theory, pp. 199–213, Springer, 2012. [44] S. Agrawal and N. Goyal, “Thompson sampling for contextual bandits with linear payoffs.,” in ICML (3), pp. 127–135, 2013. [45] M. J. Kim, “Thompson sampling for stochastic control: The finite parameter case,” IEEE Transactions on Automatic Control, vol. 62, no. 12, pp. 6415–6422, 2017. [46] M. Abeille and A. Lazaric, “Thompson sampling for linear-quadratic control prob- lems,” in AISTATS 2017-20th International Conference on Artificial Intelligence and Statistics, 2017. [47] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “On optimality of adaptive linear-quadratic regulators,” preprint arXiv:1806.10749, 2018. [48] M. Abeille and A. Lazaric, “Improved regret bounds for thompson sampling in linear quadratic control problems,” in International Conference on Machine Learning, pp. 1– 9, 2018. [49] Y. Ouyang, M. Gagrani, and R. Jain, “Control of unknown linear systems with thomp- son sampling,” in 55th Annual Allerton Conference on Communication, Control, and Computing, pp. 1198–1205, 2017. 240 [50] J.-M. Lasry and P.-L. Lions, “Mean field games,” Japanese Journal of Mathematics, vol. 2, no. 1, pp. 229–260, 2007. [51] M. Huang, P. E. Caines, and R. P. Malham´ e, “Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized epsilon-Nash equilibria,” IEEE Transactions on Automatic Control, vol. 52, no. 9, pp. 1560–1571, 2007. [52] M. Huang, P. E. Caines, and R. P. Malham´ e, “Social optima in mean field LQG control: centralized and decentralized strategies,” IEEE Transactions on Automatic Control, vol. 57, no. 7, pp. 1736–1751, 2012. [53] G. Y. Weintraub, C. L. Benkard, and B. V. Roy, “Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games,” in Advances in Neural Infor- mation Processing Systems, pp. 1489–1496, Dec. 2005. [54] G. Y. Weintraub, C. L. Benkard, and B. Van Roy, “Markov perfect industry dynamics with many firms,” Econometrica, vol. 76, no. 6, pp. 1375–1411, 2008. [55] D. A. Gomes et al., “Mean field games models—a brief survey,” Dynamic Games and Applications, vol. 4, no. 2, pp. 110–154, 2014. [56] J. Sternby, “On consistency for the method of least squares using martingale theory,” IEEE T. on Automatic Control, vol. 22, no. 3, pp. 346–352, 1977. [57] Y. Abbasi-Yadkori, N. Lazic, and C. Szepesv´ ari, “Regret bounds for model-free linear quadratic control,” preprint arXiv:1804.06021, 2018. 241 [58] S. Tu and B. Recht, “Least-squares temporal difference learning for the linear quadratic regulator,” arXiv preprint arXiv:1712.08642, 2017. [59] M. Ibrahimi, A. Javanmard, and B. V. Roy, “Efficient reinforcement learning for high dimensional linear quadratic systems,” in Advances in Neural Information Processing Systems (NIPS), pp. 2636–2644, 2012. [60] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342–353, 2018. [61] J. Arabneydi and A. Mahajan, “Linear quadratic mean field teams: Optimal and ap- proximately optimal decentralized solutions,” arXiv preprint arXiv:1609.00056, 2017. [62] H. S. Witsenhausen, “Separation of estimation and control for discrete time systems,” Proceedings of the IEEE, vol. 59, no. 11, pp. 1557–1566, 1971. [63] Y.-C. Ho, “Team decision theory and information structures,” Proceedings of the IEEE, vol. 68, no. 6, pp. 644–654, 1980. [64] N. Sandell and M. Athans, “Solution of some nonclassical lqg stochastic decision problems,” IEEE Transactions on Automatic Control, vol. 19, no. 2, pp. 108–116, 1974. [65] M. Rotkowitz and S. Lall, “A characterization of convex problems in decentralized control,” IEEE Transactions on Automatic Control, vol. 51, no. 2, pp. 274–286, 2006. [66] B. Bamieh and P. G. Voulgaris, “A convex characterization of distributed control problems in spatially invariant systems with communication constraints,” Systems & Control Letters, vol. 54, no. 6, pp. 575–583, 2005. 242 [67] L. Lessard and S. Lall, “Optimal control of two-player systems with output feedback,” IEEE Transactions on Automatic Control, vol. 60, no. 8, pp. 2129–2144, 2015. [68] A. Lamperski and J. C. Doyle, “TheH 2 control problem for quadratically invari- ant systems with delays,” IEEE Transactions on Automatic Control, vol. 60, no. 7, pp. 1945–1950, 2015. [69] A. Mahajan, N. C. Martins, M. C. Rotkowitz, and S. Y¨ uksel, “Information structures in optimal decentralized control,” in Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pp. 1291–1306, IEEE, 2012. [70] A. Nayyar, A. Mahajan, and D. Teneketzis, “Optimal control strategies in delayed sharing information structures,” IEEE Transactions on Automatic Control, vol. 56, no. 7, pp. 1606–1620, 2011. [71] J. Wu and S. Lall, “A dynamic programming algorithm for decentralized Markov decision processes with a broadcast structure,” in Decision and Control (CDC), 2010 49th IEEE Conference on, pp. 6143–6148, IEEE, 2010. [72] A. Mahajan, “Optimal decentralized control of coupled subsystems with control shar- ing,” in Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE Conference on, pp. 5726–5731, IEEE, 2011. [73] C. Amato, G. Chowdhary, A. Geramifard, N. K. Ure, and M. J. Kochenderfer, “De- centralized control of partially observable Markov decision processes,” in Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, pp. 2398–2405, IEEE, 2013. 243 [74] J. Huang, B. Yang, and D.-y. Liu, “A distributed q-learning algorithm for multi-agent team coordination,” in Machine learning and cybernetics, 2005. proceedings of 2005 international conference on, vol. 1, pp. 108–113, IEEE, 2005. [75] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams,” in Intelli- gent Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on, pp. 64–69, IEEE, 2007. [76] S. Kapetanakis and D. Kudenko, “Reinforcement learning of coordination in heteroge- neous cooperative multi-agent systems,” in Adaptive Agents and Multi-Agent Systems II, pp. 119–131, Springer, 2005. [77] J. Arabneydi and A. Mahajan, “Reinforcement learning in decentralized stochastic control systems with partial history sharing,” in American Control Conference (ACC), 2015, pp. 5449–5456, IEEE, 2015. [78] Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain, “Learning unknown markov deci- sion processes: A thompson sampling approach,” in Advances in Neural Information Processing Systems, pp. 1333–1342, 2017. [79] C.-C. Huang, D. Isaacson, and B. Vinograde, “The rate of convergence of certain nonhomogeneous Markov chains,” Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und Ver- wandte Gebiete, vol. 35, no. 2, pp. 141–146, 1976. [80] A. Bemporad, M. Heemels, M. Johansson, et al., Networked control systems, vol. 406. Springer, 2010. 244 [81] M. M. Vasconcelos and N. C. Martins, “Optimal estimation over the collision channel,” IEEE Transactions on Automatic Control, vol. 62, no. 1, pp. 321–336, 2017. [82] M. M. Vasconcelos and N. C. Martins, “Optimal remote estimation of discrete ran- dom variables over the collision channel,” IEEE Transactions on Automatic Control, vol. 64, pp. 1519 – 1534, April 2019. [83] S. Y¨ uksel and T. Basar, “Stochastic networked control systems,” AMC, vol. 10, p. 12, 2013. [84] U. Mitra, A. Emken, S. Lee, M. Li, V. Rozgic, G. Thatte, H. Vathsangam, D.-S. Zois, M. Annavaram, S. Narayanan, M. Levorato, D. Spruijt-Metz, and G. S. Sukhatme, “KNOW-ME: a case study in wireless body area sensor network design,” IEEE Com- munications Magazine, vol. 50, pp. 116–125, May 2012. [85] D.-S. Zois, M. Levorato, and U. Mitra, “Energy-efficient, heterogeneous sensor selec- tion for physical activity detection in wireless body area networks,” IEEE Transac- tions on Signal Processing, vol. 61, pp. 1581–1594, April 2013. [86] D.-S. Zois, “Sequential decision-making in healthcare IoT: Real-time health monitor- ing, treatments and interventions,” in IEEE 3rd World Forum on Internet of Things, 2016. [87] A. Kundu and D. E. Quevedo, “Stabilizing scheduling policies for networked control systems,” IEEE Transactions on Control of Network Systems, 2019. 245 [88] M. Kl¨ ugel, M. H. Mamduhi, O. Ayan, M. Vilgelm, K. H. Johansson, S. Hirche, and W. Kellerer, “Joint cross-layer optimization in real-time networked control systems,” 2019. [89] K. Liu, A. Ma, Y. Xia, Z. Sun, and K. H. Johansson, “Network scheduling and control co-design for multi-loop MPC,” IEEE Transactions on Automatic Control, vol. 64, pp. 5238–5245, Dec 2019. [90] L. Shi and H. Zhang, “Scheduling two Gauss-Markov systems: An optimal solution for remote state estimation under bandwidth constraint,” IEEE Transactions on Signal Processing, vol. 60, pp. 2038–2042, April 2012. [91] M. Xia, V. Gupta, and P. J. Antsaklis, “Networked state estimation over a shared communication medium,” IEEE Transactions on Automatic Control, vol. 62, no. 4, pp. 1729–1741, 2017. [92] A. Molin, H. Esen, and K. H. Johansson, “Scheduling networked state estimators based on value of information,” Automatica, vol. 110, p. 108578, 2019. [93] S. Knorn and D. E. Quevedo, Energy Harvesting for Wireless Sensor Networks: Tech- nology, Components and System Design, ch. Optimal energy allocation in energy har- vesting and sharing wireless sensor networks. De Gruyter Oldenbourg, 2019. [94] S. Knorn, S. Dey, A. Ahl´ en, and D. E. Quevedo, “Optimal energy allocation in multi- sensor estimation over wireless channels using energy harvesting and sharing,” IEEE Transactions on Automatic Control, vol. 64, pp. 4337–4344, Oct 2019. 246 [95] Z. Guo, Y. Ni, W. S. Wong, and L. Shi, “Time synchronization attack and counter- measure for multi-system scheduling in remote estimation,” 2019. [96] O. C. Imer and T. Basar, “Optimal estimation with limited measurements,” Inter- national Journal of Systems, Control and Communications, vol. 2, no. 1-3, pp. 5–29, 2010. [97] G. M. Lipsa and N. C. Martins, “Remote state estimation with communication costs for first-order lti systems,” IEEE Transactions on Automatic Control, vol. 56, no. 9, pp. 2013–2025, 2011. [98] J. Wu, Q. Jia, K. H. Johansson, and L. Shi, “Event-based sensor data scheduling: Trade-off between communication rate and estimation quality,” IEEE Transactions on Automatic Control, vol. 58, pp. 1041–1046, April 2013. [99] A. S. Leong, S. Dey, and D. E. Quevedo, “Transmission scheduling for remote state estimation and control with an energy harvesting sensor,” Automatica, vol. 91, pp. 54– 60, May 2018. [100] S. Wu, X. Ren, Q. Jia, K. H. Johansson, and L. Shi, “Learning optimal scheduling policy for remote state estimation under uncertain channel condition,” IEEE Trans- actions on Control of Network Systems, 2019. [101] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi, “Deep reinforce- ment learning for wireless sensor scheduling in cyber–physical systems,” Automatica, vol. 113, 2020. 247 [102] A. S. Leong, D. E. Quevedo, D. Dolz, and S. Dey, “Transmission scheduling for remote state estimation over packet dropping links in the presence of an eavesdropper,” IEEE Transactions on Automatic Control, vol. 64, pp. 3732–3739, Sep. 2019. [103] J. Lu, A. S. Leong, and D. E. Quevedo, “An event-triggered transmission scheduling strategy for remote state estimation in the presence of an eavesdropper,” 2019. [104] A. Molin and S. Hirche, “On the optimality of certainty equivalence for event-triggered control systems,” IEEE Transactions on Automatic Control, vol. 58, pp. 470–474, Feb 2013. [105] Y.-C. Ho, M. Kastner, and E. Wong, “Teams, signaling, and information theory,” IEEE Transactions on Automatic Control, vol. 23, pp. 305–312, April 1978. [106] H. S. Witsenhausen, “A counter-example in stochastic optimum control,” SIAM Jour- nal on Control, vol. 6, no. 1, pp. 131–147, 1968. [107] Y.-C. Ho, M. Kastner, and E. Wong, “Teams, signaling, and information theory,” IEEE Transactions on Automatic Control, vol. 23, no. 2, pp. 305–312, 1978. [108] H. S. Witsenhausen, “A counterexample in stochastic optimum control,” SIAM Jour- nal on Control, vol. 6, no. 1, pp. 131–147, 1968. [109] H. Li, L. Lai, and W. Zhang, “Communication requirement for reliable and secure state estimation and control in smart grid,” IEEE Transactions on Smart Grid, vol. 2, pp. 476–486, Sept 2011. [110] J. P. Hespanha, P. Naghshtabrizi, and Y. Xu, “A survey of recent results in networked control systems,” Proceedings of the IEEE, vol. 95, pp. 138–162, Jan 2007. 248 [111] M. S. Kiran, P. Rajalakshmi, K. Bharadwaj, and A. Acharyya, “Adaptive rule engine based iot enabled remote health care data acquisition and smart transmission system,” in Internet of Things (WF-IoT), 2014 IEEE World Forum on, pp. 253–258, IEEE, 2014. [112] M. Athans, “On the determination of optimal costly measurement strategies for linear stochastic systems,” Automatica, vol. 8, no. 4, pp. 397–412, 1972. [113] J. S. Baras and A. Bensoussan, “Optimal sensor scheduling in nonlinear filtering of diffusion processes,” SIAM Journal on Control and Optimization, vol. 27, no. 4, pp. 786–813, 1989. [114] W. Wu and A. Arapostathis, “Optimal sensor querying: General markovian and lqg models with controlled observations,” IEEE Transactions on Automatic Control, vol. 53, pp. 1392–1405, July 2008. [115] M. Naghshvar and T. Javidi, “Active hypothesis testing: Sequentiality and adaptivity gains,” in 2012 46th Annual Conference on Information Sciences and Systems (CISS), pp. 1–6, March 2012. [116] O. C. Imer and T. Basar, “Optimal estimation with limited measurements,” in Deci- sion and Control, 2005 and 2005 European Control Conference. CDC-ECC’05. 44th IEEE Conference on, pp. 1029–1034, IEEE, 2005. [117] M. Rabi, G. V. Moustakides, and J. S. Baras, “Adaptive sampling for linear state estimation,” SIAM Journal on Control and Optimization, vol. 50, no. 2, pp. 672–702, 2012. 249 [118] Y. Xu and J. P. Hespanha, “Optimal communication logics in networked control systems,” in Decision and Control, 2004. CDC. 43rd IEEE Conference on, vol. 4, pp. 3527–3532, IEEE, 2004. [119] J. Chakravorty and A. Mahajan, “Fundamental limits of remote estimation of autore- gressive markov processes under communication constraints,” IEEE Transactions on Automatic Control, vol. 62, pp. 1109–1124, March 2017. [120] D. Bertsekas and I. Rhodes, “Sufficiently informative functions and the minimax feedback control of uncertain dynamic systems,” IEEE Transactions on Automatic Control, vol. 18, no. 2, pp. 117–124, 1973. [121] H. S. Witsenhausen, “Minimax control of uncertain systems,” in IEEE Trans. Au- tomat. Contr, Citeseer, 1966. [122] H. Witsenhausen, “A minimax control problem for sampled linear systems,” IEEE Transactions on Automatic Control, vol. 13, no. 1, pp. 5–21, 1968. [123] J. S. Baras and M. R. James, “Robust and risk-sensitive output feedback control for finite state machines and hidden markov models,” tech. rep., 1994. [124] S. P. Coraluppi and S. I. Marcus, “Risk-sensitive and minimax control of discrete- time, finite-state markov decision processes,” Automatica, vol. 35, no. 2, pp. 301–309, 1999. [125] T. Ba¸ sar and P. Bernhard, H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, 2008. 250 [126] P. Bernhard, “Expected values, feared values, and partial information optimal con- trol,” in New trends in dynamic games and applications, pp. 3–24, Springer, 1995. [127] P. Bernhard, “Max-plus algebra and mathematical fear in dynamic optimization,” Set-Valued Analysis, vol. 8, no. 1-2, pp. 71–84, 2000. [128] P. Bernhard, “Minimax-or feared value- L1/L∞ control,” Theoretical computer sci- ence, vol. 293, no. 1, pp. 25–44, 2003. [129] J. Gonz´ alez-Trejo, O. Hern´ andez-Lerma, and L. F. Hoyos-Reyes, “Minimax control of discrete-time stochastic systems,” SIAM Journal on Control and Optimization, vol. 41, no. 5, pp. 1626–1659, 2002. [130] J. K. Satia and R. E. Lave Jr, “Markovian decision processes with uncertain transition probabilities,” Operations Research, vol. 21, no. 3, pp. 728–740, 1973. [131] G. N. Iyengar, “Robust dynamic programming,” Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005. [132] W. Wiesemann, D. Kuhn, and B. Rustem, “Robust markov decision processes,” Math- ematics of Operations Research, vol. 38, no. 1, pp. 153–183, 2013. [133] T. Osogami, “Robust partially observable markov decision process,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, pp. 106–115, 2015. [134] M. Gagrani and A. Nayyar, “Decentralized minimax control problems with partial history sharing,” in 2017 American Control Conference (ACC), pp. 3373–3379, May 2017. 251 [135] G. N. Nair, “A nonstochastic information theory for communication and state es- timation,” IEEE Transactions on Automatic Control, vol. 58, pp. 1497–1510, June 2013. [136] A. Mahajan, N. C. Martins, M. C. Rotkowitz, and S. Y¨ uksel, “Information structures in optimal decentralized control,” in 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 1291–1306, Dec 2012. [137] H. S. Witsenhausen, “A counterexample in stochastic optimum control,” SIAM Jour- nal on Control, vol. 6, no. 1, pp. 131–147, 1968. [138] Y. C. Ho and K. C. Chu, “Team decision theory and information structures in opti- mal control problems: Part i,” in Decision and Control, 1971 IEEE Conference on, pp. 383–387, Dec 1971. [139] N. Sandell and M. Athans, “Solution of some nonclassical lqg stochastic decision problems,” IEEE Transactions on Automatic Control, vol. 19, pp. 108–116, Apr 1974. [140] R. Bansal and T. Basar, “Solutions to a class of linear-quadratic-gaussian (lqg) stochastic team problems with nonclassical information,” in Decision and Control, 1987. 26th IEEE Conference on, vol. 26, pp. 1102–1103, Dec 1987. [141] A. Mahajan, “Optimal decentralized control of coupled subsystems with control shar- ing,” IEEE Transactions on Automatic Control, vol. 58, pp. 2377–2382, Sept 2013. [142] A. Nayyar and L. Lessard, “Structural results for partially nested lqg systems over graphs,” in 2015 American Control Conference (ACC), pp. 5457–5464, IEEE, 2015. 252 [143] M. Rotkowitz and S. Lall, “A characterization of convex problems in decentralized control,” IEEE Transactions on Automatic Control, vol. 51, pp. 274–286, Feb 2006. [144] P. Shah and P. A. Parrilo, “H 2 optimal decentralized control over posets: A state- space solution for state-feedback,” IEEE Transactions on Automatic Control, vol. 58, pp. 3084–3096, Dec 2013. [145] L. Lessard and S. Lall, “Optimal control of two-player systems with output feedback,” IEEE Transactions on Automatic Control, vol. 60, pp. 2129–2144, Aug 2015. [146] L. Lessard, “State-space solution to a minimum-entropyH ∞ -optimal control problem with a nested information constraint,” in 53rd IEEE Conference on Decision and Control, pp. 4026–4031, Dec 2014. [147] P. Varaiya and J. Walrand, “On delayed sharing patterns,” IEEE Transactions on Automatic Control, vol. 23, pp. 443–445, Jun 1978. [148] A. Nayyar, A. Mahajan, and D. Teneketzis, “Optimal control strategies in delayed sharing information structures,” IEEE Transactions on Automatic Control, vol. 56, pp. 1606–1620, July 2011. [149] H. S. Witsenhausen, “Separation of estimation and control for discrete time systems,” Proceedings of the IEEE, vol. 59, pp. 1557–1566, Nov 1971. [150] J. Bismut, “An example of interaction between information and control: The trans- parency of a game,” IEEE Transactions on Automatic Control, vol. 18, pp. 518–522, Oct 1973. 253 [151] J. M. Ooi, S. M. Verbout, J. T. Ludwig, and G. W. Wornell, “A separation theorem for periodic sharing information patterns in decentralized control,” IEEE Transactions on Automatic Control, vol. 42, pp. 1546–1550, Nov 1997. [152] D. Bertsekas and I. Rhodes, “Sufficiently informative functions and the minimax feedback control of uncertain dynamic systems,” IEEE Transactions on Automatic Control, vol. 18, pp. 117–124, Apr 1973. [153] M. Gagrani and A. Nayyar, “Centralized minimax control,” Technical report CENG- 2016-02, Sep 2016. http://ceng.usc.edu/techreports/2016/Nayyar%20CENG-2016- 02.pdf. [154] C. Boutilier and T. Lu, “Budget allocation using weakly coupled, constrained markov decision processes,” 2016. [155] X. Wei, H. Yu, and M. J. Neely, “Online learning in weakly coupled markov decision processes: A convergence time study,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 2, no. 1, p. 12, 2018. [156] D. A. Dolgov and E. H. Durfee, “Optimal resource allocation and policy formulation in loosely-coupled markov decision processes.,” in ICAPS, pp. 315–324, 2004. [157] E. Altman, Constrained Markov decision processes, vol. 7. CRC Press, 1999. [158] A. Piunovskiy, Optimal control of random sequences in problems with constraints, vol. 410. Springer Science & Business Media, 2012. 254 [159] O. Hern´ andez-Lerma and J. Gonz´ alez-Hern´ andez, “Constrained markov control pro- cesses in borel spaces: the discounted case,” Mathematical Methods of Operations Research, vol. 52, no. 2, pp. 271–285, 2000. [160] M. Kamgarpour and T. Summers, “On infinite dimensional linear programming ap- proach to stochastic control,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 6148–6153, 2017. [161] N. Meuleau, M. Hauskrecht, K.-E. Kim, L. Peshkin, L. P. Kaelbling, T. L. Dean, and C. Boutilier, “Solving very large weakly coupled markov decision processes,” in AAAI/IAAI, pp. 165–172, 1998. [162] L. I. Sennott, “Constrained average cost markov decision chains,” Probability in the Engineering and Informational Sciences, vol. 7, no. 1, pp. 69–83, 1993. [163] O. Hern´ andez-Lerma and J. B. Lasserre, Discrete-time Markov control processes: basic optimality criteria, vol. 30. Springer Science & Business Media, 2012. [164] M. Kurano, “The existence of a minimum pair of state and policy for markov decision processes under the hypothesis of doeblin,” SIAM journal on control and optimization, vol. 27, no. 2, pp. 296–307, 1989. [165] M. Kurano, J.-I. Nakagami, and Y. Huang, “Constrained markov decision processes with compact state and action spaces: the average case,” Optimization, vol. 48, no. 2, pp. 255–269, 2000. 255 [166] O. Hernandez-Lerma and J. Gonzalez-Hernandez, “Infinite linear programming and multichain markov control processes in uncountable spaces,” SIAM journal on control and optimization, vol. 36, no. 1, pp. 313–335, 1998. [167] J. D. Isom, S. P. Meyn, and R. D. Braatz, “Piecewise linear dynamic programming for constrained pomdps.,” in AAAI, vol. 1, pp. 291–296, 2008. [168] P. Poupart, A. Malhotra, P. Pei, K.-E. Kim, B. Goh, and M. Bowling, “Approximate linear programming for constrained partially observable markov decision processes.,” in AAAI, vol. 1, pp. 3342–3348, 2015. [169] A. Undurti and J. P. How, “An online algorithm for constrained pomdps,” in 2010 IEEE International Conference on Robotics and Automation, pp. 3966–3973, IEEE, 2010. [170] D. Kim, J. Lee, K.-E. Kim, and P. Poupart, “Point-based value iteration for con- strained pomdps,” in IJCAI, pp. 1968–1974, 2011. [171] G. Dasarathy, “A simple probability trick for bounding the expected maximum of n random variables,” available at: www.cs.cmu.edu/ gautamd/Files/maxGaussians.pdf. [172] A. Klenke, Probability theory: a comprehensive course. Springer Science & Business Media, 2013. 256 Appendices A Appendix: Learning to Control an unknown MDP A.1 Bound on the number of macro episodes Lemma A.1. The number M of macro episodes of TSDE is bounded by M≤SA log(T ). Proof. Since the second stopping criterion is triggered whenever the number of visits to a state-action pair is doubled, the start times of macro episodes can be expressed as {t 1 }∪ ∪ (s,a)∈S×A {t k :k∈M (s,a) } where M (s,a) ={k≤K T :N t k (s,a)> 2N t k−1 (s,a)}. 257 Since the number of visits to (s,a) is doubled at every t k such that k∈M (s,a) , the size of M (s,a) should not be larger than O(log(T )). This argument is made rigorous as follows. If|M (s,a) |≥ log(N T +1 (s,a)) + 1 we have N t K T (s,a) = Y k≤K T ,Nt k−1 (s,a)≥1 N t k (s,a) N t k−1 (s,a) > Y k∈M (s,a) ,Nt k−1 (s,a)≥1 2 ≥N T +1 (s,a). But this contradicts the fact thatN t K T (s,a)≤N T +1 (s,a). Therefore,|M (s,a) |≤ log(N T +1 (s,a)) for all (s,a). From this property we obtain a bound on the number of macro episodes as M≤1 + X (s,a) |M (s,a) |≤ 1 + X (s,a) log(N T +1 (s,a)) ≤1 +SA log( X (s,a) N T +1 (s,a)/SA) = 1 +SA log(T/SA)≤SA log(T ) (3) where the first inequality is the union bound and the third inequality holds because log is concave. A.2 Proof of Lemma 3.6 Proof. We have R 1 =E h K T X k=1 t k+1 −1 X t=t k h v(s t ,θ k )−v(s t+1 ,θ k ) ii =E h K T X k=1 h v(s t k ,θ k )−v(s t k+1 ,θ k ) ii ≤E h HK T i 258 where the last equality holds because 0≤v(s,θ)≤sp(θ)≤H for all s,θ from Assumption 2.1. A.3 Proof of Lemma 3.7 Proof. For notational simplicity, we use z = (s,a)∈S×A and z t = (s t ,a t ) to denote the corresponding state-action pair. Then R 2 =E h K T X k=1 t k+1 −1 X t=t k h v(s t+1 ,θ k )− X s 0 ∈S θ k (s 0 |z t )v(s 0 ,θ k ) ii =E h K T X k=1 t k+1 −1 X t=t k h X s 0 ∈S (θ ∗ (s 0 |z t )−θ k (s 0 |z t ))v(s 0 ,θ k ) ii . Since 0≤v(s 0 ,θ t )≤H from Assumption 2.1, each term in the inner summation is bounded by X s 0 ∈S (θ ∗ (s 0 |z t )−θ k (s 0 |z t ))v(s 0 ,θ k ) ≤H X s 0 ∈S |θ ∗ (s 0 |z t )−θ k (s 0 |z t )| ≤H X s 0 ∈S |θ ∗ (s 0 |z t )− ˆ θ k (s 0 |z t )| +H X s 0 ∈S |θ k (s 0 |z t )− ˆ θ k (s 0 |z t )|. Here ˆ θ k (s 0 |z t ) = Nt k (zt,s 0 ) Nt k (zt) is the empirical mean for the transition probability at the begin- ning of episode k where N t k (s t ,a t ,s 0 ) =|{τ <t k : (s τ ,a τ ,s τ+1 ) = (s t ,a t ,s 0 )}|. 259 Define confidence set B k ={θ : X s 0 ∈S |θ(s 0 |z)− ˆ θ k (s 0 |z)|≤β k (z)∀z∈S×A} (4) where β k (z) = r 14S log(2At k T ) max(1,Nt k (z)) . Note that β k (z) is the confidence set used in [19] with δ = 1/T . Then we have X s 0 ∈S |θ ∗ (s 0 |z t )− ˆ θ k (s 0 |z t )| + X s 0 ∈S |θ k (s 0 |z t )− ˆ θ k (s 0 |z t )| ≤2β k (z t ) + 2(1 {θ∗/ ∈B k } +1 {θ k / ∈B k } ). Therefore, R 2 ≤2HE h K T X k=1 t k+1 −1 X t=t k β k (z t ) i + 2HE h K T X k=1 T k (1 {θ∗∈B k } +1 {θ k ∈B k } ) i . (5) For the first term in (5) we have K T X k=1 t k+1 −1 X t=t k β k (z t ) = K T X k=1 t k+1 −1 X t=t k s 14S log(2At k T ) max(1,N t k (z t )) . Note that N t (z t )≤ 2N t k (z t ) for all t in the kth episodes from the second criterion. So we get K T X k=1 t k+1 −1 X t=t k β k (z t )≤ K T X k=1 t k+1 −1 X t=t k s 28S log(2At k T ) max(1,N t (z t )) ≤ K T X k=1 t k+1 −1 X t=t k s 28S log(2AT 2 ) max(1,N t (z t )) 260 = T X t=1 s 28S log(2AT 2 ) max(1,N t (z t )) ≤ q 56S log(AT ) T X t=1 1 p max(1,N t (z t )) . (6) Since N t (z t ) is the count of visits to z t , we have T X t=1 1 p max(1,N t (z t )) = X z T X t=1 1 {zt=z} p max(1,N t (z)) = X z 1 {N T+1 (z)>0} + N T+1 (z)−1 X j=1 1 √ j ≤ X z 1 {N T+1 (z)>0} + 2 q N T +1 (z) ≤ 3 X z q N T +1 (z). Since P z N T +1 (z) =T , we have 3 X z q N T +1 (z)≤3 s SA X z N T +1 (z) = 3 √ SAT. (7) Combining (6)-(7) we get 2H K T X k=1 t k+1 −1 X t=t k β k (z t )≤ 6 √ 56HS q AT log(AT )≤ 48HS q AT log(AT ). (8) Let’s now work on the second term in (5). Since T k ≤T for all k, we have E h K T X k=1 T k (1 {θ∗∈B k } +1 {θ k ∈B k } ) i ≤TE h K T X k=1 (1 {θ∗∈B k } +1 {θ k ∈B k } ) i ≤T ∞ X k=1 E h 1 {θ∗/ ∈B k } +1 {θ k / ∈B k } i . (9) 261 Since B k is measurable with respect to σ(h t k ), using Lemma 2.2 we get E h 1 {θ∗/ ∈B k } +1 {θ k / ∈B k } i =2E h 1 {θ∗/ ∈B k } i = 2P(θ ∗ / ∈B k ) (10) By the definition of B k in (4), [19, Lemma 17, setting δ = 1/T ] implies that P(θ ∗ / ∈B k )≤ 1 15Tt 6 k . (11) Combining (9), (10) and (11) we obtain 2HE h K T X k=1 T k (1 {θ∗∈B k } +1 {θ k ∈B k } ) i ≤ 4 15 H ∞ X k=1 t −6 k ≤ 4 15 H ∞ X k=1 k −6 ≤H. (12) The statement of the lemma then follows by substituting (8) and (12) into (5). A.4 Proof of Theorem 2.2 Proof. Since ˜ π k is an k −approximation policy, we have c(s t ,a t )≤ min a∈A n c(s t ,a) + X s 0 ∈S θ k (s 0 |s t ,a)v(s 0 ,θ k ) o − X s 0 ∈S θ k (s 0 |s t ,a t )v(s 0 ,θ k ) + k =J(θ k ) +v(s t ,θ k )− X s 0 ∈S θ k (s 0 |s t ,a t )v(s 0 ,θ k ) + k . Then (2.11) becomes R(T, TSDE) 262 =E h K T X k=1 t k+1 −1 X t=t k c(s t ,a t ) i −TE h J(θ ∗ ) i ≤E h K T X k=1 T k J(θ k ) i +E h K T X k=1 t k+1 −1 X t=t k h v(s t ,θ k )− X s 0 ∈S θ k (s 0 |s t ,a t )v(s 0 ,θ k ) ii +E h K T X k=1 T k k i −TE h J(θ ∗ ) i =R 0 +R 1 +R 2 +E h K T X k=1 T k k i . SinceR 0 +R 1 +R 2 = ˜ O(HS √ AT ) from the proof of Theorem 2.1, we obtain the first part of the result. If k ≤ 1 k+1 , since T k ≤T k−1 + 1≤...≤k + 1, we get K T X k=1 T k k ≤ K T X k=1 k + 1 k + 1 =K T ≤ p 2SAT logT where the last inequality follows from Lemma 2.1. 263 B Appendix: Learning to Control an unknown Linear Sys- tem B.1 Centralized LQR Proof of Lemma 3.2. During the kth episode, we have u t =G( ˜ θ k )x t . Then, ||x t+1 || =||(A 1 +B 1 G( ˜ θ k ))x t +w t ||≤||(A 1 +B 1 G( ˜ θ k ))x t || +||w t || ≤ρ(A 1 +B 1 G( ˜ θ k ))||x t || +||w t ||≤δ||x t || +||w t || (13) where the second inequality is the property of matrix norm, and the last inequality follows from Assumption 3.2. Iteratively applying (13), we get kx t k≤ X τ<t δ t−τ−1 kw τ k≤ X τ<t δ t−τ−1 max τ≤T kw τ k ≤ 1 1−δ max τ≤T kw τ k. Therefore, X j T ≤ 1 1−δ max t≤T kw t k j = (1−δ) −j σ j max t≤T kw standard t k j . (14) 264 where w standard t ∼N (0,I). Then, it remains to bound E[max t≤T kw standard t k j ]. Following the steps of [171], we have exp E[max t≤T kw standard t k j ] ≤E h exp max t≤T kw standard t k j i =E h max t≤T exp kw standard t k j i ≤E h X t≤T exp kw standard t k j i =TE h exp kw standard 1 k j i . (15) Combining (14) and (15), we obtain E[X j T ]≤(1−δ) −j σ j log TE h exp kw standard 1 k j i where the right-hand side is O (1−δ) −j σ j log(T ) . Proof of Lemma 3.3. Define macro-episodes with start timest n i ,i = 1, 2,... wheret n 1 =t 1 and t n i+1 = min{t k >t n i : det(Σ t k )< 0.5 det(Σ t k−1 )}. The idea is that each macro-episode starts when the second stopping criterion happens. Let M be the number of macro-episodes until time T and define n (M+1) =K T + 1. LetM be the set of episodes that is the first one in a macro-episode. Let ˜ T i = P n i+1 −1 k=n i T k be the length of the ith macro-episode. By definition of macro- episodes, any episode except the last one in a macro-episode must be triggered by the 265 first stopping criterion. Therefore, within the ith macro-episode, T k = T k−1 + 1 for all k =n i ,n i + 1,...,n i+1 − 2. Hence, ˜ T i = n i+1 −1 X k=n i T k = n i+1 −n i −1 X j=1 (T n i −1 +j) +T n i+1 −1 ≥ n i+1 −n i −1 X j=1 (j + 1) + 1 = 0.5(n i+1 −n i )(n i+1 −n i + 1). Consequently, n i+1 −n i ≤ q 2 ˜ T i for all i = 1,...,M. From this property, we obtain K T =n M+1 − 1 = M X i=1 (n i+1 −n i )≤ M X i=1 q 2 ˜ T i . (16) Using (16) and the fact that P M i=1 ˜ T i =T , we get K T ≤ M X i=1 q 2 ˜ T i ≤ v u u t M M X i=1 2 ˜ T i = √ 2MT (17) where the second inequality is by Cauchy-Schwarz. Since the second stopping criterion is triggered whenever the determinant of sample covari- ance is half, we have det(Σ −1 T )≥ det(Σ −1 tn M )> 2 det(Σ −1 t N M −1 )>···> 2 M−1 det(Σ −1 1 ) Since (tr(Σ −1 T )) d ≥ det(Σ −1 T ), we have tr(Σ −1 T )> (det(Σ −1 T )) 1/d > 2 (M−1)/d (det(Σ −1 1 )) 1/d ≥ 2 (M−1)/d λ min 266 where λ min is the minimum eigenvalue of Σ −1 1 . Note that from Remark 3.1, Σ −1 T = Σ −1 1 + 1 σ 2 T−1 X t=1 z t z > t and we obtain 2 (M−1)/d λ min < tr(Σ −1 1 ) + 1 σ 2 T−1 X t=1 z > t z t . Then, M≤1 +d log( 1 λ min (tr(Σ −1 1 ) + 1 σ 2 T−1 X t=1 z > t z t )) =O d log( 1 σ 2 T−1 X t=1 z > t z t ) ! . Note that,||z t || =||[I,G(θ) > ] > x t ||≤M G ||x t ||. Consequently, M≤O d log M 2 G 1 σ 2 T−1 X t=1 ||x t || 2 =O d log 1 σ 2 T−1 X t=1 ||x t || 2 ≤O d log(T X 2 T σ 2 ) Hence, from (17) we obtain the claim of the lemma. The following two lemmas will be useful in the proof of Lemma 3.6 and 3.7. Lemma B.1. The following bounds hold: E[ q log(X T )]≤ ˜ O(1) E[ q log(X T )X 2 T ]≤ ˜ O σ 2 (1−δ) −2 , E[X 4 T log(X T )]≤ ˜ O σ 4 (1−δ) −4 . 267 Proof. Using Jensen’s inequality and Lemma 3.2, we get E[ q log(X T )]≤ q E[log(X T )]≤ q log(E[X T ])] ≤O q log(σ log(T )(1−δ) −1 ) ≤ ˜ O(1). Similarly, E[ q log(X T )X 2 T ]≤ q E[log(X T )]E[X 4 T ]≤ q log(E[X T ])]E[X 4 T ] ≤O σ 2 (1−δ) −2 q log(T ) log(σ log(T )(1−δ) −1 ) ≤ ˜ O σ 2 (1−δ) −2 . E[X 4 T log(X T )]≤E[X 4 T log max(e,X T ))] ≤ q E[X 8 T ]E[log 2 max(e,X T )] where the second inequality follows from the cauchy-schwarz inequality. Now, log 2 (x) is a concave function for x≥e. Therefore, using jensen’s inequality we can write, E log 2 max(e,X T )) ≤ log 2 (E max(e,X T )) ≤ log 2 (e +E[X T ]) ≤ log 2 (e +TO(logT )) 268 = ˜ O(1) Also,E[X 8 T ]≤σ 8 O(logT ). Therefore, E[X 4 T log(X T )]≤σ 4 ˜ O(1) Lemma B.2. We have the following inequality: E h K T X k=1 t k+1 −1 X t=t k ||Σ −0.5 t (θ 1 − ˜ θ k )|| 2 i ≤4dn(T +E[K T ]). (18) Proof. From Lemma 9 of [30], we have ||Σ −0.5 t (θ 1 − ˜ θ k )|| 2 ≤||Σ −0.5 t k (θ 1 − ˜ θ k )|| 2 det(Σ t k ) det(Σ t ) ≤ 2||Σ −0.5 t k (θ 1 − ˜ θ k )|| 2 where the last inequality follows from the second stopping criterion of the algorithm. There- fore, K T X k=1 t k+1 −1 X t=t k ||Σ −0.5 t (θ 1 − ˜ θ k )|| 2 ≤ 2 K T X k=1 T k ||Σ −0.5 t k (θ 1 − ˜ θ k )|| 2 269 Now, taking the expectation and using T k ≤T k−1 + 1 we get, E h K T X k=1 T k ||Σ −0.5 t k (θ 1 − ˜ θ k )|| 2 i = ∞ X k=1 E h 1 {t k ≤T} T k ||Σ −0.5 t k (θ 1 − ˜ θ k )|| 2 i ≤ ∞ X k=1 E h 1 {t k ≤T} (T k−1 + 1)||Σ −0.5 t k (θ 1 − ˜ θ k )|| 2 i . Since 1 {t k ≤T} (T k−1 + 1) is measurable with respect to σ(h t k ), we get E h 1 {t k ≤T} (T k−1 + 1)||Σ −0.5 t k (θ 1 − ˜ θ k )|| 2 i =E h E h 1 {t k ≤T} (T k−1 + 1)||Σ −0.5 t k (θ 1 − ˜ θ k )|| 2 |h t k ii =E h 1 {t k ≤T} (T k−1 + 1)E h ||Σ −0.5 t k (θ 1 − ˜ θ k )|| 2 |h t k ii ≤E h 1 {t k ≤T} (T k−1 + 1)2dn i where the inequality holds because conditioned onh t k , each column of Σ −0.5 t k (θ 1 − ˜ θ k ) is the difference of two d-dimensional i.i.d. random vectors∼N (0,I). Therefore, we have the desired result as follows, E h K T X k=1 t k+1 −1 X t=t k ||Σ −0.5 t (θ 1 − ˜ θ k )|| 2 i ≤ 4dnE h 1 {t k ≤T} (T k−1 + 1) i ≤ 4dnE[T +K T ]. 270 Proof of Lemma 3.6. From the definition of R 1 we get R 1 =E h K T X k=1 t k+1 −1 X t=t k h x > t S( ˜ θ k )x t −x > t+1 S( ˜ θ k )x t+1 ii =E h K T X k=1 h x > t k S( ˜ θ k )x t k −x > t k+1 S( ˜ θ k )x t k+1 ii ≤E h K T X k=1 x > t k S( ˜ θ k )x t k i . (19) Since||S( ˜ θ k )||≤M S , we obtain R 1 ≤E h K T X k=1 M S ||x t k || 2 i ≤M S E h K T X 2 T i . (20) Proof of Lemma 3.7. Each term inside the expectation of R 2 is equal to ||S 0.5 ( ˜ θ k )θ > 1 z t || 2 −||S 0.5 ( ˜ θ k ) ˜ θ > k z t || 2 = ||S 0.5 ( ˜ θ k )θ > 1 z t || +||S 0.5 ( ˜ θ k ) ˜ θ > k z t || ||S 0.5 ( ˜ θ k )θ > 1 z t ||−||S 0.5 ( ˜ θ k ) ˜ θ > k z t || ≤ ||S 0.5 ( ˜ θ k )θ > 1 z t || +||S 0.5 ( ˜ θ k ) ˜ θ > k z t || ||S 0.5 ( ˜ θ k )(θ 1 − ˜ θ k ) > z t || Since||S 0.5 ( ˜ θ k )θ > z t ||≤M 0.5 S M θ M G X T forθ = ˜ θ k orθ =θ 1 , the above term can be further bounded by 2M 0.5 S M θ M G X T ||S 0.5 ( ˜ θ k )(θ 1 − ˜ θ k ) > z t || ≤2M S M θ M G X T ||(θ 1 − ˜ θ k ) > z t ||. 271 Therefore, R 2 ≤2M S M θ M G E h X T K T X k=1 t k+1 −1 X t=t k ||(θ 1 − ˜ θ k ) > z t || i . (21) From Cauchy-Schwarz inequality, we have E h X T K T X k=1 t k+1 −1 X t=t k ||(θ 1 − ˜ θ k ) > z t || i =E h X T K T X k=1 t k+1 −1 X t=t k ||(Σ −0.5 t (θ 1 − ˜ θ k )) > Σ 0.5 t z t || i ≤E h K T X k=1 t k+1 −1 X t=t k ||Σ −0.5 t (θ 1 − ˜ θ k )||×X T ||Σ 0.5 t z t || i ≤ v u u u tE h K T X k=1 t k+1 −1 X t=t k ||Σ −0.5 t (θ 1 − ˜ θ k )|| 2 i v u u u tE h K T X k=1 t k+1 −1 X t=t k X 2 T ||Σ 0.5 t z t || 2 i (22) From Lemma B.2 in the appendix, the first part of (22) is bounded by v u u u tE h K T X k=1 t k+1 −1 X t=t k ||Σ −0.5 t (θ 1 − ˜ θ k )|| 2 i ≤ q 4dn(T +E[K T ]). (23) For the second part of (22), note that K T X k=1 t k+1 −1 X t=t k ||Σ 0.5 t z t || 2 = T X t=1 z > t Σ t z t 272 Since||z t ||≤M G X T for all t≤T , Lemma 8 of [30] implies T X t=1 z > t Σ t z t ≤ T X t=1 max(1,M 2 G X 2 T /λ min ) min(1,z > t Σ t z t ) ≤2d max(1,M 2 G X 2 T /λ min ) log(tr(Σ −1 1 ) +TM 2 G X 2 T ) =O 2dM 2 G λ min X 2 T log(TX 2 T ) . Consequently, the second term of (22) is bounded by O s 2dM 2 G λ min E h X 4 T log(TX 2 T ) i . (24) Then, from (21), (22), (23) and (24), we obtain the result of the lemma. B.2 Mean-field LQ Lemma B.3. E " ( ˘ X i T ) 4 log( X t ( ˘ X i ∗ t T ) 2 ) # ≤ 1− 1 n 2 ˜ O(1) (25) Proof. E " ( ˘ X i T ) 4 log( X t ( ˘ X i ∗ t T ) 2 ) # ≤E " ( ˘ X i T ) 4 log(max(e, X t ( ˘ X i ∗ t T ) 2 )) # ≤ s E[( ˘ X i T ) 8 ] E[log 2 max(e, X t ˘ X i ∗ t T ) 2 ) ] 273 where the second inequality follows from the cauchy-schwarz inequality. Now, log 2 (x) is a concave function for x≥e. Therefore, using jensen’s inequality we can write, E log 2 max(e, X t ˘ X i ∗ t T ) 2 ) ≤ log 2 (E max(e, X t ˘ X i ∗ t T ) 2 )) ≤ log 2 e +E( X t ˘ X i ∗ t T ) 2 ) ! ≤ log 2 e +TO((1− 1 n ) logT ) = ˜ O(1) Also, E[( ˘ X i T ) 8 ]≤ 1− 1 n 4 O(logT ). Therefore, combining the above inequalities we have the following: E " ( ˘ X i T ) 4 log( X t ( ˘ X i ∗ t T ) 2 ) # ≤ v u u t E[( ˘ X i T ) 8 ] E h log 2 max(e, X t ˘ X i ∗ t T ) 2 ) ! i ≤ 1− 1 n 2 ˜ O(1) Proof of Lemma 3.10. We first bound ˘ K T . We can follow the same sketch as in proof of Lemma 3.3. Let ˘ M be the number of times the second stopping criterion is triggered for ˘ λ t . Then, we have ˘ K T ≤ q 2 ˘ MT (26) 274 Since the second stopping criterion is triggered whenever the determinant of sample covari- ance is half, we have det( ˘ Σ −1 T )≥ 2 ˘ M−1 det( ˘ Σ −1 1 ) Let d =d x +d u . Since ( 1 d tr( ˘ Σ −1 T )) d ≥ det( ˘ Σ −1 T ), we have tr( ˘ Σ −1 T )>d(det( ˘ Σ −1 T )) 1/d >d× 2 ( ˘ M−1)/d (det( ˘ Σ −1 1 )) 1/d ≥d× 2 ( ˘ M−1)/d ˘ λ min where ˘ λ min is the minimum eigenvalue of ˘ Σ −1 1 . Using (3.57) we have, ˘ Σ −1 T = ˘ Σ −1 1 + T−1 X t=1 n n− 1 ˘ z i ∗ t t (˘ z i ∗ t t ) and we obtain d× 2 ( ˘ M−1)/d λ min < tr( ˘ Σ −1 1 ) + T−1 X t=1 n n− 1 ||˘ z i ∗ t t || 2 Then, ˘ M≤1 +d log 1 dλ min tr( ˘ Σ −1 1 + T−1 X t=1 n n− 1 ||˘ z i ∗ t t || 2 !! =O d log( n n− 1 T−1 X t=1 ||˘ z i ∗ t t || 2 ) ! . 275 Note that,||˘ z i ∗ t t || =||[I,G( ˘ θ) | ] | ˘ x i ∗ t t ||≤M G ||˘ x i ∗ t t ||≤M G ˘ X i ∗ t T . Consequently, ˘ M≤O d log n n− 1 X t ( ˘ X i ∗ t T ) 2 Therefore, combining the above inequality with (26) we get, ˘ K T ≤O s (d x +d u )T log n n− 1 X t ( ˘ X i ∗ t T ) 2 (27) Using a similar analysis we can bound the ¯ K T as given in the statement of the lemma. 276 C Appendix: Thompson Sampling for some Decentralized Control Problems C.1 Proof of Lemma 4.1 Proof. Lett> 0 be fixed. We start by rewriting the posterior π i t (θ) in terms of the prior as follows: π i t (θ) = Q t s=1 q i θ (X i s |X i s−1 ,U i s−1 ) π i 0 (θ) P ˜ θ∈Θ Q t s=1 q i ˜ θ (X i s |X i s−1 ,U i s−1 ) π i 0 ( ˜ θ) = 1 1 + P ˜ θ∈Θt c i ˜ θ exp(− P t s=1 log Λ i ˜ θ,s ) (28) where Λ i ˜ θ,s = q i θ (X i s |X i s−1 ,U i s−1 ) q i ˜ θ (X i s |X i s−1 ,U i s−1 ) is the likelihood ratio at time s and c i ˜ θ = π i 0 ( ˜ θ) π i 0 (θ) . Θ t is defined as the following set Θ t ={ ˜ θ∈ Θ : ˜ θ6=θ,q i ˜ θ (X i s |X i s−1 ,U i s−1 )6= 0∀s≤t} (29) Define the filtrationH i s =σ(H i s ) and let Z i ˜ θ,s = P s k=1 log Λ i ˜ θ,k . Now, we add and subtract E τ θ [log Λ i ˜ θ,k |H i k−1 ] to obtain the following decomposition: Z i ˜ θ,s = s X k=1 (log Λ i ˜ θ,k −E τ θ [log Λ i ˜ θ,k |H i k−1 ]) + s X k=1 E τ θ [log Λ i ˜ θ,k |H i k−1 ], s≤t. (30) 277 The first summation on the right hand side of (30) is denoted by: M i ˜ θ,s = s X k=1 log Λ i ˜ θ,k −E τ θ [log Λ i ˜ θ,k |H i k−1 ], s≤t. (31) Also, defineM i ˜ θ,s :=M i ˜ θ,t for alls>t. Then, it is clear thatM i ˜ θ,s is a martingale with respect to the filtrationH i s . Define p i = min y,x,u, ˜ θ:q ˜ θ,i (y|x,u)6=0 q ˜ θ,i (y|x,u). Then, q ˜ θ,i (X i s |X i s−1 ,U i s−1 )≥ p i for all ˜ θ∈ Θ t . This would imply that Λ i ˜ θ,s is bounded from above and below by a finite constant for alls≤t. Thus,M i ˜ θ,s has finite increments almost surely, that is,∃d i > 0 such that, |log Λ i ˜ θ,s −E τ θ [log Λ i ˜ θ,s |H i s−1 ]|≤d i (32) Let the second summation of the decomposition in (30) be referred to asA i ˜ θ,s = P s k=1 E τ θ [log Λ i ˜ θ,k |H i k−1 ]. Then, each term inside this summation can be bounded from below as follows: E τ θ [log Λ i ˜ θ,k |H i k−1 ] =E τ θ [E τ θ [log Λ i ˜ θ,k |H i k−1 ,U i k−1 ]|H i k−1 ] =E τ θ [[K(q i θ (X i k−1 ,U i k−1 )|q i ˜ θ (X i k−1 ,U i k−1 ))]|H i k−1 ] ≥ where the last inequality follows from Assumption 4.1. Thus, we have A i ˜ θ,t ≥t . 278 Define the event B i ˜ θ,t ={|M i ˜ θ,t |≤δt} for some 0<δ <. Going back to (28) we can write the following expectation: E τ θ [π i t (θ)] =E τ θ 1 1 + P ˜ θ6=θ c i ˜ θ exp(−M i ˜ θ,t −A i ˜ θ,t ) ≥E τ θ 1 1 + P ˜ θ6=θ c i ˜ θ exp(−M i ˜ θ,t −t) =E τ θ I[∩ ˜ θ6=θ B i ˜ θ,t ] 1 + P ˜ θ6=θ c i ˜ θ exp(−M i ˜ θ,t −t) +E τ θ I[(∩ ˜ θ6=θ B i ˜ θ,t ) C ] 1 + P ˜ θ6=θ c i ˜ θ exp(−M i ˜ θ,t −t) ≥E τ θ E τ θ I[∩ ˜ θ6=θ B i ˜ θ,t ] 1 + P ˜ θ6=θ c i ˜ θ exp(−M i ˜ θ,t −t) ∩ ˜ θ6=θ B i ˜ θ,t ≥ P τ θ (∩ ˜ θ6=θ B i ˜ θ,t ) 1 + 1−π i 0 (θ) π i 0 (θ) exp(−(−δ)t) (33) where the first inequality follows from the fact that A i ˜ θ,t ≥t, the second inequality follows since we ignored the expectation term with the event I[(∩ ˜ θ6=θ B i ˜ θ,t ) C ] which is always non- negative and the last inequality follows since M i ˜ θ,t ≥−δt conditioned on the event B i ˜ θ,t . Now, a simple application of union bound gives us P τ θ (∩ ˜ θ6=θ B i ˜ θ,t )≥ 1− X ˜ θ6=θ P τ θ (B i ˜ θ,t ) C . Furthermore, since M i ˜ θ,t is a martingale with bounded increments (see (32)), we have the following by Azuma’s inequality [172]: P τ θ (B i ˜ θ,t ) C =P τ θ (|M i ˜ θ,t |≥δt)≤ 2exp −δ 2 t 2(d i ) 2 ! (34) Substituting (34) to (33) we get, E τ θ [π i t (θ)]≥ 1− 2(|Θ|− 1)exp −δ 2 t 2(d i ) 2 1 + 1−π i 0 (θ) π i 0 (θ) exp(−(−δ)t) (35) 279 Therefore, we can upper boundE τ θ [1−π i t (θ)] as follows: E τ θ [1−π i t (θ)]≤ 1− 1− 2(|Θ|− 1)exp −δ 2 t 2(d i ) 2 1 + 1−π i 0 (θ) π i 0 (θ) exp(−(−δ)t) = 1−π i 0 (θ) π i 0 (θ) exp(−(−δ)t) + 2(|Θ|− 1)exp −δ 2 t 2(d i ) 2 1 + 1−π i 0 (θ) π i 0 (θ) exp(−(−δ)t) ≤ 1−π i 0 (θ) π i 0 (θ) exp(−(−δ)t) + 2(|Θ|− 1)exp −δ 2 t 2(d i ) 2 ! If we choose δ = 2 and constants a θ = 2 max n 1−π 1 0 (θ) π 1 0 (θ) , 1−π 2 0 (θ) π 2 0 (θ) , 2(|Θ|− 1) o and b θ = min n 2 8(d 1 ) 2 , 2 8(d 2 ) 2 , 2 o then we get the desired result: E τ θ [1−π i t (θ)]≤a θ exp(−b θ t), i = 1, 2 (36) 280 D Appendix: Networked Estimation- IID case D.1 Auxiliary results The following two definitions and theorem can be found in [? ] and in [? ]. Definition D.1 (Symmetric rearrangement). Let A be a measurable set of finite volume in R n . Its symmetric rearrangement A ∗ is defined as the open ball centered at 0 n whose volume agrees withA. Definition D.2 (Symmetric decreasing rearrangement). Letf :R n →R be a nonnegative measurable function that vanishes at infinity. The symmetric decreasing rearrangement f ↓ of f is f ↓ (x) def = Z ∞ 0 I x∈{ξ∈R n |f(ξ)>t} ∗ dt. (37) Theorem D.1 (Hardy-Littlewood Inequality). Iff andg are two nonnegative measurable functions defined onR n which vanish at infinity, then the following holds: Z R n f(x)g(x)dx≤ Z R n f ↓ (x)g ↓ (x)dx, (38) where f ↓ and g ↓ are the symmetric decreasing rearrangements of f and g, respectively. 281 D.2 Proof of lemma 5.5 D.2.1 Empty battery Let e = 0. The value function in (5.49) is given by V π t (0) = inf ˜ xt E h X i∈{1,2} kX i t − ˜ x i t k 2 i +C 0 t+1 (0). (39) The infimum in the expression above is achieved by ˜ x ? t = E[X 1 t ],E[X 2 t ] . (40) Since π 1 and π 2 are symmetric around 0, ˜ x ? t = 0. (41) Therefore, if e = 0, the infimum in (5.49) is achieved by: ˜ x ? t = 0, i∈{1, 2}. (42) D.2.2 Nonempty battery Let e> 0. The value function in (5.49) is given by V π t (e) = inf ˜ xt E h min n kX 1 t − ˜ x 1 t k 2 +kX 2 t − ˜ x 2 t k 2 +C 0 t+1 (e), 282 kX 2 t − ˜ x 2 t k 2 +C 1 t+1 (e),kX 1 t − ˜ x 1 t k 2 +C 1 t+1 (e) oi . (43) The optimization problem in (43) is equivalent to: inf ˜ xt E h min n kX 1 t − ˜ x 1 t k 2 +kX 2 t − ˜ x 2 t k 2 ,kX 2 t − ˜ x 2 t k 2 +κ t (e),kX 1 t − ˜ x 1 t k 2 +κ t (e) oi , (44) where κ t (e) def =C 1 t+1 (e)−C 0 t+1 (e). (45) Consider the auxiliary cost functionJ e t :R n 1 ×R n 2 →R defined as J e t (˜ x t ) def =E h min kX 1 t − ˜ x 1 t k 2 +kX 2 t − ˜ x 2 t k 2 ,kX 2 t − ˜ x 2 t k 2 +κ t (e),kX 1 t − ˜ x 1 t k 2 +κ t (e) i , (46) where the expectation is taken with respect to the random vectors X 1 t and X 2 t . The remainder of the proof consists of solving the following optimization problem: inf ˜ xt J e t (˜ x t ). (47) Define the functionG :R n ×R n →R such that G e t (˜ x t ; x t ) def = min kx 1 t − ˜ x 1 t k 2 +kx 2 t − ˜ x 2 t k 2 ,kx 2 t − ˜ x 2 t k 2 +κ t (e),kx 1 t − ˜ x 1 t k 2 +κ t (e) . (48) 283 Using the fact that X 1 t and X 2 t are independent, and the functionG e t defined in (48), we can rewrite the functionJ e t (˜ x t ) in integral form as: J e t (˜ x t ) = Z R n 2 " Z R n 1 G e t (˜ x t ; x t )π 1 (x 1 t )dx 1 t # π 2 (x 2 t )dx 2 t . (49) The functionG e t can be alternatively represented as: G e t (˜ x t ; x t ) = min n kx 2 t − ˜ x 2 t k 2 +κ t (e),kx 1 t − ˜ x 1 t k 2 + min κ t (e),kx 2 t − ˜ x 2 t k 2 o . (50) Finally, let the functionH e t :R n ×R n →R be defined as: H e t (˜ x t ; x t ) def =kx 2 t − ˜ x 2 t k 2 +κ t (e)−G e t (˜ x t ; x t ). (51) Notice that the functionH e t vanishes as the norm of x 1 t tends to infinity, i.e., lim kx 1 t k→+∞ H e t (˜ x t ; x t ) = 0. (52) From the Hardy-Littlewood inequality (see Appendix D.1), we have: Z R n 1 H e t (˜ x t ; x t )π 1 (x 1 t )dx 1 t ≤ Z R n 1 H e↓ t (˜ x t ; x t )π ↓ 1 (x 1 t )dx 1 t , (53) where π ↓ 1 andH e↓ t denote the symmetric decreasing rearrangements of π 1 andH e t , respec- tively. The following facts hold: 284 1. Since π 1 is symmetric and unimodal around 0, π ↓ 1 =π 1 . (54) 2. SinceH e t (˜ x t ; x t ), as a function of x 1 t , is symmetric and unimodal around ˜ x 1 t (a fact that can be verified by inspection), we have: H e↓ t (˜ x t ; x t ) =H e t (0, ˜ x 2 t ); x t . (55) Therefore, the Hardy-Littlewood inequality implies that: Z R n 1 H e t (˜ x t ; x t )π 1 (x 1 t )dx 1 t ≤ Z R n 1 H e t (0, ˜ x 2 t ); x t π 1 (x 1 t )dx 1 t , (56) which is equivalent to: Z R n 1 G e t (0, ˜ x 2 t ); x t π 1 (x 1 t )dx 1 t ≤ Z R n 1 G e t (˜ x t ; x t )π 1 (x 1 t )dx 1 t . (57) Therefore, ˜ x 1? t = 0. (58) Fixing ˜ x 1? t = 0 and following the same sequence of arguments exchanging the roles of x 1 t and x 2 t , we show that ˜ x 2? t = 0. Therefore, ˜ x ? t = 0. (59) 285 D.3 Optimal thresholds for the asymmetric case In the case of asymmetric costs and weights the modified recursive algorithm is as follows. For t∈{1,··· ,T− 1}: Compute the functionC 0 t+1 according to (5.46) andC 1 t+1 andC 2 t+1 for according to: C i t+1 def =c i +E h V π t+1 min{e− 1 +Z t ,B} i , i∈{1, 2}, (60) where V π t (0) def =E h X i∈{1,2} w i kX i t −a i k 2 +V π t+1 min{Z t ,B} i (61) and V π t (e) def =E h min n X i∈{1,2} w i kX i t −a i k 2 +C 0 t+1 (e),w 2 kX 2 t −a 2 k 2 +C 1 t+1 (e),w 1 kX 1 t −a 1 k 2 +C 2 t+1 (e) oi . (62) Finally, the optimal thresholds are given by: τ i? t (e) def = s C i t+1 (e)−C 0 t+1 (e) w i , i∈{1, 2}. (63) 286 E Appendix: Networked Estimation - Markov Case E.1 Proof of Lemma 5.8 We will first expand the information structure of the estimatori toI expanded t which includes the observations of the remaining estimators i.e. I expanded t ={Y i 1:t , Y −i 1:t }. If the optimal estimator under the expanded informationI expanded t has the same form as in (5.80) then it is also optimal under the original informationI i t . Thus, it suffices to show that underI expanded t , the optimal choice of estimate is as described in the lemma. Using Lemma 5.7 we know that ˆ X i,∗ t = E[X i t |I expanded t ]. When Y i t 6= ?, then E[X i t |I expanded t ] = X i t = Y i t . When Y i t = ?, E[X i t |I expanded t ] = Z i t +E[E i t |I expanded t ]. If E[E i t |I expanded t ] = 0 whenever Y i t = ?, then the result of the lemma follows. We will show that if Y i t = ?, then E[E i t |I expanded t ] = 0, by showing that the belief of the estimator about E i t is symmetric around 0. For that purpose, define the pre-transmission belief Π t =P(E t |I expanded t−1 ) and the post-transmission belief Θ t =P(E t |I expanded t ). Letγ t be an arbitrary scheduling strategy from the class Γ sym . We proceed by induction to show that Π t is symmetric around 0. In that process it is revealed that when Y i t = ?, Θ t is symmetric around 0 along the i th component, that is, Θ t (e i t , e −i t ) = Θ t (−e i t , e −i t ),∀e i t , e −i t , which gives us the desired result. 1. Base Case: Π 1 (e 1 ) = Q N i=1 f W (e i 1 ) which is clearly symmetric around 0. 287 2. Let Π t be symmetric around 0 along all the components. Then, Θ t (e t ) = I(γt(et)=0)Πt(et) R ˜ e t I(γt(˜ et)=0)Πt(˜ et) if U t = 0 I(e j t =x)I(γt(et)=j)Πt(et) R ˜ e t I(˜ e j t =x)I(γt(˜ et)=j)Πt(˜ et) if U t =j,E j t =x (64) When U t = 0, Θ t (−e j t , e −j t ) = Θ t (e j t , e −j t ),∀e j t , e −j t ,k follows by the symmetry of γ t and Π t . WhenU t =k then it is easy to observe that Θ t (−e j t , e −j t ) = Θ t (e j t , e −j t ),∀e j t , e −j t ,j6= k. Hence, if Π t is symmetric then Θ t is symmetric around 0 along all the components other than k in this case. 3. We will now complete the induction and show that Π t+1 is symmetric around 0 along all the components using the structure of Θ t under the following two cases. i) U t = 0 - In this case, Θ t is symmetric around 0 along all the components. Also, E j t+1 =X j t+1 −Z j t+1 =X j t +W j t −Z j t =E j t +W j t for all j. Π t+1 (e t+1 ) =P(E t + W t = e t+1 |I expanded t ) = Z et N Y i=1 f W (e i t+1 −e i t ) Θ t (e t )de t Following shows the symmetry of Π t+1 for every i: Π t+1 (−e i t+1 , e −i t+1 ) = Z et f W (−e i t+1 −e i t ) Y j6=i f W (e j t+1 −e j t ) Θ t (e t )de t = Z et f W (e i t+1 +e i t ) Y j6=i f W (e j t+1 −e j t ) Θ t (e t )de t = Z et f W (e i t+1 −e i t )× Y j6=i f W (e j t+1 −e j t ) Θ t (−e i t ,e −i t )de t 288 = Z et f W (e i t+1 −e i t ) Y j6=i f W (e j t+1 −e j t ) Θ t (e i t ,e −i t )de t = Π t+1 (e i t+1 , e −i t+1 ) where the second equality follows from the symmetry of f W , third by change of vari- ables and last equality follows from the symmetry of Θ t . ii) U t = k - In this case E k t+1 = W k t and E j t+1 = E j t +W j t for all j6= k. Also, Θ t is symmetric around 0 along all the components j,j6=k as seen earlier. Π t+1 (e t+1 ) =P(W k t =e k t+1 ,E j t +W j t =e j t+1 ∀j6=k|I expanded t ) =f W (e k t+1 ) Z et Y j6=k f W (e j t+1 −e j t ) Θ t (e t )de t We can argue that Π t+1 is symmetric around 0 in a similar fashion as done in the previous case. Therefore, whenever Y i t = ?, Θ t is symmetric around 0 along the component i, which concludes the proof. E.2 Proof of Lemma 5.9 We will use the following lemma in proving Lemma 5.9. Lemma E.1. LetW be a random variable with a symmetric unimodal distribution around 0 and λ> 0 be a constant. Then, P(|W +x|>λ) is an monotonically increasing function of x for x∈R + . 289 Proof. P(|W +x|≥λ) = 1−P(−x−λ<W <−x +λ). Now,P(−x−λ<W <−x +λ) is the area under the curve of the density function ofW over the interval centered around the point−x and of length 2λ. Whenx↑, the length of the interval reamins unaltered but the center of the interval moves away from the origin. Since W is unimodal around the origin the area under the curve of the density function also decreases. Thus, P(|W 1 t +x|≥λ) is an increasing function of x. Now, we present the proof of Lemma 5.9. Proof of Lemma 5.9. Let f t (x,y) = EV t+1 (x +W 1 t ,y +W 2 t ). We can write the following recursive relation for f t , f t (x,y) =E min c + (W 1 t +x) 2 +f t+1 (W 1 t +x, 0),c + (W 2 t +y) 2 +f t+1 (0,W 2 t +y), (W 1 t +x) 2 + (W 2 t +y) 2 +f t+1 (W 1 t +x,W 2 t +y) . (65) We will first show using backward induction that f t has the following properties: 1. f t (·) is a function of|x|,|y|. 2. f t (·) is monotonically increasing in both|x| and|y|. 3. f t (x,y) =f t (y,x) for all x,y∈R. 290 The base case is true trivially since f T (x,y) = 0. Now, let f t+1 satisfies the above three properties then, 1. f t (x,y) depends only on|x|,|y|: f t (−x,y) =E min c + (W 1 t −x) 2 +f t+1 (W 1 t −x, 0),c + (W 2 t +y) 2 +f t+1 (0,W 2 t +y), (W 1 t −x) 2 + (W 2 t +y) 2 +f t+1 (W 1 t −x,W 2 t +y) =E min c + (−W 1 t −x) 2 +f t+1 (−W 1 t −x, 0),c + (W 2 t +y) 2 +f t+1 (0,W 2 t +y), (−W 1 t −x) 2 + (W 2 t +y) 2 +f t+1 (−W 1 t −x,W 2 t +y) =f t (x,y) where the second inequality follows from the fact that W 1 t and−W 1 t are identically distributed as W 1 t is symmetric around 0. Third equality follows from the induction hypothesis and (65). Similarly, f t (x,y) = f t (x,−y) = f t (−x,−y). Therefore, f t depends only on|x|,|y|. 2. f t (x,y) =f t (y,x): This property follows easily sinceW 1 t andW 2 t are independent and identically distributed and using the induction hypothesis thatf t+1 (x,y) =f t+1 (y,x). 3. f t (x,y) is monotonically increasing in both x,y when x≥ 0,y ≥ 0: Since f t de- pends only the absolute values of its argument we focus on charazterizing f t in the non-negative quadrant. Now assume, f t+1 (·) is monotonically increasing in both the arguments. Defineg t (x,y) =x 2 +y 2 +f t+1 (x,y). Then,g t (·) is also monotonically in- creasing. Then,f t (x,y) =E min(c+g t (W 1 t +x, 0),c+g t (0,W 2 t +y),g t (W 1 t +x,W 2 t +y)). 291 Define g −1 t,d (β) = min{x≥ 0 :g t (x,d)≥β}. Then, g t (x,d)≥β, ∀x≥g −1 t,d (β) since g t is a monotonically increasing function. Note that since g t (x,y) = g t (y,x) (using the property 2 of f t ), it follows that, g t (d,y)≥β, ∀y≥g −1 t,d (β). Fix y≥ 0, then f t (x,y) = ∞ Z 0 P min(c +g t (W 1 t +x, 0),c +g t (0,W 2 t +y),g t (W 1 t +x,W 2 t +y))≥λ dλ = ∞ Z 0 P c +g t (W 1 t +x, 0)≥λ,c +g t (0,W 2 t +y)≥λ,g t (W 1 t +x,W 2 t +y)≥λ dλ = ∞ Z 0 P |W 1 t +x|≥g −1 t,0 (λ−c),|W 2 t +y|≥g −1 t,0 (λ−c),g t (W 1 t +x,W 2 t +y)≥λ dλ = ∞ Z 0 Z |w 2 t +y|≥g −1 t,0 (λ−c) P |W 1 t +x|≥g −1 t,0 (λ−c),g t (W 1 t +x,w 2 t +y)≥λ f W (w 2 t )dw 2 t dλ = ∞ Z 0 Z |w 2 t +y|≥g −1 t,0 (λ−c) P |W 1 t +x|≥η f W (w 2 t )dw 2 t dλ where η = max(g −1 t,0 (λ−c),g −1 t,|w 2 t +y| (λ)). Using Lemma E.1, P(|W 1 t +x|≥ η) is an increasing function ofx. Therefore,f t (x,y) is increasing inx. We can similarly argue that f t (x,y) is increasing in y. Now,V t (E 1 t ,E 2 t ) = min(c+|E 1 t | 2 +f t (E 1 t , 0),c+|E 2 t | 2 +f t (0,E 2 t ),|E 1 t | 2 +|E 2 t | 2 +f t (E 1 t ,E 2 t )). Statement 1-3 of the lemma follow directly from the Property 1-3 of the functionf t . Lastly, whenU t 6= 0, sensor 1 will be scheduled if|E 2 t | 2 +f t (|E 2 t |, 0)≤|E 1 t | 2 +f t (|E 1 t |, 0) where we used the Property 2 of f t to write f t (0,|E 2 t |) = f t (|E 2 t |, 0). Using the monotonicity of f t its clear that sensor 1 will be scheduled if|E 1 t |≥|E 2 t | and vice-versa. Hence, the lemma follows. 292 E.3 Proof of Lemma 5.11 Proof. We will drop the subscriptT−1 from all the functions and variables in the proof for brevity. Define H(x,y,W 2 ) =E W 1 min(c + (W 1 +x) 2 ,c + (W 2 +y) 2 , (W 1 +x) 2 + (W 2 + y) 2 ). Then, f(x,y) =E W 2 H(x,y,W 2 ) = R H(x,y,w 2 )f W (w 2 )dw 2 . Thus, we can write the difference Δf x (y) = f(x,y)−f(0,y) = R [H(x,y,w 2 )−H(0,y,w 2 )]f W (w 2 )dw 2 . We will now show that Δf x (y) is an increasing function of y. Let w 2 be a fixed realization of W 2 . Consider the following two cases: i) If (w 2 +y) 2 ≤c, H(x,y,w 2 ) = (w 2 +y) 2 +E min(c, (W 1 +x) 2 ) = (w 2 +y) 2 + Z c 0 P((W 1 +x) 2 ≥z)dz ii) If (w 2 +y) 2 >c, then H(x,y,w 2 ) =c +E min((w 2 +y) 2 , (W 1 +x) 2 ) =c + Z (w 2 +y) 2 0 P((W 1 +x) 2 ≥z)dz 293 Therefore, H(x,y,w 2 )−H(0,y,w 2 ) = R c 0 F (x,z)dz if |w 2 +y|≤ √ c R (w 2 +y) 2 0 F (x,z)dz o.w. (66) where F (x,z) = P((W 1 +x) 2 ≥ z)−P((W 1 ) 2 ≥ z). Hence, we can write Δf x (y) = Δ 1 + Δ 2 + Δ 3 where, Δ 1 = Z |w 2 +y|≤ √ c c Z 0 F (x,z)f W (w 2 )dzdw 2 Δ 2 = Z w 2 +y> √ c (w 2 +y) 2 Z 0 F (x,z)f W (w 2 )dzdw 2 Δ 3 = Z w 2 +y<− √ c (w 2 +y) 2 Z 0 F (x,z)f W (w 2 )dzdw 2 We look at the three terms seperately. Firstly, we can write Δ 1 = c Z 0 F (x,z)dz Z |w 2 +y|≤ √ c f W (w 2 )dw 2 = ( c Z 0 F (x,z)dz)P(|W 2 +y|≤ √ c) (67) Now, we look at Δ 2 , 294 Δ 2 = Z w 2 > √ c−y (w 2 +y) 2 Z 0 F (x,z)f W (w 2 )dzdw 2 = ∞ Z 0 Z w 2 ≥max(−y+ √ z,−y+ √ c) F (x,z)f W (w 2 )dw 2 dz = c Z 0 ∞ Z −y+ √ c F (x,z)f W (w 2 )dw 2 dz + ∞ Z c ∞ Z −y+ √ z F (x,z)f W (w 2 )dw 2 dz = c Z 0 F (x,z)dz P(W 2 +y> √ c) + ∞ Z c F (x,z)P(W 2 +y> √ z)dz (68) where we used Fubini’s theorem to interchange the integrals in the second equality. Using a similar analysis, we can write Δ 3 as follows Δ 3 = c Z 0 F (x,z)dz P(W 2 +y<− √ c) + ∞ Z c F (x,z)P(W 2 +y<− √ z)dz (69) Using the decomposition in (67), (68) and (69) we can write, Δf x (y) = c Z 0 F (x,z)dz P(|W 2 +y|≤ √ c) + c Z 0 F (x,z)dz P(|W 2 +y+> √ c) + ∞ Z c F (x,z)P(|W 2 +y|> √ z)dz = c Z 0 F (x,z)dz + ∞ Z c F (x,z)P(|W 2 +y|> √ z)dz (70) Now, F (x,z)≥ 0 using Lemma E.1 and unimodality of W 1 . Also,P(|W 2 +y|> √ z) is an increasing function of y using the unimodality of W 2 and Lemma E.1. Therefore, Δf x (y) 295 is an increasing function ofy. Similarly, we can argue thatf(x,y)−f(x, 0) is an increasing function of x. Hence, the result of the lemma follows. 296 F Appendix: Worst Case Guarantees for remote estimation F.1 Proof of Theorem 6.1 To prove Theorem 6.1, we first derive some useful properties. Recall thatN = (N 1 ,N 2 ,...,N T ) is the collection of all the noise variables in the system. Note that given the strategy η, the stateS r and the information Q r can be written down as some function of N forr∈T . Thus, for any function f and r≥t we can write sup (Sr,Qr )|qt f(S r ,Q r ) = sup N|qt f(S r ,Q r ) For any strategy η, we define its “cost-to-go” function at time t as V η t (q t ) := sup N|qt max r≥t ρ r (S r ,η r (Q r )), (71) which is a function of the realization q t of available information at time t. Then it is clear that the worst case cost of strategy η is sup N max t∈T ρ t (S t ,η t (Q t )) = sup Q 1 V η 1 (Q 1 ). (72) We also define the value function of the problem at t to be V ∗ T (q T ) := inf a T ∈A(s o T ) n sup N|q T ρ t (S T ,a T ) o , (73) V ∗ t (q t ) := inf at∈A(s o t ) ( sup N|(qt,at) max ρ t (S t ,a t ),V ∗ t+1 (Q t+1 ) ) (74) We have the following result. 297 Lemma F.1. For any strategy η, at each time t and for every realization q t , we have V ∗ t (q t )≤V η t (q t ). (75) Proof. The proof is done by induction. At T we have V ∗ T (q T ) = inf a T ∈A(s o T ) n sup N|q T ρ t (S T ,a T ) o ≤ sup N|q T ρ T (S T ,η T (q T )) =V η T (q T ). (76) Suppose the lemma is true at t + 1. Then at t we have V η t (q t ) = sup N|qt max r≥t ρ r (S r ,η r (Q r )) = sup N|qt max ρ t (S t ,η t (q t )), max r≥t+1 ρ r (S r ,η r (Q r )) = max sup N|qt ρ t (S t ,η t (q t )), sup N|qt max r≥t+1 ρ r (S r ,η r (Q r )) (77) From Property 2 we get sup N|qt max r≥t+1 ρ r (S r ,η r (Q r )) = sup Q t+1 |(qt,At=ηt(qt)) sup N|(Q t+1 ,qt,At=ηt(qt)) max r≥t+1 ρ r (S r ,η r (Q r )) ! = sup Q t+1 |(qt,At=ηt(qt)) V η t+1 (Q t+1 ). (78) Now from (77)-(78) and the induction hypothesis we get V η t (q t ) = max sup N|qt ρ t (S t ,η t (q t )), sup Q t+1 |(qt,At=ηt(qt)) V η t+1 (Q t+1 ) 298 ≥ max sup N|qt ρ t (S t ,η t (q t )), sup Q t+1 |(qt,At=ηt(qt)) V ∗ t+1 (Q t+1 ) ≥V ∗ t (q t ). (79) It is straightforward to see that a strategy η ∗ achieving infimum at each stage in the defi- nition of V ∗ t (q t ) will be optimal and its cost will be sup Q 1 V ∗ 1 (Q 1 ). Let Θ t = [[S t |Q t ]] be the conditional range of the state at timet. Recall that Π t = [[S h t |Q t ]]. Note that Π t and Θ t are related as follows Θ t = [[S h t ,S o t |Q t ]] = [[S h t |Q t ]]×{S o t } = Π t ×{S o t }. (80) The evolution of Θ t has the following feature. Lemma F.2. There exists a function φ t (θ t ,a t ,o t+1 ,s o t+1 ) such that Θ t+1 =φ t (Θ t ,A t ,O t+1 ,S o t+1 ). (81) Proof. We can write (O t+1 ,S o t+1 ) = ˜ h t+1 (S t ,A t ,N t+1 ) for some function ˜ h t+1 . Under any strategy η, Θ t+1 = [[S t+1 |Q t+1 ]] = S t+1 |Q t ,A t ,O t+1 ,S o t+1 299 = hh f t+1 (S t ,A t ,N t+1 )|Q t , ˜ h t+1 (S t ,A t ,N t+1 ) = (O t+1 ,S o t+1 ) ii = n f t+1 (s t ,A t ,n t+1 ) : ˜ h t+1 (s t ,A t ,n t+1 ) = (O t+1 ,S o t+1 ), (s t ,n t+1 )∈ [[S t ,N t+1 |Q t ]] o = n f t+1 (s t ,A t ,n t+1 ) : ˜ h t+1 (s t ,A t ,n t+1 ) = (O t+1 ,S o t+1 ), s t ∈ [[S t |Q t ]],n t+1 ∈ [[N t+1 ]] o (82) where the last equality follows from Property 1 and the fact thatN t+1 is unrelated toS t and Q t . Therefore, (82) implies that Θ t+1 is a function ofA t ,O t+1 ,S o t+1 and Θ t = [[S t |Q t ]]. Now let’s prove Theorem 6.1. Its easy to observe using (80) that Θ t can be completely characterized using Π t ,S o t . Thus, to prove Theorem 6.1 it suffices to show that the optimal value function depends only on Θ t . Proof of Theorem 6.1. Lemma F.1 ensures that optimal costs and optimal strategies are characterized by the dynamic program V ∗ T (q T ) = inf a T n sup s T ∈θ T ρ t (s T ,a T ) o (83) V ∗ t (q t ) = inf at n max( sup st∈θt ρ t (s t ,a t ), sup Q t+1 |(qt,at) V ∗ t+1 (Q t+1 )) o (84) Therefore, it just remains to show that the above value function at t can be written as a function of θ t . Then the optimal value will depend only on θ t instead of the entire q t . This 300 claim about the value functions is proved by induction. At T , we have V ∗ T (q T ) = inf a T n sup s T ∈θ T ρ t (s T ,a T ) o =:V ∗ T (θ T ). (85) Suppose this claim is true att + 1. From Lemma F.2 and the induction hypothesis we have sup Q t+1 |(qt,at) V ∗ t+1 (Q t+1 ) = sup (qt,at,O t+1 ,S o t+1 )|(qt,at) V ∗ t+1 (Θ t+1 ) = sup (O t+1 ,S o t+1 )|(qt,at) V ∗ t+1 (φ t (θ t ,a t ,O t+1 ,S o t+1 )). (86) Since (O t+1 ,S o t+1 ) = ˜ h t+1 (S t ,A t ,N t+1 ) as in the proof of Lemma F.2, the above equation can be further expressed as sup Q t+1 |(qt,at) V ∗ t+1 (Q t+1 ) = sup (St,N t+1 )|(qt,at) V ∗ t+1 (φ t (θ t ,a t , ˜ h t+1 (S t ,a t ,N t+1 ))) = sup st∈θt,n t+1 ∈[[N t+1 ]] V ∗ t+1 (φ t (θ t ,a t , ˜ h t+1 (s t ,a t ,n t+1 ))) (87) where the last equality follows from Property 1 since θ t = [[S t |q t ,a t ]] depends on the realization ofQ t ,A t andN t+1 is unrelated to all variables beforet + 1. Therefore, the value function at t is equal to V ∗ t (q t ) = inf at n max sup st∈θt ρ t (s t ,a t ), sup st∈θt,n t+1 ∈[[N t+1 ]] V ∗ t+1 (φ t (θ t ,a t ,h t+1 (s t ,a t ,n t+1 ))) o =:V ∗ t (θ t ) (88) 301 which finishes the proof of the claim. It is straightforward to see that a strategy achieving infimum at each stage will have a cost equal to sup Q 1 V ∗ 1 (Q 1 ) = sup q 1 ∈[[Q 1 ]] V ∗ 1 (θ 1 ) where θ 1 = [[S 1 |q 1 ]]. Hence the proof is complete. F.2 Proof of Lemma 6.1 Fix the estimator’s strategy to some arbitrary g. Define S t = (X t ,E t ,Y 1:t−1 ). Then, S t+1 = X t+1 E t+1 Y 1:t = λAX t +N t max(E t −U t , 0) Y 1:t−1 ,h(X t ,U t ) =: ˜ F t (S t ,U t ,N t ). The instantaneous cost at time t can be written as C t =||X t − ˆ X t || =||X t −g t (Y 1:t )|| =||X t −g t (Y 1:t−1 ,Y t )|| =||X t −g t (Y 1:t−1 ,h(X t ,U t ))|| =:ρ(S t ,U t ). The problem of optimizing the transmission strategy is now an instance of the centralized minimax control problem discussed in Section 6.2 with S t as the directly observable state and U t as the action. Since there is no hidden state for the transmitter, the optimal transmission strategy at time t is a function of the current state S t . 302 Since the above argument holds for any arbitrary estimation strategy g, it holds true for an optimal estimation strategy as well. Therefore, it is sufficient to consider transmission strategies of the form U t = f t (S t ) = f t (X t ,E t ,Y 1:t−1 ). Moreover, since E t can be inferred from Y 1:t−1 , we can further restrict transmission strategies to the form U t =f t (X t ,Y 1:t−1 ) without any loss in performance. F.3 Proof of lemmas 6.4 and 6.5 Proof of Lemma 6.4 The proof is trivial if λ = 0, so we will focus on the case of λ6= 0. For a set S and x∈R n , define r(S,x) := sup z∈S ||z−x||. For a fixed x, we can write, r(φ t (π),x) = sup z∈φt(π) ||z−x|| = sup y∈π,||w||≤a t+1 ||λAy +w−x|| (89) ≤ sup y∈π ||λAy−x|| + sup ||w||≤a t+1 ||w|| =|λ| sup y∈π ||A(y− A −1 x λ )|| +a t+1 =|λ| sup y∈π ||y− ˜ x|| +a t+1 , where ˜ x = A −1 x λ , =|λ|r(π, ˜ x) +a t+1 (90) where we used the fact that for any vectoru,||Au|| =||u|| sinceA is an orthogonal matrix. 303 Let> 0. Then,∃y ∈π such that||y − ˜ x||>r(π, ˜ x)− |λ| . Takingw =a t+1 A(y−˜ x) ||y−˜ x|| sign(λ) and y =y we get sup y∈π,||w||≤a t+1 ||λAy +w−x||≥||λAy −x +a t+1 A(y − ˜ x) ||y − ˜ x|| sign(λ)|| =|λ|||y − ˜ x + a t+1 |λ| y − ˜ x ||y − ˜ x|| || =|λ|||y − ˜ x|| +a t+1 >|λ|r(π, ˜ x) +a t+1 − (91) Since is arbitrary (89) and (91) implies, =⇒ r(φ t (π),x)≥|λ|r(π, ˜ x) +a t+1 (92) Using (90) and (92) we get r(φ t (π),x) =|λ|r(π, ˜ x) +a t+1 . Thus, r ∗ (φ t (π)) = inf x (|λ|r(π, A −1 x λ ) +a t+1 ) =|λ|r ∗ (π) +a t+1 (93) where the second equality follows since 1 λ A −1 is invertible. Proof of Lemma 6.5 1) We start by showing that the lemma is true for t = T . Note that V T (π, ˜ e) = r ∗ (π) by definition ofr ∗ (π) andV T (π, ˜ e). Therefore, it follows trivially that V T satisfies property Q. Now, consider two sets θ and ˜ θ such that θQ ˜ θ. At t =T , observe that if e T > 0, the pre- scription γ all achieves the infimum in (6.24) and the corresponding infimum value is zero. Thus, W T (θ,e) = W T ( ˜ θ,e) = 0,∀e > 0. If e T = 0 then the only possible choice of γ T is 304 γ none . Observe from (6.22) that ψ(θ,γ none ,) =θ, thus it follows that W T (θ, 0) =V T (θ, 0) from (6.24). Since θQ ˜ θ and V T satisfies property Q, we have W T (θ, 0) = W T ( ˜ θ, 0). Thus, W T satisfies property Q. 2) We now proceed by induction to prove that the lemma is true for t<T . We first show that ifW t+1 satisfies property Q, then so doesV t . (6.25) can be simplified to the following: V t (π t , ˜ e t ) = max{r ∗ (π t ),W t+1 (φ t (π t ), ˜ e t ))}, (94) where r ∗ (π t ) = inf ˆ xt∈R sup xt∈πt ||x t − ˆ x t ||. Let π, ˜ π be two sets such that πQ˜ π. Then, r ∗ (π t ) =r ∗ (˜ π t ). Hence, the first term inside the maximization in (94) is the same for π and ˜ π. It follows from Lemma 6.4 that ifπQ˜ π thenφ t (π)Qφ t (˜ π). Then,W t+1 (φ t (π),e t ) =W t+1 (φ t (˜ π),e t ) follows using the induction hypothesis. Thus, both the terms in the maximization in (94) satisfy property Q. Therefore, V t also satisfies property Q. Next, we show that ifV t satisfies property Q then so doesW t . Observe that if{x 1 } and{x 2 } are two singleton sets thenV t ({x 1 },e) =V t ({x 2 },e) sinceV t satisfies property Q. Thus, we may write V t ({x},e) =K t (e) ∀x∈X. 305 Let e t > 0. Define W γ t (θ t ,e t ) for a given prescription γ as follows: W γ t (θ t ,e t ) = sup xt∈θt V t (ψ(θ t ,γ,y t ),e t −γ(x t )). (95) Then,W t (θ t ,e t ) = inf γ W γ t (θ t ,e t ). For any prescriptionγ, letA γ,θt :={x∈θ t :γ(x) = 0} be the set of the state values inθ t which are mapped to the control action 0. IfA γ,θt =∅, then W γ t (θ t ,e t ) =K t (e t − 1). If θ t \A γ,θt =∅, then W γ t (θ t ,e t ) =V t (θ t ,e t ). If neither A γ,θt or θ t \A γ,θt is empty, then W γ t (θ t ,e t ) = max{V t (A γ,θt ,e t ), sup x∈θt\A γ,θ t V t ({x},e t − 1)} = max{V t (A γ,θt ,e t ),K t (e t − 1)}≥K t (e t − 1). Also, it is easy to see that for the prescriptions γ all and γ none we have W γ all t (θ t ,e t ) = K t (e t − 1) and W γ none t (θ t ,e t ) =V t (θ t ,e t ) respectively. Thus, it is clear that W t (θ t ,e t ) = inf γ W γ t (θ t ,e t ) = min{K t (e t − 1),V t (θ t )} 306 = min{W γ all t (θ t ,e t ),W γ none t (θ t ,e t )}. (96) Thus, either γ all or γ none is an optimal prescription at time t. Now, ifθQ ˜ θ, then it follows from the induction hypothesis thatW t (θ,e t ) = min{V t (θ,e t ),K t (e t − 1)} = min{V t ( ˜ θ,e t ),K t (e t − 1)} = W t ( ˜ θ,e t ). Similar arguments can be made if e t = 0. Therefore, W t satisfies property Q. Thus, by induction, V t and W t satisfy property Q for all t = 1, 2,...,T . F.4 Proof of Theorem 6.2 Proof of Lemma 6.8 We first show that the post-transmission conditional range Π t is a ball centered around ˜ X t under a globally optimal prescription strategy. This can be done by a simple induction argument: At t = 1 one of the following two will happen 1. If γ 1 =γ all , then ˜ X 1 =X 1 and Π 1 ={X 1 }. 2. If γ 1 =γ none , then ˜ X 1 = 0 and Π 1 ={x 1 :||x 1 ||≤a 1 }. Hence, the claim is true for t = 1. Let the claim be true for t. Then, at time t + 1 one of the following will happen, 1. If γ t+1 =γ all , then ˜ X t+1 =X t+1 and Π t+1 ={X t+1 }. 307 2. If γ t+1 = γ none , then ˜ X t+1 = λA ˜ X t . In this case, Π t+1 = Θ t+1 ={x t+1 : x t+1 = λAx t +n t+1 , x t ∈ Π t ,||n t+1 ||≤ a t+1 } i.e. Π t+1 is obtained by rotating Π t using A, scaling it by λ and then adding it to a ball centered around origin of radius a t+1 . Using the induction hypothesis that Π t is a ball centered at ˜ X t , it follows that Π t+1 is a ball centered at ˜ X t+1 =λA ˜ X t . Thus, Π t is a ball centered around ˜ X t for all t. Therefore, the infimum in (6.25) will be achieved by ˜ X t . Hence, ˜ X t is the optimal estimate at time t. Proof of Theorem 6.2 We will argue that the strategies f ∗ , g ∗ achieve the globally optimal cost for Problem 6.1. Denote theK time instants 1 withU d∗ t equal to 1 by 1≤t 1 ≤...t K ≤T with the convention that t K+1 =T + 1,t 0 = 0 and X d 0 = 0. Now, in Problem 6.3, if t i + 1<t i+1 , the state grows in the interval [t i + 1,t i+1 − 1] for all i and in the interval [t K + 1,T ] if t K <T . Therefore, J d (U d∗ 1:T ) := max t∈T ρ(X d t ,U d∗ t ) = max 0≤i≤K X d t i+1 −1 I t i +1<t i+1 (97) Using (97) and the state dynamics we can write J d (U d∗ 1:T ) = max 0≤i≤K t i+1 −1 X j=t i +1 |λ| t i+1 −1−j a j I t i +1<t i+1 (98) 1 If ti =ti+1 for some i, the controller chooses control action 1 fewer than K times. 308 Now, consider the worst case instantaneous cost in Problem 6.1 under the strategy f ∗ , g ∗ . First consider the interval [1,t 1 ]. If t 1 = 1 then the estimation error is 0 in this interval. Whent 1 > 1, let 1≤t<t 1 , then ˆ X t = 0 under g ∗ . Then at timet, the worst case estimation error is sup N 1:t || P t−1 j=0 λ j A j N t−j || = P t j=1 |λ| t−j a j . Hence, the worst case estimation error in [1,t 1 ] is P t 1 −1 j=1 |λ| t 1 −1−j a j I 1<t 1 . Repeating this argument we get that the worst case estimation error in the interval [t i + 1,t i+1 − 1] is P t i+1 −1 j=t i +1 |λ| t i+1 −1−j a j I t i +1<t i+1 . The cost incurred by the pair f ∗ , g ∗ is the maximum of the worst case estimation error in each interval and thusJ(f ∗ , g ∗ ) =J d (U d∗ 1:T ) using (98). Now, sinceU d∗ 1:T is the optimal open loop sequence it must achieve the optimal cost for Problem 6.3 which is the same as the optimal cost for Problem 6.1 from Lemma 6.7. Therefore, (f ∗ , g ∗ ) is globally optimal. F.5 Proof of Lemma 6.9 Consider some open loop sequence U d t and let the K time instants with U d t equal to 1 be denoted by 1≤ t 1 ≤ ...t K ≤ T with the convention that t K+1 = T + 1,t 0 = 0. Define y i = t i −t i−1 for 1≤ i≤ K + 1. We refer to{y i } 1≤i≤K+1 as the partition of the time horizon. Then, P K+1 i=1 y i = T + 1. Since K < T , t i + 1 < t i+1 will hold for some i. Then, using the proof of Theorem 6.2, observe that the cost incurred for a partition{y i } would be (max i |λ| y i −1 −1 |λ|−1 )a = ( |λ| max i y i −1 −1 |λ|−1 )a when|λ|6= 1. We will show that max i y i is at least Δ for any partition. We first consider the case when T +1 K+1 is not an integer. Suppose max i y i < T + 1 K + 1 , then y i ≤ j T +1 K+1 k ∀i =⇒ K+1 X i=1 y i ≤ (K + 1) T + 1 K + 1 <T + 1. (99) 309 (99) gives a contradiction since P K+1 i=1 y i = T + 1. For the case when T +1 K+1 is an integer, a similar contradiction can be obtained by noting that y i ≤ T +1 K+1 − 1∀i. Thus, max i y i ≥ T + 1 K + 1 = Δ. Now, consider the strategy where U d t = 1 when t = mΔ for some m∈{1,...,K}. Note that lΔ≤T < (l + 1)Δ for some 1≤l≤K. It is easy to check that max i y i = Δ for this strategy and hence it achieves the optimal cost. The proof for the case when|λ| = 1 can be easily obtained in a similar manner. 310 G Appendix: Weakly coupled constrained MDP in Borel spaces G.1 Proof of Theorem 8.1 Our approach will be to first consider a centralized problem where a single agent knows the entire state and action history and takes both actions. The optimal cost of the centralized problem will serve as a lower bound for Problem 8.1. We will then establish that this lower bound is achieved under the control strategy and initial distribution described in Theorem 1. We consider a new single-agent problem with state X t := (X 1 t ,X 2 t ), action U t = (U 1 t ,U 2 t ) and same system dynamics, cost and constraint in (8.1)-(8.4). In this new problem, the information available to the agent at time t isI C t ={X 1:t , U 1:t−1 }. The agent chooses the joint action using the randomized strategy π C t as, U t ∼π C t (·|I C t ). where π C t (·|I C t ) is a probability distribution on the joint control spaceU. We define π C as the collection{π C t } t≥0 . The cost and the constraint function under a control strategy π C and initial distributionν are the same as in (8.3),(8.4). We formally state this new problem below. Problem G.1. Find a joint control strategy π C and initial distribution ν for the agents which minimizes the cost J(π C ,ν) subject to the constraint K(π C ,ν)≤k. 311 Problem G.1 can be viewed as a single agent constrained Markov Decision Process (CMDP) as follows: 1. State Process X t ∈ X whereX := Q 2 i=1 X i and Control Process U t ∈ U where U := Q 2 i=1 U i . 2. Transition Kernel: Distribution of the next state when action u is taken in state x is Q(B 1 ×B 2 |x, u) :=Q 1 (B 1 |x 1 ,u 1 )Q 2 (B 2 |x 2 ,u 2 ) for all B i ∈B(X i ). 3. Instantaneous cost and constraint: The cost and the constraint function incurred at time t is c(X t , U t ) := c 1 (X 1 t ,U 1 t ) +c 2 (X 2 t ,U 2 t ) and d(X t , U t ) := d 1 (X 1 t ,U 1 t ) + d 2 (X 2 t ,U 2 t ) respectively. Single agent constrained MDP in Borel spaces can be solved using infinite dimensional linear programming approach [21]. In this approach, an optimal occupation measure (joint probability measure) of the state and control is determined using a linear program. The optimal pair of control strategy and an initial distribution is obtained using the optimal occupation measure. We introduce the occupation measure based LP below: LP-2: min μ hμ,ci subject to:hμ,di≤k (100) μ(B,U) = Z X×U Q(B|x, u)μ (dx,du), ∀B∈B(X ) (101) μ(X,U) = 1,μ∈M + (X×U) (102) 312 Note thatM + (X×U) is the vector space of positive measures onX×U with finite w variations (similar to Definition 8.1) wherew(x, u) := 1+c(x, u). The following lemma can be obtained using the results in [21]: Lemma G.1. Under Assumption 8.1 and 8.2 there exists a solution μ ∗ to LP-2. Suppose μ ∗ = ˆ μ ∗ ·φ ∗ , then, 1. An optimal control strategy for Problem G.1 is the randomized stationary policy φ ∗ (·|x) and the optimal initial distribution is ˆ μ ∗ . 2. The optimal long-term expected average cost is given by J(φ ∗ , ˆ μ ∗ ) =hμ ∗ ,ci. Lemma G.1 characterizes the solution to Problem G.1 in terms of the solution to the LP-2. The next lemma shows that LP-2 and LP-1 defined in section 8.3 are equivalent. Lemma G.2. The linear programs LP-2 and LP-1 are equivalent in the following sense: 1. For any feasible μ in LP-2, let μ 1 ,μ 2 be the marginals of μ as follows: μ 1 (B 1 ,C 1 ) :=μ B 1 ×X 2 ,C 1 ×U 2 (103) ∀B 1 ∈B(X 1 ),C 1 ∈B(U 1 ) μ 2 (B 2 ,C 2 ) :=μ X 1 ×B 2 ,U 1 ×C 2 (104) ∀B 2 ∈B(X 2 ),C 2 ∈B(U 2 ) Then, μ 1 ,μ 2 are feasible for LP-1 and the value of LP-1 at μ 1 ,μ 2 is identical to the value of LP-2 at μ. 313 2. For any feasible μ 1 ,μ 2 in LP-1, let μ be the product measure defined as: μ B 1 ×B 2 ,C 1 ×C 2 :=μ 1 (B 1 ,C 1 )μ 2 (B 2 ,C 2 ), ∀B i ∈B(X i ),C i ∈B(U i ),i∈{1, 2} (105) Then, μ is feasible for LP-2 and the value of LP-2 at μ is identical to the value of LP-1 at μ 1 ,μ 2 . Proof. 1) Letμ be feasible for LP-2 and letμ 1 ,μ 2 be its marginals as defined in (103),(104). Then, it is easy to observe that hμ,ci =hμ 1 ,c 1 i +hμ 2 ,c 2 i hμ,di =hμ 1 ,d 1 i +hμ 2 ,d 2 i Then, the value of LP-2 under μ will be the same as value of LP-1 under μ 1 ,μ 2 (Value of a LP is the value of its objective function). Also, constraint (8.7) holds true using the above equation and the fact that μ is feasible. Moreover, (8.9) is true since the marginals of a probability measure are also valid probability measures. We now verify that (8.8) also holds true for μ 1 ,μ 2 , μ 1 (B 1 ,U 1 ) =μ(B 1 ×X 2 ,U 1 ×U 2 ) = Z X,U Q 1 (B 1 |x 1 ,u 1 )Q 2 (X 2 |x 2 ,u 2 ) | {z } 1 μ(dx,du) = Z X 1 ,X 2 ,U 1 ,U 2 Q 1 (B 1 |x 1 ,u 1 )μ(dx,du) 314 = Z X 1 ,U 1 Q 1 (B 1 |x 1 ,u 1 )μ 1 (dx 1 ,du 1 ) Repeating the same argument for μ 2 , we can conclude that (8.8) holds true. Hence, μ 1 ,μ 2 are feasible for LP-1. 2) Let μ 1 ,μ 2 be feasible for LP-1 and consider the product measure μ in (105) for LP-2. Again it is easy to verify that hμ 1 ,c 1 i +hμ 2 ,c 2 i =hμ,ci hμ 1 ,d 1 i +hμ 2 ,d 2 i =hμ,di Therefore, the value of LP-2 and LP-1 will be the same underμ andμ 1 ,μ 2 respectively. The constraint (100) holds true using the above equation and the fact that μ 1 ,μ 2 are feasible. Also, it is easy to verify that (102) holds true since μ 1 ,μ 2 are valid probability measures their product μ is also an valid probability measure. We now verify that (101) holds true for μ: Z X,U Q(B 1 ×B 2 |x, u)μ(dx,du) = Z X 1 ,U 1 ,X 2 ,U 2 2 Y i=1 Q i (B i |x i ,u i )μ i (dx i ,du i ) = 2 Y i=1 Z X i ,U i Q i (B i |x i ,u i )μ i (dx i ,du i ) =μ 1 (B 1 ,U 1 )μ 2 (B 2 ,U 2 ) =μ(B 1 ×B 2 ,U) where we used Fubini’s theorem in the second equality. Hence, μ is feasible for LP-2. 315 We are now ready to prove Theorem 8.1. Proof. Let μ ∗ be the optimizing measure for LP-2. Using Lemma G.2, we can conclude that there exists measures μ i ∗ onX i ×U i such that for all B i ∈B(X i ),C i ∈B(U i ), μ ∗ (B 1 ×B 2 ,C 1 ×C 2 ) := 2 Y i=1 μ i ∗ (B i ,C i ) We can write μ ∗ = ˆ μ ∗ ·φ ∗ where ˆ μ ∗ (B 1 ×B 2 ) = ˆ μ 1 ∗ (B 1 )ˆ μ 2 ∗ (B 2 ) and φ ∗ (C 1 ×C 2 |x) =φ 1 ∗ (C 1 |x 1 )φ 2 ∗ (C 2 |x 2 ) (106) Thus, using lemma G.1, the optimal control strategy for Problem G.1 is given by the stationary randomized strategy φ ∗ (·|x). Let J ∗ c be the value of the average cost function and K ∗ c be the value of the average constraint function achieved under the pair (φ ∗ , ˆ μ ∗ ). Suppose agenti uses the stationary strategyφ i ∗ to generate its action. Then, the joint control strategy of the agents is a randomized stationary strategy with distributionφ 1 ∗ (·|x 1 )φ 2 ∗ (·|x 2 ) which is equal toφ ∗ (·|x) using (106). Moreover, when the initial state distribution of agent i is ˆ μ i ∗ , the joint initial state distribution is ˆ μ ∗ using (106). Therefore, the cost achieved under (φ 1 ∗ ,φ 2 ∗ ), (ˆ μ 1 ∗ , ˆ μ 2 ∗ ) is J ∗ c and the average constraint function is K ∗ c . Now, J ∗ c is a lower bound for the optimal cost for Problem 8.1 since any feasible control strategy in Problem 1 is a feasible control strategy for the centralized problem. Therefore, (φ 1 ∗ ,φ 2 ∗ ), (ˆ μ 1 ∗ , ˆ μ 2 ∗ ) is an optimal pair for Problem 8.1. Also, using lemma G.1 it follows 316 that the cost achieved by the optimal strategy is the value of the linear program which is hμ 1 ∗ ,c 1 i +hμ 2 ∗ ,c 2 i. G.2 Proof of Theorem 8.2 and Lemma 8.1 Proof of Theorem 8.2. We will first show that it is sufficient to consider Gaussian measures for LP-1. Consider a measureμ i onX i ×U i with meansm i x ,m i u and second moment matrix Σ i xx Σ i xu (Σ i xu ) 0 Σ i uu . Now, observe that, hμ i ,c i i =Tr(Q i Σ i xx ) +Tr(R i Σ i uu ), hμ i ,d i i =Tr(M i Σ i xx ) +Tr(N i Σ i uu ). Suppose (X i ,U i )∼ μ i and let ˜ X i = A i X i +B i U i +W i be the next state. Then, (8.8) encodes the constraint that distribution of X i and ˜ X i should be the same. Let μ i be a feasible measure for LP-1 which satisfies (8.8). This means that the first and the second moment of X i and ˜ X i should match when (X i ,U i )∼μ i , i.e., m i x =A i m i x +B i m i u (107) Σ i xx =A i Σ i xx (A i ) 0 +A i Σ i xu (B i ) 0 +B i Σ i ux (A i ) 0 +B i Σ i uu (B i ) 0 +I (108) 317 Now, consider a Gaussian measure μ i g which has the same 1 st and 2 nd moments as μ i . If (X i ,U i )∼μ i g are jointly Gaussian then ˜ X i =A i X i +B i U i +W i is also Gaussian with mean and covariance given by the right hand side of (107) and (108) above. Thus,X i , ˜ X i are both Gaussian with same mean and covariance since (107),(108) holds true for the moments of μ i g . Hence,μ i g satisfies (8.8). Also,hμ i g ,c i i =hμ i ,c i i andhμ i g ,d i i =hμ i ,d i i asμ i ,μ i g have the same second moments. Thus, for any feasible μ i there exists a feasible Gaussian measure μ i g which achieves the same value of the linear program. Hence, it is sufficient to consider the class of Gaussian measures in LP-1. Since a Gaussian measure can be characterized only by the first and the second moments, we can reduce LP-1 to the following SDP: min Σ i xx ,Σ i uu ,Σ i xu ,m i x ,m i u 2 X i=1 Tr(Q i Σ i xx ) +Tr(R i Σ i uu ) subject to : 2 X i=1 Tr(M i Σ i xx ) +Tr(N i Σ i uu )≤k Σ i xx =A i Σ i xx (A i ) 0 +A i Σ i xu (B i ) 0 +B i Σ i ux (A i ) 0 +B i Σ ( uu B i ) 0 +I m i x =A i m i x +B i m i u Σ i xx Σ i xu (Σ i xu ) 0 Σ i uu 0 Also, without loss of generality we can set m i x =m i u = 0 since it does not affect the above SDP. This will result in the LQG-SDP as presented in the lemma. Also, using theorem 8.1, the optimal control strategy for agent i is the randomized stationary strategy given by the 318 conditional distribution of the action when the joint distribution of the state and action is μ i ∗ . Since μ i ∗ is Gaussian, the conditional distribution of the control given the state of the agent i is also Gaussian with mean m i u|x , Σ i u|x as defined in the lemma. Proof of Lemma 8.1. If Σ i u|x = 0, then the statement of the lemma follows from Theorem 8.2. Suppose Σ i u|x 6= 0 and letK i ∗ := Σ i,∗ ux (Σ i,∗ xx ) −1 . We can write the expected instantaneous cost and constraint function in (8.12),(8.13) under the strategy φ ∗ = (φ 1 ∗ ,φ 2 ∗ ) as: E φ∗ ˆ μ∗ [c(X t , U t )] = 2 X i=1 Tr (Q i + (K i ∗ ) 0 R i K i ∗ )Σ i,t xx +Tr(R i Σ i u|x ) E φ∗ ˆ μ∗ [d(X t , U t )] = 2 X i=1 Tr (M i + (K i ∗ ) 0 N i K i ∗ )Σ i,t xx +Tr(N i Σ i u|x ) where Σ i,t xx :=E φ∗ ˆ μ∗ [X i t (X i t ) 0 ]. We can easily obtain the following recursive equation for Σ i,t xx : Σ i,t+1 xx = (A i +B i K i ∗ )Σ i,t xx (A i +B i K i ∗ ) 0 +B i Σ i u|x (B i ) 0 +I (109) Similarly, we can write down the expected instantaneous cost and constraint function in (8.12),(8.13) under the strategy g ∗ = (g 1 ∗ ,g 2 ∗ ) as follows: E g∗ ˆ μ∗ [c(X t , U t )] = 2 X i=1 Tr (Q i + (K i ∗ ) 0 R i K i ∗ ) ˜ Σ i,t xx 319 E gi∗ ˆ μ∗ [d(X t , U t )] = 2 X i=1 Tr (M i + (K i ∗ ) 0 N i K i ∗ ) ˜ Σ i,t xx where ˜ Σ i,t xx :=E g∗ ˆ μ∗ [X i t (X i t ) 0 ]. We can write the following recursive equation for ˜ Σ i,t xx : ˜ Σ i,t+1 xx = (A i +B i K i ∗ ) ˜ Σ i,t xx (A i +B i K i ∗ ) 0 +I. (110) Using (109) and (110) we can write: Σ i,t+1 xx − ˜ Σ i,t+1 xx = (A i +B i K i ∗ )(Σ i,t xx − ˜ Σ i,t xx )(A i +B i K i ∗ ) 0 +B i Σ i u|x (B i ) 0 . Using the fact that Σ i,0 xx = ˜ Σ i,0 xx and an induction argument over t we can show that, Σ i,t xx ˜ Σ i,t xx , ∀t. Thus, E φ∗ ˆ μ∗ [c(X t , U t )]≥ E g∗ ˆ μ∗ [c(X t , U t )] and E φ∗ ˆ μ∗ [d(X t , U t )]≥ E g∗ ˆ μ∗ [d(X t , U t )] for all time t. Therefore, the average cost and constraint function achieved under the pair (g ∗ , ˆ μ ∗ ) is not more than the average cost and constraint function achieved under the pair (φ ∗ , ˆ μ ∗ ). Moreover, using Theorem 8.2 we know that φ i ∗ is an optimal control strategy for agent i. Hence, g i ∗ is also an optimal control strategy for agent i. 320
Abstract (if available)
Abstract
Networked systems are ubiquitous in today's world. Such systems consist of agents who have to make a series of decisions in order to achieve a common goal. There are two key challenges in the design of optimal decision strategies for the agents: ❧ i) Uncertainty: Agents have to make decisions under the presence of uncertainty. The uncertainty could manifest itself in three different forms: stochastic uncertainty, model uncertainty and adversarial uncertainty. ❧ ii) Decentralization of decision making: The agents may have different information about the network, the environment and each other and need to make decentralized decisions using only their local information. ❧ In this thesis, we consider instances of sequential decision making problems under different type of uncertainties and information structures in the network. In the context of model uncertainty, we consider the problem of controlling an unknown Markov Decision Process (MDP) and Linear Quadratic (LQ) system in a single agent setting. We pose this as an online learning problem and propose a Thompson sampling (TS) based algorithm for regret minimization. We show that the regret achieved by our proposed algorithm is order optimal up to logarithmic factors. In the multi-agent setting, we consider a mean-field LQ problem with unknown dynamics. We propose a TS based algorithm and derive theoretical guarantees on the regret of our scheme. Finally, we also study a TS algorithm for a multi-agent MDP with two classes of information structure and dynamics. ❧ In the context of stochastic uncertainty, we study a networked estimation problem with multiple sensors and non-collocated estimators. We study the joint design of scheduling strategy for the scheduler and estimation strategies for the estimators. This leads to a sequential team problem with non-classical information structure. We characterize the jointly optimal scheduling and estimation strategies under the two models of sensor state dynamics: i) IID dynamics, ii) Markov dynamics. ❧ We also study a weakly coupled constrained MDP in Borel spaces. This is a decentralized control problem with constraints. We derive the optimal decentralized strategies using an occupation measure based linear program. We further consider the special case of multi-agent LQ systems and show that the optimal control strategies could be obtained by solving a semi-definite program (SDP). ❧ In the context of adversarial uncertainty, we first look at a sequential remote estimation problem of finding a scheduling strategy for the sensor and an estimation strategy for the estimator to jointly minimize the worst-case maximum instantaneous estimation error over a finite time horizon. We obtain a complete characterization of optimal strategies for this decentralized minimax problem. Finally, we consider a broader class of minimax sequential team problems with the partial history sharing information structure for which we characterize the optimal decision strategies for the agents using a dynamic programming decomposition.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning and control in decentralized stochastic systems
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Team decision theory and decentralized stochastic control
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Empirical methods in control and optimization
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
The smart grid network: pricing, markets and incentives
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Optimizing healthcare decision-making: Markov decision processes for liver transplants, frequent interventions, and infectious disease control
PDF
Understanding goal-oriented reinforcement learning
PDF
Learning and control for wireless networks via graph signal processing
PDF
Exploiting side information for link setup and maintenance in next generation wireless networks
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Information design in non-atomic routing games: computation, repeated setting and experiment
PDF
No-regret learning and last-iterate convergence in games
PDF
Active state tracking in heterogeneous sensor networks
PDF
Robust and adaptive online decision making
Asset Metadata
Creator
Gagrani, Mukul
(author)
Core Title
Learning and decision making in networked systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/22/2020
Defense Date
08/21/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
decentralized control,LQ systems,MDP,OAI-PMH Harvest,online learning,regret analysis,sequential decision making
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nayyar, Ashutosh (
committee chair
), Jain, Rahul (
committee member
), Savla, Ketan (
committee member
)
Creator Email
gagrani.mukul@gmail.com,mgagrani@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-377545
Unique identifier
UC11666346
Identifier
etd-GagraniMuk-9010.pdf (filename),usctheses-c89-377545 (legacy record id)
Legacy Identifier
etd-GagraniMuk-9010.pdf
Dmrecord
377545
Document Type
Dissertation
Rights
Gagrani, Mukul
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
decentralized control
LQ systems
MDP
online learning
regret analysis
sequential decision making