Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sequential Decision Making and Learning in Multi-Agent Networked Systems
(USC Thesis Other)
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Sequential Decision Making and Learning in Multi-Agent Networked Systems by Sagar Sudhakara A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2024 Copyright 2024 Sagar Sudhakara Acknowledgements First and foremost, I wish to extend my heartfelt gratitude to my advisor, Prof. Ashutosh Nayyar. His guidance has been invaluable, inspiring me to continually pursue excellence in my research endeavors. I am continually impressed by his keen insights, visionary outlook, and clarity of thinking. His meticulous feedback on my work has been instrumental in my growth, and I am deeply appreciative of his unwavering support throughout my graduate studies. Our meetings and discussions, both academic and personal, will be cherished memories. I also want to express my gratitude to Prof. Rahul Jain for his mentorship and for introducing me to a captivating research topic that proved to be exceptionally rewarding. I am thankful to my qualifying exam and dissertation committee members, Prof. Ketan Savla, Prof. Pierluigi Nuzzo, and Prof. Michael J. Neely, for their invaluable input and insightful questions, which greatly enhanced the quality of my dissertation. This work has been enriched by collaboration with outstanding researchers. I am especially grateful to Prof. Aditya Mahajan, Yi Ouyang, Dhruva Karthik, and Mukul Gagrani for their mentorship and generous sharing of knowledge as we tackled challenges in sequential decision making and learning. I extend my appreciation to all my labmates and colleagues at USC, including Pranav Kadam, Vishnu Ratnam, Rajat Hebbar and Anirudh Kulkarni for stimulating discussions and camaraderie. My time in Los Angeles and the Bay Area was made all the more enjoyable by the wonderful friendships I formed. Special thanks to Rajesh, affectionately known as Rajesh Bhai, for hosting me during my Summer Internship at Samsung Research America in Dallas. To my family, I owe an immense debt of gratitude for their unwavering love and support throughout my journey. I am deeply grateful to my parents for their sacrifices, which have paved the way for my achievements. Lastly, I am thankful to my brother, Sweekar Sudhakara, whose example continually motivates me to strive for excellence. ii Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Some key concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Common Information Approach . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Problems Investigated in this Report . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Sequential Decision-making for Strategic Interactions . . . . . . . . 4 1.2.2 Sequential Decision-making for Symmetric Strategic Interactions . . 4 1.2.3 Online Learning in the presence of model uncertainty . . . . . . . . 5 Chapter 2: Optimal communication and control strategies in a cooperative multi-agent MDP problem . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Preliminary Results and Simplified Strategies . . . . . . . . . . . . . . . . . 10 2.4 Centralized Reformulation Using Common Information . . . . . . . . . . . . 11 2.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Packet-drop channel with state . . . . . . . . . . . . . . . . . . . . . 14 2.5.2 Agents with communication constraints . . . . . . . . . . . . . . . . 16 2.6 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 3: Optimal Communication and Control Strategies for a MultiAgent System in the Presence of an Adversary . . . . . . . . . 21 3.0.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.0.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 iii Table of Contents 3.1.1 Examples of Information Structures Satisfying Assumption 3.1 . . . 27 3.1.1.1 Maximum information . . . . . . . . . . . . . . . . . . . . . 27 3.1.1.2 Encrypted Communication with Global State Information . 27 3.1.1.3 Imperfect Encryption with Global State Information . . . . 28 3.2 Preliminary Results and Simplified Game Gs . . . . . . . . . . . . . . . . . . 28 3.3 Dynamic Program Characterization of a Min-max Strategy . . . . . . . . . 30 3.3.1 Virtual Game Ge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 Common Information Belief and the Dynamic Program . . . . . . . 31 3.3.2.1 Common Information Belief . . . . . . . . . . . . . . . . . 31 3.3.2.2 Dynamic Program . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.3 Communication without Encryption . . . . . . . . . . . . . . . . . . 33 3.3.4 Communication with Encryption . . . . . . . . . . . . . . . . . . . . 33 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 4: Optimal Symmetric Strategies in Multi-Agent Systems with Decentralized Information . . . . . . . . . . . . . . . . . . . . . . 35 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Information structure and strategies . . . . . . . . . . . . . . . . . . 37 4.2.2 Some specific information structures . . . . . . . . . . . . . . . . . . 39 4.2.3 Why are randomized strategies needed? . . . . . . . . . . . . . . . . 40 4.3 Common information approach . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Comparison of Problems 1b and 1c . . . . . . . . . . . . . . . . . . . . . . . 43 4.4.1 Special cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.1.1 Specialized cost . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.1.2 Specialized dynamics . . . . . . . . . . . . . . . . . . . . . 46 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 5: Thompson sampling for linear quadratic mean-field teams . . . 47 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Background on mean-field teams . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.1 Mean-field teams model . . . . . . . . . . . . . . . . . . . . . . . . . 49 States, actions, and their mean-fields . . . . . . . . . . . . . . 49 System dynamics and per-step cost . . . . . . . . . . . . . . . 49 Admissible policies and performance criterion . . . . . . . . . 50 5.2.2 Planning solution for mean-field teams . . . . . . . . . . . . . . . . . 50 Interpretation of the planning solution . . . . . . . . . . . . . 52 5.3 Learning for mean-field teams . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Prior and posterior beliefs: . . . . . . . . . . . . . . . . . . . . 53 The Thompson sampling algorithm: . . . . . . . . . . . . . . 55 Regret bounds: . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 Regret analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Empirical evaluation of regret: . . . . . . . . . . . . . . . . . 61 Comparison with naive TSDE algorithm: . . . . . . . . . . . 61 iv Table of Contents 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Distributed implementation of the algorithm: . . . . . . . . . 62 Chapter 6: Scalable regret for learning to control network-coupled subsystems with unknown dynamics . . . . . . . . . . . . . . . . . . . . 63 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2 Model of network-coupled subsystems . . . . . . . . . . . . . . . . . . . . . 66 6.2.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.1.1 Graph stucture . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.1.2 State and dynamics . . . . . . . . . . . . . . . . . . . . . . 66 6.2.1.3 Per-step cost . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2.2 Assumptions on the model . . . . . . . . . . . . . . . . . . . . . . . 68 6.2.3 Admissible policies and performance criterion . . . . . . . . . . . . . 68 6.3 Background on spectral decomposition of the system . . . . . . . . . . . . . 69 6.3.1 Spectral decomposition of the dynamics and per-step cost . . . . . . 69 6.3.2 Planning solution for network-coupled subsystems . . . . . . . . . . 70 6.4 Learning for network-coupled subsystems . . . . . . . . . . . . . . . . . . . 72 6.4.1 Simplifying assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.4.2 Prior and posterior beliefs: . . . . . . . . . . . . . . . . . . . . . . . 73 6.4.3 The Thompson sampling algorithm: . . . . . . . . . . . . . . . . . . 75 6.4.4 Regret bounds: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.5 Regret analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.5.1 Bound on R(ℓ),i(T) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.5.2 Bound on R˘i (T) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.5.3 Proof of Theorem 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.6 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.6.1 Mean-field system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.6.2 A general low-rank network . . . . . . . . . . . . . . . . . . . . . . . 82 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter 7: Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.1 Model-free reinforcement learning approach for multi-agent systems . . . . 85 7.2 Decentralized Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A Appendix: Optimal communication and control strategies in a cooperative multi-agent MDP problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.1 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 v Table of Contents A.2 Proof of Proposition 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 98 A.3 Proof of Lemma 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 B Appendix: Optimal Communication and Control Strategies for a MultiAgent System in the Presence of an Adversary . . . . . . . . . . . . . . . . 102 B.1 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 B.2 Proof of Proposition 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . 104 B.3 Proof of Proposition 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . 109 C Appendix: Optimal Symmetric Strategies in Multi-Agent Systems with Decentralized Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.1 Proof of Lemma 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.1.1 Problem P1a . . . . . . . . . . . . . . . . . . . . . . . . . . 112 C.1.2 Problem P1b . . . . . . . . . . . . . . . . . . . . . . . . . . 112 C.1.3 Problem P1c . . . . . . . . . . . . . . . . . . . . . . . . . . 113 C.2 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 113 C.3 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . 114 D Appendix: Thompson sampling for linear quadratic mean-field teams . . . . 115 D.1 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 D.2 Proof of Lemma 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 E Appendix: Scalable regret for learning to control network-coupled subsystems with unknown dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 E.1 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 E.2 Proof of Lemma 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 vi List of Figures 2.1 Performance achieved by three strategies: jointly optimal communication and control strategies, (ii) always communicate, and (iii) never communicate. In this example, p 1 a = p 2 a = 0.3, p 1 d1 = p 2 d1 = 0.6, p 1 d2 = p 2 d2 = 0.4 and ϑ = 0.95. . 19 5.1 R(T) vs T for TSDE-MF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 R(T)/ √ T vs T for TSDE-MF . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 TSDE-MF vs TSDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.1 R(T)/ √ T vs T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2 Regret for mean-field system. . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.3 Graph G ◦ with n = 4 nodes and its adjacency matrix . . . . . . . . . . . . . 82 6.4 R(T)/ √ T vs T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.5 R(T)/ √ T vs n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 vii Abstract Networked systems are ubiquitous in today’s world. Such systems consist of agents who have to make a series of decisions in order to achieve a common goal. In this report, the focus is on sequential decision making methodologies for three broad areas in multi-agent setting: (i) communication and strategic interactions, (ii)symmetric strategic interactions, (iii) online learning in the presence of model uncertainty (i.e. controlling a Linear Quadratic system with unknown dynamics). Communication and Strategic Interactions: The issue of controlling cooperative multi-agent systems with varying information sharing models has garnered considerable attention. We propose a dynamic information sharing setup, where agents can decide at each step whether to share information, balancing communication costs and control objectives. Our approach demonstrates that agents can overlook certain private information without sacrificing system performance. We provide a common information approach based solution for the strategy optimization problem. We then extend this to include adversarial scenarios, where agents must coordinate effectively while safeguarding against eavesdropping. We model this as a stochastic zero-sum game, deriving a min-max strategy for the team considering the adversary’s capabilities. Additionally, we provide structural insights to facilitate the computation of the min-max strategy. Symmetric Strategic Interactions: We consider a setup where agents focus on symmetric strategies, i.e., the case where all agents use the same control strategy. In this current setup, we use randomized actions which helps in minimizing the total expected cost associated with symmetric strategies. We first show that agents can ignore a big part of their private information without compromising the system performance and then provide a common information approach based solution for the symmetric strategy optimization problem. Online Learning: We consider the problem of controlling an unknown linear quadratic Gaussian (LQG) system consisting of multiple subsystems connected over a network. Our goal is to minimize and quantify the regret (i.e. loss in performance) of our learning and control strategy with respect to an oracle who knows the system model. Upfront viewing the interconnected subsystems globally and directly using existing LQG learning algorithms for the global system results in a regret that increases super-linearly with the number of subsystems. Instead, we propose a new Thompson sampling based learning algorithm which exploits the structure of the underlying network and results in a regret that scales linearly with the number of subsystems. viii Chapter 1 Introduction Networked systems are prevalent in today’s world, encompassing various modern engineering systems such as Network Control Systems (NCS), Cyber-Physical Systems (CPS), Wireless Sensor Networks, and Power Systems [1–3]. The decision-making process plays a crucial role in these networked systems, involving agents who must make a series of decisions to achieve a common goal. For instance, in a smart grid, multiple power sources need to determine the appropriate energy generation levels to collectively meet the total energy demand while minimizing production costs. Similarly, in a sensor network, sensors must decide when to take measurements and communicate them to the base station. In the case of a self-driving car, steering controls must be decided upon for successful navigation in its environment. In all these scenarios, agents in the network employ decision strategies to map their information and make optimal choices. Developing effective decision strategies is vital for efficient resource utilization, optimal performance, and avoidance of system failures. This report focuses on identifying optimal or near-optimal decision strategies for specific decision-making problems encountered in networked systems. Networked systems or agents typically operate over a duration of time. The agents use data obtained from their surroundings to inform decision making and action in order to achieve a concrete end goal. This procedure is referred to as sequential decision-making. Sequential decision-making may usually involve the following challenges: 1. System dynamics: In a networked system, agents interact with an environment by taking actions. Actions affect the state evolution of the system. The decision-making strategy must take into account the long-term consequences of taking an action, and this kind of problems are usually modeled as Markov Decision Processes (MDPs). Once the problem is expressed as an MDP, we can use dynamic programming to find the optimum policy. One major challenge we often face is that the state space is too large, and the application of dynamic programming in a straightforward fashion becomes intractable. In such cases, approximate dynamic programming methodologies and other statistical approaches can be used to design a more computationally tractable decision-making strategy. 1 List of Figures 2. Partial observability: In most of the networked systems, the decision-making agent/decision maker does not have complete information about the system state. This is commonly seen in decentralized decision-making in which multiple decision-makers take their decisions based on different information. Partial observability is inevitable in decentralized systems, since agents do not fully know the other agents’ information. Solving the Partially Observable MDP (POMDP) problem is notoriously hard. 3. Uncertainty: Agents have to make decisions under the presence of uncertainty. The uncertainty could manifest itself in the form of model uncertainty, like the problem of controlling mean-field Linear Quadratic system with unknown dynamics. All these challenges make sequential decision-making a very complex task. However, when the problem admits additional structure, it may be possible to design reasonably tractable solutions despite the challenges mentioned above. We will focus on three such problems in this report. In Chapter 2 and 3, we will discuss communication and strategic interactions for multi-agent systems. Our focus is to jointly design decision/control strategies for the multiple agents in order to optimize a performance metric for the team even in the presence of an adversary. In Chapter 4, our focus is on symmetric strategies, i.e., the case where all agents use the same control strategy. In Chapter 5 and 6, we investigate the problem of learning to control a mean-field LQ system and a system consisting of multiple subsystems connected over a network. We will then discuss some directions for future research in these areas in Chapter 7. 1.1 Some key concepts In this section, we introduce two key concepts which play a central role in different parts of this thesis. 1.1.1 Common Information Approach In Chapters 2-4, we study sequential decision making problem with multiple agents. Multiagent decision making problems where the agents have a common objective are also referred to as Team problems.An important aspect of such problems is the information structure, which refers to the available information for each decision maker. If each decision maker has complete knowledge of the information available to all previous agents, the information structure is referred to as classical”. Otherwise, it is called non-classical”, making sequential team problems with non-classical information structures difficult to solve, as they are usually non-convex. The common information approach [4] offers a systematic way of solving such team problems. This approach converts the team problem into an equivalent single-agent partially 2 List of Figures observed Markov decision process (POMDP). The information available to the agent in the equivalent POMDP is the common information among the agents in the original team problem. The equivalent POMDP is then solved using tools from Markov decision theory [5]. Finally, the solution is transformed to obtain optimal decentralized strategies for the original team problem. In Chapters 2-4, we employ the common information approach to determine the optimal decision strategies. 1.1.2 Thompson Sampling In Chapters 5-6, we consider sequential decision making problems with model uncertainty. For real-world systems, it is hardly the case that the model and its parameters are known precisely to the agent(s). Typically, a set is known in which the model parameters lie. Furthermore, for many problems, we do not have the luxury of first performing system identification, and then using that in designing control strategy. The agent(s) want to maintain adequate control under the presence of the uncertainty about the true model of the system. We refer to this problem as the problem of “learning” to control dynamical systems. This problem is well known as the adaptive control problem [5]. Classical adaptive control [5–9] mostly provides asymptotic guarantees on the performance of the agents. However, our objective is to learn model parameters and the corresponding optimal controller simultaneously at the fastest possible non-asymptotic rate. We take an Online learning [10] approach in Chapters 5-6 for learning the optimal control strategy. Thompson sampling [11] has emerged as a popular online learning approach due to its good performance in online learning problems [12, 13]. Thompson sampling (TS) is a Bayesian approach where the agents start with a prior distribution on the unknown parameters of the model and maintain a posterior on the basis of their information at each time. Then, at certain carefully chosen times, the agent generates a random sample from the posterior and applies the optimal strategy corresponding to the generated sample. In Chapters 5-6, we design TS based algorithms with dynamic sampling schedules which are provably order optimal in terms of the rate of learning the model parameters. 1.2 Problems Investigated in this Report Next, we introduce the problems studied in this report and briefly summarize the main ideas. 3 List of Figures 1.2.1 Sequential Decision-making for Strategic Interactions In this part, we discuss sequential decision-making for strategic interactions. The problem of controlling cooperative multi-agent systems under different models of information sharing among agents has received significant attention in the recent literature. In Chapter 2, we consider a setup where rather than committing to a fixed and non-adaptive information sharing protocol (e.g. periodic sharing or no sharing etc), agents can dynamically decide at each time step whether to share information with each other and incur the resulting communication cost. This setup requires a joint design of agents’ communication and control strategies in order to optimize the trade-off between communication costs and the control objective. We first show that agents can ignore a big part of their private information without compromising the system performance. We then provide a common information approach based solution for the strategy optimization problem. This approach relies on constructing a fictitious POMDP whose solution (obtained via a dynamic program) characterizes the optimal strategies for the agents. We extend our solution to incorporate time-varying packet-drop channels and constraints on when and how frequently agents can communicate. In Chapter 3, we consider a multi-agent system in which a decentralized team of agents controls a stochastic system in the presence of an adversary. Instead of committing to a fixed information sharing protocol, the agents can strategically decide at each time whether to share their private information with each other or not. The agents incur a cost whenever they communicate with each other and the adversary may eavesdrop on their communication. Thus, the agents in the team must effectively coordinate with each other while being robust to the adversary’s malicious actions. We model this interaction between the team and the adversary as a stochastic zero-sum game where the team aims to minimize a cost while the adversary aims to maximize it. Under some assumptions on the adversary’s capabilities, we characterize a min-max control and communication strategy for the team. We supplement this characterization with several structural results that can make the computation of the min-max strategy more tractable. 1.2.2 Sequential Decision-making for Symmetric Strategic Interactions In this part, we discuss sequential decision-making for symmetric strategic interactions. We focus on the problem of multi-agent systems with control sharing which arise naturally in various control and communication applications. In chapter 4, we consider a setup where agents focus on symmetric strategies, i.e., the case where all agents use the same control strategy. In this current setup, we use randomized actions which helps in minimizing the total expected cost associated with symmetric strategies. We first show that agents can ignore a big part of their private information without compromising the system performance. We then provide a common information approach based solution for the symmetric strategy optimization problem. This approach relies on constructing a fictitious POMDP whose solution (obtained via a dynamic program) characterizes the optimal symmetric strategies for the agents. 4 List of Figures 1.2.3 Online Learning in the presence of model uncertainty In chapter 5, We consider optimal control of an unknown multi-agent linear quadratic (LQ) system where the dynamics and the cost are coupled across the agents through the meanfield (i.e., empirical mean) of the states and controls. We investigate mean-field coupled control systems, which have emerged as a popular modeling framework in multiple research communities including Control Systems, Economics, Finance, and Statistical Physics [14– 18]. Directly using single-agent LQ learning algorithms in such models results in regret which increases polynomially with the number of agents. We propose a new Thompson sampling based learning algorithm which exploits the structure of the system model and show that the expected Bayesian regret of our proposed algorithm is order optimal with respect to time horizon up to logarithmic factors and does not grow with the total number of agents, We present detailed numerical experiments to illustrate the salient features of the proposed algorithm. We also generalize the problem to a system with a large population of heterogeneous agents. The agents belong to distinct sub population and we show that the regret of our proposed algorithm does not grow irrespective of the number of agents. In chapter 6, We consider the problem of controlling an unknown linear quadratic Gaussian (LQG) system consisting of multiple subsystems connected over a network. Our goal is to minimize and quantify the regret (i.e. loss in performance) of our learning and control strategy with respect to an oracle who knows the system model. Upfront viewing the interconnected subsystems globally and directly using existing LQG learning algorithms for the global system results in a regret that increases super-linearly with the number of subsystems. Instead, we propose a new Thompson sampling based learning algorithm which exploits the structure of the underlying network. We show that the expected regret of the proposed algorithm is bounded by O˜ n √ T , where n is the number of subsystems and T is the time horizon. O˜ notation hides logarithmic factors in T. Thus, the regret scales linearly with the number of subsystems. We present numerical experiments to illustrate the salient features of the proposed algorithm. Chapter 2 Optimal communication and control strategies in a cooperative multi-agent MDP problem 2.1 Introduction The problem of sequential decision-making by a team of collaborative agents has received significant attention in the recent literature. The goal in such problems is to jointly design decision/control strategies for the multiple agents in order to optimize a performance metric for the team. The nature of this joint strategy optimization problem as well as the best achievable performance depend crucially on the information structure of the problem. Intuitively, the information structure of a multi-agent problem specifies what information is available to each agent at each time. Depending on the underlying communication environment, a wide range of information structures can arise. If communication is costless and unrestricted, all agents can share all information with each other. If communication is too costly or physically impossible, agents may not be able to share any information at all. It could also be the case that agents can communicate only periodically or that the ability to communicate varies among the agents leading to one-directional communication between certain pairs of agents. Each of these communication models corresponds to a different information structure which, in turn, specifies the class of feasible decision/control strategies for the agents. In this paper, we consider a setup where rather than committing to a fixed and nonadaptive information sharing protocol (e.g. periodic sharing or no sharing etc), agents can dynamically decide at each time step whether to share information with each other and incur the resulting communication cost. Thus, at each time step, agents have to make two kinds of decisions - communication decisions that govern the information sharing and control decisions that govern the evolution of the agents’ states. The two kinds of agents’ 6 List of Figures strategies - communication strategies and control strategies - need to be jointly designed in order to optimize the trade-off between communication costs and the control objective. Related Work There is a significant body of prior work on decentralized control and decision-making in multi-agent systems. We focus on works where the dynamic system can be viewed as a Markov chain jointly being controlled by multiple agents/controllers. We can organize this literature based on the underlying information structure (or the information sharing protocol). In Decentralized Markov decision processes (Dec-MDPs) and Decentralized partially observable Markov decision processes (Dec-POMDPs), each agent receives a partial or noisy observation of the current system state [19]. These agents cannot communicate or share their observations with each other and can only use their private action-observation history to select their control actions. Several methods for solving such generic Dec-POMDPs exist in the literature [20–25]. However, these generic methods either involve prohibitively large amount of computation or cannot guarantee optimality. For certain Dec-MDPs and DecPOMDPs with additional structure, such as transition independence in Dec-MDPs [26, 27] or one-sided information sharing [28], one can derive additional structural properties of the optimal strategy and use these properties to make the computation more tractable. In decentralized stochastic control literature, a variety of information structures (obtained from different information sharing protocols) have been considered [4, 29, 30]. For example, [29] considers the case where agents share their information with each other with a fixed delay; [4] provides a unified treatment for a range of information sharing protocols including periodic sharing, sharing of only control actions etc. [30, 31] consider a setup where only the agents’ actions are shared with others. In emergent communication, agents have access to a cheap talk channel which can be used for communication. [32–34] propose methods for jointly learning the control and communication strategies in such settings. The key communication issue in these works is to design the most effective way of encoding the available information into the communication alphabet. In contrast, the communication issue in our setup is whether the cost of sharing states is worth the potential control benefit. In our model, agents at each time make an explicit choice regarding sharing their information with each other. We seek to jointly design this information sharing strategy and the agents’ control strategies. This problem and many of the problems considered in the prior literature can be reduced to Dec-POMDPs by a suitable redefinition of states, observations and actions. However, as demonstrated in [28], a generic Dec-POMDP based approach for problems with (limited) inter-agent communication involves very large amount of computation since it ignores the underlying communication structure. Instead, we derive some structural properties of the strategies that significantly simplify the strategy design. We then provide a dynamic program based solution using the common information approach. To the best of our knowledge, our information sharing mechanism has not been analyzed before. 7 List of Figures Contributions (i) We first show that agents can ignore a big part of their private information without compromising the system performance. This is done by using an agentby-agent argument where we fix the strategies of one agent arbitrarily and find a sufficient statistic for the other agent. This sufficient statistic turns out be a subset of the agent’s private information. This reduction in private information narrows down the search for optimal strategies to a class of simpler strategies. (ii) We then adopt the common information based solution approach for finding the optimal strategies. This approach relies on constructing an equivalent POMDP from the perspective of a fictitious coordinator that knows the common information among the agents. The solution of this POMDP (obtained via a dynamic program) characterizes the optimal strategies for the agents. (iii) Finally, we extend our setup to incorporate time-varying packet-drop channels and constraints on when and how frequently agents can communicate with each other. We show that our solution approach can be easily modified to incorporate these features using a natural augmentation of the state in the coordinator’s POMDP. Notation Random variables are denoted with upper case letters (X, Y , etc.), their realization with lower case letters (x, y, etc.), and their space of realizations by script letters (X , Y, etc.). Subscripts denote time and superscripts denote the subsystem; e.g., Xi t denotes the state of subsystem i at time t. The short hand notation Xi 1:t denotes the collection (Xi 1 , Xi 2 , ..., Xi t ). Xt denotes (X1 t , X2 t ) and Mt denotes (M1 t , M2 t ). △(X ) denotes the probability simplex for the space X . P(A) denotes the probability of an event A. [X] denotes the expectation of a random variable X. 1A denotes the indicator function of event A. For simplicity of notation, we use P(x1:t , u1:t−1) to denote P(X1:t = x1:t , U1:t−1 = u1:t−1) and a similar notation for conditional probability. We use −i to denote agent/agents other than agent i. 2.2 Problem Formulation Consider a discrete-time system with two agents. Let Xi t ∈ X i denote the local state of agent i for i = 1, 2. Xt := (X1 t , X2 t ) represents the local state of both agents. The initial local states, (X1 1 , X2 1 ), of both agents are independent random variables with initial local state Xi 1 having the probability distribution PXi 1 . Each agent perfectly observes its own local state. Let U i t ∈ Ui denote the control action of agent i at time t and Ut := (U 1 t , U2 t ) denote the control actions of both agents at time t. The local state of agent i, i = 1, 2, evolves according to Xi t+1 = k i t (Xi t , Ui t , Wi t ) (2.1) where Wi t ∈ Wi is the disturbance in dynamics with probability distribution PWi . The initial state X1 and the disturbances {Wi t }t≥1, i = 1, 2, are independent random variables. Note that the next local state of agent i depends on the current local state and control action of agent i alone. The dynamics of the two agents are independent of each other. In addition to deciding the control actions at each time, the two agents need to decide whether or not to initiate communication at each time. We use the binary variable Mi t ∈ 8 List of Figures {0, 1} to denote the communication decision taken by agent i. Let Mor t := max(M1 t , M2 t ) and let Z er t represent the information exchanged between the agents at time t. In our model, communication is initiated when any agent decides to communicate (i.e., Mi t = 1) but agents may lose packets or fail to communicate with probability pe. Based on the communication model described above, Z er t is given as: Z er t = X 1,2 t , with probability 1 − pe if Mor t = 1. ϕ, with probability pe if Mor t = 1. ϕ, if Mor t = 0. (2.2) Information structure and decision strategies: At the beginning of the t-th time step, the information available to agent i is given by I i t = {Xi 1:t , Ui 1:t−1 , Zer 1:t−1 , M1,2 1:t−1 }. (2.3) Agent i can use this information to make its communication decision at time t. Thus, Mi t is chosen as a function of I i t according to Mi t = f i t (I i t ), (2.4) where the function f i t is referred to as the communication strategy of agent i at time t. After the communication decisions are made and the resulting communication (if any) takes place, the information available to agent i is I i t+ = {I i t , Zer t , M1,2 t }. (2.5) Agent i then chooses its control action according to U i t = g i t (I i t+ ), (2.6) where the function g i t is referred to as the control strategy of agent i at time t. f i := (f i 1 , fi 2 , ..., fi T ) and g i := (g i 1 , gi 2 , ..., gi T ) are called the communication and control strategy of agent i respectively. Strategy optimization problem: At time t, the system incurs a cost ct(X1 t , X2 t , U1 t , U2 t ) that depends on the local states and control actions of both agents. Whenever agents decide to share their states with each other, they incur a state-dependent communication cost ρ(Xt). The communication cost ρ(Xt) includes the energy cost involved in transmission and computation cost involved in encoding and decoding messages. The system runs for a time horizon T. The objective is to find communication and control strategies for the two agents in order to minimize the expected value of the sum of control and communication costs over the time horizon T: "X T t=1 ct(Xt , Ut) + ρ(Xt)1{Mor t =1} # . (2.7) 9 List of Figures Remark 2.1. Even though we are formulating the problem for two agents, it can be easily extended to n agents with the communication protocol that if any one agent initiates communication all agents broadcast their state. All key results in the paper applies for n agents setup with minor adjustments in the proof. 2.3 Preliminary Results and Simplified Strategies In this section, we show that agents can ignore parts of their information without losing optimality. This removal of information narrows the search for optimal strategies to a class of simpler strategies and is a key step in our approach for finding optimal strategies. To proceed, we first split the information available to the agents into two parts – common information (which is available to both agents) and private information (which is everything except the common information): 1. At the beginning of time step t, before the communication decisions are made, the common information is defined as Ct := (Z er 1:t−1 , M1,2 1:t−1 ). (2.8) 2. After the communication decisions are made and the resulting communication (if any) takes place, the common information is defined as Ct+ = (Z er 1:t , M1,2 1:t ). (2.9) The following lemma establishes a key conditional independence property that will be critical for our analysis. Lemma 2.1 (Conditional independence property). Consider any arbitrary choice of communication and control strategies for the two agents. Then, at any time t, the two agents’ local states and control actions are conditionally independent given the common information Ct (before communication) or Ct+ (after communication). That is, if ct , ct+ are the realizations of the common information before and after communication respectively, then for any realization x1:t , u1:t−1 of states and actions, we have P(x1:t , u1:t−1|ct) = Y 2 i=1 P(x i 1:t , ui 1:t−1 |ct), (2.10) P(x1:t , u1:t |ct+ ) = Y 2 i=1 P(x i 1:t , ui 1:t |ct+ ). (2.11) Further, P(x i 1:t , ui 1:t−1 |ct) and P(x i 1:t , ui 1:t |ct+ ) depend only on agent i’ strategy and not on the strategy of agent −i. Proof. See Appendix A.1. 10 List of Figures The following proposition shows that agent i at time t can ignore its past states and actions, i.e. Xi 1:t−1 and U i 1:t−1 , without losing optimality. This allows agents to use simpler strategies where the communication and control decisions are functions only of the current state and the common information. Proposition 2.1. Agent i, i = 1, 2, can restrict itself to strategies of the form below Mi t = ¯f i t (Xi t , Ct) (2.12) U i t = ¯g i t (Xi t , Ct+ ) (2.13) without loss of optimality. In other words, at time t, agent i does not need the past local states and actions, Xi 1:t−1 , Ui 1:t−1 , for making optimal decisions. Proof. To prove this result, we fix agent −i’s strategy to an arbitrary choice and then show that agent i’s decision problem can be modeled as an MDP in a suitable state space. The result then follows from the fact that Markovian strategies are optimal in an MDP. See Appendix A.2 for details. 2.4 Centralized Reformulation Using Common Information In this section, we provide a centralized reformulation of the multi-agent strategy optimization problem using the common information approach of [4]. The main idea of the approach is to formulate an equivalent single-agent POMDP problem; solve the equivalent POMDP using a dynamic program; and then translate the results back to the original problem. Because of Proposition 2.1, we will only consider strategies of the form given in (2.12) and (2.13). Following the approach in [4], we construct an equivalent problem by adopting the point of view of a fictitious coordinator that observes only the common information among the agents ( i.e., the coordinator observes Ct before communication and Ct+ after Z er t is realized), but not the current local states (i.e., Xi t , i = 1, 2). Before communication at time t, the coordinator chooses a pair of prescriptions, Γt := (Γ1 t , Γ 2 t ), where Γi t is a mapping from Xi t to Mi t (more precisely, Γi t maps X i to {0, 1}). The interpretation of the prescription is that it is a directive to the agents about how they should use their local state information to make the communication decisions. Thus, agent i generates its communication decision by evaluating the function Γi t on its current local state: Mi t = Γi t (Xi t ). (2.14) Similarly, after the communication decisions are made and Z er t is realized, the coordinator chooses a pair of prescriptions, Λt := (Λ1 t ,Λ 2 t ), where Λi t is a mapping from Xi t to U i t (more precisely, Λi t maps X i to U i ). Agent i then generates its control action by evaluating the function Λi t on its current local state: U i t = Λi t (Xi t ). (2.15) 11 List of Figures The coordinator chooses its prescriptions based on the common information. Thus, Γ 1 t = d 1 t (Ct), Γ 2 t = d 2 t (Ct), Λ 1 t = d 1 t+ (Ct+ ), Λ 2 t = d 2 t+ (Ct+ ), (2.16) where d 1 t , d2 t , d1 t+ , d2 t+ are referred to as the coordinator’s communication and control strategy for the two agents at time t. The collection of functions (d 1 1 , d2 1 , d1 1+ , ..., d1 T + , d2 T + ) is called the coordinator’s strategy. The coordinator’s strategy optimization problem is to find a coordination strategy to minimize the expected total cost given by (2.7). The following lemma shows the equivalence of the coordinator’s strategy optimization problem and the original strategy optimization problem for the agents. Lemma 2.2. Suppose that (d 1∗ 1 , d2∗ 1 , ..., d1∗ T + , d2∗ T + ) is an optimal strategy for the coordinator. Then, optimal communication and control strategies for the agents in the original problem can be obtained as follows: for i = 1, 2, ¯f i∗ t (Xi t , Ct) = Γi t (Xi t ) where Γ i t = d i∗ t (Ct), (2.17) g¯ i∗ t (Xi t , Ct+ ) = Λi t (Xi t ) where Λ i t = d i∗ t+ (Ct+ ). (2.18) Proof. The lemma is a direct consequence of the results in [4]. Lemma 2.2 implies that the agents’ strategy optimization problem can be solved by solving the coordinator’s strategy optimization problem. The advantage of the coordinator’s problem is that it is a sequential decision-making problem with the coordinator as the only decision-maker. (Note that once the coordinator makes its decisions about which prescription to use, the agents act as mere evaluators and not as independent decision-makers.) Coordinator’s belief state: As shown in [4], the coordinator’s problem can be viewed as a POMDP. Therefore, the coordinator’s belief state can serve as the sufficient statistic for selecting prescriptions. Before communication at time t, the coordinator’s belief is given as: Πt(x 1 , x2 ) = P(X1 t = x 1 , X2 t = x 2 |Ct , Γ1:(t−1),Λ1:(t−1)). (2.19) After the communication decisions are made and Z er t is realized, the coordinator’s belief is given as: Πt+ (x 1 , x2 ) = P(X1 t = x 1 , X2 t = x 2 |Ct+ , Γ1:t ,Λ1:(t−1)). (2.20) Because of the conditional independence property identified in Lemma 2.1, the coordinator’s beliefs can be factorized into beliefs on each agent’s state, i.e., Πt(x 1 , x2 ) = Π1 t (x 1 )Π2 t (x 2 ), (2.21) Πt+ (x 1 , x2 ) = Π1 t+ (x 1 )Π2 t+ (x 2 ), (2.22) 12 List of Figures where, for i = 1, 2, Πi t is the marginal belief on Xi t obtained from 2.19 and Πi t+ is the marginal belief on Xi t obtained from 2.20. The coordinator can update its beliefs on the agents’ states in a sequential manner as described in the following lemma. Lemma 2.3. For i = 1, 2, Πi 1 is the prior belief (PXi 1 ) on the initial state Xi 1 and for each t ≥ 1, Π i t+ = η i t (Πi t , Γ i t , Zer t , Mt), (2.23) Π i t+1 = β i t (Πi t+ ,Λ i t ), (2.24) where η i t and β i t are fixed functions derived from the system model. (We will use βt(Π1,2 t+ ,Λ 1,2 t ) to denote the pair β 1 t (Π1 t+ ,Λ 1 t ), β2 t (Π2 t+ ,Λ 2 t ). Similar notation will be used for the pair η 1 t (·), η2 t (·).) Proof. See Appendix A.3. Finally, we note that given the coordinator beliefs Π1 t , Π2 t and its prescriptions Γ1 t , Γ2 t at time t, the joint probability that Z er t = ϕ and Mt = mt is given as P(Z er t = ϕ, Mt = mt |Π 1,2 t , Γ 1,2 t ) = (2.25) P x1,2 1{Γ 1 t (x1)=0}1{Γ 2 t (x2)=0}Π1 t (x 1 )Π2 t (x 2 ) if mt = (0, 0) P x1,2 pe1{Γ 1 t (x1)=m1 t }1{Γ 2 t (x2)=m2 t }Π1 t (x 1 )Π2 t (x 2 ) otherwise. Similarly, the probability that Z er t = (x 1 , x2 ) is given as P(Z er t = (x 1 , x2 )|Π 1,2 t , Γ 1,2 t ) = (1 − pe) h max(Γ1 t (x 1 ), Γ 2 t (x 2 ))i Π 1 t (x 1 )Π2 t (x 2 ). (2.26) Coordinator’s dynamic program: Using Lemma 3 and the probabilities given in (2.25) - (2.26), we can write a dynamic program for the coordinator’s POMDP problem. In the following theorem, π i denotes a general probability distribution on X i and δxi denotes a delta distribution centered at x i . Theorem 2.1. The value functions for the coordinator’s dynamic program are as follows: for all beliefs π 1 , π2 , VT +1(π 1 , π2 ) := 0 and for t = T, . . . , 2, 1, Vt+ (π 1 , π2 ) := min λ1,λ2 hX x1,2 ct x 1,2 , λ1 (x 1 ), λ2 (x 2 ) π 1 (x 1 )π 2 (x 2 ) + Vt+1(βt(π 1,2 , λ1,2 ))i , (2.27) where βt is as described in Lemma 2.3. Vt(π 1 , π2 ) := min γ 1,γ2 hX x1,2 ρ(x 1,2 ) max(γ 1 (x 1 ), γ2 (x 2 ))π 1 (x 1 )π 2 (x 2 )+ 1 List of Figures X m P(Z er t = ϕ, Mt = m|π 1,2 , γ1,2 )Vt+ (ηt(π 1,2 , γ1,2 , ϕ, m)) + X x˜ 1,2 P(Z er t = ˜x 1,2 |π 1,2 , γ1,2 )Vt+ (δx˜ 1 , δx˜ 2 ) i , (2.28) where ηt is as described in Lemma 2.3 and P(Z er t = ϕ, Mt = m|π 1,2 , γ1,2 ), P(Z er t = x˜ 1,2 |π 1,2 , γ1,2 ) are as described in (2.25) - (2.26). The coordinator’s optimal strategy is to pick the minimizing prescription pairs for each time and each (π 1 , π2 ). Proof. Since the coordinator’s problem is a POMDP, it has a corresponding dynamic program. The value functions in the theorem can be obtained by simple manipulations of the POMDP dynamic program for the coordinator. Remark 2.2. If the transition and cost functions are time-invariant, we can also consider an infinite-horizon discounted cost analog of the problem formulation in this paper. The results above can be extended to this discounted setting in a straightforward manner using the approach in [4]. 2.5 Extensions 2.5.1 Packet-drop channel with state In our formulation in Section ??, the quality of the communication channel between the agents did not change with time. In this section, we consider an extension where the packetdrop probability evolves over time as an uncontrolled Markov process. Let Et ∈ E denote the channel state at time t where E is a finite set of channel states. The process Et evolves as Et+1 = lt(Et , We t ), where the random variables {We t }t≥1 are mutually independent and also independent of all the other primitive random variables. The packet-drop probability of the channel at time t, denoted by pet ,is a function of the channel state Et , i.e., pet = φt(Et). Further, the communication cost also depends on the channel state and is given by ρ(Xt , Et). The channel state is known to both agents. The information available to agent i at times t (before communication) and t + (after communication) is thus given by I i t = {Xi 1:t , Ui 1:t−1 , Zer 1:t−1 , M1,2 1:t−1 , E1:t} (2.29) I i t+ = {I i t , Zer t , M1,2 t }. (2.30) Our goal is to find communication and control strategies for the agents in the above setup. With some minor modifications, we can use the common information based methodology of Section 2.4 to solve this problem. 14 List of Figures Given the channel state Et , the coordinator beliefs Π1 t , Π2 t and its prescriptions Γ1 t , Γ 2 t at time t, the joint probability that Z er t = ϕ and Mt = mt is given as P(Z er t = ϕ, Mt = mt |Π 1,2 t , Γ 1,2 t , Et) = (2.31) P x1,2 1{Γ 1 t (x1)=0}1{Γ 2 t (x2)=0}Π1 t (x 1 )Π2 t (x 2 ) if mt = (0, 0) P x1,2 φt(Et)1{Γ 1 t (x1)=m1 t }1{Γ 2 t (x2)=m2 t }Π1 t (x 1 )Π2 t (x 2 ), otherwise. Similarly, the probability that Z er t = (x 1 , x2 ) is given as P(Z er t = (x 1 , x2 )|Π 1,2 t , Γ 1,2 t , Et) = (1 − φt(Et))h max(Γ1 t (x 1 ), Γ 2 t (x 2 ))i Π 1 t (x 1 )Π2 t (x 2 ). (2.32) The following theorem describes the modified dynamic program for the coordinator. The value functions and the coordinator’s optimal strategy depend on the current channel state in addition to the coordinator’s beliefs. Theorem 2.2. The value functions for the coordinator’s dynamic program are as follows: for all beliefs π 1 , π2 and all e ∈ E, VT +1(π 1 , π2 , e) := 0 and for t = T, . . . , 2, 1, Vt+ (π 1 , π2 , e) := min λ1,λ2 hX x1,2 ct x 1,2 , λ1 (x 1 ), λ2 (x 2 ) π 1 (x 1 )π 2 (x 2 ) + E[Vt+1(βt(π 1,2 , λ1,2 ), Et+1) | Et = e] i , (2.33) where βt is as described in Lemma 2.3, and Vt(π 1 , π2 , e) := min γ 1,γ2 hX x1,2 ρ(x 1,2 , e) max(γ 1 (x 1 ), γ2 (x 2 ))π 1 (x 1 )π 2 (x 2 )+ X m P(Z er t = ϕ, Mt = m|π 1,2 , γ1,2 , e)Vt+ (ηt(π 1,2 , γ1,2 , ϕ, m), e) + X x˜ 1,2 P(Z er t = ˜x 1,2 |π 1,2 , γ1,2 , e)Vt+ (δx˜ 1 , δx˜ 2 , e) i , (2.34) where ηt is as described in Lemma 2.3 and P(Z er t = ϕ, Mt = m|π 1,2 , γ1,2 , e), P(Z er t = x˜ 1,2 |π 1,2 , γ1,2 , e) are as described in (2.31) - (2.32). The coordinator’s optimal strategy is to pick the minimizing prescription pairs for each time and each (π 1 , π2 , e). 1 List of Figures 2.5.2 Agents with communication constraints In this section, we consider an extension of the problem formulated in Section ?? where we incorporate some constraints on the communication between agents. The underlying system model, information structure and the total expected cost are the same as in Section ??. But now the agents have constraints on when and how frequently they can communicate. Specifically, we consider the following three constraints: 1. Minimum time between successive communication attempts (i.e. times at which Mor t = 1) must be at least smin (where smin ≥ 0). 2. Maximum time between successive communication attempts cannot exceed smax (where smax ≥ smin). 3. The total number of communication attempts over the time horizon T cannot exceed N. The strategy optimization problem is to find communication and control strategies for the agents that minimize the expected cost in (2.7) while ensuring that the above three constraints are satisfied. We assume that there is at least one choice of agents’ strategies for which the constraints are satisfied (i.e. the constrained problem is feasible). Note that our framework allows for some of the above three constraints to be absent (e.g. setting smin = 0 effectively removes the first constraint; setting N = T effectively removes the third constraint). We can follow the methodology of Section 2.4 for the constrained problem as well. The key difference is that in addition to the coordinator’s beliefs on the agents’ states, we will also need to keep track of (i) the time since the most recent communication attempt (denoted by S a t ), and (ii) the total number of communication attempts so far (denoted by S b t ). The variables S a t , Sb t are used by the coordinator to ensure that the prescriptions it selects will not result in constraint violations. For example, if S a t < smin, the coordinator can only select the communication prescriptions that map X i to 0 for each i since this ensures that the first constraint will be satisfied. Similarly, if S a t = smax, then the coordinator must select a pair of communication prescriptions that ensure that a communication happens at the current time. The following theorem describes the modified dynamic program for the coordinator in the constrained formulation. Theorem 2.3. The value functions for the coordinator’s dynamic program are as follows: for all beliefs π 1 , π2 and all non-negative integers s a , sb , VT +1(π 1 , π2 , sa , sb ) := 0 and for t = T, . . . , 2, 1, Vt+ (π 1 , π2 , sa , sb ) := min λ1,λ2 hX x1,2 ct x 1,2 , λ1 (x 1 ), λ2 (x 2 ) π 1 (x 1 )π 2 (x 2 ) 1 List of Figures + Vt+1(βt(π 1,2 , λ1,2 ), sa , sb ) i , (2.35) where βt is as described in Lemma 2.3; and if smin ≤ s a < smax and s b < N Vt(π 1 , π2 , sa , sb ) := min γ 1,γ2 hX x1,2 ρ(x 1,2 ) max(γ 1 (x 1 ), γ2 (x 2 ))π 1 (x 1 )π 2 (x 2 ) + P(Z er t = ϕ, Mt = (0, 0)|π 1,2 , γ1,2 ) × Vt+ (ηt(π 1,2 , γ1,2 , ϕ, m = (0, 0)), sa + 1, sb ) + X m̸=(0,0) P(Z er t = ϕ, Mt ̸= (0, 0)|π 1,2 , γ1,2 ) × Vt+ (ηt(π 1,2 , γ1,2 , ϕ, m ̸= (0, 0)), 0, sb + 1) + X x˜ 1,2 P(Z er t = ˜x 1,2 |π 1,2 , γ1,2 )Vt+ (δx˜ 1 , δx˜ 2 , 0, sb + 1)i , (2.36) where ηt is as described in Lemma 2.3 and P(Z er t = ϕ, Mt = m|π 1,2 , γ1,2 ), P(Z er t = x˜ 1,2 |π 1,2 , γ1,2 ) are as described by (2.25) - (2.26). If s b = N or if s a < smin, then the minimization over γ 1 , γ2 in (2.36) is replaced by simply setting γ 1 , γ2 to be the prescriptions that map all states to 0. If s b < N and if s a = smax, then the minimization over γ 1 , γ2 in (2.36) is replaced by simply setting γ 1 , γ2 to be the prescriptions that map all states to 1. The coordinator’s optimal strategy is to pick the minimizing prescription pairs for each time and each (π 1 , π2 , sa , sb ). 2.6 An Illustrative Example Problem setup: Consider a system where there are two entities that are susceptible to attacks. Each entity has an associated defender that can make decisions about whether and how to defend the entity. The defenders can take one of three possible actions: ℵ (which denotes doing nothing), d1, d2. Thus, the defenders are the decision-making agents in this model. The state Xi t ∈ {0, 1} of agent i represents whether or not entity i is under attack at time t. We use 1 to denote the attack state and 0 to denote the safe (non-attack) state. If entity i is currently in the safe state, i.e., Xi t = 0, then with probability p i a , the entity transitions to the attack state 1 (irrespective of the defender’s action). When entity i is under attack, i.e., Xi t = 1, if the corresponding defender chooses to do nothing, then the state does not change with probability 1, i.e., Xi t+1 = Xi t . On the other hand, if the defender chooses to defend using defensive action dk, where k = 1, 2, then the entity transitions to the safe state 0 with probability p i dk . The transition probabilities are listed in a tabular form in Table 2.1. If both entities are in the safe state, then the cost incurred by the system is 0. If at least one entity is under attack, then the cost incurred is 20. Further, an additional cost of 100 (respectively 150) is incurred if both defenders choose to defend using d1 (resp. 17 List of Figures Table 2.1: Transition probabilities P[Xi t+1 | Xi t , Ui t ]. U i t = ℵ U i t = d1 U i t = d2 Xi t = 0 (1-p i a ,p i a ) (1-p i a ,p i a ) (1-p i a ,p i a ) Xi t = 1 (0,1) (p i d1 ,1-p i d1 ) (p i d2 ,1-p i d2 ) d2) at the same time (in any state). More explicitly, the cost at time t is given by ct(Xt , Ut) = ϑ t−1 201(X1 t =1 or X2 t =1) + 1001(U1 t =d 1)1(U2 t =d 1) + 1501(U1 t =d 2)1(U2 t =d 2) , where 0 < ϑ < 1 is a discount factor. We assume that the pack-drop probability pe = 0 and that the communication cost is a constant ρ. When an entity is under attack, the associated defender needs to defend the system at some point of time (if not immediately). Otherwise, the system will remain in the attack state perpetually. However, a heavy cost is incurred if both agents defend using the same defensive action. Therefore, the agents must defend their respective entities in a coordinated manner. Communicating with each other can help agents coordinate effectively. On the other hand, communicating all the time can lead to a high communication cost. This tradeoff between communication and coordination can be balanced optimally using our approach discussed in Section 2.4. Implementation In our experiments, we consider an infinite horizon discounted cost version of the problem described above. Since the agents alternate between communication and control (see equations (2.14) and (2.15)), the coordinator’s POMDP as described in Section 2.4 is not time-invariant. To convert it into a time-invariant POMDP, we introduce an additional binary state variable Xc t . This variable represents whether the agents currently are in the communication phase or the control phase. The variable Xc t alternates between 0 and 1 in a deterministic manner. For agent i in the communication phase, action ℵ is interpreted as the no communication decision (Mi t = 0) and all other actions are interpreted as the communication decision (Mi t = 1). With this transformation, we can use any infinite horizon POMDP solver to obtain approximately optimal strategies for our problem. In our experiments, we use the SARSOP solver [35] that is available in the Julia POMDPs framework [36]. Results: We consider three scenarios in our experiments: (i) the jointly (approximately) optimal communication and control strategies computed using the coordinator’s POMDP, (ii) the ”never-communicate” communication strategy for agents along with control strategies that are optimized assuming no communication and (iii) the ”always-communicate” communication strategy for agents along with control strategies that are optimizing assuming persistent communication. The expected costs associated with these three strategies are shown in Figure 2.1. The approximation error achieved using the SARSOP solver is at most 0.001. 18 List of Figures ρ Figure 2.1: Performance achieved by three strategies: jointly optimal communication and control strategies, (ii) always communicate, and (iii) never communicate. In this example, p 1 a = p 2 a = 0.3, p 1 d1 = p 2 d1 = 0.6, p 1 d2 = p 2 d2 = 0.4 and ϑ = 0.95. Scaling up: A major challenge in solving the coordinator’s POMDP is that the size of the prescription space is exponential in the size of the state space X i . Recall that in the coordinator’s POMDP, the coordinator’s action space is the space of prescription pairs. POMDP solvers need to optimize over the POMDP’s action space (in our case prescription space) repeatedly as an intermediate step [35]. A naive approach for optimizing over the prescription space is to enumerate every prescription pair and choose the one with the optimum value. This approach is commonly used when the action space in POMDPs is fairly small. In [27, 37] an approach based on constraint optimization [38] was proposed to tackle the computational complexity involved in the exhaustive enumeration of all prescriptions. It was noted in [27, 37] that this approach works significantly better in practice. Our current implementation based on the Julia framework can be used only when the prescription space is small. In order to solve larger scale problems, one can modify the algorithm in [35] to incorporate the constraint optimization approach of [27, 37]. 2.7 Conclusion We considered a multi-agent problem where agents can dynamically decide at each time step whether to share information with each other and incur the resulting communicatio List of Figures cost. Our goal was to jointly design agents’ communication and control strategies in order to optimize the trade-off between communication costs and control objective. We showed that agents can ignore a big part of their private information without compromising the system performance. We then provided a common information approach based solution for the strategy optimization problem. Our approach relies on constructing a fictitious POMDP whose solution (obtained via a dynamic program) characterizes the optimal strategies for the agents. We extended our solution to incorporate time-varying packet-drop channels and constraints on when and how frequently agents can communicate. Multi-agent system in which a decentralized team of agents controls a stochastic system in the presence of an adversary is left for future work. One bottleneck we observed is the minimization over prescription space and we need more efficient ways to solve it. Using identical prescription for two agents is one simple way of minimizing the prescription space and will be explored in our future work. 20 Chapter 3 Optimal Communication and Control Strategies for a Multi-Agent System in the Presence of an Adversary In multi-agent systems, the agents may not be able to fully observe the system state and the actions of other agents. A multi-agent system is said to have an asymmetric information structure when different agents have access to different information. Each agent must select its actions based only on the limited information available to it. Decision-making scenarios with information asymmetry arise in a range of domains such as autonomous driving, power grids, transportation networks, cyber-security of networked computing and communication systems, and competitive markets and geopolitical interactions (see, for example, [4, 29, 30, 39, 40]). Based on the nature of interactions between the agents, multi-agent systems can broadly be classified into three types: (i) teams, (ii) games and (iii) team-games. In teams, all the agents act in a cooperative manner to achieve a shared objective. In games, each agent has its own objective and is self-interested. In team-games, agents within a team are cooperative but the team as a whole is non-cooperative with respect to other teams. For agents in the same team, sharing information with each other aids coordination and improves performance. Various information sharing mechanisms [4] arise depending on the underlying communication environment. For instance, if the agents have access to a perfect, costless communication channel, they can share their entire information with each other. On the other hand, if communication is too expensive, the agents may never share their information. In this paper, instead of fixing the information sharing mechanism for agents in a team, we consider a model in which the agents can strategically decide whether to share their information with other agents or not. By doing so, the agents in the team can balance the trade-off between the control cost and the communication cost. This joint design of 21 List of Figures control and communication strategies was considered in [41] and a team-optimal solution was provided using the common information approach [4]. In some scenarios (e.g. a battlefield), the team of agents may be susceptible to adversarial attacks. Also, the adversary may have the capability to intercept the communication among the agents. This makes the information sharing mechanism substantially more complicated. While sharing information with teammates may be beneficial for intra-team coordination, it can reveal sensitive information to the adversary. The adversary may exploit this information to inflict severe damage on the system. Such interactions between a team of cooperative agents and an adversary can be modeled as a zero-sum team-game [42]. In this paper, our focus is on a zero-sum game between a team of two agents and an adversary in which the team aims to minimize the control and communication cost while the adversary aims to maximize it. The system state in this game has three components: a local state for each agent in the team and global state. The adversary controls the global state and each of the agents control their respective local states. We restrict our attention to models in which the agents in the team are more informed than the adversary. Our model allows us to capture several scenarios of interest. For example, the adversary in our model can affect the quality of and the cost associated with the agents’ communication channel and the agents can perfectly or imperfectly encrypt their communication. We analyze a family of such zero-sum team vs. adversary game and provide a characterization of an optimal (min-max) control and communication strategy for the team. This characterization is based on common information belief based min-max dynamic program for team vs. team games discussed in [42]. 3.0.1 Related Works There is a large body of prior work on decision-making in multi-agent systems. In this section, we discuss related works on cooperative teams and team-games. In decentralized stochastic control literature, a variety of information structures (obtained from different information sharing protocols) have been considered [4, 29, 30, 41]. Another well-studied class of multi-agent teams with asymmetric information is the class of Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs). Several methods for solving such generic Dec-POMDPs exist in the literature [20–25]. Dynamic games among teams have received some attention over the past few years. Two closely related works are [43] and [42]. In [43], a model of games among teams where players in a team internally share their information with some delay was investigated. The authors of [43] characterize Team-Nash equilibria under certain existence assumptions. In [42], a general model of zero-sum games between two teams was considered. For this general model, the authors provide bounds on the upper and lower values of the zero-game. A relatively specialized model was also studied in [42] and for this model, a min-max strategy for one of the teams was characterized in addition to the min-max value. In [44], the authors formulate and solve a particular malicious intrusion game between two teams of mobile agents. 22 List of Figures The works that are most closely related to our work are [41] and [42]. In [41], the authors consider a team problem in which the agents can strategically decide when to communicate with each other. While our model is inspired by the model in [41], our model is substantially more general and complicated because of the presence of an adversary. In team problems, the agents can use deterministic strategies without loss of optimality, whereas in games, the agents can benefit with randomization. Due to the randomness in agents’ strategies and the need to solve a min-max problem as opposed to a simpler minimization problem, different techniques are required for analyzing and solving the team-game. Our game model is a special case of one of the models studied in [42] and hence, we can use the results in [42] to characterize a min-max strategy for the team. While we borrow some results from [42], our results on private information reduction are novel. 3.0.2 Notation Random variables are denoted by upper case letters, their realizations by the corresponding lower case letters. In general, subscripts are used as time index while superscripts are used to index decision-making agents. For time indices t1 ≤ t2, Xt1:t2 is the short hand notation for the variables (Xt1 , Xt1+1, ..., Xt2 ). Similarly, X1:2 is the short hand notation for the collection of variables (X1 , X2 ). Operators P(·) and E[·] denote the probability of an event, and the expectation of a random variable respectively. For random variables/vectors X and Y , P(·|Y = y), E[X|Y = y] and P(X = x | Y = y) are denoted by P(·|y), E[X|y] and P(x | y), respectively. For a strategy g, we use P g (·) (resp. E g [·]) to indicate that the probability (resp. expectation) depends on the choice of g. For any finite set A, ∆A denotes the probability simplex over the set A. For any two sets A and B, F(A, B) denotes the set of all functions from A to B. We define rand to be mechanism that given (i) a finite set A, (ii) a distribution d over A and a random variable K uniformly distributed over the interval (0, 1], produces a random variable X ∈ A with distribution d, i.e., X = (A, d, K) ∼ d. (3.1) 3.1 Problem Formulation Consider a discrete-time control system with a team of two agents (agent 1 and agent 2) and an adversary. The system comprises of a global state and local states for each agent in the team. Let X0 t ∈ X 0 denote the global state and let Xi t ∈ X i denote the local state of agent i. Xt := (X1 t , X2 t ) represents the local state of both agents in the team. The initial global state and the initial local states of both agents are independent random variables with state Xi 1 having the probability distribution PXi 1 , i = 0, 1, 2. Each agent perfectly observes its own local state and the global state is perfectly observed by all agents (including the adversary). Let U i t ∈ Ui denote the control action of agent i at time t. Ut := (U 1 t , U2 t ) denotes the control actions of both agents at time t. Further, let U a t ∈ Ua denote the control action of 23 List of Figures the adversary at time t. The global and local states of the system evolve according to X0 t+1 = k 0 t (X0 t , Ua t , W0 t ), (3.2) Xi t+1 = k i t (X0 t , Xi t , Ui t , Wi t ), i = 1, 2, (3.3) where Wi t ∈ Wi , i = 0, 1, 2 is the disturbance in dynamics with probability distribution PWi . The initial states X0 1 , X1 1 , X2 1 and the disturbances {Wi t }∞ t=1, i = 0, 1, 2, are independent random variables. Note that the next local state of agent i depends on the current local state and control action of agent i and the global state. The next global state depends on the current global state and the adversary’s action. In addition to deciding the control actions at each time, the two agents in the team need to decide whether or not to initiate communication at each time. We use the binary variable Mi t to denote the communication decision taken by agent i. Let Mor t := max(M1 t , M2 t ) and let Z er t represent the information exchanged between the agents at time t. In this model when global state X0 t = x, agents lose packets or fail to communicate with probability pe(x) even when one (or both) of the agents decides to communicate, i.e. when Mor t = 1. Here, pe : X 0 → [0, 1] maps the global state to a failure probability. Based on the communication model described above we can define variable Z er t given that X0 t = x as: Z er t = X 1,2 t , w.p. 1 − pe(x) if Mor t = 1. ϕ, w.p. pe(x) if Mor t = 1. ϕ, if Mor t = 0. (3.4) At time t +, the adversary observes a noisy version Yt of the variable Z er t given by Yt = lt(Z er t , Mt , X0 t , Wy t ), (3.5) where W y t is the observation noise. Information structure and decision strategies: At the beginning of the t-th time step, the information available to agent i is given by (i) history of global states and its local states, (ii) its control actions, (iii) communication actions and messages and (iv) adversary’s action and observation history: I i t = (3.6) {X0 1:t , Xi 1:t , Ui 1:t−1 , M1,2 1:t−1 , Zer 1:t−1 , Ua 1:t−1 , Y1:t−1}. Agent i can use this information to make its communication decision at time t. We allow the agent to randomize its decision. Thus, agent i first selects a distribution δMi t over {0, 1} based n its information and then it randomly picks Mi t according to the chosen distribution: δMi t = f i t (I i t ), (3.7) Mi t = ({0, 1}, δMi t , Ki t ) (3.8) 24 List of Figures where Ki t , i = 1, 2, t ≥ 1, are independent random variables uniformly distributed over the interval (0, 1] that are used for randomization (these variables are also independent of initial states and all noises/disturbances). The function f i t is referred to as the communication strategy of agent i at time t. At this point, the adversary does not take any action. After the communication decisions are made and the resulting communication (if any) takes place, the information available to agent i is I i t+ = {I i t , Zer t , M1,2 t , Yt}. (3.9) I a t denotes the adversary’s information just before the communication at time t and I a t+ denotes the adversary’s information after communication at time t +. Our model allows for different scenarios of adversary’s information which will be described later. Agent i and the adversary choose their control actions based on their post-communication information according to δUi t = g i t (I i t+ ) (3.10) U i t = (U i t , δUi t , Ki t+ ), i = 1, 2, a, (3.11) where Ki t+ , i = 1, 2, t ≥ 1, are independent random variables uniformly distributed over the interval (0, 1] that are used for randomization (these variables are also independent of all other randomization variables, initial states and all noises/disturbances). The functions g i t and g a t are referred to as the control strategy of agent i and the adversary at time t. The tuples f i := (f i 1 , fi 2 , ..., fi T ) and g i := (g i 1 , gi 2 , ..., gi T ) are called the communication and control strategy of agent i respectively. The collection f := (f 1 , f 2 ), g := (g 1 , g2 ) of communication and control strategies of both agents are called the communication and control strategy of the team. Similarly, g a := (g a 1 , ga 2 , ..., ga T ) is called the control strategy of the adversary. We can split the information available to the agents into two parts – common information and private information. Common information at a given time is the information available to all the decision-makers (including the adversary) at the given time. Private information of an agent includes all of its information at the given time except the common information. 1. At the beginning of time step t, before the communication decisions are made, the common (Ct) and private information (P i t ) is defined as Ct := I 1 t ∩ I 2 t ∩ I a t (3.12) P i t := I i t \ Ct ∀i ∈ {1, 2, a}. (3.13) 2. After the communication decisions are made and the resulting communication (if any) takes place, the common and private information is defined as Ct+ := I 1 t+ ∩ I 2 t+ ∩ I a t+ (3.14) P i t+ := I i t+ \ Ct+ ∀i ∈ {1, 2, a}. (3.15) Assumption 3.1. We assume that the following conditions are satisfied: 25 List of Figures 1. Monotonicity: The adversary’s information grows with time. Thus, I a t ⊆ I a t+ ⊆ I a t+1 for every t. 2. Nestedness: The adversary’s information is common information and each agent in the team has access to adversary’s information, i.e., Ct = I a t ⊆ I 1 t ∩ I 2 t =: C team t Ct+ = I a t+ ⊆ I 1 t+ ∩ I 2 t+ =: C team t+ . Therefore, P a t = P a t+ = ∅. 3. Common Information Evolution: (i) Let Zt+ .= Ct+ \ Ct and Zt+1 .= Ct+1 \ Ct+ be the increments in common information at times t + and t + 1, respectively . Thus, Ct+ = {Ct , Zt+ } and Ct+1 = {Ct+ , Zt+1}. The common information evolves as Zt+ = ζt+ (P 1:2 t , M1:2 t , Zer t , Yt), (3.16) Zt+1 = ζt+1(P 1:2 t+ , U1:2 t , X0:2 t+1) (3.17) where ζt+1 and ζt+ are fixed transformations. 4. Private Information Evolution: The private information evolves as P i t+ = ξ i t+ (P 1:2 t , M1:2 t , Zer t , Yt) (3.18) P i t+1 = ξ i t+1(P 1:2 t+ , U1:2 t , X0:2 t+1) (3.19) where ξ i t+ and ξ i t+1 are fixed transformations and i = 1, 2. Due to the nestedness condition in Assumption 3.1, the team is always more-informed than the adversary. Scenarios where the adversary has some private information are beyond the scope of this paper. The third and fourth conditions in Assumption 3.1 on the evolution of common and private information are very mild [4, 45] and most information structures of interest satisfy these conditions. Strategy optimization problem: At time t, the system incurs a cost ct(X0 t , Xt , Ut , Ua t ) that depends on the global state, the team’s state, control actions of both agents and the adversary’s action. Whenever agents decide to share their states with each other, they incur a state-dependent cost ρ(X0 t , Xt). The system runs for a time horizon T. The total expected cost over the time horizon T associated with a strategy profile ((f, g), ga ) is: J((f, g), ga ) = (3.20) E ((f,g),ga) hX T t=1 ct(X0 t , Xt , Ut , Ua t ) + ρ(X0 t , Xt)1{Mor t =1} i . The objective of the team is to find communication and control strategies (f, g) for the team in order to minimize the worst-case total expected cost maxg a J((f, g), ga ). This minmax optimization problem can be viewed as a zero-sum game between the team and the 26 List of Figures adversary. We denote this zero-sum game with Game G. We denote the min-max value of this game G with S u (G), i.e., S u (G) = min (f,g) max g a J((f, g), ga ). (3.21) Remark 3.1. The strategy spaces of all the players (agents in the team and the adversary) are compact and the cost J(·) is continuous in f, g, ga . Hence, we can conclude using Berge’s maximum theorem [46] that there exist strategies that achieve the maximum and minimum in (3.21). 3.1.1 Examples of Information Structures Satisfying Assumption 3.1 3.1.1.1 Maximum information Consider the case where the adversary’s information at time t and t + is given by I a t = C team t I a t+ = C team t+ . In this case, the adversary has access to all the information it can while satisfying Assumption 3.1. Thus, the common information Ct = C team t (resp. for t +). The private information for agent i at time t is given by P i t = {Xi 1:t , Ui 1:t−1 }. This information structure models scenarios in which agents in the team do not use any form of encryption and any communication that happens between them can be observed by the adversary. 3.1.1.2 Encrypted Communication with Global State Information Consider the following information structure for the adversary I a t = {X0 1:t , Ua 1:t−1 , M1,2 1:t−1 } (3.22) I a t+ = {X0 1:t , Ua 1:t−1 , M1,2 1:t } (3.23) Yt = 0. (3.24) This information structure models scenarios in which agents in the team have the capability to encrypt their messages. Since the adversary’s observation Yt is a constant, it has no knowledge about the messages exchanged by the team. The adversary however knows whether or not communication was initiated by the agents. The private information for agent i at time t is given by P i t = {Xi 1:t , Ui 1:t−1 , Zer 1:t−1 }. 27 List of Figures 3.1.1.3 Imperfect Encryption with Global State Information I a t = {X0 1:t , Ua 1:t−1 , M1,2 1:t−1 , Y1:t−1} (3.25) I a t+ = {X0 1:t , Ua 1:t−1 , M1,2 1:t , Y1:t} (3.26) This information structure is very similar to the one discussed above except that the encryption mechanism used by the agents may be imperfect. 3.2 Preliminary Results and Simplified Game Gs In this section we show that agents in the team can ignore parts of their information without losing optimality. This removal of information narrows the search for optimal strategies to a class of simpler strategies and is a key step in our approach for finding optimal strategies. Let us define the team’s common private information Dt before communication at time t and Dt+ after communication at time t + as Dt := P 1 t ∩ P 2 t (3.27) Dt+ := P 1 t+ ∩ P 2 t+ . (3.28) The variables Ct , Dt (resp. Ct+ , Dt+ ) constitute the team’s common information at time t (resp. t +), i.e., Ct ∪ Dt = I 1 t ∩ I 2 t = C team t (3.29) Ct+ ∪ Dt+ = I 1 t+ ∩ I 2 t+ = C team t+ . (3.30) Notice that Ct and Dt depend on the adversary’s information structure. However, since the team’s information structure is fixed, Ct , Dt combined do not depend on the adversary’s information structure. The following lemma establishes a key conditional independence property that will be critical for our analysis. Lemma 3.1 (Conditional independence property). At any time t, the two agents’ local states and control actions are conditionally independent given the team’s common information (Ct , Dt) (before communication) or Ct+ , Dt+ (after communication). That is, if ct , dt , ct+ , dt+ are the realizations of the common information and common private information before and after communication respectively, then for any realization x1:t , u1:t−1 of states and actions, we have P(x1:t , u1:t−1|ct , dt) = Y 2 i=1 P(x i 1:t , ui 1:t−1 |ct , dt), (3.31) P(x1:t , u1:t |ct+ , dt+ ) = Y 2 i=1 P(x i 1:t , ui 1:t |ct+ , dt+ ). (3.32) 28 List of Figures Further, P(x i 1:t , ui 1:t−1 |ct , dt) and P(x i 1:t , ui 1:t |ct+ , dt+ ) depends on only on agent i’ strategy. Proof. The proof of this lemma is very similar to the proof of Lemma 1 in [41]. For a detailed proof, see Appendix B.1 in [47]. The following proposition shows that agent i at time t and t + can ignore its past states and actions, i.e. Xi 1:t−1 and U i 1:t−1 , without losing optimality. This allows agents in the team to use simpler strategies where the communication and control decisions are functions only of the current state and the team’s common information. Proposition 3.1. Agent i, i = 1, 2, can restrict itself to strategies of the form below Mi t ∼ ¯f i t (Xi t , Ct , Dt) (3.33) U i t ∼ g¯ i t (Xi t , Ct+ , Dt+ ) (3.34) without loss of optimality. In other words, at time t and t +, agent i does not need the past local states and actions, Xi 1:t−1 , Ui t−1 , for making optimal decisions. Proof. See Appendix B.2 in [47]. Proposition 3.1 leads to a simplified game in which the information used by the players in the team is substantially reduced. We will refer to this game as Game Gs. Game Gs has the same dynamics and cost model as Game G. The key difference between these two games lies in the team’s information structure and strategy spaces. In Game Gs, the information used by player i in the team at time t and t + respectively is I i t = {Xi t} ∪ Dt ∪ Ct (3.35) I i t+ = {Xi t} ∪ Dt+ ∪ Ct+ . (3.36) Therefore, the common information in the simplified game Gs is the same as in the original game G. In the simplified game Gs, the private information1 P i t = Xi t ∪ Dt . Corollary 3.1. If (f ∗ , g∗ ) is a min-max strategy in Game Gs, then it is a min-max strategy in Game G. Further, the min-max values of games G and Gs are identical. Henceforth, we make the following mild assumption on the information structure of agents in the simplified game Gs. It can be easily verified that the simplified games corresponding to all the models discussed in Section 3.1.1 satisfy this assumption. Assumption 3.2. The information structure in the simplified game Gs with reduced private information satisfies Assumption 3.1. 1With a slight abuse of notation, we use the same letter for denoting private information in both games G and Gs. 29 List of Figures Remark 3.2. The reduced information in equations (3.35) and (3.36) is unilaterally sufficient information (see Definition 2.4 in [48]) for each player in the team. Proposition 3.1 can alternatively be shown using the concept of unilaterally sufficient information and Theorem 2.6 in [48]. 3.3 Dynamic Program Characterization of a Min-max Strategy It was shown in [42] that for certain zero-sum game models with a special structure, a virtual game Ge can be constructed based on the simplified Game Gs, and this virtual game can be used to obtain the min-max value and a min-max strategy for the minimizing team. In our game model described in Section 3.1, the adversary does not have any private information at any given time and hence, this model can be viewed as a special case of the game model described in paragraph (a), Section IV-A of [42]. Therefore, we can use the result in [42] to obtain the min-max value and a min-max strategy for our original Game G. The virtual game Ge involves the same underlying system model as in game Gs. The main differences among games Gs and Ge lie in the manner in which the actions used to control the system are chosen. In the virtual game Ge, all the players in the team of game Gs are replaced by a virtual player (referred to as virtual player b) and the adversary is replaced by a virtual player (referred to as virtual player a). These virtual players in Game Ge operate as described in the following sub-section. 3.3.1 Virtual Game Ge Consider virtual player a associated with the adversary. At each time t +, virtual player a selects a distribution Γa t over the space U a t . The set of all such mappings is denoted by B a t .= ∆U a t . Consider virtual player b associated with the team. At each time t and for each i = 1, 2, virtual player b selects a function Γi t that maps private information P i t to a distribution δMi t over the space {0, 1}. Thus, δMi t = Γi t (P i t ). The set of all such mappings is denoted by B i t .= F(P i t , ∆{0, 1}). We refer to the tuple Γt .= (Γ1 t , Γ 2 t ) as virtual player b’s prescription at time t. The set of all possible prescriptions for virtual player b at time t is denoted by Bt .= B 1 t × B2 t . At each time t + and for each i = 1, 2, virtual player b selects a function Λi t that maps private information P i t+ to a distribution δUi t+ over the space U i t . Thus, δUi t = Λi t (P i t+). The set of all such mappings is denoted by B i t+ .= F(P i t+ , ∆U i t ). We refer to the tuple Λt .= (Λ1 t ,Λ 2 t ) as virtual player b’s prescription at time t +. The set of all possible prescriptions for virtual player b at time t + is denoted by Bt+ .= B 1 t+ × B2 t+ . Once virtual players select their prescriptions at times t and t +, the corresponding actions are generated as Mi t = ({0, 1}, Γ i t (P i t ), Ki t ) (3.37) U i t = (U i t ,Λ i t (P i t+ ), Ki t+ ) (3.38) U a t = (U a t , Γ a t , Ka t+ ). (3.39) 30 List of Figures In virtual game Ge, virtual players’ information I v t at time t comprises of the common information Ct and the past prescriptions of both players Γ1:t−1, Γ a 1:t−1 ,Λ1:t−1. At time t, Virtual player b selects its prescription according to a control law χ b t , i.e. Γt = χ b t (I v t ). Note that at time t, Virtual player a does not take any action. At time t +, the virtual players information I v t+ comprises of Ct+ and all the past prescriptions of both players Γ1:t , Γ a 1:t−1 ,Λ1:t−1. Virtual player a selects its prescription according to a control law χ a t , i.e., Γ a t = χ a t (I v t+ ) and virtual player b selects its prescription according to a control law χ b t+ , i.e. Λt = χ b t (I v t+ ). For virtual player a, the collection of control laws over the entire time horizon χ a = (χ a 1 , . . . , χa T ) is referred to as its control strategy. Similarly for virtual player b. Let Ha t be the set of all possible control laws for virtual player a at time t and let Ha be the set of all possible control strategies for virtual player a, i.e. Ha = Ha 1 × · · · × Ha T . For virtual player b, the collection of control laws over the entire time horizon χ b = (χ b 1 , χb 1+ . . . , χb T , χb T + ) is referred to as its control strategy. Let Hb t (resp. Hb t+ ) be the set of all possible control laws for virtual player b at time t (resp. t +) and let Hb be the set of all possible control strategies for virtual player b. The total cost associated with the game for a strategy profile (χ a , χb ) is J (χ a , χb ) = (3.40) E (χ a,χb ) hX T t=1 ct(X0 t , Xt , Ut , Ua t ) + ρ(X0 t , Xt)1{Mor t =1} i . where the functions ct and ρ are the same as in games G and Gs. In this virtual game, virtual player a aims to maximize the cost while virtual player b aims to minimize the cost. The upper value of Game Ge is denoted by S u (Ge). 3.3.2 Common Information Belief and the Dynamic Program 3.3.2.1 Common Information Belief Before communication at time t, the CIB is given as: Πt(x 0 , x, d) = P[X0 t = x 0 , Xt = x, Dt = d|I v t ]. (3.41) After the communication decisions are made and Z er t is realized, the CIB is given as: Πt+ (x 0 , x, d) = P[X0 t = x 0 , Xt = x, Dt+ = d|I v t+ ]. (3.42) The CIB satisfies two key properties: (i) the CIB can be computed without using the virtual players’ strategies χ a and χ b ; (ii) since the adversary does not have any private information at any given time, the CIB does not depend on the adversary’s prescriptions (see Section IV-A and Appendix VI of [42]). This can be stated formally as the following lemma. Lemma 3.2. Π1(x 0 1 , x1, d1) is the belief P(X0 1 = x 0 1 , X1 = x1, D1 = d1) and for each t ≥ 1, Πt+ = ηt(Πt , Γt , Zt+ ), (3.43) 31 List of Figures Πt+1 = βt(Πt+ ,Λt , Zt+1), (3.44) where ηt , βt are fixed transformations derived from the system model using Bayes’ rule (see Appendix VI of [42]). We now describe the dynamic program that provides us with the value of the game G and an algorithm to compute a min-max strategy for the team. 3.3.2.2 Dynamic Program Define the value function VT +1(π) := 0 for all π at time T + 1. The cost-to-go functions wt (resp. wt+ ) and value functions Vt (resp. Vt+ ) for t = T, . . . , 2, 1, are defined as follows: wt+ (π, λ, γa ) := E h ct(X0 t , Xt , Ut , Ua t ) + Vt+1(βt(π, λ, Zt+1)) | π, λ, γi , Vt+ (π) := min λ max γ a wt+ (π), (3.45) wt(π, γ) := E h ρ(X0 t , Xt)1{Mor t =1} + Vt+ (ηt(π 1,2 , γ, Zt+ )) | π, γi , Vt(π) := min γ wt(π, γ). (3.46) Let Ξt(π) (resp. Ξt+ (π)) be a minimizer (resp. minmaximizer) of the cost-to-go function in (3.46) (resp. (3.45)). Theorem 3.1. The min-max value of games G, Gs and Ge are identical, i.e., we have S u (G) = S u (Ge) = E[V1(Π1)]. Further, the strategy pair f ∗ , g∗ described by Algorithm 1 is a min-max strategy for the team in the original game G. Proof. Because of our assumption on the information structure of Game Gs (Assumption 3.2), the evolution of CIB in Game Ge does not depend virtual player a’s prescription. This property allows us to use Theorems 4 and 5 in [42] and obtain our result. The dynamic program is helpful for characterizing the min-max value and a min-max strategy in a general setting. However, solving the dynamic program involves computational challenges. The main cause of these challenges is that the private information (Xi t ∪ Dt) space can be very large even after the private information reduction in the simplified game Gs. For instance, Dt = Z er 1:t−1 in the model described in Section 3.1.1.2. In the following sub-sections, we discuss some special cases in which the private information is small or can be reduced further to a manageable size. Once the private information has been reduced sufficiently, one can use the computational methodology discussed in Appendix X of [42] to solve the dynamic program. 32 List of Figures Algorithm 1 Strategies f i∗ , gi∗ for Player i in the Team 1: Input: Ξt(π), Ξt+ (π) obtained from DP for all t and all π 2: for t = 1 to T do 3: Before communication: 4: Current information: Ct , Pi t ▷ where Ct = {C(t−1)+ , Zt} 5: Update CIB Πt = βt−1(Π(t−1)+ , Ξ 1 (t−1)+ (Πt−1+ ), Zt) ▷ If t = 1, Initialize CIB Πt using C1 6: Get prescription Γt = (Γ1 t , Γ 2 t ) = Ξt(Πt) 7: Get distribution δMi t = Γi t (P i t ) and select action Mi t = ({0, 1}, δMi t , Ki t ) 8: After communication decisions are made: 9: Current information: Ct+ , Pi t+ ▷ where Ct+ = {Ct , Zt+ } 10: Update CIB Πt+ = ηt(Πt , Ξ 1 t (Πt), Zt+ ) 11: Get prescription Λt = (Λ1 t ,Λ 2 t ) = Ξt+ (Πt+ ) 12: Get distribution δUi t = Λi t (P i t ) and select action U i t = (U i t , δUi t , Ki t+ ) 3.3.3 Communication without Encryption Consider the information structure in Section 3.1.1.1 in which the agents do not encrypt their information. In this case, Dt is empty and therefore, the only private information agent i uses at time t and t + is Xi t . Due to this reduced private information space, the prescription space in the virtual game Ge is substantially smaller. Further, the CIB at time t is formed only on the current state Xt and thus, can be updated easily. The smaller prescription space, belief space and simpler belief update significantly improve the computational tractability of the dynamic program in Section 3.3.2.2. The CIB update rules for this information structure will be used in proving other results and hence, we denote these update rules with η¯t and β¯ t . 3.3.4 Communication with Encryption In this section, we consider the information structure described in Section 3.1.1.3. In this information structure, agents in the team can encrypt their information perfectly or imperfectly. At any given time, the adversary can observe whether or not communication was initiated by each agent in the team and subsequently, the imperfectly encrypted message that was exchanged between the agents. Further, we assume that the adversary can observe whether or not the agents had a successful communication over the erasure channel. Let Et be a Bernoulli variable such that Et = 1 if and only if successful communication occurred at time t. Note that Et = 1−1∅(Z er t ) and therefore, can be viewed as a part of the adversary’s observation Yt (see (3.5)). Let us define the time of last communication at time t as Lt = max{τ : τ < t, Eτ = 1} (3.47) Lt+ = max{τ : τ ≤ t, Eτ = 1}. (3.48) 33 List of Figures Here, the maximum of an empty set is considered to be −∞. Note that Lt+1 = Lt+ and that the adversary can compute Lt and Lt+ using its information. Proposition 3.2. There exists a min-max strategy of the form Mi t ∼ f i t (Xi t , XLt , Ct) (3.49) U i t ∼ g i t (Xi t , XLt+ , Ct+ ) (3.50) Proof. See Appendix B.3 in [47]. The main consequence of Proposition 3.2 is that agent i’s private information at time t (resp. t +) is reduced from Xi t , Zer 1:t−1 (resp. Xi t , Zer 1:t ) to Xi t , XLt (resp. Xi t , XLt+ ). As discussed in the previous sub-section, this reduction leads to the simplification of the dynamic program described in Section 3.3.2.2. 3.4 Conclusions We considered a zero-sum game between a team of two agents and a malicious agent. The agents can strategically decide at each time whether to share their private information with each other or not. The agents incur a cost whenever they communicate with each other and the adversary may eavesdrop on their communication. Under certain assumptions on the system dynamics and the information structure of the adversary, we characterized a min-max control and communication strategy for the team using a common information belief based min-max dynamic program. For certain specialized information structures, we proved that the agents in the team can ignore a large part of their private information without losing optimality. This reduction in private information substantially simplifies the dynamic program and hence, improves computational tractability. 34 Chapter 4 Optimal Symmetric Strategies in Multi-Agent Systems with Decentralized Information 4.1 Introduction The problem of sequential decision-making by a team of collaborative agents operating in an uncertain environment has received significant attention in the recent control (e.g. [4, 29, 30, 41, 49]) and artificial intelligence (e.g. [21, 22, 24, 25, 50]) literature. The goal in such problems is to design decision/control strategies for the multiple agents in order to optimize a performance metric for the team. In some cooperative multi-agent (or team) problems, the agents are essentially identical and interchangeable. That is, they have similar sensors for gathering information and similar action capabilities. For example, each agent can observe its position in a grid and can decide to move one step in any direction. When dealing with identical agents, it may be convenient (and perhaps necessary in some cases) to design identical decision/control strategies for the agents. We will refer to such strategies as symmetric strategies. In this paper, we investigate a team problem with identical agents and focus on the problem of designing symmetric strategies for the agents in the team. As indicated above, one motivation for using symmetric strategies is convenience and simplicity. Instead of designing n different strategies for n agents in a team, we need to design just one strategy for all agents. Another reason for using symmetric strategies arises in situations where agents don’t have any individualized identities. This can happen in settings where the population of the agents is not fixed and agents are unaware of the total number of agents currently present or their own index in the population. An example of such a situation for a multi-access communication problem is described in [51]. When an agent 35 List of Figures doesn’t know its own index (”Am I agent 1 or agent 2?”), it makes sense to use symmetric (i.e. identical) strategies for all agents irrespective of their index. In this paper, our focus is on designing symmetric strategies to optimize a finite horizon team objective. We start with a general information structure and then consider some special cases. The constraint of using symmetric strategies introduces new features and complications in the team problem. For example, when agents in a team are free to use individualized strategies, it is well-known that agents can be restricted to deterministic strategies without loss of optimality [52]. However, we show in a simple example that randomized strategies may be helpful when the agents are constrained to use symmetric strategies. We adopt the common information approach [4] for our problem and modify it to accommodate the use of symmetric strategies. This results in a common information based dynamic program where each step involves minimization over a single function from the space of an agent’s private information to the space of probability distributions over actions. The complexity of this dynamic program depends in large part on the size of the private information space. We discuss some known approaches for reducing agents’ private information and why they may not work under the constraint of symmetric strategies. We present two specialized models where private information can be reduced using simple dynamic program based arguments. Notation: Random variables are denoted by upper case letters (e.g. X), their realization with lower case letters (e.g. x), and their space of realizations by script letters (e.g. X ). Subscripts denote time and superscripts denote agent index; e.g., Xi t denotes the state of agent i at time t. The short hand notation Xi 1:t denotes the collection (Xi 1 , Xi 2 , ..., Xi t ). △(X ) denotes the probability simplex for the space X . P(A) denotes the probability of an event A. [X] denotes the expectation of a random variable X. 1A denotes the indicator function of event A. For simplicity of notation, we use P(x1:t , u1:t−1) to denote P(X1:t = x1:t , U1:t−1 = u1:t−1) and a similar notation for conditional probability. For a strategy pair (g 1 , g2 ), we use P (g 1 ,g2 ) (·) (resp. (g 1 ,g2 ) [·]) to indicate that the probability (resp. expectation) depends on the choice of the strategy pair. We use −i to denote agent/agents other than agent i. U ∼ λ indicates that U is randomly distributed according to the distribution λ. Organization: The rest of the paper is organized as follows. We formulate the team problem with symmetric strategies in Section 4.2 and discuss some of its key features and special cases. In Section 4.3, we present a common information based dynamic program for our problem (and for some special cases). In Section 4.4, we compare two information structures that differ only in the agents’ private information. We conclude in Section 4.5. Proofs of key results are provided in the appendices. 4.2 Problem Formulation Consider a discrete-time system with two agents. The system state consists of three components - a global state and two local states, one for each agent. Xi t ∈ X denotes the local 36 List of Figures state of agent i, i = 1, 2, at time t and X0 t ∈ X 0 denotes the global state at time t. Xt denotes the triplet (X0 t , X1 t , X2 t ). Let U i t ∈ U denote the control action of agent i at time t. Ut denotes the pair (U 1 t , U2 t ). The dynamics of the global and local states are as follows: X0 t+1 = f 0 t (X0 t , Ut , W0 t ), (4.1) Xi t+1 = ft(Xi t , X0 t , Ut , Wi t ), i = 1, 2, (4.2) where W0 t ∈ W0 and Wi t ∈ W are random disturbances with W0 t having the probability distribution p 0 W and Wi t , i = 1, 2, having the probability distribution pW . We use Wt to denote the triplet (W0 t , W1 t , W2 t ). Note that the next local state of agent i depends on its own current local state, the global state and the control actions of both the agents. Also note that the function ft in (4.2) is the same for both agents. The initial states X0 1 , X1 1 , X2 1 are independent random variables with X0 1 having the probability distribution α 0 and Xi 1 , i = 1, 2, having the probability distribution α. The initial states X0 1 , X1 1 , X2 1 and the disturbances W0 t , Wi t , t ≥ 1, i = 1, 2, are independent discrete random variables. These will be referred to as the primitive random variables of the system. 4.2.1 Information structure and strategies The information available to agent i , i = 1, 2, at time t consists of two parts: 1. Common information Ct - This information is available to both agents1 . Ct takes values in the set Ct . 2. Private information P i t - Any information available to agent i at time t that is not included in Ct must be included in P i t . P i t takes values in Pt . (Note that the space of private information is the same for both agents.) We use Pt to denote the pair (P 1 t , P2 t ). Ct should be viewed as an ordered list (or row vector) of some of the system variables that are known to both agents. Similarly, P i t should be viewed as an ordered list (or row vector). We assume that Ct is non-decreasing with time, i.e., any variable included in Ct is also included in Ct+1. Let Zt+1 be the increment in common information from time t to t + 1. We assume the following dynamics for Zt+1 and P i t+1 (i = 1, 2): Zt+1 = ζt(Xt , Pt , Ut , Wt); P i t+1 = ξ i t (Xt , Pt , Ut , Wt), Agent i uses its information at time t to select a probability distribution δUi t on the action space U. We will refer to δUi t as agent i’s behavioral action at time t. The action U i t is 1Ct does not have to be the entirety of information that is available to both agents; it simply cannot include anything that is not available to both agents. 37 List of Figures then randomly generated according to the chosen distribution, i.e., U i t ∼ δUi t . Thus, we can write δUi t = g i t (P i t , Ct), (4.3) where g i t is a mapping from Pt × Ct to ∆(U). The function g i t is referred to as the control strategy of agent i at time t. The collection of functions g i := (g i 1 , . . . , gi T ) is referred to as the control strategy of agent i. Let G denote the set of all possible strategies for agent i. (Note that the set of all possible strategies is the same for the two agents since the private information space, the common information space and the action space are the same for the two agents.) We use (g 1 , g2 ) to denote the pair of strategies being used by agent 1 and agent 2 respectively. We are interested in the finite horizon total expected cost incurred by the system which is defined as: J(g 1 , g2 ) :=(g 1 ,g2 ) "X T t=1 kt(Xt , Ut) # , (4.4) where kt is the cost function at time t. Our focus will be on the case of symmetric strategies, i.e., the case where both agents use the same control strategy. When referring to symmetric strategies, we will drop the superscript i in g i and denote a symmetric strategy pair by (g, g). Symmetric strategy optimization problem (Problem P1): Our objective is to find a symmetric strategy pair that achieves the minimum total expected cost among all symmetric strategy pairs. That is, we are looking for a strategy g ∈ G such that J(g, g) ≤ J(h, h), ∀h ∈ G. (4.5) Remark 4.1. We assume that the randomization at each agent is done independently over time and independently of the other agent [53]. This can be done as follows: Agent i has access to i.i.d. random variables Ki 1:T that are uniformly distributed over the interval (0, 1]. The variables K1 1:T , K2 1:T are independent of each other and of the primitive random variables. Further, agent i has access to a mechanism κ that takes as input Ki t and distribution δUi t over U and generates a random action with the input distribution. Thus, agent i’s action at time t can be written as U i t = κ(δUi t , Ki t ). Remark 4.2. If the private information space, the common information space and the action space are finite, then it can be shown that the strategy space G is a compact space and that J(g, g) is a continuous function of g ∈ G. Thus, an optimal g satisfying (4.5) exists. Remark 4.3. Consider a realization ct of Ct and suppose that both P 1 t and P 2 t happened to have the same realization pt. Then, under a symmetric strategy pair, both agents will necessarily select the same behavioral action. In other words, any difference in the choice of behavioral actions by the two agents stems from differences in realizations of private information. We assume in Problem P1 that the the specification of common and private information is given to us a priori. This means that the spaces Pt , Ct and, consequently, the strategy space G are fixed as part of the problem specification. If, for a fixed information 38 List of Figures structure, one changed the specification of common and private information (e.g. consider common information to be empty and all information to be private), then that would produce a new instance of Problem P1. Remark 4.4. We have formulated the problem with two agents for simplicity. The number of agents can in fact be any positive integer n or even a deterministic time-varying sequence nt. Our results extend to these cases with only notational modifications. 4.2.2 Some specific information structures We will be particularly interested in the special cases of Problem P1 described below. Each case corresponds to a different information structure. In each case, the global state history until time t, X0 1:t , and the action history until t − 1, U1:t−1 are part of common information Ct . 1. One-step delayed sharing information structure: In this case, each agent knows its own local state history until time t and the local state history of the other agent until time t−1. Thus, the common and private information available to agent i at time t is given by Ct = (X0 1:t , U1:t−1, X1,2 1:t−1 ); P i t = Xi t . (4.6) We refer to the instance of Problem P1 with this information structure as Problem P1a. 2. Full local history information structure: In this case, each agent knows its own local state history until time t but does not observe the local states of the other agent. Thus, the common and private information available to agent i at time t is given by Ct = (X0 1:t , U1:t−1); P i t = Xi 1:t . (4.7) This information structure corresponds to the control sharing information structure of [30]. We refer to the instance of Problem P1 with this information structure as Problem P1b. 3. Reduced local history information structure: In this case, each agent knows its own current local state but does not recall its past local states and does not observe the local states of the other agent. Thus, the common and private information available to agent i at time t is given by Ct = (X0 1:t , U1:t−1); P i t = Xi t . (4.8) We refer to the instance of Problem P1 with this information structure as Problem P1c. Remark 4.5. Another special case of Problem 1 that might be of interest is the following: Consider a situation where the state dynamics are governed not by the vector of agents’ actions but only by an aggregate effect of agents’ actions. Let At = a(U 1 t , U2 t ) denote the aggregate action. We refer to a(·, ·) as the aggregation function. Some examples of a could be the sum or the maximum function. The state dynamics are as described in equations (4.1) and (4.2) except with Ut replaced by At. The agents only observe the aggregate actions 39 List of Figures taken in the past but not the individual actions. The common and private information are given as: Ct = (X0 1:t , A1:t−1); P i t = Xi t . (4.9) We refer to the instance of Problem P1 corresponding to this case as Problem P1d. 4.2.3 Why are randomized strategies needed? In team problems, it is well-known that one can restrict agents to deterministic strategies without loss of optimality [52]. However, since the agents are restricted to use symmetric strategies in our setup, randomization can help. This can be illustrated by the following simple example. Example 1: Let T = 1 and let (X0 1 , X1 1 , X2 1 ) = (0, 0, 0) with probability 1. The action space is U = {0, 1}. The information structure is that of Problem P1c described in 4.2.2. The cost at t = 1 is given by, k1(X1, U1) = 1{U1 1 =U2 1 } . Note that the cost function penalizes the agents for taking the same action. In this case, each agent has only two deterministic strategies – taking action 0 or taking action 1 at time 1. If both agents use the same deterministic strategy, then, clearly, U 1 1 = U 2 1 and hence the expected cost incurred is 1. Consider now the following randomized strategy for each agent: U i 1 = 1 with probability p and U i 1 = 0 with probability (1 − p). When the two agents use this randomized strategy, the expected cost is p 2 + (1 − p) 2 . With p = 0.5, this cost is 0.5 which is less than the expected cost achieved by any deterministic symmetric strategy pair. Thus, when agents are restricted to use the same strategy, they can benefit from randomization. 4.3 Common information approach We adopt the common information approach [4] for Problem P1. This approach formulates a new decision-making problem from the perspective of a coordinator that knows the common information. At each time, the coordinator selects prescriptions that map each agent’s private information to its action. The behavioral action of each agent in this problem is simply the prescription evaluated at the current realization of its private information. Since Problem P1 requires symmetric strategies for the two agents, we will require the coordinator to select identical prescriptions for the two agents. To make things precise, let Bt denote the space of all functions from Pt to ∆(U). Let Γt ∈ Bt denote the prescription selected by the coordinator at time t. Then, the behavioral action of agent i, i = 1, 2, is given by: δUi t = Γt(P i t ). As in Problem P1, agent i’s action U i t is generated according to the distribution δUi t using independent randomization. The coordinator selects its prescription at time t based on the 40 List of Figures common information at time t and the history of past prescriptions. Thus, we can write: Γt = dt(Ct , Γ1:t−1), (4.10) where dt is a mapping from Ct × B1 . . . × Bt−1 to Bt . The collection of mappings d := (d1, . . . , dT ) is referred to as the coordination strategy. The coordinator’s objective is to choose a coordination strategy that minimizes the finite horizon total expected cost: J (d) :=d "X T t=1 kt(Xt , Ut) # . (4.11) The following lemma establishes the equivalence of the coordinator problem formulated above and the problem Problem P1. The use of identical prescriptions by the coordinator is needed to connect the coordinator’s strategy to symmetric strategies for the agents in Problem P1. Lemma 4.1. Problem P1 and the coordinator’s problem are equivalent in the following sense: (i) For any symmetric strategy pair (g, g), consider the following coordination strategy: dt(Ct) = gt(·, Ct). Then, J(g, g) = J (d). (ii) Conversely, for any coordination strategy d, consider the symmetric strategy pair defined as follows: gt(·, Ct) = dt(Ct , Γ1:t−1), where Γk = dk(Ck, Γ1:k−1) for k = 1, . . . , t − 1. Proof. The proof is based on Proposition 3 of [4] and the fact that the use of identical prescriptions for the two agents by the coordinator corresponds to the use of symmetric strategies in Problem P1. We now proceed with finding a solution for the coordinator’s problem. As shown in [4], the coordinator’s belief on (Xt , Pt) can serve as its information state (sufficient statistic) for selecting prescriptions. At time t, the coordinator’s belief is given as: Πt(x, p) = P(Xt = x, Pt = p|Ct , Γ1:(t−1)), (4.12) for all x ∈ X 0 × X × X , p ∈ Pt × Pt . The belief can be sequentially updated by the coordinator as described in Lemma 4.2 below. The lemma follows from arguments similar to those in Lemma 2 of [53] (or Theorem 1 of [4]). Lemma 4.2. For any coordination strategy d, the coordinator’s belief Πt evolves almost surely as Πt+1 = ηt(Πt , Γt , Zt+1), (4.13) where ηt is a fixed transformation that does not depend on the coordination strategy. 41 List of Figures Using the results in [4], we can write a dynamic program for the coordinator’s problem. Recall that Bt is the space of all functions from Pt to ∆(U). For a γ ∈ Bt and p ∈ Pt , γ(p) is a probability distribution on U. Let γ(p; u) denote the probability assigned to u ∈ U under the probability distribution γ(p). Theorem 4.1. The value functions for the coordinator’s dynamic program are as follows: Define VT +1(πT +1) = 0 for every πT +1. For t ≤ T and for any realization πt of Πt, define Vt(πt) = min γt∈Bt E[kt(Xt , Ut)+ Vt+1(ηt(πt , γt , Zt+1))|Πt = πt , Γt = γt ] (4.14) The coordinator’s optimal strategy is to pick the minimizing prescription for each time and each πt. Proof. As noted in [4], the coordinator’s problem can be seen as a POMDP. The theorem is simply the POMDP dynamic program for the coordinator. Remark 4.6. The expectation in (4.14) should be interpreted as follows: Zt+1 is given by (??), U i t , i = 1, 2, is independently randomly generated according to the distribution γt(P i t ) and the joint distribution on (Xt , Pt) is πt. Remark 4.7. It can be established by backward induction that the term being minimized in (4.14) is a continuous function of γt. This can be shown using an argument very similar to the one used in the proof of Lemma 3 in [54]. This continuity property along with the fact that Bt is a compact set ensures that the minimum in (4.14) is achieved. For the instances of Problem P1 described in Problems P1a - P1c (see Section 4.2), the private information of an agent includes its current local state. Consequently, for these instances, the coordinator’s belief is just on the private information of the agents and the current global state. The following lemma shows that this belief can be factorized into beliefs on each agent’s private information and a degenerate belief on the global state. Lemma 4.3. In Problems 1a - 1c, for any realization x 0 of the global state and any realizations p 1 , p2 of the agents’ private information, Πt(x 0 , p1 , p2 ) = δX0 t (x 0 )Π1 t (p 1 )Π2 t (p 2 ), (4.15) where Πt is the coordinator’s belief (see (4.12)), Π1 t , Π2 t are the marginals of Πt for each agent’s private information and δX0 t (·) is a delta distribution located at X0 t . (Recall that X0 t is part of the common information in Problems P1a-P1c.) Further, for any coordination strategy d, Πi t , i = 1, 2, evolves almost surely as Π i t+1 = η i t (X0 t , Π i t , Γt , Zt+1), (4.16) where η i t is a fixed transformation that does not depend on the coordination strategy. 42 List of Figures Proof. See Appendix C.1 Because of the above lemma, we can replace Πt (and its realizations πt) by (Π1 t , Π2 t , X0 t ) (and the corresponding realizations (π 1 t , π2 t , x0 t )) in the dynamic program of Theorem 4.1 for Problems P1a -P1c. 4.4 Comparison of Problems 1b and 1c The information structures in Problems P1b and P1c differ only in the private information available to the agents – in P1b, each agent know its entire local state history whereas in P1c each agent knows only its current local state. If the agents were not restricted to use the same strategies, it is known that the two information structures are equivalent. That is, if a (possibly asymmetric) strategy pair (g 1 , g2 ) is optimal for the information structure in Problem P1c, then it is also optimal for the information structure in Problem P1b [30]. This effectively means that agents can ignore their past local states without any loss in performance. However, such an equivalence of the two information structures may not hold when agents are restricted to use symmetric strategies. In other words, an optimal symmetric strategy in Problem P1c may not be optimal for Problem P1b; and the optimal performance in Problem P1c may be strictly worse than the optimal performance in Problem P1b. We explore this point in more detail below. One approach for establishing that agents can ignore parts of their private information that has been commonly used in prior literature on multi-agent/decentralized systems is the agent-by-agent (or person-by-person approach) [4, 55]. This approach works as follows: We start by fixing strategies of all agents other than agent i to arbitrary choices and then show that agent i can make decisions based on a subset or a function of its private information without compromising performance. If this reduction in agent i’s information holds for any arbitrary strategy of other agents, we can conclude that this reduction would hold for globally optimal strategies as well. By repeating this argument for all agents, one can reduce the private information of all agents without losing performance. The problem with this approach is that it cannot accommodate the restriction to symmetric strategies. The reduced-information based strategies obtained using this approach may or may not be symmetric. Thus, we cannot adopt this approach for reducing agents’ private information in Problem P1b. Another approach for reducing private information that has been used in some gametheoretic settings [54] involves the use of conditional probabilities of actions given reduced information. To see how this approach can be used, let’s consider an arbitrary (possible asymmetric) strategy pair (g 1 , g2 ) for the information structure of Problem P1b and define the following conditional probabilities for i = 1, 2: P (g 1 ,g2 ) [U i t = u|Xi t = x, Ct = ct ]. (4.17) 43 List of Figures Note that (4.17) specifies a probability distribution on U for each x and ct . Thus, it can be viewed as a valid strategy for agent i under the information structure of Problem P1c. This observation lets us define the following reduced-information strategies for the agents: g¯ i t (x, ct) := P (g 1 ,g2 ) [U i t = ·|Xi t = x, Ct = ct ], i = 1, 2. (4.18) Further, it can be shown that the above construction ensures that the joint distributions of (Xt , Ut , Ct) under strategies (g 1 , g2 ) and (¯g 1 , g¯ 2 ) are the same for all t. This, in turn, implies that J(¯g 1 , g¯ 2 ) = J(g 1 , g2 ). This argument establishes that there is a reduced-information strategy pair with the same performance as an arbitrary full-information strategy pair. Thus, the optimal performance with reduced-information strategies must be the same as the optimal performance with full-information strategies for the information structure of Problem P1b. We can try to use the above argument for symmetric strategy pairs. We start with an arbitrary symmetric strategy pair (g, g) in Problem P1b and use (4.18) to define a reducedinformation strategy pair that achieves the same performance as (g, g). The problem with this argument is that even though we started with a symmetric strategy pair (g, g), the reduced-information strategy pair constructed by (4.18) need not be symmetric. Hence, this reduced-information strategy pair may not be a valid solution for Problem P1c. We illustrate this point in the following example. Example 2: Consider a setting where there is no global state, the action space is U = {a, b} and the local states are i.i.d. (across time and across agents). Each local state is a Bernoulli (1/2) random variable. Consider the symmetric strategy pair (g, g) for Problem P1b where g1 (the strategy at t = 1) is: g1(u i 1 = a|x i 1 ) = ( α, if x i 1 = 0 β, if x i 1 = 1, (4.19) where 0 ≤ α, β ≤ 1. And g2 (the strategy at t = 2) is: g2(u i 2 = a|x i 1 , xi 2 , u1 1 , u2 1 ) = ( α, if x i 1 = x i 2 β, if x i 1 ̸= x i 2 . (4.20) We now use (4.18) to define a reduced-information strategy. Even though we started with a symmetric strategy pair for the two agents, the conditional probability on the right hand side of (4.18) may be different for the two agents. To see this, consider t = 2 and C2 = (U 1 1 , U2 1 ) = (a, b) and Xi 2 = 0. Then, for agent 1: P (g,g) (U 1 2 = a|X1 2 = 0, U1 1 = a, U2 1 = b) = P (g,g) (U 1 2 = a, X1 1 = 0|X1 2 = 0, U1 1 = a, U2 1 = b) + P (g,g) (U 1 2 = a, X1 1 = 1|X1 2 = 0, U1 1 = a, U2 1 = b) = αP (g,g) (X1 1 = 0|X1 2 = 0, U1 1 = a, U2 1 = b) + βP (g,g) (X1 1 = 1|X1 2 = 0, U1 1 = a, U2 1 = b) 44 List of Figures = α α α + β + β β α + β = α 2 + β 2 α + β (4.21) On the other hand, a similar calculation for agent 2 shows that: P (g,g) (U 2 2 = a|X2 2 = 0, U1 1 = a, U2 1 = b) = α 1 − α 2 − α − β + β 1 − β 2 − α − β = α + β − α 2 − β 2 2 − α − β . (4.22) The expressions in (4.21) and (4.22) are clearly different. For example, with α = 1/4 and β = 1/2, (4.21) evaluates to 5/12 while (4.22) evaluates to 7/20. Thus, the reducedinformation strategies constructed by (4.18) are not symmetric and, therefore, invalid for Problem P1c. 4.4.1 Special cases In this section, we present two special cases under which Problems P1b and P1c can be shown to be equivalent, i.e., we can show that an optimal strategy for Problem P1c is also optimal for Problem P1b. 4.4.1.1 Specialized cost We assume that the cost function at each time t is non-negative, i.e., kt(X0 t , X1 t , X2 t , U1 t , U2 t ) ≥ 0. Further, we assume that for each possible local state x i of agent i there exists an action m(x i ) such that kt(x 0 , x1 , x2 , m(x 1 ), m(x 2 )) = 0 for all x 0 ∈ X 0 . An example of such a cost function is kt(X0 t , X1 t , X2 t , U1 t , U2 t ) = (X1 t − U 1 t ) 2 [(X2 t − U 2 t ) 2 + 1] + (X2 t − U 2 t ) 2 , where the states and actions are integer-valued. Recall that in Problem P1b the prescription space at time t is the space of functions from X t to ∆(U) and in Problem P1c the prescription space is the space of functions from X to ∆(U). Using the dynamic programs for Problems P1b and P1c, we can show that optimal prescriptions in both problems effectively coincide with the mapping m from X to U 2 . Theorem 4.2. The value functions for the coordinator’s dynamic programs in Problems P1b and P1c can be written as follows: For t ≤ T and for any realization π 1 t , π2 t , x0 t of Π1 t , Π2 t , X0 t , Vt(π 1 t , π2 t , x0 t ) := min γt∈Bt Qt(π 1 t , π2 t , x0 t , γt), (4.23) where the function Q satisfies Qt(π 1 t , π2 t , x0 t , γt) ≥ Qt(π 1 t , π2 t , x0 t , m) = 0, (4.24) 2With a slight abuse of notation, the function m from X to U can be viewed as a deterministic prescription from X to ∆(U) or from X t to ∆(U). 45 List of Figures Consequently, the coordinator’s optimal prescription is m at each time. Proof. See Appendix C.2. Since the coordinator’s optimal strategy is identical in Problems P1b and P1c, it follows that the optimal symmetric strategy for the agents in the two problems is also the same, namely U i t = m(Xi t ). 4.4.1.2 Specialized dynamics We consider a specialized dynamics where the local states Xi 1:T , i = 1, 2, are i.i.d. uncontrolled random variables with probability distribution α and there is no global state. The following theorem shows the equivalence between Problems P1b and P1c in terms of optimal performance and strategies. Theorem 4.3. The optimal performance in Problem P1c is the same as the optimal performance in Problem P1b. Further, the optimal symmetric strategy for Problem P1c is optimal for Problem P1b as well. Proof. See Appendix C.3. In summary, for the specialized cases described above, one can reduce the private information of the agents without losing performance, even with the restriction to symmetric strategies. 4.5 Conclusion In this chapter, we focused on designing symmetric strategies to optimize a finite horizon team objective. We started with a general information structure and then considered some special cases. We showed in a simple example that randomized symmetric strategies may outperform deterministic symmetric strategies. We also discussed why some of the known approaches for reducing agents’ private information in teams may not work under the constraint of symmetric strategies. We modified the common information approach to obtain optimal symmetric strategies for the agents. This resulted in a common information based dynamic program whose complexity depends in large part on the size of the private information space. We presented two specialized models where private information can be reduced using simple dynamic program based arguments. 46 Chapter 5 Thompson sampling for linear quadratic mean-field teams 5.1 Introduction Linear dynamical systems with a quadratic cost (henceforth referred to as LQ systems) are one of the most commonly used modeling framework in Systems and Control. Part of the appeal of LQ models is that the optimal control action in such models is a linear or affine function of the state; therefore, the optimal policy is easy to identify and easy to implement. Broadly speaking, the regret of three classes of learning algorithms have been analyzed in the literature: Optimism in the face of uncertainty (OFU) based algorithms, certainty equivalence (CE) based algorithms, and Thompson sampling (TS) based algorithms. OFU-based algorithms are inspired by the OFU principle for multi-armed bandits [56]. Starting with the work of [57, 58], most of the papers following this approach [59–61] provide a high probability bound on regret. As an illustrative example, it is shown in [61] that, with high probability, the regret of a OFU-based learning algorithm is O˜(d 0.5 x (dx +du) √ T), where dx is the dimension of the state, du is the dimension of the controls, T is the time horizon, and the O˜(·) notation hides logarithmic terms in T. Certainty equivalence (CE) is a classical adaptive control algorithm in Systems and Control [62, 63]. Most papers following this approach [64–67] also provide a high probability bound on regret. As an illustrative example, it is shown in [67] that, with high probability, the regret of a CE-based algorithm is O˜(d 0.5 x du √ T + d 2 x ). Thompson sampling (TS) based algorithms are inspired by TS algorithm for multi-armed bandits [68]. Most papers following this approach [69–71] establish a bound on the expected Bayesian regret. As an illustrative example, [70] shows that the regret of a TS-based algorithm is O˜(d 0.5 x (dx + du) √ T). 47 List of Figures Two aspects of these regret bounds are important: the dependence on the time horizon T and the dependence on the dimensions (dx, du) of the state and the controls. For all classes of algorithms mentioned above, the dependence on the time horizon is O˜( √ T). Moreover, there are multiple papers which show that, under different assumptions, the regret is lower bounded by Ω(√ T) [67, 72]. So, the time dependence in the available regret bounds is nearly order optimal. Similarly, even though the dependence of the regret bound on the dimensions of the state and the control varies slightly for each class of algorithms, [67] recently showed that the regret is lower bounded by Ω( ˜ d 0.5 x du √ T). So, there is only a small scope of improvement in the dimension dependence in the regret bounds. The dependence of the regret bounds on the dimensions of the state and controls is critical for applications such as formation control of robotic swarms and demand response in power grids which have large numbers of agents (which can be of the order of 103 to 105 ). In such systems, the effective dimension of the state and the controls is ndx and ndu, where n is the number of agents and dx and du are the dimensions of the state and controls of each agent. Therefore, if we take the regret bound of, say, the OFU algorithm proposed in [61], the regret is O˜(n 1.5d 0.5 x (dx + du) √ T). Similar scaling with n holds for CE- and TS-based algorithms. The polynomial dependence on the number of agents is prohibitive and, because of it, the standard regret bounds are of limited value for large-scale systems. There are many papers in the literature on the design of large-scale systems which exploit some structural property of the system to develop low-complexity design algorithms [73–78]. However, there has been very little investigation on the role of such structural properties in developing and analyzing learning algorithms. Our main contribution is to show that by carefully exploiting the structure of the model, it is possible to design learning algorithms for large-scale LQ systems where the regret does not grow polynomially in the number of agents. In particular, we investigate mean-field coupled control systems, which have emerged as a popular modeling framework in multiple research communities including Control Systems, Economics, Finance, and Statistical Physics [14– 18]. These models are used in various applications ranging from demand response in smart grids, large scale communication networks, UAVs, finanancial markets, and many others. We refer the reader to [79] for a survey. There has been considerable interest in reinforcement learning for such models [80–85], but all of these papers focus on identifying asymptotically optimal policies and do not characterize regret. Our main contribution is to design a TS-based algorithm for mean-field teams (which is a specific mean-field model proposed in [77, 78]) and show that the regret scales as O˜(|M| 1.5d 0.5 x (dx + du) √ T), where |M| is the number of types. We would like to highlight that although we focus on a TS-based algorithm in the paper, it will be clear from the derivation that it is possible to develop OFU- and CE-based algorithms with similar regret bounds. Thus, the main takeaway message of our paper is that there is significant value in developing learning algorithms which exploit the structure of the model. 48 List of Figures 5.2 Background on mean-field teams 5.2.1 Mean-field teams model We start by describing a slight generalization of the basic model of mean-field teams proposed in [77, 78]. Mean-field teams are also called cooperative mean-field games or meanfield control in the literature [86]. Consider a system with a large population of agents. The agents are heterogeneous and have multiple types. Let M = {1, . . . , M} denote the set of types of agents, Nm, m ∈ M, denote the set of all agents of type m, and N = S m∈M Nm denote the set of all agents. States, actions, and their mean-fields Agents of the same type have the same state and action spaces. In particular, the state and control action of agents of type m take values in Rdmx and Rdmu , respectively. For any generic agent i ∈ Nm of type m, we use x i t ∈ Rdmx and u i t ∈ Rdmu to denote its state and control action at time t. We use xt = ((x i t )i∈N ) and ut = ((u i t )i∈N ) to denote the global state and control actions of the system at time t. The empirical mean-field (¯x m t , u¯ m t ) of agents of type m, m ∈ M, is defined as the empirical mean of the states and actions of all agents of that type, i.e., x¯ m t = 1 Nm X i∈Nm x i t and ¯u m t = 1 Nm X i∈Nm u i t . The empirical mean-field (x¯t ,u¯t) of the entire population is given by x¯t = (¯x 1 t , . . . , x¯ |M| t ) and u¯t = (¯u 1 t , . . . , u¯ |M| t ). As an example, consider the temperature control of a multi-storied office building. In this case, N represents the set of rooms, M represents the set of floors, Nm represents all rooms in floor m, x i t represents the temperature in room i, ¯x m t represents the average temperature in floor m, and x¯t represents the collection of average temperature in each floor. Similarly, u i t represents the heat exchanged by the air-conditioner in room i, u¯ m t represents the average heat exchanged by the air-conditioners in floor m, and u¯t represents the collection of average heat exchanged in each floor of the building. System dynamics and per-step cost The system starts at a random initial state x1 = (x i 1 )i∈N , whose components are independent across agents. For agent i of type m, the initial state x i 1 ∼ N (0, X i 1 ), and at time t ≥ 1, the state evolves according to x i t+1 = A mx i t + B mu i t + D mx¯t + E mu¯t + w i t + v m t + F mv 0 t , (5.1) 49 List of Figures where A m, B m, D m, E m, F m are matrices of appropriate dimensions, {w i t}t≥1, {v m t }t≥1, and {v 0 t }t≥1 are i.i.d. zero-mean Gaussian processes which are independent of each other and the initial state. In particular, w i t ∈ Rdmx , v m t ∈ Rdmx , and v 0 t ∈ Rd 0 v , and w i t ∼ N (0, Wi ), v m t ∼ N (0, V m), and v 0 t ∼ N (0, V 0 ). Eq. (5.1) implies that all agents of type m have similar dynamical couplings. The next state of agent i of type m depends on its current local state and control action, the current meanfield of the states and control actions of the system, and is influenced by three independent noise processes: a local noise process {w i t}t≥1, a noise process {v m t }t≥1 which is common to all agents of type m, and a global noise process {v 0 t }t≥1 which is common to all agents. At each time-step, the system incurs a quadratic cost c(xt ,ut) given by c(xt ,ut) = x¯ ⊺ t Q¯x¯t + u¯ ⊺ t R¯u¯t + X m∈M 1 Nm X i∈Nm (x i t ) ⊺ Q mx i t + (u i t ) ⊺ R mu i t . (5.2) Thus, there is a weak coupling in the cost of the agents through the mean-field. Admissible policies and performance criterion There is a system operator who has access to the states of all agents and control actions and chooses the control action according to a deterministic or randomized policy ut = πt(x1:t ,u1:t−1). (5.3) Let θ = (θ m)m∈M, where (θ m) ⊺ = [A m, B m, D m, E m, F m], denotes the parameters of the system dynamics. The performance of any policy π = (π1, π2, . . .) is given by J(π; θ) = lim sup T→∞ 1 T E hX T t=1 c(xt ,ut) i . (5.4) Let J(θ) to denote the minimum of J(π; θ) over all policies. We are interested in the setup where the system dynamics θ are unknown and there is a prior p on θ. The Bayesian regret of a policy π operating for a horizon T is defined as R(T; π) := E π X T t=1 c(xt ,ut) − T J(θ) (5.5) where the expectation is with respect to the prior on θ, the noise processes, the initial conditions, and the potential randomizations done by the policy π. 5.2.2 Planning solution for mean-field teams In this section, we summarize the planning solution of mean-field teams presented in [77, 78] for a known system model. 50 List of Figures Define the following matrices: A¯ = diag(A 1 , . . . , A |M| ) + rows(D 1 , . . . , D |M| ), B¯ = diag(B 1 , . . . , B |M| ) + rows(E 1 , . . . , E |M| ), and let Q ¯¯ = diag(Q1 , . . . , Q|M| ) + Q¯ and R ¯¯ = diag(R 1 , . . . , R |M| ) + R¯. It is assumed that the system satisfies the following: (A1) Q ¯¯ > 0 and R ¯¯ > 0. Moreover, for every m ∈ M, Qm > 0 and R m > 0. (A2) The system (A¯, B¯) is stabilizable.1 Moreover, for every m ∈ M, the system (A m, B m) is stabilizable. Now, consider the following |M| + 1 discrete time algebraic Riccati equations (DARE):2 S˘m = DARE(A m, B mQ m, R m), m ∈ M, (5.6a) S¯ = DARE(A¯, B¯, Q ¯¯, R ¯¯). (5.6b) Moreover, define L˘m = − (B m) ⊺ S˘mB m + R m −1 (B m) ⊺ S˘mA m, m ∈ M, (5.7a) L¯ = − B¯ ⊺ S¯B¯ + R ¯¯ −1 B¯ ⊺ S¯A¯, (5.7b) and let rows(L¯1 , . . . , L¯|M| ) = L¯. Finally, define ¯w m t = 1 Nm P i∈Nm w i t , w¯t = ( ¯w 1 t , . . . , w¯ |M| t ) and v¯t = (v 1 t , . . . , v |M| t ). Let W˘ m = 1 Nm P i∈Nm var(w i t − w¯ m t ) and W¯ = var(w¯t) + diag(V 1 , . . . , V |M| ) + diag(F 1V 0 , . . . , F |M|V 0 ). Note that since the noise processes are i.i.d., these covariances do not depend on time. Now, split the state x i t of agent i of type m into two parts: the mean-field state ¯x m t and the relative state ˘x i t = x i t − x¯ m t . Do a similar split of the controls: u i t = ¯u m t + ˘u i t . Since P i∈Nm x˘ i t = 0 and P i∈Nm u˘ i t = 0, the per-step cost (5.2) can be written as c(xt ,ut) = ¯c(x¯t ,u¯t) + X m∈M 1 Nm X i∈Nm c˘ m(˘x i t , u˘ i t ) (5.8) where ¯c(x¯t ,u¯t) = x¯ ⊺ t Q ¯¯x¯t + u¯ ⊺ t R ¯¯u¯t and ˘c m(˘x i t , u˘ i t ) = (˘x i t ) ⊺Qmx˘ i t + (˘u i t ) ⊺ R mu˘ i t . Moreover, the dynamics of mean-field and the relative components of the state are: x¯t+1 = A¯x¯t + B¯u¯t + w¯t + v¯t + F¯v 0 t (5.9) 1System matrices (A, B) are said to be stabilizable if there exists a gain matrix L such that all eigenvalues of A + BL are strictly inside the unit circle. 2For stabilizable (A, B) and Q > 0, DARE(A, B, Q, R) is the unique positive semidefinite solution of S = A ⊺ SA − (A ⊺ SB)(R + B ⊺ SB) −1 (A ⊺ SB) + Q. List of Figures where F¯ = diag(F 1 , . . . , F |M| ) and for any agent i of type m, x˘ i t = A m t x˘ i t + B mu˘ i t + ˘w i t , (5.10) where ˘w i t = w i t − w¯ m t . The result below follows from [78, Theorem 6]3 : Theorem 5.1. Under assumptions (A1) and (A2), the optimal policy for minimizing the cost (5.4) is given by u i t = L˘mx˘ i t + L¯mx¯t . (5.11) Furthermore, the optimal performance is given by J(θ) = X m∈M Tr(W˘ mS˘m) + Tr(W¯ S¯). (5.12) Interpretation of the planning solution Note that u¯t = L¯ tx¯t is the optimal control for the mean-field system with dynamics (5.9) and per-step cost ¯c(x¯t ,u¯t). Moreover, for agent i of type m, ˘u i t = L˘m t x˘ i t is the optimal control for the relative system with dynamics (5.10) and per-step cost ˘c m(˘x i t , u˘ i t ). Theorem 5.1 shows that at every agent i of type m, we can consider the two decoupled systems—the mean-field system and the relative system— solve them separately, and then simply add their respective controls—¯u m t and ˘u i t—to obtain the optimal control action at agent i in the original mean-field team system. We will exploit this feature of the planning solution in order to develop a learning algorithm for mean-field teams. 5.3 Learning for mean-field teams For the ease of exposition, we describe the algorithm for the special case when all types are of the same dimension (i.e., d m x = dx and d m u = du for all m ∈ M) and the same number of agents (i.e., |Nm| = n for all m ∈ M). We further assume that d 0 v = dx and F m = I. Moreover, we assume noise covariances are given as Wi = σ 2 wI, i ∈ N, V m = σ 2 v I, m ∈ M, and V 0 = σ 2 v 0 I. The above assumptions are not strictly needed for the analysis but we impose them because, under these assumptions, the covariance matrices Σ and ¯ Σ˘m are scaled identity matrices. In particular, for any m ∈ M, Σ˘m = (1 − 1 n )σ 2 wI =: σ˘ 2 I and Σ = ( ¯ σ 2 w n + σ 2 v + σ 2 v 0 )I =: σ¯ 2 I. This simpler form of the covariance matrices simplifies the description of the algorithm and the regret bounds. Following the decomposition presented in Sec. 5.2.2, we define ¯θ ⊺ = [A¯, B¯] to be the parameters of the mean-field dynamics (5.9) and (˘θ m) ⊺ = [A m, B m] to be the parameters 3The model considered in [78] did not include common noise, but it is easy to verify that their results continue to hold for models with common noise. 52 List of Figures of the relative dynamics (5.10). We let S˘m( ˘θ m) and S¯( ¯θ) denote the solution to the Riccati equations (5.6) and L˘m( ˘θ m) and L¯( ¯θ) denote the corresponding gains (5.7). Let J˘m( ˘θ m) = ˘σ 2 Tr(S˘( ˘θ m)) and J¯( ¯θ) = ¯σ 2 Tr(S¯( ¯θ)) denote the performance of the m-th relative system and the mean-field system, respectively. As shown in Theorem 5.1, J(θ) = X m∈M J˘m( ˘θ m) + J¯( ¯θ). (5.13) Prior and posterior beliefs: We assume that the unknown parameters ˘θ m, m ∈ M, lie in compact subsets Θ˘ m of R(dx+du)×dx. Similarly, ¯θ lies in a compact subset Θ of ¯ R|M|(dx+du)×|M|dx. Let ˘θ m(ℓ) denote the ℓ-th column of ˘θ m. Thus ˘θ m = cols(˘θ m(1), . . . , ˘θ m(dx)). Similarly, let ¯θ(ℓ) denote the ℓ-th column of ¯θ. Thus, ¯θ = cols(¯θ(1), . . . , ¯θ(|M|dx)). We use N (µ, Σ) to denotes the Gaussian distribution with mean µ and covariance Σ and p Θ to denote the projection of probability distribution p on the set Θ. We assume that the priors ¯p1 and ˘p m 1 , m ∈ M, on ¯θ and ˘θ m, m ∈ M, respectively, satisfy the following: =0pt =0pt (A3) p¯1 is given as: p¯1( ¯θ) = |M Y |dx ℓ=1 λ¯ℓ 1 ( ¯θ(ℓ)) Θ¯ where for ℓ ∈ {1, . . . , |M|dx}, λ¯ℓ 1 = N (¯µ1(ℓ), Σ¯ 1) with mean ¯µ1(ℓ) ∈ R|M|(dx+du) and positive-definite covariance Σ¯ 1 ∈ R|M|(dx+du)×|M|(dx+du) . (A4) p˘ m 1 is given as: p˘ m 1 ( ˘θ m) = Y dx ℓ=1 λ˘m,ℓ 1 ( ˘θ m(ℓ)) Θ˘ m where for ℓ ∈ {1, . . . , dx}, λ˘m,ℓ 1 = N (˘µ m 1 (ℓ), Σ˘m 1 ) with mean ˘µ m 1 (ℓ) ∈ Rdx+du and positive-definite covariance Σ˘m 1 ∈ R(dx+du)×(dx+du) . These assumptions are similar to the assumptions on the prior in the recent literature on TS for LQ systems [69, 70]. Following the discussion after Theorem 5.1, we maintain separate posterior distributions on ¯θ and ˘θ m, m ∈ M. In particular, we maintain a posterior distribution ¯pt on ¯θ based on the mean-field state and action history as follows: for any Borel subset B of R|M|(dx+du)×|M|dx, p¯t(B) = P( ¯θ ∈ B | x¯1:t ,u¯1:t−1). (5.14) For every m ∈ M, we also maintain a separate posterior distribution ˘p m t on ˘θ m as follows. At each time t > 1, we select an agent j m t−1 ∈ Nm as arg maxi∈Nm(˘z i t−1 ) ⊺Σ˘m t−1 z˘ i t−1 , where 53 List of Figures Σ˘m t−1 is a covariance matrix defined recursively by (5.18b). Then, for any Borel subset B of R(dx+du)×dx, p˘ m t (B) = P( ˘θ m ∈ B | {x˘ jms s , u˘ jms s , x˘ jms s+1}1≤s<t}), (5.15) See the supplementary file of [87] for a discussion on the rule to select j m t−1 . For the ease of notation, we use z¯t = (¯z 1 t , . . . , z¯ |M| t ), where ¯z m t = (¯x m t , u¯ m t ), and ˘z i t = (˘x i t , u˘ i t ). Then, we can write the dynamics (5.9)–(5.10) of the mean-field and the relative systems as x¯t+1 = ¯θ ⊺ z¯t + w¯t + v¯t + v 0 t , (5.16a) x˘ i t+1 = (˘θ m) ⊺ z˘ i t + ˘w i t , ∀i ∈ N m, m ∈ M. (5.16b) Recall that ¯σ 2 = σ 2 w/n + σ 2 v + σ 2 v 0 and ˘σ 2 = (1 − 1 n )σ 2 w. Lemma 5.1. The posterior distributions are as follows: 1. The posterior on ¯θ is p¯t = |M Y |dx ℓ=1 λ¯ℓ t ( ¯θ(ℓ)) Θ¯ , where for ℓ ∈ {1, . . . , |M|dx}, λ¯ℓ t = N (¯µt(ℓ), Σ¯ t), and µ¯t+1(ℓ) = ¯µt(ℓ) + Σ¯ tz¯t x¯t+1(ℓ) − µ¯t(ℓ) ⊺ z¯t σ¯ 2 + (z¯t) ⊺Σ¯ tz¯t , (5.17a) Σ¯ −1 t+1 = Σ¯ −1 t + 1 σ¯ 2 z¯tz¯ ⊺ t . (5.17b) 2. The posterior on ˘θ m, m ∈ M, at time t is p˘ m t ( ˘θ m) = Y dx ℓ=1 λ˘m,ℓ t ( ˘θ m(ℓ)) Θ˘ m , where for ℓ ∈ {1, . . . , dx}, λ˘m,ℓ t = N (˘µ m t (ℓ), Σ˘m t ), and µ˘ m t+1(ℓ) = ˘µ m t (ℓ) + Σ˘m t z˘ jm t t x˘ jm t t+1(ℓ) − µ˘ m t (ℓ) ⊺ z˘ jm t t σ˘ 2 + (˘z jm t t ) ⊺Σ˘m t z˘ jm t t , (5.18a) (Σ˘m t+1) −1 = (Σ˘m t ) −1 + 1 σ˘ 2 z˘ jm t t (˘z jm t t ) ⊺ . (5.18b) Proof. Note that the dynamics of x¯t and ˘x i t in (5.16) are linear and the noises w¯t + v¯t + v 0 t and ˘w i t are Gaussian. Therefore, the result follows from standard results in Gaussian linear regression [88]. List of Figures The Thompson sampling algorithm: We propose a Thompson sampling algorithm referred to as TSDE-MF which is inspired by the TSDE (Thompson sampling with dynamic episodes) algorithm proposed in [69, 70] and the structure of the optimal planning solution for the mean-field teams described in Sec. 5.2.2. The TSDE-MF algorithm consists of a coordinator C and |M| + 1 actors: a mean-field actor A¯ and a relative actor A˘m, for each m ∈ M. These actors are described below while the whole algorithm is presented in Algorithm 2. • At each time, the coordinator C observes the current global state (x i t )i∈N , computes the mean-field state x¯t and the relative states (˘x i t )i∈N , and sends the mean-field state x¯t to be the mean-field actor A¯ and the relative states x˘ m t = (˘x i t )i∈Nm of the all the agents of type m to the relative actor A˘m. The mean-field actor A¯ computes the mean-field control u¯t and the relative actor A˘m computes the relative control u˘ m t = (˘u i t )i∈Nm (as per the details presented below) and sends it back to the coordinator C. The coordinator then computes and executes the control action u i t = ¯u m t + ˘u i t for each agent i of type m. • The mean-field actor A¯ maintains the posterior ¯pt on ¯θ according to (5.17). The actor works in episodes of dynamic length. Let t¯k and T¯ k denote the start and the length of episode k, respectively. Episode k ends if the determinant of covariance Σ¯ t falls below half of its value at the beginning of the episode (i.e., det(Σ¯ t) < 0.5 det(Σ¯ t¯k )) or if the length of the episode is one more than the length of the previous episode (i.e., t − t¯k > T¯ k−1). Thus, t¯k+1 = min t > tk : det(Σ¯ t) < 0.5 det(Σ¯ tk ) or t − t¯k > T¯ k−1 . (5.19) At the beginning of episode k, the mean-field actor A¯ samples a parameter ¯θk from the posterior distribution ¯pt . During episode k, the mean-field actor A¯ generates the mean-field controls using the samples ¯θk, i.e., u¯t = L¯( ¯θk)x¯t . • Each relative actor A˘m is similar to the mean-field actor. Actor A˘m maintains the posterior ˘p m on ˘θ m according to (5.18). The actor works in episodes of dynamic length. The episodes of each relative actor A˘m and the mean-field actor A¯ are separate from each other.4 Let t˘m k and T˘m k denote the start and length of episode k, respectively. The termination condition for each episode is similar to that of the mean-field actor A¯. In particular, t˘m k+1 = min t > tm k : det(Σ˘m t ) < 0.5 det(Σ˘m tm k ) or t − t˘m k > T˘m k−1 . (5.20) At the beginning of episode k, the relative actor A˘m samples a parameter ˘θ m k from the posterior distribution ˘p m t . During episode k, the relative actor A˘m generates the relative controls using the sample ˘θ m k , i.e., u˘ m t = (L˘m( ˘θ m k )˘x i t )i∈Nm. 4We use the episode count k as a local variable which is different for each actor. 55 List of Figures Algorithm 2 TSDE-MF 1: initialize mean-field actor: Θ, (¯ ¯ µ1, Σ¯ 1), t¯0 = 0, T¯−1 = 0, k = 0 2: initialize relative-actor-m: Θ˘ m, (˘µ m 1 , Σ˘m 1 ), t˘m 0 = 0, T˘m −1 = 0, k = 0 3: for t = 1, 2, . . . do 4: observe (x i t )i∈N 5: compute x¯t , (x˘ m t )m∈M 6: u¯t ← mean-field-actor(x¯t) 7: for m ∈ M do 8: u˘ m t ← relative-actor-m(x˘ m t ) 9: for i ∈ Nm do 10: agent i applies control u i t = ¯u m t + ˘u i t 1: function mean-field-actor(x¯t) 2: global var t 3: Update ¯pt according (5.17) 4: if t − t¯k > T¯ k−1 or det(Σ¯ t) < 0.5 det(Σ¯ k) then 5: Tk ← t − t¯k, k ← k + 1, t¯k ← t 6: sample ¯θk ∼ p¯t 7: L¯ ← L¯( ¯θk) 8: return L¯x¯t 1: function relative-actor-m((˘x i t )i∈Nm) 2: global var t 3: Update ˘p m t according (5.18) 4: if t − t˘m k > T˘m k−1 or det(Σ˘m t ) < 0.5 det(Σ˘m k ) then 5: T m k ← t − t˘m k , k ← k + 1, t˘m k ← t 6: sample ˘θ m k ∼ p˘ m t 7: L˘m ← L˘m( ˘θ m k ) 8: return (L˘mx˘ i t )i∈Nm Note that the algorithm does not depend on the horizon T. A partially distributed version of the algorithm is presented in the conclusion. Regret bounds: We make the following assumption to ensure that the closed loop dynamics of the mean field state and the relative states of each agent are stable. We use the notation ∥ · ∥ to denote the induced norm of a matrix. (A5) There exists δ ∈ (0, 1) such that • for any ¯θ, ϕ¯ ∈ Θ where ¯ ¯θ ⊺ = [A¯ θ¯, B¯ θ¯], we have ∥A¯ θ¯ + B¯ θ¯L¯(ϕ¯)∥ ≤ δ. • for any m ∈ M, ˘θ m, ϕ˘m ∈ Θ˘ m, where (˘θ m) ⊺ = [Aθ˘m, Bθ˘m], we have ∥Aθ˘m + Bθ˘mL˘(ϕ˘m)∥ ≤ δ. 56 List of Figures This assumption is similar to an assumption imposed in the literature on TS for LQ systems [70]. According to Theorem 11 in [67], the assumption is satisfied if Θ =¯ {(A¯, B¯) : ∥A¯ − A¯ 0∥ ≤ ϵ,¯ ∥B¯ − B¯ 0∥ ≤ ϵ¯} Θ˘ m = {(A˘m, B˘m) : ||A˘m − A˘m 0 || ≤ ϵ˘ m, ∥B˘m − B˘m 0 ∥ ≤ ϵ˘ m} for stabilizable (A¯m 0 , B¯m 0 ) and (A˘m 0 , B˘m 0 ), and small constants ¯ϵ ϵ˘ m depending on the choice of (A¯m 0 , B¯m 0 ) and (A˘m 0 , B˘m 0 ). In other words, the assumption holds when the true system is in a small neighborhood of a known nominal system, and the small neighborhood can be learned with high probability by running some stabilizing procedure [67]. The following result provides an upper bound on the regret of the proposed algorithm. Theorem 5.2. Under (A1)–(A5), the regret of TSDE-MF is upper bounded as follows: R(T; TSDE-MF) ≤ O˜ (¯σ 2 |M| 1.5 + ˘σ 2 |M|)d 0.5 x (dx + du) √ T . Recall that ¯σ 2 = σ 2 w/n+σ 2 v +σ 2 v 0 and ˘σ 2 = (1− 1 n )σ 2 w. So, we can say that R(T; TDSE-MF) ≤ O˜(¯σ 2 |M| 1.5d 0.5 x (dx + du) √ T). Compared with the original TSDE regret O˜(n 1.5 |M| 1.5 √ T) which scales superlinear with the number of agents, the regret of the proposed algorithm is bounded by O˜(|M| 1.5 √ T) irrespective of the total number of agents. The following special cases are of interest: • In the absence of common noises (i.e., σ 2 v = σ 2 v 0 = 0), and when n ≫ |M|, R(T; TDSE-MF) ≤ O˜(˘σ 2 |M|d 0.5 x (dx + du) √ T). • For homogeneous systems (i.e., |M| = 1), we have R(T; TDSE-MF) ≤ O˜((¯σ 2+˘σ 2 )d 0.5 x (dx+ du) √ T). Thus, the scaling with the number of agents is O˜((1 + 1 n ) √ T). Note that these results show that in mean-field systems with common noise regret scales as O(|M| 1.5 ) in the number of types, while in mean-field systems without common noise, the regret scales as O(|M|). Thus, the presence of common noise fundamentally changes the scaling of the learning algorithm. 5.4 Regret analysis For the ease of notation, we simply use R(T) instead of R(T; TSDE-MF) in this section. Eq. (5.13) and (5.8) imply that the regret may be decomposed as R(T) = R¯(T) + X m∈M 1 n X i∈Nm R˘i,m(T) (5.21) 5 List of Figures where R¯(T) := E X T t=1 c¯(x¯t ,u¯t) − TJ¯( ¯θ) , R˘i,m(T) := E X T t=1 c˘ m(˘x i t , u˘ i t ) − TJ˘( ˘θm) . Note that R¯(T) is the regret associated with the mean-field system and R˘i,m(T) is the regret of the i-th relative system of type m. Observe that for the mean-field actor in our algorithm is essentially implementing the TSDE algorithm of [69, 70] for the mean-field system with dynamics (5.9) and per-step cost ¯c(x¯t ,u¯t). This is because: 1. As mentioned in the discussion after Theorem 5.1, we can view u¯t = L¯( ¯θ)x¯t as the optimal control action of the mean-field system. 2. The posterior distribution ¯pt on ¯θ depends only on (x¯1:t ,u¯1:t−1). Thus, R¯(T) is precisely the regret of the TSDE algorithm analyzed in [70]. Therefore, we have the following. Lemma 5.2. For the mean-field system, R¯(T) ≤ O˜ σ¯ 2 |M| 1.5 d 0.5 x (dx + du) √ T . (5.22) Unfortunately, we cannot use the same argument to bound R˘i,m(T). Even though we can view ˘u i t = L¯m( ˘θ m)˘x i t as the optimal control action of the LQ system with dynamics (5.10), the posterior ˘p m t on ˘θ m depends on terms other than (˘x i 1:t , u˘ i 1:t−1 ). Therefore, we cannot directly use the results of [70] to bound R˘i,m(T). In the rest of this section, we present a bound on R˘i,m(T). For the ease of notation, for any episode k, we use L˘m k and S˘m k to denote L˘m( ˘θ m k ) and S˘m( ˘θ m k ). Recall that the relative value function for average cost LQ problem is x ⊺ Sx, where S is the solution to DARE. Therefore, at any time t, episode k, agent i of type m, and state x˘ i t ∈ Rdx, with ˘u i t = L˘m k x˘ i t and ˘z i t = (˘x i 1 , u˘ i t ), the average cost Bellman equation is J˘m( ˘θ m k ) + (˘x i t ) ⊺ S˘m k x˘ i t = ˘c m(˘x i t , u˘ i t ) + E ( ˘θ m k ) ⊺ z˘ i t + ˘w i t ⊺ S˘m k ( ˘θ m k ) ⊺ z˘ i t + ˘w i t . Adding and subtracting E[(˘x i t+1) ⊺ S˘m k x˘ i t+1 | z˘ i t ] and noting that ˘x i t+1 = (˘θ m) ⊺ z˘ i t + ˘w i t , we get that c˘ m(˘x i t , u˘ i t ) = J˘m( ˘θ m k ) + (˘x i t ) ⊺ S˘m k x˘ i t − E[(˘x i t+1) ⊺ S˘m k x˘ i t+1|z˘ i t ] + ((˘θ m) ⊺ z˘ i t ) ⊺ S˘m k ((˘θ m) ⊺ z˘ i t ) − ((˘θ m k ) ⊺ z˘ i t ) ⊺ S˘m k ((˘θ m k ) ⊺ z˘ i t ). (5.23) List of Figures Figure 5.1: R(T) vs T for TSDE-MF Let K˘ m T denote the number of episodes of the relative systems of type m until the horizon T. For each k > K˘ m T , we define t˘m k to be T + 1. Then, using (5.23), we have that for any agent i of type m, R˘i,m(T) = E K˘ mXT k=1 T˘m k J˘m( ˘θ m k ) − TJ˘m( ˘θ m) | {z } regret due to sampling error =:R˘i,m 0 (T) + E K˘ mXT k=1 t˘m kX +1−1 t=t˘m k (˘x i t ) ⊺ S˘m k x˘ i t − (˘x i t+1) ⊺ S˘m k x˘ i t+1 | {z } regret due to time-varying controller =:R˘i,m 1 (T) + E K˘ mXT k=1 t˘m kX +1−1 t=t˘m k ((˘θ m) ⊺ z˘ i t ) ⊺ S˘m k ((˘θ m) ⊺ z˘ i t ) −((˘θ m k ) ⊺ z˘ i t ) ⊺ S˘m k ((˘θ m k ) ⊺ z˘ i t ) . | {z } regret due to model mismatch =:R˘i,m 2 (T) (5.24) Lemma 5.3. The terms in (5.24) are bounded as follows: 1. R˘i,m 0 (T) ≤ O˜(˘σ 2p (dx + du)T). 2. R˘i,m 1 (T) ≤ O˜(˘σ 2p (dx + du)T). 3. R˘i,m 2 (T) ≤ O˜(˘σ 2 (dx + du) √ dxT). Proof. We provide an outline of the proof. See the supplementary file of [87] for complete details. The first term R˘i,m 0 (T) can be bounded using the basic property of Thompson sampling: for any measurable function f, E[f( ˘θ m k )] = E[f( ˘θ m)] because ˘θ m k is a sample from the posterior distribution on ˘θ m. 59 List of Figures Figure 5.2: R(T)/ √ T vs T for TSDE-MF Figure 5.3: TSDE-MF vs TSDE Note that the second term R˘i,m 1 (T) is a telescopic sum, which we can simplify to establish R˘i,m 1 (T) ≤ O E[K˘ m T (X˘i T ) 2 ] , where X˘i t = max1≤t≤T ∥x˘ i t∥ is the maximum norm of the relative state along the entire trajectory. The final bound on R˘i,m 1 (T) can be obtained by bounding K˘ m T and E[(X˘i T ) 2 ]. Using the sampling condition for ˘p m t and an existing bound in the literature, we first establish that R˘i 2 (T) ≤ q E (X˘i T ) 2 PT t=1(˘z i t ) ⊺Σ˘m t z˘ i t × O˜( √ T) Then, we upper bound (˘z i t ) ⊺Σ˘m t z˘ i t by (˘z jm t t ) ⊺Σ˘m t z˘ jm t t which follows from the definition of j m t . Finally, we show that E[(X˘i T ) 2 P t (˘z jm t t ) ⊺Σ˘m t z˘ jm t t ] is O˜(1) using the fact that (Σ˘m t ) −1 is obtained by linearly combining z˘ jms s z˘ jms s ⊺ 1≤s<t as in (5.18b). Combining the three bounds in Lemma 5.3, we get that R˘i,m(T) ≤ O˜(˘σ 2 d 0.5 x (dx + du) √ T). (5.25) By subsituting (5.22) and (5.25) in (5.21), we get the result of Theorem 5.2. List of Figures 5.5 Numerical Experiments In this section, we illustrate the performance of TSDE-MF for a homogeneous (i.e., |M| = 1) mean-field LQ system for different values of the number n of agents, with A = 1, B = 0.3, D = 0.5, E = 0.2, Q = 1, Q¯ = 1, R = 1,and R¯ = 0.5. We set the local noise variance σ 2 w = 1. For the regret plots in Figure 5.1,5.2, we set the common noise variance to σ 2 v + σ 2 v 0 = 1. The prior distribution used in the simulation are set according to (A3) and (A4) with µ˘(ℓ) = [1, 1], ¯µ(ℓ) = [1, 1], Σ¯ 1 = I, and Σ˘ 1 = I, Θ = ˘ { ˘θ : A + BL˘( ˘θ) ≤ δ}, Θ = ¯ { ¯θ : A + D + (B + E)L¯( ¯θ) ≤ δ} and δ = 0.99. In the comparison of TSDE-MF method with TSDE in Figure 5.3, we consider the same dynamics and cost parameters as above but without common noise (i.e. σ 2 v + σ 2 v 0 = 0). Empirical evaluation of regret: We run the system for 500 different sample paths and plot the mean and standard deviation of the expected regret R(T) for T = 5000. The regret for different values of n is shown in 5.1–5.2. As seen from the plots, the regret reduces with the number of agents and R(T)/ √ T converges to a constant. Thus, the empirical regret matches the upper bound of O˜((1 + 1 n ) √ T) obtained in Theorem 5.2. Comparison with naive TSDE algorithm: We compare the performance of TSDE-MF with that of directly using the TSDE algorithm presented in [69, 70] for different values of n. The results are shown in Fig. 5.3. As seen from the plots, the regret of TSDE-MF is smaller than TSDE but more importantly, the regret of TSDE-MF reduces with n while that of TSDE increases with n. This matches their respective upper bounds of O˜((1 + 1 n ) √ T) and O˜(n 1.5 √ T). These plots clearly illustrate the significance of our results even for small values of n. 5.6 Conclusion We consider the problem of controlling an unknown LQ mean-field team. The planning solution (i.e., when the model is known) for mean-field teams is obtained by solving the mean-field system and the relative systems separately. Inspired by this feature, we propose a TS-based learning algorithm TSDE-MF which separately tracks the parameters ¯θ and ˘θ m of the mean-field and the relative systems, respectively. The part of the TSDE-MF algorithm that learns the mean-field system is similar to the TSDE algorithm for single agent LQ systems proposed in [69, 70] and its regret can be bounded using the results of [69, 70]. However, the part of the TSDE-MF algorithm that learns the relative component is different and we cannot directly use the results of [69, 70] to bound its regret. Our main technical contribution is to provide a bound on the regret on the relative system, which allows us to bound the total regret under TSDE-MF. 61 List of Figures Distributed implementation of the algorithm: It is possible to implement Algorithm 2 in a distributed manner as follows. Instead of a centralized coordinator which collects all the observations and computes all the controls, we can consider an alternative implementation in which there is an actor Am associated with type m and a mean-field actor A¯. Each agent observes its local state and action. The actor Am for type m computes (j m t , x¯ m t ) using a distributed algorithm, sends ¯x m t to the mean-field actor, and locally computes L˘m( ˘θk). The mean-field actor computes L¯( ¯θk) and sends the m-th block column L¯m( ¯θk) to actors Am. Each actor Am then sends (¯x m t , L¯m( ¯θk), L˘m( ˘θk)) to each agent of type m using a distributed algorithm. Each agent then applies the control law (5.11). 62 Chapter 6 Scalable regret for learning to control network-coupled subsystems with unknown dynamics 6.1 Introduction Large-scale systems comprising of multiple subsystems connected over a network arise in a number of applications including power systems, traffic networks, communication networks and some economic systems [89]. A common feature of such systems is the coupling in their subsystems’ dynamics and costs, i.e., the state evolution and local costs of one subsystem depend not only on its own state and control action but also on the states and control actions of other subsystems in the network. Analyzing various aspects of the behavior of such systems and designing control strategies for them under a variety of settings have been long-standing problems of interest in the systems and control literature [73–77]. However, there are still many unsolved challenges, especially on the interface between learning and control in the context of these large-scale systems. In this paper, we investigate the problem of designing control strategies for large-scale network-coupled subsystems when some parameters of the system model are not known. Due to the unknown parameters, the control problem is also a learning problem. We adopt a reinforcement learning framework for this problem with the goal of minimizing and quantifying the regret (i.e. loss in performance) of our learning-and-control strategy with respect to the optimal control strategy based on the complete knowledge of the system model. The networked system we consider follows linear dynamics with quadratic costs and Gaussian noise. Such linear-quadratic-Gaussian (LQG) systems are one of the most commonly 63 List of Figures used modeling framework in numerous control applications. Part of the appeal of LQG models is the simple structure of the optimal control strategy when the system model is completely known—the optimal control action in this case is a linear or affine function of the state—which makes the optimal strategy easy to identify and easy to implement. If some parameters of the model are not fully known during the design phase or may change during operation, then it is better to design a strategy that learns and adapts online. Historically, both adaptive control [6] and reinforcement learning [90, 91] have been used to design asymptotically optimal learning algorithms for such LQG systems. In recent years, there has been considerable interest in analyzing the transient behavior of such algorithms which can be quantified in terms of the regret of the algorithm as a function of time. This allows one to assess, as a function of time, the performance of a learning algorithm compared to an oracle who knows the system parameters upfront. Several learning algorithms have been proposed for LQG systems [57–61, 64–67, 70, 71, 92], and in most cases the regret is shown to be bounded by O˜(d 0.5 x (dx + du) √ T), where dx is the dimension of the state, du is the dimension of the controls, T is the time horizon, and the O˜(·) notation hides logarithmic terms in T. Given the lower bound of Ω( ˜ d 0.5 x du √ T) (where Ω( ˜ ·) notation hides logarithmic terms in T) for regret in LQG systems identified in a recent work [67], the regrets of the existing algorithms have near optimal scaling in terms of time and dimension. However, when directly applied to a networked system with n subsystems, these algorithms would incur O˜(n 1.5d 0.5 x (dx + du) √ T) regret because the effective dimension of the state and the controls is ndx and ndu, where dx and du are the dimensions of each subsystem. This super-linear dependence on n is prohibitive in largescale networked systems because the regret per subsystem (which is O˜( √ n)) grows with the number of subsystems. The learning algorithms mentioned above are for a general LQG system and do not take into account any knowledge of the underlying network structure. Our main contribution is to show that by exploiting the structure of the network model, it is possible to design learning algorithms for large-scale network-coupled subsystems where the regret does not grow super-linearly in the number of subsystems. In particular, we utilize a spectral decomposition technique, recently proposed in [93], to decompose the large-scale system into L decoupled systems, where L is the rank of the coupling matrix corresponding to the underlying network. Using the decoupled systems, we propose a Thompson sampling based algorithm with O˜(nd0.5 x (dx + du) √ T) regret bound. Related work Broadly speaking, three classes of low-regret learning algorithms have been proposed for LQG systems: certainty equivalence (CE) based algorithms, optimism in the face of uncertainty (OFU) based algorithms, and Thompson sampling (TS) based algorithms. CE is a classical adaptive control algorithm [6]. Recent papers [64–67, 92] have established near optimal high probability bounds on regret for CE-based algorithms. OFUbased algorithms are inspired by the OFU principle for multi-armed bandits [56]. Starting with the work of [57, 58], most of the papers following the OFU approach [59–61] also provide similar high probability regret bounds. TS-based algorithms are inspired by TS algorithm for multi-armed bandits [68]. Most papers following this approach [70, 71, 92] 64 List of Figures establish bounds on expected Bayesian regret of similar near-optimal orders. As argued earlier, most of these papers show that the regret scales super-linearly with the number of subsystems and are, therefore, of limited value for large-scale systems. There is an emerging literature on learning algorithms for networked systems both for LQG models [94–99] and MDP models [100–102]. The papers on LQG models propose distributed value- or policy-based learning algorithms and analyze their convergence properties, but they do not characterize their regret. Some of the papers on MDP models [101, 102] do characterize regret bounds for OFU and TS-based learning algorithms but these bounds are not directly applicable to the LQG model considered in this paper. An important special class of network-coupled systems is mean-field coupled subsystems [14, 15]. There has been considerable interest in reinforcement learning for mean-field models [84, 85], but most of the literature does not consider regret. The basic mean-field coupled model can be viewed as a special case of the network-coupled subsystems considered in this paper (see Sec. 6.6.1). In a preliminary version of this paper [87], we proposed a TSbased algorithm for mean-field coupled subsystems which has a O˜((1 + 1/n) √ T) regret per subsystem. The current paper extends the TS-based algorithm to general network-coupled subsystems and establishes scalable regret bounds for arbitrarily coupled networks. Organization The rest of the paper is organized as follows. In Section 6.2, we introduce the model of network-coupled subsystems. In Section 6.3, we summarize the spectral decomposition idea and the resulting scalable method for synthesizing optimal control strategy when the model parameters are known. Then, in Section 6.4, we consider the learning problem for unknown network-coupled subsystems and present a TS-based learning algorithm with scalable regret bound. We subsequently provide regret analysis in Section 6.5 and numerical experiments in Section 6.6. We conclude in Section 6.7. Notation The notation A = [a ij ] means that A is the matrix that has a ij as its (i, j)-th element. For a matrix A, A ⊺ denotes its transpose. Given matrices (or vectors) A1, . . . , An with the same number of rows, [A1, . . . , An] denotes the matrix formed by horizontal concatenation. For a random vector v, var(v) denotes its covariance matrix. The notation N (µ, Σ) denotes the multivariate Gaussian distribution with mean vector µ and covariance matrix Σ. For stabilizable (A, B) and positive definite matrices Q, R, DARE(A, B, Q, R) denotes the unique positive semidefinite solution of the discrete time algebraic Riccati equation (DARE), which is given as S = A ⊺ SA − (A ⊺ SB)(R + B ⊺ SB) −1 (B ⊺ SA) + Q. 65 List of Figures 6.2 Model of network-coupled subsystems We start by describing a minor variation of a model of network-coupled subsystems proposed in [93]. The model in [93] was described in continuous time. We translate the model and the results to discrete time. 6.2.1 System model 6.2.1.1 Graph stucture Consider a network consisting of n subsystems/agents connected over an undirected weighted simple graph denoted by G(N, E, Ψ), where N = {1, . . . , n} is the set of nodes, E ⊆ N ×N is the set of edges, and Ψ = [ψ ij ] ∈ Rn×n is the weighted adjacency matrix. Let M = [mij ] ∈ Rn×n be a symmetric coupling matrix corresponding to the underlying graph G. For instance, M may represent the underlying adjacency matrix (i.e., M = Ψ) or the underlying Laplacian matrix (i.e., M = diag(Ψ1n) − Ψ). 6.2.1.2 State and dynamics The states and control actions of agents take values in Rd x and Rd u , respectively. For agent i ∈ N, we use x i t ∈ Rd x and u i t ∈ Rd u to denote its state and control action at time t. The system starts at a random initial state x1 = (x i 1 )i∈N , whose components are independent across agents. For agent i, the initial state x i 1 ∼ N (0, Ξ i 1 ), and at any time t ≥ 1, the state evolves according to x i t+1 = Ax i t + Bu i t + Dx G,i t + Eu G,i t + w i t , (6.1) where x G,i t and u G,i t are the locally perceived influence of the network on the state of agent i and are given by x G,i t = X j∈N mijx j t and u G,i t = X j∈N miju j t , (6.2) A, B, D, E are matrices of appropriate dimensions, and {w i t}t≥1, i ∈ N, are i.i.d. zero-mean Gaussian processes which are independent of each other and the initial state. In particular, w i t ∈ Rd x and w i t ∼ N (0, W). We call x G,i t and u G,i t the network-field of the states and control actions at node i at time t. Thus, the next state of agent i depends on its current local state and control action, the current network-field of the states and control actions of the system, and the current local noise. 66 List of Figures We follow the same atypical representation of the “vectorized” dynamics as used in [93]. Define xt and ut as the global state and control actions of the system: xt = [x 1 t , . . . ., xn t ] and ut = [u 1 t , . . . ., un t ]. We also define wt = [w 1 t , . . . ., wn t ]. Similarly, define x G t and u G t as the global network field of states and actions: x G t = [x G,1 t , . . . ., x G,n t ] and u G t = [u G,1 t , . . . ., u G,n t ]. Note that xt , xG t , wt ∈ R dx×n and ut , uG t ∈ R du×n are matrices and not vectors. The global system dynamics may be written as: xt+1 = Axt + But + Dx G t + Eu G t + wt . (6.3) Furthermore, we may write x G t = xtM ⊺ = xtM and u G t = utM ⊺ = utM. 6.2.1.3 Per-step cost At any time t the system incurs a per-step cost given by c(xt , ut) = X i∈N X j∈N [h ij x (x i t ) ⊺Q(x j t ) + h ij u (u i t ) ⊺R(u j t )] (6.4) where Q and R are matrices of appropriate dimensions and h ij x and h ij u are real valued weights. Let Hx = [h ij x ] and Hu = [h ij u ]. It is assumed that the weight matrices Hx and Hu are polynomials of M, i.e., Hx = X Kx k=0 qkMk and Hu = X Ku k=0 rkMk (6.5) where Kx and Ku denote the degrees of the polynomials and {qk} Kx k=0 and {rk} Ku k=0 are real-valued coefficients. The assumption that Hx and Hu are polynomials of M captures the intuition that the per-step cost respects the graph structure. In the special case when Hx = Hu = I, the per-step cost is decoupled across agents. When Hx = Hu = I + M, the per-step cost captures a cross-coupling between one-hop neighbors. Similarly, when Hu = I + M + M2 , the per-step cost captures a cross-coupling between one- and two-hop neighbors. See [93] for more examples of special cases of the per-step cost defined above. 67 List of Figures 6.2.2 Assumptions on the model Since M is real and symmetric, it has real eigenvalues. Let L denote the rank of M and λ (1), . . . ., λ(L) denote the non-zero eigenvalues. For ease of notation, for ℓ ∈ {1, . . . , L}, define q (ℓ) = X Kx k=0 qk(λ (ℓ) ) k and r (ℓ) = X Ku k=0 rk(λ (ℓ) ) k , where {qk} Kx k=0 and {rk} Ku k=0 are the coefficients in (6.5). Furthermore, for ℓ ∈ {1, . . . , L}, define: A (ℓ) = A + λ (ℓ)D and B (ℓ) = B + λ (ℓ)E. We impose the following assumptions: (A1) The systems (A, B) and {(A (ℓ) , B (ℓ) )} L ℓ=1 are stabilizable. (A2) The matrices Q and R are symmetric and positive definite. (A3) The parameters q0, r0, {q (ℓ)} L ℓ=1, and {r (ℓ)} L ℓ=1 are strictly positive. Assumption (A1) is needed to ensure that the average cost under the optimal policy is bounded. Assumptions (A2) and (A3) ensure that the per-step cost is strictly positive. 6.2.3 Admissible policies and performance criterion There is a system operator who has access to the state and action histories of all agents and who selects the agents’ control actions according to a deterministic or randomized (and potentially history-dependent) policy ut = πt(x1:t , u1:t−1). Let θ ⊺ = [A, B, D, E] denote the parameters of the system dynamics. The performance of any policy π = (π1, π2, . . .) is measured by the long-term average cost given by J(π; θ) = lim sup T→∞ 1 T E π X T t=1 c(xt , ut) . (6.6) Let J(θ) denote the minimum of J(π; θ) over all policies. We are interested in the setup where the graph coupling matix M, the cost coupling matrices Hx and Hu, and the cost matrices Q and R are known but the system dynamics θ are unknown and there is a prior distribution on θ. The Bayesian regret of a policy π operating for a horizon T is defined as R(T; π) := E π X T t=1 c(xt , ut) − T J(θ) , (6.7) 68 List of Figures where the expectation is with respect to the prior on θ, the noise processes, the initial conditions, and the potential randomizations done by the policy π. 6.3 Background on spectral decomposition of the system In this section, we summarize the main results of [93], translated to the discrete-time model used in this paper. The spectral decomposition described in [93] relies on the spectral factorization of the graph coupling matrix M. Since M is a real symmetric matrix with rank L, we can write it as M = X L ℓ=1 λ (ℓ) v (ℓ) (v (ℓ) ) ⊺ , (6.8) where (λ (1), . . . , λ(L) ) are the non-zero eigenvalues of M and (v (1), . . . , v(L) ) are the corresponding eigenvectors. We now present the decomposition of the dynamics and the cost based on (6.8) as described in [93]. 6.3.1 Spectral decomposition of the dynamics and per-step cost For ℓ ∈ {1, . . . , L}, define eigenstates and eigencontrols as x (ℓ) t = xtv (ℓ) (v (ℓ) ) ⊺ and u (ℓ) t = utv (ℓ) (v (ℓ) ) ⊺ , (6.9) respectively. Furthermore, define auxiliary state and auxiliary control as x˘t = xt − X L ℓ=1 x (ℓ) t and ˘ut = ut − X L ℓ=1 u (ℓ) t , (6.10) respectively. Similarly, define w (ℓ) t = wtv (ℓ) (v (ℓ) ) ⊺ and ˘wt = wt − PL ℓ=1 w (ℓ) t . We now obtain the dynamics of the eigen and auxilary states. Multiplying (6.3) on the right by v (ℓ) (v (ℓ) ) ⊺ and observing that v (ℓ) is an eigenvector of M, we get x (ℓ) t+1 = (A + λ (ℓ)D)x (ℓ) t + (B + λ (ℓ)E)u (ℓ),i t + w (ℓ) t . (6.11) Substituting (6.3) and (6.11) in (6.10), we get x˘t+1 = Ax˘t + Bu˘t + ˘wt . (6.12) 69 List of Figures Let x (ℓ),i t and u (ℓ),i t denote the i-th column of x (ℓ) t and u (ℓ) t respectively; thus we can write x (ℓ) t = [x (ℓ),1 t , . . . ., x (ℓ),n t ] and u (ℓ) t = [u (ℓ),1 t , . . . ., u (ℓ),n t ]. Similar interpretations hold for w (ℓ),i t and ˘w i t . Looking at a particular column of (6.10) and rearranging terms, we can decompose the state and control action at each node i ∈ N as x i t = ˘x i t + PL ℓ=1 x (ℓ),i t and u i t = ˘u i t + PL ℓ=1 u (ℓ),i t . Eq. (6.11) implies that the dynamics of eigenstate x (ℓ),i t depend only on u (ℓ),i t and w (ℓ),i t , and are given by x (ℓ),i t+1 = (A + λ (ℓ)D)x (ℓ),i t + (B + λ (ℓ)E)u (ℓ),i t + w (ℓ),i t , (6.13) Similarly, Eq. (6.12) implies that the dynamics of the auxiliary state ˘x i t depend only on ˘u i t and ˘w i t , and are given by x˘ i t+1 = Ax˘ i t + Bu˘ i t + ˘w i t . (6.14) Furthermore, [93, Proposition 2] implies that per-step cost decomposes as follows: c(xt , ut) = X i∈N q0c˘(˘x i t , u˘ i t ) + X L ℓ=1 q (ℓ) c (ℓ) (x (ℓ),i t , u (ℓ),i t ) (6.15) where1 c˘(˘x i t , u˘ i t ) = (˘x i t ) ⊺ Qx˘ i t + r0 q0 (˘u i t ) ⊺ Ru˘ i t , c (ℓ) (x (ℓ),i t , u (ℓ),i t ) = (x (ℓ),i t ) ⊺ Qx (ℓ),i t + r (ℓ) q (ℓ) (u (ℓ),i t ) ⊺ Ru (ℓ),i t . Following [93, Lemma 2], we can show that for any i ∈ N, var(w (ℓ),i t ) = (v (ℓ),i) 2W and var( ˘w i t ) = (˘v i ) 2W, (6.16) where (˘v i ) 2 = 1− PL ℓ=1(v (ℓ),i) 2 . These covariances do not depend on time because the noise processes are i.i.d. 6.3.2 Planning solution for network-coupled subsystems We now present the main result of [93], which provides a scalable method to synthesize the optimal control policy when the system dynamics are known. Based on the decomposition presented in the previous section, we can view the overall system as the collection of the following subsystems: 1Recall that (A3) ensures that q0 and {q (ℓ) } L ℓ=1 are strictly positive. 70 List of Figures • Eigen-system (ℓ, i), ℓ ∈ {1, . . . , L} and i ∈ N with state x (ℓ),i t , controls u (ℓ),i t , dynamics (6.13), and per-step cost q (ℓ) c (ℓ) (x (ℓ),i, u(ℓ),i). • Auxiliary system i, i ∈ N, with state ˘x i t , controls ˘u i t , dynamics (6.14), and per-step cost q0c˘(˘x i t , u˘ i t ). Let (θ (ℓ) ) ⊺ = [A(ℓ) , B(ℓ) ] := [(A+λ (ℓ)D),(B+λ (ℓ)E)], ℓ ∈ {1, . . . , L}, and ˘θ ⊺ = [A, B] denote the parameters of the dynamics of the eigen and auxiliary systems, respectively. Then, for any policy π = (π1, π2, . . .), the performance of the eigensystem (ℓ, i), ℓ ∈ {1, . . . , L} and i ∈ N, is given by q (ℓ)J (ℓ),i(π; θ (ℓ) ), where J (ℓ),i(π; θ (ℓ) ) = lim sup T→∞ 1 T E π X T t=1 c(x (ℓ),i t , u (ℓ),i t ) . Similarly, the performance of the auxiliary system i, i ∈ N, is given by q0J˘i (π; ˘θ), where J˘i (π; ˘θ) = lim sup T→∞ 1 T E π X T t=1 c(˘x i t , u˘ i t ) . Eq. (6.15) implies that the overall performance of policy π can be decomposed as J(π; θ) = X i∈N q0J˘i (π; ˘θ) + X i∈N X L ℓ=1 q (ℓ)J (ℓ),i(π; θ (ℓ) ). (6.17) The key intuition behind the result of [93] is as follows. By the certainty equivalence principle for LQ systems, we know that (when the system dynamics are known) the optimal control policy of a stochastic LQ system is the same as the optimal control policy of the corresponding deterministic LQ system where the noises {w i t}t≥1 are assumed to be zero. Note that when noises {w i t}t≥1 are zero, then the noises {w (ℓ),i t }t≥1 and {w˘ i t}t≥1 of the eigenand auxiliary-systems are also zero. This, in turn, implies that the dynamics of all the eigen- and auxiliary systems are decoupled. These decoupled dynamics along with the cost decoupling in (6.17) imply that we can choose the controls {u (ℓ),i t }t≥1 for the eigensystem ((ℓ), i), ℓ ∈ {1, . . . , L} and i ∈ N, to minimize2 J (ℓ),i(π; θ (ℓ) ) and choose the controls {u˘ i t}t≥1 for the auxiliary system i, i ∈ N, to minimize2 J˘i (π; ˘θ). These optimization problems are standard optimal control problems. Therefore, similar to [93, Thoerem 3], we obtain the following result. Theorem 6.1. Let S˘ and {S (ℓ)} L ℓ=1 be the solution of the following discrete time algebraic Riccati equations (DARE): S˘( ˘θ) = DARE(A, B, Q, r0 q0 R), (6.18a) 2The cost of the eigensystem ((ℓ), i) is q (ℓ) J (ℓ),i(π; θ (ℓ) ). From (A3), we know that q (ℓ) is positive. Therefore, minimizing q (ℓ) J (ℓ),i(π; θ (ℓ) ) is the same as minimizing J (ℓ),i(π; θ (ℓ) ). 71 List of Figures and for ℓ ∈ {1, . . . , L}, S (ℓ) (θ (ℓ) ) = DARE(A (ℓ) , B (ℓ) , Q, r (ℓ) q (ℓ)R). (6.18b) Define the gains: G˘( ˘θ) = − (B) ⊺ S˘( ˘θ)B + r0 q0 R −1 (B) ⊺ S˘( ˘θ)A, (6.19a) and for ℓ ∈ {1, . . . , L}, G (ℓ) (θ (ℓ) ) = − (B (ℓ) ) ⊺ S (ℓ) (θ (ℓ) )B (ℓ)+ r (ℓ) q (ℓ)R −1 (B (ℓ) ) ⊺ S (ℓ) (θ (ℓ) )A (ℓ) . (6.19b) Then, under assumptions (A1)–(A3), the policy u i t = G˘( ˘θ)˘x i t + X L ℓ=1 G (ℓ) (θ (ℓ) )x (ℓ),i t (6.20) minimizes the long-term average cost in (6.6) over all admissible policies. Furthermore, the optimal performance is given by J(θ) = X i∈N q0J˘i ( ˘θ) + X i∈N X L ℓ=1 q (ℓ)J (ℓ),i(θ (ℓ) ), (6.21) where J˘i ( ˘θ) = (˘v i ) 2 Tr(WS˘) and for ℓ ∈ {1, . . . , L}, J (ℓ),i(θ (ℓ) ) = (v (ℓ),i) 2 Tr(WS (ℓ) ). (6.22) 6.4 Learning for network-coupled subsystems For the ease of notation, we define z (ℓ),i t = (x (ℓ),i t , u (ℓ),i t ) and ˘z i t = (˘x i t , u˘ i t ). Then, we can write the dynamics (6.13), (6.14) of the eigen and the auxiliary systems as x (ℓ),i t+1 = (θ (ℓ) ) ⊺ z (ℓ),i t + w (ℓ),i t , ∀i ∈ N, ∀ℓ ∈ {1, . . . , L}, (6.23a) x˘ i t+1 = (˘θ) ⊺ z˘ i t + ˘w i t , ∀i ∈ N. (6.23b) 6.4.1 Simplifying assumptions We impose the following assumptions to simplify the description of the algorithm and the regret analysis. (A4) The noise covariance W is a scaled identity matrix given by σ 2 wI. List of Figures (A5) For each i ∈ N, ˘v i ̸= 0. Assumption (A4) is commonly made in most of the literature on regret analysis of LQG systems. An implication of (A4) is that var( ˘w i t ) = (˘σ i ) 2 I and var(w (ℓ),i t ) = (σ (ℓ),i) 2 I, where (˘σ i ) 2 = (˘v i ) 2σ 2 w and (σ (ℓ),i) 2 = (v (ℓ),i) 2σ 2 w. (6.24) Assumption (A5) is made to rule out the case where the dynamics of some of the auxiliary systems are deterministic. 6.4.2 Prior and posterior beliefs: We assume that the unknown parameters ˘θ and {θ (ℓ)} L ℓ=1 lie in compact subsets Θ and ˘ {Θ(ℓ)} L ℓ=1 of R(dx+du)×dx. Let ˘θ k denote the k-th column of ˘θ. Thus ˘θ = [˘θ 1 , . . . , ˘θ dx]. Similarly, let θ (ℓ),k denote the k-th column of θ (ℓ) . Thus, θ (ℓ) = [θ (ℓ),1 , . . . , θ(ℓ),dx]. We use p Θ to denote the restriction of probability distribution p on the set Θ. We assume that ˘θ and {θ (ℓ)} L ℓ=1 are random variables that are independent of the initial states and the noise processes. Furthermore, we assume that the priors ˘p1 and {p (ℓ) 1 } L ℓ=1 on ˘θ and {θ (ℓ)} L ℓ=1, respectively, satisfy the following: (A6) p˘1 is given as: ˘p1( ˘θ) = Qdx k=1 ˘ξ k 1 ( ˘θ k ) Θ˘ where for k ∈ {1, . . . , dx}, ˘ξ k 1 = N (˘µ k 1 , Σ˘ 1) with mean ˘µ k 1 ∈ Rdx+du and positive-definite covariance Σ˘ 1 ∈ R(dx+du)×(dx+du) . (A7) For each ℓ ∈ {1, . . . , L}, p (ℓ) 1 is given as: p (ℓ) 1 (θ (ℓ) ) = Qdx k=1 ξ (ℓ),k 1 (θ (ℓ),k) Θ(ℓ) where for k ∈ {1, . . . , dx}, ξ (ℓ),k 1 = N (µ (ℓ),k 1 , Σ (ℓ) 1 ) with mean µ (ℓ),k 1 ∈ R(dx+du) and positivedefinite covariance Σ(ℓ) 1 ∈ R(dx+du)×(dx+du) . These assumptions are similar to the assumptions on the prior in the recent literature on Thompson sampling for LQ systems [70]. Our learning algorithm (and TS-based algorithms in general) keeps track of a posterior distribution on the unknown parameters based on observed data. Motivated by the nature of the planning solution (see Theorem 6.1), we maintain separate posterior distributions on ˘θ and {θ (ℓ)} L ℓ=1. For each ℓ, we select a subsystem i (ℓ) ∗ such that the i (ℓ) ∗ -th component of the eigen-vector v (ℓ) is non-zero (i.e. v (ℓ),i(ℓ) ∗ ̸= 0) . At time t, we maintain a posterior distribution p (ℓ) t on θ (ℓ) based on the corresponding eigen state and action history of the i (ℓ) ∗ -th subsystem. In other words, for any Borel subset B of R(dx+du)×dx, p (ℓ) t (B) gives the following conditional probability p (ℓ) t (B) = P(θ (ℓ) ∈ B | x (ℓ),i(ℓ) ∗ 1:t , u (ℓ),i(ℓ) ∗ 1:t−1 ). (6.25) 73 List of Figures We maintain a separate posterior distribution ˘pt on ˘θ as follows. At each time t > 1, we select an subsystem jt−1 = arg maxi∈N z˘ i ⊺ t−1Σ˘ t−1z˘ i t−1 /(˘σ i t ) 2 , where Σ˘ t−1 is a covariance matrix defined recursively in Lemma 6.1 below. Then, for any Borel subset B of R(dx+du)×dx, p˘t(B) = P( ˘θ ∈ B | {x˘ js s , u˘ js s , x˘ js s+1}1≤s<t}), (6.26) See [87] for a discussion on the rule to select jt−1. Lemma 6.1. The posterior distributions p (ℓ) t , ℓ ∈ {1, 2, . . . , L}, and p˘t are given as follows: 1. p (ℓ) 1 is given by Assumption (A7) and for any t ≥ 1, p (ℓ) t+1(θ (ℓ) ) = Y dx k=1 ξ (ℓ),k t+1 (θ (ℓ),k) Θ(ℓ) , where for k ∈ {1, . . . , dx}, ξ (ℓ),k t+1 = N (µ (ℓ),k t+1 , Σ (ℓ) t+1), and µ (ℓ) t+1 = µ (ℓ) t + Σ (ℓ) t z (ℓ),i(ℓ) ∗ t x (ℓ),i(ℓ) ∗ t+1 − (µ (ℓ) t ) ⊺ z (ℓ),i(ℓ) ∗ t ⊺ (σ (ℓ),i(ℓ) ∗ ) 2 + (z (ℓ),i(ℓ) ∗ t ) ⊺Σ (ℓ) t z (ℓ),i(ℓ) ∗ t , (6.27a) (Σ(ℓ) t+1) −1 = (Σ(ℓ) t ) −1 + 1 (σ (ℓ),i(ℓ) ∗ ) 2 z (ℓ),i(ℓ) ∗ t (z (ℓ),i(ℓ) ∗ t ) ⊺ , (6.27b) where, for each t, µ ℓ t denotes the matrix [µ (ℓ),1 t , . . . , µ (ℓ),dx t ]. 2. p˘1 is given by Assumption (A6) and for any t ≥ 1, p˘t+1( ˘θ) = Y dx k=1 ˘ξ k t+1( ˘θ k ) Θ˘ , where for k ∈ {1, . . . , dx}, ˘ξ k t+1 = N (˘µ k t+1, Σ˘ t+1), and µ˘t+1 = ˘µt + Σ˘ tz˘ jt t x˘ jt t+1 − (˘µt) ⊺ z˘ jt t ⊺ (˘σ jt ) 2 + (˘z jt t ) ⊺Σ˘ tz˘ jt t , (6.28a) (Σ˘ t+1) −1 = (Σ˘ t) −1 + 1 (˘σ jt ) 2 z˘ jt t (˘z jt t ) ⊺ . (6.28b) where, for each t, µ˘t denotes the matrix [˘µ 1 t , . . . , µ˘ dx t ]. Proof. Note that the dynamics of x (ℓ),i(ℓ) ∗ t and ˘x i t in (6.23) are linear and the noises w (ℓ),i(ℓ) ∗ t and ˘w i t are Gaussian. Therefore, the result follows from standard results in Gaussian linear regression [88, Theorem 3]. List of Figures 6.4.3 The Thompson sampling algorithm: We propose a Thompson sampling based algorithm called Net-TSDE which is inspired by the TSDE (Thompson sampling with dynamic episodes) algorithm proposed in [70] and the structure of the optimal planning solution described in Sec. 6.3.2. The Thompson sampling part of our algorithm is modeled after the modification of TSDE presented in [103]. The Net-TSDE algorithm consists of a coordinator C and |L| + 1 actors: an auxiliary actor A˘ and an eigen actor Aℓ for each ℓ ∈ {1, 2, . . . , L}. These actors are described below and the whole algorithm is presented in Algorithm 3. • At each time, the coordinator C observes the current global state xt , computes the eigenstates {x (ℓ) t } L ℓ=1 and the auxiliary states ˘xt , and sends the eigenstate x (ℓ) t to the eigen actor A(ℓ) , ℓ ∈ {1, . . . , L}, and sends the auxiliary state ˘xt to the auxiliary actor A˘. The eigen actor A(ℓ) , ℓ ∈ {1, . . . , L}, computes the eigencontrol u (ℓ) t and the auxiliary actor A˘ computes the auxiliary control ˘ut (as per the details presented below) and both send their computed controls back to the coordinator C. The coordinator then computes and executes the control action u i t = PL ℓ=1 u (ℓ),i t + ˘u i t for each subsystem i ∈ N . • The eigen actor A(ℓ) , ℓ ∈ {1, . . . , L}, maintains the posterior p (ℓ) t on θ (ℓ) according to (6.27). The actor works in episodes of dynamic length. Let t (ℓ) k and T (ℓ) k denote the starting time and the length of episode k, respectively. Each episode is of a minimum length T (ℓ) min + 1, where T (ℓ) min is chosen as described in [103]. Episode k ends if the determinant of covariance Σ(ℓ) t falls below half of its value at time t (ℓ) k (i.e., det(Σ(ℓ) t ) < 1 2 det Σt (ℓ) k ) or if the length of the episode is one more than the length of the previous episode (i.e., t − t (ℓ) k > T(ℓ) k−1 ). Thus, t (ℓ) k+1 = min t > t(ℓ) k + T (ℓ) min t − t (ℓ) k > T(ℓ) k−1 or det Σ(ℓ) t < 1 2 det Σt (ℓ) k At the beginning of episode k, the eigen actor A(ℓ) samples a parameter θ (ℓ) k according to the posterior distribution p (ℓ) t (ℓ) k . During episode k, the eigen actor A(ℓ) generates the eigen controls using the sampled parameter θ (ℓ) k , i.e., u (ℓ) t = G (ℓ) (θ (ℓ) k )x (ℓ) t . • The auxiliary actor A˘ is similar to the eigen actor. Actor A˘ maintains the posterior p˘t on ˘θ according to (6.28). The actor works in episodes of dynamic length. The episodes of the auxiliary actor A˘ and the eigen actors A(ℓ) , ℓ ∈ {1, 2, . . . , L}, are separate from each other.3 Let t˘k and T˘ k denote the starting time and the length of episode k, respectively. Each episode is of a minimum length T˘min + 1, where T˘min is 3The episode count k is used as a local variable for each actor. 75 List of Figures Algorithm 3 Net-TSDE 1: initialize eigen actor: Θ(ℓ) , (µ ℓ 1 , Σ ℓ 1 ), t ℓ 0 = −Tmin, T ℓ −1 = Tmin, k = 0, θ ℓ k = 0 2: initialize auxiliary actor: Θ, (˘ ˘ µ1, Σ˘ 1), t˘0 = −Tmin, T˘−1 = Tmin, k = 0, ˘θk = 0. 3: for t = 1, 2, . . . do 4: observe xt 5: compute {x (ℓ) t } L ℓ=1 and ˘xt using (6.9) and (6.10). 6: for ℓ = 1, 2, . . . , L do 7: u (ℓ) t ← eigen-actor(x (ℓ) t ) 8: u˘t ← auxiliary-actor(˘xt) 9: for i ∈ N do 10: Subsystem i applies control u i t = u (ℓ),i t + ˘u i t 1: function eigen-actor(x (ℓ) t ) 2: global var t 3: Update p (ℓ) t according to (6.27) 4: if (t − t (ℓ) k > Tmin) and 5: ((t − t (ℓ) k > T(ℓ) k−1 ) or (det Σ(ℓ) t < 1 2 det Σt (ℓ) k )) 6: then 7: T (ℓ) k ← t − t (ℓ) k , k ← k + 1, t (ℓ) k ← t 8: sample θ (ℓ) k ∼ p (ℓ) t 9: return G (ℓ) (θ (ℓ) k )x (ℓ) t 1: function auxiliary-actor(˘xt) 2: global var t 3: Update ˘pt according to (6.28) 4: if (t − t˘k > Tmin) and 5: ((t − t˘k > T˘ k−1) or (det Σ˘ t < 1 2 det Σ˘ t (ℓ) k )) 6: then 7: T˘ k ← t − t˘k, k ← k + 1, t˘k ← t 8: sample ˘θk ∼ p˘t 9: return G˘( ˘θk)˘xt chosen as described in [103]. The termination condition for each episode is similar to that of the eigen actor A(ℓ) . In particular, t˘k+1 = min t > t˘k + T˘min t − t˘k > T˘ k−1 or det Σ˘ t < 1 2 det Σ˘ t˘k At the beginning of episode k, the auxillary actor A˘ samples a parameter ˘θk from the posterior distribution ˘pt˘k . During episode k, the auxiliary actor A˘ generates the auxiliary controls using the the sampled parameter ˘θk, i.e., ˘ut = G˘( ˘θk)˘xt . 76 List of Figures Note that the algorithm does not depend on the horizon T. 6.4.4 Regret bounds: We impose the following assumption to ensure that the closed loop dynamics of the eigenstates and the auxiliary states of each subsystem are stable. (A8) There exists a positive number δ ∈ (0, 1) such that • for any ℓ ∈ {1, 2, . . . , L} and θ (ℓ) , ϕ(ℓ) ∈ Θ(ℓ) where (θ (ℓ) ) ⊺ = [A (ℓ) θ (ℓ) , B (ℓ) θ (ℓ) ], we have ρ(A (ℓ) θ (ℓ) + B ℓ θ (ℓ)G (ℓ) (ϕ (ℓ) )) ≤ δ. • for any ˘θ, ϕ˘ ∈ Θ, where ( ˘ ˘θ) ⊺ = [Aθ˘, Bθ˘], we have ρ(Aθ˘ + Bθ˘G˘(ϕ˘)) ≤ δ. This assumption is similar to an assumption made in [103] for TS for LQG systems. According to [104, Lemma 1] (also see [67, Theorem 11]), (A8) is satisfied if Θ (ℓ) = {θ (ℓ) ∈ R (dx+du)×dx : ∥θ (ℓ) − θ ℓ ◦∥ ≤ ε (ℓ) }, Θ =˘ { ˘θ ∈ R (dx+du)×dx : ∥ ˘θ − ˘θ◦∥ ≤ ε˘}, where θ (ℓ) and ˘θ are stabilizable and ε (ℓ) and ˘ε are sufficiently small. In other words, the assumption holds when the true system is in a small neighborhood of a known nominal system. Such a the small neighborhood can be learned with high probability by running appropriate stabilizing procedures for finite time [67, 104]. The following result provides an upper bound on the regret of the proposed algorithm. Theorem 6.2. Under (A1)–(A8), the regret of Net-TSDE is upper bounded as follows: R(T; Net-TSDE) ≤ O˜ α Gσ 2 wd 0.5 x (dx + du) √ T , where α G = PL ℓ=1 q (ℓ) + q0(n − L). See Section 6.5 for proof. Remark 6.1. The term α G in the regret bound partially captures the impact of the network on the regret. The coefficients r0 and {r (ℓ)} L ℓ=1 depend on the network and also affect the regret but their dependence is hidden inside the O˜(·) notation. It is possible to explicitly characterize this dependence but doing so does not provide any additional insights. We discuss the impact of the network coupling on the regret in Section 6.6 via some examples. 7 List of Figures Remark 6.2. The regret per subsystem is given by R(T; Net-TSDE)/n, which is proportional to α G /n = O L n + O n − 1 n = O 1 + L n . Thus, the regret per-subsystem scales as O(1 + L/n). In contrast, for the standard TSDE algorithm [70, 103], the regret per subsystem is proportional to α G (TSDE)/n = O(n 0.5 ). This clearly illustrates the benefit of the proposed learning algorithm. 6.5 Regret analysis For the ease of notation, we simply use R(T) instead of R(T; Net-TSDE) in this section. Based on (6.15) and Theorem 6.1, the regret may be decomposed as R(T) = X i∈N q0R˘i (T) + X i∈N X L ℓ=1 q (ℓ)R (ℓ),i(T) (6.29) where R˘i (T) := E X T t=1 c˘(˘x i t , u˘ i t ) − TJ˘i ( ˘θ) , and, for ℓ ∈ {1, . . . , L}, R (ℓ),i(T) := E X T t=1 c (ℓ) (x (ℓ),i t , u (ℓ),i t ) − T J(ℓ),i(θ (ℓ) ) . Based on the discussion at the beginning of Sec. 6.3.2, q0R˘i (T), i ∈ N, is the regret associated with auxiliary system i and q (ℓ)R(ℓ),i(T), ℓ ∈ {1, . . . , L} and i ∈ N, is the regret associated with eigensystem (ℓ, i). We now bound R˘i (T) and R(ℓ),i(T) separately. 6.5.1 Bound on R(ℓ),i(T) Fix ℓ ∈ {1, . . . , L}. For the component i (ℓ) ∗ , the Net-TSDE algorithm is exactly same as the variation of the TSDE algorithm of [70] presented in [103]. Therefore, from [103, Theorem 1], it follows that R (ℓ),i(ℓ) ∗ (T) ≤ O˜ (σ (ℓ),i(ℓ) ∗ ) 2 d 0.5 x (dx + du) √ T) . (6.30) We now show that the regret of other eigensystems (ℓ, i) with i ̸= i (ℓ) ∗ also satisfies a similar bound. Lemma 6.2. The regret of eigensystem (ℓ, i), ℓ ∈ {1, . . . , L} and i ∈ N, is bounded as follows: R (ℓ),i(T) ≤ O˜ (σ (ℓ),i) 2 d 0.5 x (dx + du) √ T . (6.31) List of Figures Proof. Fix ℓ ∈ {1, . . . , L}. Recall from (6.9) that x (ℓ) t = xtv (ℓ) (v (ℓ) ) ⊺ . Therefore, for any i ∈ N, x (ℓ),i t = xtv (ℓ) v (ℓ),i = v (ℓ),ixtv (ℓ) , where the last equality follows because v (ℓ),i is a scalar. Since we are using the same gain G(ℓ) (θ (ℓ) k ) for all agents i ∈ N, we have u (ℓ),i t = G (ℓ) (θ (ℓ) k )x (ℓ),i t = v (ℓ),iG (ℓ) (θ (ℓ) k )xtv (ℓ) . Thus, we can write (recall that i (ℓ) ∗ is chosen such that v (ℓ),i(ℓ) ∗ ̸= 0), for all i ∈ N, x (ℓ),i t = v (ℓ),i v (ℓ),i(ℓ) ∗ x (ℓ),i(ℓ) ∗ t and u (ℓ),i t = v (ℓ),i v (ℓ),i(ℓ) ∗ u (ℓ),i(ℓ) ∗ t . Thus, for any i ∈ N, c (ℓ) (x (ℓ),i t , u (ℓ),i t ) = v (ℓ),i v (ℓ),i(ℓ) ∗ 2 c (ℓ) (x (ℓ),i(ℓ) ∗ t , u (ℓ),i(ℓ) ∗ t ). (6.32) Moreover, from (6.22), we have J (ℓ),i(θ (ℓ) ) = v (ℓ),i v (ℓ),i(ℓ) ∗ 2 J (ℓ),i(ℓ) ∗ (θ (ℓ) ). (6.33) By combining (6.32) and (6.33), we get R (ℓ),i(T) = v (ℓ),i v (ℓ),i(ℓ) ∗ 2 R (ℓ),i(ℓ) ∗ (T). Substituting the bound for R(ℓ),i(ℓ) ∗ (T) from (6.30) and observing that (v (ℓ),i/v(ℓ),i(ℓ) ∗ ) 2 = (σ (ℓ),i/σ(ℓ),i(ℓ) ∗ ) 2 gives the result. 6.5.2 Bound on R˘i (T) The update of the posterior ˘pt on ˘θ does not depend on the history of states and actions of any fixed agent i. Therefore, we cannot directly use the argument presented in [103] to bound the regret R˘i (T). We present a bound from first principles below. For the ease of notation, for any episode k, we use G˘ k and S˘ k to denote G˘( ˘θk) and S˘( ˘θk) respectively. From LQ optimal control theory [5], we know that the average cost J˘i ( ˘θk) and the optimal policy ˘u i t = G˘ kx˘ i t for the model parameter ˘θk satisfy the following Bellman equation: J˘i ( ˘θk) + (˘x i t ) ⊺ S˘ kx˘ i t = ˘c(˘x i t , u˘ i t ) + E ˘θ ⊺ k z˘ i t + ˘w i t ⊺ S˘ k ˘θ ⊺ k z˘ i t + ˘w i t . List of Figures Adding and subtracting E[(˘x i t+1) ⊺ S˘ kx˘ i t+1 | z˘ i t ] and noting that ˘x i t+1 = ˘θ ⊺ z˘ i t + ˘w i t , we get that c˘(˘x i t , u˘ i t ) = J˘i ( ˘θk) + (˘x i t ) ⊺ S˘ kx˘ i t − E[(˘x i t+1) ⊺ S˘ kx˘ i t+1|z˘ i t ] + (˘θ ⊺ z˘ i t ) ⊺ S˘ k((˘θ) ⊺ z˘ i t ) − ( ˘θ ⊺ k z˘ i t ) ⊺ S˘ k((˘θk) ⊺ z˘ i t ). (6.34) Let K˘ T denote the number of episodes of the auxiliary actor until horizon T. For each k > K˘ T , we define t˘k to be T + 1. Then, using (6.34), we have that for any agent i, R˘i (T) = E K˘XT k=1 T˘ kJ˘i ( ˘θk) − TJ˘i ( ˘θ) | {z } regret due to sampling error =:R˘i 0 (T) + E K˘XT k=1 t˘kX +1−1 t=t˘k (˘x i t ) ⊺ S˘ kx˘ i t − (˘x i t+1) ⊺ S˘ kx˘ i t+1 | {z } regret due to time-varying controller =:R˘i 1 (T) + E K˘XT k=1 t˘kX +1−1 t=t˘k ( ˘θ ⊺ z˘ i t ) ⊺ S˘ k((˘θ) ⊺ z˘ i t ) −( ˘θ ⊺ k z˘ i t ) ⊺ S˘ k((˘θk) ⊺ z˘ i t ) . | {z } regret due to model mismatch =:R˘i 2 (T) (6.35) Lemma 6.3. The terms in (6.35) are bounded as follows: 1. R˘i 0 (T) ≤ O˜((˘σ i ) 2 (dx + du) 0.5 √ T). 2. R˘i 1 (T) ≤ O˜((˘σ i ) 2 (dx + du) 0.5 √ T). 3. R˘i 2 (T) ≤ O˜((˘σ i ) 2d 0.5 x (dx + du) √ T). Combining these three, we get that R˘i (T) ≤ O˜((˘σ i ) 2 d 0.5 x (dx + du) √ T). (6.36) See Appendix for the proof. 6.5.3 Proof of Theorem 6.2 For ease of notation, let R∗ = O˜(d 0.5 x (dx + du) √ T). Then, by subsituting the result of Lemmas 6.2 and 6.3 in (6.29), we get that R(T) ≤ X i∈N q0(˘σ i ) 2R ∗ + X i∈N X L ℓ=1 q (ℓ) (σ (ℓ),i) 2R ∗ 80 List of Figures (a) = X i∈N q0(˘v i ) 2σ 2 wR ∗ + X i∈N X L ℓ=1 q (ℓ) (v (ℓ),i) 2σ 2 wR ∗ (b) = q0(n − L) + P L ℓ=1 q (ℓ) σ 2 wR∗ , (6.37) where (a) follows from (6.24) and (b) follows from observing that P i∈N (v (ℓ),i) 2 = 1 and therefore P i∈N (˘v i ) 2 = n − L. Eq. (6.37) establishes the result of Theorem 6.2. 6.6 Some examples 6.6.1 Mean-field system Consider a complete graph G where the edge weights are equal to 1/n. Let M be equal to the adjacency matrix of the graph, i.e., M = 1 n 1n×n. Thus, the system dynamics are given by x i t+1 = Axi t + Bui t + Dx¯t + Eu¯t + w i t , where ¯xt = 1 n P i∈N x i t and ¯ut = 1 n P i∈N u i t . Suppose Kx = Ku = 1 and q0 = r0 = 1/n and q1 = r1 = κ/n, where κ is a positive constant. In this case, M has rank L = 1, the non-zero eigenvalue of M is λ (1) = 1, the corresponding normalized eigenvector is √ 1 n 1n×1 and q (1) = r (1) = q0 + q1 = (1 + κ)/n. The eigenstate is given by x 1 t = [¯xt , . . . , x¯t ] and a similar structure holds for the eigencontrol u 1 t . The per-step cost can be written as (see (6.15)) c(xt , ut) = (1 + κ) x¯ ⊺ t Qx¯t + ¯u ⊺ t Ru¯t . + 1 n X i∈N (x i t − x¯t) ⊺Q(x i t − x¯t) + (u i t − u¯t) ⊺R(u i t − u¯t) Thus, the system is similar to the mean-field team system investigated in [77]. For this model, the network dependent constant α G in the regret bound of Theorem 6.2 is given by α G = 1 + κ n = O 1 + 1 n . Thus, for the mean-field system, the regret of Net-TSDE scales as O(1+ 1 n ) with the number of agents. This is consistent with the discussion following Theorem 6.2. We test these conclusions via numerical simulations of a scalar mean-field model with dx = du = 1, σ 2 w = 1, A = 1, B = 0.3, D = 0.5, E = 0.2, Q = 1, R = 1, and κ = 0.5. The uncertain sets are chosen as Θ(1) = {θ (1) ∈ R2 : A + D + (B + E)G(1)(θ (1)) < δ} and Θ = ˘ { ˘θ ∈ R2 : A + BG˘( ˘θ) < δ} where δ = 0.99. The prior over these uncertain sets is chosen according to (A6)–(A7) where ˘µ1 = µ (1) 1 = [1, 1]⊺ and Σ˘ 1 = Σ(1) 1 = I. We set Tmin = 0 in Net-TSDE. The system is simulated for a horizon of T = 5000 and the expected regret R(T) averaged over 500 sample trajectories is shown in Fig. 6.2. As expected, the regret scales as O˜( √ T) with time and O 1 + 1 n with the number of agents. List of Figures Figure 6.1: R(T)/ √ T vs T Figure 6.2: Regret for mean-field system. 6.6.2 A general low-rank network 3 4 2 1 a a b b 0 a 0 b a 0 a 0 0 a 0 b b 0 b 0 Figure 6.3: Graph G ◦ with n = 4 nodes and its adjacency matrix We consider a network with 4n nodes given by the graph G = G ◦ ⊗ Cn, where G ◦ is a 4-node graph shown in Fig. 6.3 and Cn is the complete graph with n nodes and each edge weight equal to 1 n . Let M be the adjacency matrix of G which is given as M = M◦ ⊗ 1 n 1n×n, where M◦ is the adjacency matrix of G ◦ shown in Fig. 6.3. Moreover, suppose Kx = 2 with q0 = 1, q1 = −2, and q2 = 1 and Ku = 0 with r0 = 1. Note that the cost is not normalized per-agent. In this case, the rank of M◦ is 2 with eigenvalues ±ρ, where ρ = p 2(a 2 + b 2) and the rank of 1 n 1n×n is 1 with eigenvalue 1. Thus, M = M◦ ⊗ 1 n 1n×n has the same non-zero eigenvalues as M◦ given by λ (1) = ρ and λ (2) = −ρ. Further, q (ℓ) = (1 − λ (ℓ) ) 2 and r (ℓ) = 1, for ℓ ∈ {1, 2}. We assume that a 2 + b 2 ̸= 0.5, so that the model satisfies (A3). 82 List of Figures Figure 6.4: R(T)/ √ T vs T Figure 6.5: R(T)/ √ T vs n. For this model, the scaling parameter α G in the regret bound in Theorem 6.2 is given by α G = (1 − ρ) 2 + (1 + ρ) 2 + (4n − 2) = 4n + 2ρ 2 . Recall that ρ 2 = (λ (1)) 2 = (λ (2)) 2 . Thus, α G has an explicit dependence on the square of the eigenvalues and the number of nodes. We verify this relationship via numerical simulations. We consider the graph above with two choices of parameters (a, b): (i) a = b = 0.05 and (ii) a = b = 5. For both cases, we consider a scalar system with parameters same as the mean-field system considered in Sec. 6.6.1. The regret for both cases with different choices of number of agents 4n ∈ {4, 40, 80, 100} is shown in Fig. 6.5. As expected, the regret scales as O˜( √ T) with time and O 4n + 2ρ 2 ) with the number of agents. 6.7 Conclusion We consider the problem of controlling an unknown LQG system consisting of multiple subsystems connected over a network. By utilizing a spectral decomposition technique, we decompose the coupled subsystems into eigen and auxiliary systems. We propose a 8 List of Figures TS-based learning algorithm Net-TSDE which maintains separate posterior distributions on the unknown parameters θ (ℓ) , ℓ ∈ {1, . . . , L}, and ˘θ associated with the eigen and auxiliary systems respectively. For each eigen-system, Net-TSDE learns the unknown parameter θ (ℓ) and controls the system in a manner similar to the TSDE algorithm for single agent LQG systems proposed in [70, 103]. Consequently, the regret for each eigen system can be bounded using the results of [70, 103]. However, the part of the Net-TSDE algorithm that performs learning and control for the auxiliary system has an agent selection step and thus requires additional analysis to bound its regret. Combining the regret bounds for the eigen and auxiliary systems shows that the total expected regret of Net-TSDE is upper bounded by O˜(nd0.5 x (dx + du) √ T). The empirically observed scaling of regret with respect to the time horizon T and the number of subsystems n in our numerical experiments agrees with the theoretical upper bound. The results presented in this paper rely on the spectral decomposition developed in [93]. A limitation of this decomposition is that the local dynamics (i.e., the (A, B) matrices) are assumed to be identical for all subsystems and the coupling matrix M is symmetric. Interesting generalizations overcoming these limitations include settings where (i) there are multiple types of subsystems and the (A, B) matrices are the same for subsystems of the same type but different across types; (ii) the coupling matrix M is not symmetric; and (iii) the subsystems are not identical but approximately identical, i.e., there are nominal dynamics (A◦ , B◦ ) and the local dynamics (Ai , Bi ) of subsystem i are in a small neighborhood of (A◦ , B◦ ). It may be possible to extend the decomposition in [93] and the learning algorithm of this paper to handle cases (i) and (ii). For case (iii), it may be possible to approximate the non-identical subsystems by identical subsystems. However, such an approximation may lead to a regret which is linear in time due to the approximation error. The decomposition in [93] exploits the fact that the dynamics and the cost couplings have the same spectrum (i.e., the same orthonormal eigenvectors). It is also possible to consider learning algorithms which exploit other features of the network such as sparsity in the case of networked MDPs [101, 102]. 84 Chapter 7 Future Directions In this section we discuss some questions which are of interest based on the topics studied in this report. 7.1 Model-free reinforcement learning approach for multiagent systems In Chapter 2, we delved into a scenario where each agent possesses private observations of its own type and public observations of other agents’ actions, collaborating to minimize their collective cost. This scenario falls within the domain of decentralized control with partial information, presenting significant challenges in updating belief states without full knowledge of system dynamics. Therefore, we are prompted to ask the following question: Is it possible to update belief states without full knowledge of system dynamics? Our approach in Chapter 2 embraced the common information paradigm, wherein a fictitious coordinator selects the best policy based on beliefs inferred from agents’ current private states. These beliefs are updated independently for each agent using their action histories. However, updating belief states without complete awareness of system dynamics remains a challenge. We aim to explore model-free reinforcement learning algorithms tailored to optimize policies within a multi-agent system comprising N cooperative agents. To achieve this goal, we can leverage particle filters, specifically the bootstrap filter, distributed across agents for belief updates. This approach enabled us to devise a model-free reinforcement learning method for multi-agent partially observable Markov decision processes (POMDPs), leveraging particle filters and sampled trajectories to estimate optimal policies for the agents. 85 Chapter 7. Future Directions We propose conducting a comparative analysis between model-based and model-free implementations of the reinforcement learning algorithm, highlighting the effectiveness of the particle filter method. Looking forward, there are promising directions for future research: Complexity Analysis: Conducting a thorough complexity analysis of the algorithm, particularly when dealing with a large number of agents (N), would be advantageous. Understanding how computational demands scale with the number of agents is crucial for real-world applicability. By delving deeper into these areas, we can enhance our understanding of decentralized control problems with imperfect information and contribute to the development of more efficient and scalable solutions. 7.2 Decentralized Learning When developing learning algorithms for large-scale systems, the preference is for them to operate in a fully decentralized manner. In Chapter 5, we introduced a TS-based learning algorithm named TSDE-MF for a mean-field LQ learning problem. Despite TSDE-MF not necessitating complete state information at each agent, it is not entirely decentralized. At each time t, every agent requires knowledge of the relative state-action tuple history {(˘x i ∗ s s , u˘ i ∗ s s , x˘ i ∗ s s+1)}1≤s<t to evaluate the posterior distribution ˘pt . This necessitates coordination among agents to compute i ∗ t and to communicate the tuple of agent i ∗ t at time t. Hence, the following question arises: Is it feasible to devise a fully decentralized learning algorithm for the mean-field learning problem with provable regret bounds? Below, we outline a decentralized TS algorithm akin to TSDE-MF. Suppose the information available to agent i at time t is denoted by I i t = {x i 1:t , x¯1:t , ui 1:t−1 , u¯1:t−1}. Each agent commences with the prior distribution ˘p1 on ˘θ. Agent i updates a posterior over ˘θ at each time t as follows: p˘ i t ( ˘θ) = P( ˘θ ∈ Θ˘ |x˘ i 1:t , u˘ i 1:t−1 ) (7.1) Similarly, each agent starts with the prior distribution ¯p1 on ¯θ and updates a posterior over ¯θ as follows: p¯t(Θ) = ¯ P( ¯θ ∈ Θ¯ |x¯1:t , u¯1:t−1) (7.2) Whenever the sampling condition for ¯pt becomes true, agent i generates a sample ¯θk from p¯t . Similarly, when the sampling condition for ˘p i t becomes true, agent i generates a sample ˘θ i k from ˘p i t . Agent i computes its control action as follows: u¯t = G( ¯θk0 )¯xt , 86 Chapter 7. Future Directions u˘ i t = G( ˘θ i ki )˘x i t , u i t = ˘u i t + ¯ut This learning scheme operates in a decentralized manner. However, it remains uncertain whether theoretical guarantees on the regret of this scheme can be obtained, representing an intriguing avenue for future research. 87 Publications • Sagar Sudhakara, Dhruva Karthik, Rahul Jain, and Ashutosh Nayyar, “Optimal communication and control strategies in a cooperative multi-agent MDP problem”, IEEE Transactions on Automatic Control, 2024. • Sagar Sudhakara, and Ashutosh Nayyar, “Optimal Symmetric Strategies in MultiAgent Systems with Decentralized Information”, accepted in 2023 62nd Conference on Decision Control (CDC), Marina Bay Sands, Singapore. IEEE, 2023. • Dhruva Karthik, Sagar Sudhakara, Rahul Jain, and Ashutosh Nayyar, “Optimal Communication and Control Strategies for a Multi-Agent System in the Presence of an Adversary”, accepted in 2022 61st Conference on Decision Control (CDC), Canc´un, Mexico. IEEE, 2022. • Mukul Gagrani, Sagar Sudhakara, Aditya Mahajan, Ashutosh Nayyar, and Yi Ouyang, “A modified Thompson sampling-based learning algorithm for unknown linear systems”, accepted in 2022 61st Conference on Decision Control (CDC), Canc´un, Mexico. IEEE, 2022. • Sagar Sudhakara, Aditya Mahajan, Ashutosh Nayyar, and Yi Ouyang, “Scalable regret for learning to control network-coupled subsystems with unknown dynamics”, IEEE Transactions on Control of Network Systems, 2021. • Sagar Sudhakara, Dhruva Karthik, Rahul Jain, and Ashutosh Nayyar, “Optimal communication and control strategies in a multi-agent MDP problem”, 2021. • Mukul Gagrani, Sagar Sudhakara, Aditya Mahajan, Ashutosh Nayyar, and Yi Ouyang, “A relaxed technical assumption for posterior sampling-based reinforcement learning for control of unknown linear systems”, 2021. • Mukul Gagrani, Sagar Sudhakara, Yi Ouyang, Aditya Mahajan, and Ashutosh Nayyar, “Thompson sampling for linear quadratic mean-field teams”, in Proc. 2021 60th Conference on Decision and Control (CDC), Austin, USA, pp. 720–727. IEEE, 2021. 88 Bibliography [1] K.-D. Kim and P. R. Kumar, “Cyber–physical systems: A perspective at the centennial,” Proceedings of the IEEE, vol. 100, no. Special Centennial Issue, pp. 1287–1308, 2012. [2] J. P. Hespanha, P. Naghshtabrizi, and Y. Xu, “A survey of recent results in networked control systems,” Proceedings of the IEEE, vol. 95, no. 1, pp. 138–162, 2007. [3] R. A. Gupta and M.-Y. Chow, “Networked control system: Overview and research trends,” IEEE transactions on industrial electronics, vol. 57, no. 7, pp. 2527–2535, 2010. [4] A. Nayyar, A. Mahajan, and D. Teneketzis, “Decentralized stochastic control with partial history sharing: A common information approach,” IEEE Transactions on Automatic Control, vol. 58, no. 7, pp. 1644–1658, 2013. [5] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identification, and adaptive control, vol. 75. SIAM, 2015. [6] K. J. Astrom and B. Wittenmark, Adaptive Control. Addison-Wesley Longman Publishing Co., Inc., 1994. [7] S. Sastry and M. Bodson, Adaptive control: stability, convergence, and robustness. Prentice-Hall, Inc., 1989. [8] K. Narendra and A. Annaswamy, Stable adaptive systems. Prentice-Hall, Inc., 1989. [9] A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies for markov decision processes,” Math. of OperationsResearch, vol. 22, no. 1, pp. 222–255, 1997. [10] T. Lattimore and C. Szepesv´ari, “Bandit algorithms,” preprint, 2018. [11] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933. [12] S. L. Scott, “A modern bayesian look at the multi-armed bandit,” Applied Stochastic Models in Business and Industry, vol. 26, no. 6, pp. 639–658, 2010. 89 Bibliography [13] O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,” in NIPS, 2011. [14] J.-M. Lasry and P.-L. Lions, “Mean field games,” Japanese Journal of Mathematics, vol. 2, no. 1, pp. 229–260, 2007. [15] M. Huang, P. E. Caines, and R. P. Malham´e, “Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized epsilon-Nash equilibria,” IEEE Transactions on Automatic Control, vol. 52, no. 9, pp. 1560–1571, 2007. [16] M. Huang, P. E. Caines, and R. P. Malham´e, “Social optima in mean field LQG control: centralized and decentralized strategies,” IEEE Transactions on Automatic Control, vol. 57, no. 7, pp. 1736–1751, 2012. [17] G. Y. Weintraub, C. L. Benkard, and B. V. Roy, “Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games,” in Neural Information Processing Systems, pp. 1489–1496, Dec. 2005. [18] G. Y. Weintraub, C. L. Benkard, and B. Van Roy, “Markov perfect industry dynamics with many firms,” Econometrica, vol. 76, no. 6, pp. 1375–1411, 2008. [19] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of markov decision processes,” Mathematics of operations research, vol. 27, no. 4, pp. 819–840, 2002. [20] D. Szer, F. Charpillet, and S. Zilberstein, “MAA*: A heuristic search algorithm for solving decentralized POMDPs,” arXiv preprint arXiv:1207.1359, 2012. [21] S. Seuken and S. Zilberstein, “Formal models and algorithms for decentralized decision making under uncertainty,” Autonomous Agents and Multi-Agent Systems, vol. 17, no. 2, pp. 190–250, 2008. [22] A. Kumar, S. Zilberstein, and M. Toussaint, “Probabilistic inference techniques for scalable multiagent decision making,” Journal of Artificial Intelligence Research, vol. 53, pp. 223–270, 2015. [23] J. S. Dibangoye, C. Amato, O. Buffet, and F. Charpillet, “Optimally solving decPOMDPs as continuous-state MDPs,” Journal of Artificial Intelligence Research, vol. 55, pp. 443–497, 2016. [24] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson, “Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning,” in International Conference on Machine Learning, pp. 4295–4304, PMLR, 2018. [25] H. Hu and J. N. Foerster, “Simplified action decoder for deep multi-agent reinforcement learning,” in International Conference on Learning Representations, 2019. 90 Bibliography [26] R. Becker, S. Zilberstein, V. Lesser, and C. V. Goldman, “Solving transition independent decentralized markov decision processes,” Journal of Artificial Intelligence Research, vol. 22, pp. 423–455, 2004. [27] J. S. Dibangoye, C. Amato, and A. Doniec, “Scaling up decentralized MDPs through heuristic search,” in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI’12, (Arlington, Virginia, USA), p. 217–226, AUAI Press, 2012. [28] Y. Xie, J. Dibangoye, and O. Buffet, “Optimally solving two-agent decentralized POMDPs under one-sided information sharing,” in International Conference on Machine Learning, pp. 10473–10482, PMLR, 2020. [29] A. Nayyar, A. Mahajan, and D. Teneketzis, “Optimal control strategies in delayed sharing information structures,” IEEE Transactions on Automatic Control, vol. 56, no. 7, pp. 1606–1620, 2010. [30] A. Mahajan, “Optimal decentralized control of coupled subsystems with control sharing,” IEEE Transactions on Automatic Control, vol. 58, no. 9, pp. 2377–2382, 2013. [31] J. Foerster, F. Song, E. Hughes, N. Burch, I. Dunning, S. Whiteson, M. Botvinick, and M. Bowling, “Bayesian action decoder for deep multi-agent reinforcement learning,” in International Conference on Machine Learning, pp. 1942–1951, PMLR, 2019. [32] S. Sukhbaatar, A. Szlam, and R. Fergus, “Learning multiagent communication with backpropagation,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2252–2260, 2016. [33] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2145–2153, 2016. [34] K. Cao, A. Lazaridou, M. Lanctot, J. Z. Leibo, K. Tuyls, and S. Clark, “Emergent communication through negotiation,” in International Conference on Learning Representations, 2018. [35] H. Kurniawati, D. Hsu, and W. S. Lee, “SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces.,” in Robotics: Science and systems, vol. 2008, Citeseer, 2008. [36] M. Egorov, Z. N. Sunberg, E. Balaban, T. A. Wheeler, J. K. Gupta, and M. J. Kochenderfer, “POMDPs.jl: A framework for sequential decision making under uncertainty,” Journal of Machine Learning Research, vol. 18, no. 26, pp. 1–5, 2017. [37] J. S. Dibangoye, C. Amato, A. Doniec, and F. Charpillet, “Producing efficient errorbounded solutions for transition independent decentralized MDPs,” in International conference on Autonomous Agents and Multi-Agent Systems, pp. 539–546, 2013. [38] R. Dechter, D. Cohen, et al., Constraint processing. Morgan Kaufmann, 2003. 91 Bibliography [39] A. Washburn and K. Wood, “Two-person zero-sum games for network interdiction,” Operations research, vol. 43, no. 2, pp. 243–251, 1995. [40] R. J. Aumann, M. Maschler, and R. E. Stearns, Repeated games with incomplete information. MIT press, 1995. [41] S. Sudhakara, D. Kartik, R. Jain, and A. Nayyar, “Optimal communication and control strategies in a multi-agent MDP problem,” arXiv preprint arXiv:2104.10923, 2021. [42] D. Kartik, A. Nayyar, and U. Mitra, “Fixed-horizon active hypothesis testing,” arXiv preprint arXiv:1911.06912, 2019. [43] D. Tang, H. Tavafoghi, V. Subramanian, A. Nayyar, and D. Teneketzis, “Dynamic games among teams with delayed intra-team information sharing,” arXiv preprint arXiv:2102.11920, 2021. [44] S. Bhattacharya and T. Ba¸sar, “Multi-layer hierarchical approach to double sided jamming games among teams of mobile agents,” in 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 5774–5779, IEEE, 2012. [45] A. Nayyar, A. Gupta, C. Langbort, and T. Ba¸sar, “Common information based Markov perfect equilibria for stochastic games with asymmetric information: Finite games,” IEEE Transactions on Automatic Control, vol. 59, no. 3, pp. 555–570, 2014. [46] A. H. Guide, Infinite dimensional analysis. Springer, 2006. [47] D. Kartik, S. Sudhakara, R. Jain, and A. Nayyar, “Optimal communication and control strategies for a multi-agent system in the presence of an adversary,” arXiv preprint arXiv:2209.03888, 2022. [48] D. Tang, Games in Multi-Agent Dynamic Systems: Decision-Making with Compressed Information. PhD thesis, 2021. [49] D. Kartik, S. Sudhakara, R. Jain, and A. Nayyar, “Optimal communication and control strategies for a multi-agent system in the presence of an adversary,” arXiv preprint arXiv:2209.03888, 2022. [50] D. Szer, F. Charpillet, and S. Zilberstein, “MAA* a heuristic search algorithm for solving decentralized POMDPs,” in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp. 576–583, 2005. [51] M. J. Neely, “Repeated games, optimal channel capture, and open problems for slotted multiple access,” arXiv preprint arXiv:2110.09638, 2021. [52] S. Y¨uksel and T. Basar, “Stochastic networked control systems: Stabilization and optimization under information constraints.,” Springer Science & Business Media, 2013. 92 Bibliography [53] D. Kartik and A. Nayyar, “Upper and lower values in zero-sum stochastic games with asymmetric information,” Dynamic Games and Applications, vol. 11, no. 2, pp. 363– 388, 2021. [54] D. Kartik, A. Nayyar, and U. Mitra, “Common information belief based dynamic programs for stochastic zero-sum games with competing teams,” arXiv preprint arXiv:2102.05838, 2021. [55] Y.-C. Ho, “Team decision theory and information structures,” Proceedings of the IEEE, vol. 68, no. 6, pp. 644–654, 1980. [56] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, no. 2-3, pp. 235–256, 2002. [57] M. C. Campi and P. Kumar, “Adaptive linear quadratic Gaussian control: the costbiased approach revisited,” SIAM Journal on Control and Optimization, vol. 36, no. 6, pp. 1890–1907, 1998. [58] Y. Abbasi-Yadkori and C. Szepesv´ari, “Regret bounds for the adaptive control of linear quadratic systems,” in Annual Conference on Learning Theory, pp. 1–26, 2011. [59] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time analysis of optimal adaptive policies for linear-quadratic systems.” arXiv:1711.07230, 2017. [60] A. Cohen, T. Koren, and Y. Mansour, “Learning linear-quadratic regulators efficiently with only √ T regret,” in International Conference on Machine Learning, pp. 1300– 1309, PMLR, 2019. [61] M. Abeille and A. Lazaric, “Efficient optimistic exploration in linear-quadratic regulators via lagrangian relaxation,” in International Conference on Machine Learning, pp. 23–31, PMLR, 2020. [62] G. Goodwin, P. Ramadge, and P. Caines, “Discrete time stochastic multivariable adaptive control,” IEEE Transactions on Automatic Control, vol. 19, pp. 449–456, June 1980. [63] G. Goodwin, P. Ramadge, and P. Caines, “Discrete time stochastic adaptive control,” SIAM J. Control and Optimization, vol. 19, pp. 829–853, Nov. 1981. [64] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,” in Neural Information Processing Systems, pp. 4192–4201, 2018. [65] H. Mania, S. Tu, and B. Recht, “Certainty equivalent control of LQR is efficient.” arXiv:1902.07826, 2019. [66] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Input perturbations for adaptive control and learning,” Automatica, vol. 117, p. 108950, 2020. 93 Bibliography [67] M. Simchowitz and D. Foster, “Naive exploration is optimal for online lqr,” in International Conference on Machine Learning, pp. 8937–8948, PMLR, 2020. [68] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit problem,” in Conference on Learning Theory, 2012. [69] Y. Ouyang, M. Gagrani, and R. Jain, “Control of unknown linear systems with thompson sampling,” in Allerton Conference on Communication, Control, and Computing, pp. 1198–1205, 2017. [70] Y. Ouyang, M. Gagrani, and R. Jain, “Posterior sampling-based reinforcement learning for control of unknown linear systems,” IEEE Transactions on Automatic Control, 2019. [71] M. Abeille and A. Lazaric, “Improved regret bounds for thompson sampling in linear quadratic control problems,” in International Conference on Machine Learning, pp. 1– 9, 2018. [72] A. Cassel, A. Cohen, and T. Koren, “Logarithmic regret for learning linear quadratic regulators efficiently,” in International Conference on Machine Learning, pp. 1328– 1337, PMLR, 2020. [73] J. Lunze, “Dynamics of strongly coupled symmetric composite systems,” International Journal of Control, vol. 44, no. 6, pp. 1617–1640, 1986. [74] M. K. Sundareshan and R. M. Elbanna, “Qualitative analysis and decentralized controller synthesis for a class of large-scale systems with symmetrically interconnected subsystems,” Automatica, vol. 27, no. 2, pp. 383–388, 1991. [75] G.-H. Yang and S.-Y. Zhang, “Structural properties of large-scale systems possessing similar structures,” Automatica, vol. 31, no. 7, pp. 1011–1017, 1995. [76] S. C. Hamilton and M. E. Broucke, “Patterned linear systems,” Automatica, vol. 48, no. 2, pp. 263–272, 2012. [77] J. Arabneydi and A. Mahajan, “Team-optimal solution of finite number of mean-field coupled lqg subsystems,” in Conf. Decision and Control, (Kyoto, Japan), Dec. 2015. [78] J. Arabneydi and A. Mahajan, “Linear Quadratic Mean Field Teams: Optimal and Approximately Optimal Decentralized Solutions,” 2016. arXiv:1609.00056. [79] D. A. Gomes and J. Sa´ude, “Mean field games models—a brief survey,” Dynamic Games and Applications, vol. 4, no. 2, pp. 110–154, 2014. [80] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, “Mean field multi-agent reinforcement learning,” in International Conference on Machine Learning, pp. 5567– 5576, Jul 2018. [81] J. Subramanian and A. Mahajan, “Reinforcement learning in stationary mean-field games,” in International Conference on Autonomous Agents and Multi-Agent Systems, pp. 251–259, 2019. 94 Bibliography [82] N. Tiwari, A. Ghosh, and V. Aggarwal, “Reinforcement learning for mean field game,” arXiv preprint arXiv:1905.13357, 2019. [83] X. Guo, A. Hu, R. Xu, and J. Zhang, “Learning mean-field games,” in Neural Information Processing Systems, pp. 4966–4976, 2019. [84] S. G. Subramanian, P. Poupart, M. E. Taylor, and N. Hegde, “Multi type mean field reinforcement learning,” arXiv preprint arXiv:2002.02513, 2020. [85] M. A. uz Zaman, K. Zhang, E. Miehling, and T. Bas,ar, “Reinforcement learning in non-stationary discrete-time linear-quadratic mean-field games,” in 2020 59th IEEE Conference on Decision and Control (CDC), pp. 2278–2284, IEEE, 2020. [86] A. Angiuli, J.-P. Fouque, and M. Lauri`ere, “Unified reinforcement Q-learning for mean field game and control problems.” arXiv:2006.13912, 2020. [87] M. Gagrani, S. Sudhakara, A. Mahajan, A. Nayyar, and Y. Ouyang, “Thompson sampling for linear quadratic mean-field teams.” arXiv preprint arXiv:2011.04686, 2020. [88] J. Sternby, “On consistency for the method of least squares using martingale theory,” IEEE T. on Automatic Control, vol. 22, no. 3, pp. 346–352, 1977. [89] N. Sandell, P. Varaiya, M. Athans, and M. Safonov, “Survey of decentralized control methods for large scale systems,” vol. 23, no. 2, pp. 108–128, 1978. [90] S. J. Bradtke, “Reinforcement learning applied to linear quadratic regulation,” in Neural Information Processing Systems, pp. 295–302, 1993. [91] S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic control using policy iteration,” in Proceedings of American Control Conference, vol. 3, pp. 3475– 3479, IEEE, 1994. [92] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “On adaptive Linear–Quadratic regulators,” Automatica, vol. 117, p. 108982, July 2020. [93] S. Gao and A. Mahajan, “Optimal control of network-coupled subsystems: Spectral decomposition and low-dimensional solutions,” inprint IEEE Trans. Control of Networked Sys., 2022. [94] H. Wang, S. Lin, H. Jafarkhani, and J. Zhang, “Distributed Q-learning with state tracking for multi-agent networked control,” in AAMAS, pp. 1692 – 1694, 2021. [95] G. Jing, H. Bai, J. George, A. Chakrabortty, and P. K. Sharma, “Learning distributed stabilizing controllers for multi-agent systems,” IEEE Control Systems Letters, 2021. [96] Y. Li, Y. Tang, R. Zhang, and N. Li, “Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach,” in Proc. Conf. Learning for Dynamics and Control, pp. 814–814, June 2020. 95 Bibliography [97] S. Alemzadeh and M. Mesbahi, “Distributed q-learning for dynamically decoupled systems,” in 2019 American Control Conference (ACC), pp. 772–777, IEEE, 2019. [98] J. Bu, A. Mesbahi, and M. Mesbahi, “Policy gradient-based algorithms for continuoustime linear quadratic control,” arXiv preprint arXiv:2006.09178, 2020. [99] H. Mohammadi, M. R. Jovanovic, and M. Soltanolkotabi, “Learning the model-free linear quadratic regulator via random search,” in Learning for Dynamics and Control, pp. 531–539, PMLR, 2020. [100] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multiagent reinforcement learning with networked agents,” in International Conference on Machine Learning, pp. 5872–5881, 2018. [101] I. Osband and B. Van Roy, “Near-optimal reinforcement learning in factored MDPs,” in Advances in Neural Information Processing Systems, vol. 27, Curran Associates, Inc., 2014. [102] X. Chen, J. Hu, L. Li, and L. Wang, “Efficient reinforcement learning in factored MDPs with application to constrained RL,” arXiv preprint arXiv:2008.13319, 2021. [103] M. Gagrani, S. Sudhakara, A. Mahajan, A. Nayyar, and Y. Ouyang, “A relaxed technical assumption for posterior sampling-based reinforcement learning for control of unknown linear systems,” arXiv preprint arXiv:2108.08502, 2021. [104] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite-time adaptive stabilization of linear systems,” vol. 64, pp. 3498–3505, Aug. 2019. [105] F. Gensbittel, M. Oliu-Barton, and X. Venel, “Existence of the uniform value in zero-sum repeated games with a more informed controller,” Journal of Dynamics and Games (JDG), vol. 1, no. 3, pp. 411–445, 2014. [106] Y. Abbasi-Yadkori and C. Szepesv´ari, “Bayesian optimal control of smoothly parameterized systems.,” in Uncertainty in Artificial Intelligence, 2015. [107] Y. Abbasi-Yadkori and C. Szepesvari, “Bayesian optimal control of smoothly parameterized systems: The lazy posterior sampling algorithm.” arXiv preprint arXiv:1406.3926, 2014. 96 Appendices A Appendix: Optimal communication and control strategies in a cooperative multi-agent MDP problem A.1 Proof of Lemma 2.1 We prove the lemma by induction. Induction hypothesis at time t: Equation (2.10) holds at time t. At t = 1, before communication decisions are made, our induction hypothesis is trivially true since there is no common information at this point and the agents’ initial states are independent. Induction step: Using the induction hypothesis at time t, we first prove (2.11) at time t +. In order to do so, it suffices to show that the left hand side of (2.11) can be factorized as follows: P(x1:t , u1:t |ct+ ) = χ 1 (x 1 1:t , u1 1:t , ct+ )χ 2 (x 2 1:t , u2 1:t , ct+ ), (3) where χ 1 and χ 2 are some real-valued mappings with χ i depending only on agent i’s strategy. We now factorize the joint distribution below. Recall that ct+ = (c(t−1)+ , zer t , mt), I i t+ = (x i 1:t , ui 1:t−1 , zer 1:t , m1:t) and I i t = (x i 1:t , ui 1:t−1 , zer 1:t−1 , m1:t−1). The left hand side of (2.11) for t + can be written as P(x1:t , u1:t , zer t , mt |c(t−1)+ ) P(z er t , mt |c(t−1)+ ) = P(ut |x1:t , u1:t−1, ct+ )P(z er t |x1:t , u1:t−1, c(t−1)+ , mt) P(mt |x1:t , u1:t−1, c(t−1)+ )P(x1:t , u1:t−1|c(t−1)+ ) P(z er t , mt |c(t−1)+ ) = 1(m1 t =f 1 t (I 1 t ))1(u 1 t =g 1 t (I 1 t+ ))P(x 1 1:t , u1 1:t−1 |c(t−1)+ ) × 1(m2 t =f 2 t (I 2 t ))1(u 2 t =g 2 t (I 2 t+ ))P(x 2 1:t , u2 1:t−1 |c(t−1)+ ) × P(z er t |x1:t , u1:t−1, c(t−1)+ , mt) P(z er t , mt |c(t−1)+ ) , (4) 97 Appendix A. Chapter 2 where the last equality follows from the fact that c(t−1)+ = ct and the induction hypothesis at time t. Further, we have P(z er t |x1:t , u1:t−1, c(t−1)+ , mt) (5) = 1 if mt = (0, 0), z er t = ϕ pe if mt ̸= (0, 0), z er t = ϕ (1 − pe)1{xt=(˜x 1 t ,x˜ 2 t )} if mt ̸= (0, 0), z er t = (˜x 1 t , x˜ 2 t ), which can clearly be factorized. From equations (4) and (5), the joint distribution P(x1:t , u1:t |ct+ ) can be factorized as in (3) and thus (2.11) holds at time t +. Using this result, we now show that our induction hypothesis holds at time t+ 1. Recall that ct+1 = ct+ . At time t+ 1, before communication decisions are made, the left hand side of equation (2.10) can be written as P(x1:t+1, u1:t |ct+1) = P(xt+1|x1:t , u1:t , ct+1)P(x1:t , u1:t |ct+1) = P(x 2 t+1|x 1 t+1, x1:t , u1:t , ct+1) × P(x 1 t+1|x1:t , u1:t , ct+1)P(x1:t , u1:t |ct+ ) = h P(x 1 t+1|x 1 t , u1 t )P(x 1 1:t , u1 1:t |ct+ ) i × h P(x 2 t+1|x 2 t , u2 t )P(x 2 1:t , u2 1:t |ct+ ) i , (6) where the last equation follows from the state dynamics in (2.1) and the equation (2.11) at time t +. Using the factored form of P(x1:t+1, u1:t |ct+1) in (6), we can conclude that our induction hypothesis holds at time t + 1. Therefore, by induction, we can conclude that equations (2.10) and (2.11) hold at all times. A.2 Proof of Proposition 2.1 We will prove the result for agent i. Throughout this proof, we fix agent −i ′ s communication and control strategies to be f −i ,g −i (where f −i ,g −i are arbitrarily chosen). Define Ri t = (Xi t , Zer 1:t−1 , M1,2 1:t−1 ) and Ri t+ = (Xi t , Zer 1:t , M1,2 1:t ). Our proof will rely on the following two facts: Fact 1: {Ri 1 , Ri 1+ , Ri 2 , Ri 2+ , ....Ri T , Ri T + } is a controlled Markov process for agent i. More precisely, for any strategy choice f i , gi of agent i, P(R i t+ = ˜r i t+ |R i 1:t = r i 1:t , Mi 1:t = mi 1:t , Ui 1:t−1 = u i 1:t−1 ) = P(R i t+ = ˜r i t+ |R i t = r i t , Mi t = mi t ) (7) P(R i t+1 = ˜r i t+1|R i 1:t+ = r i 1:t+ , Mi 1:t = mi 1:t , Ui 1:t = u i 1:t ) = P(R i t+1 = ˜r i t+1|R i t+ = r i t+ , Ui t = u i t ) (8) where the probabilities on the right hand side of (7) and (8) do not depend on f i , gi . 98 Appendix A. Chapter 2 Fact 2: The costs at time t satisfy [ρ(Xi t , X−i t )1(Mor t =1)|R i 1:t = r i 1:t , Mi 1:t = mi 1:t , Ui 1:t−1 = u i 1:t−1 ] = κ i t (r i t , mi t ), (9) [ct(Xt , Ut)|R i 1:t+ = r i 1:t+ , Mi 1:t = mi 1:t , Ui 1:t = u i 1:t ] = κ i t+ (r i t+ , ui t ), (10) where the functions κ i t , κi t+ in (9) and (10) do not depend on f i , gi . Suppose that Facts 1 and 2 are true. Then, the strategy optimization problem for agent i can be viewed as an MDP over 2T time steps (i.e. time steps 1, 1 +, 2, 2 +, . . . , T, T +) with Ri t and Mi t as the state and action at time t; and Ri t+ and U i t as the state and action for time t +. Note that at time t, agent i observes Ri t , selects Mi t and the “state” transitions to Ri t+ according to Markovian dynamics (see (7)). Similarly, at time t +, agent i observes Ri t+ , selects U i t and the “state” transitions to Ri t+1 according to Markovian dynamics (see (8)). Further, from agent i’s perspective, the cost at time t depends on the state and action at t (i.e. Ri t and Mi t , see (9)) and the cost at time t + depends on the state and action at t + (i.e. Ri t+ and U i t , see (10)). It then follows from standard MDP results that agent i can find an optimal strategy (given agent −i’s strategy) of the form: Mi t = ¯f i t (R i t ) = ¯f i t (Xi t , Zer 1:t−1 , M1,2 1:t−1 ), (11) U i t = ¯g i t (Rt+ ) = ¯g i t (Xi t , Zer 1:t , M1,2 1:t ), (12) which establishes the result of the proposition (recall that Ct = (Z er 1:t−1 , M1,2 1:t−1 ) and Ct+ = (Z er 1:t , M1,2 1:t ), We now prove Facts 1 and 2 stated above. Proof of Fact 1: Let ˜r i t+ = (x i t , zer 1:t , m1:t) and r i 1:t = (x i 1:t , zer 1:t−1 , m1:t−1). Then, the left hand side of (7) can be written as P(R i t+ = (x i t , zer 1:t , m1:t)|(x i 1:t , zer 1:t−1 , m1:t−1), mi 1:t , ui 1:t−1 ) = P(z er t |x i 1:t , zer 1:t−1 , m1:t , ui 1:t−1 ) × P(M−i t = m−i t |x i 1:t , zer 1:t−1 , m1:t−1, mi t , ui 1:t−1 ) = P(z er t |x i 1:t , zer 1:t−1 , m1:t , ui 1:t−1 ) × P(M−i t = m−i t |z er 1:t−1 , m1:t−1). (13) where (13) follows from the conditional independence property of Lemma 2.1. We can further simplify the first term in (13) for different cases as follows: P(z er t |x i 1:t , zer 1:t−1 , m1:t , ui 1:t−1 ) = 99 Appendix A. Chapter 2 (1 − pe)1(˜x i t=x i t )P(˜x −i t |z er 1:t−1 , m1:t) if z er t = ˜xt , mt ̸= (0, 0) pe if z er t = ϕ, mt ̸= (0, 0) 1 if z er t = ϕ, mt = (0, 0). We note that in all cases above x i 1:t−1 does not affect the probability. This, combined with (13), establishes (7). We further note that the probabilities in the three cases above and in the second term of (13), do not depend on agent i’s strategy. Equation (8) is a direct consequence of the Markovian state dynamics of agent i. Proof of Fact 2: We have P[Xt = ˜xt , Mt = ˜mt |R i 1:t = r i 1:t , Mi 1:t = mi 1:t , Ui 1:t−1 = u i 1:t−1 ] = P[Xt = ˜xt , Mt = ˜mt |x i 1:t , zer 1:t−1 , m1:t−1, mi t , ui 1:t−1 ] (14) = 1(˜x i t=x i t )1( ˜mi t=mi t )P[X −i t = ˜x −i t , M−i t = ˜m−i t |z er 1:t−1 , m1:t−1], where the last equation follows from the conditional independence in Lemma 2.1. Therefore, the probability distribution of Xt , Mt conditioned on Ri 1:t , Mi 1:t depends only on (x i t , zer 1:t−1 , m1:t−1, mi t ) = (r i t , mi t ). Also note that this conditional probability does not depend on agent i’s strategy. Hence, the conditional expectation in (9) can be expressed as a function of r i t , mi t . To prove (10), it suffices to show that P(x −i t , u−i t |(x i 1:t , zer 1:t , m1:t), mi 1:t , ui 1:t ) = P(x −i t , u−i t |(x i t , zer 1:t , m1:t), ui t ) (15) from Lemma 2.1. A.3 Proof of Lemma 2.3 Let ct := (z er 1:t−1 , m 1,2 1:t−1 ) and ct+ := (z er 1:t , m 1,2 1:t ) be realizations of Ct , Ct+ respectively. Let ct+1 = ct+ be the corresponding realization of Ct+1. Let γ1:t , λ1:t be the realizations of the coordinator’s prescriptions Γ1:t ,Λ1:t up to time t. Let us assume that the realizations ct+1, γ1:t , λ1:t have non-zero probability. Let π i t , π i t+ and π i t+1 be the corresponding realizations of the coordinator’s beliefs Πi t , Πi t+ , and Πi t+1 respectively. These beliefs are given by π i t (x i t ) = P(Xi t = x i t |Ct = (z er 1:t−1 , m 1,2 1:t−1 ), γ1:t−1, λ1:t−1) π i t+ (x i t+ ) = P(Xi t = x i t+ |Ct+ = (z er 1:t , m 1,2 1:t ), γ1:t , λ1:t−1). π i t+1(x i t+1) = P(Xi t+1 = x i t+1|Ct+1 = (z er 1:t , m 1,2 1:t ), γ1:t , λ1:t). There are two possible cases: (i) z er t = (˜x 1 t , x˜ 2 t ), and (ii) z er t = ϕ. Let us analyze these cases separately. 100 Appendix A. Chapter 2 Case (i): When z er t = (˜x 1 t , x˜ 2 t ) for some (˜x 1 t , x˜ 2 t ) ∈ X 1 × X 2 , at least one of the agents must have decided to communicate at time t and the communication must have been successful. As described in (2.2), Z er t = Xt when successful communication occurs. Thus, we have π i t+ (x i t ) = P(Xi t = x i t |z er 1:t , m 1,2 1:t , γ1:t , λ1:t−1) = 1(x i t=˜x i t ) . (16) Case (ii): In this case, z er t = ϕ. Let mt := (m1 t , m2 t ) and q(mt) := P[Z er t = ϕ | Mt = mt ] = ( 1 if mt = (0, 0) pe otherwise. (17) Using Bayes’ rule, we have π i t+ (x i t ) = P(Xi t = x i t |z er 1:t , m 1,2 1:t , γ1:t , λ1:t−1) = P(Xi t = x i t , Zer t = ϕ, Mt = mt |z er 1:t−1 , m 1,2 1:t−1 , γ1:t , λ1:t−1) P(Z er t = ϕ, Mt = mt |z er 1:t−1 , m 1,2 1:t−1 , γ1:t , λ1:t−1) = q(mt)P(Mt = mt |x i t , ct , γ1:t , λ1:t−1)P(x i t |ct , γ1:t , λ1:t−1) P xˆ i t q(mt)P(Mt = mt |xˆ i t , ct , γ1:t , λ1:t−1)P(ˆx i t |ct , γ1:t , λ1:t−1) (a) = P(Mt = mt |x i t , ct , γ1:t , λ1:t−1)P(x i t |ct , γ1:t−1, λ1:t−1) P xˆ i t P(Mt = mt |xˆ i t , ct , γ1:t , λ1:t−1)P(ˆx i t |ct , γ1:t−1, λ1:t−1) (b) = 1(γ i t (x i t )=mi t )π i t (x i t ) P xˆ i t 1(γ i t (ˆx i t )=mi t )π i t (ˆx i t ) . (18) where, in (a), we drop γt from the term P(Xi t = x i t |ct , γ1:t , λ1:t−1) because γt is a function of the rest of the terms in the conditioning given the coordinator’s strategy. In (b), we use the fact that P(Mt = mt |x i t , ct , γ1:t , λ1:t−1) = P(m−i t |mi t , xi t , ct , γ1:t , λ1:t−1)P(mi t |x i t , ct , γ1:t , λ1:t−1) = P(m−i t |ct , γ1:t , λ1:t−1)1(γ i t (x i t )=mi t ) Hence, we can update the coordinator’s belief π i t+ (x i t ) using π i t , γi t , z er t and mt as: 1(γ i t (xi t )=mi t ) π i t (x i t ) P xˆ i t 1(γ i t (ˆxi t )=mi t ) π i t (ˆx i t ) , if z er t = ϕ 1(x i t=˜x i t ) , if z er t = (˜x 1 t , x˜ 2 t ). (19) We denote the update rule described above with η i t , i.e. π i t+ = η i t (π i t , γi t , zer t , mt). (20) 101 Appendix B. Chapter 3 Further, using the law of total probability, we can write π i t+1(x i t+1) as, π i t+1(x i t+1) = P(Xi t+1 = x i t+1|z er 1:t , m 1,2 1:t , γ1:t , λ1:t) (21) = X x i t X u i t h P(Xi t+1 = x i t+1|x i t , ui t , zer 1:t , m 1,2 1:t , γ1:t , λ1:t)× P(U i t = u i t |x i t , zer 1:t , m 1,2 1:t , γ1:t , λ1:t)P(x i t |z er 1:t , m 1,2 1:t , γ1:t , λ1:t) i a= X x i t X u i t P(Xi t+1 = x i t+1|x i t , ui t )× P(u i t |x i t , zer 1:t , m 1,2 1:t , γ1:t , λ1:t)P(x i t |z er 1:t , m 1,2 1:t , γ1:t , λ1:t−1) (22) b= X x i t X u i t P(Xi t+1 = x i t+1|x i t , ui t )1(u i t=λ i t (x i t ))π i t+ (x i t ), (23) where, (a) above, we drop λt from P(Xi t = x i t |z er 1:t , m 1,2 1:t , γ1:t , λ1:t) since λt is a function of the rest of the terms in the conditioning given the coordinator’s strategy and (b) follows from (2.15). We denote the update rule described above with β i t , i.e. π i t+1 = β i t (π i t+ , λi t ). (24) B Appendix: Optimal Communication and Control Strategies for a Multi-Agent System in the Presence of an Adversary B.1 Proof of Lemma 3.1 We prove the lemma by induction. Induction hypothesis at time t: Equation (3.31) holds at time t. At t = 1, before communication decisions are made, our induction hypothesis is trivially true since the team’s common information at this point is the global state X0 1 and the agents’ initial states X1 1 , X2 1 are independent given the global state X0 1 = x 0 1 for any x 0 1 ∈ X 0 . Induction step: Using the induction hypothesis at time t, we first prove (3.32) at time t +. In order to do so, it suffices to show that the left hand side of (3.32) can be factorized as follows: P(x1:t , u1:t |ct+ , dt+ ) = ζ 1 (x 1 1:t , u1 1:t , ct+ , dt+ )ζ 2 (x 2 1:t , u2 1:t , ct+ , dt+ ), (25) where ζ 1 and ζ 2 are some real-valued mappings with ζ i depending only on agent i’s strategy. We now factorize the joint distribution below. Recall that ct+ , dt+ = (ct , dt , zer t , mt , yt), 102 Appendix B. Chapter 3 I i t+ = (x 0 1:t , ua 1:t−1 , xi 1:t , ui 1:t−1 , zer 1:t , m1:t , y1:t) and I i t = (x 0 1:t , ua 1:t−1 , xi 1:t , ui 1:t−1 , zer 1:t−1 , m1:t−1, y1:t−1). The left hand side of (3.32) for t + can be written as P(x1:t , u1:t , zer t , mt , yt |ct , dt) P(z er t , mt , yt |ct , dt) = P(ut |x1:t , u1:t−1, ct+ , dt+ )P(yt , zer t |x1:t , u1:t−1, ct , dt , mt) P(z er t , mt , yt |ct , dt) × P(mt |x1:t , u1:t−1, ct , dt)P(x1:t , u1:t−1|ct , dt) = P(m1 t |I 1 t )P(u 1 t |I 1 t+ )P(x 1 1:t , u1 1:t−1 |ct , dt) × P(m2 t |I 2 t )P(u 2 t |I 2 t+ )P(x 2 1:t , u2 1:t−1 |ct , dt) × P(yt , zer t |x1:t , u1:t−1, ct , dt , mt) P(yt , zer t , mt |ct , dt) , (26) where the last equality follows from induction hypothesis at time t. Further, we have P(yt , zer t |x1:t , u1:t−1, ct , dt , mt) (27) = P(yt | z er t , mt , x0 t )× 1 if mt = (0, 0), z er t = ϕ pe(x 0 t ) if mt ̸= (0, 0), z er t = ϕ (1 − pe(x 0 t ))1{xt=(˜x 1 t ,x˜ 2 t )} if mt ̸= (0, 0), z er t = (˜x 1 t , x˜ 2 t ), which can clearly be factorized. From equations (26) and (27), the joint distribution P(x1:t , u1:t |ct+ , dt+ ) can be factorized as in (25) and thus (3.32) holds at time t +. Using this result, we now show that our induction hypothesis holds at time t + 1. Recall that ct+1, dt+1 = (ct+ , dt+ , x0 t+1, ua t ). At time t + 1, before communication decisions are made, the left hand side of equation (3.31) can be written as P(x1:t+1, u1:t |ct+1, dt+1) = P(x1:t+1, u1:t , x0 t+1, ua t |ct+ , dt+ ) P(x 0 t+1, ua t |ct+ , dt+ ) (28) = P(xt+1|x1:t , u1:t , ct+1, dt+1)P(x 0 t+1|x1:t , u1:t , ua t , ct+ , dt+ ) P(x 0 t+1, ua t |ct+ , dt+ ) × P(u a t |x1:t , u1:t , ct+ , dt+ )P(x1:t , u1:t |ct+ , dt+ ) = P(xt+1|x1:t , u1:t , ct+1, dt+1) P(x 0 t+1, ua t |ct+ , dt+ ) × P(x 0 t+1|u a t , ct+ , dt+ )P(u a t |ct+ , dt+ )P(x1:t , u1:t |ct+ , dt+ ) (29) = P(x 2 t+1|x 1 t+1, x1:t , u1:t , ct+1, dt+1)× P(x 1 t+1|x1:t , u1:t , ct+1, dt+1)P(x1:t , u1:t |ct+ , dt+ ) = P(x 2 t+1|x 0 t , x2 t , u2 t )× P(x 1 t+1|x 0 t , x1 t , u1 t )P(x1:t , u1:t |ct+ , dt+ ) (30) = h P(x 1 t+1|x 0 t , x1 t , u1 t )P(x 1 1:t , u1 1:t |ct+ , dt+ ) i × 103 Appendix B. Chapter 3 h P(x 2 t+1|x 0 t , x2 t , u2 t )P(x 2 1:t , u2 1:t |ct+ , dt+ ) i . (31) Here, (29) and (30) are consequences of the system dynamics in (3.3). The last equation (31) follows from the equation (3.32) at time t +. Using the factored form of P(x1:t+1, u1:t |ct+1, dt+1) in (31), we can conclude that our induction hypothesis holds at time t + 1. Therefore, by induction, we can conclude that equations (3.31) and (3.32) hold at all times. B.2 Proof of Proposition 3.1 We will prove the lemma using the following claim. Claim .1. Consider any arbitrary strategy pair f, g for the team. Then there exists a strategy pair ¯f, g¯ for the team such that, for each t, ¯f i t (resp. g¯ i t ) are functions of Xi t and Ct , Dt (resp. Ct+ , Dt+ ) and J(( ¯f, g¯), ga ) = J((f, g), ga ), ∀g a ∈ Ga . Suppose that the above claim is true. Let (f ′ , g′ ) be a min-max strategy for the team. Due to Claim .1, there exists a strategy ¯f ′ , ¯g ′ for the team such that, for each t, ¯f ′ i t (resp. ¯g ′ i t ) are functions of Xi t and Ct , Dt (resp. Ct+ , Dt+ ) and J(( ¯f ′ , ¯g ′), ga ) = J((f ′ , g′ ), ga ), for every strategy g a ∈ Ga . Therefore, we have that sup g a∈Ga J(( ¯f ′ , ¯g ′), ga ) = sup g a∈Ga J((f ′ , g′ ), ga ) = inf f,g sup g a∈Ga J((f, g), ga ). Thus, ( ¯f ′ , g¯ ′ ) is a min-max strategy for Team 1 wherein Player 1 uses only the current state and Player 2’s information. It suffices to prove Claim .1 for the information structure in Section 3.1.1.1. In this information structure, the adversary has the maximum possible information. Given a strategy ˆg a under any other information structure, there is an equivalent strategy g a under the maximal information structure in Section 3.1.1.1. Therefore, the set of all adversary’s strategies in the maximal information structure covers the adversary’s strategy space under any other information structure. The proof of Claim .1 is similar to the proof of Lemma 3 in Appendix IX of [42]. However, there is an important difference in the construction of ( ¯f ′ , g¯ ′ ). For this construction, the conditional independence in Lemma 3.1 is crucial. Proof of Claim .1: We now proceed to prove Claim .1 for the information structure in Section 3.1.1.1. 104 Appendix B. Chapter 3 Consider any arbitrary strategy f, g for the team. Let ι a t = {x 0 1:t , ua 1:t−1 , zer 1:t−1 , m1:t−1, y1:t−1} be a realization of the adversary’s information I a t at time t. Define the distribution Ψt(ι a t ) over the space (Qt τ=1 Q i=1,2 X i τ × Qt−1 τ=1 Q i=1,2 U i τ ) × {0, 1} 2 as follows: Ψt(ι a t ; x1:t , u1:t−1, mt) .= P f,g,ha [X1:t , U1:t−1, Mt = (x1:t , u1:t−1, mt) | ι a t ], if ι a t is feasible, that is P f,g,ha [I a t = ι a t ] > 0, under the open-loop strategy h a .= (u a 1:t−1 ) for the adversary. Otherwise, define Ψt(ι a t ; x1:t , u1:t−1, mt) to be the uniform distribution over the space (Qt τ=1 Q i=1,2 X i τ × Qt−1 τ=1 Q i=1,2 U i τ ) × {0, 1} 2 . Let ι a t+ = {x 0 1:t , ua 1:t−1 , zer 1:t , m1:2 1:t , y1:t} be a realization of the adversary’s information I a t+ (which is the same as the common information at time t +). Define the distribution Ψt+ (ι a t+ ) over the space Qt τ=1 Q i=1,2 X i τ × Ui τ as follows: Ψt+ (ι a t+ ; x1:t , u1:t) .= P f,g,ha [X1:t , U1:t = (x1:t , u1:t) | I a t+ = ι a t+ ], if ι a t+ is feasible, that is P f,g,ha [I a t+ = ι a t+ ] > 0, under the open-loop strategy h a .= (u a 1:t−1 ) for the adversary. Otherwise, define Ψt+ (ι a t+ ; x1:t , u1:t) to be the uniform distribution over the space Qt τ=1 Q i=1,2 X i τ × Ui τ . Lemma .1. Let f, g be the team’s strategy and let g a be an arbitrary strategy for the adversary. Then for any realization x1:t , u1:t , mt of the variables X1:t , U1:t , Mt, we have P f,g,ga [X1:t , U1:t−1, Mt = (x1:t , u1:t−1, mt) | I a t ] = Ψt(I a t ; x1:t , u1:t−1, mt) P f,g,ga [X1:t , U1:t = (x1:t , u1:t) | I a t+ ] = Ψt+ (I a t+ ; x1:t , u1:t), almost surely. Corollary .1. At any given time t, the functions ψt and ψt+ can be factorized as Ψt(ι a t ; x1:t , u1:t−1, mt) = Ψ1 t (ι a t ; x 1 1:t , u1 1:t−1 , m1 t ) × Ψ 2 t (ι a t ; x 2 1:t , u2 1:t−1 , m2 t ) (32) Ψt+ (ι a t+ ; x1:t , u1:t) = Ψ1 t+ (ι a t+ ; x 1 1:t , u1 1:t ) × Ψ 2 t+ (ι a t+ ; x 2 1:t , u2 1:t ). (33) This is a direct consequence of Lemma 3.1. For any instance ι a t of the adversary’s information I a t , define the distribution Φt(ι a t ) over the space Xt × {0, 1} 2 as follows Φt(ι a t ; xt , mt) = X x1:t−1 X u1:t−1 Ψt(ι a t ; x1:t , u1:t−1, mt). (34) 105 Appendix B. Chapter 3 Similarly, for any instance ι a t+ of the adversary’s information I a t+ , define the distribution Φt+ (ι a t+ ) over the space Xt × Ut as follows Φt+ (ι a t+ ; xt , ut) = X x1:t−1 X u1:t−1 Ψt+ (ι a t+ ; x1:t , u1:t). (35) Using Corollary .1, we can say that the functions Φt and Φt+ can be factorized as Φt(ι a t ; xt , mt) = Φ1 t (ι a t ; x 1 t , m1 t )Φ2 t (ι a t ; x 2 t , m2 t ) (36) Φt+ (ι a t+ ; xt , ut) = Φ1 t+ (ι a t+ ; x 1 t , u1 t )Φ2 t+ (ι a t+ ; x 2 t , u2 t ). (37) Define communication strategy ¯f i for agent i in the Team such that for any realization x i t , ιa t of state Xi t and adversary’s information I a t at time t, the probability of selecting an action mi t at time t is ¯f i t (x i t , ιa t ; mi t ) .= Φi t (ι a t ;x i t ,mi t P ) m′ t Φi t (ι a t ;x i t ,m′ t ) if P m′ t Φ i t (ι a t ; x i t , m′ t ) > 0 U(·) otherwise, (38) where U(·) denotes the uniform distribution over the action space {0, 1}. Notice that the construction of the strategy ¯f i does not involve adversary’s strategy g a . Define control strategy ¯g i for agent i in the team such that for any realization x i t , ιa t+ of state Xi t and adversary’s information I a t+ at time t +, the probability of selecting an action u i t at time t + is g¯ i t (x i t , ιa t+ ; u i t ) .= Φi t+ (ι a t+ ;x i t ,ui t ) P u′ t Φi t+ (ι a t+ ;x i t ,u′ t ) if P u ′ t Φ i t+ (ι a t+ ; x i t , u′ t ) > 0 U(·) otherwise, (39) where U(·) denotes the uniform distribution over the action space U i t . Notice that the construction of the strategy ¯g i does not involve adversary’s strategy g a . For convenience, let us define ¯ft(xt , ιa t ; mt) = ¯f 1 t (x 1 t , ιa t ; m1 t ) ¯f 2 t (x 2 t , ιa t ; m2 t ) (40) g¯t(xt , ιa t+ ; ut) = ¯g 1 t (x 1 t , ιa t+ ; u 1 t )¯g 2 t (x 2 t , ιa t+ ; u 2 t ). (41) Lemma .2. For any strategy g a for adversary, we have P ((f,g),ga) [Mt = mt | Xt , Ia t ] = ¯ft(Xt , Ia t ; mt) 106 Appendix B. Chapter 3 P ((f,g),ga) [Ut = ut | Xt , Ia t+ ] = ¯gt+ (Xt , Ia t+ ; ut) almost surely for every mt , ut. Proof. Let xt , ιa t , ιa t+ be a realization that has a non-zero probability of occurrence under the strategy profile ((f, g), ga ). Then using Lemma .1, we have P ((f,g),ga) [X1:t , U1:t−1, Mt = (x1:t , u1:t−1, mt) | ι a t ] = Ψt(ι a t ; x1:t , u1:t−1, mt), (42) P ((f,g),ga) [X1:t , U1:t = (x1:t , u1:t) | ι a t+] = Ψt+(ι a t+; x1:t , u1:t), (43) for every realization x1:t−1 of states X1:t−1 and u1:t−1 of action U1:t−1. Summing over all x1:t−1, ut and using (35), (34), (42) and (43), we have P ((f,g),ga) [Xt = xt | I a t = ι a t ] = X mt Φt(ι a t ; xt , mt). (44) P ((f,g),ga) [Xt = xt | I a t+ = ι a t+] = X ut Φt+(ι a t+; xt , ut). (45) The left hand side of the above equation is positive since xt , ia t , ιa t+ is a realization of positive probability under the strategy profile ((f, g), ga ). Using Bayes’ rule, (34), (38) and (42), we obtain P ((f,g),ga) [Mt = mt | Xt = xt , Ia t = ι a t ] = ¯ft(xt , ιa t ; mt). Using Bayes’ rule, (35), (39) and (43), we obtain P ((f,g),ga) [Ut = ut | Xt = xt , Ia t+ = ι a t+] = ¯gt(xt , ιa t+; ut). This concludes the proof of the lemma. Let us define (( ¯f, g¯), ga ), where ¯f, g¯ is as defined in (38), (39). We can now show that the strategy ( ¯f, g¯) satisfies J(( ¯f, g¯), ga ) = J((f, g), ga ), for every strategy g a ∈ Ga . Because of the structure of the cost function in (3.20), it is sufficient to show that for each time t, the random variables (Xt , Mt , Ia t ) have the same joint distribution under strategy profiles ((f, g), ga ) and (( ¯f, g¯), ga ) and at time t +, the random variables (Xt , Ut , Ua t , Ia t+ ) have the same joint distribution under strategy profiles ((f, g), ga ) and (( ¯f, g¯), ga ). We prove this by induction. It is easy to verify that at time t = 1, (X1, M1, Ia 1 ) have the same joint distribution under strategy profiles ((f, g), ga ) and 107 Appendix B. Chapter 3 (( ¯f, g¯), ga ). Now assume that at time t, P ((f,g),ga) [xt , mt , ιa t ] = P ((f,¯ g¯),ga [xt , mt , ιa t ], (46) for any realization of state, actions and adversary’s information xt , mt , ιa t . Let ι a t+ = (ι a t , zer t , mt , yt). Then we have P ((f,g),ga) [xt , ιa t+ ] = P[yt | z er t , xt , ιa t , mt ]P[z er t | xt , ιa t , mt ]P ((f,g),ga) [xt , ιa t , mt ] (47) = P[yt | z er t , xt , ιa t , mt ]P[z er t | xt , ιa t , mt ] × P ((f,¯ g¯),ga [xt , ιa t , mt ] (48) = P ((f,¯ g¯),ga [xt , ιa t+ ]. (49) At t +, for any realization xt , ut , ua t , ιa t+ that has non-zero probability of occurrence under the strategy profile ((f, g), ga ), we have P ((f,g),ga) [xt , ut , ua t , ιa t+ ] = P ((f,g),ga) [xt , ιa t+ ]g a t (ι a t+ ; u a t ) × P ((f,g),ga) [u 1 t | x 1 t , ιa t+ ]P ((f,g),ga) [u 2 t | x 2 t , ιa t+ ] (50) = P ((f,g),ga) [xt , ιa t+ ]g a t (ι a t+ ; u a t )¯g 1 t (x 1 t , ιa t+ ; u 1 t )¯g 2 t (x 2 t , ιa t+ ; u 2 t ) (51) = P ((f,¯ g¯),ga [xt , ιa t+ ]g a t (ι a t+ ; u a t )¯g 1 t (x 1 t , ιa t+ ; u 1 t )¯g 2 t (x 2 t , ιa t+ ; u 2 t ) (52) = P ((f,¯ g¯),ga [xt , ιa t+ ]g a t (ι a t+ ; u a t )P ((f,¯ g¯),ga) [u 1 t | x 1 t , ιa t+ ] × P ((f,¯ g¯),ga) [u 2 t | x 2 t , ιa t+ ] (53) = P ((f,¯ g¯),ga [xt , ut , ua t , ιa t+ ], (54) where the equality in (50) is a consequence of the chain rule and the manner in which players randomize their actions. Equality in (51) follows from Lemma .2 and the equality in (52) follows from the result in (56). Then we have P ((f,g),ga) [xt+1, ιa t+1] = X x¯t X u¯t P[xt+1, x0 t+1 | x¯t , u¯t , ua t , ιa t+ ]P ((f,g),ga) [¯xt , u¯t , ua t , ιa t+ ] = X x¯t X u¯t P[xt+1, x0 t+1 | x¯t , u¯t , ua t , ιa t+ ] × P ((f,¯ g¯),ga [¯xt , u¯t , ua t , ιa t+ ] (55) = P ((f,¯ g¯),ga [xt+1, ιa t+1]. (56) The equality in (55) is due to the induction hypothesis. Note that the conditional distribution P[xt+1, x0 t+1 | x¯t , u¯t , ua t , ιa t+ ] does not depend on players’ strategies. At t + 1, for any realization xt+1, mt+1, ιa t+1 that has non-zero probability of occurrence under the strategy 108 Appendix B. Chapter 3 profile ((f, g), ga ), we have P ((f,g),ga) [xt+1, mt+1, ιa t+1] (57) = P ((f,g),ga) [mt+1 | xt+1, ιa t+1]P ((f,g),ga) [xt+1, ιa t+1] (58) = ¯ft(xt+1, ιa t+1; mt+1)P ((f,¯ g¯),ga) [xt+1, ιa t+1] (59) = P ((f,¯ g¯),ga) [mt+1 | xt+1, ιa t+1]P ((f,¯ g¯),ga) [xt+1, ιa t+1] (60) = P ((f,¯ g¯),ga [xt+1, mt+1, ιa t+1], (61) Therefore, by induction, the equality in (46) holds for all t. This concludes the proof of Claim .1. B.3 Proof of Proposition 3.2 Let us define the team’s common information belief on its agents’ private states as Bt(x 0 , x1 , x2 ) = P(X0 t = x 0 t , X1 t = x 1 , X2 t = x 2 |C team t ). Bt+ (x 0 , x1 , x2 ) = P(X0 t = x 0 t , X1 t = x 1 , X2 t = x 2 |C team t+ ). Note that this belief coincides with the common information belief when the adversary has maximum information. Therefore, we can use the transformations ¯ηt and β¯ t (see Section 3.3.3) to update the belief Bt . Lemma .3. There exists a min-max strategy for the team of the form Mi t ∼ f i t (Xi t , Bt , Ct) (62) U i t ∼ g i t (Xi t , Bt+ , Ct+ ) (63) Proof. This lemma can be shown by replacing the agents in the team with a virtual coordinator [4]. The coordinator only sees the team’s common information and selects prescriptions for both agents which will be used by the agents to map their private information Xi t to an action. When the agents in the team are replaced with a virtual coordinator, the team-game can be viewed as a zero-sum game between two individual players: the coordinator and the adversary. It was shown in [105] that the more-informed player (coordinator) can select its actions (prescriptions) based on its belief on the system state (Bt) and the adversary’s information (Ct) without loss of optimality. Thus, the private information used by the agent i in the team has been further reduced to Xi t , Bt from Xi t , Dt . Lemma .4. Let (f, g) be a belief based strategy as in (63) for the team. For any such strategy pair, the team’s belief Bt is given by Bt = ϖt(XLt , Ct) (64) 109 Appendix C. Chapter 4 Bt+ = ϖt+ (XLt+ , Ct+ ), (65) where ϖt and ϖt+ are transformations that may depend on the team’s strategy f, g. Proof. We first split the team’s strategy into the coordinator’s strategy and prescriptions in the following manner. The prescriptions for agent i are defined as Γ i t = f i t (·, Bt , Ct) (66) Λ i t = g i t (·, Bt+ , Ct+ ). (67) The pair of prescriptions is denoted by Γt := ϑt(Bt , Ct) and Λt = ϑt+ (Bt+ , Ct+ ). Here, ϑ represents the coordinator’s strategy. We prove this lemma by induction. Our induction hypothesis is that (64) holds at time t. Note that at t = 1, our induction hypothesis holds because the team’s common information C team 1 and the overall common information C1 are identical. We now prove that (65) holds at time t +. The new common observations received by the team at time t + are (Zt+ , Zer t ), where Zt+ denotes the adversary’s observations at time t +. If Z er t ̸= ∅, then Z er t = Xt = XLt+ . Therefore, Bt+ is simply the degenerate distribution, i.e. Bt+ (x) = 1Xt (x) Thus, (65) holds. If Z er t = ∅, we have Bt+ a= ¯ηt(Bt , Γt ,(Zt+ , ∅)) (68) = ¯ηt(Bt , ϑt(Bt , Ct),(Zt+ , ∅)) (69) b =: ϖt+ (XLt , Ct+ ) (70) = ϖt+ (XL + t , Ct+ ). (71) Here, (a) follows from the fact that we can use ¯ηt in Section 3.3.3 to update Bt . Equality in (b) is because by induction hypothesis at time t, Bt = ϖt(XLt , Ct). We now prove that the induction hypothesis holds at time t + 1. The common observations at time t + 1 are Zt+1. Using the belief update transformation β¯ t in Section 3.3.3, Bt+1 = β¯ t(Bt+ ,Λt , Zt+1) (72) = β¯ t(Bt+ , ϑt+ (Bt+ , Ct+ ), Zt+1) (73) c =: ϖt+1(XLt+ , Ct+1) (74) = ϖt+1(XLt+1 , Ct+1). (75) Here, (c) is a consequence of (71). Therefore, by induction, the lemma holds at all times t and t +. Proposition 3.2 can be obtained by substituting Bt in Lemma .3 with the simplified expressions in Lemma .4. 110 Appendix C. Chapter 4 C Appendix: Optimal Symmetric Strategies in Multi-Agent Systems with Decentralized Information C.1 Proof of Lemma 4.3 To prove Lemma 4.3, we first show that the private information of the agents are conditionally independent given the common information under any strategies. For Problem P1a, this is straightforward since the disturbances in the dynamics are independent: P(X 1,2 t = x 1,2 t |Ct = (x 1,2 1:t−1 , x0 1:t , u1:t−1)) =P(ft−1(x 1 t−1 , x0 t−1 , ut−1, W1 t−1 ) = x 1 t )× P(ft−1(x 2 t−1 , x0 t−1 , ut−1, W2 t−1 ) = x 2 t ). For Problems P1b and P1c, we have the following lemma. Lemma .5 (Conditional independence property). Consider any arbitrary (symmetric or asymmetric) choice of agents’ strategies in Problems P1b and P1c. Then, at any time t, the two agents’ private information are conditionally independent given the common information Ct. That is, if ct is the realization of the common information at time t then for any realization pt of private information, we have P (g 1 ,g2 ) (pt |ct) = Y 2 i=1 P g i (p i t |ct), (76) Further, P g i (p i t |ct) depends only on agent i’ strategy and not on the strategy of agent −i. Proof. The proof is analogous to the proof of [30, Proposition 1] except for the possible randomization in agents’ strategies. Using the above conditional independence property for Problems P1a-P1c, we can now prove (4.15). At time t, the coordinator’s belief is given as: Πt(x 0 t , p1 t , p2 t ) = P(X0 t = x 0 t , P1 t = p 1 t , P2 t = p 2 t |Ct , Γ1:t−1) (77) for any realization x 0 t of the global state and any realizations p 1 t , p2 t of the agents’ private information. Since X0 t is part of Ct , the coordinator’s belief can be factorized into: P(x 0 t , p1 t , p2 t |Ct , Γ1:t−1) = δX0 t (x 0 t )P(p 1 t , p2 t |Ct , Γ1:t−1) = δX0 t (x 0 t )P(p 1 t |Ct , Γ1:t−1)P(p 2 t |Ct , Γ1:t−1) = δX0 t (x 0 t )Π1 t (p 1 t )Π2 t (p 2 t ), where we used the above-mentioned conditional independence. We now prove (4.16) for Problems P1a-P1c. 111 Appendix C. Chapter 4 C.1.1 Problem P1a Let π i t+1 be the realization of the coordinator’s marginal belief Πi t+1 for agent i’s private information at time t + 1. Then, π i t+1(x i t+1) = P(x i t+1|ct+1 = (x 1,2 1:t , x0 1:t+1, u1:t), γ1:t) π i t+1(x i t+1) = P(x i t+1, x0 t+1|x 1,2 1:t , x0 1:t , u1:t , γ1:t) P(x 0 t+1|x 1,2 1:t , x0 1:t , u1:t , γ1:t) = P(ft(x 0 t , ut , W0 t ) = x 0 t+1)P(ft(x i t , x0 t , ut , Wi t ) = x i t+1) P(ft(x 0 t , ut , W0 t ) = x 0 t+1) = P(ft(x i t , x0 t , ut , Wi t ) = x i t+1) (78) Thus π i t+1(·) is determined by x 0 t and the increment in common information. C.1.2 Problem P1b In this problem, the coordinator’s belief on agent 1’s private information at time t + 1 is given by: π 1 t+1(x 1 1:t+1) = P(x 1 1:t+1|ct+1 = (x 0 1:t+1, u1:t), γ1:t) = P x 2 1:t P(x 1 1:t+1, x0 t+1, ut , x2 1:t |x 0 1:t , u1:t−1, γ1:t) P x˜ 1 1:t+1 P x˜ 2 1:t P(˜x 1 1:t+1, x0 t+1, ut , x˜ 2 1:t |x 0 1:t , u1:t−1, γ1:t) (79) The numerator of (79) can be written as X x 2 1:t h P(x 1 t+1|x 1 t , x0 t , ut)P(x 0 t+1|x 0 t , ut)1u 1 t =γ(x 1 1:t ) × 1u 2 t =γ(x 2 1:t )π 1 t (x 1 1:t )π 2 t (x 2 1:t ) i . (80) Similarly the denominator can be written as X x˜ 1 1:t+1 X x˜ 2 1:t h P(˜x 1 t+1|x˜ 1 t , x0 t , ut)P(x 0 t+1|x 0 t , ut)1u 1 t =γ(˜x 1 1:t ) × 1u 2 t =γ(˜x 2 1:t )π 1 t (˜x 1 1:t )π 2 t (˜x 2 1:t ) i . (81) Let z b t+1 := (x 0 t+1, ut) be the increment in the common information in Problem P1b. Substituting equations (80), (81) in equation (79), π 1 t+1(x 1 1:t+1) can be written as P(x 1 t+1|x 1 t , x0 t , ut)P(x 0 t+1|x 0 t , ut)1u 1 t =γ(x 1 1:t )π 1 t (x 1 1:t ) P x˜ 1 1:t+1 P(˜x 1 t+1|x˜ 1 t , x0 t , ut)P(x 0 t+1|x 0 t , ut)1u 1 t =γ(˜x 1 1:t )π 1 t (˜x 1 1:t ) . (82) Thus π i t+1(·) is determined by x 0 t , π1 t , γt , zb t+1. We denote the update rule described above with η i t , i.e. π i t+1 = η i t (x 0 t , π1 t , γt , zb t+1). (83) 112 Appendix C. Chapter 4 C.1.3 Problem P1c The argument here is similar to the one used for Problem P1b above: π 1 t+1(x 1 t+1) = P(x 1 t+1|ct+1 = (x 0 1:t+1, u1:t), γ1:t) = P x 1,2 t P(x 1 t+1, x0 t+1, ut , x 1,2 t |x 0 1:t , u1:t−1, γ1:t) P x˜ 1 t+1,x˜ 1,2 t P(˜x 1 t+1, x0 t+1, ut , x˜ 1,2 t |x 0 1:t , u1:t−1, γ1:t) (84) Let z c t+1 := (x 0 t+1, ut) be the increment in the common information in Problem P1c. (84) can be written as π 1 t+1(x 1 t+1) = P x 1 t P(x 1 t+1|x 1 t , x0 t , ut)P(x 0 t+1|x 0 t , ut)1u 1 t =γ(x 1 t )π 1 t (x 1 t ) P x˜ 1 t+1,x˜ 1 t P(˜x 1 t+1|x˜ 1 t , x0 t , ut)P(x 0 t+1|x 0 t , ut)1u 1 t =γ(˜x 1 t )π 1 t (˜x 1 t ) (85) Thus π i t+1(·) is determined by x 0 t , π1 t , γt , zc t+1. We denote the update rule described above with η i t , i.e. π i t+1 = η i t (x 0 t , π1 t , γt , zc t+1). (86) C.2 Proof of Theorem 4.2 We prove the theorem by backward induction. Let’s consider Problem P1b. The value function for the coordinator’s dynamic program at time T can be written as follows: for any realization π 1 T , π2 T , x0 T of Π1 T , Π2 T , X0 T respectively, VT (π 1 T , π2 T ,x0 T ) = min γT ∈BT QT (π 1 T , π2 T , x0 T , γT ), (87) where QT (π 1 T , π2 T , x0 T , γT ) := X x1:T X uT kT (xT , uT )δx 0 T (x 0 )π 1 T (x 1 1:T )π 2 T (x 2 1:T )γT (x 1 1:T ; u 1 T )γT (x 2 1:T ; u 2 T ). (88) QT (π 1 T , π2 T , x0 T , γT ) ≥ 0 because kT (·, ·) is a non negative function. The deterministic mapping m from X to U can be viewed as a prescription γ ∈ BT with γ(x i 1:T ; m(x i T )) = 1. QT (π 1 T , π2 T ,x0 T , m) := X x1:T X uT kT (xT , m(x 1 T ), m(x 2 T )) × π 1 T (x 1 1:T )π 2 T (x 2 1:T )δx 0 T (x 0 ) = 0, (89) where we used the assumption on the cost function, namely, kT (xT , m(x 1 T ), m(x 2 T )) = 0. Hence, QT (π 1 T , π2 T , x0 T , γT ) ≥ QT (π 1 T , π2 T , x0 T , m) = 0, (90) 113 Appendix C. Chapter 4 and therefore, VT (π 1 T , π2 T , x0 T ) = min γT ∈BT QT (π 1 T , π2 T , x0 T , γT ) = 0. (91) Induction hypothesis: Assume the coordinator’s value function Vt+1(π 1 t+1, π2 t+1, x0 t+1) = 0 for any realization π 1 t+1, π2 t+1, x0 t+1 at time t + 1. The value function at time t can be written as Vt(π 1 t , π2 t , x0 t ) = min γt∈Bt Qt(π 1 t , π2 t , x0 t , γt), (92) where we define the function Qt as follows: Qt(π 1 t , π2 t , x0 t , γt) := X x1:t X ut π 1 t (x 1 1:t )π 2 t (x 2 1:t )δx 0 t (x 0 )× γt(x 1 1:t ; u 1 t )γt(x 2 1:t ; u 2 t )kt(xt , ut) + E[Vt+1(δx 0 t+1 , Π 1 t+1, Π 2 t+1)|(Π1 t , Π 2 t , X0 t , Γt) = (π 1 t , π2 t , x0 t , γt)] (93) Because of the induction hypothesis, the expectation of the value function at time t + 1 is 0 and we can simplify Qt as follows: Qt(π 1 t , π2 t , x0 t ,γt) := X x1:t X ut δx 0 t (x 0 )π 1 t (x 1 1:t )π 2 t (x 2 1:t ) × γt(x 1 1:t ; u 1 t )γt(x 2 1:t ; u 2 t )kt(xt , ut). (94) Using the same arguments as those used for QT , it follows that Qt(π 1 t , π2 t , x0 t , γt) ≥ Qt(π 1 t , π2 t , x0 t , m) = 0, (95) and therefore, Vt(π 1 t , π2 t , x0 t ) = 0. Thus, the induction hypothesis is true for all times. It is clear from the above argument that the optimal prescription for the coordinator in Problem P1b is m at each time and for any realization of its information state. Similar arguments can be repeated for the coordinator in Problem P1c as well. C.3 Proof of Theorem 4.3 Because of the specialized dynamics, the coordinator’s belief on each agent’s private information at time t is given by α 1:t for Problem P1b and by α for Problem P1c. At these beliefs, the value functions for the coordinators in Problems P1b and P1c are as follows: V b t (α 1:t , α1:t ) = min γ b t ∈Bt Q b t (α 1:t , α1:t , γb t ), V c t (α, α) = min γ c t ∈Bt Q c t (α, α, γc t ) (96) 114 Appendix D. Chapter 5 where the functions Qb t and Qc t are defined as Q b t (α 1:t , α1:t , γb t ) := X x1:t X ut α 1:t (x 1 1:t )α 1:t (x 2 1:t )γ b t (x 1 1:t ; u 1 t ) × γ b t (x 2 1:t ; u 2 t )kt(xt , ut) + E[Vt+1(α 1:t+1, α1:t+1)], Q c t (α, α, γc t ) := X xt X ut α(x 1 t )α(x 2 t )γ c t (x 1 t ; u 1 t )γ c t (x 2 t ; u 2 t ) × kt(xt , ut) + E[Vt+1(α, α)]. (97) Using a backward inductive argument, we can show that for any γ b t there exists a γ c t such that Qb t and Qc t defined above are the same (such a γ c t must satisfy equations of the form: P x i 1:t−1 γ b t (x i 1:t )α 1:t (x i 1:t ) = α(x i t )γ c t (x i t )). Similarly, we can show that for any γ c t there exists a γ b t such that Qb t and Qc t are the same (such a γ b t can be defined as γ b t (x i 1:t ) := γ c t (x i t )). This relationship between the two Q-functions implies the following equation for the corresponding value functions: V b t (α 1:t , α1:t ) = V c t (α, α). (98) The optimal cost in each problem is the value function at time t = 1 evaluated at the prior belief α. Therefore, (98) at t = 1 implies that the two problems have the same optimal performance. Consequently, an optimal symmetric strategy in Problem P1c will achieve the optimal performance in Problem P1b as well. D Appendix: Thompson sampling for linear quadratic meanfield teams D.1 Preliminary Results The analysis does not depend on the type m of the agent. So for simplicity, we will omit the superscript m in all the proofs in the appendix. Since S˘(·) and L˘(·) are continuous functions on a compact set Θ, there exist finite constants ˘ M˘ J , M˘ θ˘, M˘ S, M˘ L such that Tr(S˘( ˘θ)) ≤ M˘ J , ∥ ˘θ∥ ≤ M˘ θ˘, ∥S˘( ˘θ)∥ ≤ M˘ S and ∥[I,L˘( ˘θ) ⊺ ]∥ ≤ M˘ L for all ˘θ ∈ Θ where ˘ ∥ · ∥ is the induced matrix norm. Let X˘i T = max1≤t≤T ∥x˘ i t∥ be the maximum norm of the relative state along the entire trajectory. The next bound follows from [70][Lemma 2]. Lemma .6. For any q ≥ 1 and any T we have E h (X˘i T ) q i ≤ σ˘ qO log(T)(1 − δ) −q where δ is as defined in (A5). 115 Appendix D. Chapter 5 The following lemma gives an almost sure upper bound on the number of episodes K˘ T . Lemma .7. The number of episodes K˘ T is bounded as follows: K˘ T ≤ O vuut(dx + du)T log 1 σ˘ 2 X t (X˘jt T ) 2 ! Proof. We can follow the same sketch as in proof of Lemma 3 in [70]. Let ˘η − 1 be the number of times the second stopping criterion is triggered for ˘pt . Using the analysis in the proof of Lemma 3 in [70], we can get the following K˘ T ≤ p 2˘ηT . (99) Since the second stopping criterion is triggered whenever the determinant of sample covariance is halved, we have det(Σ˘ −1 T ) ≥ 2 η˘−1 det(Σ˘ −1 1 ) Let d = dx + du. Since ( 1 d Tr(Σ˘ −1 T ))d ≥ det(Σ˘ −1 T ), we have Tr(Σ˘ −1 T ) ≥ d(det(Σ˘ −1 T ))1/d ≥ d × 2 (˘η−1)/d(det(Σ˘ −1 1 ))1/d ≥ d × 2 (˘η−1)/dλ˘min where λ˘min is the minimum eigenvalue of Σ˘ −1 1 . Using (5.18b) we have, Σ˘ −1 T = Σ˘ −1 1 + T X−1 t=1 1 σ˘ 2 z˘ jt t (˘z jt t ) ⊺ . Therefore Tr(Σ˘ −1 T ) = Tr(Σ˘ −1 1 )+PT −1 t=1 1 σ˘ 2 Tr(˘z jt t (˘z jt t ) ⊺ ). Note that Tr(˘z jt t (˘z jt t ) ⊺ ) = Tr((˘z jt t ) ⊺ z˘ jt t ) = ∥z˘ jt t ∥ 2 . Thus, d × 2 (˘η−1)/dλ˘min ≤ Tr(Σ˘ −1 1 ) + T X−1 t=1 1 σ˘ 2 ||z˘ jt t ||2 Then, η˘ ≤1 + d log 2 log 1 dλ˘min Tr(Σ˘ −1 1 + T X−1 t=1 1 σ˘ 2 ||z˘ jt t ||2 !! 116 Appendix D. Chapter 5 = O d log 1 σ˘ 2 T X−1 t=1 ||z˘ jt t ||2 !! . Note that, ||z˘ jt t || = ||[I, L( ˘θ) ⊺ ] ⊺x˘ jt t || ≤ M˘ L||x˘ jt t || ≤ M˘ LX˘jt T . Consequently, η˘ ≤O d log 1 σ˘ 2 T X−1 t=1 (X˘jt T ) 2 Therefore, combining the above inequality with (99) we get, K˘ T ≤ O vuut(dx + du)T log 1 σ˘ 2 T X−1 t=1 (X˘jt T ) 2 (100) D.2 Proof of Lemma 5.3 Proof. We will bound each part separately. 1) Bounding R˘i 0 (T): From monotone convergence theorem, we have R˘i 0 (T) =E hX∞ k=1 1{t˘k≤T} T˘ kJ˘( ˘θk) i − TE h J˘( ˘θ) i = X∞ k=1 E h 1{t˘k≤T} T˘ kJ˘( ˘θk) i − TE h J˘( ˘θ) i . Note that the first stopping criterion of TSDE-MF ensures that T˘ k ≤ T˘ k−1 + 1 for all k. Since J˘( ˘θk) ≥ 0, each term in the first summation satisfies, E h 1{t˘k≤T} T˘ kJ˘( ˘θk) i ≤E h 1{t˘k≤T} (T˘ k−1 + 1)J˘( ˘θk) i . Note that 1{t˘k≤T} (T˘ k−1 + 1) is measurable with respect to σ({x˘ js s , u˘ js s , x˘ js s+1}1≤s<t˘k ). Then, Lemma 4 of [70] gives E h 1{t˘k≤T} (T˘ k−1 + 1)J˘( ˘θk) i = E h 1{t˘k≤T} (T˘ k−1 + 1)J˘( ˘θ) i . Combining the above equations, we get R˘i 0 (T) ≤ X∞ k=1 E h 1{t˘k≤T} (T˘ k−1 + 1)J˘( ˘θ) i − TE h J˘( ˘θ) i 117 Appendix D. Chapter 5 =E h K˘XT k=1 (T˘ k−1 + 1)J˘( ˘θ) i − TE h J˘( ˘θ) i =E h K˘ T J˘( ˘θ) i + E h K˘XT k=1 T˘ k−1 − T J˘( ˘θ) i ≤M˘ J σ˘ 2E h K˘ T i where the last equality holds because J˘( ˘θ) = ˘σ 2 Tr(S˘( ˘θ)) ≤ σ˘ 2M˘ J and PK˘ T k=1 T˘ k−1 ≤ T. 2) Bounding R˘i 1 (T): R˘i 1 (T) =E h K˘XT k=1 t˘kX +1−1 t=t˘k h (˘x i t ) ⊺S˘( ˘θk)˘x i t − (˘x i t+1) ⊺S˘( ˘θk)˘x i t+1ii =E h K˘XT k=1 h (˘x i t˘k ) ⊺S˘( ˘θk)˘x i t˘k − (˘x i t˘k+1 ) ⊺S˘( ˘θk)˘x i t˘k+1 ii ≤E h K˘XT k=1 (˘x i t˘k ) ⊺S˘( ˘θk)˘x i t˘k i . Since ||S˘( ˘θk)|| ≤ M˘ S, we obtain R˘i 1 (T) ≤E h K˘XT k=1 M˘ S∥x˘ i t˘k ∥ 2 i ≤ M˘ SE h K˘ T (X˘i T ) 2 i . Now, from Lemma .7, K˘ T ≤ O( r T log( PT t=1(X˘ jt T ) 2 σ˘ 2 )). Thus, we have R˘i 1 (T) ≤ O √ TE h (X˘i T ) 2 s log PT t=1(X˘ jt T ) 2 σ˘ 2 i ! . Then, using Cauchy-Schwarz we have, E (X˘i T ) 2 vuutlog PT t=1(X˘jt T ) 2 σ˘ 2 ! ≤ vuutE h (X˘i T ) 4 i E h log PT t=1(X˘jt T ) 2 σ˘ 2 ! i ≤ vuutE h (X˘i T ) 4 i log X T t=1 E(X˘jt T ) 2 σ˘ 2 ! ≤ O˜(˘σ 2 ) where the last inequality follows from Lemma .6. Therefore, we have R˘i 1 (T) ≤ O˜ σ˘ 2 √ T . 118 Appendix D. Chapter 5 3) Bounding R˘i 2 (T): Each term inside the expectation of R˘i 2 is equal to ∥S˘0.5 ( ˘θk) ˘θ ⊺ z˘ i t∥ 2 − ∥S˘0.5 ( ˘θk) ˘θ ⊺ k z˘ i t∥ 2 ≤ ∥S˘0.5 ( ˘θk) ˘θ ⊺ z˘ i t∥ + ∥S˘0.5 ( ˘θk) ˘θ ⊺ k z˘ i t∥ ||S˘0.5 ( ˘θk)(˘θ − ˘θk) ⊺ z˘ i t || ≤2M˘ SM˘ θM˘ LX˘i T ||( ˘θ − ˘θk) ⊺ z˘ i t || since ∥S˘0.5 ( ˘θk)ϕ˘⊺ z˘ i t∥ ≤ M˘ 0.5 S M˘ θM˘ LX˘i T for ϕ˘ = ˘θ or ϕ˘ = ˘θk. Therefore, R˘i 2 (T) ≤2M˘ SM˘ θM˘ LE h X˘i T K˘XT k=1 t˘kX +1−1 t=t˘k ∥( ˘θ − ˘θk) ⊺ z˘ i t∥ i . (101) From Cauchy-Schwarz inequality, we have E h X˘i T K˘XT k=1 t˘kX +1−1 t=t˘k ∥( ˘θ − ˘θk) ⊺ z˘ i t∥ i = E h X˘i T K˘XT k=1 t˘kX +1−1 t=t˘k ∥(Σ˘ −0.5 t ( ˘θ − ˘θk))⊺Σ˘ 0.5 t z˘ i t∥ i ≤ E h K˘XT k=1 t˘kX +1−1 t=t˘k ∥Σ˘ −0.5 t ( ˘θ − ˘θk)∥ × X˘i T ∥Σ˘ 0.5 t z˘ i t∥ i ≤ vuuutE h K˘XT k=1 t˘kX +1−1 t=t˘k ∥Σ˘ −0.5 t ( ˘θ − ˘θk)∥ 2 i vuuutE h K˘XT k=1 t˘kX +1−1 t=t˘k (X˘i T ) 2∥Σ˘ 0.5 t z˘ i t ∥ 2 i (102) From Lemma 10 in [70], the first part of (102) is bounded by E h K˘XT k=1 t˘kX +1−1 t=t˘k ∥Σ˘ −0.5 t ( ˘θ − ˘θk)∥ 2 i ≤ 4dxd(T + E[K˘ T ]). (103) For the second part of the bound in (102), we note that X t ||Σ˘ 0.5 t z˘ i t ||2 = X T t=1 (˘z i t ) ⊺Σ˘ tz˘ i t ≤ X T t=1 max 1, M˘ 2 L (X˘i T ) 2 λ˘min ! min(1,(˘z i t ) ⊺Σ˘ tz˘ i t ) ≤ X T t=1 1 + M˘ 2 L (X˘i T ) 2 λ˘min ! min(1,(˘z jt t ) ⊺Σ˘ tz˘ jt t ) (104) where the last inequality follows from the definition of jt . Using Lemma 8 of [106] we have 119 Appendix D. Chapter 5 X T t=1 min(1,(˘z jt t ) ⊺Σ˘ tz˘ jt t ) ≤ 2d log Tr(Σ˘ −1 1 ) + M˘ 2 L PT t=1(X jt T ) 2 d ! (105) Combining (104) and (105), we can bound the second part of (102) to the following E hX t (X˘i T ) 2 ||Σ˘ 0.5 t z˘ i t ||2 i ≤ O E h (X˘i T ) 4 log(X T t=1 (X˘jt T ) 2 ) i + E h (X˘i T ) 2 log(X T t=1 (X˘jt T ) 2 ) i. (106) The bound on R˘i 2 (T) in Lemma 5.3 then follows by combining (101)-(106) with the bound on E h (X˘i T ) q log(PT t=1(X˘jt T ) 2 ) i for q = 2, 4 in Lemma .8 in the appendix. Lemma .8. For any q ≥ 1, we have E " (X˘i T ) q log(X t (X˘jt T ) 2 ) # ≤ (˘σ) q O˜(1) (107) Proof. E " (X˘i T ) q log(X t (X˘jt T ) 2 ) # ≤ E " (X˘i T ) q log(max(e,X t (X˘jt T ) 2 ))# ≤ s E[(X˘i T ) 2q ] E[log2 max(e,X t X˘jt T ) 2) ] where the second inequality follows from the Cauchy-Schwarz inequality. Now, log2 (x) is a concave function for x ≥ e. Therefore, using Jensen’s inequality we can write, E log2 max(e,X t X˘jt T ) 2 ) ≤ log2 (E max(e,X t X˘jt T ) 2 )) ≤ log2 e + E( X t X˘jt T ) 2 ) ! ≤ log2 e + TO(˘σ 2 log T) = O˜(1) where we used Lemma .6 in the last inequality. Similarly, E[(X˘i T ) 2q ] ≤ (˘σ) 2q O(log T). Therefore, combining the above inequalities we have the following: E " (X˘i T ) q log(X t (X˘jt T ) 2 ) # ≤ vuutE[(X˘i T ) 2q ] E h log2 max(e,X t X˘jt T ) 2) ! i 1 Appendix E. Chapter 6 ≤ σ˘ qO˜(1) E Appendix: Scalable regret for learning to control networkcoupled subsystems with unknown dynamics E.1 Preliminary Results Since S˘(·) and G˘(·) are continuous functions on a compact set Θ, there exist finite constants ˘ M˘ J , M˘ θ˘, M˘ S, M˘G such that Tr(S˘( ˘θ)) ≤ M˘ J , ∥ ˘θ∥ ≤ M˘ θ˘, ∥S˘( ˘θ)∥ ≤ M˘ S and ∥[I, G˘( ˘θ) ⊺ ] ⊺ ∥ ≤ M˘G for all ˘θ ∈ Θ where ˘ ∥ · ∥ is the induced matrix norm. Let X˘i T = ˘σ i + max1≤t≤T ∥x˘ i t∥. The next two bounds follow from [103, Lemma 4] and [103, Lemma 5]. Lemma .9. For each node i ∈ N, any q ≥ 1 and any T > 1, E (X˘i T ) q (˘σ i) q ≤ O log T . Lemma .10. For any q ≥ 1, we have E (X˘i T ) q (˘σ i) q logX T t=1 (X˘i T ) 2 (˘σ i) 2 ≤ E (X˘i T ) q (˘σ i) q logX T t=1 X i∈N (X˘i T ) 2 (˘σ i) 2 ≤ O˜(1). (108) The next lemma gives an upper bound on the number of episodes K˘ T . Lemma .11. The number of episodes K˘ T is bounded as follows: K˘ T ≤ O vuut(dx + du)T log T X−1 t=1 (X˘jt T ) 2 (˘σ jt ) 2 ! . Proof. We can follow the same argument as in the proof of Lemma 5 in [103]. Let ˘η − 1 be the number of times the second stopping criterion is triggered for ˘pt . Using the analysis in the proof of Lemma 5 in [103], we can get the following inequalities: K˘ T ≤ p 2˘ηT , (109) det(Σ˘ −1 T ) ≥ 2 η˘−1 det(Σ˘ −1 1 ) ≥ 2 η˘−1λ˘d min, (110) 12 Appendix E. Chapter 6 where d = dx + du and λ˘min is the minimum eigenvalue of Σ˘ −1 1 . Combining (110) with Tr(Σ˘ −1 T )/d ≥ det(Σ˘ −1 T ) 1/d, we get Tr(Σ˘ −1 T ) ≥ dλ˘min2 (˘η−1)/d . Thus, η˘ ≤ 1 + d log 2 log Tr(Σ˘ −1 T ) dλ˘min . (111) Now, we bound Tr(Σ˘ −1 T ). From (6.28b), we have Tr(Σ˘ −1 T ) = Tr(Σ˘ −1 1 ) + T X−1 t=1 1 (˘σ jt ) 2 Tr(˘z jt t (˘z jt t ) ⊺ | {z } =∥z˘ jt t ∥ 2 ). (112) Note that ∥z˘ jt t ∥ = ∥[I, G˘( ˘θ) ⊺ ] ⊺ x˘ jt t ∥ ≤ M˘G∥x˘ jt t ∥ ≤ M˘GX˘jt T . Using ∥z˘ jt t ∥ 2 ≤ M˘ 2 G(X˘jt T ) 2 in (112) and substituting the resulting bound on Tr(Σ˘ −1 T ) in (111) and then combining it with the bound on η in (109), gives the result of the lemma. Lemma .12. The expected value of K˘ T is bounded as follows: E[K˘ T ] ≤ O˜ q (dx + du)T Proof. From Lemma .11, we get E[K˘ T ] ≤ O E "vuut(dx + du)T logT X−1 t=1 (X˘jt T ) 2 (˘σ jt ) 2 #! (a) ≤ O vuut(dx + du)T log E T X−1 t=1 (X˘jt T ) 2 (˘σ jt ) 2 ≤ O vuut(dx + du)T log E T X−1 t=1 X i∈N (X˘i T ) 2 (˘σ i) 2 (b) ≤ O˜( q (dx + du)T) where (a) follows from Jensen’s inequality and (b) follows from Lemma .9. E.2 Proof of Lemma 6.3 Proof. We will prove each part separately. 1) Bounding R˘i 0 (T): From an argument similar to the proof of Lemma 5 of [70], we get that R˘i 0 (T) ≤ (˘σ i ) 2M˘ JE[K˘ T ]. The result then follows from substituting the bound on E[K˘ T ] from Lemma .12. 122 Appendix E. Chapter 6 2) Bounding R˘i 1 (T): R˘i 1 (T) = E K˘XT k=1 t˘kX +1−1 t=t˘k h (˘x i t ) ⊺ S˘ kx˘ i t − (˘x i t+1) ⊺ S˘ kx˘ i t+1i = E K˘XT k=1 h (˘x i t˘k ) ⊺ S˘ kx˘ i t˘k − (˘x i t˘k+1 ) ⊺ S˘ kx˘ i t˘k+1 i ≤ E K˘XT k=1 (˘x i t˘k ) ⊺ S˘ kx˘ i t˘k ≤ E K˘XT k=1 ∥S˘ k∥∥x˘ i tk ∥ 2 ≤ M˘ SE[K˘ T (X˘i T ) 2 ] (113) where the last inequality follows from ∥S˘ k∥ ≤ M˘ S. Using the bound for K˘ T in Lemma .11, we get R˘i 1 (T) ≤ O q (dx + du)T E " (X˘i T ) 2 vuutlogT X−1 t=1 (X˘jt T ) 2 (˘σ jt ) 2 #!. (114) Now, consider the term E " (X˘i T ) 2 vuutlogT X−1 t=1 (X˘jt T ) 2 (˘σ jt ) 2 #! (a) ≤ vuutE[(X˘i T ) 4] E logT X−1 t=1 (X˘jt T ) 2 (˘σ jt ) 2 (b) ≤ vuutE[(X˘i T ) 4] log E T X−1 t=1 X i∈N (X˘i T ) 2 (˘σ jt ) 2 (c) ≤ O˜((˘σ i ) 2 ), (115) where (a) follows from Cauchy-Schwarz, (b) follows from Jensen’s inequality and (c) follows from Lemma .9. The result then follows from substituting (115) in (113). 3) Bounding R˘i 2 (T): As in [70], we can bound the inner summand in R˘i 2 (T) as ( ˘θ ⊺ z˘ i t ) ⊺ S˘ k( ˘θ ⊺ z˘ i t ) − ( ˘θ ⊺ k z˘ i t ) ⊺ S˘ k((˘θk) ⊺ z˘ i t ) ≤ O(X˘i T ∥( ˘θ − ˘θk) ⊺ z˘ i t∥). Therefore, R˘i 2 (T) ≤ O E h X˘i T K˘XT k=1 t˘kX +1−1 t=t˘k ∥( ˘θ − ˘θk) ⊺ z˘ i t∥ i . (116) 123 Appendix E. Chapter 6 The term inside O(·) can be written as E X˘i T K˘XT k=1 t˘kX +1−1 t=t˘k ∥( ˘θ − ˘θk) ⊺ z˘ i t∥ = E X˘i T K˘XT k=1 t˘kX +1−1 t=t˘k ∥(Σ˘ −0.5 tk ( ˘θ − ˘θk))⊺ Σ˘ 0.5 tk z˘ i t∥ ≤ E K˘XT k=1 t˘kX +1−1 t=t˘k ∥Σ˘ −0.5 tk ( ˘θ − ˘θk)∥ × X˘i T ∥Σ˘ 0.5 tk z˘ i t∥ ≤ vuuutE K˘XT k=1 t˘kX +1−1 t=t˘k ∥Σ˘ −0.5 tk ( ˘θ − ˘θk)∥ 2 × vuuutE K˘XT k=1 t˘kX +1−1 t=t˘k (X˘i T ) 2∥Σ˘ 0.5 tk z˘ i t ∥ 2 (117) where the last inequality follows from Cauchy-Schwarz inequality. Following the same argument as [103, Lemma 7], the first part of (117) is bounded by E K˘XT k=1 t˘kX +1−1 t=t˘k ∥Σ˘ −0.5 tk ( ˘θ − ˘θk)∥ 2 ≤ O(dx(dx + du)T). (118) For the second part of the bound in (117), we follow the same argument as [103, Lemma 8]. Recall that λ˘min is the smallest eigenvalue of Σ˘ −1 1 . Therefore, by (6.28b), all eigenvalues of Σ˘ −1 t are no smaller than λ˘min. Or, equivalently, all eigenvalues of Σ˘ t are no larger than 1/λ˘min. Using [58, Lemma 11], we can show that for any t ∈ {tk, . . . , tk+1 − 1}, ∥Σ˘ 0.5 tk z˘ i t∥ 2 = (˘z i t ) ⊺ Σ˘ tk z˘ i t ≤ det Σ˘ −1 t det Σ˘ −1 tk (˘z i t ) ⊺ Σ˘ tz˘ i t ≤ F1(X˘i T ) (˘z i t ) ⊺ Σ˘ tz˘ i t (119) where F1(X˘i T ) = 1 + (M˘ 2 G(X˘i T ) 2/λ˘minσ˘ 2 w) T˘min∨1 and the last inequality follows from [103, Lemma 10]. Moreover, since all eigenvalues of Σ˘ t are no larger than 1/λ˘min, we have (˘z i t ) ⊺Σ˘ tz˘ i t ≤ ∥z˘ i t∥ 2/λ˘min ≤ M˘ 2 G(X˘i T ) 2/λ˘min. Therefore, (˘z i t ) ⊺ Σ˘ tz˘ i t ≤ (˘σ i ) 2 ∨ M˘ 2 G(X˘i T ) 2 λ˘min 1 ∧ (˘z i t ) ⊺Σ˘ tz˘ i t (˘σ i) 2 ≤ (˘σ i ) 2 + M˘ 2 G(X˘i T ) 2 λ˘min 1 ∧ (˘z jt t ) ⊺Σ˘ tz˘ jt t (˘σ jt ) 2 , (120) 12 Appendix E. Chapter 6 where the last inequality follows from the definition of jt . Let F2(X˘i T ) = (˘σ i ) 2+(λ˘min/M˘ 2 G(X˘i T ) 2 ) . Then, X T t=1 (˘z i t ) ⊺ Σ˘ tz˘ i t ≤ F2(X˘i T ) X T t=1 1 ∧ (˘z jt t ) ⊺Σ˘ tz˘ jt t (˘σ jt ) 2 = F2(X˘i T ) X T t=1 1 ∧ Σ 0.5 t z˘ jt t (˘z jt t ) ⊺Σ 0.5 t (˘σ jt ) 2 (a) ≤ F2(X˘i T ) 2d log Tr(Σ˘ −1 T +1) d − log det Σ−1 1 (b) ≤ F2(X˘i T ) 2d log 1 d Tr(Σ˘ −1 1 ) + M˘G X T t=1 (X˘jt T ) 2 (˘σ jt ) 2 − log det Σ−1 1 (121) where d = dx + du and (a) follows from (6.28b) and the intermediate step in the proof of [107, Lemma 6]. and (b) follows from (112) and the subsequent discussion. Using (119) and (121), we can bound the second term of (117) as follows E hX T t=1 (X˘i T ) 2 ∥Σ˘ 0.5 tk z˘ i t∥ 2 i ≤ O d E F1(X˘i t )F2(X˘i T )(X˘i T ) 2 × logX T t=1 (X˘i T ) 2 (˘σ i) 2 ≤ O d(˘σ i ) 4E F1(X˘i T ) F2(X˘i T ) (˘σ i) 2 (X˘i T ) 2 (˘σ i) 2 logX T t=1 (X˘i T ) 2 (˘σ i) 2 ≤ O˜(d(˘σ i ) 4 ) (122) where the last inequality follows by observing that F1(X˘i T ) F2(X˘i T ) (˘σi) 2 (X˘i T ) 2 (˘σi) 2 logPT t=1 (X˘i T ) 2 (˘σi) 2 is a polynomial in X˘i T /σ˘ i multiplied by log(PT t=1 (X˘i T ) 2 (˘σi) 2 ) and, using Lemma .10. The result then follows by substituting (118) and (122) in (117). 1
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning and decision making in networked systems
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
Machine learning in interacting multi-agent systems
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Team decision theory and decentralized stochastic control
PDF
Learning and control in decentralized stochastic systems
PDF
Coordinated freeway and arterial traffic flow control
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Predicting and planning against real-world adversaries: an end-to-end pipeline to combat illegal wildlife poachers on a global scale
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Learning social sequential decision making in online games
PDF
No-regret learning and last-iterate convergence in games
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Interaction and topology in distributed multi-agent coordination
PDF
Control of mainstream traffic flow: variable speed limit and lane change
PDF
Information design in non-atomic routing games: computation, repeated setting and experiment
PDF
Understanding human-building interactions through perceptual decision-making processes
PDF
The interpersonal effect of emotion in decision-making and social dilemmas
PDF
Hierarchical planning in security games: a game theoretic approach to strategic, tactical and operational decision making
Asset Metadata
Creator
Sudhakara, Sagar
(author)
Core Title
Sequential Decision Making and Learning in Multi-Agent Networked Systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-05
Publication Date
05/17/2024
Defense Date
05/06/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Common information approach,Multi-agent setting,OAI-PMH Harvest,online learning,sequential decision making,Stochastic zero-sum game
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nayyar, Ashutosh (
committee chair
), Nuzzo, Pierluigi (
committee member
), Savla, Ketan (
committee member
)
Creator Email
sagarsud@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113939995
Unique identifier
UC113939995
Identifier
etd-SudhakaraS-12951.pdf (filename)
Legacy Identifier
etd-SudhakaraS-12951
Document Type
Thesis
Format
theses (aat)
Rights
Sudhakara, Sagar
Internet Media Type
application/pdf
Type
texts
Source
20240517-usctheses-batch-1154
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Common information approach
Multi-agent setting
online learning
sequential decision making
Stochastic zero-sum game