Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning and control in decentralized stochastic systems
(USC Thesis Other)
Learning and control in decentralized stochastic systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEARNING AND CONTROL IN DECENTRALIZED STOCHASTIC SYSTEMS BY NAUMAAN NAYYAR DISSERTATION Submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering in the Viterbi School of Engineering of the University of Southern California, December 2015 Doctoral Committee: Professor Rahul Jain, Chair Professor Yan Liu Professor Ashutosh Nayyar Abstract Decentralized stochastic systems are becoming increasingly prevalent due to advancements and proliferation of small-scale technology. With complex systems being implemented on smaller and more interconnected entities, decision-making in such environments necessarily becomes decentralized. Learning and control are closely related problems, with systems usu- ally employing them together in decision-making. Decentralized versions of these problems crop up in numerous practical applications, including in communication networks, sensor networks, formation ight, power distribution grids, personalized medicine and cloud com- puting. My work focuses on two problems in decentralized stochastic systems, both having communication constraints: one in the realm of control and the other in online learning. In decentralized control, we look at classes of interconnected systems where communi- cation constraints manifest themselves in the form of delays in information sharing. We specically look at a two-player control problem with dierent kinds of communication con- straints and develop an approach that nds the optimal control law for them. We believe our approach is general enough to be extended to other forms of delayed interconnections as well. The second problem that we tackle is in online learning for multiple players where com- munication is costly. We formulate a multi-player multi-armed bandit model and, through a series of results, show that near-optimal performance can be achieved by multiple players with limited communication. We propose two new policies, dE 3 and dE 3 TS that achieve this regret. Prior work in this area included policies with sub-linear regret, but not order- optimality that we have been able to show with our new policies. An application of learning to schedule in healthcare operations is also considered, where we consider a problem that traditionally has decentralized elements and show the advantages of treating it as a system with uncertainty. i To my mom, Hamida Afshan Nayyar, dad, Mohammed Ali Nayyar, and khalujan, to whom I owe an eternal debt of gratitude ii Contents List of Figures vi 1 Introduction 3 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Decentralized Optimal Control in Interconnected Systems with Communica- tion Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Interconnected systems with Communication Delays: Nested systems 5 1.2.2 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Decentralized Online Learning in Multi-armed Bandits . . . . . . . . . . . . 8 1.3.1 Multiplayer Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Learning in Restless Multi-armed Bandits . . . . . . . . . . . . . . . . . . . 12 1.4.1 Restless MABs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.2 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Scheduling in decenralized systems with uncertainty: A hospital operating room application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5.1 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.6 Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 Decentralized Control in Stochastic Systems 16 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 The (1; 0) information sharing pattern . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Linearity of the optimal control law . . . . . . . . . . . . . . . . . . . 19 2.2.2 Derivation of the Optimal Control Law . . . . . . . . . . . . . . . . . 20 iii 2.2.3 The Output-feedback problem: Optimal Control law . . . . . . . . . 22 2.3 The (1;1) information sharing pattern . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Linearity of the optimal control law . . . . . . . . . . . . . . . . . . . 24 2.3.2 Summary Statistic to Compute Optimal Control Law . . . . . . . . . 25 2.3.3 Deriving the Optimal Gain Matrix . . . . . . . . . . . . . . . . . . . 28 2.3.4 The output-feedback case . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4 Proofs of Theorems and Lemmas . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3 Decentralized Learning in Multiarmed Bandit Problems 40 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Single-player bandit problems . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . 43 3.2.2 UCB 1 , Thompson Sampling and UCB 4 . . . . . . . . . . . . . . . . . . 44 3.2.3 E 3 and E 3 TS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 Multiplayer bandit problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.1 Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . 57 3.3.2 dUCB 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.3 dE 3 and dE 3 TS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 Learning in Restless Multi-armed Bandits 67 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 Structure of the optimal policy . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.1 Optimal Value Function . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.2 Properties of optimal value function . . . . . . . . . . . . . . . . . . . 71 4.3.3 Characterization of Optimal Decision Region . . . . . . . . . . . . . . 71 4.4 Countable Policy Representation . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5 The non-Bayesian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5.1 Mapping to innite-armed MAB . . . . . . . . . . . . . . . . . . . . . 76 4.5.2 MAB with countably innite i.i.d. arms . . . . . . . . . . . . . . . . 77 4.5.3 A variant policy for countably innite i.i.d arms . . . . . . . . . . . . 80 4.5.4 Online Learning Policy for non-Bayesian RMAB with 2 positively cor- related channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 iv 4.5.5 Analysis of R2PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5 Scheduling in decentralized systems with uncertainty: A hospital operat- ing room application 86 5.1 Learning to scchedule in hospital operations data . . . . . . . . . . . . . . . 86 5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1.2 Data and Current Scheduling Policy . . . . . . . . . . . . . . . . . . 87 5.2 Understanding the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4.1 Notes about implementation . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6 Bibliography 100 v List of Figures 2.1 Information communication for one-step delayed information-sharing in two- player D-LTI systems with nested structures . . . . . . . . . . . . . . . . . . 20 2.2 Information communication for one-step delayed information-sharing in two- player D-LTI systems with nested structures . . . . . . . . . . . . . . . . . . 24 3.1 Figure showing growth of cumulative regret of the UCB 1 , Thompson Sampling and UCB 4 algorithms for a four-armed bandit problem with true means [0.1, 0.5, 0.6, 0.9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 Figure showing growth of cumulative regret of the UCB 4 algorithm with a xed per unit computation cost for a four-armed bandit problem with true means [0.1, 0.5, 0.6, 0.9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Figure showing growth of cumulative regret of theE 3 andUCB 4 algorithms for a four-armed bandit problem with true means [0:1; 0:5; 0:6; 0:9] (no computation cost). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Figure showing growth of cumulative regret of the E 3 , E 3 -TS and UCB 1 algo- rithms for a four-armed single-player bandit problem with true means [0.1, 0.5, 0.6, 0.9] (no computation cost), with time plotted on log scale. . . . . . 59 3.5 Figure showing growth of cumulative regret of the dE 3 and dE 3 -TS algorithms for a three-player, three-armed bandit setting with true means [0.2 0.25 0.3; 0.4 0.6 0.5; 0.7 0.9 0.8] (communication cost included), with time plotted on log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 Examples of possible decision regions satisfying the optimal structure. . . . . 72 4.2 Example of an optimal decision region to illustrate the countable policy rep- resentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1 A model for outpatient scheduling . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2 A typical scenario in hospital patient ow . . . . . . . . . . . . . . . . . . . 87 5.3 A typical scenario in hospital patient ow . . . . . . . . . . . . . . . . . . . 88 vi 5.4 A histogram showing distribution of surgical cases by specialty at Keck Hospital 89 5.5 A histogram showing distribution of delay in wheeling-in patients . . . . . . 90 5.6 A scatter plot showing correlation between actual and scheduled wheels-in times 91 5.7 A histogram plot showing number of records by reason for delay (> 60) at Keck Hospital . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.8 A histogram of the number of records for delay due to surgeon performing another case by specialty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.9 A schedule showing an overbooked surgeon. He performs two surgeries in parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.10 A histogram showing distribution of times in OR . . . . . . . . . . . . . . . 95 5.11 A histogram showing distribution of deviation between actual and scheduled times in OR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.12 Comparison of scheduling policies . . . . . . . . . . . . . . . . . . . . . . . . 98 5.13 Throughput v/s idle time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.14 Throughput v/s overtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.15 Throughput v/s waiting time . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.16 Idle time v/s predicted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.17 Overtime v/s predicted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 vii Acknowledgements What a journey my period of doctoral studies has been! I remember the timid forays I made into the world of research as a wide-eyed neophyte: every area to me was teeming with untold problems waiting to be solved. As I look back to that time, I recall the sense of bewilderment and a vague lack of condence in my skills to tackle these challenges. I am thankful to say that those feelings have been replaced now by a well-earned condence in my ability to tackle anything that comes my way. A lot of credit and gratitude for my success goes, rst and foremost, to my advisor and mentor, Rahul Jain. His guidance in all my academic and professional endeavors went a very long way to making me the researcher I am today. I especially appreciate the considerable freedom he gave me in exploring diverse areas, which proved to be a very valuable experience. From putting late nights to conference calls on vacations, Rahul has been the epitome of accessibility and support during my PhD. I could not have asked for a better or more interested person to guide me through this phase. I would also like to thank Bhaskar Krishnamachari for his very insightful research lessons at the start of my PhD. It was a pleasure working with him and learning from him. I'm also very grateful to him and Yan Liu for acting as my referees and being a part of my exam committees. I owe a lot of my professional success to them. For tremendously useful discussions, I am grateful to Ashutosh Nayyar, my namesake, who had to put up with mix-ups in our package deliveries on many occasions. In addition, I woud like to thank Suvrajeet Sen and Edmond Jonckheere for being a part of my qualifying commitee and giving feedback on my work. I also enjoyed my discussions with Jay Bartro, Shanghua Teng and Jianfeng Zhang. Any acknowledgement would be incomplete with expressing my gratitude to my labmate and long-time collaborator, Dileep Kalathil. Dileep's zeal for research motivated me on several occasions to push my own boundaries. I also enjoyed my enlightening discussions with my labmates, Harsha Honnappa, Hiteshi Sharma, Mehdi Jafarnia and Wenyuan Tang. Hiteshi and Mehdi: I really wish I were able to spend more time with them and see them bloom into excellent researchers. 1 I would like to thank the advisors and administrators I've had close ties with over the years including Annie Yu, Diane Demetras and Tracy Charles. They ensured that my academic journey was as smooth as could be. I'd also like to thank the people who made my social life a joy through their friendships, discussions and support: my best friends Abdul Rahman Sher (thank you for being someone I can always turn to!), Pavan Srinadh, Mayank Manjrekar, Zameel Haris, Fangda Wang, Khalid Aldaham, Prashanth Harshangi, Hoda El-Safty, Engie Salama, Laith Shalalfeh and Abdullah Alfarrarjeh. The latter two, I'm also grateful to for two wonderful years of being my roommates and introducing me to an exotic cuisine that is now my staple. There are also many others I'd like to acknowledge in the Muslim Student Union, Ansar Service Partnership, my Arab, Indian and Iranian soccer and volleyball teams. To my anc ee and partner-in-all-things, Sameera Khan, I am grateful for being by my side at all times and making my experience joyful. I eagerly look forward to our journey through life together! Lastly, but perhaps all-importantly, to my Mom and Dad, my sisters: Afreen and Nisma, for always being there for me, for supporting me, for understanding me, for motivating me: for just being my wonderful family, I owe them my everlasting gratitude and thanks. 2 Chapter 1 Introduction Decentralized stochastic systems are becoming increasingly prevalent due to advancements in technology and data analysis. Simply put, a stochastic system is one whose properties, in some part at least, are not deterministic. Such systems play a fundamental role in modelling phenomena in many elds of science, engineering, medicine, and business. With complex systems being implemented on smaller and more interconnected entities, learning and control in such environements necessarily becomes decentralized (\distributed"). Learning and control have had close ties for a very long time, either by working together to address a larger problem, or through one being used to address problems in the other's eld. For instance, many problems in operations research rst involve learning a system's properties, followed by optimization or control to reach desired objectives. This is an example of the former scenario. Examples of the latter are methodologies taken from control theory to tackle problems in reinforcement learning, where actions taken by the decision-maker can aect the state of the system. Decentralized learning and control problems have numerous practical applications, including in communication networks, sensor networks, formation ight, power distribution grids, personalized medicine and cloud computing. While decentralized control problems have been around for a long time in literature, their diculty has led to several open problems in optimal control of interconnected systems. Many questions in this topic have only been answered in the last decade or so, and a lot of work still remains to be done in arriving at a unied theory of decentralized control in interconnected systems. Decentralized learning, especially in online settings, is a nascent eld and, only recently has signicant progress been made in the area. The growth in sensor networks, personalized medicine and cognitive radio will spur the need for such online learning methodologies due to their inherent distributed nature. Tackling these two close problems and demonstrating the eects of decentralized systems 3 in a specic application of hospital operating room scheduling is the focus of my dissertation. Real world topics in engineering, business and medicine are the primary areas to which this work can be applied. To motivate and illustrate applications of my work, the next section describes a few problems that fall within its scope. 1.1 Motivation Formation ight: Drones are increasingly being used in hostile environments to reduce costs and losses to human life. Situations arise where multiple drones may have to y in formation while performing their tasks. Classical control theory assumes that individual controllers have access to the same information and measurements. Clearly, this is not the case in this situation. Solving this problem requires control of subsystems with partial information that can be communicated with delay to other controllers. Opportunistic Spectrum Access (OSA): Current regulations governing spectrum allocation result in large portions of unused spectrum, both spatially and temporally. OSA is a means to address this issue by having users (called secondary users in a cognitive radio framework) choose from multiple channels based on their availability and quality. Each user estimates the quality of channels available to him/her and chooses the best channel. However, with multiple users doing the same thing, and no user being able to know the information obtained by another user without being told explicitly, interferences are guaranteed to occur, and seamless communication cannot happen. Solving this problem requires multiple users being able to learn channel characteristics for themselves and allocating themselves to these channels in a socially optimal manner. Task scheduling: With the advent of cloud computing, computational tasks from many places are being submitted to processors across the globe. An entity allocates a task to a particular server and the task joins the corresponding queue. If the perfor- mance characteristics of the server are unknown initially, the allocator needs to learn them and allocate the tasks in a manner that minimizes cost. Solving this problem requires a type of scheduling where the task is not served immediately, but rather after reaching its turn in the queue. These scenarios motivate the three principal questions that my work tries to answer: Q1. Can we design decentralized optimal control policies for classes of interconnected sys- tems? If yes, is there an underlying unifying mechansim for this design? 4 Q2. Can we design decentralized online learning policies that achieve global objectives? Q3. Application: Can we design scheduling algorithms that leverage data to schedule tasks when case duration distribution characteristics are unknown? The following introductory sections describe these problems in a little more detail. 1.2 Decentralized Optimal Control in Interconnected Systems with Communication Delays Decentralized control problems come in several avors. Relevant work in this area is discussed in the section on Existing Work. The following section describes the specic variant that our work focuses on. 1.2.1 Interconnected systems with Communication Delays: Nested systems Nested systems are systems where the information ows in one direction, outwardly. In subsequent discussion, we will assume all systems to be discrete linear time-invariant (LTI) systems. Analysis of nested discrete LTI systems is simplied because they are a part of a larger class of problems having a partially nested structure. Subsystems are associated with controllers, and each controller has access to information from its associated subsystem. Controllers share information subject to constraints in the form of communication delays. The information known to the controllers is called the information structure of the problem. The objective of the controllers is to nd a decentralized control law u(t) that minimizes a global objective, E[ N1 X t=0 (x(t) 0 Qx(t) +u(t) 0 Ru(t)) +x(N) 0 Sx(N)]; (1.1) whereQ,R andS are positive semi-denite matrices. This is analogous to the LQG problem in classical control theory. 1.2.2 Existing Work These problems were rst formulated by Marschak in the 1950s [35] as team decision prob- lems, and further studied by Radner [38], though in such problems communication between the controllers was usually ignored. 5 In a celebrated paper [53], Witsenhausen showed that even for seemingly simple systems with communication between controllers but with multi-unit delays, non-linear controllers could outperform any linear controller. Witsenhausen also consolidated and conjectured results on separation of estimation and control in decentralized problems in [54]. However, the structure of decentralized optimal controllers for LQG systems with time-delays has been hard to identify. Indeed, in [51] it was proved that the centralized-like separation principle that was conjectured for delayed systems does not hold for a system having a delay of more than one timestep. The more general delayed sharing pattern was only recently solved by Nayyar, et al. in [37]. It remains an open problem to actually compute the optimal decentralized control law even using such a structure result. However, not all results for such problems have been negative. Building on results of Radner on team decision theory, Ho and Chu [18] showed that for a unit-delay information sharing pattern, the optimal controller is linear. This was used by Kurtaran and Sivan [22], and others [40, 55] to derive an optimal controller using dynamic programming for the nite-horizon case. Unfortunately, the results do not extend to multi-unit delayed sharing patterns. This is because the former are examples of systems with partially nested informa- tion structure, for which linear optimal control laws are known to exist [18]. A criterion for determining convexity of optimal control problems, funnel causality, was presented in [?]. Recently, another characterization, quadratic invariance, was discovered under which optimal decentralized controls laws can be solved by convex programming [39]. This has led to a resurgence of interest in decentralized optimal control problems, which since the 1970s had been assumed to be intractable. Subsequently, in a series of papers, Lall, Lessard and others have computed the optimal (linear) control laws for a suite of decentral- ized nested LQG systems, involving no communication delays [45, 46, 30] and others. More general networked structures that dealt with state-feedback delayed information sharing in networked control with sparsity and multiple players were also looked at in [27, ?], wherein recursive solutions for the optimal control laws were derived. Output feedback without delays in a network was considered in [36]. Recently, another characterization, called quadratic invariance, was discovered under which optimal decentralized controls laws can be solved by convex programming [39]. This has led to a resurgence of interest in decentralized optimal control problems, which since the 1970s had been assumed to be intractable. In a series of papers, Lall and his co-workers have computed the optimal (linear) control laws for a suite of decentralized nested LQG systems, involving no communication delays [45, 46, 30]. More general networked structures that dealt with state-feedback delayed information sharing in networked control with multiple players were also looked at in [27], wherein 6 recursive solutions for the optimal control laws were derived. For a subclass of quadratic invariant problems known as poset-causal problems, Parrilo and Shah [41] showed that the computation of the optimal law becomes easier by lending itself to decomposition into subproblems. Solutions to certain cases of the state-feedback delayed information sharing problem have been presented by Lamperski et al. [25, 26] where a networked graph structure of the controllers (strongly connected) is considered, with con- straints on system dynamics. In this work, we restrict our attention to two-player systems. A summary of results pertaining to two-player decentralized controller systems is given in Table 2.1. d 12 d 21 Literature Comments 0 0 Classical no plant restrictions 1 1 [22],[40],[55] no plant restrictions 0=10=1 [25, 26] b.d B matrix, innite horizon 0 1 [45, 46, 30] l.b.t matrix 1 1 [27] l.b.t matrix, state f/b, u.c 11 [29] b.d. dynamics, state f/b 0 1 our work no plant restrictions 1 1 our work l.b.t matrix Table 1.1: Summary of results for some information sharing patterns with two-players. In Table 2.1, d 12 is the delay in information transmission from player 1 to player 2, and vice versa. 'u.c', 'b.d' and 'l.b.t' refer to uncoupled noise, block diagonal and lower block triangular matrices respectively. Standard conditions for stability and controllability of the system are assumed to hold in all cases. 1.2.3 Our Contributions We consider a coupled two-player discrete linear time-invariant (LTI) system with a nested structure. Player 1's actions aect only block 1 of the plant whereas player 2's actions aect both blocks 1 and 2. Formally, the system dynamics are, " x 1 (t + 1) x 2 (t + 1) # = " A 11 0 A 21 A 22 #" x 1 (t) x 2 (t) # + " B 11 0 B 21 B 22 #" u 1 (t) u 2 (t) # + " v 1 (t) v 2 (t) # ; y(t) =Cx(t) +w(t) (1.2) for t2f0; 1; ;N 1g. 7 Henceforth, we will use the notationx(t + 1) =Ax(t) +Bu(t) +v(t) where the denition is clear from the context. The initial state x(0) is assumed to be zero-mean Gaussian, independent of the system noise v(t), and v(t) is zero-mean Gaussian with covariance V , which is independent across time t. The two players' have a team/common objective to nd the decentralized control law u(t) that minimizes, E[ N1 X t=0 (x(t) 0 Qx(t) +u(t) 0 Ru(t)) +x(N) 0 Sx(N)]; (1.3) where Q, R and S are positive semi-denite matrices. The information structure for the problem is, F 1 (t) =fy 1 (0 :t);u 1 (0 :t 1)g; F 2 (t) =fy 1 (0 :t 1);u 1 (0 :t 1); y 2 (0 :t);u 2 (0 :t 1)g; (1.4) which we refer to as the (1;1) pattern. The (1; 0) pattern that is considered in this work as a preliminary result is similarly dened. In our work, we have derived the optimal control laws for state feedback cases of the (1; 0) and (1;1) delayed sharing patterns. We also extended this to full output feedback for the patterns. 1.3 Decentralized Online Learning in Multi-armed Ban- dits Classical multi-armed bandits (MABs) have been studied extensively in literature [10]. These models capture the essence of a learning problem wherein the learner must tradeo between exploiting what has been learnt, and exploring more. The model can be understood through a simple game of choosing between two coins with unknown biases. The coins are tossed repeatedly and one of them is chosen at each instant. If at a given instance, the chosen coin turns up heads, we get a reward of $1, otherwise we get no reward. It is known that one of the two coins has a better bias, but the identity of the coin is not known. The question is, what is the optimal `learning' policy that helps maximize the expected reward, i.e., to discover which coin has a better bias and at the same time maximize the cumulative reward as the game is played. 8 Bandit models represent an exploration versus exploitation tradeo where the player must choose between exploring the environment to nd better options, and exploiting based on her current knowledge. These models are widely applicable in display advertisements, sensor networks, route planning, spectrum sharing etc. While MABs have been studied extensively in literature, however, multiplayer MABs have arisen only recently and we present an introduction to them in the following section. 1.3.1 Multiplayer Multi-Armed Bandits We consider anN-armed bandit withM players. At each instantt, each player plays an arm. There is no dedicated communication channel for coordination among the players. However, we do allow for players to communicate with each other, either by sending packets over a predetermined channel, or by pulling arms in a specic order. For example, a player could communicate a bit sequence to other players by picking an arm to indicate `bit 1' and any other arm for `bit 0'. These communication overheads add to regret and aect the learning rate. Potentially more than one player can pick the same arm at the same instant. If this happens, we regard it as a collision, and no player choosing that arm gets any reward. We will assume that player i playing arm k at time t obtains an i.i.d. reward S ik (t) with density function f ik (s), which will be assumed to have bounded support, and without loss of generality, taken to be in [0; 1]. Let i;k denote the mean ofS ik (t) with respect to the pdf f ik (s). A non-Bayesian setting is considered and we assume that players have no information about the means, distributions or any other statistics about the rewards from various arms other than what they observe while playing. We also assume that each player can only observe the rewards that he/she gets. If there is a collision on an arm, we assume that all players that chose the arm get zero reward. This can be relaxed where players share the reward in some manner though the results don't change appreciably. Let X ij (t) be the reward that player i gets from arm j at time t. Thus, if player i plays arm k at time t (and there is no collision), X ik (t) = S ik (t), and X ij (t) = 0;j6= k. Denote the action of playeri at timet bya i (t)2A :=f1;:::;Ng. Then, the history seen by player i at time t isH i (t) =f(a i (1);X i;a i (1) (1)); ; (a i (t);X i;a i (t) (t))g withH i (0) =;. A policy i = ( i (t)) 1 t=1 for player i is a sequence of maps i (t) :H i (t)!A that species the arm to be played at time t given the history seen by the player. LetP(N) be the set of vectors such that P(N) :=fa = (a 1 ;:::;a M ) :a i 2A;a i 6=a j ; for i6=jg: 9 The players have a team objective: namely over a time horizon T , they want to maximize the expected sum of rewardsE[ P T t=1 P M i=1 X i;a i (t) (t)] over some time horizon T . If the parameters i;j are known, this could easily be achieved by picking a bipartite matching, k 2 arg max k2P(N) M X i=1 i;k i ; (1.5) i.e., the optimal bipartite matching with mean arm rewards. Note that this may not be unique. Since the expected rewards, i;j , are unknown, players must pick learning policies that minimize the expected regret, dened for policies = ( i ; 1iM) as R (T ) =T X i i;k i E " T X t=1 M X i=1 X i; i (t) (t) # : (1.6) To make the problem more realistic, we include index computation and communication cost terms into the regret as well. Distributed algorithms for bipartite matching are known [8, 56], which determine an -optimal matching with a `minimum' amount of information exchange and computation. However, every run of a distributed bipartite matching algorithm incurs a cost due to computation and communication necessary to exchange some information for decentralization. Let C be the cost per run, and m(t) denote the number of times the distributed bipartite matching algorithm is run by time t. Then, under policy, the expected regret is R (T ) =T M X i=1 i;k i E " T X t=1 M X i=1 X i; i (t) (t) # +CE[m(T )]: (1.7) where k is the optimal matching as dened in equation (1.5). Our goal is to nd a decentralized algorithm that players can use such that together they minimize this expected regret. 1.3.2 Existing Work Single player multi-armed bandits have been extensively studied in literature since the in- troduction of the non-Bayesian bandit by Lai and Robbins in [24]. It was shown in [24] that there is no learning algorithm that asymptotically has expected regret growing slower than logT . A `learning' scheme was also constructed that asymptotically achieved this lower bound. 10 This result was subsequently generalized by many people. In [5], Anantharam, et al generalized this to the case of multiple plays, i.e., the player can pick multiple arms (or coins) when there are more than 2 arms. In [1], Agrawal proposed a sample mean based index policy that asymptotically achieved logT regret. For the special case of bounded support for rewards, Auer, et al [6] introduced a simpler index-based UCB 1 algorithm that achieved logarithmic expected regret over nite-time horizons. Recently, Thompson Sampling (TS) is another policy that has seen a surge in interest due to its much better empirical performance in several situations [49, 11]. It is a probability matching policy which, unlike the UCB-class of policies that use a deterministic condence bound, draws samples from a distribution to determine which arm to play based on the probability of its being optimal. The logarithmic regret performance of the policy was not proved until very recently [2]. Deterministic sequencing algorithms which have separate exploration and exploitation phases have also appeared in the literature as an alternative to the joint exploration and ex- ploitation approaches of UCB-like and probability matching algorithms. Noteworthy among these are the Phased Exploration and Greedy Exploitation (PEGE) policy for linear ban- dits [?] that achieves O( p T ) regret in general and O(log(T ) regret for nitely many linearly parametrized arms. Other noteworthy algorithms include the logarithmic regret achieving deterministic DSEE policy [50]. Single-player bandit problems have also been looked at in the PAC framework, for in- stance, in [?]. However, we restrict our attention to performance in the expected sense in this work. Multiplayer MABs introduce two additional questions. First, can one construct an index- based decentralized learning algorithm that achieves poly-log, or at least sub-linear expected regret? Second, what is the lower bound on the expected regret? Is it still logT , or is there a fundamental cost of decentralization that one must pay? The rst question was answered by our earlier work in [20], wherein an index-based decentralized learning algorithm called dUCB 4 was proposed. It was shown that it achieves expected regret that grows as log 2 T . 1.3.3 Our Contributions In my current work, we present two policies, dE 3 and dE 3 TS that yield expected regret growing atmost as log T (near-log T with some assumptions relaxed) in both single and multiplayer settings, which is order optimal. E 3 stands for Exponentially-spaced Exploration and Exploitation, as the policies are based on exploration and exploitation in pre-determined 11 phases. These policies give a non-information theoretic answer to the fundamental question of whether there is an inherent cost due to decentralized learning, the short answer being no, as far as order is concerned. We note that a deterministic sequence learning algorithm for a single player setting was also introduced in [50]. Extensive simulations have been carried out comparing the performances of the new algorithms with existing algorithms for single- and multi-player settings. 1.4 Learning in Restless Multi-armed Bandits Restless MABs are used to model systems with state-dependent rewards, where state transi- tions occur even when the system is not being observed. They nd applications in dynamic channel allocation, nancial investments etc. 1.4.1 Restless MABs Consider the problem of probingN independent Markovian channels. Each channeli has two states - good (denoted by 1) and bad (denoted by 0), with transition probabilitiesfp (i) 01 ;p (i) 11 g, for the transitions from 0 to 1 and 1 to 1 respectively. At each time t, the player chooses one channeli to probe, denoted by the actionU(t), and receives a reward equal to the state, S i (t), of the channel (0 or 1). The objective is to design a policy that chooses the chain at each time to maximize a long-term reward. It has been shown that a sucient statistic to make an optimal decision is given by the conditional probabilty that each channel is in state 1 given all past observations and decisions [43]. We will refer to this as the belief vector, denoted by (t),f! 1 (t);:::;! N (t)g, where ! i (t) is the conditional probability that the i-th channel is in state 1. Given the sensing action U(t) in observation slot t, the belief can be recursively updated as follows: w i (t + 1) = 8 > < > : p (i) 11 ; i2U(t);S i (t) = 1 p (i) 01 ; i2U(t);S i (t) = 0 (! i (t)) ; i = 2U(t) (1.8) where(! i (t)),! i (t)p (i) 11 + (1! i (t))p (i) 01 denotes the one-step belief update for unobserved channels. In our study, we focused on the discounted reward criterion. For a discount parameter 12 (0 < 1), the reward is dened as, E 1 X t=0 t R ( (t)) j (0) =x 0 (1.9) where R ( (t)) is the reward obtained by playing strategy ( (t)), where : (t)!U(t) is a policy, which is dened to be a function that maps the belief vector (t) to the action in U(t) in slot t. The discounted reward criterion was used due to its inherently tunable nature that gives importance both to the immediate reward, unlike the average reward criterion, and the future, unlike the myopic criterion. 1.4.2 Existing Work The problem of dynamic channel selection has recently been formulated and studied by many researchers [57, 3, 33, 14, 47, 32] under the framework of multi-armed bandits (MAB) [16]. In these papers, the channels are typically modelled as indpendent Gilbert-Elliott channels (i.e., described by two-state Markov chains, with a bad state \0" and a good state \1"). Many of the prior results apply to the Bayesian case, where it is assumed that the under- lying Markov state transition matrices for all channels are known, so that the corresponding beliefs for each channel can be updated. For the special case when the channels evolve as identical chains, it has been shown that the Myopic policy is always optimal for N = 2, 3, and also optimal for any N so long as the chains are positively correlated [57, 3]. In [33], the authors consider this problem in the general case when the channels can be non-identical. They show that a well-known heuristic, the Whittle's index exists for this problem, and can be computed in closed form. Moreover, it is shown in [33] that for the special case that channels are identical, the Whittle's index policy in fact coincides with the Myopic policy. However, as the Whittle's index is not in general the best possible policy, a question that has remained open is identifying the optimal solution for general non-identical channels in the Bayesian case. 1.4.3 Our Contributions In our work, we considered both the Bayesian and non-Bayesian versions of this two-state restless multi-armed bandit problem with non-identical channels. We made two main con- tributions: For the Bayesian version of the problem, when the underlying Markov transition ma- 13 trices are known, we prove structural properties for the optimal policy. Specically, we show that the decision region for a given channel is contiguous with respect to the belief of that channel, keeping all other beliefs xed. For the non-Bayesian version of the problem for the special case of N = 2 positively correlated possibly non-identical channels, we utilize the above-derived structure to propose a mapping to another multi-armed bandit problem, with a countably innite number of arms, each corresponding to a possible threshold policy (one of which must be optimal). We present an online learning algorithm for this problem, and prove that it yields near-logarithmic regret with respect to any policy that achieves an expected discounted reward that is within of the optimal. 1.5 Scheduling in decenralized systems with uncertainty: A hospital operating room application Resource allocation and task scheduling are very common problems that have several settings. In this work, we are concerned with the problem of scheduling appointments for tasks when the distribution characteristics are unknown beforehand, motivated by hospital operations. 1.5.1 Existing Work Existing literature for scheduling in hospital operations context consists of a variety of ap- proaches. All of them fall into the following categories: Heuristics, mathematical program- ming, simulation-based, Markov processes and queueing theory. A very thorough description of these approaches is surveyed in [17, 19]. 1.5.2 Our Contributions In this real-world application, we have shown that a data-driven method of scheduling out- performs existing practices signicantly. A prediction model using regression has been devel- oped to predict case durations. Subsequently, this prediction was combined with an integer programming formulation to devise an optimal week schedule given desired performance metrics. 14 1.6 Published Work The work described in this proposal has appeared in or been submitted to the following publications. N Nayyar, H Honnappa and R Jain, \Statistical Analysis of Operating Room Sched- ules in Elective Surgery Hospitals", POMS Annual Conference, 2015. N Nayyar, D Kalathil and R Jain, \On Single and Multiplayer Algorithms for Multi- armed Bandits", under revision at IEEE JSTSP. N Nayyar, D Kalathil and R Jain, \Optimal Decentralized Control with Asymmet- ric One-Step Delayed Information Sharing", under revision at IEEE Transactions on Control of Network Systems, Jul 2014 N Nayyar, D Kalathil and R Jain, \Optimal Decentralized Control in Unidirectional One-Step Delayed Sharing Pattern with Partial Output Feedback", American Control Conference, pp. 1906-1911, 2014 N Nayyar, D Kalathil and R Jain, \Optimal Decentralized Control in Unidirectional One-Step Delayed Sharing Pattern", Allerton Conference on Communication, Control and Computing, pp. 374-380, 2013 D Kalathil, N Nayyar and R Jain, \Decentralized learning for multi-player multi- armed bandits", IEEE Transactions on Information Theory, vol. 60, no.4, pp. 2331- 2345, 2014 D Kalathil, N Nayyar and R Jain, \Decentralized learning for multi-player multi- armed bandits", IEEE Control and Decision Conference, pp. 3960-3965, 2012 N Nayyar, Y Gai, B Krishnamachari, \On a Restless Multi-Armed Bandit Prob- lem with Non-Identical Arms", Allerton Conference on Communication, Control and Computing, pp. 369-376, 2011 15 Chapter 2 Decentralized Control in Stochastic Systems We consider optimal control of decentralized LQG problems for plants controlled by two players having asymmetric information sharing patterns between them. In the main sce- nario, players are assumed to have a unidirectional error-free, unlimited rate communication channel with a unit delay in one direction and no communication in the other. A secondary model presented for completeness considers a channel with no delay in one direction and a unit delay in the other. Delayed information sharing patterns in general do not admit linear optimal control laws and are thus dicult to control optimally. However, in these scenarios, we show that the problems have a partially nested information structure, and thus linear optimal control laws exist. Summary statistics are identied and analytical solutions to the optimal control laws are derived. The state feedback case is solved for both scenarios as is the output feedback case. 2.1 Introduction Recently, many problems of decentralized control have arisen in practical systems. Examples include cyberphysical systems, formation ight, and other networked control systems wherein multiple agents try to achieve a common objective in a decentralized manner. Such situations arise, for instance, due to controllers not having access to information at the same time. One possible reason is network delays, and a consequent time-lag to communicate observations to other controllers. These problems were rst formulated by Marschak in the 1950s [35] as team decision prob- lems, and further studied by Radner [38], though in such problems communication between the controllers was usually ignored. In a celebrated paper [53], Witsenhausen showed that 16 even for seemingly simple systems with communication between controllers but with multi- unit delays, non-linear controllers could outperform any linear controller. Witsenhausen also consolidated and conjectured results on separation of estimation and control in decentral- ized problems in [54]. However, the structure of decentralized optimal controllers for LQG systems with time-delays has been hard to identify. Indeed, in [51] it was proved that the conjectured separation principle does not hold for a system having a delay of more than one timestep. The more general delayed sharing pattern was only recently solved by Nayyar, et al. in [37]. It remains an open problem to actually compute the optimal decentralized control law even using such a structure result. However, not all results for such problems have been negative. Building on results of Radner on team decision theory, Ho and Chu [18] showed that for a unit-delay information sharing pattern, the optimal controller is linear. This was used by Kurtaran and Sivan [22] to derive an optimal controller using dynamic programming for the nite-horizon case. Unfortunately, the results do not extend to multi-unit delayed sharing patterns. This is because the former are examples of systems with partially nested information structure, for which linear optimal control laws are known to exist [18]. Recently, another characterization, called quadratic invariance, was discovered under which optimal decentralized controls laws can be solved by convex programming [39]. This has led to a resurgence of interest in decentralized optimal control problems, which since the 1970s had been assumed to be intractable. In a series of papers, Lall and his co-workers have computed the optimal (linear) control laws for a suite of decentralized nested LQG systems, involving no communication delays [45, 46, 30]. More general networked structures that dealt with state-feedback delayed information sharing in networked control with multiple players were also looked at in [27], wherein recursive solutions for the optimal control laws were derived. For a subclass of quadratic invariant problems known as poset-causal problems, Parrilo and Shah [41] showed that the computation of the optimal law becomes easier by lending itself to decomposition into subproblems. Solutions to certain cases of the state-feedback delayed information sharing problem have been presented by Lamperski et al. [25, 26] where a networked graph structure of the controllers (strongly connected) is considered, with con- straints on system dynamics. A summary of results pertaining to two-player decentralized controller systems is given in Table 2.1. In Table 2.1, d 12 is the delay in information transmission from player 1 to player 2, and vice versa. 'u.c', 'b.d' and 'l.b.t' refer to uncoupled noise, block diagonal and lower block triangular matrices respectively. Standard conditions for stability and controllability of the system are assumed to hold in all cases. 17 d 12 d 21 Literature Comments 0 0 Classical no plant restrictions 1 1 [22],[40],[55] no plant restrictions 0=10=1 [25, 26] b.d B matrix, innite horizon 0 1 [45, 46, 30] l.b.t matrix 1 1 [27] l.b.t matrix, state f/b, u.c 11 [29] b.d. dynamics, state f/b 0 1 here no plant restrictions 1 1 here l.b.t matrix Table 2.1: Summary of results for some information sharing patterns with two-players. In this work, we consider two scenarios, (1; 0) and (1;1) information sharing patterns between two players in an LQG system. Both scenarios have a partially nested information structure and also satisfy quadratic invariance. They, however, are not poset-causal and do not lend themselves to easy decompositions. We derive optimal control laws for either case. This amounts to yielding Riccati-type iterations for computing the gain matrix of the optimal control law which is linear. 2.2 The (1;0) information sharing pattern A two-player discrete linear time-invariant system is considered. The system dynamics are specied as x(t + 1) =Ax(t) +Bu(t) +v(t) (2.1) It is assumed that the initial state x(0) is zero-mean Gaussian, independent of the system noise v(t), and that v(t) are zero-mean Gaussian random vectors with covariance V , that are independent across time. The two players have a (team/common) objective to nd the decentralized control law (u 1 ();u 2 ()) that minimizes a nite-time quadratic cost criterion E[ N1 X t=0 (x(t) 0 Qx(t) +u(t) 0 Ru(t)) +x(N) 0 Sx(N)]; (2.2) where Q, R and S are positive semi-denite matrices. At each instant, the control action u i (t) taken by each player can only depend on the information structure, the information available to them denoted byF(t) = (F 1 (t);F 2 (t)). 18 We consider the following: F 1 (t) =fx 1 (0 :t);u 1 (0 :t 1);x 2 (0 :t);u 2 (0 :t 1)g; F 2 (t) =fx 1 (0 :t 1);u 1 (0 :t 1); x 2 (0 :t);u 2 (0 :t 1)g; (2.3) where x(0 :t) denotes the vector (x(0); ;x(t)). We refer to this as the (1; 0) information sharing pattern. Finding the optimal decentral- ized controller under this information pattern is very similar to that of the (1; 1) pattern [22]. We outline the approach for the sake of completeness, and will use some elements of it for solving the problem with the (1;1) pattern. The optimal law is characterized byu i (t) =f(F i (t)) for some functionf. This implies a huge function space in which to search for the optimal control law. In the next section, we show that the optimal law is linear. 2.2.1 Linearity of the optimal control law It was shown by Varaiya and Walrand [51] that the optimal decentralized control law with general delayed information sharing between players may not be linear. However, Ho and Chu [18] had established earlier that to show the existence of a linear optimal control law, it is sucient to prove that the LQG problem has a partially nested information structure [18, Theorem 2]. Formally, if player i's actions at time t 1 aect player j's information at time t 2 , then a partially nested information structure is dened to haveF i (t 1 )F j (t 2 ), whereF denotes the information available to the players when making their decision. Intuitively, this can be described as an information pattern that allows communication of all information used by a player to make a decision to all other players whose system dynamics are aected by that decision. Proposition 1. The LQG problem (2.1)-(2.2) with the (1; 0)-delayed sharing information pattern (2.3) has a partially nested structure, and a linear optimal control law exists. Proof. Observe that the information structure dened in (2.3) can be simplied asF 1 (t) = fx 1 (0 :t);x 2 (0 :t)g andF 2 (t) =fx 1 (0 :t 1);x 2 (0 :t)g. This is because for a given control law, the control actions u i () can be derived from the state information available. It is easy to see thatF i (t)F i (t +); 0;i = 1; 2 andF 1 (t)F 2 (t +); 0 for any time t. This is shown in the information communication diagram, Fig 2.1. 19 p1 p2 t=0 1 T 1 T t=0 2 2 T−1 3 Figure 2.1: Information communication for one-step delayed information-sharing in two- player D-LTI systems with nested structures The dependence of state variables on the input is also the same as in Fig 2.1. The (1; 0)- delayed sharing LQG problem thus has a partially nested structure [18] and, by Theorem 2 therein, the optimal control law is linear. Denote byH(t) = (x(0 :t1);u(0 :t1)), the common information at timet. Using the linearity of the optimal control law, it follows that the optimal control law can be written as u(t) = [u 1 (t); u 2 (t)] 0 = " F 11 (t) F 12 (t) 0 F 22 (t) #" x 1 (t) x 2 (t) # + " L 1 (H(t)) L 2 (H(t)) # ; (2.4) where F is the optimal gain matrix and L i () are linear, possibly time-varying, functions. However, the arguments of L i , namely the state observations, increase with time. So, obtaining a closed form expression for the optimal control law is intractable in the current form, necessitating use of summary statistics. 2.2.2 Derivation of the Optimal Control Law We will now show that the estimate of the current state given the history of observations available to the players is a summary statistic that characterizes the optimal control law. Recall that the information available to player 1 at timet isF 1 (t) =fx 1 (t);x 2 (t);H(t)g, and that available to Player 2 isF 2 (t) =fx 2 (t);H(t)g. Dene the estimator ^ x(t) := E[x(t)jH(t)] to be the estimate of the current states by the players, based on the history of information available to them. The dynamics of these estimators are known through Lemma 1. Lemma 1. [23] For the system (2.1), the estimator dynamics for ^ x(t) are given by ^ x(t + 1) =A^ x(t) +Bu(t) +R(t)(t) (2.5) 20 with ^ x(0) = E[x(0)], whereR(t) := R xx (t)R 1 x (t), (t) := x(t) ^ x(t), R xx(t) := AT (t), R x (t) =T (t), T (t) :=E[(x(t) ^ x(t))(x(t) ^ x(t))]. (t) is zero-mean, uncorrelated Gaussian and E[(t)(t) 0 ] =T (t). E[(t)x() 0 ] = 0 for <t. Note thatR(t) = A, thereby simiplifying the estimator dynamics above to ^ x(t + 1) = Ax(t)+Bu(t). The proof is found, for instance, in [23]. The suciency of the estimator ^ x(t) for optimal control is established through the result below. Theorem 1. For the system (2.1) with the information structure (2.3) and objective func- tion (2.2), the optimal control law is given by u(t) =F (t)x(t) +G (t)^ x(t); (2.6) whereF (t) = " F 11 (t) F 12 (t) 0 F 22 (t) # is referred to as the optimal gain matrix (to be derived later) andG (t) =(F (t)+H (t)) whereH (t) is the optimal gain matrix for classical LQR with dynamics described by (2.1). The proof follows from a transformation of the original problem to a simpler problem to which existing LQR results can be applied. It is in a similar vein to the proof of Theorem 3 and is skipped. Remark: Observe that the classical separation principle does not hold in the (0; 1) information-sharing pattern. The information statistic ^ x(t) by itself cannot be used to com- pute the optimal control law. We now characterize the optimal gain matrix, F (t). As in [22], the stochastic optimiza- tion problem can be converted into a deterministic matrix optimization problem, after which standard matrix optimization tools can be applied. Proposition 2. The optimal gain matrix F (t) is given by the backward recursion, F (t) = R +B 0 M(t + 1)B 1 (t)T 1 (t) +B 0 M(t + 1)A ; M (t) =Q +H(t) 0 RH(t) + (ABH(t)) 0 M (t + 1)(ABH(t)); M (N) =S: (2.7) The proof is a variant of Theorem 4. 21 2.2.3 The Output-feedback problem: Optimal Control law We now consider the output-feedback problem. It will be shown that the results from the state-feedback problem can be extended to the output-feedback case with a change of notation. The dynamics for the problem are described by, x(t + 1) =Ax(t) +Bu(t) +v(t); y(t) =Cx(t) +w(t); (2.8) where v(t) and w(t) are zero-mean, Gaussian random vectors with covariance matrices V and W respectively that are independent across time t and independent of initial system statex(0) and of each other. x(0) is zero-mean, Gaussian with known mean and covariance. The information structure for the problem is, F 1 (t) =fy 1 (0 :t);u 1 (0 :t 1);y 2 (0 :t);u 2 (0 :t 1)g; F 2 (t) =fy 1 (0 :t 1);u 1 (0 :t 1); y 2 (0 :t);u 2 (0 :t 1)g: (2.9) We redene the history of observations as,H(t) = (y(0 :t 1);u(0 :t 1)). Given this information structure, the players' objective is to nd the control law u i (t) as a function of F i (t) (i = 1; 2) that minimizes the nite-time quadratic cost criterion (2.2). Proposition 1 shows that the optimal control law is linear. Dening estimator ^ x(t) := E[x(t)jH(t)] as before, the following results can be reconstructed from the state-feedback theorems. Lemma 2. For the nested system (2.8), the estimator dynamics for the redened ^ x(t) are as follows. ^ x(t + 1) =A^ x(t) +Bu(t) +K(t)(t); (2.10) and ^ x(0) = E[x(0)], where, K(t) = K xy (t)K y (t) 1 and (t) = y(t) C^ x(t). K xy (t) = AT (t)C 0 , K y (t) =CT (t)C 0 +W . T (t) =E[(x(t) ^ x(t))(x(t) ^ x(t))]. (t) is zero-mean, uncorrelated Gaussian andE[(t)(t) 0 ] =CT (t)C 0 +W . E[(t)y() 0 ] = 0, <t. Theorem 2. For the nested system (2.8) with the information structure (2.9) and objective function (2.2), " u 1 (t) u 2 (t) # =F (t) " y 1 (t) y 2 (t) # +G (t)^ x(t); (2.11) 22 where F (t) = " F 11 (t) F 12 (t) 0 F 22 (t) # is the optimal gain matrix and G (t) =(F (t)C +H (t)). Proposition 3. The optimal gain matrix F (t) is given by the backward recursion, F (t) = R +B 0 M(t + 1)B 1 (t)(CT (t)C 0 +W (t)) 1 +B 0 M(t + 1)R(t) ; M (t) =Q +H(t) 0 RH(t) + (ABH(t)) 0 M (t + 1)(ABH(t)); M (N) =S: (2.12) The proofs are omitted due to space constraints. 2.3 The (1;1) information sharing pattern We now consider a coupled two-player discrete linear time-invariant (LTI) system with a nested structure. Player 1's actions aect only block 1 of the plant whereas player 2's actions aect both blocks 1 and 2. Formally, the system dynamics are, " x 1 (t + 1) x 2 (t + 1) # = " A 11 0 A 21 A 22 #" x 1 (t) x 2 (t) # + " B 11 0 B 21 B 22 #" u 1 (t) u 2 (t) # + " v 1 (t) v 2 (t) # ; (2.13) for t2f0; 1; ;N 1g. Henceforth, we will use the notationx(t + 1) =Ax(t) +Bu(t) +v(t) where the denition is clear from the context. The initial state x(0) is assumed to be zero-mean Gaussian, independent of the system noise v(t), and v(t) is zero-mean Gaussian with covariance V , which is independent across time t. The two players' have a team/common objective to nd the decentralized control law u(t) that minimizes, E[ N1 X t=0 (x(t) 0 Qx(t) +u(t) 0 Ru(t)) +x(N) 0 Sx(N)]; (2.14) 23 where Q, R and S are positive semi-denite matrices. The information structure for the problem is, F 1 (t) =fx 1 (0 :t);u 1 (0 :t 1)g; F 2 (t) =fx 1 (0 :t 1);u 1 (0 :t 1); x 2 (0 :t);u 2 (0 :t 1)g: (2.15) The set of all control laws is characterized by u i (t) = f(F i (t)) for some time-varying function f. In the next subsection, we show that the optimal law is linear. 2.3.1 Linearity of the optimal control law We now show that the (1;1) pattern has a partially nested structure. Proposition 4. The problem (2.13)-(2.14) with information structure (2.15) has a linear optimal control law. Proof. Observe that the information structure dened in (2.15) can be simplied asF 1 (t) = fx 1 (0 : t)g andF 2 (t) =fx 1 (0 : t 1);x 2 (0 : t)g. Because of the nested system structure, x 1 (t) does not depend on u 2 (). At any time instant t,F i (t)F i (t +); 0;i = 1; 2 andF 1 (t)F 2 (t +); 1. This is illustrated in the information communication diagram shown in Fig 2.2 that shows how observations are communicated. p1 p2 t=0 1 T 1 T t=0 2 2 T-1 3 Figure 2.2: Information communication for one-step delayed information-sharing in two- player D-LTI systems with nested structures Observe that A 12 = B 12 = 0 implies that dependence of state variables on the input is also as in Fig 2.2. The formulation thus has a partially nested structure [18] and by Theorem 2 therein, the optimal law is linear. Denote byH i (t) = (x i (0 : t 1);u i (0 : t 1)), the history of observations for block i. 24 Using linearity of the optimal law, it follows that u(t) = [u 1 (t); u 2 (t)] 0 = " F 11 (t)x 1 (t) +L 1 (H 1 (t)) F 22 (t)x 2 (t) +L 2 (H 1 (t);H 2 (t)) # ; (2.16) where L i () are linear, possibly time-varying, functions. 2.3.2 Summary Statistic to Compute Optimal Control Law We now show that the estimate of the current state given the history of observations of the players is a summary statistic that characterizes the optimal control law. The notation used is summarized in Table 2.2. Note that ^ x(t) and ^ ^ x(t) are estimates of the current states by Players 1 and 2 respectively. The dynamics of these estimators are derived in the following lemma. Lemma 3. For the nested system (2.13), the estimator dynamics for ^ x(t) and ^ ^ x(t) are given by ^ x(t + 1) =A^ x(t) +B " u 1 (t) ^ u 2 (t) # +R(t)(t); (2.17) ^ ^ x(t + 1) =A ^ ^ x(t) +B " u 1 (t) u 2 (t) # +R(t)(t); (2.18) with ^ x(0) = ^ ^ x(0) = E[x(0)], where ^ u 2 (t) := E[u 2 (t)jH 1 (t)],R(t) := R xx 1 (t)R 1 x 1 (t),R(t) := R xx (t)R 1 x (t), (t) := x 1 (t) ^ x 1 (t), (t) := x(t) ^ ^ x(t), R xx 1 (t) := AT (t), R x 1 (t) = T 1 (t), R xx (t) :=AT (t),R x (t) :=T (t),T (t) :=E[(x(t) ^ x(t))(x 1 (t) ^ x 1 (t))], andT (t) :=E[(x(t) ^ ^ x(t))(x(t) ^ ^ x(t))]. Note thatR(t) =A. (t) and(t) are zero-mean, uncorrelated Gaussian random vectors andE[(t)(t) 0 ] =T 1 (t) andE[(t)(t) 0 ] =T (t). At each time step,(t) is uncorrelated with H 1 (t), and so E[(t)x 1 () 0 ] = 0 for < t. (t) is uncorrelated withH 1 (t); H 2 (t), and so H i (t) fx i (0 :t 1);u i (0 :t 1)g F 1 (t) fx 1 (t);H 1 (t)g F 2 (t) fx 2 (t);H 1 (t);H 2 (t)g ^ x(t) E[x(t)jH 1 (t)] ^ ^ x(t) E[x(t)jH 1 (t);H 2 (t)] (t) x 1 (t) ^ x 1 (t) (t) x(t) ^ ^ x(t) Table 2.2: A glossary of notations used in the following sections 25 E[(t)x() 0 ] = 0 for <t. The proof is in the Appendix. Before showing that ^ x 1 (t) and ^ ^ x 2 (t) are summary statistics for the history of state observa- tions, we rst obtain the optimal decentralized control law for the LQG problem (2.13)-(2.14) with a nested information structure having no communication delay - the (0;1) case. For such a system, the information structure is described by, z 1 (t) =fx 1 (0 :t);u 1 (0 :t 1)g; (2.19) z 2 (t) =fx 1 (0 :t);u 1 (0 :t 1);x 2 (0 :t);u 2 (0 :t 1)g: Proposition 5 looks at the (0;1) state-feedback problem, already solved for output- feedback in [30]. Proposition 5. The optimal control law for the LQG problem (2.13)-(2.14) with information structure (2.19) is, " u 1 (t) u 2 (t) # = " K 11 (t) K 12 (t) 0 K 21 (t) K 22 (t) J(t) # 2 6 4 x 1 (t) ~ x 2 (t) x 2 (t) ~ x 2 (t) 3 7 5 ; ~ x 2 (t + 1) =A 21 x 1 (t) +A 22 ~ x 2 (t) +B 21 u 1 (t) +B 22 ~ u 2 (t); ~ x 2 (0) = 0; (2.20) where ~ u 2 (t) = h K 21 (t) K 22 (t) i " x 1 (t) ~ x 2 (t) # . K(t) is obtained from the backward recursion, P (T ) =S; P (t) =A 0 P (t + 1)A + (A 0 P (t + 1)BK(t) +Q; K(t) =(B 0 P (t + 1)B +R) 1 B 0 P (t + 1)A; and J(t) is obtained from the backward recursion, ~ P (T ) =S 22 ; ~ P (t) =A 0 22 ~ P (t + 1)A 22 + (A 0 22 ~ P (t + 1)B 22 J(t) +Q 22 ; J(t) =(B 0 22 ~ P (t + 1)B 22 +R 22 ) 1 B 0 22 ~ P (t + 1)A 22 : 26 Proof. The proof follows by use of Theorems 10 and 12 of [30], and replacing the information structure with the state-feedback structure given in (2.19). With this structure, the condi- tional expectations based on players' informations reduce to z(t) :=E[x(t)jz 2 (t)] =x(t) and ^ z(t) :=E[x(t)jz 1 (t)] = " x 1 (t) ~ x 2 (t) # , where ~ x 2 (t) :=E[x 2 (t)jz 1 (t)]. The optimal control law for output-feedback is, u(t) =K(t)^ z(t) + ^ K(t)(z(t) ^ z(t)); where ^ K(t) = " 0 0 ^ K 21 J(t): # Substituting the expressions for z(t) and ^ z(t), the control law reduces to, u(t) =K(t) " x 1 (t) ~ x 2 (t) # + ^ K(t) " 0 x 2 (t) ~ x 2 (t) # =K(t) " x 1 (t) ~ x 2 (t) # + " 0 J(t) # (x 2 (t) ~ x 2 (t)) = " K 11 (t) K 12 (t) 0 K 21 (t) K 22 (t) J(t) # 2 6 4 x 1 (t) ~ x 2 (t) x 2 (t) ~ x 2 (t) 3 7 5 : Consequently, we also have, ~ u 2 (t) := E[u 2 (t)jz 1 (t)] = h K 21 (t) K 22 (t) i " x 1 (t) x 2 (t) # , which when substituted in (2.13), gives (2.20). We can now establish the suciency of the estimators ^ x 1 (t) and ^ ^ x 2 (t) for optimal control. Theorem 3. For the nested system (2.13) with the information structure (2.15) and objective function (2.14), the optimal control law is given by u(t) =F (t) " x 1 (t) ^ x 1 (t) x 2 (t) ^ ^ x 2 (t) # + ~ K(t); (2.21) 27 where F (t) = " F 11 (t) 0 0 F 22 (t) # is the optimal gain matrix, and ~ K(t) is given by " K 11 (t) K 12 (t) 0 K 21 (t) K 22 (t) J(t) # 2 6 4 ^ x 1 (t) ^ x 2 (t) ^ ^ x 2 (t) ^ x 2 (t) 3 7 5 ; where K(t) and J(t) are as in Proposition 5. The proof is in the Appendix. Note that the classical separation principle does not hold here since the information statistics ^ x(t) and ^ ^ x(t) cannot be used to compute the optimal law. 2.3.3 Deriving the Optimal Gain Matrix To characterize the optimal gain matrix, F (t), the stochastic optimization problem is con- verted into a deterministic matrix optimization, and solved analytically. Theorem 4. The matrix F (t) is given by the recursion, F (t) = R + ~ B 0 M(t + 1) ~ B 1 (t) + ~ B 0 M(t + 1)V (t) 0 T 1 (t); M (t) = ~ Q +H(t) 0 RH(t) + ( ~ A +G(t)) 0 M (t + 1)( ~ A +G(t)); M (N) =S; (2.22) where, ~ A = 2 6 4 A 11 0 0 A 21 A 22 0 0 0 A 22 3 7 5 , G(t) = " BH(t) 0 0 B 22 J(t) # , ~ B = 2 6 4 B 11 0 B 21 B 22 0 B 22 3 7 5 , H(t) = " K 11 (t) K 12 (t) K 12 (t) K 21 (t) K 22 (t) J(t)K 22 (t) # , ~ Q = " Q 0 0 0 # , ~ S = " S 0 0 0 # , V (t) = E[(t)n 1 (t) 0 ], and n 1 (t) := 2 6 4 (R(t) 1 (t)) 1 B 21 F 11 (t) 1 (t) + (R(t)(t)) 2 +(R(t)(t)) 2 (R(t) 1 (t)) 2 3 7 5 . The proof is in the Appendix. 28 2.3.4 The output-feedback case The approach extends to partial output-feedback where one block's state is perfectly observed and the other is noisy. The system dynamics are described by, " x 1 (t + 1) x 2 (t + 1) # = " A 11 0 A 21 A 22 #" x 1 (t) x 2 (t) # + " B 11 0 B 21 B 22 #" u 1 (t) u 2 (t) # + " v 1 (t) v 2 (t) # ; " y 1 (t) y 2 (t) # = " C 11 0 C 21 C 22 #" x 1 (t) x 2 (t) # + " 0 w(t) # : (2.23) Here, v(t) := " v 1 (t) v 2 (t) # and w(t) are zero-mean, Gaussian random vectors with covariance matrices V and W respectively that are independent across time and independent of initial system state x(0) and of each other. x(0) is zero-mean, Gaussian with known mean and covariance. For brevity, we denote y 1 (t) :=x 1 (t). The information structure for the problem is, F 1 (t) =fy 1 (0 :t);u 1 (0 :t 1)g; F 2 (t) =fy 1 (0 :t 1);u 1 (0 :t 1); y 2 (0 :t);u 2 (0 :t 1))g: (2.24) Redening the history of observations as,H i (t) = (y i (t 1);:::;y i (0);u i (t 1);:::;u i (0)) for block i, i = 1; 2, the information available to Player 1 at time t can also be expressed as F 1 (t) =fy 1 (t);H 1 (t)g, and that available to Player 2 asF 2 (t) =fy 2 (t);H 1 (t);H 2 (t)g. Given this information structure, the players' objective is to nd the control law u i (t) as a function ofF i (t) (i = 1; 2) that minimizes the nite-time quadratic cost criterion (2.14). It can be seen from Proposition 4 that the optimal control law for the formulated output- feedback problem is linear. Further, dening estimators ^ x(t) :=E[x(t)jH 1 (t)] and ^ ^ x(t) :=E[x(t)jH 2 (t)], the following result can be reconstructed from the state-feedback version. Lemma 4. For the nested system (2.23), the estimator dynamics for the redened ^ x(t) and ^ ^ x(t) are as follows. ^ x(t + 1) =A^ x(t) +B " u 1 (t) ^ u 2 (t) # +R(t)(t); (2.25) 29 ^ ^ x(t + 1) =A ^ ^ x(t) +B " u 1 (t) u 2 (t) # +R(t)(t) (2.26) and ^ x(0) = ^ ^ x(0) = E[x(0)], where ^ u 2 (t) = E[u 2 (t)jH 1 (t)]. R(t) = R xy 1 (t)R 1 y 1 (t),R(t) = R xy R 1 y ,and (t) = y 1 (t) C 11 ^ x 1 (t), (t) = y(t) C ^ ^ x(t). R xy 1 = AT (t)C 0 11 , R y 1 = C 11 T 1 (t)C 0 11 +W 11 ,R xy =AT (t)C 0 ,R y =CT (t)C 0 +W . T (t) =E[(x(t)^ x(t))(x 1 (t)^ x 1 (t))], T (t) =E[(x(t) ^ x(t))(x(t) ^ x(t)) 0 ]. Also,(t) and(t) are zero-mean, uncorrelated Gaussian random vectors. E[(t)(t) 0 ] = C 11 T 1 (t)C 0 11 +W 11 and E[(t)(t) 0 ] = CT (t)C 0 +W . Additionally, at each time step, (t) is uncorrelated withH 1 (t), and so E[(t)y 1 () 0 ] = 0 for < t. (t) is uncorrelated with H 1 (t); H 2 (t), and so E[(t)y() 0 ] = 0 for <t. Proof. Proof is similar to Lemma 3 and is omitted. Theorem 5. For the nested system (2.23) with the information structure (2.24) and objective function (2.14), the optimal decentralized control law is given by, u(t) =F (t) " y 1 (t) ^ x 1 (t) y 2 (t)C 21 ^ ^ x 1 (t)C 22 ^ ^ x 2 (t) # + ~ K(t); (2.27) where F (t) = " F 11 (t) 0 0 F 22 (t) # is the optimal gain matrix, and, ~ K(t) = " K 11 (t) K 12 (t) K 21 (t) K 22 (t) #" ^ x 1 (t) ^ x 2 (t) # + " 0 h ^ K 21 (t) J(t) ih ^ ^ x(t) ^ x(t) i # ; where K(t) and J(t) are as in Proposition 5, ^ K 21 is the optimal control gain for the second player's input given the optimal centralized controller only based on common information. [30, Theorem 12]. The proof is in the appendix. Remark: Observe that the classical separation principle does not hold in the (1;1)- delayed sharing pattern [?]. That is, u 2 (t) cannot be computed using just E[x(t)jF 2 (t)]. 30 Theorem 6. The optimal gain matrix F (t) is given by the recursion, F (t) = R + ~ B 0 M(t + 1) ~ B 1 (t) + ~ B 0 M(t + 1)V (t) 0 CT (t)C 0 +W 1 ; M (t) = ~ Q +H(t) 0 RH(t) + ( ~ A +G(t)) 0 M (t + 1)( ~ A +G(t)); M (N) = ~ S; (2.28) where, ~ A = " A 0 0 A # ,G(t) = 2 6 4 BK(t) 0 0 B " 0 ^ K 21 (t) J(t) # 3 7 5 , ~ B = " 0 B # ,H(t) = " K " 0 ^ K 21 (t) J(t) ## , ~ Q = " Q 0 0 0 # , ~ S = " S 0 0 0 # , V (t) = E[ o (t)n 1 (t) 0 ], and n 1 (t) := " R(t)(t) R(t)(t)R(t)(t) # . 2.4 Proofs of Theorems and Lemmas Proof of Lemma 3. Let us rst derive the dynamics of block 1. Denote (t) := x 1 (t) ^ x 1 (t). Then,(t) andH 1 (t) are independent by the projection theorem for Gaussian random variables [23] and, ^ x(t + 1) =E[x(t + 1)jH 1 (t + 1)] =E[x(t + 1)jH 1 (t);x 1 (t);u 1 (t)] =E[x(t + 1)jH 1 (t);u 1 (t)] +E[x(t + 1)j(t)] E[x(t + 1)]; where the last equality follows from the independence of (t) andH 1 (t) [23]. The rst term on the RHS above,E[x(t + 1)jH 1 (t);u 1 (t)] =E[Ax(t) +Bu(t)jH 1 (t);u 1 (t)] =A^ x(t) +B " u 1 (t) ^ u 2 (t) # ; 31 where ^ u 2 (t) =E[u 2 (t)jH 1 (t)]. The second and third terms can be related through the conditional estimation of Gaussian random vectors as, E[x(t + 1)j(t)]E[x(t + 1)] =R xx 1 R 1 x 1 (t); where dening T (t) = E[(x(t) ^ x(t))(x 1 (t) ^ x 1 (t))] and partitioning T (t) as T (t) = T 1 (t)jT 2 (t) , we have R xx 1 =AT (t), R x 1 =T 1 (t) [23]. Putting these three terms together, we have ^ x(t + 1) =A^ x(t) +B " u 1 (t) ^ u 2 (t) # +R(t)(t): Similarly, by dening (t) =x(t) ^ ^ x(t) and proceeding as above, we have ^ ^ x(t + 1) =A ^ ^ x(t) +B " u 1 (t) u 2 (t) # +R(t)(t): To derive the properties of (t) and (t), note that the mean and variance follow imme- diately by observing thatE[^ x 1 (t)] =E[x 1 (t)] andE[ ^ ^ x(t)] =E[x(t)]. The Projection theorem for Gaussian RVs [23] implies that (t) and x 1 () are uncorre- lated for <t, and that (t) is uncorrelated with x(), <t. Proof of Theorem 3. We begin by outlining our proof technique. The optimal control law for the (1;1) state-feedback pattern, already shown to be linear through the problem be- ing partially nested, is proven to be the same as the optimal control law for the dynami- cal system (2.13) with the objective function arg min u(t) E[ P N1 t=0 (x(t) 0 Qx(t) +u(t) 0 Ru(t)) + x(N) 0 Sx(N)], where x(t) = [^ x 1 (t) ^ ^ x 2 (t)] and u(t) is an input variable that is an invertible transformation of the original input u(t). The crucial point behind this transformation of the objective function and the input variable is that u(t) will be shown to be a function of just the summary statistics ^ x(t) and ^ ^ x(t), hence allowing ecient computation of the control law. Since the new objective depends onx(t) andu(t), the dynamics ofx(t) can be rewritten in terms of these. It can then be shown that the modied problem is of the form as in Proposition 5. Thus, the optimal control law can be obtained for the modied problem, and inverted to get the optimal law for the (1;1) state-feedback pattern. As a preliminary result, the following lemma derives a required uncorrelatedness property and also proves the equivalence of two of the estimator variables. 32 Lemma 5. For the nested system (2.13), ^ x 1 (t) = ^ ^ x 1 (t). Additionally, (t) = 1 (t), which implies from Lemma 3 that (t) is uncorrelated withH 2 (t), and so E[(t)x() 0 ] = 0; <t. Proof. Intuitively, the claim is true because Player 2 has no additional information about block 1's dynamics as compared to Player 1. Formally, ^ ^ x 1 (t) =E[x 1 (t)jH 1 (t);H 2 (t)]; =A 11 x 1 (t 1) +B 11 u 1 (t 1) = ^ x 1 (t); where we have used the independence of the zero-mean noise v 1 (t) from bothH 1 (t) and H 2 (t), and the denitions of (t) and (t). Note that the estimator dynamics of [^ x 1 (t); ^ ^ x 2 (t)] can be written in the following way, " ^ x 1 (t + 1) ^ ^ x 2 (t + 1) # = " A 11 0 A 21 A 22 #" ^ x 1 (t) ^ ^ x 2 (t) # + " B 11 0 B 21 B 22 #" u 1 (t) u 2 (t) # + " (R(t)(t)) 1 (R(t)(t)) 2 # : (2.29) This shall be denoted in shorthand notation as, x(t + 1) =Ax(t) +Bu(t) + " (R(t)(t)) 1 (R(t)(t)) 2 # : Let us denote estimation error by e(t) = [e 1 (t) e 2 (t)] := [x 1 (t) ^ x 1 (t) x 2 (t) ^ ^ x 2 (t)] respectively. Then, e 1 (t) = 1 (t) and e 2 (t) = 2 (t). By the projection theorem [23] and some manipulations, it can be veried thatE[e i (t)x j (t) 0 ] = 0 for i;j = 1; 2. Further, we dene a transformed system input u(t) by u(t) = " u 1 (t) u 2 (t) # :=u(t)F (t)e(t); (2.30) whereF (t) is the optimal gain matrix in (2.16). Note that the transformation is invertible. From (2.16), it can be observed that two components of u(t) are linear functions ofH 1 (t) 33 and (H 1 (t);H 2 (t)) respectively, since, u(t) =u(t)F (t) " x 1 (t) ^ x 1 (t) x 2 (t) ^ ^ x 2 (t) # = " F 11 (t)^ x 1 (t) +L 1 (H 1 (t)) F 22 (t) ^ ^ x 2 (t) +L 2 (H 1 (t);H 2 (t)) # = " L 0 1 (H 1 (t)) L 0 2 (H 1 (t);H 2 (t)) # ; (2.31) since ^ x(t) and ^ ^ x(t) are linear functions ofH 1 (t) and (H 1 (t);H 2 (t)), respectively. Using transformed input u(t), (2.29) can be rewritten as x(t + 1) =Ax(t) +Bu(t) + ~ v(t); (2.32) where ~ v(t) is equal to " B 11 F 11 (t)(t) + (R(t)(t)) 1 B 21 F 11 (t)(t) +B 22 F 22 (t) 2 (t) + (R(t)(t)) 2 # : It can be veried using Lemmas 3 and 5 that ~ v(t) is Gaussian, zero-mean and independent across time t. With the estimator system well-characterized, recall that the rst step of our proof was to show that the original objective function (2.14) could be equivalently expressed in terms of x(t) and u(t). To that end, rewriting (2.14) in terms of the new state variablex(t) and the transformed inputu(t), after noting that the terms involvinge 1 (t) ande 2 (t) are either zero or independent of the input u(t), we have, min u(t);0tN1 E[ N1 X t=0 (x(t) 0 Qx(t) +u(t) 0 Ru(t)) +x(N) 0 Sx(N)] = min u(t);0tN1 E[ N1 X t=0 (x(t) 0 Qx(t) +u(t) 0 Ru(t) 0 +e(t) 0 F (t)RF (t)e(t) (2.33) + 2e(t) 0 F (t)Ru(t)) +x(N) 0 Sx(N)]: The term e(t) 0 F (t)RF (t)e(t) is an estimation error and is independent of the control 34 input u(t). This can be seen from the fact that e 1 (t) = 1 (t) and e 2 (t) = 2 (t) are indepen- dent ofH 1 (t) andH 2 (t), by Lemmas 3 and 5. u(t) is a linear function ofH 1 (t) andH 2 (t) and so is independent of e(t). In a sense, we have "subtracted" out the dependence on e(t) in (2.30). u 1 (t) andu 2 (t) are linear functions ofH 1 (t) and (H 1 (t);H 2 (t)) respectively. So the fourth term has zero expected value since E[(t)x() 0 ] = 0 = E[(t)x() 0 ]; < t, from Lemmas 3 and 5. Thus, the following is an equivalent function, min u(t);0tN1 E[ N1 X t=0 (x(t) 0 Qx(t) +u(t) 0 Ru(t)) +x(N) 0 Sx(N)]: (2.34) Observe that, with this transformation, solving the (1;1) state-feedback optimal con- trol problem is equivalent to solving the cost function (2.34) for the system (2.32) with no communication delay in the new state variables. Letting [u 1 (t) u 2 (t)] 0 take up the role of players' inputs, (2.32) has a nested system structure with no communication delays. By Proposition 5, the optimal controller for the system (2.32)-(2.34) is given by, " u 1 (t) u 2 (t) # = " K 11 (t) K 12 (t) 0 K 21 (t) K 22 (t) J(t) # 2 6 4 ^ x 1 (t) ^ x 2 (t) ^ x 2 (t) ^ x 2 (t) 3 7 5 ; (2.35) where ^ x 2 (t) =E[ ^ ^ x 2 (t)jx 1 (0 :t);u 1 (0 :t 1)]. As shown below, ^ x 2 (t) is just ^ x 2 (t). ^ x 2 (t) =E[x 2 (t)jH 1 (t)]; =E E[x 2 (t)jH 1 (t) x 1 (0 :t);u 1 (0 :t 1)]; =E[x 2 (t)jx 1 (0 :t);u 1 (0 :t 1)]]; =E E[x 2 (t)jH 1 (t);H 2 (t) x 1 (0 :t);u 1 (0 :t 1)]; = ^ x 2 (t); where the second equality just takes the conditional expectation of ^ x 2 (t) with itself and other random variables. The third and fourth equalities follow from tower rule. Note that u(t) was allowed to be any linear function of the historiesH 1 (t) andH 2 (t). 35 However, u(t) nally depends on just the summary statistics ^ x(t) and ^ ^ x(t), thereby proving that these information statistics are indeed sucient to compute the optimal control law. From (2.35) and (2.30), the optimal decentralized control law u(t) is obtained as, u(t) =F (t) " x 1 (t) ^ x 1 (t) x 2 (t) ^ ^ x 2 (t) # + ~ K(t): (2.36) Proof of Theorem 4. We need to solve for the optimal control law to the modied sys- tem (2.32) with the cost function dened by (2.34). Assuming that the optimal gain matrix F (t) is unknown, we replace it by an arbitraryF (t) and solve for it by deterministic matrix optimization. From Theorem 3, using the modied input u(t) :=u(t) " F 11 (t)(t) F 22 (t) 2 (t) # , we have, u(t) =H(t)~ x(t); (2.37) where H(t) = " K 11 (t) K 12 (t) K 12 (t) K 21 (t) K 22 (t) J(t)K 22 (t) # , and ~ x(t) := 2 6 4 ^ x 1 (t) ^ ^ x 2 (t) ^ ^ x 2 (t) ^ x 2 (t) 3 7 5 . The dynamics of ~ x(t) are obtained from (2.18), (2.32) as, ~ x(t + 1) = ~ A +G(t) ~ x(t) +noise(t); (2.38) where ~ A := " A 0 0 A 22 # ,G(t) := " BH(t) 0 0 B 22 J(t) # , andnoise(t) := " ~ v(t) ~ v 2 (t) (R(t)(t)) 2 # . Let F (t) denote the variance of noise(t). Since F (t) minimizes (2.33), keeping only the terms that depend on F (t), the relevant part of the optimization is, J F = min F (t) E N1 X t=0 ~ x(t) 0 ~ Q~ x(t) + ~ x(t) 0 H(t) 0 RH(t)~ x(t) +(t) 0 F (t) 0 RF (t)(t) + ~ x(N) 0 ~ S~ x(N) ; (2.39) where ~ Q = " Q 0 0 0 # and ~ S = " S 0 0 0 # . 36 Using matrix agebra, 2.39 simplies to, J F = min F (t) N1 X t=0 trace ~ Q +H(t) 0 RH(t) (t) + trace ~ S(N) + N1 X t=0 trace F (t) 0 RF (t)T (t) ; (2.40) where we have dened (t) :=E[x(t)x(t) 0 ], with dynamics obtained from (2.38) as, (t + 1) = ~ A +G(t) (t) ~ A +G(t) 0 + F (t) ; (0) =E[x(0)x(0) 0 ] = 0: (2.41) where F (t) is the covariance matrix of n(t). The optimal gain matrix F (t) may now be obtained using the discrete matrix minimum principle [21] through the Hamiltonian to minimize (2.39). H = trace[( ~ Q +H(t) 0 RH(t))(t)] + trace[F (t) 0 RF (t)T (t)] + trace[( ~ A +G(t))(t)( ~ A +G(t)) 0 M(t + 1) 0 ] + trace[ F (t) M(t + 1) 0 ] + trace[2F (t) (t) 0 ] (2.42) where (t) = " 0 12 (t) 21 (t) 0 # is the Lagrange multiplier matrix, partitioned in the same way as F (t) and M(t) is the costate matrix. The optimality conditions are, @H @F (t) = 0; @H @M(t) = (t); (0) = (0) @H @(t) =M (t); M (N) = @trace[S(N)] @(N) Solving the above equations, we get, 2RF (t)T (t) + @trace[ F (t) M(t + 1) 0 ] @F (t) + 2 (t) = 0 (t + 1) = ( ~ A +G(t))(t)( ~ A +G(t)) 0 + F (t) ; (0) = 0 37 M (t) = ~ Q +H(t) 0 RH(t) + ( ~ A +G(t)) 0 M (t + 1)( ~ A +G(t)); M (N) =S: (2.43) The noise term noise(t) in (2.38) can be rewritten as, noise(t) = 2 6 4 B 11 F 11 (t) 1 (t) + (R(t) 1 (t)) 1 B 21 F 11 (t) 1 (t) +B 22 F 22 (t) 2 (t) + (R(t)(t)) 2 B 22 F 22 (t) 2 (t) + (R(t)(t)) 2 (R(t) 1 (t)) 2 3 7 5 = ~ BF (t)(t) +n 1 (t), where ~ B := 2 6 4 B 11 0 B 21 B 22 0 B 22 3 7 5 , and n 1 (t) := 2 6 4 (R(t) 1 (t)) 1 (R(t)(t)) 2 (R(t)(t)) 2 (R(t) 1 (t)) 2 3 7 5 . Then, F (t) =E[( ~ BF(t) +n 1 (t))( ~ BF(t) +n 1 (t)) 0 ] E[ ~ BF (t)(t)n 1 (t) 0 + ~ BF (t)(t)( ~ BF (t)(t)) 0 +n 1 (t)n 1 (t) 0 +n 1 (t)( ~ BF (t)(t)) 0 ] = ~ BF (t)E[(t)n 1 (t) 0 ] + ~ BF (t)E[(t)(t) 0 ]F (t) 0 ~ B 0 + E[n 1 (t)(t) 0 ]F (t) 0 ~ B 0 = ~ BF (t)V (t) + ~ BF (t)T (t)F (t) 0 ~ B 0 +V (t) 0 F (t) 0 ~ B 0 where V (t) = E[(t)n 1 (t) 0 ] and the equivalence in the second line is from the point of view of taking the partial derivative of F (t) with respect to F (t). Taking the partial derivative of trace[ F (t) M(t + 1) 0 ], @trace[ F (t) M(t + 1) 0 ] @F (t) = @ @F (t) trace[( ~ BF (t)V (t) + ~ BF (t)T (t)F (t) 0 ~ B 0 +V (t) 0 F (t) 0 ~ B 0 )M(t + 1) 0 ] = ~ B 0 M(t + 1)V (t) 0 + ~ B 0 M(t + 1) ~ BF (t)T (t) + ~ B 0 M(t + 1) 0 ~ BF (t)T (t) + ~ B 0 M(t + 1) 0 V (t) 0 38 Substituting into the optimality condition (2.43) we get, F (t) = 2R + ~ B 0 M(t + 1) ~ B + ~ B 0 M(t + 1) 0 ~ B 1 2 (t) + ~ B 0 M(t + 1)V (t) 0 + ~ B 0 M(t + 1) 0 V (t) 0 T 1 (t) = R + ~ B 0 M(t + 1) ~ B 1 (t) + ~ B 0 M(t + 1)V (t) 0 T 1 (t) (2.44) where the last equality uses M(t) being symmetric. 2.5 Future Work We believe that the approach of combining linearity and summary statistics in the manner of our work can be used to compute optimal control laws for more general asymmetric information sharing scenarios. Although we have not been able to complete this task thus far, we are hopeful that the tools presented here can be used to answer these broader questions. 1. Extension to graphs with delayed communication over nodes. We conjecture that results along these lines can be extended to graphs with edges between nodes satisfying certain delay constraints. Work of a similar vein is found in [36]. Essentially, the idea here is to consider graph topologies that retain the nested structure of plant dynamics and control laws as studied in this chapter, allowing us to generalize results directly to such networks. 2. Extension to delays of multiple units. Perhaps the most challenging problem, existing work has shown the diculty in obtaining optimal control laws for a general delayed sharing setting [51]. With the exception of a structural result [37], there is no known solution. However, for suitably constrained plant matrices A, B, C, it may be possible to tractably solve for optimal laws. 39 Chapter 3 Decentralized Learning in Multiarmed Bandit Problems We consider the problem of learning in multiarmed bandit models. These problems belong to a class of online learning problems that capture exploration versus exploitation tradeos. Policies for both single player and multiplayer settings are proposed and analysed. In a multiarmed bandit, a player can pick one among many arms. Each time a player picks an arm, she gets an i.i.d. reward. However, the reward distribution is assumed to be unknown. In the multiplayer setting, arms may give dierent rewards to dierent players. If two or more players pick the same arm, there is a \collision" and neither gets anything. There is no dedicated control channel for coordination or communication among the players. Players can communicate implicitly through their actions but this is costly and adds to regret. We propose two decentralizable policies E 3 and E 3 TS that can be used in both single and multiplayer scenarios. These policies are shown to yield expected regret that grows at most as near-O(logT ). It is well known that logT is the lower bound on the rate of growth of regret even in a centralized case. The proposed algorithms improve on prior work where regret grew at O(log 2 T ). More fundamentally, these policies and their analyses suggest that in decentralized online learning, there is no additional cost due to \decentralization" in terms of order of regret. This solves a problem of great relevance in many domains, in particular spectrum sharing among cognitive radios, and had been open for a while. 3.1 Introduction Consider a simple game of choosing between two coins with unknown bias. The coins are chosen repeatedly. If at a given instance, a coin turns up heads, we get a reward of $1, else we get no reward. One of the two coins has a better bias. The question is what is the 40 optimal `learning' scheme that helps us discover which coin has a better bias, while at the same time maximizing the cumulative reward as the game is played. This is an instance of a classical non-Bayesian multi-armed bandit problem that was introduced by Lai and Robbins in [24]. Such models capture the essence of a learning problem wherein the learner must tradeo between exploiting what has been learnt, and exploring more. Instead of the cumulative reward, the performance of the learning algorithm was quantied in terms of expected regret, i.e., the dierence between the reward if you always chose the better coin, and cumulative reward from some other policy. It was shown in [24] that there is no learning algorithm that asymptotically has expected regret growing slower than logT . A `learning' scheme was also constructed that asymptotically achieved this lower bound. This result was subsequently generalized by many people. In [5], Anantharam, et al generalized this to the case of multiple plays, i.e., the player can pick multiple arms (or coins) when there are more than 2 arms. In [1], Agrawal proposed a sample mean based index policy that asymptotically achieved logT regret. For the special case of bounded support for rewards, Auer, et al [6] introduced a simpler index-based UCB 1 algorithm that achieved logarithmic expected regret non-asymptotically. Recently, there has been interest in multi-player learning in multi-armed bandits. The motivation comes from various contexts. Suppose there are two wireless users trying to choose between two wireless channels. Each wireless channel is random, and \looks" dierent for each user. If channel statistics were known, we would like to determine a matching wherein the expected sum-rate of the two users is maximized. But channel statistics are unknown, and they must be learnt by sampling the channels. Moreover, the two users have to do this independently and cannot share their observations. Furthermore, the users have no dedicated communication channel between them. They, however, may communicate implicitly for coordination, if they so choose but this would come at the expense of reduced rewards, and thus would add to regret. One can easily imagine a more general setting wherein we haveM users who have to be matched toN >M arms. There are two questions here. First, can one construct an index-based decentralized learning algorithm that achieves poly-log, or at least sub-linear expected regret? Second, what is the lower bound on the expected regret? Is it still logT , or is there a fundamental cost of decentralization that one must pay? The rst question was answered in [20], wherein an index-based decentralized learning algorithm called dUCB 4 was proposed. It was shown that it achieves expected regret that grows as log 2 T . The algorithm proceeds by running a distributed matching algorithm in the exploration phase which could return a sub-optimal, but feasible matching in nite time. It turned out that this suces to achieve log-squared regret. From Lai and Robbins [24], we 41 know that the lower bound on expected regret for any (causal) learning algorithm is log T in the centralized case. Hence, that is still a lower bound for the decentralized learning problem. What was unknown is whether is a multiplicative logarithmic cost to pay due to decentralization. Any attempts for an information theoretic characterization of this have not been successful since this problem is similar to a distributed universal source coding problem that has open for several decades. What we do know from information theory is that in distributed source coding (when source statistics are known which could be correlated), there is no \loss" in lossless source coding when it is done in a distributed manner [12]. This can be accomplished via a Slepian-Wol coding scheme [42]. Such a result however is not available for distributed universal source coding. Related work by Anandkumar et al [4] consider a more simplistic model where channel rewards are assumed to be the same for all users. This reduces the problem to a learning version of a ranking problem, instead of a matching problem that appears in our assumed model. In this work, we do not present an information theoretic lower bound on the decentralized learning in a multi-player multi-armed bandit problem. Such a result would be very interest- ing as it will also yield insight into the exact role of information sharing between the players for the decentralized algorithm to work without an increase in expected regret. Instead, we present two decentralizable policies, E 3 andE 3 TS, whereE 3 stands for Exponentially-spaced Exploration and Exploitation. Both policies yield expected regret growing atmost as log T (near-log T with some assumptions relaxed) in both single and multiplayer settings, which is order optimal. The policies are based on exploration and exploitation in pre-determined phases such that over a long horizon T , there are only logarithmically many slots in the exploration phases. We note that in the exploration phase, a distributed matching algorithm is run that produces an approximately optimal matching in nite time. We get only an approximately optimal matching because (i) the algorithm can only produce an optimal matching asymptotically, and we can at most run it for nite time, and (ii) the players need to communicate the value of their indices to each other, which are real-valued. Hence, these can be communicated only with a certain precision using a certain number of bits. It so turns out that despite all this, the decentralized learning algorithm can achieve logarithmic regret, which is optimal. This also gives a non-Information theoretic answer to the fundamental question whether there is an inherent cost due to decentralized learning. And the short answer is No, as far as order is concerned. We note that a deterministic sequence learning algorithm for a single player setting was also introduced in [50]. The algorithms introduced in this work, and the corresponding results hold even when 42 the rewards are Markovian, i.e., each arm is a rested arm but the reward sequence from it forms a Markov chain [48]. However, we only present the i.i.d. case here and refer readers to our earlier paper [20] for ideas on extensions to the Markovian setting. Extensive simulations were conducted to evaluate the empirical performances of these policies and compared with existing policies in literature, including the classical UCB 1 and Thompson Sampling policies. The decentralized policies dE 3 and dE 3 TS are compared with the previously known dUCB 4 policy. The rest of the work is organized as follows. Section 3.2 deals with the classical single player multiarmed bandit problem. The model and problem formulation is introduced in Section 3.2.1. Sections 3.2.2 and 3.2.3 deal with existing and proposed policies for the single player bandit problem, respectively. Empirical performances of the algorithms is illustrated in Section 3.2.4. In Section 3.3, the multiplayer setting is considered. Section 3.3.1 introduces the model and formulates the problem. Existing and proposed policies for the multiplayer setting appear in Sections 3.3.2 and 3.3.3, respectively. Section 3.3.4 presents simulation results. 3.2 Single-player bandit problems Of the several existing algorithms known for single-player multiarmed bandit problems, we highlight three in this section before proposing two new algorithms that have desirable properties that lend these to decentralization in multiplayer bandit problems. The three existing algorithms either fail to decentralize or perform considerably worse in multiplayer settings. Each of these algorithms applies to the problem formulation given below. 3.2.1 Model and Problem Formulation We consider an N-armed bandit problem. At each instant t, an arm k is chosen, and a reward X k (t) is obtained. Each play of an arm generates a reward in an i.i.d. manner from a xed, unknown distribution that is assumed to have bounded support, and without loss of generality, to lie in [0; 1]. The arms dier in the mean reward k associated with them. No information is assumed to be lost and access to the entire history of rewards and actions, H(t), is available to the player.H(0) :=;. Denote the arm chosen at time t by a(t)2A :=f1;:::;Ng. A policy is a sequence of maps (t) :H(t)!A that species the arm to be chosen at time t given the historyH(t). The player's objective is to choose a policy that maximizes the expected reward over a nite time horizon T . 43 If the mean rewards of the arms were known, then the problem is trivially solved by always playing the arm with the highest mean reward, i.e., (t) = arg max 1iN i for all t. However, since the mean rewards are not known, the notion of regret is used to compare policies. Regret is the dierence in total reward obtained over the time horizon by the policy and that obtained if the mean arm rewards were known (in which case, the the arm with the highest mean reward would be chosen always). Formally, the player's objective is to minimize the expected regret, which is given by, R (T ) =T k E " T X t=1 X (t) (t) # ; (3.1) where k is the highest mean reward. In subsequent discussion of the single-player multi- armed bandit problem, without loss of generality, arm 1 will be assumed to have the highest mean reward. In the following section, three existing algorithms will be surveyed along with their ex- pected regret. Detailed simulations comparing the performances of all algorithms appear in Section 3.2.4. 3.2.2 UCB 1 , Thompson Sampling and UCB 4 The UCB 1 algorithm was proposed by Auer et al. in their seminal paper [6]. Although it was among the rst multiarmed bandit algorithms that was proven to have a logarithmic growth in expected regret, it is still considered to be one of the most powerful algorithms and serves as a benchmark to test new algorithms. The algorithm, presented in Algorithm 1, is an index-based policy that chooses the arm with the highest value of a condence bound index, dened to g j (t) for arm j, at each time instant t. g j (t) :=X j (t) + s 2 log(t) n j (t) ; (3.2) where X j (t) is the average reward obtained by playing arm j and n j (t) is the number of plays of arm j by time t. It was shown that the expected regret incurred by the UCB 1 policy over a time horizonT is bounded by [6], R UCB 1 (T ) 8 log(T ) N X j>1 1 j + (1 + 2 3 ) N X j>1 j ; (3.3) 44 Algorithm 1 : UCB 1 [6] 1: Initialization: For each arm j = 1; 2;:::;N, play arm once and update g j (1). 2: while (tT ) do 3: Play arm i(t) := arg max j j (t) and observe reward r t . 4: Update g j (t) for all arms j = 1; 2;:::;N. 5: end while where j := j 1 . TheThompson Sampling algorithm has been around for quite some time in literature [49] although it was not well-studied in the context of bandit problems until quite recently [11, 2]. This algorithm, presented in Algorithm 2, is based on the idea of probability matching. At each time instant, an arm is played randomly according to its probability i (t) of it being optimal. Unlike a full Bayesian method such as Gittins Index [15], Thompson Sampling can be implemented eciently in bandit problems. Algorithm 2 : Thompson Sampling [2] 1: Initialization: For each arm i = 1; 2;:::;N, set S i = 0, F i = 0. 2: while (tT ) do 3: For each arm i, sample i (t) from Beta(S i + 1;F i + 1) distribution. 4: Play arm i(t) := arg max i i (t) and observe reward ~ r t . 5: Perform a Bernoulli trial with success probability ~ r(t) and observe output r(t). 6: If r(t) = 1, then S i(t) =S i(t) + 1, else F i(t) =F i(t) + 1. 7: end while The regret of the Thompson Sampling algorithm over a time horizonT is bounded by [2], R TS (T )O N X j>1 1 2 j 2 log(T ); (3.4) where the constants have been omitted for brevity. While Thompson Sampling has been found to empirically outperform UCB 1 in most set- tings [11], the upper bounds for guaranteed performance of the algorithm are weaker. UCB 4 is another condence-bound based index policy that was proposed recently to over- come some of the shortcomings of the UCB 1 policy, namely its reliance on index computation frequently and the diculty in extending the algorithm to a multiplayer setting. The index h(t) is very similar to the UCB 1 index, and the full policy is presented in Algorithm 3. h j (t) :=X j (t) + s 3 log(t) n j (t) ; (3.5) 45 Algorithm 3 : UCB 4 1: Initialization: Select each arm j once for t N. Update the UCB 4 indices h(t). Set = 1. 2: while (tT ) do 3: if ( = 2 p for some p = 0; 1; 2; ) then 4: Update the index vector h(t); 5: Compute the best arm j (t); 6: if (j (t)6=j (t 1)) then 7: Reset = 1; 8: end if 9: else 10: j (t) =j (t 1); 11: end if 12: Play arm j (t); 13: Increment counter = + 1; t =t + 1; 14: end while The expected regret of the UCB 4 algorithm over a time horizon T is bounded by [20], R UCB 4 (T ) max N X j>1 12 log(T ) 2 j + 2N : (3.6) Being a looser bound than UCB 1 , at rst glance, there might appear to be no advantage to using this policy. The strength of the UCB 4 algorithm, however, lies in the ability to incorporate a computation cost each time an index is computed, without losing much in terms of regret performance. Computation costs are an important practical consideration because learning algorithms often involve a penalty or cost for computation or communication. This can be the case when there is a dicult optimization problem to be solved, or an optimization problem must be solved in a distributed manner wherein information must be shared to nd the optimal solution. In a multiplayer bandit setting, the latter is indeed the case, and we nd it instructive to model the cost into the analysis for the single player setting as well. Algorithms that do not require any communication to coordinate allocation would be exceptions to this cost assumption; however, we are not aware of any that exist. In fact, we do not believe such a policy can exist under the model assumed. Let the computation cost be C units. Let m(t) denote the number of times the index is computed by time t. Then, under policy the expected regret is now given by ~ R (T ) := 1 T N X j=1 j E [n j (T )] +CE [m(T )]: (3.7) It can be seen that both UCB 1 and Thompson Sampling algorithms give linear regret 46 if a computation cost is included in the model. The expected regret of the UCB 4 algorithm over a time horizon T with computation cost C is bounded by [20], ~ R UCB 4 (T ) max +C(1 + log(T ) N X j>1 12 log(T ) 2 j + 2N : (3.8) Thus, expected regret is O(log 2 (T ). It is worth mentioning here that UCB 2 [6] is another policy that achieves sub-linear ex- pected regret when computation cost is modeled into the problem, actually achieving better performance than UCB 4 , with a logarithim regret growth. However, the UCB 2 policy cannot be implemented in a multiplayer setting and, consequently, did not warrant further interest. From the above discussion, it can be noted that although all bounds are tight in the order of optimality, the tightest bounds have been proved for UCB 1 , followed by UCB 4 and Thompson Sampling. Empirical performances of the policies is presented in Section 3.2.4. 3.2.3 E 3 and E 3 TS Up until this point, the algorithms presented were already known. In this section, we propose two new algorithms that achieve logarithmic regret even when a computation cost is included in the model. Additionally, these algorithms lend themselves to easy decentralization in a multiplayer multiarmed bandit problem. However, before we proceed with the regret analysis of these algorithms, it will be useful to add another dimension to the bandit problem. In practical implementations of an index- based algorithm, the indices are real numbers and may be known or computed only upto a certain precision. For example, an implementation may use only a small number of bits to represent indices, or there may be a cost to compute the indices to a greater precision. This nite index precision becomes especially relevant in the multiplayer setting, where we will see that indices must be communicated to other players, and only a nite number of bits can be communicated in nite time. Thus, a learning algorithm should be robust and be able to work with indices with limited precision. This can also be noted to be important because in empirical runs of index-based algorithms (e.g., UCB 1 ), it can be seen that the indices of the two best arms closely track each other even if their expected rewards are quite dierent. So whether the performance of an index-based policy is aected by limited index precision is an important consideration and it is helpful to include this in the overall regret analysis. The E 3 and E 3 TS policies are presented in Algorithms 4 and 5 respectively. They are variants of the PEE policy [50]. 47 Algorithm 4 : Exponentially-spaced Exploration and Exploitation (E 3 ) 1: Initialization: Select each arm times. Set t =N and k = 0. 2: while (tT ) do 3: Main Loop: At each epoch l: 4: Exploration Phase: Play each arm j; 1jN, number of times; 5: Update the sample mean X j (l), j; 1jN 6: Compute the best arm j (l) := arg max 1jN X j (l); 7: Exploitation Phase: Play arm j (l) for 2 l time slots; 8: Update t =t +N + 2 l , l =l + 1; 9: end while Algorithm 5 : Exponentially-spaced Exploration and Exploitation (E 3 TS) 1: Initialization: For each arm i = 1; 2;:::;N, set S i = 0, F i = 0. 2: while (tT ) do 3: Main Loop: At each epoch l: 4: Exploration Phase: Play each arm j; 1jN, number of times; 5: For each play of each arm i, sample i (t) from Beta(S i + 1;F i + 1) distribution and obtain reward ~ r i (t). 6: Perform a Bernoulli trial with success probability ~ r i (t) and observe output r i (t). 7: If r i (t) = 1, then set S i =S i + 1, else F i =F i + 1. 8: Exploitation Phase: Compute the best arm j (l) := arg max 1jN j (l); 9: Play arm j (l) for 2 l time slots; 10: Update t =t +N + 2 l , l =l + 1; 11: end while 48 We now present performance bounds for the two algorithms when there is an index computation cost C as well as when the indices have limited precision. We rst show that if min is known, we can x a precision 0<< min , so that the E 3 algorithm will achieve logarithmic regret growth. If min is not known, we can pick a positive monotone sequence f t g such that t ! 0, as t!1. Both algorithms will be analyzed concurrently as their proof techniques are largely sim- ilar. The following concentration inequality will be used in the analysis and is introduced here for the reader's ease. Fact 1: Cherno-Hoeding inequality [9]. LetX 1 ;:::;X t be random variables with a common range such thatE[X t jX 1 ;:::;X t1 ] =. Let S t = P t i=1 X i . Then for all a 0, P (S t t +a)e 2a 2 =t ; and (3.9) P (S t ta)e 2a 2 =t : (3.10) Theorem 7. (Regret bounds for E 3 and E 3 TS policies) 1. Indices are known exactly (a) If min is known, then set :=d 2 2 min e and :=d 8 2 min e. Then, the expected regret of the E 3 and E 3 TS policies with computation cost C is, ~ R E 3(T )N log(T ) max +NC log(T ) + 8N max ~ R E 3 (T )N log(T ) max +NC log(T ) + 16N max (3.11) (b) If min is not known, then choose = t , wheref t g is a positive sequence such that t !1 as t! 0. Then, ~ R E 3(T )O(N( T +C) log(T )). Also, ~ R E 3 (T ) O(N( T +C) log(T )). Thus, by choosing an arbitrarily slowly increasing sequence f t g we can make the regret arbitrarily close to log(T ). 2. Indices are known with nite precision (a) If min is known, choose an 0 < < min , and := () =d 2 ( min ) 2 e and := () =d 8 ( min ) 2 e. Then, the expected regret of the E 3 and E 3 TS policies 49 with -precise computations is given by, ~ R E 3(T )N () log(T ) max +NC log(T ) + 8N max ~ R E 3 (T )N () log(T ) max +NC log(T ) + 16N max (3.12) Thus, ~ R E 3(T ) =O(logT ): (b) If min is not known, then choose = t , wheref t g is a positive sequence such that t !1 as t! 0. Also choose = t , wheref t g is a positive sequence such that t ! 0 as t!1. Then, ~ R E 3(T )O(N( T +C) log(T )). Also, ~ R E 3 (T ) O(N( T +C) log(T )). Thus, by choosing an arbitrarily slowly increasing sequence f t g and an arbitrarily slowly decreasing sequencef t g we can make the regret arbitrarily close to log(T ). Proof. We will rst consider the case when indices are known precisely, and only an index computation costC exists. We will then extend these results to indices with nite precision. In all cases, the subscript for will be omitted where the context refers to both the E 3 and E 3 TS policy. Also,R will be used to denote regret for either policy when both are within context. 1) Indices are known exactly a) min known: We denote the expected regret incurred in the exploration phase with the superscript ~ R O (T ) and the expected regret incurred in the exploitation phase with ~ R I (T ). Also, let ~ R C (T ) be the expected regret due to computation cost. Then, ~ R(T ) = ~ R O (T ) + ~ R I (T ) + ~ R C (T ); (3.13) for both ~ R E 3(T ) and ~ R E 3 (T ). LetT be in thel 0 -th exploitation epoch. Then,TN l 0 + 2 l 0 2 and hence, logTl 0 . We have, ~ R O (T ) = l 0 N X j=2 j N l 0 max N log(T ) max : (3.14) Also, by denition, ~ R C (T ) =NCl 0 NC log(T ): (3.15) Now, ~ R I (T ) = E h P N j=2 j ~ n j (T ) i where ~ n j (T ) is the number of times arm j has been 50 played during the exploitation phases. For E 3 , ~ n j (T ) = l 0 X l=1 2 l IfX j (l) = max 1iN X i (l)g (3.16) l 0 X l=1 2 l IfX j (l)>X 1 (l)g (3.17) Similarly, for E 3 TS, ~ n j (T ) P l 0 l=1 2 l If j (l)> 1 (l)g. Thus, ~ R I E 3(T ) max l 0 X l=1 N X j=2 2 l P(X 1 (l)<X j (l)); (3.18) and, ~ R I E 3 (T ) max l 0 X l=1 N X j=2 2 l P( 1 (l)< j (l)): (3.19) The following two lemmas bound the event probabilities above for the E 3 and E 3 TS policies. Lemma 6. For E 3 , with =d 2 2 min e, P(X 1 (l)<X j (l)) 2e l (3.20) Proof. The eventfX 1 (l)<X j (l)g implies at least one of the following events: A j :=fX j (l) j > j =2g; B j :=fX 1 (l) 1 < j =2g: (3.21) The above can be shown from the fact that three events A,B and C are such that A (B[C), if and only if, A\B\C =;. Using the Cherno-Hoeding bound and choosing =d 2 2 min e we get P(A j )e 2l 2 j =4 2e l ; and similarly,P(B j )e l : (3.22) By the union bound, we getP(X 1 (l)<X j (l)) 2e l . 51 Lemma 7. For E 3 TS, with =d 8 2 min e, P( 1 (l)< j (l)) 4e l (3.23) Proof. Without loss of generality, we will assume the underlying reward distributions of the arms to have a Bernoulli distribution to simplify the analysis. This eliminates the need for line 6 in the E 3 TS policy illustrated in Algorithm 5. However, this assumption can be relaxed without any change to the results. As in Lemma 6, the eventf 1 (l)< j (l)g implies at least one of the events: A j :=f j (l) j > j =2g;B j :=f 1 (l) 1 < j =2g: (3.24) Let m j (l) denote the number of plays of arm j during the exploration phases after the l-th exploration epoch, and lets j (l) be the number of successes (r = 1) in these plays. Then, j (l) is sampled from a (s j (l) + 1;m j (l)s j (l) + 1) distribution. Additionally, let A(l) denote the eventf s j (l) m j (l) < j + j 4 g. Then, P( j (l) j + j 2 )P(A(l)) +P( j (l) j + j 2 ;A(l)): (3.25) The rst term in the expression, P(A(l)) =P s j (l) m j (l) j + j 4 exp 2 l 2 j 16 ; where the last inequality comes from the Cherno-Hoeding inequality and by noting that s j (l) m j (l) is a random variable with mean j . Also, m j (l) = l. 52 The second term, P( j (l) j + j 2 ;A(l)) =P j (l) j + 2 ; s j (l) m j (l) < j + j 4 P j (l)> s j (l) m j (l) + j 4 =P (s j (l) + 1;m j (l)s j (l) + 1)> s j (l) m j (l) + j 4 =E h F B m j (l)+1; s j (l) m j (l) + j 4 (s j (l) i E h F B m j (l); s j (l) m j (l) + j 4 (s j (l) i : (3.26) Here,F B n;p (x) is the cdf of the binomial(n;p) distribution. The equality in the second-to-last line comes from the fact that F a;b (x) = 1F B a+b1;x (a 1), where F a;b (x) is the cdf of the (a;b) distribution [2]. The inequality on the last line is a standard inequality for binomial distributions. But, by the Cherno-Hoeding inequality, it can be seen thatF B n;p (npn) exp(2n 2 ). Thus, P( j (l) j + 2 ;A(l)) exp 2 l 2 j 16 (3.27) Setting :=d 8 2 min e in (3.26) and (3.27), we get, P( j (l) j + j 2 ) 2e l (3.28) Similarly,P( 1 (l) 1 j 2 ) 2e l , and the claim of the lemma follows from the union bound. Thus, ~ R I E 3(T ) max l 0 X l=1 N X j=2 2 l 2e l 2N max 1 X l=0 (2=e) l 2N max =(1 (2=e))< 8N max ; (3.29) 53 ~ R I E 3 (T ) max l 0 X l=1 N X j=2 2 l 4e l 2N max 1 X l=0 (2=e) l 4N max =(1 (2=e))< 16N max ; (3.30) Now, combining all the terms, we get ~ R E 3(T )N logT max +C log(T ) + 8N max : (3.31) ~ R E 3 (T )N logT max +C log(T ) + 16N max : (3.32) b) min unknown: Replacing with t , we get ~ R O (T ) N T log(T ) and ~ R C (T ) C log(T ). Dene t l equals to t when t =N l + 2 l 2. Since t !1 monotonically, it is clear that there exists an l 0 such that8l > l 0 , t l >d2= 2 min e. Using the equations (3.21), (3.22), we can conclude that the summation in equation (3.18) converges as l 0 !1 and ~ R I (T ) =O(1). Thus, we get ~ R(T ) =O( T log(T )). 2) Indices are known with nite precision a) min known: The proof is only a slight modication of case 1.a. above and is pre- sented here only for E 3 . From (3.14), we get ~ R O E 3 (T ) N () log(T ) max . Since the cost of computation is C, from (3.15) we get ~ R C E 3 (T ) = NC log(T ). Due to precision error, ~ n j (T ) P l 0 l=1 2 l IfX 1 (l)<X j (l) +g. Thus, ~ R I E 3(T ) max l 0 X l=1 N X j=2 2 l P(X 1 (l)<X j (l) +): (3.33) The eventfX 1 (l)<X j (l) +g implies at least one of the following events: A j :=fX j (l) j > ( j )=2g; B j :=fX 1 (l) 1 <( j )=2g; (3.34) By the Cherno-Hoeding bound and using the fact that = () =d2=( min ) 2 e we get P(A j )e 2l ()( j ) 2 =4 e l : (3.35) 54 Similarly,P (B j )e l , and by the union bound, we getP(X 1 (l)<X j (l) +) 2e l . Thus, ~ R I E 3(T ) max l 0 X l=1 N X j=2 2 l 2e l 2N max 1 X l=0 (2=e) l 2N max =(1 (2=e))< 8N max : (3.36) Combining all the terms, we get ~ R E 3(T )N () log(T ) max +C log(T ) + 8N max : (3.37) b) min unknown: The proof is again similar to case 1.b above. Replacing with t , we get ~ R O (T )N T log(T ). Also, since t ! 0 ast!1, there exists at 0 such that fort>t 0 , t < min . Thus, for T t 0 , the analysis follows as that in 2.a and we get ~ R I (T ) = O(1). Thus, ~ R C (T )C log(T ). Combining all, we get ~ RO(( T +C) log(T )). Although the performances of E 3 and E 3 TS are considerably poorer than UCB 1 and Thompson Sampling, they lend themselves to easy decentralization and can be extended to multiplayer multiarmed bandit problems with minimal eort. Performances of all the above algorithms is compared in Section 3.2.4. 3.2.4 Simulations We conducted extensive simulations comparing the performances of these single-player mul- tiarmed bandit policies and present the results in this section. The single player test scenario comprised of a four-armed bandit with rewards for arms drawn from Bernoulli distributions with means 0:1; 0:5; 0:6; 0:9. The scenario was run over a xed time horizon T and the performance of the single-player policies was evaluated. Dierent true means and distributions were also considered and they gave similar rankings for the algorithms. In the interests of space, these results are not presented. Figure 3.1 compares the performances of the UCB 1 , Thompson Sampling and UCB 4 algo- rithm over T = 50; 000 slots. It can be seen that the empirical performance of Thompson Sampling is the best among the three algorithms, followed by UCB 1 , and UCB 4 respectively. Figure 3.2 shows the reduction in performance of the UCB 4 policy when a xed per unit computation cost is included in the setting. The simulation is run for 50; 000 time slots. The policy can be observed to perform within the theorized log 2 bound. Note that both UCB 1 and Thompson Sampling give linear regret performance in this scenario. In Figure 3.3, the performance of the newly proposedE 3 algorithm is illustrated. The time 55 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 4 0 50 100 150 200 250 300 Time Regret Mean Channel Rewards = [0.1,0.5,0.6,0.9]; Cummulative Regret − TS 25 log(t) Cummulative Regret − UCB1 Cummulative Regret − UCB4 Figure 3.1: Figure showing growth of cumulative regret of the UCB 1 , Thompson Sampling and UCB 4 algorithms for a four-armed bandit problem with true means [0.1, 0.5, 0.6, 0.9]. horizon is longer in this setting, withT = 2; 000; 000 time slots. min is assumed to be known and, consequently, is xed. It can be observed that, although, both algorithms appear to have the same logarithmic order of regret performance in time, the newE 3 algorithm performs slightly worse than the UCB 4 policy. This is attributable to the deterministic exploration phase length which must take into account worst-case scenarios. However, as we shall see in the following sections, this gives us a signicant performance advantage in the multiplayer setting. Finally, a concise comparison of the policies is presented in Figure ??. Note that in Figure 3.4, computation cost is assumed to be zero. If computation cost were included, E 3 and E 3 -TS would retain their logarithmic regret performance. However, the cumulative regret of UCB 1 would grow linearly, just as with TS [20]. 56 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 5 0 1 2 3 4 5 6 7 x 10 4 Slots Regret Mean Channel Rewards − u = [0.1, 0.5, 0.6, 0.9]; Cummulative Regret − UCB4 400*log(t)*log(t) Figure 3.2: Figure showing growth of cumulative regret of the UCB 4 algorithm with a xed per unit computation cost for a four-armed bandit problem with true means [0.1, 0.5, 0.6, 0.9]. 3.3 Multiplayer bandit problems In this section, we extend the classical multiarmed bandit problem to a multiplayer setting. Previous work on this topic can be found in [20]. The major issues that are encountered in decentralizing bandit policies are coordination among players and nite precision of indices being communicated. The following section describes the model and problem formulation. 3.3.1 Model and Problem Formulation We consider anN-armed bandit withM players. At each instantt, each player plays an arm. There is no dedicated communication channel for coordination among the players. However, we do allow for players to communicate with each other, either by sending packets over a predetermined channel, or by pulling arms in a specic order. For example, a player could communicate a bit sequence to other players by picking an arm to indicate `bit 1' and any other arm for `bit 0'. These communication overheads add to regret and aect the learning 57 0 0.5 1 1.5 2 2.5 x 10 6 0 1000 2000 3000 4000 5000 6000 7000 Time Regret Mean Channel Rewards = [0.1, 0.5, 0.6, 0.9]; Cummulative Regret − PEGE M N delmax GAMMA log(t) + C log(t) + 8 M N delmax Cummulative Regret − UCB4 Figure 3.3: Figure showing growth of cumulative regret of the E 3 and UCB 4 algorithms for a four-armed bandit problem with true means [0:1; 0:5; 0:6; 0:9] (no computation cost). rate. Potentially more than one player can pick the same arm at the same instant. If this happens, we regard it as a collision. We will assume that player i playing arm k at time t obtains an i.i.d. reward S ik (t) with density function f ik (s), which will be assumed to have bounded support, and without loss of generality, taken to be in [0; 1]. Let i;k denote the mean of S ik (t) with respect to the pdf f ik (s). A non-Bayesian setting is considered and we assume that players have no information about the means, distributions or any other statistics about the rewards from various arms other than what they observe while playing. We also assume that each player can only observe the rewards that she gets. If there is a collision, we assume that all players that choose an arm on which there is a collision get a zero reward. This can be relaxed where players share the reward in some manner though the results don't change appreciably. Let X ij (t) be the reward that player i gets from arm j at time t. Thus, if player i plays arm k at time t (and there is no collision), X ik (t) = S ik (t), and X ij (t) = 0;j6= k. Denote the action of playeri at timet bya i (t)2A :=f1;:::;Ng. Then, the history seen by player 58 Timeslots 10 3 10 4 10 5 10 6 Cumulative Regret 0 5000 10000 15000 Performance of different single player policies E 3 E 3 -TS UCB 1 Upper Bound Figure 3.4: Figure showing growth of cumulative regret of the E 3 , E 3 -TS and UCB 1 algo- rithms for a four-armed single-player bandit problem with true means [0.1, 0.5, 0.6, 0.9] (no computation cost), with time plotted on log scale. i at time t isH i (t) =f(a i (1);X i;a i (1) (1)); ; (a i (t);X i;a i (t) (t))g withH i (0) =;. A policy i = ( i (t)) 1 t=1 for player i is a sequence of maps i (t) :H i (t)!A that species the arm to be played at time t given the history seen by the player. LetP(N) be the set of vectors such that P(N) :=fa = (a 1 ;:::;a M ) :a i 2A;a i 6=a j ; for i6=jg: The players have a team objective: namely over a time horizon T , they want to maximize the expected sum of rewardsE[ P T t=1 P M i=1 X i;a i (t) (t)] over some time horizon T . If the parameters i;j are known, this could easily be achieved by picking a bipartite matching, k 2 arg max k2P(N) M X i=1 i;k i ; (3.38) i.e., the optimal bipartite matching with mean arm rewards. Note that this may not be unique. Since the expected rewards, i;j , are unknown, players must pick learning policies 59 that minimize the expected regret, dened for policies = ( i ; 1iM) as R (T ) =T X i i;k i E " T X t=1 M X i=1 X i; i (t) (t) # : (3.39) To make the problem more realistic, we include index computation and communication cost terms into the regret as well. Distributed algorithms for bipartite matching are known [8, 56], which determine an -optimal matching with a `minimum' amount of information exchange and computation. However, every run of a distributed bipartite matching algorithm incurs a cost due to computation and communication necessary to exchange some information for decentralization. Let C be the cost per run, and m(t) denote the number of times the distributed bipartite matching algorithm is run by time t. Then, under policy, the expected regret is R (T ) =T M X i=1 i;k i E " T X t=1 M X i=1 X i; i (t) (t) # +CE[m(T )]: (3.40) where k is the optimal matching as dened in equation (3.38). Our goal is to nd a decentralized algorithm that players can use such that together they minimize this expected regret. Let g(t) (g i;j (t); 1 i M; 1 j N) denote the vector of indices that the algorithm uses. We will refer to an-optimal distributed bipartite matching algorithm asdBM (g(t)) that yields a solution k (t) := (k 1 (t);:::;k M (t))2P(N) such that P M i=1 g i;k i (t) (t) P M i=1 g i;k i (t)) ; 8k 2 P(N); k 6= k . There exists an optimal bipartite matching k 2 P(N) such that k 2 arg max k2P(N) P M i=1 i;k i . Denote := P M i=1 i;k i , and dene k := P M i=1 i;k i ; k2P(N). Let min = min k2P(N);k6=k k and max = max k2P(N) k . We assume min > 0. In the following section, an existing algorithm is surveyed along with its expected regret. Simulations comparing the performances of all algorithms appear in Section 3.3.4. 3.3.2 dUCB 4 The dUCB 4 policy is a natural decentralization of the UCB 4 policy [20]. To the best of our knowledge, it is the rst and only policy for multiplayer multiarmed bandit problems. The algorithm, presented in Algorithm 6, is an index-based policy that chooses the arm 60 with the highest value of a condence bound index, ~ h(t). ~ h(t) =X i;j (t) + s (M + 2) log(n i (t)) n i;j (t) (3.41) Algorithm 6 dUCB 4 for User i 1: Initialization: Play a set of matchings so that each player plays each arm at least once. Set counter = 1. 2: while (tT ) do 3: if ( = 2 p for some p = 0; 1; 2; ) then 4: //Decision frame: 5: Update g(t); 6: Participate in the dBM (g(t)) algorithm to obtain a match k i (t); 7: if (k i (t)6=k i (t 1)) then 8: Use interrupt phase to signal an INTERRUPT to all other players about changed allocation; 9: Reset = 1; 10: end if 11: if (Received an INTERRUPT) then 12: Reset = 1; 13: end if 14: else 15: // Exploitation frame: 16: k i (t) =k i (t 1); 17: end if 18: Play arm k i (t); 19: Increment counter = + 1, t =t + 1; 20: end while The following theorem gives an upper bound on the expected regret incurred by dUCB 4 over a nite time horizon T . Theorem 8. Let > 0 be the precision of the distributed bipartite matching algorithm and the precision of the index representation. If min is known, choose > 0 such that < min =(M + 1). Let L be the length of a frame. Then, the expected regret of the dUCB 4 algorithm is ~ R dUCB4 (T ) (L max +C(f(L))(1 + log(T ))) 4M 3 (M + 2)N log(T ) ( min ((M + 1)) 2 +NM(2M + 1) : Thus, ~ R dUCB4 (T ) = O(log 2 (T )). A slight modication to the policy with increasing frame 61 length addresses the case when min is unknown. [20] 3.3.3 dE 3 and dE 3 TS In this section, we propose two multiplayer policies that achieve logarithmic regret per- formance. These policies are decentralized extensions of E 3 and E 3 TS policies that we presented in an earlier section, and are presented in Algorithms 7 and 8 respectively. Algorithm 7 : dE 3 1: while (tT ) do 2: Main Loop: At each epoch l: 3: Exploration Phase: Each player i; 1 i M, plays each arm j; 1 j N, number of times; 4: Update the index g i;j (l) =X i;j (l); 5: Exploitation Phase: Participate in the dBM (g(l)) algorithm to obtain a matchk (l); 6: Each player i plays arm k i (l) for 2 l time slots; 7: t =t +MN + 2 l , l =l + 1; 8: end while Algorithm 8 : dE 3 TS 1: Initialization: For each armj = 1; 2;:::;N and playeri = 1;:::;M, setS i;j = 0,F i;j = 0; 2: while (tT ) do 3: Main Loop: At each epoch l: 4: Exploration Phase: Each player i; 1 i M, plays each arm j; 1 j N, number of times; 5: For each play of each arm j by player i, sample i;j from Beta(S i;j + 1;F i;j + 1) distribution. 6: If r i = 1, then set S i;j =S i;j + 1, else F i;j =F i;j + 1. 7: Exploitation Phase: Participate in the dBM (thetha) algorithm to obtain a match k (l); 8: Each player i plays arm k i (l) for 2 l time slots; 9: t =t +MN + 2 l , l =l + 1; 10: end while In both these algorithms, the total regret can be thought as the sum of three dierent regret terms. All the time slots spent in exploration are considered to contribute to regret as the rst term, ~ R O (T ). At the end of every exploration phase, a bipartite matching algorithm is run and and each run adds costC to the second term of regret ~ R C (T ). The costC depends on two parameters: (a) the precision of the bipartite matching algorithm 1 > 0, and (b) 62 the precision of the index representation 2 > 0. A bipartite matching algorithm has an 1 - precision if it gives an 1 -optimal matching. This would happen, for example, when such an algorithm is run only for a nite number of rounds. The index has an 2 -precision if any two indices are not distinguishable if they are closer than 2 . This can happen for example when indices must be communicated to other players in a nite number of bits. Thus, the costC is a function of 1 and 2 , and can be denoted asC( 1 ; 2 ), withC( 1 ; 2 )!1 as 1 or 2 ! 0. Since, 1 and 2 are the parameters that are xed a priori, we consider = min( 1 ; 2 ) to specify both precisions. Due to overlap, we shall denote this computation and communication cost by C() as dierent communication methods and implementations of distributed bipartitie matching will give dierent costs. For example, a communication cost of one unit per bit transmitted for the dBM algorithm [20] will result in a cost that is sum of constants times log(1= 1 ) and log(1= 2 ). Details of this implementation are presented in [20]. Essentially, time slots are divided into exporation and exploitation phases. At the end of an exploration phase, players signal their arm preferences and bid values (dierence of top two indices) using either a packet-based mechasnism or by transmitting bits on a predetermined channel. Due to synchronization, other players can receive the transmitted information and run the dBM algorithm. The third term in the regret expression, ~ R I (T ), comes from non-optimal matchings in the exploitation phase, i.e., if the matching k(l) is not the optimal matching k . Thus, we have the regret of the dE 3 and dE 3 TS policies to be given by, ~ R(T ) = ~ R O (T ) + ~ R I (T ) + ~ R C (T ): (3.42) R will be used to denote regret for either policy when both are within context. We rst show that if min is known, we can choose an< min =(M +1), so that dE 3 and dE 3 TS algorithms achieve a logarithmic regret growth with T . If min is not known, we can pick a positive monotone sequencef t g such that t ! 0, as t!1. In a decentralized bipartite matching algorithm, the precision will depend on the amount of information exchanged. Note that in addition to the near-log performance of the policy, the dependence on number of players and arm is cubic and linear, respectively. Theorem 9. (i) Let> 0 be the precision of the bipartite matching algorithm and the preci- sion of the index representation. If min is known, choose such that 0<< min =(M +1), set =d2M 2 =( min (M + 1)) 2 e and =d8M 2 =( min (M + 1)) 2 e. Then, the ex- 63 pected regrets of the dE 3 and dE 3 TS policies are, ~ R dE 3(T )MN max log(T ) +C() log(T ) + 8MN max ; ~ R dE 3 (T )MN max log(T ) +C() log(T ) + 16MN max : Thus, ~ R(T ) =O(log(T )) for both policies. (ii) If min is not known, then choose = t , wheref t g is a positive sequence such that t !1 as t! 0. Also choose = t , wheref t g is a positive sequence such that t ! 0 as t!1. Then, ~ R(T ) = O(( T +C( T )) logT ). Thus, by choosing an arbitrarily slowly increasing sequencef t g and an arbitrarily slowly decreasing sequencef t g we can make the regret arbitrarily close to log(T ). Proof. The proof will be illustrated here only for the dE 3 policy since the dierences between it and the analysis of the dE 3 TS policy are similar to those found in Theorem 7. (i) Let T be in the l 0 th exploitation epoch. Then, T MN l 0 + 2 l 0 2 and hence logTl 0 . Then, ~ R O dE 3(T ) =MN l 0 max MN log(T ) max : (3.43) Also, by denition ~ R C dE 3(T ) =C()l 0 C() log(T ): (3.44) A suboptimal matching occurs in the lth exploitation epoch if the eventfX i;k i (l)< (M + 1) + P M i=1 X i;k i (l)g occurs. Clearly, as in the single player case, ~ R I dE 3(T ) max l 0 X l=1 2 l P M X i=1 X i;k i (l)< (M + 1) + M X i=1 X i;k i (l) : (3.45) The event n P M i=1 X i;k i (l)< (M + 1) + P M i=1 X i;k i (l) o implies at least one of the following events A i;j :=fjX i;j (l) i;j j> ( min (M + 1))=2Mg; (3.46) for 1iM; 1jN. By the Cherno-Hoeding bound, and then using the fact that 64 =d2M 2 =( min (M + 1)) 2 e, P(A i;j ) 2e 2l ( min (M+1)) 2 =4M 2 2e l : (3.47) Then, by using the union bound, ~ R I E 3(T ) max l 0 X l=1 2 l M X i=1 N X j=1 P(A i;j ) max 2MN 1 X l=0 (2=e) l ; (3.48) = 2 max MN=(1 (2=e)) < 8MN max : (3.49) Combining all the terms, we get ~ R dE 3(T )MN log(T ) max +C() log(T ) + 8MN max : (3.50) In a similar manner, ~ R dE 3 (T )MN log(T ) max +C() log(T ) + 16MN max ; (3.51) where =d8M 2 =( min (M + 1)) 2 e. The proof of part (ii) is similar to the proof of part 2.b. of Theorem 7, so the details are omitted. 3.3.4 Simulations We now present the empirical performance of the proposed dE 3 and dE 3 TS algorithms. We consider a three-player, three-armed bandit setting. Rewards for each arm are gener- ated independently from a binomial distribution with means 0:2; 0:25; 0:3 for player 1, 0:4; 0:6; 0:5 for player 2 and 0:7; 0:9; 0:8 for player 3. A time horizon spanning 20 epochs was considered. = 0:001 was used as the tolerance for the bipartite matching, which was done using a distributed implementation of Bertsekas' auction algorithm. The performance of the algorithm was averaged over 10 sample runs. was set equal to 100 for dE 3 and 400 for dE 3 TS (see analysis for the reason for diering 's). A xed per unit cost each time 65 the distributed bipartite matching algorithm dBM is run, is included in the setting to model communication cost in the decentralized setting. The plot of the growth of cumulative regret with time of dE 3 , dE 3 -TS and dUCB 4 is shown in Figure 3.5. We can see that the logarithmic regret performance of dE 3 and dE 3 -TS clearly outperforms the log 2 T -regret performance of our earlier dUCB 4 policy [20]. The dashed line curve is the theoretical upper bound on the performance of dE 3 -TS. Timeslots 10 3 10 4 10 5 10 6 Cumulative Regret # 10 5 0 0.5 1 1.5 2 2.5 3 Performance of different distributed learning policies dE 3 Upper Bound dE 3 -TS dUCB 4 Figure 3.5: Figure showing growth of cumulative regret of the dE 3 and dE 3 -TS algorithms for a three-player, three-armed bandit setting with true means [0.2 0.25 0.3; 0.4 0.6 0.5; 0.7 0.9 0.8] (communication cost included), with time plotted on log scale. 3.4 Future Work 1. Decentralized contextual bandits. The contextual bandit model is a variation of the standard multiarmed bandit model, in which the player also observes context information and can use it to determine which arm to pull. Major work under this formulation was done by Langford et al. in [28, 31]. They were able to design an algo- rithm (Epoch-Greedy) whose regret scaled as O(T 2=3 ) in the worst case, and O( logT 2 ) when the best and second-best arms have mean rewards separated by . We believe it is possible to nd a decentralization of this algorithm under the assumptions of our standard decentralized model. 66 Chapter 4 Learning in Restless Multi-armed Bandits We consider the following learning problem motivated by opportunistic spectrum access in cognitive radio networks. There are N independent Gilbert-Elliott channels with possibly non-identical transition matrices. It is desired to have an online policy to maximize the long-term expected discounted reward from accessing one channel at each time dynamically. While there is a stream of recent results on this problem when the channels are identical, much less is known for the harder case of non-identical channels. We provide the rst characterization of the structure of the optimal policy for this problem when the channels can be non-identical, in the Bayesian case (when the transition matrices are known). We also provide the rst provably ecient learning algorithm for a non-Bayesian version of this problem (when the transition matrices are unknown). Specically, for the special case of two positively correlated channels, we use the structure we identify to develop a novel mapping to a dierent multi-armed bandit with countably-innite arms, in which each arm corresponds to a threshold-based policy. Using this mapping, we propose a policy that achieves near- logarithmic regret for this problem with respect to an -optimal solution. 4.1 Introduction The problem of dynamic channel selection has recently been formulated and studied by many researchers [57, 3, 33, 14, 47, 32] under the framework of multi-armed bandits (MAB) [16]. In these papers, the channels are typically modelled as indpendent Gilbert-Elliott channels (i.e., described by two-state Markov chains, with a bad state \0" and a good state \1"). The objective is to develop a policy for the user to select a channel at each time based on prior observations so as to maximize some suitably dened notion of long-term reward. This can 67 be viewed as a special case of restless multi-armed bandits (RMAB), a class of problems rst introduced by Whittle [52]. Depending on whether the underlying state transition Matrix is known or unknown, the problem can be respectively classied as Bayesian (because in this case the belief that the channel will be good in the next time step can be updated exactly for all channels based on the prior observations) or non-Bayesian. Many of the prior results apply to the Bayesian case, where it is assumed that the under- lying Markov state transition matrices for all channels are known, so that the corresponding beliefs for each channel can be updated. For the special case when the channels evolve as identical chains, it has been shown that the Myopic policy is always optimal for N = 2, 3, and also optimal for anyN so long as the chains are positively correlated [57, 3]. In [33], the authors consider this problem in the general case when the channels can be non-identical. They show that a well-known heuristic, the Whittle's index exists for this problem, and can be computed in closed form. Moreover, it is shown in [33] that for the special case that channels are identical, the Whittle's index policy in fact coincides with the Myopic policy. However, as the Whittle's index is not in general the best possible policy, a question that has remained open is identifying the optimal solution for general non-identical channels in the Bayesian case. In [57], it is also shown that when the channels are identical, the Myopic policy has a semi- universal structure, such that it requires only the determination of whether the transition matrix is positively or negatively correlated, not the actual parameter values. In [14], it has been shown that this structure can be exploited to obtain an ecient online learning algorithm for the non-Bayesian version of the problem (where the underlying transition matrix is completely unknown). In particular, Dai et al. [14] show that near logarithmic regret (dened as the dierence between cumulative rewards obtained by a model-aware optimal-policy-implementing genie and that obtained by their policy) with respect to time 1 can be achieved by mapping two particular policies to arms in a dierent multi-armed bandit. For the more general case of non-identical arms, there have been some recent results that show near-logarithmic weak regret (measured with respect to the best possible single-channel- selection policy, which need not be optimal) [47, 32, 13]. Thus, there are currently no strong results showing a provably ecient online learning algorithm for the case of non-identical channels in the non-Bayesian case. In our work, we consider both the Bayesian and non-Bayesian versions of this two-state restless multi-armed bandit problem with non-identical channels. We make two main con- 1 Note that it is desirable to have sub-linear regret with respect to time for non-Bayesian multi-armed bandits, as this indicates that asymptotically the time-averaged reward approaches that obtained by the model-aware genie. 68 tributions: For the Bayesian version of the problem, when the underlying Markov transition ma- trices are known, we prove structural properties for the optimal policy. Specically, we show that the decision region for a given channel is contiguous with respect to the belief of that channel, keeping all other beliefs xed. For the non-Bayesian version of the problem for the special case of N = 2 positively correlated possibly non-identical channels, we utilize the above-derived structure to propose a mapping to another multi-armed bandit problem, with a countably innite number of arms, each corresponding to a possible threshold policy (one of which must be optimal). We present an online learning algorithm for this problem, and prove that it yields near-logarithmic regret with respect to any policy that achieves an expected discounted reward that is within of the optimal. 4.2 Model For our problem formulation, we will look at the following description of an RMAB. Consider the problem of probing N independent Markovian channels. Each channel i has two states - good (denoted by 1) and bad (denoted by 0), with transition probabilitiesfp (i) 01 ;p (i) 11 g, for the transitions from 0 to 1 and 1 to 1 respectively. At each time t, the player chooses one channel i to probe, denoted by the action U(t), and receives a reward equal to the state, S i (t), of the channel (0 or 1). The objective is to design a policy that chooses the chain at each time to maximize a long-term reward, which will be mathematically formulated shortly. It has been shown that a sucient statistic to make an optimal decision is given by the conditional probabilty that each channel is in state 1 given all past observations and decisions [43]. We will refer to this as the belief vector, denoted by (t),f! 1 (t);:::;! N (t)g, where ! i (t) is the conditional probability that the i-th channel is in state 1. Given the sensing action U(t) in observation slot t, the belief can be recursively updated as follows: w i (t + 1) = 8 > < > : p (i) 11 ; i2U(t);S i (t) = 1 p (i) 01 ; i2U(t);S i (t) = 0 (! i (t)) ; i = 2U(t) (4.1) where(! i (t)),! i (t)p (i) 11 + (1! i (t))p (i) 01 denotes the one-step belief update for unobserved channels. 69 In our study, we will focus on the discounted reward criterion. For a discount parameter (0 < 1), the reward is dened as, E 1 X t=0 t R ( (t)) j (0) =x 0 (4.2) where R ( (t)) is the reward obtained by playing strategy ( (t)), where : (t)!U(t) is a policy, which is dened to be a function that maps the belief vector (t) to the action in U(t) in slot t. The discounted reward criterion is used due to its inherently tunable nature that gives importance both to the immediate reward, unlike the average reward criterion, and the future, unlike the myopic criterion. 4.3 Structure of the optimal policy We will now derive the structure of the optimal policy for a 2-state Bayesian RMAB. For ease of exposition, we describe our result in the context of a problem with 2 channels. However, as we discuss, our key structural result in this section readily generalizes to any number of channels. We rst dene the value function, which is the the expected reward for the player when the optimal policy is adopted, recursively. Next, we derive a key property of the optimal value function and use it to characterize the structure of the optimal policy. 4.3.1 Optimal Value Function In this part, we will derive the optimal value function, i.e., the expected discounted reward obtained by a player who uses the optimal policy. Let V (! 1 ;! 2 ) denote the optimal value function with beliefs ! 1 and ! 2 of the two arms. If the player chooses arm 1, and sees a good state (which occurs with probability ! 1 ), he gets an immediate reward of 1 and a future reward of V (p (1) 11 ;! 2 p (2) 11 + (1! 2 )p (2) 01 ). If he sees a bad state (which occurs with probability 1! 1 ), he gets an immediate reward of 0 and a future reward of V (p (1) 01 ;! 2 p (2) 11 + (1! 2 )p (2) 01 ). Putting these together, the discounted payo under the optimal policy given that the user chooses arm 1, is, V 1 (! 1 ;! 2 ) =! 1 (1 +V (p 11 ;! 2 q 11 + (1! 2 )q 01 )) +(1! 1 )(V (p 01 ;! 2 q 11 + (1! 2 )q 01 )) (4.3) 70 Similarly, given that the player chooses arm 2, his discounted payo under the optimal policy is, V 2 (! 1 ;! 2 ) =! 2 (1 +V (! 1 p 11 + (1! 1 )p 01 ;q 11 )) +(1! 2 )(V (! 1 p 11 + (1! 1 )p 01 ;q 01 )) (4.4) At each time instant, the optimal value function is the greater of the above two functions. Thus, the optimal value function is given by, V (! 1 ;! 2 ) = maxfV 1 (! 1 ;! 2 );V 2 (! 1 ;! 2 )g (4.5) 4.3.2 Properties of optimal value function Lemma 8. V 1 (! 1 ;! 2 ) is linear in ! 1 (keeping ! 2 xed) and V 2 (! 1 ;! 2 ) is linear in ! 2 (keeping ! 1 xed). Proof. These are trivial from the denition of V 1 and V 2 . Lemma 9. V 1 (! 1 ;! 2 ) is convex in! 2 , V 2 (! 1 ;! 2 ) is convex in! 1 , andV (! 1 ;! 2 ) is convex in ! 1 and ! 2 . Proof. This follows from convexity of optimal value functions proved in [44]. 4.3.3 Characterization of Optimal Decision Region We now present our main structural result. Theorem 10. The decision region where arm 1 is chosen is contiguous horizontally and the decision region for arm 2 is contiguous vertically. Mathematically, if (! 1 ;! 2 )2 1 and (! 0 1 ;! 2 )2 1 where ! 1 ! 0 1 , 1 is the region where arm 1 is chosen in the decision region space, then (! 00 1 ;! 2 )2 1 ,8! 00 1 2 [! 1 ;! 0 1 ] (and similarly for arm 2). Proof. We will prove the theorem for the decision region for arm 2, and the result for arm 1 can be proved in an analogous manner. 71 Let (! 1 ;! (1) 2 ); (! 1 ;! (2) 2 )2 2 , the decision region for arm 2. Then, for2 [0; 1], we have, V (! 1 ;! (1) 2 + (1)! (2) 2 ) V (! 1 ;! (1) 2 ) + (1)V (! 1 ;! (2) 2 ) =V 2 (! 1 ;! (1) 2 ) + (1)V 2 (! 1 ;! (2) 2 ) =V 2 (! 1 ;! (1) 2 + (1)! (2) 2 ) V (! 1 ;! (1) 2 + (1)! (2) 2 ) (4.6) using the above lemmas. Hence all the inequalities hold with equality sign, and V 2 (! 1 ;! (1) 2 + (1)! (2) 2 ) =V (! 1 ;! (1) 2 + (1)! (2) 2 ) Or in other words, ! (1) 2 + (1)! (2) 2 2 2 . Figure 4.1: Examples of possible decision regions satisfying the optimal structure. 72 Lemma 10. It is sucient to restrict the analysis of the decision region of any policy governed solely by the beliefs of the states of the arms to the boundary of the rectangle with vertices given by minfp (1) 01 ;p (1) 11 g, maxfp (1) 01 ;p (1) 11 g, minfp (2) 01 ;p (2) 11 g, and maxfp (2) 01 ;p (2) 11 g. Proof. Recall that in our model, we had assumed the initial belief vector to be the vector of stationary probabilities of the Markov chains. It is easy to see that this lies between minfp (i) 01 ;p (i) 11 g and maxfp (i) 01 ;p (i) 11 g. By denition of the belief update in (4.1), the belief of the state of an arm can only take valuesp (i) 01 , orp (i) 11 , or(! i (t)). From the recursive nature of, the belief for an armi always lies between (and possibly including)p (i) 01 andp (i) 11 and tends to the stationary distribution of the Markov chain for the arm. Hence, the two-dimensional velief vector lies in the rectangle formed by these extreme points. Further, since any policy always observes at least one of the states (and consequently giving rise to a belief of p (i) 01 or p (i) 11 for the observed arm i), the decision region is restricted to the boundary of the rectangle. Remark 1. A noteworthy point is that we have used only the linearity of V i (! 1 ;! 2 ) in the i-th argument and the convexity of the optimal value function in proving Theorem 10. Since these properties are both valid for number of arms, this immediately extends the theorem to those cases as well. Specically, for an RMAB with N arms, each having a good and a bad state, the decision region for arm i is contiguous along the i-th dimension axis of the belief space. Similarly Lemma 10 can be generalized to any N to indicate that it is su- cient to restrict attention to the hyper-plane boundaries of the corresponding N-dimensional hypercube. We will now show that the optimal policy can only have 0, 2 or 4 intersection points with the decision boundary. Lemma 11. The optimal policy having the structure derived in Theorem 10 can have only 0, 2 or 4 intersection points with the decision boundary dened in Lemma 10. Proof. We will outline the proof of this lemma in several steps. Recall from Lemma 10 that it is sucient to restrict our attention to the boundary of the rectangle with vertices given by minfp (1) 01 ;p (1) 11 g, maxfp (1) 01 ;p (1) 11 g, minfp (2) 01 ;p (2) 11 g, and maxfp (2) 01 ;p (2) 11 g. An immediate observation is that the number of intersection points of the optimal policy with the decision boundary is even (it cannot be odd because then the decision regions do not make sense). From Theorem 10, no edge of the decision boundary rectangle can have more than two points of intersection (if it does, then it violates one of the contiguity conditions). 73 Also from Theorem 10, no two adjacent edges can both have two intersection points at the same time (if they do, one of them violates the contiguity condition). The rst and second observation imply that the optimal policy has 0, 2, 4, 6 or 8 inter- section points with the boundary. The third implies that 8 points are not possible. It also implies that the only way to have 6 intersection points is to have two intersection points each on opposite edges. It is easy to see that this contradicts the contiguity condition for one of these edges. Hence, 6 is not possible. Remark 2. An immediate consequence of Theorem 10, Lemma 10 and Lemma 11 is that the optimal policy has a threshold structure. A few example regions are illustrated in Fig 4.1. 4.4 Countable Policy Representation We now focus our attention to the special case when N = 2 and both channels are pos- itively correlated. We will show in this section, that in this case, the optimal policy can be represented in such a way that it must be one of a well-dened countably innite set of policies. Lemma 12. For a positively-correlated 2-armed, 2-state RMAB, the belief update for an unobserved arm is monotonically increasing or decreasing towards the steady state probability. Proof. For a positively-correlated Markov chain, p 11 p 01 . From the dention of belief update for an unobserverd arm, , in (4.1), the monotonic nature of belief update is estab- lished. Because of this monotonicity property, the above-derived thresholds on the belief space can be translated into lower and upper thresholds on the time spent on a channel. Because of this, the optimal policy in this case can be represented as a mapping from A toB, where A is a vector of three binary indicators. A[1] represents the current arm, A[2] represents the current arm's state, and A[3] represents the state of the other arm when it was last visited. B[1], andB[2] are lower and upper time thresholds on the time spent on the current channel that trigger a switch to the other channel. B[1] andB[2] can take on the countably innite values of all natural numbers and also the symbol1 (to represent \never"). The corresponding policy maintains a counter C(t) that is reset to 1 every time a new arm is played, and incremented by 1 each time the same arm has been played again. The meaning of this mapping is that whenever the condition in A is satised, the policy should switch arms at the next time t + 1 if the counter C(t)>B[1] or if C(t)<B[2]. 74 A dierent way of putting this is that the optimal policy can be described by a 16-tuple, corresponding to all possible pairs B[1],B[2] in order, for each of the 8 dierent values that the three binary elements of A can take on. This, in turn, implies that the optimal policy can be searched for among a countably innite set (consisting of all such 16-tuples) 2 . Figure 4.2: Example of an optimal decision region to illustrate the countable policy repre- sentation Example: We illustrate our countable policy representation with an example. Consider the case when the decision region looks like Fig 4.2, with the steady state probabilities of seeing a 1 in each channel represented by the vertical and horizontal dashed lines. For this example, the left-side threshold corresponds to the belief for arm 2 when it has not been played for 10 time steps assuming it was last observed in state 1 while arm 1 was observed just in the last step to be in state 0. Note that because this threshold is to the top of the steady state probability of arm 2, it could never reach this threshold if it was last observed to be in state 0. Similarly, the right-top threshold corresponds to the belief for arm 1 when it has not been played for 20 time steps assuming it was last observed in state 1 and arm 2 has just been observed in the last step to be in state 1. The optimal policy corresponding to this decision region can be depicted as in table 4.1. For instance, when the last observed state on the previous arm was 0, and we observe the current arm, 1, to be in state 0, we will 2 We omit the details for brevity, but it is possible to map these tuples to a countable innite set so that the optimal policy corresponds to some nite numbered element. 75 play it till the counter C(t) crosses 10. The entries in the B[1] and B[2] give the conditions on C(t) when the policy switches arms. Note that some entries may be irrelevant. For instance, we would never switch from arm 1 on a 1 because it lies entirely in decision region 1 . Finally, the sixteen-tuple representation of this particular policy would then be simply [10;1;1;1; 1;1;1;1; 1;1;1;1; 1;1; 1; 20] current arm state of last state of B[1] B[2] current arm previous arm 1 0 0 10 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 2 0 0 1 1 2 1 0 1 1 2 0 1 1 1 2 1 1 1 20 Table 4.1: Threshold policy for Fig. 4.2 4.5 The non-Bayesian case We now turn to the non-Bayesian case forN = 2 positively correlated channels, to develop an online learning policy that takes advantage of the countable policy representation described in the previous section. In the non-Bayesian case, the underlying transition probability matrices are not known to the user, who must adaptively select arms over time based on observations. 4.5.1 Mapping to innite-armed MAB The crux of our mapping is to consider each possible 16-tuple description of the threshold- based optimal policy as an arm in a new multi-armed bandit. As there are countably many of them, they can be renumbered as arm 1, 2, 3, ... Now, even through we do not know what the underlying transition probability matrices are, we know that the optimal policy corresponds to one of these arms. Each arm can be thought of as having a countably innite number of states (these states correspond to the belief vector of the states of the arms). The states evolve in Markovian fashion depending on the policy being used. Although the number of arms in the mapping has now increased dramatically (from two to innity), this mapping simplies a crucial aspect of the learning problem. Now we 76 need only identify the single best arm, and no longer have to switch dynamically between them. The only other known strong result on non-Bayesian restless MAB [14] also performs a similar mapping; however in that case the mapping is only to two arms. We rst present a sublinear regret policy for a non-Bayesian MAB with countably innite arms, each having an i.i.d reward. We then present a variant policy that also achieves a sublinear regret, with the key dierence that it separates the exploration and exploitation phases. Finally, we give a policy for our 2-state, 2 positively correlated channels problem that builds on the variant policy for i.i.d. rewards, and show that it achieves a sublinear regret with respect to an -optimality criterion. 4.5.2 MAB with countably innite i.i.d. arms We consider a multi-armed bandit problem with countably innite arms, where time is indexed byn. The reward got from an armi atn, denoted byX i (n), is an unknown random process which evolves i.i.d over time. At each time slot, an armi is selected under a strategy , and the reward of the selected arm X i (n) is observed. The optimal arm is dened to be the arm with the highest expected reward, and we assume that the index of the optimal arm, i.e., i = arg max i E[X i ], is nite. We assume that there exists a non-zero minimum dierence between the expected rewards of the optimal arm and a suboptimal arm. We evaluate a policy for this problem in terms of regret, which we dene as follows: R (n) =E[number of times suboptimal arms are played by policy in n time slots] (4.7) We present the following policy (Algorithm 9) for this problem, and prove an upper bound for the regret achieved by it. This policy, that we call UCB-CI, generalizes the well-known UCB1 policy of Auer et al. [7]. Let n be the total number times the multiarmed bandit has been run, n i be the number of times arm i has been selected. Let M denote the set of arms added to the algorithm. f(n) is a slowly growing function. Dene index of arm i to be, index i (n) =X i;n i + r 2 lnn n i (4.8) where X i;n i is the sample mean of arm i from n i selections of it. Theorem 11. For countably innite i.i.d arms, when the number of arms in the set M at 77 Algorithm 9 A policy for MAB with countably innite arms yielding i.i.d rewards (UCB-CI) Initialize: Addf(0) arms toM. Select arms inM once. Observe rewards and updaten, n i and X i;n . for slot n = 1; 2;::: do Select the arm with the highest index. Observe reward and update n, n i and indices. Add additional arms to set M s.t. the total number of arms is f(n + 1). Select each new arm once. Observe rewards and update n, n i and X i;n . end for timen is upper bounded byf(n), wheref(n) is a discrete, non-decreasing divergent function of time, using UCB-CI gives a regret that scales as 3 O(f(n) ln(n)). Proof. We will bound T i (n), the number of times arm i (6= i ) is played in n plays. Let I t be the arm played at time t andfI t =ig be its indicator function. Letn2N be s.t. the set of arms inM before playn, which we will, from now, denote by M n , contains the optimal arm. Let n 0 be the smallest play number for which M n 0 contains the optimal arm. For any arm i6=i in set M n , DeneX i;T i as the mean of the observed rewards of armi forT i plays. Denec t;s = q 2 lnt s and i =E[X i ]E[X i ]. 3 We use asymptotic notation here for simplicity; the proof of Theorem 11 in fact shows that the upper bound of regret holds uniformly over time. 78 Then, T i (n) = 1 + n X t=f(0)+1 fI t =ig = 1 + n 0 X t=f(0)+1 fI t =ig + n X t=n 0 +1 fI t =ig C +l + n X t=n 0 +1 fI t =i;T i (t 1)lg C +l + n X t=n 0 +1 fX T (t1) +c t1;T (t1) X i;T i (t1) +c t1;T i (t1) ;T i (t 1)lg C +l + n X t=n 0 +1 f min 0<s<t fX s +c t1;s g max ls i <t fX i;s +c t1;s i gg C +l + 1 X t=1 t1 X s=1 t1 X s i =l fX s +c t;s X i;s i +c t;s i g where C = n 0 P t=f(0)+1 fI t =ig, which is a constant dened by n 0 . Following a similar analysis as in [7],X s +c t;s X i;s i +c t1;s i implies at least one of the following X s E[X i ]c t;s (4.9) X i;s E[X i ] +c t;s i (4.10) E[X i ]<E[X i ] + 2c t;s i (4.11) By Cherno-Hoeding bound, PfX s E[X i ]c t;s ge 4 lnt =t 4 PfX i;s E[X i ] +c t;s i ge 4 lnt =t 4 79 For l =d 8 lnn 2 i e, the third event in (4.11) does not happen. Thus, we get, E[T i (n)] C +d 8 lnn 2 i e + 1 X t=1 t1 X s=1 t1 X s i =l 2t 4 C + 1 + 8 lnn 2 i + 2 3 C 0 + 8 lnn 2 i Since the number of arms at n-th play is upper bounded by f(n), we have, R UCBCI (n)C 0 f(n) + 8 lnnf(n) 2 i =O(f(n) ln(n)) 4.5.3 A variant policy for countably innite i.i.d arms We now present a variant policy, that we call DSEE-CI (for deterministic sequencing of exploration and exploitation for countably innite arms), for the same model considered in the previous part, which is an extension of the DSEE policy in [34]. The main dierence between DSEE-CI and UCB-CI is that the exploration and exploita- tion phases are separated out. Each arm is explored for the same number of times and then the arm having the highest sample mean is played for some time during the exploitation phase. It is important to note that the observations made in the exploitation phase are not used in updating the sample mean. This process is repeated with the relative duration of the exploitation phase increasing with time. The DSEE-CI policy is detailed in Algorithm 10. As before, f(n) is a slowly growing function that upper bounds the number of arms in play at time n. M is the actual number of arms in play at any given time and M(n) denotes the set M at time n. jMj denotes the cardinality of set M. At any time, let Z denote the cumulative number of slots used in the exploration phase up to that time. And let c be a positive constant. In the following theorem, we outline the proof of the upper bound of the regret of this algorithm. For simplicity, we adapt the proof of the UCB-CI policy for this; however, in principle it is possible to prove regret bounds under more general assumptions on the reward process (as shown in [34], for the nite arms case). 80 Algorithm 10 A variant policy for MAB with countably innite arms and i.i.d. rewards (DSEE-CI) Initialize: Addf(0) arms toM. Select arms inM once. Observe rewards and updaten i and X i;n i . Set Z to be f(0). while 1 do Exploitation: Select the arm with the highest sample mean X i;n i . Play the arm till slot n 0 where n 0 is determined by Z =cj(M(t 0 )j lnn 0 . Exploration: Add additional arms to setM s.t. the total number of arms isf(n 0 + 1). For each old arm inM, play the arm once and observe its reward. Updaten i andX i;m i . For each newly adder arm inM, play the arm till it has been played for the same number of times as the old arms. Observe rewards and update n i and X i;n i . Increment Z by the number of slots used up in the exploration phase. end while Theorem 12. For countably innite arms, when the number of arms in set M at time n is upper bounded by f(n), where f(n) is a discrete, increasing function of time, DSEE-CI yields a regret that is O(g(n) lnn), where g(n) is a monotonic function of f(n). Proof. The proof follows along similar lines to Theorem 11. Note that the second term of the index dened in (4.8) is the same for each arm before the start of an exploitation phase. Thus, the rst term, i.e, the sample mean, determines the index values for the arms. Also, by choosing c large enough (larger than all E[T i ] ln(n) in Theorem 11, which depends on the minimum distance between the optimal arm and any suboptimal arm), it can be argued that a sub-optimal arm is played in the exploitation phase only j(n) lnn times, where j(n) is a function of f(n). Also, for the exploration phase, the cumulative length of the phase is always less than Z = cj(M(n)j lnn. Thus, the total regret incurred by Algorithm 10 is O(g(n) lnn) where g(n) is some function of f(n), which can therefore be made to grow arbitrarily slowly. 4.5.4 Online Learning Policy for non-Bayesian RMAB with 2 pos- itively correlated channels We are now in a position to tackle the problem of identifying the optimal policy for our problem (2-state RMAB with two positively correlated channels). Recall our mapping of the policies dened by the 16-tuples for the original problem to countably innite arms in Section 4.4. We dene the arm that has the highest discounted reward as the optimal arm. It is given 81 as: i = arg max i E[ 1 X t=0 t R i (t)j (0) =x 0 ] (4.12) whereR i (t) is the reward obtained during the t-th play from armi. Recall that we have set up the mapping in such a way that the index of this optimal arm is a nite number. We now dene an -optimal arm to be an arm whose innite horizon discounted reward is within of that of i . Our regret denition with respect to an -optimal arm is as follows. R (n) =E[number of times an arm whose discounted reward is lesser than that of the optimal arm i by more than is played by strategy in n time slots] (4.13) We now dene a playable state and show that, using a predetermined strategy, we can reach a playable state in nite expected time from any initial state. A playable state is any feasible belief vector that can be reached by a strategy that has selected each channel at least once. It is described by a four-tuple of its attributes: (current channel, current channel's state, previous channel's last observed state and number of turns current channel has been played) with respect to such a strategy. Note that since we dene a playable state to be reached by some feasible strategy that selects a channel at each time, physically unrealizable states for a particular problem are eliminated. Lemma 13. There exists a predetermined strategy that takes any initial state to a playable state in nite expected time. Proof. Our predened strategy is as follows. We play the previous channel till we observe the last observed state of that channel for the playable state. We then switch to the other channel and play it for the number of turns in the playable state attribute. If we observe the current channel's state on the last turn, we stop. Otherwise, we switch to the previous channel and repeat the process. Since the probability of reaching any channel's state is nite and non-zero with Markovian transitions (if it were zero, by denition of a playable state, a strategy that switches to the current channel on the last observed state of the previous channel would reach the playable state trivially), there is a nite expected time to reach the playable state because it is just a sequence of states with each having a positive probability of being reached. From now onwards,x 0 will always denote a playable state. With our denition of regret, we can now equivalently restrict our horizon to anyTT 0 plays of an arm by the following 82 lemma. Lemma 14. There exists a T 0 such that8T T 0 , an -optimal arm for the nite horizon discounted reward criterion up to timeT is the same as that for the innite horizon discounted criterion, if all rewards are nite and non-negative. Proof. For the optimal arm i dened by (4.12) and any T > 0, dene c(T ) as c(T ) =E[ 1 X t=T +1 t R i (t)j 0 =x 0 ] (4.14) Since the rewards during a play are non-negative, this implies thatc(T ) is a non-increasing function of T . Therefore, for any 0<c 0 <,9T 0 s.t. E[ 1 X t=T 0 +1 t R i (t)j 0 =x 0 ]c 0 <: (4.15) Also note that for any sub--optimal arm i 0 , E[ 1 X t=0 t R i (t)j (0) =x 0 ]E[ 1 X t=0 t R i 0(t)j (0) =x 0 ] (4.16) We then have, E[ 1 X t=0 t R i (t)j (0) =x 0 ]E[ 1 X t=0 t R i 0(t)j (0) =x 0 ] )E[ T 0 X t=0 t R i (t)j (0) =x 0 ]E[ 1 X t=0 t R i 0(t)j (0) =x 0 ] +E[ 1 X t=T 0 +1 t R i (t)j (0) =x 0 ] )E[ T 0 X t=0 t R i (t)j (0) =x 0 ]E[ 1 X t=0 t R i 0(t)j (0) =x 0 ]>c 0 > 0 (4.17) The strict inequality follows from the denition of c 0 . Thus, by the non-increasing nature of the residual discounted reward as discussed above,8T T 0 , the -optimal arms have a greater nite horizon discounted reward than even the innite horizon discounted reward of any suboptimal channel, and consequently, its nite horizon discounted reward. We present in Algorithm 11 a policy we refer to as R2PC for solving the 2-state RMAB problem with two positively correlated channels. The terms used in the description and the analysis of this policy are as follows. 83 AT -play of an arm is dened to be playing the arm for T slots from the initial state x 0 . An update of an index for an arm is calculated in the following way for the collection of the rewards obtained in theT -play of that arm. m is the total number ofT -plays andm i is the number of T -plays for arm i. Let n(m) be the time at the beginning of m-th T -play. Let Y i (m) be the discounted reward got from theT -play of armi, i.e.,Y i (m) = t=T P t=0 t R i (n(m)+t). Y i;m i is the sample mean of the m i dierent observations of Y i . As before, f(n) is a slowly growing function of n that upper bounds the total number of arms at the n-th play. M is the set of arms at any time. Z is the cumulative duration of the exploration phase up to a given time. Algorithm 11 An online learning policy for the two-state RMAB with two positively cor- related channels (R2PC) Initialize: Add f(0) arms to M. Play the predetermined strategy to reach x 0 . Select an arm in M. Play it for T slots and observe the rewards. Repeat for each arm in M. Update m, m i and Y i;m i . Set Z to be f(0). while 1 do Exploitation: Select the arm with the highestY i;m i . Play the arm till slotn 0 wheren 0 is determined by Z =cj(M(n 0 )j lnn 0 . Exploration: Add additional arms to setM s.t. the total number of arms isf(n 0 + 1). For each old arm in M, play the predetermined strategy to reach x 0 . Then, play the arm for T slots. Observe rewards and update m, m i and Y i;m i For each newly added arm in M, play the predetermined strategy to reach x 0 . Then play the arm forT slots. Repeat till the arm has the samem i as the old arms. Observe rewards and update m, m i and Y i;m i . Increment Z by the number of T -plays in the exploration phase. end while 4.5.5 Analysis of R2PC We construct the proof of the regret bound for the R2PC policy in the following manner. First, we decompose the regret into that obtained by playing the predetermined strategy to reachx 0 and the part obtained during theT -play phases. We show that the reward obtained across T -plays of an arm is i.i.d and independent of the rewards from previous T -plays of other arms. Next, we prove the regret bound for the T -play. Finally, we show that the time taken to reach x 0 in each play of an arm is nite and hence, does not aect the order of the regret bound, thus completing our proof. Lemma 15. The reward from eachT -play of an arm is i.i.d. and independent of the rewards obtained in previous and future T -plays of any arm. 84 Proof. Since we start playing an arm and observing its rewards only from initial state x 0 , the reward from each play of an arm is i.i.d. over each T -play because it is given by the state evolution of a Markov chain conditioned on the same initial state and played for the same duration. Also, by the same reasoning, the rewards obtained are independent of the past and future rewards for T -plays on any arm. We now show our main theorem as follows. Theorem 13. The regret of the R2PC policy is bounded by 4 O(h(n) lnn). Proof. Let us rst consider the regret due to the T -plays. From Lemma 15, we know that the rewards obtained during each T -play of an arm in the exploration phase are i.i.d and independent of the previous and future plays of any other arm. Therefore, we can apply Theorem 11 and the regret achieved by the algorithm in the exploitation phase after n slots is O(g 1 (n) ln(n)) where g 1 (n) is a monotonic function of f(n). We now show that the nite time taken by the predetermined strategy to reach x 0 does not aect the order of the regret bound. When we factor in the time taken to reach x 0 , the expected length of time of the cummulative exploration phase at time n is upper bounded by cjM(n)j ln(n)(T +E[time taken to reach x 0 ])c 0 f(n) ln(n) (from Lemma 13). Thus, the total regret incurred by the R2PC policy, which is the sum of regrets during the exploration and the exploitation phases, is O(h(n) ln(n)) where h(n) is some monotonic function of f(n). 4.6 Conclusion In our work, we have derived a structural result for the optimal policy for a two-state Bayesian restless multi-armed bandit with non-identical arms, under the innite horizon discounted reward criterion. For the non-Bayesian version of this problem, in the special case of two positively correlated arms, we then developed a novel mapping to a dierent countably innite-armed bandit problem. Using this mapping, we have proposed an online learning policy for this problem that yields near-logarithmic regret with respect to an -optimal solution. Developing ecient learning policies for other cases remains an open problem. An alternative, possibly more ecient, approach to online learning in these kinds of problems might be to use the historical observations of each arm to estimate the P matrix, and use these estimates iteratively in making arm selection decisions at each time. It is, however, unclear at present how to prove regret bounds using such an iterative estimation approach. 4 For Theorem 13 also, the regret bound holds uniformly over time. 85 Chapter 5 Scheduling in decentralized systems with uncertainty: A hospital operating room application 5.1 Learning to scchedule in hospital operations data Scheduling is an important problem faced in the healthcare industry. Our motivation for treating these as applications of learning arose from data we obtained from the Keck School of Medicine. Our work aims to address problems in operating room scheduling experienced by the Keck Hospital Surgery department. 5.1.1 Introduction Patient scheduling in hospitals is one of the most important aspects of hospital operations. It is a non-trivial problem due to several sources of uncertainties, both intrinsic and extrinsic to the hospital system. For instance, delays occur in the system due to surgeons performing other cases, pending labwork, patient being late, delay in room cleaning etc. The scheduling problem impacts a wide range of factors including patient satisfaction, surgeon remunera- tion, and resource utilization. Two such instances of patient scheduling are outpatient and operating room scheduling In the outpatient setting, patients arrive at a hospital to visit physicians. They join the queue and are given an estimated waiting time. When their turn in queue comes, they are served by the attending physicians. The actual waiting time in queue depends on the number of people in the queue and the service time of the physicians. The hospital has to predict waiting times in the queue correctly to enhance patient experience (cost of deviating lower) 86 and avoid wasting physicians' times (cost of deviating higher). Departures Arrivals Buffer Server Figure 5.1: A model for outpatient scheduling Operating room scheduling is a more complex problem due to the myriad of reasons for delays that could potentially occur during the process. A typical scenario is that, after a patient is determined to require surgery and approval is obtained, a date and time for the surgery is allotted. The patient arrives at that time and, depending on availability, is transferred to an intensive care unit (ICU) to be monitored and prepared for surgery. Patients arriving without appointments in need of immediate surgery arrive at the emergency room, and are then transferred to the ICU. When the patient is ready, he/she is moved to the operating room and a team of nurses get the patient ready for surgery. The surgeon(s) then arrives and performs the surgery, successful completion of which follows discharge to the post-anesthesia care unit (PACU) for recovery. Subsequently, the patient is either moved back to the ICU for intensive care, or the general ward for non-intensive care. At this point, after sucient length depending on the monitoring results, the patient may either be taken for another surgical procedure or discharged from the hospital. This is detailed in Figure 5.2, which highlights the ow of a patient through the operating room. A description of the actual scheduling process is given in Figure 5.3. Figure 5.2: A typical scenario in hospital patient ow 5.1.2 Data and Current Scheduling Policy We obtained data from the Keck Hospital, which comprised of the operating room schedule for the year 2013. A total of 10,674 patients underwent surgery at the hospital during this 87 Figure 5.3: A typical scenario in hospital patient ow period. Data was obtained for the operative stage of the hospital visit only, and did not include pre-op and post-op stages. A full list of elds available to us is given in Table 5.1. OR Case Number Case Start Date Patient Type Anesthesia Start Time Delay Reason(s) Surgery Start Time Surgery Stop Time Patient In Room Time Scheduled Slot Type Scheduled Start Time ASA Class Anesthesia Type Scheduled Start Date/Time Case Duration Primary Surgeon Primary Anesthesiologist(s) Operating Room Patient Out of Room Time Scheduled Operating Room Actual Slot Type Surgical Specialty Case Create Date/Time Admit Date/Time First Circulator(s) Scheduled Case Duration Scheduled Start Date First Scrub(s) Anesthesia Stop Time Primary Procedure (ORC Desc) Table 5.1: List of elds in the Keck Hospital dataset The current scheduling policy at the hospital is to allocate individual operating rooms to dierent surgical specialties for a day. Within this day, it is up to the specialty department to divide the period into as many surgeries as they want. Scheduling usually begins at 7:30am, and blocks are divided into four, eight, ten or twelve hour chunks. If a specialty does not need an operating room on a particualar day, they notify the scheduling system a certain number of days in advance, and the room is then allotted to a dierent specialty. Our analysis objective was to improve on the current scheduling policy through data- driven prediction and optimization approaches. The following section presents our prelimi- nary analysis. 5.2 Understanding the data Preliminary data analysis is being carried out to understand the system. Figure 5.4 shows the number of cases by surgical specialty. Urology and Orthopedics account for more than 88 36% of the total cases. Figure 5.4: A histogram showing distribution of surgical cases by specialty at Keck Hospital The next series of plots look at delay in beginning surgeries in more detail. Figure 5.5 presents the plot of the distribution of delay in wheeling in a patient into the operating room after anesthesia. Over half the patients were takin in within +/- 15 minutes of the scheduled start time. However, as can be seen in Figure 5.6, this delay becomes progressively worse as the day goes on. In fact, we were able to t a quadratic trend line to the delay distribution. We investigated the reasons for delays, and found that surgeons performing other cases was the primary reason for the delays, and warranted further analysis. Figure 5.7 shows the major contributors to delay (greater than 60 cases) in starting surgeries. When ordered by specialty, it was observed that Orthopedics had the most delays due to surgeons being late, followed by Urology, although Urology performs signicantly better when the ratio of delay cases to total cases is used as a metric. Figure 5.8 shows the number of records of delay due to surgeons performing other cases by specialty. We conjectured that one reason for the above-mentioned delay could be overbooking of surgeons, and on deeper analysis, we found this indeed to be the case. At an extreme, as 89 Figure 5.5: A histogram showing distribution of delay in wheeling-in patients Figure 5.9 shows that a primary surgeon is booked to two parallel cases in dierent operating rooms and he performs surgery on one patient while the other is still in the OR! This happens on multiple occasions, although sparsely. Feedback from the hospital informed us that this occurs when the surgeon is assisted by a resident who completes the surgery by closing up the patient, etc. Our next analysis pertained to the deviation in scheduled and actual case duration. This is an important metric because it tells us how accurately the surgeons were able to estimate their procedure time and this greatly impacts the scheduling process. Figure 5.10 shows the distribution of the actual time spent by patients in the operating rooms, while Figure 5.11 plots the distribution of the deviation between actual and scheduled times in the operating rooms. It can be seen that there is a much greater spread in deviation compared to the delay in starting surgeries, which indicates that the scheduling gives a bit of leeway in the gap between surgeries. 90 Figure 5.6: A scatter plot showing correlation between actual and scheduled wheels-in times 5.3 Problem Formulation The deterministic model generates a schedule deterministically using estimates of case du- rations. This section gives a complete characterization of the model. System parameters Consider I cases to be scheduled in J ORs by K surgeons over D days. Further letT = [0;T jd ] be the time period that OR j is scheduled to be open on day d, and letT max = [0;maxT jd ] be the maximum duration a room may remain open. Let A ik be the indicator function that denotes if case i is assigned to surgeon k, and A IK be the corresponding matrix. Let B jkdt denote the indicator function that surgeon k's department 91 Figure 5.7: A histogram plot showing number of records by reason for delay (> 60) at Keck Hospital is assigned OR j on day d and time t. D i denotes the duration of case i. In our predictive model, we use a regression estimate for D i based on historical case duration and surgeon estimates. Availability parameters Let C IDT , where C idt be an indicator function that denotes if case i can be done on day d at time t. C kdt indicates if surgeon k is available on day d at time t, and the corresponding matrix is C KDT . System variables Let us introduce x ijdt , the indicator variable that case i is assigned to OR j on day d and start time t. This is our base optimization variable. Dene, y ijdt = P t s=0 x ijds P max (tD i ;1) s=0 x ijds ;8i;j;d;t2T max = P t s=max (tD i +1;0) x ijds ;8i;j;d;t2T max (5.1) which takes value 1 when casei is on-going in ORj on dayd and timet and value 0 otherwise. We dene start time of case i by s i and end time by e i , where, s i = X j;d;t tx ijdt ;8i2T max ; (5.2) e i = X j;d;t (t +D i )x ijdt ;8i2T max (5.3) Objective variables In terms of the system variables, we can now dene various perfor- mance metrics. Denote byO i the overtime due to casei,I jd the idle time in ORj on dayd, W i the waiting time for patient i between his scheduled time and the actual time, and N is 92 Figure 5.8: A histogram of the number of records for delay due to surgeon performing another case by specialty the number of unscheduled cases. Then, O i = X d;j;tT jd y ijdt ; 8i I j;d = T jd X i;t<T jd y ijdt ; 8j;d; W i = 0; 8i; N = I X i;j;d;t2Tmax x ijdt (5.4) Waiting time is equal to zero because we are generating schedule variables and not actual times. System constraints System constraints are dened to be those that are inherent to the scheduling system. These depend on the system parameters. 93 Figure 5.9: A schedule showing an overbooked surgeon. He performs two surgeries in parallel. A case cannot be assigned more than once: X j;d;t2maxT x ijdt 1;8i: (5.5) No surgery can go beyond maxT jd : e i maxT jd ;8i: (5.6) No case can start at or after T jd : s i <T jd ;8i: (5.7) (or) x ijdt = 0;8i;j;d;tT jd : (5.8) 94 Figure 5.10: A histogram showing distribution of times in OR An OR can only be in use by one case at a time: X i y ijdt 1;8j;d;t2T max : (5.9) A surgeon can only perform one case at a time: X i;j A ik y ijdt 1;8k;d;t2T max : (5.10) Availability constraints Availability constraints are dened to be those that depend on the availability parameters. Case availability by day and time: Case availability is determined by C IDT and C KDT , which take into account patient availability and surgeon availability, respectively. A joint availability parameter is created from them as, C IDT =C IDT ^A IK C KDT : (5.11) 95 Figure 5.11: A histogram showing distribution of deviation between actual and scheduled times in OR The case availability constraint is then expressed as, X i;j;d;t (1C idt )y ijdt = 0: (5.12) Optimization problem The optimal schedule is given by the solution to the following binary integer linear program (BILP): min P i O i + P j;d I j;d + N s.t. (5.1) (5.12) 5.4 Simulations In this section, we present simulation results using our prediction and optimization approach. All models/algorithms have been implemented in Python 2.7 (https://www.python.org/ download/releases/2.7/), using the NumPy package and a linear program (LP) modeler PuLP (https://pypi.python.org/pypi/PuLP andhttp://pythonhosted.org/PuLP/). The generated model BILP les are then given to two open-source solvers: CBC (Coin-OR Branch 96 and Cut,https://projects.coin-or.org/Cbc) and/or GLPK (GNU Linear `Programming Kit, http://www.gnu.org/software/glpk/). The former's routines are written in C++, and the latter's in ANSI C. 5.4.1 Notes about implementation The following is a list of observations that we made while using PuLP to model our opti- mization problem in Python. Dening all parameters and known variables with NumPy arrays instead of Python lists speeds up runtime a little. Dening and manipulating all optimization variables with PuLP variables instead of NumPy arrays or Python lists speeds up runtime by orders of magnitude. The algorithms have been implemented using the fewest possible variables. As such, several variables from the formulation such as s i , e i etc. do not appear in the actual implementations. 5.4.2 Simulation results We observed that our policy considerably outperformed the existing policy at Keck hospital. In this section, we present performance comparisons of our proposed data-based policies to the current policy. Figure 5.12 shows the improvement data-based policies have over a decentralized manual implementation of the scheduling system. We see that over time and waiting time have been reduced by over 80%. This tremendous improvment, however, does not manifest itself in idle time, indicating that the current schedule is a very low throughput schedule. Figures 5.13-5.15 illustrate the above fact through scheduling future cases into the same week. This was done by taking cases from subsequent weeks in such a way that the case load was represented proportionately. It can be observed that performance improvement of 60% (light blue bar) can be achieved with performance metrics (waiting time and over time) not degrading beyond the current operating metrics. The last two gures, Figures 5.16 and 5.17 illustrate the dierence in performance metrics between the theoretical model and simulation using real data. 97 Figure 5.12: Comparison of scheduling policies Figure 5.13: Throughput v/s idle time Figure 5.14: Throughput v/s overtime 98 Figure 5.15: Throughput v/s waiting time Figure 5.16: Idle time v/s predicted Figure 5.17: Overtime v/s predicted 99 Chapter 6 Bibliography [1] R Agrawal. Sample mean based index policies with (O(logn)) regret for the multi-armed bandit problem. Adv. Appl. Probability, Vol. 27, No. 4, 1054-1078, 1995. [2] S Agrawal and N Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proc. COLT, 2012. [3] S Ahmad, M Liu, T Javidi, Q Zhao, and B Krishnamachari. Optimality of myopic sensing in multichannel opportunistic access. Information Theory, IEEE Transactions on, 55(9):4040{4050, 2009. [4] A Anandkumar, N Michael, A K Tang, and A Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. Selected Areas in Com- munications, IEEE Journal on, 29(4):731{745, April 2011. [5] V Anantharam, P Varaiya, and J Walrand. Asymptotically ecient allocation rules for the multi-armed bandit problem with multiple plays - part i: i.i.d. rewards. IEEE Trans. Autom. Control, vol. 32, no. 11, 968-975, November, 1987. [6] P Auer, N Cesa-Bianchi, and P Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235{256, 2002. [7] P Auer, N Cesa-Bianchi, and P Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235{256, 2002. [8] D P Bertsekas. Auction algorithms for network ow problems: A tutorial introduction. Comput. Optim. Appl., 1(1):7{66, 1992. [9] S Boucheron, G Lugosi, and P Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. 100 [10] S Bubeck and N Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi- armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1{122, 2012. [11] O Chapelle and L Li. An empirical evaluation of Thompson Sampling. volume 24, 2012. [12] T M Cover and J A Thomas. Elements of Information Theory, 2nd Edition. Wiley Series in Telecommunications and Signal Processing, 2006. [13] W Dai, Y Gai, and B Krishnamachari. Ecient online learning for opportunistic spec- trum access. In INFOCOM, 2012 Proceedings IEEE, pages 3086{3090, 2012. [14] W Dai, Y Gai, B Krishnamachari, and Q Zhao. The non-bayesian restless multi-armed bandit: A case of near-logarithmic regret. In proceedings of IEEE ICASSP, pages 2940{ 2943, 2011. [15] J C Gittins. Bandit processes and dynamic allocation indices. J. R. Statist. Soc. B, 41:148{164, 1979. [16] J C Gittins, K D Glazebrook, and R Weber. Multi-armed bandit allocation indices. Wiley, 2011. [17] D Gupta and B Denton. Appointment scheduling in health care: Challenges and op- portunities. IIE Transactions, 40(9):800{819, 2008. [18] Y C Ho and K C Chu. Team decision theory and information structures in optimal control problems{part i. IEEE Trans. on Automatic Control, 17(1):15{22, 1972. [19] P J H Hulshof, N Kortbeek, R J Boucherie, E W Hans, and P J M Bakker. Taxonomic classication of planning decisions in health care: a structured review of the state of the art in or/ms. HS, 1(2):129{175, 2012. [20] D Kalathil, N Nayyar, and R Jain. Decentralized learning for multi-player multi-armed bandits. IEEE Trans. Inf. Theory, 60(4):2331{2345, 2014. [21] D L Kleinmann and M Athans. The discrete minimum principle with application to the linear regulator problem. Electronic Systems Laboratory, MIT, Rep 260, 1966. [22] B Z Kurtaran and R Sivan. Linear-quadratic-gaussian control with one-step-delay shar- ing pattern. IEEE Trans. on Automatic Control, 19(5):571{574, 1974. [23] H Kwakernaak and R Sivan. Linear Optimal Control Systems. John Wiley & Sons, 1972. 101 [24] T Lai and H Robbins. Asymptotically ecient adaptive allocation rules. Adv. Appll. Math., 6(1):4{22, 1985. [25] A Lamperski and J C Doyle. Dynamic programming solutions for decentralized state- feedback lqg problems with communication delays. In American Control Conference (ACC), 2012, 2012. [26] A Lamperski and J C Doyle. Output feedbackH2 model matching for decentralized systems with delays. In American Control Conference (ACC), pages 5778{5783, 2013. [27] A Lamperski and L Lessard. Optimal state-feedback control under sparsity and delay constraints. In IFAC Workshop on Dist. Est. and Control in Networked Systems, pages 204{209, 2012. [28] J Langford and T Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems 20, pages 817{824. 2008. [29] L Lessard. Optimal control of a fully decentralized quadratic regulator. In Communi- cation, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 48{54, 2012. [30] L Lessard and A Nayyar. Structural results and explicit solution for two-player lqg systems on a nite time horizon. In IEEE Conf. on Decision and Control (CDC), pages 6542{6549, 2013. [31] L Li, W Chu, J Langford, and R E Schapire. A contextual-bandit approach to personal- ized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 661{670, 2010. [32] H Liu, L Liu, and Q Zhao. Logarithmic weak regret of non-bayesian restless multi-armed bandit. In Proceedings of IEEE ICASSP, pages 1968{1971, 2011. [33] K Liu and Q Zhao. A restless bandit formulation of opportunistic access: Indexablity and index policy. In Sensor, Mesh and Ad Hoc Communications and Networks Work- shops, 2008. SECON Workshops '08. 5th IEEE Annual Communications Society Con- ference on, pages 1{5, 2008. [34] K Liu and Q Zhao. Multi-armed bandit problems with heavy tail reward distributions. In Proceedings of the Allerton Conference on Communication, Control, and Computing, 2011. 102 [35] J Marschak and R Radner. Economic theory of teams. Cowles Foundation Discussion Papers 59e, Cowles Foundation for Research in Economics, Yale University, 1958. [36] A Nayyar and L Lessard. Optimal Control for LQG Systems on Graphs|Part I: Struc- tural Results. ArXiv e-prints, 2014. [37] A Nayyar, A Mahajan, and D Teneketzis. Optimal control strategies in delayed sharing information structures. IEEE Trans. on Automatic Control, 56(7):1606{1620, 2011. [38] R Radner. Team decision problems. The Annals of Mathematical Statistics, 33(3):857{ 881, 1962. [39] M Rotkowitz and S Lall. A characterization of convex problems in decentralized control. IEEE Trans. on Automatic Control, 51(2):274{286, 2006. [40] Nils R Sandell and M Athans. Solution of some nonclassical lqg stochastic decision problems. IEEE Trans. on Automatic Control, 19(2):108{116, 1974. [41] P Shah and P A Parrilo. H2-optimal decentralized control over posets: A state space solution for state-feedback. In IEEE Conf. on Decision and Control (CDC), pages 6722{6727, 2010. [42] D Slepian and J K Wolf. Noiseless coding of correlated information sources. IEEE Trans. Inf. Theory, 19(4):471{480, 1973. [43] R D Smallwood and E J Sondik. The optimal control of partially observable markov processes over a nite horizon. Operations Research, 21(5):1071{1088, 1973. [44] E J Sondik. The optimal control of partially observable markov processes over the innite horizon: Discounted costs. Operations Research, 26(2):282{304, 1978. [45] J Swigart and S Lall. An explicit state-space solution for a decentralized two-player op- timal linear-quadratic regulator. In American Control Conference (ACC), pages 6385{ 6390, 2010. [46] J Swigart and S Lall. Optimal controller synthesis for a decentralized two-player system with partial output feedback. In American Control Conference (ACC), pages 317{323, 2011. [47] C Tekin and M Liu. Online learning in opportunistic spectrum access: A restless bandit approach. In Proceedings of IEEE INFOCOM, pages 2462{2470, 2011. 103 [48] Cem Tekin and Mingyan Liu. Online learning of rested and restless bandits. IEEE Trans. Inf. Theory, 58(8):5588{5611, 2012. [49] W R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285{294, 1933. [50] S Vakili, K Liu, and Q Zhao. Deterministic sequencing of exploration and exploitation for multi-armed bandit problems. IEEE Journal of Selected Topics in Signal Processing, 59(3):1902{1916, 2013. [51] P Varaiya and J Walrand. On delayed sharing patterns. IEEE Trans. on Automatic Control, 23(3):443{445, 1978. [52] P Whittle. Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 25:287{298, 1988. [53] H S Witsenhausen. A Counterexample in Stochastic Optimum Control. SIAM J. on Control, 6(1):131{147, 1968. [54] H S Witsenhausen. Separation of estimation and control for discrete time systems. Proceedings of the IEEE, 59(11):1557{1566, 1971. [55] T Yoshikawa. Dynamic programming approach to decentralized stochastic control prob- lems. IEEE Trans. on Automatic Control, 20(6):796{797, 1975. [56] M M Zavlanos, L Spesivtsev, and G J Pappas. A distributed auction algorithm for the assignment problem. In Proc. IEEE Conf. Decision and Control (CDC), pages 1212{1217, December, 2008. [57] Q Zhao, B Krishnamachari, and K Liu. On myopic sensing for multi-channel oppor- tunistic access: structure, optimality, and performance. Wireless Communications, IEEE Transactions on, 7(12):5431{5440, 2008. 104
Abstract (if available)
Abstract
Decentralized stochastic systems are becoming increasingly prevalent due to advancements and proliferation of small‐scale technology. With complex systems being implemented on smaller and more interconnected entities, decision‐making in such environments necessarily becomes decentralized. Learning and control are closely related problems, with systems usually employing them together in decision‐making. Decentralized versions of these problems crop up in numerous practical applications, including in communication networks, sensor networks, formation flight, power distribution grids, personalized medicine and cloud computing. My work focuses on two problems in decentralized stochastic systems, both having communication constraints: one in the realm of control and the other in online learning. ❧ In decentralized control, we look at classes of interconnected systems where communication constraints manifest themselves in the form of delays in information sharing. We specifically look at a two‐player control problem with different kinds of communication constraints and develop an approach that finds the optimal control law for them. We believe our approach is general enough to be extended to other forms of delayed interconnections as well. ❧ The second problem that we tackle is in online learning for multiple players where communication is costly. We formulate a multi‐player multi‐armed bandit model and, through a series of results, show that near‐optimal performance can be achieved by multiple players with limited communication. We propose two new policies, dE³ and dE³−TS that achieve this regret. Prior work in this area included policies with sub‐linear regret, but not order‐optimality that we have been able to show with our new policies. ❧ An application of learning to schedule in healthcare operations is also considered, where we consider a problem that traditionally has decentralized elements and show the advantages of treating it as a system with uncertainty.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning and decision making in networked systems
PDF
Team decision theory and decentralized stochastic control
PDF
Empirical methods in control and optimization
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Computational stochastic programming with stochastic decomposition
PDF
Computational validation of stochastic programming models and applications
PDF
Online learning algorithms for network optimization with unknown variables
PDF
The next generation of power-system operations: modeling and optimization innovations to mitigate renewable uncertainty
PDF
Iterative path integral stochastic optimal control: theory and applications to motor control
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
The smart grid network: pricing, markets and incentives
PDF
Train routing and timetabling algorithms for general networks
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
Adaptive control: transient response analysis and related problem formulations
PDF
Smarter markets for a smarter grid: pricing randomness, flexibility and risk
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Machine learning in interacting multi-agent systems
Asset Metadata
Creator
Nayyar, Naumaan
(author)
Core Title
Learning and control in decentralized stochastic systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/29/2015
Defense Date
08/28/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Control,estimation,integer programming,OAI-PMH Harvest,online learning,optimal control,stochastic control,stochastic learning,stochastic systems
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jain, Rahul (
committee chair
), Liu, Yan (
committee member
), Nayyar, Ashutosh (
committee member
)
Creator Email
naumaann@gmail.com,nnayyar@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-188388
Unique identifier
UC11272459
Identifier
etd-NayyarNaum-3960.pdf (filename),usctheses-c40-188388 (legacy record id)
Legacy Identifier
etd-NayyarNaum-3960.pdf
Dmrecord
188388
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Nayyar, Naumaan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
estimation
integer programming
online learning
optimal control
stochastic control
stochastic learning
stochastic systems