Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Empirical methods in control and optimization
(USC Thesis Other)
Empirical methods in control and optimization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EMPIRICAL METHODS IN CONTROL AND OPTIMIZATION. BY DILEEP KALATHIL DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering in the Viterbi School of Engineering of the University of Southern California, December, 2014 Doctoral Committee: Professor Rahul Jain, Chair Professor Bhaskar Krishnamachari Professor Yan Liu ii Abstract Thisdissertationaddressessomeproblemsintheareaoflearning,optimizationanddecisionmaking in stochastic systems using empirical methods. First part of the dissertation addresses sequential learning and decision making with partial and noisy information which is an important problem in many areas like manufacturing systems, communication networks, clinical trials etc. Centralized sequential learning and decision making problems have been studied extensively in the literature where such problems are modeled as (single agent) Multi-Armed Bandits (MAB) problems (Lai and Robbins, 1985). Surprisingly, a similar framework for multi-agent systems, a decentralized MAB framework, hasn’t received much attention. In our work, we developed a theory for decentralized learning in multi-player multi- armed bandits. . We showed that a simple index based algorithm is sucient to maximize the team objective and explicitly designed such an algorithm. We also showed that the convergence rate of this algorithm is O((log 2 T)/T). Second part of the thesis addresses learning and optimization in dynamical systems, with a fo- cus on Markov Decision Processes (MDP). Design and analysis of large and complex systems often involve large scale and time consuming computer simulations. This is because they are often inher- entlystochasticwherethedynamicsmaybedrivenbysomestochasticprocesswhosecharacteristics may be unknown to the designer. A large number of such problems, like autonomous navigation of robots, aremodeledasMarkovDecisionProcesses(MDP)problem. Sincetheunderlyingtransition kernel which governs the stochastic transitions of the system state are often unknown, an optimal controlpolicyforsuchsystemsareoftenfoundbysimulations. Inourwork, weaddressedthisprob- lemofsimulationbasedlearningandoptimizationinMDPsandwedevelopeda theory of empirical dynamic programming (EDP) for MDPs. We proposed simple and natural empirical variants of the classical dynamic programming algorithms, empirical value iteration (EVI) and empirical policy iteration (EPI), and gave convergence and sample complexity bounds for both. We also showed thatthesetechniquescanbeusedinmanyothersituationsincludingtheminimaxequilibriumcom- putation for zero-sum stochastic games. We also introduce another algorithm, Empirical Q Value Iteration (EQVI) which gives a stronger (almost sure) convergence guarantee. Simulation results show better convergence rate for these algorithms than stochastic approximation/reinforcement learning schemes such as Q-learning and actor-crtic learning. iii Wealsoaddresstheproblemoflearningformulti-criterionoptimizationinMDPs. Theproblem of competing agents with multi-criterion performance objective was first considered by Blackwell (Blackwell, 1956) in the context of static games with vector-valued payo↵s. Blackwell introduced the notion of approachability: a target set is approachable for a given agent if its average payo↵ vector approaches this set for any strategy of the opponent(s). Blackwell characterized the games for which there is an armative answer and gave a learning algorithm which achieves this. The problemofapproachabilityinadynamicsetting,wherethestateofthesystemevolvesinMarkovian manner, is surprisingly still not well understood. In our work, we address the approachability in MDPs and Stackelberg stochastic games. We made two major contributions: firstly, we gave a characterization of the approachable sets (i.e., achievable target sets) in such systems and gave a simple and computationally tractable strategy for approachability. Secondly, using the theory of stochastic approximations, we gave a learning algorithm for approachability when the agents don’t have the knowledge of the probability kernel that governs stochastic evolution of the environment. iv Acknowledgments First and foremost, I owe deepest gratitude to my advisor, Professor Rahul Jain. Rahul has been the best advisor that I could ask for. He gave me great freedom in selecting my research area and exploring various problems and he has been immensely patient with all my research adventures and misadventures. His unique skill for identifying important and challenging research problems, emphasis on the rigour and preciseness, attention for details, sharp insights and deep knowledge have greatly helped my research. I am deeply indebted to him for my transformation from an immature first year graduate student to an independent researcher. One of the best decision that I have made during my PhD years was to do an internship with Prof. Vivek Borkar in the last summer. I have immensely benefited from his broad knowledge of various disciplines of mathematics, his sharp thinking, intuition and fundamental understanding. His passion for research and endless energy are really inspiring. I would like to express my deep gratitude to him. I thank Prof. Larry Goldstein, who has been a mentor and advisor for my MA thesis in math- ematics. My regular interactions with him have greatly helped me to sharpen my mathematical skills and deepen my intuition. I really enjoyed working with him. I am grateful to Prof. Bhaskar Krishnamacahri and Prof. Yan Liu for being on my qualifying examination committee and disser- tation committee. I thank Prof. Suvarjeet Sen and Prof. Guiseppe Caire for being in my qualifying examination committee. Special thanks to Prof. Mihaela van der Schaar for her encouragement and for being in my qualifying examination committee. I have had a wonderful five years in LA because of my friend and colleagues. I am fortunate to have Nachi and Saurov as my close friends who were there for any help. Thanks to Arunima for her friendship and for joining us in the last year. Special thanks to Mythili for her loving presence. Thanks to Harsha for those long discussion ranging from probability to politics. Thanks to Thaseem for his reassuring voice at the time of disappointments. I would like to thank my labmates Naumaan, Wenyuan, Hanie and my other PhD colleagues Longbo, Chiru, Srinivas, Yi Gai, Sunav and Sundar. Thanks to my friends Anil, Priyanka, Dinakar and Ashy for the happy memories. Thanks to bodhicommons for the perspective. My sincere thanks to my close friends Arun, Shrihari and Smitha for their love and support which kept me going in the early days of my PhD. v I am fortunate to have Sindhu as my friend and soul mate. It is her love and confidence that helped me to overcome the uncertainties. She has lighted my life and I am blessed to have her with me for ever. Last but not least, I want to express my love and gratitude to my dear mother, father and brother. I am where I am because of their unconditional love and support. vi Table of Contents List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Decentralized Learning for Multi-player Multi-armed Bandits . . . . . . . . . . . . . 2 1.2 Empirical Dynamic Programming for Markov Decision Processes . . . . . . . . . . . 5 1.3 Empirical Q-Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Learning for Multi-Criterion Optimization: Approachability in Dynamical Systems . 8 1.5 Incentives for Cooperation in Wireless Network . . . . . . . . . . . . . . . . . . . . . 10 2 Decentralized Learning for Multi-player Multi-armed Bandits . . . . . . . . . . . . . . . . 13 2.1 Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Arms with i.i.d. rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Arms with Markovian rewards . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Some variations on single player multi-armed bandit with i.i.d. rewards . . . . . . . 15 2.2.1 UCB 1 with index recomputation every L slots . . . . . . . . . . . . . . . . . . 15 2.2.2 UCB 4 Algorithm when index computation is costly . . . . . . . . . . . . . . . 16 2.2.3 Algorithms with finite precision indices . . . . . . . . . . . . . . . . . . . . . 20 2.3 Single Player Multi-armed Bandit with Markovian Rewards . . . . . . . . . . . . . . 22 2.4 The Decentralized MAB problem with i.i.d. rewards . . . . . . . . . . . . . . . . . . 26 2.5 The Decentralized MAB problem with Markovian rewards . . . . . . . . . . . . . . . 32 2.6 Distributed Bipartite Matching: Algorithm and Implementation. . . . . . . . . . . . 36 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Empirical Dynamic Programming for Markov Decision Processes . . . . . . . . . . . . . . 39 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Empirical Algorithms for Dynamic Programming . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Empirical Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.2 Empirical Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Iteration of Random Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 Probabilistic Fixed Points of Random Operators . . . . . . . . . . . . . . . . 46 3.3.2 A Stochastic Process on N............................. 48 3.3.3 Dominating Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.4 Convergence Analysis of Random Operators . . . . . . . . . . . . . . . . . . . 52 3.4 Sample Complexity for EDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.1 Empirical Bellman Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 vii 3.4.2 Empirical Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.3 Empirical Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Variations and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.1 Asynchronous Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.2 Minimax Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.3 The Newsvendor Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.8 Proofs of Various Lemmas, Propositions and Theorems . . . . . . . . . . . . . . . . . 71 3.8.1 Proof of Fact 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.8.2 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.8.3 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.8.4 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.8.5 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.8.6 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.8.7 Lemma 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.8.8 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.8.9 Proof of Proposition 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.8.10 Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4 Empirical Q-Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1 Preliminaries and Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.1 MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.2 Value Iteration, Q-Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . 80 4.1.3 Empirical Q-Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.4 Comparison with Classical Q-Learning . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5 Approachability in Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.1 A Blackwell’s Approachability Theorem for MDPs . . . . . . . . . . . . . . . . . . . 97 5.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.1.2 Approachability Theorem for MDPs . . . . . . . . . . . . . . . . . . . . . . . 99 5.1.3 Reinforcement Learning Algorithm for Blackwell Approachability in MDPs . 105 5.2 Blackwell’s Approachability Theorem for Stackelberg Stochastic Games . . . . . . . 110 5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.2 Approachability Theorem for Stackelberg Stochastic Games . . . . . . . . . . 111 5.2.3 A Learning Algorithm for Approachability in Stochastic Stackelberg Games . 116 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6 A Principal-Agent Approach to Spectrum Sharing . . . . . . . . . . . . . . . . . . . . . . 119 6.1 The Physical Model and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2 The Principal-Agent Framework and Contract Design . . . . . . . . . . . . . . . . . 121 6.3 First-best Contracts for Cognitive Spectrum Sharing . . . . . . . . . . . . . . . . . . 125 6.3.1 Spectrum Contracts Without Time-Sharing . . . . . . . . . . . . . . . . . . . 125 viii 6.3.2 Moral Hazard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3.3 First-best Spectrum Contracts with Time-Sharing . . . . . . . . . . . . . . . 129 6.3.4 Extension to multiple secondary users . . . . . . . . . . . . . . . . . . . . . . 131 6.4 Hidden Information: Second-best Cognitive Spectrum Contracts . . . . . . . . . . . 133 6.5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.6 Proofs of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 ix x List of Figures 2.1 Structure of the decision frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Numerical performance comparison of EDP algorithms . . . . . . . . . . . . . . . . . 70 4.1 Comparison of QL and EQVI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.1 Gaussian Interference channel and achievable region with SIC . . . . . . . . . . . . . 121 6.2 Contractual mechanism between Principal and Agent . . . . . . . . . . . . . . . . . . 123 6.3 Contract function under Case-A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4 Plot of contract functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.5 Contract function under Case-C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 xi xii 1 Introduction Thisdissertationaddressessomeproblemsintheareaoflearning,optimizationanddecisionmaking in stochastic systems, often involving multiple agents. We start with a few examples from the real world which serve as the motivating problems. Opportunistic Spectrum Access : In many wireless networks, like cognitive radio network or wireless adhoc network, multiple wireless users have to access the wireless spectrum in an op- portunistic way. Each wireless user has to select one channel to transmit form multiple channels available. However, he knows nothing about the channel statistics, i.e., has no idea of how good or bad the channels are, and what rate it may expect to get from each channel. The rates could be learnt only by exploring various channels. However, if there are multiple users transmitting in one channel,thee↵ectiveratewillbedecreasedduetointerference. Thesenetworksaredistributedover a geographic area and decentralized in nature. So, each user will know only about the rates that he gets and will have no information about the other users’ preference for each channel or the rates that they are getting. In many scenarios, users can be part of a single team, like wireless nodes in a sensor network, who want to achieve a global objective (like maximizing the cumulative rate) or in other scenarios users can be players in a game where they are rational and selfish individuals like nodes in a wireless adhoc network where each one wants to maximize his own rate. In both cases, each user has to employ a learning policy to learn the channels, and about the strategy of the other users. In team problems the objective of such a learning policy is to achieve the joint objective asymptotically. In game problems the objective of such a learning policy is to reach an equilibrium of the game eventually. Packet Routing: Consider the problem of packet routing in a network like a local wireless mesh network or a large wired network. Each node in the network may be sending packets to other nodes and each of these packet may use a di↵erent route depending on the state of congestion in the intermediate nodes. The objective can be minimizing the delay or achieving some other quality of service criteria. This is a dynamic problem with the state of the system, the size of the queue in each nodes, evolves over time. There are two interesting variation of this problem: 1) A single agent problem to determine the optimal routing to ensure certain quality of service. This can be modeledasaMarkovdecisionproblemandoptimalsolutioncanbefoundbydynamicprogramming. However, such dynamic programming solutions can be impractical due to two reasons, the state 1 spacewillbetoolargeandthetransitionprobabilitiescorrespondingtotheevolutionofthestateof the system will be unknown. The only available information will be historical data or a simulation model of the system and the system designer has to calculate the optimal routing policy from this. 2) A multi-agent problem where each data transmitting node is rational and selfish and wants to maximize his own utility. Each user has to learn the behaviour of the other users as well as the evolution of the network. The system designer’s objective will be to design a learning policy such that the game dynamics eventually converge to an equilibrium of the game. ResourceAllocation: Agentsthatoperateintherealworldoftenneedtoconsiderseveralperfor- mance criteria simultaneously. For example consider the budget allocation/optimization problem of a large retail corporation (like Walmart). They may want their average transportation cost, employment cost and advertisement cost to be less than some prescribed amounts and the average revenue to be greater than a prescribed target. The cost and revenue depend on the market state and the strategy of the opponents in the market. Typically, this can be modeled as a decision mak- ing problem in a competitive stochastic dynamic system because the state of the system (market condition, inventory levels etc.) evolves stochastically, often in a Markovian manner, depending on the current state and the strategies of all the agents. Another example is that of bandwidth allocation in a wireless network where QoS guarantees require that a certain minimum throughput may be guaranteed to each client. At the same time, there may be other costs such as power to be kept within limits. The above problems motivate the following questions: (Q.1). How can we design simple decentralized learning algorithms for a team of players to achieve a global objective? What is the right metric for the characterization of such algorithms? (Q.2). How can we solve large Markov decision problems when the transition kernels are un- known? How can we make use of the historical data or simulation models available? What are the performance guarantee that we can provide for such solutions? (Q.3). How can we design simple learning algorithms for multi-criterion optimization in dynamical systems? 1.1 Decentralized Learning for Multi-player Multi-armed Bandits This subsection gives an introduction to Chapter 2 which attempts to solve the problem posed in (Q.1). Sequential learning and decision making with partial and noisy information is an important problem in several technological and scientific disciplines such as manufacturing systems, pricing and revenue management, communication and transportation networks, clinical trials etc. As an example (highly simplified) consider a simple game of choosing between two coins with unknown bias. The coins are chosen repeatedly. If at a given instance, a coin turns up heads, we get a 2 reward of $1, else we get zero. One of the two coins has a better bias. The question is what is the optimal ‘learning’ scheme that asymptotically helps us discover which coin has a better bias, while at the same time maximizing the cumulative reward as the game is played. This is an instance of a classical non-Bayesian multi-armed bandit problem that was introduced by Lai and Robbins in [1]. Such models capture the essence of a learning problem wherein the learner must tradeo↵ betweenexploiting whathasbeenlearnt,andexploring more. Insteadofthecumulativereward,the performance of the learning algorithm was quantified in terms of expected regret,i.e.,thedi↵erence between the reward if you always chose the better coin, and cumulative reward from some other policy. It was shown in [1] that there is no learning algorithm that asymptotically has expected regret growing slower than logT. A ‘learning’ scheme was also constructed that asymptotically achieved this lower bound. This result was subsequently generalized by many people. In [2], Anantharam, et al generalized this to the case of multiple plays, i.e., the player can pick multiple arms (or coins) when there are more than 2 arms. In [3], Agrawal proposed a sample mean based index policy which achieves logT regret asymptotically. Assuming that the rewards are coming from a distribution of bounded support, Auer, et al [4] proposed a much simpler sample mean based index policy, called UCB 1 , which achieves logT uniformly over time, not only asymptotically. Also, unlike the policy in [3], the index doesn’t depend on the specific family of distributions that the rewards come from. In [5], Anantharam, et al proposed a policy to the case where the arms are modelled as Marko- vian, not i.i.d. The rewards are assumed to come from a finite, irreducible and aperiodic Markov chainrepresentedbyasingleparameterprobabilitytransitionmatrix. Thestateofeacharmevolves accordingtoanunderlyingtransitionprobabilitymatrixwhenthearmisplayedandremainsfrozen whenpassive. Suchproblemsarecalledrested Markovian bandit problems (whererested referstono state evolution until the arm is played). In [6], Tenkin and Liu extended the UCB 1 policy to the case of rested Markovian bandit problems. If some non-trivial bounds on the underlying Markov chains are known a priori, they showed that the policy achieves logT regret uniformly over time. Also, if no information about the underlying Markov chains is available, the policy can easily be modified to get a near-O(logT) regret asymptotically. The models in which the state of an arm continues to evolve even when it is not played are called restless Markovian bandit problems.Restlessmodels are considerably more dicult than the rested models and have been shown to be P-SPACE hard [7]. This is because the optimal policy no longer will be to “play the arm with the highest mean reward”. [8] employs a weaker notion of regret (weak regret) which compares the reward of a policy tothatofapolicywhichalwaysplaysthearmwiththehighestmeanreward. Theyproposeapolicy which achieves logT (weak) regret uniformly over time if certain bounds on the underlying Markov model are known a priori and achieves a near-O(logT) (weak) regret asymptotically when no such knowledge is available. [9] proposes another simpler policy which achieves the same bounds for weak regret. [10] proposes a policy based on deterministic sequence of exploration and exploitation and achieves the same bounds for weak regret. In [11], the authors consider the notion of strong 3 regret and propose a policy which achieves near-logT (strong) regret for some special cases of the restless model. Recently, there is an increasing interest in multi-armed bandit models, partly because of oppor- tunistic spectrum access problems. Consider a user who must choose between N wireless channels. Yet, it knows nothing about the channel statistics, i.e., has no idea of how good or bad the channels are, and what rate it may expect to get from each channel. The rates could be learnt by exploring various channels. Thus, these have been formulated as multi-armed bandit problems, and index- type policies have been proposed for choosing spectrum channels. In many scenarios, there are multiple users accessing the channels at the same time. Each of these users must be matched to a di↵erent channel. These have been formulated as a combinatorial multi-armed bandit problem [12] [13], and it was shown that an “index-matching” algorithm that at each instant determines a matching by solving a sum-index maximization problem achieves O(logT) regret uniformly over time, and this is indeed order-optimal. Inothersettings,theuserscannotcoordinate,andtheproblemmustbesolvedinadecentralized manner. Thus, settings where all channels (arms) are identical for all users with i.i.d. rewards have been considered, and index-type policies that can achieve coordination have been proposed that get O(logT) regret uniformly over time [14, 15, 16, 10]. A similar result for Markovian reward model with weak regret has been shown by [10], assuming some non-trivial bounds on the underlying Markov chains are known a priori. The regret scales only polynomially in the number of users and channels. Surprisingly, the lack of coordination between the players asymptotically imposes no additional cost or regret. In this chapter, we consider the decentralized multi-armed bandit problem with distinct arms for each player. We consider both the i.i.d. reward model and the rested Markovian reward model. All players together must discover the best arms to play as a team. However, since they are all tryingtolearnatthesametime,theymaycollidewhentwoormorepickthesamearm. Wepropose an index-type policy dUCB 4 based on a variation of the UCB 1 index. At its’ heart is a distributed bipartite matching algorithm such as Bertsekas’ auction algorithm [17]. This algorithm operates in rounds, andineachroundpricesforvariousarmsaredeterminedbasedonbid-values. Thisimposes communication (and computation)cost on the algorithmthat must be accounted for. Nevertheless, we show that when certain non-trivial bounds on the model parameters are known a priori, the dUCB 4 algorithm that we introduce achieves (at most) near-O(log 2 T) growth non-asymptotically in expected regret. If no such information about the model parameters are available, dUCB 4 algorithm stillachieves(atmost)near-O(log 2 T)regretasymptotically. Alowerbound,however,isnotknown at this point, and is a work in progress. 4 1.2 Empirical Dynamic Programming for Markov Decision Processes This subsection gives an introduction to Chapter 3 which attempts to solve the problem posed in (Q.2). Markov decision processes (MDPs) are natural models for decision making in a stochastic dy- namicsettingforawidevarietyofapplications. The‘principleofoptimality’introducedbyRichard Bellman in the 1950s has proved to be one of the most important ideas in stochastic control and optimization theory. It leads to dynamic programming algorithms for solving sequential stochastic optimization problems. And yet, it is well-known that it su↵ers from a “curse of dimensionality” [18, 19, 20], and does not scale computationally with state and action space size. In fact, the dynamic programming algorithm is known to be PSPACE-hard [21]. Thisrealizationledtothedevelopmentofawidevarietyof‘approximatedynamicprogramming’ methods beginning with the early work of Bellman himself [22]. These ideas evolved independently in di↵erent fields, including the work of Werbos [23], Kushner and Clark [24] in control theory, Minsky [25], Barto, et al [26] and others in computer science, and Whitt in operations research [27, 28]. The key idea was an approximation of value functions using basis function approximation [22], state aggregation [29], and subsequently function approximation via neural networks [30]. The diculty was universality of the methods. Di↵erent classes of problems require di↵erent approximations. Thus, alternative model-free methods were introduced by Watkins and Dayan [31] where a Q-learning algorithm was proposed as an approximation of the value iteration procedure. It was soon noticed that this is essentially a stochastic approximations scheme introduced in the 1950s by RobbinsandMunro[32]andfurtherdevelopedbyKieferandWolfowitz[33]andKushnerandClarke [24]. This led to many subsequent generalizations including the temporal di↵erences methods [34] and actor-critic methods [35, 36]. These are summarized in [34, 37, 38]. One shortcoming in this theory is that most of these algorithms require a recurrence property to hold, and in practice, often work only for finite state and action spaces. Furthermore, while many techniques for establishing convergence have been developed [39], including the o.d.e. method [40, 41], establishing rate of convergence has been quite dicult [42]. Thus, despite considerable progress, these methods are notuniversal,samplecomplexityboundsarenotknown,andsootherdirectionsneedtobeexplored. A natural thing to consider is simulation-based methods. In fact, engineers and computer scientists often do dynamic programming via Monte-Carlo simulations. This technique a↵ords considerable reduction in computation but at the expense of uncertainty about convergence. In this chapter, we analyze such ‘empirical dynamic programming’ algorithms. The idea behind the algorithms is quite simple and natural. In the DP algorithms, replace the expectation in the Bellman operator with a sample-average approximation. The idea is widely used in the stochastic programmingliterature,butmostlyforsingle-stageproblems. Inourcase,replacingtheexpectation with an empirical expectation operator, makes the classical Bellman operator a random operator. 5 In the DP algorithm, we must find the fixed point of the Bellman operator. In the empirical DP algorithms, we must find a probabilistic fixed point of the random operator. In this chapter, we first introduce two notions of probabilistic fixed points that we call ‘strong’ and ‘weak’. We then show that asymptotically these concepts converge to the deterministic fixed point of the classical Bellman operator. The key technical idea of this chapter is a novel stochastic dominance argument that is used to establish probabilistic convergence of a random operator, and in particular, of our empirical algorithms. Stochastic dominance, the notion of an order on the space of random variables, is a well developed tool (see [43, 44] for a comprehensive study). In this chapter, we develop a theory of empirical dynamic programming (EDP) for Markov decision processes (MDPs). Specifically, we make the following contributions in this chapter. First, we propose both empirical value iteration and policy iteration algorithms and show that these converge. Each is an empirical variant of the classical algorithms. In empirical value iteration (EVI), the expectation in the Bellman operator is replaced with a sample-average or empirical approximation. In empirical policy iteration (EPI), both policy evaluation and the policy iteration are done via simulation, i.e., replacing the exact expectation with the simulation-derived empirical expectation. We note that the EDP class of algorithms is not a stochastic approximation scheme. Thus, we don’t need a recurrence property as is commonly needed by stochastic approximation- based methods. Thus, the EDP algorithms are relevant for a larger class of problems (in fact, for any problem for which exact dynamic programming can be done.) We provide convergence and sample complexity bounds for both EVI and EPI. But we note that EDP algorithms are essentially “o↵-line” algorithms just as classical DP is. Moreover, we also inherit some of the problems of classical DP such as scalability issues with large state spaces. These can be overcome in the same way as one does for classical DP, i.e., via state aggregation and function approximation. Second, since the empirical Bellman operator is a random monotone operator, it doesn’t have a deterministic fixed point. Thus, we introduce new mathematical notions of probabilistic fixed points. These concepts are pertinent when we are approximating a deterministic operator with an improving sequence of random operators. Under fairly mild assumptions, we show that our two probabilisticfixedpointconceptsconvergetothedeterministicfixedpointoftheclassicalmonotone operator. Third, since scant mathematical methods exist for convergence analysis of random operators, we develop a new technique based on stochastic dominance for convergence analysis of iteration of random operators. This technique allows for finite sample complexity bounds. We use this idea to prove convergence of the empirical Bellman operator by constructing a dominating Markov chain. We note that there is an extant theory of monotone random operators developed in the context of random dynamical systems [45] but the techniques for convergence analysis of random operators is not relevant to our context. Our stochastic dominance argument can be applied for more general random monotone operators than just the empirical Bellman operator. WealsogiveanumberofextensionsoftheEDPalgorithms. WeshowthatEVIcanbeperformed 6 asynchronously, making a parallel implementation possible. Second, we show that a saddle point equilibrium of a zero-sum stochastic game can be computed approximately by using the minimax Bellmanoperator. Third, wealsoshowhowtheEDPalgorithmandourconvergencetechniquescan be used even with continuous state and action spaces by solving the dynamic newsvendor problem. A key question is how is the empirical dynamic programming method di↵erent from other methods for simulation-based optimization of MDPs, on which there is substantial literature. We note that most of these are stochastic approximation algorithms, also called reinforcement learning incomputerscience. Withinthisclass. thereareQ-learningalgorithms,actor-criticalgorithms,and appproximate policy iteration algorithms. Q-learning was introduced by Watkins and Dayan but its convergence as a stochastic approximation scheme was done by Bertsekas and Tsitsiklis [34]. Q- learningfortheaveragecostcasewasdevelopedin[46],andinarisk-sensitivecontextwasdeveloped in [47]. The convergence rate of Q-learning was established in [48] and similar sample complexity bounds were given in [42]. Actor-critic algorithms as a two time-scale stochastic approximation was developed in [35]. But the most closely related work is optimistic policy iteration [49] wherein simulated trajectories are used for policy evaluation while policy improvement is done exactly. The algorithm is a stochastic approximation scheme and its almost sure convergence follows. This is true of all stochastic approximation schemes but they do require some kind of recurrence property to hold. In contrast, EDP is not a stochastic approximation scheme, hence it does not need such assumptions. However, we can only guarantee its convergence in probability. A class of simulation-based optimization algorithms for MDPs that is not based on stochastic approximationsistheadaptivesamplingmethodsdevelopedbyFu, Marcusandco-authors[50,51]. These are based on the pursuit automata learning algorithms [52, 53, 54] and combine multi- armed bandit learning ideas with Monte-Carlo simulation to adaptively sample state-action pairs to approximate the value function of a finite-horizon MDP. Some other closely related works are [55] which introduces simulation-based policy iteration (for average cost MDPs). It basically shows that almost sure convergence of such an algorithm can fail. Another related work is [56]. wherein a simulation-based value iteration is proposed for a finite horizon problem. Convergence in probability is established if the simulation functions corresponding to the MDP is Lipschitz continuous. Another closely related paper is [57], which considersvalueiterationwitherror. WenotethatourfocusisoninfinitehorizondiscountedMDPs. Moreover, we do not require any Lipschitz continuity condition. We show that EDP algorithms will converge (probabilistically) whenever the classical DP algorithms will converge. A survey on approximate policy iteration is provided in [58]. Approximate dynamic program- ming(ADP)methodsaresurveyedin[38]. Infact,manyMonte-Carlo-baseddynamicprogramming algorithms are introduced herein (but without convergence proof.) Simulation-based uniform esti- mation of value functions was studied in [59, 60]. This gave PAC learning type sample complexity bounds for MDPs and this can be combined with policy improvement along the lines of optimistic policy iteration. 7 1.3 EmpiricalQ-Value Iteration This subsection gives an introduction to Chapter 4 which gives another approach to solve the problem posed in (Q.2). Q-learning algorithm of Watkins [61] has been an early and among the leading algorithms for approximate dynamic programming for Markov decision processes. An important feature of this and other algorithms of this ilk (actor-critic, TD( ), LSTD, LSPE, natural gradient,···) has been that they are stochastic approximations, i.e., recursive schemes that update a vector incrementally based on observed payo↵s. This is achieved by using step-sizes that are either decreasing slowly in a precise sense or equal a small positive constant. In either case, this induces a slower time scale for the iteration compared to the ‘natural’ time scale on which the underlying stochastic phenomena evolve. Thus the two time scale e↵ects such as averaging kick in, ensuring that the algorithm e↵ectively follows an averaged dynamics, i.e., its original dynamics averaged out over the random processes a↵ecting it on the natural time scale. The iterations are designed such that this averaged dynamics has the desired convergence properties. What we propose here is an alternative scheme for Q-learning that is not incremental and thereforeevolvesonthenaturaltimescale. ItdoestheusualQ-valueiterationwiththeprovisothat the conditional averaging with respect to the actual transition kernel of the underlying controlled Markov chain is replaced by a simulation based empirical surrogate. One obvious advantage one might expect from this is that if it works, it will have a much faster convergence. Our contribution is to provide a rigorous proof that it works and provide simulation evidence that the expected fast convergence is indeed a reality. The proof technique we use is of independent interest, based as it is upon the constructs bor- rowed from the celebrated backward coupling scheme for exact simulation [62] (see also [63] for a discussion of the scheme and other related dynamics). In hindsight, this need not be surprising, as value and Q-value iterations in finite time yield finite horizon values / Q values ‘looking backward’ with the initial guess as the terminal cost. 1.4 Learning for Multi-Criterion Optimization: Approachability in Dynamical Systems This subsection gives an introduction to Chapter 2 which attempts to solve the problem posed in (Q.3). Classical game theory, Markov Decision Processes (MDPs) and stochastic games typically deal only with scalar performance criteria: corresponding to each state and action of the agents, each agentincursascalarcost. Thestandardproblemistocomputeorlearn apolicyforeachagentwhich will minimize her scalar performance objective conditioned on the policies of all other agents and 8 the dynamics of the underlying system. Computing or learning equilibrium policies in a standard normalformstrategicgameorstochasticgame, computingorlearningtheoptimalpolicyofaverage or discounted cost MDPs are the typical examples. However, many interesting problems often fall outside this ‘scalar performance criteria’ class. For example consider the problem of an automaker. They want to minimize their cost but also worry about reliability, perceived quality and customer satisfaction - all of which are quantified in some way, and should be greater than some prescribed values. Bandwidth allocation in a wireless network involves throughput maximization while also providing certain delay guarantees. Multi-objectiveoptimizationisawellstudiedarea[65] thoughmostmethodsfocusonachieving a Pareto-optimal solution. In the context of decision making under dynamically changing environ- ment, these problems have been studied extensively under the class ‘constrained MDPs’ [66]. A reinforcement learning algorithm for constrained MDPs was developed in [67]. [68] and [69] con- siderMDPswitharbitraryrewardprocess. Theirsettingisintheframeworkofregretminimization which is di↵erent from our approach. In this chapter, we address the problem of muti-objective optimization in a dynamically chang- ingenvironmentinthecontextofMDPswithvector-valued costfunctions. Ourobjectiveistolearn the control policies that will drive the average vector-cost to a given ‘target set’. The problem for- mulation is di↵erent from that of constrained MDPs: the constrained optimization framework is replaced by a constraint satisfaction framework. This problem formulation is inspired from the famous ‘Blackwell’s approachability theorem’ [70]. In his work, Blackwell posed the following question: does there exist a strategy in an arbitrary two-player game with vector-payo↵s which guarantees that player 1’s average payo↵ approaches (with probability 1) a given closed convex set D irrespective of the other player’s strategy? Blackwell characterized the games for which there is an armative answer and he also prescribed a strategy which achieves this. We note that while the approchability question has mostly been asked in the context of games, with competing decision makers, more basically, it is about multiple objectives. Approachability in a stochastic game framework has been addressed before. [71] studied this problem where the approachability from a given initial state was studied under some recurrence assumptions. Their scheme depends on updating strategies when the system returns to a fixed state s 0 .Thisscheme was proposed because there appeared to be a need to keep the policy fixed for some duration in order to ‘exploit’ that policy before one ‘explores’ again. However, there are many computational diculties associated with this approach, in particular the return time to a fixed state. Firstly, this approach has the undesirable e↵ect of increasing the variance of the cost for the agents if the recurrence times are large, e.g., in very large systems. For this reason, the scheme will have slow convergence. [72] proposed an alternative scheme that required less restrictive assumptions. The basic idea is the ‘increasing time window’ method: keep the policy constant for the length of a time window whose duration increases gradually. In each window, the policy used is the equilibrium policy of an N i -stage stochastic game where N i is the length of the ith window. The 9 computation of this policy is thus clearly non-trivial. [73] presents yet another scheme which has the same drawbacks. These schemes also did not address another important aspect of this problem: alearningalgorithmforapproachabilitywhenthetransitionkernelcorrespondingtotheunderlying Markov dynamics is unknown. [74] gave a learning algorithm that builds on the work in [71]. The key idea is to run J learning algorithms in parallel, each one corresponds to a di↵erent steering direction. If J is suciently large, ✏-approachability can be guaranteed. This scheme however is potentially computationally impractical in some scenarios. In this chapter, we make the following contributions. We first consider Markov decision pro- cess with vector cost functions. We give necessary and sucient conditions for approachability of convex sets. We also give sucient conditions for approachability of non-convex sets. Some of these conditions are similar to prior work but the proof is entirely di↵erent as we rely on ideas from stochastic approximation theory for constructing an approachability strategy. It turns that it is easy to construct a ‘learning’ scheme, i.e., when the model parameters are unknown, from our approachability strategy. We then consider Stackelberg stochastic games. These are of course special stochastic games. We give necessary and sucient conditions for approachability of convex sets and sucient conditions for non-convex sets. In that sense, our approachability results are weaker since other prior work [71, 72, 73], all consider general stochastic games. However, again using multiple time scale stochastic approximation theory, we are able to give a ‘learning’ scheme for Stackelberg stochastic games. It seems dicult to derive a learning scheme from approacha- bility strategies for stochastic games from any of the prior works (except [74]). Thus, our main contributions are: (i) new proofs for approachability for MDPs and Stackelberg stochastic games, and (ii) stochastic approximations-based learning schemes for approachability in them. 1.5 Incentives for Cooperation in Wireless Network This subsection gives an introduction to Chapter 6 which is motivated from the opportunistic spectrum access problems introduced before. The scarcity of spectrum is becoming an impediment to the growth of more capable wireless networks. Several measures are sought to address this problem: Freeing up unused spectrum, shar- ing of spectrum through new paradigms such as cognitive radio sensing, as well as sophisticated information theoretic schemes and network coding methods. Nearly all such methods presume perfect user cooperation. In particular, Multi-user communication theory has proved tremendously successful in developing almost capacity-achieving schemes [75, 76]. Based on these, new hierarchi- cal network architectures have been proposed [77] in which cluster of nodes act as multi-antenna arrays and use MIMO coding techniques to improve network capacity scaling from O(n/log n) [78] to O(n), where n is the number of nodes [79]. This, however, is an unjustified assumption. And with non-cooperative, selfish users who act strategically, network (sum-rate) capacity can be 10 arbitrarily bad as shown for the single-hop Gaussian interference channel in unlicensed bands [80]. For licensed bands, one of the key challenges when users have cognitive radio capability is, why would primary users give up their ownership rights over their spectrum and share it with secondary users at the cost of performance degradation to themselves? FCC mandates are not going to solve the problem as primary users can always transmit junk to keep channels busy and deter secondary users [81, 82]. Thisincentiveissuehasbeensoughttobeaddressedbyguaranteeingtheprimaryuserapayment inlieuofsharinghisspectrumandsu↵eringsomeperformancedegradation. Thishasbeensoughtto be implemented in various ways: dynamic competitive pricing [83, 84, 85], and spectrum auctions [86, 87, 88] . While competitive pricing is usually not incentive-compatible and not robust to manipulation by strategic users, carefully designed auctions can potentially be strategy-proof and yield socially optimal outcomes. In many scenarios, we can even operate them as double-sided auctions or markets when there are both buyers and sellers. Unfortunately, for auctions to be practical, they must be operated by a neutral, disinterested party as an auctioneer. Otherwise, the auctioneer can manipulate the auctions to his advantage. This situation is unlikely to arise in most cognitive radio systems. The spectrum sharing and allocation must happen as a direct result of interaction between a primary user and one or more secondary users. In information and communication theory, it is well-known that when users use “cooperative communication” schemes, better performance (in terms of higher sum-rate) can be achieved. In fact, using such schemes (e.g., successive interference cancellation (SIC)) can enable spectrum sharing between users without any performance degradation at all for the dominant user. Thus, spectrum sharing with naive coding, i.e., treating interference from other users as noise can lead to inecient outcomes as Nash equilibria. Thus, a question arises whether it is possible to alleviate this ineciency by introducing an incentive alignment mechanism. Our focus in this chapter is a licensed band setting with cognitive radios where there is a primary user who owns the spectrum band and a secondary user who wants to share the spectrum with the primary user. Spectrum sharing is desirable for two reasons. First, the primary user may not be using the channel all the time. When it is not being used, it can be used by the secondary user. Second, even when the primary user is using the channel, it can still share the spectrum with the secondary user with little or no performance degradation to itself. For example, when there are only two users sharing a Gaussian interference channel, if they both agree to cooperate in doing successive interference cancellation, the dominant user su↵ers no performance degradation at all. His achievable rate is the same as if the other user were not present. This, however, does not imply that if the primary user acts as the dominant user, it does not su↵er any externality cost due to the presence of the other user. For one, the achievable rate region R SIC is possible only asymptotically in the codeword length. So, with any practical SIC codes, there is going to be some performance loss. Second, SIC requires cooperation between the users in codebook design. Thus, coding and decoding at the primary transmitter and receiver is more 11 complex than before again imposing complexity externality. Even if these issues are ignored, the primary user still may not have an incentive to share the spectrum unless he is compensated for it in some way. Furthermore, in some scenarios complexity considerations can entail that the primary acts as a non-dominant user while the secondary acts as a dominant user. Thus, it is amply clear that there is a need to introduce incentive alignment schemes that will induce the primary user to share spectrum with the secondary user, and moreover cooperate in doing so by using advanced communication schemes such as SIC. As we argued before, auction mechanisms [87][86] are not the right framework for this problem since there is no independent, impartial entity that can coordinate the auction and act as an auctioneer. Here, theprimaryuserisaninterestedpartywithincentivestomanipulatetheauction. We, thus, consider this as a principal-agent model [89][90] where one user (possibly the primary) acts as a principal, and o↵ers several contracts to the agent(s) (possibly the secondary user(s)). The agent(s) then picks one of the possible contracts or may reject all of them. We specify the class of contracts that a primary user can o↵er such that both the primary and the secondary users are able to maximize their individual utilities while still achieving a social welfare objective. The principal-agent model and the contractual mechanism approach to spectrum sharing is new. We focus on a two-user setting and assume that their radios are sophisticated enough to employ SIC techniques [93]. One receivers takes a dominant role and decodes his signal the last after decoding signals of all the other users. Typically, this would be the primary user but there can be scenarios where the secondary user radio is more sophisticated and acts as a dominant user. Again, either the primary or the secondary user can be the principal and o↵er contracts to the other user. Our findings are the following: (i) Under the full information case, i.e., when all the exact channel coecients is common knowledge, in general, it is impossible to design (first-best) contracts that are Pareto-optimal at equilibrium. Nevertheless, we can specify channel conditions, and contract formats which are Pareto-optimal at equilibrium under them. (ii) Even if we allow for time-sharing of a “dominant” role as part of the contractual negotiation, it is still impossible to design contracts that are Pareto-optimal at equilibrium, though as earlier, we can specify various contract formats that do result in Pareto-optimal outcomes under various channel conditions. (iii) Deviation from agreed contract is a serious concern given lack of policing. Nevertheless, we show that simple incentive schemes can be devised that make the contracts robust to any deviation post-contract. (iv) The contract design methodology can be extended to multiple secondary users. (v) We also show that under hidden information, when the primary user has a dominant role, neither user has an incentive to lie about their direct channel coecients, or manipulate the cross channel measurements. This is not the case when the secondary user has a dominant role. Thus, the lesson we learn is that when the information about channel coecients is hidden, it is better for the primary user to be the dominant user. This yields (second-best) contracts with (first-best) Pareto-optimal outcomes at equilibrium. 12 2 Decentralized Learning for Multi-player Multi-armed Bandits In this chapter we consider the problem of distributed online learning with multiple players in multi-armed bandits (MAB) models. Each player can pick among multiple arms. When a player picks an arm, it gets a reward. We consider both i.i.d. reward model and Markovian reward model. In the i.i.d. model each arm is modelled as an i.i.d. process with an unknown distribution with an unknownmean. IntheMarkovianmodel,eacharmismodelledasafinite,irreducible,aperiodicand reversibleMarkovchainwithanunknownprobabilitytransitionmatrixandstationarydistribution. The arms give di↵erent rewards to di↵erent players. If two players pick the same arm, there is a “collision”, and neither of them get any reward. There is no dedicated control channel for coordination or communication among the players. Any other communication between the users is costly and will add to the regret. We propose an online index-based distributed learning policy called dUCB 4 algorithm that trades o↵ exploration v. exploitation in the right way, and achieves expected regret that grows at most as near-O(log 2 T). The motivation comes from opportunistic spectrum access by multiple secondary users in cognitive radio networks wherein they must pick among various wireless channels that look di↵erent to di↵erent users. This is the first distributed learning algorithm for multi-player MABs with heterogeneous players (that have player-dependent rewards) to the best of our knowledge. The chapter is organized as follows. In Section 2.1, we present the model and problem formu- lation. In section 2.2 and 2.3 we present some variations on single player MAB with i.i.d. rewards and Markovian rewards respectively. In section 2.4, we introduce the decentralized MAB problem with i.i.d. rewards. We then extend the results to the decentralized cases with Markovian rewards in section 2.5. In section 2.6 we present the distributed bipartite matching algorithm which is used in our main algorithm for decentralized MAB. Section 2.7 concludes the chapter. 13 2.1 Model and Problem Formulation 2.1.1 Arms with i.i.d. rewards We consider an N-armed bandit with M players. In a wireless cognitive radio setting [94], each arm could correspond to a channel, and each player to a user who wants to use a channel. Time is slotted, and at each instant each player picks an arm. There is no dedicated control channel for coordination among the players. So, potentially more than one player can pick the same arm at the same instant. We will regard that as a collision. Player i playing arm k at time t yields i.i.d. reward S ik (t) with density function f ik (s). We will assume that the rewards are bounded, and without loss of generality lie in [0,1]. Let µ i,k denote the mean of S ik (t)w.r.t. thepdf f ik (s). We assume that the players have no information about the mean, the distributions or any other statistics about the rewards from various arms other than what they observe while playing. We also assume that each player can only observe the rewards that they get. When there is a collision, we will assume that all players that choose the arm on which there is a collision get zero reward. This could be relaxed where the players share the reward in some manner though the results do not change appreciably. Let X ij (t) be the reward that player i gets from arm j at time t.Thus,ifplayer i plays arm k at time t (and there is no collision), X ik (t)= S ik (t), and X ij (t)=0,j 6= k. Denote the action of player i at time t by a i (t)2A:= {1,...,N}.Then,the history seen by player i at time t is H i (t)= {(a i (1),X i,a i (1) (1)),··· ,(a i (t),X i,a i (t) (t))} with H i (0) = ;.A policy ↵ i =(↵ i (t)) 1 t=1 for player i is a sequence of maps ↵ i (t):H i (t)!A that specifies the arm to be played at time t given the history seen by the player. Let P(N) be the set of vectors such that P(N):={a=(a 1 ,...,a M ):a i 2A,a i 6=a j ,for i6=j}. The players have a team objective: namely over a time horizon T, they want to maximize the expected sum of rewardsE[ P T t=1 P M i=1 X i,a i (t) (t)] over some time horizon T. If the parameters µ i,j are known, this could easily be achieved by picking a bipartite matching k ⇤⇤ 2arg max k2P (N) M X i=1 µ i,k i , (2.1) i.e., the optimal bipartite matching with expected reward from each match. Note that this may not be unique. Since the expected rewards, µ i,j , are unknown, the players must pick learning policies that minimize the expected regret, defined for policies ↵ =(↵ i ,1 i M) as R ↵ (T)=T X i µ i,k ⇤⇤ i E ↵ " T X t=1 M X i=1 X i,↵ i (t) (t) # . (2.2) 14 Our goal is to find a decentralized algorithm that players can use such that together they minimize the expected regret. 2.1.2 Arms with Markovian rewards Here we follow the model formulation introduced in the previous subsection, with the exception that the rewards are now considered Markovian. The reward that player i gets from arm j (when thereisnocollision)X ij ,ismodelledasanirreducible,aperiodic,reversibleMarkovchainonafinite state space X i,j and represented by a transition probability matrix P i,j := ⇣ p i,j x,x 0 :x,x 0 2X i,j ⌘ . We assume that rewards are bounded and strictly positive, and without loss of generality lie in (0,1]. Let ⇡ i,j := ⇣ ⇡ i,j x ,x2X i,j ⌘ be the stationary distribution of the Markov chain P i,j .The mean reward from arm j for player i is defined as µ i,j := P x2X i,j x⇡ i,j x . Note that the Markov chain represented by P i,j makes a state transition only when player i plays arm j.Otherwiseit remains rested. Wenotethatalthoughweusethe‘bigO’notationtoemphasistheregretorder,unlessotherwise noted results are non-asymptotic. 2.2 Some variations on single player multi-armed bandit with i.i.d. rewards We first present some variations on the single player non-Bayesian multi-armed bandit model. Thesewillproveusefullaterforthemulti-playerproblemthoughtheyshouldalsobeofindependent interest. 2.2.1 UCB 1 with index recomputation everyL slots Consider the classical single player non-Bayesian N-armed bandit problem. At each time t,the playerpicksaparticulararm,sayj,andgetsarandomrewardX j (t). TherewardsX j (t),1 t T, areindependentandidenticallydistributedaccordingtosomeunknownprobabilitymeasurewithan unknownexpectationµ j . Withoutlossofgenerality, assumethatµ 1 >µ i >µ N ,fori=2,···N 1. Let n j (t) denote the number of times arm j has been played by time t. Denote j := µ 1 µ j , min := min j,j6=1 j and max := max j j . The regret for any policy ↵ is R ↵ (T):=µ 1 T N X j>1 µ j E ↵ [n j (T)]. (2.3) 15 UCB 1 index [4] is defined as g j (t):=X j (t)+ s 2log(t) n j (t) , (2.4) where X j (t) is the average reward obtained by playing arm j by time t. It is defined as X j (t)= P t m=1 r j (m)/n j (t), where r j (m) is the reward obtained from arm j at time m. If arm j is played at time t then r j (m)=X j (m) and otherwise r j (t) = 0. Now, an index-based policy called UCB 1 [4] is to pick the arm that has the highest index at each instant. It can be shown that this algorithm achieves regret that grows logarithmically in T non-asymptotically. An easy variation of the above algorithm which will be useful in our analysis of subsequent algorithms is the following. Suppose the index is re-computed only once everyL slots. In that case, it is easy to establish the following. Theorem 1. Under the UCB 1 algorithm with recomputation of the index once every L slots, the expected regret by time T is given by R UCB1 (T) N X j>1 8LlogT j +L ✓ 1+ ⇡ 2 3 ◆ N X j>1 j . (2.5) The proof follows [4] and taking into account the fact that every time a suboptimal arm is selected, it is played for the next L time slots. We omit it due to space consideration. 2.2.2 UCB 4 Algorithm when index computation is costly Often, learning algorithms pay a penalty or cost for computation. This is particularly the case when the algorithms must solve combinatorial optimization problems that are NP-hard. Such costs also arise in decentralized settings wherein algorithms pay a communication cost for coordination between the decentralized players. This is indeed the case, as we shall see later when we present an algorithm to solve the decentralized multi-armed bandit problem. Here, however, we will just consider an “abstract” communication or computation cost. The problem we formulate below can be solved with better regret bounds than what we present. At this time though we are unable to design algorithms with better regret bounds, that also help in decentralization. Consider a computation cost every time the index is recomputed. Let the cost be C units. Let m(t) denote the number of times the index is computed by time t. Then, under policy ↵ the expected regret is now given by ˜ R ↵ (T):=µ 1 T N X j=1 µ j E ↵ [n j (T)]+CE ↵ [m(T)]. (2.6) It is easy to argue that the UCB 1 algorithm will give a regret⌦( T) for this problem. We present an 16 alternative algorithm called UCB 4 algorithm, that gives sub-linear regret. Define the UCB 4 index g j (t):=X j (t)+ s 3log(t) n j (t) . (2.7) We define an arm j ⇤ (t)tobethe best arm if j ⇤ (t)2argmax 1 i N g i (t). Algorithm 1 : UCB 4 1: Initialization: Select each arm j once for t N. Update the UCB 4 indices. Set ⌘ = 1. 2: while (t T) do 3: if (⌘ =2 p for some p=0,1,2,···) then 4: Update the index vector g(t); 5: Compute the best arm j ⇤ (t); 6: if (j ⇤ (t)6=j ⇤ (t 1)) then 7: Reset ⌘ = 1; 8: end if 9: else 10: j ⇤ (t)=j ⇤ (t 1); 11: end if 12: Play arm j ⇤ (t); 13: Increment counter ⌘ = ⌘ +1; t =t+1; 14: end while We will use the following concentration inequality. Fact 1: Cherno↵-Hoe↵ding inequality [95] Let X 1 ,...,X t be random variables with a common range such that E[X t |X 1 ,...,X t 1 ]= µ. Let S t = P t i=1 X i . Then for all a 0, P(S t tµ+a) e 2a 2 /t , and P(S t tµ a) e 2a 2 /t . (2.8) Theorem 2. The expected regret for the single player multi-armed bandit problem with per com- putation cost C using the UCB 4 algorithm is given by ˜ R UCB4 (T) ( max +C(1+logT))· 0 @ N X j>1 12logT 2 j +2N 1 A . Thus, ˜ R UCB4 (T)=O(log 2 T). Proof. We prove this in two steps. First, we compute the expected number of times a suboptimal arm is played and then the expected number of times we recompute the index. Consider any suboptimal armj> 1. Denote c t,s = p 3logt/s and the indicator function of the event A by I{A}.let ⌧ j,m be the time at which the player makes the mth transition to arm j from another arm and ⌧ 0 j,m be the time at which the player makes the mth transition from arm j to another arm. Let ˜ ⌧ 0 j,m =min{⌧ 0 j,m ,T}.Then, 17 n j (T) 1+ P T m=1 |(˜ ⌧ 0 j,m ⌧ j,m )|I{Arm j is picked at time ⌧ j,m ,⌧ j,m T} 1+ T X m=1 |(˜ ⌧ 0 j,m ⌧ j,m )|I{g j (⌧ j,m 1) g 1 (⌧ j,m 1),⌧ j,m T} l+ T X m=1 |(˜ ⌧ 0 j,m ⌧ j,m )|I{g j (⌧ j,m 1) g 1 (⌧ j,m 1),⌧ j,m T,n j (⌧ j,m 1) l} (a) l+ T X m=1 1 X p=0 2 p I{g j (⌧ j,m +2 p 2) g 1 (⌧ j,m +2 p 2),⌧ j,m +2 p T,n j (⌧ j,m 1) l} (b) l+ T X m=2 1 X p=0 2 p I{g j (m+2 p 2) g 1 (m+2 p 2),m+2 p T,n j (m 1) l} l+ T X m=1 X p 0,m+2 p T 2 p I{X j (m+2 p 1)+c m+2 p 1,n j (m+2 p 1) X 1 (m+2 p 1)+c m+2 p 1,n 1 (m+2 p 1) ,n j (m 1) l} (2.9) l+ T X m=1 X p 0,m+2 p T 2 p I{ max l s j <m+2 p X j (m+2 p 1)+c m+2 p 1,s j min 1 s 1 <m+2 p X 1 (m+2 p 1)+c m+2 p 1,s 1 } l+ 1 X m=1 X p 0,m+2 p T 2 p m+2 p X s 1 =1 m+2 p X s j =l I{X j (m+2 p )+c m+2 p ,s j X 1 (m+2 p )+c m+2 p ,s 1 }. (2.10) In Algorithm 1 (UCB 4 ), if an arm is for the pth time consecutively (without switching to any other arms in between), it will be played for the next 2 p slots. Inequality (a) uses this fact. In the inequality (b), we replace ⌧ j,m by m which is clearly an upper bound. Now, observe that the event {X j (m+2 p )+c m+2 p ,s j X 1 (m+2 p )+c m+2 p ,s 1 } implies at least one of the following events, A := X 1 (m+2 p ) µ 1 c m+2 p ,s 1 ,B := X j (m+2 p ) µ j +c m+2 p ,s j , or C := µ 1 <µ j +2c m+2 p ,s j . (2.11) Now, using the Cherno↵-Hoe↵ding bound, we get P X 1 (m+2 p ) µ 1 c m+2 p ,s 1 (m+2 p ) 6 , P X j (m+2 p ) µ j +c m+2 p ,s j (m+2 p ) 6 . For l = ⇠ 12logT 2 j ⇡ , the last event in (2.11) is false. In fact, µ 1 µ j 2c m+2 p ,s j 18 =µ 1 µ j 2 q 3log(m+2 p )/s j µ 1 µ j j =0, for s j ⌃ 12logT/ 2 j ⌥ . So, we get, E[n j (T)] ⌃ 12logT/ 2 j ⌥ + 1 X m=1 1 X p=0 2 p m+2 p X s 1 =1 m+2 p X s j =1 2(m+2 p ) 6 ⌃ 12logT/ 2 j ⌥ +2 1 X m=1 1 X p=0 2 p (m+2 p ) 4 12logT 2 j +2. (2.12) Next, we upper-bound the expectation of m(T), the number of index computations performed by time T. We can write m(T)= m 1 (T)+m 2 (T), where m 1 (T) is the number of index updates that result in an optimal allocation, and m 2 (T) is the number of index updates that result in a suboptimal allocation. Clearly, the number of updates resulting in a suboptimal allocation is less than the number of times a suboptimal arm is played. Thus, E[m 2 (T)] N X j>1 E[n j (T)]. (2.13) To bound E[m 1 (T)], let ⌧ l be the time at which the player makes the lth transition to an optimal arm from a suboptimal arm and ⌧ 0 l be the time at which the player makes the lth transition from an optimal arm to a suboptimal arm. Then, m 1 (T) P n sub (T) l=1 log|⌧ l ⌧ 0 l |,where n sub (T)isthe total number of such transitions by timeT. Clearly,n sub (T) is upper-bounded by the total number of times the player picks a sub-optimal arm. Also, log|⌧ l ⌧ 0 l | logT. So, E[m 1 (T)] N X j>1 E[n j (T)]·logT. (2.14) Thus, from bounds (2.13) and (2.14), we get E[m(T)] N X j>1 E[n j (T)]·(1+logT). (2.15) Now, using equation (2.6), the expected regret is ˜ R UCB4 (T)= N X j>1 E[n j (T)]· j +CE[m(T)] max N X j>1 E[n j (T)]+CE[m(T)] ( max +C(1+logT)) N X j>1 E[n j (T)]. 19 by using (2.15). Now, by bound (2.12), we get the desired bound on the expected regret. Remarks. 1. It is easy to show that the lower bound for the single player MAB problem with computation costs is⌦(log T). This can be achieved by the UCB 2 algorithm [4]. To see this, note that the number of times the player selects a suboptimal arm when using UCB 2 is O(logT). Since E[n j (T)] =O(logT), we get E[ P N j>1 n j (T)] =O(logT), and also E[m 2 (T)] =O(logT). Now, since the epochs are not getting reset after every switch and are exponentially spaced, the number of updates that result in the optimal allocation, m 1 (T) logT. These together yield ˜ R UCB2 (T) N X j>1 E[n j (T)]· j +CE[m(T)] =O(logT). 2. However, it is unknown at this time if UCB 2 can be decentralized. This is the main reason for introducing the UCB 4 algorithm. 2.2.3 Algorithms with finite precision indices Often, the indices might be known only upto a certain precision. This can happen when either in an implementation only a finite number of bits are available to represent indices, or when there is a cost to compute the indices to greater precision. For example, as we will see in the decentralized setting, indices must be communicated to other players. Since only a finite number of bits can be communicated in finite time, the algorithm should be able to work with indices with limited precision. Thus, when indices are known up to some ✏ precision, it may not possible to tell which of the two indices is greater if they are within ✏ of each other. This becomes important because it can empirically be observed (see Figure ??) that indices track each other closely in index based algorithms. The question then is, how are the performances of various index-based policies a↵ected if there are limits on index resolution, and only an arm with an ✏-highest index can be picked. Wefirstshowthatif min isknown,wecanfixaprecision0<✏< min , sothatUCB 4 algorithm will achieve order log-squared regret growth with T.If min is not known, we can pick a positive monotonesequence{✏ t }suchthat✏ t ! 0, ast!1 . Denotethecostofcomputationfor✏-precision be C(✏). We assume that C(✏)!1 monotonically as ✏! 0. Theorem 3. (i) If min is known, choose an 0<✏< min . Then, the expected regret of the UCB 4 algorithm with ✏-precise computations is given by ˜ R UCB4 (T) ( max +C(✏)(1+logT))· 0 @ N X j>1 12logT ( j ✏) 2 +2N 1 A . Thus, ˜ R UCB4 (T)=O(log 2 T). 20 (ii) If min is unknown, denote ✏ min = min /2 and choose a positive monotone sequence {✏ t } such that ✏ t ! 0 as t!1 . Then, there exists a t 0 > 0 such that for allT>t 0 , ˜ R UCB4 (T) ( max +C(✏ min ))t 0 +( max +C(✏ T )(1+logT))· 0 @ N X j>1 12logT ( j ✏ min ) 2 +2N 1 A where t 0 is the smallest t such that ✏ t 0 <✏ min . Thus by choosing an arbitrarily slowly decreasing sequence {✏ t }, we can make the regret arbitrarily close to O(log 2 T) asymptotically. Proof. (i) The proof is only a slight modification of the proof given in Theorem 2. Due to the ✏ precision, the player will pick a suboptimal arm if the event {X j (m+2 p )+ c m+2 p ,s j + ✏ X 1 (m+2 p )+c m+2 p ,s 1 } occurs. Thus equation (2.9) becomes, n j (T) l+ 1 X m=1 X p 0,m+2 p T 2 p m+2 p X s 1 =1 m+2 p X s j =l I{X j (m+2 p )+c m+2 p ,s j +✏ X 1 (m+2 p )+c m+2 p ,s 1 }. Now, the event {X j (m+2 p )+c m+2 p ,s j +✏ X 1 (m+2 p )+c m+2 p ,s 1 } implies that at least one of the following events must occur: A := X 1 (m+2 p ) µ 1 c m+2 p ,s 1 ,B := X j (m+2 p ) µ j +✏+c m+2 p ,s j , C := µ 1 <µ j +✏+2c m+2 p ,s j , or D := µ 1 <µ j +✏ . (2.16) Since {X j (m+2 p ) µ j +✏+c m+2 p ,s j }✓{ X j (m+2 p ) µ j +c m+2 p ,s j },wehave P({X j (m+2 p ) µ j +✏+c m+2 p ,s j }) P({X j (m+2 p ) µ j +c m+2 p ,s j }). Also, forl = ⌃ 12logT/( j ✏) 2 ⌥ ,theeventC cannot happen. In fact, µ 1 µ j ✏ 2c t+2 p ,s j = µ 1 µ j ✏ 2 q 3log(t+2 p ) s j µ 1 µ j ✏ ( j ✏)=0, for s j ⌃ 12logT/( j ✏) 2 ⌥ .If✏< min , the last event (D) in equation (2.16) is also not true. Thus, for 0<✏< min , we get E[n j (T)] 12log(n) ( j ✏) 2 +2. (2.17) The rest of the proof is the same as in Theorem 2. Now, if min is known, we can choose 0<✏< min and by Theorem 2 and bound (2.17), we get the desired result. (ii) If min is unknown, we can choose a positive monotone sequence {✏ t } such that ✏ t ! 0 as t!1 .Thus,thereexistsa t 0 such that fort>t 0 , ✏ t <✏ min . We may get a linear regret upto time t 0 but after that the analysis follows as that in the proof of Theorem 2, and regret grows only sub-linearly. Since C(·) is monotone, C(✏ T )>C(✏ t ) for allt<T. The last part can now be trivially established using the obtained bound on the expected regret. 21 2.3 Single Player Multi-armed Bandit with Markovian Rewards Now, weconsiderthescenariowheretherewardsobtainedfromanarmarenoti.i.d. butcomefrom a Markov chain. Reward from each arm is modeled as an irreducible, aperiodic, Markov chain on a finite state space X i and represented by a transition probability matrix P i := ⇣ p i x,x 0 :x,x 0 2X i ⌘ . Assume that the reward spaceX i ✓ (0,1]. Let X i (1),X i (2),... denote the successive rewards from arm i. All arms are mutually independent. Let ⇡ i := ⇡ i x ,x2X i be the stationary distribution of the Markov chain P i . Since the Markov chains are ergodic under these assumptions, the mean reward from arm i is given by µ i := P x2X ix⇡ i x . Without loss of generality, assume that µ 1 > µ i >µ N , for i=2,···N 1. As before, n j (t) denotes the number of times arm j has been played by time t. Denote j := µ 1 µ j , min := min j,j6=1 j and max := max j j . Denote ⇡ min := min 1 i N,x2X i⇡ i x , x max := max 1 i N,x2X ix and x min := min 1 i N,x2X ix. Let ˆ ⇡ i x := max{⇡ i x ,1 ⇡ i x } and ˆ ⇡ max := max 1 i N,x2X i ˆ ⇡ i x . Let |X i | denote the cardinality of the state space X i , |X| max := max 1 i N |X i |. Let P i 0 be the adjoint of P i on l 2 (⇡ i ), defined as, P i 0 x,y := ⇡ y P i y,x /⇡ x . (2.18) Then, ˜ P i :=P i 0 P i , denotes the multiplicative symmetrization of P i . We assume that P i s are such that ˜ P i s are irreducible. Clearly, this is a less stringent assumption than the reversibility of the matrices P i s. Also, if P i x,x > 0,8 x2X i , then it is easy to show that ˜ P i s are reversible. Let ⇢ i be the eigenvalue gap, 1 2 ,where 2 is the second largest eigenvalue of the matrix ˜ P i . Denote ⇢ max := max 1 i N ⇢ i and ⇢ min := min 1 i N ⇢ i ,where ⇢ i is the eigenvalue gap of the ith arm. The total reward obtained by the timeT is then given byS T = P N j=1 P n j (T) s=1 X j (s). The regret for any policy ↵ is defined as ˜ R M,↵ (T):=µ 1 T E ↵ N X j=1 n j (T) X s=1 X j (s)+CE ↵ [m(T)] (2.19) where C is the cost per computation and m(T) is the number of times the index is computed by time T, as described in section 2.2. Define the index g j (t):=X j (t)+ s log(t) n j (t) , (2.20) where X j (t) is the average reward obtained by playing arm j by time t, as defined in the previous section. can be any constant satisfying> 168|X| 2 max /⇢ min . We introduce one more notation here. If F and G are two -algebras, thenF_G denotes the smallest -algebra containing F and G. Similarly, if {F t ,t=1,2,...} is a collection of -algebras, 22 then_ t 1 F t denotes the smallest algebra containing F 1 ,F 2 ,.... We will use the following results on Markov chains. Lemma1. [5] Let (X t ,t=1,2,...) be an irreducible, aperiodic Markov chain on a finite state space X with transition probability matrix P, an initial distribution that is non-zero in all states and a stationarydistribution⇡ . LetF t be the -algebrageneratedby(X 1 ,X 2 ,...,X t ). LetG be a -algebra independent of F =_ t 1 F t . Let ⌧ be a stopping time of F t _G . Let N(x,⌧ ):= P ⌧ t=1 I{X t = x}. Then, |E[N(x,⌧ )] ⇡ x E[⌧ ]| K, where K 1/⇡ min and ⇡ min =min x2X ⇡ x . K depends on P. Lemma 2. [96] Let (X t ,t=1,2,...) be an irreducible, aperiodic Markov chain on a finite state space X with transition probability matrix P, an initial distribution and a stationary distribution ⇡ . DenoteN = ⇣ x ⇡ x ,x2X ⌘ 2 . Let ⇢ be the eigenvalue gap, 1 2 , where 2 is the second largest eigenvalue of the matrix ˜ P where ˜ P is the multiplicative symmetrization of the transition matrix P. Let f :X! R be such that P x2X ⇡ x f(x)=0, kfk 1 1,kfk 2 2 1.If ˜ P is irreducible, then for any> 0, P P t a=1 f(X a )/t N e t⇢ 2 /28 . The following can be derived easily from Lemma 1. Lemma3. If the reward of each arm is given by a Markov chain satisfying the hypothesis of Lemma 1, then under any policy ↵ we have ˜ R M,↵ (T) N X j=2 j E ↵ [n j (T)]+K X,P +CE ↵ [m(T)] (2.21) where K X,P = P N j=1 P x2X j x/⇡ j min and ⇡ j min =min x2X j ⇡ j x Proof. LetX j (1),X j (2),...denotethesuccessiverewardsfromarmj. LetF j t denotesthe -algebra generated by (X j (1),...,X j (t)). Let F j =_ t 1 F j t and G j =_ i6=j F i . Since arms are independent, G j isindependentofF j . Clearly,n j (T)isastoppingtimewithrespecttoG j _F j T . Thetotalreward is S T = P N j=1 P n j (T) s=1 X j (s)= P N j=1 P x2X j xN(x,n j (T)) where N(x,n j (T)) := P n j (T) t=1 I{X j (t)= x}. TakingtheexpectationandusingtheLemma1, wehave E[S T ] P N j=1 P x2X j x⇡ j x E[n j (T)] P N j=1 P x2X j x/⇡ j min ,whichimplies E[S T ] P N j=1 µ j E[n j (T)] K X,P , where K X,P = P N j=1 P x2X j x/⇡ j min . Since regret ˜ R M,↵ (T)=µ 1 T E ↵ P N j=1 P n j (T) t=1 X j (t)+CE ↵ [m(T)] (c.f. equation (2.19)), we get | ˜ R M,↵ (T) 0 @ µ 1 T N X j=1 µ j E[n j (T)]+CE ↵ [m(T)] 1 A | K X,P . 23 Theorem 4. (i) If |X| max and ⇢ min are known, choose> 168|X| 2 max /⇢ min . Then, the expected regret using the UCB 4 algorithm with the index defined as in (2.20) for the single player multi-armed bandit problem with Markovian rewards and per computation cost C is given by ˜ R M,UCB4 (T) ( max +C(1+logT))· 0 @ N X j>1 4 logT 2 j +N(2D+1) 1 A +K X,P where D = |X|max ⇡ min .Thus, ˜ R M,UCB4 (T)=O(log 2 T). (ii) If|X| max and ⇢ min are not known, choose a positive monotone sequence{ t } such that t !1 as t!1 and t t. Then, ˜ R M,UCB4 (T)= O( T log 2 T). Thus, by choosing an arbitrarily slowly increasing sequence { t } we can make the regret arbitrarily close to log 2 T. Proof. (i) Consider any suboptimal armj> 1. Denote c t,s = p logt/s. As in the proof of Theorem 2, we start by bounding n j (T). The initial steps are the same as in the proof of Theorem 2. So, we skip those steps and start from the inequality (2.9) there. n j (T) l+ 1 X m=1 X p 0,m+2 p T 2 p m+2 p X s 1 =1 m+2 p X s j =l I{X j (m+2 p )+c m+2 p ,s j X 1 (m+2 p )+c m+2 p ,s 1 }. The event{X j (m+2 p )+c m+2 p ,s j X 1 (m+2 p )+c m+2 p ,s 1 } is true only if at least one of the events shown in display (2.11) are true. We note that, for any initial distribution j for arm j, N j = j x ⇡ j x ,x2X j ! 2 X x2X j j x ⇡ j x ! 2 1 ⇡ min . (2.22) Also, x max 1. Let n j x (s j ) be the number of times the state x is observed when arm j is pulled s j times. Then, the probability of the first event in (2.11), P(X j (m+2 p ) µ j +c m+2 p ,s j ) =P 0 @ X x2X j xn j x (s j ) s j X x2X j x⇡ j x +s j c m+2 p ,s j 1 A =P 0 @ X x2X j (n j x (s j ) s j ⇡ j x ) s j c m+2 p ,s j /x 1 A (a) X x2X j P ✓ n j x (s j ) s j ⇡ j x s j c m+2 p x|X j | ◆ = X x2X j P P s j t=1 I{X j (t)=x} s j ⇡ j x s j ˆ ⇡ j x c m+2 p ,s j x|X j |ˆ ⇡ j x ! (b) X x2X j N j(m+2 p ) ⇢ i /28x 2 |X j | 2 (ˆ ⇡ j x ) 2 (c) |X| max ⇡ min (m+2 p ) ⇢ min /28|X| 2 max . The inequality (a) follows after some simple algebra, which we skip due to space limitations. The inequality (b) follows by defining the function f(X j (t)) = (I{X j (t)= x} ⇡ j x )/ˆ ⇡ j x and using the 24 Lemma 2. For inequality (c) we used the facts that N j 1/⇡ min , x max 1 and ˆ ⇡ max 1. Thus, P(X j (m+2 p ) µ j +c m+2 p ,s j ) D(m+2 p ) ⇢ min /28|X|max| 2 (2.23) where D = |X|max ⇡ min . Similarly we can get, P(X 1 (m+2 p ) µ 1 c m+2 p ,s 1 ) D(m+2 p ) ⇢ min /28|X|max| 2 (2.24) For l = l 4 logT/ 2 j m , the last event in (2.11) is false. In fact, µ 1 µ j 2c m+2 p ,s j =µ 1 µ j 2 q log(m+2 p )/s j µ 1 µ j j =0, for s j ⌃ 4 logT/ 2 j ⌥ . Thus, E[n j (T)] & 4 logT 2 j ' + 1 X m=1 1 X p=0 2 p m+2 p X s 1 =1 m+2 p X s j =1 2D(m+2 p ) ⇢ min 28|X|max| 2 = & 4 logT 2 j ' +2D 1 X m=1 1 X p=0 2 p (m+2 p ) ⇢ min 56|X| 2 max 28|X| 2 max . (2.25) When> 168|X| 2 max /⇢ min , the above summation converges to a value less that 1 and we get E[n j (T)] 4 logT 2 j +(2D+1). (2.26) Now, from the proof of Theorem 2 (equation (2.15)), E[m(T)] N X j>1 E[n j (T)]·(1+logT). (2.27) Now, using inequality (2.21), the expected regret ˜ R M,UCB4 (T)= = N X j>1 E[n j (T)]· j +CE[m(T)]+K X,P max N X j>1 E[n j (T)]+CE[m(T)]+K X,P ( max +C(1+logT)) N X j>1 E[n j (T)]+K X,P . by using (2.27). Now, by bound (2.26), we get the desired bound on the expected regret. 25 (ii) Replacing with t , equation (2.25) becomes E[n j (T)] & 4 T logT 2 j ' +2D 1 X m=1 1 X p=0 2 p (m+2 p ) m+2 p⇢ min 56|X| 2 max 28|X| 2 max Since, t !1 ast!1 ,theexponent m+2 p⇢ min 56|X| 2 max 28|X| 2 max becomessmallerthat 4forsuciently large m and p, and the above summation converges, yielding the desired result. We note that the results for Markovian reward just presented extend easily even with finite precision indices. As before, suppose the cost of computation for ✏-precision is C(✏). We assume that C(✏)!1 monotonically as ✏! 0. We will use this result in Section 2.5. Theorem 5. (i) If min , |X| max and ⇢ min are known, choose an 0<✏< min ,anda> 168|X| 2 max /⇢ min . Then, the expected regret using the UCB 4 algorithm with the index defined as in (2.20) for the single player multi-armed bandit problem with Markovian rewards with ✏-precise computations is given by ˜ R M,UCB4 (T) ( max +C(✏)(1+logT))· 0 @ N X j>1 4 logT ( j ✏) 2 +N(2D+1) 1 A . where D = |X|max ⇡ min .Thus, ˜ R M,UCB4 (T)=O(log 2 T). (ii) If min , |X| max and ⇢ min are unknown, choose a positive monotone sequences {✏ t } and { t } such that t t, ✏ t ! 0 and t !1 as t!1 . Then, ˜ R M,UCB4 (T)=O(C(✏ T ) T log 2 T).Wecan choose {✏ t } and { t } as two arbitrarily slowly decreasing and increasing sequences respectively, and the regret can be made arbitrarily close to log 2 (T). The proof follows by a combination of the proof of the theorems 3 and 4, and is omitted. 2.4 The Decentralized MAB problem with i.i.d. rewards Wenowconsiderthedecentralizedmulti-armedbanditproblemwithi.i.d. rewardswhereinmultiple players play at the same time. Players have no information about means or distribution of rewards from various arms. If two or more players pick the same arm, we assume that neither gets any reward. This is an online learning problem of distributed bipartite matching. Communication requirement: We assume that there are no dedicated control channels for coordi- nation or communication between the players. However, we do allow for players to communicate with each other by pulling arms in a specific order. For example, a player could communicate a bit sequence to other players by picking arm to indicate ‘bit 1’ and any other arm for ‘bit 0’. However, such a communication overhead would add to regret and a↵ect the learning rate. 26 Distributedalgorithmsforbipartitematchingalgorithmsareknown[97,98]whichdeterminean ✏-optimal matchingwith a‘minimum’ amount of information exchange and computation. However, every run of this distributed bipartite matching algorithm incurs a cost due to computation and communication necessary to exchange some information for decentralization. LetC be the cost per run, and m(t) denote the number of times the distributed bipartite matching algorithm is run by time t. Then, under policy ↵ the expected regret is R ↵ (T)=T M X i=1 µ i,k ⇤⇤ i E ↵ " T X t=1 M X i=1 X i,↵ i (t) (t) # +CE[m(T)]. (2.28) where k ⇤⇤ is the optimal matching as defined in equation (2.1) in section 2.1.1. Temporal Structure. We divide time into frames. Each frame is one of two kinds: a decision frame, and an exploitation frame. In the decision frame, the index is recomputed, and the dis- tributed bipartite matching algorithm run again to determine the new matching. The length of such a frame can be seen as cost of the algorithm. We further divide the decision frame into two phases, a negotiation phase and an interrupt phase (see Figure 2.1). The information exchange needed to compute an ✏-optimal matching is done in the negotiation phase.Inthe interrupt phase, a player signals to other players if his allocation has changed. In the exploitation frame, the cur- rent matching is exploited without updating the indices. Later, we will allow the frame lengths to increase with time. We now present the dUCB 4 algorithm, a decentralized version of UCB 4 . For each player i and each arm j,wedefinea dUCB 4 index at the end of frame t as g i,j (t):=X i,j (t)+ s (M +2)logn i (t) n i,j (t) , (2.29) where n i (t) is the number of successful plays (without collisions) of player i by frame t, n i,j (t) is the number of times player i picks arm j successfully by frame t. X i,j (t) is the sample mean of rewards from arm j for player i from n i,j (t) samples. Let g(t) denote the vector (g i,j (t),1 i M,1 j N). We will refer to an ✏-optimal distributed bipartite matching algorithm as dBM ✏ (g(t)) that yields a solution k ⇤ (t):=(k ⇤ 1 (t),...,k ⇤ M (t))2P(N) such that P M i=1 g i,k ⇤ i (t) (t) P M i=1 g i,k i (t)) ✏, 8 k2P(N),k 6= k ⇤ . There exists an optimal bipartite matching k ⇤⇤ 2P(N) such that k ⇤⇤ 2 argmax k2P (N) P M i=1 µ i,k i . Denote µ ⇤⇤ := P M i=1 µ i,k ⇤⇤ i , and define k := µ ⇤⇤ P M i=1 µ i,k i , k2P(N). Let min =min k2P (N),k6=k ⇤⇤ k and max = max k2P (N) k . We assume that min > 0. In the dUCB 4 algorithm, at the end of every decision frame, the dBM ✏ (g(t)) will give a legitimate matching with no two players colliding on any arm. Thus, the regret accrues either if the matching k(t) is not the optimal matchingk ⇤⇤ , or if a decision frame is employed by the players to recompute the matching. Every time a frame is a decision frame, it adds a cost C to the regret. The cost C 27 Algorithm 2 dUCB 4 for User i 1: Initialization: Play a set of matchings so that each player plays each arm at least once. Set counter ⌘ = 1. 2: while (t T) do 3: if (⌘ =2 p for some p=0,1,2,···) then 4: //Decision frame: 5: Update g(t); 6: Participate in the dBM ✏ (g(t)) algorithm to obtain a match k ⇤ i (t); 7: if (k ⇤ i (t)6=k ⇤ i (t 1)) then 8: Use interrupt phase to signal an INTERRUPT to all other players about changed allocation; 9: Reset ⌘ = 1; 10: end if 11: if (Received an INTERRUPT) then 12: Reset ⌘ = 1; 13: end if 14: else 15: // Exploitation frame: 16: k ⇤ i (t)=k ⇤ i (t 1); 17: end if 18: Play arm k ⇤ i (t); 19: Increment counter ⌘ = ⌘ +1, t =t+1; 20: end while depends on two parameters: (a) the precision of the bipartite matching algorithm ✏ 1 > 0, and (b) theprecisionoftheindexrepresentation ✏ 2 > 0. Abipartitematchingalgorithmhasan ✏ 1 -precision if it gives an ✏ 1 -optimal matching. This would happen, for example, when such an algorithm is run only for a finite number of rounds. The index has an ✏ 2 -precision if any two indices are not distinguishable if they are closer than ✏ 2 . This can happen for example when indices must be communicated to other players in a finite number of bits. Thus,thecostC isafunctionof✏ 1 and✏ 2 ,andcanbedenotedasC(✏ 1 ,✏ 2 ),withC(✏ 1 ,✏ 2 )!1 as ✏ 1 or ✏ 2 ! 0. Since, ✏ 1 and ✏ 2 are the parameters that are fixedapriori, we consider ✏=min(✏ 1 ,✏ 2 ) to specify both precisions. We denote the cost as C(✏). We first show that if min is known, we can choose an✏< min /(M + 1), so that dUCB 4 algorithm will achieve order log-squared regret growth withT.If min is not known, we can pick a positive monotone sequence{✏ t } such that ✏ t ! 0, ast!1 . In a decentralized bipartite matching algorithm, the precision ✏ will depend on the amount of information exchanged in the decision frames. It, thus, is some monotonically decreasing function ✏ = f(L) of their length L such that ✏ ! 0 as L!1 . Thus, we must pick a positive monotone sequence {L t } such that L t !1 . Clearly, C(f(L t ))!1 as t!1 . This can happen arbitrarily slowly. Theorem 6. (i) Let✏> 0 be the precision of the bipartite matching algorithm and the precision of the index representation. If min is known, choose✏> 0 such that✏< min /(M +1). Let L be 28 the length of a frame. Then, the expected regret of the dUCB 4 algorithm is ˜ R dUCB4 (T) (L max +C(f(L))(1+logT))· ✓ 4M 3 (M +2)N logT ( min ((M +1)✏) 2 +NM(2M +1) ◆ . Thus, ˜ R dUCB4 (T)=O(log 2 T). (ii) When min is unknown, denote ✏ min = min /(2(M +1)) and let L t !1 as t!1 . Then, there exists a t 0 > 0 such that for allT>t 0 , ˜ R dUCB4 (T) (L t 0 max +C(f(L t 0 ))t 0 +(L T max +C(f(L T ))(1+logT))· ✓ 4M 3 (M +2)N logT ( min ✏ min ) 2 +NM(2M +1) ◆ , wheret 0 is the smallestt such thatf(L t 0 )<✏ min . Thus by choosing an arbitrarily slowly increasing sequence {L t } we can make the regret arbitrarily close to log 2 T. Proof. (i) First, we obtain a bound for L = 1. Then, appeal to a result like Theorem 1 to obtain the result for general L.Theimplicitdependencebetween ✏ and L through the function f(·)does not a↵ect this part of the analysis. Details are omitted due to space limitations. We first upper bound the number of sub-optimal plays. We define ˜ n i,j (t),1 i M,1 j N asfollows: WheneverthedBM ✏ (g(t))algorithmgivesanon-optimalmatchingk(t), ˜ n i,j (t)isincreased byoneforsome(i,j)2argmin 1 i M,1 j N n i,j (t). Let ˜ n(T)denotethetotalnumberofsuboptimal plays. Then, clearly, ˜ n(T)= P M i=1 P N j=1 ˜ n i,j (T). So, in order to get a bound on ˜ n(T) we first get a bound on ˜ n i,j (T). Let ˜ I i,j (t)betheindicatorfunctionwhichisequalto1if ˜ n i,j (t)isincrementedbyone, attimet. Thus ˜ I i,j (t) is equal to 1 whenever there is a non-optimal matching. Let this non-optimal matching be k(t), where k(t) 6= k ⇤⇤ . In the following, we denote it as k, omitting the time index. A non- optimal matching k is selected if the event ⇢ P M i=1 g i,k ⇤⇤ i (m+2 p 1) (M +1)✏+ P M i=1 g i,k i (m+ 2 p 1) happens. If each index has an error of at most ✏, the sum of M terms may introduce an error of atmostM✏. In addition, the distributed bipartite matching algorithm dBM ✏ itself yields only an ✏-optimal matching. This accounts for the term (M +1)✏ above. The initial proof steps are similar to that in Theorem 2. We define ⌧ ij,m ,⌧ 0 ij,m in the same way as in the proof of Theorem 2 and let ˜ ⌧ 0 j,m =min{⌧ 0 ij,m ,T}. Let k be the matching specified before. 29 Then we get ˜ n i,j (T) 1+ T X m=1 |(˜ ⌧ 0 ij,m ⌧ ij,m )|I ⇢ M X i=1 g i,k ⇤⇤ i (⌧ ij,m 1) (M +1)✏+ M X i=1 g i,k i (⌧ ij,m 1) l+ T X m=1 |(˜ ⌧ 0 ij,m ⌧ ij,m )|I ⇢ M X i=1 g i,k ⇤⇤ i (⌧ j,m 1) (M +1)✏+ M X i=1 g i,k i (⌧ j,m 1),˜ n i,j (⌧ ij,m 1) l l + T X m=1 1 X p=0 2 p I ⇢ M X i=1 g i,k ⇤⇤ i (⌧ ij,m +2 p 1) (M +1)✏+ M X i=1 g i,k i (⌧ ij,m +2 p 1),˜ n i,j (⌧ ij,m 1) l l + T X m=1 1 X p=0 2 p I ⇢ M X i=1 g i,k ⇤⇤ i (m+2 p 1) (M +1)✏+ M X i=1 g i,k i (m+2 p 1),˜ n i,j (m 1) l l+ T X m=1 1 X p=0 2 p I ⇢ M X i=1 ✓ X i,k ⇤⇤ i (m+2 p 1)+c m+2 p 1,n i,k ⇤⇤ i (m+2 p 1) ◆ (M +1)✏+ M X i=1 X i,k i (m+2 p 1)+c m+2 p 1,n i,k i (m+2 p 1) ,˜ n i,j (m 1) l l+ T X m=1 1 X p=0 2 p I ⇢ min 1 s 1,k ⇤⇤ 1 ,...,s M,k ⇤⇤ M <m+2 p M X i=1 ⇣ X i,k ⇤⇤ i (m+2 p 1)+c m+2 p 1,s i,k ⇤⇤ i ⌘ (M +1)✏+ max l s 0 1,k 1 ,...,s 0 M,k M <m+2 p M X i=1 ✓ X i,k i (m+2 p 1)+c m+2 p 1,s 0 i,k i ◆ l+ 1 X m=1 1 X p=0 2 p m+2 p X s 1,k ⇤⇤ 1 =1 ... m+2 p X s M,k ⇤⇤ M =1 m+2 p X s 0 1,k 1 =1 ... m+2 p X s 0 M,k M =1 I ⇢ M X i=1 ⇣ X i,k ⇤⇤ i (m+2 p )+c m+2 p ,s i,k ⇤⇤ i ⌘ (M +1)✏+ M X i=1 ✓ X i,k i (m+2 p )+c m+2 p ,s 0 i,k i ◆ . (2.30) Now, it is easy to observe that the event ⇢ M X i=1 ⇣ X i,k ⇤⇤ i (m+2 p )+c m+2 p ,s i,k ⇤⇤ i ⌘ (M +1)✏+ M X i=1 ✓ X i,k i (m+2 p )+c m+2 p ,s 0 i,k i ◆ 30 implies at least one of the following events: A i := ⇢ X i,k ⇤⇤ i (m+2 p ) µ i,k ⇤⇤ i c m+2 p ,s i,k ⇤⇤ i , B i := ⇢ X i,k i (m+2 p ) µ i,k i +c m+2 p ,s 0 i,k i ,1 i M, C := ⇢ M X i=1 µ i,k ⇤⇤ i < (M +1)✏+ M X i=1 µ i,k i +2 M X i=1 c m+2 p ,s 0 i,k i (2.31) Using the Cherno↵-Hoe↵ding inequality, we get P(A i ) (m+2 p ) 2(M+2) , P(B i ) (m + 2 p ) 2(M+2) , 1 i M. For l l 4M 2 (M+2)logT ( min (M+1)✏) 2 m , we get P M i=1 µ i,k ⇤⇤ i P M i=1 µ i,k i (M +1)✏ 2 P M i=1 c m+2 p ,s 0 i,k i M X i=1 µ i,k ⇤⇤ i M X i=1 µ i,k i (M +1)✏ 2M r (M +2)log(m+2 p ) l M X i=1 µ i,k ⇤⇤ i M X i=1 µ i,k i (M +1)✏ ( min (M +1)✏) 0 (2.32) where we used the fact that ( min (M +1)✏)> 0 by assumption. So, we get, E[˜ n i,j (T)] ⇠ 4M 2 (M +2)logT ( min (M +1)✏) 2 ⇡ + 1 X m=1 1 X p=0 2 p m+2 p X s 1,k ⇤⇤ 1 =1 ... m+2 p X s M,k ⇤⇤ M =1 m+2 p X s 0 1,k 1 =1 ... m+2 p X s 0 M,k M =1 2M(m+2 p ) 2(M+2) ⇠ 4M 2 (M +2)logT ( min (M +1)✏) 2 ⇡ +2M 1 X m=1 1 X p=0 2 p (m+2 p ) 4 4M 2 (M +2)logT ( min (M +1)✏) 2 +(2M +1). (2.33) Now, putting it all together, we get E[˜ n(T)] = M X i=1 N X j=1 E[˜ n i,j (T)] 4M 3 (M +2)N logT ( min (M +1)✏) 2 +(2M +1)MN. Now, by the proof of Theorem 2 (c.f. equation(2.15), E[m(T)] E[˜ n(T)](1+logT). We can now bound the regret, ˜ R dUCB4 (T)= P k2P (N),k6=k ⇤⇤ k P M i=1 E[˜ n i,k i (T)]+CE[m(T)] max X k2P (N),k6=k ⇤⇤ M X i=1 E[˜ n i,k i (T)]+CE[m(T)] = max E[˜ n(T)]+CE[m(t)]. 31 For a general L, by Theorem 1 we get ˜ R dUCB4 (T) L max E[˜ n(T)]+C(f(L))E[m(T)] (L max +C(f(L))(1+logT))E[˜ n(T)]. Now, using the bound (2.34), we get the desired upper bound on the expected regret. (ii) Since ✏ t =f(L t ) is a monotonically decreasing function ofL t such that ✏ t ! 0 asL t !1 ,there exists at 0 such that fort>t 0 , ✏ t <✏ min . We may get a linear regret upto timet 0 but after that by the analysis of Theorem 2, regret grows only sub-linearly. Since C(·) is monotonically increasing, C(f(L T )) C(f(L t )),8 t T, we get the desired result. The last part is illustrative and can be trivially established using the obtained bound on the regret in (ii). Remarks. 1. We note that in the initial steps, our proof followed the proof of the main result in [12]. 2. The UCB 2 algorithm described in [4] performs computations only at exponentially spaced time epochs. So, it is natural to imagine that a decentralized algorithm based on it could be developed, and get a better regret bound. Unfortunately, the single player UCB 2 algorithm has an obvious weakness: regret is linear in the number of arms. Thus, the decentralized/combinatorial extension of UCB 2 would yield regret growing exponentially in the number of players and arms. We use a similar index but a di↵erent scheme, allowing us to achieve poly-log regret growth and a linear memory requirement for each player. 2.5 The Decentralized MAB problem with Markovian rewards Now, we consider the decentralized MAB problem with M players and N arms where the rewards obtained each time when an arm is pulled are not i.i.d. but come from a Markov chain. The reward thatplayerigetsfromarmj (whenthereisnocollision)X ij ,ismodelledasanirreducible,aperiodic, reversible Markov chain on a finite state space X i,j and represented by a transition probability matrix P i,j := ⇣ p i,j x,x 0 :x,x 0 2X i,j ⌘ . Assume that X i,j 2(0,1]. Let X i,j (1),X i,j (2),... denote the successive rewards from arm j for player i. All arms are mutually independent for all players. Let ⇡ i,j := ⇣ ⇡ i,j x ,x2X i,j ⌘ be the stationary distribution of the Markov chain P i,j . The mean reward from arm j for player i is defined as µ i,j := P x2X i,j x⇡ i,j x . Note that the Markov chain represented by P i,j makes a state transition only when player i plays arm j. Otherwise, it remains rested. As described in the previous section, n i (t) is the number of successful plays (without collisions) of player i by frame t, n i,j (t) is the number of times player i picks arm j successfully by frame t and X i,j (t) is the sample mean of rewards from armj for playeri fromn i,j (t) samples. Denote min := min k2P (N),k6=k ⇤⇤ k and max := max k2P (N) k . Denote ⇡ min := min 1 i M,1 j N,x2X i,j ⇡ i,j x , x max := max 1 i M,1 j N,x2X i,jx and x min := min 1 i M,1 j N,x2X i,jx. Let ˆ ⇡ i,j x := max{⇡ i,j x ,1 32 ⇡ i,j x } and ˆ ⇡ max := max 1 i M,1 j N,x2X i,j ˆ ⇡ i,j x . Let |X i,j | denote the cardinality of the state space X i,j , |X| max := max 1 i M,1 j N |X i,j |. Let ⇢ i,j be the eigenvalue gap, 1 2 ,where 2 is the second largest eigenvalue of the matrix P i,j 2 . Denote ⇢ max := max 1 i M,1 j N ⇢ i,j and ⇢ min := min 1 i M,1 j N ⇢ i,j . The total reward obtained by time T is S T = P N j=1 P M i=1 P n i,j (T) s=1 X i,j (s) and the regret is ˜ R M,↵ (T):=T M X i=1 µ i,k ⇤⇤ i E ↵ 2 4 N X j=1 M X i=1 n i,j (T) X s=1 X i,j (s) 3 5 +CE[m(T)]. (2.34) Define the index g i,j (t):=X i,j (t)+ s logn i (t) n i,j (t) (2.35) where be any constant such that> (112+56M)|X| 2 max /⇢ min . We need the following lemma to prove the regret bound. Lemma 4. If the reward of each player-arm pair (i,j) is given by a Markov chain, satisfying the properties of Lemma 1, then under any policy ↵ ˜ R M,↵ (T) X k2P (N),k6=k ⇤⇤ k E[n k (T)]+CE[m(T)]+ ˜ K X,P (2.36) where n k (T) is the number of times that the matching k occurred by the time T and ˜ K X,P is defined as ˜ K X,P = P N j=1 P M i=1 P x2X i,j x/⇡ j min Proof. Let (X i,j (1),X i,j (2),...) denote the successive rewards for player i from arm j. Let F i,j t denote the -algebra generated by (X i,j (1),...,X i,j (t)), F i,j =_ t 1 F i,j t and G i,j =_ (k,l)6=(i,j) F k,l . Since arms are independent, G i,j is independent of F i,j . Clearly, n i,j (T) is a stopping time with respect to F i,j _G i,j T . The total reward is S T = N X j=1 M X i=1 n i,j (T) X t=1 X i,j (t)= N X j=1 M X i=1 X x2X i,j xN(x,n i,j (T)) where N(x,n i,j (T)) := P n i,j (T) t=1 I{X i,j (t)=x}. Taking expectations and using the Lemma 1, E[S T ] N X j=1 M X i=1 X x2X i,j x⇡ i,j x E[n i,j (T)] N X j=1 M X i=1 X x2X i,j x/⇡ i,j min which implies, E[S T ] N X j=1 M X i=1 µ i,j E[n i,j (T)] ˜ K X,P 33 where ˜ K X,P = P N j=1 P M i=1 P x2X i,j x/⇡ i,j min . Now, N X j=1 M X i=1 µ i,j E[n i,j (T)] = N X j=1 M X i=1 X k2P (N),(i,j)2 k µ i,k i E[n i,k i (T)] = X k2P (N) M X i=1 µ i,k i E[n i,k i (T)] = X k2P (N) µ k E[n k (T)] where µ k = P M i=1 µ i,k i . Since regret is defined as in the equation (2.34), ˜ R M,↵ (T) 0 @ Tµ ⇤⇤ X k2P (N),(i,j)2 k µ i,k i E[n i,k i (T)]+CE ↵ [m(T)] 1 A ˜ K X,P . (2.37) The main result of this section is the following. Theorem 7. (i) Let✏> 0 be the precision of the bipartite matching algorithm and the precision of the index representation. If min , |X| max and ⇢ min are known, choose✏> 0 such that✏< min /(M + 1) and> (112 + 56M)|X| 2 max /⇢ min . Let L be the length of a frame. Then, the expected regret of the dUCB 4 algorithm with index (2.35) for the decentralized MAB problem with Markovian rewards and per computation cost C is given by ˜ R M,dUCB4 (T) (L max +C(f(L))(1+logT))· ✓ 4M 3 N logT ( min (M +1)✏) 2 +(2MD+1)MN ◆ + ˜ K X,P . Thus, ˜ R M,dUCB4 (T)=O(log 2 T). (ii) If min , |X| max and ⇢ min are unknown, denote ✏ min = min /(2(M +1)) and let L t !1 as t!1 . Also, choose a positive monotone sequence { t } such that t !1 as t!1 and t t. Then, ˜ R M,dUCB4 (T)= O(C(f(L T )) T log 2 T). Thus by choosing an arbitrarily slowly increasing sequences, we can make the regret arbitrarily close to log 2 T. Proof. (i) We skip the initial steps as they are same as in the proof of Theorem 6. We start by bounding ˜ n i,j (T) as defined in the proof of Theorem 6. Then, from equation (2.30), we get ˜ n i,j (T) l+ 1 X m=1 1 X p=0 2 p m+2 p X s 1,k ⇤⇤ 1 =1 ... m+2 p X s M,k ⇤⇤ M =1 m+2 p X s 0 1,k 1 =1 ... m+2 p X s 0 M,k M =1 I{ M X i=1 ⇣ X i,k ⇤⇤ i (m+2 p )+c m+2 p ,s i,k ⇤⇤ i ⌘ (M +1)✏+ M X i=1 ✓ X i,k i (m+2 p )+c m+2 p ,s 0 i,k i ◆ } (2.38) Now, the event in the parenthesis {·} above implies at least one of the events (A i ,B i ,C,D) given 34 in the display (2.31). From the proof of Theorem 4 (equations (2.23, 2.24), P(A i ) D(m + 2 p ) ⇢ min 28|X|max| 2 , P(B i ) D(m+2 p ) ⇢ min 28|X|max| 2 ,1 i M. Similar to the steps in display (2.32), we can show that the event C is false. Also, the event D is false by assumption. So, similar to the proof of the Theorem 6 (c.f. display (2.33) we get, E[˜ n i,j (T)] ⇠ 4M 2 logT ( min (M +1)✏) 2 ⇡ + 1 X m=1 1 X p=0 2 p m+2 p X s 1,k ⇤⇤ 1 =1 ... m+2 p X s M,k ⇤⇤ M =1 m+2 p X s 0 1,k 1 =1 ... m+2 p X s 0 M,k M =1 2MD(m+2 p ) ⇢ min 28|X|max| 2 ⇠ 4M 2 logT ( min (M +1)✏) 2 ⇡ +2MD 1 X m=1 1 X p=0 2 p (m+2 p ) ⇢ min 56M|X| 2 max 28|X| 2 max 4M 2 logT ( min (M +1)✏) 2 +(2MD+1). when> (112+56M)|X| 2 max /⇢ min . Now, putting it all together, we get E[˜ n(T)] = M X i=1 N X j=1 E[˜ n i,j (T)] 4M 3 N logT ( min (M +1)✏) 2 +(2MD+1)MN. Now, by proof of the Theorem 2 (equation (2.15)), E[m(T)] E[˜ n(T)](1 + logT). We can now bound the regret, ˜ R M,dUCB4 (T)= X k2P (N),k6=k ⇤⇤ k M X i=1 E[˜ n i,k i (T)]+CE[m(T)]+ ˜ K X,P max X k2P (N),k6=k ⇤⇤ M X i=1 E[˜ n i,k i (T)]+CE[m(T)]+ ˜ K X,P = max E[˜ n(T)]+CE[m(T)]+ ˜ K X,P . For a general L, by Theorem 1 ˜ R M,dUCB4 (T) L max E[˜ n(T)]+C(f(L))E[m(T)]+ ˜ K X,P . (L max +C(f(L))(1+logT))E[˜ n(T)]+ ˜ K X,P . Now, using the bound (2.39), we get the desired upper bound on the expected regret. (ii) This can now easily be obtained using the above and following Theorem 6. 35 2.6 Distributed Bipartite Matching: Algorithm and Implementation In the previous section, we referred to an unspecified distributed algorithm for bipartite matching dBM, that is used by the dUCB 4 algorithm. We now present one such algorithm, namely, Bertsekas’ auctionalgorithm[17], anditsdistributedimplementation. Wenotethatthepresentedalgorithmis nottheonlyonethatcanbeused. ThedUCB 4 algorithmwillworkwithadistributedimplementation of any bipartite matching algorithm, e.g. algorithms given in [98]. Consider a bipartite graph with M players on one side, and N arms on the other, and M N. Each playeri has a valueµ i,j for each armj. Each player knows only his own values. Let us denote by k ⇤⇤ , a matching that maximizes the matching surplus P i,j µ i,j x i,j , where the variable x i,j is 1 if i is matched with j, and 0 otherwise. Note that P i x i,j 1,8 j, and P j x i,j 1,8 i. Our goal is to find an ✏-optimal matching. We call any matchingk ⇤ to be ✏-optimal if P i µ i,k ⇤⇤ (i) P i µ i,k ⇤ (i) ✏. Algorithm 3 : dBM ✏ ( Bertsekas Auction Algorithm) 1: All players i initialize prices p j =0,8 channels j; 2: while (prices change) do 3: Player i communicates his preferred arm j ⇤ i and bid b i = max j (µ ij p j ) 2max j (µ ij p j )+ ✏ M to all other players. 4: Each player determines on his own if he is the winner i ⇤ j on arm j; 5: All players set prices p j =µ i ⇤ j ,j ; 6: end while Here, 2max j is the second highest maximum over all j.The best arm for a player i is arm j ⇤ i = argmax j (µ i,j p j ). The winner i ⇤ j on an arm j is the one with the highest bid. Thefollowinglemmain[17]establishesthatBertsekas’auctionalgorithmwillfindthe✏-optimal matching in a finite number of steps. Lemma 5. [17] Given✏> 0, Algorithm 3 with rewards µ i,j , for player i playing the jth arm, converges to a matching k ⇤ such that P i µ i,k ⇤⇤ (i) P i µ i,k ⇤ (i) ✏ where k ⇤⇤ is an optimal matching. Furthermore, this convergence occurs in less than (M 2 max i,j {µ i,j })/✏ iterations. The temporal structure of the dUCB 4 algorithm is such that time is divided into frames of length L. Each frame is either a decision frame, or an exploitation frame. In the exploitation frame, each player plays the arm it was allocated in the last decision frame. The distributed bipartite matching algorithm (e.g. based on Algorithm 3), is run in the decision frame. The decision frame has an interrupt phase of length M and negotiation phase of length L M. We now describe an implementation structure for these phases in the decision frame. Interrupt Phase: The interrupt phase can be implemented very easily. It has length M time slots. On a pre-determined channel, each player by turn transmits a ‘1’ if the arm with which it is now matched has changed, ‘0’ otherwise. If any user transmits a ‘1’, everyone knows that the matching has changed, and they reset their counter ⌘ = 1. 36 Figure 2.1: Structure of the decision frame Negotiation Phase: The information needed to be exchanged to compute an ✏-optimal matching is done in the negotiation phase. We first provide a packetized implementation of the negotiation phase. The negotiation phase consists of J subframes of length M each (See figure 2.1). In each subframe, the users transmit a packet by turn. The packet contains bid information: (channel number, bid value). Since all users transmit by turn, all the users know the bid values by the end of the subframe, and can compute the new allocation, and the prices independently. The length of the subframe J determines the precision ✏ of the distributed bipartite matching algorithm. Note that in the packetized implementation, ✏ 1 = 0, i.e., bid values can be computed exactly, and for a given ✏ 2 , we can determine J, the number of rounds the dBM algorithm 3 runs for, and returns an ✏ 2 -optimal matching. If a packetized implementation is not possible, we can give a physical implementation. Our only assumption here is going to be that each user can observe a channel, and determine if there was a successful transmission on it, a collision, or no transmission, in a given time slot. The whole negotiation phase is again divided into J sub-frames. In each sub-frame, each user transmits by turn. It simply transmits dlogMe bits to indicate a channel number, and then dlog1/✏ 1 e bits to indicate its bid value to precision ✏ 1 . The number of such sub-frames J is again chosen so that the dBM algorithm (based on Algorithm 3) returns an ✏ 2 -optimal matching. 2.7 Conclusions WehaveproposedadUCB 4 algorithmfordecentralizedlearninginmulti-armedbanditproblemsthat achieves a regret of near-O(log 2 (T)). The motivation for this came from opportunistic spectrum access for cognitive radios where users do not have a dedicated control channel for coordination. The key contributions of thus chapter is to lay down a framework in which to think about decentralized learning with heterogeneous players. Namely, think of this as an online decentralized bipartite matching problem in which players exchange “bids”. However, since there is no dedicated channel for message exchange, only “quantized” bids can be exchanged which a↵ects both the op- timality of the bipartite matching, as well as learning. Moreover, this message exchange comes at 37 a “cost”. Modularizing the problem in this manner is what made it tractable. Within this frame- work, we proposed using the dUCB 4 schedule for exploration-exploitation in which case O(log 2 T) bound on expected regret was obtained. However, it may be possible to use other schedules. In particular, the deterministic schedule such as proposed for the DSEE algorithm [99] may be used and may result in an improved O(logT) bound on expected regret. A very important and fundamental question is whether a “fully” decentralized learning algo- rithm (without any coordination) even exists, and if so, what would its lower bound be. This has very interesting connections with distributed universal coding, which itself is a dicult problem. However, should this be achievable, it will yield very important insight about whether explicit com- munication/side information is necessary for distributed learning, or not, as in distributed source coding via Slepian-Wol↵ coding, which established that side information is not needed. In future work, we hope to be able to provide a lower bound by considering a more restricted, stylized setting by, for example, assuming availability of side information. Other directions for future work would be extension to distributions with heavy tails [100] and restless bandits [14, 101, 102]. 38 3 Empirical Dynamic Programming for Markov Decision Processes In this chapter we propose empirical dynamic programming algorithms for Markov decision pro- cesses (MDPs). In these algorithms, the exact expectation in the Bellman operator in classical value iteration is replaced by an empirical estimate to get ‘empirical value iteration’ (EVI). Pol- icy evaluation and policy improvement in classical policy iteration are also replaced by simulation to get ‘empirical policy iteration’ (EPI). Thus, these empirical dynamic programming algorithms involve iteration of a random operator, the empirical Bellman operator. We introduce notions of probabilistic fixed points for such random monotone operators. We develop a stochastic dominance framework for convergence analysis of such operators. We then use this to give sample complexity bounds for both EVI and EPI. We then provide various variations and extensions to asynchronous empirical dynamic programming, the minimax empirical dynamic program, and show how this can also be used to solve the dynamic newsvendor problem. Preliminary experimental results suggest a faster rate of convergence than stochastic approximation algorithms. This chapter is organized as follows. In Section 3.1, we discuss preliminaries and briefly talk about classical value and policy iteration. Section 3.2 presents empirical value and policy iteration. Section 3.3 introduces the notion of random operators and relevant notions of probabilistic fixed points. Inthissection, wealsodevelopastochasticdominanceargumentforconvergenceanalysisof iteration of random operators when they satisfy certain assumptions. In Section 3.4, we show that the empirical Bellman operator satisfies the above assumptions, and present sample complexity and convergence rate estimates for the EDP algorithms. Section 3.5 provides various extensions including asychronous EDP, minmax EDP and EDP for the dynamic newsvendor problem. Basic numerical experiments are reported in Section 3.6. 3.1 Preliminaries We first introduce a typical representation for a discrete time MDP as the 5-tuple (S, A, {A(s): s2S},Q,c). 39 Both the state spaceS and the action spaceA are finite. Let P(S) denote the space of probability measures overS and we define P(A) similarly. For each state s2S,theset A(s)⇢ A is the set of feasible actions. The entire set of feasible state-action pairs is K,{(s,a)2S⇥ A : a2A(s)}. The transition law Q governs the system evolution, Q(·|s,a)2P(A) for all (s, a) 2 K,i.e., Q(j|s,a) for j2S is the probability of visiting the state j next given the current state-action pair (s, a). Finally, c : K! R is a cost function that depends on state-action pairs. Let ⇧ denote the class of stationary deterministic Markov policies, i.e., mappings ⇡ : S ! A which only depend on history through the current state. We only consider such policies since it is well known that there is an optimal policy in this class. For a given states2S, ⇡ (s)2A(s)isthe action chosen in state s under the policy ⇡ . We assume that⇧ only contains feasible policies that respect the constraints K. The state and action at time t are denoted s t and a t , respectively. Any policy ⇡ 2⇧ and initial state s2S determine a probability measure P ⇡ s and a stochastic process {(s t ,a t ),t 0} defined on the canonical measurable space of trajectories of state-action pairs. The expectation operator with respect to P ⇡ s is denoted E ⇡ s [·]. We will focus on infinite horizon discounted cost MDPs with discount factor ↵ 2(0, 1). For a given initial state s2S, the expected discounted cost for policy ⇡ 2⇧ is denoted by v ⇡ (s)=E ⇡ s " 1 X t=0 ↵ t c(s t ,a t ) # . The optimal cost starting from state s is denoted by v ⇤ (s), inf ⇡ 2 ⇧ E ⇡ s 2 4 X t 0 ↵ t c(s t ,a t ) 3 5 , and v ⇤ 2R |S| denotes the corresponding optimal value function in its entirety. Value iteration The Bellman operator T : R |S| ! R |S| is defined as [Tv](s), min a2 A(s) {c(s,a)+↵ E[v(˜ s)|s,a]},8 s2S, for any v2R |S| ,where˜ s is the random next state visited, and E[v(˜ s)|s,a]= X j2 S v(j)Q(j|s,a) 40 is the explicit computation of the expected cost-to-go conditioned on state-action pair (s,a)2K. Value iteration amounts to iteration of the Bellman operator. We have a sequence v k k 0 ⇢ R |S| where v k+1 =Tv k = T k+1 v 0 for all k 0 and an initial seed v 0 . This is the well-known value iteration algorithm for dynamic programming. We next state the Banach fixed point theorem which is used to prove that value iteration converges. Let U be a Banach space with normk·k U . We call an operator G : U ! U a contraction mapping when there exists a constant 2[0, 1) such that kGv 1 Gv 2 k U kv 1 v 2 k U ,8 v 1 ,v 2 2U. Theorem 8. (Banach fixed point theorem) Let U be a Banach space with normk·k U , and let G : U ! U be a contraction mapping with constant 2[0, 1). Then, (i) there exists a unique v ⇤ 2U such thatGv ⇤ =v ⇤ ; (ii) for arbitrary v 0 2U, the sequence v k =Gv k 1 =G k v 0 converges in norm to v ⇤ as k!1 ; (iii) kv k+1 v ⇤ k U kv k v ⇤ k U for all k 0. For the rest of the chapter, let C denote the space of contraction mappings fromR |S| ! R |S| .It is well known that the Bellman operator T2Cwith constant = ↵ is a contraction operator, and hence has a unique fixed point v ⇤ . It is known that value iteration converges to v ⇤ as k!1 .In fact, v ⇤ is the optimal value function. Policy iteration Policy iteration is another well known dynamic programming algorithm for solving MDPs. For a fixed policy ⇡ 2⇧, define T ⇡ : R |S| ! R |S| as [T ⇡ v](s)=c(s,⇡ (s))+↵ E[v(˜ s)|s,⇡ (s)]. The first step is a policy evalution step. Compute v ⇡ by solving T ⇡ v ⇡ = v ⇡ for v ⇡ . Let c ⇡ 2R |S| be the vector of one period costs corresponding to a policy ⇡ , c ⇡ (s)= c(s,⇡ (s)) and Q ⇡ ,the transition kernel corresponding to the policy ⇡ .Then,writingT ⇡ v ⇡ =v ⇡ we have the linear system c ⇡ +Q ⇡ v ⇡ =v ⇡ . (Policy Evaluation) The second step is a policy improvement step. Given a value function v2R |S| , find an ‘improved’ policy ⇡ 2⇧withrespectto v such that T ⇡ v =Tv. (Policy Update) Thus, policy iteration produces a sequence of policies ⇡ k k 0 and v k k 0 as follows. At iteration k 0,wesolvethelinearsystemT ⇡ kv ⇡ k =v ⇡ k forv ⇡ k ,andthenwechooseanewpolicy⇡ k satisfying 41 T ⇡ kv ⇡ k =Tv ⇡ k , which is greedy with respect to v ⇡ k . We have a linear convergence rate for policy iteration as well. Let v 2R |S| be any value function, solve T ⇡ v =Tv for ⇡ , and then compute v ⇡ .Then,weknow [34, Lemma 6.2] that kv ⇡ v ⇤ k ↵ kv v ⇤ k, from which convergence of policy iteration follows. Unless otherwise specified, the norm ||·|| we will use in this chapter is the sup norm. We use the following helpful fact in the chapter. Proof is given in Appendix 3.8.1. Remark 1. Let X be a given set, and f 1 : X ! R and f 2 : X ! R be two real-valued functions on X. Then, (i) |inf x2 X f 1 (x) inf x2 X f 2 (x)| sup x2 X |f 1 (x) f 2 (x)|,and (ii) |sup x2 X f 1 (x) sup x2 X f 2 (x)| sup x2 X |f 1 (x) f 2 (x)|. 3.2 Empirical Algorithms for Dynamic Programming We now present empirical variants of dynamic programming algorithms. Our focus will be on value and policy iteration. As the reader will see, the idea is simple and natural. In subsequent sections we will introduce the new notions and techniques to prove their convergence. 3.2.1 Empirical Value Iteration We introduce empirical value iteration (EVI) first. The Bellman operator T requires exact evaluation of the expectation E[v(˜ s)|s,a]= X j2 S Q(j|s,a)v(j). We will simulate and replace this exact expectation with an empirical estimate in each iteration. Thus, we need a simulation model for the MDP. Let : S⇥ A⇥ [0,1]! S be a simulation model for the state evolution for the MDP, i.e. yields the next state given the current state, the action taken and an i.i.d. random variable. Without loss of generality, we can assume that ⇠ is a uniform random variable on [0,1] and (s,a) 2K. With this convention, the 42 Bellman operator can be written as [Tv](s), min a2 A(s) {c(s,a)+↵ E[v( (s,a,⇠ ))]},8 s2S. Now, we replace the expectation E[v( (s,a,⇠ ))] with its sample average approximation by simulating ⇠.Given n i.i.d. samples of a uniform random variable, denoted {⇠ i } n i=1 , the empirical estimate of E[v( (s,a,⇠ ))] is 1 n P n i=1 v( (s,a,⇠ i )). We note that the samples are regenerated at each iteration. Thus, the EVI algorithm can be summarized as follows. Algorithm 4 Empirical Value Iteration Input: v 0 2R |S| , sample size n 1. Set counter k = 0. 1. Sample n uniformly distributed random variables {⇠ i } n i=1 from [0, 1], and compute v k+1 (s)= min a2 A(s) ( c(s,a)+ ↵ n n X i=1 v k ( (s,a,⇠ i )) ) ,8 s2S. 2. Increment k :=k+1 and return to step 2. Ineachiteration, weregeneratesamplesandusethisempiricalestimatetoapproximateT. Now we give the sample complexity of the EVI algorithm. Proof is given in Section 3.4. Theorem9. Given ✏2(0,1) and 2(0,1),fix ✏ g = ✏/⌘ ⇤ and select 1 , 2 > 0 such that 1 +2 2 where ⌘ ⇤ =d2/(1 ↵ )e. Select an n such that n n(✏, )= 2( ⇤ ) 2 ✏ 2 g log 2|K| 1 where ⇤ = max (s,a)2 K c(s,a)/(1 ↵ ) and select a k such that k k(✏, ) = log ✓ 1 2 µ n,min ◆ , where µ n,min =min ⌘ µ n (⌘ ) and µ n (⌘ ) is given by Lemma 8. Then P n kˆ v k n v ⇤ k ✏ o . Remark 2. This result says that, if we take n n(✏, ) samples in each iteration of the EVI algorithm and performk>k(✏, ) iterations then the EVI iterate ˆ v k n is ✏ close to the optimal value function v ⇤ with probability greater that 1 . We note that the sample complexity is O 1 ✏ 2 ,log 1 ,log|S|,log|A| . 43 The basic idea in the analysis is to frame EVI as iteration of a random operator b T n which we call the empirical Bellman operator. We define b T n as h b T n (!)v i (s), min a2 A(s) ( c(s,a)+ ↵ n n X i=1 v( (s,a,⇠ i )) ) ,8 s2S. (3.1) This is a random operator because it depends on the random noise samples {⇠ } n i=1 . The definition and the analysis of this operator is done rigorously in Section 3.4. 3.2.2 Empirical Policy Iteration We now define EPI along the same lines by replacing exact policy improvement and evaluation with empirical estimates. For a fixed policy ⇡ 2⇧, we can estimate v ⇡ (s) via simulation. Given a sequence of noise !=(⇠ i ) i 0 ,wehave s t+1 = (s t ,⇡ (s t ),⇠ t ) for all t 0. For> 0, choose a finite horizonT such that max (s,a)2 K |c(s,a)| 1 X t=T+1 ↵ t <. We use the time horizon T to truncate simulation, since we must stop simulation after finite time. Let [ˆ v ⇡ (s)](!)= T X t=0 ↵ t c(s t (!),⇡ (s t (!))) be the realization of P T t=0 ↵ t c(s t ,a t ) on the sample path!. The next algorithm requires two input parameters, n and q, which determine sample sizes. Parameter n is the sample size for policy improvement and parameter q is the sample size for policy evaluation. We discuss the choices of these parameters in detail later. In the following algorithm, the notation s t (! i ) is understood as the state at time t in the simulated trajectory ! i . Step 2 replaces computation ofT ⇡ v =Tv (policy improvement). Step 3 replaces solution of the system v =c ⇡ +↵Q ⇡ v (policy evaluation). We now give the sample complexity result for EPI. Proof is given in Section 3.4. Theorem 10. Given ✏ 2 (0,1), 2 (0,1) select 1 , 2 > 0 such that 1 +2 2 < . Also select 11 , 12 > 0 such that 11 + 12 < . Then, select ✏ 1 ,✏ 2 > 0 such that ✏ g = ✏ 2 +2↵✏ 1 (1 ↵ ) where ✏ g = ✏/⌘ ⇤ ,⌘ ⇤ =d2/(1 ↵ )e. Then, select a q and n such that q q(✏, )= 2( ⇤ (T+1)) 2 (✏ 1 ) 2 log 2|S| 11 . n n(✏, )= 2( ⇤ ) 2 (✏ 2 /↵ ) 2 log 2|K| 12 . 44 Algorithm 5 Empirical Policy Iteration Input: ⇡ 0 2⇧, ✏> 0. 1. Set counter k = 0. 2. For each s2S,draw ! 1 ,...,! q 2⌦ and compute ˆ v ⇡ k (s)= 1 q q X i=1 T X t=0 ↵ t c(s t (! i ),⇡ (s t (! i ))). 3. Draw ⇠ 1 ,...,⇠ n 2[0,1]. Choose ⇡ k+1 to satisfy ⇡ k+1 (s)2arg min a2 A(s) ( c(s,a)+ ↵ n n X i=1 ˆ v ⇡ k ( (s,a,⇠ i )) ) ,8 s2S. 4. Increase k :=k+1 and return to step 2. 5. Stop whenkˆ v ⇡ k+1 ˆ v ⇡ k k ✏. where ⇤ = max (s,a)2 K c(s,a)/(1 ↵ ), and select a k such that k k(✏, ) = log ✓ 1 2 µ n,q,min ◆ , where µ n,q,min =min ⌘ µ n,q (⌘ ) and µ n,q (⌘ ) is given by equation (3.11). Then, P{kv ⇡ k v ⇤ k ✏} . Remark 3. This result says that, if we do q q(✏, ) simulation runs for empirical policy evalu- ation, n n(✏, ) samples for empirical policy update and performk>k(✏, ) iterations then the true valuev ⇡ k of the policy ⇡ k will be ✏ close to the optimal value functionv ⇤ with probability greater that 1 .Wenotethat q is O 1 ✏ 2 ,log 1 ,log|S| and n is O 1 ✏ 2 ,log 1 ,log|S|,log|A| . 3.3 Iteration of Random Operators The empirical Bellman operator b T n we defined in equation (3.1) is a random operator. When it operates on a vector, it yields a random vector. When b T n is iterated, it produces a stochastic process and we are interested in the possible convergence of this stochastic process. The underlying assumptionisthattherandomoperator b T n isan‘approximation’ofadeterministicoperatorT such that b T n converges to T (in a sense we will shortly make precise) as n increases. For example the empirical Bellman operator approximates the classical Bellman operator. We make this intuition 45 mathematically rigorousin this section. Thediscussionin thissectionis notspecific totheBellman operator, but applies whenever a deterministic operator T is being approximated by an improving sequence of random operators { b T n } n 1 . 3.3.1 Probabilistic Fixed Points of Random Operators In this subsection we formalize the definition of a random operator, denoted by b T n . Since b T n is a random operator, we need an appropriate probability space upon which to define b T n . So, we define the sample space⌦ = [0 ,1] 1 ,the algebra F = B 1 where B is the inherited Borel algebra on [0,1], and the probability distributionP on⌦ formed by an infinite sequence of uniform random variables. The primitive uncertainties on⌦ are infinite sequences of uniform noise !=(⇠ i ) i 0 where each ⇠ i is an independent uniform random variable on [0,1]. We view (⌦ ,F,P) as the appropriate probability space on which to define iteration of the random operators n b T n o n 1 . Next we define a composition of random operators, b T k n , on the probability space (⌦ 1 ,F 1 ,P), for all k 0 and all n 1where, b T k n (!)v = b T n (! k 1 ) b T n (! k 2 )··· b T n (! 0 )v. Note that ! 2⌦ 1 is an infinite sequence (! j ) j 0 where each ! j =(⇠ j,i ) i 0 . Then we can define the iteration of b T n with an initial seed ˆ v 0 n 2R |S| (we use the hat notation to emphasize that the iterates are random variables generated by the empirical operator) as b v k+1 n = b T n b v k n = b T k n b v 0 n (3.2) Notice that we only iterate k for fixed n. The sample size n is constant in every stochastic process ˆ v k n k 0 ,whereˆ v k n = b T k n ˆ v 0 , for all k 1. For a fixed ˆ v 0 n , we can view all ˆ v k n as measurable mappings from⌦ 1 to R |S| via the mapping ˆ v k n (!)= b T k n (!)ˆ v 0 n . The relationship between the fixed points of the deterministic operator T and the probabilistic fixedpointsoftherandomoperator n b T n o n 1 dependsonhow n b T n o n 1 approximatesT. Motivated by the relationship between the classical and the empirical Bellman operator, we will make the following assumption. Assumption 1. P ⇣ lim n!1 k b T n v Tvk ✏ ⌘ =08 ✏> 0 and8 v2R kSk .Also T has a (possibly non-unique) fixed point v ⇤ such thatTv ⇤ =v ⇤ . Assumption 1 is equivalent to lim n!1 b T n (!)v =Tv for P almost all ! 2⌦. Here, we benefit from defining all of the random operators n b T n o n 1 together on the sample space ⌦ = [0 ,1] 1 ,so that the above convergence statement makes sense. 46 Strong probabilistic fixed point: We now introduce a natural probabilistic fixed point notion for n b T n o n 1 , in analogy to the definition of a fixed point, kTv ⇤ v ⇤ k = 0 for a deterministic operator. Definition 1. A vector ˆ v2R |S| is a strong probabilistic fixed point for the sequence n b T n o n 1 if lim n!1 P ⇣ k b T n ˆ v ˆ vk>✏ ⌘ =0, 8 ✏> 0. We note that the above notion is defined for a sequence of random operators, rather than for a single random operator. Remark 4. We can give a slightly more general notion of a probabilistic fixed point which we call (✏, )-strong probablistic fixed point. For a fixed (✏, ), we say that a vector ˆ v 2 R |S| is an (✏, )-strong probabilistic fixed point if there exists an n 0 (✏, ) such that for all n n 0 (✏, ) we get P ⇣ k b T n ˆ v ˆ vk>✏ ⌘ < . Note that, all strong probabilistic fixed points satisfy this condition for arbitrary (✏, ) and hence are (✏, )-strong probabilistic fixed points. However the converse need not be true. In many cases we may be looking for an ✏-optimal ‘solution’ with a 1 ‘probabilistic guarantee’ where (✏, ) is fixed a priori. In fact, this would provide an approximation to the strong probabilistic fixed point of the sequence of operators. Weak probabilistic fixed point: It is well known that iteration of a deterministic contraction operatorconvergestoitsfixedpoint. Itisunclearwhetherasimilarpropertywouldholdforrandom operators, and whether they would converge to the strong probabilistic fixed point of the sequence n b T n o n 1 in any way. Thus, we define an apparently weaker notion of a probabilistic fixed point that explicitly considers iteration. Definition 2. A vector ˆ v2R |S| is a weak probabilistic fixed point for n b T n o n 1 if lim n!1 limsup k!1 P ⇣ k b T k n v ˆ vk>✏ ⌘ =0, 8 ✏> 0, 8 v2R |S| We use limsup k!1 P ⇣ k b T k n v ˆ vk>✏ ⌘ insteadoflim k!1 P ⇣ k b T k n v ˆ vk>✏ ⌘ becausethelatter limit may not exist for any fixed n 1. Remark 5. Similar to the definition that we gave in Remark 4, we can define an (✏, )-weak proba- blistic fixed point. For a fixed (✏, ), we say that a vector ˆ v2R |S| is an (✏, )-weak probabilistic fixed pointifthereexistsann 0 (✏, )suchthatforalln n 0 (✏, )wegetlimsup k!1 P ⇣ k b T k n v ˆ vk>✏ ⌘ < . As before, all weak probabilistic fixed points are indeed (✏, ) weak probabilistic fixed points, but converse need not be true. At this point the connection between strong/weak probabilistic fixed points of the random operator b T n and the classical fixed point of the deterministic operator T is not clear. Also it is not 47 clear whether the random sequence {ˆ v k n } k 0 converges to either of these two fixed point. In the following subsections we address these issues. 3.3.2 A Stochastic Process on N In this subsection, we construct a new stochastic process on N that will be useful in our analysis. We first start with a simple lemma. Lemma 6. The stochastic process ˆ v k n k 0 is a Markov chain on R |S| . Proof. This follows from the fact that each iteration of b T n is independent, and identically dis- tributed. Thus, the next iterate ˆ v k+1 n only depends on history through the current iterate ˆ v k n . Eventhough ˆ v k n k 0 isaMarkovchain,itsanalysisiscomplicatedbytwofactors. First, ˆ v k n k 0 isaMarkovchainonthecontinuousstatespaceR |S| ,whichintroducestechnicaldicultiesingeneral when compared to a discrete state space. Second, the transition probabilities of ˆ v k n k 0 are too complicated to compute explicitly. Since we are approximating T by b T n and want to compute v ⇤ , we should track the progress of ˆ v k n k 0 to the fixed point v ⇤ of T. Equivalently, we are interested in the real-valued stochastic process kˆ v k n v ⇤ k k 0 .Ifkˆ v k n v ⇤ k approaches zero then ˆ v k n approaches v ⇤ , and vice versa. The state space of the stochastic process kˆ v k n v ⇤ k k 0 is R, which is simpler than the state spaceR |S| of ˆ v k n k 0 , but which is still continuous. Moreover, kˆ v k n v ⇤ k k 0 is a non-Markovian process in general. In fact it would be easier to study a related stochastic process on a discrete, and ideally a finite state space. In this subsection we show how this can be done. We make a boundedness assumption next. Assumption 2. There exists a ⇤ < 1 such that kˆ v k n k ⇤ almost surely for all k 0, n 1. Also, kv ⇤ k ⇤ . Under this assumption we can restrict the stochastic process kˆ v k v ⇤ k k 0 to the compact state space B 2 ⇤ (0) = n v2R |S| : kvk 2 ⇤ o . We will adopt the convention that any elementv outside ofB ⇤ (0) will be mapped to its projection ⇤ v kvk onto B ⇤ (0) by any realization of b T n . Choose agranularity ✏ g > 0 to be fixed for the remainder of this discussion. We will break up Rintointervalsoflength✏ g startingatzero, andwewillnotewhichintervalisoccupiedbykˆ v k n v ⇤ k at each k 0. We will define a new stochastic process X k n k 0 on (⌦ 1 ,F 1 ,P) with state space N. The idea is that X k n k 0 will report which interval of [0, 2 ⇤ ]isoccupiedby kˆ v k n v ⇤ k k 0 . 48 Define X k n :⌦ 1 ! N via the rule: X k n (!)= 8 < : 0, ifkˆ v k (!) v ⇤ k=0, ⌘ 1, if (⌘ 1)✏ g <kˆ v k (!) v ⇤ k ⌘✏ g , (3.3) for all k 0. More compactly, X k n (!)= l kˆ v k n (!) v ⇤ k/✏ g m , whered e denotes the smallest integer greater than or equal to 2R. Thus the stochastic process X k n k 0 is a report on how close the stochastic process kˆ v k n v ⇤ k k 0 is to zero, and in turn how close the Markov chain ˆ v k n k 0 is to the true fixed point v ⇤ of T. Define the constant N ⇤ , ⇠ 2 ⇤ ✏ g ⇡ . Notice N ⇤ is the smallest number of intervals of length ✏ g needed to cover the interval [0, 2 ⇤ ]. By construction,thestochasticprocess X k n k 0 isrestrictedtothefinitestatespace{⌘ 2N:0 ⌘ N ⇤ }. The process X k n k 0 need not be a Markov chain. However, it is easier to work with than either ˆ v k n k 0 or kˆ v k n v ⇤ k k 0 because it has a discrete state space. It is also easy to relate X k n k 0 back to kˆ v k n v ⇤ k k 0 . Recall that X as Y denotes almost sure inequality between two random variables X and Y defined on the same probability space. The stochastic processes X k n k 0 and kˆ v k n v ⇤ k/✏ g k 0 are defined on the same probability space, so the next lemma follows by construction of X k n k 0 . Lemma 7. For all k 0, X k n as kˆ v k n v ⇤ k/✏ g . To proceed, we will make the following assumptions about the deterministic operator T and the random operator b T n . Assumption 3. kTv v ⇤ k ↵ kv v ⇤ k for all v2R |S| . Assumption 4. There is a sequence {p n } n 1 such that P ⇣ kTv b T n vk<✏ ⌘ >p n (✏) and p n (✏)"1 as n!1 for all v2B ⇤ (0), 8 ✏> 0 . Wenowdiscusstheconvergencerateof X k n k 0 . LetX k n = ⌘ .OntheeventF = n kT ˆ v k n b T n ˆ v k n k<✏ g o , we have kˆ v k+1 n v ⇤ kk b T n ˆ v k n T ˆ v k k+kT ˆ v k n v ⇤ k (↵⌘ +1)✏ g 49 whereweusedAssumption3andthedefinitionofX k n . NowusingAssumption4wecansummarize: If X k n =⌘, then X k+1 n d ↵⌘ +1e with a probability at least p n (✏ g ). (3.4) We conclude this subsection with a comment about the state space of the stochastic process of {X k n } k 0 . If we start with X k n = ⌘ and if d↵⌘ +1e<⌘ then we must have improvement in the proximity of ˆ v k+1 n to v ⇤ . We define a new constant ⌘ ⇤ =min{⌘ 2N : d↵⌘ +1e<⌘ } = ⇠ 2 1 ↵ ⇡ . If ⌘ is too small, then d↵⌘ +1e may be equal to ⌘ and no improvement in the proximity of ˆ v k n to v ⇤ can be detected by X k n k 0 . For any ⌘ ⌘ ⇤ , d↵⌘ +1e<⌘ and strict improvement must hold. So, for the stochastic process X k n k 0 , we can restrict our attention to the state space X :={⌘ ⇤ ,⌘ ⇤ +1,...,N ⇤ 1,N ⇤ }. 3.3.3 Dominating Markov Chains Ifwecouldunderstandthebehaviorofthestochasticprocesses X k n k 0 ,thenwecouldmakestate- mentsabouttheconvergenceof kˆ v k n v ⇤ k k 0 and ˆ v k n k 0 . Althoughsimplerthan kˆ v k n v ⇤ k k 0 and ˆ v k n k 0 , the stochastic process X k n k 0 is still too complicated to work with analytically. We overcomethisdicultywithafamilyofdominatingMarkovchains. Wenowpresentourdominance argument. Several technical details are expanded upon in the appendix. We will denote our family of “dominating” Markov chains (MC) by Y k n k 0 . We will con- struct these Markov chains to be tractable and to help us analyze X k n k 0 . Notice that the family Y k n k 0 has explicit dependence onn 1. We do not necessarily construct Y k n k 0 on the prob- ability space (⌦ 1 ,F 1 ,P). Rather, we view Y k n k 0 as being defined on (N 1 , N), the canonical measurable space of trajectories on N,so Y k n : N 1 ! N.Wewilluse Q to denote the probability measureof Y k n k 0 on(N 1 , N). Since Y k n k 0 willbeaMarkovchainbyconstruction, theprob- ability measure Q will be completely determined by an initial distribution on N and a transition kernel. We denote the transition kernel of Y k n k 0 asQ n . Our specific choice for Y k n k 0 is motivated by analytical expediency, though the reader will see that many other choices are possible. We now construct the process Y k n k 0 explicitly, and then compute its steady state probabilities and its mixing time. We will define the stochastic process Y k n k 0 on the finite state space X, based on our observations about the boundedness of kˆ v k n v ⇤ k k 0 and X k n k 0 . Now, for a fixed n and p n (✏ g ) (we drop the argument ✏ g in the following for notational convenience) as assumed in Assumption 4 we construct the dominating 50 Markov chain Y k n k 0 as: Y k+1 n = 8 < : max Y k n 1,⌘ ⇤ , w.p. p n , N ⇤ , w.p. 1 p n . (3.5) The first value Y k+1 n = max Y k n 1,⌘ ⇤ corresponds to the case where the approximation error satisfieskT ˆ v k b T n ˆ v k k<✏ g , and the second value Y k+1 n =N ⇤ corresponds to all other cases (giving us an extremely conservative bound in the sequel). This construction also ensures that Y k n 2X for all k, Q almost surely. Informally, Y k n k 0 will either move one unit closer to zero until it reaches ⌘ ⇤ , or it will move (as far away from zero as possible) to N ⇤ . We now summarize some key properties of Y k n k 0 . Proposition 1. For Y k n k 0 as defined above, (i) it is a Markov chain; (ii) the steady state distribution of Y k n k 0 ,andthelimit Y n = d lim k!1 Y k n , exists; (iii) Q Y k n >⌘ !Q{ Y n >⌘ } as k!1 for all ⌘ 2N; Proof. Parts (i) - (iii) follow by construction of Y k n k 0 and the fact that this family consists of irreducible Markov chains on a finite state space. We now describe a stochastic dominance relationship between the two stochastic processes X k n k 0 and Y k n k 0 . The notion of stochastic dominance (in the usual sense) will be central to our development. Definition 3. Let X and Y be two real-valued random variables, then Y stochastically dominates X, written X st Y, when E[f (X)] E[f (Y)] for all increasing functions f : R! R. The condition X st Y is known to be equivalent to E[1{X ✓ }] E[1{Y ✓ }] or Pr{X ✓ } Pr{Y ✓ }, for all ✓ in the support of Y. Notice that the relation X st Y makes no mention of the respective probability spaces on which X and Y are defined - these spaces may be the same or di↵erent (in our case they are di↵erent). Let F k k 0 be the filtration on (⌦ 1 ,F 1 ,P) corresponding to the evolution of information about X k n k 0 . Let ⇥ X k+1 n |F k ⇤ denote the conditional distribution ofX k+1 n given the information F k . The following theorem compares the marginal distributions of X k n k 0 and Y k n k 0 at all times k 0 when the two stochastic processes X k n k 0 and Y k n k 0 start from the same state. Theorem 11. If X 0 n =Y 0 n , then X k n st Y k n for all k 0. 51 Proof is given in Appendix 3.8.2 ThefollowingcorollaryresultingfromTheorem11relatesthestochasticprocesses kˆ v k n v ⇤ k k 0 , X k n k 0 , and Y k n k 0 in a probabilistic sense, and summarizes our stochastic dominance argu- ment. Corollary 1. For any fixed n 1, we have (i)P kˆ v k n v ⇤ k>⌘✏ g P X k n >⌘ Q Y k n >⌘ for all ⌘ 2N for all k 0; (ii) limsup k!1 P X k n >⌘ Q{ Y n >⌘ } for all ⌘ 2N; (iii) limsup k!1 P kˆ v k n v ⇤ k>⌘✏ g Q{ Y n >⌘ } for all ⌘ 2N. Proof. (i) The first inequality is true by construction of X k n .Then P X k n >⌘ Q Y k n >⌘ for all k 0 and ⌘ 2N by Theorem 11. (ii) Since Q Y k n >⌘ converges (by Proposition 1), the result follows by taking the limit in part (i). (iii) This again follows by taking limit in part (i) and using Proposition 1 We now compute the steady state distribution of the Markov chain {Y k n } k 0 . Let µ n denotes the steady state distribution of Y n = d lim k!1 Y k n (whose existence is guaranteed by Proposition 1) where µ n (i)=Q{Y n =i} for all i2X. The next lemma follows from standard techniques (see [103] for example). Proof is given in Appendix 3.8.3. Lemma 8. For any fixed n 1, µ n (⌘ ⇤ )= p N ⇤ ⌘ ⇤ 1 n ,µ n (N ⇤ )= 1 p n p n ,µ n (i)= (1 p n )p (N ⇤ i 1) n ,8 i = ⌘ ⇤ +1,...,N ⇤ 1. Note that an explicit expression for p n in the case of empirical Bellman operator is given in equation (3.6). 3.3.4 Convergence Analysis of Random Operators We now give results on the convergence of the stochastic process {ˆ v k n , which could equivalently be written b T k n ˆ v 0 } k 0 . Also we elaborate on the connections between our di↵erent notions of fixed points. Throughout this section, v ⇤ denotes the fixed point of the deterministic operator T as defined in Assumption 1. Theorem12. Suppose the random operator b T n satisfies the assumptions 1 - 4. Then for any✏> 0, lim n!1 limsup k!1 P ⇣ kˆ v k n v ⇤ k>✏ ⌘ =0, i.e. v ⇤ is a weak probabilistic fixed point of n b T n o n 1 . 52 Proof. Choose the granularity ✏ g = ✏/⌘ ⇤ . From Corollary 1 and Lemma 8, limsup k!1 P n kˆ v k n v ⇤ k>⌘ ⇤ ✏ g o Q{ Y n >⌘ ⇤ }=1 µ n (⌘ ⇤ )=1 p N ⇤ ⌘ ⇤ 1 n Now by Assumption 4, p n "1 and by taking limit on both sides of the above inequality gives the desired result. Nowweshowthatastrongprobabilisticfixedpointandthedeterministicfixedpointv ⇤ coincide under Assumption 1. Proposition 2. Suppose Assumption 1 holds. Then, (i) v ⇤ is a strong probabilistic fixed point of the sequence n b T n o n 1 . (ii) Let ˆ v be a strong probabilistic fixed point of the sequence n b T n o n 1 , then ˆ v is a fixed point of T. Proof is given in Appendix 3.8.4. Thus the set of fixed points of T and the set of strong probabilistic fixed points of { b T n } n 1 coincide. This suggests that a “probabilistic” fixed point would be an “approximate” fixed point of the deterministic operator T. Wenowexploretheconnectionbetweenweakprobabilisticfixedpointsandclassicalfixedpoints. Proposition 3. Suppose the random operator b T n satisfies the assumptions 1 - 4. Then, (i) v ⇤ is a weak probabilistic fixed point of the sequence n b T n o n 1 . (ii) Let ˆ v be a weak probabilistic fixed point of the sequence n b T n o n 1 , then ˆ v is a fixed point of T. Proof is given Appendix 3.8.5. It is obvious that we need more assumptions to analyze the asymptotic behavior of the iterates of the random operator b T n and establish the connection to the fixed point of the deterministic operator. We summarize the above discussion in the following theorem. Theorem 13. Suppose the random operator b T n satisfies the assumptions 1 - 4. Then the following three statements are equivalent: (i) v is a fixed point of T, (ii) v is a strong probabilistic fixed point of n b T n o n 1 , (iii) v is a weak probabilistic fixed point of n b T n o n 1 . This is quite remarkable because we see not only that the two notions of a probabilistic fixed point of a sequence of random operators coincide, but in fact they coincide with the fixed point of the related classical operator. Actually, it would have been disappointing if this were not the case. The above result now suggests that the iteration of a random operator a finite number k of times and for a fixed n would yield an approximation to the classical fixed point with high probability. 53 Thus, the notions of the (✏, )-strong and weak probabilistic fixed points coincide asymptotically, however, we note that non-asymptotically they need not be the same. 3.4 Sample Complexity for EDP In this section we present the proofs of the sample complexity results for empirical value iteration (EVI) and policy iteration (EPI) (Theorem 9 and Theorem 10 in Section 3.2). 3.4.1 Empirical Bellman Operator Recall the definition of the empirical Bellman operator in equation (3.1). Here we give a mathe- matical basis for that definition which will help us to analyze the convergence behaviour of EVI (since EVI can be framed as an iteration of this operator). The empirical Bellman operator is a random operator, because it maps random samples to operators. Recall from Section 3.3.1 that we define the random operator on the sample space ⌦=[0 ,1] 1 where primitive uncertainties on⌦ are infinite sequences of uniform noise !=(⇠ i ) i 0 where each ⇠ i is an independent uniform random variable on [0,1]. This convention, rather than just defining⌦ = [0 ,1] n for a fixed n 1, makes convergence statements with respect to n easier to make. Classical value iteration is performed by iterating the Bellman operator T. Our EVI algorithm is performed by choosing n and then iterating the random operator b T n . So we follow the notations introduced in Section 3.3.1 and the kth iterate of EVI, ˆ v k n is given by ˆ v k n = b T k n ˆ v 0 n where ˆ v 0 n 2R |S| be an initial seed for EVI. We first show that the empirical Bellman operator satisfies the Assumptions 1 - 4. Then the analysis follows the results of Section 3.3. Proposition 4. The Bellman operator T and the empirical Bellman operator b T n (defined in equa- tion (3.1))satisfyAssumptions1-4 Proof is given in Appendix 3.8.6. We note that we can explicitly give an expression for p n (✏) (of Assumption 4) as below. For proof, refer to Proposition 4: P n k b T n v Tvk<✏ o >p n (✏):=1 2|K|e 2(✏/↵ ) 2 n/(2 ⇤ ) 2 . (3.6) Also we note that we can also give an explicit expression for ⇤ of in Assumption 2 as ⇤ , max (s,a)2 K |c(s,a)| 1 ↵ . (3.7) 54 For proof, refer to the proof of Proposition 4. 3.4.2 Empirical Value Iteration Here we use the results of Section 3.3 for analyzing the convergence of EVI. We first give an asymptotic result. Proposition 5. For any 1 2(0,1) select n such that n 2( ⇤ ) 2 (✏ g /↵ ) 2 log 2|K| 1 then, limsup k!1 P n kˆ v k n v ⇤ k>⌘ ⇤ ✏ g o 1 µ n (⌘ ⇤ ) 1 Proof. From Corollary 1, limsup k!1 P kˆ v k n v ⇤ k>⌘ ⇤ ✏ g Q{ Y n >⌘ ⇤ }=1 µ n (⌘ ⇤ ). For 1 µ n (⌘ ⇤ ) to be less that 1 , we compute n using Lemma 8 as, 1 1 µ n (⌘ ⇤ )=p N ⇤ ⌘ ⇤ 1 n p n =1 2|K|e 2(✏g/↵ ) 2 n/(2 ⇤ ) 2 . Thus, we get the desired result. We cannot iterate b T n forever so we need a guideline for a finite choice ofk. This question can be answered in terms of mixing times. The total variation distance between two probability measures µ and ⌫ on S is kµ ⌫ k TV = max S⇢ S |µ(S) ⌫ (S)| = 1 2 X s2 S |µ(s) ⌫ (s)|. Let Q k n be the marginal distribution of Y k n on N at stage k and d(k)=kQ k n µ n k TV be the total variation distance between Q k n and the steady state distribution µ n . For 2 > 0, we define t mix ( 2 )=min{k : d(k) 2 } to be the minimum length of time needed for the marginal distribution of Y k n to be within 2 of the steady state distribution in total variation norm. We now bound t mix ( 2 ). Lemma 9. For any 2 > 0, t mix ( 2 ) log ✓ 1 ✏µ n,min ◆ . 55 where µ n,min := min ⌘ µ n (⌘ ). Proof. LetQ n bethetransitionmatrixoftheMarkovchain{Y k n } k 0 . Alsolet ? = max{| | : is an eigenvalue ofQ n , 6=1}. By [104, Theorem 12.3], t mix ( 2 ) log ✓ 1 2 µ n,min ◆ 1 1 ? but ? = 0 by Lemma given in Appendix 3.8.7. We now use the above bound on mixing time to get a non-asymptotic bound for EVI. Proposition 6. For any fixed n 1, P kˆ v k n v ⇤ k>⌘ ⇤ ✏ g 2 2 +(1 µ n (⌘ ⇤ )) for k log ⇣ 1 2 µ n,min ⌘ . Proof. For k log ⇣ 1 2 µ n,min ⌘ t mix ( 2 ), d(k)= 1 2 N ⇤ X i=⌘ ⇤ |Q(Y k n =i) µ n (i)| 2 . Then,|Q(Y k n = ⌘ ⇤ ) µ n (⌘ ⇤ )| 2d(t) 2 2 . So,P kˆ v k n v ⇤ k>⌘ ⇤ ✏ g Q(Y k n >⌘ ⇤ )=1 Q(Y k n = ⌘ ⇤ ) 2 2 +(1 µ n (⌘ ⇤ )). We now combine Proposition 5 and 6 to prove Theorem 9. Proof of Theorem 9: Proof. Let ✏ g = ✏/⌘ ⇤ , and 1 , 2 be positive with 1 +2 2 . By Proposition 5, for n n(✏, )we have limsup k!1 P n kˆ v k n v ⇤ k>✏ o =limsup k!1 P n kˆ v k n v ⇤ k>⌘ ⇤ ✏ g o = 1 µ n (⌘ ⇤ ) 1 . Now, fork k(✏, ), by Proposition 6,P kˆ v k n v ⇤ k>⌘ ⇤ ✏ g =P kˆ v k n v ⇤ k>✏ 2 2 +(1 µ n (⌘ ⇤ )). Combining both we get, P kˆ v k n v ⇤ k>✏ . 3.4.3 Empirical Policy Iteration We now consider empirical policy iteration. EPI is di↵erent from EVI, and seemingly more dicult to analyze, because it does not correspond to iteration of a random operator. Furthermore, it has two simulation components, empirical policy evaluation and empirical policy update. However, we show that the convergence analysis in a manner similar to that of EVI. We first give a sample complexity result for policy evaluation. For a policy ⇡ ,let v ⇡ be the actual value of the policy and let ˆ v ⇡ q be the empirical evaluation. Then, 56 Proposition 7. For any ⇡ 2⇧ , ✏2(0, ) and for any> 0 P kˆ v ⇡ q v ⇡ k ✏ , for q 2( ⇤ (T+1)) 2 (✏ ) 2 log 2|S| , where ˆ v q is evaluation of v ⇡ by averaging q simulation runs. Proof. Let v ⇡, T :=E h P T t=0 ↵ t c(s t (!)),⇡ (s t (!))) i .Then, |ˆ v ⇡ q (s) v ⇡ (s)|| ˆ v ⇡ q (s) v ⇡, T |+|v ⇡, T v ⇡ | | 1 q q X i=1 T X t=0 ↵ k c(s t (! i ),⇡ (s t (! i ))) v ⇡, T |+ T X t=0 1 q q X i=1 (c(s t (! i ),⇡ (s t (! i ))) E[c(s t (!),⇡ (s t (!)))]) +. Then, with ˜ ✏=(✏ )/(T+1), P |ˆ v ⇡ q (s) v ⇡ (s)| ✏ P 1 q q X i=1 (c(s t (! i ),⇡ (s t (! i ))) E[c(s t (! i ),⇡ (s t (! i )))]) ˜ ✏ ! 2e 2q˜ ✏ 2 /(2 ⇤ ) 2 . By applying the union bound, we get P kˆ v ⇡ q v ⇡ k ✏ 2|S|e 2q˜ ✏ 2 /(2 ⇤ ) 2 . For q 2( ⇤ (T+1)) 2 (✏ ) 2 log 2|S| the above probability is less than . We define P kˆ v ⇡ q v ⇡ k<✏ >r q (✏):=1 2|S|e 2q˜ ✏ 2 /(2 ⇤ ) 2 , with ˜ ✏=(✏ )/(T+1). (3.8) We say that empirical policy evaluation is ✏ 1 -accurate if kˆ v ⇡ q v ⇡ k<✏ 1 . Then by the above proposition empirical policy evaluation is ✏ 1 -accurate with a probability greater than r q (✏ 1 ). The accuracy of empirical policy update compared to the actual policy update indeed depends on the empirical Bellman operator b T n . We say that empirical policy update is ✏ 2 -accurate if k b T n v Tvk<✏ 2 . Then, by the definition of p n in equation (3.6), our empirical policy update is ✏ 2 -accurate with a probability greater than p n (✏ 2 ) 57 We now give an important technical lemma. Proof is essentially a probabilistic modification of Lemmas 6.1 and 6.2 in [34] and is omitted. Lemma 10. Let {⇡ k } k 0 be the sequence of policies from the EPI algorithm. For a fixed k,assume that P kv ⇡ k ˆ v ⇡ k q k<✏ 1 (1 1 ) and P ⇣ kTˆ v ⇡ k q b T n ˆ v ⇡ k q k<✏ 2 ⌘ (1 2 ). Then, kv ⇡ k+1 v ⇤ k ↵ kv ⇡ k v ⇤ k+ ✏ 2 +2↵✏ 1 (1 ↵ ) with probability at least (1 1 )(1 2 ). We now proceed as in the analysis of EVI given in the previous subsection. Here we track the sequence{kv ⇡ k v ⇤ k} k 0 . Note that this beinga proof technique, the fact that the valuekv ⇡ k v ⇤ k is not observable does not a↵ect our algorithm or its convergence behavior. We define X k n,q =dkˆ v ⇡ k v ⇤ k/✏ g e where the granularity ✏ g is fixed according to the problem parameters as ✏ g = ✏ 2 +2↵✏ 1 (1 ↵ ) .Thenby Lemma 10, if X k n,q =⌘, then X k+1 n,q d ↵⌘ +1e with a probability at least p n,q =r q (✏ 1 )p n (✏ 2 ). (3.9) This is equivalent to the result for EVI given in equation (3.4). Hence the analysis is the same from here onwards. However, for completeness, we explicitly give the dominating Markov chain and its steady state distribution. For p n,q given in display (3.9), we construct the dominating Markov chain Y k n,q k 0 as Y k+1 n,q = 8 < : max Y k n,q 1,⌘ ⇤ , w.p. p n,q , N ⇤ , w.p. 1 p n,q , (3.10) which exists on the state space X. The family Y k n,q k 0 is identical to Y k n k 0 except that its transition probabilities depend on n and q rather than just n. Let µ n,q denote the steady state distribution of the Markov chain Y k n,q k 0 . Then by Lemma 8, µ n,q (⌘ ⇤ )= p N ⇤ ⌘ ⇤ 1 n,q ,µ n,q (N ⇤ )= 1 p n,q p n,q ,µ n,q (i)= (1 p n,q )p (N ⇤ i 1) n,q ,8 i = ⌘ ⇤ +1,...,N ⇤ 1. (3.11) Proof of Theorem 10: Proof. First observe that by the given choice of n and q,wehave r q (1 11 ) and p n (1 12 ). 58 Hence 1 p n,q 11 + 12 11 12 < 1 . Now by Corollary 1, limsup k!1 P{kˆ v ⇡ k v ⇤ k>✏}=limsup k!1 P{kˆ v ⇡ k v ⇤ k>⌘ ⇤ ✏ g }Q{ Y n,q >⌘ ⇤ }=1 µ n,q (⌘ ⇤ ). For 1 µ n,q (⌘ ⇤ ) to be less than 1 we need 1 1 µ n (⌘ ⇤ )=p N ⇤ ⌘ ⇤ 1 n,q p n,q which true as verified above. Thus we get limsup k!1 P{kˆ v ⇡ k v ⇤ k>✏} 1 , similar to the result of Proposition 5. Selecting the number of iterations k based on the mixing time is same as given in Proposition 6. Combining both as in the proof of Theorem 9 we get the desired result. 3.5 Variations and Extensions We now consider some variations and extensions of EVI. 3.5.1 Asynchronous Value Iteration TheEVIalgorithmdescribedaboveissynchronous,meaningthatthevalueestimatesforeverystate are updated simultaneously. Here we consider each state to be visited at least once to complete a fullupdatecycle. Wemodifytheearlierargumenttoaccountforthepossiblyrandomtimebetween full update cycles. Classicalasynchronousvalueiterationwithexactupdateshasalreadybeenstudied. Let(x k ) k 0 be any infinite sequence of states in S.Thissequence(x k ) k 0 may be deterministic or stochastic, and it may even depend online on the value function updates. For any x 2 S,wedefinethe asynchronous Bellman operator T x : R |S| ! R |S| via [T x v](s)= 8 < : min a2 A(s) {c(s,a)+↵ E[v( (s,a,⇠ ))]},s =x, v(s),s6=x. TheoperatorT x onlyupdatestheestimateofthevaluefunctionforstatex,andleavestheestimates for all other states exactly as they are. Given an initial seedv 0 2R |S| , asynchronous value iteration produces the sequence v k k 0 defined by v k =T xt T x t 1 ···T x 0 v 0 for k 0. The following key properties of T x are immediate. Lemma 11. For any x2S: (i) T x is monotonic; 59 (ii) T x [v+⌘ 1] =T x v+↵⌘e x , where e x 2R |S| be the unit vector corresponding to x2S. Proof is given in Appendix 3.8.8 The next lemma is used to show that classical asynchronous VI converges. Essentially, a cycle of updates that visits every state at least once is a contraction. Lemma 12. Let (x k ) K k=1 be any finite sequence of states such that every state in S appears at least once, then the operator e T =T x 1 T x 2 ···T x K is a contraction with constant ↵ . It is known that asynchronous VI converges when each state is visited infinitely often. To continue, define K 0 = 0 and in general, we define K m+1 , inf n k : k K m , (x i ) k i=Km+1 includes every state in S o . Time K 1 is the first time that every state in S is visited at least once by the sequence (x k ) k 0 . Time K 2 is the first time after K 1 that every state is visited at least once again by the sequence (x k ) k 0 ,etc. Thetimes {K m } m 0 completely depend on (x k ) k 0 . For any m 0, if we define e T =T K m+1 T K m+1 1 ···T Km+2 T Km+1 , then we know k e Tv v ⇤ k ↵ kv v ⇤ k, by the preceding lemma. It is known that asynchronous VI converges under some conditions on (x k ) k 0 . Theorem 14. [29]. Suppose each state in S is included infinitely often by the sequence (x k ) k 0 . Then v k ! v ⇤ . Next we describe an empirical version of classical asynchronous value iteration. Again, we replace exact computation of the expectation with an empirical estimate, and we regenerate the sample at each iteration. Step1ofthisalgorithmreplacestheexactcomputationv k+1 =T x k v k withanempiricalvariant. Using our earlier notation, we let b T x,n be a random operator that only updates the value function for state x using an empirical estimate with sample size n 1: h b T x,n (!)v i (s)= 8 < : min a2 A(s) c(s,a)+ 1 n P n i=1 v( (s,a,⇠ i )) ,s =x, v(s),s6=x. 60 Algorithm 6 Asynchronous empirical value iteration Input: v 0 2R |S| , sample size n 1, a sequence (x k ) k 0 . Set counter k = 0. 1. Sample n uniformly distributed random variables {⇠ i } n i=1 , and compute ˆ v k+1 (s)= ( min a2 A(s) c(s,a)+ ↵ n P n i=1 ˆ v k ( (s,a,⇠ i )) ,s =x k , v(s),s6=x k . 2. Increment k :=k+1 and return to step 2. We use ˆ v k n k 0 to denote the sequence of asynchronous EVI iterates, ˆ v k+1 n = b T x k ,n b T x k 1 ,n ··· b T x 0 ,n ˆ v 0 n , or more compactly ˆ v k+1 n = b T x k ,n ˆ v k n for all k 0. We can use a slightly modified stochastic dominance argument to show that asynchronous EVI converges in a probabilistic sense. Only now we must account for the hitting times {K m } m 0 as well, since the accuracy of the overall update depends on the accuracy in b T x,n as well as the length of the interval {K m +1,K m +2,...,K m+1 }. In asynchronous EVI, we will focus on ˆ v Km n m 0 rather than ˆ v k n k 0 . We check the progress of the algorithm at the end of complete update cycles. In the simplest update scheme, we can order the states and then update them in the same order throughout the algorithm. The set (x k ) k 0 is deterministic in this case, and the intervals {K m +1,K m +2,...,K m+1 } all have the same length |S|. Consider e T =T x K 1 ,n T x K 1 1 ,n ···T x 1 ,n T x 0 ,n , the operator b T x 0 ,n introduces ✏ error into component x 0 , the operator b T x 1 ,n introduces ✏ error into component x 1 , etc. To ensure that b T = b T x K 1 ,n b T x K 1 1 ,n ··· b T x 1 ,n b T x 0 ,n is close to e T, we require each b T x k ,n to be close to T x k for k=0, 1,...,K 1 1,K 1 . The following noise driven perspective helps with our error analysis. In general, we can view asynchronous empirical value iteration as v 0 =T x v+" for all k 0where " = b T x,n v T x v 61 is the noise for the evaluation of T x (and it has at most one nonzero component). Starting with v 0 ,definethesequence v k k 0 by exact asynchronous value iteration v k+1 = T x k v k for all k 0. Also set ˜ v 0 :=v 0 and define ˜ v k+1 =T x k ˜ v k +" k forallk 0where" k 2R |S| isthenoisefortheevaluationofT x k on ˜ v k . Inthefollowingproposition, we compare the sequences of value functions v k k 0 and ˜ v k k 0 under conditions on the noise {" k } k 0 . Proposition 8. Suppose ⌘ 1 " i ⌘ 1 for all j=0,1,...,k where ⌘ 0 and 12R |S| , i.e. the error is uniformly bounded for j=0,1,...,k. Then, for all j=0,1,...,k: v j j X i=0 ↵ i ! ⌘ 1 ˜ v j v j + j X i=0 ↵ i ! ⌘ 1. Proof is given in Appendix 3.8.9 Now we can use the previous proposition to obtain conditions for k e Tv b Tvk<✏ (for our deterministic update sequence). Starting with the update for state x 0 , we can choose n to ensure kT x 0 ,n v b T x 0 ,n vk<✏/|S| similar to that in equation (3.6). However, in this case our error bound is P n k b T x,n v T x vk ✏/|S| o P ( max a2 A(s) | 1 n n X i=1 v( (s,a,⇠ i )) E[v( (s,a,⇠ ))]| ✏/(↵ |S|) ) 2|A|e 2(✏/(↵ |S|)) 2 n/(2 ⇤ ) 2 , for allv2R |S| (which does not depend onx). We are only updating one state, so we are concerned with the approximation of at most |A| terms c(s,a)+↵ E[v( (s,a,⇠ ))] rather than |K|.Atthe next update we want k b T x 1 ,n ˆ v 1 n T x 1 ,n ˆ v 1 n k<✏/|S|, and we get the same error bound as above. Based on this reasoning, assume kT x k ,n ˆ v k n b T x k ,n ˆ v k n k<✏/|S| 62 for all k=0,1,...,K 1 . In this case we will in fact get the stronger error guarantee k e Tv b Tvk< ✏ |S| |S| 1 X i=0 ↵ i <✏ from Proposition 8. The complexity estimates are multiplicative, so the probability kT x k ,n ˆ v k n b T x k ,n ˆ v k n k<✏/|S| for all k=0,1,...,K 1 is bounded above by p n =2|S||A|e 2(✏/(↵ |S|)) 2 n/(2 ⇤ ) 2 . To understand this result, remember that |S| iterations of asynchronous EVI amount to at most |S||A| empirical estimates of c(s,a)+E[v( (s,a,⇠ ))]. We require all of these estimates to be within error ✏/|S|. We can take the above value for p n and apply our earlier stochastic dominance argument to kˆ v Km n v ⇤ k m 0 , without further modification. This technique extends to any deterministic se- quence(x k ) k 0 wherethelengthsofafullupdateforallstates|K m+1 K m |areuniformlybounded for all m 0 (with the sample complexity estimate suitably adjusted). 3.5.2 Minimax Value Iteration Now we consider a two player zero sum Markov game and show how an empirical min-max value iteration algorithm can be used to a compute an approximate Markov Perfect equilibrium. Let the Markov game be described by the 7-tuple (S, A, {A(s):s2S}, B, {B(s):s2S},Q,c). The action spaceB for player 2 is finite and B(s) accounts for feasible actions for player 2. We let K ={(s,a,b):s2S,a2A(s),b2B(s)} be the set of feasible station-action pairs. The transition law Q governs the system evolution, Q(·|s,a,b)2P(A) for all (s,a) 2K, which is the probability of next visiting the state j given (s,a,b). Finally, c : K ! R is the cost function (say of player 1) in state s for actions a and b. Player 1 wants to minimize this quantity, and player 2 is trying to maximize this quantity. Let the operator T be defined as T :R |S| ! R |S| is defined as [Tv](s), min a2 A(s) max b2 B(s) {c(s,a,b)+↵ E[v(˜ s)|s,a,b]},8 s2S, 63 for any v2R |S| ,where˜ s is the random next state visited and E[v(˜ s)|s,a,b]= X j2 S v(j)Q(j|s,a,b) is the same expected cost-to-go for player 1. We call T the Shapley operator in honor of Shapley who first introduced it [105]. We can use T to compute the optimal value function of the same which is given by v ⇤ (s) = max a2 A(s) min b2 B(s) {c(s,a,b)+↵ E[v ⇤ (˜ s)|s,a,b]},8 s2S, is the optimal value function for player 1. It is well known that that the Shapley operator is a contraction mapping. Lemma 13. The Shapley operator T is a contraction. Proof is given in Appendix 3.8.10 for completeness. To compute v ⇤ , we can iterate T. Pick any initial seed v 0 2R S , take v 1 =Tv 0 , v 2 =Tv 1 , and in general v k+1 =Tv k for all k 0. It is known that [105] this procedure converges to the optimal value function. We refer to this as the classical minimax value iteration. Now, using the simulation model :S⇥ A⇥ B⇥ [0,1]! S, the empirical Shapley operator can be written as [Tv](s), max a2 A(s) min b2 B(s) {c(s,a,b)+↵ E[v( (s,a,b,⇠ ))]},8 s2S, where ⇠ is a uniform random variable on [0,1]. WewillreplacetheexpectationE[v( (s,a,b,⇠ ))]withanempiricalestimate. Givenasampleof nuniformrandomvariables,{⇠ i } n i=1 ,theempiricalestimateofE[v( (s,a,b,⇠ ))]is 1 n P n i=1 v( (s,a,b,⇠ i )). Our algorithm is summarized next. Algorithm 7 Empirical value iteration for minimax Input: v 0 2R S , sample size n 1. Set counter k = 0. 1. Sample n uniformly distributed random variables {⇠ i } n i=1 , and compute v k+1 (s) = max a2 A(s) min b2 B(s) ( c(s,a,b)+ ↵ n n X i=1 v k ( (s,a,b,⇠ i )) ) ,8 s2S. 2. Increment k :=k+1 and return to step 2. Ineachiteration,wetakeanewsetofsamplesandusethisempiricalestimatetoapproximateT. 64 SinceT isacontractionwithaknownconvergencerate↵ ,wecanapplytheexactsamedevelopment as for empirical value iteration. 3.5.3 The Newsvendor Problem We now show via the newsvendor problem that the empirical dynamic programming method can sometimes work remarkably well even for continuous states and action spaces. This, of course, exploits the linear structure of the newsvendor problem. Let D be a continuous random variable representing the stationary demand distribution. Let {D k } k 0 be independent and identically distributed collection of random variables with the same distribution as D,where D k is the demand in period k. The unit order cost is c, unit holding cost is h, and unit backorder cost is b.Welet x k be the inventory level at the beginning of period k, and we let q k 0 be the order quantity before demand is realized in period k. For technical convenience, we only allow stock levels in the compact set X =[x min ,x max ]⇢ R. This assumption is not too restrictive, since a firm would not want a large number of backorders and any real warehouse has finite capacity. Notice that since we restrict to X, we know that no order quantity will ever exceed q max =x max x min . Define the continuous function :R!X via (x)= 8 > > > < > > > : x max , ifx>x max , x min , ifx<x min , x, otherwise, The function accounts for the state space truncation. The system dynamic is then x k+1 = (x k +q k D k ),8 k 0. We want to solve inf ⇡ 2 ⇧ E ⇡ ⌫ " 1 X k=0 ↵ k (cq k +max{hx k , bx k }) # , (3.12) subject to the preceding system dynamic. We know that there is an optimal stationary policy for this problem which only depends on the current inventory level. The optimal cost-to-go function for this problem, v ⇤ , satisfies v ⇤ (x)=inf q 0 {cq+max{hx, bx}+E[v ⇤ ( (x+q D))]},8 x2R, where, the optimal value function v ⇤ : R ! R. We will compute v ⇤ by iterating an appropriate Bellman operator. Classical value iteration for Problem (3.12) consists of iteration of an operator in C(X), the 65 space of continuous functions f :X! R.Weequip C(X) with the norm kfk C(X) =sup x2X |f (x)|. Under this norm, C(X) is a Banach space. Now, the Bellman operator T :C(X)!C (X) for the newsvendor problem is given by [Tv](x)=inf q 0 {cq+max{hx, bx}+↵ E[v( (x+q D))]},8 x2X. Value iteration for the newsvendor can then be written succinctly as v k+1 =Tv k for all k 0. We confirm that T is a contraction with respect tok·k C(X) in the next lemma, and thus the Banach fixed point theorem applies. Lemma 14. (i) T is a contraction on C(X) with constant ↵ . (ii) Let v k k 0 be the sequence produced by value iteration, then lim k!0 kv k v ⇤ k C(X) ! 0. Proof. (i) Choose u, v2C(X), and use Fact 1 to compute kTu Tvk C(X) =sup x2X |[Tu](x) [Tv](x)| sup x2X ,q2 [0,qmax] ↵ |E[u( (x+q D))] E[v( (x+q D))]| ↵ sup x2X ,q2 [0,qmax] E[|u( (x+q D)) v( (x+q D))|] ↵ ku vk C(X) . (ii)SinceC(X)isaBanachspaceandT isacontractionbypart(i), theBanachfixedpointtheorem applies. Choose the initial form for the optimal value function as v 0 (x) = max{hx, bx},8 x2X. It is chosen to represent the terminal cost in state x when there are no further ordering decisions. Then, value iteration yields v k+1 (x)=inf q 0 n cq+max{hx, bx}+↵ E h v k ( (x+q D)) io ,8 x2X. We note some key properties of these value functions. Lemma 15. Let v k k 0 be the sequence produced by value iteration, then v k is Lipschitz contin- uous with constant max{|h|, |b|} P k i=0 ↵ i for all k 0. 66 Proof. First observe thatv 0 is Lipschitz continuous with constant max{|h|, |b|}. Forv 1 , we choose x and x 0 and compute |v 1 (x) v 1 x 0 | sup q 0 max{hx, bx} max hx 0 , bx 0 +↵ E ⇥ v 0 ( (x+q D)) ⇤ E ⇥ v 0 x 0 +q D ⇤ max{|h|, |b|}|x x 0 |+↵ max{|h|, |b|}E ⇥ | (x+q D) x 0 +q D | ⇤ max{|h|, |b|}|x x 0 |+↵ max{|h|, |b|}|x x 0 |, where we use the fact that the Lipschitz constant of is one. The inductive step is similar. From Lemma 15, we also conclude that the Lipschitz constant of any iterate v k is bounded above by L ⇤ , max{|h|, |b|} 1 X i=0 ↵ i = max{|h|, |b|} 1 ↵ . We acknowledge the dependence of Lemma 15 on the specific choice of the initial seed, v 0 . Wecandoempiricalvalueiterationwiththesameinitialseed ˆ v 0 n =v 0 asabove. Now, fork 0, ˆ v k+1 n (x)=inf q 0 ( cq+max{hx, bx}+ ↵ n n X i=1 ˆ v k n ( (x+q D i )) ) ,8 x2R. Note that {D 1 ,...,D n } is an i.i.d. sample from the demand distribution. It is possible to perform these value function updates exactly for finite k based on [56]. Also note that, the initial seed is piecewise linear with finitely many breakpoints. Because the demand sample is finite in each iteration, thus each iteration will take a piecewise linear function as input and then produce a piecewise linear function as output (both with finitely many breakpoints). Lemma 15 applies without modification to ˆ v k n k 0 , all of these functions are Lipschitz continuous with constants bounded above by L ⇤ . As earlier, we define the empirical Bellman operator b T n :⌦ !C as h b T n (!)v i (x)=inf q 0 ( cq+max{hx, bx}+ ↵ n n X i=1 v( (x+q D i )) ) ,8 x2R. With the empirical Bellman operator, we write the iterates of EVI as ˆ v k+1 n = b T k n v. We can again apply the stochastic dominance techniques we have developed to the convergence analysis of stochastic process kˆ v k n v ⇤ k C(X) k 0 . Similarly to that of equation (3.7), we get an upper bound kvk C(X) ⇤ , cq max +max{hx max ,bx min } 1 ↵ 67 for the norm of the value function of any policy for Problem (3.12). By the triangle inequality, kˆ v k n v ⇤ k C(X) k ˆ v k n k C(X) +kv ⇤ k C(X) 2 ⇤ . We can thus restrict kˆ v k n v ⇤ k C(X) to the state space [0, 2 ⇤ ]. For a fixed granularity ✏ g > 0, we can define X k n k 0 and Y k n k 0 as in Section 3.3. Our upper bound on probability follows. Proposition 9. For any n 1 and✏> 0 P n k b T n v Tvk ✏ o P ( ↵ sup x2X ,q2 [0,qmax] |E[v( (x+q D))] 1 n n X i=1 v( (x+q D i ))| ✏ ) 2 & 9(L ⇤ ) 2 q 2 max ✏ 2 ' e 2(✏/3) 2 n/(2kvk C(X) ) 2 , for all v2C(X) with Lipschitz constant bounded by L ⇤ . Proof. By Fact 1, we know that k b T n v Tvk C(X) ↵ sup x2X ,q2 [0,qmax] |E[v( (x+q D))] 1 n n X i=1 v( (x+q D i ))|. Let{(x j ,q j )} J j=1 be an ✏/(3L ⇤ ) net forX⇥ [0,q max ]. We can choose J to be the smallest integer greater than or equal to x max x min ✏/(3L ⇤ ) ⇥ q max ✏/(3L ⇤ ) = 9(L ⇤ ) 2 q 2 max ✏ 2 . If we have |E[v( (x j +q j D))] 1 n n X i=1 v( (x j +q j D i ))| ✏/3 for all j=1,...,J,then |E[v( (x+q D))] 1 n n X i=1 v( (x+q D i ))| ✏ for all (x,q)2X⇥ [0,q max ] by Lipschitz continuity and construction of {(x j ,q j )} J j=1 .Then,by Hoe↵ding’s inequality and using union bound, we get, P ( ↵ sup j=1,...,J |E[v( (x j +q j D))] 1 n n X i=1 v( (x j +q j D i ))| ✏/3 ) 2Je 2(✏/3) 2 n/(2kvk C(X) ) 2 . 68 As before, we use the preceding complexity estimate to determine p n for the family Y k n k 0 . The remainder of our stochastic dominance argument is exactly the same. 3.6 Numerical Experiments We now provide a numerical comparison of EDP methods with other methods for approximate dynamicprogrammingviasimulation. Figure3.1showsrelativeerror(||ˆ v k n v ⇤ ||)oftheActor-Critic algorithm, Q-Learning algorithm, Optimistic Policy Iteration (OPI), Empirical Value Iteration (EVI) and Empirical Policy Iteration (EPI). It also shows relative error for exact Value iteration (VI). The problem considered was a generic 10 state and 5 action space MDP with infinite-horizon discounted cost. From the figure, we see that EVI and EPI significantly outperform Actor-Critic algorithm (which converges very slowly) and Q-Learning. Optimistic policy iteration performs better than EVI since policy iteration-based algorithms are known to converge faster, but EPI outperforms OPI as well. The experiments were peformed on a generic laptop with Intel Core i7 processor, 4GM RAM, on a 64-bit Windows 7 operating system via Matlab R2009b environment. These preliminary numerical results seem to suggest that EDP methods outperform other ADP methods numerically, and hold good promise. More definitive conclusions about their numerical performance requires further work. We would also like to point that EDP methods would very easily be parallelizable, and hence, they could potentially be useful for a wider variety of problem settings. 3.7 Conclusions In this chapter, we have introduced a new class of algorithms for approximate dynamic program- ming. The idea is actually not novel, and quite simple and natural: just replace the expectation in the Bellman operator with an empirical estimate (or a sample average approximation, as it is often called.) The diculty, however, is that it makes the Bellman operator a random operator. This makes its convergence analysis very challenging since (infinite horizon) dynamic programming theory is based on looking at the fixed points of the Bellman operator. However, the extant notions of ‘probabilistic’ fixed points for random operators are not relevant since they are akin to classi- cal fixed points of deterministic monotone operators when ! is fixed. We introduce two notions of probabilistic fixed points - strong and weak. Furthermore, we show that these asymptotically coincide with the classical fixed point of the related deterministic operator, This is reassuring as it suggests that approximations to our probabilistic fixed points (obtained by finitely many iterations of the empirical Bellman operator) are going to be approximations to the classical fixed point of the Bellman operator as well. 69 5 10 15 20 25 30 35 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 iteration Relative Error Value Iteration (VI) Empirical Policy Iteration (Empirical PI) Optimistic Policy Iteration (Optmistic PI) Actor −Critic Q learning Policy iteration (PI) Emprical Value Iteration (Empirical VI) PI Empirical PI Optimistic PI Empirical VI Q learning Actor−Critic VI Figure 3.1: Numerical performance comparison of EDP algorithms [Number of samples taken: n = 10,q = 10 (Refer EVI and EPI algorithms in Section 3.2 for the definition of n and q)] In developing this theory, we also developed a mathematical framework based on stochastic dominance for convergence analysis of random operators. While our immediate goal was analysis of iteration of the empirical Bellman operator in empirical dynamic programming, the framework is likely of broader use, possibly after further development. We have then shown that many variations and extensions of classical dynamic programming, work for empirical dynamic programming as well. In particular, empirical dynamic programming can be done asynchronously just as classical DP can be. Moreover, a zero-sum stochastic game can be solved by a minimax empirical dynamic program. We also apply the EDP method to the dynamic newsvendor problem which has continuous state and action spaces, which demonstrates the potential of EDP to solve problems over more general state and action spaces. We have done some preliminary experimental performance analysis of EVI and EPI, and com- pared it to similar methods. Our numerical simulations suggest that EDP algorithms converge faster than stochastic approximation-based actor-critic, Q-learning and optimistic policy iteration algorithms. However, these results are only suggestive, we do not claim definitive performance improvement in practice over other algorithms. This requires an extensive and careful numerical investigation of all such algorithms. 70 We do note that EDP methods, unlike stochastic approximation methods, do not require any recurrence property to hold. In that sense, they are more universal. On the other hand, EDP algorithms would inherit some of the ‘curse of dimensionality’ problems associated with exact dynamic programming. Overcoming that challenge requires additional ideas, and is potentially a fruitful direction for future research. Some other directions of research are extending the EDP algorithms to the infinite-horizon average cost case, and to the partially-observed case. We will take up these issues in the future. 3.8 Proofs of Various Lemmas, Propositions and Theorems 3.8.1 Proof of Fact 1 Proof. To verify part (i), note inf x2 X f 1 (x)= inf x2 X {f 1 (x)+f 2 (x) f 2 (x)} inf x2 X {f 2 (x)+|f 1 (x) f 2 (x)|} inf x2 X ( f 2 (x)+sup y2 Y |f 1 (y) f 2 (y)| ) inf x2 X f 2 (x)+sup y2 Y |f 1 (y) f 2 (y)|, giving inf x2 X f 1 (x) inf x2 X f 2 (x) sup x2 X |f 1 (x) f 2 (x)|. By the same reasoning, inf x2 X f 2 (x) inf x2 X f 1 (x) sup x2 X |f 1 (x) f 2 (x)|, and the preceding two inequalities yield the desired result. Part (ii) follows similarly. 3.8.2 Proof of Theorem 11 We first prove the following lemmas. Lemma 16. ⇥ Y k+1 n |Y k n = ✓ ⇤ is stochastically increasing in ✓ for all k 0, i.e. ⇥ Y k+1 n |Y k n = ✓ ⇤ st ⇥ Y k+1 n |Y k n = ✓ 0 ⇤ for all ✓ ✓ 0 . 71 Proof. We see that Pr n Y k+1 n ⌘ |Y k n = ✓ o is increasing in ✓ by construction of Y k n k 0 .If✓>⌘ ,thenPr Y k+1 n ⌘ |Y k n = ✓ =1since Y k+1 n ✓ 1 almost surely; if ✓ ⌘ ,thenPr Y k+1 n ⌘ |Y k n = ✓ =1 p n since the only way Y k+1 n will remain larger than ⌘ is if Y k+1 n =N ⇤ . Lemma 17. ⇥ X k+1 n |X k n =✓, F k ⇤ st ⇥ Y k+1 n |Y k n = ✓ ⇤ for all ✓ and all F k for all k 0. Proof. Follows from construction of Y k n k 0 . For any history F k , P n X k+1 n ✓ 1|X k n =✓, F k o Q n Y k+1 n ✓ 1|Y k n = ✓ o =1. Now, P n X k+1 n =N ⇤ |X k n =✓, F k o P n X k+1 n >✓ 1|X k n =✓, F k o =1 P(X k+1 n ✓ 1|X k n = ✓ ) 1 p n the last inequality follows because p n is the worst case probability for a one-step improvement in the Markov chain {X k n } k 0 Proof of Theorem 11 Proof. Trivially, X 0 n st Y 0 n since X 0 n =Y 0 n . Next, we see that X 1 n st Y 1 n by previous lemma. We prove the general case by induction. Suppose X k n st Y k n for k 1, and for this proof define the random variable Y(✓ )= 8 < : max{✓ 1,⌘ ⇤ }, w.p. p n , N ⇤ , w.p. 1 p n , as a function of ✓ . We see that Y k+1 n has the same distribution as h Y(⇥) |⇥= Y k n i by definition. SinceY(✓ ) are stochastically increasing, we see that h Y(⇥) |⇥= Y k n i st h Y(⇥) |⇥= X k n i 72 by [44, Theorem 1.A.6] and our induction hypothesis. Now, h Y(⇥) |⇥= X k n i st h X k+1 n |X k n , F k i by [44, Theorem 1.A.3(d)] for all histories F k . It follows that Y k+1 n st X k+1 n by transitivity. 3.8.3 Proof of Lemma 8 Proof. The stationary probabilities {µ n (i)} N ⇤ i=⌘ ⇤ satisfy the equations: µ n (⌘ ⇤ )=p n µ n (⌘ ⇤ )+p n µ n (⌘ ⇤ +1), µ n (i)=p n µ n (i+1), 8 i = ⌘ ⇤ +1,...,N ⇤ 1, µ n (N ⇤ )=(1 p n ) N ⇤ X i=⌘ ⇤ µ n (i), N ⇤ X i=⌘ ⇤ µ n (i)=1. We then see that µ n (i)=p (N ⇤ i) n µ n (N ⇤ ),8 i = ⌘ ⇤ +1,...,N ⇤ 1, and µ n (⌘ ⇤ )= p n 1 p n µ n (⌘ ⇤ +1) = p N ⇤ ⌘ ⇤ n 1 p n µ(N ⇤ ). We can solve for µ n (N ⇤ )using P N ⇤ i=⌘ ⇤ µ n (i) = 1, 1= N ⇤ X i=⌘ ⇤ µ n (i) = p N ⇤ ⌘ ⇤ n 1 p n µ n (N ⇤ )+ N ⇤ X i=⌘ ⇤ +1 p N ⇤ i n µ n (N ⇤ ) = 2 4 p N ⇤ ⌘ ⇤ n 1 p n + N ⇤ X i=⌘ ⇤ +1 p N ⇤ i n 3 5 µ n (N ⇤ ) = " p N ⇤ ⌘ ⇤ n 1 p n + p n p N ⇤ ⌘ ⇤ n 1 p n # µ n (N ⇤ ), = p n 1 p n µ n (N ⇤ ), 73 based on the fact that N ⇤ X i=⌘ ⇤ +1 p (N ⇤ i) n = N ⇤ ⌘ ⇤ 1 X i=0 p i n = 1 p (N ⇤ ⌘ ⇤ ) n 1 p n . We conclude µ(N ⇤ )= 1 p n p n , and thus µ n (i)=(1 p n )p (N ⇤ i 1) n ,8 i = ⌘ ⇤ +1,...,N ⇤ 1, and µ n (⌘ ⇤ )=p N ⇤ ⌘ ⇤ 1 n . 3.8.4 Proof of Proposition 2 Proof. (i) First observe that lim n!1 b T n (!)v ⇤ =Tv ⇤ , by Assumption 1. It follows that b T n (!)v ⇤ converges to v ⇤ =Tv ⇤ as n!1 , P almost surely. Almost sure convergence implies convergence in probability. (ii) Let ˆ v be a strong probabilistic fixed point. Then, P(kTˆ v ˆ vk ✏) P(kTˆ v b T n ˆ vk ✏/2)+P(k b T n ˆ v ˆ vk ✏/2) First term on the RHS can be made arbitrarily small by Assumption 1. Second term on RHS can also be made arbitrarily small by the definition of strong probabilistic fixed point. So, for suciently large n, we get P(kTˆ v ˆ vk ✏) < 1. Since the event in the LHS is deterministic, we getkTˆ v ˆ vk = 0. Hence, ˆ v =v ⇤ . 3.8.5 Proof of Proposition 3 Proof. (i) This statement is proved in Theorem 12. (ii) Fix the initial seed v2R |S| . For a contradiction, suppose ˆ v is not a fixed point of T so that kv ⇤ ˆ vk = ✏ 0 > 0 (we use here the fact that v ⇤ is unique). Now kˆ v v ⇤ k = ✏ 0 k b T k n v ˆ vk+k b T k n v v ⇤ k 74 foranynandk bythetriangleinequality. Forclarity, thisinequalityholdsinthealmostsuresense: P ⇣ ✏ 0 k b T k n v ˆ vk+k b T k n v v ⇤ k ⌘ =1 for all n and k. We already know that lim n!1 limsup k!1 P ⇣ k b T k n v v ⇤ k>✏ 0 /3 ⌘ =0 by Theorem 12 and lim n!1 limsup k!1 P ⇣ k b T k n v ˆ vk>✏ 0 /3 ⌘ =0 by assumption. Now P ⇣ max n k b T k n v ˆ vk,k b T k n v v ⇤ k o >✏ 0 /3 ⌘ P ⇣ k b T k n v v ⇤ k>✏ 0 /3 ⌘ +P ⇣ k b T k n v ˆ vk>✏ 0 /3 ⌘ , so lim n!1 limsup k!1 P ⇣ max n k b T k n v ˆ vk,k b T k n v v ⇤ k o >✏ 0 /3 ⌘ =0. However, ✏ 0 k b T k n v ˆ vk+k b T k n v v ⇤ k almost surely so at least one of k b T k n v ˆ vk or k b T k n v v ⇤ k must be greater than ✏ 0 /3 for all large k. 3.8.6 Proof of Proposition 4 Proof. (i) Assumption 1 : Certainly, kT n (!)(v) Tvk ↵ max (s,a)2 K | 1 n n X i=1 v( (s,a,⇠ i )) E[v( (s,a,⇠ ))]| using Fact (1). We know that for any fixed (s,a)2K, | 1 n n X i=1 v( (s,a,⇠ i )) E[v( (s,a,⇠ ))]|! 0, as n!1 by the Strong Law of Large Numbers (the random variable v( (s,a,⇠ )) has finite expectation because it is essentially bounded). Recall that K is finite to see that the right hand side of the above inequality converges to zero as n!1 . (ii) Assumption 2 : We define the constant ⇤ , max (s,a)2 K |c(s,a)| 1 ↵ . 75 Then it can be easily verified that the value of any policy⇡v ⇡ ⇤ .Then v ⇤ ⇤ and without loss of generality we can restrict ˆ v k n to the set B 2 ⇤ (0). (iii) Assumption 3: This is the well known contraction property of the Bellman operator. (iv) Assumption 4: Using Fact 1, for any fixed s2S, | b T n v(s) Tv(s)| max a2 A(s) ↵ n n X i=1 v( (s,a,⇠ i ) ↵ E[v( (s,a,⇠ )] and hence, P n k b T n v Tvk ✏ o P ( max (s,a)2 K ↵ n n X i=1 v( (s,a,⇠ i ) ↵ E[v( (s,a,⇠ )] ✏ ) . For any fixed (s,a)2K, P ( ↵ n n X i=1 v( (s,a,⇠ i )) ↵ E[v( (s,a,⇠ ))] ✏ ) 2e 2(✏/↵ ) 2 n/(vmax v min ) 2 2e 2(✏/↵ ) 2 n/(2kvk) 2 2e 2(✏/↵ ) 2 n/(2 ⇤ ) 2 by Hoe↵ding’s inequality. Then, using the union bound, we have P n k b T n v Tvk ✏ o 2|K|e 2(✏/↵ ) 2 n/(2 ⇤ ) 2 . By taking complements of the above event we get the desired result. 3.8.7 Lemma 18 Lemma 18. For any fixed n 1, the eigenvalues of the transition matrixQ of the Markov chain Y k n are 0 (with algebraic multiplicity N ⇤ ⌘ ⇤ 1)and1. Proof. In general, the transition matrixQ n 2R (N ⇤ ⌘ ⇤ +1)⇥ (N ⇤ ⌘ ⇤ +1) of Y k n k 0 looks like Q n = 2 6 6 6 6 6 6 6 6 6 4 p n 0 ··· ··· 0(1 p n ) p n 0 ··· ··· 0(1 p n ) 0 p n 0 ··· 0(1 p n ) . . . . . . . . . . . . . . . . . . 00 ··· ··· 0(1 p n ) 00 ··· ··· p n (1 p n ) 3 7 7 7 7 7 7 7 7 7 5 . 76 To compute the eigenvalues ofQ n , we want to solveQ n x =x for some x6= 0 and 2R. For x=(x 1 ,x 2 ,...,x N ⇤ ⌘ ⇤ +1 )2R N ⇤ ⌘ ⇤ +1 , Q n x = 0 B B B B B B B B B @ p n x 1 +(1 p n )x N ⇤ ⌘ ⇤ +1 p n x 1 +(1 p n )x N ⇤ ⌘ ⇤ +1 p n x 2 +(1 p n )x N ⇤ ⌘ ⇤ +1 . . . p n x N ⇤ ⌘ ⇤ 1 +(1 p n )x N ⇤ ⌘ ⇤ +1 p n x N ⇤ ⌘ ⇤ +(1 p n )x N ⇤ ⌘ ⇤ +1 1 C C C C C C C C C A . Now, suppose 6= 0 andQx =x for some x6= 0. By the explicit computation ofQx above, [Qx] 1 =px 1 +(1 p)x N ⇤ ⌘ ⇤ +1 =x 1 and [Qx] 2 =px 1 +(1 p)x N ⇤ ⌘ ⇤ +1 =x 2 , so it must be that x 1 =x 2 . However, then [Qx] 2 =px 1 +(1 p)x N ⇤ ⌘ ⇤ +1 =px 2 +(1 p)x N ⇤ ⌘ ⇤ +1 =[Qx] 3 , and thus x 2 =x 3 . Continuing this reasoning inductively shows that x 1 =x 2 =··· =x N ⇤ ⌘ ⇤ +1 for any eigenvector x ofQ. Thus, it must be true that = 1. 3.8.8 Proof of Lemma 11 Proof. (i) Suppose v v 0 . It is automatic that [T x v](s)= v(s) v 0 (s)=[T x v 0 ](s) for all s6= x. For state s =x, c(s,a)+↵ E ⇥ v s 0 |s,a ⇤ c(s,a)+↵ E ⇥ v 0 s 0 |s,a ⇤ for all (s,a)2K,so min a2 A(s) c(s,a)+↵ E ⇥ v s 0 |s,a ⇤ min a2 A(s) c(s,a)+↵ E ⇥ v 0 s 0 |s,a ⇤ , and thus [T x v](s) [T x v 0 ](s). (ii) We see that min a2 A(s) c(s,a)+↵ E ⇥ v 0 s 0 +⌘ |s,a ⇤ =min a2 A(s) c(s,a)+↵ E ⇥ v 0 s 0 |s,a ⇤ +↵⌘ for state x and all other states are not changed. 77 3.8.9 Proof of Proposition 8 Proof. Starting with v 0 , T x 0 v 0 ⌘ 1 T x 0 v 0 +" 0 T x 0 v 0 +⌘ 1, which gives T x 0 v 0 ⌘ 1 ˜ v 0 T x 0 v 0 +⌘ 1, and v 1 ⌘ 1 ˜ v 1 v 1 +⌘ 1. By monotonicity of T x 1 , T x 1 ⇥ v 1 ⌘ 1 ⇤ T x 1 ˜ v 1 T x 1 ⇥ v 1 +⌘ 1 ⇤ , and by our assumptions on the noise, T x 1 ⇥ v 1 ⌘ 1 ⇤ ⌘ 1 ˜ v 2 T x 1 ⇥ v 1 +⌘ 1 ⇤ +⌘ 1. Now T x 1 [v ⌘ 1] =T x 1 v ↵⌘e x , thus v 2 ⌘ 1 ↵⌘ 1 ˜ v 2 v 2 +⌘ 1+↵⌘ 1. Similarly, v 3 ↵ (⌘ 1 ↵⌘ 1) ⌘ 1 ˜ v 3 v 3 +↵ (⌘ 1+↵⌘ 1)+⌘ 1, and the general case follows. 3.8.10 Proof of Lemma 13 Proof. Using Fact 1 twice, compute |[Tv](s) ⇥ Tv 0 ⇤ (s)| =| min a2 A(s) max b2 B(s) {r(s,a,b)+↵ E[v(˜ s)|s,a,b]} min a2 A(s) max b2 B(s) r(s,a,b)+↵ E ⇥ v 0 (˜ s)|s,a,b ⇤ | max a2 A | max b2 B(s) {r(s,a,b)+↵ E[v(˜ s)|s,a,b]} max b2 B(s) r(s,a,b)+↵ E ⇥ v 0 (˜ s)|s,a,b ⇤ | ↵ max a2 A max b2 B(s) |E ⇥ v(˜ s) v 0 (˜ s)|s,a,b ⇤ | ↵ max a2 A max b2 B E ⇥ |v(˜ s) v 0 (˜ s)||s,a,b ⇤ ↵ kv v 0 k. 78 4 Empirical Q-Value Iteration InthischapterweproposeanewalgorithmforlearningtheoptimalQfunctionofaMarkovDecision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and ‘actor-critic’ algorithms, our algorithm doesn’t depend on a stochastic approximation method. We show that our algorithm, called Empirical Q-Value Iteration (EQVI), converges almost surely to the optimal Q function. To the best of our knowledge, this is the first algorithm for learning in MDPs that guarantees an almost sure convergence without using a stochastic approximation method. Preliminary experimental results suggest a faster rate of convergence for our algorithm than stochastic approximation based algorithms. 4.1 Preliminaries and Main Result 4.1.1 MDPs We will consider an MDP on a finite state space S with a finite action space A. Let P(A) denote the set of all probability vectors on A. Also given is a transition kernel p : S⇥ A⇥ S ! [0,1] satisfying P s 0 2 S p(s 0 |s,a) = 1. Let c :S⇥ A! R + denotes the cost function which depends on the state-action pair. An MDP is a controlled Markov chain {X t } on the set S controlled by an A valued control process {Z t } such that P(X t+1 = s|X k ,Z k ,k t)= p(s|X t ,Z t ). Define ⇧ to be the class of stationary randomized policies: mappings ⇡ : S!P (A) which only depend on history through the current state. Our objective is to minimize over all admissible {Z t } the infinite horizon discounted cost E[ P 1 t=0 ↵ t c(X t ,Z t )] where ↵ 2(0,1) is the discount factor. It is well known that⇧ contains an optimal policy which minimizes the infinite horizon discounted cost [106]. Also, let ⌃ denote the set of non-stationary policies { t } with t : S!P (A). For any ⇡ 2⇧, we define the transition probability matrix P ⇡ as, P ⇡ (s,s 0 ):= X a2 A p(s 0 |s,a)⇡ (s,a) (4.1) 79 We make the following assumption. Assumption 5. For any ⇡ 2⇧ , the Markov chain defined by the transition probability matrix P ⇡ is irreducible and aperiodic. Remark 6. By Assumption 5, for any ⇡ 2⇧ , there exists a positive integer r ⇡ such that, (P ⇡ ) r⇡ (s,s 0 ) > 0,8 s,s 0 2S where (P ⇡ ) r⇡ (s,s 0 ) denote the (s,s 0 )th element of the matrix (P ⇡ ) r⇡ [104, Proposition 1.7, Page 8]. Define the optimal value function V ⇤ , V ⇤ :S! R + as V ⇤ (s)= inf ⇡ 2 ⇧ E " 1 X t=0 ↵ t c(X t ,⇡ (X t )) X 0 =s # . (4.2) Also define the Bellman operator G :R |S| + ! R |S| + as G(V)(s):=min a2 A " c(s,a)+↵ X s 0 p(s 0 |s,a)V(s 0 ) # . (4.3) The Bellman operator is a contraction mapping, i.e., kG(V) G(V 0 )k 1 ↵ kV V 0 k 1 , and the optimal value function V ⇤ is the unique fixed point of G(·). Given the optimal value function, an optimal policy ⇡ ⇤ can be calculated as ⇡ ⇤ (s)2argmin a2 A " c(s,a)+↵ X s 0 p(s 0 |s,a)V ⇤ (s 0 ) # . (4.4) 4.1.2 Value Iteration,Q-Value Iteration A standard scheme for finding the optimal value function (and hence an optimal policy) is value iteration. One starts with an arbitrary function V 0 .At tth iteration, given the current guess V t , we calculate V t+1 = G(V t ). Since G(·) is a contraction mapping, by Banach fixed point theorem, V t ! V ⇤ . Another way to find the optimal value function is via Q-value iteration. Though this requires more computation than the value iteration, Q-value iteration is extremely useful in developing learning algorithms for MDPs. Define Q-value operator T :R d + ! R d + as T(Q)(s,a):=c(s,a)+↵ X s 0 2 S p(s 0 |s,a)min b Q(s 0 ,b) (4.5) where d = |S||A|. Similar to the Bellman operator G, Q-value operator T is also a contraction 80 mapping, i.e.,kT(Q) T(Q 0 )k 1 ↵ kQ Q 0 k 1 . Let Q ⇤ be the unique fixed point of T(·), i.e., Q ⇤ (s,a)=c(s,a)+↵ X s 0 2 S p(s 0 |s,a)min b Q ⇤ (s 0 ,b). ThisQ ⇤ iscalledtheoptimalQ-value. BytheuniquenessofV ⇤ ,itisclearthatV ⇤ =min a2 A Q ⇤ (s,a). Thus given Q ⇤ , one can compute V ⇤ and hence an optimal policy ⇡ ⇤ . The standard method to compute Q ⇤ is Q-value iteration. We start with an arbitrary Q 0 and then update Q t+1 =T(Q t ). Due to the contraction property of T, Q t ! Q ⇤ . 4.1.3 EmpiricalQ-Value Iteration The Bellman operator G and the Q-value operator T require the knowledge of the exact transition kernel p(·|·,·). In real life applications, these transition probabilities may not be readily available, but it may be possible to simulate a transition according to any of these probabilities. Without loss of generality, we assume that the MDP is driven by uniform random noise according to the simulation function :S⇥ S⇥ [0,1]! S such that Pr( (s,a,⇠ )=s 0 )=p(s 0 |s,a) (4.6) where ⇠ is a random variable distributed uniformly in [0,1]. Using this convention, the Q-value operator can be written as T(Q)(s,a):=c(s,a)+↵ E min b Q( (s,a,⇠ ),b) . (4.7) In empirical Q-value iteration (EQVI) algorithm, we replace the expectation in the above equa- tion by an empirical estimate. Given a sample ofn i.i.d. random variables distributed uniformly in [0,1],denoted{⇠ i } n i=1 ,theempiricalestimateofE[min b Q( (s,a,⇠ ),b)]is 1 n P n i=1 min b Q( (s,a,⇠ i ),b). We summarize our empirical Q-value iteration algorithm below. Algorithm 8 : Empirical Q-Value Iteration (EQVI) Algorithm Input: b Q 0 2R d + , sample size n 1, maximum iterations t max . Set counter t = 0. 1. For each (s,a)2S⇥ A, sample n uniformly distributed random variables ⇠ t i (s,a) n i=1 , and compute b Q t+1 (s,a)=c(s,a)+↵ 1 n n X i=1 ✓ min b b Q t (s,a,⇠ t i (s,a)),b ◆ 2. Increment t =t+1. Ift>t max , STOP. Else, return to step 1. Main Result: 81 Theorem 15. The empirical Q-value iteration converges to the optimal Q function, i.e., b Q t ! Q ⇤ a.s. Proof is given in Section 4.2. 4.1.4 Comparison with ClassicalQ-Learning Classical(synchronous)Q-learningalgorithmfordiscountedMDPsworksasfollows(see[34,Section 5.6]). For every state-action pair (s,a)2S⇥ A, we maintain a Q function and use the update rule Q t+1 (s,a)=Q t (s,a)+ t ✓ c(s,a)+↵ min b2 A Q t (s,a,⇠ t (s,a)),b Q t (s,a) ◆ (4.8) where ⇠ t (s,a) is a random noise sampled uniformly from [0,1] and { t ,t 0} is the standard stochastic approximation step sequence such that P t t = 1 and P t 2 t < 1 . It can be shown that Q t ! Q ⇤ almost surely [34]. The rate of convergence depends on the sequence { t ,t 0} [107]. In general, the convergence is very slow. Empirical Q value iteration algorithm does not use stochastic approximation. The rate of convergence will depend on the number of noise samples n. 4.2 Analysis Let (⌦ 1 ,F 1 ,P 1 ) be the probability space of one sided infinite sequence ! such that ! = {! t : t2 Z ⇤ } where Z ⇤ is the set of non-negative integers. Each element ! t of the sequence is a vector ! t =(⇠ t i (s,a),1 i n,s 2S,a 2A), where ⇠ t i (s,a) is a random noise distributed uniformly in [0,1]. We assume that ⇠ t i (s,a) are i.i.d. 8 i,8 (s,a)2S⇥ A and8 t2Z ⇤ . E 1 denotes the expectation with respect to the measure P 1 . For each t2Z ⇤ , ✓ t denotes the left shift operator, i.e., ✓ t ! :={! ⌧ +t : ⌧ 0}. (4.9) Also, let be the projection operator such that ( ✓ t !)= ! t ,8 t 2Z ⇤ , 8 ! 2⌦ 1 . Recall that is the simulation function defined in equation (4.6) such that P 1 ( (s,a,⇠ t i (s,a)) =s 0 )=p(s 0 |s,a), 8 i,t. (4.10) Using , for each ! 2⌦ 1 we define a sequence of empirical transition kernelsb p(!)=(b p t (! t )) t 0 as b p t (s 0 |s,a):= 1 n n X i=1 I{ (s,a,⇠ t i (s,a)) =s 0 }. (4.11) 82 We dropped ! t from the above definition for ease of notation. For any ⇡ 2⇧, we also define the transition probability matrix b P ⇡ t as, b P ⇡ t (s,s 0 ):= X a2 A b p t (s 0 |s,a)⇡ (s,a). (4.12) Note that the rows of b P ⇡ t are independent due to the independence assumption on the elements of the vector ! t . Also, b P ⇡ t are independent8 t. We define the empirical Q-value operator b T :⌦ 1 ⇥ R d + ! R d + as b T(✓ t !,Q)(s,a):= ˘ T n (( ✓ t !),Q)(s,a):=c(s,a)+↵ 1 n n X i=1 min b Q( (s,a,⇠ t i (s,a)),b) =c(s,a)+↵ X s 0 b p t (s 0 |s,a)min b Q(s 0 ,b). (4.13) Then, the empirical Q-value iteration given in Algorithm 8 can be succinctly represented as b Q t+1 (!)= b T(✓ t !, b Q t (!)). (4.14) We drop ! from the notation of b Q t whenever it is not necessary. Note that from equation (4.10) and (4.13), for any fixed Q, E 1 h b T(✓ t !,Q) i =T(Q), 8 t2Z ⇤ , (4.15) where T is the Q-value operator defined in equation (4.5). We define another probability space (⌦ 2 ,F 2 ,P 2 ) of one sided infinite sequence ⌫ such that ⌫ = {⌫ t : t 2 Z ⇤ }. Each element ⌫ t of the sequence ⌫ is a |S||A| dimensional vector, ⌫ t = (⌫ t (s,a),s2S,a2A)where ⌫ t (s,a) is a random noise distributed uniformly in [0,1]. We assume that ⌫ t (s,a) are i.i.d. 8 (s,a)2S⇥ A and8 t2Z ⇤ . E 2 denotes the expectation with respect to P 2 . Let P be the product measure, P =P 1 ⌦ P 2 and let E denote the expectation with respect to P. For each ! 2⌦ 1 , i.e., for each sequence of transition kernels b p(!)=(b p t (! t )) t 0 ,wedefinea sequence of simulation functions t =( 1 t , 2 t ) t 0 as, 1 t : S⇥ A⇥ ⌦ 2 ! S (4.16) 2 t : S⇥ ⌦ 2 ! A (4.17) such that P 2 1 t (s,a,⌫ t (s,a)) =s 0 = b p t (s 0 |s,a) (4.18) and 2 t is the (randomized) control strategy that maps the output of the function 1 t to the action space. We note that the control strategy can be in the set ⇧ or ⌃. For t 2 >t 1 ,definethe 83 composition function t 2 t 1 as t 2 t 1 := t 2 1 t 2 2 ··· t 1 . (4.19) Givenan! 2⌦ 1 ,⌫ 2⌦ 2 andaninitialcondition(s 0 ,a 0 ),wecansimulate anMDPwithstate-action sequence (X t (!,⌫ ),Z t (!,⌫ )) t 0 as follows: (X t (!,⌫ ),Z t (!,⌫ )) = t 0 (s 0 ,a 0 ) and (X t+1 (!,⌫ ),Z t+1 (!,⌫ )) = t t 0 (s 0 ,a 0 ) (4.20) We call this simulation method as forward simulation. The dependence on the control strategy 2 t is implicit and is not used in the notation. Whenever not necessary, we also drop ! and ⌫ from the notation and denote the simulated chain by (X t ,Z t ). Since P 2 (X t+1 |X m ,Z m ,m t)=b p t (X t+1 |X t ,Z t ), the sequence (X t (!,⌫ )) t 0 is a controlled Markov chain. Next we prove a series of results about the controlled Markov chain (X t (!,⌫ )) t t 0 .Wewilluse these results in proving Theorem 15. Given an initial time t 0 and states s 0 ,s 0 2S,wedefinethe hitting time ⌧ !,⌫ of the controlled Markov chain (X t (!,⌫ )) t t 0 as ⌧ !,⌫ (s 0 ,s 0 ):=min{m 0|X t 0 +m (!,⌫ )=s 0 ,X t 0 (!,⌫ )=s 0 } (4.21) We first show that the expected value of the hitting time is finite when the chain is controlled by a stationary strategy, i.e., 2 t = ⇡ 2⇧ ,8 t. Proposition 10. Let (X t (!,⌫ ),⇡ (X t (!,⌫ ))) t t 0 be the sequence of state-action pairs for the MDP simulated according to (4.20) using a stationary control strategy 2 t = ⇡ 2⇧ ,8 t. Let ⌧ !,⌫ be the hitting time as defined in equation (4.21). Then, E ⇥ ⌧ !,⌫ (s 0 ,s 0 ) ⇤ <1 , 8 s 0 ,s 0 2S. Proof. Consider a sequence of states, (s t 0 +j ) r j=0 ,with s t 0 =s 0 and s t 0 +r =s 0 such that P ⇡ (s t 0 ,s t 0 +1 )···P ⇡ (s t 0 +r 1 ,s t 0 +r ) > 0. By Remark 6, such a sequence of states exists. Further- more, r can be picked independent of the choice of s 0 ,s 0 and we assume that it is so. Let W ⇡ =W ⇡ ((s t 0 +j ) r j=0 ):= b P ⇡ t 0 (s t 0 ,s t 0 +1 )··· b P ⇡ t 0 +r 1 (s t 0 +r 1 ,s t 0 +r ). Using (4.10)-(4.12), E 1 [ b P ⇡ t ]=P ⇡ ,8 t and since b P ⇡ t are i.i.d. 8 t, E 1 [W ⇡ ]=P ⇡ (s t 0 ,s t 0 +1 )···P ⇡ (s t 0 +r 1 ,s t 0 +r ) > 0. 84 So there exist✏> 0,> 0 such that P 1 (W ⇡ >✏)>.Then, P ⌧ !,⌫ (s 0 ,s 0 ) r P 2 ⌧ !,⌫ (s 0 ,s 0 ) r|W ⇡ >✏ P 1 (W ⇡ >✏) > ✏, because P 2 (⌧ !,⌫ (s 0 ,s 0 ) r|W ⇡ ) P 2 (X t 0 +r =s 0 ,X t 0 =s 0 |W ⇡ ) W ⇡ .Then P ⌧ !,⌫ (s 0 ,s 0 )>r (1 ✏ ). Due to the i.i.d. nature of ! and the Markov property of X t (!,⌫ ), it is clear that the above probability does not depend on t 0 and hence, for anyk> 0, P ⌧ !,⌫ (s,s 0 )>kr (1 ✏ ) k . Then, E ⇥ ⌧ !,⌫ (s,s 0 ) ⇤ = X t 0 P ⌧ !,⌫ (s,s 0 )>t X k 0 rP ⌧ !,⌫ (s,s 0 )>kr r X k 0 (1 ✏ ) k <1 Consider two controlled Markov chains X 1 t (!,⌫ ),X 2 t (! 0 ,⌫ 0 ),t t 0 , with di↵erent initial condi- tions, defined on (⌦ ⇥ ⌦ 0 ,F⇥F 0 ,P⇥ P 0 )where(⌦ 0 ,F 0 ,P 0 ) is another copy of (⌦ ,F,P). Define the coupling time,e ⌧ ! ⇤ ,⌫ ⇤ , for ! ⇤ := (!,! 0 ),⌫ ⇤ := (⌫,⌫ 0 ), as e ⌧ ! ⇤ ,⌫ ⇤ (s 1 0 ,s 2 0 ):=min m 0:X 1 t 0 +m (!,⌫ )=X 2 t 0 +m (! 0 ,⌫ 0 ),X 1 t 0 (!,⌫ )=s 1 0 ,X 2 t 0 (! 0 ,⌫ 0 )=s 2 0 (4.22) We prove that the expected value of the coupling time is finite when the chain is controlled by a stationary strategy. Proposition 11. Let (X 1 t (!,⌫ ),⇡ (X 1 t (!,⌫ ))) t t 0 ,(X 2 t (! 0 ,⌫ 0 ),⇡ (X 2 t (! 0 ,⌫ 0 ))) t t 0 be two sequences of state-action pairs for an MDP simulated according to (4.20) using a stationary control strategy 2 t = ⇡ 2⇧ ,8 t. Lete ⌧ ! ⇤ ,⌫ ⇤ be the coupling time as defined in equation (4.22). Then, E ⇥ e ⌧ ! ⇤ ,⌫ ⇤ (s 1 0 ,s 2 0 ) ⇤ <1 ,8 s 1 0 ,s 2 0 2S. Proof. Consider two sequences of states, (s 1 t 0 +j ) r j=0 and (s 2 t 0 +j ) r j=0 with s 1 t 0 = s 1 0 ,s 2 t 0 = s 2 0 ,s 1 t 0 +r = s 2 t 0 +r =s, for some s2S such that P ⇡ s 1 t 0 ,s 1 t 0 +1 ···P ⇡ s 1 t 0 +r 1 ,s 1 t 0 +r > 0 and P ⇡ s 2 t 0 ,s 2 t 0 +1 ···P ⇡ s 2 t 0 +r 1 ,s 2 t 0 +r > 0. By Remark 6, such (s 1 t 0 +j ) r j=0 and (s 2 t 0 +j ) r j=0 exist. Using, by abuse of notation, some common 85 notation for entities defined on the two copies of (⌦ ,F,P), let W ⇡ 1 =W ⇡ 1 ((s 1 t 0 +j ) r j=0 ):= b P ⇡ t 0 s 1 t 0 ,s 1 t 0 +1 ··· b P ⇡ t 0 +r 1 s 1 t 0 +r 1 ,s 1 t 0 +r , W ⇡ 2 =W ⇡ 2 ((s 2 t 0 +j ) r j=0 ):= b P ⇡ t 0 s 2 t 0 ,s 2 t 0 +1 ··· b P ⇡ t 0 +r 1 s 2 t 0 +r 1 ,s 2 t 0 +r . As in the proof of Proposition 10, E 1 [W ⇡ 1 ]=P ⇡ s 1 t 0 ,s 1 t 0 +1 ···P ⇡ s 1 t 0 +r 1 ,s 1 t 0 +r > 0, E 1 [W ⇡ 2 ]=P ⇡ s 2 t 0 ,s 2 t 0 +1 ···P ⇡ s 2 t 0 +r 1 ,s 2 t 0 +r > 0. So there exist✏> 0,> 0 such that P 1 (W ⇡ 1 >✏)> and P 1 (W ⇡ 2 >✏)> . Moreover, due to the independence of b P ⇡ t 0 +j (s 1 t 0 +j ,s 1 t 0 +j+1 ) and b P ⇡ t 0 +j (s 2 t 0 +j ,s 2 t 0 +j+1 ), P 1 (W ⇡ 1 >✏,W ⇡ 2 >✏) > 2 . Also, P 2 X 1 t 0 +r =X 2 t 0 +r ,X 1 t 0 =s 1 0 ,X 2 t 0 =s 2 0 |W ⇡ 1 ,W ⇡ 2 W ⇡ 1 W ⇡ 2 . Then by an argument analogous to that of Proposition 10, we have P e ⌧ ! ⇤ ,⌫ ⇤ (s 1 0 ,s 2 0 ) r P 2 e ⌧ !,⌫ (s 1 0 ,s 2 0 ) r|W ⇡ 1 >✏,W ⇡ 2 >✏ P 1 (W ⇡ 1 >✏,W ⇡ 2 >✏) ✏ 2 2 , where the ✏, may be chosen independent of the choice of s 1 0 ,s 2 0 . Hence P e ⌧ ! ⇤ ,⌫ ⇤ (s 1 0 ,s 2 0 ) r (1 ✏ 2 2 ). NowthesameargumentsintheproofofProposition10canbeappliedtogetthedesiredconclusion. WenowextendtheresultofProposition10andProposition11tonon-stationarycontrolstrate- gies. Forthat,weusethefollowingresultfrom[108]forahomogeneousMDPdefinedbytheoriginal transition kernel p(·|·,·). Lemma 19. [108, Lemma 1.1, Page 42] (X t , t (X t )),t t 0 , be the sequence of state-action pairs corresponding to the homogeneous MDP defined by an arbitrary control strategy 2⌃ and the transition kernel p(·|·,·). Then there exist integer r ⇤ and✏> 0 such that P(⌧ (s,s 0 )>r ⇤ ) < 1 ✏, 8 s,s 0 2S. Proof. Suppose not. Then there exists a sequence of controlled Markov chains {X ↵ t ,t t 0 }, 86 ↵ =1,2,... governed by control strategies{ ↵ t ,t t 0 } such that the following holds: If ⌧ ↵ (s,s 0 ):= min{t 0|X ↵ t 0 +t =s 0 ,X ↵ t 0 =s},then, P ⌧ ↵ (s,s 0 )>↵ > 1 1 ↵ , ↵ 1. By dropping to a subsequence if necessary and invoking Skorohod’s theorem, we may assume that these chains are defined on a common probability space, and there exists a controlled Markov chain {X 1 t ,t t 0 } governed by a control strategy 1 with X 1 t 0 = s, such that (X ↵ t , ↵ t ) t 0 ! (X 1 t , 1 t ) t 0 a.s. Since P ⌧ ↵ (s,s 0 )>j =E " j Y t=1 I{X ↵ t 0 +t 6=s 0 }, # ↵,t =1,2,..., a straightforward limiting argument leads to Pr ⌧ 1 (s,s 0 )>↵ > 1 1 ↵ , ↵ 1. for ⌧ 1 (s,s 0 ):=min{t 0|X 1 t 0 +t = s 0 ,X 1 t 0 = s}.Then, ⌧ 1 =1 a.s. This is possible only if there exists a non-empty subset G of S\{s 0 } such that for each i 2 G, max k62 G min a2 A p(k|i,a) = 0. Let a i be the action at which the above minimum is achieved. Then the chain starting at G and governed by a stationary control strategy ⇡ such that ⇡ (i)= a i never leaves G. This contradicts Assumption5thatunderanystationarycontrolstrategy,Sisirreducible. Thusthegivenstatement must hold. Proposition 12. Let (X t (!,⌫ ), t (X t (!,⌫ ))) t t 0 be the sequence of state-action pairs for the MDP simulated according to (4.20) using an arbitrary control strategy 2 t = t ,8 t. Let ⌧ !,⌫ be the hitting time as defined in equation (4.21). Then, E ⇥ ⌧ !,⌫ (s 0 ,s 0 ) ⇤ <1 , 8 s 0 ,s 0 2S Proof. Proof is similar to that of Proposition 10. By Lemma 19, there exists a j ⇤ ,0<j ⇤ r ⇤ and a sequence of states, (s t 0 +j ) j ⇤ j=0 ,with s t 0 =s 0 and s t 0 +j ⇤ =s 0 such that P t 0 +1 (s t 0 ,s t 0 +1 )···P t 0 +j ⇤ (s t 0 +r 1 ,s t 0 +j ⇤ ) >0where P t is defined as in (4.1) by replacing ⇡ with t . Let W =W ((s t 0 +j ) j ⇤ j=0 ):= b P t 0 t 0 (s t 0 ,s t 0 +1 )··· b P t 0 +j ⇤ 1 t 0 +r 1 (s t 0 +r 1 ,s t 0 +j ⇤ ). where b P t is defined as in (4.12) by replacing ⇡ with t . As in the proof of Proposition 10E[ b P t t ]= 87 P t ,8 t and since b P t t are independent8 t, E 1 [W ]=P t 0 +1 (s t 0 ,s t 0 +1 )···P t 0 +j ⇤ (s t 0 +r 1 ,s t 0 +j ⇤ ) > 0 Then, there exists an✏> 0,> 0 such that P 1 (W >✏)> . Then, as in the proof of Proposition 10, P ⌧ !,⌫ (s 0 ,s 0 )>r (1 ✏ ). and, E ⇥ ⌧ !,⌫ (s 0 ,s 0 ) ⇤ <1 . Proposition 13. Let (X 1 t (!,⌫ ), t (X 1 t (!,⌫ ))) t t 0 ,(X 2 t (! 0 ,⌫ 0 ), t (X 2 t (! 0 ,⌫ 0 ))) t t 0 be two sequences of state-action pairs for an MDP simulated according to (4.20) using an arbitrary control strategy 2 t = t . Lete ⌧ ! ⇤ ,⌫ ⇤ be the coupling time as defined in equation (4.22). Then, E ⇥ e ⌧ ! ⇤ ,⌫ ⇤ (s 1 0 ,s 2 0 ) ⇤ <1 ,8 s 1 0 ,s 2 0 2S. Proof. Proof is straightforward by combining the proofs of Proposition 11 and Proposition 12. We now consider the backward simulation of an MDP. This is similar to the coupling from the past idea introduced by [62]. Note that for us this is a proof technique, a ‘thought experiment’, and not the actual algorithm. Given ! 2⌦ 1 ,⌫ 2⌦ 2 , the sequence of simulation functions ( t =( 1 t , 1 t )), a t 0 > 0, and an initial condition e X t 0 (!,⌫ )= s 0 , e Z t 0 (!,⌫ )= a 0 , we simulate a controlled Markov chain ( e X m (!,⌫ )) 0 m= t 0 of length t 0 + 1 using the backward simulation. As the first step, we do an o✏ine computation of all the possible simulation trajectories as follows. 1. Input t 0 . Initialize m = 1. 2. Compute e 1 m (s,a,⌫ (m) (s,a)) := 1 m (s,a,⌫ (m) (s,a)),8 (s,a)2S⇥ A. 3. m =m 1. Ifm< t 0 , stop. Else, return to step 2. Then we simulate ( e X m (!,⌫ )) 0 m= t 0 +1 as, e X m = e 1 m ( e X m 1 , e Z m 1 ,⌫ (m 1) ), (4.23) e Z m = e 2 m ( e X m ):= 2 m ( e X m ), (4.24) 88 starting from the initial condition e X t 0 (!,⌫ )= s 0 , e Z t 0 (!,⌫ )= a 0 . We define the composition function as e 0 t 0 := e 0 e 1 ··· e t 0 +2 e t 0 +1 (4.25) where e m =( e 1 m , e 2 m ). Recall that (c.f. (4.20)) in the forward simulation starting from t = 0, we go from a path of length t 0 to a path of length t 0 +1 by taking the composition t 0 t 0 0 (s 0 ,a 0 ). In backward simulation we do this by taking the composition e 0 t 0 +1 e t 0 (s 0 ,a 0 ). So, forward simulation is done by forward composition of simulation functions where as the backward simulation is done by backward composition of the simulation functions. Furthermore, in forward simulation, we can successively generate consecutive states of a single controlled Markov chain trajectory one transition at a time, whereas in backward simulation one is obliged to generate one transition per stateandanytrajectoryfrom t 0 to0hastobetracedoutofthiscollectionbychoosingcontiguous state transitions at each successive time. This feature is familiar from the Propp-Wilson backward simulation algorithm mentioned above. In the following, we fix the control strategy 2 t as, 2 t (s) = argmin e Q t (s,·)8 s, (4.26) where e Q t is defined as, e Q t (s,a):=E 2 " 1 X k= t ↵ k+t c( e X k , e Z k )+↵ t e Q 0 ( e X 0 , e Z 0 )| e X t =s, e Z t =a # (4.27) and e Q 0 (·,·)= h(·,·) for any bounded function h : S⇥ A ! R + . Note that the expectation in the above equation is with respect to the measure P 2 for a given ! (i.e., for a given sequence of transition kernels (b p t (! t )) t 0 . We now show an important connection between the e Q t iterate defined above and the empirical Q-value iterate b Q t . Proposition 14. Let b Q 0 (·,·)= e Q 0 (·,·)=h(·,·) for any bounded function h :S⇥ A! R + . Then, b Q t = e Q t for all t 0. Proof. We prove this by induction. First note that by the definition of b Q t given in equation (4.14), for all (s,a)2S⇥ A, we get b Q 0 (s,a)=h(s,a), b Q 1 (s,a)=c(s,a)+↵ X s 0 b p 0 (s 0 |s,a)min b h(s 0 ,b). 89 Now, by the definition in equation (4.27), e Q 0 (s,a)=E 2 h h( e X 0 , e Z 0 )| e X 0 =s, e Z 0 =a i =h(s,a), e Q 1 (s,a)=E 2 h c( e X 1 , e Z 1 )+↵ e Q 0 ( e X 0 , e Z 0 )| e X 1 =s, e Z 1 =a i =c(s,a)+↵ E 2 h e Q 0 ( 0 ( e X 1 , e Z 1 ,⌫ 1 ), e Z 0 )| e X 1 =s, e Z 1 =a i where e Z 0 = argmin e Q 0 ( 0 ( e X 1 , e Z 1 ,⌫ 1 ),·). Then, e Q 1 (s,a)=c(s,a)+↵ X s 0 b p 0 (s 0 |s,a)min b h(s 0 ,b), where we used the fact that e Q 0 =h. Now, assume that b Q m = e Q m for all m t 1. Then, e Q t (s,a)=E 2 " 1 X k= t ↵ k+t c( e X k , e Z k )+↵ t e Q 0 ( e X 0 , e Z 0 )| e X t =s, e Z t =a # =c(s,a)+E 2 " 1 X k= t+1 ↵ k+t c( e X k , e Z k )+↵ t e Q 0 ( e X 0 , e Z 0 )| e X t =s, e Z t =a # =c(s,a)+↵ E 2 " 1 X k= t+1 ↵ k+t 1 c( e X k , e Z k )+↵ t 1 e Q 0 ( e X 0 , e Z 0 )| e X t =s, e Z t =a # =c(s,a)+↵ E 2 h e Q t 1 ( e X t+1 , e Z t+1 )| e X t =s, e Z t =a i =c(s,a)+↵ E 2 h e Q t 1 ( t+1 ( e X t , e Z t ,⌫ t ), e Z t+1 )| e X t =s, e Z t =a i , where e Z t+1 = argmin e Q t 1 ( t+1 ( e X t , e Z t ,⌫ t ), · ). Then, e Q t (s,a)=c(s,a)+↵ X s 0 b p t 1 (s 0 |s,a)min b e Q t 1 (s 0 ,b) =c(s,a)+↵ X s 0 b p t 1 (s 0 |s,a)min b b Q t 1 (s 0 ,b) = b Q t (s,a). We shall need the following lemma of Blackwell and Dubins [109], [110, Chapter 3, Theorem 3.3.8]. Lemma 20. Let Y t ,t=1,2,...,1 be real random variable on a probability space (⌦ 1 ,F 1 ,P 1 ) 90 such that Y t ! Y 1 and E[sup t |Y t |] < 1 . Let {F t } be a family of sub- -fields of F which is either increasing or decreasing, with F 1 = _ t F t or \ t F t accordingly. Then lim t,j!1 E[Y t |F j ]= E[Y 1 |F 1 ] a.s. and in L 1 . Proposition 15. For a given ! 2 ⌦ 1 , let b p(!)=(b p t (!)) t 0 be the corresponding sequence of transition kernels and b Q t (!),t 0, be the corresponding Q value iterates as defined in equation (4.14). Then, there exists a random variable Q ⇤ (!) such that b Q t (!)! Q ⇤ (!),! a.s.. Proof. Consider the backward simulation described above. For ! 2⌦ 1 ,⌫ 2⌦ 2 , we trace out two MDPs with state-action sequences ( e X m (!,⌫ ), e Z m (!,⌫ )) 0 m= t , ( e X 0 m (!,⌫ ), e Z 0 m (!,⌫ )) 0 m= t , with initial conditions ( e X t (!,⌫ ), e Z t (!,⌫ )) = (s,a), ( e X 0 t (!,⌫ ), e Z 0 t (!,⌫ )) = (s 0 ,a 0 ). Decrease t until they couple. Once they couple, they follow the same sample path. In the follow- ing, we first show the following: Claim: These chains couple with probability 1 as t!1 . Proof of claim: Lete ⌧ t !,⌫ be the time after which these chains couple, i.e, e X t+e ⌧ t !,⌫ = e X 0 t+e ⌧ t !,⌫ and e X t+l 6= e X 0 t+l for all 0 l<e ⌧ t !,⌫ . Since these chains are of finite length (from t to 0), we may need to define the value ofe ⌧ t !,⌫ arbitrarily if they don’t couple during this time. To overcome this, we let these chains to run to infinity. This can be done without loss of generality as follows: For t m< 0, simulate the chains according the backward simulation methods speci- fied by (4.23)-(4.24). For m 0, continue the simulation to generate chains ( e X m (!,⌫ )) 1 m=1 , ( e X 0 m (!,⌫ ), e Z 0 m (!,⌫ )) 1 m=1 as, e X m = 1 m+t ( e X m 1 , e Z m 1 ,⌫ (m 1) ), (4.28) e Z m = 2 m+t ( e X m ). (4.29) It is easy to see that e ⌧ t !,⌫ has the same statistical properties as the coupling time defined in equation (4.22). So, by Proposition 13, E[e ⌧ t !,⌫ ]<1 . Now, X n 1 P 2e ⌧ t !,⌫ n =E[2e ⌧ t !,⌫ ]<1 . 91 Also, it is easy to see thate ⌧ t !,⌫ s are identically distributed8 t. So, X n 1 P 2e ⌧ t !,⌫ n = X n 1 P 2e ⌧ n !,⌫ >n <1 which implies X n 1 P ⇣ e ⌧ n !,⌫ n> n 2 ⌘ <1 . Then, by Borel-Cantelli lemma, e ⌧ n !,⌫ n!1 ,(!,⌫ ) a.s. Thus the chains will couple with probability 1. (End of ‘proof of claim’) Now, by construction, e Q t (s,a) e Q t (s 0 ,a 0 )=E 2 ( t+e ⌧ t !,⌫ 1)^ ( 1) X k= t ↵ k+t ⇣ c( e X k , e Z k ) c( e X 0 k , e Z 0 k ) ⌘ +↵ t^ e ⌧ t !,⌫ ⇣ e Q 0 ( e X 0 , e Z 0 ) e Q 0 ( e X 0 0 , e Z 0 0 ) ⌘ ( e X t , e Z t )=(s,a),( e X 0 t , e Z 0 t )=(s 0 ,a 0 ) Since the chains will couple with probability 1, the RHS of the above equation will converge to a random variable R(!)(s,a,s 0 ,a 0 ), ! a.s. as t!1 ,i.e, R t (!)(s,a,s 0 ,a 0 ):= e Q t (s,a) e Q t (s 0 ,a 0 )! R(!)(s,a,s 0 ,a 0 ),! a.s. (4.30) We revert to the ‘forward time’ picture henceforth. Now, b Q t+1 (s,a)=c(s,a)+↵ X s 0 b p t (s 0 |s,a)min b b Q t (s 0 ,b) =c(s,a)+↵ X s 0 b p t (s 0 |s,a)min b ⇣ b Q t (s 0 ,b) b Q t (s,a) ⌘ +↵ b Q t (s,a) =c(s,a)+↵ X s 0 b p t (s 0 |s,a)min b R t (!)(s 0 ,b,s,a)+↵ b Q t (s,a) 92 Sinceb p t depends only on !, we can define another random variable R 0 t (!)(s,a) such that R 0 t (!)(s,a):= X s 0 b p t (s 0 |s,a)min b R t (!)(s 0 ,b,s,a) =E min b R t (!)(s 0 ,b,s,a)| ˜ F t 1 , where ˜ F t 1 := (⇠ t 0 i (s,a),s 2S,a 2A,1 i n,t 0 <t). Since R t (!) ! R(!), ! a.s., it follows from the preceding lemma that there exists another random variable R ⇤ (!) such that R 0 t (!)! R ⇤ (!),! a.s. Then, b Q t+1 (s,a)=c(s,a)+↵R 0 t (!)(s,a)+↵ b Q t (s,a) =c(s,a)+↵R 0 t (!)(s,a)+↵c (s,a)+↵ 2 R 0 t 1 (!)(s,a)+↵ 2 b Q t 1 (s,a) . . . . . . =c(s,a) t X k=0 ↵ k +↵ t X k=0 ↵ k R 0 t k (!)(s,a)+↵ t+1 b Q 0 (s,a). Clearly, b Q t (s,a)! Q ⇤ (!):= c(s,a) (1 ↵ ) + ↵R ⇤ (!)(s,a) (1 ↵ ) ,! a.s. Now we give the proof of our main theorem (Theorem 15). Proof. Let (⌦ 1 ,F 1 ,P 1 ) be the probability space as defined before. Let F t := ( b Q m ,m t). From Proposition15, b Q t (!)! b Q ⇤ (!),! a.s., and hence b Q t (!) b Q t 1 (!)! 0. Takingconditionalexpec- tationandusingLemma20weget,E[ b Q t (!)|F t 1 ] b Q t 1 (!)! 0. Since b Q t (!)= b T(✓ t 1 !, b Q t 1 (!)), fromequation(4.15),E[ b Q t (!)|F t 1 ]=T( b Q t 1 (!))whereT istheQ-valueoperatordefinedinequa- tion (4.5). This gives T( b Q t 1 (!)) b Q t 1 (!)! 0. Then by the continuity of T, T( b Q ⇤ (!)) = b Q ⇤ (!) which implies b Q ⇤ (!) is indeed equals to the optimal Q function Q ⇤ , by the uniqueness of the fixed point of T. 4.3 Simulations In this section, we show some simulation result comparing the classical Q-learning (QL) algorithm (given in equation (4.8)) with our empirical Q-value iteration (EQVI). We generate a ‘random’ 93 MDP, with|S| = 100 and|A| = 10, where the transition matrixP and the costc(s,a) are generated randomly. Figure 6.4 shows the comparison. X axis is the iteration number t and Y axis is the relative errorkQ t Q ⇤ k/kQ ⇤ k. QL step size is of the form 1/t ✓ , and the figure shows three di↵erent curves for ✓ =1.0,0.8,0.6. We show two di↵erent curves for EQVI, for the number of sample n=1,n = 5. Figure shows a faster convergence for EQVI. 0 20 40 60 80 100 120 140 160 180 200 0 0.2 0.4 0.6 0.8 1 Number of iterations ||Q t −Q * ||/||Q * || Comparison: Empirical Q Learning (EQL) and classical Q Learning (QL) EQL, n=1 EQL, n=5 QL, =1.0 QL, =0.8 QL, =0.6 Figure4.1: QLwiththreedi↵erentstepsizes,1 /t ✓ , ✓ =1.0,0.8,0.6.EQVIwithtwodi↵erentsample sizes, n=1,5. 4.4 Conclusions We have presented a new Q-learning algorithm for discounted cost MDPs. We have rigourously established the convergence of this algorithm to the desired limit with probability one. Unlike the classicallearningschemesforMDPssuchasQ-learningandactor-criticalgorithms,ouralgorithmor analysis doesn’t use a stochastic approximation method. Preliminary experimental results suggest a faster rate of convergence for our algorithm than stochastic approximation algorithms. We consider this contribution only as a first step towards solving a set of more interesting problems. 94 • Distributed and Asynchronous Implementation: We considered the case where all compo- nents of the vector b Q t are updated simultaneously at time t and the outcome is immediately availablefor thenext iteration. However, there may besituations whereonly asubsetof com- ponents gets update in each iteration. Also, each component may be updated by di↵erent processors and they may be exchanging this updated information with some communication delay. Classical learning schemes such as Q-learning and actor-critic algorithms are known to work well in such distributed asynchronous implementations. We are currently working towards extending our results to such scenarios. • Average Reward Cost MDPs: Average reward MDPs are typically hard to analyze because the dynamic programming operator for average reward MDP is not a contraction mapping. There are, however, provably convergentQ-learning algorithm and actor-critic algorithms for average reward MDPs due to the powerful ODE approach to stochastic approximation [46] [35]. It is interesting to see if our algorithm works for learning in MDPs with average cost. • Rate of Convergence: The rate of convergence of stochastic approximation based algorithms is very slow in general because they are incremental in nature. We show that in simulation EQVI converges faster than the classical QL, which is not surprising given that our scheme is not incremental. The exact rate of convergence is an important problem to be addressed. Also, the tradeo↵ between higher computation and lower variance in the choice of the number n of samples at each iterate could be quantified. 95 96 5 Approachability in Dynamical Systems In this chapter we propose a new learning algorithm for approachability in dynamical systems. The notion of approachability was introduced by Blackwell ([70]) in the context of vector-valued repeated games. The famous ‘Blackwell’s approachability theorem’ prescribes a strategy for ap- proachability, i.e., for ‘steering’ the average vector-cost of a given player towards a given target set, irrespectiveofthestrategiesoftheotherplayers. Inthischapter,motivatedfromthemulti-objective optimization/decision making problems in dynamically changing environments, we address the ap- proachability problem in Markov Decision Processes (MDPs) and Stackelberg stochastic games with vector-valued cost functions. We make two main contributions. Firstly, we give simple and computationallytractablestrategyforapproachabilityforMDPsandStackelbergstochasticgames. Secondly, we give reinforcement learning algorithms to learn the approachable strategy when the transition kernel is unknown. We also show that the conditions that we give for approachability are both necessary and sucient for convex sets and thus a complete characterization. We also give sucient conditions for non-convex sets. The chapter is organized as follows. In section 5.1, we state and prove Blackwell’s approachabil- ity theorem for MDPs. We also give a reinforcement learning algorithm for approachability when the transition kernels are not known. In section 5.2, we give Blackwell’s approachability theorem for Stackelberg stochastic games and give a learning algorithm. In section 5.3, we conclude with some observations and point out some interesting future directions. 5.1 A Blackwell’s Approachability Theorem for MDPs 5.1.1 Preliminaries We consider an MDP with a finite state space S and a finite action space A. Let p(·,·,·)bethe transition kernel that governs the system evolution such thatp(s,a,·)2P(S) for all (s,a)2S⇥A . Let c be a vector-cost function, c :S⇥A! R K . For a given state s2S and any action a2A, c(s,a)=[c 1 (s,a),...,c K (s,a)] † where c j :S⇥A! R for 1 j K. We assume that the cost function is bounded and without loss of generality assume that |c j (s,a)| 1,1 j K,8 s 2 97 S,8 a2A. For any arbitrary set B,let P(B) denote the space of probability measures over the set B.A stationary policy is a mapping ⇡ :S!P (A) such that for any state s2S and an action a2A, ⇡ (s,a) is the probability with which an action a is chosen in state s regardless of the previous history. Any stationary policy ⇡ and an initial state s 0 2S induces a Markov chain on S.We make the following assumption: Assumption 6. The Markov chain induced by any stationary randomized policy is irreducible. This is a standard assumption made in reinforcement learning theory. Let time be discrete and (s n ,a n ) be the state-action pair at time n. The average vector-cost incurred till time n is denoted by x n = 1 n P n m=1 c(s m ,a m ). It is well known that under any stationary policy, the average vector cost x n converges to c(⇡ )where c(⇡ ):= " X s,a ⌘ ⇡ (s) ⇡ (s,a) c 1 (s,a),..., X s,a ⌘ ⇡ (s) ⇡ (s,a) c K (s,a) # † , (5.1) and ⌘ ⇡ is the stationary distribution of the Markov chain induced by the policy ⇡ . Our objective is to specify the conditions under which a given closed set is approachable and prescribe a policy (not necessarily stationary) for approachability. The notion of an approachable set is made precise in the following definition. Let D⇢ R K be any given set. Then for x62D we definekx Dk := inf y2 D kx yk. Let be a possibly non-stationary policy and for any initial state s2S and µ s ( ) be the induced probability distribution on the sequence of vectors {x n } n 0 . Definition4 (ApproachableSet). A setD⇢ R K is approachable if there exists a policy (possibly non-stationary) such that kx n Dk! 0, µ s ( ) almost surely for all initial states s2S. A set is approachable if and only if its closure is approachable and hence without loss of generality we can consider only closed sets D. We first restrict our attention only to convex D. Extension to non-convex sets is discussed in Section 5.1.2. Thus, let D be a closed convex set in R K . For any x 2 R K \ D,let P D (x) denote the (unique) projection of x onto D. Let (x)=(x P D (x))/kx P D (x)k. Foreachx2R K \D,wedefineascalar-valuedMDPwiththestagecost˜ c(s,a;x)=hc(s,a), (x)i. With respect to this scalar MDP parametrized by x, we define the following. c ⇤ (x)=min ⇡ hc(⇡ ), (x)i (5.2) ⇧( x) = argmin ⇡ hc(⇡ ), (x)i. (5.3) 98 Thus,⇧( x) is the set of stationary policies which minimizes the infinite horizon average cost lim n!1 1 n n X m=1 ˜ c(s m ,a m ;x) and c ⇤ (x) is the corresponding optimal cost. 5.1.2 Approachability Theorem for MDPs In this section, we state and prove our approachability theorem for MDPs. Sucient Condition and Strategy for Approachability for Convex Sets Theorem 16. (i) (Sucient Condition) A closed convex set D is approachable in an arbitrary MDP (satisfying Assumption 6) if the following holds: For every x 2 R K \ D, there exists a (possibly non-unique) stationary randomized policy ⇡ (x) such that hc(⇡ (x)) P D (x), (x)i 0. (ii) (Strategy for Approachability) A strategy for approachability is: at each time step n+1, select the action a n+1 according to a policy ⇡ xn such that ⇡ xn 2⇧( x n ). The geometric meaning of the above theorem is intuitive. For every point x outside D the player has a strategy ⇡ x such that the expected vector-cost corresponding to that policy lies in the halfspace containing D defined by the supporting hyperplane for the set D at P D (x). However, the strategy for approachability is slightly non-obvious. Since we are changing the policy at each time instant, it is not obvious how the corresponding MDP gets enough time to converge to the value corresponding to that policy. Later, we will show that this indeed happens via a stochastic approximation scheme. Note that there is a modeling issue here. If one knows or can compute a policy ˘ ⇡ such that c(˘ ⇡ )2D, all one has to do is to implement it, to reach D. Thus, the above model has a somewhat awkward viewpoint that while at time n, ⇡ xn can be figured out, no such ˘ ⇡ can be. However, our purpose in analyzing this case is merely to pave way for the learning scheme that appears in the next section. We first give an overview of the proof idea. The detailed proof is given in Section 5.1.2. Average vector-cost till time step n+1 can be written as x n+1 = 1 n+1 n+1 X m=0 c(s m ,a m )=x n + 1 (n)(c(s n+1 ,a n+1 ) x n ) =x n + 1 (n)(c(s n+1 ,⇡ xn ) x n +M (1) n+1 ) (5.4) 99 where (n)=1/(n+1), c(s n+1 ,⇡ xn )= P a2A ⇡ xn (s n+1 ,a)c(s n+1 ,a) and M (1) n+1 = c(s n+1 ,a n+1 ) c(s n+1 ,⇡ xn ). The key idea in the analysis of (5.4) is to show that it asymptotically tracks the di↵erential inclusion given by ˙ x(t)2d(x(t)) x(t). (5.5) where d(x):={c(⇡ ): ⇡ 2⇧( x)}. Then by showing that the dynamics given by (5.5) converges to the set D, we will conclude that the sequence {x n } also converges to the same set a.s. Proof of Theorem 16 Beforewegivethedetailedproof,itisnecessarytoverifytheexistenceofasolutiontothedi↵erential inclusion (5.5). We first state some regularity properties of (5.5). Proposition 16. For each x2R K \D, (i) sup y2 d(x) kyk<K(1+kxk). (ii) d(x) is convex, compact and upper semicontinuous. Both are rather straight forward to prove using standard analysis techniques and hence we omit the proof. It is well known that the di↵erential inclusion in (5.5) with the regularity conditions given by Proposition 16 admits a solution through every initial point [111]. Even though equation (5.4) appears as a standard single timescale stochastic approximation iteration,itismuchmorecomplicatedthanthatandinvolvesanimplicitmultipletimescaleprocess. The sequence {x n } is ‘a↵ected’ by the process {s n } running in the background on the true or ‘natural’ timescale which corresponds to the time index ‘n’itself. Since 1 (n) can be considered as time steps and 1 (n) ! 0, the process {x n } evolves on a slower timescale than the process {s n }. The analysis involved is usually called ‘averaging the natural timescale’ and is described in detail in [107, Section 6.2]. Here we give only the results relevant for our problem. Readers are referred to [107, Section 6.2] for proofs and details. [112],[113]containpioneeringworkonstochasticapproximationswithdi↵erentialinclusion. We shall, however, refer to [107] as a common source for all facts regarding stochastic approximations for convenience and uniformity of notation. The basic approach to the analysis of (5.4) is to construct a suitable continuous interpolated trajectory x(t),t 0 and show that it asymptotically almost surely approaches the solution set of (5.5). This is done as follows: Define t(0) = 0,t(n)= P n 1 m=0 1 (m),m 1. Clearly t(n)"1 . Let I n := [t(n),t(n+1)),n 0. Define a continuous, piecewise linearx(t),t 0byx(t(n)) =x n ,n 0, 100 with linear interpolation in each interval I n . That is, x(t)=x n +(x n+1 x n ) t t(n) t(n+1) t(n) ,t2I n (5.6) Define a P(S⇥A )-valued random process µ(t)=µ(t,(s,a)),t 0, by µ(t):= (sn,an) ,t2I n ,n 0, (5.7) where is the Kronecker delta function. Also, define fort>v 0, and for B Borel in [v,t], µ t v (B⇥ (s,a)) := 1 t v Z B µ(y,(s,a))dy. Two necessary conditions for the analysis in [107] to carry through are: 1. Almost surely, sup n kx n k<1 . 2. Almost surely, for anyt> 0, the set {µ v+t v ,v 0} remains tight. Since we are dealing with only finite spaces, we don’t need to worry about the measurability issues discussed in [107]. For the same reason, conditions 1 and 2 are also true. Let be the set of ergodic occupation measures over the setS⇥A , i.e., any 2 can be decomposed as (s,a)= ⌘ ⇡ (s)⇡ (s,a)where ⇡ (·,·) is a stationary randomized policy and ⌘ ⇡ (·) is an invariant probability measure under the policy ⇡ . For any x 2 R K \D,let (x) ⇢ be such that any 2 (x) can be decomposed as (s,a)= ⌘ ⇡ x (s)⇡ x (s,a)with ⇡ x 2⇧( x). For ⌫ 2P(S⇥A ), define ˜ h(x,⌫ ):= X (s,a)2S⇥A (c(s,a) x)⌫ (s,a) For µ(·) defined in (5.7), consider the non-autonomous o.d.e. ˙ x(t)= ˜ h(x(t),µ(t)) (5.8) Let x v (t),t v, denote the solution to (5.8) with x v (v)=x(v), for v 0. Aset A2R K is said to be an invariant set for the di↵erential inclusion (5.5) if for x(0)2A there is some trajectory x(t),t2(1 ,1 ), that lies entirely in A. An invariant set is said to be an internally chain transitive invariant set if for any x,y2A and any✏> 0,T > 0, there exists an n 1 and points x 0 =x,x 1 ,...,x n 1 ,x n =y in A, such that the trajectory of (5.5) initiated at x i meets with the ✏-neighborhood of x i+1 for 0 i<n after a time t T. We use the following important result from [107]. Theorem 17. [107, Theorem 7, Chapter 6] Almost surely, {x(v +·),v 0} converge to an internally chain transitive invariant set of the 101 di↵erential inclusion ˙ x(t)2{ ˜ h(x,⌫ ): ⌫ 2 (x)} :=d(x(t)) x(t) (5.9) as t"1 . In particular, {x n } converge a.s. to such a set. We now show that any path corresponding to the di↵erential inclusion dynamics given by (5.5) converges to the set D. Some definitions are in order: A compact invariant set M is called an attractor of a dynamical system if it has an open neighborhood O such that every trajectory in O remains in O and converges to M. The largest such O is called the domain of attraction. An attractor M is called a global attractor if the domain of attraction is R K . Proposition 17. The set D is a global attractor for the di↵erential inclusion specified by (5.5). Proof. Consider the Lyapunov functionV(x)=min z2 D 1 2 kx zk 2 . We first note that by Danskin’s Theorem ([114], p. 717), r V(x)=(x P D (x)). So, d dt V(x(t)) = hr V(x(t)), ˙ x(t)i = hx(t) P D (x(t)),y(t)i for y(t)2d(x(t)) x(t). By assumption, there exists a policy ⇡ (x) such that hc(⇡ (x)), (x)ih P D (x), (x)i.Then, for any optimal policy ⇡ x 2 ⇧( x)whichminimizes ⇡ 7! h c(⇡ ), (x)i,wehave hc(⇡ x ), (x)i hP D (x), (x)i. So for any x(t), hc(⇡ x(t) ) P D (x(t)),x(t) P D (x(t))i 0 and hence hc(⇡ x(t) ) x(t),x(t) P D (x(t)ik x(t) P D (x(t))k 2 . This gives d dt V(x(t)) 2V(x(t)) =) V(x(t)) V(x(0))e 2t . Thus, D is a global attractor. Using Theorem 17 and Proposition 17, we now prove the Approachability Theorem for MDPs, i.e., Theorem 16. Proof. (of Theorem 16) By Theorem 17{x n } converges a.s. to an internally chain transitive invariant set of the di↵erential inclusion given by (5.5). Since D is a global attractor, this internally chain transitive invariant set is a subset of D [107]. Hence, {x n } converges a.s. to D. Necessary Condition We now show that the approachability condition given in Theorem 16 is also a necessary condition and thus give a complete characterization of the approachable convex sets in MDPs. Proposition 18 (Necessary Condition). If a closed convex set D is approachable in an arbitrary MDP (satisfying Assumption 6), then (i) every half-space containing D is approachable, and 102 (ii) for every x2R K \D, there exists a (possibly non-unique) stationary randomized policy ⇡ (x) such that hc(⇡ (x)) P D (x), (x)i 0. Proof. Claim (i) is obvious. We now show that (i) implies (ii) and complete the argument. Let x2R K \D and H x be the supporting half-space to the set D at the point P D (x) given by H x :={y2R K :hy P D (x), (x)i 0}. Sinceeveryhalf-spacecontainingD isapproachable, there exists a policy (possibly non-stationary) for the player such that limsup n!1 hx n P D (x), (x)i 0. Since |hx n P D (x), (x)i| is bounded, inf 2 ⌃ E lim sup n!1 hx n , (x)i h P D (x), (x)i (5.10) The LHS of equation (5.10) is the optimal expected average cost corresponding to an MDP with scalar stage cost ˜ c(s,a;x)= hc(s,a), (x)i. And from the theory of MDPs [108], there exists a stationary randomized policy ⇡ (x) such that this cost is equal tohc(⇡ (x)), (x)i. Extension to Non-Convex Sets Here we give the sucient condition and strategy for approachability when the target set is non- convex. Note that, given anx2R K \D, the closest point tox inD may not be unique. Let ˜ P D (x) be the set of all such points. Theorem 18. (i) (Sucient Condition) A closed set D is approachable in an arbitrary MDP (satisfying Assumption 6) if the following holds: For every x 2R K \D and for every P D (x) 2 ˜ P D (x), there exists a (possibly non-unique) stationary randomized policy ⇡ (x) such thathc(⇡ (x)) P D (x), (x)i 0. (ii) (Strategy for Approachability) A strategy for approachability is: at each time step n+1, select a P D (x)2 ˜ P D (x), and select the action a n according to a policy ⇡ xn such that ⇡ xn 2⇧( x n ). The condition given above is not necessary for approachability. This can be easily seen by considering a set D as the union of two disjoint convex sets. The proof technique used in Theorem 16 is not directly applicable here because the Lyapunov function defined there may not be di↵erentiable when the set D is non-convex. We overcome this diculty by using semidi↵erentials and a general version of the envelope theorem [115]. Wefirstgivesomesomebasicdefinitionsandresultsfrom[115,Pages29,42-46]thatisnecessary for the proof. Definition 5 (Semi-di↵erentials) . Let V : R K ! R.The super and sub-di↵erential of V (or 103 semi-di↵erentials )at x, D + V(x) and D V(x), are defined as D + V(x):= ( p2R K :lim y!x sup y2 R K V(y) V(x) p·(y x) |x y| 0 ) D V(x):= ⇢ p2R K :lim y!x inf y2 R K V(y) V(x) p·(y x) |x y| 0 Let V be such that V(x):= inf z2 D g(x,z) where g :R K ⇥ D! R, D⇢ R K . Assume that, (A1) g is bounded and g(·,z) di↵erentiable at x uniformly in z, (5.11) (A2) z7! D x g(x,z) is continuous and z7! g(x,z) lower semicontinous, (5.12) where D x g is the partial derivative of g w.r.t. x. Let ˜ P D (x) := argmin z2 D g(x,z):={z2D :V(x)=g(x,z)},Y(x):={D x g(x,z):z2 ˜ P D (x)} Also, the (one-sided) directional derivative of V in the direction of q is D + V(x)(q):= lim h!0 + V(x+hq) V(x) h . We use the following general version of the envelope theorem given below. Proposition 19. [115, Proposition 2.13, Page 44] Let D be a compact set and g satisfies assump- tions (5.11)-(5.12). Then, Y(x)6=; D + V(x)=coY(x) D V(x)= ( {y} if Y(x)={y} ; if Y(x) is not a singleton Moreover, V has the (one-sided) directional derivative in any direction q, given by D + V(x)(q)= min y2 Y(x) y·q=min p2 D + V(x) p·q Proof. ProofissimilartothatofTheorem16exceptthefactthattheLyapunovfunctionmaynotbe di↵erentiable when the set D is non-convex. We overcome this diculty by using semidi↵erentials. LetV(x)=min z2 D 1 2 kx zk 2 betheLyapunovfunction. Then,D + V(x)=co{(x P D (x)),P D (x)2 104 ˜ P D (x)}. Let v(t)=V(x(t)) and by chain rule [111, Proposition 7, Page 288], D + v(t)=D + V(x(t))(˙ x(t)) =D + V(x(t))(y(t)) for y(t)2d(x(t)) x(t). As before, we can also show that, for all P D (x(t))2 ˜ P D (x(t)), hc(⇡ x(t) ) x(t),x(t) P D (x(t))ik x(t) P D (x(t))k 2 = 2V(x(t)) = 2v(t). Then, D + v(t)= min P D (x(t))2 ˜ P D (x(t)) hc(⇡ x(t) ) x(t),x(t) P D (x(t)i 2v(t) and by [111, Proposition 8, Page 289], we get v(t) v(0)+2 Z t 0 v(⌧ )d⌧ 0. Now, by Gronwall’s inequality, v(t) v(0)e 2t . The rest of the proof is same as before. Remark 7. (i) We note that our approachability theorems are similar to that of Shimkin and Shwartz [71] but weaker than that of Milman [72] which does not require Assumption 6. However, this generality comes at the cost of an inability to develop a learning algorithm for approachability. Our approachability strategy is di↵erent from that in [71] (and also [72]), and this has necessitated a completely di↵erent proof from any of the prior works. Moreover, we are able to give a learning version of the approachability strategy, something quite dicult from either of the prior works. Kamal [73] gives a stochastic approximation-inspired iterative scheme for approachability but it is not a learning scheme as it depends on knowing the model fully. Thus, it seems that ours is the first approachability result for MDPs which uses a stochastic approximation scheme for its proof, and as a bio-product yields a natural learning scheme. (ii) We also note that all these works address the approachability question for general stochastic games whereas our above theorem is for MDPs. We give an approachability theorem and a learning scheme for Stackelberg stochastic games in Section 5.2. In general, a learning scheme for a general stochastic games is a long-standing open problem. We are able to give a learning scheme for a special case, namely the Stackelberg stochastic game. 5.1.3 Reinforcement Learning Algorithm for Blackwell Approachability in MDPs In this section, we introduce a reinforcement learning algorithm for approachability in multi- objective MDPs. The Approachability theorem for MDPs shows that if the agent selects her 105 action at time step n+1 according to the policy ⇡ xn such that ⇡ xn 2⇧( x n ), then x n approaches the desired set D.Given x n , such a policy can be easily computed if one knows the transition kernelp(·,·,·). The problem of ‘learning’ arises when this transition kernel is unknown but one has access to a simulation device that can generate, for any s2S,a2A, an independent S-valued random variable whose probability law is p(s,a,·). So, the objective of a learning algorithm is to ‘learn’ such a policy ⇡ xn at each time step n using this simulation device. We first review some basic theory for average-cost MDPs from [46]. It is known that under Assumption 6, one can associate a value function with the problem of finding the optimal cost of a standard average-cost MDP, i.e, one can find V :S! R and a scalar c opt such that they satisfy the dynamic programming equation V(s)=min a " ˜ c(s,a;x)+ X s 0 p(s,a,s 0 )V(s 0 ) # c opt ,s2S (5.13) Also c opt is the optimal cost which is unique. V(·) is unique only up to an additive constant. The same dynamic programming equation can be written in terms of the ‘Q-value’ defined by the expression in the square brackets on the right as Q(s,a)=˜ c(s,a;x)+ X s 0 p(s,a,s 0 )min b Q(s 0 ,b) c opt ,s2S,a2A (5.14) with V(s)=min a Q(s,a). Equation (5.14) is useful because it can be shown that a stationary policy ⇡ is optimal if and only if ⇡ (s)2argminQ(s,·),8 s. Typically, the optimal average cost (along with the corresponding optimal value function or the optimal Q function) is computed by using a technique called Relative Value Iteration (RVI), V n+1 (s)=min a " ˜ c(s,a;x)+ X s 0 p(s,a,s 0 )V n (s 0 ) f(V n ) # ,s2S (5.15) wheref isany Lipschitzfunctionsatisfyingthefollowingproperties: For anall1vectore,f(e) = 1, f(y+ce)=f(x)+c and f(cy)=cf(x) for c2R. As a simple example, we can set f(V)=V(s 0 ) for a fixed s 0 2S. It is known that V n ! V such that f(V)=c opt . Similarly, one can specify RVI for the Q function as Q n+1 (s,a)=˜ c(s,a;x)+ X s 0 p(s,a,s 0 )min b Q n (s 0 ,b) f(Q n ),s2S,a2A (5.16) We kept x fixed in the foregoing. We continue to do so in what follows and suppress the x dependence in order to simplify the notation. Both the RVI algorithms given above make use of the knowledge of the transition kernel p(·,·,·). In [46], Abounadi, et al. proposed a learning algorithm for computing the optimal cost and optimal policy of an average-cost MDP. The algorithm is called 106 RVI Q-learning. We first assume that we have access to a simulation function ⇠ such that ⇠ :S⇥A⇥ [0,1]!S (5.17) andP(⇠ (s,a,!)=s 0 )=p(s,a,s 0 )whereP(·) is taken with respect to the uniform random variable ! 2[0,1]. The synchronous RVI Q-learning algorithm is given by Q n+1 (s,a)=Q n (s,a)+ (n) ✓ ˜ c(s,a;x)+min b Q n (⇠ sa n ,b) f(Q n ) Q n (s,a) ◆ (5.18) where ⇠ sa n = ⇠ (s,a,! n ) and ! n is independent and uniformly distributed on [0,1]. Here, (n)isa standard stochastic approximation step size satisfying the conditions X n (n)=1 , X n 2 (n)<1 . (5.19) Similarly, the asynchronous version of RVI Q-learning is given by Q n+1 (s,a)=Q n (s,a)+ (⌫ (s,a,n))I{(s,a)=(s n ,a n )} ✓ ˜ c(s,a;x)+min b Q n (⇠ sa n ,b) f(Q n ) Q n (s,a) ◆ (5.20) where ⌫ (s,a,n):= P n m=0 I{(s,a)=(s n ,a n )} and (n) satisfy the additional conditions sup n ([yn]) (n) <1 , P [yn] m=0 (m) P n m=0 (m) ! 1 uniformly in y2(0,1). (5.21) Also, the relative sampling frequency of state-action pairs should be bounded away from zero, i.e., liminf n!1 ⌫ (s,a,n) n+1 > 0 a.s.8 (s,a)2S⇥A . (5.22) We can rewrite the synchronous RVI Q-learning equation (5.18) as Q n+1 (s,a)=Q n (s,a)+ (n) ⇣ T(Q n )(s,a) f(Q n ) Q n (s,a)+M (2) n+1 ⌘ (5.23) where the mapping T is defined as (TQ)(s,a)=˜ c(s,a;x)+ X s 0 p(s,a,s 0 )min b Q(s 0 ,b), (5.24) and M (2) n+1 =˜ c(s,a;x)+min b Q n (⇠ sa n ,b) (TQ n )(s,a). (5.25) The RVI Q-learning algorithm given in equation (5.23) is in the form of a standard stochastic 107 approximation algorithm. The corresponding o.d.e. limit is ˙ Q(t)=c(T 0 (Q(t)) Q(t)) (5.26) where T 0 (Q)= T(Q) f(Q)e andc> 0. Using standard stochastic approximation theory, it can be shown that the sequence Q n asymptotically tracks this o.d.e. So, if we can show that this o.d.e. converges to Q ⇤ such that f(Q ⇤ )= c opt , then convergence of the RVI Q-learning algorithm to ↵ ⇤ can be deduced from that. The following relevant result is available in [46]. Theorem 19 ([46]). (i) Q ⇤ is the globally asymptotically stable equilibrium point for (5.26). (ii) The sequence {Q n } given by the synchronous RVI Q-learning algorithm (5.18) or its asyn- chronous version (5.20) is bounded almost surely. (iii) In both the synchronous and the asynchronous RVI Q-learning algorithms, if {Q n } remains bounded almost surely, then Q n ! Q ⇤ . Now we give ourQ-learning algorithm for approachability in MDPs: Select actiona n+1 accord- ing to the policy ⇡ n and update x n ,Q n ,⇡ n as x n+1 =x n + 1 (n)(c(s n+1 ,a n+1 ) x n ), 1 (n)=1/(n+1), (5.27) Q n+1 (s,a)=Q n (s,a)+ 2 (⌫ (s,a,n))I{(s n ,a n )=(s,a)} ˜ c(s,a;x n )+min b Q n (s n+1 ,b) (5.28) f(Q n ) Q n (s,a) , ⇡ n+1 (s,·)2argmin a Q n+1 (s,a), (5.29) where 2 (n) satisfy the condition (5.19) - (5.21). It is clear that 1 (n) also satisfy these conditions. Moreover, 2 (n) should satisfy the condition 1 (n) 2 (n) ! 0 (5.30) The synchronous case is analogously written. Remark 8. The proof of convergence of the above requires the condition given in (5.22).This is not ensured by the above choice of ⇡ n+1 , so one usually employs some randomization to ensure adequate ‘exploration’ implicit in the condition (5.22). One way to do so is to choose for some 0<✏<< 1, ⇡ n+1 (s,·)=(1 ✏)⌘ +✏ , where ⌘ (argmin(Q n+1 (s,·)) = 1 and is uniform on A. This however ensures only near-optimality in online scenarios. This is standard for Q-learning and we shall not discuss it in detail. Under the above assumptions we have the following theorem. Theorem 20. Assume that the stochastic approximation step sizes 1 (n), 2 (n) and the relative sampling frequency ⌫ (s,a,n) satisfy the assumptions (5.19) - (5.22) and (5.30). Then, in both the 108 synchronous and asynchronous reinforcement learning algorithms for Blackwell’s Approachability given by equation (5.27) - (5.29), kx n Dk! 0 almost surely. Proof. For notational simplicity, we consider only the synchronous case. First, we rewrite the equations (5.27) and (5.28) as x n+1 =x n + 1 (n) ⇣ c(s n+1 ,⇡ n ) x n +M (1) n+1 ⌘ , (5.31) Q n+1 (s,a)=Q n (s,a)+ 2 (n) ⇣ T(Q n )(s,a) f(Q n ) Q n +M (2) n+1 ⌘ , (5.32) where c(s n+1 ,⇡ n )= P a2A c(s n+1 ,a)⇡ (s,a), M (1) n+1 = c(s n+1 ,a n+1 ) c(s n+1 ,⇡ n ) and M (2) n+1 is as given in (5.25). This is a standard two timescale stochastic approximation iteration where x n is on a slower timescale compared to Q n . We first claim that x n can be considered ‘quasi-static’ for the analysis of (5.32). For this, we rewrite (5.31) as x n+1 =x n + 2 (n) ✏ 1 n +M 3 n+1 , (5.33) where ✏ 1 n = 1 (n) 2 (n) (c(s n+1 ,⇡ n ) x n ) and M 3 n+1 = 1 (n) 2 (n) M 1 n+1 . Clearly, ✏ 1 n ! 0 almost surely. Then, by [107, Section 2.2], (x n ,Q n ) will converge to the internally chain transitive invariant set of the o.d.e. ˙ x(t)=0, ˙ Q(t)=T 0 (Q(t)) Q(t). Then, by Theorem 19, Q n Q ⇤ (x n ) ! 0 a.s., where Q ⇤ (x n ) is the optimal Q function of the scalar-valued MDP with cost function ˜ c(s,a;x n ). Then, we can conclude that (x n ,⇡ n ) converges to the set {(x,⇡ x ):x2R K ,⇡ x 2⇧( x)}. It follows that c(s,⇡ n ) c(s,⇡ xn )! 0 almost surely. Now, we rewrite equation (5.31) as x n+1 =x n + 2 (n) ⇣ (c(s n+1 ,⇡ xn ) x n )+(c(s,⇡ n ) c(s,⇡ xn ))+M (1) n+1 ⌘ =x n + 2 (n) ⇣ (c(s n+1 ,⇡ xn ) x n )+✏ (2) n +M (2) n+1 ⌘ , (5.34) where ✏ (2) n =(c(s,⇡ n ) c(s,⇡ xn )). From the argument above, it is clear that ✏ (2) n ! 0 almost surely. Then, the asymptotic behavior of (5.34) is the same as that of the equation x n+1 =x n + 2 (n) ⇣ (c(s n+1 ,⇡ xn ) x n )+M (1) n+1 ⌘ , (5.35) because{✏ (2) n }willcontributeonlyanadditionalerrortermwhichisasymptoticallynegligible. (See, e.g., [107], p. 17.) This is the same as equation (5.4) and hence, by Theorem 16, kx n Dk! 0 almost surely. 109 5.2 Blackwell’s Approachability Theorem for Stackelberg Stochastic Games Inthissection, wenowshowanalogousapproachabilitytheoremsandlearningalgorithmsforStack- elberg Stochastic games. We note that computation of equilibria of general stochastic games is a longstanding open problem. Computational algorithms and learning schemes are only known for some special cases, such as zero-sum and the single controller case. The results from the previous section are not immediately relevant here because there are two decision makers now with a general sum game between them. 5.2.1 Preliminaries Consider a stochastic game with two players, finite state space S, finite action space A =A 1 ⇥A 2 where A i is the action of player i,i=1,2. An element a=(a 1 ,a 2 )2Ais called an action vector. Let p(·,·,·) be the transition kernel that governs the system evolution, p(s,a,·)2P(S) for all s2S,a2A. Let c i :S⇥A! R K be the vector-cost function for player i, and for a given state s2S and an action vector a2A, c i (s,a)=[c i 1 (s,a),...,c i K (s,a)] † where c i j :S⇥A! R for 1 j K. We assume that the cost function is bounded and without loss of generality assume that |c i j (s,a)| 1,1 j K,1 i 2,8 s2S,8 a2A. The average vector-cost incurred for player i till time n is denoted by x i n = 1 n P n m=1 c i (s m ,a m ). As we have argued in the introduction, developing a ‘natural and dynamic’ learning algorithm for a general stochastic game is inherently dicult. Here we consider a natural relaxation of the problem as ‘games against nature’, and formalize it as a Stackelberg stochastic game.Ina Stackelberg stochastic game, at each time step n, player 1 takes an action a 1 n first. Player 2 (adversary) observes this action and then selects her action a 2 n . Let ⌃ i be the set of behavioral strategies of player i,i=1,2. A pair of strategies ( 1 , 2 ) 2 ⌃ 1 ⇥ ⌃ 2 together with an initial state s induces probability distribution µ s ( 1 , 2 ) on the sequence of vectors {x i n ,i=1,2}. Our objective is to specify the conditions under which a given closed set is approachable and prescribe a strategy for approachability. The notion of an approachable set for stochastic games is made precise in the following definition. Definition 6 (Approachable Set). A closed set D is approachable for player i if there exists a behavioral strategy i for player i such that kx i n Dk! 0, µ s ( i , i )-almost surely for all i 2 ⌃ i , and for all initial states s2S. Remark9. In the following, we analyze the approachability conditions and the algorithm for player 1 (against the strategies of the adversary) and hence, we drop the superscript 1.Thus, x n indicates x 1 n , c indicates c 1 , etc. 110 Let be the set of ergodic occupation measures over the setS⇥A such that any 2 can be decomposed as (s,a 1 ,a 2 )= ⌘ (s)⇡ 1 (·|s)⇡ 2 (·|s,a 1 )where ⇡ 1 (·|s)2P(A 1 ), ⇡ 1 (·|s,a 1 )2P(A 2 ), and ⌘ is an invariant probability measure over S induced by the policy pair (⇡ 1 ,⇡ 2 ). For any 2 , define the ergodic vector-cost function for player i as c i ( ): ! R K = 2 4 X s,a 1 ,a 2 (s,a 1 ,a 2 )c i 1 (s,a 1 ,a 2 ),..., X s,a 1 ,a 2 (s,a 1 ,a 2 )c i K (s,a 1 ,a 2 ) 3 5 † (5.36) Note that c 1 ( ) is the expected average vector-cost for player 1 if both players play a pair of stationary strategies (⇡ 1 ,⇡ 2 ) such that is the ergodic occupation measure induced by (⇡ 1 ,⇡ 2 ). We make the following assumption. Assumption 7. The Markov chain induced by any pair of stationary strategies (⇡ 1 ,⇡ 2 ) by the players is irreducible. 5.2.2 Approachability Theorem for Stackelberg Stochastic Games In this section, we state and prove our approachability theorem for Stackelberg stochastic games. Sucient Condition and Strategy for Approachability of Convex Sets As in the MDP case, we first restrict our attention only to a convex set D. Extension to the non-convex set is given in Section 5.2.2. We start by defining a few quantities. We reuse some of the notations from Section 5.1. Recall that for any x 2 R K \D,let P D (x) denote the (unique) projection of x onto D. Let (x)= (x P D (x))/kx P D (x)k. For every x2R K \D, we define a scalar-valued Stackelberg stochastic game with the stage cost ˜ c i (s,a;x)= hc i (s,a), (x)i. With respect to this game, we define the following quantities: c ⇤ (x):=min ⇡ 1 max ⇡ 2 {hc( ), (x)i : (s,a 1 ,a 2 )= ⌘ (s)⇡ 2 (a 2 |s,a 1 )⇡ 1 (a 1 |s)} (5.37) ⇡ 1(x) = argmax ⇡ 2 {hc( ), (x)i : (s,a 1 ,a 2 )= ⌘ (s)⇡ 2 (a 2 |s,a 1 )⇡ 1 (a 1 |s)} (5.38) ⇧ 1 (x) := argmin ⇡ 1 max ⇡ 2 {hc( ), (x)i : (s,a 1 ,a 2 )= ⌘ (s)⇡ 2 (a 2 |s,a 1 )⇡ 1 (a 1 |s)} (5.39) (x):={ 2 : (s,a 1 ,a 2 )= ⌘ (s)⇡ 2 (a 2 |s,a 1 )⇡ 1 (a 1 |s),⇡ 1 2⇧ 1 (x),⇡ 2 2 ⇡ 1(x)} (5.40) ˜ (x)={ 2 :hc( ), (x)i c ⇤ (x)} (5.41) Theorem 21. (i) (Sucient Condition) A closed convex set D is approachable from all initial states in a Stackelberg stochastic game (satisfying Assumption 7) if for every x 2R K \D there 111 exists a (possibly non-unique) occupation measure , (s,a 1 ,a 2 )= ⌘ (s)⇡ 2 (a 2 |s,a 1 )⇡ 1 (a 1 |s), such that c( )=c ⇤ (x) and hc( ) P D (x), (x)i 0. (ii) (Strategy for Approachability) A strategy for approachability is: At every time instant n+1, select action a 1 n+1 according to the policy ⇡ 1 xn such that ⇡ 1 xn 2⇧ 1 (x n ). Proof of Theorem 21 Theproofhereisalsobasedonstochasticapproximationideas. However,sincethesystemevolution now depends on actions taken by two players (who are not necessarily zero-sum adversaries), it is much more complicated. This requires additional work which we show below. Other details are similar to the MDP case, and are either summarized or omitted. The average vector-cost till time step n+1 can be written as x n+1 = 1 n+1 n+1 X m=0 c(s m ,a m )=x n + 1 (n)(c(s n+1 ,a n+1 ) x n ) =x n + 1 (n) ⇣ c(s n+1 ,⇡ xn ,a 2 n+1 ) x n +M (1) n+1 ⌘ (5.42) where (n)=1/(n+1), c(s n+1 ,⇡ xn ,a 2 n+1 )= P a 1 2A 1 ⇡ xn (a 1 )c(s n+1 ,a 1 ,a 2 n+1 ) and M (1) n+1 = c(s n+1 ,a 1 n+1 ,a 2 n+1 ) c(s n+1 ,⇡ xn ,a 2 n+1 ) . Similar to the MDP case, the key idea in the analysis of (5.42) is to show that it asymptotically tracks the di↵erential inclusion dynamics given by ˙ x(t)2w(x(t)) x(t). (5.43) where w(x):=co{c( ˜ ): ˜ 2 ˜ (x)}. Then, by showing that the dynamics given by (5.43) converges to the set D, we will conclude that the sequence {x n } also converges to the same set a.s. We start with the following proposition. Proposition 20. The sets (x) and ˜ (x) are compact, non-empty and the set-valued maps : x7! (x), ˜ :x7! ˜ (x) are upper semi-continuous. Proof. ConsidertheStackelbergstochasticgamewhereinateachtimeinstantn,thestateisX n 2S and player 1 moves first and generates a randomized action Z 1 n 2A 1 according to conditional law ⇡ 1 (·|X n ), followed by player 2 who observes Z 1 n and chooses a randomized action according to the conditional law ⇡ 2 (·|X n ,Z 1 n ). Thus, player 2 faces an average reward MDP with state process (X n ,Z 1 n ),n 0, and transition probability ˇ p((˜ s,˜ a 1 )|(s,a 1 ),a 2 ):= p(s,a 1 ,a 2 ,˜ s)⇡ 1 (˜ a 1 |˜ s). For a given x 2R K \D, the optimal adversarial policy ⇡ 2 by player 2 for the scalar-valued MDP with 112 cost function hc(i,a,z), (x)i can be characterized as the maximizer in the associated dynamic programming equation, similar to equation (5.13) as V((s,a 1 )) = max ⇡ 2 0 @ X (˜ s,˜ a 1 ) X a 2 ⇡ 2 (a 2 |s,a 1 )ˇ p((˜ s,˜ a 1 )|(s,a 1 ),a 2 ) hc(s,a 1 ,a 2 ), (x)i+V((˜ s,˜ a 1 )) 1 A . (5.44) This has a solution (V, )where is unique and V is unique upto an additive scalar and can be rendered unique by arbitrarily fixing, say V((s 0 ,a 1 0 )) = 0. Then, ⇡ 1(x) defined above is simply the set of maximizers on the R.H.S. Being the set of maximizers of an ane function on a convex compact set, it is a non-empty convex compact set. NowconsidertheMDPwithstateprocess{X n }andstationaryrandomizedpolicies⇡ ((a 1 ,a 2 )|s):= ⇡ 1 (a 1 |s)⇡ 2 (a 2 |s,a 1 )where⇡ 1 (·|s)2P(A 1 )and⇡ 2 (·|s,a 1 )2 ⇡ 1(x). Thatis,theactionspaceforthe problem is := [ ⇡ 1 2P (A 1 ) {⇡ 1 }⇥ ⇡ 1(x) with the relative topology inherited from P(A 1 ⇥A 2 ). By our definition of ⇡ 1(x), the optimal policy for this MDP corresponds to the min-max policy for the original problem, i.e., belongs to⇧ 1 (x). This in turn is given by the minimizers on the right hand side of the dynamic programming equation ˆ V(s)=min ⇡ 2 0 @ X ˜ s X a 1 ,a 2 ⇡ 1 (a 1 |s)⇡ 2 (a 2 |s,a 1 )p(s,a 1 ,a 2 ,˜ s) ⇣ hc(s,a 1 ,a 2 ), (x)i+ ˆ V(˜ s) ⌘ 1 A ˆ , (5.45) which has a solution ( ˆ V, ˆ ), where ˆ is unique and ˆ V is unique upto an additive scalar and can be rendered unique by arbitrarily fixing, say, ˆ V(s 0 ) = 0. In either (5.44) or (5.45), if we replace x by x n ,x n ! x 1 , then a subsequential limit (V 0 1 , 0 1 ),( ˆ V 0 1 , ˆ 0 1 ) of the corresponding (V 0 = V 0 n , 0 = 0 n ),( ˆ V 0 = ˆ V 0 n , ˆ 0 = ˆ 0 n ) must satisfy the respective dynamic programming equations (5.44),(5.45), with V 0 1 ((s 0 ,a 1 0 )) = ˆ V 0 1 (s 0 ) = 0. By the uniqueness claim above, they are the appropriate value functions forx =x 1 . Furthermore, if we pick ⇡ 2 n to be a maximizer on the right hand side of (5.44) for (V 0 n , 0 n ),n 1, any limit point thereof asn"1 must be a maximizer of the same forn =1 .A similar argument works for the minimizers of (5.45). It is easy to deduce from this that the graph of is closed. Hence, is upper semi-continuous. This in particular implies, by our definition of (·), that c ⇤ (·) is lower semi-continuous. The claim regarding ˜ (·) follows from this. First we will prove some conditions that will ensure the well posedness of (5.43). Proposition 21. For each x2R K \D, (i) sup y2 w(x) kyk<K(1+kxk). (ii) w(x) is convex, compact and upper semicontinuous. Proof. (i)isobviousfromtheboundednessassumption. For(ii),w(x)isconvexbydefinition. Now, considerthemappingh(x):={c( ˜ ): ˜ 2 ˜ (x)}.Since ˜ (x)iscompact,andc(·)iscontinuous,h(x) 113 is compact. Then, the closed convex hull of h(x), w(x), is also compact. The upper semicontinuity of h is clear from the upper semicontinuity of ˜ . Now the closed convex hull of h(x), w(x), is also upper semicontinuous, by [107, Lemma 5, Chapter 5]. Now we prove Theorem 21. Proof. TheremainingdetailsoftheproofareverysimilartothatforMDPs. Asbeforeweconstruct the interpolated trajectory x(t),t 0 of (5.42) as defined in equation (5.6). Using the same arguments and then by applying the result of Theorem 17 we can show that almost surely, {x(v+ ·),v 0} converge to an internally chain transitive invariant set of the di↵erential inclusion given by (5.43). In particular this implies that {x n } converge a.s. to such a set. Consider the same Lyapunov function V(x)=min z2 D 1 2 kx zk 2 .Sincer V(x)=(x P D (x)), d dt V(x(t)) =hr V(x(t)), ˙ x(t)i =hx(t) P D (x(t)),y(t)i for y(t)2w(x(t)) x(t). ByProposition20andourhypotheses,thereexistsanoccupationmeasure suchthat (s,a 1 ,a 2 )= ⌘ (s)⇡ 2 (a 2 |s,a 1 )⇡ 1 (a 1 |s), andc ⇤ (x)=hc( ), (x)ih P D (x), (x)i. Thenforanypolicy ˜ x 2 ˜ (x) we havehc( ˜ x ), (x)i c ⇤ (x)h P D (x), (x)i. So for any x(t), hc( ˜ x(t) ) P D (x(t)),x(t) P D (x(t)i 0 for all ˜ x(t) 2 ˜ (x(t)) and hence,hc( ˜ x(t) ) x(t),x(t) P D (x(t)ik x(t) P D (x(t))k 2 . This gives d dt V(x(t)) 2V(x(t)) so that V(x(t)) V(x(0))e 2t Thus, D is a global attractor. Since D is a global attractor, the internally chain invariant set corresponding to the di↵erential inclusion (5.43) is a subset of D [107] . Hence, {x n } converges to D . Necessary Condition We now give a necessary condition for approachability for convex sets. Proposition 22 (Necessary Condition). If a closed convex set D is approachable from all initial states in an arbitrary Stackelberg stochastic game (satisfying Assumption 7), then (i) every half-space containing D is approachable, and (ii) there exists an occupation measure , (s,a 1 ,a 2 )= ⌘ (s)⇡ 2 (a 2 |s,a 1 )⇡ 1 (a 1 |s), such that c( )= c ⇤ (x) and hc( ) P D (x), (x)i 0. 114 Proof. Claim (i) is obvious. We now show that (i) implies (ii) and complete the argument. Let x2R K \D and H x be the supporting half-space to the set D at the point P D (x) given by H x :={y2R K :hy P D (x), (x)i 0}. Since every half-space containing D is approachable, there exists a strategy 1 for player 1 such that for any (Stackelberg) strategy 2 of player 2, limsup n!1 hx n P D (x), (x)i 0,µ( 1 , 2 ) a.s. Since |hx n P D (x), (x)i| is bounded, inf 1 sup 2 E µ( 1 , 2 ) [lim sup n!1 hx n , (x)i]h P D (x), (x)i (5.46) Note that the L.H.S. of equation (5.46) is the min-max cost for player 1 in the average cost scalar- valued Stackelberg stochastic game with stage cost ˜ c 1 (s,a;x)= hc 1 (s,a), (x)i.Thenbythe arguments in the proof of Proposition 20, there exists a 2 (x) such that the L.H.S. is equal to hc( ), (x)i. This completes the proof. Extension to Non-Convex Sets We now give the approachability result when the target set is non-convex. The proof is the same as that of Theorem 21 except the fact that the Lyapunov function may be non-di↵erentiable. We overcome this diculty by considering the semidi↵erentials as we did for MDPs (in Theorem 18). Since the techniques are the same, we omit the proof. As before, let ˜ P D (x) be the set of points in D that are closest to x2R K \D. Theorem22. (i)(SucientCondition)Aclosedconvexset D isapproachablefromallinitialstates in the stochastic game satisfying Assumption 7 if for everyx2R K \D and for eachP D (x)2 ˜ P D (x), there exists an occupation measure , (s,a 1 ,a 2 )= ⌘ (s)⇡ 2 (a 2 |s,a 1 )⇡ 1 (a 1 |s), such that such that c( )=c ⇤ (x) and hc( ) P D (x), (x)i 0. (ii) (Strategy for Approachability) A strategy for approachability is: At every time instant n+1, select a P D (x)2 ˜ P D (x), and select action a 1 n+1 according to the policy ⇡ 1 xn such that ⇡ 1 xn 2⇧ 1 (x n ). Remark10. (i) Our approachability theorem is for a special case, namely the Stackelberg stochastic games while that of [72, 71, 73] is for general stochastic games. However, we are able to give a di↵erent approachable strategy based on stochastic approximations. This approachability strategy naturally leads to a dynamic learning algorithm. (ii) The only learning algorithm for approachability in stochastic games we are aware of, is given in [74]. This algorithm is based on the approachable strategy in [71], and seems complicated and computationally impractical. Our learning algorithm is simpler but is applicable only to Stackelberg stochastic games. This is an interesting step in developing stochastic approximation based learning 115 algorithms for general stochastic games, similar to reinforcement learning algorithms for MDPs. Such a goal, as we noted earlier is a longstanding open problem, and has been pursued without success for a long time. 5.2.3 A Learning Algorithm for Approachability in Stochastic Stackelberg Games Approachability theorem for Stackelberg stochastic games that we proved before shows that if player 1 selects her action at time step n+1 according to the policy ⇡ 1 xn such that ⇡ 1 xn 2⇧ 1 (x n ) then x n approaches the desired set D. So the objective of the algorithm is to ‘learn’ such a policy ⇡ xn at each time step n. We give an algorithm which indeed does this. WedothisbyconsideringtheproblemaslearningintwocoupledMDPs. Thus,mostofthetools that we used in Section 5.1 can be applied directly. We give the asynchronous learning algorithm below. The synchronous scheme may be written analogously. x n+1 =x n + 1 (n)(c(s n+1 ,a n+1 ) x n ), 1 (n)=1/(n+1), (5.47) b Q n+1 ((s,a 1 ),a 2 )= b Q n ((s,a 1 ),a 2 )+ 2 (ˆ ⌫ (n,s,a 1 ,a 2 ))I{((s n ,a 1 n ),a 2 n )=((s,a 1 ),a 2 )} ˜ c((s,a 1 ),a 2 ;x n )+max z b Q n ((s n+1 ,a 1 n+1 ),z) f( b Q n ) b Q n ((s,a 1 ),a 2 ) , (5.48) ˆ ⇡ 2 n+1 (s,a 1 ) = argmax a 2 b Q n+1 (s,a 1 ,a 2 ), (5.49) e Q n+1 (s,a 1 )= e Q n (s,a 1 )+ 3 (˜ ⌫ (s,a 1 ,n))I{s n =s,a 1 n =a 1 } ˜ c(s,a 1 n ,a 2 n ;x n )+min y e Q n (s n+1 ,y) f( e Q n ) e Q n (s,a 1 ) , (5.50) ⇡ 1 n+1 (s) = argmin a 1 e Q n+1 (s,a 1 ), (5.51) where 1 (n), 2 (n), 3 (n) satisfy the conditions in equations (5.19), (5.21), (5.30), with the addi- tionalstipulationsthat 3 (n)=o( 2 (n)), 3 (n)=o( 1 (n)). Also, ˆ ⌫ (n,s,a 1 ,a 2 )= P n m=1 I{(s,a 1 ,a 2 )= (s m ,a 1 m ,a 2 m )},˜ ⌫ (n,s,a 1 )= P n m=1 I{(s,a 1 )=(s m ,a 1 m )}. Remark 11. Here also, as observed in Remark 8, we need to randomize ⇡ 1 n+1 . We skip the details. Theorem 23. The learning algorithm for approachability in Stackelberg stochastic game given by equation (5.47) - (5.51), kx n Dk! 0 almost surely. Proof. ProofisverysimilartothatofTheorem20andhencewegiveonlyasketchofthearguments. Here, b Q n is the Q-function for the MDP faced by player 2 and note that b Q n and ˆ ⇡ 2 n are computed by player 1 only. e Q n is the Q-function for the MDP faced by player 1. Both x n and ⇡ 1 n are on a slower timescale compared to b Q n and can be considered ‘quasi-static’ for the analysis of (5.48). Thus, by the results of [46], we can conclude that kˆ ⇡ 2 n ⇡ 1 n (x n )k! 0 a.s. Consider (5.50) next. Again, x n can be treated as quasi-static for this and b Q n , which is on a 116 faster time scale, as quasi-equilibrated (i.e.,kˆ ⇡ 2 n ⇡ 1 n (x n )k! 0 a.s.), whence min a 1 e Q n (s n ,a 1 ) min a 1 max a 2 b Q n (s n ,a 1 ,a 2 )! 0, a.s. Then, one can conclude that k⇡ 1 n ⇧ 1 (x n )k! 0 a.s. This implies that asymptotically player 1’s strategy is the same as the approachable strategy specified by Theorem 21 and hence by the same theoremkx n Dk! 0. These arguments can be made rigorous as in the proof of Theorem 20 by essentially following the steps in [107]. Since that is routine, we conclude the proof here. 5.3 Conclusion We have presented a simple and computationally tractable strategy for Blackwell’s approachability inMDPsandStackelbergstochasticgames. Wehavealsogivenareinforcementlearningbasedalgo- rithm to learn the approachable strategy when the transition kernels are unknown. The motivation for this came from multi-objective optimization and decision making problems in a dynamically changingenvironment. Wenotethatwhileapproachabilityquestionsaretypicallyaskedinthecon- text of games, repeated or stochastic, they are basically about settings when decision-makers are faced with multiple objectives. The conditions for approachability for MDPs and Stochastic games are very similar. However, in our stochastic approximations-based approach, the proof techniques for the Stackelberg stochastic game setting is much more complicated since we must account for system dynamics depending on actions taken by both players. The learning algorithms we devise for the two settings are also di↵erent since for MDPs we have a stochastic approximation scheme that ‘averages over the natural time scale’ while for the game setting, we have a multiple time scale stochastic approximation scheme. There are many interesting related questions that one can possibly address in the future. Ex- tension to MDPs and stochastic games with discounted reward is one problem but the solution is possibly very messy due to dependence on the intial state. The optimal rate of convergence for the approachability problems in such systems is another important problem that has never been addressed before, and likely very challenging. 117 118 6 APrincipal-AgentApproachto Spectrum Sharing Development of dynamic spectrum access and allocation techniques recently have made feasible the vision of cognitive radio systems. However, a fundamental question arises: Why would licensed primaryusersofaspectrumbandallowsecondaryuserstosharethebandanddegradeperformance for them? And how can we design incentive schemes to enable spectrum sharing using cooperative communication schemes? We consider a principal-agent framework, and propose a contracts-based approach. First, a single primary and a single secondary transmitter-receiver pair with a Gaussian interference channel between them are considered. The two users may contract to cooperate in doingsuccessive-interferencecancellation. Underfullinformation, wegiveequilibriumcontractsfor various channel conditions. These equilibrium contracts yield Pareto-optimal rate allocations when physically possible. We then allow for time-sharing and observe that in equilibrium contracts there is no actual time-sharing. We show that the designed contracts can be made robust to deviation by either user post-contract. We also show how these can be extended to multiple secondary users. We show that under hidden information, when the primary user has a dominant role, neither user has an incentive to lie about their direct channel coecients, or manipulate the cross channel measurements, and Pareto-optimal outcomes are achieved at equilibrium. The chapter is organized as follows. In Section 6.1, we describe the physical layer channel model and some related work. In Section 6.2, we introduce the principal-agent framework and the contract design problem in a cognitive radio setting. In Section 6.3, we propose incentive- compatible contracts under full information. We then consider the incomplete information setting in Section 6.4. We end with a discussion of future work. 6.1 The Physical Model and Related Work We consider a situation where there is a primary user 0 who owns the spectrum license, and M 1 secondary cognitive radio users who want to share the spectrum with the primary user. We will model it as a Gaussian Interference Channel (GIC) of bandwidthW (assumed unity, for simplicity) 119 with M transmitter-receiver pairs. We consider a discrete-time channel model: y i [n]=⌃ M 1 j=0 h j,i x j [n]+z i [n],i=0,···M 1, (6.1) where x j is the signal from the transmitter j, y i is the signal received at the receiver i, h j,i is the channel attenuation coecient from transmitter j to receiver i, and the noise process {z i } is i.i.d. over time with distribution N(0,N 0 ). We assume a flat fading channel. Each user could treat the signal from other users as interference. This is reasonable when interference cancellation and alignment schemes cannot be used due to limitations on decoder complexity, delay constraints, or uncertainty in channel estimation. We assume that users use random Gaussian codebooks for transmission. Then, the maximum rate that the system can achieve is given by R i = log ✓ 1+ c i,i P i N 0 +⌃ j6=i c j,i P j ◆ ,i=0,···M 1, (6.2) where P i is the transmitted power of user i, and c i,j = |h i,j | 2 . Each transmitter has power con- straints. Thus, P i must satisfy P i ¯ P i for each i. ThespectrumsharingproblemistodetermineasetofpowerallocationsP =(P 1 ,···,P M )that maximize a given global utility function (such as the achievable sum-rate P i R i ) while satisfying the power constraints. However, users are selfish and may not cooperate, with each wanting to maximize its’ own rate. Thus, they pick their power allocations P each wanting to maximize their own rate and leading to a spectrum sharing game between them. To predict the outcome of such a game, we look at its Nash equilibrium (NE) P ⇤ such that given the power allocations of all the other users P ⇤ i ,user i’s rate is maximized at P ⇤ i ,i.e., R i (P ⇤ i ,P ⇤ i ) R i (P i ,P ⇤ i ), 8 P i ¯ P i . (6.3) In [80], it was shown that in a flat fading GIC, a Nash equilibrium (NE) exists, all NE are pure strategy equilibria, and under certain conditions, full-spread power allocation is a NE. Moreover, it was shown that under certain conditions, full-spread is the unique NE. In most cases, however, the set of rate vectors that result from the full-spread NE are not Pareto optimal. So, there may be a significant performance loss if theM users operate at any such point due to lack of cooperation. In fact, in many cases this inecient outcome is the only possible outcome of the game. For general parallel GIC, existence of Nash equilibrium was proved in [116]. The above discussion assumed that users do not use cooperative communication schemes. In licensed settings, however, it can be possible to enable spectrum sharing using cooperative commu- nication schemes [75, 76]. A particular scheme of relevance is Successive Interference Cancellation (SIC) which works as follows. Suppose user M 1 decodes his own signal by treating interference 120 from all other users as noise, then he can achieve a rate R M 1 = log ⇣ 1+ c M 1,M 1 P M 1 N 0 +⌃ j<M 1 c j,M 1 P j ⌘ . Now, user i can do the same and decode signal of all usersj>i first as above. Then, he can subtract these signals from his received signal, and decode his own signal by treating all other users as noise and achieve a rate R i = log ✓ 1+ c i,i P i N 0 +⌃ j<i c j,i P j ◆ , (6.4) which is greater than the rate in equation (6.2) if he had treated all other users’ signals as noise. Proceeding in this way, user 0 then achieves a rate R 0 = log ⇣ 1+ c 0,0 P 0 N 0 ⌘ . We will call user 0 as a dominant user, and user i as a ith non-dominant user. From the achievable rate region (Fig. 6.1), it is clear that all users could potentially gain if we could find a way for them to cooperate. Figure 6.1: The Gaussian Interference channel, and it’s achievable region with SIC with assigned dominant/non-dominant roles 6.2 The Principal-Agent Framework and Contract Design We now introduce the principal-agent model [89, 90]. One agent acts as the principal who o↵ers one or more contracts to one or more agents. The agent(s) then select(s) one or rejects all. Denote the power used by the principal as P p and that used by the agent as P a . Let the utility of the principal be its rate R p (P p ,P a ) and utility of the agent be its rate R a (P a ,P p ). Denote by (P p ,P a ) the payment that the agent makes to the principal. This can be positive or negative. Definition7. Aspectrumcontractisatuple( (·,·),P p ,P a )suchthattheoperatingpointis(P p ,P a ) and the payment is (P p ,P a ). Denote agent strategy space S a =[0, ¯ P a ]⇥ [0, ¯ P p ], payment function : S a ! R and outcome function f : S a !S a . First, the principal picks a payment function and an outcome function f. Then, the agent rejects the contract o↵ered, or accepts and picks operating powers in S a .We will typically have f(P a ,P p )=(P a ,P p ), i.e., the agent will pick the powers at the operating point, 121 but at times we may restrict this. The principal wants to design a (,f ) that maximizes his payo↵ R p + once the agent has accepted and an operating point (P p ,P a ) has been picked. Let ¯ R p and ¯ R a denote the reservation utilities (rates) for the principal and the agent that they can derive if the contract is not accepted. Definition 8. We say that a spectrum contract function is individually rational for an agent if there exist feasible (P p ,P a ) such that, R a (P a ,P p ) (P p ,P a ) ¯ R a . That is, it would be rational for the agent (or the principal) to participate only if he can pick a feasible operating point at which his payo↵ is at least as large as his reservation utility ¯ U a . Definition 9. We say that a spectrum contract is incentive compatible (IC) for an agent if at the operating point (P p ,P a ), R a (P a ,P p ) (P p ,P a ) R a (P 0 a ,P p ) (P p ,P 0 a ), 8 P a 0 ¯ P a . Thatis, (amongtheIRoperatingpoints), theagentwillpickanoperatingpointthatmaximizes its payo↵. So, in designing the contracts, the principal should take both the IR and IC constraints into account. The contract design problem induces a game between the principal and the agent(s). The principal wants to design a contract that maximizes his payo↵, but at the same time get accepted by the agent(s). From the set of contracts o↵ered, the agent will accept that contract which will maximizes its payo↵ [90]. The contract that gets accepted finally is called an equilibrium contract. Now, the principal’s contract design problem is given by the following optimization problem. CD-OPT: max Pp,Pa, (·,·) R p (P p ,P a )+ (P p ,P a ) (6.5) s.t. [IR]: R a (P a ,P p ) (P p ,P a ) ¯ R a (6.6) [IC]: R a (P a ,P p ) (P p ,P a ) R a (P 0 a ,P p ) (P p ,P 0 a ), 8 P a 0 ¯ P a . (6.7) The optimization problemCD-OPT above is a non-convex, variational problem, solving which, in general, is dicult. Existence of a solution can be established using standard arguments. We will, however, argue existence by construction in subsequent discussion. When there is no information asymmetry between the principal and the agent, a contract is called the first-best. In such cases, the principal can make the agent operate at his reservation utility and extract all the surplus. When the principal has incomplete or imperfect information about the agent’s type (e.g., the channel coecients here), a contract is called second-best.Insuch cases, the principal “pays” an information rent to the agent. Clearly, the principal’s surplus under second-best contracts will be smaller than that in first-best contracts. The principal and the agent can make contract after they observe their types, i.e. , after they estimate their channel coecients. Such contracts are called ex post contract. However, sometimes the principal and the agent can contract before they estimate their channel coecients and such contracts are called ex ante contracts. In the subsequent discussions we will show that if there is 122 Principal O↵er contract(s) (·,·) ! Accepts one or rejects all Make payment (P a ,P p ) Share the channel ! Agent Figure 6.2: Contractual mechanism between Principal and Agent no information asymmetry between the principal and the agent, then the optimal contracts are of ex post type and if there is information asymmetry, then optimal contracts are of ex ante type. The timing of the contractual mechanism between the principal and the agent is summarized in the figure 6.2. We define a social welfare function as S(P p ,P a )= R p (P p ,P a )+R a (P a ,P p ). We say that a rate allocation (R ⇤⇤ p ,R ⇤⇤ a )is socially optimal if it is achieved by a power allocation (P ⇤⇤ p ,P ⇤⇤ a ) that maximizesthesocialwelfaresubjecttothepowerconstraints. Wewillsaythataspectrumcontract is (socially) optimal if it achieves a socially optimal rate allocation. Given the channel model and the communication scheme (like interference-limited communica- tion, TDMA or FDMA), our aim is to design a spectrum contract which achieves a socially optimal rate allocation. In the following paragraphs we will examine the optimum contract design for two typical system models. In particular, we will argue that the equilibrium rate allocation achieved under those contract mechanism can be far away from the socially optimal rate allocation. This motivates our use of SIC as the communication scheme. The most obvious choice for a contract mechanism is paying-for-interference model. The pri- mary user will allow the secondary users to share the spectrum while treating their signals as noise. Spectrum sharing will decrease the data rate of the primary user who would have enjoyed an in- terference free channel otherwise. So, the challenge is to design a contract mechanism which will ensurethattheprimaryuserhasanincentivetosharethespectrum(viaensuringthepaymentfrom the secondary users), secondary users accept the o↵ered contract (via ensuring a positive utility), and the rate allocations are socially optimal. Denoting the received power from the ith transmitter at the jth receiver by P ij = c i,j P i , and without loss of generality, set N 0 = 1, the contract design problem for a paying-for-interference model is given by CD-OPT (c.f. equations (6.5)- (6.7)), where R p (P p ,P a ) = log ✓ 1+ P 1,1 1+P 1,2 ◆ R a (P a ,P p ) = log ✓ 1+ P 2,2 1+P 1,2 ◆ Here, user 1 is the primary and user 2 is the secondary. Also, the primary is acting as the principal. The optimization problem CD-OPT is convex only if the Signal to Interference Ratio (SIR) of 123 each user is very high. If the SIR of each user is very high, the achievable rate of user i, log(1+ SIR i ), will be approximately equal to log(SIR i ) and the problem can be solved by using Geometric Programming [117]. However, in practise, most systems operate with a very low SIR and theCD- OPTproblemisnon-convex. In[117], itwasshownthatthisnon-convexoptimizationproblemcan besolvedinaniterativewaybyusingComplementaryGeometricProgrammingmethods. However, the solution obtained may correspond to a local maxima rather than the global maxima. Thus, the equilibrium rate allocation achieved by the contract mechanism for paying-for-interference model can be far away from the socially optimal rate allocation. Inarecentwork[118],Duan,etalproposedaresourceexchangebasedspectrumsharingcontract scheme for cognitive radio networks. In their model, the interaction between the primary user and thesecondaryusersinvolvesthreephases. InPhase1,theprimarytransmitterbroadcastsitsdatato theprimaryreceiverandtheinvolvedsecondaryuser’stransmitters. InPhase2,secondaryreceivers decode the primary user’s data from Phase 1 and forward it to the primary receiver simultaneously, using the space-time codes assigned by the primary user. The use of synchronized space-time codes avoids interference at the primary receiver. As a reward for relaying its data, the primary user allocates dedicated time slots for the involved secondary users, and the secondary users share the spectrum using TDMA to avoid interference. The power spent by a secondary user to forward the primary user’s data and the length of the time that it get in return are determined by designing a contract. However, here the contract is designed in such way that primary user always maximizes itsutility. Also, inmanyscenarios, thecontractdesignbecomesanon-convexoptimizationproblem as in the paying-for-interference case. Moreover, the issue of social welfare-optimality or fairness have not been addressed. In this chapter, we assume that the primary and the secondary user(s), if they agree to a contract, use the “cooperative communication” scheme, SIC [75]. Either can act as a dominant user. And either can be the principal, i.e., the one who proposes the contract. Such schemes typically assume that the channel coecients (i.e., types of the players) are common knowledge, andfurthermore, once roles, ratesandtransmit powerstobeusedhavebeenagreedupon, theusers adhere to those commitments. We first design contracts for the complete information setting, and then discuss their robustness in the asymmetric/hidden information setting. We investigate the social optimality of the spectrum contract. Furthermore, any of the principal or the agent(s) can deviate from the agreed contract: They can transmit at a higher power, or a greater rate. This can cause sub-optimal performance of the cooperative communication scheme. We will provide incentive mechanisms to avoid this moral hazard problem. 124 6.3 First-best Contracts for Cognitive Spectrum Sharing We first assume that there is no informational asymmetry between the principal and the agent. The channel coecients and the power constraints are known to all the users. This is a first step to understanding the more interesting case of informational asymmetry when channel coecients are not common knowledge and users can manipulate their estimation. It turns out that the informational asymmetry case becomes really easy to understand once we understand and design contracts for the full information case. We specify the conditions under which (i) a first-best contract exists, (ii) the first-best contract results in a Pareto-optimal operating point, and (iii) the first-best contract is socially optimal. Denote the received power from the ith transmitter at the jth receiver by P ij = c i,j P i , and without loss of generality, set N 0 = 1. Now, if the SIC scheme is successful, the dominant user’s rate depends only on its received power at its own receiver. For example, when the primary acts as the dominant user, its transmission rate will be R dd (P p ,P s ) := log(1+P p,p ), (6.8) where P p,p is the received power of the primary transmitter at the primary receiver. The non- dominant user must transmit at a rate smaller than both R nd (P s ,P p ) := log ✓ 1+ P s,p 1+P p,p ◆ and R nn (P s ,P p ) := log ✓ 1+ P s,s 1+P p,s ◆ . (6.9) For the SIC scheme to be successful, the non-dominant user must transmit at a rate smaller than R nd (P s ,P p )sothatitssignalcanbedecodedbythedominantuser’sreceiver. Hemustalsotransmit at a rate smaller than R nn (P s ,P p ) so that its own receiver can decode his own signal. 6.3.1 Spectrum Contracts Without Time-Sharing We first consider the problem of contract design with full information when roles of users are fixed (i.e.,theprincipalo↵erscontractstotheagentinwhichit’sroleasadominantornon-dominantuser is fixed). We focus on a two user setting for simplicity, one a primary, and the other a secondary user. Extension to multiple secondary users can be done under certain conditions, and is discussed later. In the two-user setting, there are four cases: Either of the two users can be the principal, and either can be a dominant user. Below, we discuss only one case out of these four cases. Other cases can be analyzed in the same manner with similar results. The exact theorem statements for the other cases are in the appendix (Theorem (30) - Theorem (32)). Their proofs are omitted as they are very similar to that of Theorem 24. We assume that reservation utilities of the users are ¯ R p = log(1+P pp ) and ¯ R s = 0. 125 Case A: Primary user is the principal and the dominant user. Now, if the secondary user accepts a contract, and both cooperate in using the SIC scheme, then the maximum achievable rate of the primary user is given by (6.8). The secondary user’s maximum transmission rate is limited by the minimum of rates R nd and R nn in equation (6.9). Define (for principal=Primary, domi- nant=Primary) nd,A (P s ; ¯ P p )=R nd (P s , ¯ P p ), and nn,A (P s ; ¯ P p )=R nn (P s , ¯ P p ). (6.10) Note that in these contract functions, the power of the primary is fixed at ¯ P p , i.e., the secondary user (if he accepts) can only pick his own power P s . We now specify sucient conditions on the channel coecients under which the primary user o↵ers di↵erent first-best contract functions. Theorem 24. Let ⌘ P,1 := (c ss c sp ) and ⌘ P,2 := (c sp c ps c pp c ss ). (i) If ⌘ P,1 >⌘ P,2 ¯ P p , then the (first-best) equilibrium contract is ( nd,A (·; ¯ P p ),( ¯ P p , ¯ P s )), and the equi- librium rate allocation is (R dd ( ¯ P p , ¯ P s ),R nd ( ¯ P s , ¯ P p )), which is socially optimal (and Pareto-optimal as well). (ii). If⌘ P,1 ⌘ P,2 ¯ P p andc p,p c p,s , thenthe(first-best)equilibriumcontractis( nn,A (·; ¯ P p ),( ¯ P p , ¯ P s )), and the equilibrium rate allocation is (R dd ( ¯ P p , ¯ P s ),R nn ( ¯ P s , ¯ P p )), which is socially optimal (and Pareto-optimal as well). (iii). There exist channel conditions (in case: ⌘ P,1 <⌘ P,2 ¯ P p ,c p,p <c p,s ) under which the (first-best) equilibrium contract does not yield Pareto-optimal rates. Proof. (i)ConsiderthecontractdesignproblemCD-OPTin(6.5)withp=p(rimary),anda=s(econdary). In addition to the [IR] and [IC] constraints there, here we will also have the following: R p (P p ,P s )= R dd (P p ,P s ), (6.11) R s (P s ,P p ) R nd (P s ,P p ) = log ✓ 1+ P s,p 1+P p,p ◆ , (6.12) R s (P s ,P p ) R nn (P s ,P p ) = log ✓ 1+ P s,s 1+P p,s ◆ . (6.13) We will call this the CD-OPT-PP, the contract design optimization problem for case A. It is now easy to check that for ⌘ P,1 >⌘ P,2 ¯ P p constraint (6.12) is tight. This can be done easily by comparing the RHS of inequalities (6.12) and (6.13). Since ¯ R s = 0, an obvious solution for the optimalpaymentfunction ⇤ (·,·)isR s (·,·). Withthis, thesurplusoftheagent(thesecondaryuser) is zero at any operating point. Thus, the IR and IC constraints would be satisfied. Substituting the ⇤ (·,·)intheobjectivefunctionandmaximizingwillgivetheoptimumpowerallocationoftheusers as P p = ¯ P p and P s = ¯ P s . Now, assuming that the secondary user prefers to get a higher rate with zero surplus to a lower rate with zero surplus, he will then pick P s = ¯ P s . Note, however, that the secondary user can maximize his rate by making the interference from the primary user arbitrarily 126 small, which will not be optimal for the primary user. Thus, the primary user o↵ers contracts with his transmission power fixed at ¯ P p . The secondary only gets to pick his own transmission powerP s and the corresponding payment nd,A (P s ; ¯ P p ). Thus, o↵ering nd,A (·; ¯ P p ) by the primary user, and picking P s = ¯ P s by the secondary user is an equilibrium. It is easy to check that the corresponding rate allocation maximizes the sum rate R p +R s , and is thus Pareto-optimal as well. (ii) As before, consider the CD-OPT-PP problem, and assume that the channel coecients are such that ⌘ P,1 ⌘ P,2 ¯ P p and c p,p c p,s . Then, for any feasible P p 2 [0, ¯ P p ], the RHS of the inequality (6.13) is always smaller than the RHS of the inequality (6.12). Thus, the inequality (6.13) will be tight (an equality) at the optimum. Furthermore, since ¯ R s = 0, an obvious solution for ⇤ (·,·), as before, is R s (·,·). With this, the surplus of the agent (the secondary user) is zero at any operating point and the IR and IC constraints would be satisfied. Substituting the ⇤ (·,·) in the objective function and maximizing will give the optimum power allocation of the users as P p = ¯ P p and P s = ¯ P s . Also, the secondary will pick P s = ¯ P s as before. The principal again fixes his transmission power at ¯ P p since given the choice, the secondary user will pick P p = 0. Thus, the secondary only gets to pick P s and the corresponding payment nn,A (P s ; ¯ P p ). Thus, o↵ering nn,A (·; ¯ P p ) by the primary user, and picking P s = ¯ P s by the secondary user is an equilibrium with corresponding rate allocation being sum rate and Pareto-optimal. (iii) When ⌘ P,1 <⌘ P,2 ¯ P p ,c p,p <c p,s and c pp (1+c ps ¯ P p +c ss ¯ P s +c sp ¯ P s +c 2 ps ¯ P 2 P )<c ps c ss ¯ P s , (6.14) it can be checked that the optimal solution of the CD-OPT-PP optimization problem will occur at P s = ¯ P s and P p < ¯ P p , in which case the corresponding rate pair will be in the interior of the achievable rate region, and cannot be Pareto-optimal. We summarize the results of Theorem 24 in Figure 6.3. The operating points are the rate pairs, (R p ,R s ), at the equilibrium and are shown in the Figure 6.1. Figures6.4(i),6.4(ii)showthevalueofthecontractfunction (·,·)fordi↵erentvaluesof ¯ P s , ¯ P p . In Figure 6.4 (i), channel parameters are c pp =c ss =1.0,c sp =c ps =0.5. So, ⌘ P,1 >⌘ P,2 ¯ P p for any positive value of ¯ P p and by Theorem 24 the optimal contract function is nd,A ( ¯ P s , ¯ P p ). In Figure 6.4 (ii), channel parameters are c pp =0.8,c ss =0.7,c sp =0.8,c ps =0.4. So, ⌘ P,1 ⌘ P,2 ¯ P p when ¯ P p 2 [0,0.42] and ⌘ P,1 >⌘ P,2 ¯ P p when ¯ P p 2 [0.42,1.0]. So, by Theorem 24 the optimal contract function is nn,A ( ¯ P s , ¯ P p )when ¯ P p 2[0,0.42] and nd,A ( ¯ P s , ¯ P p )when ¯ P p 2[0.42,1.0]. Remarks. 1. Theotherthreecaseswhenthesecondaryuseriseithertheprincipal,orthedominant user can also be analyzed in a manner similar to above, and the form of the equilibrium contracts can be obtained. We state some notable observations. 2(i) Case B: Secondary user is the principal and primary user is dominant. In this case, the principal only o↵ers an arbitrarily small payment B =✏> 0 to the primary user with fixed powers 127 ( ¯ P p , ¯ P s ), and this results in an equilibrium contract under most channel conditions. The reason for this is that since the primary user is dominant, it would su↵er no performance degradation by sharing spectrum using SIC. Thus, the secondary user need not o↵er it any additional surplus. (ii) Case C: Secondary user is the principal and dominant. In this case, the secondary user fixes P p = ¯ P p , and lets the primary user pick its power. (iii) Case D: Primary user is the principal and secondary user is dominant. In this case, the principal fixes P p = ¯ P p , and lets the secondary user pick his own power. 3. In all four cases, we have some channel conditions under which no (first-best) Pareto-optimal equilibrium contract exists. Lesson #1. Under most channel conditions, first-best contracts that are Pareto-optimal at equi- librium can be designed. But under some channel conditions, with fixed roles, it is impossible to do so. 6.3.2 Moral Hazard Problems An obvious question is do either of the users have an incentive to deviate from the contractual agreement. That is, the principal could o↵er a menu of contracts to the agent. The agent may accept one. Then the payment is made. But at the time of communication, either of them may use a di↵erent rate or transmission power than agreed. This is called the moral hazard problem. And if it happens, how can this be prevented? The moral hazard problem can indeed happen. Consider the case where the primary user is the principal and dominant with channel coecients satisfying assumptions in case (ii) of Theorem 24. Then, R allowed = R nd ( ¯ P p , ¯ P s )<R achievable = R nn ( ¯ P p , ¯ P s ), i.e, the secondary user can increase his rate to beyond the agreed rate R allowed , and get across a higher rate to his own receiver while making a lower payment. This deviation by the secondary user, however, can be detected by the primary/dominant user since this will cause the successive interference cancellation (SIC) to fail at the primary user, who will then get zero rate. To avoid the moral hazard problem, the principal can require the secondary user to make a refundable access deposit, say larger than his reservation utility ¯ R p . If the secondary user deviates, this is forfeited. Channel conditions Contract function Powers Operating point Optimality ⌘ P,1 >⌘ P,2 ¯ P p nd,A (·; ¯ P p ) ( ¯ P p , ¯ P s ) B Socially/Pareto ⌘ P,1 ⌘ P,2 ¯ P p ,c p,p c p,s nn,A (·; ¯ P p ) ( ¯ P p , ¯ P s ) B Socially/Pareto ⌘ P,1 ⌘ P,2 ¯ P p ,c p,p <c p,s nn,A (·;P p ) (P p < ¯ P p , ¯ P s ) Interior non-Pareto Figure 6.3: Summary of the results for Case A under full information. 128 0 0.5 1 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 P p Contract Function − (P s , P p ) P s (P s , P p ) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 P p Contract Function − (P s , P p ) P s (P s , P p ) Figure6.4: (i)Plotoftheoptimalcontractfunction ( ¯ P s , ¯ P p )against ¯ P s and ¯ P p forCaseA.Channel parameters are c pp =c ss =1.0,c sp =c ps =0.5. (ii) Plot of the optimal contract function ( ¯ P s , ¯ P p ) against ¯ P s and ¯ P p for Case A. Channel parameters are c pp =0.8,c ss =0.7,c sp =0.8,c ps =0.4. In the same way, if the primary user is non-dominant, under some channel conditions, it has an incentive to deviate from the agreed rate and powers. Again, such a deviation will cause the SIC scheme to fail at the secondary receiver, who will then get zero rate. It can thus be detected. The principal in this case can avoid the moral hazard problem by deferred payment, i.e., the secondary user makes the payment only after channel use. If the primary user deviates, the payment is forfeited. Lesson #2. Using simple incentive schemes, the moral hazard problems can always be avoided. 6.3.3 First-best Spectrum Contracts with Time-Sharing We now allow for time sharing between the primary and secondary users. The time-sharing roles are determined as part of the contractual negotiation. Denote by ↵ the time-sharing variable. The principal acts as a dominant user ↵ fraction of the time, and as a non-dominant user ¯ ↵ =1 ↵ fraction of the time with ↵ 2[0,1]. Either the primary or the secondary can act as the principal. Here we discuss only one case. The other case can be analyzed in the same manner. Definition 10. A time-sharing spectrum contract is a tuple (↵, TS (·,·,·),P p ,P a ) such that the operating point is (↵, (P p ,P a )) and the payment is TS (↵,P p ,P a ). Case: Primary user is the principal and dominant. The contract design problem of the principal in the time sharing case is the same as CD-OPT in (6.5) with p=p(rimary), and a=s(econdary), with the rate functions R p and R s given as R p (P p ,P s )= ↵R dd (P p ,P s )+¯ ↵R n (P p ,P s ),R s (P s ,P p )= ↵R n (P s ,P p )+¯ ↵R dd (P s ,P p ) (6.15) 129 where, R n (P s ,P p ) R nd (P s ,P p ) := log ✓ 1+ P s,p 1+P p,p ◆ , (6.16) R n (P s ,P p ) R nn (P s ,P p ) := log ✓ 1+ P s,s 1+P p,s ◆ . (6.17) R n (P p ,P s ) R nd (P p ,P s ) := log ✓ 1+ P p,s 1+P s,s ◆ , (6.18) R n (P p ,P s ) R nn (P p ,P s ) := log ✓ 1+ P p,p 1+P s,p ◆ . (6.19) Define: TS nd (↵ ;P s ; ¯ P p ):= ↵R nd (P s , ¯ P p )+¯ ↵R dd (P s , ¯ P p ), (6.20) TS nn (↵ ;P s ; ¯ P p ):= ↵R nn (P s , ¯ P p )+¯ ↵R dd (P s , ¯ P p ). (6.21) We now give the equilibrium contract functions and rate allocations in the time sharing case. Theorem 25. Let ⌘ S,1 := (c pp c ps ) and ⌘ S,2 := (c ps c sp c pp c ss ) and ⌘ P,1 ,⌘ P,2 as defined in Theorem (24). (i) If ⌘ P,1 ⌘ P,2 ¯ P p and c p,p c p,s , then the equilibrium contract is (↵ =0, TS nn (.;.; ¯ P p ),( ¯ P p , ¯ P s )).If ⌘ S,1 >⌘ S,2 ¯ P s , the equilibrium rate allocation is (R nd ( ¯ P p , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto-optimal. If ⌘ S,1 ⌘ S,2 ¯ P s , then the equilibrium rate allocation is (R nn ( ¯ P p , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto- optimal. (ii)If⌘ P,1 >⌘ P,2 ¯ P p and cp,s cs,s cs,s.cp,p < ¯ P p thentheequilibriumcontractis(↵ =0, TS nd (.;P s ; ¯ P p ),( ¯ P p , ¯ P s )). If ⌘ S,1 ⌘ S,2 ¯ P p the equilibrium rate allocation is (R nn ( ¯ P p , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto- optimal. If ⌘ S,1 >⌘ S,2 ¯ P p the equilibrium rate allocation is (R nd ( ¯ P p , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto-optimal. (iii)⌘ P,1 >⌘ P,2 ¯ P p and cp,s cs,s cs,s.cp,p > ¯ P p , thentheequilibriumcontractis(↵ =1, TS nd (.;P s ; ¯ P p ),( ¯ P p , ¯ P s )). The equilibrium rate allocation is (R dd ( ¯ P p , ¯ P s ),R nd ( ¯ P s , ¯ P p )), which is Pareto-optimal. (iv) There exist channel conditions under which the equilibrium contract is not Pareto-optimal. Proof. (i) As in the proof of Theorem 24, the obvious solution for ⇤ (·,·)is R s (·,·). When ⌘ P,1 ⌘ P,2 ¯ P p and c p,p c p,s , the constraint (6.17) is tight. Then by equation (6.15), we can calculate R s . It is easy to observe that in this case, R s (P s ,P p )= TS nn (↵ ;P s ;P p ). With this form of the contract, the net utility of the agent is zero at any operating point. Thus IR and IC constraints are satisfied. Now, since R s (P s ,P p ) is monotonic in P s , the agent will pick P s = ¯ P s . Furthermore, the secondary user, if given the choice, will pick P p to make interference from the primary user arbitrary small. But this does not maximize the surplus of the primary user, and who thus, fixes his transmission power at ¯ P p , and the secondary user only gets to pick his transmission power P s and the corresponding payment TS nn (·;P s ; ¯ P p ). Thus, o↵ering TS nn (·;·; ¯ P p ) by the primary user, and picking P s = ¯ P s by the secondary user is an equilibrium. Under ⌘ P,1 ⌘ P,2 ¯ P p , it is easy to check that R dd R nn . So, the secondary will choose ↵ = 0. Now, when ⌘ S,1 <⌘ S,2 ¯ P s , R n (·,·)=R nn (·,·) and when ⌘ S,1 >⌘ S,2 ¯ P s , R n (·,·)= R nd (·,·). So, the equilibrium rates can be calculated from 130 equations (6.15). It is easy to check that the corresponding rate allocation is Pareto-optimal. (ii) The argument in this case is very similar to case (i). The only di↵erence is that equilibrium contract is TS nd (·;·;P p )with ↵ = 0. (iii) The argument in this case is again very similar to case (i). The only di↵erence is that equilib- rium contract is TS nd (·;·;P p )with ↵ = 1. (iv) ⌘ P,1 <⌘ P,2 ¯ P p ,c p,p <c p,s and ⌘ S,1 <⌘ S,2 ¯ P p ,c s,s <c s,p , whatever be the choice of ↵ , the con- dition is similar to that in Theorem 24 (iv). So, these channel condition results in an equilibrium contract that does not yield Pareto-optimal rates. Remarks. Above, we have given forms for equilibrium contracts which allow for time sharing of dominant role. This is determined itself as part of the contract. The key observation here is that despite allowing for any time sharing, i.e., ↵ 2[0,1], at equilibrium, we only see ↵ 2{0,1},i.e., though the dominant role is now determined as part of the contractual “negotiation”, but we do not see any actual time sharing. Lesson #3. While time-sharing contracts are more flexible, the problem of non-existence of Pareto-optimal first-best contracts under some channel conditions cannot be resolved even with this increased flexibility. 6.3.4 Extension to multiple secondary users The analysis in the previous sections can be extended to the case of multiple secondary users. We illustrate this by considering a three user case. We show that under some channel conditions, (first-best) equilibrium contract exists and the equilibrium rate allocation is socially optimal. We assume that the primary is the dominant user and the data from both the secondary users will be decoded by the primary’s receiver. Secondary user 1 treats primary users’ data as noise, but can decode secondary user 2’s data. Secondary user 2 is acting as the non-dominant user for the other two users. Being the dominant user, primary user gets an interference free channel and has a rate R 0 = log(1+P p,p ). The rate of secondary user 1 has two constraints: it should be decodable at the primary user’s receiver as well as at its own receiver. These rate constraint are given by R s 1 ,p = log ⇣ 1+ Ps 1 ,p 1+Pp,p ⌘ and R s 1 ,s 1 = log ⇣ 1+ Ps 1 ,s 1 1+Pp,s 1 ⌘ respectively. Similarly, secondary user 2’s rate has three constraints: it should be decodable at primary user’s receiver, at secondary user 1’s receiver and at its own receiver. These constraint rates are given byR s 2 ,p = log ⇣ 1+ Ps 2 ,p 1+Pp,p+Ps 1 ,p ⌘ , R s 2 ,s 1 = log ⇣ 1+ Ps 2 ,s 1 1+Pp,s 1 +Ps 1 ,s 1 ⌘ and R s 2 ,s 2 = log ⇣ 1+ Ps 2 ,s 2 1+Pp,s 2 +Ps 1 ,s 2 ⌘ respectively. Assuming that 131 the primary user is acting as the principal, his contract design problem is as follows: max Pp,Ps 1 ,Ps 2 , R 0 (P p ,P s 1 ,P s 2 )+ 1 (P p ,P s 1 ,P s 2 )+ 2 (P p ,P s 1 ,P s 2 ) s.t. [IR1]: R 1 (P p ,P s 1 ,P s 2 ) 1 (P p ,P s 1 ,P s 2 ) 0 [IR2]: R 2 (P p ,P s 1 ,P s 2 ) 2 (P p ,P s 1 ,P s 2 ) 0 [IC1]: R 1 (P p ,P s 1 ,P s 2 ) 1 (P p ,P s 1 ,P s 2 ) R 1 (P p ,P 0 s 1 ,P s 2 ) 1 (P p ,P 0 s 1 ,P s 2 ),8 P 0 s 1 ¯ P s 1 . [IC2]: R 1 (P p ,P s 1 ,P s 2 ) 1 (P p ,P s 1 ,P s 2 ) R 1 (P p ,P s 1 ,P 0 s 2 ) 1 (P p ,P s 1 ,P 0 s 2 ),8 P 0 s 2 ¯ P s 2 . R 1 =min(R s 1 ,p ,R s 1 ,s 1 ),R 2 =min(R s 2 ,p ,R s 2 ,s 1 ,R s 2 ,s 2 ). We now specify sucient conditions on the channel coecients under which the primary user o↵ers socially optimal first-best contracts. Theorem26. (i) When the channel conditions are such thatc p,p >c p,s 1 ,c p,p >c p,s 2 , c p,s 1 >c p,s 2 , and c s 1 ,s 1 >c s 1 ,s 2 , then there exists a (first-best) equilibrium contract. The contract functions are of the form 1 ( ¯ P p ,·,P s 2 )= R 1 ( ¯ P p ,·,P s 2 ), 2 ( ¯ P p ,P s 2 ,·)= R 2 ( ¯ P p ,P s 2 ,·). The equilibrium rate allocation is socially optimal (and Pareto-optimal as well). (ii) There exist channel conditions under which the (first-best) equilibrium contract does not yield Pareto-optimal rates. Proof. As in the two users case, we will assume that the IR constraints are tight, find the 1 and 2 , substitute them into the objective function and solve by treating it as an unconstrained optimization problem. Now, the exact expression for 1 and 2 depends on R 1 and R 2 , and which of the constraints on R 1 and R 2 are tight. This depends on the channel conditions. So, there can be six di↵erent forms for the objective functions. By taking each case separately, it can easily be shown that the conditions given in the theorem are sucient for the existence of a first-best contract. Here we give one particular case for illustration. Assume, R 1 =R s 1 ,s 1 and R 2 =R s 2 ,s 2 . Then the optimization problem reduces to max Pp,Ps 1 ,Ps 2 (1+P pp ) (1+P p,s 1 ) (1+P p,s 1 +P s 1 ,s 1 ) (1+P p,s 2 +P s 1 ,s 2 ) (1+P p,s 2 +P s 1 ,s 2 +P s 2 ,s 2 ). (6.22) It is straight forward to show that the above function is monotonic in (P p ,P s 1 ,P s 2 )underthe conditions given in the theorem. So, each user will transmit at their maximum power and the sum rate will be on the Pareto-optimal boundary. Now, consider the same case as above, except that c p,p <c p,s 1 , instead of c p,p >c p,s 1 . And, similar to the case (iv) in Theorem 24, it can be shown that the solution for the above optimization 132 problem will be P s 1 = ¯ P s 1 ,P s 2 = ¯ P s 2 ,but P p < ¯ P p , in which case the corresponding rate pair will be in the interior of the achievable rate region, and cannot be Pareto-optimal. Lesson #4. With multiple secondary users, first-best contracts that are Pareto-optimal can be designed. But under some channel conditions, it is impossible to do so. 6.4 Hidden Information: Second-best Cognitive Spectrum Contracts In the channel estimation phase each user sends a pilot signal, all receivers calculate the channel strength by estimating received power and then feedback the channel strength to the corresponding transmitter. Clearly, this channel estimation requires the cooperation from all the users. In the previous section, we assumed that the channel information is common knowledge and correct. However, when the users are selfish and rational, each one may want to maximize his own utility. So, each user may manipulate the channel measurements if he can increase his own payo↵. Now the question is, are the contracts designed in the previous section robust with hidden information? If not, can one design contracts that are robust and still Pareto-optimal at equilibrium? Our main result is that hidden information doesn’t always hurt. When the primary user is dominant, neither the agent nor the principal will manipulate the channel measurement. Thus, it ispossibletodesignanex postsecond-bestcontractwhichachievesthefirst-bestoutcome. However, when the secondary user is dominant, it is impossible to design an ex post second-best contract which achieves the first-best outcome. For those cases, we propose an ex ante second-best contract whichachievesthefirst-bestoutcome. Inthefollowingdiscussion,weanalyseCaseA-Dandpropose the optimal contract for each case. We present only case A in detail. Other cases can be found in the appendix. Case A: Primary user is the principal and the dominant user Theorem 27. Let contract functions A nd = nd,A and A nn = nn,A be as defined in (6.10). In Case A, the agent will report the channel coecients truthfully. The equilibrium contracts and rate allocations are the same as in the complete information case (as in Theorem 24). Proof. Since the primary is the dominant user here, the secondary user’s data is decoded at the primary user’s receiver. If secondary user tries to increase its rate from the agreed upon rate R s in the contract, it will lead to SIC decoding failure as explained in the section 6.3.2 and hence the deviation will be detected by the primary. This means that the secondary user cannot deviate from the agreed upon rate. Since the maximum possible rate at which the secondary user can transmit depends on its channel coecients, the best the secondary user can do is to report the true values of its channel coecients to the primary user and get a contract for the maximum possible rate. Now, if the primary user manipulates the channel measurements to increase the transmission rate of the secondary user (and hence the corresponding payment), the secondary user’s rate will be 133 higher than the maximum possible decodable rate at the primary’s receiver. So, the SIC scheme will fail and hence primary user’s own decoding will fail. Thus, the primary user won’t manipulate the channel measurements to get a higher payment. Thus, in this case neither the primary nor the secondary user will have an incentive to manipu- late the channel measurements as they cannot get any increase in their payo↵ by doing so. So, the contract design will be same as in complete information scenario as in Theorem 24. Case B: Secondary user is the principal, Primary user is dominant. Similar to the case A, the primary is the dominant user here and the secondary user’s data is decoded at the primary’s receiver. So, by the Theorem 27, neither the primary nor the secondary user will have an incentive to manipulate the channel measurements. So, the contract design will be the same as in the complete information scenario (as in Theorem (30)). Corollary 2. If the primary user is dominant, then the agent will report the true channel coe- cients to the principal. The principal can design a second-best contract which achieves the first-best outcome. The equilibrium contract functions and the rate allocations will be the same as in the complete information case. Case C: Secondary user is the principal and the dominant user. Theorem 28. In Case C, it is impossible to design an ex post second best contract which achieves the first best outcome. Proof. Herethesecondaryuseristheprincipalandprimaryuseristhetheagent. So,thereservation utility of the agent is log(1+c p,p ¯ P p ). So, theIR constraint of the optimization problemCD-OPT is R p (·,·) (P p ,P s ) log(1 +c p,p ¯ P p ), where R p (·,·) is the rate of the primary and (P p ,P s ) is the payment that it receives from the secondary. So, for the primary to share the spectrum, the payment should satisfy the condition (·,·) log(1 +c p,p ¯ P p ) R p (·,·). Since the principal (secondary) has to depend on the agent (primary) to know the value of c p,p , the agent (primary) canreportahighervalueofc p,p thantheactual. Thisisequivalenttoreportingahigherreservation utility than the actual and thus the agent (primary) can get a non-zero net utility due to this false reporting. Thus, any contract function which depends on the reported value of c p,p will not yield a first best outcome. However, if the payment o↵ered doesn’t satisfy the IR constraint of the agent (primary), it may not share the spectrum at all. So, it is important for the principal (secondary) to know the value ofc p,p to design the contract. Thus, the principal cannot design an ex post contract which makes the agent to operate at its reservation utility. So, no ex post second-best contract which achieves the first-best outcome can be designed. Sincenoexpostcontractwhichachievesthefirst-bestoutcomeispossibleinthiscase,wepropose an ex ante contract design to overcome this problem. An ex ante contract is an agreement between the principal and the agent before either of them know any of the channel coecients’ realization, 134 but do know their distribution which is common knowledge. The agent will accept the contract, if his expected utility (ex ante utility) is at least equal to his expected reservation utility. So, theIR constraint in the optimization problem CD-OPT changes to E[R a (P a ,P p ) (P p ,P a ) ¯ R a ] 0. Once the contract is accepted, the channel coecients are measured and the agent will pick his power allocation. In the following theorem, we establish that in an ex ante contract, neither the primarynorthesecondaryuserwillhaveanincentivetomanipulatethechannelcoecients. Weuse this result along with Theorem (31) (complete information case C) to give a complete description of the second-best contracts in this case. Define C nd (P p ; ¯ P s )= E ⇥ log(1+c p,p ¯ P p ) ⇤ R nd (P p , ¯ P s ), and (6.23) C nn (P p ; ¯ P s )= E ⇥ log(1+c p,p ¯ P p ) ⇤ R nn (P p , ¯ P s ). (6.24) where the expectation is with respect to the distribution of the channel coecients. Let ⌘ S,1 and ⌘ S,2 be as defined in the Theorem 25. Theorem 29. In case C (i) The agent (primary user) will report the channel coecients truthfully. The principal (secondary user) will be able to calculate ⌘ S,1 and ⌘ S,2 correctly, based on the reported channel coecients, as in Theorem (31). (ii) If ⌘ S,1 >⌘ S,2 ¯ P s , then the equilibrium ex ante second-best contract which achieves the first-best outcome is ( C nd (·; ¯ P s ),( ¯ P p , ¯ P s )), and the equilibrium rate allocation is (R nd ( ¯ P s , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto-optimal as well. (iii) If ⌘ S,1 ⌘ S,2 ¯ P p andc s,s c s,p , then the equilibrium exante second-best contract which achieves thefirstbestoutcomeis( C nn (·; ¯ P p ),( ¯ P p , ¯ P s )), andtheequilibriumrateallocationis(R nn ( ¯ P p , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto-optimal. (iv) There exist channel conditions under which the (first-best) equilibrium contract does not yield Pareto-optimal rates. Proof. (i) From the proof of Theorem 28, it is clear that the primary user has an incentive to report false channel coecients because the IR constraint depends on the actual value of c p,p . However, in an ex ante contract, the IR constraint is E[R a (P a ,P p ) (P p ,P a ) ¯ R a ] 0whichdoesn’tdepend on the actual value of the c p,p . It depends only on the distribution of the channel coecients. So, the primary user has no incentive to manipulate the channel coecients. Now, the secondary user can increase his utility only by decreasing his payment to the primary. From equation (6.23), the general form of the contract function is E ⇥ log(1+c p,p ¯ P p ) ⇤ R p (···), where R p (···) is the rate of the primary user and R p (···)=min(R nd (···),R nn (···)). If the secondary user manipulates the channel measurements to increase the transmission rate of the primary user (and decrease the corresponding payment), the primary user’s rate will be higher 135 than the maximum possible decodable rate at the secondary’s receiver. So, the SIC scheme will fail and hence secondary user’s own decoding will fail. Thus, the secondary user won’t manipulate the channel measurements to lower the payment. Since the reported channel coecients are correct, the principal can calculate the parameter ⌘ S,1 and ⌘ S,2 as in the complete information case (Theorem (31)). The rest of the proof is very similar to that of Theorem 24. So, we keep it short. (ii) If ⌘ S,1 >⌘ S,2 ¯ P s ,then R nd (···) R nn (···) and hence R p = R nd (···) and the contract function C nd (···) satisfies the ex ante IR constraint with equality. So, the surplus of the agent (primary) is zero at any operating point. Also the solution of the optimization problemCD-OPT with (···)= C nd (···)is P s = ¯ P s ,P p = ¯ P p . So, by the same arguments as above, the principal (secondary) will o↵er the contract C nd (·; ¯ P s ) and the agent (primary) will pick P p = ¯ P p .The corresponding rate allocation maximizes the sum rate and is Pareto-optimal. (iii) When ⌘ S,1 ⌘ S,2 ¯ P p and c s,s c s,p , it can easily be shown that R nn (···) R nd (···) and hence, R p = R nn (···) and the contract function C nn (···) satisfies the ex ante IR constraint with equality. So, the surplus of the agent (primary) is zero at any operating point. The solution of the optimization problem CD-OPT with (···)= C nn (···) gives the optimal power allocation as P s = ¯ P s ,P p = ¯ P p . So, the principal (secondary user) fixes his power at ¯ P s and o↵ers the contract C nn (·; ¯ P s ). Now, assuming that the primary user prefers to get a higher rate with zero surplus to a lower rate with zero surplus, he will pick P p = ¯ P p . It is easy to check that the corresponding rate allocation maximizes the sum rate R p +R s , and is Pareto-optimal as well. (iv) When ⌘ S,1 <⌘ S,2 ¯ P s ,c s,s <c s,p and c ss (1+c sp ¯ P s +c pp ¯ P p +c ps ¯ P p +c 2 sp ¯ P 2 s )<c sp c ss ¯ P p , (6.25) it can be checked that the optimal solution of the CD-OPT-PP optimization problem will occur at P p = ¯ P p and P s < ¯ P s , in which case the corresponding rate pair will be in the interior of the achievable rate region, and cannot be Pareto-optimal. Channel conditions Contract function Powers Operating point Optimality ⌘ S,1 >⌘ S,2 ¯ P s C nn (·; ¯ P s ) ( ¯ P s , ¯ P p ) A Socially/Pareto ⌘ S,1 ⌘ S,2 ¯ P s ,c s,s c s,p C nn (·; ¯ P s ) ( ¯ P s , ¯ P p ) A Socially/Pareto ⌘ S,1 ⌘ S,2 ¯ P s ,c s,s <c s,p C nn (·,P s ) (P s < ¯ P s , ¯ P p ) Interior non-Pareto Figure 6.5: Summary of the results for Case C under hidden information We summarize the results of Theorem 29 in Figure 6.5. The operating points are the rate pairs, (R p ,R s ), at the equilibrium and are shown in the Figure 6.1. 136 Corollary 3. If the secondary user is dominant, it is impossible to design an ex post second-best contract which achieves the first-best outcome. An ex ante contract which achieves the first-best outcome can be designed which yields Pareto-optimal outcomes at equilibrium under most channel conditions. Lesson #5. The main lesson learned from the above analysis is that when information about channel coecients is hidden, it is better for the primary user to be the dominant user. We then have ex post second-best equilibrium contracts with first-best, Pareto-optimal outcomes. 6.5 Conclusion and Future Works This chapter presents a new approach to ‘incentivized’ spectrum sharing for licensed bands in cognitive radio systems using “cooperative (or multi-user) communication” schemes. One of the impediments is that such schemes require cooperation between non-cooperative users each of whom is independent and selfish. There is little justification in assuming that users will expend their resources, particularly battery power to aid communication of other users. We have proposed an incentive mechanism approach that not only enables deployment of sophisticated cooperative communication schemes (such as SIC) but is also natural and easy to implement. We have first considered a full information setting, in which we specified the equilibrium con- tracts that happen to be Pareto-optimal and, in fact, sum-rate maximizing under most channel conditions. However, there are channel conditions under which no first-best Pareto-optimal con- tracts exists. This can be seen as an impossibility theorem, and hence a negative result. Never- theless, we are able to characterize sucient conditions on the channels under which the first-best constract is indeed Pareto-optimal at equilibrium. While there exist incentives for deviating from agreed contracts, we showed that using simple incentive schemes, this moral hazard problem can be easily avoided. When we allow for time-sharing as part of the contract, we have concluded that at equilibrium there actually will not be any time-sharing of roles. Theaboveassumedthatallchannelcoecientsareknownexactlytobothusers. Inreality,users have to cooperatively determine the cross channel coecients while the direct channel coecients have to be elicited. This leaves room for users to lie or manipulate the channel measurements. We have shown that while, in general, this hidden information can lead to non-existence of an equilibrium contract, this can be avoided if the primary user is assigned a dominant role, in which case the equilibrium outcome is Pareto-optimal. Thesettingwehaveconsideredcanbeextendedinmanywaystomorerealisticscenarios. Onlya singlesecondaryuserhasbeenconsidered. Buttheresultscanbeextendedtosettingswithmultiple secondary users (contracts with time-sharing do present a challenge with multiple secondary users) though the first-best equilibrium contract formats increase in complexity with more users. We have also considered utilities of users to be their rates. When general concave utility functions of rates 137 are considered, Pareto-optimal equilibrium contracts cannot be designed, in general. We have also assumed a static, flat fading channel. In general, channels can have frequency selective fading, and can vary over time. In that setting, a repeated game setting can be considered, and would require more sophisticated, dynamic contracts. Dynamic mechanism design and contract theory is an area of very active research, and this will be considered in the future. 6.6 Proofs of Theorems We provide the exact theorem statements for all the other cases. We omit the proofs as they are very similar to that of Theorem (24) and Theorem (29). Let ⌘ P,1 and ⌘ P,2 be as defined in the Theorem (24) and let ⌘ S,1 and ⌘ S,2 be as defined in the Theorem (25). First-best Contracts without Time-Sharing: Cases B-D Case B: Secondary user is the principal and primary user is dominant. We define the constant B (P p ;P s )= ✏. Theorem 30. Let contract function B (.;.) be as defined above. In Case B, (i) If ⌘ P,1 >⌘ P,2 ¯ P p , then the (first-best) equilibrium contract is ( B ( ¯ P p ; ¯ P s ),( ¯ P p , ¯ P s )), and the equi- librium rate allocation is (R dd ( ¯ P p , ¯ P s ),R nd ( ¯ P s , ¯ P p )), which is Pareto-optimal. (ii) If ⌘ P,1 ⌘ P,2 ¯ P p and c p,p c p,s , then the (first-best) equilibrium contract is then the (first-best) equilibriumcontractis( B ( ¯ P p ; ¯ P s ),( ¯ P p , ¯ P s )), andtheequilibriumrateallocationis(R dd ( ¯ P p , ¯ P s ),R nn ( ¯ P s , ¯ P p )), which is Pareto-optimal. (iii) There exist channel conditions under which the second-best equilibrium contract does not yield Pareto-optimal rates. Case C: Secondary user is the principal and the dominant user. We define the contract functions nd,C (P p ; ¯ P s ) = log(1+c p,p ¯ P p ) R nd (P p , ¯ P s ), and nn,C (P p ; ¯ P s ) = log(1+c p,p ¯ P p ) R nn (P p , ¯ P s ). Theorem 31. Let contract functions nd,C (P p ; ¯ P s ) and nn,C (P p ; ¯ P s ) be as defined above. (i) If ⌘ S,1 >⌘ S,2 ¯ P s , then the (first-best) equilibrium contract is ( nd,C (·; ¯ P s ),( ¯ P p , ¯ P s )), and the equi- librium rate allocation is (R nd ( ¯ P p , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto-optimal. (ii)If⌘ S,1 ⌘ S,2 ¯ P p andc s,s c s,p , thenthe(first-best)equilibriumcontractis( nn,C (·; ¯ P s ),( ¯ P p , ¯ P s )), and the equilibrium rate allocation is (R nn ( ¯ P p , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto-optimal. (iii) There exist channel conditions under which the (first-best) equilibrium contract does not yield Pareto-optimal rates. 138 Case D: Primary user is the principal and secondary user is dominant. We define the constant the contract function D (P s ; ¯ P p )=R nd (P s , ¯ P p ). Theorem 32. Let contract function D (P s ; ¯ P p ) be as defined above. In case D, (i) If ⌘ S,1 >⌘ S,2 ¯ P s , then the (first-best) equilibrium contract is ( D (.; ¯ P p ),( ¯ P p , ¯ P s )), and the equi- librium rate allocation is (R nd ( ¯ P p , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto-optimal. (ii) If ⌘ S,1 ⌘ S,2 ¯ P p and c s,s c s,p , then the (first-best) equilibrium contract is ( D (.; ¯ P p ),( ¯ P p , ¯ P s )), and the equilibrium rate allocation is (R nn ( ¯ P p , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto-optimal. (iii) There exist channel conditions under which the (first-best) equilibrium contract does not yield Pareto-optimal rates. Hidden Information: Second-best Contracts in Cases B and D Case B: Secondary user is the principal, Primary user is dominant. Theorem 33. Let contract function B (.;.) be as defined in (6.6). In Case B, the agent will report the channel coecient truthfully. The form of the equilibrium contract and the equilibrium rate allocation are the same as that in the complete information case ( in Theorem 30). Case D: Primary user is the principal, Secondary user is the dominant user. Wedefinethecontract function D (P s ; ¯ P p )=E ⇥ log(1+c s,s ¯ P s ) ⇤ . Theorem 34. Let contract functions D (P s ; ¯ P p ) be as defined above. In Case D, (i) If ⌘ S,1 >⌘ S,2 ¯ P s , then the equilibrium ex ante second-best contract which achieves the first-best outcome is ( D (.; ¯ P p ),( ¯ P p , ¯ P s )), and the equilibrium rate allocation is (R nd ( ¯ P s , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto-optimal as well. (ii) If ⌘ S,1 ⌘ S,2 ¯ P p and c s,s c s,p , then the equilibrium ex ante second-best contract which achieves thefirstbestoutcomeis( D (.; ¯ P p ),( ¯ P p , ¯ P s )), andtheequilibriumrateallocationis(R nn ( ¯ P p , ¯ P s ),R dd ( ¯ P s , ¯ P p )), which is Pareto-optimal. (iii) There exist channel conditions under which the (first-best) equilibrium contract does not yield Pareto-optimal rates. 139 140 7 References [1] T. Lai and H. Robbins, “Asymptotically ecient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, no. 1, pp. 4-22, 1985. [2] V.Anantharam, P.Varaiya, andJ.Walrand, “Asymptoticallyecientallocationrulesforthe multi-armed bandit problem with multiple plays - part i: i.i.d. rewards,” IEEE Transactions on Automatic Control, vol. 32, no. 11, pp. 968-975, November, 1987. [3] R. Agrawal, “Sample mean based index policies with (O(logn)) regret for the multi-armed bandit problem,” Advances in Applied Probability, Vol. 27, No. 4, pp. 1054-1078, 1995. [4] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine Learning, vol. 47, no. 2, pp. 235-256, 2002. [5] V. Anantharam, P. Varaiya, and J. Walrand, “Asymptotically ecient allocation rules for the multi-armed bandit problem with multiple plays - part ii: Markovian rewards,” IEEE Transactions on Automatic Control, vol. 32, no. 11, pp. 977-982, November 1987. [6] C.TekinandM.Liu, “Onlinealgorithmsforthemulti-armedbanditproblemwithmarkovian rewards,” Allerton Conference on Communication, Control, and Computing, October, 2010. [7] C. Papadimitriou and J. Tsitsiklis, “The complexity of optimal queuing network control,” Mathematics of Operations Research, vol. 24, no. 2, pp. 293-305, May, 1999. [8] C. Tekin and M. Liu, “Online learning in opportunistic spectrum access: A restless bandit approach,” International Conference on Computer Communications (INFOCOM), Shanghai, China., April 2011. [9] W. Dai, Y. Gai, and B. Krishnamachari, “Ecient online learning for opportunistic spec- trum access,” International Conference on Computer Communications (INFOCOM), Mini Conference, Orlando, USA, March, 2012. [10] H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Restless multi-armed bandit withunknowndynamics,” IEEE Transactions on Information Theory,Submitted,November, 2011. [11] W. Dai, Y. Gai, B. Krishnamachari, and Q. Zhao, “The non-bayesian restless multi-armed bandit: A case of near-logarithmic regret,” International Conference on Acoustics, Speech and Signal Processing (ICASSP), May, 2011. 141 [12] Y.Gai, B.Krishnamachari, andR.Jain, “Combinatorialnetworkoptimizationwithunknown variables: Multi-armedbanditswithlinearrewardsandindividualobservations,”IEEE/ACM Trans. on Networking, to appear, 2012. [13] Y. Gai, B. Krishnamachari, and M. Liu, “On the combinatorial multi-armed bandit prob- lem with markovian rewards,” IEEE Global Communications Conference (GLOBECOM), December, 2011. [14] K. Liu and Q. Zhao, “Indexability of restless bandit problems and optimality of whittle index fordynamicmultichannelaccess,” IEEE Transactions on Information Theory, vol.56, no.11, pp. 5547–5567, Nov. 2010. [15] A. Anandkumar, N. Michael, A. Tang, and A. Swami, “Distributed algorithms for learning and cognitive medium access with logarithmic regret,” IEEE JSAC on Advances in Cognitive Radio Networking and Communications, April, 2011. [16] Y. Gai and B. Krishnamachari, “Decentralized online learning algorithms for opportunistic spectrum access,” IEEE Global Communications Conference (GLOBECOM 2011),Decem- ber, 2011. [17] D. P. Bertsekas, “Auction algorithms for network flow problems: A tutorial introduction,” Computational Optimization and Applications, vol. 1, pp. 7-66, 1992. [18] R. Bellman, Dynamic Programming. Princeton University Press, 1957. [19] G. L. Nemhauser, Introduction to dynamic programming. Wiley, 1966. [20] R. Howard, Dynamic Probabilistic Systems: Vol.: 2.: Semi-Markov and Decision Processes. John Wiley and Sons, 1971. [21] C. H. Papadimitriou and J. Tsitsiklis, “The complexity of markov decision processes,” Math. Oper. Res, vol. 12, no. 3, pp. 441–450, 1987. [22] R. Bellman and S. Dreyfus, “Functional approximations and dynamic programming,” Math- ematical Tables and Other Aids to Computation, vol. 13, no. 68, pp. 247–251, 1959. [23] P. Werbos, “Beyond regression: New tools for prediction and analysis in the behavioral sciences,” Ph.D. dissertation, 1974. [24] H. J. Kushner and D. S. Clark, Stochastic approximation methods for constrained and uncon- strained systems. Springer, 1978. [25] M. Minsky, “Steps toward artificial intelligence,” Proceedings of the IRE, vol. 49, no. 1, pp. 8–30, 1961. [26] A. Barto, S. Sutton, and P. Brouwer, “Associative search network: A reinforcement learning associative memory,” Biol. Cybernet., vol. 40, no. 3, pp. 201–211, 1981. [27] W. Whitt, “Approximations of dynamic programs, i,” Math. Oper. Res, vol. 3, no. 3, pp. 231–243, 1978. 142 [28] ——, “Approximations of dynamic programs, ii,” Math. Oper. Res, vol. 4, no. 2, pp. 179–185, 1979. [29] D. Bertsekas, Dynamic programming and optimal control. Athena Scientific Belmont, 2004, vol. 1 and 2. [30] M. Anthony and P. Bartlett, Neural network learning: Theoretical foundations. cambridge university press, 2009. [31] C. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992. [32] H.RobbinsandS.Monro,“Astochasticapproximationmethod,”TheAnnalsofMathematical Statistics, pp. 400–407, 1951. [33] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression function,” The Annals of Mathematical Statistics, vol. 23, no. 3, pp. 462–466, 1952. [34] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, 1996. [35] V. R. Konda and V. S. Borkar, “Actor-critic–type learning algorithms for markov decision processes,” SIAM Journal on control and Optimization, vol. 38, no. 1, pp. 94–123, 1999. [36] V. R. Konda and J. Tsitsiklis, “Convergence rate of linear two-time-scale stochastic approx- imation,” Ann. Appl. Probab., pp. 796–819, 2004. [37] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Cambridge Univ Press, 1998. [38] W. B. Powell, Approximate Dynamic Programming: Solving the curses of dimensionality. John Wiley & Sons, 2007, vol. 703. [39] V. S. Borkar, Stochastic approximation: A dynamical systems viewpoint. Cambridge Uni- versity Press, Cambridge, 2008. [40] L. Ljung, “Analysis of recursive stochastic algorithms,” IEEE Trans. Automat. Control, vol. 22, no. 4, pp. 551–575, 1977. [41] V. S. Borkar and S. P. Meyn, “The ode method for convergence of stochastic approximation and reinforcement learning,” SIAM J. Control Optim., vol. 38, no. 2, pp. 447–469, 2000. [42] S. M. Kakade, “On the sample complexity of reinforcement learning,” Ph.D. dissertation, University of London, 2003. [43] A. M¨ uller and D. Stoyan, Comparison methods for stochastic models and risks. Wiley, 2002, vol. 389. [44] M. Shaked and J. G. Shanthikumar, Stochastic orders. Springer, 2007. [45] I. Chueshov, Monotone random systems theory and applications. Springer, 2002, vol. 1779. [46] J. Abounadi, D. Bertsekas, and V. S. Borkar, “Learning algorithms for Markov decision processes with average cost,” SIAM J. Control Optim., vol. 40, no. 3, pp. 681–698, 2001. 143 [47] V. S. Borkar, “Q-learning for risk-sensitive control,” Math. Oper. Res, vol. 27, no. 2, pp. 294–311, 2002. [48] E. Even-Dar and Y. Mansour, “Learning rates for q-learning,” J. Mach. Learn. Res, vol. 5, pp. 1–25, 2004. [49] J. Tsitsiklis, “On the convergence of optimistic policy iteration,” J. Mach. Learn. Res, vol. 3, pp. 59–72, 2003. [50] H. S. Chang, M. Fu, J. Hu, and S. I. Marcus, Simulation-Based Algorithms for Markov Decision Processes. Springer, Berlin, 2006. [51] ——, “A survey of some simulation-based algorithms for markov decision processes,” Com- mun. Inf. Syst., vol. 7, no. 1, pp. 59–92, 2007. [52] M. L. Thathachar and P. S. Sastry, “A class of rapidly convergent algorithms for learning automata,” IEEE Tran. Sy. Man. Cyb., vol. 15, pp. 168–175, 1985. [53] K. Rajaraman and P. S. Sastry, “Finite time analysis of the pursuit algorithm for learning automata,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 26, no. 4, pp. 590–598, 1996. [54] K.S.NarendraandM.Thathachar,Learning automata: an introduction. DoverPublications, 2012. [55] W. Cooper, S. Henderson, and M. Lewis, “Convergence of simulation-based policy iteration,” Probab. Engrg. Inform. Sci., vol. 17, no. 02, pp. 213–234, 2003. [56] W. Cooper and B. Rangarajan, “Performance guarantees for empirical markov decision pro- cesses with applications to multiperiod inventory models,” Oper. Res., vol. 60, no. 5, pp. 1267–1281, 2012. [57] A. Almudevar, “Approximate fixed point iteration with an application to infinite horizon markov decision processes,” SIAM J. Control Optim., vol. 47, no. 5, pp. 2303–2347, 2008. [58] D. Bertsekas, “Approximate policy iteration: A survey and some new methods,” J. Control Theory Appl., vol. 9, no. 3, pp. 310–335, 2011. [59] R. Jain and P. Varaiya, “Simulation-based uniform value function estimates of markov deci- sion processes,” SIAM J. Control Optim., vol. 45, no. 5, pp. 1633–1656, 2006. [60] ——, “Simulation-based optimization of markov decision processes: An empirical process theory approach,” Automatica, vol. 46, no. 8, pp. 1297–1304, 2010. [61] C.J.C.H.Watkins,Learning from Delayed Rewards. Ph.D.Thesis,UniversityofCambridge, 1989. [62] J.G.ProppandD.B.Wilson, “ExactsamplingwithcoupledMarkovchainsandapplications to statistical mechanics,” Random structures and Algorithms, vol. 9, no. 1-2, pp. 223–252, 1996. 144 [63] P. Diaconis and D. Freedman, “Iterated random functions,” SIAM Review, vol. 41, no. 1, pp. 45–76, 1999. [64] D.P.Bertsekas,DynamicProgrammingandOptimalControlvol.2, 4thed. AthenaScientific, 2012. [65] R. E. Steuer, Multiple criteria optimization: Theory, computation, and application. Krieger Malabar, 1989. [66] E. Altman, Constrained Markov decision processes. CRC Press, 1999, vol. 7. [67] V. S. Borkar, “An actor-critic algorithm for constrained Markov decision processes,” Systems and control letters, vol. 54, no. 3, pp. 207–213, 2005. [68] J. Y. Yu, S. Mannor, and N. Shimkin, “Markov decision processes with arbitrary reward processes,” Mathematics of Operations Research, vol. 34, no. 3, pp. 737–757, 2009. [69] E. Even-Dar, S. Kakade, and Y. Mansour, “Online markov decision processes,” Mathematics of Operations Research, vol. 34, no. 3, pp. 726–736, 2009. [70] D. Blackwell, “An analog of the minimax theorem for vector payo↵s,” Pacific Journal of Mathematics, vol. 6, no. 1, pp. 1–8, 1956. [71] N. Shimkin and A. Shwartz, “Guaranteed performance regions in Markovian systems with competing decision makers,” IEEE Trans. Automat. Control, vol. 38, no. 1, pp. 84–95, 1993. [72] E. Milman, “Approachable sets of vector payo↵s in stochastic games,” Games Econom. Be- hav., vol. 56, no. 1, pp. 135–147, 2006. [73] S.Kamal,“AvectorminmaxproblemforcontrolledMarkovchains,” Arxiv preprint, available at arXiv:1011.0675v1, 2010. [74] S.MannorandN.Shimkin,“Ageometricapproachtomulti-criterionreinforcementlearning,” J. Mach. Learn. Res., vol. 5, pp. 325–360, 2004. [75] G. Kramer, I. Maric, and R. Yates, “Cooperative communications,” Foundations and Trends in Networking series, 2006. [76] G. Kramer, “Topics in multi-user information theory,” Foundations and Trends in Commu- nications and Information Theory series, 2006. [77] D. Tse and P. Viswanath, “Fundamentals of wireless communication,” Cambridge University Press, 2005. [78] P. Gupta and P. Kumar, “The capacity of wireless networks,” IEEE Transactions on Infor- mation Theory, 2000. [79] M.Francechetti, O.Douse, D.Tse, andP.Thiran, “Closingthegapinthecapacityofwireless networks via percolation theory,” IEEE Transactions on Information Theory, March 2007. [80] R. Etkin, A. Parekh, and D. Tse, “Spectrum sharing for unlicensed bands,” IEEE J. Selected Areas in Communications, April 2007. 145 [81] A. Sahai, R. Tandra, and S. Mishra, “Fundamental design tradeo↵s in cognitive radio sys- tems,” Proc. TAPAS, 2006. [82] G. Atia, A. Sahai, and V. Saligrama, “Spectrum enforcement and liability assignment in cognitive radio systems,” Proc. DySpan Conference, October 2008. [83] Y. Xing, R. Chandramouli, and C. Cordeiro, “Price dynamics in competitive agile spectrum access markets,” IEEE J. Selected Areas in Communications, April, 2007. [84] Z. Ji and K. J. Ray Liu, “Multi-stage pricing game for collusion-resistant dynamic spectrum allocation,” IEEE J. Selected Areas in Communications, January, 2008. [85] D. Niyato and E. Hossain, “Competitive pricing for spectrum sharing in cognitive radio networks: Dynamic games, ineciency of equilibria and collusion,” IEEE J. Selected Areas in Communications, January, 2008. [86] J. Huang, R. Berry, and M. Honig, “Auction-based spectrum sharing,” ACM/Springer Jour- nal of Mobile Networks and Applications (MONET, January, 2006. [87] F. Fu and M. van der Schaar, “Learning for dynamic bidding in cognitive radio resources,” ACM/Springer Journal of Mobile Networks and Applications (MONET), 2007. [88] H. Chang, K. C. Chen, N. Prasad, and C. Su, “Auction based spectrum management for cognitive radio networks,” IEEE Vehicular Technology Conference, 2009. [89] J. La↵ont and M. D., “The theory of incentives: The principal-agent model,” Princeton University Press, 2002. [90] P. Bolton and M. Dewatrimont, “Contract theory,” MIT Press, 2005. [91] R. Johari, S. Mannor, and J. Tsitsiklis, “A contract-based model for directed network forma- tion,” Games and Economic Behavior, 2006. [92] M. Feldman, J. Chuang, I. Stoica, and S. Shenker, “Hidden-action in network routing,” IEEE J. Selected Areas in Communications, 2007. [93] T. Cover and J. Thomas, “Elements of information theory,” Wiley-Interscience, 2006. [94] E. Hossain and V. K. Bhargava, “Cognitive wireless communication networks,” Springer, 2007. [95] D. Pollard, “Convergence of stochastic processes,” Springer, 1984. [96] P. Lezaud, “Cherno↵-type bound for finite markov chains,” Ann. Appl. Prob., vol. 8, pp. 849-867, 1998. [97] D. P. Bertsekas, “The auction algorithm: A distributed relaxation method for the assignment problem,” Annals of Operations Research, vol. 14, 1988. [98] M. Zavlanos, L. Spesivtsev, and G. J. Pappas, “A distributed auction algorithm for the as- signmentproblem,” Proceedings of the IEEE Conference on Decision and Control,December, 2008. 146 [99] S. Vakili, K. Liu, and Q. Zhao, “Deterministic sequencing of exploration and exploitation for multi-armed bandit problems,” IEEE Journal of Selected Topics in Signal Processing,to appear, 2013. [100] K. Liu and Q. Zhao, “Multi-armed bandit problems with heavy tail reward distributions,” Allerton Conference on Communication, Control, and Computing, September, 2011. [101] C. Tekin and M. Liu, “Online learning of rested and restless bandits,,” IEEE Trans. on Information Theory, Submitted, 2012. [102] H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Restless multi-armed bandit with unknown dynamics,” IEEE Transactions on Information Theory, vol. 59, no. 3, pp. 1902–1916, Mar. 2013. [103] S. M. Ross, Stochastic processes. John Wiley & Sons, 1996. [104] D. Levin, Y. Peres, and E. L. Wilmer, Markov chains and mixing times. AMS, 2009. [105] L. S. Shapley, “Stochastic games,” Proceedings Nat. Acad. of Sciences USA, vol. 39, no. 10, pp. 1095–1100, 1953. [106] M.L.Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2009, vol. 414. [107] V. S. Borkar, Stochastic approximation: a dynamical systems viewpoint. Cambridge Univer- sity Press, 2008. [108] ——, Topics in controlled Markov chains. Longman Scientific & Technical, 1991. [109] D. Blackwell and L. Dubins, “Merging of opinions with increasing information,” The Annals of Mathematical Statistics, pp. 882–886, 1962. [110] V. S. Borkar, Probability theory: an advanced course. Springer, 1995. [111] J. P. Aubin and A. Cellina, Di↵erential inclusions: Set-valued maps and viability theory . Springer-Verlag New York,., 1984. [112] M. Benaim, J. Hofbauer, and S. Sorin, “Stochastic approximations and di↵erential inclu- sions,” SIAM J. Control Optim., vol. 44, no. 1, pp. 328–348, 2005. [113] ——, “Stochastic approximations and di↵erential inclusions, part ii: applications,” Math. Oper. Res., vol. 31, no. 4, pp. 673–695, 2006. [114] D. Bertsekas, A. Nedi´ c, and A. Ozdaglar, Convex analysis and optimization.AthenaScien- tific, 2003. [115] M. Bardi and I. Capuzzo-Dolcetta, Optimal control and viscosity solutions of Hamilton- Jacobi-Bellman equations. Springer, 2008. [116] W. Yiu, G. Ginis, and J. Cio, “Distributed multiuser power control for digital subscriber lines,” IEEE J. Selected Areas in Communications, 2002. 147 [117] M. Chiang, “Geometric programming for communication systems,” Foundations and Trends in Communications and Information Theory, 2005. [118] L. Duan, L. Gao, and J. Huang, “Contract-based cooperative spectrum sharing,” IEEE Dy- namic Spectrum Access Networks (DySPAN), May 2011. 148
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning and decision making in networked systems
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Interaction and topology in distributed multi-agent coordination
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Learning and control in decentralized stochastic systems
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Computational validation of stochastic programming models and applications
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Learning and control for wireless networks via graph signal processing
PDF
The next generation of power-system operations: modeling and optimization innovations to mitigate renewable uncertainty
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Machine learning in interacting multi-agent systems
PDF
Exploiting diversity with online learning in the Internet of things
PDF
Computational stochastic programming with stochastic decomposition
Asset Metadata
Creator
Kalathil, Dileep Manisseri
(author)
Core Title
Empirical methods in control and optimization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/14/2014
Defense Date
07/08/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
approachability,MDP,multi-armed bandits,OAI-PMH Harvest,online optimization,spectrum sharing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jain, Rahul (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Liu, Yan (
committee member
)
Creator Email
manisser@usc.edu,mkdileep@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-489169
Unique identifier
UC11286983
Identifier
etd-KalathilDi-3010.pdf (filename),usctheses-c3-489169 (legacy record id)
Legacy Identifier
etd-KalathilDi-3010.pdf
Dmrecord
489169
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kalathil, Dileep Manisseri
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
approachability
MDP
multi-armed bandits
online optimization
spectrum sharing