Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Utilizing context and structure of reward functions to improve online learning in wireless networks
(USC Thesis Other)
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Utilizing Context and Structure of Reward Functions to Improve Online Learning in Wireless Networks by Pranav Krishna Sakulkar A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2018 Copyright 2018 Pranav Krishna Sakulkar Dedication To my beloved family members: Aai (Madhuri Sakulkar), Baba (Kr- ishna Sakulkar) and Bayko (Rasika Chimalwar). ii Acknowledgments This work would not have been possible if it were not for the help of many people. First, I would like to thank my advisor, Prof. Bhaskar Krishnamachari. I could not have been more fortunate to have such a wonderful person as my advisor. He taught me everything I know about research and opened my mind to a world of opportunities. He has always been very kind, patient, and generous. I enjoyed all the insightful discussions I have had with him. At the University of Southern California, I received the support of various fac- ulty, students, and colleagues. I would like to thank the rest of my thesis committee: Professor Ashutosh Nayyar, and Professor Leana Golubchik, for their guidance and support. Thanks to Prof. Michael Neely and Prof. Yan Liu for their feedback during my qualifying exam. I sincerely thank all my ANRG lab-mates for the wide range of discussions on research and life in general. ANRG members have been my support system during my time at USC. I have learned so much from them that will stay with me forever. Special thanks to all the EE sta members for their help. They have always been ready to resolve the problems arising over the years. iii I thank all my advisors from the earlier stages of my academic career, each of whom has in uenced me a lot. Dr. Anil Kayande has had a tremendous impact on my thought process and my way of approaching the problems. I am immensely grateful to him for his invaluable guidance during my formative years. Finally, I'd like to thank my parents, Madhuri and Krishna Sakulkar, my lovely wife, Rasika Chimalwar and my friends for their moral and emotional support throughout this journey. Without their support, this PhD wouldn't have been possible. iv Table of Contents Dedication ii Acknowledgments iii List Of Figures viii List Of Tables ix Abstract x Chapter 1: Introduction 1 1.1 Contextual Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Dynamic Channel Selection . . . . . . . . . . . . . . . . . . 5 1.2.2 Energy Harvesting Communications . . . . . . . . . . . . . . 6 1.2.3 Multi-Objective Optimization . . . . . . . . . . . . . . . . . 8 1.3 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . . . 10 1.4 Contextual Combinatorial Bandits . . . . . . . . . . . . . . . . . . 13 1.4.1 Distributed Computing Application . . . . . . . . . . . . . . 13 Chapter 2: Background: Bandit Problems 16 2.1 Introduction to Bandit Problems . . . . . . . . . . . . . . . . . . . 16 2.2 Types of Bandit Problems . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Bandit Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 UCB1 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 Epsilon-Greedy Policy . . . . . . . . . . . . . . . . . . . . . 21 2.3.3 DSEE policy . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 3: Related Work 23 3.1 Contextual Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Online Learning Over MDPs . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Contextual Combinatorial Bandits . . . . . . . . . . . . . . . . . . 29 v Chapter 4: Contributions 31 4.1 Contextual Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Online Learning Over MDPs . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Contextual Combinatorial Bandits . . . . . . . . . . . . . . . . . . 36 4.3.1 Wireless Distributed Computing . . . . . . . . . . . . . . . . 36 4.3.2 Online Learning Algorithms . . . . . . . . . . . . . . . . . . 37 Chapter 5: Contextual Bandits with Known Reward Functions 38 5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Discrete Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.1 Naive Approach . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.2 Multi-UCB . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2.3 New Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Regret Analysis of DCB() . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Upper Bound on Regret . . . . . . . . . . . . . . . . . . . . 51 5.3.2 Asymptotic Lower bound . . . . . . . . . . . . . . . . . . . . 52 5.4 Continuous Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.1 Known Time Horizon . . . . . . . . . . . . . . . . . . . . . . 57 5.4.2 Unknown Time Horizon . . . . . . . . . . . . . . . . . . . . 58 5.5 Numerical Simulation Results . . . . . . . . . . . . . . . . . . . . . 59 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 6: Online Learning over MDPs with Known Reward Func- tions 66 6.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 68 6.1.2 Optimal Stationary Policy . . . . . . . . . . . . . . . . . . . 72 6.2 Online Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . 74 6.2.1 LPSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2.2 Epoch-LPSM . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2.3 Regret vs Computation Tradeo . . . . . . . . . . . . . . . . 89 6.3 Multi-Channel Communication . . . . . . . . . . . . . . . . . . . . 90 6.3.1 Online Learning Algorithm . . . . . . . . . . . . . . . . . . . 93 6.3.2 Regret Analysis of MC-LPSM . . . . . . . . . . . . . . . . . 94 6.3.2.1 Non-Optimal Power-Channel Mapping . . . . . . . 98 6.3.2.2 Non-Optimal State-Action Mapping . . . . . . . . 100 6.3.3 Asymptotic Lower Bound . . . . . . . . . . . . . . . . . . . 103 6.4 Cost Minimization Problems . . . . . . . . . . . . . . . . . . . . . . 112 6.5 Numerical Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 vi Chapter 7: Contextual Combinatorial Bandits 122 7.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.2 Distributed Computing Application . . . . . . . . . . . . . . . . . . 125 7.2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.2.2 Cost and Latency . . . . . . . . . . . . . . . . . . . . . . . . 126 7.2.3 Optimization Problem . . . . . . . . . . . . . . . . . . . . . 128 7.2.4 Optimal Policy . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.2.5 Reward Functions . . . . . . . . . . . . . . . . . . . . . . . . 130 7.3 PMF Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . 133 7.3.1 Concentration Inequality for Lipschitz Functions . . . . . . . 134 7.3.2 The Max Function . . . . . . . . . . . . . . . . . . . . . . . 137 7.3.3 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . 139 7.3.3.1 Infeasible Optimal Mapping . . . . . . . . . . . . . 141 7.3.3.2 Non-Optimal Latency Estimates . . . . . . . . . . 143 7.3.3.3 Regret Bound . . . . . . . . . . . . . . . . . . . . . 148 7.4 Numerical Simulation Results . . . . . . . . . . . . . . . . . . . . . 151 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Chapter 8: Conclusion 154 Appendix A Sum of Bounded Random Variables . . . . . . . . . . . . . . . . . . . . . 157 Appendix B Proof of Theorem 1: Bound on the pulls of non-optimal arms . . . . . . 158 Appendix C Proof of Lemma 1: High probability bound for UCB1() . . . . . . . . . 162 Appendix D Proof of Theorem 2: Bound on the non-optimal pulls of optimal arms . . 166 Appendix E Analysis of Markov Chain Mixing . . . . . . . . . . . . . . . . . . . . . . 172 Reference List 176 vii List Of Figures 1.1 A tree-based summary of the work. . . . . . . . . . . . . . . . . . . 3 1.2 Dynamic channel selection problem. . . . . . . . . . . . . . . . . . . 5 1.3 Power-aware channel selection in energy harvesting communications. 7 1.4 Job-aware computational ooading problem. . . . . . . . . . . . . . 9 1.5 An example assignment of tasks from a task graph on the networked devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.1 Simulation results for the channel selection problem with = 10 2 . 65 6.1 Power allocation over a wireless channel in energy harvesting com- munications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 Packet scheduling over a wireless channel . . . . . . . . . . . . . . . 112 6.3 Regret performance of LPSM algorithms. . . . . . . . . . . . . . . . 116 6.4 Eect of the parameters n 0 and on the regret of Epoch-LPSM. . . 117 6.5 Regret performance of the MC-LPSM algorithm. . . . . . . . . . . . 118 7.1 An example task graph with nodes indicating the work-load of the task and edges indicating the required amount of data exchange. . . 126 7.2 Comparison of the regret results for the contextual combinatorial bandit problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 viii List Of Tables 5.1 Notations for Algorithm Analysis . . . . . . . . . . . . . . . . . . . 45 5.2 Regret for the channel selection problem when T = 10 5 . . . . . . . 61 5.3 Regret for the power-aware channel selection problem when T = 10 6 61 6.1 Actions for each state . . . . . . . . . . . . . . . . . . . . . . . . . . 116 ix Abstract Many problems in wireless communications and networking involve sequential de- cision making in unknown stochastic environments, such as transmission channel selection, transmission power allocation over wireless channels etc. These problems can be modeled as bandit problems, where the agent needs to make certain deci- sions sequentially over time and learn about the environment during the process. A special feature of these problems is that there is some side-information available before making the decisions and the reward functions are known to the agent. In this dissertation, we analyze the settings, design and analyze online learning algo- rithms for these problems by utilizing the side-information and the knowledge of the reward functions. Many of these problems can be modeled as contextual bandit problems, which are natural extensions of the well-known multi-armed bandit problem. In con- textual bandit problems, at each time, an agent observes some side information or context, pulls one arm and receives the reward for that arm. We consider a stochastic formulation where the context-reward tuples are independently drawn from an unknown distribution in each trial. Motivated by networking applications, x we analyze a setting where the reward is a known non-linear function of the context and the chosen arm's current state. We rst consider the case of discrete and nite context-spaces and propose DCB(), an algorithm that we prove, through a careful analysis, yields regret (cumulative reward gap compared to a distribution-aware genie) scaling logarithmically in time and linearly in the number of arms that are not optimal for any context, improving over existing algorithms where the regret scales linearly in the total number of arms. We then study continuous context- spaces with Lipschitz reward functions and propose CCB(;), an algorithm that uses DCB() as a subroutine. CCB(;) reveals a novel regret-storage trade-o that is parametrized by. Tuning to the time horizon allows us to obtain sub-linear re- gret bounds, while requiring sub-linear storage. By exploiting joint learning for all contexts we get regret bounds for CCB(;) that are unachievable by any existing contextual bandit algorithm for continuous context-spaces. We also show similar performance bounds for the unknown horizon case. We also consider the problem of power allocation over one or more time-varying channels with unknown distributions in energy harvesting communications. In the single-channel case, the transmitter chooses the transmit power based on the amount of stored energy in its battery with the goal of maximizing the average rate over time. We model this problem as a Markov decision process (MDP) with transmitter as the agent, battery status as the state, transmit power as the action and rate as the reward. The average reward maximization problem can be modeled xi by a linear program (LP) that uses the transition probabilities for the state-action pairs and their reward values to select a power allocation policy. This problem is challenging because the uncertainty in channels implies that the mean rewards as- sociated with the state-action pairs are unknown. We therefore propose two online learning algorithms: Linear Program of Sample Means (LPSM) and Epoch-LPSM that learn these rewards and adapt their policies over time. For both algorithms, we prove that their regret is upper-bounded by a constant. To our knowledge this is the rst result showing constant regret learning algorithms for MDPs with unknown mean rewards. We also prove an even stronger result about LPSM: that its policy matches the optimal policy exactly in nite expected time. Epoch-LPSM incurs a higher regret compared to LPSM, while reducing the computational requirements substantially. We further consider a multi-channel scenario where the agent also chooses a channel in each slot, and present our multi-channel LPSM (MC-LPSM) algorithm that explores dierent channels and uses that information to solve the LP during exploitation. MC-LPSM incurs a regret that scales logarithmically in time and linearly in the number of channels. Through a matching lower bound on the regret of any algorithm, we also prove the asymptotic order optimality of MC-LPSM. xii When the agent needs to select a vector instead of a scalar as its action at any given time, the problem becomes a combinatorial bandit problem. These prob- lems become especially interesting when there is side-information or context avail- able to the agent. Such problems have wide applicability in networking in general and in distributed computing in particular. With Wireless Distributed Computing (WDC), multiple resource-constrained mobile devices connected wirelessly can col- laborate to enable a variety of applications involving complex tasks that one device cannot support individually. It is important to consider the application task graph, the features of the instantaneous data-frame, availability of the computing resources and the link connectivity to these devices, and determine the task assignments to balance the trade-o between energy costs of the devices and overall task execu- tion latency. Considering the time-varying nature of the resource availability and the link conditions, we model the online task assignment problem as a contextual combinatorial bandit. Since each incoming data-frame may have dierent features and aect the optimal task assignment, the data-frame features act as context at each time-slot. We propose a novel online learning algorithm called PMF Learning algorithm that learns the distributions of the internal random variables of the sys- tem over time and uses the empirical distributions to make task allocations at each step. We prove that the regret of PMF Learning algorithm scales logarithmically over time and linearly in the number of connected devices, which is a substantial improvement over the standard combinatorial bandit algorithms. xiii Chapter 1 Introduction Many problems in networking such as dynamic spectrum allocation in cognitive radio networks involve sequential decision making under the face of uncertainty. The multi-armed bandit problem (MAB, see [1, 2, 3, 4]), a fundamental online learning framework, has been previously applied in networks for multi-user channel allocation [5] and distributed opportunistic spectrum access [6]. These problems also appear widely in online advertising and clinical trials. In the standard MAB version, an agent is presented with a sequence of trials where in each trial it has to choose an arm from a set of arms, each providing stochastic rewards over time with unknown distributions. The agent receives a payo or reward based on its action in each trial. The goal is to maximize the total expected reward over time. On one hand, dierent arms need to be explored often enough to learn their rewards and on the other hand, prior observations need to be exploited to maximize the immediate reward. MAB problems, therefore, capture the fundamental trade-o 1 between exploration and exploitation that appears widely in these decision making problems. The contextual bandit problem considered here is a natural extension of the basic MAB problem where the agent can see a context or some hint about the rewards in the current trial that can be used to make a better decision. This makes contextual bandits more general and more suitable to various practical applications such as ad placements [7, 8] and personalized news recommendations [9], since set- tings with no context information are rare in practice. In our work, we consider applications of contextual bandits in wireless networks. The idea is to show how the known contextual information can be combined with a pure bandit approach to improve the performance of online scheduling algorithms in unknown environments. Our bandit formulations are signicant in real-world networking applications, be- cause they tend to be rich with contextual information or observed variables such as packet sizes, queue sizes, data types, job sizes, QoS needs, etc. that can be combined with learning over unknown stochastic variables to improve performance. One interesting feature of the problems from wireless networks is that the reward function that maps the side-information and the internal random variables to the reward values is already known to the agent. This knowledge of the function can help the agent make better scheduling decisions at each time-slot. The idea is to use the available knowledge of the contextual information and the reward function to make better decisions at each time step. Since the reward 2 function is also known in advance, the reward revelation at any given slot can also be used to better the reward estimates for other contexts as well. These ideas motivate the central theme of the thesis work. Thesis Statement: Algorithms involving wireless network optimiza- tion should utilize the observed side-information (the contexts) and the knowledge of the reward functions to improve the eciency of the online decisions under unknown stochastic environments. Figure 1.1: A tree-based summary of the work. 1.1 Contextual Bandits In this section, we consider contextual bandit problems where the relationship be- tween the contexts and the rewards is known, which we argue is the case with many network applications. In this case, the reward information revealed after an arm-pull for a certain context can also be used to learn about rewards for other contexts as well, since the reward function is already known. The arm-pulls for 3 one context thus help in reducing the exploration requirements of other contexts. Performance of any online learning algorithm is measured in terms of its regret, dened as the dierence between the expected cumulative rewards of the Bayes optimal algorithm and that of the agent. Since the bulk of the regret in ban- dit problems is due to the exploration of dierent arms, the knowledge of reward function helps in reducing the regret substantially. As previous bandit algorithms cannot exploit this knowledge, we develop new contextual bandit algorithms for this setting. Also, the previously studied contextual bandit problems either assume linear reward mapping [9, 10, 11, 12] or do not make explicit assumptions about the exact structure of reward functions [7, 13, 14, 8, 15]. It must, however, be noted that the network applications often involve non-linear reward functions and thus warrant a new approach. The goal is to study these contextual bandits and develop ecient algorithms for them. In chapter 5, we discuss the system model, propose our online learning schemes and prove performance guarantees for the proposed algorithms. 1.2 Motivating Examples We present some examples of problems arising in communication networks where a contextual bandit formulation is applicable. In each case, there are a few key ingredients: a set of arms (typically corresponding to resources such as channels or servers) with associated attributes or performance measures (such as throughput 4 Tx Rx Context: Packet Arrivals Arms: Channels of Random Capacities Figure 1.2: Dynamic channel selection problem. capacity, channel gain, processing rate) evolving stochastically with an unknown distribution, and a context known to the user, such that the reward obtained at each step is a known (possibly non-linear) function of the context and the random attribute of the arm selected in that step. 1.2.1 Dynamic Channel Selection The problem of selecting the best one of several capacity-limited channels every slot based on the available number data packets can be modeled as a contextual bandit problem as shown in gure 1.2. Here, arms correspond to the dierent channels available, arm-values or attributes are the instantaneous capacities of these channels which can be assumed to vary stochastically over time (e.g., due to fading or interference in the case of wireless networks) and the context is the number of data-bits that need to be transmitted during the slot. The reward representing the 5 number of successfully transmitted bits over the chosen channel can be expressed as g(y;x) = minfy;xg; (1.1) wherey is the number of data-bits to be transmitted,x is the capacity of the chan- nel. In this problem, channels are to be selected sequentially over the trials based on the context information. Assuming no prior information about the channels, the goal in this problem is to maximize the number of successfully transmitted bits over time. When the number of bits to be sent at each time are nite, this represents the case of discrete contextual bandits considered in chapter 5. Variations of the dynamic channel selection problem have been previously stud- ied from the perspective of MABs in [5, 6]. These formulations, however, do not consider any side-information availability and always try to select a channel with maximum capacity. Although modeling the available data as context complicates the formulation, as we shall see in later sections, context-dependent channel selec- tions improve the system performance. 1.2.2 Energy Harvesting Communications Consider a power-aware channel selection problem in energy harvesting communi- cations shown in gure 1.3. In every slot, the current harvested power available for transmission p is known to the transmitter as context. Based on this context information, a channel is to be selected sequentially in each slot for transmission. 6 Tx Context: Harvested Power Arms: Channels with Random Gains Rx Figure 1.3: Power-aware channel selection in energy harvesting communications. The reward function in this case is the power-rate function, which can be assumed to be proportional to the AWGN channel capacity log(1 +px), where x denotes the instantaneous channel gain-to-noise ratio (the arm attribute) varying stochas- tically over time. The goal of this problem is to maximize the sum-rate over time. Note that the harvested power p can even be a continuous random variable, and log(1 +px) is Lipschitz continuous in p for all x 0. This represents the case of continuous contextual bandits considered in this work. Such scheduling problems in networks have been previously studied from the perspective of stochastic network optimization in [16, 17, 18]. The goal of these problems is to maximize the overall network utility while keeping the queues at all the nodes bounded. These problems, however, assume that the network controller (or the agent) can observe the instantaneous channel states, queue states perfectly before making the scheduling decisions. This means that the instantaneous realiza- tions of all the sources of randomness in network get revealed to the agent before making scheduling decisions for that slot. This is in contrast to the online learning 7 formulations where the agent makes decisions under the face of uncertainty and the realizations of the internal random variables (in our case, the channel gains) are revealed to it only after its decisions are implemented in that slot. 1.2.3 Multi-Objective Optimization Computational ooading involves transferring job-specic data to a remote server and performing further computations on the server. These problems involve trans- mission energy costs and remote computational costs that are incurred by ooading the jobs to remote servers. There could also be other objectives such as minimizing the overall latency which depends on the transmission delay and the processing time required on the server. Consider a simple latency model from [19]: given a task i with workloadm i , if executed on devicej, the execution delay can be expressed as T j ex (i) =c m i z j ; where z j is the CPU rate of device j and c is a constant. The transmission delay can be expressed as T j tr (i) = d i B j ; whered i represents the amount of data to be transmitted andB j denotes the data- rate of the link to the server. As we can notice for each serverj, the overall latency T j tr (i) +T j ex (i) depends on attributes such as the data-rate of the transmission link 8 Context: Job Features Arms: Servers with Random Transmission and Processing Costs Server Server Server Server Scheduler Job Figure 1.4: Job-aware computational ooading problem. and the current eective processing rate of the server, which could be unknown stochastic processes, as well as job-specic features such as the data transmission and processing requirements which act as the context for this problem. As shown in gure 1.4, this problem of sequential server allocation for various jobs to mini- mize the cumulative cost over time can thus be formulated as a contextual bandit problem. The computational ooading problems can, in general, have more com- plex cost functions that depend on the number of cores available at a server and job-specic features that determine whether a job is parallelizable or not. The ooading problem considered in [19] assumes that the distributions of the random variables z j and B j are known to the agent. A more general ooading problem, MABSTA, is considered in [20], where online learning policies are em- ployed in unknown environments. However, no context information is taken into consideration while making the ooading decisions. In our work, we use the context information in our online learning model to make better ooading decisions. 9 1.3 Markov Decision Processes (MDPs) Communication systems where the transmissions are powered by the energy har- vested from nature have rapidly emerged as viable options for the next-generation wireless networks [21]. These advances are promising as they help us reduce the dependence on the conventional sources of energy and the carbon footprint of our systems. The ability to harvest energy promises self-sustainability and prolonged network lifetimes that are only limited by the communication hardware rather than the energy storage. Energy can be harvested from various natural sources, such as solar, thermal, chemical, biological etc. Their technologies dier in terms of their eciency and harvesting capabilities, depending on the mechanisms, devices, cir- cuits used. Since these energy sources are beyond human control, energy harvesting brings up a novel aspect of irregular and random energy source for communication. This demands a fresh look at the transmission schemes used by wireless networks. The next generation wireless networks need to be designed while keeping the irreg- ularities and randomness in mind. The performance of the energy harvesting communication systems is dependent on the ecient utilization of energy that is currently stored in the battery, as well as that is to be harvested over time. These systems must make decisions while keeping their impact on the future operations in mind. Such problems of opti- mal utilization of the available resources can be classied into oine optimization [22, 23, 24] and online optimization [25, 26, 27, 28, 24] problems. In the oine 10 optimization problems, the transmitter deterministically knows the exact amounts of the harvested energies and the data along with their exact arrival times. These assumptions are too optimistic, since the energy harvesting communication systems are usually non-deterministic. In the online optimization problems, however, the transmitter is assumed to know the distributions or some statistics of the harvesting and the data arrival processes. It may get to know their instantaneous realizations before making decisions in each slot. The problem considered in chapter 6 falls in the category of the online optimization problems. In our work, the channel gains are assumed to be i.i.d. over time with an un- known distribution and the harvested energy is assumed to be stochastically varying with a known distribution. This is based on the fact that the weather conditions are more predictable than the radio frequency (RF) channels which are sensitive to time-varying multi-path fading. In the single channel case, the transmitter has to decide its transmit power level based on the current battery status with the goal maximizing the average expected transmission rate obtained over time. We model the system as a Markov decision process (MDP) with the battery status as the state, the transmit power as the action, the rate as the reward. The power allocation problem, therefore, reduces to the average reward maximization problem for an MDP. Since the channel gain distribution is unknown, their expected rates for dierent power levels are also unknown. The transmitter or the agent, there- fore, cannot determine the optimal mapping from the battery state to the transmit 11 power precisely. It needs to learn these rate values over time and make decisions along the way. We cast this problem as an online learning problem over an MDP. In the multi-channel case, the agent needs to also select a channel from the set of channels for transmission. The gain-distributions of these channels are, in general, dierent and unknown to the agent. Due to dierent distributions, the optimal channels for dierent transmit powers can be dierent. The agent, therefore, needs to explore dierent channels over time to be able to estimate their rate values and use these estimates to choose a power-level and a channel at each time. This exploration of dierent channels leads the agent into making non-optimal power and channel decisions and thus hampers the performance of the online learning algorithm. One interesting feature of this problem is that the data-rate obtained during transmission is a known function of the chosen transmit power level and the instan- taneous channel condition. Whenever a certain power is used for transmission, the agent can gure out the instantaneous channel condition once the instantaneous rate is revealed to it. This information about the instantaneous gain of the chosen channel can, therefore, be used to update the rate-estimates for all power levels. The knowledge of the rate function can be used to speed up the learning process in this manner. 12 1.4 Contextual Combinatorial Bandits In many sequential decision making problems in wireless networks, the decision making agent needs to select a bunch of elements, such as selecting a path con- sisting of multiple links in routing problems. Similar problems also come up in distributed computing where a task graph needs to be mapped a network of com- puting devices with the goal of minimizing the execution delays or the costs. In the current problem, the actions are combinatorial in nature, since a device needs to be chosen for each task of the job. The agent has some side-information available before making the ooading decision, such as the features of the task graph in the distributed computing problem or the source-destination pair, amount of data to be sent in the routing problem. We study this problem, which we call as a contextual combinatorial bandit, in chapter 7. In chapter 7, we dene the general problem and also show how the distributed computing problem gets directly modeled by our problem. The proposed algorithm and its proof technique, however, apply to a range of problems in wireless networks. 1.4.1 Distributed Computing Application Let us consider the specic from distributed computing. In gure 1.5, we illustrate the basic idea of Wireless Distributed Computing (WDC). Given an application consisting of multiple tasks, we wish to assign them to the networked devices with the goal of minimizing the total expected latency while keeping the expected cost 13 under some threshold. This cost can depend on other performance metrics like energy consumption. Figure 1.5: An example assignment of tasks from a task graph on the networked devices. Implementing such a computational ooading involves extra communication costs since the application and the proling data must be transferred to remote servers. The additional communication aects both the energy consumption and overall latency. In general, an application can be modeled by a task graph. In the example task graph shown in gure 7.1, a task is represented by a node whose weight species its workload. Each edge species the data dependency between two tasks, and is labeled with the amount of data exchange required. The ooading algorithm needs to come up with a task assignment scheme to minimize the expected delay subject to the cost constraint. Note that these collaborating devices are connected with wireless links which have time-varying gains. The collaborating wireless devices are mobile and can also get disconnected arbitrarily and come back later. The link qualities are, therefore, highly dynamic and variable. The quantity of the resources shared by a device is 14 dependent on its own local processes. Hence, the performance of each device also varies with time. As a part of exploration, the task-graph scheduler needs to learn about the device performance and the link qualities over time by scheduling dierent tasks to these devices and using their corresponding links for data transmission. It also needs to exploit this information to improve its task-graph scheduling with time. 15 Chapter 2 Background: Bandit Problems One of the earliest bandit problems, the multi-armed bandit (MAB) problem, was posed by Thompson [29] from the perspective of clinical trials. There have been a variety of bandit problems studied in the elds of online optimization and machine learning since then. Personalized news recommendations, advertising are some of the other applications that have been studied extensively. Recently, there have been some eorts to apply bandit problems in the elds of communications and networking, since these systems operate in dynamically changing and unknown environments, and constantly make decisions along the way. 2.1 Introduction to Bandit Problems Bandit problems are essentially sequential decision making problems, where the agent needs to select one of the possible arms or actions in each time slot. Depending on the chosen arm and the state of the environment, the agent receives a reward 16 for that action. There could also be some observations presented to the agent depending on the particular system. The agent needs to update its belief about the arms and the environment based on the observations and the reward and is expected to improve its decision making over time. The MAB problem can be understood easily by imagining a gambler inside a casino. At a row of slot machines, the gambler needs to make several decisions, such as which machines to play, how many times to play each machine and which order to play them in. When played, each machine returns a random reward drawn from a probability distribution of that particular machine. The goal of the gambler is to maximize the sum of rewards earned over time through a sequence of lever pulls. In the standard MAB problem, the gambler has no initial knowledge about the machines and needs to learn their behavior only by pulling the levers. The important dilemma the gambler faces at each trial is between \exploitation" of the machine that it believes to have the highest expected payo and \exploration" to get more information about the expected payos of the other machines. Any online decision-making involves a fundamental dilemma of exploitation vs exploitation. Exploitation means that the agent should make the best decision given current information and exploration means that the agent should gather more information about the system by making random choices. The best long-term strategy may involve short-term sacrices for the sake of exploration. However, too much of exploration also hampers the performance, since the goal of exploration is to 17 aid the exploitation by reducing the probability of making incorrect decisions during exploitation. This trade-o needs to be addressed for each problem separately based on the model assumptions so that the agent gathers enough information to make the best overall decisions. This trade-o between exploration and exploitation is also faced in reinforcement learning problems. 2.2 Types of Bandit Problems In addition to the standard MAB problem, there are various types of bandit prob- lems that dier in their assumptions, their system model, their time horizons, the types of observations etc. Let us brie y go over some of these bandit problems. 1. Finite Horizon Bandits: In such problems [30], each arm pull is associated with a pulling cost and the agent has a xed budget to work with. There are other variations of the problem such as [31], where only exploration needs to be performed over a xed horizon. 2. Contextual Bandits: In these problems, the agent is presented with some side information or context about the current slot. This side information aects the rewards the agent receives for each arm. The agent, therefore, also needs to take into account the current context during decision making for the given time-slot. These problems have been studied from the perspective of personalized news recommendations and advertising in [9, 11, 7]. 18 3. Combinatorial Bandits: In these problems [5, 32], instead of a single discrete variable to choose from, the agent needs to choose values for a set of variables. Assuming that each variable is discrete, the number of possible choices in each slot is exponential in the number of variables. In networking, these problems have been used in routing applications. 4. Adversarial Bandits: In this problem [33], the agent chooses an arm in each time-slot and an adversary chooses the reward structures for each arm simul- taneously. This is one of the strongest generalizations of the bandit problem, since it removes all the assumptions of probability distributions for the arm rewards and a solution to the adversarial bandit problem is a generalized solution to the more specic bandit problems. 5. Dueling Bandits: This variant models the exploration vs exploitation trade- o for relative feedback. In this problem [34], the gambler is allowed to pull two levers in each slot, but it only receives a binary feedback specifying which lever provided the better reward. This problem is more dicult, since the gambler has no way of observing the reward of its actions directly. 2.3 Bandit Algorithms Let us consider some of the common approaches to solving the standard MAB problem. In this problem, we assume that the rewards associated with each arm 19 evolve in an i.i.d. fashion over time. The performance of any bandit algorithm is compared in terms of its \regret", which is dened as the expected dierence between the cumulative rewards of the optimal action and that of the agent's actions over time. This means that lower the regret, better is the algorithm. Intuitively, if the regret grows sub-linearly with time, the time-averaged regret tends to zero over time. 2.3.1 UCB1 Policy In [3], Auer presented an upper condence bound based policy called UCB1, which considers arms with non-negative rewards with distributions that have a nite sup- port. We present their policy as follows: Algorithm 1 UCB1 Policy [3] 1: // Initialization 2: for n = 1 to N do 3: Play n-th arm; 4: end for 5: // Main Loop 6: while 1 do 7: n =n + 1; 8: Play an arm i that solves the maximization problem: i = arg max ( x i + r 2 lnn n i ) ; (2.1) where x i is the average reward received for arm i, n i is the number of times arm i has been played up to the current time slot; 9: end while 20 Equation 2.1 in algorithm 1 shows how UCB1 manages the trade-o between exploration and exploitation. If an arm i is not played often enough, then its corresponding countn i would be small and the term q 2 lnn n i would be comparatively large and dominate the sum. Hence the underplayed arms are more likely to be played leading to their exploration. On the contrary, if all the arms are played often enough, then an arm that has the highest observed mean, x i , is more likely to be picked up leading to exploitation. The regret of UCB1 algorithm scales logarithmically with time. 2.3.2 Epsilon-Greedy Policy -Greedy policy is a simple and well-known randomized bandit policy, where with probability 1, the machine with the highest average reward is played (exploita- tion) and with probability , a randomly chosen machine is played (exploration). In this case, the constant exploitation probability of causes a linear growth in the regret. One obvious improvement is to let go to zero with a certain rate, so that the exploration probability decreases over time, as our estimates for the expected rewards get more accurate. In [3], Auer proves that a rate of 1 n , wheren is the slot index, achieves a logarithmic regret bound. 21 2.3.3 DSEE policy In [4], a deterministic equivalent of the -Greedy policy is proposed, where the time-slots are deterministically divided into two mutually exclusive sequences: the exploration sequence and the exploitation sequence. For the slots in the exploitation sequence, currently known best arm is played and for the exploration sequence, all the arms are played in a round robin fashion. By interleaving the two sequences and by tuning the length of the exploration sequence appropriately, authors show that Deterministic Sequencing of Exploration and Exploitation (DSEE) policy achieves a logarithmic regret bound with respect to time. 22 Chapter 3 Related Work In this chapter, we study the state of previous research in each of the focus areas of our thesis. 3.1 Contextual Bandits Lai and Robbins [1] wrote one of the earliest papers on the stochastic MAB problem and provided an asymptotic lower bound of (K logT ) on the expected regret for any bandit algorithm. In [2], sample-mean based upper condence bound (UCB) policies are presented that achieve the logarithmic regret asymptotically. In [3], several variants of the UCB based policies, including UCB1, are presented and are proved to achieve the logarithmic regret bound uniformly over time for arm- distributions with nite support. Contextual bandits extend the standard MAB setting by providing some addi- tional information to the agent about each trial. In the contextual bandits studied 23 in [9, 10, 11, 12], the agent observes feature vectors corresponding to each arm. The expected reward of the chosen arm is assumed to be a linear function of its corre- sponding context vector. LinRel [10] and LinUCB [11] use the context features to estimate the mean reward and its corresponding condence interval for each arm. The arm that achieves the highest UCB which is the sum of mean reward and its condence interval is pulled. These condence intervals are dened in such a way that the expected reward lies in the condence interval around its estimated mean with high probability. In LinRel and LinUCB, mean and covariance estimates of the linear function parameters are used to estimate the upper condence bounds for the arm rewards. LinUCB algorithm has also been empirically evaluated on Yahoo!'s news recommendation database in [9]. An algorithm based on Thompson sampling is proposed in [12] for bandits with linear rewards. In this Bayesian approach, the agent plays an arm based on its posterior probability of having the best reward. The posterior probabilities are then updated according to the reward obtained af- ter each pull. LinRel and LinUCB achieve a regret bound of ~ O( p T ) while the Thompson sampling algorithm achieves a bound of ~ O( p T 1+ ) for any> 0, where T is the time horizon. Since these algorithms consider linear functions and store estimates of the parameters characterizing the rewards, their storage requirements do not increase with time. More general contextual bandits that do not assume any specic relation be- tween the context and reward vectors are studied in [7, 13, 14]. The epoch-greedy 24 algorithm [7] separates the exploration and exploitation steps and partitions the trials in dierent epochs. Each epoch starts with one exploration step followed by several exploitation steps. It stores the history of contexts, actions and rewards for the exploration steps and requires O(T 2=3 ) storage. Its regret is also upper bounded by ~ O(T 2=3 ). RandomizedUCB [13], however, does not make any distinc- tion between exploration and exploitation. At each trial it solves an optimization problem to obtain a distribution over all context-action mappings and samples an arm as the action for the trial. Its regret bound is ~ O( p T ), but it needs O(T ) storage as it stores the history of all trials. ILTCB [14] is a more ecient algorithm that solves an optimization problem only on a pre-specied set of trials. However, it still needsO(T ) storage. Since these general contextual bandit algorithms do not make any assumption on the context-reward mapping, they store the raw history of trials and do not summarize the information into meaningful statistics. Storage requirements for these approaches, therefore, increase with time. Contextual bandits in general metric spaces are studied in [8, 15] under the assumption of Lipschitz reward function. This formulation is also able to handle the case of continuous contexts unlike the previously discussed general contextual bandit problems. Query-ad-clustering algorithm [8] partitions the metric spaces in uniform clusters, whereas the meta-algorithm from [15] considers adaptive par- titions and uses a general bandit algorithm as a subroutine. Upper condence indexes are evaluated knowing the corresponding clusters of incoming contexts, the 25 clusters-wise empirical means of previously obtained rewards and the number of pulls for each cluster. Regret bounds for these approaches involve the packing and covering dimensions of the context and arm spaces. Recently, contextual bandits with budget and time constraints have been studied in [35, 36]. Resourceful contextual bandits from [35] consider a general setting with random costs associated with the arms, a continuous context-space and multiple budget constraints. An algorithm called Mixture Elimination that achievesO( p T ) regret is proposed for this problem. A simplied model with xed costs, discrete and nite contexts, exactly one time constraint and one budget constraint is considered in [36]. For this model, an algorithm called UCB ALP achievingO(logT ) regret is proposed. In these problem formulations, time and budget constraints also aect the regular exploration-exploitation trade-o. 3.2 Online Learning Over MDPs The oine optimization problems in the energy harvesting communications assume a deterministic system and the exact knowledge of energy and data arrival times and their amounts. In [22], the goal is to minimize the time by which all the packets are delivered. In [23], a nite horizon setting is considered with the goal of maximizing the amount of transmitted data. These two problems are proved to be duals of each other in [23]. In the online problems, the system is usually modelled as an MDP with the objective being the maximization of the average reward over time. These 26 problems make specic distributional assumptions about the energy harvesting pro- cess and the channel gains. In the Markovian setting considered in [25], the energy replenishment process and the packet arrival process are assumed to follow Poisson distributions. Each packet has a random value assigned to it and the reward, in this setting, corresponds to the sum of the values of the successfully transmitted packets. In [26], the power management problem is formulated as an MDP, where the transmitter is assumed to know the full channel state information before the transmission in each slot. The properties of the optimal transmission policy are characterized using dynamic programming for this setting. In [27], the energy har- vesting process and the data arrivals are assumed to follow Bernoulli distributions and a simple AWGN channel is assumed for transmission. A policy iteration based scheme is designed to minimize the transmission errors for the system. In [28], power allocation policies over a nite time horizon with known channel gain and harvested energy distributions are studied. In [37, 38], power control policies are analyzed for communication systems with retransmissions. These MDP settings assume that energy harvesting follows a stationary Bernoulli process. In [37], the channel is assumed to follow a known fading model and in [38], the packet error rate is assumed to be known. In [24], the oine and online versions of the throughput optimization problem are studied for a fading channel and Poisson energy arrival process. In our system, however, we do not assume specic distributions for the energy arrival process and the channel gain-to-noise ratios. 27 Our problem can be seen from the lens of contextual bandits, which are exten- sions of the standard multi-armed bandits (MABs). In the standard MAB problem [1, 3, 4], the agent is presented with a set of arms each providing stochastic rewards over time with unknown distributions and it has to choose an arm in each trial with the goal of maximizing the sum reward over time. In [1], Lai and Robbins provide an asymptotic lower bound of (lnT ) on the expected regret of any algorithm for this problem. In [3], an upper condence bound based policy called UCB1 and a randomized policy that separates exploration from exploitation called -greedy are proposed and are also proved to achieve logarithmic regret bounds for arm distribu- tions with nite support. In [4], a deterministic equivalent of-greedy called DSEE is proposed and proved to provide similar regret guarantees. In the contextual ban- dits, the agent also sees some side-information before making its decision in each slot. In the standard contextual bandit problems [9, 7, 13, 14, 39], the contexts are assumed to be drawn from an unknown distribution independently over time. In our problem, the battery state can be viewed as the context. We model the context transitions by an MDP, since the agent's action at time t aects not only the instantaneous reward but also the context in slot t + 1. The agent, therefore, needs to decide the actions with the global objective in mind, i.e. maximizing the average reward over time. The algorithms presented in [7, 13, 14] do not assume any specic relation between the context and the reward. The DCB() algorithm presented in [39], however, assumes that the mapping from the context and random 28 instance to the reward is a known function and uses this function knowledge to re- duce the expected regret. It must, however, be noted that the MDP formulation generalizes the contextual bandit setting in [39], since the i.i.d. context case can be viewed as a single state MDP. Our problem is also closely related to the reinforcement learning problem over MDPs from [40, 41, 42]. The objective for these problems is to maximize the average undiscounted reward over time. In [40, 41], the agent is unaware of the transition probabilities and the mean rewards corresponding to the state-action pairs. In [42], the agent knows the mean rewards, but the transition probabilities are still unknown. In our problem, the mean rewards are unknown, while the transition probabilities of the MDP can be inferred from the knowledge of the arrival distribution and the action taken from each state. In contrast to the works above, for our problem motivated by the practical application in energy harvesting communications, we show that the learning incurs a constant regret in the single channel case. 3.3 Contextual Combinatorial Bandits The MAB setting has been studied for the distributed computing application in [20]. However it only considers a static task graph and doesn't allow for dynamically changing context for each time-slot. The contextual combinatorial bandit problem has been previously analyzed from the perspective of online recommendations, in 29 [43]. However it does not consider the constrained optimization problem studied in this thesis. Also, the reward and the cost functions are not known for the applications considered in [43]. Ours is the rst work on constrained contextual combinatorial bandit problem with known reward functions. The bandit problems are widely useful in elds such as nance, industrial en- gineering and medicine. This thesis is, however, particularly motivated by their applications in communications and networking. Given the wide applicability of bandits to a range of communication systems, as testied by these papers, it is evident that bandits in general and MABs in particular will have signicant impact on the process of design and analysis of communication algorithms for networks operating in unknown stochastic environments. 30 Chapter 4 Contributions In this chapter, we discuss the signicant contributions of this thesis in each of the focus areas of our research. 4.1 Contextual Bandits In chapter 5, we rst analyze the case of discrete and nite context-spaces and propose a policy called discrete contextual bandits or DCB() that requiresO(MK) storage, whereM is the number of distinct contexts and K is the number of arms. DCB() uses upper condence bounds similar to UCB1, a standard stochastic MAB algorithm from [3] and yields a regret that grows logarithmically in time and linearly in the number of arms not optimal for any context. A key step in our analysis is to use a novel proof technique to prove a constant upper bound on the regret contributions of the arms which are optimal for some context. In doing so, we also prove a high probability bound for UCB1 on the number of pulls of the optimal arm 31 in the standard MAB problem. This high probability bound can be independently used in other bandit settings as well. Further, we use the standard MAB asymptotic lower bound results from [1] to show the order optimality of DCB(). Note that DCB() outperforms UCB1 and Multi-UCB, a discretized version of the Query-ad- clustering algorithm from [8]. Regret of UCB1 scales linearly with time as it ignores the context information completely. Multi-UCB uses the context information to run separate instances of UCB1 for each context, but is unable to exploit the reward function knowledge. Regret of Multi-UCB, therefore, grows logarithmically in time, but unlike DCB() it scales linearly in the total number of arms. For continuous context-spaces with Lipschitz rewards, we propose a novel algo- rithm called continuous contextual bandits or CCB(;) that quantizes the context- space and uses DCB() over the quantized space as a subroutine. Our analysis of CCB(;) reveals an interesting trade-o between the storage requirement and the regret, where desired performance can be achieved by tuning the parameter to the time horizon. Decreasing the quantization step-size increases the storage re- quirements while decreasing the regret. Thus by exploiting the reward-function knowledge for joint learning across contexts, CCB(;) is able to obtain regrets even smaller than the lower bound of (T 2=3 ) from [8] for continuous contextual bandits without such knowledge. For the case of unknown time horizon, we employ a doubling technique that provides similar regret guarantees while using the same 32 amount of storage. CCB(;), therefore, empowers the system designer by giving control over to tune the regret performance. 4.2 Online Learning Over MDPs The problem of maximizing the average expected reward of an MDP can be for- mulated as a linear program (LP). The solution of this LP gives the stationary distribution over the state-action pairs under the optimal policy. If the MDP is ergodic, then there exists a deterministic optimal policy. In chapter 6, we model the problem of communication over a single channel using the harvested energy as an MDP and prove the ergodicity of the MDP under certain assumptions about the transmit power and the distribution of harvested energy. This helps us focus only on the deterministic state-action mappings which are nite in number. The LP formulation helps us characterize the optimal policy that depends on the transition probabilities for the state-action pairs and their corresponding mean rewards. We use the optimal mean reward obtained from the LP as a benchmark to compare the performance of our algorithms with. Since the mean rewards associated the state-action pairs are unknown to the agent, we propose two online learning algorithms: LPSM and Epoch-LPSM that learn these rewards and adapt their policies along the way. The LPSM algorithm solves the LP at each step to decide its current policy based on its current sample mean estimates for the rewards, while the Epoch-LPSM algorithm divides the time into epochs, solves the LP only at the 33 beginning of the epochs and follows the obtained policy throughout that epoch. We measure the performance of our online algorithms in terms of their regrets, dened as the cumulative dierence between the optimal mean reward and the instantaneous reward of the algorithm. We prove that the reward loss or regret incurred by each of these algorithms is upper bounded by a constant. To our knowledge this is the rst result where constant regret algorithms are proposed for the average reward maximization problem over MDPs with stochastic rewards with unknown means. We further prove that the LPSM algorithm starts following the genie's optimal policy in nite expected time. The nite expected time is an even stronger result than the constant regret guarantee. Our proposed single channel algorithms greatly dier in their computational requirements. Epoch-LPSM incurs a higher regret compared to LPSM, but reduces the computational requirements substantially. LPSM solves a total of T LPs in time T , whereas Epoch-LPSM solves only O(lnT ) number of LPs. We introduce two parametersn 0 and that reveal the computation vs regret tradeo for Epoch- LPSM. Tuning these parameters allows the agent to control the system based on its performance requirements. We extend our framework to the case of multiple channels where the agent also needs to select the transmission channel in each slot. We present our MC- LPSM algorithm that deterministically separates exploration from exploitation. MC-LPSM explores dierent channels to learn their expected rewards and uses that 34 information to solve the average reward maximization LP during the exploitation slots. The length of the exploration sequence scales logarithmically over time and contributes to the bulk of the regret. This exploration, however, helps us bound the exploitation regret by constant. We, therefore, prove a regret bound for MC- LPSM that scales logarithmically in time and linearly in the number of channels. This design of the exploration sequence, however, needs to know a lower bound on the dierence in rates for the channels. We observe that this need of knowing some extra information about the system can be eliminated by using a longer exploration sequence as proposed in [4]. The regret of this design can be made arbitrarily close to the logarithmic order. We also prove an asymptotic regret lower bound of (lnT ) for any algorithm under certain conditions. This proves the asymptotic order optimality of the proposed MC-LPSM approach. We further show that, similar to Epoch-LPSM, the MC-LPSM algorithm also solves only O(lnT ) number of LPs in time T . We show that the proposed online learning algorithms also work for cost min- imization problems in packet scheduling with power-delay tradeo with minor changes. 35 4.3 Contextual Combinatorial Bandits In chapter 7, our contributions can be divided into two types, one being on the modeling aspects of the application from distributed computing and the other being on the online learning algorithm aspect of the general problem. 4.3.1 Wireless Distributed Computing The mapping of task graphs to the network of devices in distributed computing is performed with the objective of minimizing the overall execution latency. In doing so, however, one needs to also consider the costs associated with the execution of the tasks on those devices and the transmission costs associated with moving the data from one device to another. In chapter 7, we explain what context means for these task assignment problems. We also propose a representation for the costs and delays as functions of the available context and the internal random variables of the system. We notice that delays have non-linearity associated with them, since a child tasks needs to wait for the execution of all its parents to be over and receive the the outputs from them. We analyze the optimization problem that a genie that knows the distributions of the internal random variables would solve to get the optimal mapping at each time slot. Since the agent does not know these distributions, it needs to learn them over time and improve its decision making along the way. 36 4.3.2 Online Learning Algorithms We propose a novel online learning algorithm to learn the distributions of the in- ternal random variables over time, called PMF learning algorithm. As we learn about the devices and the links over time, our distributions of the internal random variables of the system get closer to their true distributions. The non-linearity of the delays, however, does not aect the learning, since the max function is Lips- chitz. We prove a concentration inequality for Lipschitz functions that is helpful in proving the regret guarantees for our online learning algorithm. The PMF learn- ing algorithm needs to solve the optimal mapping problem at time slot using the contextual information and the estimates of PMF vectors for the internal random variables. The agent uses the lower condence bounds (LCBs) for each PMF vector instead of the true PMFs in each time slot, since the true PMFs are unknown. Us- ing the derived concentration inequality, we prove an upper bound on the regret our algorithm. The analysis of our algorithm includes several interesting steps, where we analyze the total number of slots where the agent fails to make the optimal de- cisions by separating the dierent reasons for getting a non-optimal mapping from the optimization problem. To the best our knowledge, this is the rst work that deals with contextual combinatorial bandits with known reward functions. 37 Chapter 5 Contextual Bandits with Known Reward Functions In this chapter, we consider the problem of stochastic contextual bandits. First, we describe the system model and discuss a naive algorithm to solve the problem for discrete context spaces. Later, we propose our novel online learning algorithm, called DCB, and analyze its regret performance. Finally, we propose another algo- rithm, called CCB, that extends DCB to continuous context spaces. 5.1 Problem Formulation In the general stochastic contextual bandit problem, there is a distribution D over (y; r), where y2Y is a context vector in context spaceY and r2 R K is the reward vector containing entries corresponding to each arm. The problem is a The work in this chapter is based on [39]. 38 repeated game such that at each trial t, an independent sample (y t ; r t ) is drawn from D, context vector y t is presented to the agent. Once the agent chooses an arm a t 2f1; ;Kg, it receives a reward r a;t which is the a t -th element of the vector. A contextual bandit algorithmA chooses an arm a t to pull at each trial t, based on the previous history of contexts, actions, their corresponding rewards: (y 1 ;a 1 ;r a;1 ); ; (y t1 ;a t1 ;r a;t1 ) and the current context y t . The goal is to max- imize the total expected reward P T t=1 E (yt;rt)D [r a;t ]. Consider a setH consisting of hypothesesh :Y!f1; ;Kg. Each hypothesis maps the context y to an arma. The expected reward of a hypothesish is expressed as R(h) =E (y;r)D r h(y) : (5.1) The Bayes optimal hypothesish refers to the hypothesis with maximum expected reward h = arg max h2H R(h) and maps contexts as h (y) = arg max a2f1;;Kg E [r a j y]. Agent's goal is to choose arms sequentially and compete with the hypothesis h . Regret of the agent's algorithmA is dened as the dierence in expected reward accumulated by the best hypothesis h and that by its algorithm over dierent trials, which can be expressed as R A (n) =nR(h )E " n X t=1 r A(t);t # ; (5.2) 39 whereA(t) denote the arm pulled by algorithmA in t-th trial. Note that the expectation in (5.2) is over the sequence of random samples from D and any in- ternal randomization potentially used byA. The reward maximization problem is equivalently formulated as a regret minimization problem. Smaller the regret R A (n), better is the policy. We would like the regret to be sublinear with respect to timen so that the time-averaged regret will converge to zero. These policies are asymptotically Bayes optimal, since lim n!0 R A (n) n = 0. Letx a;t denote the state or the value of a-th arm att-th slot andX denote the set of all possible states of an arm. We assume that the reward is a known bounded function of the selected arm's state and the context represented as r a;t =g(y t ;x a;t ); (5.3) wherejg(y;x)jB for someB > 0 and8y2Y;x2X . Note that, in the standard MAB problems, reward of an arm is same as its state. We assume semi-bandit feedback, which means only the state of the agent's chosen arm is revealed. If the reward is an injective function of the arm state for all contexts, then the arm state can be inferred without ambiguity from the reward observation, since the current context is already known to the user. Thus, from now on we assume that the chosen arm's state is either revealed by semi-bandit feedback or inferred from the reward observation. Note that, since we are solving a combinatorial maximization problem in a, we do not need further concavity assumptions on the reward function g. 40 In this problem formulation, the context spaceY can be either discrete or con- tinuous. We rst develop an algorithm called DCB() for discrete context spaces in section 5.2 and later propose its extension CCB(;) for continuous context spaces in section 5.4 assuming that the reward function is Lipschitz with respect to the context variable. 5.2 Discrete Contexts In this section, we consider discrete and nite context spaceY =fy (1) ; ; y (M) g withM elements. As a convention, we usej andk to index the arms andi to index the discrete contexts. 5.2.1 Naive Approach In the standard MAB setting, the reward of an arm is same as its value and all trials are stochastically identical. UCB1 policy, proposed by Auer et al. in [3], pulls the arm that maximizesX k + p 2 lnn=m k atn-th slot, whereX k is the mean observed value of the k-th arm and m k is the number of pulls of k-th arm. The second term is the size of the one-sided condence interval for the empirical mean within which the actual expected reward value falls with high probability. Applying such standard MAB algorithms directly to our problem will result in poor performance, as they ignore the context-reward relationship. Playing the arm with highest mean value is not necessarily a wise policy, since the trials are not all identical. When 41 every context inY appears with a non-zero probability, playing the same arm over dierent trials implies that a suboptimal arm is pulled with a non-zero probability at each trial. The expected regret, therefore, grows linearly over time. Although the regret of naive approach is not sublinear, it serves as one of the benchmarks to compare our results against. 5.2.2 Multi-UCB Main drawback of the plain UCB1 policy is that it ignores the contexts completely. One way to get around this problem is to run a separate instance of UCB1 for each context. Each UCB1 can be tailored to a particular context, where the goal is to learn the arm-rewards for that context by trying dierent arms over time. UCB1 instance for context y i picks the arm that maximizes g i;k + p 2 lnn i =m i;k , where g i;k denotes the empirical mean of the rewards obtained from previous m i;k pulls of k-th arm for context y i and n i denotes the number of occurrences of that context till the n-th trial. Since each instance incurs a regret that is logarithmic in time, the overall regret of Multi-UCB1 is also logarithmic in time. Compared to the naive approach where the regret is linear in time, this is a big improvement. Notice that Multi-UCB learns the rewards independently for each context and does not exploit the information revealed by the pulls for other contexts. Unless a single arm is optimal for all contexts, every arm's contribution to the regret is logarithmic 42 in time, since they are explored independently for all contexts. Hence the regret scales as O(MK logn). 5.2.3 New Policy Let i;j be the expected reward obtained by pulling the j-th arm for context y (i) which is evaluated as i;j =E r j j y (i) =E g(y (i) ;x j ) : (5.4) Let the corresponding optimal expected reward be i = i;h (y (i) ) . We dene G i , the maximum deviation in the rewards, as G i = sup x2X g(y (i) ;x) inf x2X g(y (i) ;x): (5.5) Since the reward function is assumed to be bounded over the domain of arm values, G i 2B. Similar to Multi-UCB, the idea behind our policy is to store an estimator of i;j for every context-arm pair. For these statistics, we compute upper condence bounds similar to UCB1 [3] depending on the number of previous pulls of each arm and use them to decide current action. Since the selected arm can be observed or inferred irrespective of the context, it provides information about possible rewards for other contexts as well. This allows joint learning of arm-rewards for all contexts. 43 Intuitively, this policy should be at least as good as Multi-UCB. Our policy, shown in algorithm 2, is called as Discrete Contextual Bandit or DCB(). In table 5.1, we summarize the notations used. Algorithm 2 DCB() 1: Parameters: > 0. 2: Initialization: ( ^ i;j ) MK = (0) MK , (m j ) 1K = (0) 1K . 3: for n = 1 to K do 4: Set j =n and pull j-th arm; 5: Update ^ i;j ,8i : 1iM, as ^ i;j =g(y (i) ;x j;n ) and m j as m j = 1; 6: end for 7: // MAIN LOOP 8: while 1 do 9: n =n + 1; 10: Given the context y n = y (i) , pull the armA(n) that solves the maximization problem A(n) = arg max j2S ( ^ i;j +G i s (2 +) lnn m j ) ; (5.6) 11: Update ^ i;j and m j ,8i : 1iM as ^ i;j (n) = ( ^ i;j (n1)m j (n1)+g(y (i) ;x j;n ) m j (n1)+1 ; if j =A(n) ^ i;j (n 1); else (5.7) m j (n) = m j (n 1) + 1; if j =A(n) m j (n 1); else (5.8) 12: end while Algorithmically, DCB() is similar in spirit to UCB1. It diers from UCB1 in updating multiple reward averages after a single pull. Also, the reward-ranges G i for various contexts are in general dierent and better regret constants are obtained by scaling the condence bounds accordingly. The major dierence, however, comes from the parameter which needs to be strictly positive. This condition is essential for bounding the number of pulls of the arms that are optimal for some context. The 44 Table 5.1: Notations for Algorithm Analysis K : number of arms M : number of distinct contexts (for discrete contexts) Y : set of all contexts X : support set of the arm-values : index to indicate the contextually optimal arm p i : probability of context being y (i) (for discrete contexts) Y j : set of all contexts for which j-th arm is optimal q j : sum probability of all contexts inY j S : index set of all arms O : index set of all optimal arms O : index set of all non-optimal arms T O j (n) : number of optimal pulls of j-th arm in n trials T N j (n) : number of non-optimal pulls of j-th arm in n trials T j (n) : number of pulls of j-th arm in n trials (Note: T j (n) =T O j (n) +T N j (n)) A(n) : index of the arm pulled in n-th trial by DCB() analysis for these arms is sophisticated and diers signicantly from the standard analysis of UCB1, and is therefore one of the main contributions of our chapter. DCB() uses an MK matrix ( ^ i;j ) MK to store the reward information ob- tained from previous pulls. ^ i;j is the sample mean of all rewards corresponding to context y (i) and the observed x j values. In addition, it uses a length K vector (m j ) 1K to store the number of pulls of j-th arm up to the current trial. At the n-th trial,A(n)-th arm is pulled andx A(n);n is revealed or inferred from the reward. Based on this, ( ^ i;j ) MK and (m i ) 1K are updated. It should be noted that the time indexes in (5.7) and (5.8) are only for notational clarity. It is not necessary to store the previous matrices while running the algorithm. Storage required by the DCB() policy is, therefore, only (MK) and does not grow with time. In section 5.3, we analyze and upper bound the cumulative regret of our policy and show that 45 it scales logarithmically in time and linearly in the number of arms not optimal for any context. 5.3 Regret Analysis of DCB() In the standard MAB problems, regret arises when the user pulls the non-optimal arms. Hence the regret upper bounds can be derived by analyzing the expected number of pulls of each non-optimal arm. In our problem, however, there can be multiple contextually optimal arms. It is, therefore, important to analyze not only the number of times an arm is pulled but also the contexts it is pulled for. LetS =f1; 2;:::;Kg denote the index set of all arms. Let us dene an optimal setO and a non-optimal setO as O =fjj9y (i) 2Y :h (y (i) ) =jg; O =SnO: (5.9) Note that the arms in the non-optimal setO are not optimal for any context. This means that every time an arm inO is pulled, it contributes to the expected regret. However, an arm inO contributes to the expected regret only when it is pulled for contexts with dierent optimal arms. When these optimal arms are pulled for the contexts for which they are optimal, they don't contribute to the expected regret. 46 We analyze these two cases separately and provide upper bounds on the number of pulls of each arm for our policy. Theorem 1 (Bound on the pulls of non-optimal arms). For all j2O, under the DCB() policy E [T j (n)] 4(2 +) lnn min 1iM (i) j 2 + 1 + 2 3 ; (5.10) where (i) j = i i;j . Theorem 2 (Bound on the non-optimal pulls of optimal arms). For all j2O, under the DCB() policy E T N j (n) n o + 2 3+2 K 4+2 c X y (i) 2Y i 1 (p i ) 2+2 + X y (i) 2Y j 2 p 2 i + 2 3 X y (i) = 2Y j p i ; (5.11) where c = P 1 n=1 1 n 1+2 and n o is the minimum value of n satisfying j p o n 2K k > l 4(2 +) lnn ( o ) 2 m ; (5.12) with p o = min 1iM p i , and o = min 8i;j:h (i)6=j (i) j . Theorem 1, whose proof can be found in appendix B, provides an upper bound on the number of pulls of any non-optimal arm that scales logarithmically in time. As discussed previously, all such pulls contribute to the regret. Theorem 2 states that the number of non-optimal pulls of any optimal arm is bounded above by a constant. Regret contribution of the optimal arms is, therefore, bounded. We sketch 47 the proof of theorem 2 to emphasize the basic ideas of our novel proof technique and provide the detailed proof in appendix D. In order to bound the number of pulls of the optimal arms in DCB(), we require a high probability bound on the optimal pulls by UCB1 [3] in the standard MAB problem. Since we are not aware of any such existing result, we derive one ourselves. Lemma 1 shows a bound for the generalized UCB1 policy called UCB1(), proof of which can be found in appendix C. Note that the condence interval in UCB1() is q (2+) lnn m k , which reduces to that of UCB1 for = 0. In this context, j denotes the mean value of j-th machine and = max 1jK j . Lemma 1 (High probability bound for UCB1()). In the standard stochastic MAB problem, if the UCB1() policy is run on K machines having arbitrary reward dis- tributions with support in [0; 1], then Pr n T (n)< n K o < 2K 4+2 n 2+2 ; (5.13) for all n such that j n K k > 4(2 +) lnn min j6=j ( j ) 2 : (5.14) Note that n lnn is an increasing sequence in n and there exists some n 0 2N such that condition (5.14) is satised for every n n 0 . n 0 is a constant whose actual value depends on the true mean values for the arms. 48 Proof sketch of theorem 2. For every optimal arm, there exists a non-zero proba- bility of appearance of the contexts for which it is optimal. Over successive pulls, DCB() leads us to pull the optimal arms with very high probability for their cor- responding contexts. Thus, these arms are pulled at least a constant fraction of time with high probability. Since the constant fraction of time is much more than the logarithmic exploration requirement, these optimal arms need not be explored during other contexts. The idea is to show that the expected number of these non-optimal pulls is bounded. Let us dene K mutually exclusive and collectively exhaustive context-setsY j for j2S as Y j =fy (i) 2Yjh (i) =jg: Further, let N i (n) denote the number of occurrences of context y (i) till the n-th trial. At the n-th trial, if an optimal arm j is pulled for a context y n = y (i) = 2Y j , this happens due to one of the following reasons: 1. The number of pulls of j-th arm till the n-th trial is small and we err due to lack of sucient observations of j-th arm's values, which means the arm is under-played. 2. We err in spite of having previously pulledj-th arm enough number of times. 49 We bound the probability of occurrence for these two cases by functions ofn, whose innite series overn is convergent. These reasons can be analyzed in three mutually exclusive cases as follows Under-realized contexts In this case, the number of occurrences of contexts in Y j till then-th trial is small. If all the contexts inY j are under-realized, this could lead toj-th arm being under-explored at trialn. We use Hoeding's inequality (see appendix A) to bound the expected number of occurrences of this event. Under-exploited arm We assume that no context y (i) 2Y j is under-realized and yet j-th arm is not pulled in enough trials. For these contexts j-th arm is optimal, but we still don't pull it often enough. In order to bound the expected number of occurrences of this event, we use the high probability bound from lemma 1. Dominant condence bounds This case corresponds to the event where no context inY j is under-realized and j-th arm is also pulled suciently often, yet it is pulled non-optimally. This occurs when the upper condence bound for some other arm dominates the corresponding DCB() index. We prove the theorem by upper bounding the expected number of occurrences of these events over all trials. 50 5.3.1 Upper Bound on Regret Upper bounds on the number of pulls of all arms in theorems 1 and 2 lead us to the regret bound in theorem 3. It states a regret upper bound that grows linearly in the number of non-optimal arms and logarithmically in time, i.e. O( O lnn). Theorem 3 (Regret bound for DCB()). The expected regret under the DCB() policy till trial n is at most 4(2 +) max ( o ) 2 O lnn +O(1); (5.15) where max = max 8i;j (i) j . Proof of theorem 3. Since the expected number of pulls of optimal arms is bounded (from theorem 2), their regret contribution is also bounded. Note that this regret bound is still a function of K, but a constant in terms of n. Using theorem 1 and the denition of o , total number of pulls of any non-optimal arm till n-th trial is upper bounded by 4(2+)max lnn (o) 2 +O(1). As the contribution to expected regret of any non-optimal pull is upper bounded by max , considering all non-optimal arms proves the theorem. Remark 1. Note that in order for theorem 2 to hold, DCB() must be run with a strictly positive . For = 0, the condence bound looks similar to the one from UCB1 [3]. It remains to be seen if the condition > 0 is an artifact of our proof technique or a stringent requirement for constant regret bound in theorem 2. 51 Remark 2. If the reward function is not dependent on the context, this formulation reduces to the standard MAB problem. In this special case, our DCB() algorithm reduces to UCB1. Thus, our results generalize the results from [3]. Remark 3. Strategies like deterministic sequencing of exploration and exploitation (DSEE) [4],-greedy [3] and epoch greedy [7] separate exploration from exploitation. All the arms are explored either deterministically equally in DSEE or uniformly at random in others. Using these strategies for our problem leads to logarithmically scaling regret contributions for all the arms. Since the optimal arms are not known a priory, these strategies do not reduce the regret contribution from optimal arms. 5.3.2 Asymptotic Lower bound For the standard MAB problem, Lai and Robbins provided an asymptotic lower bound on the regret in [1]. For every non-optimal arm j and the families of arm- distributions parametrized by a single real number, they showed: lim inf n!1 EfT j (n)g lnn D j ; (5.16) where D j > 0 is a function of the Kullback-Leibler divergence between the reward distribution of j-th arm and that of the optimal arm. This result was extended by Burnetas and Katehakis to distributions indexed by multiple parameters in [44]. It is important to note that the lower bound holds only for consistent policies. 52 A policy is said to be consistent if R (n) = o(n ) for all > 0 as n!1. Any consistent policy, therefore, pulls every non-optimal arm at least (lnn) times asymptotically. The contextual bandit problem can be visualized an interleaved version of sev- eral MAB problems, each corresponding to a distinct context. These MAB instances are, however, not independent. When the reward function is known, information gained from an arm pull for any context can be used to learn about the rewards for other contexts as well. In terms of learning, there is no distinction among the pulls of a specic arm for distinct contexts. The burden of exploration, thus, gets shared by all the contexts. LetN i (n) denote the number of occurrences of y (i) till then-th trial. Under the conditions from [1], for a non-optimal arm j under a consistent policy we get lim inf n!1 EfT j (n)g lnN i (n) D i;j ; (5.17) where D i;j > 0 depends on the KL divergence between the conditional reward distribution of j-th arm and that of the contextually optimal arm. By the law of large numbers, N i (n)!p i n as n!1. Thus, we write lim n!1 lnN i (n) lnn = 1 + lim n!1 ln (N i (n)=n) lnn = 1: 53 Condition (5.17), therefore, reduces to lim inf n!1 EfT j (n)g lnn D i;j : Note that there is one such condition corresponding to each context and non-optimal arm pair. Combining these conditions for a non-optimal arm j, we get lim inf n!1 EfT j (n)g lnn max 1iM D i;j > 0: (5.18) This gives an asymptotic lower bound of (lnn) on the number of pulls of a non- optimal arm under a consistent policy. Note that this bound matches the upper bound of O(lnn) for DCB() policy proving its order optimality. 5.4 Continuous Contexts In this section, we extend our algorithm for discrete contexts to the continuous case. For simplicity, we assume thatYR. This can be easily extended to general metric spaces similar to [8, 15]. We additionally assume the reward function to be L-Lipschitz in context, which means jg(y;x)g(y 0 ;x)jLjyy 0 j; 8y;y 0 2Y;x2X: (5.19) 54 Our main idea is to partition the support of context space into intervals of size . We learn and store the statistics from the discrete-context algorithms for the center points of each interval. We quantize each incoming context value y t is into ^ y t and use DCB() as a subroutine to perform successive pulls. Since the context- distribution is not known, we resort to uniform quantization. Algorithm 3 presents our continuous contextual bandit algorithm based on these ideas. Theorem 4 states its regret upper bound in terms of the time horizon T . Algorithm 3 CCB(;) 1: Parameters: ;> 0. 2: PartitionY into intervals of size . 3: Set up a DCB() instance with the set ^ Y containing the center points of each partition as the context-set. 4: for t = 1 to T do 5: Quantize y t into ^ y t ; 6: Feed ^ y t as the context to DCB() and pull the arm it recommends; 7: Update the internal statistics of DCB(); 8: end for Theorem 4 (Regret bound for CCB(;)). The expected regret under CCB(;) policy is upper bounded by LT +O(logT ), where T is the time horizon. Proof. Let the instance of DCB() algorithm working with the quantized contexts ^ y t beA and its expected regret beR A (T ). Similarly, letB denote the instance of CCB(;) algorithm and R B (T ) its expected regret. The Bayes optimal policies onY and ^ Y are denoted by h and ^ h , respectively. It is important to note that ^ h (^ y) =h (^ y) for all ^ y2 ^ Y andA(t) =B(t) sinceB follows the recommendations 55 ofA. The main dierence is the input context forA andB, which changes the optimal rewards. The regret ofA, in terms of these notations, is R A (T ) = T X t=1 E h g(^ y t ;x ^ h (t);t )g(^ y t ;x A(t);t ) i : (5.20) Thus, the regret ofB is described as R B (T ) = T X t=1 E g(y t ;x h (t);t )g(y t ;x B(t);t ) = T X t=1 E g(y t ;x h (t);t )g(^ y t ;x h (t);t ) + T X t=1 E g(^ y t ;x h (t);t )g(^ y t ;x B(t);t ) + T X t=1 E g(^ y t ;x B(t);t )g(y t ;x B(t);t ) T X t=1 E g(^ y t ;x h (t);t )g(^ y t ;x B(t);t ) + T X t=1 E L 2 + T X t=1 E L 2 (L-Lipschitz) LT + T X t=1 E h g(^ y t ;x ^ h (t);t )g(^ y t ;x A(t);t ) i LT +R A (T ): (5.21) From theorem 3, we know thatR A (T ) is at most O(logT ). Substituting this in (5.21) yields the result. Note that the choice of is dependent on the time horizonT . The regret upper bound is, therefore, not linear in T as it might appear from theorem 4. In the following subsections, we discuss how can be tuned to T in order to obtain the 56 desired storage and regret guarantees. Hereafter, we will use T to denote the tuned parameter . 5.4.1 Known Time Horizon The CCB(; T ) regret bound ofO( T LT ) is largely controlled by the parameter T . Apart from the regret, another important concern for online learning algorithms is the required storage. Since there are O( 1 T ) context intervals, the storage required is O( K T ). This manifests a peculiar storage vs regret trade-o. As T increases, its storage decreases, while the regret increases linearly. This trade-o arises due to the structure of our contextual bandit setting that knows the context-reward mapping function. If the time horizon is xed and known in advance, T can be tuned based on the performance requirements. Note that the regret bounds of CCB(;) are only possible due to joint learning for all context intervals. Multi-UCB can also be extended to handle continuous con- texts by using this quantization technique. This essentially generalizes the Query- ad-clustering algorithm from [8] with tunable . A similar algorithm employing adaptive quantization intervals appears in [15]. Both of these algorithms cannot exploit the reward function knowledge and they thus have to perform independent exploration of arms for dierent context intervals. As decreases, the number of context intervals increases which also increases the number of trials spent in explor- ing dierent arms. These algorithms achieve minimum regret bounds of O(T 2=3+ ) 57 for any> 0 andO(T 2=3 logT ) respectively. Furthermore a lower bound of (T 2=3 ) on all continuous contextual bandit algorithms is proved in [8]. Since we assume the reward function knowledge, the agent can obtain sub-linear regret bounds as small as (logT ) by appropriately choosing T in CCB(;). This is a substantial improvement over the existing contextual bandit algorithms. Exploiting the knowl- edge of the reward function, therefore, helps in drastically reducing the regret for the continuous contextual bandits. 5.4.2 Unknown Time Horizon Even when the time horizon is unknown, similar performance can be obtained using the so-called doubling trick from [8, 15]. It converts a bandit algorithm with a known time horizon into one that runs without that knowledge with essentially same performance guarantees. LetB T denote the CCB(; T ) instance for a xed horizon T with T tuned for regret bounds of O(LT ) for 2 [0; 1]. The new algorithmC runs in phases m = 1; 2; of 2 m trials each, such that each phase m runs a new instance ofB 2 m. Following recurrence relationship relates the regret of C to that ofB T R C (T ) =R C (T=2) +R B T=2 (T=2): (5.22) Hence we get a regret bound of O(LT logT ), while using the storage of same order O(KT 1 ). Note that during the execution ofC, previously learned values 58 are discarded in each phase and a fresh instance ofB is run. Using those stored values for next phases may help in decreasing the regret further. The system designer can analyze the regret-storage trade-o and tune T based on the hardware specications and performance requirements. This empowers the designer with more control over the algorithmic implementation than any of the existing contextual bandit algorithms. 5.5 Numerical Simulation Results In this section, we present the results of numerical simulations for the channel selection problem from 1.2.1. Consider the case of M = 4 contexts uniformly distributed over the setY =f1; 2; 3; 4g. Let there be K = 7 arms whose arm- values are scaled Bernoulli random variables. For every 1j 7, the arm values are distributed as PrfX j = jg = 8j 10 , PrfX j = 0g = 2+j 10 . Since the genie knows these distributions, it gures out the optimal arms for all the contexts based on the expected reward matrix ( i;j ) given by ( i;j ) = 0 B B B B B B B B B B B B @ 0:7 0:7 0:7 0:7 1:2 1:2 1:2 0:6 1:5 1:5 1:0 0:5 1:6 1:2 0:8 0:4 1:2 0:9 0:6 0:3 0:8 0:6 0:4 0:2 0:4 0:3 0:2 0:1 1 C C C C C C C C C C C C A ; 59 where the components in the box are the optimal arms. Hence, h (i) = i for all i. Figure 5.1a plots the cumulative regret over the logarithm of trial index. We observe that R(n) logn converges to a constant for the DCB(10 2 ) policy, whereas for the UCB1 policy it continues to increase rapidly, since the regret for UCB1 policy grows linearly for contextual bandits. It must be noted that the logarithmic regret of DCB() is due to the presence of 3 non-optimal arms. Multi-UCB regret is also logarithmic in time albeit with higher constants due to independent exploration of arms for each context. If we reduce the number of arms to 4, by removing the non-optimal arms 5; 6 and 7, then the expected reward matrix for the channel selection problem shrinks to ( i;j ) = 0 B B B B B B B B B B B B @ 0:7 0:7 0:7 0:7 1:2 1:2 1:2 0:6 1:5 1:5 1:0 0:5 1:6 1:2 0:8 0:4 1 C C C C C C C C C C C C A : Regret performance for this case is plotted in gure 5.1b, which shows that the regret growth stops after some trials for DCB(10 2 ). It must be noted that we plot the net regret in this case and not the regret divided by logn. Bounded regret is expected, since all arms are optimal and the regret due to non-optimal arms is logarithmic in time, which do not exist as O = 0. Since Multi-UCB is unable to exploit the reward-function knowledge, its regret still grows logarithmically in 60 time. Such a case demonstrates the signicance of our policy, as it reduced the regret drastically by jointly learning the arms-rewards for all the contexts. Table 5.2 compares the regret for UCB1, Multi-UCB and DCB(10 2 ) when T = 10 5 for both channel selection examples. We see that the regret reduction using DCB() is substantial, especially when the non-optimal arm set is empty. Table 5.2: Regret for the channel selection problem when T = 10 5 UCB1 Multi-UCB DCB(10 2 ) M = 4;K = 7; O = 3 17262 4893 1294 M = 4;K = 4; O = 0 15688 3278 28 Table 5.3: Regret for the power-aware channel selection problem when T = 10 6 =T 1=3 =T 1=2 =T 2=3 Multi-UCB() 15535.8 17583.9 23117.2 CCB(;) - Unknown T 8645.7 6533.0 1476.2 CCB(;) - Known T 3010.5 1163.4 481.8 UCB1 25201.5 We compare the performance of continuous contextual bandit algorithms for the power-aware channel selection problem in energy harvesting communications from section 1.2.2. We use the same K = 4 arms from the previous example. The context is assumed to be uniformly distributed in (0; 1). Note that arms 3 and 4 are the optimal arms in this case. We use = 1 p T withT = 10 6 trials for CCB(;) with known time horizon. According to the theoretical results from section 5.4, CCB(;) yields regret bounds of O( p T ) and O( p T logT ) in this setting with and without the knowledge of the horizon respectively. We also run a Multi-UCB instance with the same quantization intervals. Table 5.3 shows the regret at the 61 horizon for the algorithms tuned with dierent values of . We notice that the regret of CCB(;) decreases as decreases. Even when the time horizon is un- known, CCB(;) outperforms Multi-UCB and UCB1. The numerical regret results for Multi-UCB concur with the theoretical result from [8] that the quantization in- tervals of size (T 1=3 ) yield minimum regret. Thus we observe that reducing does not necessarily reduce the regret of Multi-UCB as it does for CCB(;). 5.6 Summary Multi-armed bandits have been previously used to model many networking appli- cations such as channel selection and routing. In many sequential decision making problems in networks such as channel selection, power allocation and server se- lection, however, agent knows some side-information such as number of packets to be transmitted, transmit power available, features about the job to scheduled. Motivated by these applications, we have considered stochastic contextual bandit problems in this chapter. In our formulation, the agent also knows the reward functions, i.e. the relationship between the context and the reward. For the case of discrete and nite context spaces, we have proposed a UCB- like algorithm called DCB(). It exploits the knowledge of reward functions for updating reward information for all contexts. We proposed a novel proof technique to bound the number of non-optimal pulls of the optimal arms by constant. This helped us obtain a regret bound that grows logarithmically in time and linearly in 62 the number of non-optimal arms. This regret is shown to be order optimal by a natural extension of the lower bound result for standard multi-armed bandits. This regret performance is an improvement over bandit algorithms unaware of reward functions where regret grows linearly with the number of arms. For the proposed DCB() policy, the non-optimal pulls of the optimal arms are shown to be bounded for > 0. It remains to be seen if such a guarantee can be provided for = 0. While proving the regret results for DCB(;), we also proved a high probability bound on the number of optimal pulls by UCB1 in the standard MAB setting. This result could independently have potential applications in other bandit problems. Further contributions involve extending DCB() to continuous context spaces. We proposed an algorithm called CCB(;) for Lipschitz reward functions that uses DCB() as a subroutine. Regret analysis of CCB(;) uncovered an interesting re- gret vs storage trade-o parameterized by where the regret can be reduced by using larger storage. System designers can obtain sub-linear regrets by tuning based on the time horizon and storage requirements. Even when the time horizon is unknown, similar performance is guaranteed by the proposed epoch-based im- plementation of CCB(;). The joint learning of arm-rewards in CCB(;) yields regret bounds that are unachievable by any of the existing contextual bandit algo- rithms for continuous context spaces. In the current setting, we assumed no queuing of data packets or the harvested energy. When there exist data queues or batteries at the transmitter, the agent can 63 decide to send some of those data-bits or use some of that energy in the current slot and potentially store the rest for later slots. Such a setting with this additional layer of decision making is a non-trivial extension that warrants further investigation. 64 0 20000 40000 60000 80000 100000 Number of trials −200 0 200 400 600 800 1000 1200 1400 1600 Regret / log(t) UCB1 Multi-UCB DCB(ǫ) (a)K = 7 and O = 3. 0 20000 40000 60000 80000 100000 Number of trials −5000 0 5000 10000 15000 20000 Regret UCB1 Multi-UCB DCB(ǫ) (b)K = 4 and O = 0. Figure 5.1: Simulation results for the channel selection problem with = 10 2 . 65 Chapter 6 Online Learning over MDPs with Known Reward Functions In this chapter, we consider the problem of online learning over an MDP, where the transition probabilities of the MDP are known, but the reward process is un- known to the agent. This problem is particularly important because of its applica- tion in energy harvesting communications. In this work, we formulate the problem, propose multiple online learning algorithms to solve the problem and analyze their regret performance with respect to the time. 6.1 System Model We describe the model of the energy harvesting communication system considered in this chapter using a single channel. Consider a time-slotted energy harvesting The work in this chapter is based on [45] and [46]. 66 communication system where the transmitter uses the harvested power for trans- mission over a channel with stochastically varying channel gains with unknown distribution as shown in gure 6.1. We further assume that the channel gain-to- noise ratio is i.i.d. over time. Let p t denote the harvested power in the t-th slot which is assumed to be i.i.d. over time. Let Q t denote the stored energy in the transmitter's battery that has a capacity of Q max . Assume that the transmitter decides to use q t (Q t ) amount of power for transmission in t-th slot. We assume discrete and nite number of power levels for the harvested and transmit powers. The rate obtained during the t-th slot is assumed to follow a relationship r t =B log 2 (1 +q t X t ); (6.1) whereX t denotes the instantaneous channel gain-to-noise ratio of the channel which is assumed to be i.i.d. over time and B is the channel bandwidth. The battery state gets updated in the next slot as Q t+1 = minfQ t q t +p t ;Q max g: (6.2) The goal is to utilize the harvested power and choose a transmit power q t in each slot sequentially to maximize the expected average rate lim T!1 1 T E h P T t=1 r t i obtained over time. 67 Tx Rx Harvested Power: p t Time Slots Wireless Channel . . . t-1 t t+1 t+2 . . Transmit Power : q t Battery State: Q t Figure 6.1: Power allocation over a wireless channel in energy harvesting commu- nications 6.1.1 Problem Formulation Consider an MDPM with a nite state spaceS and a nite action spaceA. Let A s A denote the set of allowed actions from state s. When the agent chooses an action a t 2A st in state s t 2S, it receives a random reward r t (s t ;a t ). Based on the agent's decision the system undergoes a random transition to a state s t+1 according to the transition probability P (s t+1 j s t ;a t ). In the energy harvesting problem, the battery status Q t represents the system state s t and the transmit power q t represents the action taken a t at any slot t. In this chapter, we consider systems where the random rewards of various state action pairs can be modelled as r t (s t ;a t ) =f(s t ;a t ;X t ); (6.3) 68 where f is a reward function known to the agent and X t is a random variable internal to the system that is i.i.d. over time. Note that in the energy harvesting communications problem, the reward is the rate obtained at each slot and the reward function is dened in equation (6.1). In this problem, the channel gain-to- noise ratio X t corresponds to the system's internal random variable. We assume that the distribution of the harvested energyp t is known to the agent. This implies that the state transition probabilitiesP (s t+1 js t ;a t ) are inferred by the agent based on the update equation (6.2). A policy is dened as any rule for choosing the actions in successive time slots. The action chosen at timet may, therefore, depend on the history of previous states, actions and rewards. It may even be randomized such that the action a2A s is chosen from some distribution over the actions. A policy is said to be stationary, if the action chosen at time t is only a function of the system state at t. This means that a deterministic stationary policy is a mapping from the state s2S to its corresponding action a2A s . When a stationary policy is played, the sequence of statesfs t jt = 1; 2;g follows a Markov chain. An MDP is said to be ergodic, if every deterministic stationary policy leads to an irreducible and aperiodic Markov chain. According to section V.3 from [47], the average reward can be maximized by an appropriate deterministic stationary policy for an ergodic MDP with nite state space. In order to arrive at an ergodic MDP for the energy harvesting communications problem, we make following assumptions: 69 AS-1 When Q t > 0, the transmit power q t > 0. AS-2 The distribution of the harvested energy is such that Prfp t = pg > 0 for all 0pQ max . Proposition 1. Under assumptions AS-1 and AS-2, the MDP corresponding to the transmit power selection problem in energy harvesting communications is ergodic. Proof. Consider any policy and letP (n) (s;s 0 ) be then-step transition probabilities associated with the Markov chain resulting from the policy. First, we prove that P (1) (s;s 0 ) > 0 for any s 0 s as follows. According to the state update equations, s t+1 =s t (s t ) +p t : (6.4) The transition probabilities can, therefore, be expressed as P (1) (s;s 0 ) = Prfp =s 0 s +(s)g 0; (6.5) since s 0 s and (s) 0 for all states. This implies that any state s 0 2S is accessible from any other state s in the resultant Markov chain, if ss 0 . Now, we prove that P (1) (s;s 1) > 0 for all s 1 as follows. From equation (6.5), we observe that P (1) (s;s 1) = Prfp =(s) 1g 0; (6.6) 70 since (s) 1 for all s 1. This implies that every state s2S is accessible from the state s + 1 in the resultant Markov chain. Equations (6.5) and (6.6) imply that all the state pairs (s;s + 1) communicate with each other. Since communication is an equivalence relationship, all the states communicate with each other and the resultant Markov chain is irreducible. Also, equation (6.5) implies that P (1) (s;s) > 0 for all the states and the Markov chain is, therefore, aperiodic. Under assumptions AS-1 and AS-2, the MDP under consideration is ergodic and we can restrict ourselves to the set of deterministic stationary policies which we interchangeably refer to as policies henceforth. Let (s;a) denote the expected reward associated with the state-actions pair (s;a) which can be expressed as (s;a) =E [r(s;a)] =E X [f(s;a;X)]: (6.7) For ergodic MDPs, the optimal mean reward is independent of the initial state (see [48], section 8.3.3). It is specied as = max 2B (; M); (6.8) whereB is the set of all policies, M is the matrix whose (s;a)-th entry is(s;a), and (; M) is the average expected reward per slot using policy. We use the optimal 71 mean reward as the benchmark and dene the cumulative regret of a learning algorithm after T time-slots as R(T ) :=T E " T1 X t=0 r t # : (6.9) This denition of regret of an online learning algorithm is used in reinforcement learning literature [40, 41, 42]. With this denition, the optimal policy also incurs a regret, when the initial state distribution is not same as the stationary distribution. We characterize this regret using tools from Markov chain mixing in appendix E. 6.1.2 Optimal Stationary Policy When the expected rewards for all state-action pairs (s;a) and the transition probabilities P (s 0 js;a) are known, the problem of determining the optimal policy to maximize the average expected reward over time can be formulated as a linear program (LP) (see e.g. [47], section V.3) shown below. max X s2S X a2As (s;a)(s;a) s.t. (s;a) 0;8s2S;a2A s ; X s2S X a2As (s;a) = 1; X a2A s 0 (s 0 ;a) = X s2S X a2As (s;a)P (s 0 js;a);8s 0 2S; (6.10) 72 where (s;a) denotes the stationary distribution of the MDP. The objective func- tion of the LP from equation (6.10) gives the average rate corresponding to the stationary distribution(s;a), while the constraints make sure that this stationary distribution corresponds to a valid policy on the MDP. Such LPs can be solved by standard solvers such as CVXPY [49]. If (s;a) is the solution to the LP from (6.10), then for every s2S, (s;a)> 0 for only one action a 2 A s . This is due to the fact the the optimal policy is deterministic for ergodic MDPs in average reward maximization problems (see [48], section 8.3.3). Thus for this problem, (s) = arg max a2As (s;a). Note that we, henceforth, drop the action index from the stationary distribution, since the policies under consideration are deterministic and the corresponding action is, therefore, deterministically known. In general, we use (s) to denote the stationary distribution corresponding to the policy . It must be noted that the stationary distribution of any policy is independent of the reward values and only depends on the transition probability for every state-action pair. The expected average reward depends on the stationary distribution as (; M) = X s2S (s)(s;(s)): (6.11) In terms of this notation, the LP from (6.10) equivalent to maximizing (; M) over 2B. Since the matrix M is unknown, we develop online learning policies for our problem in the next section. 73 6.2 Online Learning Algorithms For the power allocation problem under consideration, although the agent knows the state transition probabilities, the mean rewards for the state-action pairs(s;a) values are still unknown. Hence, the agent cannot solve the LP from (6.10) to gure out the optimal policy. Any online learning algorithm needs to learn the reward values over time and update its policy adaptively. One interesting aspect of the problem, however, is that the reward function from equation (6.3) is known to the agent. Since the reward functions under consideration (6.1) is bijective, once the reward is revealed to the agent, it can infer the instantaneous realization of the random variableX. This can be used to predict the rewards that would have been obtained at that time for other state-action pairs using the function knowledge. In our online learning framework, we store the average values of these inferred rewards (s;a) for all state-action pairs. The idea behind our algorithms is to use the estimated sample mean values for the optimization problem instead of the unknown (s;a) values in the objective function of the LP from (6.10). Since the (s;a) values get updated after each reward revelation, the agent needs to solve the LP again and again. We propose two online learning algorithms: LPSM (linear program of sample means) where the agent solves the LP at each slot and Epoch- LPSM where the LP is solved at xed pre-dened time slots. Although the agent is unaware of the actual (s;a) values, it learns the statistics (s;a) over time and eventually gures out the optimal policy. 74 Let B(s;a) sup x2X f(s;a;x) inf x2X f(s;a;x) denote any upper bound on the maximum possible range of the reward for the state-action pair (s;a) over the supportX of the random variable X. We use following notations in the analysis of our algorithms: B 0 := max (s;a) B(s;a), 1 := max 6= (; M). The total number of states and actions are specied as S :=jSj, A :=jAj, respectively. Also, t denotes the matrix containing the entries t (s;a) at time t. 6.2.1 LPSM The LPSM algorithm presented in algorithm 4 solves the LP at each time-step and updates its policy based on the solution obtained. It stores only one value per state-action pair. Its required storage is, therefore, O(SA). In theorem 5, we derive an upper bound on the expected number of slots where the LP fails to nd the optimal solution during the execution of LPSM. We use this result to bound the total expected regret of LPSM in theorem 6. These results guarantee that the regret is always upper bounded by a constant. Note that, for the ease of exposition, we assume that the time starts at t = 0. This simplies the analysis and has no impact on the regret bounds. Theorem 5. The expected number of slots where non-optimal policies are played by LPSM is upper bounded by 1 + (1 +A)S e 1 2 1 B 0 2 1 : (6.12) 75 Algorithm 4 LPSM 1: Initialization: For all (s;a) pairs, (s;a) = 0. 2: for n = 0 do 3: Given the state s 0 and choose any valid action; 4: Update all (s;a) pairs: (s;a) =f(s;a;x 0 ); 5: end for 6: // MAIN LOOP 7: while 1 do 8: n =n + 1; 9: Solve the LP from (6.10) using (s;a) in place of unknown (s;a); 10: In terms of the LP solution (n) , dene n (s) = arg max a2As (n) (s;a);8s2S; 11: Given the state s n , select the action n (s n ); 12: Update for all valid (s;a) pairs: (s;a) n(s;a) +f(s;a;x n ) n + 1 ; 13: end while Proof. Let t denote the policy obtained by LPSM at time t andI(z) be the indi- cator function dened to be 1 when the predicate z is true, and 0 otherwise. Now the number of slots where non-optimal policies are played can be expressed as N 1 = 1 + 1 X t=1 If t 6= g = 1 + 1 X t=1 If( ; t )( t ; t )g: (6.13) We observe that ( ; t ) ( t ; t ) implies that at least one of the following inequalities must be true: ( ; t )( ; M) 1 2 (6.14) 76 ( t ; t )( t ; M) + 1 2 (6.15) ( ; M)<( t ; M) + 1 : (6.16) Note that the event from condition (6.16) can never occur, because of the denition of 1 . Hence we upper bound the probabilities of the other two events. For the rst event from condition (6.14), we get Pr ( ; t )( ; M) 1 2 = Pr ( X s2S (s; (s)) t (s; (s)) X s2S (s; (s))(s; (s)) 1 2 ) Pr n For at least one state s2S : (s; (s)) t (s; (s)) (s; (s)) (s; (s)) 1 2 o X s2S Pr n (s; (s)) t (s; (s)) (s; (s)) (s; (s)) 1 2 o X s2S Pr t (s; (s))(s; (s)) 1 2 (a) X s2S e 2( 1 2B(s; (s)) ) 2 t =Se 2 1 2B 0 2 t ; (6.17) where (a) holds due to Hoeding's inequality from lemma 3 (see appendix A). 77 Similarly for the second event from condition (6.15), we get Pr ( t ; t )( t ; M) + 1 2 = Pr ( X s2S X a2As t (s;a) t (s;a) X s2S X a2As t (s;a)(s;a) + 1 2 ) PrfFor at least one state-action pair (s;a) : t (s;a) t (s;a) t (s;a) (s;a) + 1 2 X s2S X a2As Pr t (s;a) t (s;a) t (s;a) (s;a) + 1 2 = X s2S X a2As Pr t (s;a)(s;a) + 1 2 (b) X s2S X a2As e 2( 1 2B(s;a) ) 2 t SAe 2 1 2B 0 2 t ; (6.18) where (b) holds due to Hoeding's inequality from lemma 3 in appendix A. The expected number of non-optimal policies from equation (6.13), therefore, can be expressed as E[N 1 ] 1 + 1 X t=1 Prf( ; t )( t ; t )g 1 + 1 X t=1 Pr ( ; t )( ; M) 1 2 + Pr ( t ; t )( t ; M) + 1 2 78 1 + (1 +A)S 1 X t=1 e 2 1 2B 0 2 t 1 + (1 +A)S e 1 2 1 B 0 2 1 e 1 2 1 B 0 2 1 + (1 +A)S e 1 2 1 B 0 2 1 : (6.19) It is important to note that even if the optimal policy is found by the LP and played during certain slots, it does not mean that regret contribution of those slots is zero. According to the denition of regret from equation (6.9), regret contribution of a certain slot is zero if and only if the optimal policy is played and the corresponding Markov chain is at its stationary distribution. In appendix E, we introduce tools to analyze the mixing of Markov chains and characterize this regret contribution in theorem 17. These results are used to upper bound the LPSM regret in the next theorem. Theorem 6. The total expected regret of the LPSM algorithm is upper bounded by 1 + (1 +A)S e 1 2 1 B 0 2 1 ! max 1 + max : (6.20) where = max s;s 0 2S kP (s 0 ;)P (s;)k TV , P denotes the transition probability ma- trix corresponding to the optimal policy, max = max s2S;a2As (s;a) and max = min s2S;a2As (s;a). 79 Proof. The regret of LPSM arises when either non-optimal actions are taken or when optimal actions are taken and the corresponding Markov chain is not at stationarity. For the rst source of regret, it is sucient to analyze the number of instances where the LP fails to nd the optimal policy. For the second source, however, we need to analyze the total number of phases where the optimal policy is found in succession. Since only the optimal policy is played in consecutive slots in a phase, it corre- sponds to transitions on the Markov chain associated with the optimal policy and the tools from appendix E can be applied. According to theorem 17, the regret contribution of any phase is bounded from above by (1 ) 1 max . As proved in theorem 5, for t 1, the expected number of instances of non-optimal policies is upper bounded by (1+A)S e 1 2 ( 1 B 0 ) 2 1 . Since any two optimal phases must be separated by at least one non-optimal slot, the expected number of optimal phases is upper bounded by 1 + (1+A)S e 1 2 ( 1 B 0 ) 2 1 . Hence, fort 1, the expected regret contribution from the slots following the optimal policy is upper bounded by 1 + (1 +A)S e 1 2 1 B 0 2 1 ! max 1 : (6.21) Note that the maximum regret possible during one slot is max . Hence for the slots where non-optimal policies are played, the corresponding expected regret contribution is upper bounded by 1 + (1+A)S e 1 2 ( 1 B 0 ) 2 1 ! . 80 Overall expected regret for the LPSM algorithm is, therefore, bounded from above by equation (6.20). Remark 4. It must be noted that we call two policies as same if and only if they recommend identical actions for every state. It is, therefore, possible for a non- optimal policy to recommend optimal actions for some of the states. In the analysis of LPSM, we count all occurrences of non-optimal policies as regret contributing occurrences in order to upper bound the regret. Remark 5. Note that the LPSM algorithm presented above works for general re- ward functions f. The rate function in the energy harvesting communications is, however, not dependent on the state which is the battery status. It is a function of the transmit power level and the channel gain only. The LPSM algorithm, there- fore, needs to store only one variable for each transmit power level and needs O(A) storage overall. The probability of event from condition (6.15) is bounded by a tighter upper bound of Ae 2 1 2B 0 2 t . The regret upper bound from theorem 6 is also tightened to 1 + (S +A) e 1 2 1 B 0 2 1 ! max 1 + max : (6.22) For the LPSM algorithm, we can prove a stronger result about the convergence time. LetZ be the random variable corresponding to the rst time-slot after which LPSM never fails to nd the optimal policy. This means that Z 1 represents the last time-slot where LPSM nds a non-optimal policy. We prove that the expected value of Z is nite, which means that LPSM takes only a nite amount time in 81 expectation before it starts following the genie. We present this result in theorem 7. Theorem 7. For the LPSM algorithm, the expected value of the convergence time Z is nite. Proof. SinceZ 1 denotes the index of the last slot where LPSM errs, all the slots from Z onward must have found the optimal policy . We use this idea to bound the following probability PrfZZ 0 g = PrfLPSM nds the optimal policy in all slots Z 0 , Z 0 + 1,g = 1 PrfLPSM fails in at least one slot infZ 0 ;Z 0 + 1;gg 1 1 X t=Z 0 PrfLPSM fails at tg 1 1 X t=Z 0 (1 +A)Se 2 1 2B 0 2 t (From equation (6.19)) = 1 (1 +A)Se 1 2 1 B 0 2 Z 0 1 e 1 2 1 B 0 2 : (6.23) We, therefore, get the following exponential inequality for Z 0 1 PrfZ >Z 0 g (1 +A)Se 1 2 1 B 0 2 Z 0 1 e 1 2 1 B 0 2 : (6.24) 82 The expectation of Z can, now, be bounded as E[Z] = 1 X Z 0 =0 Z 0 PrfZ =Z 0 g = 1 X Z 0 =0 PrfZ >Z 0 g 1 + 1 X Z 0 =1 (1 +A)Se 1 2 1 B 0 2 Z 0 1 e 1 2 1 B 0 2 = 1 + (1 +A)Se 1 2 1 B 0 2 1 e 1 2 1 B 0 2 2 : (6.25) The expected value of Z is, therefore, nite. Remark 6. Note that the result about nite expected convergence time is not directly implied by the constant regret result. The proof of theorem 7 relies on the exponential nature of the concentration bound, whereas it is possible to prove constant expected regrets even for weaker concentration bounds [46]. 6.2.2 Epoch-LPSM The main drawback of the LPSM algorithm is that it is computationally heavy as it solves one LP per time-slot. In order to reduce the computation requirements, we propose the Epoch-LPSM algorithm in algorithm 5. Epoch-LPSM solves the LP in each of the rstn 0 slots, divides the later time into several epochs and solves the LPs only at the beginning of each epoch. The policy obtained by solving the LP at the beginning of an epoch is followed for the remaining slots in that epoch. 83 We increase the length of these epochs exponentially as time progresses and our condence on the obtained policy increases. In spite of solving much fewer number of LPs, the regret of Epoch-LPSM is still bounded by a constant. First, we obtain an upper bound on the number of slots where the algorithm plays non-optimal policies in theorem 8 and later use this result to bound the regret in theorem 9. Algorithm 5 Epoch-LPSM 1: Parameters: n 0 2N and 2f2; 3;g. 2: Initialization: k = 0, n = 0 and for all (s;a) pairs, (s;a) = 0. 3: while n<n 0 do 4: Follow LPSM algorithm to decide action a n , update the variables accord- ingly and increment n; 5: end while 6: while nn 0 do 7: n =n + 1; 8: if n =n 0 k then 9: k =k + 1; 10: Solve the LP from (6.10) with (s;a) in place of unknown (s;a); 11: In terms of the LP solution (n) , dene (k) (s) = arg max a2As (n) (s;a);8s2S; 12: end if 13: Given the state s n , select the action (k) (s n ); 14: Update for all (s;a) pairs: (s;a) n(s;a) +f(s;a;x n ) n + 1 ; 15: end while Theorem 8. The expected number of slots where non-optimal policies are played by Epoch-LPSM is upper bounded by 1 + (1 +A)S 0 @ 1 e 1 2 1 B 0 2 n 0 e 1 2 1 B 0 2 1 1 A + ( 1)(1 +A)Sn 0 n 0 ; ; (6.26) 84 where n 0 ; = P 1 k=0 k e 1 2 1 B 0 2 n 0 k <1. Proof. Note that epochk starts att =n 0 k1 and end att =n 0 k 1. The policy obtained at t = n 0 k1 by solving the LP is, therefore, played for ( k k1 )n 0 number of slots. Let us analyze the probability that the policy played during epoch k is not optimal. Let that policy be (k) . Prf (k) 6= g = Prf n 0 k16= g (a) (1 +A)Se 2 1 2B 0 2 n 0 k1 ; (6.27) where (a) holds for all k 1 as shown in the proof of theorem 5. When the LP fails to obtain the optimal policy at the beginning of an epoch, then all the slots in that epoch will play the obtained non-optimal policy. Let N 2 denote total number of such slots. We get N 2 = 1 + n 0 1 X t=1 If t 6= g + 1 X k=1 n 0 ( k k1 )If (k) 6= g: In expectation, we get E[N 2 ] = 1 + n 0 1 X t=1 Prf t 6= g 85 + 1 X k=1 n 0 ( k k1 ) Prf (k) 6= g 1 + n 0 1 X t=1 (1 +A)Se 2 1 2B 0 2 t + 1 X k=1 n 0 ( k k1 )(1 +A)Se 2 1 2B 0 2 n 0 k1 1 + (1 +A)S n 0 1 X t=1 e 1 2 1 B 0 2 t +n 0 ( 1)(1 +A)S 1 X k=0 k e 1 2 1 B 0 2 n 0 k 1 + (1 +A)Se 1 2 1 B 0 2 1 e 1 2 1 B 0 2 n 0 1 e 1 2 1 B 0 2 +n 0 ( 1)(1 +A)S n 0 ; ; (6.28) where n 0 ; <1 holds due to ratio test for series convergence as lim k!1 k+1 e 1 2 1 B 0 2 n 0 k+1 k e 1 2 1 B 0 2 n 0 k = lim k!1 e 1 2 1 B 0 2 n 0 ( k+1 k ) = lim k!1 e 1 2 1 B 0 2 n 0 (1) k = 0: Now we analyze the regret of Epoch-LPSM in the following theorem. 86 Theorem 9. The total expected regret of the Epoch-LPSM algorithm is upper bounded by 0 @ 1 + (1 +A)S 0 @ 1 e 1 2 1 B 0 2 n 0 e 1 2 1 B 0 2 1 1 A 1 A max 1 + max + ( 1)(1 +A)Sn 0 n 0 ; max + (1 +A)S 1 X k=1 e 2 1 2B 0 2 n 0 k1 ! max 1 : (6.29) Proof. First we analyze the regret contribution from the rstn 0 slots. As argued in the proof of theorem 6, the regret contribution of the rstn 0 slots is upper bounded by 0 @ 1 + (1 +A)S 0 @ 1 e 1 2 1 B 0 2 n 0 e 1 2 1 B 0 2 1 1 A 1 A max 1 + max : (6.30) Now we analyze the number of phases where the optimal policy is played in successive slots for tn 0 . Note that any two optimal phases are separated by at least one non-optimal epoch. We bound the number of non-optimal epochs N 3 as E[N 3 ] = 1 X k=1 Prf (k) 6= g (1 +A)S 1 X k=1 e 2 1 2B 0 2 n 0 k1 (From equation (6.27)) < 0; 87 where the series P 1 k=1 e 2 1 2B 0 2 n 0 k1 converges due to ratio test as lim k!1 e 1 2 1 B 0 2 n 0 k+1 e 1 2 1 B 0 2 n 0 k = lim k!1 e 1 2 1 B 0 2 n 0 (1) k = 0: Hence fortn 0 , there can be at mostE[N 3 ] number of optimal phases in expecta- tion. Since each of these phases can contribute a maximum of (1 ) 1 max regret in expectation, total regret from slots with optimal policies for t n 0 is upper bounded by E[N 3 ] max 1 . Also, the expected number of slots where a non-optimal policy is played for tn 0 is bounded by ( 1)(1 +A)Sn 0 n 0 ; as derived in the proof of theorem 8. The regret contribution of these slots is, therefore, bounded by ( 1)(1 +A)Sn 0 n 0 ; max , since the maximum expected regret incurred during any slot is max . The total expected regret of Epoch-LPSM is, therefore, upper bounded by the expression (6.29). Remark 7. Similar to the LPSM algorithm, the algorithm presented above works for general reward functions f. Since the rate function in the energy harvesting communications is not dependent on the state, the Epoch-LPSM algorithm needs to store only one variable per transmit power level and uses O(A) storage overall. The regret upper bound from theorem 9 is also tightened to 88 0 @ 1 + (S +A) 0 @ 1 e 1 2 1 B 0 2 n 0 e 1 2 1 B 0 2 1 1 A 1 A max 1 + max + ( 1)(S +A)n 0 n 0 ; max + (S +A) 1 X k=1 e 2 1 2B 0 2 n 0 k1 ! max 1 : (6.31) 6.2.3 Regret vs Computation Tradeo The LPSM algorithm solvesT LPs in timeT , whereas the Epoch-LPSM algorithm solves n 0 LPs in the initial n 0 slots anddlog (Tn 0 )e LPs when the time gets divided into epochs. This drastic reduction in the required computation comes at the cost of an increase in the regret for Epoch-LPSM. It must, however, be noted that both the algorithms have constant-bounded regrets. Also, increasing the value of the parameter in Epoch-LPSM leads to reduction in the number of LPs to be solved over time by increasing the epoch lengths. Any non-optimal policy found by LP, therefore, gets played over longer epochs increasing the overall regret. Increasing n 0 increases the total number of LPs solved by the algorithm while reducing the expected regret. The system designer can analyze the regret bounds of these two algorithms and its own performance requirements to choose the parameters n 0 and for the system. We analyze the eect of variation of these parameters on the regret performance of Epoch-LPSM through numerical simulations in section 6.5. 89 6.3 Multi-Channel Communication In this section, we extend the energy harvesting communications problem to con- sider a system where there exists a set of parallel channels, with unknown statistics, for communication and one of these channels is to be selected in each slot. The goal is to utilize the battery at the transmitter and maximize the amount of data transmitted over time. Given a time-slotted system, we assume that the agent is aware of the distribution of energy arrival. The agent sees the current state of the battery and needs to decide the transmit power-level and the channel to be used for transmission. This problem, therefore, involves an additional decision making layer compared to the single channel case. Note that we use the terms transmit power and action interchangeably in this section. For this problem, we simplify the notations used previously and drop the state as a parameter for the reward function f, since the rate is not a function of the battery state and only depends on the transmit power-level and the channel gain. These channels will, in general, have dierent distributions of channel gains and there may not be a single channel that is optimal for all transmit power-levels. The expected rate achieved by selecting j-th channel from the set of M channels and an action a corresponding the transmit power used is j (a) =E X j [f(a;X j )]; (6.32) 90 where X j denotes the random gain of j-th channel. Let us dene : A ! f1; 2 ;Mg as the mapping from transmit power-levels to their corresponding optimal channels: (a) = arg max j2f1;2;Mg j (a): (6.33) A genie that knows the distributions of dierent channels gains can gure out the optimal channel mapping . Once an action a is chosen by the genie, there is no incentive to use any channel other than (a) for transmission during that slot. Let (a) denote the expected rate of the best channel for action a, i.e. (a) = (a) (a). The genie uses these values to solve the following LP. max X s2S X a2As (s;a) (a) s.t. (s;a) 0;8s2S;a2A s ; X s2S X a2As (s;a) = 1; X a2A s 0 (s 0 ;a) = X s2S X a2As (s;a)P (s 0 js;a);8s 0 2S: (6.34) The genie obtains :S!A, the optimal mapping from the battery state to the transmit power-level using the non-zero terms of the optimal stationary distribution (s;a). Note that the constraints of the optimization problem (6.34) ensure that 91 the stationary distribution actually corresponds to some valid deterministic state- action mapping. LetB be the set of all state-action mappings. There are only a nite number of such mappings 2B and the stationary distribution only depends on the matrix of state transition probabilities which is assumed to be known. We use (s) denote the stationary distribution corresponding to the state-action mapping . We dropped the action parameter from the previous notation, since it is implicit from . The expected average reward of the power selection policy along with a channel selection policy is calculated as (;; M) = X s2S (s) (a) (a); (6.35) where M denotes the matrix containing all j (a) values for power-channel pairs. For the genie under consideration, the LP from (6.34) is equivalent to = arg max 2B (; ; M): (6.36) The expected average reward of the genie can, therefore, be dened as = ( ; ; M). Since the mean rate matrix M is unknown to the agent, we pro- pose an online learning framework for this problem. 92 6.3.1 Online Learning Algorithm Since the agent does not know the distributions of channel gains, it needs to learn the rates for various power-channel pairs, gure out over time and use it to make decisions about the transmit power-level at each slot. We propose an online learning algorithm called Multi-Channel LPSM (MC-LPSM) for this problem. We analyze the performance of MC-LPSM in terms of the regret as dened in equation (6.9). The MC-LPSM algorithm stores estimates of the rates for all power-level and channel pairs based on the observed values of the channel gains. Whenever the rate obtained is revealed to the agent, it can infer the instantaneous gain of the chosen channel knowing the transmit power-level. Once the instantaneous gain of a channel is known, this information can be used to update the sample-mean rate estimates of all the power-levels for that channel. The algorithm divides time into two interleaved sequences: an exploration sequence and an exploitation sequence similar to the DSEE algorithm from [4]. In the exploitation sequence, the agent uses its current estimates of the rates to determine the transmit power and channel selection policies. First it selects a channel for each power-level that has the highest empirical rate for that transmit power. The sample-mean rate estimates for these power-channel pairs are, then, used to solve the LP from equation (6.34) with values replacing the values and to obtain a power-selection policy for that slot. In the exploration sequence, the agent selects all channels in a round-robin fashion in 93 order to learn the rates over time and chooses the transmit power-levels arbitrarily. The choice of the length of the exploration sequence balances the tradeo between exploration and exploitation. LetR(t) denote the set of time indexes that are marked as exploration slots up to time t. LetjR(t)j be the cardinality of the setR(t). At any given time t, m j stores the number of timesj-th channel has been chosen during the exploration sequence till that slot. Using these notations we present MC-LPSM in algorithm 6. Note that MC-LPSM stores a j (a) variable for every action action-channel pair (a;j) and anm j variable for every channelj. It, therefore, requiresO(MA) storage. Note that the agent updates the variables only during the exploration sequence when it tries dierent channels sequentially. Since the variables do not change during exploitation, the agent does not have to solve the LP in all exploitation slots. During a phase of successive exploitation slots, the channel and power selection policies obtained by solving the LPs remain unchanged. The agent, therefore, needs to solve the LP at time t only if the previous slot was an exploration slot. Since there are at mostjR(T )j exploration slots, MC-LPSM solves at mostjR(T )j number of LPs in T slots. 6.3.2 Regret Analysis of MC-LPSM Let us rst dene the notations used in the regret analysis. Since the agent is unaware of the matrix of expected rates M, it stores the estimates of the expected 94 Algorithm 6 MC-LPSM 1: Parameters: w> 2B 2 0 d 2 . 2: Initialization: For all a2A and j2f1; 2; ;Mg, j (a) = 0. Also m j = 0 for all channels j and n = 0. 3: while n<T do 4: n =n + 1; 5: if n2R(T ) then 6: // Exploration sequence 7: Choose channel j = ((n 1) mod M) + 1 and any valid power-level as action a; 8: Update j (a) variables for all actions a2A for the chosen channel j: j (a) m j j (a) +f(a;x n ) m j + 1 ; m j m j + 1; (6.37) 9: else 10: // Exploitation sequence 11: if n 12R(T ) then 12: Dene a channel mapping , such that (a) = max j j (a); 13: Solve the LP from (6.34) with (a) (a) instead of unknown (a) for all valid state-action pairs (s;a); 14: In terms of the LP solution (n) , dene(s) = arg max a2As (n) (s;a);8s2S; 15: end if 16: Given the state s n , select the power-level a n = (s n ) as action for trans- mission over channel (a n ); 17: end if 18: end while 95 rates in matrix . We dene(;; ) according to equation (6.35) with the actual mean values replaced by their corresponding estimates in . Let P denote the transition probability matrix corresponding to the optimal state-action mapping . We further dene: = max s;s 0 2S kP (s 0 ;)P (s;)k TV (6.38) max = max a2A (a) (6.39) max = min a2A; j2f1;2;;Mg j (a) (6.40) 3 = min a2A; j6= (a) f (a) j (a)g (6.41) 4 = max 6= (; ; M) (6.42) B 0 = sup x2X f(a;x) inf x2X f(a;x): (6.43) In terms of these notations, we provide an upper bound on the regret of MC-LPSM as follows. Theorem 10. Given a constant d minf 3 ; 4 g, choose a constant w > 2B 2 0 d 2 . Construct an exploration sequence sequentially as follows: 1.jR(1)j<f1g, 2. For any t> 1, include t inR(t) ijR(t 1)j<Mdw lnte. 96 Under this exploration sequenceR, the T -slot expected regret of MC-LPSM al- gorithm is upper bounded by Mdw lnTe + 2AMc ( wd 2 2B 2 0 ) max + max 1 ; (6.44) where c (x) = P 1 t=1 t x <1 for x> 1. Proof. In order to upper bound the regret of the MC-LPSM algorithm, we analyze the number of time-slots where the agent plays policy combinations other than ( ; ). Such a failure event at time t corresponds to at least one of the following cases 1. t2R, i.e. the exploration of dierent channels, 2. t 6= during exploitation, 3. t 6= during exploitation. Let N 4 (T ) be the total number of exploitation slots where MC-LPSM fails to nd the optimal power-channel mapping or the optimal state-action mapping up to time T . Let us dene eventsE 1;t =f t 6= g andE 2;t =f t 6= g. Now N 4 (T ) can be expressed as N 4 (T ) = X t= 2R;tT IfE 1;t [E 2;t g = X t= 2R;tT IfE 1;t g +IfE 2;t \E 1;t g : (6.45) 97 We analyze the two events separately and upper bound their probabilities. 6.3.2.1 Non-Optimal Power-Channel Mapping We use t (a) to denote (a);t (a) for all action a. The probability of the eventE 1;t can be bounded as Prf t 6= g = PrfFor at least one action a2A such that: t (a)6= (a)g X a2A Prf t (a)6= (a)g X a2A PrfFor at least one channel j6= (a): j;t (a) t (a)g = X a2A X j6= (a) Prf j;t (a) t (a)g: (6.46) In order for the condition j;t (a) t (a) to hold, at least one of the following must hold: j;t (a) j (a) + 3 2 (6.47) t (a) (a) 3 2 (6.48) (a)< j (a) + 3 : (6.49) 98 Note that condition (6.49) cannot hold due to the denition of 3 . Hence we upper bound the the probabilities of the other two events. The construction of the exploration sequence guarantees that at t = 2R each channel has been explored at leastdw lnte times. Since B 0 upper bounds the maximum deviation in the range of rate values over channels, we bound the probability for the event from condition (6.47) using Hoeding's inequality as Pr j;t (a) j (a) + 3 2 e 1 2 ( 3 B 0 ) 2 dw lnte e 1 2 ( 3 B 0 ) 2 w lnt t w 2 ( 3 B 0 ) 2 : (6.50) Using the Hoeding's inequality again for the condition (6.48), we similarly obtain Pr t (a) (a) 3 2 t w 2 ( 3 B 0 ) 2 : (6.51) We can, therefore, express the upper bound from equation (6.46) as Prf t 6= g X a2A X j6= (a) Pr j;t (a) j (a) + 3 2 + Pr t (a) (a) 3 2 X a2A X j6= (a) 2t w 2 ( 3 B 0 ) 2 2A(M 1)t w 2 ( 3 B 0 ) 2 : (6.52) 99 6.3.2.2 Non-Optimal State-Action Mapping We analyze the eventE 2;t \E 1;t where the LP fails to nd the optimal state-action mapping in spite of having found the optimal power-channel mapping . PrfE 2;t \E 1;t g = Prf t 6= t ; t = g = Prf( ; ; t )( t ; ; t )g: (6.53) For ( ; ; t )( t ; ; t ) to hold, at least one of the following must hold: ( ; ; t )( ; ; M) 4 2 (6.54) ( t ; ; t )( t ; ; M) + 4 2 (6.55) ( ; ; M)<( t ; ; M) + 4 : (6.56) The condition from equation (6.56) cannot hold due to the denition of 4 . We use the techniques from equations (6.17) and (6.18) in the proof theorem 5 to upper bound the probabilities of the events of equations (6.54) and (6.55) as Pr ( ; ; t )( ; ; M) 4 2 minfS;Age 1 2 ( 4 B 0 ) 2 dw lnte At w 2 ( 4 B 0 ) 2 (6.57) Pr ( t ; ; t )( t ; ; M) + 4 2 100 Ae 1 2 ( 4 B 0 ) 2 dw lnte At w 2 ( 4 B 0 ) 2 : (6.58) Note that these concentration bounds are dierent from the single channel case, as the number of observations leading to t is onlydw lnte in contrast tot observations for the single channel. Now we update the upper bound from equation (6.53) as PrfE 2;t \E 1;t g Pr ( ; ; t )( ; ; M) 4 2 + Pr ( t ; ; t )( t ; ; M) + 4 2 2At w 2 ( 4 B 0 ) 2 : (6.59) The expected number of exploitation slots, where non-optimal power and chan- nel selection decisions are made E[N 4 (T )], can be bounded using equations (6.52) and (6.59) as E[N 4 (T )] T X t=1 PrfE 1;t g + PrfE 2;t \E 1;t g T X t=1 2A(M 1)t w 2 ( 3 B 0 ) 2 + 2At w 2 ( 4 B 0 ) 2 2AM T X t=1 t w 2 ( d B 0 ) 2 2AMc ( wd 2 2B 2 0 ) ; (6.60) 101 where d minf 3 ; 4 g. Since w > 2B 2 0 d 2 , the upper bound from equation (6.60) holds. The expected number of slots, where non-optimal decisions are made in- cluding exploration and exploitation sequences, is upper bounded by Mdw lnTe + 2AMc ( wd 2 2B 2 0 ) . This implies that there can at most beMdw lnTe + 2AMc ( wd 2 2B 2 0 ) phases where the optimal policies are played in succession, since any two optimal phases must have at least one non-optimal slot in between. The total expected regret of any optimal phase is bounded by max 1 and the expected regret incurred during a non-optimal slot is bounded by max . The total expected T -slot regret of the MC-LPSM algorithm is, therefore, upper bounded by the expression (6.44). Note that the length of the exploration sequence specied in theorem 10 scales logarithmically in time. The MC-LPSM algorithm using this exploration sequence, therefore, solves O(lnT ) number of LPs in T slots, similar in order to the single channel Epoch-LPSM algorithm. It must be noted that the logarithmic order regret is achievable by MC-LPSM if we know d, a lower bound on 3 and 4 . This is required in order to dene a constant w that leads to the series convergence in the regret proof. If no such knowledge is available, the exploration sequence needs to be expanded in order to achieve a regret that is arbitrarily close to the logarithmic order, similar to the DSEE techniques from [4]. The regret result for such an exploration sequence is as follows: 102 Theorem 11 (Theorem 2 from [4]). Let g be any positive, monotonically non- decreasing sequence with g(t)!1 as t!1. Construct an exploration sequence as follows: for any t > 1, include t inR ijR(t 1)j < Mdg(t) lnte. Under this exploration sequenceR, the T -slot expected regret of MC-LPSM algorithm is O(g(T ) lnT ). While this regret is not logarithmic in time, one can approach arbitrarily close to the logarithmic order by reducing the diverging rate ofg(t). With this construction of the exploration sequence, the MC-LPSM algorithm solves O(g(T ) lnT ) number of LPs in time T . 6.3.3 Asymptotic Lower Bound In the multi-channel scenario, there exist one or more channels that are optimal for some transmit power levels with non-zero stationary probability. For every optimal channel j, there exists some state s2S such that ( (s)) = j. There may also exist arms that are either not optimal for any transmit power level or are optimal for power levels that have zero stationary probability. We now present an asymptotic lower-bound on regret of any algorithm for the multi-channel energy harvesting communications problem under certain conditions. To prove the regret bound, we rst present a lower bound on the number of plays of the non-optimal channels for any algorithm. Our analysis is based on the asymptotic lower bound on the regret of the standard MAB problem by Lai and Robbins [1]. This MAB regret 103 lower bound applies to the settings where the arm-distributions are characterized by a single parameter. This result was extended by Burnetas and Katehakis to distributions indexed by multiple parameters in [44]. In our analysis, however, we restrict ourselves to the single parameter channel-gain distributions. Let the gain distribution of each channel be expressed by its density function g(x; ) with respect to some measure , where the density functiong(;) is known and is an unknown parameter from some set . Although we consider continuous distributions here, the analysis also holds for discrete distributions where probabil- ity mass function replaces the density and the summations replace the integrals. Corresponding to a valid transmit power a and parameter 2 , we dene the expected rate as (a; ) = Z x2X f(a;x)g(x; )d(x): (6.61) Let I( ; 0 ) denote the Kullback-Leibler distance dened as I( ; 0 ) = Z x2X ln g(x; ) g(x; 0 ) g(x; )d(x): (6.62) We now make following assumptions about the density and the parameter set under consideration. A1 Existence of mean: (a; )<1 exists for any 2 and a2A. 104 A2 Denseness of : 8 2 ,8a2A and8 > 0,9 0 2 such that (a; ) < (a; 0 )<(a; ) +. A3 Positivity of distance: 0< I( ; 0 )<1 whenever(a; )<(a; 0 ) for some a2A. A4 Continuity of I( ; 0 ): 8 > 0,8a2A and8 ; 0 2 such that (a; ) < (a; 0 ),9 = (a;; ; 0 ) > 0 for whichj I( ; 0 ) I( ; 00 )j < whenever (a; 0 )<(a; 00 )<(a; 0 ) +. For channel gain distributions satisfying these conditions, we present a lower bound on the number of plays of a non-optimal arm based on the techniques from [1]. Theorem 12. Assume that the density and the parameter set satisfy assumptions A1-A4. Let = ( 1 ; 2 ; ; M ) be a valid parameter vector characterizing the distributions of the M channels, P and E be the probability measure and expec- tation under . LetL be any allocation rule that satises for every as T!1, R L (T ) = o(T b ) for every b > 0 over an MDPM. Let N i (T ) denote the number of plays of i-th channel up to time T by the ruleL, andO the index set of the optimal channels under the parameter vector . Then for every channel i2O , lim inf T!1 E N i (T ) lnT max j2O 1 I( i ; j ) : (6.63) Proof. Without the loss of generality, we assume that 12O and 22O for the parameter vector . This means that9a2A such that (a; 2 ) > (a; 1 ) and 105 (a; 2 ) (a; j ) for 3 j M. Fix any 0 < < 1. By assumptions A2 and A4, we can choose 2 such that (a;)>(a; 2 ) &j I( 1 ;) I( 1 ; 2 )j< I( 1 ; 2 ): (6.64) Let us dene a new parameter vector = (; 2 ; ; M ) such that under , 12O . The basic argument is that any algorithm incurring regrets of order o(T b ) for every b> 0 must play every channel a minimum number of times to be able to distinguish between the cases and. Let N i;j (T ) denote the number of times the i-th channel has been played up to time T with power levels for which j-th channel was the optimal channel. We, therefore, have N i (T ) = P M j=1 N i;j (T ) where N i;j (T ) 0 for all (i;j) pairs. We dene T i (T ) as the number of plays of the power levels for which i-th channel is optimal up to time T . This implies that the allocation ruleL plays channels other than the i-th channel for T i (T )N i;i (T ) number of times where they were non- optimal. Fix 0 < b < . SinceR L (T ) = o(T b ) as T!1, we have E [N j;i (T )] = o(T b ) when i6=j. Hence for distributions parametrized by, we have E [T 1 (T )N 1;1 (T )] = X j6=1 E [N j;1 (T )] =o(T b ): (6.65) 106 We dene a stationary distribution over channel plays under the optimal power selection policy for the MDP as j = X s2S X a2A; (a)=j (s;a): (6.66) Note that the optimal policies and are dependent on the choice of the param- eter vector characterizing the channels and so is j . For channels j2O , j > 0 under. SinceR L (T ) =o(T b ) asymptotically, we haveE [j 1 TT 1 (T )j] =o(T b ) as T!1. Fix 0<c< 1. We have P fT 1 (T ) (1c) 1 Tg =P f 1 TT 1 (T )c 1 Tg P fj 1 TT 1 (T )jc 1 Tg E [j 1 TT 1 (T )j] c 1 T (Markov's Inequality) =o(T b1 ): (6.67) We consider another event: P N 1;1 (T )< (1) lnT I( 1 ;) ;T 1 (T ) (1c) 1 T P n T 1 (T )N 1;1 (T )T 1 (T ) (1) lnT I( 1 ;) ; T 1 (T ) (1c) 1 T o 107 P n T 1 (T )N 1;1 (T ) (1c) 1 T (1) lnT I( 1 ;) ; T 1 (T ) (1c) 1 T o P T 1 (T )N 1;1 (T ) (1c) 1 T (1) lnT I( 1 ;) E [T 1 (T )N 1;1 (T )] (1c) 1 TO(lnT ) (Markov's Inequality) =o(T b1 ): (6.68) For the eventP n N 1 (T )< (1) lnT I( 1 ;) o , we have P N 1 (T )< (1) lnT I( 1 ;) P N 1;1 (T )< (1) lnT I( 1 ;) P fT 1 (T ) (1c) 1 Tg +P N 1;1 (T )< (1) lnT I( 1 ;) ;T 1 (T ) (1c) 1 T =o(T b1 ): (6.69) Note that the allocation ruleL only knows the channel gain realizations of the arms it has played, it does not have the exact distributional knowledge. Let Y 1 ;Y 2 ; denote the successive realizations of the 1-st channel's gains. We dene L m = P m k=1 ln g(Y k ; 1 ) g(Y k ;) and an eventE T as E T = N 1 (T )< (1) lnT I( 1 ;) ;L N 1 (T ) (1b) lnT : (6.70) 108 From the inequality in (6.69), we have P fE T g =o(T b1 ): (6.71) Note the following relationship P fN 1 (T ) =n 1 ; ;N M (T ) =n M ;L n 1 (1b) lnTg = Z fN 1 (T )=n 1 ;;N M (T )=n M ;Ln 1 (1b) lnTg n 1 Y k=1 g(Y k ;) g(Y k ; 1 ) dP = Z fN 1 (T )=n 1 ;;N M (T )=n M ;Ln 1 (1b) lnTg e Ln 1 dP e (1b) lnT P fN 1 (T ) =n 1 ; ;N M (T ) =n M ; L n 1 (1b) lnTg =T (1b) P fN 1 (T ) =n 1 ; ;N M (T ) =n M ; L n 1 (1b) lnTg: (6.72) This result rests on the assumption that the allocation ruleL can only depend on the channel gain realizations it has observed by playing and possibly on some internal randomization in the rule. Note thatE T is a disjoint union of the events of the form fN 1 (T ) =n 1 ; ;N M (T ) =n M and L n 1 (1b) lnTg withn 1 ++n M =T and n 1 < (1) lnT I( 1 ;) . It now follows from equations (6.71) and (6.72) that asT!1: P fE T gT 1b P fE T g! 0: (6.73) 109 By the strong law of large numbers, Lm m ! I( 1 ;)> 0 and max km L k m ! I( 1 ;) almost surely underP . Since 1b> 1, it follows that as T!1: P L k > (1b) lnT for some k< (1) lnT I( 1 ;) ! 0: (6.74) From equations (6.73) and (6.74), we conclude that lim T!1 P N 1 (T )< (1) lnT I( 1 ;) = 0: In other words, lim T!1 P N 1 (T )< (1) lnT (1 +) I( 1 ; 2 ) = 0: This implies that lim inf T!1 E N 1 (T ) lnT 1 I( 1 ; 2 ) : (6.75) Note that we only considered one optimal arm above. Results like equation (6.75) hold for all optimal arms. By combining the lower bounds for a xed non- optimal arm, we get the result in equation (6.63). Theorem 13. Assume that the density and the parameter set satisfy assumptions A1-A4. Let denote the parameter vector whosej-th entry is j andO denote the index set of optimal channels. LetL be any allocation rule that satises for every 110 as T!1,R L (T ) = o(T b ) for every b > 0 over an MDPM. Then the regret ofL satises lim inf T!1 R L (T ) lnT 3 X i2O max j2O 1 I( i ; j ) : (6.76) Proof. Dene a hypothetical allocationL 0 based onL such that wheneverL plays a non-optimal channel for some power level during its execution,L 0 plays the optimal channel corresponding to the same power level. It followsL in rest of the slots. If N 0 denotes the count variables corresponding toL 0 , then for i2O we have N 0 i (T ) = 0 and for j2O we have N 0 j (T ) =N j (T ) + X i2O N i;j (T ): (6.77) According to equation (6.41), 3 is the minimum expected gap between the optimal rate of any power level and the rate for any other channel at that power. Using 3 , we relate the regrets ofL andL 0 as R L (T )R L 0(T ) +E 2 4 X i2O N i (T ) 3 5 3 3 X i2O E [N i (T )]: (6.78) Using theorem 12, we get the equation (6.76). 111 Tx Rx Arrived Packets: b t Time Slots Wireless Channel . . . t-1 t t+1 t+2 . . Scheduled Packets : r t Queued Packets: Q t Figure 6.2: Packet scheduling over a wireless channel Theorem 13 implies that when the gain distributions characterized by a single parameter for each channel and follow assumptions A1-A4, any algorithm with R(T ) =o(T b ) for every b> 0 must play the non-optimal channels at least (lnT ) times asymptotically. In the presence of non-optimal channels, an asymptotic regret of (lnT ) is inevitable for any algorithm. Hence we conclude that our MC-LPSM algorithm is asymptotically order optimal when the system contains non-optimal channels. 6.4 Cost Minimization Problems We have considered reward maximization problems for describing our online learn- ing framework. This framework can also be applied to average cost minimization problems in packet scheduling with power-delay tradeo as shown in gure 6.2. We describe this motivating example and the minor changes required in our algorithms. 112 Consider a time-slotted communication system where a sender sends data pack- ets to a receiver over a stochastically varying channel with unknown distribution. Such a communication system has been studied previously in [50] assuming the channel to be non-stochastically varying over time. In our setting, the arrival of data packets is also stochastic with a known distribution. The sender can send multiple packets at a higher cost, or can defer some for latter slots while incurring a xed delay penalty per packet for every time-slot it spends in the sender's queue. Let Q t denote the number of packets in the queue at time t and r t ( Q t ) be the number of packets transmitted by the sender during the slot. Hence,Q t r t number of packets get delayed. The sender's queue gets updated as Q t+1 = minfQ t r t +b t ;Q max g; (6.79) whereb t is the number of new packet arrivals int-th slot andQ max is the maximum queue size possible. Since the data-rate is modelled according to equation (6.1), the power cost incurred during the t-th slot by transmitting r t packets over the channel becomes w p X t 2 rt=B , where w p is a constant known to the sender and X t is the instantaneous channel gain-to-noise ratio that is assumed to be i.i.d. over time. Assuming w d as the unit delay penalty, during the slot the sender incurs an eective cost C t =w d (Q t r t ) +w p X t 2 rt=B : (6.80) 113 This problem also represents an MDP where the queue size is the state and the number of packets transmitted is the action taken. The goal of this problem is to schedule transmissionsr t sequentially and minimize the expected average cost over time lim T!1 1 T E " T X t=1 C t # : (6.81) Note that the cost from equation (6.80) used in this scenario is also a function of the state unlike the problem of energy harvesting communications. The presented algorithms LPSM and Epoch-LPSM also apply to cost mini- mization problems with minor changes. If (; M) denotes the average expected cost of the policy , then = min 2B (; M). Using this optimal mean cost as the benchmark, we dene the cumulative regret of a learning algorithm after T time-slots as R(T ) :=E " T1 X t=0 C t # T : (6.82) In order to minimize the regret for this problem, the LP from (6.10) needs to be changed from a maximization LP to a minimization LP. With these changes to the algorithms, all the theoretical guarantees still hold with the constants dened accordingly. 114 6.5 Numerical Simulations We perform simulations for the power allocation problem withS =f0; 1; 2; 3; 4g and A =f0; 1; 2; 3; 4g. Note that each state s t corresponds to Q t from equation (6.2) with Q max = 4 and a t corresponds to the transmit power q t from equation (6.1). The reward function is the rate function from equation (6.1) and the channel gain is a scaled Bernoulli random variable with PrfX = 10g = 0:2 and PrfX = 0g = 0:8. The valid actionsA s and the optimal action (s) for each state s are shown in table 6.1. We use CVXPY [49] for solving the LPs in our algorithms. For the simulations in gure 6.3, we use n 0 = 2 and = 10, and plot the average regret performance over 10 3 independent runs of dierent algorithms along with their corresponding regret upper bounds. Here, the naive policy never uses the battery, i.e. it uses all the arriving power for the current transmission. Playing such a xed non-optimal policy causes linearly growing regret over time. Note that the optimal policy also incurs a regret because of the corresponding Markov chain not being at stationarity. We observe that LPSM follows the performance of the optimal policy with the dierence in regret stemming from the rst few time-slots when the channel statistics are not properly learnt and thus LPSM fails to nd the optimal policy. As the time progresses, LPSM nds the optimal policy and its regret follows the regret pattern of the optimal policy. In Epoch-LPSM withn 0 = 2 and = 10, the agent solves the LP at t = 1 and t = 2. Its LP solution at t = 2 is followed for the rst epoch and thus the regret grows linearly till t = 19. At 115 0 20 40 60 80 100 Number of trials 0 1 2 3 4 5 6 7 8 9 Regret LPSM Epoch-LPSM Optimal Policy Naive Policy Figure 6.3: Regret performance of LPSM algorithms. t = 20, a new LP is solved which often leads to the optimal policy and the regret contribution from latter slots, therefore, follows the regret of the optimal policy. It must be noted that Epoch-LPSM solves only 3 LPs during these slots, while LPSM solves 99 LPs. Epoch-LPSM, therefore, reduces the computational requirements substantially while incurring a slightly higher cumulative regret. Table 6.1: Actions for each state s A s (s) 0 f0g 0 1 f1g 1 2 f1; 2g 1 3 f1; 2; 3g 2 4 f1; 2; 3; 4g 3 116 0 20 40 60 80 100 Number of trials 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Regret E-LPSM(6, 2) E-LPSM(2, 6) E-LPSM(2, 2) E-LPSM(6, 6) Opt. Policy Figure 6.4: Eect of the parameters n 0 and on the regret of Epoch-LPSM. In gure 6.4, we plot the average regret performance over 10 4 independent runs of Epoch-LPSM for the previous system with dierentn 0 and value pairs. As the value of increases for a xed n 0 , the length of the epochs increases and thus the potential non-optimal policies are followed for longer epochs. Larger the value of n 0 , better are the policies played in the initial slots where the agent solves the LP at each time. These intuitions are consistent with gure 6.4, where Epoch-LPSM withn 0 = 6 and = 2 has the lowest regret and the one withn 0 = 2 and = 6 has the highest of the lot. We see the regret vs computation tradeo in action, as the decrease in computation by increasing or decreasing n 0 leads to larger regrets. We observe that has more impact on the regret than n 0 . Notice that there are 117 0 2000 4000 6000 8000 10000 Time (t) 20 0 20 40 60 80 100 120 140 Regret/ ln(t) MC-LPSM Optimal Policy Figure 6.5: Regret performance of the MC-LPSM algorithm. changes in the regret trends of Epoch-LPSM att =n 0 m for smallm, because these are the slots where a new LP is solved by MC-LPSM. Once the optimal policy is found by the algorithm, its regret in latter slots follows the trend of the optimal policy. In gure 6.5, we plot the regret performance of MC-LPSM for a system with 2 communication channels. The channels are scaled Bernoulli random variables where the gain of the rst follows PrfX = 10g = 0:5 and PrfX = 0g = 0:5, while that of the other follows PrfX = 22g = 0:4 and PrfX = 0g = 0:6. The optimal actions selection policy is same as the previous case, while the optimal mapping of transmit-power to the channels is (1) = 2, (2) = 2, (3) = 1 and 118 (4) = 1. For MC-LPSM, we set w = 300 and plot the regret divided by the logarithm of the time index, averaged over 100 realizations, in gure 6.5. We notice that whenever the exploration slots are densely packed, the regret grows linearly as non-optimal policies are potentially played during exploration. As the exploration need gets satised over time, the agent solves the LP based on its rate estimates from exploration phases and the regret contribution from the exploitation remains bounded. The regret divided by the logarithm of time saturates, as expected, to a constant value smaller than the asymptotic upper bound of the regret. 6.6 Summary We have considered the problem of power allocation over a stochastically varying channel with unknown distribution in an energy harvesting communication system. We have cast this problem as an online learning problem over an MDP. If the tran- sition probabilities and the mean rewards associated with the MDP are known, the optimal policy maximizing the average expected reward over time can be deter- mined by solving an LP specied in the chapter. Since the agent is only assumed to know the distribution of the harvested energy, it needs to learn the rewards of the state-action pairs over time and make its decisions based on the learnt behaviour. For this problem, we have proposed two online learning algorithms: LPSM and Epoch-LPSM, which both solve the LP using the sample mean estimates of the re- wards instead of the unknown mean rewards. The LPSM algorithm solves the LP 119 at each time-slot using the updated estimates, while the Epoch-LPSM only solves the LP at certain pre-dened time-slots parametrized by n 0 and and thus, saves a lot of computation at the cost of an increased regret. We have shown that the re- grets incurred by both these algorithms are bounded from above by constants. The system designers can, therefore, analyze the regret versus computation tradeo and tune the parameters n 0 and based on their performance requirements. Through the numerical simulations, we have shown that the regret of LPSM is very close to that of the optimal policy. We have also analyzed the eect of the parameters n 0 and on the regret the Epoch-LPSM algorithm which approaches the regret of the optimal policy for small values and large n 0 . For the case of multiple channels, there is an extra layer of decision making to select a channel for transmission in each slot. For this problem, we have extended our approach and proposed the MC-LPSM algorithm. MC-LPSM separates the exploration of dierent channels to learn their rates from the exploitation where these rate-estimates of dierent channels for dierent channels are used to obtain a power selection policy and a channel selection policy in each slot. We have proved a regret upper bound that scales logarithmically in time and linearly in the number of channels. We have also shown that the total computational requirement of MC-LPSM also scales similarly. In order to show the asymptotic order optimality of our MC-LPSM algorithm, we have proved an asymptotic regret lower bound of (lnT ) for any algorithm under certain conditions. In this chapter, we have 120 considered uniform exploration of dierent channels in MC-LPSM. Analyzing upper condence bound algorithms, where exploration of dierent channels gets tuned to their performance, remains a future work. While we have considered the reward maximization problem in energy harvest- ing communications for our analysis, we have shown that these algorithms also work for the cost minimization problems in packet scheduling with minor changes. 121 Chapter 7 Contextual Combinatorial Bandits Combinatorial bandits are natural extensions of the standard bandit problems, where the agent needs to select a vector as its action. In the contextual version of this problem, there is some side-information available to the agent before making the decision in each time slot. We consider these contextual combinatorial bandit problems in this chapter, where the agent needs to solve a constrained optimization problem every time it selects its action. This chapter is organized as follows. First, in section 7.1, we describe the general problem of constrained combinatorial contextual bandits. In section 7.2, we analyze a particular problem form distributed computing as a special case of our bandit problem and show how the general problem maps directly to this application. We propose our general PMF learning algorithm and its implementation for the WDC The work in this chapter is based on [51]. 122 application as well as analyze its regret performance in section 7.3. Finally, in section 7.4, we provide the simulation results for our algorithm. 7.1 Problem Formulation Let y t be the context vector presented to the agent at time t and x t be the action vector of the agent. Let N denote the length of the action vector, whose each element comes from a set of length M. So, there are a total of M N possible action vectors for the agent. If all the action vectors are considered independently as arms of a contextual bandit problem, then the regret of such an algorithm scales exponentially inM. We consider alternate approaches to solve this problem in this chapter. Let F 1 and F 2 be two metrics of interest for our system, where the goal is to nd an action vector that solves the following optimization problem: P1 : minimize x2[M] N E[F 1 (y; x)] subject to E[F 2 (y; x)]B: (7.1) Let there be a set of U unknown random variables out of which a maximum of L random variables aect each of the metrics at any given slot depending on our chosen action vector x. We assume that the functional forms of how these random variables aect the instantaneous values of the metrics F 1 and F 2 are known. It 123 must be noted that E[F 1 (y; x)] and E[F 2 (y; x)] are functions of the PMFs of the random variables corresponding to the action vector x, which we assume to be Lipschitz with respect to these PMF vectors. Additionally, let there be an oracle that solves the optimization problem P1 for the current context given the distributions of the random variables. A genie that knows the distributions of all the internal random variables can use this oracle to obtain the best possible action for the current context at any time slot. Since our agent does not know the distributions of these random variables, it needs to learn them over time and improve its decision making. The performance of such an online learning agent is evaluated in terms of its regret, which represents the dierence in the expected rewards of the optimal policy and that of the agent's policy. The expected cumulative regret is calculated as: R A (n) =E " n X t=1 F A(t) 1 (t) # E " n X t=1 F 1 (t) # ; (7.2) whereA denotes the agent's algorithm, E[F A(t) 1 (t)] denotes the agent's expected objective metric, and E[F 1 (t)] denotes the genie's expected objective metric at time t. 124 7.2 Distributed Computing Application Let us consider a concrete example from distributed computing and see how this problem gets formulated as a special case of the constrained contextual bandit prob- lem. Suppose a data processing application consists of N tasks, with dependencies specied by a directed acyclic graph (DAG)G = (V;E) as shown in gure 1.5. The task precedence relation is described by a directed edge (m;n), which indicates that task n relies on the results of task m. Task n, therefore, cannot start till task m nishes. Assume that there is an incoming data-stream to be processed, where each frame needs to go through all the tasks in order to be processed completely. There are M collaborating devices, not all of which may be available in each slot. For every task graph, there exists an initial task (say, task 1) that initiates the application and a nal task N that terminates it. Every path from task 1 to task N can be described by a sequence of nodes, where each pair of successive nodes is connected by a directed edge. 7.2.1 Context Assume that there is an incoming data-stream. Each frame t has some job specic features which are known as side-information to the agent before it assigns the tasks to devices. This vector of features, therefore, acts as a context for the assignment. These features aect the overall cost and latency for the data frame. In the exam- ple task graph shown in gure 7.1, the node-specic work-loads and edge-specic 125 Figure 7.1: An example task graph with nodes indicating the work-load of the task and edges indicating the required amount of data exchange. amount of data exchange required are specied. These work-loads and data ex- change requirements are data-frame specic and act as contexts for each frame. In general, the context can include various features depending on whether a particular task is parallelizeable at a node. The ooading strategy has to assign these tasks to multiple devices considering the data transmission and computation costs, and also their corresponding latencies. 7.2.2 Cost and Latency Let us rst dene the general latencies and costs, which represent the metrics of interest F 1 and F 2 for this application. Given a context vector of data-frame features y, let C (j) ex (i; y) be the random execution cost of task i on device j and C (jk) tr (d; y) be the random transmission cost ofd units of data from devicej to device 126 k. The latency random variables, T (j) ex (i; y) and T (jk) tr (d), are similarly dened. A task assignment can be represented by x2f1;:::;Mg N , where the i-th entry, x i , denotes the device executing the task i. In terms of these notations, the total cost can be written as ~ C(y; x) = N X i=1 C (x i ) ex (i; y) + X (m;n)2E C (xmxn) tr (d mn ): (7.3) As described in this equation, the total cost is additive over the nodes or the tasks and the edges of the graph. On the other hand, the latency up to task i depends on the tasks preceding it in a non-linear manner. Let D (i) (y; x) be the latency till the end of the execution of task i, which can be recursively dened as D (i) (y; x) = max m2C(i) n T (xmx i ) tr (d mi ) +D (m) (y; x) o +T (x i ) ex (i; y); (7.4) whereC(i) denotes the set of all children of node i. For example, in gure 1.5, the children of task 4 are task 2 and task 3. For each child node m, the latency is cumulative as the latency up to taskm plus the latency of the transmission of data d mi . The total latency till task i is, therefore, determined by the slowest path up to that task. 127 Note that the costs and latencies are not deterministic, because of the dynamic nature of the device behavior and the link qualities. They are random variables that depend on the features of the incoming data-frame. 7.2.3 Optimization Problem Let us consider the optimization problem to be solved by the agent at every time slot for our WDC application. Given an application described by a task graph, a resource network described by the processing rates and the link connectivity between available devices, the goal is to nd a task assignment x that minimizes the total latency and satises the cost constraint. The optimization problem can be written as P2 : minimize x2[M] N E[D (N) (y; x)] subject to E[ ~ C(y; x)]B: (7.5) The Cost and D (N) are dened in equations (7.3) and (7.4), respectively. The constant B that species the cost constraint depends on the energy consumption of the wireless devices. Note that P2 directly maps to P1 with D (N) mapping to F 1 and ~ C to F 2 . Note that the optimization problem described above also has a deterministic variant. The setting with deterministic costs and latencies, and also the random setting with known distributions has been previously studied in [19, 52]. These settings assume that all the data-frames have the same processing and transmission 128 requirements and therefore do not consider context dynamism. We now extend the model to tackle the setting with random variables with unknown distributions and dynamic context. 7.2.4 Optimal Policy For each incoming data-frame, the best computational ooading assignment de- pends on the features of the data-frame or the context, the random variables corre- sponding to the link rates and the processing rates for the servers. If the distribu- tions of these internal random variables are known to an all-knowing genie, ideally it can solve P2 and identify the best ooading assignment. However, actually solv- ing the optimization problem P2 involves deriving the distribution ofD (N) by using the distributions of transmission times and the execution times, and determining the expected value ofD (N) for every potential ooading assignment. Setting aside the problem of determining the distributions of D (N) for each potential ooading, the problem of determining the optimal assignment is in itself NP-hard as proved in [19] even for the deterministically known delays and costs. In this work, therefore, we analyze a weaker genie that solves following opti- mization problem: P3 : minimize x2[M] N ~ D (N) (y; x) subject to E[ ~ C(y; x)]B; (7.6) 129 where ~ D (N) (y; x) can be recursively dened as ~ D (i) (y; x) = max m2C(i) n E[T (xmx i ) tr (d mi )] + ~ D (m) (y; x) o +E[T (x i ) ex (i; y)]: (7.7) With this modied denition of the delays, ~ D (i) (y; x) can be calculated given the incoming data-frame without needing to characterize the delay distributions of each D (i) (y; x). 7.2.5 Reward Functions In our model, we allow collaborative computing in a distributed manner, where dierent tasks for the same job are computed on dierent wirelessly connected devices. In such cases, the total number of possible task assignments is very large (exponential). However there is a lot of contextual information available, like the communication and processing requirements of the tasks on the application task graph, parallelizability of the tasks and the servers. This information can be used to make these combinatorial ooading decisions. This gives rise to the problem of combinatorial bandits with contextual information. Linear combinatorial bandits have been studied previously in [5]. The latency, however, has a max function as shown in equation (7.7). The non-linear nature of the objective function, the 130 latency, complicates the problem. The aspect of contextual information has also not been studied in the previous literature on combinatorial bandits. One interesting aspect of our model is that the functions for the latency and the cost are known. They are functions of the job-specic features and the random variables corresponding the device states and the channel qualities. Consider the simple latency model from [19]: given a task i with workload m i , if executed on device j, the execution delay can be expressed as T (j) ex (i; y) =c m i z j ; where z j is the CPU rate of device j and c is a constant. The transmission delay can be expressed as T (jk) tr (d) = d B jk ; whered represents the amount of data to be transmitted from device j to devicek and B jk denotes the data-rate of the link between the devices. As we can notice, the execution and transmission delays depend on attributes such as the data-rate of the transmission link and the current eective processing rate of the server, which could be unknown stochastic processes, as well as job-specic features such as the data transmission and processing requirements which act as the context for our problem. 131 Consider general cost functions g as follows. C (j) ex (m; y) =g (m) (y; z (j) ) C (jk) tr (d mn ) =g (mn) (y; z (jk) ); where g represents the cost function, z (j) represents the vector of device-specic features of devicej, z (jk) denotes the vector of link-specic features of link jk, and m,n denote the tasks in the task graph. Let latency functionsh be dened similarly. Note that the vectors z are random vectors with unknown distributions. Only the instantaneous realizations get revealed to the agent after the job execution. Let p (j) denote the PMF vector of random variable of device features for device j, wherel-th element is Pr z (j) = z l with z l being a vector from the corresponding sample spaceZ (j) . Let p (jk) be PMF vectors similarly dened for communication costs between devices j and k. The corresponding expected costs and delays get dened : E[C (j) ex (m; y)] = X z2Z (j) g (m) (y; z) Pr z (j) = z =G (j) (m) (y; p (j) ); E[T (j) ex (m; y)] = X z2Z (j) h (m) (y; z) Pr z (j) = z =H (j) (m) (y; p (j) ); 132 E[C (jk) tr (d mn )] = X z2Z (jk) g (mn) (y; z) Pr z (jk) = z =G (jk) (mn) (y; p (jk) ); E[T (jk) tr (d mn )] = X z2Z (jk) h (mn) (y; z) Pr z (jk) = z =H (jk) (mn) (y; p (jk) ): Notice that the expected costs and delays are monotonically increasing in each of the entry in the PMF vectors. This property is very signicant and is utilized later in the analysis of our online learning algorithm. A genie that knows the distributions of the internal random variables or features can solve the optimization problem for the given context at that slot to obtain the optimal allocation. Since the agent does not have this information, it needs to perform online learning from the revelations following a chosen allocation. 7.3 PMF Learning Algorithm As we have seen in the previous section, the task graph mapping problem from distributed computing is a special case of the contextual bandit problem. However, it is still interesting as it involves two dierent types of random variables: device- specic and link-specic random variables. In this section, we propose our PMF learning algorithm for the WDC application and generalize it for the original bandit problem. 133 In our online learning approach, the agent learns the probability mass functions (PMFs) of the unknown random variables over time. At each slot, the agent will use the current estimate of the PMFs of the internal random variables and the re- ward function to come up with estimates of the potential costs and latencies for all assignments, and choose an assignment that solves the optimization problem with these estimated cost and latency values. We shall assume that the PMF learning algorithm has access to an oracle that perfectly solves the optimization problem P3. The online learning agent can, therefore, call this oracle with its current cost and latency estimates to obtain the optimal policy for these values. Based on the instantaneous revelations after the allocation, the PMF estimates of the random variables get improved. This iterative process continues with each slot. We for- malize this intuition in our PMF learning algorithm for the distributed computing problem in algorithm 7 and develop tools to analyze the regret performance of this algorithm. The general version of this algorithm involves learning the distributions of all the internal random variables over time and using an oracle solving the optimization problem P1 in each slot. We describe the general PMF algorithm in algorithm 8. 7.3.1 Concentration Inequality for Lipschitz Functions In order to prove the regret guarantees for our algorithms, we rst prove a concen- tration inequality for Lipschitz functions. Consider a function f : [0; 1] S !R. For 134 Algorithm 7 PMF Learning Algorithm in WDC 1: Initialization: ^ p (j) = 0; ^ p (jk) = 0;m j = 0 & m jk = 0 for all valid j, k. 2: for n = 1 to M 2 do 3: Select an unused link and a mapping that uses it. 4: Update the ^ p vectors for each link and device used in the mapping and increment their corresponding m counters by 1. 5: end for 6: // MAIN LOOP 7: while 1 do 8: n =n + 1; 9: Given the context y n , choose the mapping by solving P3 using In the objective In the constraint ^ p (j) q (N+jEj+1) lnn m j ^ p (j) q (N+jEj+2) lnn 2m j ^ p (jk) q (N+jEj+1) lnn m jk ^ p (jk) q (N+jEj+2) lnn 2m jk 10: Update the ^ p vectors for each link and device used in the mapping and increment their corresponding m counters by 1. 11: end while Algorithm 8 General PMF Learning Algorithm 1: Initialization: ^ p (j) = 0;m j = 0 for all random variables j. 2: for n = 1 to U do 3: Select an unused random variable and an action that uses it. 4: Update the ^ p vectors for each random variable used in the action and incre- ment their corresponding m counters by 1. 5: end for 6: // MAIN LOOP 7: while 1 do 8: n =n + 1; 9: Given the context y n , choose the mapping by solving P1 using In the objective In the constraint ^ p (j) q (L+1) lnn m j ^ p (j) q (L+2) lnn 2m j 10: Update the ^ p vectors for random variable used in the mapping and increment their corresponding m counters by 1. 11: end while 135 each 1iS, we assume that function f satises following Lipschitz continuity property. Property 1. Letp i 2 [0; 1], for each 1iS. For any such that (+p i )2 [0; 1], we have i > 0 such that jf(p 1 ;p 2 ; ;p i ; ;p S )f(p 1 ;p 2 ; ;p i +; ;p S )j i jj: (7.8) We now assume that function f satises the Lipschitz property with known constants i for parameter indexes i. Let p be any valid PMF over the support setf1; 2; ;Sg and ^ p be an estimate of p. The deviation between the function values at these vectors can be bounded as: jf(^ p)f(p)j S X i=1 i j^ p i p i j = T j^ p pj; (7.9) where is vector whose i-th element is i ,j^ p pj contains element-wise absolute values of ^ p p, and ^ p i and p i represent the i-th elements of ^ p and p, respectively. Theorem 14. If ^ p n represents the empirical estimate of the PMF p based on n observations of the random variable, then Prfjf(^ p n )f(p)jg 2Se 2( 0 ) 2 n ; (7.10) where 0 = S P i=1 i . 136 Proof. We write Prfjf(^ p n )f(p)jg Pr For at least one 1iS:j^ p i p i j 0 S X i=1 Pr j^ p i p i j 0 (a) S X i=1 2e 2( 0 ) 2 n = 2Se 2( 0 ) 2 n ; (7.11) where (a) holds due to Hoeding's inequality (see appendix A). This concentration inequality becomes useful in proving the regret bounds for the PMF learning algorithm. The basic idea is that as the PMF estimates of these random variables get close to the true PMFs, the estimates of the expected values calculated using these PMF estimates will also get close to the actual expected values of the functions. 7.3.2 The Max Function Notice that the overall latency for the WDC problem modeled in equation (7.7) is a non-linear function of the execution and transmission delays. The non-linearity arises due to the presence of the max function. We analyze the max function and prove that it satises the Lipschitz continuity property from equation (7.8). 137 Lemma 2. Let g M (p) = maxff 1 (p);f 2 (p); ;f M (p)g, where each function f j satises the Lipschitz continuity property 1 with constants j . Then the composite function g M also satises property 1 with M = maxf 1 ; 2 ; ; M g, where max refers to element-wise maximum function. Proof. We prove this lemma by induction. First we prove the following: g 2 (p) = maxff 1 (p);f 2 (p) (7.12) Let p 1 and p 2 be any two vectors in the domain of g. We assume without the loss of generality that g(p 2 )g(p 1 ). For the ease of exposition, let us dene following indexes for i2f1; 2g: j i = arg max j2f1;2g f j (p i ): (7.13) Using these notations, we write g(p 2 )g(p 1 ) =f j 2 (p 2 )f j 1 (p 1 ) f j 2 (p 2 )f j 2 (p 1 ) T j 2 jp 2 p 1 j: (7.14) We also know that j 2 2 = maxf 1 ; 2 g, where the inequalities are also as- sumed to be element-wise. 138 Exact same arguments work when we dene the functionsg M recursively. When we writeg M (p) = maxfg M1 (p);f M (p)g, we obtain the stated result with constants also dened recursively as M = maxf M1 ; M g. 7.3.3 Algorithm Analysis Let us analyze the regret performance of the two variants of the PMF learning algorithm. Although the algorithms are very similar, we still analyze both of them separately, since the WDC application involves two dierent kind of random vari- ables: device-specic and link-specic random variables. We use the concentration inequality from theorem 14 in the analysis of PMF learning, since max is Lipschitz as shown in lemma 2. First, we now provide the regret result for the PMF learning algorithm in the WDC application as follows: Theorem 15 (Regret Result in WDC). The expected regret under algorithm 7 over rst T slots is upper bounded by: maxf max ;Pg " NjZ (x) j +jEjjZ (xx) j 2 6 + (M +M 2 ) 1 + MjZ (x) j +M 2 jZ (xx) j 2 3 ! + 4M(M + 1)(N +jEj + 1) lnT min (MjZ (x) j+M 2 jZ (xx) j)max 2 # ; (7.15) 139 where P denotes the penalty for playing an action outside of the true feasible set, max denotes the maximum regret corresponding to a non-optimal action, max de- notes the maximum value of Lipschitz coecient across all PMF vectors, min rep- resents the minimum distance betweenE[ ~ D (N) (y; x a )] andE[ ~ D (N) (y; x (y))] across all contexts for all mappings a,jZ (x) j andjZ (xx) j denote the maximum supports of the random variables corresponding to the devices and the links, respectively. Proof. When the agent obtains a non-optimal mapping from the oracle at time t, it corresponds to following events: 1.E 1;t : the optimal mapping does not satisfy the cost constraint of the agent's optimization problem and is therefore not in the feasible set at t, 2.E 2;t : there exits at least one other mapping with latency estimates lower than that of the optimal mapping in the agent's problem at t. Let N(T ) be the total number of time-slots where non-optimal policies are played by the agent up to time T . We can now represent N(T ) as follows: N(T ) = T X t=1 IfE 1;t [E 2;t g T X t=1 (IfE 1;t g +IfE 2;t g): (7.16) Notice that both the events deal with dierent aspects of the optimization problem. WhileE 1;t deals with the constraint of the problem,E 2;t deals with the objective of the optimization. We analyze the two events separately and upper bound their probabilities. 140 7.3.3.1 Infeasible Optimal Mapping B (y) is the mean cost of the best mapping for context y. I[N 1 (T )] = T X t=1 IfE 1;t g = T X t=1 I n ~ C (LCB) (y; x (y))>B o T X t=1 I n ~ C (LCB) (y; x (y))B (y) o (B (y)B) T X t=1 t X m h 1 =1 t X m h ja()j =1 I ( 0 @ N X i=1 G (x i ) (i) (y; ^ p (x i ) (L) (t)) + X (m;n)2E G (x m x n ) (mn) (y; ^ p (x m x n ) (L) (t)) 1 A B (y) ) T X t=1 t X m h 1 =1 t X m h ja()j =1 N X i=1 I n G (x i ) (i) (y; ^ p (x i ) (L) (t))G (x i ) (i) (y; p (x i ) ) o + X (m;n)2E I n G (x m x n ) (mn) (y; ^ p (x m x n ) (L) (t))G (x m x n ) (mn) (y; p (x m x n ) ) o : (7.17) where h j (1jja()j) represents the j-th element in a(). Any mapping can at most have N nodes, one for each task andjEj links, so at most (N +jEj) random 141 variables. Let us consider the probabilities for each of these terms independently and upper bound them. Let c t;s = q (N+jEj+2) lnt 2s . Pr n G (x i ) (i) (y; ^ p (x i ) (L) (t))G (x i ) (i) (y; p (x i ) ) o = Pr 8 < : X z l 2Z (x i ) g (m) (y; z l )^ p (x i ) (L);l (t) X z l 2Z (x i ) g (m) (y; z l )p (x i ) l 9 = ; X z l 2Z (x i ) Pr n ^ p (x i ) (L);l (t)p (x i ) l o X z l 2Z (x i ) Pr n p (x i ) s;l c t;s p (x i ) l o X z l 2Z (x i ) e (N+jEj+2) lnt (Lemma 3, Appendix A) =jZ (x i ) jt (N+jEj+2) : (7.18) Similarly, we obtain following result for the other event: Pr n G (x m x n ) (mn) (y; ^ p (x m x n ) (L) (t))G (x m x n ) (mn) (y; p (x m x n ) ) o jZ (x m x n ) jt (N+jEj+2) : (7.19) We bound the expected number of occurrences of infeasible optimal mapping as: E[N 1 (T )] = T X t=1 PrfE 1;t g 1 X t=1 t X m h 1 =1 t X m h ja()j =1 142 0 @ N X i=1 jZ (x i ) jt (N+jEj+2) + X (m;n)2E jZ (x m x n ) jt (N+jEj+2) 1 A NjZ (x) j +jEjjZ (xx) j 1 X t=1 t 2 = NjZ (x) j +jEjjZ (xx) j 2 6 ; (7.20) wherejZ (x) j andjZ (xx) j denote the maximum supports of the random variables corresponding to the devices and the links, respectively. 7.3.3.2 Non-Optimal Latency Estimates Let us introduce counters ~ N j (t) and ~ N jk (t), after the initialization phase, corre- sponding to the computing nodes and their connecting links, respectively. If at time-slot t a non-optimal allocation is chosen by the agent and if it has lower la- tency estimates than the optimal mapping in the agent's problem att, we increment exactly one of the counters by 1. We increment the counter whose corresponding value of m is the smallest. If there are multiple such counters with the same m values, we break ties arbitrarily. This is inspired from the technique used in [5]. Since exactly one of the counters gets incremented each time a non-optimal alloca- tion is used, the total number of non-optimal plays of this case, N 2 (T ) is equal to the summation of all counters N 2 (T ) = T X t=1 IfE 2;t g 143 = X j2V ~ N j (T ) + X (j;k)2E ~ N jk (T ): (7.21) Also note, for the counters, following inequality holds: ~ N j (t)m j (t) 8j2V ~ N jk (t)m jk (t) 8jk2E: Let ~ I j (t) and ~ I jk (t) the indicator functions that equal 1 if their corresponding counters are incremented by 1 at time t. For an arbitrary positive integer l, we have: ~ N j (T ) = T X t=jVj+jEj+1 I n ~ I j (t) = 1 o l + T X t=jVj+jEj+1 I n ~ I j (t) = 1; ~ N j (t 1)l o : (7.22) When ~ I j (t) = 1, a non-optimal action a(t) has been picked up and the value of m for the j-th node must be the smallest among the nodes in a(t). Notice that we denote this action as a(t), since at each time ~ I j (t) = 1, we could have dierent actions. Note that similar expressions also hold for the link specic counters also. Let C t;s = q (N+jEj+1) lnt s . Now we have ~ N j (T ) 144 l + T X t=jVj+jEj+1 I n ~ D (N) (LCB) (y; x (y)) ~ D (N) (LCB) (y; x a(t) ); ~ N j (t 1)l o l + T X t=jVj+jEj I n ~ D (N) (LCB) (y; x (y)) ~ D (N) (LCB) (y; x a(t) ); ~ N j (t)l o : (7.23) Note that ~ N j (t)l implies l ~ N j (t)m i (t);8i2A a(t+1) l ~ N j (t)m ik (t);8ik2A a(t+1) : (7.24) Now we write ~ N j (T ) l + T X t=jVj+jEj I max 0<m a() t ~ D (N) (LCB);m a() (y; x (y)) min l<m a(t) t ~ D (N) (LCB);m a(t) (y; x a(t) ) l + 1 X t=jVj+jEj t X m h 1 =1 t X m h ja()j =1 t X mp 1 =l t X mp ja(t)j =l I n ~ D (N) (LCB);m a() (y; x (y)) ~ D (N) (LCB);m a(t) (y; x a(t) ) o ; (7.25) whereh j (1jja()j) represents thej-th element in a() andp j (1jja(t)j) represents the j-th element in a(t). 145 Note thatI n ~ D (N) (LCB);m a() (y; x (y)) ~ D (N) (LCB);m a(t) (y; x a(t) ) o means at least one of the following must be true: ~ D (N) (LCB);m a() (y; x (y))E[ ~ D (N) (y; x (y))] (7.26) E[ ~ D (N) (y; x a(t) )] ~ D (N) (UCB);m a(t) (y; x a(t) ) (7.27) ~ D (N) (UCB);m a(t) (y; x a(t) ) ~ D (N) (LCB);m a(t) (y; x a(t) ) >E[ ~ D (N) (y; x a(t) )]E[ ~ D (N) (y; x (y))]: (7.28) Let us nd the upper bound on Prf ~ D (N) (LCB);m a() (y; x (y))E[ ~ D (N) (y; x (y))]g. We have Pr n ~ D (N) (LCB);m a() (y; x (y))E[ ~ D (N) (y; x (y))] o Pr n Following must hold for at least one z k 2Z (x i ) or z ko 2Z (x ij ) : p (x i ) s;k C t;s p (x i ) k p (x ij ) s;ko C t;s p (x ij ) ko o MjZ (x) j +M 2 jZ (xx) j e 2(N+jEj+1) lnt (Lemma 3, Appendix A) = MjZ (x) j +M 2 jZ (xx) j t 2(N+jEj+1) ; (7.29) 146 since there can be at mostM devices andM 2 links used in the scheduled mapping. Similarly, we have the following bound for the event in equation (7.27) as follows: Pr n E[ ~ D (N) (y; x a(t) )] ~ D (N) (UCB);m a(t) (y; x a(t) ) o MjZ (x) j +M 2 jZ (xx) j t 2(N+jEj+1) : (7.30) Notice that the third condition in equation (7.28), however, never holds for l & 4(N+jEj+1) lnt min (MjZ (x) j+M 2 jZ (xx) j)max 2 ' , where max denotes the maximum value of Lipschitz coecient across all PMF vectors and min represents the minimum distance be- tween E[ ~ D (N) (y; x a(t) )] and E[ ~ D (N) (y; x (y))] across all contexts for all mappings a(t). Therefore, we choose l = & 4(N+jEj+1) lnT min (MjZ (x) j+M 2 jZ (xx) j)max 2 ' from now onward. E h ~ N j (T ) i & 4(N +jEj + 1) lnT min (MjZ (x) j+M 2 jZ (xx) j)max 2 ' + 1 X t=1 t X m h 1 =1 t X m h ja()j =1 t X mp 1 =l t X mp ja(t)j =l 2 MjZ (x) j +M 2 jZ (xx) j t 2(M+1) 4(N +jEj + 1) lnT min (MjZ (x) j+M 2 jZ (xx) j)max 2 + 1 + 2 MjZ (x) j +M 2 jZ (xx) j 1 X t=1 t 2 = 4(N +jEj + 1) lnT min (MjZ (x) j+M 2 jZ (xx) j)max 2 + 1 + MjZ (x) j +M 2 jZ (xx) j 2 3 : (7.31) Note that, similar upper bound also holds for the other set of counters ~ N jk (T ). 147 From equation (7.21), we now have following upper bound E[N 2 (T )] M +M 2 4(N +jEj + 1) lnT min (MjZ (x) j+M 2 jZ (xx) j)max 2 + 1 + MjZ (x) j +M 2 jZ (xx) j 2 3 ! = 4M(M + 1)(N +jEj + 1) lnT min (MjZ (x) j+M 2 jZ (xx) j)max 2 + (M +M 2 ) 1 + MjZ (x) j +M 2 jZ (xx) j 2 3 ! : (7.32) 7.3.3.3 Regret Bound The non-optimal events described above can lead to the agent playing an action that fails to satisfy the constraint of the true optimization problem at that slot, which results in a penalty or the action can satisfy the constraint, but resulting in regret due to its non-optimality. Let P denote the penalty for playing an action outside of the true feasible set and max denote the maximum regret corresponding to a non-optimal action. The total regret can, therefore, be bounded as follows: R A (T ) maxf max ;Pg (E[N 1 (T )] +E[N 2 (T )]) maxf max ;Pg " NjZ (x) j +jEjjZ (xx) j 2 6 + (M +M 2 ) 1 + MjZ (x) j +M 2 jZ (xx) j 2 3 ! 148 + 4M(M + 1)(N +jEj + 1) lnT min (MjZ (x) j+M 2 jZ (xx) j)max 2 # : (7.33) Similarly, we obtain the following result for the general version of the PMF learning algorithm: Theorem 16 (General Regret Result). The expected regret under algorithm 8 over rst T slots is upper bounded by: maxf max ;Pg " 4(L + 1)U 3 lnT min jZjmax 2 +U + (2U 2 +L)jZj 2 6 # ; (7.34) where P denotes the penalty for playing an action outside of the true feasible set, max denotes the maximum regret corresponding to a non-optimal action, max denotes the maximum value of Lipschitz coecient across all PMF vectors, min represents the minimum distance betweenE[F 1 (y; x)] andE[F 1 (y; x (y))] across all contexts,jZj denotes the maximum support of the internal random variables. Proof. The regret proof follows similar steps as the previous result. We analyze the occurrences of two types of events in N 1 (T ) and N 2 (T ). First, we bound the expected number of occurrences of infeasible optimal action vector by following as: E[N 1 (T )]LjZj 2 6 ; (7.35) 149 wherejZj denotes the maximum support of the internal random variables. Finally, we bound the number of occurrences of the event, where the true optimal action vector is not optimal for the agent's optimization problem, as: E[N 2 (T )]U 4(L + 1) lnT min UjZjmax 2 + 1 +UjZj 2 3 ! = 4(L + 1)U 3 lnT min jZjmax 2 +U +U 2 jZj 2 3 : (7.36) Let P denote the penalty for playing an action outside of the true feasible set and max denote the maximum regret corresponding to a non-optimal action. The total regret can, therefore, be bounded as follows: R A (T ) maxf max ;Pg (E[N 1 (T )] +E[N 2 (T )]) maxf max ;Pg " 4(L + 1)U 3 lnT min jZjmax 2 +U +U 2 jZj 2 3 +LjZj 2 6 # : (7.37) Notice that the structures of the regret bounds are very similar for both the algorithms. The regret grows logarithmically over time and is polynomial in the total number of random variables of the system. This is in contrast with the bandit algorithms that treat each action vector independently as an arm and end of incurring regrets scaling exponentially in the total number of random variables. 150 Remark 8. Note that when the random variables are not discrete, then we can discretize their support space similar to our CCB algorithm from chapter 5. How- ever, in such cases, the maximum lengths of the support spaces,jZ (x) j andjZ (xx) j, increase thereby increasing the regret. These can be tuned according to our CCB algorithm by using the doubling trick. Remark 9. PMF learning algorithm does not rely on learning the statistics of the random variables corresponding to all contexts, since it calculates the required statistics using the estimates of the PMFs. This means that the context spaces do not need to be nite or discrete, which is the real strength of our algorithm. PMF learning algorithm, therefore, can be used a variety of applications where the context spaces are not discrete. It is not restricted to the combinatorial setting discussed in this chapter. 7.4 Numerical Simulation Results We now simulate a task graph with 4 tasks and a network of 3 devices connected to each other as shown in gure 1.5. There are a total of 12 random variables, with two corresponding to each of the device and the link. The context is a vector of length 8 with entries for each of the task weights and the edge weight of the task graph, which are drawn from Bernoulli distribution. The link and device specic random variables are shifted Bernoulli random variables that take values fromf1; 2g. We compare a naive algorithm that treats each mapping as an independent arm and 151 Figure 7.2: Comparison of the regret results for the contextual combinatorial bandit problem runs a separate instance of UCB for each of the context with our PMF learning algorithm in gure 7.2. We see that the cumulative regret over logn converges to a constant value much faster than the naive algorithm which still is in the linear regret region and is yet to converge. 7.5 Summary We have considered the problem of contextual combinatorial bandits and its ap- plication in WDC. We have proposed a novel online PMF learning algorithm that exploits the knowledge of the functional forms of the metrics under consideration and learns the distributions of the internal random variables of the system over 152 time. We have proved an upper bound on the cumulative regret of our algorithm that scales logarithmically with time and is polynomial in the length of the action vector. Such an approach to learning the distributions of the random variables can also be extended to continuous random variables using discretization and the doubling trick. 153 Chapter 8 Conclusion In this thesis, we have discussed various bandit problems and their applications in wireless networks. As we can see, sequential decision making in unknown environ- ments is ubiquitous in networking. These problems are rich in the side-information or context available before making the decisions. These problems, however, dier from the traditional bandit problems considered in advertising, clinical trials, news recommendations, because of knowledge of the reward functions. Depending on the application at hand, these costs and rewards can represent the data rate over a wireless channel, execution delays for a task, communication delays for data trans- mission etc. These costs and rewards are dependent on the side information and some internal random variables whose distributions are unknown to the agent. The agent, therefore, needs to learn about the environment by making decisions over time and improve its policies along the way. In chapter 5, we have considered contextual bandits that arise in channel selec- tion and power allocation over wireless channels and proposed two novel algorithms: 154 DCB and CCB for discrete context spaces and continuous context spaces, respec- tively. We have proved a logarithmic upper bound on the regret of DCB and also its order optimality by showing a matching lower bound. By extending DCB to continuous cases, we have showed a regret upper bound on CCB that is much lower than the previous known upper bounds for continuous context spaces. In chapter 6, we have considered cases where the context is also aected by the agent's action in the previous time slot. We have modeled these problems arising in energy harvesting communications by Markov Decision Processes that can be solved by linear programming. However, because of the lack of distributional knowledge, the agent cannot solve the LP and learns the reward estimates over time and uses them to solve the LP. We have proved a constant regret upper bound for this LPSM algorithm. We have reduced the computational requirements for this problem by proposing an epoch based version of LPSM, called epoch-LPSM, that also has a constant regret upper bound. Then, we have extended this algorithm to consider more complex problems involving multiple decisions in each slot and proposed our MC-LPSM algorithm and showed a logarithmic upper bound on its regret. We have proved its order optimality by proving a matching lower bound for the problem. We have extended the contextual bandit setting to consider combinatorial de- cision making in distributed computing. In chapter 7, we have shown how the task graph features act as contexts and their mapping to the computing nodes act as a combinatorial action for these problems. We have proposed a novel PMF learning 155 algorithm, where the agent tries to learn the distributions of the internal random variables over time by using the computing devices and the links connecting them. We have used LCB-based estimates for the PMF vectors of the internal random variables in the optimization problem and proved a logarithmic regret upper bound for the algorithm. During this process, we have also proved an interesting concen- tration inequality for Lipschitz functions of random variables. Through our work, we have showed how the bandit algorithms could use the side information and the knowledge of the reward functions to improve their regret performance. We hope that online learning will be more widely used in networking in years to come given the novel algorithms and interesting analytical tools it oers. 156 Appendix A Sum of Bounded Random Variables We use following version of Hoeding's inequality [53]. Lemma 3 (Hoeding's Inequality). Let Y 1 ;:::;Y n be i.i.d. random variables with mean and range [0; 1]. Let S n = n P t=1 Y t . Then for all 0 PrfS n n +g e 2 2 =n PrfS n ng e 2 2 =n : 157 Appendix B Proof of Theorem 1: Bound on the pulls of non-optimal arms For any j2O, we write T j (n) = 1 + n X t=K+1 IfA(t) =jg l + n X t=K+1 IfA(t) =j;T j (t 1)lg; (B.1) whereI(x) is the indicator function dened to be 1 when the predicatex is true and 0 otherwise, and l is an arbitrary positive integer. Let C t;s = q (2+) lnt s . Let ^ i;j;s denote ^ i;j when j-th arm has been pulled s times. We use a superscript to refer to optimal arms' statistics. For example we write ^ i;s andT i (t 1) to respectively denote ^ i;j;s and T j (t 1) for the optimal arm: j =h (y (i) ). Our idea is to upper 158 bound the probability of the indicator function of the eventfA(t) =j;T j (t1)lg and bound the number of pulls as E [T j (n)] l + n X t=K+1 PrfA(t) =j;T j (t 1)lg =l + n X t=K+1 M X i=1 PrfA(t) =j;T j (t 1)l; y t = y (i) g =l + M X i=1 p i n X t=K+1 PrfA(t) =j;T j (t 1)lj y t = y (i) g l + M X i=1 p i n X t=K+1 Pr n ^ i;T i (t1) +G i C t1;T i (t1) ^ i;j;T j (t1) +G i C t1;T j (t1) ;T j (t 1)l o l + M X i=1 p i n X t=K+1 Pr min 1s<t ^ i;s +G i C t1;s max ls j <t ^ i;j;s j +G i C t1;s j l + M X i=1 p i 1 X t=1 t1 X s=1 t1 X s j =l Pr n ^ i;s +G i C t;s ^ i;j;s j +G i C t;s j o : (B.2) Observing that ^ i;s +G i C t;s ^ i;j;s j +G i C t;s j cannot hold, unless at least one of the following conditions hold ^ i;s i G i C t;s ; (B.3) ^ i;j;s j i;j +G i C t;s j ; (B.4) 159 i i;j + 2G i C t;s j : (B.5) Using Hoeding's inequality (see appendix A) on (B.3) and (B.4), we get Prf ^ i;s i G i C t;s g e 2(2+) lnt =t 2(2+) ; Prf ^ i;j;s j i;j +G i C t;s j g e 2(2+) lnt =t 2(2+) : For s j l = & 4(2+) lnn min 1iM (i) j 2 ' , we get i;j i i;j 2G i C t;s j = i i;j 2G i s (2 +) lnt s j (i) j min 1iM (i) j 0: Hence, the condition (B.2) reduces to E [T j (n)] 2 6 6 6 6 4(2 +) lnn min 1iM (i) j 2 3 7 7 7 7 + M X i=1 p i 1 X t=1 t1 X s=1 t1 X s j =l Pr n ^ i;s i G i C t;s o + Pr n ^ i;j;s j i;j +G i C t;s j o 2 6 6 6 6 4(2 +) lnn min 1iM (i) j 2 3 7 7 7 7 + M X i=1 p i 1 X t=1 t X s=1 t X s j =l 2t 2(2+) 160 2 6 6 6 6 4(2 +) lnn min 1iM (i) j 2 3 7 7 7 7 + 2 M X i=1 p i 1 X t=1 t 2(1+) 2 6 6 6 6 4(2 +) lnn min 1iM (i) j 2 3 7 7 7 7 + 2 3 M X i=1 p i 4(2 +) lnn min 1iM (i) j 2 + 1 + 2 3 ; which concludes the proof. 161 Appendix C Proof of Lemma 1: High probability bound for UCB1() Let denote the instance of UCB1() in the standard MAB problem and j the index of optimal arm. UCB1() stores the empirical means of arm-rewards. We use ^ X j;s and ^ X s to respectively denote the averages of arm rewards for j-th arm and the optimal arm when they have been pulled for s trials. If the optimal arm gets pulled less than n K times during the rst n trials, then according to the pigeonhole principle there must exist an arm that gets pulled more than n K times. Pr n T (n)< n K o Pr n 9j2Snfj g :T j (n)> n K o X j2Snfj g Pr n T j (n)> n K o : (C.1) 162 For the j-th arm, if T j (n) n K + 1, then at some n K + 1tn it must have been pulled for the ( n K + 1)-th time. We track this trial-index t and bound the probabilities for non-optimal arms. Pr n T j (n)> n K o n X t=b n K c+1 Pr n (t) =j;T j (t 1) = j n K ko n X t=b n K c+1 Pr n ^ X t (t1) +C t1;T (t1) ^ X j;b n K c +C t1;b n K c o n X t=b n K c+1 Pr min 1s<t ^ X s +C t1;s ^ X j;b n K c +C t1;b n K c n X t=b n K c+1 t1 X s=1 Pr n ^ X s +C t1;s ^ X j;b n K c +C t1;b n K c o : (C.2) Note that ^ X s +C t1;s ^ X j;b n K c +C t1;b n K c implies that at least one of the following must hold ^ X s C t;s (C.3) ^ X j;b n K c j C t;b n K c (C.4) < j + 2C t;b n K c (C.5) Using Hoeding's inequality we bound the probability of events (C.3) and (C.4) as Pr n ^ X s C t;s o e 2(2+) lnt =t 2(2+) ; 163 Pr n ^ X j;b n K c j C t;b n K c o e 2(2+) lnt =t 2(2+) : As stated in (5.14) for n K > 4(2+) lnn min j6=j ( j ) 2 , j 2C t;b n K c = j 2 s (2 +) lnt n K > j 2 s (2 +) lnn n K > j min 1jK ( j ) 0: (C.6) Thus, condition (C.5) is false when n K > 4(2+) lnn min j6=j ( j ) 2 . Using union bound on (C.2) for every non-optimal arm, we get Pr n T j (n)> n K o n X t=b n K c+1 t1 X s=1 Pr n ^ X s C t;s o + Pr n ^ X j;b n K c j C t;b n K c o n X t=b n K c+1 t1 X s=1 2t 2(2+) 2 n X t=b n K c+1 t 32 < 2 n X t=b n K c+1 n K 32 < 2n n K 32 = 2K 3+2 n 2+2 : (C.7) 164 Substituting this in (C.1), for all n satisfying condition (5.14), we get Pr n T (n)< n K o < X j2Snfj g 2K 3+2 n 2+2 < 2K 4+2 n 2+2 : 165 Appendix D Proof of Theorem 2: Bound on the non-optimal pulls of optimal arms Fix j2O. LetE j;t denote an event where j-th arm is pulled non-optimally at t-th trial. Total number of non-optimal pulls can, therefore, be written as T N j (n) = n P t=1 IfE j;t g. LetE 1 j;t denote an event of at least one of the contexts y (i) 2Y j not having occurred even half the number of its expected occurrences till thet-th trial. Additionally, letE 2 j;t be the event ofj-arm not having been pulled at least 1 K fraction of such pulls by the optimal hypothesis till the t-th trial. In terms of these events, we write the following bound on expected number of non-optimal pulls E[T N j (n)] = n X t=1 PrfE j;t g n X t=1 (PrfE 1 j;t g + PrfE 2 j;t \E 1 j;t g + PrfE j;t \E 1 j;t \E 2 j;t g): (D.1) 166 Under-realized contexts Let N i (n) denote the number occurrences of y (i) till the n-th trial. Thus,E 1 j;t corresponds to the event where N i (t) p i n 2 for at least one context y (i) 2Y j . In terms of indicators,N i (n) = n P t=1 Ify t = y (i) g andE[N i (n)] =p i n. These indi- cator random variables are i.i.d. and thus, we use Hoeding's inequality (appendix A) with =p i n=2 to get Pr n N i (n) p i 2 n o exp p 2 i 2 n : (D.2) The exponential bound is important, since it helps us bound the number of oc- currences of context under-realization N(E 1 j ) by a constant. Note that N(E 1 j ) = P 1 n=1 IfE 1 j;n g. We obtain a bound on its expectation, which also bounds the rst term in equation (D.1), as follows E N(E 1 j ) = 1 X n=1 PrfE 1 j;n g 1 X n=1 X y (i) 2Y j exp p 2 i 2 n (Union bound) X y (i) 2Y j exp n p 2 i 2 o 1 exp n p 2 i 2 o X y (i) 2Y j 2 p 2 i : (D.3) 167 Under-exploited arm We assume that no context y (i) 2Y j is under-realized and yetj-th arm is not pulled often enough. For these contextsj-th arm is optimal and we don't end up pulling it enough nevertheless. We dene an eventE 2 j corresponding to the existence of a context y (i) 2Y j for which the arm is under-exploited. We now bound the probability that for some context y (i) 2Y j , T O j;i (n)< N i (n) K n, where T O j;i (n) denotes the number of pulls ofj-th arm for context y (i) . Here, we use a high probability bound (see lemma 1) on the optimal arm for UCB1() which we prove in appendix C. We upper bound the second term from equation (D.1) as follows: EfN(E 2 j jE 1 j )g = 1 X n=1 Pr n E 2 j;n \E 1 j;n o (a) n o + 1 X n=no X y (i) 2Y i Pr T O j;i (n)< N i (n) K ;N i (n) p i n 2 n o + 1 X n=no X y (i) 2Y i n X m=p i n=2 Pr T O j;i (n)< N i (n) K ;N i (n) =m =n o + X y (i) 2Y i 1 X n=no n X m=p i n=2 Pr T O j;i (n)< N i (n) K N i (n) =m PrfN i (n) =mg n o + X y (i) 2Y i 1 X n=no n X m=p i n=2 Pr n T O j;i (n)< m K o (b) n o + X y (i) 2Y i 1 X n=no n X m=p i n=2 2K 4+2 m 2+2 (Lemma 1) n o + X y (i) 2Y i 1 X n=no n X m=p i n=2 2K 4+2 (p i n=2) 2+2 n o + X y (i) 2Y i 2 3+2 K 4+2 (p i ) 2+2 1 X n=no 1 n 1+2 168 n o + 2 3+2 K 4+2 c X y (i) 2Y i 1 (p i ) 2+2 ; (D.4) where c = P 1 n=1 1 n 1+2 <1, since we know that the series converges for > 0. Thus, we need > 0 in order to provide theoretical performance guarantees. Note that (a) holds as we bound the probabilities for the rst n o terms by 1, and (b) holds since the pulls for one context can only decrease the exploration requirements of other contexts. For (b) to hold, condition of lemma 1 needs to be satised, which is j p i n 2K k > 4(2+) lnn (i) j 2 for all nn o . In order to make n o independent of i and j, we choose n o as the minimum value of n satisfying following inequality: j p o n 2K k > l 4(2 +) lnn ( o ) 2 m ; (D.5) where p o = min 1iM p i and o = min 8i;j:h (i)6=j (i) j . Note that all terms in the upper bound in (D.4) are constant, showing that the number of occurrences of under-exploited optimal arms is bounded. Dominant condence bounds This case corresponds to the event where no context is under-realized and the optimal arms are also pulled suciently often, yet an optimal arm is pulled non-optimally. This means that j-th arm is pulled for a context y (i) = 2Y j at trial n, while N i (n) p i 2 n and T O j;i (n) N i (n) K n for all y (i) 2Y j . Let q j = P y (i) 2Y j p i . Note that we only need to analyze n n o , since 169 we already upper bound the probabilities of non-optimal pulls by 1 as shown in previous case. Thus, we have T j (n) X y (i) 2Y j T O j;i (n) > X y (i) 2Y j N i (n) 2 > nq j 2K : Expected number of events N(E j \E 1 j \E 2 j ) gets upper bounded as follows: N(E j \E 1 j \E 2 j ) = 1 X n=no PrfE j;n \E 1 j;n \E 2 j;n g = 1 X n=no X y (i) = 2Y j Pr n A(n) =j;T j (n) nq j 2K ; y n = y (i) o 1 X n=no X y (i) = 2Y j p i Pr n A(n) =j;T j (n) nq j 2K y n = y (i) o : (D.6) Following the line of argument from the proof of theorem 1 in appendix B, we get N(E j \E 1 j \E 2 j ) 2 X y (i) = 2Y j p i 1 X n=no n 2(1+) 2 3 X y (i) = 2Y j p i = 2 3 (1q j ): (D.7) 170 Note that the arguments of theorem 1 hold, since we only consider trials nn o . Combining the results form (D.3), (D.4) and (D.7), we get E T N j (n) n o + 0 @ X y (i) 2Y j 2 p 2 i 1 A + 2 3+2 K 4+2 c 0 @ X y (i) 2Y i 1 (p i ) 2+2 1 A + 2 3 (1q j ); which concludes the proof. 171 Appendix E Analysis of Markov Chain Mixing We brie y introduce the tools required for the analysis of Markov chain mixing (see [54], chapter 4 for a detailed discussion). The total variation (TV) distance between two probability distributions and 0 on sample space is dened by k 0 k TV = max E j(E) 0 (E)j: (E.1) Intuitively, it means the TV distance between and 0 is the maximum dier- ence between the probabilities of a single event by the two distributions. The TV distance is related to the L 1 distance as follows k 0 k TV = 1 2 X !2 j(!) 0 (!)j: (E.2) We wish to bound the maximal distance between the stationary distribution and the distribution over states after t steps of a Markov chain. Let P (t) be the 172 t-step transition matrix with P (t) (s;s 0 ) being the transition probability from state s to s 0 of the Markov chain in t steps andP be the collection of all probability distributions on . Also let P (t) (s;) be the row or distribution corresponding to the initial state of s. Based on these notations, we dene a couple of useful t-step distances as follows: d(t) := max s2S kP (t) (s;)k TV = sup 2P kP (t) k TV ; (E.3) ^ d(t) := max s;s 0 2S kP (t) (s 0 ;)P (t) (s;)k TV = sup 0 ;2P k 0 P (t) P (t) k TV : (E.4) For irreducible and aperiodic Markov chains, the distances d(t) and ^ d(t) have fol- lowing special properties: Lemma 4 ([54], lemma 4:11). For all t> 0, d(t) ^ d(t) 2d(t). Lemma 5 ([54], lemma 4:12). The function ^ d is sub-multiplicative: ^ d(t 1 +t 2 ) ^ d(t 1 ) ^ d(t 2 ). These lemmas lead to following useful corollary: Corollary 1. For all t 0, d(t) ^ d(1) t . Consider an MDP with optimal stationary policy . Since the MDP might not start at the stationary distribution corresponding to the optimal policy, even 173 the optimal policy incurs some regret as dened in equation (6.9). We characterize this regret in the following theorem. Note that this regret bound is independent of the initial distribution over the states. Theorem 17 (Regret of Optimal Policy). For an ergodic MDP, the total expected regret of the optimal stationary policy with transition probability matrix P is upper bounded by (1 ) 1 max , where = max s;s 0 2S kP (s 0 ;) P (s;)k TV and max = max s2S;a2A (s;a). Proof. Let 0 be the initial distribution over states and t = 0 P (t) be such distri- bution at time t represented as a row vectors. Also, let be a row vector with the entry corresponding to state s being (s; (s)). We use d (t) and ^ d (t) to denote the t-step distances from equations (E.3) and (E.4) for the optimal policy. Ergodicity of the MDP ensures that the Markov chain corresponding to the optimal policy is irreducible and aperiodic, and thus lemmas 4 and 5 hold. The regret of the optimal policy, therefore, gets simplied as: R ( 0 ;T ) =T T1 X t=0 t =T ( ) T1 X t=0 t = T1 X t=0 ( t ) T1 X t=0 ( t ) + (Negative entries ignored) 174 = T1 X t=0 X s2S ( (s) t (s)) + (s) max T1 X t=0 X s2S ( (s) t (s)) + = max T1 X t=0 k 0 P (t) k TV max T1 X t=0 d (t) max T1 X t=0 ^ d (1) t (From corollary 1) = max T1 X t=0 t max 1 1 : 175 Reference List [1] T. L. Lai and H. Robbins, \Asymptotically ecient adaptive allocation rules," Advances in applied mathematics, vol. 6, no. 1, pp. 4{22, 1985. [2] R. Agrawal, \Sample mean based index policies with o (log n) regret for the multi-armed bandit problem," Advances in Applied Probability, pp. 1054{1078, 1995. [3] P. Auer, N. Cesa-Bianchi, and P. Fischer, \Finite-time analysis of the multi- armed bandit problem," Machine learning, vol. 47, no. 2-3, pp. 235{256, 2002. [4] S. Vakili, K. Liu, and Q. Zhao, \Deterministic sequencing of exploration and exploitation for multi-armed bandit problems," Selected Topics in Signal Pro- cessing, IEEE Journal of, vol. 7, no. 5, pp. 759{767, 2013. [5] Y. Gai, B. Krishnamachari, and R. Jain, \Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and indi- vidual observations," IEEE/ACM Transactions on Networking (TON), vol. 20, no. 5, pp. 1466{1478, 2012. [6] Y. Gai and B. Krishnamachari, \Distributed stochastic online learning policies for opportunistic spectrum access," Signal Processing, IEEE Transactions on, vol. 62, no. 23, pp. 6184{6193, 2014. [7] J. Langford and T. Zhang, \The epoch-greedy algorithm for multi-armed ban- dits with side information," in Advances in neural information processing sys- tems, pp. 817{824, 2008. [8] T. Lu, D. P al, and M. P al, \Contextual multi-armed bandits," in International Conference on Articial Intelligence and Statistics, pp. 485{492, 2010. [9] L. Li, W. Chu, J. Langford, and R. E. Schapire, \A contextual-bandit ap- proach to personalized news article recommendation," in Proceedings of the 19th international conference on World wide web, pp. 661{670, ACM, 2010. [10] P. Auer, \Using condence bounds for exploitation-exploration trade-os," The Journal of Machine Learning Research, vol. 3, pp. 397{422, 2003. 176 [11] W. Chu, L. Li, L. Reyzin, and R. E. Schapire, \Contextual bandits with linear payo functions," in International Conference on Articial Intelligence and Statistics, pp. 208{214, 2011. [12] S. Agrawal and N. Goyal, \Thompson sampling for contextual bandits with linear payos," in International Conference on Machine Learning, 2013. [13] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang, \Ecient optimal learning for contextual bandits," in Conference on Uncertainty in Articial Intelligence, 2011. [14] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire, \Taming the monster: A fast and simple algorithm for contextual bandits," in Interna- tional Conference on Machine Learning, pp. 1638{1646, 2014. [15] A. Slivkins, \Contextual bandits with similarity information," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 2533{2568, 2014. [16] M. J. Neely, S. T. Rager, and T. F. La Porta, \Max weight learning algorithms for scheduling in unknown environments," IEEE Transactions on Automatic Control, vol. 57, no. 5, pp. 1179{1191, 2012. [17] L. Huang and M. J. Neely, \Utility optimal scheduling in energy-harvesting networks," IEEE/ACM Transactions on Networking (TON), vol. 21, no. 4, pp. 1117{1130, 2013. [18] L. Huang, X. Liu, and X. Hao, \The power of online learning in stochastic net- work optimization," in ACM SIGMETRICS Performance Evaluation Review, vol. 42, pp. 153{165, ACM, 2014. [19] Y. H. Kao and B. Krishnamachari, \Optimizing mobile computational ooad- ing with delay constraints," in 2014 IEEE Global Communications Conference, pp. 2289{2294, Dec 2014. [20] Y.-H. Kao, K. Wright, B. Krishnamachari, and F. Bai, \Online learning for wireless distributed computing," arXiv preprint arXiv:1611.02830, 2016. [21] S. Ulukus, A. Yener, E. Erkip, O. Simeone, M. Zorzi, P. Grover, and K. Huang, \Energy harvesting wireless communications: A review of recent advances," Selected Areas in Communications, IEEE Journal on, vol. 33, no. 3, pp. 360{ 381, 2015. [22] J. Yang and S. Ulukus, \Optimal packet scheduling in an energy harvest- ing communication system," Communications, IEEE Transactions on, vol. 60, no. 1, pp. 220{230, 2012. 177 [23] K. Tutuncuoglu and A. Yener, \Optimum transmission policies for battery lim- ited energy harvesting nodes," Wireless Communications, IEEE Transactions on, vol. 11, no. 3, pp. 1180{1189, 2012. [24] O. Ozel, K. Tutuncuoglu, J. Yang, S. Ulukus, and A. Yener, \Transmission with energy harvesting nodes in fading wireless channels: Optimal policies," IEEE Journal on Selected Areas in Communications, vol. 29, no. 8, pp. 1732{ 1743, 2011. [25] J. Lei, R. Yates, and L. Greenstein, \A generic model for optimizing single-hop transmission policy of replenishable sensors," IEEE Transactions on Wireless Communications, vol. 8, no. 2, pp. 547{551, 2009. [26] A. Sinha, \Optimal power allocation for a renewable energy source," in Com- munications (NCC), 2012 National Conference on, pp. 1{5, IEEE, 2012. [27] Z. Wang, A. Tajer, and X. Wang, \Communication of energy harvesting tags," IEEE Transactions on Communications, vol. 60, no. 4, pp. 1159{1166, 2012. [28] C. K. Ho and R. Zhang, \Optimal energy allocation for wireless communica- tions with energy harvesting constraints," Signal Processing, IEEE Transac- tions on, vol. 60, no. 9, pp. 4808{4818, 2012. [29] W. R. Thompson, \On the likelihood that one unknown probability exceeds another in view of the evidence of two samples," Biometrika, vol. 25, no. 3/4, pp. 285{294, 1933. [30] L. Tran-Thanh, A. C. Chapman, A. Rogers, and N. R. Jennings, \Knapsack based optimal policies for budget-limited multi-armed bandits.," in AAAI, 2012. [31] S. Bubeck, R. Munos, and G. Stoltz, \Pure exploration in multi-armed bandits problems," in International conference on Algorithmic learning theory, pp. 23{ 37, Springer, 2009. [32] B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari, \Combinatorial cascading bandits," in Advances in Neural Information Processing Systems, pp. 1450{ 1458, 2015. [33] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, \The nonstochas- tic multiarmed bandit problem," SIAM journal on computing, vol. 32, no. 1, pp. 48{77, 2002. [34] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims, \The k-armed dueling bandits problem," Journal of Computer and System Sciences, vol. 78, no. 5, pp. 1538{1556, 2012. 178 [35] A. Badanidiyuru, J. Langford, and A. Slivkins, \Resourceful contextual ban- dits," in Conference on Learning Theory, 2014. [36] H. Wu, R. Srikant, X. Liu, and C. Jiang, \Algorithms with logarithmic or sublinear regret for constrained contextual bandits," in Advances in Neural Information Processing Systems, pp. 433{441, 2015. [37] A. Aprem, C. R. Murthy, and N. B. Mehta, \Transmit power control policies for energy harvesting sensors with retransmissions," IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 5, pp. 895{906, 2013. [38] A. Seyedi and B. Sikdar, \Energy ecient transmission strategies for body sensor networks with energy harvesting," IEEE Transactions on Communica- tions, vol. 58, no. 7, pp. 2116{2126, 2010. [39] P. Sakulkar and B. Krishnamachari, \Stochastic contextual bandits with known reward functions," arXiv preprint arXiv:1605.00176, 2016. [40] P. Ortner and R. Auer, \Logarithmic online regret bounds for undiscounted reinforcement learning," in Proceedings of the 2006 Conference on Advances in Neural Information Processing Systems, vol. 19, p. 49, 2007. [41] P. Auer, T. Jaksch, and R. Ortner, \Near-optimal regret bounds for rein- forcement learning," in Advances in neural information processing systems, pp. 89{96, 2009. [42] A. Tewari and P. L. Bartlett, \Optimistic linear programming gives logarithmic regret for irreducible mdps," in Advances in Neural Information Processing Systems, pp. 1505{1512, 2008. [43] L. Qin, S. Chen, and X. Zhu, \Contextual combinatorial bandit and its ap- plication on diversied online recommendation," in Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 461{469, SIAM, 2014. [44] A. N. Burnetas and M. N. Katehakis, \Optimal adaptive policies for markov decision processes," Mathematics of Operations Research, vol. 22, no. 1, pp. 222{255, 1997. [45] P. Sakulkar and B. Krishnamachari, \Online learning schemes for power alloca- tion in energy harvesting communications," IEEE Transactions on Information Theory, 2017. [46] P. Sakulkar and B. Krishnamachari, \Online learning of power allocation poli- cies in energy harvesting communications," in Signal Processing and Commu- nications (SPCOM), 2016 IEEE International Conference on, IEEE, 2016. 179 [47] S. Ross, Introduction to stochastic dynamic programming. Academic Press, 1983. [48] M. L. Puterman, Markov decision processes: discrete stochastic dynamic pro- gramming. John Wiley & Sons, 2005. [49] S. Diamond and S. Boyd, \CVXPY: A Python-embedded modeling language for convex optimization," Journal of Machine Learning Research, 2016. To appear. [50] Y. Wu, R. Kannan, and B. Krishnamachari, \Ecient scheduling for energy- delay tradeo on a time-slotted channel," tech. rep., University of Southern California, Ming Hsieh Department of Electrical Engineering { Systems, 2015. [51] P. Sakulkar and B. Krishnamachari, \Contextual combinatorial bandits in wireless distributed computing," in Asilomar Conference on Signals, Systems, and Computers, IEEE, 2017. [52] Y.-H. Kao, B. Krishnamachari, M.-R. Ra, and F. Bai, \Hermes: Latency optimal task assignment for resource-constrained mobile computing," in 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 1894{1902, IEEE, 2015. [53] W. Hoeding, \Probability inequalities for sums of bounded random vari- ables," Journal of the American statistical association, vol. 58, no. 301, pp. 13{ 30, 1963. [54] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov chains and mixing times. American Mathematical Soc., 2009. 180
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Algorithmic aspects of energy efficient transmission in multihop cooperative wireless networks
PDF
Empirical methods in control and optimization
PDF
Quantum computation in wireless networks
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Exploiting side information for link setup and maintenance in next generation wireless networks
PDF
Learning and control in decentralized stochastic systems
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Learning and decision making in networked systems
PDF
Data-driven optimization for indoor localization
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
PDF
Optimal resource allocation and cross-layer control in cognitive and cooperative wireless networks
PDF
Efficient data collection in wireless sensor networks: modeling and algorithms
PDF
Understanding the characteristics of Internet traffic dynamics in wired and wireless networks
PDF
On practical network optimization: convergence, finite buffers, and load balancing
Asset Metadata
Creator
Sakulkar, Pranav Krishna
(author)
Core Title
Utilizing context and structure of reward functions to improve online learning in wireless networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/02/2018
Defense Date
01/18/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bandit problems,OAI-PMH Harvest,online learning,wireless networks
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Krishnamachari, Bhaskar (
committee chair
)
Creator Email
pranav.sakulkar@gmail.com,sakulkar@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-467750
Unique identifier
UC11268052
Identifier
etd-SakulkarPr-5992.pdf (filename),usctheses-c40-467750 (legacy record id)
Legacy Identifier
etd-SakulkarPr-5992.pdf
Dmrecord
467750
Document Type
Dissertation
Rights
Sakulkar, Pranav Krishna
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bandit problems
online learning
wireless networks