Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Online learning algorithms for network optimization with unknown variables
(USC Thesis Other)
Online learning algorithms for network optimization with unknown variables
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ONLINELEARNINGALGORITHMSFORNETWORKOPTIMIZATIONWITH UNKNOWNVARIABLES by YiGai ADissertationPresented tothe FACULTYOFTHEUSCGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (ELECTRICAL ENGINEERING) December2012 Copyright 2012 YiGai Dedication Tomybelovedparents,HuaZhaoandWenquanGai. ii Acknowledgments Thisworkwouldnothavebeenpossibleifitwerenotforthehelpofmanypeople. First, I would liketo thank my advisor, Prof. Bhaskar Krishnamachari. I could not have been more fortunate to have such a fine advisor as Bhaskar. He opened my mind to a world of opportunities and taught me about research, learning, teaching, career planning, and so much else. Hehas always been very kind,patient, and generous. I enjoyed themany insightful discussions with him. Those moments of deep happiness from conquering research difficultieswithhimwillalwaysbeprecioustome. I thank my dissertation committee for their guidance and support, including Prof. RahulJainandProf. ShaddinDughmi. Special thanks go to my research collaborators. Matthew P. Johnson is a terrific friend and researcher who gaveme plenty of very useful suggestions. I also thank Prof. Mingyan Liu, Prof. Qing Zhao, Prof. Amotz Bar-Noy, Wenhan Dai, George Rabanca andNaumaanNayyar. Thisdissertationbenefitedenormouslyfromallofthem. At the University of Southern California, I received the support of various faculty, students, and colleagues. Prof. Michael Neely’s classes were the most fun and educa- tionalclassesthatIhaveevertaken. IthankProf. DavidKempeforhisvaluablefeedback iii on my papers. I was also helped by Prof. Urbashi Mitra, Prof. Antonio Ortega, Ying Chen,ScottMoeller,MaheswaranSathiamoorthy,YiWang,VladBalan,MiZhang,Hua Liu,Pai-HanHuang,JoonAhn,SangwonLee,SuvilSingh,AmitabhaGhosh,MajedAl- resaini, Marjan Baghaie, Sundeep Pattem, Avinash Sridharan, Yanting Wu, QiuminXu, ShangxingWangandPeterUngsunan. Ithankallmyadvisorsfromtheearlierstagesofmyacademiccareer, eachofwhom has influenced me, includingProf. Lin Zhang, who gavemelotsof supportat Tsinghua University. Finally,I’dliketothankmyparentsandfriendsfortheiremotionalsupport. iv TableofContents Dedication ii Acknowledgments iii ListofFigures viii ListofTables x Abstract xii Chapter1: Introduction 1 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.1 LearningwithLinearRewards . . . . . . . . . . . . . . . . . . 6 1.1.1.1 I.I.DRewards . . . . . . . . . . . . . . . . . . . . . 6 1.1.1.2 RestedMarkovianRewards . . . . . . . . . . . . . . 9 1.1.1.3 RestlessMarkovianRewards . . . . . . . . . . . . . 11 1.1.2 LearningforStochasticWater-FillingwithLinearandNonlinear Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.1.3 LearninginDecentralizedSettings . . . . . . . . . . . . . . . . 14 1.2 Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter2: Background: ClassicMulti-ArmedBandits 17 2.1 ProblemFormulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 OnlineLearningAlgorithmsforClassicMABs . . . . . . . . . . . . . 19 2.2.1 Lai-RobbinsPolicy . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 UCB1 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter3: RelatedWork 23 3.1 IndependentArmswithTemporallyI.I.D.Rewards . . . . . . . . . . . 23 3.2 DependentArmswithTemporallyI.I.D.Rewards . . . . . . . . . . . . 26 3.3 IndependentArmswithRestedMarkovianRewards . . . . . . . . . . . 28 3.4 IndependentArmswithRestlessMarkovianRewards . . . . . . . . . . 29 3.5 DecentralizedPolicyforMulti-ArmedBandits . . . . . . . . . . . . . . 30 v 3.6 ApplicationstoCommunicationNetworks . . . . . . . . . . . . . . . . 32 Chapter4: LearningwithI.I.D.LinearRewards 34 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 ProblemFormulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 PolicyDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 ANaiveApproach . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.2 ANewPolicy . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 AnalysisofRegret . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5.1 MaximumWeightedMatching . . . . . . . . . . . . . . . . . . 48 4.5.2 ShortestPath . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5.3 MinimumSpanningTree . . . . . . . . . . . . . . . . . . . . . 55 4.6 ExamplesandSimulationResults . . . . . . . . . . . . . . . . . . . . 56 4.7 KSimultaneousActions . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.8 LLRwithApproximationAlgorithm . . . . . . . . . . . . . . . . . . . 61 4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter5: LearningwithRestedMarkovianRewards 66 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 ProblemFormulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 MatchingLearningforMarkovianRewards . . . . . . . . . . . . . . . 70 5.4 AnalysisofRegret . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5 ExamplesandSimulationResults . . . . . . . . . . . . . . . . . . . . 87 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chapter6: LearningwithRestlessMarkovianRewards 92 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2 ProblemFormulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3 PolicyDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4 AnalysisofRegret . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.5 ApplicationsandSimulationResults . . . . . . . . . . . . . . . . . . . 118 6.5.1 StochasticShortestPath . . . . . . . . . . . . . . . . . . . . . 118 6.5.2 StochasticBipartiteMatchingforChannelAllocation . . . . . . 123 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Chapter7: LearningforStochasticWater-Filling 127 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.2 ProblemFormulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.3 OnlineLearningforMaximizingtheSum-Rate . . . . . . . . . . . . . 132 7.3.1 PolicyDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.3.2 AnalysisofRegret . . . . . . . . . . . . . . . . . . . . . . . . 135 7.4 OnlineLearningforSum-Pseudo-Rate . . . . . . . . . . . . . . . . . . 142 vi 7.4.1 PolicyDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.4.2 AnalysisofRegret . . . . . . . . . . . . . . . . . . . . . . . . 143 7.5 ApplicationsandSimulationResults . . . . . . . . . . . . . . . . . . . 150 7.5.1 NumericalResultsforCWF1 . . . . . . . . . . . . . . . . . . . 150 7.5.2 NumericalResultsforCWF2 . . . . . . . . . . . . . . . . . . . 152 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Chapter8: DecentralizedLearningforOpportunisticSpectrumAccess 156 8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.2 ProblemFormulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.3 SelectiveLearningoftheK-thLargestExpected Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.4 DistributedLearningwithPrioritization . . . . . . . . . . . . . . . . . 167 8.5 DistributedLearningwithFairness . . . . . . . . . . . . . . . . . . . . 170 8.6 SimulationResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Chapter9: ConclusionsandOpenQuestions 182 Bibliography 185 vii ListofFigures 1.1 Overviewofthisdissertation. . . . . . . . . . . . . . . . . . . . . . . . 5 4.1 Anillustrativescenario. . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Simulationresultsofasystemwith7orthogonalchannelsand 4users. . 56 4.3 Simulationresultsofasystemwith9orthogonalchannelsand 5users. . 57 5.1 Simulationresultsofexample1withΔ min = 0.1706. . . . . . . . . . . 89 5.2 Simulationresultsofexample2withΔ min = 0.0091. . . . . . . . . . . 90 6.1 AnillustrationofCLRMR. . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Twoexamplegraphsforstochasticshortestpathrouting. . . . . . . . . 119 6.3 Normalizedregret R(n) lnn vs. ntimeslots. . . . . . . . . . . . . . . . . . 121 6.4 Normalizedregret R(n) lnn vs. ntimeslots. . . . . . . . . . . . . . . . . . 125 7.1 Normalizedregret R(n) logn vs. ntimeslots. . . . . . . . . . . . . . . . . . 150 7.2 NumericalresultsofE[ e T i (n)]/lognandtheoreticalbound. . . . . . . . 152 7.3 Normalizedregret R(n) logn vs. ntimeslots. . . . . . . . . . . . . . . . . . 153 8.1 Illustrationofrotatingtheprioritizationvector. . . . . . . . . . . . . . 170 8.2 Normalizedregret R(n) lnn vs. ntimeslots. . . . . . . . . . . . . . . . . . 178 viii 8.3 Number of times that channel i has been chosen by user m up to time n = 10 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.4 ComparisonofDLFandTDFS. . . . . . . . . . . . . . . . . . . . . . 180 ix ListofTables 4.1 Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Comparisonofregretbounds. . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Regretwhent = 2×10 6 . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1 Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2 Transitionprobabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 Rewardsoneach state. . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 Expectedrewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5 Numberoftimesthatresourcej hasbeenmatchedwithuseriuptotime n = 10 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.6 Rewardsoneach state. . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.7 Expectedrewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.8 Numberoftimesthatresourcej hasbeenmatchedwithuseriuptotime n = 10 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.1 NotationforAlgorithm7. . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Notationforregretanalysis. . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Transitionprobabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4 Transitionprobabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . 120 x 6.5 Transitionprobabilitiesp 01 ,p 10 foreachuser-channelpair. . . . . . . . 124 6.6 Transitionprobabilitiesp 01 ,p 10 foreachuser-channelpair. . . . . . . . 124 xi Abstract The formulations and theories of multi-armed bandit (MAB) problems provide funda- mental tools for optimal sequential decision making and learning in uncertain environ- ments. They have been widely applied to resource allocation, scheduling, and routing in communication networks, particularly in recent years, as the field is seeing an in- creasingfocusonadaptiveonlinelearningalgorithmstoenhancesystemperformancein stochastic, dynamic, and distributed environments. This dissertation addresses several keyproblemsinthisdomain. Our first focus is about MAB with linear rewards. As they are fundamentally about combinatorial optimization in unknown environments, one would indeed expect to find even broader use of multi-armed bandits. However, a barrier to their wider application in practice has been the limitation of the basic formulation and corresponding policies, which generally treat each arm as an independent entity. They are inadequate to deal withmanycombinatorialproblemsofpracticalinterestinwhichtherearelargenumbers ofarms. Insuchsettings,itisimportanttoconsiderandexploitanystructureintermsof dependenciesbetweenthearms. Inthisdissertation,weshowthatwhenthedependencies takealinearform,theycanbehandledtractablywithalgorithmsthathaveprovablygood xii performance in terms of regret as well as storage and computation. We develop a new classoflearningalgorithmsfordifferentproblemsettingsincludingi.i.d. rewards,rested Markovian rewards, and restless Markovian rewards, to improve the cost of learning, comparedtopriorwork,forlarge-scalestochasticnetworkoptimizationproblems. We then consider the problem of optimal power allocation over parallel channels with stochastically time-varying gain-to-noise ratios for maximizing information rate (stochastic water-filling) with both linear and non-linear multi-armed bandit formula- tionsandproposenewefficientonlinelearningalgorithmsforthese. Finally, we focus on learning in decentralized settings. The desired objective is to developdecentralizedonlinelearningalgorithmsrunningateachusertomakeaselection among multiple choices, where there is no information exchange, such that the sum- throughput of all distributed users is maximized. We make two contributions in this problem. First, we consider the setting where the users have a prioritized ranking, such that it is desired for the K-th ranked user to learn to access the arm offering the K- th highest mean reward. For this problem, we present the first distributed algorithm that yields regret that is uniformly logarithmic over time without requiring any prior assumption about the mean rewards. Second, we consider the case when a fair access policyisrequired,i.e.,itisdesiredforalluserstoexperiencethesamemeanreward. For this problem, we present a distributed algorithm that yields order-optimal regret scaling withrespecttothenumberofusersandarms,betterthanpreviouslyproposedalgorithms intheliterature. xiii Chapter1 Introduction Multi-armedbandit(MAB)problemsprovideafundamentalapproach tolearningunder stochastic rewards, and find rich applications in a wide range of networking contexts, from Internet advertising [69] to medium access in cognitive radio networks [59,75]. In the simplest, classic non-Bayesian version of the problem, studied by Lai and Rob- bins[52],thereareK independentarms,eachgeneratingstochasticrewardsthatarei.i.d. overtime. Theplayerisunawareoftheparametersforeacharm,andmustusesomepol- icy to play the arms in such a way as to maximizethe cumulativeexpected reward over thelongterm. Thepolicy’sperformanceismeasuredintermsofitsregret,definedasthe gap between the the expected reward that could be obtained by an omniscient user that knowstheparametersforthestochasticrewardsgeneratedbyeacharmandtheexpected cumulativerewardofthatpolicy. Itisofinteresttocharacterizethegrowthofregretwith respect to time as well as with respect to the number of arms/players. Intuitively, if the regretgrowssublinearlyovertime,thetime-averagedregrettendstozero. 1 There is inherently a tradeoff between exploration and exploitation in the learning process in a multi-armed bandit problem: on the one hand all arms need to be sampled periodically by the policy used, to ensure that the “true” best arm is found; on the other hand,thepolicyshouldplaythearmthatisconsideredtobethebestoftenenoughtoac- cumulaterewardsatagoodpace. ToquotePeterWhittle[78]: banditproblems“embody inessentialform aconflictevidentinallhumanaction. Thisistheconflictbetween tak- ingthoseactionswhichyieldimmediaterewardandthose(suchasacquiringinformation orskill,orpreparingtheground)whosebenefitwillcomeonlylater.” The classical treatment of multi-armed bandits assumes that arms are independent. Further, existing algorithms for the classic MAB, such as the Lai-Robbins policy [52] and Auer et al.’s UCB1 [13], yield regret that is linear in the number of arms. In many problems of practical interest, particularly those arising in the context of communica- tion networks such as stochastic shortest path and minimum spanning tree computation andschedulingbasedonmaximum-weightmatchingonbipartitegraphs(withunknown stochasticweights), there are dependencies between a large numberof arms that can be described by a smaller set of unknown parameters (the number of arms could be expo- nentialinthenumberoftheseparameters,forexample,thenumberofpossiblepathsvs. thenumberofedges). As they are fundamentally about combinatorial optimization in unknown environ- ments, one would indeed expect to find even broader use of multi-armed bandits. How- ever,wearguethatabarriertotheirwiderapplicationinpracticehasbeenthelimitation 2 of the basic formulation and corresponding policies, which generally treat each arm as anindependententity. Theyareinadequatetodealwithmanycombinatorialproblemsof practicalinterestinwhichtherearelarge(exponential)numbersofarms. Our first focus in this dissertation is to consider and exploit any structure in terms of dependencies between the arms to learn more efficiently for stochastic network opti- mizationproblemsunderdifferentsettings. When the dependencies take a linear form, we show in Chapter 4 that for the i.i.d. rewards,theycanbehandledtractablywithpoliciesthathaveprovablygoodperformance intermsofregretaswellasstorageandcomputation. Whilethemajorityof theliteratureon MAB has focused on thei.i.d. reward model, extensions exist to the rested Markovian reward model where the reward state of each arm evolves as an unknown Markov process over successive plays and remains frozen when the arm is not played [10,74], and restless Markovian reward model where the reward state of each arm evolves dynamically following unknown stochastic processes no matter it is played or not [25,56,57,75]. However, these prior works deal with the dependenciesbetweenarmsinefficiently. ForrestedandrestlessMarkovianrewardswhenthedependenciestakealinearform, wepresentinChapter5andChapter6onlinelearningalgorithmsthataredesignedforthe settingwheretheedgeweightsaremodeledbyfinite-stateMarkovchains,withunknown transitionmatrices. 3 While Chapter 4 to Chapter 6 provide efficient algorithms for MAB with linear re- wards, we arealso interestedin goingbeyondthelinearreward formulationtosolvethe moregeneralformulationsofMABwithnon-linearrewards. One classic optimization problem in communication systems with a non-linear ob- jective function is that of rate-maximizing constrained power allocation over parallel channels, which is solved by the well-known water-filling algorithm in the determin- istic setting [40]. We show in Chapter 7 two distinct but related MAB formulations of stochasticwater-fillinginwhichthegaintonoiseratioforeachchannelevolvesovertime as an i.i.d. random process. For the first problem, we propose a cognitive water-filling algorithm that exploits the linear structure of this problem. For the second problem, we present the first MAB policy with provableregret performance to exploit non-linear dependenciesbetweenarms. Another focus of this dissertation is on designing and analyzing decentralized algo- rithms for MAB. The classical MAB formulationconsiders only a singleplayer, which, in a network setting, can only handle centralized configurations where all players act collectively as a singleentity by exchanging observations and making decisions jointly. In many applications, however, information exchange among players and joint decision makingcanbecostlyoreveninfeasibleduetothecompetitionsamongplayers. Thus,we haveadecentralizedprobleminwhichmultipledistributedplayerslearnfromtheirlocal observations and make decisions independently. Decentralized players’ actions affect each other since other players’ observations and actions are unknown. Conflicts occur 4 O n l i n e L e a r n i n g A l g o r i t h m s ( w i t h A p p l i c a t i o n s i n N e t w o r k s ) L e a r n i n g f o r C o m b i n a t o r i a l N e t w o r k O p t i m i z a t i o n D e c e n t r a l i z e d L e a r n i n g i . i . d ( C h a p t e r 3 ) M a r k o v i a n R e s t l e s s M A B ( C h a p t e r 5 ) R e s t e d M A B ( C h a p t e r 4 ) P r i o r i t i z e d A c c e s s ( C h a p t e r 7 ) F a i r A c c e s s ( C h a p t e r 7 ) M A B w i t h L i n e a r R e w a r d s M A B w i t h n o n P L i n e a r R e w a r d s ( C h a p t e r 6 ) Figure1.1: Overviewofthisdissertation. whenmultipleplayerschoosethesamearmatthesametimeandconflictingplayerscan only share the reward offered by the arm, not necessarily with conservation. All these makelearningmoredifficultfordecentralized MABproblems. We presentin Chapter8 decentralizedlearningalgorithmsinthecontextofopportunisticspectrumaccess. Figure 1.1 shows the structure of research mentioned in this dissertation on online learning algorithms with the multi-armed bandit formulations. We have also obtained some additional related results on learning algorithms for non-bayesian restless MAB (RMAB), that are not included in this dissertation. In [26], we propose an algorithm to learn the optimal policy for the non-Bayesian RMAB with identical transition matrices, byemployingasuitablemeta-policywhichtreatseach policyfromafinitesetas anarm inadifferentnon-Bayesianmulti-armedbanditproblemforwhichasingle-armselection policy is optimal. Our results on the non-Bayesian RMAB with non-identical transition matricescan befoundin[66]. 5 While this dissertation has focused on the stochastic reward model, we should note that another variant of the MAB problem, adversarial reward model, has been widely studied [2,14]. For adversarial reward model (or adversarial bandits), no statistical assumptionsaremadeaboutthegenerationofrewards. Anadversary,ratherthanawell- behavedstochasticprocess,hascompletecontroloverthepayoffs. 1.1 Contributions GiventhereadyapplicabilityofMABtoawiderangeofcommunicationsystems,asev- idenced by many papers, it is clear that progress in expanding the boundaries of knowl- edgeonalgorithmsandperformancebeyondclassicalMABwillhavesignificantimpact onthedesignofefficientcommunicationnetworkprotocolsinunknownstochasticenvi- ronments. Wenowdetailourcontributions. 1.1.1 LearningwithLinearRewards 1.1.1.1 I.I.DRewards In Chapter 4, we consider the followingmulti-armed bandit problem. There areN ran- dom variables with unknown mean that are each instantiated in an i.i.d. fashion over time. At each time a particular set of multiple random variables can be selected, sub- ject to a general arbitrary constraint on weights associated with the selected variables. 6 All of the selected individual random variables are observed at that time, and a linearly weightedcombinationoftheseselectedvariablesisyieldedasthereward. Ourgeneralformulationofmulti-armedbanditswithlinearrewardsisapplicabletoa verybroadclassofcombinatorialnetworkoptimizationproblemswithlinearobjectives. These include maximum weight matching in bipartite graphs (which is useful for user- channel allocationsin cognitiveradio networks), as well as shortestpath, and minimum spanning tree computation. In these examples, there are random variables associated with each edge on a given graph, and the constraints on the set of elements allowed to be selected at each time correspond to sets of edges that form relevant graph structures (suchasmatchings,paths,orspanningtrees). Becauseourformulationallowsforarbitraryconstraintsonthemultipleelementsthat are selected at each time, priorwork on multi-armedbanditsthat only allowfora fixed- number of multipleplays along with individualobservations at each time[3,11] cannot bedirectly used for thismoregeneral problem. On theotherhand, by treating each fea- sible weighted combination of elements as a distinct arm, it is possible to handle these constraintsusingpriorapproaches for multi-armedbanditswithsingleplay (such as the well-knownUCB1 indexpolicyofAuer et al.[13]). Howeverthisapproachturnsoutto be naive, and yields poor performance scaling in terms of regret, storage, and computa- tion. This is because this approach maintains and computes quantitiesfor each possible combination separately and does not exploit potential dependencies between them. In Chapter 4, we instead propose smarter policies to handle the arbitrary constraints, that 7 explicitlytakeintoaccountthelinearnatureofthedependenciesandbaseallstorageand computations on the unknown variables directly. As we shall show, this saves not only on storage and computation, but also substantially reduces the regret compared to the naiveapproach. Specifically, we first present a novel policy called Learning with Linear Rewards (LLR) that requires only O(N) storage, and yields a regret that grows essentially 1 as O(N 4 logn),wherenisthetimeindex. Wealsodiscusshowthispolicycanbemodified in a straightforward manner while maintaining the same performance guarantees when the problem is one of cost minimization rather than reward maximization. A key step inthesepoliciesweproposeisthesolvingofadeterministiccombinatorialoptimization with a linear objective. While this is NP-hard in general (as it includes 0-1 integer lin- ear programming), thereare stillmanyspecial-case combinatorialproblemsof practical interest which can be solvedin polynomialtime. For such problems, thepolicywe pro- posewouldthusinheritthepropertyofpolynomialcomputationateachstep. Further,we presentsuitablyrelaxedresultsontheregretthatwouldbeobtainedforcomputationally harderproblemswhenanapproximationalgorithmwithaknownguaranteeisused. We also present in Chapter 4 a more general K-action formulation, in which the policyisallowedtopickK ≥ 1differentcombinationsofvariableseachtime. Weshow how the basic LLR policy can be readily extended to handle this and present the regret analysisforthiscaseaswell. 1 This is asimplificationofourkeyresult insection4.4whichgivesa tighterexpressionforthebound onregretthatappliesuniformlyovertime,notjustasymptotically. 8 Theexamplesofcombinatorialnetworkoptimizationthatwepresentarefarfromex- hausting the possible applications of the formulation and the policies we present in this work — there are many other linear-objective network optimization problems [6,46]. Our framework, for the first time, allows these problems to be solved in stochastic set- tings with unknown random coefficients, with provably efficient performance. Besides communication networks, we expect that our work will also find practical application in other fields where linear combinatorial optimization problems arise naturally, such as algorithmiceconomics,datamining,finance, operationsresearch and industrialengi- neering. 1.1.1.2 RestedMarkovianRewards InChapter5,weformulateanovelcombinatorialgeneralizationofthemulti-armedban- ditproblemthatallowsforrestedMarkovianrewardsandproposeanefficientpolicyfor it. In particular, there is a given bipartite graph ofM users andN ≥ M resources. For each user-resource pair (i,j), there is an associated state that evolvesas an aperiodic ir- reduciblefinite-stateMarkovchainwithunknownparameters,withtransitionsoccurring each time the particular user i is allocated resource j. The useri receives a reward that dependsonthecorrespondingstateeach timeitisallocatedtheresourcej. Akeydiffer- ence fromtheclassicmulti-armedbanditis thateach usercan potentiallyseeadifferent reward process for the same resource. If we therefore view each possible matching of userstoresourcesasanarm,thenwehaveaexponentialnumberofarmswithdependent 9 rewards. Thus,thisnewformulationissignificantlymorechallengingthanthetraditional multi-armedbanditproblems. Because our formulation allows for user-resource matching, it could be potentially appliedtoadiverserangeofnetworkingsettingssuchas switchinginrouters(wherein- putsneedtobematchedtooutputs)orfrequencyschedulinginwirelessnetworks(where nodes need to be allocated to channels) or for server assignment problems (for allocat- ingcomputationalresourcesforvariousprocesses),etc.,withtheobjectiveoflearningas quicklyaspossiblesoastomaximizetheusageofthebestoptions. Forinstance,ourfor- mulation is general enough to be applied to the channel allocation problem in cognitive radionetworksconsideredin[35]iftherewards foreach user-channel paircomefroma discretesetandarei.i.d. overtime(whichisaspecialcaseofMarkovianrewards). Our main contribution in Chapter 5 is the design of a novel policy for this problem thatwerefertoMatchingLearningforMarkovianRewards(MLMR).Sincewetreateach possiblematchingofuserstoresourcesasanarm,thenumberofarmsinourformulation growsexponentially. However,MLMRusesonlypolynomialstorage, and requires only polynomialcomputationat each step. We analyze the regret for this policy with respect to the best possiblestatic matching, and show that it is uniformly logarithmicover time undersomerestrictionsontheunderlyingMarkovprocess. Wealsoshowthatwhenthese restrictionsareremoved,theregretcanstillbemadearbitrarilyclosetologarithmicwith respect to time. In either case, the regret is polynomial in the number of users and resources. 10 1.1.1.3 RestlessMarkovianRewards We present in Chapter 6 an online learning algorithm that is designed for the setting wheretheedgeweightsaremodeledbyfinite-stateMarkovchains,withunknowntransi- tionmatrices. Wespecificallymodelthisproblemasacombinatorialmulti-armedbandit problemwithrestlessMarkovianrewards. We considera single-actionregret definition,whereby thegenieisassumed toknow the transition matrices for all edges, but is constrained to stick with one action (corre- sponding to a particular network structure) at all times 2 . We prove that our algorithm, which werefer to as CLRMR (CombinatorialLearning with RestlessMarkov Rewards) achieves a regret that is polynomial in the number of Markov chains (i.e., number of edges), and logarithmic with time. This implies that our learning algorithm, which does not know the transition matrices, asymptotically achieves the maximum time av- eraged reward possible with any single-action policy, even if that policy is given ad- vanced knowledge of the transition matrices. By contrast, the conventional approach of estimatingthe mean of each edge weight and then finding the desired network structure via deterministic optimization would incur greater overhead and provide only linearly increasingregretovertime,whichisnotasympoticallyoptimal. Whilerecentworkhasshownhowtoaddressmulti-armedbanditswithrestlessMarko- vian rewards in the classicnon-combinatorialsetting [75], this dissertationis thefirst to showhow to efficiently implementonlinelearning forstochasticcombinatorialnetwork 2 Although a stronger notion of regret can be defined, allowing the genie to vary the action at each time,theproblemofminimizingsuchastrongerregretis muchharderandremainsopenevenforsimpler settingsthantheoneweconsiderhere. 11 optimization when edge weights are dynamically evolving as restless Markovian pro- cesses. We perform simulations to evaluate our new algorithm over two combinatorial network optimization problems: stochastic shortest path routing and bipartite matching for channel allocation, and show that its regret performance is substantially better than that of the algorithm presented in [75], which can handle restless Markovian rewards but does not exploit the dependence between the arms, resulting in a regret that grows exponentiallyinthenumberofunknownvariables. 1.1.2 LearningforStochasticWater-FillingwithLinearandNonlinear Rewards Afundamentalresourceallocationproblemthatarisesinmanysettingsinwirelesscom- munication systems is to allocate a constrained amount of power across many parallel channels in order to maximize the sum-rate. Assuming that the power-rate function for eachchannelisproportionaltolog(1+SNR)aspertheShannon’scapacitytheoremfor AWGN channels, it is well known that the optimal power allocation can be determined byawater-fillingstrategy[24]. Theclassicwater-fillingsolutionisadeterministicalgo- rithm, and requires perfect knowledge of all channel gain to noise ratios. Water-filling is used widely in practical wireless networks, for example, for power allocation to sub- carriers inmulti-userOFDMsystemssuchasWiMax. Inpractice,however,channelgain-to-noiseratiosarestochasticquantities. Tradition- allythisishandledbyestimatingtheirmeanandapplyingthedeterministicoptimization. 12 Weconsiderhereanalternativeapproachbasedononlinelearning,specificallystochastic multi-armedbandits. Weformalizestochasticwater-fillingasfollows: timeisdiscretized intoslots;eachchannel’sgain-to-noiseratioismodeledasani.i.d. randomvariablewith anunknowndistribution. Inourgeneralformulation,thepower-to-ratefunctionforeach channelisallowedtobeanysub-additivefunction 3 . We consider in Chapter 7 two distinct but related MAB formulations of stochastic water-filling. First, we seek a power allocation that maximizes the expected sum-rate (i.e., an op- timization of the formE[ P i log(1 +SNR i )]). Even if the channel gain-to-noise ratios are random variableswith knowndistributions,thisturns outto be ahard combinatorial stochasticoptimizationproblem. Ourfocusisthusonamorechallengingcase. We present a novel combinatorial policy for this first problem that we call CWF1, that yields regret growing polynomially in N and logarithmically over time. Despite the exponential growing set of arms, the CWF1 observes and maintains information for P ·N variables,onecorrespondingtoeach power-leveland channel, andexploitslinear dependenciesbetweenthearmsbasedonthesevariables. Second, we consider identifying the power allocation that maximizes P i log(1 + E[SNR i ]). This is motivated by the fact that typically, the way the randomness in the channelgaintonoiseratiosisdealtwithisthatthemeanchannelgaintonoiseratiosare 3 Afunctionf issubadditiveiff(x+y)≤ f(x)+f(y);foranyconcavefunctiong,ifg(0)≥ 0(such aslog(1+x)),g issubadditive. 13 estimated first based on averaging a finite set of training observations and then the esti- matedgainsareusedinadeterministicwater-fillingprocedure. Essentiallythisapproach tries to identify the power allocation that maximizes a pseudo-sum-rate, which is deter- minedbasedonthepower-rateequationappliedtothemeanchannelgain-to-noiseratios. Forthissecondformulation,wepresentadifferentstochasticwater-fillingalgorithmthat we call CWF2, which learns the optimal power allocation to maximize this function in an online fashion. This algorithm observes and maintains information for N variables, one corresponding to each channel, and exploits non-linear dependencies between the arms based on these variables. To our knowledge, CWF2 is the first MAB algorithm to exploit non-linear dependencies between the arms. We show that the number of times CWF2 plays a non-optimal combination of powers is uniformly bounded by a function thatislogarithmicintime. Undersomerestrictiveconditions,CWF2 mayalsosolvethe firstproblemmoreefficiently. 1.1.3 LearninginDecentralizedSettings There are two problem formulations of interest when considering distributed MAB: a) the prioritized access problem, where it is desired to prioritize a ranked set of users so that the K-th ranked user learns to access the arm with the K-th highest reward, and b) the fair access problem, where the goal is to ensure that each user receives the same rewardinexpectation. 14 In Chapter 8, we make significant new contributions to both problem formulations. For the prioritized access problem, we present a distributed learning policy DLP that resultsina regretthat isuniformlylogarithmicin timeand, unlikethepriorwork in[9], does not require any prior knowledge about the arm reward means. For the fair access problem, we present anotherdistributedlearning policyDLF, which yieldsregret that is alsouniformlylogarithmicintimeandthatscalesasO(M(N−M))withrespecttothe number of users M and the number of arms N. As it has been shown in [59] that the lower-bound of regret for distributed policies also scales as Ω(M(N −M)), this is not onlyabetterscalingthanthepreviousstateoftheart,itis,infact,order-optimal. Akeysubroutineofbothdecentralizedlearningpoliciesrunningateachuserinvolves selecting an arm with the desired rank order of the mean reward. For this, we present a new policy that we refer to as SL(K), which is a non-trivial generalization of UCB1 in [13]. SL(K) provides a general solution for selecting an arm with the K-th largest expectedrewardsforclassicMABproblemswithN arms. 1.2 Organization The rest of the dissertation is organized as follows. We present some background ma- terial and a brief survey of relevant studies in the literature in Chapter 2 and Chapter 3. In Chapter 4, we formulate the combinatorial multi-armed bandit (MAB) problem with linear rewards and individual observations, and develop new efficient policies for this problem, that are shownto achieveregret that grows logarithmicallywith time, and 15 polynomiallyin the number of unknown variables. In Chapter 5, we consider the prob- lem with rested Makovian rewards. Then, we investigate the restless setting in Chapter 6. In Chapter 7, we consider the Stochastic Water-Filling problem, which involves the problem formulations with both linear and nonlinear rewards. We investigate the de- centralized online learning for multi-armed bandits in Chapter 8. Finally, we present concludingcommentsandindicatesomeopendirectionsforfutureworkinChapter9. 16 Chapter2 Background: ClassicMulti-ArmedBandits The first multi-armed bandit (MAB) problem is posed by Thompson in 1933 for the application of clinical trials [76]. Since then, MAB has developed into an important branchinstochasticoptimizationandmachinelearning. Ithasrecentlygainedincreasing attentionfromthecommunicationsandnetworkingresearchcommunityduetoitsability of formulating and tackling the optimization of learning and activation in a dynamic environment,oftenunderunknownmodels. 2.1 ProblemFormulation In the simplest, classic non-Bayesian version of the problem, studied by Lai and Rob- bins[52],thereareK independentarms,eachgeneratingstochasticrewardsthatarei.i.d. overtime,defined byrandomvariablesX i,n for1≤i≤ K. Timeisslottedandindexed byn≥ 1. Thestochasticrewardsaregeneratedindependentlyacrossarms,i.e.,X i,n i and 17 X j,n j are independent (and usually not identically distributed)for each 1 ≤ i < j ≤ K andeachn i ,n j ≥ 1. Apolicy,orallocationstrategy,φ ={φ(n)} ∞ n=1 isanalgorithmthatchoosesthenext armtoplaybasedonthesequenceofpastplaysandobtainedrewards. The player is unaware of the parameters for each arm (or the distributions of the random variables), and must use some algorithm, or policy, or allocation strategy, to play thearms (based on thesequence ofpast plays and obtainedrewards) in sucha way astomaximizethecumulativeexpectedrewardoverthelongterm. The policy’s performance is measured in terms of its regret, defined as the gap be- tweenthetheexpectedrewardthatcouldbeobtainedbyanomniscientplayerthatknows theparametersforthestochasticrewardsgeneratedbyeacharmandtheexpectedcumu- lativereward of that policy. LetT i (n) be the numberof times armi has been played by φduringthefirstnplays. Thentheregretofφafternplaysisdefinedby θ ∗ n−θ i K X i=1 E[T j (n)] (2.1) whereθ i istheexpectedrewardgotonarmiandθ ∗ = max 1≤j≤K θ j . It is of interest to characterize the growth of regret with respect to time as well as with respect to the number of arms/players. Intuitively, if the regret grows sublinearly overtime,thetime-averagedregrettendstozero. 18 2.2 OnlineLearningAlgorithmsforClassicMABs 2.2.1 Lai-RobbinsPolicy Multi-armedbanditproblemsprovideafundamentalapproachtolearningunderstochas- ticrewards. LaiandRobbins[52]areamongtheearliest onestostudytheclassicmulti- armed bandit problems with the i.i.d formulation, for specific families of reward distri- butionswhichindexedbyasinglerealparameter. Denotedbyτ i,t thenumberoftimesthatarmihasbeenplayedupto(butexcluding) timet. LetS i,1 ,...,S i,τ i,t bethepastobservationsgotfromarmi. Fixδ∈ (0,1/N). Algorithm1hasbeenproposedbythemtoachieve E[T i (n)]≤ 1 I(θ i ,θ ∗ ) +o(1) lnn (2.2) where o(1) → 0 as n → ∞. I(θ i ,θ ∗ ) = E θ i [log[f(y,θ i )/f(y,θ ∗ )]] is the Kullback- Leiblerdivergencebetweentwodistributionsparameterizedbyθ i andθ ∗ . LaiandRobbinsalsoshowedthelowerboundofregretforclassicmulti-armedbandit problems indexed by a single real parameter. They have shown that for any suboptimal armi, E[T i (n)]≥ lnn I(θ i ,θ ∗ ) . (2.3) So,theregretofalgorithm1isthebestpossible(orderoptimal). 19 Algorithm1Lai-RobbinsPolicy[52]forClassicMAB 1: // INITIALIZATION 2: forp = 1toN do 3: n =p; 4: Playeacharmonce; 5: endfor 6: // MAIN LOOP 7: while1do 8: n =n+1; 9: Amongallthearmsthathavebeenplayedatleast(t−1)δtimes,denotedbyl t the leader,whichisthearmwiththelargestpointestimate l t = arg max i:τ i,t ≥(t−1)δ h τ i,t (S i,1 ,...,S i,τ i,t ). (2.4) whereh τ i,t (S i,1 ,...,S i,τ i,t )isthepointestimateofarmi. 10: Let r t = t⊘ N be the round robin candidate at time t. The player plays l t if h τ l t ,t (S lt,1 ,...,S lt,τr t ,t ) > g t,τ l t ,t (S rt,1 ,...,S rt,τr t ,t ), and the round-robin candi- dater t otherwise. g t,τ l t ,t (S rt,1 ,...,S rt,τr t ,t )isupperconfidencebounds. 11: endwhile For Gaussian, Bernoulli, and Poisson reward models,h k andg t,k satisfy 2.5 and 2.6 asfollows: h k = k P j=1 S n,j k , (2.5) g t,k = inf{λ :λ≥h k ∧I(h k ,λ)≥ log(t−1) k }. (2.6) 2.2.2 UCB1Policy Ourwork inthisdissertationis significantlyinfluenced by thepaperby Auer et al.[13], whichconsidersarmswithnon-negativerewardsthatarei.i.d. overtimewithanarbitrary un-parameterized distribution that has the only restriction that it have a finite support. 20 Further, they provide a simple policy (referred to as UCB1, as shown in Algorithm 2), which achieves logarithmic regret uniformly over time as shown in Theorem 1, rather thanonlyasymptotically. Algorithm2UCB1 Policy[13]forClassicMAB 1: // INITIALIZATION 2: forp = 1toN do 3: n =p; 4: Playeacharmonce; 5: endfor 6: // MAIN LOOP 7: while1do 8: n =n+1; 9: Playanarmiwhichsolvesthemaximizationproblem i = argmaxx i + r 2lnn n i (2.7) wherex i is the average reward got on arm i; n i is the number of times arm i has beenplayeduptothecurrenttimeslot. 10: endwhile (2.7) in Algorithm 2 shows how the algorithm handles the tradeoff between explo- rationand exploitationin MABs. Ifan armiisnot playedoften enough,n i issmalland hence q 2lnn n i is relatively big and will dominate in (2.7). So the arms are not played often enough are more likely to be picked (exploration). On the other hand, if all the arms are played often enough, x i dominates in (2.7), so the arm with highest observed meanismorelikelytobeplayed(exploitation). Theorem1. Theexpected regret under UCB1 policyis atmost " 8 X k:θ k <θ ∗ ( lnn Δ k ) # +(1+ φ 2 3 )( X k:θ k <θ ∗ Δ k ) (2.8) 21 where Δ k =θ ∗ −θ k ,θ k = P i∈A k a i θ i . Proof. See[13,Theorem1]. The key idea of the proof presented by Auer et al. [13] is to find the upper bound expected number of times that each non-optimal arm is played and then sum over all arms to get the upper bound of regret. Auer et al. show that the probability that each non-optimal arm is played is equivalent to sum of probabilities of three events: (i) the observed mean of the optimal arm is below a certain distance of its expectation; (ii) the observed mean of some arm i is above a certain distance of its; (iii) the expectation of theoptimalarmissmallerthan theexpectationofarmiplusavaluewhichisafunction of the number of times arm i has been played. Then they bound the probabilities of the first two events based on the Chernoff-Hoeffding bound [70], as stated in Lemma1, for theconcentrationofsamplemean. Theyalsoshowthattheprobabilityofthethirdevent iszero whenthenumberoftimesthatarmisplayedislargeenough. Lemma 1 (Chernoff-Hoeffding bound [70]). X 1 ,...,X n are random variables with range [0,1], and E[X t |X 1 ,...,X t−1 ] = μ, ∀1 ≤ t ≤ n. Denote S n = P X i . Then foralla≥ 0 P{S n ≥nμ+a}≤e −2a 2 /n P{S n ≤ nμ−a}≤e −2a 2 /n (2.9) Our work in this dissertation reshapes in innovative ways the UCB1 algorithm and itsproofformoregeneral andpracticalproblemsettings. 22 Chapter3 RelatedWork In thischapter, wegivean overviewofthepreviousresearch and literaturethat arerele- vant to our studies. We summarize below prior work, which has treated a) independent andtemporallyi.i.d. rewards,b)non-independentarmswithtemporallyi.i.d,c)indepen- dent and rested Markovian state-based rewards, d) independent and restless Markovian state-basedrewards,e)distributedonlinelearningpolicies,orf)applicationstocommu- nicationnetworks. 3.1 IndependentArmswithTemporallyI.I.D.Rewards LaiandRobbins[52]wroteoneoftheearliestpapersontheclassicnon-Bayesianinfinite horizon multi-armed bandit problem. Assuming N independent arms, each generating rewards that are i.i.d. over time from a given family of distributions with an unknown real-valuedparameter, theypresentedageneral policythatprovidesexpected regret that is O(N logn), i.e. linear in the number of arms and asymptotically logarithmic in n. 23 They also show that this policy is order optimal in that no policy can do better than Ω(N logn). Anantharamet al.[11]extendthisworktothecasewhenM multipleplays are allowed. Agrawal et al. [4] further extend this to the case when there are multiple playsandalsoswitchingcostsaretakenintoaccount. Our study is influenced by the works by Agrawal [3] and Auer et al. [13]. The work by Agrawal [3] first presented easy to compute upper confidence bound (UCB) policies based on the sample-mean that also yield asymptotically logarithmic regret. Auer et al. [13] build on [3], and present variants of Agrawal’s policy including the so-called UCB1 policy, and prove bounds on the regret that are logarithmic uniformly over time (i.e., for any finite n, not only asymptotically), so long as the arm rewards have a finite support. There are similarities in the proof techniques used in both these works[3,13],whichbothuseknownresultsonlargedeviationupperbounds. Inthisthe- sis, we also make use of this approach, leveraging the same Chernoff-Hoeffding bound utilized in [13]. However, these works do not exploit potential dependencies between the arms 1 . As we show in this thesis, a direct application of the UCB1 policy therefore performspoorlyforourproblemformulationforthei.i.d. case. UnlikeLaiandRobbins[52],Agrawal[3],andAueretal.[13],weconsiderinChap- ter 4 a more general combinatorial version of the problem (with the i.i.d. formulation) thatallowsfortheselectionofasetofmultiplevariablessimultaneouslysolongasthey 1 Both the papers by Agrawal [3] and Auer et al. [13] indicate in the problem formulation that the rewardsareindependentacrossarms;however,sincetheirprooftechniqueboundsseparatelytheexpected time spentoneachnon-optimalarm, in factthe boundstheexpectedregretthattheyget throughlinearity of expectationapplies even when the arm rewards are not independent. Nevertheless, as we indicate, the policiesdonotexploitanydependenciesthatmayexistbetweenthearmrewards. 24 satisfy a given arbitrary constraint. The constraint can specified explicitly in terms of sets of variables that are allowed to be picked together. There is no restriction on how these sets are constructed. They may correspond, for example, to some structural prop- erty such as all possible paths or matchings on a graph. While we credit the paper by Anantharametal.[11]forbeingthefirsttoconsidermultiplesimultaneousplays,wenote thatourformulationinChapter4 ismuchmoregeneral thanthatwork. Specifically, the work by Anantharam et al. [11] considers only a particular kind of constraint: it allows selection of all combinations of a fixed number of arms (i.e., in [11], exactly M arms must be played at each time). For this reason, the algorithm presented in [11] cannot bedirectlyusedforthemoregeneral combinatorialprobleminourformulation. Forthe samereason,thealgorithmpresentedin[4]alsocannotbeuseddirectlyforourproblem. In our formulation in Chapter 4, we assume that the rewards from each individual variable in the selected combination are observed, and the total reward is a linearly weightedcombinationofthesevariables. Becauseweconsiderafundamentallydifferent andmoregeneralproblemformulation,ourproofstrategy,whilesharingstructuralsimi- larities with [3,13], has anon-trivialinnovativecomponent as well. In particular, in our setting, it turns out to be difficult to directly bound the number of actual plays of each weighted combination of variables, and therefore we create a carefully-defined virtual counterforeachindividualvariable,andboundthatinstead. 25 3.2 DependentArmswithTemporallyI.I.D.Rewards Whiletheseabovekeypapersandmanyothershavefocusedonindependentarms,there have been some works treating dependencies between arms. The paper by Pandey et al. [69] divides arms into clusters of dependent arms (in our case there would be only onesuchclusterconsistingofallthearms). Theirmodelassumesthat each arm provide onlybinary rewards, and inany case, theydonot present anytheoretical analysisonthe expected regret. Ortner[68]proposes tousean additionalarm color, to utilizethegiven similarity information of different arms to improvethe upper bound of the regret. They assume that the difference of the mean rewards of any two arms with the same color is lessthanapredefinedparameterδ,whichisknowntotheuser. Thisisdifferentfromthe linearrewardmodelinthisthesis. Mersereauetal.[63]considerabanditproblemwheretheexpectedrewardisdefined asalinearfunctionofarandomvariable,andthepriordistributionisknown. Theyshow the upper bound of the regret is O( √ n) and the lower bound of the regret is Ω( √ n). Rusmevichientong and Tsitsiklis [72] extend [63] to the setting where the reward from each arm is modeled as the sum of a linear combination of a set of unknown static ran- dom numbers and a zero-mean random variable that is i.i.d. over time and independent across arms. The upper bound of the regret is shown to beO(N √ n) on the unit sphere and O(N √ nlog 3/2 n) for a compact set, and the lower bound of regret is Ω(N √ n) for both cases. Although these papers also consider linear dependencies, a key difference, however is that in [63] and [72] it is assumed that only the total reward is observed at 26 each time, not the individual rewards. In this dissertation, we instead assume that all theselectedindividualrandomvariablesareobservedateachtime(fromwhichthetotal rewardcanbeinferred). Becauseofthemorelimitedcoarse-grainedfeedback,theprob- lems tackled in [63] and [72] are indeed much more challenging, perhaps explaining whytheyresultinahigherregretboundorder. Both[12]and[27]considerlinearrewardmodelsthataremoregeneralthanours,but also undertheassumptionthat onlythetotalreward isobserved at each time. Auer[12] presents a randomized policy which requires storage and computation to grow linearly in the number of arms. This algorithm is shown to achieve a regret upper bound of O( √ N √ nlog 3 2 (n|F|)). Daniet al.[27]developanotherrandomizedpolicyforthecase of a compact set of arms, and show the regret is upper bounded by O(N √ nlog 3/2 n) for sufficiently large n with high probability, and lower bounded by Ω(N √ n). They alsoshowthatwhenthedifferenceincosts(denotedasΔ)betweentheoptimalandnext to optimal decision among the extremal points is greater than zero, the regret is upper boundedbyO( N 2 Δ log 3 n)forsufficientlylargenwithhighprobability. Liu and Zhao [61] have also investigated multi-armed bandit problems where the dependenciesbetweenthearmstakealinearform. Especially,theirproposedpolicycan handleamoregeneralrewardmodelbyextendingtheresultsfordistributionswithfinite support to including any light-tailed distributions for a general compact action space. They also consider the heavy-tailed distributions for the special case when the action space is a polytope or finite. However their work has also focused on the case where 27 the feedback is coarse-grained in that only the total arm rewards are known and not for individualcomponents. Another paper that is related to our work is by Awerbuch and Kleinberg [15]. They consider the problem of shortest path routing in a non-stochastic, adversarial setting, in which only the total cost of theselected path is revealed at each time. Forthis problem, assumingtheedgecostson thegraph arechosen byan adaptiveadversary thatcan view thepastactionsofthepolicy,theypresentapolicywithregretscalingasO(n 2 3 (log(n)) 1 3 ) overntimesteps. However,althoughas wediscussourformulationcan also beapplied to online shortest path routing, our work is different from [15] in that we consider a stochastic,non-adversarialsettingandallowforobservationsoftheindividualedgecosts oftheselectedpathateachtime. 3.3 IndependentArmswithRestedMarkovianRewards There has been relatively less work on multi-armed bandits with Markovian rewards. Anantharam et al. [10] wrote one of the earliest papers with such a setting. They pro- posedapolicytopickM outoftheN armseachtimeslotandprovethelowerboundand the upper bound on regret. However, the rewards in their work are assumed to be gen- erated by rested Markov chains with transition probability matrices defined by a single parameterθ with identical statespaces. Also, the result for theupper bound is achieved onlyasymptotically. 28 For the case of single users and independent arms, a recent work by Tekin and Liu [74] has extended the results in [10] to the case with no requirement for a single parameter and identical state spaces across arms. They propose to use UCB1 from [13] forthemulti-armedbanditproblemwithMarkovianrewardsandprovealogarithmicup- per bound on the regret under some conditions on the Markov chain. We use elements ofthe prooffrom [74] in Chapter5 ofthis thesis,which is howeverquitedifferent in its combinatorialmatchingformulation(whichallowsfordependentarms). 3.4 IndependentArmswithRestlessMarkovianRewards Restless arm bandits are so named becausethe arms evolveat each time, changingstate even when they are not selected. Work on restless Markovian rewards with singleusers and independent arms can be found in [26,56,57,66,75]. In these papers there is no considerationofpossibledependenciesamongarms,asinourworkhere. Tekin and Liu [75] have proposed a RCA policy that achieves logarithmic single- action regret when certain knowledge about the system is known. We use elements of thepolicyandprooffrom[75]inChapter6ofthisthesis,whichishoweverquitediffer- ent in its combinatorial matching formulation (which allows for dependent arms). Liu et al. [56,57] adopted the same problem formulation as in [75], and proposed a pol- icy named RUCB, achieving a logarithmic single-action regret over time when certain system knowledge is known. They also extend the RUCB policy to achieve a near- logarithmicregretasymptoticallywhennoknowledgeaboutthesystemisavailable. 29 Inourrecentwork[25],wehavealsoconsideredthesameformulation,andproposed a CEE policy. When no information is available about the dynamics of the arms, CEE is the first algorithm to guarantee near-logarithmic regret uniformly over time. When some bounds corresponding to the stationary state distributionsand the state-dependent rewards are known, we show that CEE can be easily modified to achieve logarithmic regretovertimewithlessadditionalinformationcomparedwithRCAandRUCB. In[26],wehaveadoptedastrongerdefinitionofregret: thedifferenceinexpectedre- wardcomparedtoamodel-awaregenie. Theydevelopapolicythatyieldsregretoforder arbitrarily close to logarithmicfor certain classes of restless bandits with a finite-option structure, such as restless MAB with two states and identical probability transition ma- trices. In[66],wehavedevelopedapolicyforaspecialcaseoftwopositivelycorrelated restlessmulti-armedbanditproblemandprovethatityieldsnear-logarithmicregretwith respect to any policy that achieves an expected discounted reward that is withinǫ of the optimal. 3.5 DecentralizedPolicyforMulti-ArmedBandits While most of the prior work on MAB focused on the centralized policies, motivated by the problem of opportunistic access in cognitive radio networks, Liu and Zhao [59, 60], and Anandkumar et al. [8,9] have both developed policies for the problem of M distributed players operatingN independent arms. There are two problem formulations ofinterestwhenconsideringdistributedMAB:a)theprioritizedaccessproblem,whereit 30 isdesiredtoprioritizearankedset ofuserssothattheK-thranked userlearns toaccess the arm with the K-th highest reward, and b) the fair access problem, where the goal is to ensure that each user receives the same reward in expectation. For the prioritized accessproblem,Anandkumaretal.[9]presentadistributedpolicythatyieldsregretthat islogarithmicintime,butrequirespriorknowledgeofthearmrewardmeans. Forthefair accessproblem,theyproposein[8,9]arandomizeddistributedpolicythatislogarithmic with respect to time and scales as O(M 2 N) with respect to the number of arms and users. Liu and Zhao [59,60] also treat the fair access problem and present the TDFS policy which yields asymptoticallylogarithmicregret with respect to timeand scales as O(M(max{M 2 ,(N −M)M})) withrespecttothenumberofarmsandusers. In Chapter 8 of this thesis, we make significant new contributions to both problem formulations. Fortheprioritizedaccessproblem,wepresentadistributedlearningpolicy DLP that results in a regret that is uniformly logarithmic in time and, unlike the prior work in [8,9], does not require any prior knowledge about the arm reward means. For thefairaccessproblem,wepresentanotherdistributedlearningpolicyDLF,whichyields regret that is also uniformly logarithmicin timeand that scales asO(M(N −M)) with respect to the number of users M and the number of arms N. As it has been shown in[59,60]thatthelower-boundofregretfordistributedpoliciesalsoscalesasΩ(M(N− M)),thisisnotonlyabetterscalingthanthepreviousstateoftheart,itis,infact,order- optimal. 31 Anotherrecent work ondecentralized MABproblemisby Kalathil et al.[44]. They haveconsideredadifferentdecentralizedmulti-armedbanditproblemwheretherewards on each arm can be distinct for each player. Decentralized policies for both i.i.d. and restedMarkovianrewardsareproposed,basedontheuseofadistributedbipartitematch- ingalgorithm. 3.6 ApplicationstoCommunicationNetworks Whilemulti-armedbanditsarebroadly usefulforotherfields suchas medicine,finance, and industrial engineering, this thesis is particularly motivatedand inspired by their ap- plicabilitytocommunicationnetworks. Webrieflysurveytheseapplicationsbelow. Theapplicationincommunicationswherebanditformulationshavefoundparticular use in recent years is the context of dynamic spectrum access in cognitive radio net- works [1,5,7–9,35,41,43,51,55,59,60,62,64,67,75,80]. They have been applied to other problems in wireless communications such as downlink scheduling in wire- less networks [54], [65], MIMO systems[77], and channel measurements for wideband communication [42]. They have been used for opportunistic routing [19] and intrusion detection [79] in ad hoc networks, and for network selection in heterogeneous wireless multimedia networks [73]. In the context of sensor systems, they have been applied to controlemissionsforlowprobabilityintercept sensors[48], forschedulingcommunica- tionstomaximizesensornetworklifetime[23],forjointcodingandschedulinginsensor 32 networks[58],andfornodediscoveryinmobilesensornetworks[30]. Multi-armedban- dit formulations have also been considered for path and wavelength selection in optical networks[45]. Given the ready applicability of MAB to a wide range of communication systems, as evidencedby thesemanypapers, itis clearthat progressin expandingtheboundaries of knowledge on algorithms and performance beyond classical MAB will have signif- icant impact on the design of efficient communication network protocols in unknown stochasticenvironments. 33 Chapter4 LearningwithI.I.D.LinearRewards 4.1 Overview In this chapter 1 , we formulate the following combinatorial multi-armed bandit (MAB) problem: thereareN randomvariableswithunknownmeanthatareeachinstantiatedin ani.i.d. fashionovertime. Ateachtimemultiplerandomvariablescanbeselected,sub- jecttoanarbitraryconstraintonweightsassociatedwiththeselectedvariables. Allofthe selected individual random variables are observed at that time, and a linearly weighted combination of these selected variables is yielded as the reward. The goal is to find a policy that minimizes regret. This formulation is broadly applicable and useful for stochastic online versions of many interesting tasks in networks that can be formulated astractablecombinatorialoptimizationproblemswithlinearobjectivefunctions,suchas maximumweighted matching,shortestpath, and minimumspanningtreecomputations. 1 Thischapterisbasedon[35]and[36]. 34 Priorworkonmulti-armedbanditswithmultipleplayscannotbeappliedtothisformula- tionbecauseofthegeneralnatureoftheconstraint. Ontheotherhand,themappingofall feasiblecombinationstoarmsallowsfortheuseofpriorworkonMABwithsingle-play, but results in regret, storage, and computation growing exponentially in the number of unknownvariables. Wepresentnewefficientpoliciesforthisproblem,thatareshownto achieve regret that grows logarithmically with time, and polynomially in the number of unknownvariables. Furthermore, these policiesonly require storagethat grows linearly inthenumberofunknownparameters. Forproblemswheretheunderlyingdeterministic problem is tractable, these policies further require only polynomial computation. For computationally intractable problems, we also present results on a different notion of regretthatissuitablewhenapolynomial-timeapproximationalgorithmisused. 4.2 ProblemFormulation We consider a discrete timesystemwithN unknownrandom processesX i (n),1≤ i≤ N,wheretimeisindexedbyn. WeassumethatX i (n)evolvesasani.i.d. randomprocess overtime,withtheonlyrestrictionthatitsdistributionhaveafinitesupport. Withoutloss ofgenerality,wenormalizeX i (n)∈ [0,1]. WedonotrequirethatX i (n)beindependent acrossi. Thisrandomprocessisassumedtohaveameanθ i =E[X i ]thatisunknownto theusers. Wedenotethesetofallthesemeansas Θ ={θ i }. At each decision period n (also referred to interchangeably as time slot), an N- dimensional action vectora(n) is selected under a policyφ(n) from a finite setF. We 35 assumea i (n)≥ 0forall1≤i≤N. Whenaparticulara(n)isselected,onlyforthosei witha i (n)6= 0, the valueofX i (n) is observed. WedenoteA a(n) ={i : a i (n)6= 0,1≤ i≤N},theindexsetofalla i (n)6= 0foranactiona. Therewardisdefinedas: R a(n) (n) = N X i=1 a i (n)X i (n). (4.1) Whenaparticularactiona(n)isselected,therandomvariablescorrespondingtonon- zero componentsofa(n) are revealed 2 ,i.e., thevalueofX i (n)is observedforallisuch thata(n)6= 0. Weevaluatepolicieswithrespecttoregret,whichisdefinedasthedifferencebetween theexpected reward that could beobtained by agenie thatcan pick an optimalaction at each time, and that obtained by the given policy. Note that minimizing the regret is equivalenttomaximizingtherewards. Regretcanbeexpressedas: R φ n (Θ) =nθ ∗ −E φ [ n X t=1 R φ(t) (t)], (4.2) where θ ∗ = max a∈F N P i=1 a i θ i , the expected reward of an optimal action. For the rest of the chapter, we use ∗ as the index indicating that a parameter is for an optimal action. If thereismorethanoneoptimalactionexist,∗refers toanyoneofthem. 2 Asnotedintherelatedwork,thisisakeyassumptioninourworkthatdifferentiatesitfromotherprior workonlineardependent-armbandits[12],[27]. Thisis averyreasonableassumptionin manycases, for instance,in the combinatorialnetworkoptimizationapplicationswe discuss in section4.5,it corresponds torevealingweightsonthesetofedgesselectedateachtime. 36 Intuitively, we would like the regretR φ n (Θ) to be as small as possible. If it is sub- linearwithrespecttotimen,thetime-averagedregretwilltendtozeroandthemaximum possibletime-averagedrewardcanbeachieved. Notethatthenumberofactions|F|can beexponentialinthenumberofunknownrandomvariablesN. 4.3 PolicyDesign 4.3.1 ANaiveApproach Auniquefeatureofourproblemformulationisthattheactionselectedateachtimecanbe chosensuchthatthecorrespondingcollectionofindividualvariablessatisfiesanarbitrary structural constraint. For this reason, as we indicated in our related works discussion, prior work on MAB with fixed number of multiple plays, such as [11], or on linear reward models, such as [27], cannot be applied to this problem. One straightforward, relativelynaiveapproachtosolvingthecombinatorialmulti-armedbanditsproblemthat we defined is to treat each arm as an action, which allows us to use the UCB1 policy given by Auer et al. [13]. Using UCB1, each action is mapped into an arm, and the action that maximizes ˆ Y k + q 2lnn m k will be selected at each time slot, where ˆ Y k is the meanobservedrewardonactionk,andm k isthenumberoftimesthatactionk hasbeen played. This approach essentially ignores the dependencies across the different actions, storing observed information about each action independently, and making decisions basedonthisinformationalone. 37 Note that UCB1 requires storage that is linear in the number of actions and yields regret growing linearly with the number of actions. In a case where the number of ac- tionsgrowexponentiallywiththenumberofunknownvariables,bothofthesearehighly unsatisfactory. Intuitively, UCB1 algorithm performs poorly on this problem because it ignores the underlying dependencies. This motivates us to propose a sophisticated policy which moreefficientlystoresobservationsfromcorrelatedactionsandexploitsthecorrelations tomakebetterdecisions. 4.3.2 ANewPolicy Ourproposedpolicy,whichwerefertoas“learningwithlinearrewards”(LLR),isshown inAlgorithm3. Table 4.1 summarizes some notation we use in the description and analysis of our algorithm. The key idea behind this algorithmis to store and use observations for each random variable, rather than for each action as a whole. Since the same random variablecan be observedwhileoperatingdifferentactions,thisallowsexploitationofinformationgained fromtheoperationofoneactiontomakedecisionsaboutadependentaction. We use two 1 by N vectors to store the information after we play an action at each time slot. One is ( ˆ θ i ) 1×N in which ˆ θ i is the average (sample mean) of all the observed values of X i up to the current time slot (obtained through potentially different sets of 38 Algorithm3LearningwithLinearRewards(LLR) 1: // INITIALIZATION 2: If max a |A a |isknown,letL = max a |A a |;else,L =N; 3: forp = 1toN do 4: n =p; 5: Playanyactionasuchthatp∈A a ; 6: Update( ˆ θ i ) 1×N ,(m i ) 1×N accordingly; 7: endfor 8: // MAIN LOOP 9: while1do 10: n =n+1; 11: Playanactionawhichsolvesthemaximizationproblem a = argmax a∈F X i∈Aa a i ˆ θ i + s (L+1)lnn m i ; (4.3) 12: Update( ˆ θ i ) 1×N ,(m i ) 1×N accordingly; 13: endwhile actionsovertime). Theotheroneis(m i ) 1×N inwhichm i isthenumberoftimesthatX i hasbeenobserveduptothecurrenttimeslot. At each time slotn, after an actiona(n) is played, we get the observation of X i (n) foralli∈A a(n) . Then ( ˆ θ i ) 1×N and (m i ) 1×N (bothinitializedto0attime0)areupdated asfollows: ˆ θ i (n) = ˆ θ i (n−1)m i (n−1)+X i (n) m i (n−1)+1 , ifi∈A a(n) ˆ θ i (n−1) , else (4.4) m i (n) = m i (n−1)+1 , ifi∈A a(n) m i (n−1) , else (4.5) 39 N : numberofrandomvariables. a: vectorsofcoefficients,definedonsetF. A a : {i :a i 6= 0,1≤i≤N}. ∗: indexindicatingthataparameterisforan optimalaction. m i : numberoftimesthatX i hasbeenobserved uptothecurrenttimeslot. ˆ θ i : average(samplemean)ofalltheobserved valuesofX i uptothecurrenttimeslot. NotethatE[ ˆ θ i (n)] =θ i . ˆ θ i,m i : average(samplemean)ofalltheobserved valuesofX i whenitisobservedm i times. Δ a : R ∗ −R a . Δ min : min Ra<R ∗ Δ a . Δ max : max Ra<R ∗ Δ a . T a (n): numberoftimesactionahasbeenplayed inthefirstntimeslots. a max : max a∈F max i a i . Table4.1: Notation. Note that while we indicate the time index in the above updates for notational clar- ity, it is not necessary to store the matrices from previous time steps while running the algorithm. LLR policy requires storage linearinN. In section 4.4, we will present the analysis oftheupperboundofregret,andshowthatitispolynomialinN andlogarithmicintime. Notethatthemaximizationproblem(4.3)needstobesolvedasthepartofLLRpolicy. It isadeterministiclinearoptimalproblemwithafeasiblesetF andthecomputationtime for an arbitraryF may not be polynomial inN. As we show in Section 4.5, there exist manypracticallyusefulexampleswithpolynomialcomputationtime. 40 4.4 AnalysisofRegret Traditionally, the regret of a policy for a multi-armed bandit problem is upper-bounded by analyzing the expected number of times that each non-optimal action is played, and the summing this expectation over all non-optimal actions. While such an approach will work to analyze the LLR policy too, it turns out that the upper-bound for regret consequently obtained is quite loose, being linear in the number of actions, which may growfasterthanpolynomials. Instead,wegivehereatighteranalysisoftheLLRpolicy thatprovidesanupperboundwhichisinsteadpolynomialinN andlogarithmicintime. Liketheregretanalysisin[13],thisupper-boundisvalidforfiniten. Theorem2. Theexpected regret under theLLR policyis atmost 4a 2 max L 2 (L+1)N lnn (Δ min ) 2 +N + π 2 3 LN Δ max . (4.6) Proof. DenoteC t,m i as q (L+1)lnt m i . We introduce e T i (n) as a counter after the initializa- tionperiod. Itisupdatedinthefollowingway: At each time slot after the initialization period, one of the two cases must happen: (1) an optimal action is played; (2) a non-optimal action is played. In the first case, ( e T i (n)) 1×N won’t be updated. When an non-optimal actiona(n) is picked at time n, there must be at least one i ∈ A a such that i = arg min j∈Aa m j . If there is only one such action, e T i (n)isincreasedby1. Iftherearemultiplesuchactions,wearbitrarilypickone, sayi ′ ,andincrement e T i ′ by1. 41 Eachtimewhenanon-optimalactionispicked,exactlyoneelementin( e T i (n)) 1×N is incrementedby1. Thisimpliesthatthetotalnumberthatwehaveplayedthenon-optimal actions is equal to the summation of all counters in ( e T i (n)) 1×N , i.e., P a:Ra<R ∗ T a (n) = N P i=1 e T i (n)andhence E[ X a:Ra<R ∗ T a (n)] =E[ N X i=1 e T i (n)]. (4.7) Therefore, wehave: X a:Ra<R ∗ E[T a (n)] = N X i=1 E[ e T i (n)]. (4.8) Alsonotefor e T i (n),thefollowinginequalityholds: e T i (n)≤m i (n),∀1≤i≤N. (4.9) Denoteby e I i (n)theindicatorfunctionwhichisequalto1if e T i (n)isaddedbyoneat timen. Letl beanarbitrarypositiveinteger. Then: e T i (n) = n X t=N+1 1{ e I i (t) = 1} ≤l+ n X t=N+1 1{ e I i (t) = 1, e T i (t−1)≥l} (4.10) where1(x) is the indicator function defined to be 1 when the predicate x is true, and 0 when it is false. When e I i (t) = 1, a non-optimal actiona(t) has been picked for which m i = min j {m j : ∀j ∈ A a(t) }. We denote this action asa(t) since at each time that e I i (t) = 1,wecouldgetdifferentactions. Then, 42 e T i (n)≤l+ n X t=N+1 1{ X j∈A a ∗ a ∗ j ( ˆ θ j,m j (t−1) +C t−1,m j (t−1) ) ≤ X j∈A a(t) a j (t)( ˆ θ j,m j (t−1) +C t−1,m j (t−1) ), e T i (t−1)≥l} ≤l+ n X t=N 1{ X j∈A a ∗ a ∗ j ( ˆ θ j,m j (t) +C t,m j (t) ) ≤ X j∈A a(t+1) a j (t+1)( ˆ θ j,m j (t) +C t,m j (t) ), e T i (t)≥l}. (4.11) Notethatl≤ e T i (t)implies, l≤ e T i (t)≤m j (t),∀j ∈A a(t+1) . (4.12) So, e T i (n)≤l+ n X t=N 1{ min 0<m h 1 ,...,m h |Aa∗| ≤t |Aa∗| X j=1 a ∗ h j ( ˆ θ h j ,m h j +C t,m h j ) ≤ max l≤mp 1 ,...,mp |A a(t+1) | ≤t |A a(t+1) | X j=1 a p j (t+1)( ˆ θ p j ,mp j +C t,mp j )} ≤l+ ∞ X t=1 t X m h 1 =1 ··· t X m h |A ∗ | =1 t X mp 1 =l ··· t X mp |A a(t+1) | =l 1{ |Aa∗| X j=1 a ∗ h j ( ˆ θ h j ,m h j +C t,m h j )≤ |A a(t+1) | X j=1 a p j (t+1)( ˆ θ p j ,mp j +C t,mp j )} whereh j (1≤ j ≤|A a∗ |)representsthej-thelementinA a∗ andp j (1≤ j ≤|A a(t+1) |) representsthej-thelementinA a(t+1) . 43 |Aa∗| P j=1 a ∗ h j ( ˆ θ h j ,m h j +C t,m h j )≤ |A a(t+1) | P j=1 a p j (t+1)( ˆ θ p j ,mp j +C t,mp j ) meansthat atleast oneofthefollowingmustbetrue: |Aa∗| X j=1 a ∗ h j ˆ θ h j ,m h j ≤R ∗ − |Aa∗| X j=1 a ∗ h j C t,m h j , (4.13) |A a(t+1) | X j=1 a p j (t+1) ˆ θ p j ,mp j ≥R a(t+1) + |A a(t+1) | X j=1 a p j (t+1)C t,mp j , (4.14) R ∗ <R a(t+1) +2 |A a(t+1) | X j=1 a p j (t+1)C t,mp j . (4.15) NowwefindtheupperboundforP{ |Aa∗| P j=1 a ∗ h j ˆ θ h j ,m h j ≤R ∗ − |Aa∗| P j=1 a ∗ h j C t,m h j }. Wehave: P{ |Aa∗| X j=1 a ∗ h j ˆ θ h j ,m h j ≤R ∗ − |Aa∗| X j=1 a ∗ h j C t,m h j } =P{ |Aa∗| X j=1 a ∗ h j ˆ θ h j ,m h j ≤ |Aa∗| X j=1 a ∗ h j θ h j − |Aa∗| X j=1 a ∗ h j C t,m h j } ≤P{Atleastoneofthefollowingmusthold: a ∗ h 1 ˆ θ h 1 ,m h 1 ≤a ∗ h 1 θ h 1 −a ∗ h 1 C t,m h 1 , a ∗ h 2 ˆ θ h 2 ,m h 2 ≤a ∗ h 2 θ h 2 −a ∗ h 2 C t,m h 2 , . . . a ∗ h |Aa∗| ˆ θ h 1 ,m h |Aa∗| ≤a ∗ h |Aa∗| θ h |Aa∗| −a ∗ h |Aa∗| C t,m h |Aa∗| } 44 ≤ |Aa∗| X j=1 P{a ∗ h j ˆ θ h j ,m h j ≤a ∗ h j θ h j −a ∗ h j C t,m h j } = |Aa∗| X j=1 P{ ˆ θ h j ,m h j ≤θ h j −C t,m h j }. (4.16) ∀1 ≤ j ≤ |A a∗ |, applying the Chernoff-Hoeffding bound stated in Lemma 1, we couldfindtheupperboundofeach itemintheaboveequationas, P{ ˆ θ h j ,m h j ≤θ h j −C t,m h j } =P{m h j ˆ θ h j ,m h j ≤m h j θ h j −m h j C t,m h j } ≤ e −2· 1 m h i j ·(m h j ) 2 · (L+1)lnt m h j =e −2(L+1)lnt =t −2(L+1) . Thus, P{ |Aa∗| X j=1 a ∗ h j ˆ θ h j ,m h j ≤R ∗ − |Aa∗| X j=1 a ∗ h j C t,m h j } ≤|A a∗ |t −2(L+1) ≤ Lt −2(L+1) . (4.17) Similarly,wecan gettheupperboundoftheprobabilityforinequality(4.14): P{ |A a(t+1) | X j=1 a p j (t+1) ˆ θ p j ,mp j ≥R a(t+1) + |A a(t+1) | X j=1 a p j (t+1)C t,mp j }≤ Lt −2(L+1) . (4.18) 45 Notethatforl≥ 4(L+1)lnn Δ a(t+1) Lamax 2 , R ∗ −R a(t+1) −2 |A a(t+1) | X j=1 a p j (t+1)C t,mp j =R ∗ −R a(t+1) −2 |A a(t+1) | X j=1 a p j (t+1) s (L+1)lnt m p j ≥R ∗ −R a(t+1) −La max r 4(L+1)lnn l ≥R ∗ −R a(t+1) −La max s 4(L+1)lnn 4(L+1)lnn Δ a(t+1) La max 2 ≥R ∗ −R a(t+1) −Δ a(t+1) = 0. (4.19) Equation (4.19) impliesthat condition(4.15) is false whenl = 4(L+1)lnn Δ a(t+1) Lamax 2 . If we letl = 4(L+1)lnn ( Δ min Lamax ) 2 ,then(4.15)isfalseforalla(t+1). Therefore, E[ e T i (n)]≤ 4(L+1)lnn Δ min Lamax 2 + ∞ X t=1 t X m h 1 =1 ··· t X m h |A ∗ | =1 t X mp 1 =l ··· t X mp |A a(t) | =l 2Lt −2(L+1) ≤ 4a 2 max L 2 (L+1)lnn (Δ min ) 2 +1+L ∞ X t=1 2t −2 ≤ 4a 2 max L 2 (L+1)lnn (Δ min ) 2 +1+ π 2 3 L. (4.20) 46 SounderLLRpolicy,wehave: R φ n (Θ) =R ∗ n−E φ [ n X t=1 R φ(t) (t)] = X a:Ra<R ∗ Δ a E[T a (n)] ≤ Δ max X a:Ra<R ∗ E[T a (n)] = Δ max N X i=1 E[ e T i (n)] ≤ " N X i=1 4a 2 max L 2 (L+1)lnn (Δ min ) 2 +N + π 2 3 LN # Δ max ≤ 4a 2 max L 2 (L+1)N lnn (Δ min ) 2 +N + π 2 3 LN Δ max . (4.21) Remark 1. Note that when the set of action vectors consists of binary vectors with a single “1”, the problem formulation reduces to an multi-armed bandit problem withN independent actions. In this special case, the LLR algorithm is equivalent to UCB1 in[13]. Thus, ourresultsgeneralizethatpriorwork. Remark 2. We have presented F as a finite set in our problem formation. We note that the LLR policy we have described and its analysis actually also work with a more general formulation whenF is an infinite set with the following additional constraints: the maximization problem in (4.3) always has at least one solution; Δ min exists; a i is bounded. Withtheaboveconstraints,Algorithm3willworkthesameandtheconclusion andallthedetailsoftheproofof Theorem 2 canremain thesame. 47 Remark3. Infact,Theorem2,alsoholdsforcertainkindsofnon-i.i.d. randomvariables X i ,1≤ i≤ N that satisfythe condition thatE[X i (t)|X i (1),...,X i (t−1)] = θ i ,∀1≤ i≤N. ThisisbecausetheChernoff-Hoeffdingboundusedintheregretanalysisrequires onlythisconditiontohold 3 . 4.5 Applications Wenow describesomeapplicationsand extensionsoftheLLR policyfor combinatorial networkoptimizationingraphswheretheedgeweightsareunknownrandomvariables. 4.5.1 MaximumWeightedMatching MaximumWeightedMatching(MWM)problemsarewidelyusedinthemanyoptimiza- tion problems in wireless networks such as the prior work in [16,21]. Given any graph G = (V,E),thereisaweightassociatedwitheachedgeandtheobjectiveistomaximize thesumweightsofamatchingamongallthematchingsinagivenconstraintset,i.e.,the generalformulationforMWMproblemis max R MWM a = |E| X i=1 a i W i s.t. aisamatching (4.22) whereW i istheweightassociatedwitheachedgei. 3 Thisdoesnot,however,includeMarkovchainsforwhichwehaveobtainedsomeweakerregretresults inChapter5andChapter6. 48 In many practical applications, the weights are unknown random variables and we need to learn by selecting different matchings over time. This kind of problem fits the generalframeworkofourproposedpolicyregardingtherewardasthesumweightanda matching as an action. Our proposed LLR policy is a solution with linear storage, and theregretpolynomialinthenumberofedges,andlogarithmicintime. Since there are various algorithms to solve the different variations in the maximum weightedmatchingproblems,suchastheHungarianalgorithmforthemaximumweighted bipartite matching [50], Edmonds’s matching algorithm [31] for a general maximum matching. Inthesecases,thecomputationtimeisalsopolynomial. Herewepresentageneralproblemofmultiuserchannelallocationsincognitiveradio network. ThereareM secondaryusersandQorthogonalchannels. Eachsecondaryuser requires a singlechannel for operation that does not conflict with the channels assigned to the other users. Due to geographic dispersion, each secondary user can potentially see different primary user occupancy behavior on each channel. Time is divided into discrete decision rounds. The throughput obtainable from spectrum opportunities on each user-channel combination over a decision period is denoted as S i,j and modeled as an arbitrarily-distributed random variable with bounded support but unknown mean, i.i.d. over time. This random process is assumed to have a mean θ i,j that is unknown to the users. The objective is to search for an allocation of channels for all users that maximizestheexpectedsumthroughput. 49 S1 P1 P2 S2 0.9 0.2 0.3 0.8 S2 C1 C2 S1 Figure4.1: Anillustrativescenario. Assuming an interference model whereby at most one secondary user can derive benefit from anychannel, ifthenumberofchannels is greater than thenumberofusers, anoptimalchannelallocationemploysaone-to-onematchingofuserstochannels,such thattheexpectedsum-throughputismaximized. Figure4.1illustratesasimplescenario. Therearetwosecondaryusers(i.e.,links)S1 andS2,thatareeachassumedtobeininterferencerangeofeachother. S1isproximateto primaryuserP1whoisoperatingonchannel1. S2isproximatetoprimaryuserP2whois operatingonchannel2. ThematrixshowsthecorrespondingΘ,i.e.,thethroughputeach secondary user could derive from being on the corresponding channel. In this simple example, the optimal matching is for secondary user 1 to be allocated channel 2 and user 2 to be allocated channel 1. Note, however, that, in our formulation, the users are not a priori aware of the matrix of mean values, and therefore must follow a sequential learningpolicy. Notethatthisproblemcanbeformulatedasamulti-armedbanditswithlinearregret, inwhich each action correspondsto amatchingoftheusers to channels,and thereward corresponds to the sum-throughput. In this channel allocation problem, there isM ×Q unknown random variables, and the number of actions are P(Q,M), which can grow 50 exponentially in the number of unknown random variables. Following the convention, instead of denoting the variables as a vector, we refer it as a M by Q matrix. So the rewardaseachtimeslotbychoosingapermutationaisexpressedas: R a = M X i=1 Q X j=1 a i,j S i,j (4.23) wherea∈F,F isasetwithallpermutations,whichisdefinedas: F ={a :a i,j ∈{0,1},∀i,j∧ Q X i=1 a i,j = 1∧ Q X j=1 a i,j = 1}. (4.24) WeusetwoM byQmatricestostoretheinformationafterweplayanactionateach timeslot. Oneis( ˆ θ i,j ) M×Q inwhich ˆ θ i,j istheaverage(samplemean)ofalltheobserved values of channel j by user i up to the current time slot (obtained through potentially different sets of actions over time). The other one is (m i,j ) M×Q in which m i,j is the numberoftimesthatchannelj hasbeenobservedbyuseriuptothecurrenttimeslot. ApplyingAlgorithm3,wegetalinearstoragepolicyforwhich( ˆ θ i,j ) M×Q and(m i,j ) M×Q arestoredandupdatedateachtimeslot. Theregretispolynomialinthenumberofusers and channels, and logarithmicin time. Also, the computationtimefor the policyis also polynomial since (4.3) in Algorithm 3 now becomes the following deterministic maxi- mumweightedbipartitematchingproblem: argmax a∈F X (i,j)∈Aa ˆ θ i,j + s (L+1)lnn m i,j ! (4.25) 51 on the bipartite graph of users and channels with edge weights ˆ θ i,j + q (L+1)lnn m i,j . It could be solved with polynomial computation time (e.g., using the Hungarian algo- rithm[50]). NotethatL = max a |A a | = min{M,Q}forthisproblem,whichislessthan M ×Q so that the bound of regret is tighter. The regret is O(min{M,Q} 3 MQlogn) followingTheorem2. 4.5.2 ShortestPath Shortest Path (SP) problem is another example where the underlying deterministic op- timization can be done with polynomial computation time. If the given directed graph is denoted as G = (V,E) with the source node s and the destination node d, and the cost (e.g., thetransmissiondelay)associated withedge (i,j) is denotedasD i,j ≥ 0, the objectiveisfindthepathfromstodwiththeminimumsumcost,i.e., min C SP a = X (i,j)∈E a i,j D i,j (4.26) s.t. a i,j ∈{0,1},∀(i,j)∈E (4.27) ∀i, X j a i,j − X j a j,i = 1 : i =s −1 : i =t 0 : otherwise (4.28) where equation (4.27) and (4.28) defines a feasible setF, such thatF is the set of all possiblepathesfromstod. When(D ij )arerandomvariableswithboundedsupportbut 52 unknown mean, i.i.d. over time, an dynamic learning policy is needed for this multi- armedbanditformulation. Note that corresponding to the LLR policy with the objective to maximize the re- wards, a direct variation of it is to find the minimum linear cost defined on finite con- straintsetF, by changingthe maximizationproblemin to a minimizationproblem. For clarity, this straightforward modification of LLR is shown below in Algorithm 4, which werefertoasLearningwithLinearCosts(LLC). Algorithm4LearningwithLinearCost(LLC) 1: // INITIALIZATION PART IS SAME AS IN ALGORITHM 3 2: // MAIN LOOP 3: while1do 4: n =n+1; 5: Playanactionawhichsolvestheminimizationproblem a = argmin a∈F X i∈Aa a i ˆ θ i − s (L+1)lnn m i ; (4.29) 6: Update( ˆ θ i ) 1×N ,(m i ) 1×N accordingly; 7: endwhile LLC (Algorithm4) is a policy for a general multi-armed bandit problem with linear costdefinedonanyconstraintset. ItisdirectlyderivedfromtheLLRpolicy(Algorithm 3),soTheorem2alsoholdsforLLC,wheretheregretisdefinedas: R φ n (Θ) =E φ [ n X t=1 C φ(t) (t)]−nC ∗ (4.30) whereC ∗ representstheminimumcost,whichiscostoftheoptimalaction. 53 UsingtheLLCpolicy,wemapeachpathbetweensandtasanaction. Thenumberof unknownvariablesare|E|,whilethenumberofactionscouldgrowexponentiallyinthe worstcase. SincethereexistpolynomialcomputationtimealgorithmssuchasDijkstra’s algorithm [29] and Bellman-Ford algorithm [18,32] for the shortest path problem, we could apply these algorithms to solve (4.29) with edge cost ˆ θ i − q (L+1)lnn m i . LLC is thus an efficient policy to solve the multi-armed bandit formulation of the shortest path problem with linear storage, polynomial computation time. Note thatL = max a |A a | = |E|. RegretisO(|E| 4 lnn). AnotherrelatedproblemistheShortestPathTree(SPT),whereproblemformulation is similar, and the objective is to find a subgraph of the given graph with the minimum totalcostbetweenaselectedrootsnodeandallothernodes. Itisexpressedas[17,47]: min C SPT a = X (i,j)∈E a i,j D i,j (4.31) s.t. a i,j ∈{0,1},∀(i,j)∈E (4.32) X (j,i)∈BS(i) a j,i − X (i,j)∈FS(i) a i,j = −n+1 : i =s 1 : i∈V/{s} (4.33) where BS(i) = {(u,v) ∈ E : v = i}, FS(i) = {(u,v) ∈ E : u = i}. (4.32) and (4.33) defines the constraint setF. We can also use the polynomial computation time algorithms such as Dijkstra’s algorithm and Bellman-Ford algorithm to solve (4.29) for theLLCpolicy. 54 NaivePolicy LLR MaximumWeighted O(|F|logn) O(min{M,Q} 3 MQlogn) Matching where|F| =P(Q,M)(whenQ≥M) or|F| =P(Q,M)(whenQ<M) ShortestPath O(|F|logn) O(|E| 4 logn) (CompleteGraph) where|F| = |V|−2 P k=0 (|V|−2)! k! and |V|(|V|−1) 2 =|E| MinimumSpanning O(|F|logn) O(|E| 4 logn) Tree(CompleteGraph) where|F| =|V| |V|−2 and |V|(|V|−1) 2 =|E| Table4.2: Comparisonofregretbounds. 4.5.3 MinimumSpanningTree MinimumSpanningTree(MST)isanothercombinatorialoptimizationwithpolynomial computation time algorithms, such as Prim’s algorithm [71] and Kruskal’s algorithm [49]. TheobjectivefortheMSTproblemcanbesimplypresentedas min a∈F C MST a = X (i,j)∈E a i,j D i,j (4.34) whereF isthesetofallspanningtreesinthegraph. WiththeLLCpolicy,eachspanningtreeistreatedasanaction,andL =|E|. Regret boundalsogrowsasO(|E| 4 logn). To summarize, we show in Table 4.2 a side by side comparison for the bipartite matching, shortest paths and spanning tree problems. For the matching problem, the graph is already restricted to bipartite graphs. The problem of counting the number of 55 pathsonagraphisknowntobe#-Pcomplete,sothereisnoknownsimpleformulafora general setting. Similarly,we are not aware of any formulasfor counting thenumberof spanningtreesonageneralgraph. Forthisreason,forthelattertwoproblems,wepresent comparativeanalyticalboundsforthespecialcaseofthecompletegraph,whereaclosed form expression for number of paths can be readily obtained, and Cayley’s formula can beusedforthenumberofspanningtrees[22]. 4.6 ExamplesandSimulationResults 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 6 0 500 1000 1500 2000 2500 Time Regret/Log(t) Naive Policy LLR Policy Figure4.2: Simulationresultsofasystemwith7orthogonalchannelsand 4users. We present in these section the numerical simulation results with the example of multiuserchannelallocationsincognitiveradionetwork. Fig 4.2 shows the simulation results of using LLR policy compared with the naive policyin4.3.1. WeassumethatthesystemconsistsofQ = 7orthogonalchannelsinand M = 4 secondary users. The throughput{S i,j (t)} t≥1 for the user-channel combination 56 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 6 0 0.5 1 1.5 2 2.5 3 x 10 4 Time Regret/Log(t) Naive Policy LLR Policy Figure4.3: Simulationresultsofasystemwith9orthogonalchannelsand 5users. is an i.i.d. Bernoulli process with mean θ i,j ((θ i,j ) is unknown to the players) shown as below: (θ i,j ) = 0.3 0.5 0.9 0.7 0.8 0.9 0.6 0.2 0.2 0.3 0.4 0.5 0.4 0.5 0.8 0.6 0.5 0.4 0.7 0.2 0.8 0.9 0.2 0.2 0.8 0.3 0.9 0.6 (4.35) where the components in the box are in the optimal action. Note that P(7,4) = 840 while 7×4 = 28, so the storage used for the naive approach is 30 times more than the LLR policy. Fig 4.2 showsthe regret (normalized withrespect to thelogarithmof time) over time for the naive policy and the LLR policy. We can see that under both policies theregretgrowslogarithmicallyintime. Buttheregretforthenaivepolicyisalothigher thanthatoftheLLRpolicy. Fig 4.3 is another example of the case whenQ = 9 and M = 5. The throughput is alsoassumedtobeani.i.d. Bernoulliprocess,withthefollowingmean: 57 (θ i,j ) = 0.3 0.5 0.9 0.7 0.8 0.9 0.6 0.8 0.7 0.2 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.9 0.8 0.6 0.5 0.4 0.7 0.2 0.8 0.2 0.8 0.9 0.2 0.2 0.8 0.3 0.9 0.6 0.5 0.4 0.6 0.7 0.5 0.7 0.6 0.8 0.2 0.6 0.8 . (4.36) For this example, P(9,5) = 15120, which is much higher than 9× 5 = 45 (about 336 times higher), so the storage used by the naive policy grows much faster than the LLR policy. Comparing with the regrets shown in Table 4.3 for both examples when t = 2×10 6 ,wecanseethattheregretalsogrowsmuchfasterforthenaivepolicy. NaivePolicy LLR 7channels,4users 2443.6 163.6 9channels,5users 24892.6 345.2 Table4.3: Regretwhent = 2×10 6 . 4.7 KSimultaneousActions The reward-maximizing LLR policy presented in Algorithm 3 and the corresponding cost-minimizingLLCpolicypresentedinAlgorithm4canalsobeextendedtothesetting whereK actions are played at each timeslot. Thegoal is to maximizethe totalrewards (or minimize the total costs) obtained by these K actions. For brevity, we only present 58 the policy for the reward-maximization problem; the extension to cost-minimization is straightforward. The modified LLR-K policy for picking the K best actions are shown inAlgorithm5. Algorithm5LearningwithLinearRewardswhileselectingK actions(LLR-K) 1: // INITIALIZATION PART IS SAME AS IN ALGORITHM 3 2: // MAIN LOOP 3: while1do 4: n =n+1; 5: Playactions{a} K ∈F withK largestvaluesin(4.37) X i∈Aa a i ˆ θ i + s (L+1)lnn m i ; (4.37) 6: Update( ˆ θ i ) 1×N ,(m i ) 1×N forallactionsaccordingly; 7: endwhile Theorem 3. The expected regret under the LLR-K policy withK actions selection is at most 4a 2 max L 2 (L+1)N lnn (Δ min ) 2 +N + π 2 3 LK 2L N Δ max . (4.38) Proof. The proof is similar to the proof of Theorem 2, but now we have a set of K actions with K largest expected rewards as the optimal actions. We denote this set as A ∗ ={a ∗,k ,1≤ k ≤ K} wherea ∗,k is theaction withk-th largest expected reward. As in the proof of Theorem 2, we define e T i (n) as a counter when a non-optimal action is playedinthesameway. Equation(4.43),(4.9),(4.10)and(4.12)stillhold. Notethateachtimewhen e I i (t) = 1,thereexistssomeactionsuchthatanon-optimal action is picked for which m i is the minimum in this action. We denote this action as a(t). Notethata(t) meansthereexistsm,1≤m≤ K,suchthatthefollowingholds: 59 e T i (n)≤ l+ n X t=N 1{ X j∈A a ∗,m a ∗,m j ( ˆ θ j,m j (t) +C t,m j (t) ) ≤ X j∈A a(t) a j (t)( ˆ θ j,m j (t) +C t,m j (t) ), e T i (t)≥ l}. (4.39) SinceateachtimeK actionsareplayed,soattimet,arandomvariablecouldbeobserved uptoKttimes. Then(4.13)shouldbemodifiedas: e T i (n)≤l+ ∞ X t=1 Kt X m h 1 =1 ··· Kt X m h |A ∗,m | =1 Kt X mp 1 =l ··· Kt X mp |A a(t) | =l 1{ |A a ∗,m| X j=1 a ∗,m h j ( ˆ θ h j ,m h j +C t,m h j )≤ |A a(t) | X j=1 a p j (t)( ˆ θ p j ,mp j +C t,mp j )}. (4.40) Equation(4.13)to(4.19)aresimilarbysubstitutinga ∗ witha ∗,m . So,wehave: E[ e T i (n)]≤ 4(L+1)lnn Δ min Lamax 2 + ∞ X t=1 Kt X m h 1 =1 ··· Kt X m h |A ∗ | =1 Kt X mp 1 =l ··· Kt X mp |A a(t) | =l 2Lt −2(L+1) ≤ 4a 2 max L 2 (L+1)lnn (Δ min ) 2 +1+ π 2 3 LK 2L . Hence,wegettheupperboundfortheregretas: R φ n (Θ)≤ 4a 2 max L 2 (L+1)N lnn (Δ min ) 2 +N + π 2 3 LK 2L N Δ max . 60 4.8 LLRwithApproximationAlgorithm One interesting question arises in the context of NP-hard combinatorial optimization problems,whereeventhedeterministicversionoftheproblemcannotbesolvedinpoly- nomial time with known algorithms. In such cases, if only an approximation algorithm withsomeknownapproximationguaranteeisavailable,whatcanbesaidabouttheregret bound? Forsuchsettings,letusconsiderthatafactor-β approximationalgorithm(i.e.,which for a maximizationproblem yields a solutionsthat havereward more than OPT β ) is used to solvethe maximizationstep in (4.3) in our LLR algorithm 3. Accordingly, we define an β-approximate action to be an action whose expected reward is within a factor β of that of the optimal action, and all other actions as non-β-approximate. Now we define β-approximationregret asfollows: R β,φ n (Θ) =E[totalnumberoftimesnon-β-approximateactions areplayedbystrategyφinntimeslots] =E[ X a:aisnotaβ-approximate action m a (n)] (4.41) where m a (n) is the total number of time that a has been played up to time n. We define Δ β min as the minimum distance between an β-approximate action and a non-β- approximateaction. WeassumeΔ β min > 0. WehavethefollowingtheoremregardingLLRwithaβ-approximationalgorithm. 61 Theorem 4. Theβ-approximation regret under the LLR policy with aβ-approximation algorithmisat most 4a 2 max L 2 (L+1)N lnn Δ β min 2 +N + π 2 3 LN (4.42) Proof. We modify the proof of Theorem 2 to show Theorem 4. We replace “optimal action”to“β-approximateaction”,and“non-optimalaction”to“non-β-approximateac- tion”everywhereshownintheproofofTheorem2,and westilldefineavirtualcounter ( e T i (n)) 1×N in a similar way. We still use * to refer to an optimal action. So (4.43) becomes, X a:aisanon-β-approximate action E[T a (n)] = N X i=1 E[ e T i (n)]. (4.43) Now we note that for LLR with a β-approximation algorithm, when e I i (t) = 1, a non-β-approximateactiona(t) has been pickedfor whichm i = min j {m j :∀j ∈A a(t) }. Defines max (t)istheoptimalsolutionfor(4.3)inAlgorithm3. Then,wehave X j∈A a(t) a j (t)( ˆ θ j,m j (t−1) +C t−1,m j (t−1) )≥ 1 β s max (t) ≥ 1 β X j∈A a ∗ a ∗ j ( ˆ θ j,m j (t−1) +C t−1,m j (t−1) ) (4.44) So, 62 e T i (n)≤l+ n X t=N+1 1{ 1 β X j∈A a ∗ a ∗ j ( ˆ θ j,m j (t−1) +C t−1,m j (t−1) ) ≤ X j∈A a(t) a j (t)( ˆ θ j,m j (t−1) +C t−1,m j (t−1) ), e T i (t−1)≥l} (4.45) Withasimilaranalysis,asin(4.11)to(4.13),wehave e T i (n)≤l+ ∞ X t=1 t X m h 1 =1 ··· t X m h |A ∗ | =1 t X mp 1 =l ··· t X mp |A a(t+1) | =l 1{ 1 β |Aa∗| X j=1 a ∗ h j ( ˆ θ h j ,m h j +C t,m h j )≤ |A a(t+1) | X j=1 a p j (t+1)( ˆ θ p j ,mp j +C t,mp j )} Nowwenotethat 1 β |Aa∗| P j=1 a ∗ h j ( ˆ θ h j ,m h j +C t,m h j )≤ |A a(t+1) | P j=1 a p j (t+1)( ˆ θ p j ,mp j +C t,mp j ) impliesthatatleastoneofthefollowingmustbetrue: 1 β |Aa∗| X j=1 a ∗ h j ˆ θ h j ,m h j ≤ 1 β R ∗ − 1 β |Aa∗| X j=1 a ∗ h j C t,m h j , (4.46) |A a(t+1) | X j=1 a p j (t+1) ˆ θ p j ,mp j ≥R a(t+1) + |A a(t+1) | X j=1 a p j (t+1)C t,mp j , (4.47) 1 β R ∗ <R a(t+1) +2 |A a(t+1) | X j=1 a p j (t+1)C t,mp j . (4.48) (4.46) and (4.47) are equivalent to (4.13) and (4.14), and we note that for (4.48), whenl≥ 4(L+1)lnn Δ β min Lamax ! 2 , 63 1 β R ∗ −R a(t+1) −2 |A a(t+1) | X j=1 a p j (t+1)C t,mp j ≥ 1 β R ∗ −R a(t+1) −Δ β min = 0. (4.49) Therefore,(4.41)stillholds,andwehavetheupperboundforβ-approximationregret as, R β,LLR n (Θ)≤ 4a 2 max L 2 (L+1)N lnn Δ β min 2 +N + π 2 3 LN 4.9 Summary Inthischapter,wehaveconsideredmulti-armedbanditproblemsinwhichateachtimean arbitrarilyconstrainedsetofrandomvariablesareselected,theselectedvariablesarere- vealed,andatotalrewardthatisalinearfunctionoftheselectedvariablesisyielded. For such problems, existing single-play MAB policies such as the well-known UCB1 [13] can be utilized, but have poor performance in terms of storage, computation, and re- gret. The LLR and LLR-K policies we have presented are smarter in that they store and make decisions at each time based on the stochastic observations of the underlying unknown-mean random variables alone; they require only linear storage and result in a regretthatisboundedbyapolynomialfunctionofthenumberofunknown-meanrandom variables. If the deterministic version of the corresponding combinatorial optimization 64 problem can be solvedin polynomialtime, ourpolicywill also require onlypolynomial computationper step. We have showna numberof problemsin thecontext of networks where this formulation would be useful, including maximum-weightmatching, shortest path and spanning tree computations. For the case where the deterministic version is NP-hard, one has often at hand a polynomial-timeapproximation algorithm. In section 4.8, we show that under a suitably relaxed definition of regret, the LLR algorithm can alsoemploysuchanapproximationtogiveprovableperformance. 65 Chapter5 LearningwithRestedMarkovianRewards 5.1 Overview Weconsider 1 acombinatorialgeneralizationoftheclassicalmulti-armedbanditproblem that is defined as follows. There is a given bipartite graph of M users and N ≥ M resources. For each user-resource pair (i,j), there is an associated state that evolves as an aperiodic irreducible finite-state Markov chain with unknown parameters, with transitions occurring each time the particular user i is allocated resource j. The user i receives a reward that depends on the corresponding state each time it is allocated the resourcej. Thesystemobjectiveistolearnthebestmatchingofuserstoresourcessothat the long-term sum of the rewards received by all users is maximized. This corresponds to minimizing (single-action) regret, defined here as the gap between the expected total reward that can be obtained by the best-possible static matching and the expected total reward that can beachievedby a givenalgorithm. We present apolynomial-storageand 1 Thischapterisbasedinparton[37]. 66 polynomial-complexity-per-stepmatching-learningalgorithmforthisproblem. Weshow that this algorithm can achieve a regret that is uniformly arbitrarily close to logarithmic intimeandpolynomialinthenumberofusersandresources. 5.2 ProblemFormulation We consider a bipartite graph withM users andN ≥ M resources predefined by some application. Time is slotted and is indexed byn. At each decision period (also referred to interchangeably as time slot), each of the M users is assigned a resource with some policy. For each user-resource pair (i,j), there is an associated statethat evolves as an ape- riodic irreducible finite-state Markov chain with unknown parameters. When user i is assigned resource j, assuming there are no other conflicting users assigned this re- source, i is able to receive a reward that depends on the corresponding state each time it is allocated the resource j. The state space is denoted by S i,j = {z 1 ,z 2 ,...,z |S i,j | }. The state of the Markov chain for each user-resource pair (i,j) evolves only when re- sourcej isallocated to useri. WeassumetheMarkovchains fordifferentuser-resource pairs are mutually independent. The reward got by user i while allocated resource j on state z ∈ S i,j is denoted by θ i,j z , which is also unknown to the users. We denote by P i,j ={p i,j (z a ,z b )} za,z b ∈S i,j thetransitionprobabilitymatrixfortheMarkovchain(i,j). Denote by π i,j z the steady state distribution for state z. The mean reward got by user i 67 on resourcej is denoted byμ i,j . Then we haveμ i,j = P z∈S i,j θ i,j z π i,j z . The set of all mean rewardsisdenotedbyμ ={μ i,j }. WedenotebyY i,j (n)theactualrewardobtainedbyauseriifitisassignedresource j at timen. We assumethatY i,j (n) = θ i,j z(n) , if useri is the only occupant of resourcej at timen wherez(n) is the state of Markov chain associated with (i,j) at timen. Else, if multiple users are allocated resource j, then we assume that, due to interference, at most oneof the conflicting usersj ′ gets rewardY i,j ′(n) = θ i,j ′ z ′ (n) wherez ′ (n) is thestate of Markov chain associated with (i,j ′ ) at timen, while the otherusers on the resources j 6= j ′ get zero reward, i.e., Y i,j (n) = 0. This interference model covers scenarios in manynetworkingsettings. A deterministic policy φ(n) at each time is defined as a map from the observation history{O t } n−1 t=1 to a vector of resourceso(n) to be selected at period n, whereO t is the observation at time t; the i-th element in o(n), o i (n), represents the resource al- location for user i. Then the observation history{O t } n−1 t=1 in turn can be expressed as {o i (t),Y i,o i (t) (t)} 1≤i≤M,1≤t<n . Due to the fact that allocating more than one user to a resource is always worse than assigning each a different resource in terms of sum-throughput, we will focus on collision-freepoliciesthat assignallusers distinctresources, whichwewillrefer to as a permutationormatching. ThereareP(N,M)suchpermutations. We formulate our problem as a combinatorial multi-armed bandit, in which each arm corresponds to a matching of the users to resources. We can represent the arm 68 corresponding to a permutation k (1 ≤ k ≤ P(N,M)) as the index setA k = {(i,j) : (i,j)isinpermutationk}. The stochastic reward for choosing arm k at time n under policyφisthengivenas Y φ(n) (n) = X (i,j)∈A φ(n) Y i,j (n) = X (i,j)∈A φ(n) θ i,j z φ(n) . Note that different from most prior work on multi-armed bandits, this combinatorial formulationresultsindependenceacrossarmsthatsharecommoncomponents. Akeymetricofinterestinevaluatingagivenpolicyforthisproblemis(single-action) regret, which is defined as the difference between the expected reward that could be obtained by the best-possible static matching, and that obtained by the given policy. It canbeexpressedas: R φ (n) =nμ ∗ −E φ [ n X t=1 Y φ(t) (t)] =nμ ∗ −E φ [ n X t=1 X (i,j)∈A φ(t) θ i,j z φ(t) ], (5.1) where μ ∗ = max k P (i,j)∈A k μ i,j , the expected reward of the optimal arm, is the expected sum-weight of the maximum weight matching of users to resources with μ i,j as the weight. We are interested in designing policies for this combinatorial multi-armed bandit problemwithMarkovianrewardsthatperformwellwithrespecttoregret. Intuitively,we 69 would like theregretR φ (n) to be as small as possible. If it is sub-linearwith respect to timen,thetime-averagedregretwilltendtozero. 5.3 MatchingLearningforMarkovianRewards A straightforward idea for the combinatorial multi-armed bandit problem with Marko- vian rewards is to treat each matching as an arm, apply UCB1 policy (given by Auer et al.[13])directly,andignorethedependenciesacrossthedifferentarms. Foreacharmk, two variables are stored and updated: the time average of all the observation values of arm k and the number of times that arm k has been played up to the current time slot. TheUCB1policymakesdecisionsbasedonthisinformationalone. However,thereareseveralproblemsthatariseinapplyingUCB1directlyintheabove setting. WenotethatUCB1requiresboththestorageandcomputationtimethatarelinear inthenumberofarms. SincethenumberofarmsinthisformulationgrowsasP(N,M), it is highly unsatisfactory. Also, the upper-bound of regret given in [74] will not work anymoresincetherewardsacrossarmsarenotindependentanymoreandthestatesofan arm may involve even when this arm is not played. No existing analytical result on the upper-boundofregretcanbeapplieddirectlyinthissettingtothebestofourknowledge. So we are motivated to propose a policy which more efficiently stores observations from correlated arms and exploits the correlations to make better decisions. Our key ideais to usetwoM byN matrices, ( ˆ θ i,j ) M×N and (m i,j ) M×N , to storetheinformation for each user-resource pair, rather than for each arm as a whole. ˆ θ i,j is the average 70 Algorithm6MatchingLearningforMarkovianRewards(MLMR) 1: // INITIALIZATION 2: forp = 1toM do 3: forq = 1toN do 4: n = (M−1)p+q; 5: Play anypermutationk suchthat(p,q)∈A k ; 6: Update( ˆ θ i,j ) M×N ,(m i,j ) M×N accordingly. 7: endfor 8: endfor 9: // MAIN LOOP 10: while1do 11: n =n+1; 12: Solve the Maximum Weight Matching problem (e.g., using the Hungarian al- gorithm [50]) on the bipartite graph of users and resources with edge weights ˆ θ i,j + q Llnn m i,j M×N toplayarmk thatmaximizes X (i,j)∈A k ˆ θ i,j + s Llnn m i,j ! (5.2) whereLisapositiveconstant. 13: Update( ˆ θ i,j ) M×N ,(m i,j ) M×N accordingly. 14: endwhile (sample mean) of all the observed values of resource j by user i up to the current time slot(obtainedthroughpotentiallydifferentsetsofarmsovertime). m i,j isthenumberof timesthatresourcej hasbeenassignedtouseriuptothecurrenttimeslot. Ateachtimeslotn,afteranarmk isplayed,wegettheobservationofY i,j (n)forall (i,j)∈A k . Then ( ˆ θ i,j ) M×N and (m i,j ) M×N (both initializedto 0 at time0)are updated asfollows: ˆ θ i,j (n) = ˆ θ i,j (n−1)m i,j (n−1)+Y i,j (n) m i,j (n−1)+1 , if(i,j)∈A k ˆ θ i,j (n−1) , else (5.3) 71 m i,j (n) = m i,j (n−1)+1, if(i,j)∈A k m i,j (n−1) , else (5.4) Note that while we indicate the time index in the above updates for notational clar- ity, it is not necessary to store the matrices from previous time steps while running the algorithm. Our proposed policy, which we refer to as Matching Learning for Markovian Re- wards,isshowninAlgorithm6. 5.4 AnalysisofRegret WesummarizesomenotationweuseinthedescriptionandanalysisofourMLMRpolicy inTable5.1. Theregretofapolicyforamulti-armedbanditproblemistraditionallyupper-bounded byanalyzingtheexpectednumberoftimesthateachnon-optimalarmisplayedandthen taking the summation over these expectation times the reward difference between an optimal arm and a non-optimal arm all non-optimal arms. Although we could use this approachtoanalyzetheMLMRpolicy,wenoticethattheupper-boundforregretconse- quentlyobtainedisquiteloose,whichislinearinthenumberofarms,P(N,M). Instead, we present here a novel analysis for a tighter analysis of the MLMR policy. Our anal- ysis shows an upper bound of the regret that is polynomial inM andN, and uniformly logarithmicovertime. 72 N : numberofresources. M : numberofusers,M ≤N. k : indexofaparameterusedforanarm, 1≤k≤P(N,M). i,j : indexofaparameterusedforuseri,resourcej. A k : {(i,j) : (i,j)isinpermutationk} K i,j : {A k : (i,j)∈A k } ∗: indexindicatingthataparameterisforthe optimalarm. Iftherearemultipleoptimalarms, ∗refers toanyofthem. m i,j : numberoftimesthatresourcej hasbeen matchedwithuseriuptothecurrenttimeslot. ˆ θ i,j : average(samplemean)ofallobservedvalues ofresourcej byuseriuptocurrenttimeslot. m k i : m i,j suchthat (i,j)∈A k atcurrenttimeslot. S i,j : statespaceoftheMarkovchainfor user-resourcepair(i,j). P i,j : transitionmatrixoftheMarkovchain associatedwithuser-resourcepair(i,j). π i,j z : steadystatedistributionforstatez ofthe Markovchainassociatedwith (i,j). θ i,j z : rewardobtainedbyuseriwhileaccess resourcej onstatez∈S i,j . μ i,j : P z∈S i θ i,j z π i,j z ,themeanrewardforuseriusing resourcej μ k : P (i,j)∈A k μ i,j μ ∗ : max k P (i,j)∈A k μ i,j Δ k : μ ∗ −μ k . Δ min : min k:μ k <μ ∗ Δ k . Δ max : max k Δ k . π min : min 1≤i≤M,1≤j≤N,z∈S i,j π i,j z . s max : max 1≤i≤M,1≤j≤N |S i,j |. s min : min 1≤i≤M,1≤j≤N |S i,j |. 73 θ max : max 1≤i≤M,1≤j≤N,z∈S i,j θ i,j z . θ min : min 1≤i≤M,1≤j≤N,z∈S i,j θ i,j z . ǫ i,j : eigenvaluegap,definedas 1−λ 2 ,whereλ 2 isthesecondlargesteigenvalueofP i,j . ǫ max : max 1≤i≤M,1≤j≤N ǫ i,j . ǫ min : min 1≤i≤M,1≤j≤N ǫ i,j . T k (n): numberoftimesarmk hasbeenplayedby MLMRinthefirstntimeslots. ˆ θ k (n): P (i,j)∈A k ˆ θ i,j (n). Itisthesummationofallthe averageobservationvaluesinarmk attimen. ˆ θ k i,m k i : ˆ θ i,j (n)suchthat (i,j)∈A k andm i,j (n) =m k i . ˆ θ k,m k 1 ,...,m k M : M P i=1 ˆ θ k i,m k i . Table5.1: Notation. ThefollowinglemmasareneededforourmainresultsinTheorem5: Lemma2. (Lemma2.1from[10]){X n ,n = 1,2,...}isanirreducibleaperiodicMarkov chain with state space S, transition matrix P, a stationary distribution π z , ∀z ∈ S, and an initial distribution q. Let F t be the σ-algebra generated by X 1 ,X 2 ,...,X t . Let G be a σ-algebra independent of F = ∨ t≥1 F t . Let τ be a stopping time with respect to the increasing family of σ-algebra G∨F t ,t≥ 1. Define N(z,τ) such that N(z,τ) = τ P t=1 I(X t =z). Then, |E[N(z,τ)−π z E[τ]]|≤A P , (5.5) forallqand allτ suchthatE[τ]<∞. A P isa constantthatdependsonP. 74 Lemma3. (Corollary1from[74])Letπ min betheminimumvalueamongthestationary distribution,whichis definedasπ min = min z∈S π z . ThenA P ≤ 1/π min . Lemma4. Foruser-resourcematching,ifthestateofrewardassociatedwitheachuser- resource pair (i,j) is given by a Markov chain, denoted by{X i,j 1 ,X i,j 2 ,...}, satisfying thepropertiesofLemma 2, thentheregret underpolicyφis boundedby: R φ (n)≤ P(N,M) X k=1 (μ ∗ −μ k )E φ [T φ k (n)]+A S,P,Θ , (5.6) whereA S,P,Θ isaconstantthatdependsonallthestatespaces{S i,j } 1≤i≤M,1≤i≤N ,transi- tionprobabilitymatrices{P i,j } 1≤i≤M,1≤i≤N andtherewardsset{θ i,j z ,z∈S i,j } 1≤i≤M,1≤i≤N . Proof. ∀1 ≤ i ≤ M, 1 ≤ j ≤ N, define G i,j = ∨ k6=i,l6=j F k,l where F k,l = ∨ t≥1 F i,j t , which applies to the Markove chain{X i,j 1 ,X i,j 2 ,...}. We note that the Markove chains ofdifferentuser-resourcepairsaremutuallyindependent,so∀i,j,G i,j isindependentof F i,j . F i,j satisfies the conditions in Lemma 2. Note that T φ i,j (n) is a stopping time with respectto{G i,j ∨F i,j n ,n> 1}. SincethestateofaMarkovechainevolvesonlywhenitisobserved,X i,j 1 ,...,X i,j T φ i,j (n) represents the successive states of the Markov chain up ton when assigning resource j touseri.Thenthetotalrewardobtainedunderpolicyφuptotimenisgivenby: n X t=1 Y φ(t) (t) = N X j=1 M X i=1 T φ i,j (n) X l=1 X z∈S i,j θ i,j z 1(X i,j l =z). (5.7) 75 Notethat∀i = 1,...,M,T φ k (n) =T φ(n),i k whereT φ(n),i k isthenumberoftimesupto n that the i-th component has been observed while playing arm k, and there exists one resourceindexj suchthat(i,j)∈A k . So,wehave: P(N,M) X k=1 μ k E φ [T φ k (n)] = P(N,M) X k=1 M X i=1 μ k i E φ [T φ k (n)] = P(N,M) X k=1 M X i=1 μ k i E φ [T φ,i k (n)] = N X j=1 M X i=1 μ i,j X A k ∈K i,j E φ [T φ,i k (n)] = N X j=1 M X i=1 μ i,j E φ [T φ i,j (n)] = N X j=1 M X i=1 X z∈S i,j θ i,j z π i,j z E φ [T φ i,j (n)]. Hence, |R φ (n)− P(N,M) X k=1 (μ ∗ −μ k )E φ [T φ k (n)]| = R φ (n)−(nμ ∗ − P(N,M) X k=1 μ k E φ [T φ k (n)]) = (nμ ∗ −E φ [ n X t=1 Y φ(t) (t)])−(nμ ∗ − P(N,M) X k=1 μ k E φ [T φ k (n)]) = E φ [ n X t=1 Y φ(t) (t)]− P(N,M) X k=1 μ k E φ [T φ k (n)] = E φ [ N X j=1 M X i=1 T φ i,j (n) X l=1 X z∈S i,j θ i,j z I(X i,j l =z)]− N X j=1 M X i=1 X z∈S i,j θ i,j z π i,j z E φ [T φ i,j (n)] 76 ≤ N X j=1 M X i=1 X z∈S i,j |E φ [ T φ i,j (n) X l=1 θ i,j z I(X i,j l =z)]−θ i,j z π i,j z E φ [T φ i,j (n)]| = N X j=1 M X i=1 X z∈S i,j θ i,j z |E φ [ T φ i,j (n) X l=1 I(X i,j l =z)]−π i,j z E φ [T φ i,j (n)]| = N X j=1 M X i=1 X z∈S i,j θ i,j z E φ [N(z,T φ i,j (n))]−π i,j z E π [T φ i,j (n)] . BasedonLemma2,wehave: |R φ (n)− P(N,M) X k=1 (μ ∗ −μ k )E φ [T φ k (n)]|≤ N X j=1 M X i=1 X z∈S i,j θ i,j z C P i,j =A S,P,Θ . (5.8) Lemma5. (Theorem 2.1 from [39])Let{X n ,n = 1,2,...} be an irreducibleaperiodic MarkovchainwithfinitestatespaceS,transitionmatrixP,astationarydistributionπ z , ∀z ∈ S, and an an initialdistributionq. LetN q =||( qz πz ),z ∈ S|| 2 . The eigenvalue gap ǫ is defined as ǫ = 1−λ 2 , where λ 2 is the second largest eigenvalue of the matrixP. ∀A⊂ S, definet A (n) as the total number of times that all states in the setA are visited up totimen. Then∀γ ≥ 0, P(t A (n)−nπ A ≥γ)≤ (1+ γǫ 10n N q e −γ 2 ǫ/20n ), (5.9) whereπ A = P z∈A π z . 77 Our main results on the regret of MLMR policy are shown in Theorem 5. We show that with using a constant L which is bigger than a value determined by the minimum eigenvalue gap of the transition matrix, maximum value of the number of states, and maximumvalueoftherewards, ourMLMRpolicyisguaranteed toachievearegret that isuniformlylogarithmicintime,andpolynomialinthenumberofusersandresources. Theorem 5. When using any constantL ≥ (50+40M)θ 2 max s 2 max ǫ min , the expected regret under theMLMR policyspecified inAlgorithm6 isat most " 4M 3 NLlnn (Δ min ) 2 +MN +M 2 N s max π min 1+ ǫ max √ L 10s min θ min ! π 3 # Δ max +A S,P,Θ , (5.10) where Δ min , Δ max , π min , s max , s min ,θ max , θ min , ǫ max ,ǫ min follow the definition in Table 5.1;A S,P,Θ followsthedefinitionin Lemma 4. Proof. LetC t,n be q Llnt n . DenoteC t,n A k = P (i,j)∈A k q Llnt m i,j = M P i=1 q Llnt n k i = M P i=1 C t,n k i . It isalsodenotedbyC t,(n k 1 ,...,n k M ) sometimesforaclearexplanationinthisproof. We introduce e T i,j (n) as a counter after the initialization period. It is updated in the followingway: Ateachtimeslotaftertheinitializationperiod,oneofthetwocasesmusthappen: (1) anoptimalarmisplayed;(2)anon-optimalarmisplayed. Inthefirstcase,( e T i,j (n)) M×N won’t be updated. When an non-optimal arm k(n) is picked at timen, there must be at least one (i,j)∈A k such thatm i,j (n) = min (i 1 ,j 1 )∈A k m i 1 ,j 1 . If there is only one such arm, 78 e T i,j (n) is increased by 1. If there are multiple such arms, we arbitrarily pick one, say (i ′ ,j ′ ),andincrement e T i ′ j ′ by1. Eachtimewhenanon-optimalarmispicked,exactlyoneelementin( e T i,j (n)) M×N is incrementedby1. Thisimpliesthatthetotalnumberthatwehaveplayedthenon-optimal armsisequaltothesummationofallcountersin ( e T i,j (n)) M×N . Therefore,wehave X k:μ k <μ∗ E[T k (n)] = M X i=1 N X j=1 E[ e T i,j (n)]. (5.11) Alsonotefor e T i,j (n),thefollowinginequalityholds e T i,j (n)≤m i,j (n),∀1≤i≤M,1≤j ≤N. (5.12) Denoteby e I i,j (n)theindicatorfunctionwhichisequalto1if e T i,j (n)isaddedbyone attimen. Letl beanarbitrarypositiveinteger. Then e T i,j (n) = n X t=MN+1 1{ e I i,j (t)}≤l+ n X t=MN+1 1{ e I i,j (t), e T i,j (t−1)≥l} (5.13) where1(x) is the indicator function defined to be 1 when the predicate x is true, and 0 whenitisfalse. When e I i,j (t) = 1, there exists some arm such that a non-optimal arm is picked for which m i,j is the minimum in this arm. We denote this arm by k(t) since at each time that e I i,j (t) = 1,wemaygetdifferentarms. Then, 79 e T i,j (n)≤l+ n X t=MN+1 1{ ˆ θ ∗ (t−1)+C t−1,n ∗ (t−1) ≤ ˆ θ k(t−1) (t−1)+C t−1,n A k(t−1) (t−1) , e T i,j (t−1)≥l} =l+ n X t=MN 1{ ˆ θ ∗ (t)+C t,n ∗ (t) ≤ ˆ θ k(t) (t)+C t,n A k(t) (t) , e T i,j (t)≥l}. Based on(5.12),l≤ e T i,j (t)implies: l≤ e T i,j (t)≤m i,j (t) =m k(t) i . So, ∀1≤i≤M,m k(t) i ≥l. Thenwecanbound e T i,j (n)as, e T i,j (n)≤l+ n P t=MN 1{ min 0<m ∗ 1 ,...,m ∗ M ≤t ˆ θ ∗ m ∗ 1 ,...,m ∗ M +C t,(m ∗ 1 ,...,m ∗ M ) ≤ max l≤m k(t) 1 ,...,m k(t) M ≤t ˆ θ k(t),m k(t) 1 ,...,m k(t) M +C t,(m k(t) 1 ,...,m k(t) M ) } ≤l+ ∞ P t=1 [ t P m ∗ 1 =1 ··· t P m ∗ M =1 t P m k(t) 1 =l ··· t P m k(t) M =l 1{ ˆ θ ∗ m ∗ 1 ,...,m ∗ M +C t,(m ∗ 1 ,...,m ∗ M ) ≤ ˆ θ k(t),m k(t) 1 ,...,m k(t) M +C t,(m k(t) 1 ,...,m k(t) M ) }]. ˆ θ ∗ m ∗ 1 ,...,m ∗ M +C t,(m ∗ 1 ,...,m ∗ M ) ≤ ˆ θ k(t),m k(t) 1 ,...,m k(t) M +C t,(m k(t) 1 ,...,m k(t) M ) means that at least oneofthefollowingmustbetrue: 80 ˆ θ ∗ m ∗ 1 ,...,m ∗ M ≤μ ∗ −C t,(m ∗ 1 ,...,m ∗ M ) , (5.14) ˆ θ k(t),m k(t) 1 ,...,m k(t) M ≥μ k(t) +C t,(m k(t) 1 ,...,m k(t) M ) , (5.15) μ ∗ <μ k(t) +2C t,(m k(t) 1 ,...,m k(t) M ) . (5.16) HerewefirstfindtheupperboundforP{ ˆ θ ∗ m ∗ 1 ,...,m ∗ M ≤ μ ∗ −C t,(m ∗ 1 ,...,m ∗ M ) }: P{ ˆ θ ∗ m ∗ 1 ,...,m ∗ M ≤μ ∗ −C t,(m ∗ 1 ,...,m ∗ M ) }=P{ M P i=1 ˆ θ ∗ i,m ∗ i ≤ M P i=1 μ ∗ i − M P i=1 C t,n ∗ i } ≤ M P i=1 P{ ˆ θ ∗ i,m ∗ i ≤μ ∗ i −C t,n ∗ i }. (5.17) ∀1≤i≤M, P{ ˆ θ i,m ∗ i ≤μ ∗ i −C t,n ∗ i } =P{ |S ∗ i | X z=1 θ ∗ i (z)m ∗ i (z) m ∗ i ≤ |S ∗ i | X z=1 θ ∗ i (z)π ∗ i (z)−C t,n ∗ i } =P{ |S ∗ i | X z=1 (θ ∗ i (z)m ∗ i (z)−m ∗ i θ ∗ i (z)π ∗ i (z))≤−m ∗ i C t,n ∗ i } ≤P{Atleastoneofthefollowingmusthold: θ ∗ i (1)m ∗ i (1)−m ∗ i θ ∗ i (1)π ∗ i (1)≤− m ∗ i |S ∗ i | C t,n ∗ i , . . . θ ∗ i (|S ∗ i |)m ∗ i (|S ∗ i |)−m ∗ i θ ∗ i (|S ∗ i |)π ∗ i (|S ∗ i |)≤− m ∗ i |S ∗ i | C t,n ∗ i } ≤ |S ∗ i | X z=1 P{θ ∗ i (z)m ∗ i (z)−m ∗ i θ ∗ i (z)π ∗ i (z)≤− m ∗ i |S ∗ i | C t,n ∗ i } 81 = |S ∗ i | X z=1 P{m ∗ i (z)−m ∗ i π ∗ i (z)≤− m ∗ i |S ∗ i |θ ∗ i (z) C t,n ∗ i } = |S ∗ i | X z=1 P{(m ∗ i − X l6=z m ∗ i (l))−m ∗ i (1− X l6=z π ∗ i (z))≤− m ∗ i |S ∗ i |θ ∗ i (z) C t,n ∗ i } = |S ∗ i | X z=1 P{ X l6=z m ∗ i (l)−m ∗ i X l6=z π ∗ i (z)≥ m ∗ i |S ∗ i |θ ∗ i (z) C t,n ∗ i }. (5.18) ∀1≤ z ≤|S ∗ i |, applyingLemma5, wecan find theupperboundofeach probability in(5.18)as, P{ ˆ θ i,m ∗ i ≤μ ∗ i −C t,n ∗ i }≤ |S ∗ i | X z=1 1+ ǫ i,j 10|S ∗ i |θ ∗ i (z) s Llnt n ∗ i N q i,j e − m ∗ i Llntǫ i,j 20|S ∗ i | 2 θ ∗ i (z) 2 n ∗ i ≤ |S ∗ i | X z=1 1+ ǫ max √ Lt 10s min θ min ! N q i,j e − Llntǫ min 20s 2 max θ 2 max ≤ s max π min √ t 1+ ǫ max √ L 10s min θ min ! t − Lǫ min 20s 2 max θ 2 max (5.19) = s max π min 1+ ǫ max √ L 10s min θ min ! t − Lǫ min −10s 2 max θ 2 max 20s 2 max θ 2 max , where(5.19)holdssinceforanyq i,j , N q i,j = q i,j z π i,j z ,z∈S i,j 2 ≤ |S i,j | X z=1 q i,j z π i,j z 2 ≤ |S i,j | X z=1 kq i,j z k 2 π min = 1 π min . Thus, P{ ˆ θ ∗ m ∗ 1 ,...,m ∗ M ≤θ ∗ −C t,(m ∗ 1 ,...,m ∗ M ) }≤ Ms max π min 1+ ǫ max √ L 10s min θ min ! t − Lǫ min −10s 2 max θ 2 max 20s 2 max θ 2 max . 82 With the similar calculation, we can also get the upper bound of the probability for (5.15): P{ ˆ θ k(t),m k(t) 1 ,...,m k(t) M ≥μ k +C t,(m k(t) 1 ,...,m k(t) M ) } ≤ M X i=1 P{ ˆ θ k i,m k i ≥ μ k i +C t,n k i } = M X i=1 P{ |S k i | X z=1 θ k i (z)m k i (z) m k i ≥ |S k i | X z=1 θ k i (z)π k i (z)+C t,n k i } ≤ M X i=1 |S k i | X z=1 P{θ k i (z)m k i (z)−m k i θ k i (z)π k i (z)≥ m k i |S ∗ i | C t,n k i } = M X i=1 |S k i | X z=1 P{m k i (z)−m k i π k i (z)≥ m k i |S k i |θ k i (z) C t,n k i } ≤ M X i=1 s max π min 1+ ǫ max √ L 10s min θ min ! t − Lǫ min −10s 2 max θ 2 max 20s 2 max θ 2 max ≤ Ms max π min 1+ ǫ max √ L 10s min θ min ! t − Lǫ min −10s 2 max θ 2 max 20s 2 max θ 2 max . (5.20) Notethatforl≥ 4Llnn Δ k(t) M 2 , μ ∗ −μ k(t) −2C t,(m k(t) 1 ,...,m k(t) M ) =μ ∗ −μ k(t) −2 M X i=1 s Llnt n k(t) i ≥μ ∗ −μ k(t) −M s 4Llnn 4Llnn Δ k(t) M 2 =μ ∗ −μ k(t) −Δ k(t) = 0. (5.21) 83 (5.21) implies that condition (5.16) is false when l = 4Llnn Δ k(t) M 2 . If we let l = 4Llnn Δ i,j min M ! 2 ,then(5.16)isfalseforallk(t),1≤t≤∞where, Δ i,j min = min k {Δ k : (i,j)∈A k }. (5.22) Therefore, E[ e T i,j (n)]≤ 4Llnn Δ i,j min M 2 + ∞ X t=1 t X m ∗ 1 =1 ··· t X m ∗ 1 =M t X m k 1 =1 ··· t X m k 1 =M 2M s max π min 1+ ǫ max √ L 10s min θ min ! t − Lǫ min −10s 2 max θ 2 max 20s 2 max θ 2 max ! ≤ 4M 2 Llnn Δ i,j min 2 +1+M s max π min 1+ ǫ max √ L 10s min θ min ! ∞ X t=1 2t − Lǫ min −(40M+10)s 2 max θ 2 max 20s 2 max θ 2 max ≤ 4M 2 Llnn Δ i,j min 2 +1+M s max π min 1+ ǫ max √ L 10s min θ min ! ∞ X t=1 2t −2 (5.23) = 4M 2 Llnn Δ i,j min 2 +1+M s max π min 1+ ǫ max √ L 10s min θ min ! π 3 , where(5.23)holdssinceL≥ (50+40M)θ 2 max s 2 max ǫ min . SounderourMLMRpolicy, R φ (n)≤ P(N,M) X k=1 (μ ∗ −μ k )E π [T k π (n)]+A S,P,Θ = X k:θ k <θ∗ Δ k E[T k (n)]+A S,P,Θ 84 Wehave, R φ (n)≤ Δ max X k:θ k <θ∗ E[T k (n)]+A S,P,Θ = Δ max M X i=1 N X j=1 E[ e T i,j (n)]+A S,P,Θ ≤ " M X i=1 N X j=1 4M 2 Llnn Δ i,j min 2 +1+M s max π min 1+ ǫ max √ L 10s min θ min ! π 3 # Δ max +A S,P,Θ ≤ " 4M 3 NLlnn (Δ min ) 2 +MN +M 2 N s max π min 1+ ǫ max √ L 10s min θ min ! π 3 # Δ max +A S,P,Θ . (5.24) Theorem 5 shows when we use a constant L which is large enough such that L ≥ (50+40M)θ 2 max s 2 max ǫ min , the regret of Algorithm 6 is upper-bounded uniformly over time n by a function that grows asO(M 3 N lnn). However, whenθ max , s max orǫ min is unknown, theupperboundofregretcannotbeguaranteedtogrowlogarithmicallyinn. So when no knowledge about the system is available, we extend the MLMR policy to achieve a regret that is bounded uniformly over time n by a function that grows as O(M 3 NL(n)lnn), by using any arbitrarily slowly diverging non-decreasing sequence L(n) such that L(n) ≤ n for any n in Algorithm 6. Since L(n) can grow arbitrarily slowly, the MLMR can achieve a regret arbitrarily close to the logarithmic order. We presentouranalysisinTheorem6. 85 Theorem6. Whenusinganyarbitrarilyslowlydivergingnon-decreasingsequenceL(n) (i.e.,L(n)→∞ asn→∞) in (5.2) such that∀n,L(n)≤ n, the expected regret under theMLMR policyspecified inAlgorithm6 isat most 4M 3 NL(n)lnn (Δ min ) 2 +MNB S,P,Θ +M 2 N s max π min 1+ ǫ max 10s min θ min π 3 Δ max +A S,P,Θ , (5.25) whereB S,P,Θ isa constantthatdepends onθ max ,s max andǫ min . Proof. LetC t,n be q L(t)lnt n . LetC t,n A k be P (i,j)∈A k q L(t)lnt m i,j . ThenreplacingLwithL(t) in the proofof Theorem 5, (5.11) to (5.22) still stand. The upper bound ofE[ e T i,j (n)] in (5.23)shouldbemodifiedasin(5.26). E[ e T i,j (n)]≤ 4L(n)lnn Δ i,j min M 2 + ∞ X t=1 t X m ∗ 1 =1 ··· t X m ∗ 1 =M t X m k 1 =1 ··· t X m k 1 =M 2M s max π min 1+ ǫ max p L(t) 10s min θ min ! t − L(t)ǫ min −10s 2 max θ 2 max 20s 2 max θ 2 max ≤ 4M 2 L(n)lnn Δ i,j min 2 +1+ Ms max π min 1+ ǫ max 10s min θ min ∞ X t=1 2 p L(t)t − L(t)ǫ min −(40M+10)s 2 max θ 2 max 20s 2 max θ 2 max ≤ 4M 2 L(n)lnn Δ i,j min 2 +1+M s max π min 1+ ǫ max 10s min θ min ∞ X t=1 2t − L(t)ǫ min −(40M+10)s 2 max θ 2 max 20s 2 max θ 2 max + 1 2 . (5.26) L(t) is a diverging non-decreasing sequence, so there exists a constantt 1 , such that forallt≥t 1 ,L(t)≥ (60+40M)θ 2 max s 2 max ǫ min ,whichimpliest − L(t)ǫ min −(40M+10)s 2 max θ 2 max 20s 2 max θ 2 max + 1 2 ≤ t −2 . Thus,wehave 86 E[ e T i,j (n)]≤ 4M 2 L(n)lnn Δ i,j min 2 +M s max π min 1+ ǫ max 10s min θ min ∞ X t=t 1 2t −2 +B S,P,Θ = 4M 2 L(n)lnn Δ i,j min 2 +M s max π min 1+ ǫ max √ L 10s min θ min ! π 3 +B S,P,Θ (5.27) whereB S,P,Θ isaconstantasshownin(5.28),whichdependsonθ max ,s max andǫ min . B S,P,Θ = 1+M s max π min 1+ ǫ max 10s min θ min t 1 −1 X t=1 2t − L(t)ǫ min −(40M+10)s 2 max θ 2 max 20s 2 max θ 2 max + 1 2 . (5.28) ThenfortheMLMRpolicywithL(n) R φ (n)≤ Δ max M X i=1 N X j=1 E[ e T i,j (n)]+A S,P,Θ ≤ 4M 3 NL(n)lnn (Δ min ) 2 +MNB S,P,Θ +M 2 N s max π min 1+ ǫ max 10s min θ min π 3 Δ max +A S,P,Θ . (5.29) 5.5 ExamplesandSimulationResults We consider a system that consists of M = 2 users and N = 4 resources. The state of each resource evolves as an irreducible, aperiodic Markov chain with two states “0” and “1”. For all the tables in this section, the element in the i-th row and j-th column 87 represents the value for the user-resource pair (i,j). The transition probabilities are shownintheTable5.2,andthemeanrewardsforeachstateareshownintheTable5.3. 0.5 0.4 0.7 0.3 0.2 0.9 0.9 0.7 p 01 0.6 0.7 0.8 0.9 0.9 0.5 0.4 0.4 p 10 Table5.2: Transitionprobabilities. 0.6 0.5 0.2 0.4 0.3 0.7 0.8 0.3 θ 0 0.8 0.2 0.7 0.5 0.5 0.3 0.6 0.6 θ 1 Table5.3: Rewardsoneach state. For 1 ≤ i ≤ M, 1 ≤ j ≤ N, the stationary distribution of user-resource pair (i,j) onstate“0”iscalculatedas p i,j 10 p i,j 01 +p i,j 10 ;thestationarydistributiononstate“1”iscalculated as p i,j 01 p i,j 01 +p i,j 10 . The eigenvalue gap isǫ i,j = p i,j 01 +p i,j 10 . The expected rewardμ i,j for all the pairscanbecalculatedasinTable5.4. 0.6909 0.3909 0.4333 0.425 0.3363 0.4429 0.6615 0.4909 μ Table5.4: Expectedrewards. We can see that the arm {(1,1),(2,3)} is the optimal arm with greatest expected rewardμ ∗ = 0.6909+0.6615 = 1.3524. Δ min = 0.1706. Figure 5.1 shows the simulation result of the regret (normalized with respect to the logarithmof time) for our MLMRpolicy for the abovesystem with different choices of L. We also show the theoretical upper bound for comparison. The value ofL to satisfy 88 0 1 2 3 4 5 6 7 8 9 10 x 10 5 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Time Regret/Log(t) L = 2 L = 303 Theoretical Upper Bound Figure5.1: Simulationresultsofexample1withΔ min = 0.1706. the condition in Theorem 5 isL≥ (50+40M)R 2 s 2 max ǫ min = 303, so we pickedL = 303 in the simulation. NotethatintheproofofTheorem5,whenL< (50+40M)R 2 s 2 max ǫ min ,wehave − Lǫ min −(40M +10)s 2 max θ 2 max 20s 2 max θ 2 max >−2. Thisimplies ∞ P t=1 2t − Lǫ min −(40M+10)s 2 max θ 2 max 20s 2 max θ 2 max doesnotconvergeanymoreandthuswecannot bound E[ e T i,j (n)] any more. Empirically, however, in 5.1 the case when L = 2 also seems to yield logarithmic regret over time and the performance is in fact better than L = 303, since the non-optimal arms are played less when L is smaller. However, this may possibly be due to the fact that the cases when e T i,j (n) grows faster than ln(t) only happenswithverysmallprobabilitywhenL = 2. Table5.5showsthenumberoftimesthatresourcej hasbeenmatchedwithuseriup totimen = 10 7 . 89 999470 153 185 196 136 293 999155 420 m i,j (10 7 ),L = 2 892477 30685 39410 37432 26813 50341 850265 72585 m i,j (10 7 ),L = 303 Table 5.5: Number of times that resource j has been matched with user i up to time n = 10 7 . 0 1 2 3 4 5 6 7 8 9 10 x 10 5 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 Time Regret/Log(t) L = 2 L = 303 Theoretical Upper Bound Figure5.2: Simulationresultsofexample2withΔ min = 0.0091. Figure 5.2 shows the simulation results of the regret of another example with the same transition probabilities as in the previous example and different rewards on states asinTable5.6. 0.7 0.3 0.5 0.5 0.65 0.7 0.8 0.4 θ 0 0.4 0.6 0.7 0.45 0.5 0.5 0.6 0.55 θ 1 Table5.6: Rewardsoneach state. Theexpectedrewardμ i,j forallthepairscanbecalculatedasinTable5.7. {(1,1),(2,3)}is stillthe optimalarm. However, compared with theprevious exam- ple,wecanseethattheexpectedrewardofthreeotherarms{(1,3),(2,1)},{(1,3),(2,2)}, {(1,1),(2,2)}are all very close to theexpected reward of the optimalarm. For thisex- ample, Δ min = 0.0091, which is much smaller compared with the previous example. 90 0.5636 0.4091 0.5933 0.4875 0.6227 0.5714 0.6615 0.4954 μ Table5.7: Expectedrewards. In this case, the non-optimal arms are played much more compared with the previous example. This is because we have several arms of which the expected rewards are very close to μ ∗ , so the policy has to spend a lot more time to explore on those non-optimal armstomakesurethosearenon-optimalarms. ThisfactcanbeseenclearlyinTable5.8, which presents the number of times that resource j has been matched with user i up to timen = 10 7 underbothcaseswhenL = 2andL = 303. 817529 544 179832 2099 175583 3610 820097 714 m i,j (10 7 ),L = 2 346395 60031 472346 121232 301491 146317 482545 69651 m i,j (10 7 ),L = 303 Table 5.8: Number of times that resource j has been matched with user i up to time n = 10 7 . 5.6 Summary We have presented the MLMR policy for the problem of learning combinatorial match- ings of users to resources when the reward process is Markovian. We have shown that thispolicyrequiresonlypolynomialstorageandcomputationperstep,andyieldsaregret that grows uniformly logarithmically overtime and only polynomiallywith the number ofusersandresources. 91 Chapter6 LearningwithRestlessMarkovianRewards 6.1 Overview Inthischapter 1 ,weconsiderhowtosolvethecombinatorialnetworkoptimizationprob- lems when the edge weights vary as independent Markov chains with unknown dynam- ics. Using a stochastic restless multi-armed bandit approach, we propose CLRMR, a online learning algorithm. We prove that, compared to a genie that knows the Markov transition matrices and uses the single-best structure at all times, CLRMR yields regret thatispolynomialinthenumberofedgesandnearly-logarithmicintime. 6.2 ProblemFormulation WeconsiderasystemwithN edgespredefinedbysomeapplication,wheretimeisslotted and indexedbyn. Foreach edgei (1≤ i≤ N), thereis an associatedstatethat evolves 1 Thischapterisbasedinparton[38]. 92 asadiscrete-time,finite-state,aperiodic,irreducibleMarkovchain 2 {X i (n),n≥ 0}with unknown parameters 3 . We denote the state space for the i-th Markov chain by S i . We assume these N Markov chains are mutually independent. The reward obtained from state x (x ∈ S i ) of Markov chain i is denoted as r i x . Denote by π i x the steady state distributionfor statex. The mean reward obtained on Markov chaini is denoted byμ i . Thenwehaveμ i = P z∈S i,j r i x π i x . Thesetofallmeanrewardsisdenotedbyμ ={μ i }. At each decision period n (also referred to interchangeably as time slot), an N- dimensional action vector a(n), representing an arm, is selected under a policy φ(n) from a finitesetF. We assumea i (n)≥ 0 forall 1≤ i≤ N. When a particulara(n) is selected, the value of r i x i (n) is observed, only for thosei witha i (n) 6= 0. We denote by A a(n) ={i :a i (n)6= 0,1≤ i≤ N}theindexsetofalla i (n)6= 0foranarma. Wetreat eacha(n)∈F asanarm. Therewardisdefinedas: R a(n) (n) = X i∈A a(n) a i (n)r i x i (n) (6.1) wherex i (n)denotesthestateofaMarkovchainiattimen. When aparticulararma(n)isselected,therewardscorrespondingtonon-zerocom- ponents of a(n) are revealed, i.e., the value of r i x i (n) is observed for all i such that a i (n)6= 0. 2 We alsoreferMarkovchain{X i (n),n≥ 0}andMarkovchainiinterchangeably. 3 Alternatively, for Markov chain{X i (n),n ≥ 0}, it suffices to assume that the multiplicative sym- metrizationofthetransitionprobabilitymatrixisirreducible. 93 ThestateoftheMarkovchainevolvesrestlessly,i.e.,thestatewillcontinuetoevolve independently of the actions. We denote by P i = (p i x,y ) x,y∈S i the transition probability matrixfortheMarkovchaini. Wedenoteby(P i ) ′ ={(p i ) ′ x,y } x,y∈S i theadjointofP i on l 2 (π), so (p i ) ′ x,y = p i y,x π i y /π i x . Denote ˆ P i = (P i ) ′ P as themultiplicativesymmetrization ofP i . ForaperiodicirreducibleMarkovchains, ˆ P i sareirreducible[28]. A key metric of interest in evaluating a given policy φ for this problem is regret, whichisdefinedasthedifferencebetweentheexpectedrewardthatcouldbeobtainedby thebest-possiblestaticaction,andthatobtainedby thegivenpolicy. Itcan beexpressed as: R φ (n) =nγ ∗ −E φ [ n X t=1 R φ(t) (t)] =nγ ∗ −E φ [ n X t=1 X i∈A a(t) a i (t)r i x i (t) ] (6.2) where γ ∗ = max a∈F P i∈A a(n) a i μ i is the expected reward of the optimal arm. For the rest of the chapter, we use∗ as the index indicating that a parameter is for an optimal arm. If there is more than one optimal arm,∗ refers to any one of them. We denote by γ a the expectedrewardofarma,soγ a = |Aa| P j=1 a p j μ p j . Forthiscombinatorialmulti-armedbanditproblemwithrestlessMarkovianrewards, our goal is to design policies that perform well with respect to regret. Intuitively, we would like the regretR φ (n) to be as small as possible. If it is sublinear with respect to timen,thetime-averagedregretwilltendtozero. 94 6.3 PolicyDesign For the above combinatorial MAB problem with restless rewards, we have two chal- lengeshereforthepolicydesign: (1) A straightforward idea is to apply RCA in [75], or RUCB in [57] directly and naively, and ignore the dependencies across the different arms. However, we note that RCA and RUCB both require the storage and computation time that are linear in the numberofarms. Sincethere couldbeexponentiallymanyarmsin thisformulation,it is highlyunsatisfactory. (2)Unlikeourpriorworkon combinatorialMABwithrested rewards,forwhichthe transitions only occur each time the Markov chains are observed, the policy design for the restless case is much more difficult, since the current state while starting to play a Markovchaindependsnotonlyonthetransitionprobabilities,butalsoonthepolicy. To deal with the first challenge, we want to design a policy which more efficiently storesobservationsfromthecorrelatedarms,andexploitsthecorrelationstomakebetter decisions. Insteadofstoringtheinformationforeacharm,ourideaistousetwo 1byN vectorstostoretheinformationforeachMarkovchain. Thenanindexforeacheacharm is calculated, based on the information stored for underlying components. This index is usedforchoosingthearmtobeplayedeachtimewhenadecionneedstobemade. To deal with the second challenge, for each arm a we note that the multidimen- sional Markov chain{X a (n),n ≥ 0} defined by underlying components as X a (n) = (X i (n)) i∈Aa is aperiodic and irreducible. Instead of utilizing the actual sample path of 95 N : numberofresources a: vectorsofcoefficients,definedonsetF; wemapeachaasanarm A a :{i :a i 6= 0,1≤i≤N} t: currenttimeslot t 2 : numberoftimeslotsinSB2uptothecurrent timeslot b: numberofblocksuptothecurrenttimeslot m i 2 :numberoftimesthatMarkovchainihasbeen observedduringSB2uptothecurrenttimeslot ¯ z i 2 : average(samplemean)ofalltheobserved valuesofMarkovchainiduringSB2 upto thecurrenttimeslot ζ i : statethatdeterminetheregenerativecyclesfor Markovchaini x i : theobservedstatewhenMarkovChainiis played;(x i ) i∈Aa istheobservedstatevector ifarmaisplayed Table6.1: NotationforAlgorithm7. all observations, we only take the observations from a regenerative cycle for Markov chainsanddiscardtherestinitsestimationoftheindex. Our proposed policy, which we refer to as Combinatorial Learning with Restless MarkovReward(CLRMR),isshowninAlgorithm7. Table6.1summerizesthenotation weuseforCLRMRalgorithm. ForAlgorithm7,(x i ) i∈Aa = (ζ i ) i∈Aa meansx i =ζ i ,∀i. CLRMR operates in blocks. Figure 6.1 illustrates one possible realization of this Algorithm 7. At the beginning of each block, an arma is picked and within one block, this algorithm always play the same arm. For each Markov chain{X i (n)}, we specifiy a state ζ i at the beginning of the algorithm as a state to mark the regenerative cycle. Then, for the multidimentional Markov chain {X a (n)} associated with this arm, the state(ζ i ) i∈Aa isusedtodefinearegenerativecyclefor{X a (n)}. 96 Algorithm7CombinatorialLearningwithRestlessMarkovReward(CLRMR) 1: // INITIALIZATION 2: t = 1,t 2 = 1;∀i = 1,··· ,N,m i 2 = 0, ¯ z i 2 = 0; 3: forb = 1toN do 4: t := t+1,t 2 := t 2 +1; Play any arma such thatb∈A a ; denote (x i ) i∈Aa as the observedstatevectorforarma; 5: ∀i ∈ A a(n) , let ζ i be the first state observed for Markov chain i if ζ i has never beenset; ¯ z i 2 := ¯ z i 2 m i 2 +r i x i m i 2 +1 ,m i 2 :=m i 2 +1; 6: while(x i ) i∈Aa 6= (ζ i ) i∈Aa do 7: t :=t+1,t 2 :=t 2 +1;Playarma;denote(x i ) i∈Aa astheobservedstatevector; 8: ∀i∈A a(n) , ¯ z i 2 := ¯ z i 2 m i 2 +r i x i m i 2 +1 ,m i 2 :=m i 2 +1; 9: endwhile 10: endfor 11: // MAIN LOOP 12: while1do 13: // SB1 STARTS 14: t :=t+1; 15: Playanarmawhichmaximizes max a∈F X i∈Aa a i ¯ z i 2 + s Llnt 2 m i 2 ! ; (6.3) whereLisaconstant. 16: Denote(x i ) i∈Aa astheobservedstatevector; 17: while(x i ) i∈Aa 6= (ζ i ) i∈Aa do 18: t :=t+1;Playanarmaanddenote(x i ) i∈Aa astheobservedstatevector; 19: endwhile 20: // SB2 STARTS 21: t 2 :=t 2 +1; 22: ∀i∈A a(n) , ¯ z i 2 := ¯ z i 2 m i 2 +r i x i m i 2 +1 ,m i 2 :=m i 2 +1; 23: while(x i ) i∈Aa 6= (ζ i ) i∈Aa do 24: t :=t+1,t 2 :=t 2 +1; 25: Play anarmaanddenote(x i ) i∈Aa astheobservedstatevector; 26: ∀i∈A a(n) , ¯ z i 2 := ¯ z i 2 m i 2 +r i x i m i 2 +1 ,m i 2 :=m i 2 +1; 27: endwhile 28: // SB3 IS THE LAST PLAY IN THE WHILE LOOP. THEN A BLOCK COMPLETES. 29: b :=b+1,t :=t+1; 30: endwhile 97 SB1 SB2 SB3 SB1 SB1 SB2 SB2 SB3 SB3 play arm play arm play arm compute index compute index compute index compute index Figure6.1: AnillustrationofCLRMR. Each block is broken into three sub-blocks denoted by SB1, SB2 and SB3. In SB1, theselectedarmaisplayeduntilthestate(ζ i ) i∈Aa isobserved. Uponthisobservationwe enteraregenerativecycle,andcontinueplayingthesamearmuntill(ζ i ) i∈Aa isobserved again. SB2 includes all timeslots from the first visit of (ζ i ) i∈Aa up to but excludingthe secondvisitto(ζ i ) i∈Aa . SB3consistsasingletimeslotwiththesecondvisitto(ζ i ) i∈Aa . SB1isemptyifthefirstobservedstateis(ζ i ) i∈Aa . SoSB2includestheobservedrewards foraregenerativecycleofthemultidimentionalMarkovchain{X a (n)} associatedwith arma, which implies that SB2 also includes the observed rewards for one or more re- generativecyclesforeach underlyingMarkovchain{X i (n)},i∈A a . Thekeytothealgorithm7istostoretheobservationsforeachMarkovchaininstead ofthewholearm,andutilizetheobservationsonlyinSB2forthem,andvirtuallyassem- ble them (highlighted with thick lines in Figure 6.1). Due to the regenerative nature of the Markov chain, by putting the observations in SB2, the sample path has exactly the samestaticsasgivenbythetransitionprobabilitymatrix. Sotheproblemistractable. LLR policy requires storage linear in N. We use two 1 by N vectors to store the information for each Markov chain after we play the selected arm at each time slot in SB2. One is (¯ z i 2 ) 1×N in which ¯ z i 2 is the average (sample mean) of observed values in 98 SB2 up to the current timeslot (obtained through potentially different sets of arms over time). The other one is (m i 2 ) 1×N in which m i 2 is the number of times that{X i (n)} has beenobservedinSB2 uptothecurrenttimeslot. Line 1 to line 10 are the initialization, for which each Markov chain is observed at leastonce,andζ i isspecifiedasthefirststateobservedfor{X i (n)}. Aftertheinitialization,atthebeginningofeachblock,CLRMRselectsthearmwhich solvesthemaximizationproblemasin(6.3). Itisadeterministiclinearoptimalproblem withafeasiblesetF andthecomputationtimeforanarbitraryF maynotbepolynomial inN. But, as weshowinSection 6.5,thereexistmanypracticallyusefulexampleswith polynomialcomputationtime. 6.4 AnalysisofRegret We summarize some notation we use in the description and analysis of our CLRMR policyinTable6.2. WefirstshowinTheorem7anupperboundonthetotalexpectednumberofplaysof suboptimalarms. Theorem7. WhenusinganyconstantL≥ 56(H +1)S 2 max r 2 max ˆ π 2 max /ǫ min , we have X a:γ a <γ ∗ (γ ∗ −γ a )E[T a (n)]≤Z 1 lnn+Z 2 99 H : max a |A a |. NotethatH ≤N a(τ) : thearmplayedintimeτ b(n): numberofcompletedblocksuptotimen t(b): timeattheendofblockb t 2 (b): totalnumberoftimeslotsspentinSB2 uptoblockb B a (b): totalnumberofblockswithinthefirstb blocksinwhicharmaisplayed m i 2 (t 2 (b)): totalnumberoftimeslotsMarkovchaini isobservedduringSB2uptoblockb ¯ z i 2 (s): themeanrewardfromMarkovchaini whenitisobservedforthes-thtimeof onlythosetimesplayedduringSB2 T(n): timeattheendofthelastcompletedblock T a (n): totalnumberoftimeslotsarmaispalyed uptotimeT(n) m i x (s): numberoftimesthatstatexoccuredwhen Markovchainihasbeenobservedstimes Y i 1 (j): vectorofobservedstatesfromSB1ofthe j-thblockforplayingMarkovchaini Y i 2 (j): vectorofobservedstatesfromSB2ofthe j-thblockforplayingMarkovchaini Y i (j): vectorofobservedstatesfromthej-th blockforplayingMarkovchaini ˆ π i x : max{π i x ,1−π i x } ˆ π max : max i,x∈S i ˆ π i x π min : min i,x∈S i π i x π max : max i,x∈S i π i x ǫ i : eigenvaluegap,definedas 1−λ 2 ,where λ 2 isthesecondlargesteigenvalueofthe multiplicativesymmetrizationofP i ǫ min : min i ǫ i S max : max i |S i | r max : max i,x∈S i r i x a max : max i∈Aa,a∈F a i Δ a : γ ∗ −γ a Δ min : min γ a ≤γ ∗ Δ a 100 Δ max : max γ a ≤γ ∗ Δ a {X a (n)}: multidimentionalMarkovchaindefined byX a (n) = (X i (n)) i∈Aa ζ a : (ζ i ) i∈Aa ,statevectorthatdetermines theregenerativecyclesfor{X a (n)} Π a z : steadystatedistributionforstatez of{X a (n)} Π a min : min z∈S a Π a z Π min : min a,z∈S a Π a z M a z 1 ,z 2 : meanhittingtimeofstatez 2 starting fromaninitialstatez 1 for{X a (n)} M a max : max z 1 ,z 2 ∈S a M a z 1 ,z 2 γ ′ max : max γ a ≤γ ∗ γ a Table6.2: Notationforregretanalysis. where Z 1 = Δ max 1 Π min +M max +1 4NLH 2 a 2 max Δ 2 min Z 2 = Δ max 1 Π min +M max +1 N + πNHS max 3π min To proveTheorem 7, weusetheinequalitiesas statedin Theorem 3.3from [53]and atheoremfrom[20]. Lemma 6 (Theorem 3.3 from [53]). Consider a finite-state, irreducible Markov chain {X t } t≥1 with state space S, matrix of transition probabilitiesP, an initial distribution q and stationary distributionπ. Let N q = ( qx πx ,x∈S) 2 . Let ˆ P = P ′ P be the mul- tiplicative symmetrization of P where P ′ is the adjoint of P on l 2 (π). Let ǫ = 1−λ 2 , where λ 2 is the second largest eigenvalue of the matrix ˆ P. ǫ will be referred to as the 101 eigenvalue gap of ˆ P. Let f : S → R be such that P y∈S π y f(y) = 0,kfk ∞ ≤ 1 and 0<kfk 2 2 ≤ 1. If ˆ P is irreducible,thenforanypositiveintegernandall 0<δ≤ 1 P P n t=1 f(X t ) n ≥δ ≤N q exp − nδ 2 ǫ 28 . Lemma 7. If {X n } n≥0 is a positive recurrent homogeneous Markov chain with state spaceS,stationarydistributionπ andτ isastoppingtimethatisfinitealmostsurelyfor whichX τ =xthen forally∈S, E " τ−1 X t=0 I(X t =y)|X 0 =x # =E[τ|X 0 =x]π y . Proofof Theorem 7. Weintroduce e B i (b)asacounterfortheregretanalysistodealwith thecombinatorialarms. Aftertheinitializationperiod, e B i (b) isupdatedinthefollowing way: at thebeginningofany blockwhenanon-optimalarmischosen tobeplayed,find i such that i = arg min j∈Aa(b) m j 2 (i the index of the elements which are among the ones that have been observed least in SB2 in the non-optimal arm). If there is only one such arm, e B i (b) is increased by 1. If there are multiple such arms, we arbitrarily pick one, sayi ′ , and increment e B i ′ by 1. Based ontheabovedefinitionof e B i (b), each timeanon- optimal arm is chosen to be played at the beginning of a block, exactly one element in ( e B i (b)) 1×N isincrementedby 1. So thesummationofallcountersin ( e B i (b)) 1×N equals thetotalnumberofblocksinwhichwehaveplayednon-optimalarms, 102 X a:γ a <γ ∗ E[B a (b)] = N X i=1 E[ e B i (b)]. (6.4) Wealsohavethefollowinginequalityfor e B i (b): e B i (b)≤m i 2 (t(b−1)),∀1≤ i≤ N,∀b. (6.5) Denote byc t,s q Llnt s . Denote by e I i (b) the indicator function which is equal to 1 if e B i (b) is added by oneat blockb. Letl bean arbitrary positiveinteger. Then wecan get theupperboundofE[ e B i (b)] shownin(6.6), E[ e B i (b)] = b X β=N+1 P{ e I i (β) = 1}≤l+ b X β=N+1 P{ e I i (β) = 1, e B i (β−1)≥l} ≤ l+ b X β=N+1 P{ X k∈A a ∗ a ∗ k g k t 2 (β−1),m k 2 (t(β−1)) ≤ X j∈A a(h) a j (b)g j t 2 (β−1),m j 2 (t(β−1)) , e B i (β−1)≥l}. (6.6) where g i t,s = ¯ z i 2 (s) +c t,s anda(β) is defined as a non-optimal arm picked at block β when e I i (β) = 1. Note thatm i 2 = min j {m j 2 : ∀j ∈A a(β) }. We denote this arm bya(β) sinceateach blockthat e I i (β) = 1,wecouldgetdifferentarms. Notethatl≤ e B i (β−1)implies, l≤ e B i (β−1)≤m i 2 (t(β−1)),∀j∈A a(β) . (6.7) 103 So we can further derive the upper bound of E[ e B i (b)] shown in (6.8), where h j (1 ≤ j ≤ |A a∗ |) represents thej-th element inA a∗ ; p j (1 ≤ j ≤ |A a(β) |) represents thej-th elementinA a(β) orA a(t) .A a(τ) representsthearmplayedintheτ-thtimeslotscounting onlyinSB2. E[ e B i (b)]≤l+ b X β=N+1 P{ min 0<s h 1 ,...,s h |Aa∗| <t 2 (β−1) |Aa∗| X j=1 a ∗ h j g h j t 2 (β−1),s h j ≤ max t 2 (l)≤sp 1 ,...,sp |A a(β) | <t 2 (β−1) |A a(β) | X j=1 a p j (β)g p j t 2 (β−1),sp j } ≤l+ b X β=N+1 t 2 (β−1) X s h 1 =1 ··· t 2 (β−1) X s h |A ∗ | =1 t 2 (β−1) X sp 1 =t 2 (l) ··· t 2 (β−1) X sp |A a(β) | =t 2 (l) P{ |Aa∗| X j=1 a ∗ h j g h j t 2 (β−1),s h j ≤ |A a(β) | X j=1 a p j (β)g p j t 2 (β−1),sp j } ≤l+ t 2 (b) X τ=1 τ−1 X s h 1 =1 ··· τ−1 X s h |A ∗ | =1 τ−1 X sp 1 =l ··· τ−1 X sp |A a(β) | =l P{ |Aa∗| X j=1 a ∗ h j g h j τ,s h j ≤ |A a(τ) | X j=1 a p j (τ)g p j τ,sp j } (6.8) Notethat, P{ |Aa∗| X j=1 a ∗ h j g h j τ,s h j ≤ |A a(τ) | X j=1 a p j (t)g p j τ,sp j } =P{ |Aa∗| X j=1 a ∗ h j (¯ z h j 2 (s h j )+c τ,s h j )≤ |A a(τ) | X j=1 a p j (τ)(¯ z p j 2 (s p j )+c τ,sp j )} (6.9) 104 =P{Atleastoneofthefollowingmusthold: |Aa∗| X j=1 a ∗ h j ¯ z h j 2 (s h j )≤γ ∗ − |Aa∗| X j=1 a ∗ h j c τ,s h j , (6.10) |A a(τ) | X j=1 a p j (τ)¯ z p j 2 (s p j )≥γ a(τ) + |A a(τ) | X j=1 a p j (τ)c τ,sp j , (6.11) γ ∗ <γ a(τ) +2 |A a(τ) | X j=1 a p j (τ)c τ,sp j } (6.12) Now weshowtheupperboundon theprobabilitiesofinequalities(6.10), (6.11)and (6.12)separately. Wefirstfindanupperboundontheprobabilityof(6.10): P{ |Aa∗| X j=1 a ∗ h j ¯ z h j 2 (s h j )≤ γ ∗ − |Aa∗| X j=1 a ∗ h j c τ,s h j } =P{ |Aa∗| X j=1 a ∗ h j ¯ z h j 2 (s h j )≤ |Aa∗| X j=1 a ∗ h j μ h j − |Aa∗| X j=1 a ∗ h j c τ,s h j } ≤ |Aa∗| X j=1 P{a ∗ h j ¯ z h j 2 (s h j )≤a ∗ h j (μ h j −c τ,s h j )} = |Aa∗| X j=1 P{¯ z h j 2 (s h j )≤μ h j −c τ,s h j }. ∀1≤j ≤|A a∗ |, P{¯ z h j 2 (s h j )≤μ h j −c τ,s h j } =P{ X x∈S h j ( r h j x m h j x (s h j ) s h j −r h j x π h j x )≤ X x∈S h j − c τ,s h j |S h j | } ≤ X x∈S h j P{ r h j x m h j x (s h j ) s h j −r h j x π h j x ≤− c τ,s h j |S h j | } 105 = X x∈S h j P{r h j x m h j x (s h j )−s h j r h j x π h j x ≤− s h j c τ,s h j |S h j | } = X x∈S h j P{r h j x (s h j − X y6=x m h j y (s h j ))−r h j x s h j (1− X y6=x π h j y )≤− s h j c τ,s h j |S h j | } = X x∈S h j P{ X y6=x m h j y (s h j )− X y6=x π h j y ≥ s h j c τ,s h j r h j x |S h j | = X x∈S h j P{ s h j P t=1 1(Y h j t 6=x)−s h j (1−π h j x ) ˆ π h j x s h j ≥ s h j c τ,s h j r h j x |S h j | } ≤ X x∈S h j N q h j τ − Lǫ h j 28(|S h j |r h j x ˆ π h j x ) 2 (6.13) ≤ |S h j | π min τ − Lǫ min 28S 2 max r 2 max ˆ π 2 max (6.14) where(6.13)followsfromLemma6byletting δ = s h j c τ,s h j r h j x |S h j | , f(Y i t ) = 1(Y i t 6=x)−(1−π i x ) ˆ π i x . 1(a) istheindicatorfunctiondefined tobe1 when thepredicateais true, and 0 when it is false. ˆ π i x is defined as ˆ π i x = max{π i x ,1−π i x } to guaranteekfk ∞ ≤ 1. We note that whenδ > 1thedeviationprobabilityiszero,sotheboundstillholds. (6.14)followsfromthefact thatforanyq i , N q i = q i x π i x ,x∈S i 2 ≤ |S i | X x=1 q i x π i x 2 ≤ |S i | X x=1 kq i x k 2 π min = 1 π min . 106 Notethatallthequantitiesincomputingtheindicesandtheprobabilitiesabovecome from SB2. Got foreverySB2 inablock, thequantitiesbeginwithstateζ a and end with a return to ζ a . So for each underlying Markov chain {X i (n)},i ∈ A a , the quantities got begin with state ζ i and end with a return to ζ i . Note that for all i, Markov chain {X i (n)} could be played in different arms, but the quantities got always begin with stateζ i and end with a return toζ i . Then by the strong Markov property, the process at thesestoppingtimeshas thesamedistributionas the originalprocess. Connecting these intervals together we form a continuous sample path which can be viewed as a sample path generated by a Markov chain with transition matrix identical to the original arm. ThisisthereasonwhywecanapplyLemma6tothisMarkovchain. Therefore, P{ |Aa∗| X j=1 a ∗ h j ¯ z h j 2 (s h j )≤γ ∗ − |Aa∗| X j=1 a ∗ h j c τ,s h j } ≤ HS max π min τ − Lǫ min 28S 2 max r 2 max ˆ π 2 max (6.15) Withasimilarderivation,wehave P{ |A a(τ) | X j=1 a p j (τ)¯ z p j 2 (s p j )≥γ a(τ) + |A a(τ) | X j=1 a p j (τ)c τ,sp j } ≤ |A a(τ) | X j=1 P{a p j (τ)¯ z p j 2 (s p j )≥a p j (τ)μ p j +a p j (τ)c τ,sp j } ≤ |A a(τ) | X j=1 X x∈S p j P{r p j x m p j x (s p j )−s p j r p j x π p j x ≥ s p j c τ,sp j |S p j | } 107 = |A a(τ) | X j=1 X x∈S p j P{ sp j P t=1 1(Y p j t =x)−s p j π p j x ˆ π p j x s p j ≥ s p j c τ,sp j r p j x |S p j | } ≤ |A a(τ) | X j=1 X x∈S p j N q p jτ − Lǫ p j 28(|S p j |r p j x ˆ π p j x ) 2 (6.16) ≤ HS max π min τ − Lǫ min 28S 2 max r 2 max ˆ π 2 max (6.17) where(6.16)followsfromLemma6byletting δ = s p j c τ,sp j r p j x |S p j | , f(Y i t ) = 1(Y i t =x)−π i x ˆ π i x . Notethatwhenl≥ 4Llnt 2 (b) Δ a(τ) Hamax 2 ,(6.12)isfalseforτ,whichgives, γ ∗ −γ a(τ) −2 |A a(τ) | X j=1 a p j (τ)c τ,sp j =γ ∗ −γ a(τ) −2 |A a(τ) | X j=1 a p j s Llnt 2 (b) s p j ≥γ ∗ −γ a(τ) −Ha max r 4Llnt 2 (b) l ≥γ ∗ −γ a(τ) −Ha max s 4Llnt 2 (b) 4Llnt 2 (b) Δ a(t) Ha max 2 (6.18) ≥γ ∗ −γ a(τ) −Δ a(τ) = 0. (6.19) Hence, when weletl≥ l 4LH 2 a 2 max lnt 2 (b) Δ 2 min m , (6.12)is falseforalla(τ). Therefore, we have(6.20). 108 E[ e B i (b)]≤ 4LH 2 a 2 max lnt 2 (b) Δ 2 min + t 2 (b) X τ=1 τ−1 X s h 1 =1 ··· τ−1 X s h |A ∗ | =1 τ−1 X sp 1 =l ··· τ−1 X sp |A a(β) | =l 2HS max π min τ − Lǫ min 28S 2 max r 2 max ˆ π 2 max (6.20) Following(6.20), E[ e B i (b)]≤ 4LH 2 a 2 max lnn Δ 2 min +1+ HS max π min ∞ X τ=1 2τ − Lǫ min −56HS 2 max r 2 max ˆ π 2 max 28S 2 max r 2 max ˆ π 2 max (6.21) = 4LH 2 a 2 max lnn Δ 2 min +1+ HS max π min ∞ X τ=1 2τ −2 (6.22) = 4LH 2 a 2 max lnn Δ 2 min +1+ πHS max 3π min (6.22)followssinceL≥ 56(H +1)S 2 max r 2 max ˆ π 2 max /ǫ min . Accordingto(6.4), X a:γ a <γ ∗ E[B a (b)] = N X i=1 E[ e B i (b)]≤ 4NLH 2 a 2 max lnn Δ 2 min +N + πNHS max 3π min (6.23) Note that the total number of plays of arma at the end of blockb(n) is equal to the totalnumberofplaysofarma duringSB2s (theregenerativecyclesofvisitingstateζ a ) plusthetotalnumberofplaysbeforeenteringtheregenerativecyclesplusonemoreplay resultingfromthelastplayoftheblockwhichisstateζ a . Sowehave, E[T a (n)]≤ 1 Π a min +M a max +1 E[B a (b(n))]. 109 Therefore, X a:γ a <γ ∗ (γ ∗ −γ a )E[T a (n)] ≤ Δ max X a:γ a <γ ∗ 1 Π a min +M a max +1 E[B a (b(n))] (6.24) ≤ Δ max 1 Π min +M max +1 X a:γ a <γ ∗ E[B a (b(n))] (6.25) ≤Z 1 lnn+Z 2 where Z 1 = Δ max 1 Π min +M max +1 4NLH 2 a 2 max Δ 2 min , Z 2 = Δ max 1 Π min +M max +1 N + πNHS max 3π min NowweshowourmainresultsontheregretofCLRMRpolicyasinTheorem8. Theorem 8. When using any constantL≥ 56(H +1)S 2 max r 2 max ˆ π 2 max /ǫ min , the regret of CLRMR can beupper boundeduniformlyover timebythefollowing, R CLRMR (n)≤ Z 3 lnn+Z 4 (6.26) 110 where Z 3 =Z 1 +Z 5 4NLH 2 a 2 max Δ 2 min Z 4 =Z 2 +γ ∗ ( 1 π min +M max +1)+Z 5 (N + πNHS max 3π min ) and Z 5 =γ ′ max ( 1 Π min +M max +1− 1 π max )+γ ∗ M ∗ max Proof. Denotetheexpectationswithrespect topolicyCLRMRgivenζ byE ζ . Thenthe regretisboundedas, R CLRMR ζ (n) =γ ∗ E ζ [T(n)]−E ζ [ T(n) X t=1 X i∈A a(t) a i (t)r i x i (t) ] +γ ∗ E ζ [n−T(n)]−E ζ [ n X t=T(n)+1 X i∈A a(t) a i (t)r i x i (t) ] ≤ γ ∗ E ζ [T(n)]− X a γ a E ζ [T a (n)] ! +γ ∗ E ζ [n−T(n)] + X a γ a E ζ [T a (n)]−E ζ [ T(n) X t=1 X i∈A a(t) a i (t)r i x i (t) ] ≤Z 1 lnn+Z 2 +γ ∗ ( 1 Π min +M max +1) (6.27) + X a γ a E ζ [T a (n)]−E ζ [ T(n) X t=1 X i∈A a(t) a i (t)r i x i (t) ] . where(6.27)followsfromTheorem7andE ζ [n−T(n)]≤ 1 Π min +M max +1. 111 Notethat: X a γ a E ζ [T a (n)]−E ζ [ T(n) X t=1 X i∈A a(t) a i (t)r i x i (t) ] ≤γ ∗ E ζ [T ∗ (n)]+ X a:γ a <γ ∗ γ a E ζ [T a (n)] − X i∈A a ∗ X y∈S i a ∗ i r i y E ζ [ B ∗ (b(n)) X j X Y i t ∈Y i (j) 1(Y i t =y)] − X a:γ a <γ ∗ X i∈Aa X y∈S i a i r i y E ζ [ B a (b(n)) X j X Y i t ∈Y i 2 (j) 1(Y i t =y)] (6.28) wheretheinequalityabovecomesfromcountingonlyinY i 2 (j)insteadofY i (j)in(6.28). ThenapplyingLemma7to(6.28),wehave E ζ [ B a (b(n)) X j X Y i t ∈Y i 2 (j) 1(Y i t =y)] = π i y π i ζ i E ζ [B a (b(n))]. So, − X a:γ a <γ ∗ X i∈Aa X y∈S i a i r i y E ζ [ B a (b(n)) X j X Y i t ∈Y i 2 (j) 1(Y i t =y)] ≤− X a:γ a <γ ∗ γ a π max E ζ [B a (b(n))]. (6.29) Alsonotethat: X a:γ a <γ ∗ γ a E ζ [T a (n)]≤ X a:γ a <γ ∗ γ a ( 1 π a min +M a max +1)E ζ [B a (b(n))]. (6.30) 112 Inserting(6.29)and(6.30)into(6.28),weget, X a γ a E ζ [T a (n)]−E ζ [ T(n) X t=1 X i∈A a(t) a i (t)r i x i (t) ] ≤ γ ∗ E ζ [T ∗ (n)]+ X a:γ a <γ ∗ γ a ( 1 Π a min +M a max +1− 1 π max )E ζ [B a (b(n))] − X i∈A a ∗ X y∈S i a ∗ i r i y E ζ [ B ∗ (b(n)) X j X Y i t ∈Y i (j) 1(Y i t =y)] =Q ∗ (n)+ X a:γ a <γ ∗ γ a ( 1 Π a min +M a max +1− 1 π max )E ζ [B a (b(n))], where Q ∗ (n) =γ ∗ E ζ [T ∗ (n)]− X i∈A a ∗ X y∈S i a ∗ i r i y E ζ [ B ∗ (b(n)) X j X Y i t ∈Y i (j) 1(Y i t =y)]. We now consider the upper bound forQ ∗ (n). We note that the total numberof time slots for playingall suboptimalarms is at mostlogarithmic,so thenumber oftimeslots in which the optimal arm is not played is at most logarithmic. We could then combine the successive blocks in which the best arm is played, and denote by ¯ Y ∗ (j) the j-th combinedblock. Denote ¯ b ∗ as thetotalnumberof combinedblocks up to blockb. Each combinedblock ¯ Y ∗ startsafterdis-continuityinplayingtheoptimalarm,so ¯ b ∗ (n)isless thanorequaltototalnumberofcompletedblocksinwhichthebestarmisnotplayedup totimen. Thus, E ζ [ ¯ b ∗ (n)]≤ X a:γ a <γ ∗ E ζ [B a (b(n))]. (6.31) 113 Each combined block ¯ Y ∗ consists of two sub-blocks: ¯ Y ∗ 1 which contains the state vectorsfortheoptimalarmvisitedfrombeginningof ¯ Y ∗ (emptyifthefirststateisζ ∗ )to thestaterightbeforehittingζ ∗ andsub-block ¯ Y ∗ 2 whichcontainstherestof ¯ Y ∗ (arandom numberofregenerativecycles). Denotethelengthof ¯ Y ∗ 1 by| ¯ Y ∗ 1 |andthelengthof ¯ Y ∗ 2 by | ¯ Y ∗ 2 |. Wedenote ¯ Y i 2 (j)bythestatesforMarkovchainiforalli∈A a ∗ in ¯ Y ∗ 2 . ThereforewegettheupperboundforQ ∗ (n)as Q ∗ (n) =γ ∗ E ζ [T ∗ (n)]− X i∈A a ∗ X y∈S i a ∗ i r i y E ζ [ B ∗ (b(n)) X j X Y i t ∈Y i (j) 1(Y i t =y)] (6.32) ≤ X i∈A a ∗ X y∈S i a ∗ i r i y π i y E ζ [ ¯ b ∗ (n) X j=1 | ¯ Y ∗ 2 |] (6.33) − X i∈A a ∗ X y∈S i a ∗ i r i y E ζ [ ¯ b ∗ (n) X j=1 X Y i t ∈ ¯ Y i 2 (j) 1(Y i t =y)] (6.34) + X i∈A a ∗ X y∈S i γ ∗ E ζ [ ¯ b ∗ (n) X j=1 | ¯ Y ∗ 1 |] (6.35) ≤γ ∗ M ∗ max X a:γ a <γ ∗ E ζ [B a (b(n))] (6.36) where the inequality in (6.33) comes from counting only the rewards obtained in sub- block ¯ Y i 2 in(6.32). Also,notethatbasedonLemma7,(6.33)equals(6.34),andtherefore wehavetheinequality(6.36). Hence,∀ζ, R CLRMR ζ (n)≤Z 1 lnn+Z 2 +γ ∗ ( 1 π min +M max +1) + X a:γ a <γ ∗ γ a (M a max +1)E ζ [B a (b(n))]+γ ∗ M ∗ max X a:γ a <γ ∗ E ζ [B a (b(n))] 114 ≤Z 1 lnn+Z 2 +γ ∗ ( 1 π min +M max +1) +(γ ′ max ( 1 Π min +M max +1− 1 π max )+γ ∗ M ∗ max )E ζ [B a (b(n))] ≤Z 3 lnn+Z 4 , (6.37) where(6.37)followsfromTheorem7and(6.23),and Z 3 =Z 1 +Z 5 4NLH 2 a 2 max Δ 2 min Z 4 =Z 2 +γ ∗ ( 1 Π min +M max +1)+Z 5 (N + πNHS max 3π min ). Z 5 isdefined as Z 5 =γ ′ max ( 1 Π min +M max +1− 1 π max )+γ ∗ M ∗ max . Theorem 8 shows when we use a constant L ≥ 56(H + 1)S 2 max r 2 max ˆ π 2 max /ǫ min , the regret of Algorithm 7 is upper-bounded uniformly overtimen by a function that grows as O(N 3 lnn). However, when S max , r max , ˆ π max or ǫ min (or the bound of them) are unknown,theupperboundofregretcannotbeguaranteedtogrowlogarithmicallyinn. When no knowledge about the system is available, we extend the CLRMR pol- icy to achieve a regret bounded uniformly over time n by a function that grows as 115 O(N 3 L(n)lnn),usinganyarbitrarilyslowlydivergingnon-decreasingsequenceL(n)in Algorithm7. SinceL(n)couldgrowarbitrarilyslowly,thismodifiedversionofCLRMR, namedCLRMR-LN,couldachievearegretarbitrarilyclosetothelogarithmicorder. We presentouranalysisinTheorem9. Theorem9. Whenusinganyarbitrarilyslowlydivergingnon-decreasingsequenceL(n) (i.e.,L(n)→∞asn→∞), andreplacing(6.3)inAlgorithm7 accordinglywith max a∈F a i ¯ z i 2 + s L(n(t 2 ))lnt 2 m i 2 ! (6.38) where n(t 2 ) is the time when total number of time slots spent in SB2 is t 2 , the expected regret underthismodifiedversionof CLRMR, named CLRMR-LN policy,isat most R CLRMR−LN (n)≤Z 6 L(n)lnn+Z 7 (6.39) whereZ 6 andZ 7 areconstants. Proof. Replacing c t,s with q L(n(t))lnt s , and replacing L with L(n(t 2 (b))) or L(n(τ)) accordinglyintheproofofTheorem7,(6.4)to(6.21)stillstand. L(n(τ)) is a diverging non-decreasing sequence, so there exists a constant τ ′ such thatforallτ ≥τ ′ ,L(n(τ))≥ 56(H +1)S 2 max r 2 max ˆ π 2 max /ǫ min ,whichimplies τ − L(n(τ))ǫ min −56HS 2 max r 2 max ˆ π 2 max 28S 2 max r 2 max ˆ π 2 max ≤τ −2 . Thus,wehave, 116 E[ e B i (b)]≤ 4L(n(t 2 (b)))H 2 a 2 max lnn Δ 2 min +1+ HS max π min ∞ X τ=1 2τ −2 +Z 8 ≤ 4L(n)H 2 a 2 max lnn Δ 2 min +1+ πHS max 3π min +Z 8 (6.40) where Z 8 = HS max π min τ ′ −1 X τ=1 2τ − Lǫ min −56HS 2 max r 2 max ˆ π 2 max 28S 2 max r 2 max ˆ π 2 max (6.41) Thenwecanaccordinghave, X a:γ a <γ ∗ (γ ∗ −γ a )E[T a (n)]≤Z 9 L(n)lnn+Z 2 +Δ max 1 Π min +M max +1 NZ 8 . where Z 9 = Δ max 1 Π min +M max +1 4NH 2 a 2 max Δ 2 min . (6.42) So, R CLRMR−LN (n)≤Z 6 L(n)lnn+Z 7 , (6.43) where Z 6 =Z 9 +Z 5 4NH 2 a 2 max Δ 2 min 117 Z 7 =Z 2 +γ ∗ ( 1 Π min +M max +1)+Δ max 1 Π min +M max +1 NZ 7 (6.44) +Z 5 (N + πNHS max 3π min +NZ 7 ). 6.5 ApplicationsandSimulationResults We now present an evaluation of our policy over stochastic versions of two combina- torial network optimization problems of practical interest: stochastic shortest path (for routing),andstochasticbipartitematching(forchannelallocation). 6.5.1 StochasticShortestPath In the stochastic shortest path problem, given a graph G = (V,E), with edge weights (D ij ) stochasticallyvarying with time as restless Markov chains with unknown dynam- ics, we seek to find a path between a given source s and destination t with minimum expecteddelay. WecanapplytheCLRMRpolicytothisproblem,withsomeveryminor modifications to the policy and the corresponding regret definition to be applicable to a minimizationprobleminstead ofmaximization. For clarity,(6.3) in Algorithm7 should bereplacedby, a = argmin a∈F X i∈Aa a i ¯ z i 2 − s Llnt 2 m i 2 ! ; (6.45) 118 Andthedefinitionofregretshouldbeinsteadexpressedas, R φ (n) =E φ [ n X t=1 R φ(t) (t)]−nη ∗ . (6.46) whereη ∗ representstheminimumcost,whichiscostoftheoptimalarm. For the stochastic shortest path problems, each path between s and t is mapped to an arm. Although the number of paths could grow exponentially with the number of Markov chains, |E|. CLRMR efficiently solves this problem with polynomial storage |E|andregretscalingasO(|E| 3 logn). Also, sincethere exist polynomialtimealgorithmssuch as Dijkstra’s algorithm[29] and Bellman-Fordalgorithm[18,32]forshortestpath, wecan applythesealgorithmsto solve(6.45)withedgecost ¯ z i 2 − q Llnt 2 m i 2 . s t 1 2 3 5 4 e . 1 1 e . 1 2 e . 1 3 e . 1 4 e . 1 5 e . 1 e . 2 e . 3 e . 4 e . 5 e . 6 e . 7 e . 8 e . 9 e . 1 0 (a) A graph with 15 links and 96 acyclic paths be- tweensandt. s t 1 2 3 4 5 6 e . 1 e . 2 e . 3 e . 4 e . 5 e . 6 e . 7 e . 8 e . 9 e . 1 0 e . 1 1 e . 1 2 e . 1 3 e . 1 4 e . 1 5 e . 1 6 e . 1 7 e . 1 9 e . 1 8 (b) A graphwith 19 links and260 acyclicpaths be- tweensandt. Figure6.2: Twoexamplegraphsforstochasticshortestpathrouting. WeshowthenumericalsimulationresultswithtwoexamplegraphsinFigure6.2. 119 Firgure 6.2(a) is a graph with 15 links, and there are 96 acyclicpaths betweens and t. We assume each link has two states with the delay 0.1 on good links, and 1 on bad links. Table6.3summarizesthetransitionprobabilitiesoneach link. Link p 01 ,p 10 Link p 01 ,p 10 Link p 01 ,p 10 e.1 0.9,0.1 e.6 0.8,0.2 e.11 0.8,0.3 e.2 0.3,0.9 e.7 0.2,0.7 e.12 0.2,0.7 e.3 0.2,0.7 e.8 0.3,0.8 e.13 0.8,0.1 e.4 0.2,0.7 e.9 0.1,0.9 e.14 0.4,0.8 e.5 0.3,0.9 e.10 0.3,0.6 e.15 0.1,0.8 Table6.3: Transitionprobabilities. Figure6.2(b)showsanothergraphwith19links,whichhasonly4linksmorethanthe graphinFigure6.2(a). Butthereare4×(1+P(4,1)+P(4,2)+P(4,2)+P(4,4))= 260 acyclicpathsinthegraphbetweensandt,muchmorethanthenumberofpathsin6.2(a). We use P(N,M) here to denote the number of permutations that arrange M out of N choices. Weagainassumeeachlinkhastwostateswiththedelay0.1ongoodlinks,and 1onbadlinks. Table6.4summarizesthetransitionprobabilitiesoneach link. Link p 01 ,p 10 Link p 01 ,p 10 Link p 01 ,p 10 e.1 0.2,0.8 e.8 0.3,0.8 e.15 0.1,0.8 e.2 0.3,0.9 e.9 0.1,0.9 e.16 0.8,0.1 e.3 0.2,0.7 e.10 0.9,0.1 e.17 0.2,0.7 e.4 0.7,0.1 e.11 0.3,0.8 e.18 0.9,0.1 e.5 0.3,0.9 e.12 0.2,0.7 e.19 0.3,0.8 e.6 0.2,0.7 e.13 0.8,0.1 e.7 0.2,0.8 e.14 0.4,0.8 Table6.4: Transitionprobabilities. 120 0 0.5 1 1.5 2 x 10 7 0 2 4 6 8 10 12 14 16 x 10 4 Time Regret/Log(t) RCA Policy CLRMR Policy (a) Simulation results for stochastic short path probleminFigure6.2(a)withL = 1324. 0 0.5 1 1.5 2 x 10 7 0 0.5 1 1.5 2 2.5 3 3.5 x 10 5 Time Regret/Log(t) RCA Policy CLRMR Policy (b) Simulation results for stochastic short path probleminFigure6.2(b)withL = 1512. 0 1 2 3 4 5 6 7 8 9 10 x 10 5 10 2 10 3 10 4 10 5 10 6 Time Regret/Log(t) RCA Policy, L = 1512 CLRMR Policy, L = 1512 RCA Policy, L = 50 CLRMR Policy, L = 50 RCA Policy, L = 2 CLRMR Policy, L = 2 (c) Simulation results for stochastic short path problem in Figure 6.2(b) with L = 2 (logarith- micscaleforthey-axis). Figure6.3: Normalizedregret R(n) lnn vs. ntimeslots. 121 Figure6.3showsthesimulationresultsforthegraphsinFigure6.2. InFigure6.3(a) and 6.3(b), we let L = 1324 and L = 1512 respectively, such that L ≥ 56(H + 1)S 2 max r 2 max ˆ π 2 max /ǫ min . We let L = 2 for Figure 6.3(c). We can see that in these three casesofbothgraphs,ourproposedCLRMRperformsbetterthanthenaiveapplicationof RCA. Wecanalsoseethatunderbothpolicies,theregretgrowslogarithmicallyintime. We note that as the number of links increases from 15 to 19, an so the number of paths increasemuchfaster, from 96to 260, thegapbetweentheRCA policyand ourCLRMR policyisalothigher. AnotherobservationisthatwhenLvariesfrom1512to2asshownin6.3(c),CLRMR and RCA also seems to yield logarithmic regret over time, and the performance is in fact much better than L = 1512. Note that in the proof of Theorem 7, when L < 56(H + 1)S 2 max r 2 max ˆ π 2 max /ǫ min , we have − Lǫ min −56HS 2 max r 2 max ˆ π 2 max 28S 2 max r 2 max ˆ π 2 max > −2. This implies ∞ P τ=1 2τ − Lǫ min −56HS 2 max r 2 max ˆ π 2 max 28S 2 max r 2 max ˆ π 2 max does not converge anymore and thus we could not bound E[ e B i (b)] anymore. Empirically,however,in6.3(c)thecasewhenL < 1512also seems to yield logarithmic regret over time and the performance is in fact better than that of L≥ 1512, since thenon-optimalarms are played less whenL is smaller. However, this maypossiblybeduetothefactthatthecasesinwhich e B i (b)growsfasterthanln(t)only happenswithverysmallprobability. ThesmallerLis,thegreaterthisprobabilityis. 122 6.5.2 StochasticBipartiteMatchingforChannelAllocation Asasecondapplication,weconsideranapplicationinacognitiveradionetworkswhere M secondaryusersinterferingwitheachotherneed tobeallocatedtoQnon-conflicting orthogonal channels. We assume that, due to geographic dispersion, each user may see differentprimaryuseroccupancybehaviorineachchannel. Theavailabilityofspectrum opportunitiesoneach user-channel combination(i,j) overadecisionperiod ismodeled as a restless two-state Markov chain. It is easy to show that applying CLRMR to this problemyieldsstoragelinearinMQ, andaregretboundthatscalesas O(min{M,Q} 2 MQlogn), followingTheorem8. The computation time of CLRMR is also polynomial, since there are various algo- rithms to solve the different variations in the maximum weighted matching problems, suchastheHungarianalgorithmforthemaximumweightedbipartitematching[50]and Edmonds’smatchingalgorithm[31]forageneralmaximummatching. We showsimulationresults of our CLRMR algorithm and the naiveRCA algorithm for opportunistic spectrum access with two scenarios: (i) a system consisting Q = 7 orthogonal channels and M = 4 secondary users and (ii) a system consisting Q = 9 orthogonal channels and M = 5 secondary users. The transition probability matrices usedforthesetwoscenariosarepresentedinTable6.5and6.6. 123 ch.1 ch.2 ch.3 ch.4 ch.5 ch.6 ch.7 u.1 0.5,0.6 0.2,0.7 0.3,0.9 0.8,0.1 0.2,0.8 0.2,0.6 0.2,0.9 u.2 0.2,0.7 0.2,0.9 0.1,0.8 0.3,0.7 0.3,0.6 0.1,0.7 0.8,0.2 u.3 0.7,0.1 0.2,0.7 0.1,0.8 0.2,0.8 0.5,0.6 0.2,0.8 0.2,0.7 u.4 0.2,0.6 0.2,0.8 0.1,0.8 0.4,0.6 0.9,0.2 0.1,0.7 0.2,0.8 Table6.5: Transitionprobabilitiesp 01 ,p 10 foreach user-channelpair. ch.1 ch.2 ch.3 ch.4 ch.5 ch.6 ch.7 ch.8 ch.9 u.1 0.5,0.6 0.2,0.7 0.2,0.9 0.8,0.1 0.2,0.7 0.3,0.7 0.2,0.9 0.2,0.7 0.1,0.9 u.2 0.3,0.8 0.1,0.9 0.2,0.8 0.3,0.7 0.3,0.6 0.2,0.8 0.4,0.7 0.2,0.8 0.9,0.2 u.3 0.8,0.1 0.2,0.7 0.3,0.7 0.2,0.8 0.5,0.6 0.2,0.7 0.2,0.7 0.2,0.8 0.1,0.9 u.4 0.3,0.9 0.2,0.8 0.2,0.9 0.4,0.6 0.9,0.2 0.2,0.9 0.2,0.9 0.2,0.9 0.2,0.9 u.5 0.5,0.6 0.2,0.7 0.3,0.9 0.2,0.7 0.5,0.5 0.2,0.7 0.8,0.1 0.3,0.9 0.3,0.9 Table6.6: Transitionprobabilitiesp 01 ,p 10 foreach user-channelpair. The simulation results are shown and compared in Figure 6.4. We can see that for scenario (i), there areP(7,4) = 840 matchings, whileonly 7 = 28 Markov chains. For scenario (ii), as the number of channels and users increases a to 9 and 5, the number of matchings are much higher (P(9,5) = 15120), which is about 336 times higher. So the storage as well as regret of the naive RCA policy grow much faster than CLRMR policy, as the results indicated in 6.4(a) and 6.4(b). For these two simulations, we pick the value of L as 922 and 1135 such that L ≥ 56(H + 1)S 2 max r 2 max ˆ π 2 max /ǫ min . We also show the simulation results whenL varies from 1135 to 2 in 6.4(c). Again, we see that the performance seems to improvein practice with smallerL values, even if it is not be theoreticallyguaranteed. 124 0 0.5 1 1.5 2 x 10 7 0 1 2 3 4 5 6 7 8 x 10 5 Time Regret/Log(t) RCA Policy CLRMR Policy (a) N = 7 channels, M = 4 secondary users, L = 922. 0 0.5 1 1.5 2 x 10 7 0 0.5 1 1.5 2 2.5 3 x 10 6 Time Regret/Log(t) RCA Policy CLRMR Policy (b) N = 9 channels, M = 5 secondary users, L = 1135. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 6 10 2 10 3 10 4 10 5 10 6 Time Regret/Log(t) RCA Policy, L = 1135 CLRMR Policy, L = 1135 RCA Policy, L = 50 CLRMR Policy, L = 50 RCA Policy, L = 2 CLRMR Policy, L = 2 (c) N = 9 channels, M = 5 secondary users, L = 2(logarithmicscaleforthey-axis). Figure6.4: Normalizedregret R(n) lnn vs. ntimeslots. 125 6.6 Summary We have presented CLRMR, a provably efficient online learning policy for stochastic combinatorial network optimizationwith restless Markovianrewards. This algorithm is widelyapplicabletomanynetworkingproblemsofinterest,asillustratedbyoursimula- tion based evaluationof the policyovertwo different problems: stochasticshortestpath andstochasticmaximumweightbipartitematching. 126 Chapter7 LearningforStochasticWater-Filling 7.1 Overview The classic water-filling algorithm is deterministic and requires perfect knowledge of the channel gain to noise ratios. In this Chapter 1 , we consider how to do power alloca- tion over stochastically time-varying (i.i.d.) channels with unknown gain to noise ratio distributions. We adopt an online learning framework based on stochastic multi-armed bandits. We consider two variations of the problem, one in which the goal is to find a power allocation to maximize P i E[log(1 + SNR i )], and another in which the goal is to find a power allocation to maximize P i log(1 +E[SNR i ]). For the first problem, we propose a cognitive water-filling algorithm that exploits the linear structure of this problem, that we call CWF1. We show that CWF1 obtains a regret that grows polyno- miallyinthenumberofchannelsandlogarithmicallyintime. Itthereforeasymptotically achievestheoptimaltime-averagedratethatcanbeobtainedwhenthegaindistributions 1 Thischapterisbasedinparton[34]. 127 are known. Forthesecond problem, wepresent an algorithmcalled CWF2, which is,to our knowledge, the first algorithm in the literature on stochastic multi-armed bandits to exploit non-linear dependencies between the arms. We prove that the number of times CWF2pickstheincorrectpowerallocationisboundedbyafunctionthatispolynomialin thenumberofchannelsandlogarithmicintime,implyingthatitsfrequencyofincorrect allocationasymptoticallytendstozero. 7.2 ProblemFormulation We define thestochasticversion of the classiccommunicationtheory problem of power allocationformaximizingrateoverparallelchannels(water-filling)asfollows. We consider a system with N channels, where the channel gain-to-noise ratios are unknown random processes X i (n),1 ≤ i ≤ N. Time is slotted and indexed by n. We assumethatX i (n) evolvesasan i.i.d. randomprocessovertime(i.e., weconsiderblock fading),withtheonlyrestrictionthatitsdistributionhasafinitesupport. Withoutlossof generality, we normalize X i (n) ∈ [0,1]. We do not require that X i (n) be independent acrossi. Thisrandomprocessisassumedtohaveameanθ i =E[X i ]thatisunknownto theusers. WedenotethesetofallthesemeansbyΘ ={θ i }. At each decision period n (also referred to interchangeably as a time slot), an N- dimensionalactionvectora(n), representingapowerallocationontheseN channels, is selected under a policyφ(n). We assume that the power levels are discrete, and we can put any constraint on the selections of power allocationssuch that they are from a finite 128 set F (i.e., the maximum total power constraint, or an upper bound on the maximum allowed power per subcarrier). We assumea i (n) ≥ 0 for all 1 ≤ i ≤ N. When a par- ticularpowerallocationa(n) is selected, the channel gain-to-noiseratios corresponding tononzerocomponentsofa(n) arerevealed,i.e.,thevalueofX i (n)isobservedforalli suchthata i (n)6= 0. We denotebyA a(n) ={i : a i (n)6= 0,1≤ i≤ N} theindexset of alla i (n)6= 0foranallocationa. We adopt a general formulation for water-filling, where the sum rate 2 obtained at timenbyallocatingasetofpowersa(n) isdefinedas: R a(n) (n) = X i∈A a(n) f i (a i (n),X i (n)). (7.1) where for alli, f i (a i (n),X i (n)) is a nonlinear continuous increasing sub-additivefunc- tioninX i (n),andf i (a i (n),0) = 0foranya i (n). Weassumef i isdefinedonR + ×R + . Our formulation is general enough to include as a special case of the rate function obtainedfromShannon’scapacitytheoremforAWGN,whichiswidelyusedincommu- nicationnetworks: R a(n) (n) = N X i=1 log(1+a i (n)X i (n)) 2 We refertorateandrewardinterchangeablyinthischapter. 129 In the typical formulation there is a total power constraint and individual power con- straints,thecorrespondingconstraintis F ={a : N X i=1 a i ≤P total ∧0≤ a i ≤P i ,∀i}, whereP total isthetotalpowerconstraintandP i isthemaximumallowedpowerperchan- nel. Our goal is to maximize the expected sum-rate when the distributions of all X i are unknown,asshownin(7.2). WerefertothisobjectiveasO 1 . max a∈F E[ X i∈Aa f i (a i ,X i ))] (7.2) Note that even whenX i have known distributions, this is a hard combinatorial non- linear stochastic optimization problem. In our setting, with unknown distributions, we canformulatethisasamulti-armedbanditproblem,whereeachpowerallocationa(n)∈ F is an arm and the reward function is in a combinatorialnon-linearform. The optimal arms are the ones with the largest expected reward, denoted asO ∗ ={a ∗ }. For the rest ofthechapter,weuse∗astheindexindicatingthataparameterisforanoptimalarm. If morethanoneoptimalarmexists,∗refers toanyoneofthem. We note that for the combinatorial multi-armed bandit problem with linear rewards where the reward function is defined byR a(n) (n) = P i∈A a(n) a i (n)X i (n),a ∗ is a solution to a deterministicoptimizationproblem because max a∈F E[ P i∈Aa a i X i ] = max a∈F P i∈Aa a i E[X i ]. 130 Different from the combinatorial multi-armed bandit problem with linear rewards, a ∗ hereisasolutiontoastochasticoptimizationproblem,i.e., a ∗ ∈O ∗ ={˜ a : ˜ a = argmax a∈F E[ X i∈Aa f i (a i ,X i ))]}. (7.3) WeevaluatepoliciesforO 1 withrespecttoregret,whichisdefinedasthedifference between the expected reward that could be obtained by a genie that can pick an optimal arm at each time, and thatobtained by thegivenpolicy. Notethat minimizingtheregret isequivalenttomaximizingtheexpectedrewards. Regretcan beexpressedas: R φ (n) =nR ∗ −E[ n X t=1 R φ(t) (t)], (7.4) whereR ∗ = max a∈F E[ P i∈Aa f i (a i ,X i ))],theexpectedrewardofanoptimalarm. Intuitively, we would like the regretR φ (n) to be as small as possible. If it is sub- linearwithrespecttotimen,thetime-averagedregretwilltendtozeroandthemaximum possible time-averaged reward can be achieved. Note that the number of arms|F| can beexponentialinthenumberofunknownrandomvariablesN. We also note that for the stochastic version of the water-filling problems, a typical way in practice to deal with the unknown randomness is to estimate the mean channel gaintonoiseratiosfirstandthenfindtheoptimizedallocationbasedonthemeanvalues. Thisapproachtriestoidentifythepowerallocationthatmaximizesthepower-rateequa- tionappliedtothemeanchannelgain-to-noiseratios. Werefertomaximizingthisasthe 131 sum-pseudo-rate over averaged channels. We denote this objective byO 2 , as shown in (7.5). max a∈F X i∈Aa f i (a i ,E[X i ]) (7.5) WewouldalsoliketodevelopanonlinelearningpolicyforO 2 . Notethattheoptimal arma ∗ ofO 2 is a solutionto a deterministicoptimizationproblem. So, we evaluatethe policies forO 2 with respect to the expected total number of times that a non-optimal power allocation is selected. We denote by T a (n) the number of times that a power allocationispickeduptotimen. Wedenoter a = P i∈Aa f i (a i ,E[X i ]). LetT φ non (n)denote the total number of times that a policyφ select a power allocationr a < r a ∗ . Denote by 1 φ t (a) theindicatorfunctionwhich is equal to 1 ifa isselected underpolicyφ at timet, and0else. Then E[T φ non (n)] =n−E[ n X t=1 1 φ t (a ∗ ) = 1] (7.6) = X ra<r a ∗ E[T a (n)]. 7.3 OnlineLearningforMaximizingtheSum-Rate Wefirstpresentinthissectionanonlinelearningpolicyforstochasticwater-fillingunder objectO 1 . 132 7.3.1 PolicyDesign A straightforward, naive way to solve this problem is to use the UCB1 policy proposed [13]. ForUCB1, eachpowerallocationistreatedasanarm, andthearmthatmaximizes ˆ Y k + q 2lnn m k will be selected at each time slot, where ˆ Y k is the mean observed reward on arm k, and m k is the number of times that arm k has been played. This approach essentially ignores the underlying dependencies across the different arms, and requires storage that is linear in the number of arms and yields regret growing linearly with the numberofarms. Sincetherecanbeanexponentialnumberofarms,theUCB1algorithm performspoorlyonthisproblem. We note that for combinatorial optimization problems with linear reward functions, anonlinelearningalgorithmLLRhasbeenproposedinChapter4asanefficientsolution. LLRstoresthemeanofobservedvaluesforeveryunderlyingunknownrandomvariable, as well as the number of times each has been observed. So the storage of LLR is linear in the number of unknown random variables, and the analysis in Chapter 4 shows LLR achieves a regret that grows logarithmicallyin time, and polynomiallyin the number of unknownparameters. However, the challenge with stochastic water-filling with objectiveO 1 , where the expectationisoutsidethenon-linearreward function,directlystoringthemeanobserva- tionsofX i willnotwork. Todealwiththischallenge,weproposetostoretheinformationforeacha i ,X i com- bination, i.e., ∀1 ≤ i ≤ N, ∀a i , we define a new set of random variables Y i,a i = 133 f i (a i ,X i ). So now the number of random variables Y i,a i is N P i=1 |B i |, whereB i = {a i : a i 6= 0}. Notethat N P i=1 |B i |≤ PN. Thentherewardfunctioncanbeexpressedas R a = X i∈Aa Y i,a i , (7.7) Notethat(7.7)isinacombinatoriallinearform. For this redefined MAB problem with N P i=1 |B i | unknown random variables and lin- ear reward function (7.7), we propose the following online learning policy CWF1 for stochasticwater-fillingasshowninAlgorithm8. Algorithm8OnlineLearningforStochasticWater-Filling: CWF1 1: // INITIALIZATION 2: If max a |A a |isknown,letL = max a |A a |;else,L =N; 3: forn = 1toN do 4: Playanyarmasuchthatn∈A a ; 5: ∀i∈A a ,∀a i ∈B i ,Y i,a i := Y i,a i m i +f i (a i ,X i ) m i +1 ; 6: ∀i∈A a ,m i :=m i +1; 7: endfor 8: // MAIN LOOP 9: while1do 10: n :=n+1; 11: Playanarmawhichsolvesthemaximizationproblem X i∈Aa (Y i,a i + s (L+1)lnn m i ); (7.8) 12: ∀i∈A a ,∀a i ∈B i ,Y i,a i := Y i,a i m i +f i (a i ,X i ) m i +1 ; 13: ∀i∈A a ,m i :=m i +1; 14: endwhile 134 Tohaveatighterboundofregret,differentfromtheLLRalgorithm,insteadofstoring the number of times that each unknown random variables Y i,a i has been observed, we use a 1 byN vector, denoted as (m i ) 1×N , to storethe numberof timesthatX i has been observeduptothecurrenttimeslot. Weusea1by N P i=1 |B i |vector,denotedas(Y i,a i ) 1× N P i=1 |B i | tostoretheinformationbased on the observed values. (Y i,a i ) 1× N P i=1 |B i | is updated in as shown in line 12. Each time an arma(n)isplayed,∀i∈A a(n) ,theobservedvalueofX i isobtained. Foreveryobserved value ofX i ,|B i | values are updated: ∀a i ∈ B i , the average valueY i,a i of all the values of Y i,a i up to the current time slot is updated. CWF1 policy requires storage linear in N P i=1 |B i |. 7.3.2 AnalysisofRegret Theorem10. Theexpected regret under theCWF1 policyisatmost 4L 2 (L+1)N lnn (Δ min ) 2 +N + π 2 3 LN Δ max . (7.9) where Δ min = min a6=a ∗ R ∗ −E[R a ],Δ max = max a6=a ∗ R ∗ −E[R a ]. NotethatL≤N. Proof. LetC t,m i denote q (L+1)lnt m i . Weintroduce e T i (n) as acounterafter theinitializa- tionperiod. Itisupdatedinthefollowingway: Ateachtimeslotaftertheinitializationperiod,oneofthetwocasesmusthappen: (1) an optimal arm is played; (2) a non-optimalarm is played. In the first case, ( e T i (n)) 1×N 135 won’t be updated. When an non-optimal arma(n) is picked at time n, there must be at least one i ∈ A a such that i = argmin j∈Aa m j . If there is only one such arm, e T i (n) is increased by 1. If there are multiple such arms, we arbitrarily pick one, say i ′ , and increment e T i ′ by1. Each time when a non-optimal arm is picked, exactly one element in ( e T i (n)) 1×N is incrementedby1. Thisimpliesthatthetotalnumberthatwehaveplayedthenon-optimal armsisequaltothesummationofallcountersin ( e T i (n)) 1×N . Therefore, wehave: X a:a6=a ∗ E[T a (n)] = N X i=1 E[ e T i (n)]. (7.10) Alsonotefor e T i (n),thefollowinginequalityholds: e T i (n)≤m i (n),∀1≤i≤N. (7.11) Denoteby e I i (n)theindicatorfunctionwhichisequalto1if e T i (n)isaddedbyoneat timen. Letl beanarbitrarypositiveinteger. Then: e T i (n) = n X t=N+1 1{ e I i (t) = 1} ≤l+ n X t=N+1 1{ e I i (t) = 1, e T i (t−1)≥l} (7.12) where 1(x) is the indicator function defined to be 1 when the predicate x is true, and 0 when it is false. When e I i (t) = 1, a non-optimal arma(t) has been picked for which 136 m i = min j {m j : ∀j ∈ A a(t) }. We denote this arm as a(t) since at each time that e I i (t) = 1,wecouldgetdifferentarms. We denote by Y i,a i ,m i the average (sample mean) of all the observed values of Y i,a i whenthecorrespondingX i isobservedm i times. LetE[Y i,a i ]denoteE[f i (a i ,X i )]. Thenwehave, e T i (n)≤l+ n X t=N+1 1{ X j∈A a ∗ (Y j,a ∗ j ,m j (t−1) +C t−1,m j (t−1) ) X j∈A a(t) (Y j,a j (t),m j +C t−1,m j (t−1) ), e T i (t−1)≥l} ≤l+ n X t=N 1{ X j∈A a ∗ (Y j,a ∗ j ,m j (t) +C t,m j (t) ) ≤ X j∈A a(t) (Y j,a j (t),m j (t) +C t,m j (t) ), e T i (t)≥l}. (7.13) Notethatl≤ e T i (t)implies, l≤ e T i (t)≤m j (t),∀j ∈A a(t) . (7.14) Then, e T i (n)≤l+ n X t=N+1 1{ X j∈A a ∗ (Y j,a ∗ j ,m j (t−1) +C t−1,m j (t−1) ) ≤ X j∈A a(t) (Y j,a j (t),m j +C t−1,m j (t−1) ), e T i (t−1)≥l} 137 ≤l+ n X t=N 1{ X j∈A a ∗ (Y j,a ∗ j ,m j (t) +C t,m j (t) ) ≤ X j∈A a(t) (Y j,a j (t),m j (t) +C t,m j (t) ), e T i (t)≥l}. where h j (1 ≤ j ≤ |A a∗ |) represents the j-th element inA a∗ and p j (1 ≤ j ≤ |A a(t) |) representsthej-thelementinA a(t) . |Aa∗| P j=1 (Y h j ,a ∗ h j ,m h j +C t,m h j )≤ |A a(t) | P j=1 (Y p j ,ap j (t),mp j +C t,mp j )meansthatatleastoneof thefollowingmustbetrue: |Aa∗| X j=1 Y h j ,a ∗ h j ,m h j ≤R ∗ − |Aa∗| X j=1 C t,m h j , (7.15) |A a(t) | X j=1 Y p j ,ap j (t),mp j ≥R a(t) + |A a(t) | X j=1 C t,mp j , (7.16) R ∗ <R a(t) +2 |A a(t) | X j=1 C t,mp j . (7.17) NowwefindtheupperboundforP{ |Aa∗| P j=1 Y h j ,a ∗ h j ,m h j ≤R ∗ − |Aa∗| P j=1 C t,m h j }. Wehave: P{ |Aa∗| X j=1 Y h j ,a ∗ h j ,m h j ≤R ∗ − |Aa∗| X j=1 C t,m h j } =P{ |Aa∗| X j=1 Y h j ,a ∗ h j ,m h j ≤ |Aa∗| X j=1 E[Y h j ,a ∗ h j ]− |Aa∗| X j=1 C t,m h j } ≤ |Aa∗| X j=1 P{Y h j ,a ∗ h j ,m h j ≤E[Y h j ,a ∗ h j ]−C t,m h j }. 138 ∀1 ≤ j ≤ |A a∗ |, applying the Chernoff-Hoeffding bound stated in Lemma 1 in Chapter2,wecouldfindtheupperboundofeach itemintheaboveequationas, P{Y h j ,a ∗ h j ,m h j ≤E[Y h j ,a ∗ h j ]−C t,m h j } =P{m h j Y h j ,a ∗ h j ,m h j ≤ m h j E[Y h j ,a ∗ h j ]−m h j C t,m h j } ≤e −2· 1 m h j ·(m h j ) 2 · (L+1)lnt m h j =e −2(L+1)lnt =t −2(L+1) . Thus, P{ |Aa∗| X j=1 Y h j ,a ∗ h j ,m h j ≤R ∗ − |Aa∗| X j=1 C t,m h j } ≤|A a∗ |t −2(L+1) ≤ Lt −2(L+1) . (7.18) Similarly,wecan gettheupperboundoftheprobabilityforinequality(7.16): P{ |A a(t) | X j=1 Y p j ,ap j (t),mp j ≥R a(t) + |A a(t) | X j=1 C t,mp j }≤Lt −2(L+1) . (7.19) 139 Notethatforl≥ 4(L+1)lnn Δ a(t) Lamax 2 , R ∗ −R a(t) −2 |A a(t) | X j=1 C t,mp j =R ∗ −R a(t) −2 |A a(t) | X j=1 s (L+1)lnt m p j ≥R ∗ −R a(t) −L r 4(L+1)lnn l ≥R ∗ −R a(t) −L s 4(L+1)lnn 4(L+1)lnn Δ a(t) L 2 ≥R ∗ −R a(t) −Δ a(t) = 0. (7.20) Equation (7.39) implies that condition (7.15) is false when l = 4(L+1)lnn Δ a(t) L 2 . If we letl = 4(L+1)lnn ( Δ min L ) 2 ,then(7.15)isfalseforalla(t). Therefore, E[ e T i (n)]≤ & 4(L+1)lnn Δ min L 2 ' + ∞ X t=1 t X m h 1 =1 ··· t X m h |A ∗ | =1 t X mp 1 =l ··· t X mp |A a(t) | =l 2Lt −2(L+1) ≤ 4L 2 (L+1)lnn (Δ min ) 2 +1+L ∞ X t=1 2t −2 ≤ 4L 2 (L+1)lnn (Δ min ) 2 +1+ π 2 3 L. (7.21) 140 SounderCWF1policy,wehave: R φ n (Θ) =R ∗ n−E φ [ n X t=1 R φ(t) (t)] = X a:Ra<R ∗ Δ a E[T a (n)] ≤ Δ max X a:Ra<R ∗ E[T a (n)] = Δ max N X i=1 E[ e T i (n)] ≤ " N X i=1 4L 2 (L+1)lnn (Δ min ) 2 +N + π 2 3 LN # Δ max ≤ 4L 2 (L+1)N lnn (Δ min ) 2 +N + π 2 3 LN Δ max . (7.22) Remark 4. For CWF1 policy, although there are N P i=1 |B i | random variables, the upper bound of regret remainsO(N 4 logn), which is the same as LLR, as shown by Theorem 2 in Chapter 4. DirectlyapplyingLLR algorithmto solvethe redefined MAB problemin (7.7)willresultin a regret thatgrowsasO(P 4 N 4 logn). Remark5. Algorithm8willevenworkforratefunctionsthatdonotsatisfysubadditivity. Remark6. WecandevelopsimilarpoliciesandresultswhenX i areMarkovianrewards asin Chapter5 andChapter 6. 141 7.4 OnlineLearningforSum-Pseudo-Rate We now show our novel online learning algorithm CWF2 for stochastic water-filling with objectO 2 . Unlike CWF1, CWF2 exploits non-linear dependencies between the choices of power allocations and requires lower storage. Under condition where the powerallocation that maximizeO 2 also maximizeO 1 , we will see through simulations thatCWF2 hasbetterregretperformance. 7.4.1 PolicyDesign Our proposed policy CWF2 for stochastic water filling with objectiveO 2 is shown in Algorithm9. Algorithm9OnlineLearningforStochasticWater-Filling: CWF2 1: // INITIALIZATION 2: If max a |A a |isknown,letL = max a |A a |;else,L =N; 3: forn = 1toN do 4: Playanyarmasuchthatn∈A a ; 5: ∀i∈A a ,X i := X i m i +X i m i +1 ,m i :=m i +1; 6: endfor 7: // MAIN LOOP 8: while1do 9: n :=n+1; 10: Playanarmawhichsolvesthemaximizationproblem max a∈F X i∈Aa f i (a i ,X i )+f i (a i , s (L+1)lnn m i ) ; (7.23) 11: ∀i∈A a(n) ,X i := X i m i +X i m i +1 ,m i :=m i +1; 12: endwhile 142 Weusetwo1byN vectorstostoretheinformationafterweplayanarmateachtime slot. Oneis(X i ) 1×N inwhichX i istheaverage(samplemean)ofalltheobservedvalues ofX i uptothecurrenttimeslot(obtainedthroughpotentiallydifferentsetsofarmsover time). The other one is (m i ) 1×N in which m i is the number of times that X i has been observeduptothecurrenttimeslot. So CWF2policyrequiresstoragelinearinN. 7.4.2 AnalysisofRegret Theorem 11. Under the CWF2 policy, the expected total number of times that non- optimalpower allocationsareselected isat most E[T φ non (n)]≤ N(L+1)lnn B 2 min +N + π 2 3 LN, (7.24) whereB min is a constantdefined byδ min andL;δ min = min a:ra<r ∗ (r ∗ −r a ). Proof. Wewillshowtheupperboundoftheregretinthreesteps: (1)introduceacounter e T i (n)(definedasbelow)andshowitsrelationshipwiththeupperboundoftheregret;(2) showtheupperboundofE[ e T i (n)];(3)showtheupperboundofE[T φ non (n)]. (1)The counter e T i (n) After the initialization period, ( e T i (n)) 1×N is introduced as a counter and is updated in the following way: at any time n when a non-optimal power allocation is selected, find i such that i = arg min j∈Aa(n) m j . If there is only one such power allocation, e T i (n) is increased by 1. Iftherearemultiplesuch powerallocations,wearbitrarilypick one,say 143 i ′ ,andincrement e T i ′ by1. Basedontheabovedefinitionof e T i (n),eachtimewhenanon- optimal power allocation is selected, exactly one element in ( e T i (n)) 1×N is incremented by 1. Sothesummationofall countersin ( e T i (n)) 1×N equalstothetotalnumberthatwe haveselectedthenon-optimalpowerallocations,asbelow: X a:Ra<R ∗ E[T a (n)] = N X i=1 E[ e T i (n)]. (7.25) Wealsohavethefollowinginequalityfor e T i (n): e T i (n)≤m i (n),∀1≤i≤N. (7.26) (2)ShowtheupperboundofE[ e T i (n)] Let C t,m i denote q (L+1)lnt m i . Denote by e I i (n) the indicator function which is equal to 1 if e T i (n) is added by one at timen. Letl be an arbitrary positiveinteger. Then, we could get the upper bound of E[ e T i (n)] as shown in (7.27), wherea(t) is defined as a non-optimalpowerallocationpickedattimetwhen e I i (t) = 1. Notethatm i = min j {m j : ∀j ∈A a(t) }. We denote this power allocation bya(t) since at each time that e I i (t) = 1, wecouldgetdifferentselectionsofpowerallocations. E[ e T i (n)] = n X t=N+1 P{ e I i (t) = 1} ≤l+ n X t=N+1 P{ e I i (t) = 1, e T i (t−1)≥l} 144 ≤l+ n X t=N+1 P{ X j∈A a ∗ f j (a ∗ j ,X j,m j (t−1) )+f j (a ∗ j ,C t−1,m j (t−1) ) ≤ X j∈A a(t) f j (a j (t),X j,m j (t−1) )+f j (a j (t),C t−1,m j (t−1) ) , e T i (t−1)≥l}. (7.27) Note that l ≤ e T i (t− 1) implies, l ≤ e T i (t− 1) ≤ m j (t− 1),∀j ∈ A a(t) . So we could get an upper bound of E[ e T i (n)] as shown in (7.28), (7.29), (7.30), (7.31) 3 , where h j (1 ≤ j ≤ |A a∗ |) represents the j-th element in A a∗ ; p j (1 ≤ j ≤ |A a(t) |) represents the j-th element inA a(t) ; r ∗ = |Aa∗| P j=1 f h j (a ∗ h j ,θ h j ) = P i∈A a ∗ f i (a i ,θ i )); r a(t) = |A a(t) | P j=1 f p j (a p j (t),θ p j ) = P i∈Aa f i (a i ,θ i ). E[ e T i (n)]≤l+ n X t=N+1 P{ min 0<m h 1 ,...,m h |Aa∗| <t |Aa∗| X j=1 f h j (a ∗ h j ,X h j ,m h j )+f h j (a ∗ h j ,C t−1,m h j ) ≤ max l≤mp 1 ,...,mp |A a(t) | <t |A a(t) | X j=1 f p j (a p j (t),X p j ,mp j )+f p j (a p j (t),C t−1,mp j ) } ≤l+ ∞ X t=2 t−1 X m h 1 =1 ··· t−1 X m h |A ∗ | =1 t−1 X mp 1 =l ··· t−1 X mp |A a(t) | =l P{ |Aa∗| X j=1 f h j (a ∗ h j ,X h j ,m h j )+f h j (a ∗ h j ,C t−1,m h j ) ≤ |A a(t) | X j=1 f p j (a p j (t),X p j ,mp j )+f p j (a p j (t),C t−1,mp j ) } (7.28) 3 Theseequationsareonthenextpageduetothespacelimitations. 145 Sowehave, E[ e T i (n)]≤l+ ∞ X t=2 t−1 X m h 1 =1 ··· t−1 X m h |A ∗ | =1 t−1 X mp 1 =l ··· t−1 X mp |A a(t) | =l P{Atleastoneofthefollowingmusthold: |Aa∗| X j=1 f h j (a ∗ h j ,X h j ,m h j )≤ r ∗ − |Aa∗| X j=1 f h j (a ∗ h j ,C t−1,m h j ), (7.29) |A a(t) | X j=1 f p j (a p j (t),X p j ,mp j )≥r a(t) + |A a(t) | X j=1 f p j (a p j (t),C t−1,mp j ), (7.30) r ∗ <r a(t) +2 |A a(t) | X j=1 f p j (a p j (t),C t−1,mp j )} (7.31) Nowweshowtheupperboundoftheprobabilitiesforinequalities(7.29),(7.30)and (7.31)separately. Wefirstfindtheupperboundoftheprobabilityfor(7.29),asshownin (7.33). P{ |Aa∗| X j=1 f h j (a ∗ h j ,X h j ,m h j )≤r ∗ − |Aa∗| X j=1 f h j (a ∗ h j ,C t−1,m h j )} =P{ |Aa∗| X j=1 f h j (a ∗ h j ,X h j ,m h j )+f h j (a ∗ h j ,C t−1,m h j ) ≤ |Aa∗| X j=1 f h j (a ∗ h j ,θ h j )} ≤ |Aa∗| X j=1 P{f h j (a ∗ h j ,X h j ,m h j )+f h j (a ∗ h j ,C t−1,m h j )≤f h j (a ∗ h j ,θ h j )} ≤ |Aa∗| X j=1 P{f h j (a ∗ h j ,X h j ,m h j +C t−1,m h j )≤ f h j (a ∗ h j ,θ h j )} (7.32) = |Aa∗| X j=1 P{X h j ,m h j +C t−1,m h j ≤θ h j } (7.33) 146 Equation(7.32)holdsbecauseoflemma1. So∀j, f h j (a ∗ h j ,X h j ,m h j +C t−1,m h j )≤f h j (a ∗ h j ,X h j ,m h j )+f h j (a ∗ h j ,C t−1,m h j ). (7.34) (7.33)holdsbecause∀i,f i (a i ,X i )isanon-decreasingfunctioninX i foranyX i ≥ 0. In (7.33),∀1≤ j ≤|A a∗ |, applyingtheChernoff-HoeffdingboundstatedinLemma 1,wecouldfindtheupperboundofeach itemas, P{X h j ,m h j +C t−1,m h j ≤θ h j }≤e −2· 1 m h i j ·(m h j ) 2 · (L+1)ln(t−1) m h j = (t−1) −2(L+1) . Thus, P{ |Aa∗| X j=1 f h j (a ∗ h j ,X h j ,m h j )≤ r ∗ − |Aa∗| X j=1 f h j (a ∗ h j ,C t−1,m h j )} ≤|A a∗ |t −2(L+1) ≤ L(t−1) −2(L+1) . (7.35) Nowwecangettheupperboundoftheprobabilityforinequality(7.30),asshownin (7.36). P{ |A a(t) | X j=1 f p j (a p j (t),X p j ,mp j )≥r a(t) + |A a(t) | X j=1 f p j (a p j (t),C t−1,mp j )} =P{ |A a(t) | X j=1 f p j (a p j (t),X p j ,mp j )≥ |A a(t) | X j=1 f p j (a p j (t),θ p j )+f p j (a p j (t),C t−1,mp j ) } ≤ |A a(t) | X j=1 P{f p j (a p j (t),X p j ,mp j )≥ f p j (a p j (t),θ p j )+f p j (a p j (t),C t−1,mp j )} 147 ≤ |A a(t) | X j=1 P{f p j (a p j (t),X p j ,mp j )≥f p j (a p j (t),θ p j +C t−1,mp j )} = |A a(t) | X j=1 P{X p j ,mp j ≥θ p j +C t−1,mp j )≤L(t−1) −2(L+1) . (7.36) Equation(7.36)holds,followingasimilarreasoningasusedtoderive(7.35). For alli and given anya i , sincef i (a i ,x) is an increasing, continuous function inx, wecouldfindaconstantB i (a i )suchthat f i (a i ,B i (a i )) = δ min 2L . (7.37) DenoteB min (a) = min i∈Aa B i (a i ). Then∀i∈A a ,wehave f i (a i ,B min (a))≤ δ min 2L . (7.38) Notethatforl≥ l (L+1)lnn B 2 min (a(t)) m , r ∗ −r a(t) −2 |A a(t) | X j=1 f p j (a p j (t),C t−1,mp j ) =r ∗ −r a(t) −2 |A a(t) | X j=1 f p j (a p j (t), s (L+1)ln(t−1) m p j ) ≥ r ∗ −r a(t) −2 |A a(t) | X j=1 f p j (a p j (t), r (L+1)lnn l ) ≥ r ∗ −r a(t) −2 |A a(t) | X j=1 f p j (a p j (t), r (L+1)lnn l ) (7.39) 148 ≥r ∗ −r a(t) −2 |A a(t) | X j=1 f p j (a p j (t),B min (a(t))) ≥δ a(t) −2 |A a(t) | X j=1 δ min 2L ≥δ a(t) −δ min ≥ 0. (7.40) So(7.31)isfalsewhenl≥ l (L+1)lnn B 2 min (a(t)) m . WedenoteB min = min a∈F B min (a(t))), andlet l≥ l (L+1)lnn B 2 min m ,then(7.31)isfalseforalla(t). Therefore, wegettheupperboundofE[ e T i (n)]asin(7.41). E[ e T i (n)]≤ (L+1)lnn B 2 min + ∞ X t=2 t−1 X m h 1 =1 ··· t−1 X m h |A ∗ | =1 t−1 X mp 1 =l ··· t−1 X mp |A a(t) | =l 2L(t−1) −2(L+1) ≤ (L+1)lnn B 2 min +1+L ∞ X t=1 2t −2 ≤ (L+1)lnn B 2 min +1+ π 2 3 L. (7.41) (3)Upper boundofE[T φ non (n)] E[T φ non (n)] = X a:Ra<R ∗ E[T a (n)] = N X i=1 E[ e T i (n)] ≤ N(L+1)lnn B 2 min +N + π 2 3 LN. (7.42) 149 Remark7. CWF2 can beusedto solvethestochasticwater-fillingwith objectiveO 1 as well if∃a ∗ ∈O ∗ , suchthat∀a / ∈O ∗ , X i∈A a ∗ f i (a i ,θ i ))> X j∈Aa f j (a j ,θ j ). (7.43) Then theregret ofCWF2 is atmost R CWF2 (n)≤ N(L+1)lnn B 2 min +N + π 2 3 LN Δ max , (7.44) 7.5 ApplicationsandSimulationResults 7.5.1 NumericalResultsforCWF1 0 1 2 3 4 5 6 7 8 9 10 x 10 5 0 500 1000 1500 Time Regret/Log(t) UCB LLR CWF1 Figure7.1: Normalizedregret R(n) logn vs. ntimeslots. 150 We nowshow thenumerical resultsfor CWF2 policy. We considera OFDM system with4subcarriers. Weassumethebandwidthofthesystemis4MHz,andthenoiseden- sity is−80 dBw/Hz. We assume Rayleigh fading with parameter σ = (2,0.8,2.80.32) for4subcarriers. Weconsiderthefollowingobjectiveforoursimulation: max E " N X i=1 log(1+a i (n)X i (n)) # (7.45) s.t. N X i=1 a i (n)≤P total ,∀n (7.46) a 1 (n)∈{0,10,20,30},∀n (7.47) a 2 (n)∈{0,10,20,30},∀n (7.48) a 3 (n)∈{0,10,20,30,40},∀n (7.49) a 4 (n)∈{0,10,20},∀n (7.50) where P total = 60mW (17.8 dBm). The unit for above power constraints from (7.47) to (7.50)ismW.Notethat(7.46)to(7.50)definetheconstraintsetF. Forthisscenario,thereare140differentchoicesofpowerallocations,andtheoptimal powerallocationcanbecalculatedas (20,20,20,0). We compare the performance of our proposed CWF1 policy with UCB1 policy and LLR policy, as shown in Figure 7.1. As we can see from 7.1, naively applying UCB1 and LLR policy results in a worse performance than CWF1, since theUCB1 policy can notexploittheunderlyingdependenciesacrossarms,andLLRpolicydoesnotutilizethe observationsasefficientlyasCWF1 does. 151 7.5.2 NumericalResultsforCWF2 WeshowthesimulationresultsofCWF2 usingthesamesystemasin7.5.1. Weconsiderthefollowingobjectiveforoursimulation: max " N X i=1 log(1+a i (n)E[X i (n)]) # s.t. a∈F (7.51) whereF issameasin7.5.1. Forthisscenario,weassumeRayleighfadingwithparameterσ = (1.23,1.0,0.55,0.95) for4subcarriers. Andtheoptimalpowerallocationcanbecalculatedas (20,20,0,20). 0 0.5 1 1.5 2 2.5 3 x 10 7 10 3 10 4 10 5 10 6 10 7 10 8 Time Regret/Log(t) Theoretical Upper Bound CWF2 Figure7.2: NumericalresultsofE[ e T i (n)]/lognandtheoreticalbound. Figure7.2showsthesimulationresultsofthetotalnumberoftimesthatnon-optimal power allocations are chosen by running CWF2 up to 30 million time slots. We also 152 show the theoretical upper bound in figure 7.2. In this case, we see that the theoretical upperboundisquitelooseandthealgorithmdoesmuchbetterinpractice. For this setting, we note that (7.43) is satisfied, since (20,20,0,20) also maximizes (7.45). SoasstatedinRemark7,CWF2canalsobeusedtosolvestochasticwaterfilling withO 1 , with regretthat growslogarithmicallyin timeand polynomiallyin thenumber ofchannels. We show a comparison of the UCB1 policy, LLR policy, CWF1 policy and CWF2 policy under this setting in Figure 7.3. We can see that CWF2 performs the best by far sinceitincorporateawaytoexploitnon-lineardependenciesacrossarms,andlearnmore efficiently. 0 1 2 3 4 5 6 7 8 9 10 x 10 5 0 200 400 600 800 1000 1200 1400 1600 1800 Time Regret/Log(t) UCB LLR CWF1 CWF2 Figure7.3: Normalizedregret R(n) logn vs. ntimeslots. 153 7.6 Summary We have considered the problem of optimal power allocation over parallel channels with stochastically time-varying gain-to-noise ratios for maximizing information rate (stochastic water-filling) in this work. We approached this problem from the novel per- spective of online learning. The crux of our approach is to map each possible power allocation into arms in a stochastic multi-armed bandit problem. The significant new challenge imposed here is that the reward obtained is a non-linear function of the arm choice and the underlying unknown random variables. To our knowledge there is no priorworkonstochasticMABthatexplicitlytreatssuchaproblem. Wefirstconsideredtheproblemofmaximizingtheexpectedsumrate. Forthisprob- lemwe developedtheCWF1 algorithm. Despitethefact thatthenumberofarms grows exponentiallyinthenumberofpossiblechannels,weshowthattheCWF1 algorithmre- quiresonlypolynomialstorageand alsoyieldsaregretthatispolynomialinthenumber ofpowerlevelsperchannelandthenumberofchannels,andlogarithmicintime. We then considered the problem of maximizing the sum-pseudo-rate, where the pseudo rate for a stochastic channel is defined by applying the power-rate equation to its mean SNR (log(1 +E[SNR]). The justification for considering this problem is its connection to practice (where allocations over stochastic channels are made based on estimatedmean channel conditions). Albeitsub-optimalwithrespect to maximizingthe expected sum-rate, the use of the sum-pseudo-rate as the objective function is a more tractable approach. For this problem, we developed a new MAB algorithm that we call 154 CWF2. This is the first algorithmin the literature on stochasticMAB that exploitsnon- lineardependenciesbetween thearmrewards. Wehaveprovedthatthenumberoftimes this policy uses a non-optimal power allocation is also bounded by a function that is polynomialinthenumberofchannelsandpower-levels,andlogarithmicintime. Our simulations results show that the algorithms we develop are indeed better than naive application of classic MAB solutions. We also see that under settings where the power allocation for maximizing the sum-pseudo-rate matches the optimal power al- location that maximizes the expected sum-rate, CWF2 has significantly better regret- performancethanCWF1. Becauseourformulationsallowforverygeneralclassesofsub-additiverewardfunc- tions, we believe that our technique may be much more broadly applicable to settings other than power allocation for stochasticchannels. We would therefore liketo identify andexploresuchapplicationsinfuturework. 155 Chapter8 DecentralizedLearningforOpportunisticSpectrum Access 8.1 Overview Thefundamentalproblemofmultiplesecondaryuserscontendingforopportunisticspec- trum access overmultiplechannels in cognitiveradio networks has been formulated re- cently as a decentralized multi-armed bandit (D-MAB) problem. In a D-MAB problem there are M users and N arms (channels) that each offer i.i.d. stochastic rewards with unknown means so long as they are accessed without collision. The goal is to design distributed online learning policies that incur minimal regret. We consider two related problem formulations in this chapter 1 . First, we consider the setting where the users have a prioritized ranking, such that it is desired for the K-th-ranked user to learn to access the arm offering the K-th highest mean reward. For this problem, we present 1 Thischapterisbasedinparton[33]. 156 DLP,thefirstdistributedpolicythatyieldsregretthatisuniformlylogarithmicovertime without requiring any prior assumption about the mean rewards. Second, we consider thecasewhenafairaccesspolicyisrequired,i.e.,itisdesiredforalluserstoexperience thesamemeanreward. Forthisproblem,wepresentDLF,adistributedpolicythatyields order-optimal regret scaling with respect to the number of users and arms, better than previously proposed policies in the literature. Both of our distributedpolicies make use ofaninnovativemodificationofthewell-knownUCB1policyfortheclassicmulti-armed banditproblemthatallowsasingleusertolearnhowtoplaythearmthatyieldstheK-th largestmeanreward. 8.2 ProblemFormulation WeconsideracognitivesystemwithN channels(arms)andM decentralizedsecondary users (players). The throughput ofN channels are defined by random processesX i (n), 1≤ i≤ N. Timeis slotted and denoted by the indexn. We assumethatX i (n) evolves as an i.i.d. random process over time, with the only restriction that its distributionhave a finite support. Without loss of generality, we normalize X i (n) ∈ [0,1]. We do not require that X i (n) be independent across i. This random process is assumed to have a meanθ i =E[X i ], that is unknown to the users and distinct from each other. We denote thesetofallthesemeansas Θ ={θ i ,1≤i≤ N}. At each decisionperiodn(also referred to interchangeablyas timeslot), each ofthe M decentralized users selects an arm only based on its own observation histories under 157 a decentralized policy. When aparticulararmiis selected by userj, thevalueofX i (n) is only observed by userj, and if there is no other user playing the same arm, a reward of X i (n) is obtained. Else, if there are multiple users playing the same arm, then we assume that, due to collision, at most one of the conflicting users j ′ gets reward X i (n), while the other users get zero reward. This interference assumption covers practical modelsinnetworkingresearch,suchastheperfectcollisionmodel(inwhichnoneofthe conflicting users derive any benefit) and CSMA with perfect sensing (in which exactly one of the conflicting user derives benefit from the channel). We denote the first model asM 1 andthesecondmodelasM 2 . Wedenotethedecentralizedpolicyforuserjattimenasφ j (n),andthesetofpolicies for all users asφ = {φ j (n),1≤ j ≤ M}. We are interested in designing decentralized policies, under which there is no information exchange among users, and analyze them withrespecttoregret,whichisdefinedasthegapbetweentheexpectedrewardthatcould beobtainedbyagenie-aidedperfectselectionandthatobtainedbythepolicy. Wedenote O ∗ M as a set of M arms with M largest expected rewards. The regret can be expressed as: R φ (Θ;n) =n X i∈O ∗ M θ i −E φ [ n X t=1 S φ(t) (t)] (8.1) whereS φ(t) (t)isthesumoftheactualrewardobtainedbyallusersattimetunderpolicy φ(t),whichcouldbeexpressedas: S φ(t) (t) = N X i=1 M X j=1 X i (t)I i,j (t), (8.2) 158 where for M 1 , I i,j (t) is defined to be 1 if user j is the only user to play arm i, and 0 otherwise; for M 2 , I i,j (t) is defined to be 1 if user j is the one with the smallest index among all users playing arm i at time t, and 0 otherwise. Then, if we denote V φ i,j (n) =E[ P n t=1 I i,j (t)],wehave: E φ [ n X t=1 S φ(t) (t)] = N X i=1 M X j=1 θ i E[V φ i,j (n)] (8.3) Besides getting low total regret, there could be other system objectives for a given D-MAB. We consider two in this paper. In the prioritized access problem, we assume that each user has information of a distinct allocation order. Without loss of generality, we assume that the users are ranked in such a way that the m-th user seeks to access thearmwiththem-thhighestmeanreward. Inthefairaccess problem,usersaretreated equallytoreceivethesameexpectedreward. 8.3 SelectiveLearningoftheK-thLargestExpected Reward We first propose a general policy to play an arm with the K-th largest expected reward (1 ≤ K ≤ N) for classic multi-armed bandit problem withN arms and one user, since thekeyideaofourproposeddecentralizedpoliciesrunningateachuserinsection8.4and 8.5 is thatusermwill run alearning policytargeting an arm withm-th largest expected reward. 159 Our proposed policyof learning an arm withK-th largest expected reward is shown inAlgorithm10. Algorithm10SelectivelearningoftheK-thlargestexpectedrewards(SL(K)) 1: // INITIALIZATION 2: fort = 1toN do 3: Leti =tandplayarmi; 4: ˆ θ i (t) =X i (t); 5: m i (t) = 1; 6: endfor 7: // MAIN LOOP 8: while1do 9: t =t+1; 10: LetthesetO K containstheK armswiththeK largestvaluesin(8.4) ˆ θ i (t−1)+ s 2lnt m i (t−1) ; (8.4) 11: Playarmk inO K suchthat k = arg min i∈O K ˆ θ i (t−1)− s 2lnt m i (t−1) ; (8.5) 12: ˆ θ k (t) = ˆ θ k (t−1)m k (t−1)+X k (t) m k (t−1)+1 ; 13: m k (t) =m k (t−1)+1; 14: endwhile Weusetwo1byN vectorstostoretheinformationafterweplayanarmateachtime slot. Oneis ( ˆ θ i ) 1×N in which ˆ θ i is theaverage (samplemean)of all theobserved values ofX i uptothecurrenttimeslot(obtainedthroughpotentiallydifferentsetsofarmsover time). The other one is (m i ) 1×N in which m i is the number of times that X i has been observeduptothecurrenttimeslot. 160 NotethatwhileweindicatethetimeindexinAlgorithm10fornotationalclarity,itis notnecessarytostorethematricesfromprevioustimestepswhilerunningthealgorithm. SoSL(K)policyrequiresstoragelinearinN. Remark: SL(K)policygeneralizesUCB1in[13]andpresentsageneralwaytopick an armwiththeK-th largestexpected rewardsforaclassicmulti-armedbanditproblem withN arms(withouttherequirementofdistinctexpectedrewardsfordifferentarms). Nowwepresenttheanalysisoftheupperboundofregret,andshowthatitislinearin N andlogarithmicintime. WedenoteA K asthesetofarmswithK-thlargestexpected reward. Notethat Algorithm10 is a general algorithm for picking an arm with theK-th largest expected reward for the classic multi-armed bandit problems, where we allow multiplearmswithK-thlargestexpectedreward,andallthesearmsretreatedasoptimal arms. ThefollowingtheoremholdsforAlgorithm10. Theorem 12. Under the policy specified in Algorithm10, the expected number of times thatwe pickanyarmi / ∈A K afterntimeslotsisat most: 8lnn Δ K,i +1+ 2π 2 3 . (8.6) where Δ K,i =|θ K −θ i |,θ K is theK-th largestexpected reward. Proof. DenoteT i (n) asthenumberoftimesthatwepickarmi / ∈A K attimen. Denote C t,m i as q (L+1)lnt m i . Denote ˆ θ i,m i astheaverage(samplemean)ofalltheobservedvalues 161 of X i when it is observedm i time. O ∗ K is denoted as the set of K arms withK largest expectedrewards. DenotebyI i (n)theindicatorfunctionwhichisequalto1ifT i (n)isaddedbyoneat timen. Letl be an arbitrary positiveinteger. Then, for any armi which is not a desired arm,i.e.,i / ∈A K : T i (n) = 1+ n X t=N+1 1{I i (t)}≤ l+ n X t=N+1 1{I i (t),T i (t−1)≥l} ≤ l+ n X t=N+1 (1{I i (t),θ i <θ K ,T i (t−1)≥l}+1{I i (t),θ i >θ K ,T i (t−1)≥l}) (8.7) where1(x) is the indicator function defined to be 1 when the predicate x is true, and 0 whenitisfalse. Note that for the case θ i < θ K , arm i is picked at time t means that there exists an armj(t)∈O ∗ K ,suchthatj(t) / ∈O K . Thismeansthefollowinginequalityholds: ˆ θ j(t),T j(t) (t−1) +C t−1,T j(t) (t−1) ≤ ˆ θ i,T i (t−1)+C t−1,T i (t−1) . (8.8) Then,wehave n X t=N+1 1{I i (t),θ i <θ K ,T i (t−1)≥l} ≤ n X t=N+1 1{ ˆ θ j(t),T j(t) (t−1) +C t−1,T j(t) (t−1) ≤ ˆ θ i,T i (t−1) +C t−1,T i (t−1) ,T i (t−1)≥l} (8.9) 162 ≤ n X t=N+1 1{ min 0<m j(t) <t ˆ θ j(t),m j(t) +C t−1,m j(t) ≤ max l≤m i <t ˆ θ i,m i +C t−1,m i } ≤ ∞ X t=1 t−1 X m j(t) =1 t−1 X m i =l 1{ ˆ θ j(t),m j(t) +C t,m j(t) ≤ ˆ θ i,m i +C t,m i } ˆ θ j(t),m j(t) +C t,m j(t) ≤ ˆ θ i,m i +C t,m i impliesthatatleastoneofthefollowingmustbe true: ˆ θ j(t),m j(t) ≤ θ j(t) −C t,m j(t) , (8.10) ˆ θ i,m i ≥θ i +C t,m i , (8.11) θ j(t) <θ i +2C t,m i . (8.12) Applying the Chernoff-Hoeffding bound [70], we could find the upper bound of (8.10) and(8.11)as, P{ ˆ θ j(t),m j(t) ≤θ j(t) −C t,m j(t) }≤ e −4lnt =t −4 , (8.13) P{ ˆ θ i,m i ≥θ i +C t,m i }≤ e −4lnt =t −4 (8.14) Forl≥ l 8lnn Δ 2 K,i m , θ j(t) −θ i −2C t,m i ≥θ K −θ i −2 s 2Δ 2 K,i lnt 8lnn ≥θ K −θ i −Δ K,i = 0, (8.15) so(8.12)isfalsewhenl≥ l 8lnn Δ 2 K,i m . Notethatforthecaseθ i >θ K ,whenarmiispicked at timet, thereare two possibilities: eitherO K =O ∗ K , orO K 6=O ∗ K . IfO K =O ∗ K , the followinginequalityholds: 163 ˆ θ i,T i (t−1) −C t−1,T i (t−1) ≤ ˆ θ K,T K (t−1) −C t−1,T K (t−1) . IfO K 6=O ∗ K ,O K hasatleastonearmh(t) / ∈O ∗ K . Then,wehave: ˆ θ i,T i (t−1) −C t−1,T i (t−1) ≤ ˆ θ h(t),T h(t) (t−1) −C t−1,T h(t) (t−1) . So toconcludebothpossibilitiesforthecaseθ i >θ K ,ifwedenoteO ∗ K−1 =O ∗ K −A K , ateachtimetwhenarmiispicked,theseexistsanarmh(t) / ∈O ∗ K−1 ,suchthat ˆ θ i,T i (t−1) −C t−1,T i (t−1) ≤ ˆ θ h(t),T h(t) (t−1) −C t−1,T h(t) (t−1) . (8.16) Thensimilarly,wecanhave: n X t=N+1 1{I i (t),θ i >θ K ,T i (t−1)≥ l} ≤ ∞ X t=1 t−1 X m i =l t−1 X m h(t) =1 1{ ˆ θ i,m i −C t,m i ≤ ˆ θ h(t),m h(t) −C t,m h(t) } (8.17) Note that ˆ θ i,m i −C t,m i ≤ ˆ θ h(t),m h(t) −C t,m h(t) implies one of the following must be true: ˆ θ i,m i ≤ θ i −C t,m i , (8.18) ˆ θ h(t),m h(t) ≥θ h(t) +C t,m h(t) , (8.19) θ i <θ h(t) +2C t,m i . (8.20) 164 WeagainapplytheChernoff-HoeffdingboundandgetP{ ˆ θ i,m i ≤θ i −C t,m i }≤ t −4 , P{ ˆ θ h(t),m h(t) ≥θ h(t) +C t,m h(t) }≤t −4 . Alsonotethatforl≥ l 8lnn Δ 2 K,i m , θ i −θ h(t) −2C t,m i ≥θ i −θ K −Δ K,i ≥ 0, (8.21) so(8.20)isfalse. Hence, wehave E[T i (n)]≤ & 8lnn Δ 2 K,i ' + ∞ X t=1 t−1 X m j(t) =1 t−1 X m i =⌈(8lnn)/Δ 2 K,i ⌉ (P{ ˆ θ j(t),m j(t) ≤ θ j(t) −C t,m j(t) }+P{ ˆ θ i,m i ≥ θ i +C t,m i }) + ∞ X t=1 t−1 X m i =⌈(8lnn)/Δ 2 K,i ⌉ t−1 X m h(t) =1 (P{ ˆ θ i,m i ≤θ i −C t,m i }+P{ ˆ θ h(t),m h(t) ≥θ h(t) +C t,m h(t) }) ≤ 8lnn Δ 2 K,i +1+2 ∞ X t=1 t−1 X m j(t) =1 t−1 X m i =1 2t −4 ≤ 8lnn Δ 2 K,i +1+ 2π 2 3 . Thedefinitionofregret fortheaboveproblemisdifferentfromthetraditionalmulti- armed bandit problem with the goal of maximization or minimization, since our goal now is to pick the arm with the K-th largest expected reward and we wish we could 165 minimizethenumberoftimesthatwepickthewrongarm. Herewegivetwodefinitions oftheregrettoevaluatetheSL(K)policy. Definition 1. We define the regret of type 1 at each time slot as the absolute difference betweentheexpectedrewardthatcouldbeobtainedbyageniethatcanpickanarmwith K-th largest expected reward, and that obtained by the given policy at each time slot. Then thetotalregretoftype1bytimen isdefined as sumof theregret at eachtimeslot, whichis: R φ 1 (Θ;n) = n X t=1 |θ K −E φ [S φ(t) (t)]| (8.22) Definition 2. We define the total regret of type 2 by time n as the absolute difference betweentheexpectedrewardthatcouldbeobtainedbyageniethatcanpickanarmwith K-thlargestexpectedreward, andthatobtainedbythegivenpolicyafternplays,which is: R φ 2 (Θ;n) =|nθ K −E φ [ n X t=1 S φ(t) (t)]| (8.23) Here we note that∀n,R φ 2 (Θ;n) ≤R φ 1 (Θ;n) because|nθ K −E φ [ P n t=1 S φ(t) (t)]| = |nθ K − P n t=1 E φ [S φ(t) (t)]|≤ P n t=1 |θ K −E φ [S φ(t) (t)]|. Corollary1. The expected regret underbothdefinitionsis atmost X i:i/ ∈A k ( 8lnn Δ K,i )+(1+ 2π 2 3 ) X i:i/ ∈A k Δ K,i . (8.24) 166 Proof. UndertheSL(K)policy,wehave: R φ 2 (Θ;n)≤R φ 1 (Θ;n) = n X t=1 |θ K −E φ [S φ(t) (t)]| = X i:i/ ∈A k Δ K,i E[T i (n)] ≤ X i:i/ ∈A k ( 8lnn Δ K,i )+(1+ 2π 2 3 ) X i:i/ ∈A k Δ K,i . (8.25) Corollary 1 showsthe upperbound of the regret ofSL(K) policy. It growslogarith- micalintimeandlinearlyinthenumberofarms. 8.4 DistributedLearningwithPrioritization Wenowconsiderthedistributedmulti-armedbanditproblemwithprioritizedaccess. Our proposeddecentralizedpolicyforN armswithM usersisshowninAlgorithm11. Inthisalgorithm,line2to6istheinitializationpart,forwhichusermwillplayeach arm once to have the initial value in ( ˆ θ m i ) 1×N and (m m i ) 1×N . Line 3 ensures that there willbenocollisionsamongusers. SimilarasinAlgorithm10,weindicatethetimeindex fornotationalclarity. Onlytwo1byN vectors,( ˆ θ m i ) 1×N and(m m i ) 1×N ,areusedbyuser m to store the information after we play an arm at each time slot. We denote o ∗ m as 167 Algorithm 11 Distributed Learning Algorithm with Prioritization for N Arms with M UsersRunningatUserm(DLP) 1: // INITIALIZATION 2: fort = 1toN do 3: Playarmk suchthatk = ((m+t) mod N)+1; 4: ˆ θ m k (t) =X k (t); 5: m m k (t) = 1; 6: endfor 7: // MAIN LOOP 8: while1do 9: t =t+1; 10: Playanarmk accordingtopolicySL(m)specifiedinAlgorithm10; 11: ˆ θ m k (t) = ˆ θ m k (t−1)m m k (t−1)+X k (t) m m k (t−1)+1 ; 12: m m k (t) =m m k (t−1)+1; 13: endwhile the index of arm with them-th largest expected reward. Note that{o ∗ m } 1≤m≤M = O ∗ M . DenoteΔ i,j =|θ i −θ j |forarmi,j. Wenowstatethemaintheoremofthissection. Theorem 13. The expected regret under the DLP policy specified in Algorithm 11 is at most M X m=1 X i6=o ∗ m ( 8lnn Δ 2 o ∗ m ,i +1+ 2π 2 3 )θ o ∗ m + M X m=1 X h6=m ( 8lnn Δ 2 o ∗ h ,o ∗ m +1+ 2π 2 3 )θ o ∗ m . (8.26) Proof. DenoteT i,m (n)thenumberoftimesthatusermpickarmiattimen. For each userm, the regret under DLP policy can arise due to two possibilities: (1) usermplaysanarmi6=o ∗ m ;(2)otheruserh6=mplaysarmo ∗ m . Inbothcases,collisions 168 may happen, resulting a loss which is at mostθ o ∗ m . Considering these two possibilities, theregretofusermisupperboundedby: R φ (Θ,m;n)≤ X i6=o ∗ m E[T i,m (n)]θ o ∗ m + X h6=m E[T o ∗ m ,h (n)]θ o ∗ m (8.27) FromTheorem12,T i,m (n) andT o ∗ m ,h (n)areboundedby E[T i,m (n)]≤ 8lnn Δ 2 o ∗ m ,i +1+ 2π 2 3 , (8.28) E[T o ∗ m ,h (n)]≤ 8lnn Δ 2 o ∗ h ,o ∗ m +1+ 2π 2 3 . (8.29) So, R φ (Θ,m;n)≤ X i6=o ∗ m ( 8lnn Δ 2 o ∗ m ,i +1+ 2π 2 3 )θ o ∗ m + X h6=m ( 8lnn Δ 2 o ∗ h ,o ∗ m +1+ 2π 2 3 )θ o ∗ m (8.30) Theupperboundforregretis: R φ (Θ;n) = M X m=1 R φ (Θ,m;n) ≤ M X m=1 X i6=o ∗ m ( 8lnn Δ 2 o ∗ m ,i +1+ 2π 2 3 )θ o ∗ m + M X m=1 X h6=m ( 8lnn Δ 2 o ∗ h ,o ∗ m +1+ 2π 2 3 )θ o ∗ m (8.31) 169 If we define Δ min = min 1≤i≤N,1≤j≤M Δ i,j , and θ max = max 1≤i≤N θ i , we could get a more concise(butlooser)upperboundas: R φ (Θ;n)≤ M(N +M−2)( 8lnn Δ 2 min +1+ 2π 2 3 )θ max . (8.32) Theorem13showsthattheregretofourDLPalgorithmisuniformlyupper-bounded foralltimenbyafunctionthatgrowsasO(M(N +M)lnn). 8.5 DistributedLearningwithFairness Forthepurposeoffairnessconsideration,secondaryusersshouldbetreatedequally,and there should be no prioritization for the users. In this scenario, a naive algorithm is to apply Algorithm 11 directly by rotating the prioritization as shown in Figure 8.1. Each user maintains two 1 byN vectors ( ˆ θ m j,i ) M×N and (m m j,i ) M×N , where thej-th row stores only the observation values for the j-th prioritization vectors. This naive algorithm is showninAlgorithm12. ۏ ێ ێ ێ ۍ ͳ ʹ ڭ ܯ െ ͳ ܯ ے ۑ ۑ ۑ ې ۏ ێ ێ ێ ۍ ʹ ͵ ڭ ܯ ͳ ے ۑ ۑ ۑ ې ۏ ێ ێ ێ ۍ ܯ ͳ ʹ ڭ ܯ െ ͳ ے ۑ ۑ ۑ ې ڮ Figure8.1: Illustrationofrotatingtheprioritizationvector. 170 Algorithm 12 A Naive Algorithm for Distributed Learning Algorithm with Fairness (DLF-Naive)RunningatUserm. 1: At timet, run Algorithm 11 with prioritizationK = ((m+t) mod M) + 1, then updatetheK-throwof( ˆ θ m j,i ) M×N and (m m j,i ) M×N accordingly. WecanseethatthestorageofAlgorithm12growslinearinMN, insteadofN. And it does not utilize the observations under different allocation order, which will result a worse regret as shown in the analysis of this section. To utilize all the observations, we proposeourdistributedlearningalgorithmwithfairness(DLF)inAlgorithm13. Algorithm13DistributedLearningAlgorithmwithFairnessforN ArmswithM Users RunningatUserm(DLF) 1: // INITIALIZATION 2: fort = 1toN do 3: Playarmk suchthatk = ((m+t) mod N)+1; 4: ˆ θ m k (t) =X k (t); 5: m m k (t) = 1; 6: endfor 7: // MAIN LOOP 8: while1do 9: t =t+1; 10: K = ((m+t) mod M)+1; 11: Playanarmk accordingtopolicySL(K)specifiedinAlgorithm10; 12: ˆ θ m k (t) = ˆ θ m k (t−1)m m k (t−1)+X k (t) m m k (t−1)+1 ; 13: m m k (t) =m m k (t−1)+1; 14: endwhile SameasinAlgorithm11,onlytwo1byN vectors,( ˆ θ m i ) 1×N and(m m i ) 1×N ,areused byusermtostoretheinformationafterweplayanarmateach timeslot. Line 11 in Algorithm 13 means userm play the arm with theK-th largest expected reward with Algorithm 10, where the value of K is calculated in line 10 to ensure the 171 desiredarmtopickforeachuserisdifferent,andtheusersplayarmsfromtheestimated largesttotheestimatedsmallestinturnstoensurethefairness. Theorem14. TheexpectedregretundertheDLF-NaivepolicyspecifiedinAlgorithm12 isat most X o ∗ m ∈O ∗ m M X m=1 X i6=o ∗ m ( 8ln⌈n/M⌉ Δ 2 o ∗ m ,i +1+ 2π 2 3 )θ o ∗ m + X o ∗ m ∈O ∗ m M X m=1 X h6=m ( 8ln⌈n/M⌉ Δ 2 o ∗ h ,o ∗ m +1+ 2π 2 3 )θ o ∗ m . (8.33) Proof. Theorem14isadirectconclusionfromTheorem13byreplacingnwith⌈n/M⌉, andthentakethesumoverallM bestarmswhichareplayedinthealgorithm. TheabovetheoremshowsthattheregretoftheDLF-NaivepolicygrowsasO(M 2 (N+ M)lnn). Theorem 15. The expected regret under the DLF policy specified in Algorithm 13 is at most M N X i=1 ( 8lnn Δ 2 min,i +1+ 2π 2 3 )θ max +M(M −1) X i∈O ∗ M ( 8lnn Δ 2 min,i +1+ 2π 2 3 )θ i , (8.34) where Δ min,i = min 1≤m≤M Δ o ∗ m ,i . Proof. DenoteK ∗ m (t) as the index of the arm with theK-th (got by line 10 at timet in Algorithm13runningat userm) largestexpectedreward. DenoteQ m i (n) asthenumber oftimesthatusermpickarmi6=K ∗ m (t) for1≤t≤n. 172 We noticethat for any arbitrary positiveintegerl and any timet,Q m i (t)≥ l implies m i (t)≥l. So(8.7)to(8.21)intheproofofTheorem12stillholdbyreplacingT i (n)with Q m i (n) and replacingK withK ∗ m (t). Note that since all the channels are with different rewards,thereisonlyoneelementinthesetA K ∗ m (t) . To find the upper bound ofE[Q m i (n)], we should let l to be l ≥ l 8lnn Δ 2 min,i m such that (8.12)and(8.20)arefalseforallt. Sowehave, E[Q m i (n)] ≤ 8lnn Δ 2 min,i + ∞ X t=1 t−1 X m j(t) =1 t−1 X m i = l (8lnn)/Δ 2 K ∗ m (t),i m (P{ ˆ θ j(t),m j(t) ≤θ j(t) −C t,m j(t) } +P{ ˆ θ i,m i ≥θ i +C t,m i }) + ∞ X t=1 t−1 X m i = l (8lnn)/Δ 2 K ∗ m (t),i m t−1 X m h(t) =1 (P{ ˆ θ i,m i ≤θ i −C t,m i } +P{ ˆ θ h(t),m h(t) ≥θ h(t) +C t,m h(t) }) ≤ 8lnn Δ 2 min,i +1+2 ∞ X t=1 t−1 X m j(t) =1 t−1 X m i =1 2t −4 ≤ 8lnn Δ 2 min,i +1+ 2π 2 3 . (8.35) Hence for userm, we could calculate the upper bound of regret considering the two possibilitiesasintheproofofTheorem13as: R φ (Θ,m;n)≤ N X i=1 Q m i (n)θ max + X h6=m X i∈O ∗ M Q m h (n)θ i (8.36) 173 Sotheupperboundforregretformusersis: R φ (Θ;n) = M X m=1 R φ (Θ,m;n) ≤M N X i=1 ( 8lnn Δ 2 min,i +1+ 2π 2 3 )θ max +M(M −1) X i∈O ∗ M ( 8lnn Δ 2 min,i +1+ 2π 2 3 )θ i (8.37) Tobemoreconcise,wecouldalsowritetheaboveupperboundas: R φ (Θ;n)≤M(N +M(M−1))( 8lnn Δ min +1+ 2π 2 3 )θ max . (8.38) Theorem16. Whentimenislargeenoughsuchthat n lnn ≥ 8(N +M) Δ 2 min +(1+ 2π 2 3 )N +M, (8.39) theexpected regret undertheDLF policyspecifiedinAlgorithm13 isat most M X i/ ∈O ∗ M ( 8lnn Δ 2 min,i +1+ 2π 2 3 )θ max +M 2 (1+ 2π 2 3 )θ max +M(M −1)(1+ 2π 2 3 ) X i∈O ∗ M θ i . (8.40) Proof. Theinequality(8.35)impliesthatthetotalnumberoftimesthatthedesiredarms arepickedby usermat timenislowerboundedbyn− N P i=1 ( 8lnn Δ 2 min,i +1+ 2π 2 3 ). Sinceall 174 thearmswithM largestexpectedrewardsarepickedinturnbythealgorithm,∀i∈O ∗ M , wehave m i (n)≥ 1 M n− N X i=1 ( 8lnn Δ 2 min,i +1+ 2π 2 3 ) ! . (8.41) wherem i (n) refers to the numberof times that armi has been observed up to timen at userm. (Forthepurposeofsimplicity,weomitminthenotationofm i .) Notethatwhennisbigenoughsuchthat n lnn ≥ 8(N+M) Δ 2 min +(1+ 2π 2 3 )N+M,wehave, m i (n)≥ 1 M n− N X i=1 ( 8lnn Δ 2 min,i +1+ 2π 2 3 ) ! ≥⌈ 8lnn Δ 2 min ⌉. (8.42) When (8.42)holds,both(8.12)and(8.20)arefalse. Then∀i∈O ∗ M , whennis large enoughtosatisfy(8.42), E[Q m i (n)] = n X t=N+1 1{I i (t)} = n X t=N+1 (1{I i (t),θ i <θ K }+1{I i (t),θ i >θ K }) ≤ ∞ X t=1 t−1 X m j(t) =1 t−1 X m i =⌈(8lnn)/Δ 2 min ⌉ (P{ ˆ θ j(t),m j(t) ≤θ j(t) −C t,m j(t) }+P{ ˆ θ i,m i ≥θ i +C t,m i }) + ∞ X t=1 t−1 X m i =⌈(8lnn)/Δ 2 min ⌉ t−1 X m h(t) =1 (P{ ˆ θ i,m i ≤θ i −C t,m i }+P{ ˆ θ h(t),m h(t) ≥θ h(t) +C t,m h(t) }) ≤ 1+2 ∞ X t=1 t−1 X m j(t) =1 t−1 X m i =1 2t −4 ≤ 1+ 2π 2 3 . (8.43) 175 Sowhen(8.42)issatisfied,atighterboundfortheregretin(8.34)is: R φ (Θ;n) ≤M X i/ ∈O ∗ M ( 8lnn Δ 2 min,i +1+ 2π 2 3 )θ max +M 2 (1+ 2π 2 3 )θ max +M(M −1)(1+ 2π 2 3 ) X i∈O ∗ M θ i . (8.44) Wecouldalsowriteaconcise(butlooser)upperboundas: R φ (Θ;n)≤M(N −M)( 8lnn Δ min +1+ 2π 2 3 )θ max +M 3 (1+ 2π 2 3 )θ max . (8.45) ComparingTheorem14withTheorem15andTheorem16,ifwedefineC = 8(N+M) Δ 2 min + (1 + 2π 2 3 )N +M, we can see that the regret of the naive policy DLF-Naive grows as O(M 2 (N +M)lnn),whiletheregretoftheDLFpolicygrowsasO(M(N +M 2 )lnn) when n lnn < C, O(M(N −M)lnn) when n lnn ≥ C. So when n is large, the regret of DLFgrowsasO(M(N−M)lnn). Wealsonotethatthefollowingtheoremhasbeenshownin[8]onthelowerboundof regretunderanydistributedpolicy. 176 Theorem 17 (Proposition 1 from [8]). The regret of any distributed policy φ is lower- boundedby R φ (Θ;n)≥ M X m=1 X i/ ∈O ∗ M Δ min,i E[Q m i ]. (8.46) Lai and Robbins [52] showed that for any uniformly good policy, the lower bound of Q m i for a single user i grows as Ω(lnn). So DLF is a decentralized algorithm with finite-timeorder-optimalregretboundforfairaccess. 8.6 SimulationResults Wepresentsimulationresultsforthealgorithmsdevelopedinthiswork,varyingthenum- berof users and channels toverify theperformanceof ourproposedalgorithmsdetailed earlier. In the simulations, we assume channels are in either idle state (with throughput 1) or busy state (with throughput 0). The state of each N channel evolves as an i.i.d. Bernoulliprocessacrosstimeslots,withtheparameterset ΘunknowntotheM users. Figure 8.2 shows the simulation results averaged over 50 runs using the three algo- rithms, DLP, DLF-Naive, and DLF, and the regrets are compared. Figure 8.2(a) shows thesimulationsforN = 4 channels,M = 2 users, with Θ = (0.9,0.8,0.7,0.6). In Fig- ure 8.2(b), we have N = 5 channels, M = 3 users, and Θ = (0.9,0.8,0.7,0.6,0.5). Figure 8.2(c) shows the simulations for N = 7 channels, and M = 4 users, with Θ = (0.9,0.8,0.7,0.6,0.5,0.4,0.3). 177 0 2 4 6 8 10 x 10 5 0 200 400 600 800 1000 1200 Time Regret/Log(t) DLF DLP DLF−Naive (a) N = 4 channels, M = 2 secondary users, Θ = (0.9,0.8,0.7,0.6). 0 2 4 6 8 10 x 10 5 0 500 1000 1500 2000 2500 3000 3500 Time Regret/Log(t) DLF DLP DLF−Naive (b) N = 5 channels, M = 3 secondary users, Θ = (0.9,0.8,0.7,0.6,0.5). 0 2 4 6 8 10 x 10 5 0 1000 2000 3000 4000 5000 6000 7000 Time Regret/Log(t) DLF DLP DLF−Naive (c) N = 7 channels, M = 4 secondary users, Θ = (0.9,0.8,0.7,0.6,0.5,0.4,0.3). Figure8.2: Normalizedregret R(n) lnn vs. ntimeslots. 178 ሺ ݊ ሻ i = 1 i = 2 i = 3 i = 4 i = 5 m = 1 9 9 5 8 3 5 3 0 3 0 6 5 9 6 5 9 1 8 4 m = 2 2 9 1 4 9 9 4 0 6 2 2 1 3 8 6 2 3 2 6 3 m = 3 7 1 1 2 5 7 0 9 9 3 4 9 5 2 6 4 0 5 8 4 (a) DLPpolicy. ሺ ݊ ሻ i = 1 i = 2 i = 3 i = 4 i = 5 m = 1 3 3 3 3 2 8 3 3 3 3 2 2 3 3 1 0 6 8 1 7 6 1 5 2 1 m = 2 3 3 3 3 2 8 3 3 3 2 6 9 3 3 0 9 8 2 1 8 7 2 5 4 9 m = 3 3 3 3 3 2 6 3 3 3 3 1 6 3 3 0 7 6 4 2 0 9 8 4 9 6 (b) DLFpolicy. ൫ ݊ ǡ ൯ ǡ ݉ ൌ ͳ i = 1 i = 2 i = 3 i = 4 i = 5 j = 1 3 2 9 6 0 4 2 7 4 9 4 7 4 3 3 0 1 7 6 j = 2 2 1 8 4 3 2 7 3 0 1 3 0 3 1 5 4 0 2 7 7 j = 3 6 5 9 2 2 8 2 3 2 7 6 8 4 2 1 0 9 6 0 0 ൫ ݊ ǡ ൯ ǡ ݉ ൌ ʹ i = 1 i = 2 i = 3 i = 4 i = 5 j = 1 3 3 0 0 6 2 2 1 5 4 5 8 9 3 7 4 1 5 5 j = 2 2 9 5 2 3 2 7 6 9 5 1 7 5 9 7 0 3 2 2 4 j = 3 7 4 3 1 9 7 8 3 2 8 3 4 7 1 7 3 8 5 2 7 ൫ ݊ ǡ ൯ ǡ ݉ ൌ ͵ i = 1 i = 2 i = 3 i = 4 i = 5 j = 1 3 2 9 8 9 1 2 3 0 3 7 3 2 2 6 1 1 4 6 j = 2 2 1 9 7 3 2 7 5 7 4 2 3 4 4 8 2 5 3 9 4 j = 3 5 1 5 2 5 1 9 3 2 7 6 1 9 1 9 3 2 7 4 8 (c) DLF-Naivepolicy. Figure 8.3: Number of times that channel i has been chosen by user m up to timen = 10 6 . As expected, DLFhas theleast regret, sinceoneofthekeyfeatures ofDLFisthat it doesnotfavoranyoneuseroveranother. Thechanceforeach usertouseanyoneofthe M bestchannelsarethesame. ItutilizesitsobservationsonalltheM bestchannels,and thusmakeslessmistakesforexploring. DLF-Naivenotonlyhasthegreatestregret,also uses more storage. DLP has greater regret than DLF since userm has to spend time on exploringtheM −1 channels intheM bestchannels expect channelk 6= o ∗ m . Not only this results in a loss of reward, this also results in the collisions among users. To show thisfact, wepresentthenumberoftimesthatachannelisaccessed by allM usersupto timen = 10 6 inFigure8.3. Figure8.2alsoexplorestheimpactofincreasingthenumberofchannelsN,andsec- ondary users M on the regret experienced by the different policies with the minimum distance between arms Δ min fixed. It is clearly that as the number of channels and sec- ondaryusersincreases,theregret, aswellastheregretgapbetweendifferentalgorithms increases. 179 0 1 2 3 4 5 x 10 5 0 500 1000 1500 2000 2500 3000 3500 4000 Time Regret/Log(t) DLF TDFS Figure8.4: ComparisonofDLFandTDFS. In Figure 8.4, we compare the normalized regret R(n) lnn of DLF logarithm and the TDFS algorithm proposed by Liu and Zhao [59,60], in a system with N = 4 channels andM = 2secondaryusers. Θ = (0.9,0.8,0.7,0.6). Theresultsaregotbyaveraging50 runs up to half million time slots. We can see that compared with TDFS, our proposed DLF algorithm not only has a better theoretical upper bound of regret, it also performs betterforpractical use. Also,TDFSonlyworksforproblemswithsingle-parameterized distribution. We don’t have this requirement for DFS. Besides, the storage of TDFS is notpolynomial. 8.7 Summary Theproblemofdistributedmulti-armedbanditsisafundamentalextensionoftheclassic onlinelearningframeworkthatfindsapplicationinthecontextofopportunisticspectrum accessforcognitiveradionetworks. Wehavemadetwokeyalgorithmiccontributionsto 180 thisproblem. Forthecaseofprioritizedusers,wepresentedDLP,thefirstdistributedpol- icythatyieldslogarithmicregretovertimewithoutpriorassumptionsaboutthemeanarm rewards. Forthecaseoffairaccess,wepresentedDLF,apolicythatyieldsorder-optimal regret scaling in terms of thenumbers of users and arms, which is also an improvement overpriorresults. Through simulations,we have furthershown that the overallregret is lowerforthefairaccess policy. 181 Chapter9 ConclusionsandOpenQuestions Inthisdissertation,wehavepresentedanewclassofcombinatorialmulti-armedbandits andpoliciestoexploitthestructuresofthedependenciestoimprovethecostoflearning compared to prior work for large-scale stochasticnetwork optimization problems in the followingsettings: • i.i.d. rewardswithlineardependencies; • restedMarkovianrewardswithlineardependencies; • restlessMarkovianrewardswithlineardependencies; • i.i.d. rewardswithnonlineardependencies; Wehaveshownthatthedependenciescanbehandledtractablywithpoliciesthathave provably good performance in terms of regret as well as storage and computation. Our proposednovelpoliciesyieldregret thatgrowspolynomiallyinthenumberofunknown 182 parameters,dramaticallyimprovingthecostoflearningcomparedtopriorworkforlarge- scalestochasticnetworkoptimizationproblems. Besidesthese,wehaveproposeddecentralizedonlinelearningalgorithmsrunningat eachusertomakeaselectionamongmultiplechoicesforclassicmulti-armedbanditsfor bothprioritizedandfairsetting. While the results in this dissertation have provided useful insights into real-world optimization with unknown variables, there are a number of interesting open questions andextensiondirectionstobeexploreinthefuture. • Extensionsonlinearrewards: The first open question is that while we have developed a particular policy LLR with a uniform bound on regret that grows as an O(N 4 lnn) function in Chapter 4, it is unclear what the lower-bound on regret for this problem is. It could be conjectured to be as low as O(N lnn); if this were to be true, then it should be possibleto develop a more efficient policy than our current LLR policy. Then the upperboundsofregretofMLMR(proposedinChapter5)andCLRMR(proposed inChapter6)canalsobefurtherlowered. • Nonlinearrewards: Itwouldbeofgreatinteresttoseeifitispossibletoalsotacklemoregeneralnon- linear reward functions beyond the work in Chapter 7, at least in structured cases thathaveprovedtobetractableindeterministicsettings,suchasconvexfunctions. 183 • Multidimensionalrewards: There are many problems in communication networks that multiple performance objectives(e.g.,delay,throughput,reliability,energy)areneededtobesatisfied. It is of interested to find a point on the Pareto-optimal boundary set for such prob- leminsteadofasingleoptimalsolution. So, apossiblewayistoconsiderrewards obtained at each time as a vector, and then maximize one of these rewards in ex- pectationsubjecttoaverageconstraintsontheotherelementsoftherewardvector. An open problem here is to find out if it is possible to solve such a problem by usingaconvexlinearcombinationoftherewardsasafeedbackforthelearning. • Strictregretinspecialcases: In this dissertation, we have derived results for combinatorial MAB with restless andrestedMarkovianrewardswithrespecttotheregretdefinedbytheoptimalarm asastaticarm. Inotherwork[26]and[66],wehavederivedresultsfortheoptimal policyforatwo-stateBayesianrestlessmulti-armedbanditwithidenticalandnon- identicalarmswithrespecttothetrueregretwherethegeniewillswitcharmswhen playingoptimally. Developingefficientlearningpoliciesforothercasesremainsan openproblem. Analternative,possiblymoreefficient,approachtoonlinelearning inthesekindsofproblemsmightbetousethehistoricalobservationsofeacharmto estimatetheP matrix, and usetheseestimatesiterativelyin makingarm selection decisionsateachtime. Itis,however,unclearatpresenthowtoproveregretbounds usingsuchaniterativeestimationapproach. 184 Bibliography [1] E. Moulines A. Alaya-Feki, B. Sayrac and A. LeCornec. On the combinatorial multi-armedbanditproblemwithmarkovianrewards. InIEEEGlobalTelecommu- nicationsConference(GLOBECOM), pages1–5,December2008. [2] J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: an efficient algo- rithm for bandit linear optimization. In the 21st Annual Conference on Learning Theory(COLT), pages263–274,July2008. [3] R. Agrawal. Sample mean based index policies witho(logn) regret for the multi- armedbanditproblem. Advances in AppliedProbability,27(4):1054–1078,1995. [4] R. Agrawal, M. V. Hegde, and D. Teneketzis. Multi-armed bandit problems with multipleplays and switchingcost. Stochastics and StochasticReports, 29(4):437– 459,1990. [5] S. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari. Optimality of my- opic sensing in multi-channel opportunistic access. IEEE Transactions on Infor- mationTheory,55(9):4040–4050,2009. [6] R. K. Ahuja, T. L. Magnanti, and J. Orlin. Network Flows: Theory, Algorithms, andApplications. PrenticeHall,1993. [7] A. Alaya-Feki, E. Moulines, and A. LeCornec. Dynamic spectrum access with non-stationary multi-armed bandit. In IEEE 9th Workshop on Signal Processing Advancesin WirelessCommunications(SPAWC),pages416–420,July2008. [8] A. Anandkumar, N. Michael, A. Tang, and A. Swami. Distributed learning and allocation of cognitive users with logarithmic regret. IEEE Journal on Selected Areasin Communications,29(4):731–745,2011. [9] A. Anandkumar, N. Michael, and A. K. Tang. Opportunisticspectrum access with multiple users: learning under competition. In IEEE International Conference on ComputerCommunications(INFOCOM), pages1–9,March2010. [10] V. Anantharam, P. Varaiya, and J . Walrand. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part ii: Markovian rewards. IEEE Transactionson AutomaticControl,32(11):977–982,1987. 185 [11] V. Anantharam, P. Varaiya, and J. Walrand. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: i.i.d. rewards. IEEETransactionsonAutomaticControl,32(11):968–976,1987. [12] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. The Journalof MachineLearningResearch,3:397–422,2003. [13] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed banditproblem. MachineLearning,47(2):235–256,2002. [14] P. Auer, N.Cesa-Bianchi, Y.Freund, and R. E.Schapire. Thenonstochasticmulti- armedbanditproblem. SIAM JournalonComputing,32(1):48–77,2002. [15] B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In the 36th Annual ACM Sympo- siumonTheory ofComputing(STOC), pages45–53,June2004. [16] H. Balakrishnan, C. L. Barrett, V. S. A. Kumar, M. V. Marathe, and S. Thite. The distance-2 matching problem and its relationship to the mac-layer capacity ofad hocwireless networks. IEEE Journal on Selected Areas in Communications, 22(6):1069–1079,2004. [17] M. S. Bazaraa, J. J. Jarvis, and H. D. Sherali. Linear Programming and Network Flows(4th Edition). Wiley,2009. [18] R. Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1958):87–90,1958. [19] A.Bhorkar,M.Naghshvar,T.Javidi,andB.Rao. Exploringandexploitingrouting opportunities in wireless ad-hoc networks. In IEEE Conference on Decision and Control(CDC), pages4834–4839,December2009. [20] P. Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation,and Queues. Springer,1998. [21] A.Brzezinski,G.Zussman,andE.Modiano. Enablingdistributedthroughputmax- imization in wireless mesh networks: a partitioning approach. In the 12th annual international conference on Mobile computing and networking (MobiCom), pages 26–37,September2006. [22] A. Cayley. A theorem on trees. Quarterly Journal of Mathematics, 23:376–378, 1889. [23] Y. Chen, Q. Zhao, V. Krishnamurthy,and D. Djonin. Transmissionscheduling for optimizing sensor network lifetime: A stochastic shortest path approach. IEEE Transactionson SignalProcessing,55(5):2294–2309,2007. 186 [24] T.CoverandJ.Thomas. ElementsofInformationTheory. NewYork: Wiley,1991. [25] W. Dai, Y. Gai, and B. Krishnamachari. Efficient onlinelearning for opportunistic spectrumaccess. InIEEEInternationalConferenceonComputerCommunications (INFOCOM), pages3086–3090,March2012. [26] W.Dai,Y. Gai,B. Krishnamachari,andQ.Zhao. Thenon-bayesianrestlessmulti- armedbandit: acaseofnear-logarithmicregret. In IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pages 2940–2943, May 2011. [27] V.Dani,T.P.Hayes,andS.M.Kakade. Stochasticlinearoptimizationunderbandit feedback. In the21stAnnualConference onLearning Theory(COLT), pages355– 366,July2008. [28] P. Diaconis and L. Saloff-Coste. Nash inequalities for finite markov chains. Jour- nalofTheoretical Probability,9(2):459–510,1996. [29] E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik,1(1):269–271,1959. [30] V. Dyo and C. Mascolo. A node discoveryservice for partially mobilesensor net- works. Inthe2ndinternationalworkshoponMiddlewareforsensornetworks(Mid- Sens),pages13–18,Novermber2007. [31] J. Edmonds. Paths, trees, and flowers. Canadian Journal of Mathematics, 17(3):449–467,1965. [32] L.R. Ford. Onaroutingproblem. PaperP-923The RAND Corporation,1956. [33] Y. Gai and B. Krishnamachari. Decentralized online learning algorithms for op- portunistic spectrum access. In IEEE Global Telecommunications Conference (GLOBECOM),pages1–6,December2011. [34] Y. Gai and B. Krishnamachari. Online learning algorithms for stochastic water- filling. In Information Theory and Applications Workshop (ITA), pages 352–356, February2012. [35] Y. Gai, B. Krishnamachari, and R. Jain. Learning multiuser channel allocations in cognitive radio networks: a combinatorial multi-armed bandit formulation. In IEEESymposiumonInternationalDynamicSpectrumAccessNetworks(DySPAN), pages1–9,April2010. [36] Y. Gai, B. Krishnamachari, and R. Jain. Combinatorial network optimizationwith unknownvariables: multi-armedbanditswithlinearrewards. IEEE/ACM Transac- tionson Networking,20(5):1466–1478,2012. 187 [37] Y. Gai, B. Krishnamachari, and M. Liu. On the combinatorial multi-armed bandit problemwithmarkovianrewards. InIEEEGlobalTelecommunicationsConference (GLOBECOM),pages1–6,December2011. [38] Y.Gai,B.Krishnamachari,andM.Liu. Onlinelearningforcombinatorialnetwork optimization with restless markovian rewards. In the 9th Annual IEEE Commu- nications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks (SECON), pages28–36,June2012. [39] D. Gillman. A chernoffbound for random walks on expandergraphs. SIAM Jour- nalon Computing,27(4):1203–1220,1998. [40] A.Goldsmith. WirelessCommunications. CambridgeUniversityPress,2005. [41] T. Javidi, B. Krishnamachari, Q. Zhao, and M. Liu. Optimality of myopic sens- ing in multi-channel opportunistic access. In IEEE International Conference on Communications(ICC), pages2107–2112,May2008. [42] S. Jing, L. Zheng, and M. Medard. On training with feedback in wideband chan- nels. IEEEJournalonSelectedAreasinCommunications,26(8):1607–1614,2008. [43] W. Jouini, D. Ernst, C. Moy, and J. Palicot. Multi-armed bandit based policies for cognitive radio’s decision making issues. In 3rd International Conference on Signals,Circuitsand Systems(SCS),pages1–6,November2009. [44] D. Kalathil, Nayyar N, and R. Jain. Decentralized learning for multi-playermulti- armed bandits. In IEEE Conference on Decision and Control (CDC), December 2012. ToAppear. [45] Y. V. Kiran, T. Venkatesh, and C. S. R. Murthy. A reinforcement learning frame- work for path selection and wavelength selection in optical burst switched net- works. IEEE Journalon SelectedAreas in Communications,25(9):18–26,2007. [46] B. Korte and J. Vygen. Combinatorial Optimization: Theory and Algorithms (4th ed.). Springer,2008. [47] J. Krarup and M. N. Rørbech. Lp formulations of the shortest path tree problem. 4OR:A QuarterlyJournalof OperationsResearch,2(4):259–274,2004. [48] V. Krishnamurthy. Emission management for low probability intercept sensors in networkcentricwarfare. IEEETransactionsonAerospaceandElectronicSystems, 41(1):133–151,2005. [49] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling sales- man problem. Proceedings of the American Mathematical Society, 7(1):48–50, 1956. 188 [50] H. W. Kuhn. The hungarian method for the assignment problem. Naval research logisticsquarterly,2(1-2):83–97,2006. [51] L. Lai, H. Jiang, and H.V. Poor. Medium access in cognitive radio networks: A competitivemulti-armed bandit framework. In 42nd Asilomar Conference on Sig- nals,SystemsandComputers(ACSSC), pages98–102,October2008. [52] T. Lai and H. Robbinse. Asymptotically efficient adaptive allocation rules. Ad- vances inAppliedMathematics,6(1):4–22,1985. [53] P. Lezaud. Chernoff-type bound for finite markov chains. Journal of Theoretical Probability,8(3):849–867,1998. [54] C. LiandM.J.Neely. Exploitingchannelmemoryformultiuserwirelessschedul- ing without channel measurement: Capacity regions and algorithms. Performance Evaluation,68(8):631–657,2011. [55] H.LiandZ.Han. Dogfightinspectrum: combatingprimaryuseremulationattacks in cognitive radio systemsłpart ii: unknown channel statistics. IEEE Transactions onWirelessCommunications,10(1):274–283,2011. [56] H. Liu, K. Liu, and Q. Zhao. Learning and sharing in a changing world: non- bayesianrestlessbanditwithmultipleplayers. InInformationTheoryandApplica- tionsWorkshop(ITA), pages1–7,February2011. [57] H. Liu, K. Liu, and Q. Zhao. Logarithmic weak regret of non-bayesian restless multi-armed bandit. In IEEE International Conference on Acoustics, Speech and SignalProcessing(ICASSP), pages1968–1971,May2011. [58] J. Liu, Z. Liu, D. Towsley, and C. H. Xia. Maximizing the data utility of a data archiving&queryingsystemthroughjointcodingandscheduling. Inthe6thinter- national conference on Information processing in sensor networks (IPSN), pages 244–253,April2007. [59] K. Liu and Q. Zhao. Distributed learning in cognitive radio networks: Multi- armed bandit with distributed multipleplayers. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3010–3013, March 2010. [60] K.LiuandQ.Zhao. Distributedlearninginmulti-armedbanditwithmultipleplay- ers. IEEETransactionsonSignalProcessing,58(11):5667–5681,2010. [61] K. Liu and Q. Zhao. Online learning for stochastic linear optimization problems. InInformationTheoryandApplicationsWorkshop(ITA),pages363–367,February 2012. 189 [62] O.Mehanna, A.Sultan,and H.ElGamal. Blindcognitivemacprotocols. In IEEE InternationalConference on Communications(ICC), pages1–5,June2009. [63] A. J. Mersereau, P. Rusmevichientong, and J. N. Tsitsiklis. A structured multi- armed bandit problem and the greedy policy. In IEEE Transactions on Automatic Control,pages203–210,January2010. [64] A. Motamedi and A. Bahai. Optimal channel selection for spectrum-agile low- power wireless packet switched networks in unlicensed band. EURASIP Journal onWirelessCommunicationsand Networking,2008:1–10,2008. [65] S. Murugesan, P. Schniter, and N. B. Shroff. Opportunistic scheduling using arq feedbackinmulti-celldownlink. In the44thAsilomarConferenceon Signals,Sys- temsand Computers(ASILOMAR),pages1733–1737,November2010. [66] N.Nayyar, Y.Gai, andB. Krishnamachari. Onarestlessmulti-armedbanditprob- lem with non-identical arms. In the 49th Annual Allerton Conference on Commu- nication,Control,andComputing(Allerton),pages369–376,September2011. [67] J.Ni˜ no-Mora. Arestlessbanditmarginalproductivityindexforopportunisticspec- trumaccesswithsensingerrors. NetworkControlandOptimization,LectureNotes inComputer Science,5894:60–74,2009. [68] R. Ortner. Exploiting similarity information in reinforcement learning - similarity modelsformulti-armedbanditsandmdps. In the2nd InternationalConference on AgentsandArtificialIntelligence(ICAART),pages203–210,January2010. [69] S.Pandey,D.Chakrabarti,andD.Agarwal. Multi-armedbanditproblemswithde- pendentarms. In the 24th Annual InternationalConference on MachineLearning, pages721–728,June2007. [70] D.Pollard. Convergence ofStochasticProcesses. Berlin: Springer,1984. [71] R. C. Prim. Shortest connection networks and some generalizations. Bell System TechnicalJournal,36(6):1389–1401,1957. [72] P. Rusmevichientongand J. N. Tsitsiklis. Linearly parameterized bandits. Mathe- maticsofOperationsResearch,35(2):395–411,2010. [73] P. Si, H. Ji, and F. R. Yu. Optimal network selection in heterogeneous wireless multimedianetworks. WirelessNetworks, 16(5):1277–1288,2010. [74] C. Tekin and M. Liu. Online algorithms for the multi-armed bandit problem with markovian rewards. In the 48th Annual Allerton Conference on Communication, Control,andComputing(Allerton),pages1675–1682,September2010. 190 [75] C. Tekin and M. Liu. Online learning in opportunistic spectrum access: a restless banditapproach. InIEEEInternationalConferenceonComputerCommunications (INFOCOM), pages2462–2470,April2011. [76] W.R. Thompson. Onthelikelihoodthatoneunknownprobabilityexceedsanother inviewoftheevidenceoftwosamples. Biometrika,25(3/4):285–294,1933. [77] H. I. Volos and R. M. Buehrer. On balancing exploration vs. exploitation in a cognitive engine for multi-antenna systems. In IEEE Global Telecommunications Conference(GLOBECOM), pages1–6,December2009. [78] P. Whittle. Multi-armed bandits and the gittinsindex. Journal of the Royal Statis- ticalSociety.SeriesB (Methodological),42(2):143–149,1980. [79] F. R. Yu, H. Tang, F. Wang, and V. C.M. Leung. Distributed node selection for threshold key management with intrusion detection in mobile ad hoc networks. WirelessNetworks,16(8):2169–2178,2010. [80] Q. Zhao, B. Krishnamachari, and K. Liu. On myopic sensing for multi-channel opportunistic access: structure, optimality, and performance. IEEE Transactions onWirelessCommunications,7(12):5431–5440,2008. 191
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Empirical methods in control and optimization
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Learning and control in decentralized stochastic systems
PDF
Learning and decision making in networked systems
PDF
Algorithmic aspects of energy efficient transmission in multihop cooperative wireless networks
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Data-driven optimization for indoor localization
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Interactive learning: a general framework and various applications
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Optimization strategies for robustness and fairness
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
Asset Metadata
Creator
Gai, Yi
(author)
Core Title
Online learning algorithms for network optimization with unknown variables
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/26/2012
Defense Date
10/17/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithm design and analysis,machine learning,network optimization,OAI-PMH Harvest,online learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Krishnamachari, Bhaskar (
committee chair
), Dughmi, Shaddin (
committee member
), Jain, Rahul (
committee member
)
Creator Email
ygai@usc.edu,yigaiee@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-120026
Unique identifier
UC11291530
Identifier
usctheses-c3-120026 (legacy record id)
Legacy Identifier
etd-GaiYi-1348.pdf
Dmrecord
120026
Document Type
Dissertation
Rights
Gai, Yi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
algorithm design and analysis
machine learning
network optimization
online learning