Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Some bandit problems
(USC Thesis Other)
Some bandit problems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SOME BANDIT PROBLEMS by Zhengyu Zhang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (INDUSTRIAL AND SYSTEM ENGINEERING) May 2021 Copyright 2021 Zhengyu Zhang Acknowledgements I would like to rst express my sincere appreciation to my advisor and my committee chair, Professor Sheldon Ross, who continuously provided guidance, encouragement and was always willing and patient to help me throughout my PhD journey. Without his persistent support this dissertation would have not been possible. I am sincerely thankful to Professor John Carlsson, Professor Meisam Razaviyayn and Profes- sor Leana Golubchik for serving on my Defense committee and providing insightful feedback on my thesis, and Professor Sze-chuan Suen for serving on my Qualifying Exam committee. I would also like to thank Daniel J. Epstein Department of Industrial and System Engineering, faculties and sta for their support. I want to say thank you to all my friends and peers, Yang Cao, Jiachuan Chen, Yunan Zhou, Shuotao Diao, Ziyu He, Tianjian Huang, Mingdong Lyu, Peng Dai, Yuanxiang Wang, Maximilian Zellner, He Luan, Yuan Jin. It is you who made my days at USC more memorable and joyful. Last but not least, deepest thanks to my parents and Yue for the encouragement, support and love. ii TableofContents Acknowledgements ii ListofTables v ListofFigures vi Abstract vii Chapter1: Introduction 1 1.1 An Innitely Many Armed Bandit Problem . . . . . . . . . . . . . . . . . . . . . . 2 1.2 A Dueling Bandit Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Estimating the Strengths of Bradley-Terry Model . . . . . . . . . . . . . . . . . . . 5 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter2: LiteratureReview 7 Chapter3: AnInnitelyManyArmedBanditProblem 13 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Optimal Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Optimal Acceptance Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Optimal Sampling Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 n-Failure Polices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Optimal n-Failure Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 0-failure Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.3 Arbitrary n-Failure Threshold Policy . . . . . . . . . . . . . . . . . . . . . 25 3.3.4 Additional Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 A Policy Improvement Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter4: ADuelingBanditProblem 39 4.1 Minimizing the Weak Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.1 Beat the Winner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 iii 4.1.2 Modied Beat the Winner . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1.3 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.4 A Revisit to WS-W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Minimizing the Strong Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 The Sampling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Find the Best Dueler with Fixed Condence . . . . . . . . . . . . . . . . . . . . . 53 4.3.1 A Gambler’s Ruin Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.2 Modied Gambler’s Ruin Rule . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter5: EstimatingtheStrengthofBradleyTerry-Model 62 5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 The Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 The Simulation Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.1 The Simulation Estimators of Strengths . . . . . . . . . . . . . . . . . . . 64 5.3.2 The Simulation Estimators of Probabilities . . . . . . . . . . . . . . . . . . 66 5.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter6: Conclusions 74 Bibliography 77 AppendixA 81 AppendixB 87 iv ListofTables 3.1 The expected number of plays and thresholds vector(k 1 ;:::;k n ) for the optimal n-failure threshold policy n with(;) = (0:05;0:05) . . . . . . . . . . . . . . . 36 3.2 The expected number of plays and thresholds vector(k 0 1 ;:::;k 0 n ) for then-failure threshold policy 0 n derived byM 1 () with(;) = (0:05;0:05) . . . . . . . . . . 37 3.3 Results applyingM 1 () on i ;i = 0;1:::;4. The statistics is summarized over 10 6 repeated runs and s.d. stands for standard deviation. . . . . . . . . . . . . . . 37 3.4 Comparison between policy M 1 ( 0 ) and KL-LUCB, where TrBeta(;) is truncated Beta distribution with support on(0;0:95). . . . . . . . . . . . . . . . . 38 4.1 Numerical example of MGRR with 1000 replications . . . . . . . . . . . . . . . . . 61 5.1 Numerical example with strengths generated from exponential(1) with 1000 replications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Numerical example with strengths generated from uniform(0,1) with 1000 replications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Example of estimating probabilities of having largest strength with 1000 replications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 v ListofFigures 3.1 Visualization of decision region for the optimal policy when the current arm has record(s;k). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Experiments with 100 arms. (a) Cumulative regret over 100 replications; (b) Standard deviation of cumulative regret atT = 10 5 . . . . . . . . . . . . . . . . . 46 4.2 Experiments with 1000 arms. (a) Cumulative regret over 100 replications; (b) Standard deviation of cumulative regret atT = 10 6 . . . . . . . . . . . . . . . . . 46 4.3 Experiments with 5 arms on exponential(1) strengths. Replication: 200 times . . . 53 4.4 Experiments with 5 arms on uniform(0, 1) strengths. Replication: 200 times . . . . 54 vi Abstract The dissertation mainly studied variants of multi-armed bandit problems. A general framework of multi-armed bandits can be described as follows. Suppose there is a set of nite or innite many arms. At each step, the learner can sample one or a few arms, and then observes a reward generated from the underlying distribution associated to the sampled arms. One popular objective is to minimize the cumulative regret, which is the expected sum of dierences between the actual rewards obtained and the rewards by taking the optimal action. The other popular goal is best arm identication, which aims at nding the best arm with high condence after certain number of plays. The problem arises naturally in domains tackling the trade-o between exploration and exploitation, such as online recommendation, search engine, crowdsourcing and so on. We rst looked at an innitely many armed bandit problem. Suppose there is a set of innite many arms and each arm has independent Bernoulli rewards with unknown mean. With the goal being identifying an arm such that the posterior probability of its mean being at least1 is at least1, we want to minimize the expected number of plays until such an arm is identied. We were able to show that there is an optimal policy such that it never plays a previously discarded arm and such optimal policies have threshold structure. We proposed a heuristic policy that limits vii the number of failures allowed for all arms. We also developed a policy improvement strategy which can improve upon an arbitrary policy. We then considered the dueling bandit problem. Instead of playing a single arm, the learner draws a pair of arms at each time step and learns the noisy pairwise preference. Our rst goal is to minimize the cumulative binary weak regret, that is, the total number of plays that the best arm is not involved. Assuming the existence of Condorcet winner, we proposed an algorithm with theoretical guarantee of nite regret over innite time horizon and developed an improved algorithm with better empirical performance. We then considered the objective of minimizing the cumulative binary strong regret. We designed a Thompson sampling approach which determines the next pair based on the sampled strength. We proposed to sample from the posterior distri- bution by taking a MCMC approach. In the end, we studied the objective of identifying the best arm with xed condence using fewest plays. We employed a knockout tournament structure with the winner of each duel determined by the gambler’s ruin rule. We also considered the problem of estimating the strengths of Bradley-Terry model and esti- mating probabilities that some player has the largest strength. We took a Bayesian approach that utilizes the simulation estimator to achieve this goal. Two ecient variance reduction techniques were developed to speed up the simulation. viii Chapter1 Introduction The multi-armed bandit problem is a classical sequential decision model addressing the trade-o between exploration and exploitation (Berry and Fristedt 1985; Lai and Robbins 1985). In the classical bandit model, there is a set of arms with each arm having an unknown reward distri- bution. Upon playing an arm, the learner can observe a reward sampled from its corresponding distribution. The learner aims to nd an allocation strategy or a policy, which species how to adaptively sample arms based on past observations, to achieve some objective. The typical goal includes minimizing the cumulative regret (the expected dierence between the sum of rewards that could have received by always playing the best arm and the sum of the rewards actually re- ceived), nding the best arm (declare one of the arm as the best arm after a certain xed number of plays) and so on. The multi-armed bandit problem has attracted a lot of attention in recent years due to its various applications. For example, in medical treatment design (Durand et al. 2018), the actions refer to the dierent treatment options, and the reward is the outcome of the applied treatment (either success or failure). In such scenarios, the learner must tackle the problem of using a 1 known drug (exploitation) or trying some new drugs (exploration). Other practical applications vary from recommendation systems (Kohli, Salek, and Stoddard 2016), anomaly detection (Ding, Li, and H. Liu 2019) to online advertisement (Chakrabarti et al. 2009). This thesis mainly focuses on two variants of the multi-armed bandit problems: an innitely many armed bandit problem and a dueling bandit problem. In addition, we also consider a related problem aiming at estimating the strengths of Bradley-Terry model. 1.1 AnInnitelyManyArmedBanditProblem In this chapter, we consider an innitely many armed bandit problem with Bernoulli rewards. We assume that means of arms are independently generated from a known prior distribution. For specied values and, the objective is to nd the policy that minimizes the expected number of plays until we have identied an arm for which the posterior probability of its mean being at least1 is at least1. We decompose a policy into two parts: the acceptance rule, which tells the player when to accept an arm, and the sampling rule, which describes how to sequentially sample arms based on past observations. Saying that an arm has record (s;k) if it has been played s + k times withs successes resulting. We rst show that there exists a nondecreasing sequences k ;k 0 such that it is optimal to accept the arm having record (s;k) if and only if s s k , where the sequences k can be explicitly determined. Under the given acceptance rule, we show that there is an optimal sampling rule that 1) never plays a previously used arm that is not the last arm 2 played; 2) continues to play the current arm if it has record (s;k) whenevers m k for some nondecreasing sequencem k ;k 1. Since we are not able to determine the optimal thresholdsm k ;k 1 either analytically or numerically, we consider an approximation of the optimal sampling policy, namedn-failure pol- icy, by limiting the number of failures before an arm must be abandoned to n. We show that the optimal sampling policy of this type has a threshold structure. We then derive the mean number of plays taken by an arbitraryn-failure threshold policy, which helps us identify the op- timal thresholds vector through exhaustive search. We also introduce several useful properties associated withn-failure threshold policies. To improve a given sampling policy, we propose a policy improvement strategy. Specically, whenever the original policy calls for switching to a new arm, we imagine giving the current arm an additional chance. If continuing with the current arm until the next failure or acceptance region is better than switching, then we will adjust our decision and stay with the current arm. We show that the resulting new policy always performs at least as good as the original policy. Numerical examples are given with regard to multiple sampling polices, including the optimal n-failure policy,n-failure policy derived by policy improvement strategy, innite failure policy derived by policy improvement strategy. 1.2 ADuelingBanditProblem In the classical multi-armed bandit problem, it is assumed that the learner can play one arm at each time step and observes a reward, which typically provides information about that arm. 3 An extension to such an assumption is that at each time step two of arms are chosen to play a game, and the learner observes the winner of the game. That is, instead of collecting information about a single arm, the learner observers relative information regarding the two arms. Compared with the classical multi-armed bandit, dueling bandit weakens the assumption that a real-value reward is required and thus is able to handle implicit and qualitative feedback. In this problem, we suppose there is a set ofn arms and at every stage two are chosen to play a game, with the result of the game learned. We consider three dierent objectives: 1) minimizing the cumulative weak regret; 2) minimizing the strong regret 3) nding the best arm with xed condence level. We rst consider the objective of minimizing the cumulative weak regret. The binary weak regretr(t) at time periodt isr(t) = 0 if the best arm is one of the selected arms andr(t) = 1 otherwise. We suppose there is a Condorcet winner that wins each game it plays with probability at least p > 1=2, with the value of p being unknown. The objective is to choose bandits to maximize the number of times that one of the competitors is the best arm. We propose a policy, named Beat the Winner(BTW), and show that the expected total number of games that the best player is not involved using BTW is upper bounded byO(n 2 ), wheren is the total number of arms. We also propose an improved version, named by Modied Beat the Winner(MBTW), and show that it has better empirical performance. We then consider two notions of strong regret. Under the notion of strong regret 1, the player can avoid the regret only if the top two arms are selected simultaneously. On the other hand, under the notion of strong regret 2, the player wants to maximize the number of times that the best arm is chosen as both the dueling arms. We suppose that arm i has unknown value 4 v i , i = 1;:::;n, and that armi beats j with probability v i =(v i +v j ). We propose a Thompson Sampling type algorithm and empirically show that its performance matches the state-of-art. Under the objective of nding the best arm with xed condence level, we again make the assumption that there is a Condorcet winner such that it wins all matches with probability at leastp> 0:5, however, with the value ofp known. Our objective is to nd the best arm, namely the Condorcet winner, with probability at least1 using as few plays as possible. By adopting the knockout tournament structure, we propose two polices that determine the winner of each duel by the gambler’s ruin rule. We provide the upper bound of the total number of plays for both policies. 1.3 EstimatingtheStrengthsofBradley-TerryModel Suppose there is a set ofn players and each ofn players has a value, withv i being the value of playeri;i = 1;:::;n: We further suppose that the results of all games are independent and that a game between playeri andj is won by playeri with probability v i v i +v j : This model is well-known as Bradley-Terry model. The values are assumed to be unknown and the problem is to use data w i;j ;i6=j to estimate these values and to obtain a feel for how likely it is thatv i is the largest of them, wherew i;j is the number of games betweeni andj that have been won byi: 5 1.4 ThesisOutline The thesis is organized as follows. Chapter 2 reviews literature on multi-armed bandit problems, with emphasis on innitely many armed bandit problem, dueling bandit problem, and the best arm identication problem. In Chapter 3, we present the innitely many armed bandit problem with Bernoulli rewards. In Chapter 4, we turn our attention to the dueling bandit problem. We consider three sub-problems, with their objectives being minimizing the cumulative weak regret, minimizing the cumulative strong regret, and nding the best dueler with xed condence. In Chapter 5, we present the simulation technique for estimating the strengths of Bradley-Terry model. The conclusion is given in Chapter 6. 6 Chapter2 LiteratureReview One popular objective of the classical multi-armed bandit (MAB) is best arm identication, where the player seeks the optimal arm (with high probability) at the end of an exploration phase. Such problem is motivated by scenarios where the learner aims at nding the optimal arm instead of maximizing the cumulative rewards obtained (Audibert and Sébastien Bubeck 2010; S. Chen et al. 2014). The problem of best arm identication has been extensively studied whenn, the number of arms, is nite. There are two typical but distinct settings in the literature: xed budget setting and xed condence setting. In the xed budget setting, the player can only use up to a xed number of plays and the objective is to maximize the probability of nding the best arm. Au- dibert and Sébastien Bubeck 2010 proposed an elimination algorithm, named Successive Rejects, which splits the budget into multiple rounds and discards an arm each round. They showed that the algorithm has a near optimal probability of identifying the best arm. Séebastian Bubeck, T. Wang, and Viswanathan 2013 extended the problem so that the objective is to identify the topm arms. They proposed an algorithm that also relies on the successive elimination of the seemingly 7 bad arms. Gabillon et al. 2011 considered the multi-bandits problem where the goal is to identify the best arm in each of the bandits. They designed an algorithm that focuses on the arm whose mean is close to the mean of the best arm in the same bandit and showed an upper bound on the probability of error. In the xed condence setting, the objective is to minimize the number of plays required until the player nds the desired arm with xed condence level. The formulation was rst introduced by Even-Dar, Mannor, and Mansour 2006, where they developed an elimi- nation method and provided an upper bound of the number of plays until nding an-optimal bandit. The problem was further extended by Kalyanakrishnan and Stone 2010 tom-best arm identication with a given condence. The classical MAB assumes that there are nite number of arms presented to the learner. However, in many cases, the learner may face a nite, but extremely large number of arms. As a result, it is almost impossible to play all arms once and thus the assumption of innitely many arms is necessary. The innitely many armed bandit problem was rst considered by Berry and Fristedt 1985 under the objective of minimizing the cumulative regret with Bernoulli rewards. They showed that under the assumption of uniform prior on the mean of arms, the best lower bound is p 2T , where T is the time span. They exhibited a class of polices that achieve the lower bound utilizing the number of success runs. Bonald and Proutiere 2013 presented a two- target algorithm that also achieved the lower bound. They further showed that their algorithm is still optimal on Bernoulli rewards with arbitrary priors. Y. Wang, Audibert, and Rémi Munos 2009 studied the problem under arbitrary bounded rewards and proposed a condence bound algorithm. 8 Under the pure-exploration objective, the bandit problem with the assumption of innitely many arms also attracts lots of attention in recent years. Carpentier and Valko 2015 studied such a problem under the xed budget setting. They assumed that both of the reward distribution and mean reward distribution satisfy a specied regularity condition. The authors proposed a condence-interval type algorithm and showed that it is optimal up to a constant factor. Aziz et al. 2018 adopted the xed condence setting under the same assumption regarding the reward and mean reward distributions. In their work, they proposed a two-phase algorithm, in which a number of arms are initially drawn and then a best arm identication algorithm is run over the selected arms. Chandrasekaran and Karp 2014 supposed that each arm either has meanp 1 with probability or meanp 2 with probability1, where0<p 2 <p 1 < 1. Assuming thatp 1 ;p 2 and are known, and that the objective is to minimize the mean number of plays until identifying an arm where the posterior probability of its mean beingp 1 is at least1, they developed a policy based on the likelihood ratio information and showed that it is optimal for any starting history. In this thesis, we also consider another variant of multi-armed bandit, the dueling bandit prob- lem. Instead of playing one single arm, it is assumed the action is to compare a pair of arms and the player observes which of the two arms is preferred. The problem arises naturally in domains where feedback is represented in the form of pairwise comparison, such as information retrieval (Radlinski, Kurup, and Joachims 2008), online ranking (X. Chen et al. 2013) and crowdsourcing services (Guo et al. 2012). Dueling bandit was originally raised by Yue and Joachims 2009 and has been primarily stud- ied under the strong regret criterion. Various denitions of the best arm have been considered. 9 Early algorithms, such as IF (Yue and Joachims 2009) and BTM (Yue and Joachims 2011), assumed that i is the winner over j with unknown probability p i;j ;i6= j; and supposed that the arms are totally ordered in that for some permutationi 1 ;:::;i n of 1;:::;n,p i j ;i k > 0:5 whenj > k. Both algorithms adopted an exploration then exploitation strategy to control the regret. As an extension, the Condorcet winner model only assumes that there exists an arm who beats any other arm with probability greater than0:5. Zoghi, Whiteson, Remi Munos, et al. 2014 proposed RUCB algorithm by adopting the UCB framework and provided the theoretical guarantee that the cumulative regret is upper bounded byO(n 2 logm) wherem is the time horizon. The later pro- posed mergeRUCB in Zoghi, Whiteson, and Rijke 2015 further tightened the regret upper bound toO(nlogm). Other algorithms including RMED (Komiyama, Honda, Kashima, et al. 2015 and WS-S (B. Chen and Frazier 2017) also achievedO(nlogm) regret upper bound. Beyond the Con- dorcet winner setting, a more general assumption of Copeland winner, which is the arm that beats the most other arms, has also attracted lots of attention recently. Zoghi, Karnin, et al. 2015 pre- sented two algorithms, which successively eliminate non-Copeland winners, for both small-scale and large-scale Copeland dueling bandit problems. Komiyama, Honda, and Nakagawa 2016 pro- posed CW-RMED algorithm for Copeland dueling bandit problem and derived an asymptotically regret bound. Dueling bandit under the weak regret criterion has also previously been considered (Yue and Joachims 2009, B. Chen and Frazier 2017). To the best of our knowledge, the recent work by B. Chen and Frazier 2017 seems to be the only paper that studied the weak regret and designed a specic algorithm (called WS-W) for it. A brief description of WS-W is as follows. Let the score 10 of each arm be the number of wins minus the number of losses in all games that arm has played. Roundk +1;k 0; will begin with one arm having score(n1)k and all others having score k. The player with score(n1)k will play a randomly chosen one of the other arms a series of games that ends when one of their scores isk1. At that point the player with scorek1 stops playing in roundk +1 and the other plays a randomly chosen one of the remaining arms until one of their scores hitsk1, and so on. It was shown in B. Chen and Frazier 2017) that WS-W hasO(n 2 ) bound under the Condorcet winner setting andO(nlogn) if arms are totally ordered. Another popular objective of dueling bandit is ranking and identifying the best arm. Yue and Joachims 2011 proposed a PAC maximum selection algorithm when arms are totally ordered. Un- der the assumption of strong stochastic transitivity and stochastic triangle inequality, Falahatgar, Orlitsky, et al. 2017 developed a knockout tournament algorithm to identify the best dueler. The algorithm proceeds in round with arms randomly paired up and eliminates half of arms each rounds. Under the assumption of strong stochastic transitivity, Falahatgar, Hao, et al. 2017 de- signed a sequential elimination algorithm for best arm identication. Mohajer, Suh, and Elmahdy 2017 studied the top-k ranking problem when arms are totally ordered. In this thesis, we also consider the Bradley-Terry model, which was originally proposed by Zermelo 1929 and later popularized by Bradley and Terry 1952. It has multiple applications for analyzing sports data. For example, the World Chess Federation and the European Go Federation successfully adopted a similar model for ranking players (Hastie and Tibshirani 1998). Catte- lan, Varin, and Firth 2013 developed a dynamic model to analyze the abilities of teams in sports 11 tournaments. McHale and Morton 2011 presented a Bradley-Terry model for forecasting match results of men’s tennis. One way of estimating the parameters of the Bradley-Terry model is by maximum likelihood estimation(MLE). Hunter et al. 2004 proposed an ecient iterative optimization algorithm by adopting the minorization-maximization framework. In recent years, Bayesian inference has also been considered as an alternative approach. Guiver and Snelson 2009 presented an Expectation- Propagation method to quickly compute an approximation of the posterior which is potentially scalable to large real-world problems. Caron and Doucet 2012 proposed a simple Gibbs sampler for Bayesian inference based on a suitable set of latent variables. 12 Chapter3 AnInnitelyManyArmedBanditProblem We consider an innite collection of Bernoulli arms, whose means are independently generated from a known prior distribution. For specied values and, the objective is to nd the policy that minimizes the expected number of plays until we have identied an arm for which the poste- rior probability of its mean being at least1 is at least1. We rst present the structure of the optimal policy. We then propose a heuristic policy and develop a policy improvement technique. 3.1 ProblemFormulation We consider a xed-condence pure-exploration bandit problem with innitely many arms. At each time step, the player chooses an arm to play based on past observations, and obtains a reward. The selected arm can either be a new arm or one that has been played previously. The sequence of rewards for armi;i = 1;::: are modeled by independently and identically distributed Bernoulli random variables with unknown meanp i . We further assume that the mean rewardsp i are the values of independent random variables from a known distribution with densityf. The 13 player keeps sampling arms until an armi is identied for which the posterior probability of its mean being at least1 is at least1. That is, P(p i 1jObservations) 1 The objective is to nd a policy that minimizes the expected number of plays taken. 3.2 OptimalPolicy In this section, we present the optimal policy. We characterize a policy from two perspectives. 1. Under what conditions an arm can be accepted with the guarantee that the posterior prob- ability of its mean being at least1 is at least1. 2. How to adaptively play arms based on past observations. That is, a policy is composed of anacceptancerule, which ends the problem with an arm accepted, and a sampling rule, which describes how to sequentially sample arms. In the following, we will rst present the optimal acceptance rule and then describe the optimal sampling rule under the proposed acceptance rule. For convenience, we say a success/failure is observed if the reward is 1/0. Also say that an arm has record (s;k) if it has been playeds+k times withs successes resulting. 14 3.2.1 OptimalAcceptanceRule Recall that the mean rewards of arms are assumed independently generated from a known dis- tribution with densityf. Thus, we let f s;k (p) =C s;k p s (1p) k f(p);0<p< 1 (3.1) be the posterior density of an arm with record(s;k), whereC s;k is the normalizing constant. Let X s;k be a random variable with densityf s;k . Denition1 (Acceptance rule). Anarmcanbeacceptedifandonlyifitsrecord(s;k)2R,where R =f(s;k) : P(X s;k 1) 1g We claim that the above acceptance rule is optimal. To see this, notice that the arm can satisfy the requirement only if its record lies inR. Meanwhile, as the objective is to minimize the number of plays, one should accept the arm immediately if its record is contained inR. Therefore, the above acceptance rule is optimal. We can further simplify the acceptance rule by employing the following Lemma 2. Lemma2. X s;k is likelihood ratio increasing ins and decreasing ink. Proof. Recall thatX s;k has densityf s;k (p) =C s;k p s (1p) k f(p);0<p< 1. We have f s;k+1 (p) f s;k (p) = C s;k+1 C s;k (1p) 15 and f s+1;k (p) f s;k (p) = C s+1;k C s;k p Hence,X s;k is likelihood ratio increasing ins and decreasing ink. As likelihood ratio ordering implies stochastic ordering, it follows thatX s;k is stochastically increasing ins and decreasing ink. Hence,P(X s;k 1) is monotone increasing ins. As a result, the proposed acceptance rule is equivalent to the following. Denition3 (Acceptance rule). An arm can be accepted if and only if its record(s;k)2R, with R =f(s;k) : ss k g wheres k = minfs : P(X s;k 1) 1g. Proposition4. s k is non-decreasing ink,k = 0;1;:::. Proof. This follows because P(X s;k 1)P(X s;k+1 1) Becauses k is nondecreasing ink and the arm is accepted immediately once its record falls in R, the accepted arm must have record(s k ;k) for somek 0. From now on, we restrict ourselves 16 to polices that accept the arm if and only if the conditions of Denition 3 are satised. We turn our attention to the optimal sampling rule under this acceptance rule. 3.2.2 OptimalSamplingRule Suppose that the player has observed some data and currently there is no arm satisfying the acceptance conditions. To continue, the player must decide which arm to play next. Call the arm that is played at the last time slot be the current arm. Call an arm that was played previously but not the one played at the last time slot be a discarded arm. Call an arm that has never been previously played be a new arm. Consider all actions the player can take at that moment. We classify the actions into three categories: 1) continue to play the current arm 2) play one of the discarded arms; 3) play a new arm. The following Lemmas 5 - 7 reveal the structure of the optimal sampling policy. Lemma5. There exists an optimal policy that never switches to a discarded arm. Proof. We prove Lemma 5 by induction on the total number of times that the player is allowed to switch to a discarded arm. Suppose that the player is only allowed to switch to a discarded arm once. Consider the momentt that it is optimal to switch to a previously discarded arm, say armc. LetE 1 denote the minimal expected additional number of plays if we switch back toc; andE 2 be the minimal expected additional number of plays if we play a new arm. Because it is optimal to switch back, it follows thatE 1 E 2 . On the other hand, consider the momentt 0 that armc is played for the last time before timet: At that momentt 0 , letE 0 1 =E 1 denote the minimal expected additional number of plays if we continue withc and also agree to never switch back 17 to a discarded arm, andE 0 2 be the minimal expected number of additional plays if we play a new arm. As it is optimal to play a new arm, it follows thatE 0 2 E 0 1 . Also, becauset 0 < t; there would be more options for a later switch back if we choose a new arm at timet than there would be if we choose a new arm at timet 0 , implying thatE 2 E 0 2 : Combining our inequalities gives E 1 E 2 E 0 2 E 0 1 = E 1 : Therefore,E 1 = E 2 , thus showing that there is an optimal policy that never switches back to a discarded arm. Now suppose that a policy is allowed to switch to a discarded arm up ton times. Using the same argument as in the preceding it follows that one can do as well if onlyn1; rather than n; switch backs are allowed. Continuing on proves that there is an optimal that never switches to a discarded arm. Lemma 5 suggests that we can restrict ourselves to the class of sampling rules such that the action is to either play the current arm or choose a new arm. Therefore, the decision can be based only on the record of the current arm. As a result, we can formulate the problem as a dynamic programming problem, with the state being the record of the current arm. LetV(s;k) denote the minimal expected additional number of plays when the record of the current arm is(s;k). If the arm can not be accepted at state(s;k), the player must decide either to play the current arm or choose a new arm. If the player continues with the current arm, then the minimal expected additional number of plays is 1+p(s;k)V(s+1;k)+(1p(s;k))V(s;k+1) 18 where p(s;k) is the probability that the next play returns a success given the current record is (s;k). Otherwise, the player uses a new arm and the minimal minimal expected additional number of plays isV(0;0): Therefore, the functionV(s;k) satises V(s;k) = minf1+p(s;k)V(s+1;k)+(1p(s;k))V(s;k+1);V(0;0)g (3.2) withV(s;k) = 0 for all (s;k) such thats s k . It is possible that continuing with the current arm results in the same minimal expected additional plays as using a new arm. For uniqueness, we specify that the optimal policy always plays a new arm if actions are indierent. Next, we show a useful property of functionV(s;k). Lemma6. V(s;k) is nonincreasing ins and nondecreasing ink. Proof. We rst show thatV(s;k) V(s + 1;k) . Consider two scenarios, the rst where the current state is (s;k) and the second where it is (s + 1;k). Suppose that the optimal policy is employed in scenario1, and letN 1 denote the additional number of plays until an arm is accepted. LetN 2 denote the additional number of plays until an arm is accepted in scenario 2 when the policy employed is to follow the policy of scenario 1; that is, to take the actions that would be optimal if the initial state were(s;k): We claim thatN 1 is stochastically larger thanN 2 ; and will show it by a coupling argument. Letp denote the success probability of the current arm with record (s;k) in the rst scenario, with its subsequent rewards beingX i ;i 1. Similarly, letp 0 denote the success probability of the arm with record (s+1;k) in the second scenario, with its subsequent rewards beingX 0 i ;i 1. As shown in Lemma 2,X s;k is stochastically increasing in 19 s, which implies thatp is stochastically smaller thanp 0 . Hence, we can couple the value ofp and p 0 so thatp < p 0 , and couple the rewardsX i andX 0 i so thatX i X 0 i for alli > 0. As a result, one of the following outcomes occurs: 1. N 2 <N 1 if the current arm is accepted in the second scenario. 2. N 2 =N 1 if the current arm is discarded in both scenarios. Therefore,N 1 is stochastically largerN 2 , which implies V(s;k) =E[N 1 ]E[N 2 ]V(s+1;k) To show thatV(s;k+1)V(s;k), consider two scenarios where the rst is in state(s;k+1) and the second is in state (s;k). Then applying the same coupling argument again shows the result. Lemma 6 yields the following result about the optimal sampling policy. Lemma7. The optimal sampling rule never discards an arm that just had a success. Proof. We show the lemma by contradiction. Assume that the optimal policy stays with the current arm in state(s;k) and uses a new arm at state(s+1;k). By the optimality equation 3.2, V(s;k)<V(0;0) andV(s+1;k) =V(0;0), which impliesV(s;k)<V(s+1;k), contradicting Lemma 6. 20 Say that a sampling rule is athresholdrule if it never switches to a discarded arm and for some nondecreasing sequencem k ;k 1 continues to use an arm that currently hask failures if and only if its current number of successes is at leastm k . Now we present the structure of the optimal sampling policy. Theorem8. The optimal sampling rule is a threshold rule. Proof. Suppose that the optimal sampling policy will stay with the current arm at state(s;k), that is,V(s;k) < V(0;0). By Lemma 6,V(s;k) is a nonincreasing function ofs. Thus,V(s 0 ;k) < V(0;0) for alls<s 0 <s k , which implies the optimal policy will stay with the current arm when s<s 0 <s k . On the other hand, suppose that the optimal policy calls for switching at(s;k), that isV(s;k) =V(0;0). AsV(s 0 ;k)V(s;k) ifs 0 <s,V(s 0 ;k) =V(0;0) for alls 0 <s. Thus the optimal action is to play a new arm whens 0 < s. As a result, there exists somem k ;k = 1;2;::: such that given the record (s;k), it is optimal to continue with the current arm whens m k and switch otherwise. To sum up, for an arm that is currently being played and has record (s;k), it is optimal to accept the arm ifs s k . If the arm can’t be accepted, then it is optimal to stay with it ifs m k . Otherwise, the arm is discarded and the player will draw a new arm. Note that we can explicitly determine the valuess k ; but cannot do the same for the valuesm k . Figure 3.1 provides a visualization of the decision region using the optimal policy for the arm with record(s;k). 21 Figure 3.1: Visualization of decision region for the optimal policy when the current arm has record (s;k). 3.3 n-FailurePolices In the previous section, we showed that the optimal sampling rule is a threshold rule. However, in practice, it is hard to determine the optimal thresholds either theoretically or numerically. In this section, we turn our attention to the approximation of the optimal sampling rule. An intuitive idea is that we can give up the arm if it can not be accepted within certain number of plays. In the following, we consider a class of polices having the restriction that an arm must be discarded once it has a certain number of failures. Note that we stay with the optimal acceptance rule specied in Denition3 throughout. Denition 9. A policy is said to be n-failure policy if an arm can be played up to observing n failures. Anewarmwillbeplayedif(n+1)failureisobserved,regardlessofthenumberofsuccesses obtained. 3.3.1 Optimaln-FailurePolicy We rst show that the optimal n-failure policy preserves the threshold structure. 22 Lemma10. The optimaln-failure policy is a threshold policy. Proof. LetV n (s;k) denote the expected additional number of plays if the current state is (s;k) when the optimaln-failure policy is employed,k = 0;:::;n+1. The functionV n (s;k) satises that V n (s;k) = minf1+p(s;k)V n (s+1;k)+(1p(s;k))V n (s;k+1);V n (0;0)g k = 0;:::;n V n (s;n+1) =V n (0;0) withV n (s;k) = 0 for all(s;k) such thatss k . In other words, the optimality equation is exactly as before with the exception that the only action upon seeing (n+1)th failure is to play a new arm. As a result, Lemmas 6 and 7 remain true and the optimal sampling policy is still a threshold policy. Now, we restrict ourselves ton-failure threshold policies. Let(k 1 ;:::;k n ) denote then-failure threshold policy such that an arm is continued to be played if there is at leastk i successes obtained upon observingith failure,i = 1;:::;n. Otherwise, it calls for switching to a new arm. LetS denote the total number of plays until an arm is accepted when using policy. We are interested in determiningE[S (k 1 ;:::;kn) ] for arbitrary threshold vectors(k 1 ;:::;k n ). We start with the simple policy that doesn’t allow any failures. 23 3.3.2 0-failurePolicy Consider the policy 0 that switches to a new arm whenever a failure is observed. That is, if 0 is used then an arm is either accepted if it yieldss 0 consecutive successes or it is discarded at its rst failure. To determineE[S 0 ], we letX i ;i = 1;2;::: denote the number of plays for theith arm used, and letN denote the index of the arm which is accepted. Clearly,X i ;i = 1;2;::: are independently and identically distributed andN is a stopping time for the sequence. Therefore, we can apply Wald’s Equation, which implies E[S 0 ] =E[ N X i=1 X i ] =E[N]E[X 1 ] To determineE[N], note that the number of arms needed is a geometric random variable, with its parameterP 0 equal to P 0 = Z 1 0 p s 0 f(p)dp On the other hand,E[X 1 ] can be derived by E[X 1 ] = 1 X j=0 P(X 1 >j) = s 0 1 X j=0 P(X 1 >j) = s 0 1 X j=0 Z 1 0 p j f(p)dp If we letP have densityf, and i =E[P i ], then E[S 0 ] = 1+ P s 0 1 j=1 i s 0 24 3.3.3 Arbitraryn-FailureThresholdPolicy Now we generalize the previous result to the arbitrary n-failure threshold policy (k 1 ;:::;k n ). Recall that an arm is discarded if the action is to draw a new arm. The next Lemma 11 gives the recursive formula of determining the expected plays for an arbitrary arm. Lemma 11. Let X (k 1 ;:::;kn) be the total number of plays for an arbitrary arm until it is either acceptedordiscardedwhenusingpolicy(k 1 ;:::;k n ). Then,E[X (k 1 ;:::;kn) ]canberecursivelydeter- mined by E[X (k 1 ;:::;kn) ] =E[X (k 1 ;:::;k n1 ) ]+E[A (k 1 ;:::;kn) ] whereA (k 1 ;:::;kn) ,theadditionalnumberofplaysofanarmafterobservingitsnthfailure,hasmean E[A (k 1 ;:::;kn) ] = s 0 1 X i 1 =k 1 s 1 1 X i 2 =maxfi 1 ;k 2 g ::: s n1 1 X in=maxfkn;i n1 g E[(P in P sn )(1P) n1 ] Proof. It is immediate thatX (k 1 ;:::;kn) =X (k 1 ;:::;k n1 ) +A (k 1 ;:::;kn) . To derive the expected value ofA (k 1 ;:::;kn) , we condition on the success probabilityp of the arm, and the number of successes observed before thejth failure, denoted byi j ,j = 1;::;n. Note that thejth failure must occur after(j1)th failure andk j th success, whichever is later, and befores j th success, at which point the arm is accepted, we can obtain that 25 E[A (k 1 ;:::;kn) ] = Z 1 0 s 0 1 X i 1 =k 1 s 1 1 X i 2 =maxfi 1 ;k 2 g ::: s n1 1 X in=maxfkn;i n1 g p in q n snin1 X j=0 p j f(p)dp = Z 1 0 s 0 1 X i 1 =k 1 s 1 1 X i 2 =maxfi 1 ;k 2 g ::: s n1 1 X in=maxfkn;i n1 g q n1 (p in p sn ) f(p)dp = s 0 1 X i 1 =k 1 s 1 1 X i 2 =maxfi 1 ;k 2 g ::: s n1 1 X in=maxfkn;i n1 g E[(P in P sn )(1P) n1 ] whereq = 1p. The proof is complete Lemma 12 gives the recursive formula of computing the probability of accepting an arbitrary arm. Lemma12. LetP (k 1 ;:::;kn) betheprobabilitythatanarbitraryarmisacceptedbyemployingpolicy (k 1 ;:::;k n ) . ThenP (k 1 ;:::;kn) can be recursively determined by P (k 1 ;:::;kn) =P (k 1 ;:::;k n1 ) +B (k 1 ;:::;kn) whereB (k 1 ;:::;kn) ; the probability that the arm is accepted after thenth failure is observed, is given by B (k 1 ;:::;kn) = s 0 1 X i 1 =k 1 s 1 1 X i 2 =maxfi 1 ;k 2 g ::: s n1 1 X in=maxfkn;i n1 g E[P sn (1P) n ] 26 Proof. It is immediate thatP (k 1 ;:::;kn) =P (k 1 ;:::;k n1 ) +B (k 1 ;:::;kn) . To derive the value ofB (k 1 ;:::;kn) , we condition on the probability of the arm, and the number of successes observed before thejth failure, denoted byi j ,j = 1;::;n. Similarly to Lemma 11 B (k 1 ;:::;kn) = Z 1 0 s 0 1 X i 1 =k 1 s 1 1 X i 2 =maxfi 1 ;k 2 g ::: s n1 1 X in=maxfkn;i n1 g p sn q n f(p)dp = s 0 1 X i 1 =k 1 s 1 1 X i 2 =maxfi 1 ;k 2 g ::: s n1 1 X in=maxfkn;i n1 g E[P sn (1P) n ] With Lemma 11 and 12, we now are ready to derive the formula of expected number of plays for an arbitraryn-failure threshold policy(k 1 ;:::;k n ). Corollary13. RecallthatS (k 1 ;:::;kn) denotethetotalnumberofplaysuntilanarmisacceptedwhen policy(k 1 ;:::;k n ) is adopted. Then E[S (k 1 ;:::;kn) ] = E[X (k 1 ;:::;kn) ] P (k 1 ;:::;kn) Proof. LetX i ;i = 1;2;::: be the number of plays for theith arm andN be the number of arms it takes until an arm is accepted. Following the same reasoning in derivingE[S 0 ], we can apply Wald’s Equation and thus E[S (k 1 ;:::;kn) ] =E[ N X i=1 X i ] =E[N]E[X 1 ] = E[X (k 1 ;:::;kn) ] P (k 1 ;:::;kn) 27 Corollary 13 is very useful in the sense that we are able to explicitly determine the total expected number of plays for arbitrary n-failure threshold policy. Specically, when the prior is restricted to Beta distribution, we can analytically compute the expected number of plays. In practice, we can nd the optimaln-failure threshold policy through exhaustive search whenn is small. Numerical examples are provided in Section 3.5 with uniform prior and the number of failures allowed up to 4. For the rest of this section, we will turn our attention to some additional properties regarding then-failure threshold policies. 3.3.4 AdditionalProperties Our next result (Lemma 15) reveals the relationship between the minimal expected number of plays achievable usingn-failure policy, denoted byE[S n ], and its corresponding threshold vector, denoted by (k 1 ;:::;k n ). In particular, assuming thatE[S n ] is given, (k 1 ;:::;k n ) can be explicitly determined in a recursive fashion. To begin, consider an arm that currently has record(s;i) and has neither been accepted nor discarded using policy (k 1 ;:::;k n ). Let X (k 1 ;:::;kn) (s;i) be the additional plays until it is either accepted or discarded andP (k 1 ;:::;kn) (s;i) is the probability of being accepted. Then Lemma14. E[X (k 1 ;:::;kn) (s;i)] P (k 1 ;:::;kn) (s;i) is a decreasing function ofs whens<k i+1 . The proof of Lemma 14 is given in the Appendix A.2. With Lemma 14, 28 Lemma15. AssumingE[S n ] is known, and let k i;min = minfsj E[X (k 1 :;::;k n ) (s;i)] P (k 1 ;:::;k n ) (s;i) <E[S n ];k i1 sk i+1 g thenk i =k i;min for alli = 1;:::;n. Proof. Suppose that the player always takes the optimal action. Consider the scenario where the current arm has record (s;i),s < s i . If the player continues to play the current arm, then the expected additional number of plays isE[X (k 1 :;::;k n ) (s;i)]+(1P (k 1 ;:::;k n ) (s;i))E[S n ]. If the player switches to a new arm, then the expected additional plays isE[S n ]. Therefore, it is optimal to continue with the current arm if E[X (k 1 :;::;k n ) (s;i)]+(1P (k 1 ;:::;k n ) (s;i))E[S n ]<E[S n ] which is also equivalent to E[X (k 1 :;::;k n ) (s;i)] P (k 1 ;:::;k n ) (s;i) <E[S n ] As Lemma 14 shows that E[X (k 1 :;::;k n ) (s;i)] P (k 1 ;:::;k n ) (s;i) is decreasing ins, the optimal value ofith threshold is k i = minfsj E[X (k 1 :;::;k n ) (s;i)] P (k 1 ;:::;k n ) (s;i) <E[S n ]g which completes the proof. 29 Applying Lemma 15, we can recursively determine the optimal thresholds vector givenE[S n ], where we start with obtaining the value ofk n , then go all the way tok 1 . Note that E[X (k 1 :;::;kn) (s;i)] andP (k 1 ;:::;kn) (s;i) can be determined in a similar fashion toE[X (k 1 :;::;kn) ] and P (k 1 ;:::;kn) , and we give the formula in Appendix A.1. In practice, Lemma 15 is useful in the sense that one may start out with approximating the value ofE[S n ], for example, using the method from approximate dynamic programming. If the approximation is close to optimal, then applying Lemma 15 can obtain a near-optimal or the optimal threshold vector. In the end, we want to mention two additional properties of then-failure threshold policy. Lemma16. E[S (k 1 ;:::;kn) ] is unimodal function ofk i fork i1 k i <k i+1 for anyi = 1;:::;n. The proof takes a bit algebra and we give the proof in Appendix A.3. Lemma 16 provides a practical way of improving a n-failure policy. That is, one can start out with an arbitrary n- failure policy, then either randomly or sequentially select a threshold to update. The unimodality property helps nd the local minimum quickly. Although the updated policy may not end up converging to the optimaln-failure policy, it runs much faster than exhaustive search and thus provides an alternative search for identifying a good policy. The last lemma in this section shows that the expected return of the optimaln-failure policy converges to that of the optimal threshold policy. Lemma17. Let denote the optimal policy.E[S (k 1 ;:::;k n ) ]!E[S ] asn!1. Proof. Let(k 1 ;k 2 ;::::) denote the threshold vector for optimal policy with possibly innite number of thresholds. Consider the truncated optimal policy 0 n with up ton failures are allowed 30 for an arbitrary arm, that is, 0 n = (k 1 ;:::;k n ). Recall thatX denotes the number of plays for an arbitrary arm until it is either accepted or discarded and P is the probability of being accepted. By Lemma 11,E[X 0 n ] is monotone increasing inn. AsE[X 0 n ] is upper bounded by E[X ], it follows thatE[X 0 n ]!E[X ]. Using the same reasoning can show thatP 0 n !P . By Corollary 13,E[S 0 n ] = E[X 0 n ] P 0 n and thusE[S 0 n ]!E[S ]. On the other hand, as(k 1 ;:::;k n ) is the optimal n-failure policy, we must have E[S ] E[S (k 1 ;:::;k n ) ] E[S 0 n ]. Therefore E[S (k 1 ;:::;k n ) ]!E[S ] whenn!1. 3.4 APolicyImprovementStrategy One drawback of using the optimaln-failure policy is that nding the optimal thresholds com- putationally suers from the curse of dimensionality. In this section, starting with an arbitrary policy we give a technique that often leads to an improvement. In particular, for an arbitrary sampling policy, we show how to obtain a modied policyM() that is as least as good as. To begin, suppose that we use an arbitrary sampling policy. Consider the moment when calls for switching to a new arm. Instead of following, we compare the expected additional number of plays from following with that from continuing with the current arm until either an additional failure occurs or the arm is accepted and, in the former case, then switching back to. The action is modied to continue with the current arm if the latter result is at least as good as switching. If the decision is to continue with the current arm and a failure occurs before that arm can be accepted, then we switch back to and plays a new arm. That is, we give the current arm 31 an additional chance in the sense that we stay with it until either the next failure or acceptance occurs. The formal denition of our modication strategy is given as follows. Denition 18 (Policy Improvement Strategy 1). Let M 1 () be a modication of policy such that policyM 1 () always continues if calls for continuing. When calls for switching,M 1 () computes E[X add ] Paccept ,whereX add istheadditionalnumberofplaysuntileitheranewfailureisobserved or the arm is accepted, andP accept is the probability of being accepted. If E[X add ] P accept <E[S ] (3.3) thenM 1 () continues with the current arm until either acceptance or the next failure. If the next failure occurs before acceptance, then switch to a new arm. Note that to modify the sampling policy, it is required thatE[S ] is known beforehand. This can be achieved either by simulation over repeated runs, or analytically computed by Corollary 13 if it has the thresholds structure. In our above improvement strategy, it is assumed thatM 1 () forces to switch to a new arm if the additional failure occurs before the acceptance. However, we may also want to consider modications that do not necessarily switch upon seeing the next failure. In fact, for any failure observed afterwards, we can again make the comparison to decide what to do. That is, we will recheck condition 3.3 and stay with the current arm as long as it holds. We claim that the new policy is also an improvement upon the original policy. 32 Denition 19 (Policy Improvement Strategy 2). LetM 1 () be a modication of policy such that policyM 1 () always continues if calls for continuing. When calls for switching,M 1 () computes E[X add ] Paccept , and If E[X add ] P accept <E[S ] (3.4) thenM 1 () continues with the current arm. For any failure observed afterwards, the action is to continue with the current arm as long as 3.4 holds. We now show thatM 1 () is always at least as good as the original policy: Lemma20. E[S M 1 () ]E[S ] for any sampling policy. Proof. LetM 1 n () be the policy where the modication is only used the rstn times it is called for. We show by induction thatE[S M 1 n () ]E[S ] for alln. To start, consider the moment when the decisions ofM 1 1 () and are dierent. At that moment, the original policy will switch to a new arm and will takeE[S ] expected additional plays. M 1 1 (), which continues with the current arm until either observing one additional failure or reaching acceptance region, will take E[X add ]+(1P accept )E[S ] expected additional plays. By denition, E[X add ]+(1P accept )E[S ]<P accept E[S ]+(1P accept )E[S ] =E[S ] Therefore, E[S M 1 1 () ] E[S ]. Assume the result is true forn = m. Consider two scenarios where the rst uses policy M 1 m () while the second uses M 1 m+1 (). We can couple the two scenarios so that the decisions are identical until the (m + 1)st modication is called for. At that moment,M 1 m () hasE[S ] expected additional plays whileM 1 m+1 () hasE[X add ]+(1 33 P accept )E[S ]. Therefore,E[S M 1 m+1 () ]E[M 1 m ()] and the result is proven. We should note that becauseE[X add ] andP accept are computed under the supposition that we continue with the current arm until either we reach the stopping set or a failure occurs, they can be calculated by the formula given in Appendix A.1. In practice, we can apply theM 1 () on a n-failure threshold policy. We do this in the next section with numerical examples which, among other things, show that such policy modication when applied to the1-failure policy yields very good results. Our proof of Lemma 20 also works for showing thatM 1 () improves upon. Corollary21. E[S M 1 () ]E[S ] for any sampling policy. By employing the policy improvement strategy 1, we now propose a specic way of obtaining a n-failure policy, which in practice has very good performance. To begin, suppose that one currently uses an arbitraryn-failure policy(k 1 ;:::;k n ). Consider the moment that one observes the(n+1)th failure for some arm, and suppose that he has obtaineds successes. At that moment, apply the policy improvement strategy 1, which will determine whether to continue or switch to a new arm. By Lemma 14, E[X add ]=P accept is a decreasing function of the current number of successes obtained. Therefore, there is a s such that it will switch to a new arm if s < s and continue until one additional failure or the acceptance. As a result, we obtain a (n + 1)- failure threshold policy that always improves upon the current n-failure policy. This result is summarized in the following Corollary 22. 34 Corollary 22. Let(k 1 ;:::;k n ) be an arbitraryn-failure threshold policy. Let(k 1 ;:::;k n ;k 0 n+1 ) be a(n+1)-failure policy with k 0 n+1 = minfij E[(P i P s n+1 )(1P) n ] E[P s n+1 (1P) n+1 ] E[S (k 1 ;:::;kn)) ]g (3.5) Then E[S (k 1 ;:::;kn)) ]<E[S (k 1 ;:::;kn;k 0 n+1 )) ] To see how to obtain a n-failure policy by utilizing Corollary 22, suppose we start with 0-failure policy, with its mean number of plays being E[S 0 ]. Now we make it as a 1-failure policy by taking k 1 = minfij E[P i P s 1 ] E[P s 1(1P)] E[S 0 ]g. Suppose that the 1-failure above results in E[S 1 ] as the expected number of plays. We make it as a 2-failure policy by taking k 2 = minfij E[(P i P s 2 )(1P)] E[P s 2(1P) 2 ] E[S 1 ]g, with the rst threshold keeping the same. As a result, we are able to run this idea recursively to obtain a n-failure policy. We show with the numerical example in the next section that suchn-failure policy has relative good performance compared to the optimaln-failure policy. 3.5 NumericalResults In this section, we will provide numerical examples in which we compare the performance of the optimaln-failure policy, policy improvement strategy and the state-of-art. For the optimal n-failure policy, we restrict the total number of failures up to 4. The optimal thresholds are determined by exhaustive search. We then show then-failure policy derived by utilizing policy 35 improvement strategy M 1 () (see Denition 18). We also give an example of using the policy improvement strategyM 1 () (see Denition 19), which gets rid of the restriction that an arm must be abandoned after observing certain number of failures. Specically, we applyM 1 () on the optimaln-failure policy. In the end, we compare our results with the state-of-art. Note that the parameters of the problem are denoted by(;), with the interpretation being that the objective is to identify an arm where the posterior probability of its mean being at least 1 is at least1. To begin, we show the performance of the optimaln-failure policy for the instances of(;) = (0:05;0:05). We assume that the mean of arms have uniform (0,1) prior distribution. As Corollary 13 gives the formula of the expected number of plays for an arbitraryn-failure policy, we can identify the optimaln-failure policy by exhaustive search. Table 3.1 shows the expected return from the optimal n;n 4 failure policy with its corresponding threshold vector. To ease the notation, we let n denote the optimaln-failure policy. policy expected plays k 1 k 2 k 3 k 4 0 274.1290 1 245.9787 25 2 238.2847 23 55 3 235.3494 22 51 83 4 233.9398 21 50 79 111 Table 3.1: The expected number of plays and thresholds vector (k 1 ;:::;k n ) for the optimal n- failure threshold policy n with(;) = (0:05;0:05) Next we show then-failure policy, denoted by 0 n ;n 4, obtained by utilizingM 1 () recur- sively (See Corollary 22). The results including the expected return and the threshold vector are 36 summarized in Table 3.2. Compared with the optimaln-failure policy, the derivedn-failure po- lices show very good performance. Meanwhile, it enjoys the advantage of obtaining the threshold vector without spending too much computational eorts. policy expected ips k 0 1 k 0 2 k 0 3 k 0 4 0 0 274.1290 0 1 246.0941 23 0 2 238.3873 23 54 0 3 235.5383 23 54 79 0 4 234.3232 23 54 79 111 Table 3.2: The expected number of plays and thresholds vector(k 0 1 ;:::;k 0 n ) for then-failure thresh- old policy 0 n derived byM 1 () with(;) = (0:05;0:05) Next we apply the policy modication strategyM 1 () on i ;i = 0;:::;4. Recall that when the original policy calls for switching to a new arm,M 1 () will stay with the current arm as long as condition 3.4 holds. Since the modied policy doesn’t retain the threshold structure and thus the expected number of plays can’t be computed by applying Corollary 13, we run simulation to estimate the expected return. The results are summarized in Table 3.3. We can observe that the modied policies always improve upon the originaln-failure policy and the improvement is signicant whenn is small. policy expected plays s.d. M 1 ( 0 ) 232.5916 0.173 M 1 ( 1 ) 233.45 0.175 M 1 ( 2 ) 233.348 0.175 M 1 ( 3 ) 232.964 0.174 M 1 ( 4 ) 232.662 0.174 Table 3.3: Results applying M 1 () on i ;i = 0;1:::;4. The statistics is summarized over 10 6 repeated runs and s.d. stands for standard deviation. 37 In the end, we compare policy M 1 ( 0 ) to the KL-LUCB algorithm proposed in Aziz et al. 2018. KL-LUCB is a two-phase algorithm such that it rst selects a pool of arms as candidates and then runs a best arm identication algorithm to nd the best arm in the candidates set. We borrow the numerical results in their paper, where they assumed Bernoulli arms with mean be- ing truncated beta with support on (0,0.95). For comparison, we use theM 1 ( 0 ), where 0 is the 0-failure policy. The results are shown in Table 3.4. As our model is able to explicitly exploit the prior information, while KL-LUCB only assumes some regularity conditions on the reward distribution and mean reward distribution, our policy shows much better performance than KL- LUCB. Prior KL-LUCB HeuristicM( 0 ) TrBeta(1,1) 0.06 0.05 113000 329.89 TrBeta(1,1) 0.098 0.1 10000 111.87 TrBeta(1,2) 0.234 0.05 79000 54.53 TrBeta(1,2) 0.234 0.1 65000 43.05 Table 3.4: Comparison between policyM 1 ( 0 ) and KL-LUCB, whereTrBeta(;) is truncated Beta distribution with support on(0;0:95). 38 Chapter4 ADuelingBanditProblem In this Chapter, we study a variant of the classical multi-armed bandit, dueling bandit problem. Instead of playing a single arm, two arms are chosen to play a game at each time step, with the result of the game learned. We consider three dierent objectives: minimizing the cumulative weak regret, minimizing the cumulative strong regret, and nding the best dueler with xed condence level. Under the objective of weak regret, we want to maximize the number of times that one of the competitors is the best arm. We present a policy that achieves nite regret, and a modied policy that empirically has better performance. Under the notion of strong regret, we are interested in maximizing the number of times that the contest is between the best arm itself. We develop a Thompson Sampling type algorithm. We also propose two policies which employ the structure of gambler’s ruin problem for nding the best dueler with xed condence using the minimal number of plays. 39 4.1 MinimizingtheWeakRegret In this section, we consider the dueling bandit problem under the objective of minimizing the cumulative weak regret. We suppose that there is a set ofn arms. At each stage, the player picks two arms to play a game, with the winner learned by the player. We letp i;j ,1i;jn, denote the probability that armi is preferred over armj wheni paired withj, wherep i;j is unknown. We assume that there exists a Condorcet winner, that is, there is an unknown arm i and an unknown probabilityp > 1=2, such that armi is the winner of each its duel with probability at least p. The binary weak regret r(t) at time period t is r(t) = 0 if the best arm is one of the chosen arms and r(t) = 1 otherwise. Our objective is to minimize the cumulative regret P 1 t=1 r(t) over innite time horizon. We propose two algorithms,BeattheWinner (BTW) and its improved version Modied Beat the Winner (MBTW). 4.1.1 BeattheWinner We now present our BTW algorithm. The BTW algorithm proceeds in rounds, with round k;k 1; consisting of two arms playing a sequence of games until one of them has won k times. BeattheWinnerRule • Arms are initially put in the queue in a random order. • For roundk = 1;2;::: 40 – Top two arms in queue play a sequence of games. The winner of the round is the rst to wink games. – The loser goes to the end of the queue, and the winner stays at the top of the queue. Now we show the expected cumulative regret of BTW is upper bounded byO(n 2 ) over innite time horizon. Lemma23. LetL k be the event thati is the loser of roundk: Witha = 2p1 P(L k ) expfka 2 g Proof. Becausei must play in roundk to be the loser of that round, it follows that P(L k )P(L k ji plays in roundk) Now, it follows by a coupling argument thatP(L k ji plays in roundk) is upper bounded by the probability that a total ofk heads occurs before a total ofk tails in a sequence of independent trials that each result in a head with probability q = 1p: Hence, with B being a binomial (2k1;q) random variable 41 P(L k )P(Bk) =P B(2k1)qka+q expf 2(ka+q) 2 2k1 g expfka 2 g where the next to last inequality follows from Cherno’s bound. With Lemma 23, we are able to give the upper bound of expected number of games thati is not involved. Theorem24. WithX being the total number of games that do not involvei ; E[X] (n2) 2 +e a 2 2K 3 +(2n5)K 2 (4.1) whereK = 1=(1e a 2 ): Proof. LetR be last round lost byi : Lemma 23 gives P(Rr) =P([ kr L k ) Kexp(ra 2 ) 42 Hence, E[R] = X r1 P(Rr) K 2 e a 2 and E[R 2 ] =2 X r1 rP(Rr)E[R] 2K 3 e a 2 E[R] Because there are at most2k1 games in roundk, andi plays in all rounds after roundR+n2 X R+n2 X k=1 (2k1) = (R+n2) 2 yielding that E[X] (n2) 2 +e a 2 2K 3 +(2n5)K 2 Note that the regret bound of the BTW matches that of the WS-W proposed in B. Chen and Frazier 2017 without assuming p i;j 6= 0:5 for all pairs of i;j (We will show later that WS-W actually doesn’t need that assumption). However, in practice BTM is not very competitive for smalln since the BTM takes relative long time to identify and extensively play the best arm. This 43 is also indicated by the regret bound derived in 4.1, wheren 2 is dominated by the constantK 3 whenn is small. On the other hand, whenn is large, the performance of BTW roughly matches WS-W and enjoys the advantage of having smaller variance. Numerical instances will be shown in the next section along with our proposed MBTW algorithm. 4.1.2 ModiedBeattheWinner Note that one of main drawbacks of BTW is that it doesn’t utilize any past records of arms. Hence, two apparently bad arms could play a huge amount of games, where we gain no meaningful information but the cumulative regret dramatically increases. To overcome this drawback, we consider allowing the learner to keep track of some records of arms. Specically, we want to record the dierence between the number of rounds an arm wins and the number of rounds it loses for each arm. Based on such information, we propose the Modied Beat the Winner algorithm(MBTW) and empirically show that it signicantly outperforms both BTM and WS-W. Similar to BTW, MBTW also plays games in a round fashion, where each round consists of a series of games. However, the number of games is no longer determined by the number of past rounds, but by the records of the arms in the duel. We now show how to dene the records of the arms, and how to choose players of the next round. ModiedBeattheWinnerRule • The initial value ofr i ; the score of armi, isr i = 1;i = 1;:::n 44 • Choose an arm uniformly at random as thehost, and leth denote the index of the host. The current host is always one of the players of the next round. • For each round – Leti;i6= h; be the opponent of armh with probability r i P i6=h r i . The armi so chosen and h play a sequence of games until one of them has won r h games. Let w and l denote the indices of the winner and loser. – Resetr w =r w +1 – Resetr l = max(r l 1; 1) – Seth =w. (The winner of the current round becomes the host.) 4.1.3 NumericalExamples In this section, we show two simulated numerical instances with 100 arms and 1000 arms. The WS-W method is used as the benchmark to evaluate our proposed algorithms. For both instances, we generatep i;j from independentunif(0:2;0:8) fori < j;i6= i and setp j;i = 1p i;j . For i =i , we generatep i ;j fromunif(0:5;0:8) for the 100 arms case andp i;j fromunif(0:55;0:8) for the 1000 arms case. Note that we increase the minimal winning probability of the Condorcet winner for speeding up the convergence in the second instance. The results are shown in Figure 4.1 and Figure 4.2, including the plot of cumulative regret and standard deviation at xed time spotT = 10 5 . Each example takes 2000 repeated runs. For the example with 100 arms, we can see MBTW converges faster than WS-W, with its cumulative 45 Figure 4.1: Experiments with 100 arms. (a) Cumulative regret over 100 replications; (b) Standard deviation of cumulative regret atT = 10 5 . Figure 4.2: Experiments with 1000 arms. (a) Cumulative regret over 100 replications; (b) Standard deviation of cumulative regret atT = 10 6 . 46 weak regret atT = 10 5 being51% smaller than that of WS-W. As for the example with 1000 arms, the lead is even larger, where neither of BTW and WS-W has converged atT = 10 6 while MBTW has already converged atT = 1:2 10 5 . In addition, MBTW enjoys the advantage of smaller variance. As a result, we conclude that MBTW has the overall best empirical performance over the other two algorithms. 4.1.4 ARevisittoWS-W For the rest of this section, we provide a supplementary proof of WS-W showing that the upper bound of expected cumulative regret under the Condorcet winner setting can be further improved over what was shown in B. Chen and Frazier 2017. Suppose that p = min p i;j >0:5 p i;j , to p = min j6=i p i ;j . Compared to the original proof, our proof is still valid when there exists a pairi;j with p i;j = 0:5 and thus regret bound only depends on least winning p when the Condorcet player matched with other players. Now consider the WS-W algorithm. LetW k andL k be the event that the best arm wins round k and loses roundk, respectively. To slightly simply the following analysis, we assume that there aren+1 arms in total. Lemma 25. Condition on whether the best arm is the winner of roundk1, the probability that the best arm wins roundk is P(W k jW k1 ) 1(q=p) nkn+k 1(q=p) n(k+1) 47 P(W k jL k1 ) 1(q=p) 1(q=p) n(k+1) Lemma26. The probability that the best arm loses roundk is bounded by P(L k )< 2( q p ) k Lemma27. Consider gambler’s ruin problem which stops when the gambler is either upm1 or down1. LetE p [X] be the mean number of games when the gambler wins each bet with probability p. Then for8p2 (0;1) E p [X]< 2m The proofs of Lemma 25, 26 and 27 can be found in the Appendix B.1, B.2, B.3. Theorem28. LetX bethetotalcumulativeweakregretoverinnitetimehorizonemployingWS-W, then E[X] 2p 2 (2p1) 2 (n 2 +n) Proof. Letr =q=p, then E[X] = X k1 E[regret at roundk] X k1 2r k1 k(n 2 +n) = 2p 2 (2p1) 2 (n 2 +n) 48 where the inequality follows by lemma 26 and lemma 27. 4.2 MinimizingtheStrongRegret In this section, we restrict ourselves to the scenario where each arm has an unknown associated valuev i ;i = 1;:::;n. The probability that armi is preferred overj isv i =(v i +v j ). The objective is to minimize two versions of cumulative strong regret. Specically, under the notion of strong regret 1, two dierent arms are picked at each time slot and one can avoid the regret only if two best arms(i.e. two arms with largestv i ) are selected simultaneously. On the other hand, under the notion of strong regret 2, one is allowed to pick the same arm in the duel. The objective then is to minimize the the number of time that the best arm is not chosen as both the dueling arms. Note that we use binary regret under both settings meaning that regret is 0 if the corresponding optimal criteria is satised and 1 otherwise. We propose a new algorithm by adopting the Thompson sampling approach, originally intro- duced by Thompson 1933, to pairwise comparison. Existing works that employ the Thompson sampling idea on dueling bandit either assume the existence of Condorcet winner(RCB algorithm by Zoghi, Whiteson, Remi Munos, et al. 2014) or deal with the more general Copeland winner(DT algorithm by Wu and X. Liu 2016). Both algorithms maintained the posterior distribution for the preference matrix P and explicitly used independently sampled p i;j to pick the arms. In our model, however, the preference probabilities are completely determined by the associated values and thus sampling the value of arms should be more informative than from sampling preference 49 probabilities. In the following work, we develop a Markov chain Monte Carlo(MCMC) sampling approach that allows us to sample values of arms from the posterior distribution. We empirically compared our algorithm to 5 benchmarks. 4.2.1 TheSamplingApproach The approach of sampling values of arms at each time stage is as follows. We imagine that the strengths are the values ofn independent exponential random variables,V 1 ;:::;V n . Given this, it follows that ifw i;j denotes the current number of times playeri has beatenj, then the conditional density ofV 1 ;:::;V n would be f(x 1 ;:::;x N ) =Ce P i x i Y i6=j x i x i +x j w i;j (4.2) for a normalization factorC. Our algorithmic approach for strong regret 2 is to simulateV (1) = (V (1) 1 ;:::;V (1) n ) andV (2) = (V (2) 1 ;:::;V (2) n ) independently according to 4.2, then let I = argmax i V (1) i ;J = argmax i V (2) i and chooseI andJ to play with each other in the next round. (Note that ifI =J then 4.2 needs not be updated.) For strong regret 1, we simulate onlyV (1) = (V (1) 1 ;:::;V (1) n ) and choose the two indices with largest values play to the next game. However, because directly simulatingV from the posterior refd1 does not seem computation- ally feasible (for one thingC is dicult to compute), we utilize the Hasting-Metropolis algorithm 50 to generate a Markov chain whose limiting distribution is given by 4.2. The Markov chain is de- ned as follows. When its current state isx = (x 1 ;:::;x n ), a coordinate that is equally like to be any of 1;:::;n is selected. Ifi is selected, a random variable Y is generated from an exponential distribution with mean x i , and if Y = y, then (x 1 ;:::;x i1 ;y;x i+1 ;:::;x n ) is considered as the candidate next state. In other words, if we denotey = (x 1 ;:::;x i1 ;y;x i+1 ;:::;x n ), the density function for the proposed next state is q(yjx) = 1 n 1 x i e y=x i The transition from to the current statex to its next statex proceeds via the following step. x = 8 > > > < > > > : (x 1 ;:::;x i1 ;y;x i+1 ;:::;x n ) with probability(x;y) (x 1 ;:::;x i1 ;x i ;x i+1 ;:::;x n ) with probability1(x;y) where (x;y) = min ( f(y) f(x) q(xjy) q(yjx) ;1 ) • For strong regret 1, the simulation of the Markov chain stops afterk iterations and we use indices of two largest values of the nal state vector as the choice of players for the next round. • For strong regret 2, we let the simulation of Markov chain stop after 2k iterations. We choose index of largest value of the vector at iterationk and 2k as the choice of the rst player and the second player respectively. 51 Suppose that the nal stage vector (k iterations for strong regret 1 and 2k iterations for strong regret 2) is x 0 1 ;:::;x 0 n . Once that round has been completed, and we have updated the values of w i;j ; we let the initial value of the Markov chain used to obtain the next pair of duelists be x 0 1 ;:::;x 0 n : Because the conditional density should not change by much after a single game, we expect this will speed up the convergence of the chain. In practice, it turns out thatk = O(n) would be enough for each simulation. In addition, we compute by using the identity = exp(log()). 4.2.2 NumericalExperiments We empirically compare our Thompson sampling approach with 5 benchmarks: WS-S(B. Chen and Frazier 2017), RUCB(Zoghi, Whiteson, Remi Munos, et al. 2014), D-TS, D-TS*(Wu and X. Liu 2016) and RCB( Zoghi, Whiteson, Remi Munos, et al. 2014). We compare over the simulated data under strong regret 2 criteria. The comparison is conducted in two scenarios, where i.i.d expo- nential(1) and uniform(0,1) random variables are generated as the strengths of arms. As shown in Figure 4.3, our Thompson Sampling approach empirically outperforms all benchmarks when strengths are generated from exponential(1). While the strength are generated from uniform(0,1), as shown in Figure 4.4, our algorithm also matches the state-of-art. For both instances, the results are averaged over 200 replications. Note that we setk = 50 iterations for sampling V at each time slot. 52 0 2000 4000 6000 1e+02 1e+03 1e+04 1e+05 Time Horizon Cumulative Regret Algorithm TS WS−S RUCB D_TS D_TS+ RCS Figure 4.3: Experiments with 5 arms on exponential(1) strengths. Replication: 200 times 4.3 FindtheBestDuelerwithFixedCondence In this section, we consider the objective of nding the best dueler (arm) with xed condence. Same as before, we can draw a duel (i;j) at each time period and observe one of them as the winner. The probability that armi beatsj is equal to some unknown constantp i;j . We further assume that there exists a Condorcet winneri such that it beats any armj;j6=i with probability at leastp = 0:5+. Dierent from the assumption made in Section 4.1, here we assume is known. 53 0 5000 10000 15000 1e+02 1e+03 1e+04 1e+05 Time Horizon Cumulative Regret Algorithm TS WS−S RUCB D_TS D_TS+ RCS Figure 4.4: Experiments with 5 arms on uniform(0, 1) strengths. Replication: 200 times Our objective is to nd a policy that returns the best arm with probability at least1 using as few plays as possible. To begin, we present a tournament framework with knockout structure. We will present two policies based on this framework. The framework is dened as follow. KnockoutTournamentFramework • Initialization: all arms are alive • For roundj = 1;2;::: – Randomly pair up arms that are still alive 54 – For each pair, conduct a series of comparisons until claiming one of them as winner – The winner proceeds to the next round, with its opponents no longer being alive. • Claim the winner of the tournament as the best arm. In the following, we will present two ways of determining the winner for each match. Note that players who get a bye in some rounds automatically advance to the next round. For conve- nience of analysis, we assume that the total number of playersn = 2 k . 4.3.1 AGambler’sRuinRule Adopting the framework above, we propose aGambler’sRuinRule (GRR) to determine the winner of each match. We will show that the expected number of plays using GRR is upper bounded. To begin, letr =p=(1p) andm j = log r 2 j ;j = 1;:::. Gambler’sRuinRule • At roundj, play each pair until one is ahead bym j , with the leader being the winner. Lemma29. GRR identies the best dueleri with probability at least1. Proof. Given thati successfully proceeds to roundj, the probability thati is eliminated at round j, denoted byP j , can be upper bounded by using the gambler’s ruin probability P i 1r m j 1r 2m j = 1 1+r m j < 1 r m j = 2 j 55 To win the tournament,i needs to win allk rounds. Hence, P(i is eliminated) =P([ k j=1 i is eliminated at roundj) k X j=1 P(i is eliminated at roundj) < k X j=1 P j < which indicates that the probability of nding the best arm is at least1. Next, we will derive the expected number of plays by employing GRR. LetX m be the total number of plays for a match that ends until one of the player is ahead bym. Letp be the proba- bility that one player beats the other. The following Lemma 30 shows thatE[X m ] is maximized when ofp = 0:5. Lemma 30. The expected time until one of the player is ahead bym is a decreasing function ofp whenp> 1=2. The proof can be found in appendix B.4. As a result of Lemma 30,E[X m ] is maximized when p = 0:5 and thus can be upper bounded bym 2 . The total number of plays can be bounded by E[number of plays] k X j=1 2 kj m 2 j 56 4.3.2 ModiedGambler’sRuinRule One underlying drawback of GRR is that it may play too many games between two suboptimal arms to determine which seems better. In such cases, one might consider eliminating both arms as neither of them shows the potential to be best. Therefore, we can improve GRR by limiting the number of games in each match, and drop both arms if neither of them can win the match by the end. We name the new policy by Modied Gambler’s Ruin Rule (MGRR), which can described as follows. ModiedGambler’sRuinRule • At roundj, play each pair until either one is ahead bym j , with the leader being the winner, or the total number of comparison reachesn j , in which both arms are eliminated. wherem j = maxf 1 4 log( 2 j );1g andn j = 3m j . As a preparation of showing the correctness of MGRR, we need the following lemma. Lemma31. For0<x< 1 ( 1x 1+x ) 1 x e 2 Proof. Letf(x) = (1x)e 2x (1+x). It suces to show thatf(x) 0 for 0 < x < 1. By taking the derivative, f 0 (x) = 2e 2x e 2x 2xe 2x 1 f 00 (x) =4xe 2 x 57 Sincef 00 (x) < 0 for 0 < x < 1 andf 00 (0) = 0, thenf 0 (x) < f 0 (0) = 0 for 0 < x < 1. Hence, f(x)<f(0) = 0. Lemma32. MGRR identies the best armi with probability at least 1 -. Proof. Recall the best arm wins all other arms with probability at leastp = 0:5+. LetX i ;i = 1;2;::: be the i.i.d random variables such that X i = 8 > > < > > : 1 w.p. 1 + 1 w.p. 1 - LetS n = P n i=1 X i . Given that the best player successfully advances to roundj, the probability that the best player loses roundj, denoted byP j , is P j P(S n hitsm j beforem j [S n can’t hitm j withinn j steps) P(S n hitsm j beforem j )+P(S n can’t hitm j withinn j steps) 58 The rst term can be bounded by using the gambler’s ruin probability, where P(S n hitsm j beforem j ) = 1(p=q) m j 1(p=q) 2m j = 1 1+r m j < ( 1 r ) m j = ( 12 1+2 ) 1 4 log( 2 j ) <e log( 2 j+1 ) = 2 j+1 where the second inequality follows by Lemma 31. For the second term, we can obtain P(S n can’t hitm j withinn j steps)P(S n j <m j ) =P(S n j 2n j <m j 2n j ) < exp( (m j 2n j ) 2 2n j ) = exp( 25 24 log( 2 j+1 )) < exp(log( 2 j+1 )) = 2 j+1 where the second inequality uses Azuma inequality. As a result, 59 P j P(S n hitsm i beforem i )+P(S n can’t hitm i withinn i steps) < 2 i which is equal to the probability bound derived in GRR. Following the reasoning in Lemma 29, we can conclude that the probability that the best arm is identied employing MGRR is at least 1. Since the number of games is upper bounded in each match, we are able to derive the upper bound of the total number of games, which is number of game k X j=1 2 kj X j = 3 4 2 k X j=1 2 kj (log2 j+1 +log 1 ) = 3n 4 2 k X j=1 log2 j+1 +log 1 2 j < 3n 4 2 (4+log 1 ) =O( nlog 1 2 ) For the rest of this section, we will show a numerical example of MGRR and draw comparison to the start-of-art. To generate the preference matrix, we consider the example where each armi 60 has strengthv i and the probability thati beatsj isv i =(v i +v j ). We then suppose there are8 arms and generate the preference matrix by using the strength vector(0:2;0:2;0:2;0:2;0:2;0:2;0:2; 0:4). That is, the best arm wins each of its game with probability 2=3. We compare the perfor- mance of MGRR to two benchmarks, the maximum selection algorithm in Falahatgar, Orlitsky, et al. 2017 and top-1 selection algorithm in Mohajer, Suh, and Elmahdy 2017. Both benchmarks take knockout tournament structure, however, with variation on the approach of determining the winner of each match. For our example, we let = 0:05. The numerical result after 1000 replications is summarized in Table 4.1. Method Percentage of correct Mean number of game Standard deviation GMRR 0.967 208 1.61 Benchmark 1 0.995 446 0.411 Benchmark 2 0.997 9797 4.32 Table 4.1: Numerical example of MGRR with 1000 replications 61 Chapter5 EstimatingtheStrengthofBradleyTerry-Model In this Chapter, we consider the classical Bradley-Terry model. Suppose there is set ofn players, with each player having a strengthv i ;i = 1;:::;n. When playeri paired with playerj,i beatsj with probabilityp i;j . We take a Bayesian approach that supposes that thev i are the values of inde- pendent and identically distributed (i.i.d) exponential random variables. We will show how to use simulation to obtain both estimators of thev i as well as of the probability thatv i = max i v i . We use simulated data to compare two ecient ways of doing our simulation estimation procedure and compare our results with ones obtained by the minorization-maximization approach. 5.1 ProblemFormulation Suppose there aren players in an ongoing competition, with playeri having an unknown value v i ;i = 1;:::;n: We suppose that each game involves two players, and that, independently of what has previously occurred, a game betweeni andj is won byi with probabilityv i =(v i +v j ): Supposing that the results of already played games is thati has beatenj a total ofw i;j times, for 62 i6=j; our objective is to use these data to estimate the quantitiesv 1 ;:::;v n and also to obtain a feel for the probability that a particular playerk has the largest value. 5.2 TheEstimators The likelihood of the valuesv 1 ;:::;v n given the dataw i;j ;i6=j is L(v 1 ;:::;v n ) = Y i6=j ( v i v i +v j ) w i;j We will take a Bayesian approach that supposes that initially nothing is known about the strengths of the players and that v 1 ;:::;v n are the values of independent and identically dis- tributed random variablesV 1 ;:::;V n : BecauseL(v 1 ;:::;v n ) = L(cv 1 ;:::;cv n ) for every posi- tive constantc, it is natural to normalize and consider the quantities v i P n j=1 v j : Consequently, as we are supposing that nothing is initially known about the relative strengths, we will assume that V 1 ;:::;V n are such that V 1 P n j=1 V j ;:::; Vn P n j=1 V j has a Dirichlet prior density f(x 1 ;:::;x n ) = (n1)!; x 1 > 0;:::;x n > 0; n X i=1 x i = 1 63 As it is well known that V 1 P n j=1 V j ;:::; Vn P n j=1 V j has the Dirichlet density whenV 1 ;:::;V n are i.i.d exponential random variables, we will suppose thatV 1 ;:::;V n are independent exponentials with rate1. The posterior density ofV 1 ;:::;V n given the dataw i;j ;i6=j; is f(v 1 ;:::;v n jw i;j ;i6=j) =Cf(v 1 ;:::;v n ) Y i6=j ( v i v i +v j ) w i;j where f(v 1 ;:::;v n ) = Q n i=1 e v i : Hence, E[V k jw i;j ;i6=j] = E[V k Q i6=j ( V i V i +V j ) w i;j ] E[ Q i6=j ( V i V i +V j ) w i;j ] (5.1) and P(V k = max i V i jw i;j ;i6=j) = E[IfV k = max i V i g Q i6=j ( V i V i +V j ) w i;j ] E[ Q i6=j ( V i V i +V j ) w i;j ] (5.2) whereV 1 ;:::;V n are i.i.d exponentials with rate1. 5.3 TheSimulationEstimators 5.3.1 TheSimulationEstimatorsofStrengths We can estimate the quantitiesE[V k jw i;j ;i6= j]; k = 1;:::;n; as follows. In each simulation run generaten independent exponentials with rate 1, V 1 ;:::;V n ; letb = Q i6=j ( 2V i V i +V j ) w i;j ; and 64 leta k = V k b; k = 1;:::;n: Do this form runs. Lettingb (t) ;a (t) 1 ;:::;a (t) n be the values ofb and a 1 ;:::;a n obtained on runt, the raw simulation estimator ofE[V k jw i;j ;i6=j] is e k (raw) = P m t=1 a (t) k P m t=1 b (t) ; k = 1;:::;n Note that although the preceding is mathematically equivalent to deningb = Q i6=j ( V i V i +V j ) w i;j ; we have used Q i6=j ( 2V i V i +V j ) w i;j to keep these values from becoming so small that the computer rounds them to equal0. In practice, it turns out that the variance of Q i6=j ( 2V i V i +V j ) w i;j is very large. To reduce the number of runs needed, we employ two variance reduction techniques: stratied sampling and importance sampling. In stratied sampling, we stratify on the ordering of V i . That is, if we stratify on the permutation = ( 1 ;:::; 8 ) for whichV 1 ::: V 8 , doing one simulation to estimateE[ Q i6=j ( 2V i V i +V j ) w i;j jw i;j ;] andE[V k Q i6=j ( 2V i V i +V j ) w i;j jw i;j ;], k = 1;:::;n. If we letb anda ;k denote their simulated values, then the estimator ofE[V k jw i;j ] is e k (stratied) = P a ;i P b ; k = 1;:::;n The total number of simulation needed isn!: In our importance sampling estimator, we letY i be exponential with rate P n i=1 w i nw i ,i = 1;:::;n, wherew i = P j6=i w i;j represents the total number of wins by playeri. Then, withv = P n i=1 w i ; 65 E[ Y i6=j ( V i V i +V j ) w i;j ] = ( n Y i=1 nw i v )E[ Y i6=j ( Y i Y i +Y j ) w i;j n Y i=1 exp ( v nw i 1)Y i ] (5.3) and E[V k Y i6=j ( V i V i +V j ) w i;j ] = ( n Y i=1 nw i v )E[Y k Y i6=j ( Y i Y i +Y j ) w i;j n Y i=1 exp ( v nw i 1)Y i ] (5.4) For simulation runt, we generateY (t) i ;i = 1;:::;n; and let b (t) = Y i6=j ( 2Y (t) i Y (t) i +Y (t) j ) w i;j n Y i=1 exp ( v nw i 1)Y (t) i a (t) k =Y (t) k b (t) ; k = 1;:::;n Then the estimate ofE[V k jw i;j ;i6=j] afterm runs would be e k (importance) = P m t=1 a (t) k P m t=1 b (t) ;k = 1;:::;n 5.3.2 TheSimulationEstimatorsofProbabilities To estimatep k P(V r = max i V i jw i;j ;i6=j), we note from Equation 5.2 that 66 P(V k = max i V i jw i;j ;i6=j) = E[ Q i6=j ( V i V i +V j ) w i;j jV k = max i V i ] nE[ Q i6=j ( V i V i +V j ) w i;j ] =CE[ Y i6=j ( 2V i V i +V j ) w i;j jV k = max i V i ]; r = 1;:::;n: We can thus use simulation to estimateP 1 ;P 2 ;:::;P n as follows. In thet th simulation run, gen- eraten independent exponentials with rate 1,V 1 ;:::;V n and leti be such thatV i = max i V i : To estimateE[ Q i6=j ( 2V i V i +V j ) w i;j jV k = max i V i ], let X j (k) = 8 > > > > > > < > > > > > > : V j ; if j6=i ;j6=k V i ; if j =k V k ; if j =i and letb (t) k = Q i6=j ( 2X i (k) X i (k)+X j (k) ) w i;j : Do the preceding for eachr = 1;:::;n: If we dom simulation runs, then the estimator ofP(V k = max i V i jw i;j ;i6=j) is p k (raw) = P m t=1 b (t) k P n k=1 P m t=1 b (t) k 67 Remark33. Recallthatwhenweestimatevalueswithstratiedsampling,westratifyontheorder- ingofV i ,denotedby,andletb betheestimatedvalueofE[ Q i6=j ( 2V i V i +V j ) w i;j jw i;j ;]. Since in- dicatesthat 1 = argmax i V i ,wecanuseb toderiveanotherestimatorofP(V k = max i V i jw i;j ;i6= j). p k (stratied) = P : 1 =k b P b ; k = 1;:::;n: Note that by using the estimator above, we can obtain allb in the course of estimating values and thus no extra simulations runs are needed. Remark 34. When applying the preceding to sports players, one should usually have 2 values for eachplayer,onewhentheplayerisplayingathomeandonewhenitisplayingawayfromhome. In specialcases, onemightwanttohaveevenmorevaluesperplayer. Forinstance, inbaseballtheeare often players that have dominant pitchers and the player’s value might be dierent when they are pitching. 5.4 NumericalExamples In this section, we provide numerical examples using our proposed estimator. We assume that there are total number of 8 players, with player i associated with a value v i ;i = 1;:::;8 rep- resenting their strengths. When playeri matches up with playerj, the probabilityi beatsj is v i =(v i +v j ). We let each pair of players play 4 games and an estimate of their strengths is made based on their records. In the following examples, we start by generating the valuesv 1 ;:::;v n from a distribution G: Using the simulated values, we then generate the results of the games, 68 and then use the results to estimate the values. In this way we are able to compare the estimated values with the actual values. Although our derivations supposed that the values come from an exponential distribution, we think that the estimators should work well for any strength values. Thus, in our examples we consider cases both whereG is exponential and where it is uniform. We doN replications of the preceding experiment, starting each replication by generating the strengths from the distributionG. That is, in each replication we do the following. • Generatev = (v 1 ;:::;v 8 ) from distributionG. • Generatew i;j from Binomial(4,v i =(v i +v j )) wherei<j and letw j;i = 4w i;j whenj >i. The raw simulation estimators of the values, based onm simulation runs, could now be obtained by 1. Fort = 1;:::;m • GenerateV (t) i ;i = 1;:::;8 independently from Exponential(1). • Letb (t) = Q i6=j ( 2V (t) i V (t) i +V (t) j ) w i;j ; and leta (t) i =V (t) i b (t) ; i = 1;:::;8. 2. Returne k (raw) = P n t=1 a (t) k P n t=1 b (t) ,k = 1;:::;8. As we mentioned in Section 5.3 , we also use the stratied sampling and importance sampling for the purpose of variance reduction. The estimator based on stratied sampling can be obtained by the following approach StratiedSamplingApproach 69 1. For each permutation • GenerateV i ;i = 1;:::;8 independently from Exponential(1). • LetV (i) be the ordered value ofV i such thatV (1) >>V (8) . SetV i =V (i) ;i = 1;:::;8: • Letb = Q i6=j ( 2V i V i +V j ) w i;j ; and leta ;i =V i b ; i = 1;:::;8. 2. Returne k (stratied) = P a ;i P b ,k = 1;:::;8. where the total number of simulation needed is 8!. As for importance sampling, recall that we usew i to denote the total number of wins by playeri andv = P 8 i=1 w i . The strengths can be estimated by ImportanceSamplingApproach 1. Fort = 1;:::;m • GenerateY (t) i from Exponential( P 8 i=1 w i 8w i ),i = 1;:::;8 • Letb (t) = Q i6=j ( 2Y (t) i Y (t) i +Y (t) j ) w i;j Q n i=1 exp ( v nw i 1)Y (t) i and leta (t) i =Y (t) i b (t) ; i = 1;:::;8. 2. Returne k (importance) = P m t=1 a (t) k P m t=1 b (t) ;k = 1;:::;n. For our example, 5000 runs seems sucient. 70 To measure the goodness of our estimate, we dene the strength ratio vectorr = (r 1 ;:::;r 8 ) be such thatr i = v i P 8 i=1 v i ;i = 1;:::;8. If we letr j represents the actual strength ratios generated at replicationj and ^ r j denotes the estimated ratios, then the mean square error of the strength ratio overN replications can be dened as MSE = 1 N N X j=1 8 X i=1 (r j;i ^ r j;i ) 2 To illustrate the eectiveness of our procedure, we also implement the minorization-maximization(MM) algorithm proposed in Hunter et al. 2004, which can estimate the strength ratios, as the bench- mark. By doing 1000 replications, the numerical results are summarized in Table 5.1 and Table 5.2. Estimator MSE of estimated strength ratios Stratied Sampling 0.02178 Importance sampling 0.02107 MM 0.03378 Table 5.1: Numerical example with strengths generated from exponential(1) with 1000 replica- tions Estimator MSE of estimated strength ratios Stratied Sampling 0.01449 Importance Sampling 0.01560 MM 0.02498 Table 5.2: Numerical example with strengths generated from uniform(0,1) with 1000 replications To estimatep r P(V r = max i V i jw i;j ;i6= j), we adopt the estimator proposed in Section 5.3. Recall thati is such thatV i (t) = maxV i (t). If we dom runs, the simulation estimator of probabilities can be obtained by 71 1. Fort = 1;:::;m (a) GenerateV (t) i ;i = 1;:::;8 independently from Exponential(1). (b) Fork = 1;:::;8 • let X j (r) = 8 > > > > > > < > > > > > > : V (t) j ; if j6=i ;j6=k V (t) i ; if j =k V (t) k ; if j =i • Setb (t) r = Q i6=j ( 2X i (k) X i (k)+X j (k) ) w i;j . 2. Return ^ p k = P m t=1 b (t) r P n r=1 P m t=1 b (t) r ;k = 1;:::;8: To intuitively show how our procedure works, we provide a numerical example as follows. The setup is similar as before, where we assume there are 8 players and each pair of players plays 4 games. In this case, however, the strengths of players are initially xed as constant. We doN = 1000 replications and for each replication, we generate the game results based on the xed strengths and then the estimation of p k ;k = 1;:::;8 is made based on their records. An example showing our estimatedP(V k = max i V i jw i;j ;i6= j) over 1000 replications paired with their strengths is listed in Table 5.3. Note that we take the average over 1000 replications as our nal estimation. 72 Strength Estimated probabilities having largest strength 12 0.5207 8 0.2182 7 0.1558 5 0.0554 4 0.0245 4 0.0240 2 0.0010 1 0.00002 Table 5.3: Example of estimating probabilities of having largest strength with 1000 replications 73 Chapter6 Conclusions In this thesis, we mainly study two bandit typed problems: an innitely many armed bandit problem and a dueling bandit problem. In addition, we also present a technique for estimating the strengths of Bradley-Terry model. We rst introduce the problem of minimizing the expected number of plays to nd a good Bernoulli arm in an innite collection. We assume arms have Bernoulli rewards, whose means are the values of independent random variables having a known densityf. At each time step, the player can draw an arm and observes the reward generated from its corresponding distribution. We want to identify an arm for which the posterior probability of the mean being at least1 is at least 1. The objective is to nd a policy that minimizes the expected number of plays until a desired arm is identied. We decompose a policy into two parts: the acceptance rule, which determines when to accept an arm, and the sampling rule, which describes how to sequentially sample arms. Letting an arm have record(s;k) if it has been playeds+k times withs successes resulting, we show that the optimal acceptance rule is such that it is optimal to accept the arm having record (s;k) if and 74 only ifss k for a nondecreasing sequences k ,k 0. Under this acceptance rule, we then show that there exists an optimal sampling rule such that it is optimal to continue playing the current arm having record(s;k) if and only ifsm k ;k 1 and to switch to a previously unseen arm otherwise. The valuess k can be explicitly determined while them k are unknown. We also consider sampling policies that limit the number of failures allowed on an arbitrary arm. We show that the optimal policy of such type also has the threshold structure. Given the threshold vector, we provide a recursive way of determining its expected number of plays. In addition, we give a modication strategy that can be used to improve an arbitrary sampling policy. Numerical results indicate that the modication of the policy that only allows an arm a single loss before switching performs very well. We then study another variant of the classical multi-armed bandit problem, the dueling bandit problem. We assume there is a set ofn arms. At each time step, the player chooses two arms to play a game, with the result of the game being learned. We consider three dierent objectives: minimizing the cumulative weak regret, minimizing the cumulative strong regret, and nding the best dueler with minimal number of plays. Under the objective of minimizing the weak regret, we assume there is a best arm, which beats any other arm with probability at leastp,p > 0:5 andp is unknown. The objective is to maxi- mize the number of times that one of the competitors is the best arm. We present a policy, named Beat the Winner(BTW), that achieves nite regret, and an improved policy, named Modied Beat the Winner(MBTW), that empirically performs better. Under the goal of minimizing the strong regret, we assume that each armi has strengthv i ,i = 1;:::;n. The probability that armi beats 75 j isv i =(v i +v j ). we are interested in maximizing the number of times that the game is between the best arm itself. We develop a Thompson Sampling type algorithm, which employs Metropo- lis–Hastings algorithm to sample from the posterior distribution. Numerical examples show that our algorithm matches the state-of-art. Under the objective of nding the best dueling, we again assume there is a best arm, which beats any arm with probability at leastp, withp known. We design two policies that makes use of the structure of gambler’s ruin problem. We derive the upper bound of the expected number of plays for both polices, and numerically compare to the state-of-art. Finally, we study the classical Bradley-Terry model. The model assumes that there are n players, with each player having a strength v i ;i = 1;:::;n. When player i paired with player j, i beatsj with probabilityv i =(v i +v j ). We take a Bayesian approach that supposes that the v i are the values of independent and identically distributed exponential random variables. We show how to use simulation to obtain both estimators of thev i as well as of the probability that v i = max i v i . We use simulated data to compare two ecient ways of doing our simulation estimation procedure. 76 Bibliography Audibert, Jean-Yves and Sébastien Bubeck (2010). “Best arm identication in multi-armed bandits”. In: Aziz, Maryam, Jesse Anderton, Emilie Kaufmann, and Javed Aslam (2018). “Pure exploration in innitely-armed bandit models with xed-condence”. In: arXiv preprint arXiv:1803.04665. Berry, Donald A and Bert Fristedt (1985). “Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability)”. In: London: Chapman and Hall 5, pp. 71–87. Bonald, Thomas and Alexandre Proutiere (2013). “Two-target algorithms for innite-armed bandits with Bernoulli rewards”. In: Advances in Neural Information Processing Systems, pp. 2184–2192. Bradley, Ralph Allan and Milton E Terry (1952). “Rank analysis of incomplete block designs: I. The method of paired comparisons”. In: Biometrika 39.3/4, pp. 324–345. Bubeck, Séebastian, Tengyao Wang, and Nitin Viswanathan (2013). “Multiple identications in multi-armed bandits”. In: International Conference on Machine Learning, pp. 258–265. Caron, Francois and Arnaud Doucet (2012). “Ecient Bayesian inference for generalized Bradley–Terry models”. In: Journal of Computational and Graphical Statistics 21.1, pp. 174–196. Carpentier, Alexandra and Michal Valko (2015). “Simple regret for innitely many armed bandits”. In: International Conference on Machine Learning, pp. 1133–1141. Cattelan, Manuela, Cristiano Varin, and David Firth (2013). “Dynamic Bradley–Terry modelling of sports tournaments”. In: Journal of the Royal Statistical Society: Series C (Applied Statistics) 62.1, pp. 135–150. 77 Chakrabarti, Deepayan, Ravi Kumar, Filip Radlinski, and Eli Upfal (2009). “Mortal multi-armed bandits”. In: Advances in neural information processing systems, pp. 273–280. Chandrasekaran, Karthekeyan and Richard Karp (2014). “Finding a most biased coin with fewest ips”. In: Conference on Learning Theory, pp. 394–407. Chen, Bangrui and Peter I Frazier (2017). “Dueling bandits with weak regret”. In: arXiv preprint arXiv:1706.04304. Chen, Shouyuan, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen (2014). “Combinatorial pure exploration of multi-armed bandits”. In: Advances in Neural Information Processing Systems, pp. 379–387. Chen, Xi, Paul N Bennett, Kevyn Collins-Thompson, and Eric Horvitz (2013). “Pairwise ranking aggregation in a crowdsourced setting”. In: Proceedings of the sixth ACM international conference on Web search and data mining, pp. 193–202. Ding, Kaize, Jundong Li, and Huan Liu (2019). “Interactive anomaly detection on attributed networks”. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 357–365. Durand, Audrey, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D Mitsis, and Joelle Pineau (2018). “Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis”. In: Machine Learning for Healthcare Conference, pp. 67–82. Even-Dar, Eyal, Shie Mannor, and Yishay Mansour (2006). “Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems”. In: Journal of machine learning research 7.Jun, pp. 1079–1105. Falahatgar, Moein, Yi Hao, Alon Orlitsky, Venkatadheeraj Pichapati, and Vaishakh Ravindrakumar (2017). “Maxing and ranking with few assumptions”. In: Advances in Neural Information Processing Systems, pp. 7060–7070. Falahatgar, Moein, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh (2017). “Maximum selection and ranking under noisy comparisons”. In: arXiv preprint arXiv:1705.05366. Gabillon, Victor, Mohammad Ghavamzadeh, Alessandro Lazaric, and Sébastien Bubeck (2011). “Multi-bandit best arm identication”. In: Advances in Neural Information Processing Systems, pp. 2222–2230. Guiver, John and Edward Snelson (2009). “Bayesian inference for Plackett-Luce ranking models”. In: proceedings of the 26th annual international conference on machine learning, pp. 377–384. 78 Guo, Shengbo, Scott Sanner, Thore Graepel, and Wray Buntine (2012). “Score-based bayesian skill learning”. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pp. 106–121. Hastie, Trevor and Robert Tibshirani (1998). “Classication by pairwise coupling”. In: Advances in neural information processing systems, pp. 507–513. Hunter, David R et al. (2004). “MM algorithms for generalized Bradley-Terry models”. In: The annals of statistics 32.1, pp. 384–406. Kalyanakrishnan, Shivaram and Peter Stone (2010). “Ecient Selection of Multiple Bandit Arms: Theory and Practice.” In: ICML. Vol. 10, pp. 511–518. Kohli, Pushmeet, Mahyar Salek, and Greg Stoddard (2016). “A fast bandit algorithm for recommendations to users with heterogeneous tastes”. In: Komiyama, Junpei, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa (2015). “Regret lower bound and optimal algorithm in dueling bandit problem”. In: Conference on learning theory, pp. 1141–1154. Komiyama, Junpei, Junya Honda, and Hiroshi Nakagawa (2016). “Copeland dueling bandit problem: Regret lower bound, optimal algorithm, and computationally ecient algorithm”. In: arXiv preprint arXiv:1605.01677. Lai, Tze Leung and Herbert Robbins (1985). “Asymptotically ecient adaptive allocation rules”. In: Advances in applied mathematics 6.1, pp. 4–22. McHale, Ian and Alex Morton (2011). “A Bradley-Terry type model for forecasting tennis match results”. In: International Journal of Forecasting 27.2, pp. 619–630. Mohajer, Soheil, Changho Suh, and Adel Elmahdy (2017). “Active Learning for Top-K Rank Aggregation from Noisy Comparisons”. In: International Conference on Machine Learning, pp. 2488–2497. Radlinski, Filip, Madhu Kurup, and Thorsten Joachims (2008). “How does clickthrough data reect retrieval quality?” In: Proceedings of the 17th ACM conference on Information and knowledge management, pp. 43–52. Thompson, William R (1933). “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples”. In: Biometrika 25.3/4, pp. 285–294. Wang, Yizao, Jean-Yves Audibert, and Rémi Munos (2009). “Algorithms for innitely many-armed bandits”. In: Advances in Neural Information Processing Systems, pp. 1729–1736. 79 Wu, Huasen and Xin Liu (2016). “Double thompson sampling for dueling bandits”. In: Advances in Neural Information Processing Systems, pp. 649–657. Yue, Yisong and Thorsten Joachims (2009). “Interactively optimizing information retrieval systems as a dueling bandits problem”. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1201–1208. — (2011). “Beat the mean bandit”. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 241–248. Zermelo, Ernst (1929). “Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrscheinlichkeitsrechnung”. In: Mathematische Zeitschrift 29.1, pp. 436–460. Zoghi, Masrour, Zohar S Karnin, Shimon Whiteson, and Maarten De Rijke (2015). “Copeland dueling bandits”. In: Advances in Neural Information Processing Systems, pp. 307–315. Zoghi, Masrour, Shimon Whiteson, Remi Munos, and Maarten Rijke (2014). “Relative upper condence bound for the k-armed dueling bandit problem”. In: International Conference on Machine Learning, pp. 10–18. Zoghi, Masrour, Shimon Whiteson, and Maarten de Rijke (2015). “Mergerucb: A method for large-scale online ranker evaluation”. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 17–26. 80 AppendixA ProofsinChapter3 A.1 DerivationofE[X (k 1 ;:::;k n ) (s;i)]andP (k 1 ;:::;k n ) (s;i) Following the reasoning in Lemma 11,E[X (k 1 ;:::;kn) (s;i)] can be recursively determined by E[X (k 1 ;:::;kn) (s;i)] =E[X (k 1 ;:::;k n1 ) (s;i)]+E[A (k 1 ;:::;kn) (s;i)] whereA (k 1 ;:::;kn) (s;i) is the additional number of plays of an arm after observing itsnth failure. Recall that random variableP has densityf. LetP s;i be the probability that an arm has record (s;i) afters+i plays, then P s;i = Z 1 0 s 0 1 X l 1 =k 1 s 1 1 X l 2 =maxfl 1 ;k 2 g ::: s X l i =maxfl i1 ;k i g p s q i f(p)dp 81 whereq = 1p. The expectation ofA (k 1 ;:::;kn) (s;i) can be obtained by E[A (k 1 ;:::;kn) (s;i)] = 1 P s;i Z 1 0 s 0 1 X i 1 =k 1 ::: s1 X l i =maxfl i1 ;k i g ::: s n1 1 X in=maxfkn;i n1 g p in q n snin1 X j=0 p j f(p)dp = 1 P s;i Z 1 0 s 0 1 X i 1 =k 1 ::: s1 X l i =maxfl i1 ;k i g ::: s n1 1 X in=maxfkn;i n1 g q n1 (p in p sn ) f(p)dp = 1 P s;i s 0 1 X i 1 =k 1 ::: s1 X l i =maxfl i1 ;k i g ::: s n1 1 X in=maxfkn;i n1 g E[(P in P sn )(1P) n1 ] Similarly,P (k 1 ;:::;k n ) (s;i) can be determined recursively by P (k 1 ;:::;kn) (s;i) =P (k 1 ;:::;k n1 ) (s;i)+B (k 1 ;:::;kn) (s;i) whereB (k 1 ;:::;kn) (s;i), the probability that the arm is accepted after thenth failure is observed, can be derived by B (k 1 ;:::;kn) (s;i) = 1 P s;i Z 1 0 s 0 1 X i 1 =k 1 ::: s1 X l i =maxfl i1 ;k i g ::: s n1 1 X in=maxfkn;i n1 g p sn q n f(p)dp = 1 P s;i s 0 1 X i 1 =k 1 ::: s1 X l i =maxfl i1 ;k i g ::: s n1 1 X in=maxfkn;i n1 g E[P sn (1P) n ] Note thatA (k 1 ;:::;kn) (s;i) andB (k 1 ;:::;kn) (s;i) are equal to 0 forn<i. 82 A.2 ProofofLemma14 LetC(i;s;j) =E[(P i P s )(1P) j ] andD(s;j) =E[P s (1P) j ]. Denote E = s l+1 1 X i l+1 =k l+1 C(i l+1 ;s l+1 ;l+1)+:::+ s l+1 1 X i l+1 =k l+1 ::: s l+1 1 X in=maxfkn;i n1 g C(i n ;s n ;n) and F = s l+1 1 X i l+1 =k l+1 D(s l+1 ;l+1)+:::+ s l+1 1 X i l+1 =k l+1 ::: s l+1 1 X in=maxfkn;i n1 g D(s n ;n) Because of the assumptions<k l+1 , we can write E[X (k 1 ;:::;kn) (s;j)] P (k 1 ;:::;kn) (s;j) = P s 0 1 l 1 =k 1 ::: P s1 l j =maxfl j1 ;k j g C(s;s j ;j)+E P s 0 1 l 1 =k 1 ::: P s1 l j =maxfl j1 ;k j g D(s j ;j)+F Now since both ofC(s;s j ;j)+E andD(s j ;j)+F are independent of the sequencel 1 ;:::;l j , we can get rid of the summation, which gives E[X (k 1 ;:::;kn) (s;j)] P (k 1 ;:::;kn) (s;j) = C(s;s j ;j)+E D(s j ;j)+F AsC(s;s j ;j) is a decreasing function ofs, whileE;F andD(s j ;j) are all constants regardings. We can conclude that E[X (k 1 ;:::;kn) (s;j)] P (k 1 ;:::;kn) (s;j) is also a decreasing function ofs. 83 A.3 ProofofLemma16 We want to show that, for any 1 l n, there exists some k l such that E[S (k 1 ;:::k l ;:::kn) ] is monotonically decreasing whenk l1 k l k l and monotonically increasing whenk l k l < k l+1 . Consider the thresholds vector (k 1 ;:::;k;:::;k n ) and (k 1 ;:::;k + 1;:::;k n ). Following the notation ofC;D;E;F at the beginning of A.2 and Lemma 11, we can obtain E[X (k 1 ;:::;k;:::kn) ]E[X (k 1 ;:::;k+1;:::kn) ] =E[A (k 1 ;:::;k) ]E[A (k 1 ;:::;k+1) ]+ n X j=l+1 (E[A (k 1 ;:::;k;:::;k j ) ]E[A (k 1 ;:::;k+1;:::;k j ) ]) = X (i 1 ;:::;i l1 ):i l =k C(k;s l ;l)+ s l+1 1 X i l+1 =k l+1 C(i l+1 ;s l+1 ;l+1)+::: + s l+1 1 X i l+1 =k l+1 ::: s l+1 1 X in=maxfkn;i n1 g C(i n ;s n ;n) = X (i 1 ;:::;i l1 ):i l =k (C(k;s l ;l)+E) Similarly, by Lemma 12 P (k 1 ;:::;k;:::kn) P (k 1 ;:::;k+1;:::kn) =B (k 1 ;:::;k) B (k 1 ;:::;k+1) + n X j=l+1 (B (k 1 ;:::;k;:::k j ) B (k 1 ;:::;k+1;:::k j ) ) = X (i 1 ;:::;i l1 ):i l =k D(s l ;l)+ s l+1 1 X i l+1 =k l+1 D(s l+1 ;l+1)+:::+ s l+1 1 X i l+1 =k l+1 ::: s l+1 1 X in=maxfkn;i n1 g D(s n ;n) = X (i 1 ;:::;i l1 ):i l =k (D(s l ;l)+F) 84 As a result, E[X (k 1 ;:::;k;:::;kn) ]E[X (k 1 ;:::;k+1;:::;kn) ] P (k 1 ;:::;k;:::;kn) P (k 1 ;:::;k+1;:::;kn) = P (i 1 ;:::;i l1 ):i l =k (C(k;s l ;l)+E) P (i 1 ;:::;i l1 ):i l =k (D(s l ;l)+F) = C(k;s l ;l)+E D(s l ;l)+F Note thatC(k;s l ;l) decreases ink andD(s l ;l);E;F are constant with regard tok. To show that E[X (k 1 ;:::;k;:::;kn) ] P (k 1 ;:::;k;:::;kn) is a unimodal function ofk, we observe E[X (k 1 ;:::;k;:::;kn) ]E[X (k 1 ;:::;k+1;:::;kn) ] P (k 1 ;:::;k;:::;kn) P (k 1 ;:::;k+1;:::;kn) E[X (k 1 ;:::;k+1;:::;kn) ] P (k 1 ;:::;k+1;:::;kn) = (C(k;s l ;l)+E)P (k 1 ;:::;k+1;:::;kn) (D(s l ;l)+F)E[X (k 1 ;:::;k+1;:::;kn) ] P (k 1 ;:::;k+1;:::;kn) (P (k 1 ;:::;k;:::;kn) P (k 1 ;:::;k+1;:::;kn) ) Now letf(k) = (C(k;s l ;l)+E)P (k 1 ;:::;k+1;:::;kn) (D(s l ;l)+F)E[X (k 1 ;:::;k+1;:::;kn) ]. Subtracting f(k) byf(k+1) gives f(k)f(k+1) = C(k;s l ;l)C(k+1;s l ;l) P (k 1 ;:::;k+1;:::;kn) > 0 Therefore,f(k) is a decreasing function ofk, which indicates there exists some value ofk , such thatf(k) 0 forkk andf(k) 0 forkk . As a result, E[X (k 1 ;:::;k;:::kn) ]E[X (k 1 ;:::;k+1;:::kn) ] P (k 1 ;:::;k;:::kn) P (k 1 ;:::;k+1;:::kn) E[X (k 1 ;:::;k+1;:::kn) ] P (k 1 ;:::;k+1;:::kn) 0 85 forkk and the other way around forkk . Since E[X (k 1 ;:::;k;:::;kn) ]E[X (k 1 ;:::;k+1;:::;kn) ] P (k 1 ;:::;k;:::;kn) P (k 1 ;:::;k+1;:::;kn) E[X (k 1 ;:::;k+1;:::;kn) ] P (k 1 ;:::;k+1;:::;kn) =) E[X (k 1 ;:::;k;:::;kn) ] P (k 1 ;:::;k;:::;kn) E[X (k 1 ;:::;k+1;:::;kn) ] P (k 1 ;:::;k+1;:::;kn) we can conclude E[X (k 1 ;:::;k;:::;kn) ] P (k 1 ;:::;k;:::;kn) is a unimodal ofk. 86 AppendixB ProofsinChapter4 B.1 ProofofLemma25 Proof. LetN denote the total number of arms. If the best arm is the winner of the roundk1, the best arm needs to beat all opponents to win roundk no matter in what order, which gives P(W k jW k1 ) = N Y i=1 P(best player wins iterationi) N Y i=1 1(q=p) N(k1)+k+i1 1(q=p) N(k1)+k+i = 1(q=p) NkN+k 1(q=p) N(k+1) 87 On the other hand, if the best arm is the challenger at roundk and suppose that it comes to play at iterationj,1jN, P(W k jL k1 ) = 1(q=p) 1(q=p) N(k1)+k+j N Y i=j+1 1(q=p) N(k1)+k+i1 1(q=p) N(k1)+k+i = 1(q=p) 1(q=p) N(k+1) B.2 ProofofLemma26 Proof. . Letr =q=p. By conditioning on whether the best player wins roundk1 P(L k ) =P(L k jW k1 )P(W k1 )+P(L k jL k1 )P(L k1 ) = 1P(W k jW k1 ) P(W k1 )+ 1P(W k jL k1 ) P(L k1 ) = r NkN+k r N(k+1) 1r N(k+1) 1P(L k1 ) + rr N(k+1) 1r N(k+1) P(L k1 ) = rr NkN+k 1r N(k+1) P(L k1 )+ r NkN+k r N(k+1) 1r N(k+1) <rP(L k1 )+r N(k1)+k Solving the recursion formula gives us P(L k )<r k +r k i=k1 X i=0 (r N(k1) )<r k (1+ 1 1+r N )< 2r k 88 B.3 ProofofLemma27 Proof. E p [X] = n 2p1 1(q=p) 1(q=p) n 1 n p6= 1 2 E p [X] =n1 p = 1 2 Letr =q=p and thusp = 1=(1+r). Whenp6= 1=2, substitutep byr E p [X] = n(1r) 1r n 1 1+r 1r Forr> 1 E p [X]2n = (n+ r n 1 r1 2n r n 1 1+r ) 1+r r n 1 = n (2r2)n r+1 +1 i1 X i=0 r i 1+r r n 1 < 0 89 Forr< 1 E p [X]2n = (n r n 1 r1 +2n r n 1 1+r ) 1+r 1r n > (n r n 1 r1 +n(r n 1)) 1+r 1r n = ( n1 X i=0 r i +nr n ) 1+r 1r n < 0 The proof is complete. B.4 ProofofLemma30 Proof. Letr =p=(1p). Using the result from gambler’s ruin problem, we can obtain E[X m ] = m 2p1 r m 1 r m +1 Writep as a function ofr, that is,p = r=(r +1). NowE[X m ] can be viewed as a function ofr, denoted byf(r), where f(r) = m 2r=(r+1)1 r m 1 r m +1 =m r+1 r1 r m 1 r m +1 90 Asr is a increasing function ofp, it suces to show thatf(r) is a decreasing function ofr. Take the derivative will result in f 0 (r) =m r m +1+m(r+1)r m1 (r1)(r m +1) (r1) 2 (r m +1) 2 r m +1+(r1)mr m1 (r+1)(r m 1) (r1) 2 (r m +1) 2 =m mr m+1 mr m1 r 2m +1 (r1) 2 (r m +1) 2 Letg(r) =mr m+1 mr m1 r 2m +1. It suces to show thatg(r)< 0 for allr> 0. g(r) = (r 2 1)mr m1 r 2m +1 = (r 2 1)mr m1 (r 2 1)( m1 X i=0 r 2i ) = (r 2 1)(mr m1 m1 X i=0 r 2i ) By inequality of arithmetic mean and geometric means P m1 i=0 r 2i m m v u u t m1 Y i=0 r 2i =r m1 Thus, g(r) (r 2 1)(mr m1 mr m1 ) = 0 which completes the proof. 91
Abstract (if available)
Abstract
The dissertation mainly studied variants of multi-armed bandit problems. We first looked at an infinitely many armed bandit problem. Suppose there is a set of infinite many arms and each arm has independent Bernoulli rewards with unknown mean. With the goal being identifying an arm such that the posterior probability of its mean being at least 1– α is at least 1 – β, we want to minimize the expected number of plays until such an arm is identified. We were able to show that there is an optimal policy such that it never plays a previously discarded arm and such optimal policies have threshold structure. We proposed a heuristic policy that limits the number of failures allowed for all arms. We also developed a policy improvement strategy which can improve upon an arbitrary policy. We then considered the dueling bandit problem. Instead of playing a single arm, the learner draws a pair of arms at each time step and learns the noisy pairwise preference. Our first goal is to minimize the cumulative binary weak regret, that is, the total number of plays that the best arm is not involved. Assuming the existence of Condorcet winner, we proposed an algorithm with theoretical guarantee of finite regret over infinite time horizon and developed an improved algorithm with better empirical performance. We then considered the objective of minimizing the cumulative binary strong regret. We designed a Thompson sampling approach which determines the next pair based on the sampled strength. We proposed to sample from the posterior distribution by taking a MCMC approach. In the end, we studied the objective of identifying the best arm with fixed confidence using fewest plays. We employed a knockout tournament structure with the winner of each duel determined by the gambler's ruin rule. We also considered the problem of estimating the strengths of Bradley-Terry model and estimating probabilities that some player has the largest strength. We took a Bayesian approach that utilizes the simulation estimator to achieve this goal. Two efficient variance reduction techniques were developed to speed up the simulation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multi-armed bandit problems with learned rewards
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Learning and decision making in networked systems
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Continuous approximation for selection routing problems
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
A stochastic employment problem
PDF
Empirical methods in control and optimization
PDF
Mixed-integer nonlinear programming with binary variables
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Applications of topological data analysis to operational research problems
PDF
Asymptotic analysis of the generalized traveling salesman problem and its application
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Stochastic models: simulation and heavy traffic analysis
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
Data-driven optimization for indoor localization
PDF
Applications of explicit enumeration schemes in combinatorial optimization
PDF
The warehouse traveling salesman problem and its application
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
Asset Metadata
Creator
Zhang, Zhengyu
(author)
Core Title
Some bandit problems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Publication Date
01/29/2021
Defense Date
11/20/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
best arm identification,dueling bandit,infinitely many arm bandit,multi-armed bandit,OAI-PMH Harvest,sequential decision making
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ross, Sheldon (
committee chair
), Carlsson, John (
committee member
), Golubchik, Leana (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
zhan892@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-418735
Unique identifier
UC11673145
Identifier
etd-ZhangZheng-9249.pdf (filename),usctheses-c89-418735 (legacy record id)
Legacy Identifier
etd-ZhangZheng-9249.pdf
Dmrecord
418735
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Zhengyu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
best arm identification
dueling bandit
infinitely many arm bandit
multi-armed bandit
sequential decision making